Gini impurity and information gain in decision trees

Decision trees are among the most intuitive and widely-used machine learning algorithms for classification and regression tasks. At the heart of building an effective decision tree model lies the question: how do we decide which feature to split on at each node? This is where two fundamental concepts come into play—gini impurity and information gain. Understanding these decision tree metrics is crucial for anyone looking to master the decision tree algorithm and build robust AI systems.

In this comprehensive guide, we’ll explore how gini impurity and information gain work, how they relate to entropy, and why they’re essential for feature selection in decision trees. We’ll dive deep into the mathematics, provide practical Python examples, and help you understand when to use each metric in your decision tree models.

Content

1. Understanding decision trees and the need for splitting criteria

Before we delve into gini impurity and information gain, let’s establish why we need these metrics in the first place. A decision tree model works by recursively splitting the dataset into subsets based on the values of input features. Each split aims to create subsets that are more “pure” than the parent node—meaning the samples in each subset are more homogeneous with respect to their target class.

The fundamental question

At every node in a decision tree, we face a critical decision: which feature should we use to split the data, and at what value? If we have a dataset with dozens or even hundreds of features, trying all possible combinations would be computationally expensive and inefficient. We need a systematic way to evaluate the quality of each potential split.

This is where splitting criteria come in. These metrics quantify how “good” a split is by measuring how much it reduces the impurity or uncertainty in the resulting child nodes. The two most popular metrics are:

Gini impurity (also known as the gini index): Measures the probability of incorrectly classifying a randomly chosen element
Information gain: Measures the reduction in entropy (uncertainty) after a split

Both metrics serve the same purpose—helping the decision tree algorithm choose the best feature to split on—but they approach the problem from slightly different angles.

2. Gini impurity: measuring classification error probability

Gini impurity is one of the most popular metrics used in decision tree algorithms, particularly in the CART (Classification and Regression Trees) algorithm. The gini index quantifies the probability of incorrectly classifying a randomly selected element from the dataset if it were randomly labeled according to the class distribution in the subset.

Mathematical definition

For a dataset with ( C ) classes, the gini impurity is calculated as:

$$ Gini(S) = 1 – \sum_{i=1}^{C} p_i^2 $$

Where:

$ S $ is the dataset or subset
$ p_i $ is the proportion of samples belonging to class $ i $
$ C $ is the total number of classes

Interpreting gini impurity values

The gini impurity ranges from 0 to 0.5 for binary classification (or $ 1 – \frac{1}{C} $ for multi-class problems):

Gini = 0: Perfect purity—all samples belong to a single class
Gini = 0.5: Maximum impurity for binary classification—samples are evenly distributed across classes
Lower values: Indicate higher purity and better splits

Practical example: calculating gini impurity

Let’s consider a concrete example. Imagine we have a dataset of 100 animals, and we’re trying to classify them as either “Dog” or “Cat”. In our current node, we have:

70 Dogs
30 Cats

The gini impurity would be:

$$ Gini = 1 – \left(\left(\frac{70}{100}\right)^2 + \left(\frac{30}{100}\right)^2\right) = 1 – (0.49 + 0.09) = 0.42 $$

Now, let’s say we split this node based on the feature “Size” (Large vs. Small):

Left child (Large animals):

60 Dogs
5 Cats

$$ Gini_{left} = 1 – \left(\left(\frac{60}{65}\right)^2 + \left(\frac{5}{65}\right)^2\right) = 1 – (0.852 + 0.006) = 0.142 $$

Right child (Small animals):

10 Dogs
25 Cats

$$ Gini_{right} = 1 – \left(\left(\frac{10}{35}\right)^2 + \left(\frac{25}{35}\right)^2\right) = 1 – (0.082 + 0.510) = 0.408 $$

Python implementation

Here’s how to calculate gini impurity in Python:

import numpy as np

def gini_impurity(labels):
    """
    Calculate Gini impurity for a list of labels.
    
    Parameters:
    labels (array-like): Array of class labels
    
    Returns:
    float: Gini impurity value
    """
    # Count unique classes and their frequencies
    classes, counts = np.unique(labels, return_counts=True)
    
    # Calculate proportions
    proportions = counts / len(labels)
    
    # Calculate Gini impurity
    gini = 1 - np.sum(proportions ** 2)
    
    return gini

# Example usage
labels = ['Dog'] * 70 + ['Cat'] * 30
print(f"Gini impurity: {gini_impurity(labels):.4f}")

# After split - left child
left_labels = ['Dog'] * 60 + ['Cat'] * 5
print(f"Gini impurity (left): {gini_impurity(left_labels):.4f}")

# After split - right child
right_labels = ['Dog'] * 10 + ['Cat'] * 25
print(f"Gini impurity (right): {gini_impurity(right_labels):.4f}")

Weighted gini impurity after a split

When evaluating a split, we need to calculate the weighted average of the gini impurity for the child nodes:

$$ Gini_{split} = \frac{n_{left}}{n_{total}} \times Gini_{left} + \frac{n_{right}}{n_{total}} \times Gini_{right} $$

For our example:

$$ Gini_{split} = \frac{65}{100} \times 0.142 + \frac{35}{100} \times 0.408 = 0.092 + 0.143 = 0.235 $$

The decision tree algorithm would then calculate this weighted gini impurity for all possible features and thresholds, selecting the split that minimizes this value.

3. Entropy and information theory foundations

To understand information gain, we first need to grasp the concept of entropy, which comes from information theory. Entropy measures the amount of uncertainty or disorder in a dataset—the more mixed the classes, the higher the entropy.

What is entropy?

In the context of decision trees, entropy quantifies how unpredictable or “surprised” we would be when randomly picking an element from the dataset. If all samples belong to the same class, there’s no surprise (entropy = 0). If classes are evenly distributed, surprise is maximum.

The entropy for a dataset is calculated as:

$$ Entropy(S) = -\sum_{i=1}^{C} p_i \log_2(p_i) $$

Where:

$ p_i $ is the proportion of samples belonging to class ( i )
$ C $ is the total number of classes
$ \log_2 $ is the logarithm base 2 (measuring information in bits)

Entropy values and interpretation

Entropy = 0: Perfect purity—no uncertainty
Entropy = 1 (for binary classification): Maximum uncertainty—classes are evenly split
Higher values: Indicate more disorder and uncertainty

Calculating entropy: a step-by-step example

Using our previous animal classification example with 70 Dogs and 30 Cats:

$$ Entropy = -\left(\frac{70}{100} \log_2\left(\frac{70}{100}\right) + \frac{30}{100} \log_2\left(\frac{30}{100}\right)\right) $$

$$ Entropy = -(0.7 \times (-0.515) + 0.3 \times (-1.737)) = -(-0.360 – 0.521) = 0.881 $$

For the split on “Size”:

Left child (Large animals): $$ Entropy_{left} = -\left(\frac{60}{65} \log_2\left(\frac{60}{65}\right) + \frac{5}{65} \log_2\left(\frac{5}{65}\right)\right) = 0.371 $$

Right child (Small animals): $$ Entropy_{right} = -\left(\frac{10}{35} \log_2\left(\frac{10}{35}\right) + \frac{25}{35} \log_2\left(\frac{25}{35}\right)\right) = 0.863 $$

Python implementation for entropy

import numpy as np

def entropy(labels):
    """
    Calculate entropy for a list of labels.
    
    Parameters:
    labels (array-like): Array of class labels
    
    Returns:
    float: Entropy value
    """
    # Count unique classes and their frequencies
    classes, counts = np.unique(labels, return_counts=True)
    
    # Calculate proportions
    proportions = counts / len(labels)
    
    # Calculate entropy (avoid log(0) by filtering out zero proportions)
    entropy_value = -np.sum([p * np.log2(p) for p in proportions if p > 0])
    
    return entropy_value

# Example usage
labels = ['Dog'] * 70 + ['Cat'] * 30
print(f"Entropy: {entropy(labels):.4f}")

# After split
left_labels = ['Dog'] * 60 + ['Cat'] * 5
right_labels = ['Dog'] * 10 + ['Cat'] * 25

print(f"Entropy (left): {entropy(left_labels):.4f}")
print(f"Entropy (right): {entropy(right_labels):.4f}")

The relationship between entropy and information

The concept of entropy comes from Claude Shannon’s information theory. In this framework, information is inversely related to probability—rare events carry more information than common events. When we say a dataset has high entropy, we’re saying it contains a lot of “information” in the sense that we’re uncertain about what class a randomly selected sample belongs to.

4. Information gain: quantifying the value of a split

Information gain is the primary metric used in algorithms like ID3 and C4.5 for building decision trees. It measures how much a feature reduces entropy—in other words, how much information we gain about the class labels by knowing the value of that feature.

Mathematical definition

The calculation involves finding the difference between the entropy before the split and the weighted average entropy after the split:

$$ IG(S, A) = Entropy(S) – \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \times Entropy(S_v) $$

Where:

$ S $ is the original dataset
$ A $ is the feature being considered for splitting
$ S_v $ is the subset of $ S $ where feature $ A $ has value $ v $
$ |S| $ denotes the number of samples in set $ S $

Calculating information gain for our example

Using our animal classification example where we split on “Size”:

$$ IG = Entropy_{parent} – \left(\frac{n_{left}}{n_{total}} \times Entropy_{left} + \frac{n_{right}}{n_{total}} \times Entropy_{right}\right) $$

Substituting our values:

$$ IG = 0.881 – \left(\frac{65}{100} \times 0.371 + \frac{35}{100} \times 0.863\right) $$

$$ IG = 0.881 – (0.241 + 0.302) = 0.881 – 0.543 = 0.338 $$

This positive information gain of 0.338 bits tells us that knowing whether an animal is “Large” or “Small” reduces our uncertainty about whether it’s a Dog or Cat by 0.338 bits of information.

Python implementation for information gain

import numpy as np

def information_gain(parent_labels, left_labels, right_labels):
    """
    Calculate information gain for a binary split.
    
    Parameters:
    parent_labels (array-like): Labels before split
    left_labels (array-like): Labels in left child
    right_labels (array-like): Labels in right child
    
    Returns:
    float: Information gain value
    """
    # Calculate parent entropy
    parent_entropy = entropy(parent_labels)
    
    # Calculate weighted child entropy
    n_total = len(parent_labels)
    n_left = len(left_labels)
    n_right = len(right_labels)
    
    weighted_entropy = (n_left / n_total) * entropy(left_labels) + \
                       (n_right / n_total) * entropy(right_labels)
    
    # Information gain
    ig = parent_entropy - weighted_entropy
    
    return ig

# Example usage
parent_labels = ['Dog'] * 70 + ['Cat'] * 30
left_labels = ['Dog'] * 60 + ['Cat'] * 5
right_labels = ['Dog'] * 10 + ['Cat'] * 25

ig = information_gain(parent_labels, left_labels, right_labels)
print(f"Information gain: {ig:.4f}")

Interpreting information gain values

IG > 0: The split reduces entropy and provides useful information
IG = 0: The split doesn’t help—child nodes have the same entropy as the parent
Higher IG: Better splits that more effectively separate classes

The decision tree algorithm selects the feature and threshold that maximize information gain at each node, ensuring that each split moves us closer to pure leaf nodes.

5. Comparing gini impurity vs information gain

Both gini impurity and information gain serve the same fundamental purpose in decision tree construction, but they have subtle differences that can affect model performance and computational efficiency. Understanding when to use each metric is important for building optimal decision tree models.

Computational efficiency

Gini impurity is computationally faster because it only requires squaring and summing proportions. There are no logarithmic calculations involved, making it more efficient for large datasets.

Information gain (entropy-based) requires logarithmic calculations, which are more expensive computationally. However, with modern computing power, this difference is often negligible for most practical applications.

Sensitivity to class distributions

Gini impurity tends to favor creating pure nodes and often isolates the most frequent class into its own node. It’s less sensitive to small differences in class distributions.

Information gain is more sensitive to the overall distribution of classes and tends to create more balanced trees. It emphasizes reducing uncertainty across all classes more evenly.

Mathematical comparison

Let’s compare how these metrics behave for a binary classification problem with varying class distributions:

import numpy as np
import matplotlib.pyplot as plt

def compare_metrics(p):
    """Compare Gini and Entropy for different class proportions"""
    gini = 1 - (p**2 + (1-p)**2)
    ent = -p * np.log2(p + 1e-10) - (1-p) * np.log2(1-p + 1e-10)
    return gini, ent

# Generate proportions from 0 to 1
proportions = np.linspace(0.01, 0.99, 100)
gini_values = []
entropy_values = []

for p in proportions:
    g, e = compare_metrics(p)
    gini_values.append(g)
    entropy_values.append(e)

# Plotting would show that both curves peak at p=0.5 (maximum impurity)
# but entropy has a sharper curve

When we plot these values, we observe:

Both metrics reach their maximum when classes are evenly split (p = 0.5)
Both metrics reach minimum (0) at pure nodes (p = 0 or p = 1)
Entropy has a slightly sharper curve, making it more sensitive to changes in class distribution

Practical differences in decision tree behavior

In practice, decision trees built with gini impurity versus information gain often produce similar results, but there can be subtle differences:

Trees using gini impurity:

Tend to isolate the largest class earlier in the tree
May create slightly simpler trees with fewer splits
Faster to train on large datasets

Trees using information gain:

Create more balanced partitions
May produce slightly deeper trees
Better theoretical foundation in information theory

Which metric should you use?

For most practical applications, the choice between gini impurity and information gain won’t dramatically affect your model’s performance. However, here are some guidelines:

Use gini impurity when:

Computational efficiency is critical
You’re working with very large datasets
You’re using the CART algorithm or scikit-learn’s default settings

Use information gain when:

You want a stronger theoretical foundation in information theory
You’re implementing ID3 or C4.5 algorithms
You want more balanced tree structures

Many practitioners simply stick with the default setting of their chosen library. In scikit-learn, for example, the default criterion is gini impurity, and it works well for most cases.

6. Implementing feature selection with gini and information gain

Now that we understand both metrics, let’s implement a complete feature selection process for building a decision tree. We’ll create functions that evaluate all possible splits and select the best one based on either gini impurity or information gain.

Complete implementation example

import numpy as np
import pandas as pd
from collections import Counter

class DecisionTreeMetrics:
    """
    A class implementing Gini impurity and Information Gain
    for decision tree feature selection.
    """
    
    @staticmethod
    def gini_impurity(labels):
        """Calculate Gini impurity for a set of labels."""
        if len(labels) == 0:
            return 0
        
        counter = Counter(labels)
        impurity = 1.0
        
        for count in counter.values():
            prob = count / len(labels)
            impurity -= prob ** 2
        
        return impurity
    
    @staticmethod
    def entropy(labels):
        """Calculate entropy for a set of labels."""
        if len(labels) == 0:
            return 0
        
        counter = Counter(labels)
        entropy_val = 0.0
        
        for count in counter.values():
            prob = count / len(labels)
            if prob > 0:
                entropy_val -= prob * np.log2(prob)
        
        return entropy_val
    
    @staticmethod
    def weighted_impurity(left_labels, right_labels, metric='gini'):
        """Calculate weighted impurity after a split."""
        n_left = len(left_labels)
        n_right = len(right_labels)
        n_total = n_left + n_right
        
        if metric == 'gini':
            left_impurity = DecisionTreeMetrics.gini_impurity(left_labels)
            right_impurity = DecisionTreeMetrics.gini_impurity(right_labels)
        else:  # entropy
            left_impurity = DecisionTreeMetrics.entropy(left_labels)
            right_impurity = DecisionTreeMetrics.entropy(right_labels)
        
        weighted = (n_left / n_total) * left_impurity + \
                   (n_right / n_total) * right_impurity
        
        return weighted
    
    @staticmethod
    def information_gain(parent_labels, left_labels, right_labels):
        """Calculate information gain for a split."""
        parent_entropy = DecisionTreeMetrics.entropy(parent_labels)
        child_entropy = DecisionTreeMetrics.weighted_impurity(
            left_labels, right_labels, metric='entropy'
        )
        return parent_entropy - child_entropy
    
    @staticmethod
    def gini_gain(parent_labels, left_labels, right_labels):
        """Calculate Gini gain (reduction in Gini impurity) for a split."""
        parent_gini = DecisionTreeMetrics.gini_impurity(parent_labels)
        child_gini = DecisionTreeMetrics.weighted_impurity(
            left_labels, right_labels, metric='gini'
        )
        return parent_gini - child_gini

def find_best_split(X, y, metric='gini'):
    """
    Find the best feature and threshold for splitting.
    
    Parameters:
    X (pd.DataFrame or np.array): Feature matrix
    y (array-like): Target labels
    metric (str): Either 'gini' or 'entropy'
    
    Returns:
    dict: Best split information
    """
    best_gain = -1
    best_feature = None
    best_threshold = None
    
    n_features = X.shape[1]
    
    for feature_idx in range(n_features):
        feature_values = X.iloc[:, feature_idx] if isinstance(X, pd.DataFrame) else X[:, feature_idx]
        thresholds = np.unique(feature_values)
        
        for threshold in thresholds:
            # Split data
            left_mask = feature_values <= threshold
            right_mask = ~left_mask
            
            left_labels = y[left_mask]
            right_labels = y[right_mask]
            
            # Skip if split creates empty node
            if len(left_labels) == 0 or len(right_labels) == 0:
                continue
            
            # Calculate gain
            if metric == 'gini':
                gain = DecisionTreeMetrics.gini_gain(y, left_labels, right_labels)
            else:
                gain = DecisionTreeMetrics.information_gain(y, left_labels, right_labels)
            
            # Update best split
            if gain > best_gain:
                best_gain = gain
                best_feature = feature_idx
                best_threshold = threshold
    
    return {
        'feature': best_feature,
        'threshold': best_threshold,
        'gain': best_gain
    }

# Example usage with a real dataset
# Creating a simple dataset
np.random.seed(42)
X = pd.DataFrame({
    'height': np.random.randint(50, 200, 100),
    'weight': np.random.randint(30, 100, 100),
    'age': np.random.randint(1, 15, 100)
})

# Create labels based on simple rules (for demonstration)
y = np.array(['Cat' if (x['height'] < 100 and x['weight'] < 60) 
              else 'Dog' for _, x in X.iterrows()])

# Find best split using Gini
print("Using Gini Impurity:")
best_split_gini = find_best_split(X, y, metric='gini')
print(f"Best feature: {X.columns[best_split_gini['feature']]}")
print(f"Best threshold: {best_split_gini['threshold']}")
print(f"Gini gain: {best_split_gini['gain']:.4f}")

print("\nUsing Information Gain:")
best_split_ig = find_best_split(X, y, metric='entropy')
print(f"Best feature: {X.columns[best_split_ig['feature']]}")
print(f"Best threshold: {best_split_ig['threshold']}")
print(f"Information gain: {best_split_ig['gain']:.4f}")

Using scikit-learn’s decision tree implementation

In practice, you’ll often use established libraries like scikit-learn. Here’s how to specify different criteria:

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

# Decision tree with Gini impurity
dt_gini = DecisionTreeClassifier(criterion='gini', random_state=42)
dt_gini.fit(X_train, y_train)
y_pred_gini = dt_gini.predict(X_test)

# Decision tree with Entropy (Information Gain)
dt_entropy = DecisionTreeClassifier(criterion='entropy', random_state=42)
dt_entropy.fit(X_train, y_train)
y_pred_entropy = dt_entropy.predict(X_test)

# Compare performance
print(f"Accuracy with Gini: {accuracy_score(y_test, y_pred_gini):.4f}")
print(f"Accuracy with Entropy: {accuracy_score(y_test, y_pred_entropy):.4f}")

print(f"\nTree depth with Gini: {dt_gini.get_depth()}")
print(f"Tree depth with Entropy: {dt_entropy.get_depth()}")

print(f"\nNumber of leaves with Gini: {dt_gini.get_n_leaves()}")
print(f"Number of leaves with Entropy: {dt_entropy.get_n_leaves()}")

Feature importance based on impurity reduction

One of the valuable byproducts of using these metrics is that we can calculate feature importance. Features that lead to larger reductions in impurity across the tree are considered more important:

import matplotlib.pyplot as plt

# Get feature importances
feature_names = iris.feature_names
importances_gini = dt_gini.feature_importances_
importances_entropy = dt_entropy.feature_importances_

# Create comparison
feature_comparison = pd.DataFrame({
    'Feature': feature_names,
    'Gini Importance': importances_gini,
    'Entropy Importance': importances_entropy
})

print("\nFeature Importance Comparison:")
print(feature_comparison.round(4))

This implementation demonstrates how gini impurity and information gain are used in practice for feature selection and how they ultimately lead to building effective decision tree models for classification tasks.

7. Conclusion

Understanding gini impurity and information gain is fundamental to mastering decision tree algorithms and building effective classification models. While both metrics serve the same purpose—identifying the best features for splitting data—they approach the problem from slightly different mathematical perspectives. Gini impurity measures the probability of misclassification, while information gain quantifies the reduction in entropy or uncertainty.

In practice, the choice between using gini index or entropy-based information gain rarely makes a dramatic difference in model performance. Gini impurity is computationally more efficient and works well as the default choice in most decision tree implementations. However, understanding both concepts gives you the flexibility to experiment and the knowledge to make informed decisions when optimizing your decision tree models. Whether you’re working with simple binary classification or complex multi-class problems, these fundamental metrics will guide your feature selection process and help build more accurate and interpretable AI systems.

Explore more: