Regression Trees: Predicting Continuous Values with Trees

Machine learning offers numerous approaches to predict continuous numerical values, and among these, tree-based methods stand out for their interpretability and effectiveness. While decision trees are often associated with classification problems, they excel equally well at regression tasks. A regression tree provides an intuitive, visual approach to understanding how different features contribute to predicting continuous outcomes, from house prices to temperature forecasts.

Unlike their classification counterparts that predict discrete categories, decision tree regressors partition the feature space into regions and assign a continuous value to each region. This fundamental difference makes regression trees particularly valuable when you need both accurate predictions and transparent reasoning behind those predictions.

Content

1. Understanding regression trees

What is a regression tree?

A regression tree is a decision tree designed specifically for predicting continuous numerical values rather than discrete classes. While the structure resembles classification trees—with nodes, branches, and leaves—the key difference lies in what happens at the leaf nodes. Instead of voting for a class label, each leaf in a decision tree regression model contains a numerical value representing the prediction for all samples that reach that leaf.

The tree works by recursively splitting the dataset based on feature values, creating a hierarchical structure of decisions. At each internal node, the algorithm asks a question about a feature (e.g., “Is the house size greater than 1,500 square feet?”). Based on the answer, samples flow down to the left or right child node. This process continues until samples reach a leaf node, where they receive their final predicted value.

How regression trees differ from classification trees

The primary distinction between regression and classification trees lies in their objectives and output. A classification tree aims to minimize impurity measures like Gini index or entropy, predicting discrete class labels. In contrast, a regression tree minimizes error measures related to continuous values, typically mean squared error (MSE).

At the leaf nodes, classification trees use majority voting to determine the predicted class, while regression trees calculate the mean (or sometimes median) of the target values for all training samples in that leaf. This mean becomes the prediction for any new sample that falls into that region of the feature space.

The splitting criteria also differ. While classification uses information gain or Gini impurity, decision tree regression employs variance reduction or MSE reduction to determine optimal splits. The goal is to create partitions where samples within each region have similar target values, minimizing the prediction error.

The intuition behind recursive partitioning

Imagine you’re trying to predict house prices based on features like size, location, and age. Instead of finding a complex mathematical formula, a regression tree asks a series of simple yes/no questions. First, it might ask: “Is the house larger than 2,000 square feet?” Houses answering “yes” tend to be more expensive, so they go to the right branch. Houses answering “no” go left.

For the expensive houses on the right, the tree asks another question: “Is it in a premium neighborhood?” This creates further subdivisions. The process continues, creating increasingly homogeneous groups where houses have similar prices. Eventually, you reach leaf nodes containing houses with very similar characteristics, and the average price of these houses becomes your prediction.

This recursive partitioning elegantly handles non-linear relationships and interactions between features without requiring you to specify them explicitly. If size matters more for older houses than newer ones, the tree naturally discovers this by creating appropriate splits.

2. The mathematics of decision tree regression

Splitting criteria and cost functions

The foundation of building a regression tree lies in choosing optimal splits. At each node, the algorithm evaluates all possible splits across all features to find the one that best reduces prediction error. The most common criterion is mean squared error (MSE) reduction.

For a node containing a set of samples $ S $ with target values $ y_i $, the MSE is calculated as:

$$ MSE = \frac{1}{|S|} \sum_{i \in S} (y_i – \bar{y})^2 $$

where $ \bar{y} $ is the mean of the target values in set ( S ).

When considering a split that divides $ S $ into left and right subsets $ S_{left} $ and $ S_{right} $, the quality of the split is measured by the weighted MSE reduction:

$$ \Delta MSE = MSE(S) – \frac{|S_{left}|}{|S|} MSE(S_{left}) – \frac{|S_{right}|}{|S|} MSE(S_{right}) $$

The split that maximizes $ \Delta MSE $ is chosen. This ensures that the resulting child nodes are more homogeneous in their target values than the parent node.

Variance reduction

An alternative but mathematically equivalent approach uses variance as the splitting criterion. Since variance measures the spread of values around their mean, minimizing variance creates more uniform groups. The variance for a set of samples is:

$$ Var(S) = \frac{1}{|S|} \sum_{i \in S} (y_i – \bar{y})^2 $$

This is identical to MSE, making variance reduction and MSE minimization equivalent objectives. Some implementations use the term “variance reduction” to emphasize the statistical interpretation of creating homogeneous partitions.

Prediction at leaf nodes

Once the tree is fully grown, making predictions is straightforward. For a new sample, you traverse the tree from root to leaf by answering the questions at each node. When you reach a leaf, the prediction is simply the mean of all training samples that fell into that leaf:

$$ \hat{y} = \frac{1}{|S_{leaf}|} \sum_{i \in S_{leaf}} y_i $$

This mean minimizes the squared error for all samples in that leaf. Some implementations offer alternatives like using the median, which is more robust to outliers, though the mean is standard practice.

3. Building a regression tree with CART

The CART algorithm

Classification and Regression Trees (CART) is the most widely used algorithm for building decision tree regressors. Developed as a unified framework, CART handles both classification and regression through a greedy, top-down recursive approach.

The algorithm starts with all training data at the root node and considers all possible binary splits for each feature. For continuous features, it evaluates thresholds like “feature ≤ value.” For categorical features, it considers groupings of categories. The split that produces the largest MSE reduction is selected.

After splitting, the algorithm recursively applies the same process to each child node, treating it as a new root for its subtree. This continues until a stopping criterion is met: reaching a maximum depth, having too few samples to split, or achieving sufficiently low error.

Implementation in Python with scikit-learn

Let’s implement a practical regression tree example using scikit-learn. We’ll predict housing prices based on various features:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import fetch_california_housing

# Load the California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train the regression tree
regressor = DecisionTreeRegressor(
    max_depth=5,
    min_samples_split=20,
    min_samples_leaf=10,
    random_state=42
)
regressor.fit(X_train, y_train)

# Make predictions
y_pred = regressor.predict(X_test)

# Evaluate performance
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")

This code demonstrates the essential workflow: loading data, splitting it, training a decision tree regressor, and evaluating its performance. The hyperparameters control tree complexity to prevent overfitting.

Controlling tree complexity

Regression trees can easily overfit by growing too deep and memorizing training data. Several hyperparameters control complexity:

max_depth: Limits how many levels the tree can have. Deeper trees capture more complex patterns but risk overfitting.

min_samples_split: The minimum number of samples required to split a node. Higher values prevent splits on small subsets.

min_samples_leaf: The minimum number of samples required in a leaf node. This ensures predictions are based on sufficient data.

max_leaf_nodes: Directly limits the total number of leaf nodes, providing explicit control over model complexity.

Here’s how different depths affect predictions:

# Compare different tree depths
depths = [2, 5, 10, None]
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for idx, depth in enumerate(depths):
    ax = axes[idx // 2, idx % 2]
    
    # Train model with specific depth
    model = DecisionTreeRegressor(max_depth=depth, random_state=42)
    model.fit(X_train[:, 0].reshape(-1, 1), y_train)
    
    # Create predictions for plotting
    X_plot = np.linspace(X_train[:, 0].min(), X_train[:, 0].max(), 300)
    y_plot = model.predict(X_plot.reshape(-1, 1))
    
    # Plot
    ax.scatter(X_train[:, 0], y_train, alpha=0.3, s=10)
    ax.plot(X_plot, y_plot, color='red', linewidth=2)
    ax.set_title(f'Depth = {depth if depth else "Unlimited"}')
    ax.set_xlabel('Median Income')
    ax.set_ylabel('House Price')

plt.tight_layout()
plt.show()

This visualization reveals how tree depth affects the decision boundary. Shallow trees create simple, stepwise predictions, while deeper trees produce more granular, potentially overfitted patterns.

4. Random forest regression: ensemble power

From single trees to forests

While individual regression trees offer interpretability, they suffer from high variance—small changes in training data can produce very different trees. Random forest regression addresses this weakness through ensemble learning, combining multiple decision tree regressors to create more robust predictions.

A random forest regressor builds numerous trees, each trained on a random subset of the data (bootstrap sampling) and considering only random subsets of features at each split. This deliberate injection of randomness creates diverse trees that make different errors. By averaging their predictions, the forest reduces variance while maintaining low bias.

The prediction from a random forest with ( M ) trees is simply:

$$ \hat{y} = \frac{1}{M} \sum_{m=1}^{M} \hat{y}_m $$

where $ \hat{y}_m $ is the prediction from the $ m $-th tree.

Implementing random forest regression

Building on our previous example, let’s implement a random forest regressor:

from sklearn.ensemble import RandomForestRegressor

# Create and train random forest
rf_regressor = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    min_samples_split=10,
    min_samples_leaf=5,
    max_features='sqrt',
    random_state=42,
    n_jobs=-1
)
rf_regressor.fit(X_train, y_train)

# Make predictions
rf_pred = regressor.predict(X_test)

# Evaluate
rf_mse = mean_squared_error(y_test, rf_pred)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_test, rf_pred)

print(f"Random Forest RMSE: {rf_rmse:.4f}")
print(f"Random Forest R² Score: {rf_r2:.4f}")

# Feature importance
importances = rf_regressor.feature_importances_
feature_names = housing.feature_names
indices = np.argsort(importances)[::-1]

print("\nFeature Importances:")
for i in range(len(importances)):
    print(f"{feature_names[indices[i]]}: {importances[indices[i]]:.4f}")

Random forests typically outperform single trees, especially on complex datasets. The n_estimators parameter controls how many trees to build—more trees generally improve performance but increase computation time.

Advantages of ensemble methods

Random forest regression offers several compelling advantages over single decision tree regressors:

Reduced overfitting: By averaging multiple trees, random forests smooth out individual tree peculiarities and reduce variance. This makes them much more robust to noise in training data.

Feature importance: Forests naturally provide feature importance scores by measuring how much each feature decreases MSE across all trees. This helps identify which variables matter most for predictions.

Out-of-bag evaluation: Since each tree is trained on a bootstrap sample (roughly 63% of data), the remaining 37% serves as a validation set. This provides honest performance estimates without a separate validation set.

Parallelization: Each tree trains independently, allowing efficient parallel computation across multiple processors.

The tradeoff is reduced interpretability—while a single regression tree can be visualized and understood, a forest of 100+ trees becomes a “black box.” However, techniques like feature importance and partial dependence plots help maintain some interpretability.

5. Practical considerations and best practices

Feature scaling and preprocessing

Unlike many machine learning algorithms, decision tree regression and random forest regressors do not require feature scaling. Since trees make decisions based on threshold comparisons (e.g., “Is age > 30?”), the scale of features is irrelevant. A feature measured in meters works identically to one measured in kilometers after adjusting the threshold.

However, preprocessing can still be beneficial:

Handling missing values: Trees can handle missing data through surrogate splits, but explicitly imputing missing values often improves performance. Mean/median imputation or more sophisticated methods like KNN imputation work well.

Encoding categorical variables: Convert categorical features to numerical representations. One-hot encoding works but can create many features. For high-cardinality categories, target encoding or embedding methods may be preferable.

Outlier treatment: Regression trees are relatively robust to outliers since they partition data into regions. However, extreme outliers can still create unusual splits. Consider capping or transforming extreme values.

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Example preprocessing pipeline
numeric_features = [0, 1, 2, 3]
categorical_features = [4, 5]

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Combine with regressor
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(random_state=42))
])

Hyperparameter tuning strategies

Optimizing hyperparameters significantly impacts performance. Key parameters to tune for decision tree regressors include:

For individual trees:

max_depth: Start with 5-10, increase if underfitting
min_samples_split: Try 10-50 for large datasets
min_samples_leaf: Try 5-20 to prevent overfitting
max_features: Consider subsets to reduce correlation

For random forest regression:

n_estimators: More is usually better (100-500), with diminishing returns
max_depth: Often deeper than single trees (10-30)
min_samples_split and min_samples_leaf: Similar to single trees
max_features: ‘sqrt’ or ‘log2’ work well for most problems

Use grid search or randomized search to explore hyperparameter space:

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Define parameter distribution
param_distributions = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(5, 30),
    'min_samples_split': randint(5, 50),
    'min_samples_leaf': randint(2, 20),
    'max_features': ['sqrt', 'log2', None]
}

# Randomized search
random_search = RandomizedSearchCV(
    RandomForestRegressor(random_state=42),
    param_distributions=param_distributions,
    n_iter=50,
    cv=5,
    scoring='neg_mean_squared_error',
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {-random_search.best_score_:.4f}")

Interpreting model predictions

Despite their power, tree-based models can be interpreted through several techniques:

Feature importance: Shows which features contribute most to predictions across all splits. Useful for understanding which variables drive your model.

Partial dependence plots: Visualize how predictions change as a single feature varies while others remain fixed. This reveals non-linear relationships captured by the model.

Individual tree inspection: For single decision tree regressors, you can visualize the entire tree structure and trace prediction paths.

from sklearn.tree import plot_tree
from sklearn.inspection import PartialDependenceDisplay

# Visualize a single tree
plt.figure(figsize=(20, 10))
plot_tree(
    regressor,
    feature_names=housing.feature_names,
    filled=True,
    rounded=True,
    fontsize=10
)
plt.show()

# Partial dependence plots
features = [0, 5]  # MedInc and AveRooms
PartialDependenceDisplay.from_estimator(
    rf_regressor,
    X_train,
    features,
    feature_names=housing.feature_names
)
plt.show()

6. Common pitfalls and how to avoid them

Overfitting and underfitting

Overfitting occurs when a regression tree becomes too complex, memorizing training data rather than learning general patterns. Signs include excellent training performance but poor test performance. Deep trees with many leaves are especially prone to overfitting.

Prevention strategies:

Limit tree depth with max_depth
Require minimum samples per split (min_samples_split)
Require minimum samples per leaf (min_samples_leaf)
Use pruning techniques (post-pruning or cost-complexity pruning)
Employ ensemble methods like random forest regression

Underfitting happens when the tree is too simple to capture underlying patterns. The model performs poorly on both training and test data. Shallow trees often underfit complex relationships.

Prevention strategies:

Increase max_depth gradually
Reduce min_samples_split and min_samples_leaf constraints
Ensure sufficient training data
Verify feature quality and engineering

Handling imbalanced targets

While imbalance is typically discussed for classification, regression can face similar challenges when target values are unevenly distributed. If most houses cost between $200K-$300K but a few cost $1M+, the model may underpredict expensive houses.

Strategies to address this:

Sample weighting: Assign higher weights to underrepresented target ranges, making the model pay more attention to rare values.

Stratified sampling: When splitting data, ensure all target ranges are represented in training and test sets.

Transform targets: Apply logarithmic or other transformations to compress the range of target values, then inverse transform predictions.

# Log transform targets
y_train_log = np.log1p(y_train)
y_test_log = np.log1p(y_test)

# Train on transformed targets
regressor.fit(X_train, y_train_log)

# Inverse transform predictions
y_pred_log = regressor.predict(X_test)
y_pred = np.expm1(y_pred_log)

Extrapolation limitations

A critical limitation of regression trees is their inability to extrapolate beyond the range of training data. Since predictions are averages of training samples in leaf nodes, the model cannot predict values outside the training target range.

For example, if you train on houses priced $100K-$500K, the tree cannot predict a $600K house accurately—it will predict at most the highest training value seen in that region of feature space.

Handling extrapolation:

Ensure training data covers the full range of expected prediction scenarios
Use domain knowledge to identify when predictions may require extrapolation
Consider alternative models (linear regression, neural networks) if extrapolation is necessary
Implement prediction confidence bounds to flag when inputs fall outside training distribution

7. When to use regression trees

Ideal use cases

Regression trees and random forest regressors excel in specific scenarios:

Interpretability matters: When stakeholders need to understand prediction logic, decision tree regression offers clear, human-readable rules. Each prediction path tells a story.

Mixed feature types: Trees naturally handle both numerical and categorical features without extensive preprocessing. They automatically discover relevant categories and thresholds.

Non-linear relationships: When features interact in complex, non-linear ways (e.g., location matters more for large houses), trees discover these patterns automatically without manual feature engineering.

Robust to outliers: Unlike linear models where outliers heavily influence the entire model, trees isolate outliers in specific branches, limiting their impact.

Missing data: Trees handle missing values gracefully through surrogate splits, requiring less imputation than many alternatives.

Feature selection: Built-in feature importance helps identify which variables matter, useful for both modeling and domain understanding.

Comparison with other regression methods

Linear regression: Simple and interpretable with coefficients showing feature impact. However, assumes linear relationships and requires careful feature engineering for interactions. Use when relationships are approximately linear and interpretability through coefficients is valuable.

Support Vector Regression: Excellent for high-dimensional data and can model non-linear relationships through kernels. More computationally expensive than trees and less interpretable. Choose for complex, high-dimensional problems where accuracy trumps interpretability.

Neural networks: Extremely flexible, handling any non-linear relationship given sufficient data. Require large datasets, extensive tuning, and offer minimal interpretability. Use for massive datasets with complex patterns where computational resources are available.

Gradient boosting (XGBoost, LightGBM): Often achieves the best predictive performance by iteratively building trees to correct previous errors. More complex to tune and interpret than random forests. Choose when maximizing accuracy is paramount and computational resources permit.

CART regression trees: Best for interpretability, moderate-sized datasets, and when feature interactions matter. Random forest regressors extend this with better accuracy at the cost of some interpretability.

The choice depends on your priorities: interpretability versus accuracy, dataset size, computational constraints, and the nature of feature relationships.

8. Conclusion

Regression trees represent a elegant and powerful approach to predicting continuous values through an intuitive tree structure. By recursively partitioning the feature space and averaging target values within regions, decision tree regressors create interpretable models that naturally handle non-linear relationships, feature interactions, and mixed data types. Whether using individual trees for maximum interpretability or random forest regression for enhanced predictive power, these methods provide valuable tools for diverse regression challenges.

The journey from understanding basic CART algorithms to implementing sophisticated random forest regressors equips you with practical skills for real-world machine learning problems. While regression trees have limitations—particularly in extrapolation and potential overfitting—careful hyperparameter tuning and ensemble methods mitigate these concerns. As you apply these techniques to your own problems, remember that the best model balances predictive accuracy with interpretability, matching your specific requirements and constraints.

Explore more: