//

Regression Trees: Predicting Continuous Values with Trees

Machine learning offers numerous approaches to predict continuous numerical values, and among these, tree-based methods stand out for their interpretability and effectiveness. While decision trees are often associated with classification problems, they excel equally well at regression tasks. A regression tree provides an intuitive, visual approach to understanding how different features contribute to predicting continuous outcomes, from house prices to temperature forecasts.

Unlike their classification counterparts that predict discrete categories, decision tree regressors partition the feature space into regions and assign a continuous value to each region. This fundamental difference makes regression trees particularly valuable when you need both accurate predictions and transparent reasoning behind those predictions. 

Regression Trees Predicting Continuous Values with Trees

1. Understanding regression trees

What is a regression tree?

A regression tree is a decision tree designed specifically for predicting continuous numerical values rather than discrete classes. While the structure resembles classification trees—with nodes, branches, and leaves—the key difference lies in what happens at the leaf nodes. Instead of voting for a class label, each leaf in a decision tree regression model contains a numerical value representing the prediction for all samples that reach that leaf.

The tree works by recursively splitting the dataset based on feature values, creating a hierarchical structure of decisions. At each internal node, the algorithm asks a question about a feature (e.g., “Is the house size greater than 1,500 square feet?”). Based on the answer, samples flow down to the left or right child node. This process continues until samples reach a leaf node, where they receive their final predicted value.

How regression trees differ from classification trees

The primary distinction between regression and classification trees lies in their objectives and output. A classification tree aims to minimize impurity measures like Gini index or entropy, predicting discrete class labels. In contrast, a regression tree minimizes error measures related to continuous values, typically mean squared error (MSE).

At the leaf nodes, classification trees use majority voting to determine the predicted class, while regression trees calculate the mean (or sometimes median) of the target values for all training samples in that leaf. This mean becomes the prediction for any new sample that falls into that region of the feature space.

The splitting criteria also differ. While classification uses information gain or Gini impurity, decision tree regression employs variance reduction or MSE reduction to determine optimal splits. The goal is to create partitions where samples within each region have similar target values, minimizing the prediction error.

The intuition behind recursive partitioning

Imagine you’re trying to predict house prices based on features like size, location, and age. Instead of finding a complex mathematical formula, a regression tree asks a series of simple yes/no questions. First, it might ask: “Is the house larger than 2,000 square feet?” Houses answering “yes” tend to be more expensive, so they go to the right branch. Houses answering “no” go left.

For the expensive houses on the right, the tree asks another question: “Is it in a premium neighborhood?” This creates further subdivisions. The process continues, creating increasingly homogeneous groups where houses have similar prices. Eventually, you reach leaf nodes containing houses with very similar characteristics, and the average price of these houses becomes your prediction.

This recursive partitioning elegantly handles non-linear relationships and interactions between features without requiring you to specify them explicitly. If size matters more for older houses than newer ones, the tree naturally discovers this by creating appropriate splits.

2. The mathematics of decision tree regression

Splitting criteria and cost functions

The foundation of building a regression tree lies in choosing optimal splits. At each node, the algorithm evaluates all possible splits across all features to find the one that best reduces prediction error. The most common criterion is mean squared error (MSE) reduction.

For a node containing a set of samples \( S \) with target values \( y_i \), the MSE is calculated as:

$$ MSE = \frac{1}{|S|} \sum_{i \in S} (y_i – \bar{y})^2 $$

where \( \bar{y} \) is the mean of the target values in set ( S ).

When considering a split that divides \( S \) into left and right subsets \( S_{left} \) and \( S_{right} \), the quality of the split is measured by the weighted MSE reduction:

$$ \Delta MSE = MSE(S) – \frac{|S_{left}|}{|S|} MSE(S_{left}) – \frac{|S_{right}|}{|S|} MSE(S_{right}) $$

The split that maximizes \( \Delta MSE \) is chosen. This ensures that the resulting child nodes are more homogeneous in their target values than the parent node.

Variance reduction

An alternative but mathematically equivalent approach uses variance as the splitting criterion. Since variance measures the spread of values around their mean, minimizing variance creates more uniform groups. The variance for a set of samples is:

$$ Var(S) = \frac{1}{|S|} \sum_{i \in S} (y_i – \bar{y})^2 $$

This is identical to MSE, making variance reduction and MSE minimization equivalent objectives. Some implementations use the term “variance reduction” to emphasize the statistical interpretation of creating homogeneous partitions.

Prediction at leaf nodes

Once the tree is fully grown, making predictions is straightforward. For a new sample, you traverse the tree from root to leaf by answering the questions at each node. When you reach a leaf, the prediction is simply the mean of all training samples that fell into that leaf:

$$ \hat{y} = \frac{1}{|S_{leaf}|} \sum_{i \in S_{leaf}} y_i $$

This mean minimizes the squared error for all samples in that leaf. Some implementations offer alternatives like using the median, which is more robust to outliers, though the mean is standard practice.

3. Building a regression tree with CART

The CART algorithm

Classification and Regression Trees (CART) is the most widely used algorithm for building decision tree regressors. Developed as a unified framework, CART handles both classification and regression through a greedy, top-down recursive approach.

The algorithm starts with all training data at the root node and considers all possible binary splits for each feature. For continuous features, it evaluates thresholds like “feature ≤ value.” For categorical features, it considers groupings of categories. The split that produces the largest MSE reduction is selected.

After splitting, the algorithm recursively applies the same process to each child node, treating it as a new root for its subtree. This continues until a stopping criterion is met: reaching a maximum depth, having too few samples to split, or achieving sufficiently low error.

Implementation in Python with scikit-learn

Let’s implement a practical regression tree example using scikit-learn. We’ll predict housing prices based on various features:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import fetch_california_housing

# Load the California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train the regression tree
regressor = DecisionTreeRegressor(
    max_depth=5,
    min_samples_split=20,
    min_samples_leaf=10,
    random_state=42
)
regressor.fit(X_train, y_train)

# Make predictions
y_pred = regressor.predict(X_test)

# Evaluate performance
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")

This code demonstrates the essential workflow: loading data, splitting it, training a decision tree regressor, and evaluating its performance. The hyperparameters control tree complexity to prevent overfitting.

Controlling tree complexity

Regression trees can easily overfit by growing too deep and memorizing training data. Several hyperparameters control complexity:

max_depth: Limits how many levels the tree can have. Deeper trees capture more complex patterns but risk overfitting.

min_samples_split: The minimum number of samples required to split a node. Higher values prevent splits on small subsets.

min_samples_leaf: The minimum number of samples required in a leaf node. This ensures predictions are based on sufficient data.

max_leaf_nodes: Directly limits the total number of leaf nodes, providing explicit control over model complexity.

Here’s how different depths affect predictions:

# Compare different tree depths
depths = [2, 5, 10, None]
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for idx, depth in enumerate(depths):
    ax = axes[idx // 2, idx % 2]
    
    # Train model with specific depth
    model = DecisionTreeRegressor(max_depth=depth, random_state=42)
    model.fit(X_train[:, 0].reshape(-1, 1), y_train)
    
    # Create predictions for plotting
    X_plot = np.linspace(X_train[:, 0].min(), X_train[:, 0].max(), 300)
    y_plot = model.predict(X_plot.reshape(-1, 1))
    
    # Plot
    ax.scatter(X_train[:, 0], y_train, alpha=0.3, s=10)
    ax.plot(X_plot, y_plot, color='red', linewidth=2)
    ax.set_title(f'Depth = {depth if depth else "Unlimited"}')
    ax.set_xlabel('Median Income')
    ax.set_ylabel('House Price')

plt.tight_layout()
plt.show()

This visualization reveals how tree depth affects the decision boundary. Shallow trees create simple, stepwise predictions, while deeper trees produce more granular, potentially overfitted patterns.

4. Random forest regression: ensemble power

From single trees to forests

While individual regression trees offer interpretability, they suffer from high variance—small changes in training data can produce very different trees. Random forest regression addresses this weakness through ensemble learning, combining multiple decision tree regressors to create more robust predictions.

A random forest regressor builds numerous trees, each trained on a random subset of the data (bootstrap sampling) and considering only random subsets of features at each split. This deliberate injection of randomness creates diverse trees that make different errors. By averaging their predictions, the forest reduces variance while maintaining low bias.

The prediction from a random forest with ( M ) trees is simply:

$$ \hat{y} = \frac{1}{M} \sum_{m=1}^{M} \hat{y}_m $$

where \( \hat{y}_m \) is the prediction from the \( m \)-th tree.

Implementing random forest regression

Building on our previous example, let’s implement a random forest regressor:

from sklearn.ensemble import RandomForestRegressor

# Create and train random forest
rf_regressor = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    min_samples_split=10,
    min_samples_leaf=5,
    max_features='sqrt',
    random_state=42,
    n_jobs=-1
)
rf_regressor.fit(X_train, y_train)

# Make predictions
rf_pred = regressor.predict(X_test)

# Evaluate
rf_mse = mean_squared_error(y_test, rf_pred)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_test, rf_pred)

print(f"Random Forest RMSE: {rf_rmse:.4f}")
print(f"Random Forest R² Score: {rf_r2:.4f}")

# Feature importance
importances = rf_regressor.feature_importances_
feature_names = housing.feature_names
indices = np.argsort(importances)[::-1]

print("\nFeature Importances:")
for i in range(len(importances)):
    print(f"{feature_names[indices[i]]}: {importances[indices[i]]:.4f}")

Random forests typically outperform single trees, especially on complex datasets. The n_estimators parameter controls how many trees to build—more trees generally improve performance but increase computation time.

Advantages of ensemble methods

Random forest regression offers several compelling advantages over single decision tree regressors:

Reduced overfitting: By averaging multiple trees, random forests smooth out individual tree peculiarities and reduce variance. This makes them much more robust to noise in training data.

Feature importance: Forests naturally provide feature importance scores by measuring how much each feature decreases MSE across all trees. This helps identify which variables matter most for predictions.

Out-of-bag evaluation: Since each tree is trained on a bootstrap sample (roughly 63% of data), the remaining 37% serves as a validation set. This provides honest performance estimates without a separate validation set.

Parallelization: Each tree trains independently, allowing efficient parallel computation across multiple processors.

The tradeoff is reduced interpretability—while a single regression tree can be visualized and understood, a forest of 100+ trees becomes a “black box.” However, techniques like feature importance and partial dependence plots help maintain some interpretability.

5. Practical considerations and best practices

Feature scaling and preprocessing

Unlike many machine learning algorithms, decision tree regression and random forest regressors do not require feature scaling. Since trees make decisions based on threshold comparisons (e.g., “Is age > 30?”), the scale of features is irrelevant. A feature measured in meters works identically to one measured in kilometers after adjusting the threshold.

However, preprocessing can still be beneficial:

Handling missing values: Trees can handle missing data through surrogate splits, but explicitly imputing missing values often improves performance. Mean/median imputation or more sophisticated methods like KNN imputation work well.

Encoding categorical variables: Convert categorical features to numerical representations. One-hot encoding works but can create many features. For high-cardinality categories, target encoding or embedding methods may be preferable.

Outlier treatment: Regression trees are relatively robust to outliers since they partition data into regions. However, extreme outliers can still create unusual splits. Consider capping or transforming extreme values.

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Example preprocessing pipeline
numeric_features = [0, 1, 2, 3]
categorical_features = [4, 5]

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Combine with regressor
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(random_state=42))
])

Hyperparameter tuning strategies

Optimizing hyperparameters significantly impacts performance. Key parameters to tune for decision tree regressors include:

For individual trees:

  • max_depth: Start with 5-10, increase if underfitting
  • min_samples_split: Try 10-50 for large datasets
  • min_samples_leaf: Try 5-20 to prevent overfitting
  • max_features: Consider subsets to reduce correlation

For random forest regression:

  • n_estimators: More is usually better (100-500), with diminishing returns
  • max_depth: Often deeper than single trees (10-30)
  • min_samples_split and min_samples_leaf: Similar to single trees
  • max_features: ‘sqrt’ or ‘log2’ work well for most problems

Use grid search or randomized search to explore hyperparameter space:

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Define parameter distribution
param_distributions = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(5, 30),
    'min_samples_split': randint(5, 50),
    'min_samples_leaf': randint(2, 20),
    'max_features': ['sqrt', 'log2', None]
}

# Randomized search
random_search = RandomizedSearchCV(
    RandomForestRegressor(random_state=42),
    param_distributions=param_distributions,
    n_iter=50,
    cv=5,
    scoring='neg_mean_squared_error',
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {-random_search.best_score_:.4f}")

Interpreting model predictions

Despite their power, tree-based models can be interpreted through several techniques:

Feature importance: Shows which features contribute most to predictions across all splits. Useful for understanding which variables drive your model.

Partial dependence plots: Visualize how predictions change as a single feature varies while others remain fixed. This reveals non-linear relationships captured by the model.

Individual tree inspection: For single decision tree regressors, you can visualize the entire tree structure and trace prediction paths.

from sklearn.tree import plot_tree
from sklearn.inspection import PartialDependenceDisplay

# Visualize a single tree
plt.figure(figsize=(20, 10))
plot_tree(
    regressor,
    feature_names=housing.feature_names,
    filled=True,
    rounded=True,
    fontsize=10
)
plt.show()

# Partial dependence plots
features = [0, 5]  # MedInc and AveRooms
PartialDependenceDisplay.from_estimator(
    rf_regressor,
    X_train,
    features,
    feature_names=housing.feature_names
)
plt.show()

6. Common pitfalls and how to avoid them

Overfitting and underfitting

Overfitting occurs when a regression tree becomes too complex, memorizing training data rather than learning general patterns. Signs include excellent training performance but poor test performance. Deep trees with many leaves are especially prone to overfitting.

Prevention strategies:

  • Limit tree depth with max_depth
  • Require minimum samples per split (min_samples_split)
  • Require minimum samples per leaf (min_samples_leaf)
  • Use pruning techniques (post-pruning or cost-complexity pruning)
  • Employ ensemble methods like random forest regression

Underfitting happens when the tree is too simple to capture underlying patterns. The model performs poorly on both training and test data. Shallow trees often underfit complex relationships.

Prevention strategies:

  • Increase max_depth gradually
  • Reduce min_samples_split and min_samples_leaf constraints
  • Ensure sufficient training data
  • Verify feature quality and engineering

Handling imbalanced targets

While imbalance is typically discussed for classification, regression can face similar challenges when target values are unevenly distributed. If most houses cost between $200K-$300K but a few cost $1M+, the model may underpredict expensive houses.

Strategies to address this:

Sample weighting: Assign higher weights to underrepresented target ranges, making the model pay more attention to rare values.

Stratified sampling: When splitting data, ensure all target ranges are represented in training and test sets.

Transform targets: Apply logarithmic or other transformations to compress the range of target values, then inverse transform predictions.

# Log transform targets
y_train_log = np.log1p(y_train)
y_test_log = np.log1p(y_test)

# Train on transformed targets
regressor.fit(X_train, y_train_log)

# Inverse transform predictions
y_pred_log = regressor.predict(X_test)
y_pred = np.expm1(y_pred_log)

Extrapolation limitations

A critical limitation of regression trees is their inability to extrapolate beyond the range of training data. Since predictions are averages of training samples in leaf nodes, the model cannot predict values outside the training target range.

For example, if you train on houses priced $100K-$500K, the tree cannot predict a $600K house accurately—it will predict at most the highest training value seen in that region of feature space.

Handling extrapolation:

  • Ensure training data covers the full range of expected prediction scenarios
  • Use domain knowledge to identify when predictions may require extrapolation
  • Consider alternative models (linear regression, neural networks) if extrapolation is necessary
  • Implement prediction confidence bounds to flag when inputs fall outside training distribution

7. When to use regression trees

Ideal use cases

Regression trees and random forest regressors excel in specific scenarios:

Interpretability matters: When stakeholders need to understand prediction logic, decision tree regression offers clear, human-readable rules. Each prediction path tells a story.

Mixed feature types: Trees naturally handle both numerical and categorical features without extensive preprocessing. They automatically discover relevant categories and thresholds.

Non-linear relationships: When features interact in complex, non-linear ways (e.g., location matters more for large houses), trees discover these patterns automatically without manual feature engineering.

Robust to outliers: Unlike linear models where outliers heavily influence the entire model, trees isolate outliers in specific branches, limiting their impact.

Missing data: Trees handle missing values gracefully through surrogate splits, requiring less imputation than many alternatives.

Feature selection: Built-in feature importance helps identify which variables matter, useful for both modeling and domain understanding.

Comparison with other regression methods

Linear regression: Simple and interpretable with coefficients showing feature impact. However, assumes linear relationships and requires careful feature engineering for interactions. Use when relationships are approximately linear and interpretability through coefficients is valuable.

Support Vector Regression: Excellent for high-dimensional data and can model non-linear relationships through kernels. More computationally expensive than trees and less interpretable. Choose for complex, high-dimensional problems where accuracy trumps interpretability.

Neural networks: Extremely flexible, handling any non-linear relationship given sufficient data. Require large datasets, extensive tuning, and offer minimal interpretability. Use for massive datasets with complex patterns where computational resources are available.

Gradient boosting (XGBoost, LightGBM): Often achieves the best predictive performance by iteratively building trees to correct previous errors. More complex to tune and interpret than random forests. Choose when maximizing accuracy is paramount and computational resources permit.

CART regression trees: Best for interpretability, moderate-sized datasets, and when feature interactions matter. Random forest regressors extend this with better accuracy at the cost of some interpretability.

The choice depends on your priorities: interpretability versus accuracy, dataset size, computational constraints, and the nature of feature relationships.

8. Knowledge Check

Quiz 1: Fundamentals of Regression Trees

Question: What is a regression tree, and what is its primary function in machine learning?
Answer: A regression tree is a type of decision tree specifically designed for predicting continuous numerical values, such as a house price or temperature. Its primary function is to partition the feature space into distinct, non-overlapping regions and then assign a single continuous prediction value to each region.

Quiz 2: Regression vs. Classification Trees

Question: Identify the primary difference between a regression tree and a classification tree regarding their leaf node predictions and their splitting criteria.
Answer: A classification tree predicts a discrete class at its leaf nodes (typically using the majority vote of samples in the leaf) and uses impurity measures like Gini impurity or entropy for splitting. In contrast, a regression tree predicts a continuous value at its leaves (typically the mean of the samples) and uses error-reduction measures like Mean Squared Error (MSE) for splitting.

Quiz 3: The Mathematics of Splitting

Question: What is the most common mathematical criterion that a regression tree algorithm uses to determine the best split at any given node?
Answer: The most common criterion is Mean Squared Error (MSE) reduction. The algorithm evaluates all possible splits and chooses the one that maximizes this reduction. This is mathematically equivalent to maximizing variance reduction, which provides the statistical intuition that the best split is the one that creates child nodes that are more homogeneous (have lower variance) in their target values than the parent node.

Quiz 4: Making a Prediction

Question: After a regression tree is fully trained, describe the process it follows to make a prediction for a new, unseen data sample.
Answer: To make a prediction, the new sample traverses the tree from the root node down to a leaf. At each internal node, a condition based on a feature (e.g., “Is house size > 1500 sq ft?”) directs the sample to a specific child node. When it reaches a terminal leaf node, the prediction is typically the mean of the target values of all the training samples that were grouped into that leaf. While the mean is standard, some implementations allow using the median, which is more robust to outliers.

Quiz 5: Preventing Overfitting

Question: Name two key hyperparameters for controlling the complexity of a decision tree regressor and explain how they help prevent overfitting.
Answer: Four key hyperparameters for controlling complexity and preventing overfitting are:
1. max_depth: Limits the maximum number of levels in the tree, preventing it from becoming too deep and creating overly specific rules.
2. min_samples_split: Sets the minimum number of samples a node must contain to be considered for splitting, preventing the model from learning from very small groups.
3. min_samples_leaf: Specifies the minimum number of samples required to be in a terminal leaf node, ensuring that every prediction is based on a reasonably sized group.
4. max_leaf_nodes: Directly limits the total number of terminal nodes in the tree.
These parameters work together to stop the tree from perfectly memorizing noise and outliers in the training data.

Quiz 6: From Trees to Forests

Question: What is the primary weakness of an individual regression tree, and how does the Random Forest ensemble method address this weakness?
Answer: The primary weakness of a single regression tree is its high variance, meaning that small changes in the training data can lead to a drastically different tree structure and unstable predictions. A Random Forest addresses this by building numerous diverse trees on different data subsets created through bootstrap sampling (sampling with replacement). By averaging the predictions of these many decorrelated trees, the Random Forest significantly reduces variance, leading to more robust and accurate results.

Quiz 7: Random Forest Benefits

Question: Beyond improved prediction accuracy, what is a key advantage of Random Forest regression for model interpretation?
Answer: While Random Forests are known for improved accuracy, they offer several other powerful advantages:
1. Feature Importance: A key advantage for interpretation is that they naturally provide feature importance scores. By measuring how much each feature contributes to reducing MSE across all trees, the model helps identify which variables are most influential.
2. Reduced Overfitting: By averaging the predictions of many trees, the model is much more robust to noise and less prone to overfitting than a single, deep decision tree.
3. Out-of-Bag (OOB) Evaluation: Since each tree is trained on a bootstrap sample (about 63% of the data), the remaining data can be used as an internal validation set to get an unbiased performance estimate without a separate test set.
4. Parallelization: The construction of each tree is an independent process, allowing the training to be efficiently parallelized across multiple CPU cores.

Quiz 8: Feature Scaling

Question: Is feature scaling (e.g., standardizing data to have a mean of 0 and variance of 1) a required preprocessing step for regression trees? Explain why or why not.
Answer: No, feature scaling is not required. Tree-based models make decisions by finding optimal split points based on thresholds (e.g., “Is age > 30?”). The outcome of these threshold-based comparisons is not affected by the absolute scale or distribution of the feature’s values. However, it is important to note that while scaling is unnecessary, other preprocessing steps like handling missing values or encoding categorical variables are still highly beneficial for tree-based models.

Quiz 9: Extrapolation Limitations

Question: What is a critical limitation of regression trees when asked to make predictions on data outside the range of the training set’s target values?
Answer: Regression trees are incapable of extrapolation. Since a tree’s prediction for a given region is based on the average of the target values it saw within that region during training, it can never predict a value higher than the maximum or lower than the minimum target value present in the training data.
To mitigate this limitation, one can:
• Ensure the training data covers the full expected range of target values.
• Consider alternative models like linear regression if extrapolation is a known requirement for the use case.

Quiz 10: When to Use a Regression Tree

Question: Describe an ideal scenario where using a single regression tree is preferable to more complex models like neural networks or gradient boosting.
Answer: A single regression tree is ideal in scenarios where the following are priorities:
1. Interpretability: When stakeholders need to understand the precise, rule-based logic behind each prediction, a tree’s visual structure provides a clear, human-readable explanation that “black box” models cannot.
2. Handling Mixed Feature Types: Trees naturally handle a mix of numerical and categorical features without extensive preprocessing.
3. Capturing Non-Linear Relationships: They can automatically discover and model complex, non-linear interactions between features without requiring manual feature engineering.
4. Robustness to Outliers: Outliers have a localized impact on specific branches of the tree rather than influencing the entire model, making trees relatively robust.
Explore more: