Intermediate

Implement Decision Trees & Random Forest

Decision trees are the second most commonly asked ML coding question after linear models. Build a decision tree classifier from scratch using information gain and Gini impurity, then extend it to a random forest ensemble.

📝

Interview Question #1: “Implement a decision tree classifier from scratch. It should support both Gini impurity and information gain (entropy) as splitting criteria. Include fit and predict methods.”

Understanding the Splitting Criteria

Gini Impurity

Gini impurity measures how often a randomly chosen element would be misclassified. For a node with K classes:

Gini(node) = 1 - sum(p_k^2) for k in 1..K

# Example: node with 40 class-0 and 60 class-1 samples
# Gini = 1 - (0.4^2 + 0.6^2) = 1 - (0.16 + 0.36) = 0.48

Information Gain (Entropy)

Entropy measures the disorder in a node. Information gain is the reduction in entropy after a split:

Entropy(node) = -sum(p_k * log2(p_k)) for k in 1..K

Information_Gain = Entropy(parent) - weighted_avg(Entropy(children))

Decision Tree: Complete Solution

import numpy as np
from collections import Counter

class Node:
    """A node in the decision tree."""
    def __init__(self, feature=None, threshold=None, left=None,
                 right=None, value=None):
        self.feature = feature      # Index of feature to split on
        self.threshold = threshold  # Threshold value for the split
        self.left = left            # Left subtree (feature <= threshold)
        self.right = right          # Right subtree (feature > threshold)
        self.value = value          # Class label (for leaf nodes)

    def is_leaf(self):
        return self.value is not None


class DecisionTreeClassifier:
    """
    Decision tree classifier built from scratch.
    Supports Gini impurity and entropy splitting criteria.
    """

    def __init__(self, max_depth=10, min_samples_split=2,
                 criterion='gini'):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.criterion = criterion
        self.root = None

    def _gini(self, y):
        """Compute Gini impurity."""
        counts = np.bincount(y)
        probabilities = counts / len(y)
        return 1.0 - np.sum(probabilities ** 2)

    def _entropy(self, y):
        """Compute entropy."""
        counts = np.bincount(y)
        probabilities = counts / len(y)
        # Filter out zero probabilities to avoid log(0)
        probabilities = probabilities[probabilities > 0]
        return -np.sum(probabilities * np.log2(probabilities))

    def _impurity(self, y):
        """Compute impurity based on chosen criterion."""
        if self.criterion == 'gini':
            return self._gini(y)
        return self._entropy(y)

    def _information_gain(self, y, left_idx, right_idx):
        """Compute information gain from a split."""
        parent_impurity = self._impurity(y)
        n = len(y)
        n_left, n_right = len(left_idx), len(right_idx)

        if n_left == 0 or n_right == 0:
            return 0

        child_impurity = (
            (n_left / n) * self._impurity(y[left_idx]) +
            (n_right / n) * self._impurity(y[right_idx])
        )
        return parent_impurity - child_impurity

    def _best_split(self, X, y):
        """Find the best feature and threshold to split on."""
        best_gain = -1
        best_feature = None
        best_threshold = None

        n_features = X.shape[1]

        for feature_idx in range(n_features):
            # Get unique thresholds (midpoints between sorted values)
            values = np.sort(np.unique(X[:, feature_idx]))
            thresholds = (values[:-1] + values[1:]) / 2

            for threshold in thresholds:
                left_idx = np.where(X[:, feature_idx] <= threshold)[0]
                right_idx = np.where(X[:, feature_idx] > threshold)[0]

                gain = self._information_gain(y, left_idx, right_idx)

                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature_idx
                    best_threshold = threshold

        return best_feature, best_threshold, best_gain

    def _build_tree(self, X, y, depth=0):
        """Recursively build the decision tree."""
        n_samples = len(y)
        n_classes = len(np.unique(y))

        # Stopping conditions
        if (depth >= self.max_depth or
            n_classes == 1 or
            n_samples < self.min_samples_split):
            # Return leaf with majority class
            most_common = Counter(y).most_common(1)[0][0]
            return Node(value=most_common)

        # Find best split
        feature, threshold, gain = self._best_split(X, y)

        if gain <= 0:
            most_common = Counter(y).most_common(1)[0][0]
            return Node(value=most_common)

        # Split data
        left_idx = np.where(X[:, feature] <= threshold)[0]
        right_idx = np.where(X[:, feature] > threshold)[0]

        # Recursively build subtrees
        left = self._build_tree(X[left_idx], y[left_idx], depth + 1)
        right = self._build_tree(X[right_idx], y[right_idx], depth + 1)

        return Node(feature=feature, threshold=threshold,
                    left=left, right=right)

    def fit(self, X, y):
        """Build the decision tree from training data."""
        self.root = self._build_tree(X, y.astype(int))
        return self

    def _predict_single(self, x, node):
        """Traverse the tree for a single sample."""
        if node.is_leaf():
            return node.value
        if x[node.feature] <= node.threshold:
            return self._predict_single(x, node.left)
        return self._predict_single(x, node.right)

    def predict(self, X):
        """Predict class labels for all samples."""
        return np.array([self._predict_single(x, self.root) for x in X])


# ---- Test ----
np.random.seed(42)
# Create a simple dataset: XOR-like pattern
X = np.array([
    [0, 0], [0, 1], [1, 0], [1, 1],
    [0.1, 0.1], [0.1, 0.9], [0.9, 0.1], [0.9, 0.9],
], dtype=float)
y = np.array([0, 1, 1, 0, 0, 1, 1, 0])

tree = DecisionTreeClassifier(max_depth=5, criterion='gini')
tree.fit(X, y)

predictions = tree.predict(X)
accuracy = np.mean(predictions == y)
print(f"Training accuracy: {accuracy:.2%}")  # Should be 100%
print(f"Predictions: {predictions}")
print(f"Actual:      {y}")

What interviewers look for:

Clean separation of the Node class and the tree-building logic
Correct implementation of both Gini and entropy
Proper stopping conditions (max depth, min samples, pure node, no gain)
Using midpoints between sorted values as candidate thresholds
Handling edge cases (empty splits, single-class nodes)

⚠

Common mistake: Forgetting to handle the case where information_gain <= 0. Without this check, the tree will keep splitting even when no improvement is possible, leading to infinite recursion or nonsensical splits.

📝

Interview Question #2: “Extend your decision tree to a random forest. Explain what bagging is and why random feature selection at each split reduces correlation between trees.”

Random Forest: Complete Solution

class RandomForestClassifier:
    """
    Random forest classifier using bagging and random feature subsets.
    """

    def __init__(self, n_trees=10, max_depth=10, min_samples_split=2,
                 max_features='sqrt', criterion='gini'):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_features = max_features
        self.criterion = criterion
        self.trees = []
        self.feature_indices = []

    def _get_n_features(self, n_total):
        """Determine number of features to consider at each split."""
        if self.max_features == 'sqrt':
            return max(1, int(np.sqrt(n_total)))
        elif self.max_features == 'log2':
            return max(1, int(np.log2(n_total)))
        elif isinstance(self.max_features, int):
            return min(self.max_features, n_total)
        return n_total  # Use all features

    def _bootstrap_sample(self, X, y):
        """Create a bootstrap sample (sampling with replacement)."""
        n_samples = X.shape[0]
        indices = np.random.choice(n_samples, size=n_samples, replace=True)
        return X[indices], y[indices]

    def fit(self, X, y):
        """Train the random forest."""
        self.trees = []
        self.feature_indices = []
        n_total_features = X.shape[1]
        n_features = self._get_n_features(n_total_features)

        for _ in range(self.n_trees):
            # Bootstrap sample
            X_boot, y_boot = self._bootstrap_sample(X, y)

            # Random feature subset
            feat_idx = np.random.choice(n_total_features, size=n_features,
                                         replace=False)
            self.feature_indices.append(feat_idx)

            # Train a decision tree on the bootstrap sample with subset features
            tree = DecisionTreeClassifier(
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                criterion=self.criterion
            )
            tree.fit(X_boot[:, feat_idx], y_boot)
            self.trees.append(tree)

        return self

    def predict(self, X):
        """Predict using majority vote across all trees."""
        # Collect predictions from each tree
        all_preds = np.array([
            tree.predict(X[:, feat_idx])
            for tree, feat_idx in zip(self.trees, self.feature_indices)
        ])  # Shape: (n_trees, n_samples)

        # Majority vote for each sample
        predictions = np.array([
            Counter(all_preds[:, i]).most_common(1)[0][0]
            for i in range(X.shape[0])
        ])
        return predictions


# ---- Test ----
np.random.seed(42)
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=200, n_features=10,
                           n_informative=5, random_state=42)

# Split manually (no sklearn train_test_split)
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

rf = RandomForestClassifier(n_trees=20, max_depth=8, criterion='gini')
rf.fit(X_train, y_train)

train_acc = np.mean(rf.predict(X_train) == y_train)
test_acc = np.mean(rf.predict(X_test) == y_test)
print(f"Train accuracy: {train_acc:.2%}")
print(f"Test accuracy:  {test_acc:.2%}")

What interviewers look for:

Understanding of bagging: bootstrap sampling with replacement reduces variance
Understanding of random feature subsets: using sqrt(n_features) reduces correlation between trees
Majority voting: aggregating predictions from multiple trees
Knowing that random forests reduce overfitting compared to individual decision trees

📝

Interview Question #3: “How would you implement pruning for your decision tree? Explain the difference between pre-pruning and post-pruning.”

Pruning Strategies

Pre-Pruning (Already Implemented Above)

Pre-pruning stops tree growth early using constraints:

max_depth — limits the depth of the tree
min_samples_split — minimum samples required to split a node
min_information_gain — minimum gain required to make a split

Post-Pruning (Reduced Error Pruning)

def prune(self, node, X_val, y_val):
    """
    Post-pruning using a validation set.
    Replace subtrees that don't improve validation accuracy.
    """
    if node.is_leaf():
        return node

    # Determine which validation samples go left and right
    if node.feature is not None:
        left_mask = X_val[:, node.feature] <= node.threshold
        right_mask = ~left_mask

        # Recursively prune children first
        if np.sum(left_mask) > 0:
            node.left = self.prune(node.left,
                                    X_val[left_mask], y_val[left_mask])
        if np.sum(right_mask) > 0:
            node.right = self.prune(node.right,
                                     X_val[right_mask], y_val[right_mask])

    # Check if replacing this subtree with a leaf improves accuracy
    if len(y_val) > 0:
        # Current subtree accuracy
        current_preds = np.array([self._predict_single(x, node)
                                   for x in X_val])
        current_acc = np.mean(current_preds == y_val)

        # Leaf accuracy (majority class)
        majority_class = Counter(y_val).most_common(1)[0][0]
        leaf_acc = np.mean(majority_class == y_val)

        # Replace with leaf if it is at least as accurate
        if leaf_acc >= current_acc:
            return Node(value=majority_class)

    return node

💡

Interview tip: When discussing pruning, always mention the bias-variance tradeoff. Pre-pruning is faster but may stop too early (underfitting). Post-pruning grows the full tree first and then removes branches, which is more thorough but computationally expensive. In practice, most libraries (scikit-learn) use cost-complexity pruning (ccp_alpha parameter).

← Previous Linear & Logistic Regression Next → K-Means & KNN