Intermediate

LightGBM

Master Microsoft's LightGBM framework with its histogram-based approach, leaf-wise growth strategy, native categorical support, and blazing-fast training on large datasets.

Why LightGBM?

LightGBM is often the fastest gradient boosting framework, especially on large datasets. It uses two key innovations: Gradient-based One-Side Sampling (GOSS) to reduce the number of data instances, and Exclusive Feature Bundling (EFB) to reduce the number of features.

Basic Usage

Python

import lightgbm as lgb

model = lgb.LGBMClassifier(
    n_estimators=500,
    max_depth=-1,           # No limit (leaf-wise controls depth)
    num_leaves=31,           # Key parameter for LightGBM
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=0.1,
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)]
)

Leaf-Wise vs Level-Wise Growth

Aspect	Leaf-Wise (LightGBM)	Level-Wise (XGBoost)
Strategy	Split the leaf with highest loss reduction	Split all leaves at current depth
Tree shape	Asymmetric, deeper on important branches	Balanced, fixed depth
Speed	Faster (fewer splits needed)	Slower (many unnecessary splits)
Overfitting risk	Higher (use num_leaves to control)	Lower (max_depth is intuitive)

Rule of Thumb: Set num_leaves to less than 2^max_depth. For example, if max_depth=7, use num_leaves < 128. Start with num_leaves=31 (default) and increase carefully.

Native Categorical Features

Python

# LightGBM handles categoricals natively (no one-hot encoding!)
categorical_cols = ["city", "product_type", "channel"]

# Convert to pandas category dtype
for col in categorical_cols:
    df[col] = df[col].astype("category")

model = lgb.LGBMClassifier()
model.fit(
    X_train, y_train,
    categorical_feature=categorical_cols
)

# LightGBM finds optimal splits on categories
# Much better than one-hot encoding for high-cardinality features

LightGBM Key Parameters

Parameter	Default	Purpose
num_leaves	31	Max leaves per tree (main complexity control)
min_child_samples	20	Minimum samples in a leaf
feature_fraction	1.0	Column sampling ratio
bagging_fraction	1.0	Row sampling ratio
max_bin	255	Number of histogram bins (more = slower but more accurate)

Next: CatBoost

Explore Yandex's CatBoost with its unique ordered boosting and superior categorical feature handling.

Next: CatBoost →

← XGBoost CatBoost →