Intermediate

LightGBM

Master Microsoft's LightGBM framework with its histogram-based approach, leaf-wise growth strategy, native categorical support, and blazing-fast training on large datasets.

Why LightGBM?

LightGBM is often the fastest gradient boosting framework, especially on large datasets. It uses two key innovations: Gradient-based One-Side Sampling (GOSS) to reduce the number of data instances, and Exclusive Feature Bundling (EFB) to reduce the number of features.

Basic Usage

Python
import lightgbm as lgb

model = lgb.LGBMClassifier(
    n_estimators=500,
    max_depth=-1,           # No limit (leaf-wise controls depth)
    num_leaves=31,           # Key parameter for LightGBM
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=0.1,
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)]
)

Leaf-Wise vs Level-Wise Growth

AspectLeaf-Wise (LightGBM)Level-Wise (XGBoost)
StrategySplit the leaf with highest loss reductionSplit all leaves at current depth
Tree shapeAsymmetric, deeper on important branchesBalanced, fixed depth
SpeedFaster (fewer splits needed)Slower (many unnecessary splits)
Overfitting riskHigher (use num_leaves to control)Lower (max_depth is intuitive)
Rule of Thumb: Set num_leaves to less than 2^max_depth. For example, if max_depth=7, use num_leaves < 128. Start with num_leaves=31 (default) and increase carefully.

Native Categorical Features

Python
# LightGBM handles categoricals natively (no one-hot encoding!)
categorical_cols = ["city", "product_type", "channel"]

# Convert to pandas category dtype
for col in categorical_cols:
    df[col] = df[col].astype("category")

model = lgb.LGBMClassifier()
model.fit(
    X_train, y_train,
    categorical_feature=categorical_cols
)

# LightGBM finds optimal splits on categories
# Much better than one-hot encoding for high-cardinality features

LightGBM Key Parameters

ParameterDefaultPurpose
num_leaves31Max leaves per tree (main complexity control)
min_child_samples20Minimum samples in a leaf
feature_fraction1.0Column sampling ratio
bagging_fraction1.0Row sampling ratio
max_bin255Number of histogram bins (more = slower but more accurate)

Next: CatBoost

Explore Yandex's CatBoost with its unique ordered boosting and superior categorical feature handling.

Next: CatBoost →