Intermediate
LightGBM
Master Microsoft's LightGBM framework with its histogram-based approach, leaf-wise growth strategy, native categorical support, and blazing-fast training on large datasets.
Why LightGBM?
LightGBM is often the fastest gradient boosting framework, especially on large datasets. It uses two key innovations: Gradient-based One-Side Sampling (GOSS) to reduce the number of data instances, and Exclusive Feature Bundling (EFB) to reduce the number of features.
Basic Usage
Python
import lightgbm as lgb model = lgb.LGBMClassifier( n_estimators=500, max_depth=-1, # No limit (leaf-wise controls depth) num_leaves=31, # Key parameter for LightGBM learning_rate=0.1, subsample=0.8, colsample_bytree=0.8, reg_alpha=0.1, reg_lambda=0.1, random_state=42 ) model.fit( X_train, y_train, eval_set=[(X_test, y_test)], callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)] )
Leaf-Wise vs Level-Wise Growth
| Aspect | Leaf-Wise (LightGBM) | Level-Wise (XGBoost) |
|---|---|---|
| Strategy | Split the leaf with highest loss reduction | Split all leaves at current depth |
| Tree shape | Asymmetric, deeper on important branches | Balanced, fixed depth |
| Speed | Faster (fewer splits needed) | Slower (many unnecessary splits) |
| Overfitting risk | Higher (use num_leaves to control) | Lower (max_depth is intuitive) |
Rule of Thumb: Set
num_leaves to less than 2^max_depth. For example, if max_depth=7, use num_leaves < 128. Start with num_leaves=31 (default) and increase carefully.
Native Categorical Features
Python
# LightGBM handles categoricals natively (no one-hot encoding!) categorical_cols = ["city", "product_type", "channel"] # Convert to pandas category dtype for col in categorical_cols: df[col] = df[col].astype("category") model = lgb.LGBMClassifier() model.fit( X_train, y_train, categorical_feature=categorical_cols ) # LightGBM finds optimal splits on categories # Much better than one-hot encoding for high-cardinality features
LightGBM Key Parameters
| Parameter | Default | Purpose |
|---|---|---|
| num_leaves | 31 | Max leaves per tree (main complexity control) |
| min_child_samples | 20 | Minimum samples in a leaf |
| feature_fraction | 1.0 | Column sampling ratio |
| bagging_fraction | 1.0 | Row sampling ratio |
| max_bin | 255 | Number of histogram bins (more = slower but more accurate) |
Next: CatBoost
Explore Yandex's CatBoost with its unique ordered boosting and superior categorical feature handling.
Next: CatBoost →
Lilly Tech Systems