Feature Engineering
Feature engineering is often the difference between a mediocre model and a great one. Learn how to transform, select, and create features that help your model learn.
Why Feature Engineering Matters
Features are the input variables your model uses to make predictions. Feature engineering is the process of transforming raw data into features that better represent the underlying patterns. It is often the single most impactful step in the ML pipeline — better features can improve a simple model more than a complex algorithm with poor features.
Numerical Features
Scaling and Normalization
Many algorithms (SVM, KNN, neural networks, PCA) are sensitive to the scale of features. A feature ranging 0–1000 would dominate a feature ranging 0–1.
from sklearn.preprocessing import StandardScaler, MinMaxScaler import numpy as np # StandardScaler: mean=0, std=1 (most common) scaler = StandardScaler() X_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Use train stats! # MinMaxScaler: scales to [0, 1] minmax = MinMaxScaler() X_normalized = minmax.fit_transform(X_train) # Log transform for skewed distributions X['salary_log'] = np.log1p(X['salary'])
Categorical Features
ML algorithms work with numbers, not text. Categorical features must be encoded:
import pandas as pd from sklearn.preprocessing import LabelEncoder, OneHotEncoder # One-Hot Encoding (for nominal categories) # Best for: colors, countries, categories without order df_encoded = pd.get_dummies(df, columns=['color'], drop_first=True) # color='red' -> [0, 1] (green=0, red=1) # drop_first avoids multicollinearity # Label Encoding (for ordinal categories) # Best for: sizes (S, M, L), education levels le = LabelEncoder() df['size_encoded'] = le.fit_transform(df['size']) # S=0, M=1, L=2 # Target Encoding (for high-cardinality categories) # Replace category with mean of target variable # Use with care: can cause overfitting means = df.groupby('city')['price'].mean() df['city_encoded'] = df['city'].map(means)
Feature Selection
Not all features are useful. Irrelevant or redundant features can hurt performance and slow training:
- Correlation analysis: Remove highly correlated features (correlation > 0.95). They provide redundant information.
- Feature importance: Tree-based models (Random Forest, XGBoost) provide built-in feature importance scores. Remove features with near-zero importance.
- Recursive Feature Elimination (RFE): Iteratively trains a model, removes the least important feature, and repeats. Finds the optimal feature subset.
- Statistical tests: Chi-squared test for categorical features, ANOVA F-test for numerical features against a categorical target.
from sklearn.feature_selection import RFE, SelectKBest, f_classif from sklearn.ensemble import RandomForestClassifier # Feature importance from Random Forest rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) importances = rf.feature_importances_ # Recursive Feature Elimination rfe = RFE(estimator=rf, n_features_to_select=10) rfe.fit(X_train, y_train) selected_features = X_train.columns[rfe.support_] # Statistical selection (top K features) selector = SelectKBest(f_classif, k=10) X_selected = selector.fit_transform(X_train, y_train)
Feature Creation
Creating new features from existing ones can dramatically improve model performance:
- Interaction features: Multiply or combine features (e.g.,
price_per_sqft = price / area). - Date/time features: Extract year, month, day of week, hour, is_weekend, days_since_event.
- Aggregation features: Group-level statistics (mean, count, max per customer, per category).
- Polynomial features: Create x^2, x^3, x1*x2 terms to capture non-linear relationships.
- Text features: Word count, character count, TF-IDF scores, sentiment scores.
- Binning: Convert continuous variables to categories (age groups, income brackets).
Missing Value Handling
Real-world data almost always has missing values. Your strategy depends on the cause and extent:
| Strategy | When to Use | Implementation |
|---|---|---|
| Drop rows | Few missing values (<5%), random missingness | df.dropna() |
| Drop columns | Feature has >50% missing values | df.drop(columns=[...]) |
| Mean/median imputation | Numerical features, random missingness | SimpleImputer(strategy='median') |
| Mode imputation | Categorical features | SimpleImputer(strategy='most_frequent') |
| KNN imputation | Complex patterns, correlated features | KNNImputer(n_neighbors=5) |
| Missing indicator | Missingness itself is informative | Add boolean column feature_is_missing |
Lilly Tech Systems