Intermediate

Feature Engineering

Feature engineering is often the difference between a mediocre model and a great one. Learn how to transform, select, and create features that help your model learn.

Why Feature Engineering Matters

Features are the input variables your model uses to make predictions. Feature engineering is the process of transforming raw data into features that better represent the underlying patterns. It is often the single most impactful step in the ML pipeline — better features can improve a simple model more than a complex algorithm with poor features.

Numerical Features

Scaling and Normalization

Many algorithms (SVM, KNN, neural networks, PCA) are sensitive to the scale of features. A feature ranging 0–1000 would dominate a feature ranging 0–1.

Python (sklearn)
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np

# StandardScaler: mean=0, std=1 (most common)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use train stats!

# MinMaxScaler: scales to [0, 1]
minmax = MinMaxScaler()
X_normalized = minmax.fit_transform(X_train)

# Log transform for skewed distributions
X['salary_log'] = np.log1p(X['salary'])
💡
Critical rule: Always fit the scaler on training data only, then transform both train and test data using the training statistics. Fitting on the full dataset causes data leakage — test set information leaks into preprocessing, giving overly optimistic results.

Categorical Features

ML algorithms work with numbers, not text. Categorical features must be encoded:

Python (Encoding)
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# One-Hot Encoding (for nominal categories)
# Best for: colors, countries, categories without order
df_encoded = pd.get_dummies(df, columns=['color'],
                            drop_first=True)
# color='red' -> [0, 1] (green=0, red=1)
# drop_first avoids multicollinearity

# Label Encoding (for ordinal categories)
# Best for: sizes (S, M, L), education levels
le = LabelEncoder()
df['size_encoded'] = le.fit_transform(df['size'])
# S=0, M=1, L=2

# Target Encoding (for high-cardinality categories)
# Replace category with mean of target variable
# Use with care: can cause overfitting
means = df.groupby('city')['price'].mean()
df['city_encoded'] = df['city'].map(means)

Feature Selection

Not all features are useful. Irrelevant or redundant features can hurt performance and slow training:

  • Correlation analysis: Remove highly correlated features (correlation > 0.95). They provide redundant information.
  • Feature importance: Tree-based models (Random Forest, XGBoost) provide built-in feature importance scores. Remove features with near-zero importance.
  • Recursive Feature Elimination (RFE): Iteratively trains a model, removes the least important feature, and repeats. Finds the optimal feature subset.
  • Statistical tests: Chi-squared test for categorical features, ANOVA F-test for numerical features against a categorical target.
Python (Feature Selection)
from sklearn.feature_selection import RFE, SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier

# Feature importance from Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importances = rf.feature_importances_

# Recursive Feature Elimination
rfe = RFE(estimator=rf, n_features_to_select=10)
rfe.fit(X_train, y_train)
selected_features = X_train.columns[rfe.support_]

# Statistical selection (top K features)
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)

Feature Creation

Creating new features from existing ones can dramatically improve model performance:

  • Interaction features: Multiply or combine features (e.g., price_per_sqft = price / area).
  • Date/time features: Extract year, month, day of week, hour, is_weekend, days_since_event.
  • Aggregation features: Group-level statistics (mean, count, max per customer, per category).
  • Polynomial features: Create x^2, x^3, x1*x2 terms to capture non-linear relationships.
  • Text features: Word count, character count, TF-IDF scores, sentiment scores.
  • Binning: Convert continuous variables to categories (age groups, income brackets).

Missing Value Handling

Real-world data almost always has missing values. Your strategy depends on the cause and extent:

StrategyWhen to UseImplementation
Drop rowsFew missing values (<5%), random missingnessdf.dropna()
Drop columnsFeature has >50% missing valuesdf.drop(columns=[...])
Mean/median imputationNumerical features, random missingnessSimpleImputer(strategy='median')
Mode imputationCategorical featuresSimpleImputer(strategy='most_frequent')
KNN imputationComplex patterns, correlated featuresKNNImputer(n_neighbors=5)
Missing indicatorMissingness itself is informativeAdd boolean column feature_is_missing
Feature engineering workflow: 1) Understand your data (EDA). 2) Handle missing values. 3) Encode categorical features. 4) Scale numerical features. 5) Create new features based on domain knowledge. 6) Select the best features. 7) Validate that your features improve model performance.