Supervised Learning
Explore the most widely used ML paradigm: learn regression algorithms for predicting continuous values and classification algorithms for predicting categories, with practical sklearn code examples.
What is Supervised Learning?
In supervised learning, you train a model on labeled data — each training example has an input (features) and a known output (label/target). The model learns a mapping function from inputs to outputs and can then predict outputs for new, unseen inputs.
Regression Algorithms
Regression predicts continuous numerical values (e.g., price, temperature, revenue).
Linear Regression
The simplest regression algorithm. It fits a straight line (or hyperplane in multiple dimensions) through the data by minimizing the sum of squared errors. Best for linear relationships.
from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score import numpy as np # Generate sample data np.random.seed(42) X = np.random.rand(100, 1) * 10 y = 2.5 * X.squeeze() + np.random.randn(100) * 2 + 5 # Split data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) # Train model model = LinearRegression() model.fit(X_train, y_train) # Predict and evaluate y_pred = model.predict(X_test) print(f"R-squared: {r2_score(y_test, y_pred):.4f}") print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}") print(f"Coefficient: {model.coef_[0]:.4f}") print(f"Intercept: {model.intercept_:.4f}")
Other Regression Algorithms
- Polynomial Regression: Fits polynomial curves (quadratic, cubic) by adding polynomial features. Captures non-linear relationships.
- Decision Tree Regression: Splits data into regions based on feature thresholds. Captures non-linear patterns but prone to overfitting.
- Random Forest Regression: Ensemble of many decision trees. Reduces overfitting through averaging. Robust and versatile.
- Gradient Boosting (XGBoost, LightGBM): Builds trees sequentially, each correcting errors of the previous. Often the best performer on tabular data. XGBoost and LightGBM are optimized implementations.
Classification Algorithms
Classification predicts discrete categories (e.g., spam/not spam, cat/dog, disease type).
Logistic Regression
Despite its name, logistic regression is a classification algorithm. It uses the sigmoid function to output probabilities between 0 and 1. Simple, interpretable, and works well as a baseline.
Support Vector Machines (SVM)
SVMs find the hyperplane that best separates classes by maximizing the margin between them. Using the "kernel trick," they can handle non-linear decision boundaries. Effective in high-dimensional spaces.
K-Nearest Neighbors (KNN)
Classifies a new point based on the majority class among its K nearest neighbors. Simple and intuitive but slow for large datasets (must compare with all training points).
Naive Bayes
Based on Bayes' theorem with the "naive" assumption of feature independence. Very fast and works well for text classification (spam filtering, sentiment analysis).
Decision Trees and Ensembles
Decision trees for classification split data based on feature thresholds to maximize class purity. Ensemble methods combine multiple trees:
- Random Forest: Bagging (bootstrap aggregating) of decision trees. Each tree sees a random subset of data and features. Votes are averaged.
- Gradient Boosting (XGBoost, LightGBM): Sequential tree building where each tree corrects prior mistakes. State-of-the-art for tabular data.
Classification Example
from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # Load dataset iris = load_iris() X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.2, random_state=42) # Train Random Forest rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) # Evaluate y_pred = rf.predict(X_test) print(classification_report(y_test, y_pred, target_names=iris.target_names)) # Feature importance for name, importance in zip(iris.feature_names, rf.feature_importances_): print(f"{name}: {importance:.4f}")
Algorithm Selection Guide
| Algorithm | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Linear/Logistic Regression | Simple, fast, interpretable | Cannot capture non-linear patterns | Baselines, linear relationships |
| Decision Tree | Interpretable, handles non-linearity | Prone to overfitting | Explainability requirements |
| Random Forest | Robust, handles noise well | Less interpretable, slower | General-purpose, medium datasets |
| XGBoost/LightGBM | Best accuracy on tabular data | Requires tuning, less interpretable | Competitions, production systems |
| SVM | Effective in high dimensions | Slow on large datasets | Small to medium datasets, text |
| KNN | Simple, no training needed | Slow at prediction, sensitive to scale | Small datasets, prototyping |
| Naive Bayes | Very fast, works with little data | Feature independence assumption | Text classification, spam filtering |