Beginner

Supervised Learning

Explore the most widely used ML paradigm: learn regression algorithms for predicting continuous values and classification algorithms for predicting categories, with practical sklearn code examples.

What is Supervised Learning?

In supervised learning, you train a model on labeled data — each training example has an input (features) and a known output (label/target). The model learns a mapping function from inputs to outputs and can then predict outputs for new, unseen inputs.

Regression Algorithms

Regression predicts continuous numerical values (e.g., price, temperature, revenue).

Linear Regression

The simplest regression algorithm. It fits a straight line (or hyperplane in multiple dimensions) through the data by minimizing the sum of squared errors. Best for linear relationships.

Python (sklearn)

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Generate sample data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2.5 * X.squeeze() + np.random.randn(100) * 2 + 5

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print(f"R-squared: {r2_score(y_test, y_pred):.4f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}")
print(f"Coefficient: {model.coef_[0]:.4f}")
print(f"Intercept: {model.intercept_:.4f}")

Other Regression Algorithms

Polynomial Regression: Fits polynomial curves (quadratic, cubic) by adding polynomial features. Captures non-linear relationships.
Decision Tree Regression: Splits data into regions based on feature thresholds. Captures non-linear patterns but prone to overfitting.
Random Forest Regression: Ensemble of many decision trees. Reduces overfitting through averaging. Robust and versatile.
Gradient Boosting (XGBoost, LightGBM): Builds trees sequentially, each correcting errors of the previous. Often the best performer on tabular data. XGBoost and LightGBM are optimized implementations.

Classification Algorithms

Classification predicts discrete categories (e.g., spam/not spam, cat/dog, disease type).

Logistic Regression

Despite its name, logistic regression is a classification algorithm. It uses the sigmoid function to output probabilities between 0 and 1. Simple, interpretable, and works well as a baseline.

Support Vector Machines (SVM)

SVMs find the hyperplane that best separates classes by maximizing the margin between them. Using the "kernel trick," they can handle non-linear decision boundaries. Effective in high-dimensional spaces.

K-Nearest Neighbors (KNN)

Classifies a new point based on the majority class among its K nearest neighbors. Simple and intuitive but slow for large datasets (must compare with all training points).

Naive Bayes

Based on Bayes' theorem with the "naive" assumption of feature independence. Very fast and works well for text classification (spam filtering, sentiment analysis).

Decision Trees and Ensembles

Decision trees for classification split data based on feature thresholds to maximize class purity. Ensemble methods combine multiple trees:

Random Forest: Bagging (bootstrap aggregating) of decision trees. Each tree sees a random subset of data and features. Votes are averaged.
Gradient Boosting (XGBoost, LightGBM): Sequential tree building where each tree corrects prior mistakes. State-of-the-art for tabular data.

Classification Example

Python (sklearn)

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42)

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Evaluate
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred,
                          target_names=iris.target_names))

# Feature importance
for name, importance in zip(iris.feature_names,
                              rf.feature_importances_):
    print(f"{name}: {importance:.4f}")

Algorithm Selection Guide

Algorithm	Strengths	Weaknesses	Best For
Linear/Logistic Regression	Simple, fast, interpretable	Cannot capture non-linear patterns	Baselines, linear relationships
Decision Tree	Interpretable, handles non-linearity	Prone to overfitting	Explainability requirements
Random Forest	Robust, handles noise well	Less interpretable, slower	General-purpose, medium datasets
XGBoost/LightGBM	Best accuracy on tabular data	Requires tuning, less interpretable	Competitions, production systems
SVM	Effective in high dimensions	Slow on large datasets	Small to medium datasets, text
KNN	Simple, no training needed	Slow at prediction, sensitive to scale	Small datasets, prototyping
Naive Bayes	Very fast, works with little data	Feature independence assumption	Text classification, spam filtering

✅

Practical advice: For tabular data, start with a Random Forest baseline, then try XGBoost or LightGBM. These tree-based ensembles consistently outperform other algorithms on structured data. For very large datasets, LightGBM is faster than XGBoost.

← Previous Introduction Next → Unsupervised Learning