Beginner

Supervised Learning

Explore the most widely used ML paradigm: learn regression algorithms for predicting continuous values and classification algorithms for predicting categories, with practical sklearn code examples.

What is Supervised Learning?

In supervised learning, you train a model on labeled data — each training example has an input (features) and a known output (label/target). The model learns a mapping function from inputs to outputs and can then predict outputs for new, unseen inputs.

Regression Algorithms

Regression predicts continuous numerical values (e.g., price, temperature, revenue).

Linear Regression

The simplest regression algorithm. It fits a straight line (or hyperplane in multiple dimensions) through the data by minimizing the sum of squared errors. Best for linear relationships.

Python (sklearn)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Generate sample data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2.5 * X.squeeze() + np.random.randn(100) * 2 + 5

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print(f"R-squared: {r2_score(y_test, y_pred):.4f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}")
print(f"Coefficient: {model.coef_[0]:.4f}")
print(f"Intercept: {model.intercept_:.4f}")

Other Regression Algorithms

  • Polynomial Regression: Fits polynomial curves (quadratic, cubic) by adding polynomial features. Captures non-linear relationships.
  • Decision Tree Regression: Splits data into regions based on feature thresholds. Captures non-linear patterns but prone to overfitting.
  • Random Forest Regression: Ensemble of many decision trees. Reduces overfitting through averaging. Robust and versatile.
  • Gradient Boosting (XGBoost, LightGBM): Builds trees sequentially, each correcting errors of the previous. Often the best performer on tabular data. XGBoost and LightGBM are optimized implementations.

Classification Algorithms

Classification predicts discrete categories (e.g., spam/not spam, cat/dog, disease type).

Logistic Regression

Despite its name, logistic regression is a classification algorithm. It uses the sigmoid function to output probabilities between 0 and 1. Simple, interpretable, and works well as a baseline.

Support Vector Machines (SVM)

SVMs find the hyperplane that best separates classes by maximizing the margin between them. Using the "kernel trick," they can handle non-linear decision boundaries. Effective in high-dimensional spaces.

K-Nearest Neighbors (KNN)

Classifies a new point based on the majority class among its K nearest neighbors. Simple and intuitive but slow for large datasets (must compare with all training points).

Naive Bayes

Based on Bayes' theorem with the "naive" assumption of feature independence. Very fast and works well for text classification (spam filtering, sentiment analysis).

Decision Trees and Ensembles

Decision trees for classification split data based on feature thresholds to maximize class purity. Ensemble methods combine multiple trees:

  • Random Forest: Bagging (bootstrap aggregating) of decision trees. Each tree sees a random subset of data and features. Votes are averaged.
  • Gradient Boosting (XGBoost, LightGBM): Sequential tree building where each tree corrects prior mistakes. State-of-the-art for tabular data.

Classification Example

Python (sklearn)
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42)

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Evaluate
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred,
                          target_names=iris.target_names))

# Feature importance
for name, importance in zip(iris.feature_names,
                              rf.feature_importances_):
    print(f"{name}: {importance:.4f}")

Algorithm Selection Guide

AlgorithmStrengthsWeaknessesBest For
Linear/Logistic RegressionSimple, fast, interpretableCannot capture non-linear patternsBaselines, linear relationships
Decision TreeInterpretable, handles non-linearityProne to overfittingExplainability requirements
Random ForestRobust, handles noise wellLess interpretable, slowerGeneral-purpose, medium datasets
XGBoost/LightGBMBest accuracy on tabular dataRequires tuning, less interpretableCompetitions, production systems
SVMEffective in high dimensionsSlow on large datasetsSmall to medium datasets, text
KNNSimple, no training neededSlow at prediction, sensitive to scaleSmall datasets, prototyping
Naive BayesVery fast, works with little dataFeature independence assumptionText classification, spam filtering
Practical advice: For tabular data, start with a Random Forest baseline, then try XGBoost or LightGBM. These tree-based ensembles consistently outperform other algorithms on structured data. For very large datasets, LightGBM is faster than XGBoost.