Intermediate

ML with Python

Machine learning fundamentals using scikit-learn — regression, classification, clustering, and model evaluation with practical Python examples.

Scikit-learn Overview

Scikit-learn is the most popular Python library for traditional machine learning. It provides a consistent API for all algorithms: fit() to train, predict() to make predictions, and score() to evaluate.

# The scikit-learn pattern
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)

Regression

Regression predicts continuous numerical values.

Linear Regression

Finds the best-fit line through the data. Use when the relationship between features and target is approximately linear. Key metric: R-squared (how much variance the model explains).

Polynomial Regression

Extends linear regression to capture non-linear relationships by adding polynomial features (x^2, x^3, etc.).

Regression Metrics

  • MAE — Mean Absolute Error. Average of absolute differences.
  • MSE — Mean Squared Error. Penalizes large errors more.
  • RMSE — Root MSE. Same units as target variable.
  • R-squared — Proportion of variance explained (0-1, higher is better).

Classification

Classification predicts categorical labels (classes).

K-Nearest Neighbors (KNN)

Classifies a data point based on the majority class of its K nearest neighbors. Simple but effective. The choice of K matters: small K = noisy, large K = smooth boundaries.

Decision Trees

Build a tree of if-else rules based on feature values. Easy to interpret but prone to overfitting. Use max_depth to control tree complexity.

Logistic Regression

Despite the name, it is a classification algorithm. Uses the sigmoid function to output probabilities between 0 and 1. Use for binary classification (yes/no, spam/not spam).

Support Vector Machines (SVM)

Finds the hyperplane that best separates classes with the maximum margin. Effective in high-dimensional spaces. Can use kernel tricks for non-linear boundaries.

Classification Metrics

  • Accuracy — Overall correct predictions. Misleading for imbalanced data.
  • Precision — True positives / (True positives + False positives).
  • Recall — True positives / (True positives + False negatives).
  • F1 Score — Harmonic mean of precision and recall.
  • Confusion Matrix — Table of TP, TN, FP, FN values.

Clustering

Clustering groups similar data points without labels (unsupervised learning).

K-Means Clustering

Partitions data into K clusters by minimizing the distance from each point to its cluster center. You must specify K in advance. Use the elbow method to find the optimal K.

Hierarchical Clustering

Builds a tree (dendrogram) of clusters. Can be agglomerative (bottom-up) or divisive (top-down). Does not require specifying K in advance.

DBSCAN

Density-Based Spatial Clustering of Applications with Noise. Finds clusters of arbitrary shape and identifies outliers as noise. Does not require specifying the number of clusters.

💡
Key distinction: K-Means requires specifying the number of clusters (K). DBSCAN does not. K-Means finds circular/spherical clusters. DBSCAN can find clusters of any shape. Know when to use each for the assessment.

Practice Questions

📝
Q1: You want to predict house prices (a continuous value) based on features like size, location, and age. Which algorithm is most appropriate?

A) Logistic regression
B) Linear regression
C) K-Means clustering
D) KNN classification
Show Answer

B) Linear regression. House price is a continuous numerical value, making this a regression problem. Linear regression learns the relationship between input features and the continuous target. Logistic regression is for classification (predicting categories), not regression despite its name.

📝
Q2: A decision tree model achieves 100% accuracy on training data but 65% on test data. What is happening and how do you fix it?

A) Underfitting — increase tree depth
B) Overfitting — limit tree depth with max_depth
C) Data drift — collect new data
D) Class imbalance — use SMOTE
Show Answer

B) Overfitting — limit tree depth with max_depth. 100% training accuracy with 65% test accuracy is classic overfitting. The tree has memorized the training data. Setting max_depth limits how deep the tree can grow, forcing it to learn general patterns rather than memorizing noise.

📝
Q3: You need to group customers into segments for a marketing campaign. You do not know how many segments exist and the groups may have irregular shapes. Which algorithm should you use?

A) K-Means
B) DBSCAN
C) Linear regression
D) Logistic regression
Show Answer

B) DBSCAN. DBSCAN does not require specifying the number of clusters and can find clusters of arbitrary shapes. K-Means requires you to specify K and finds only spherical clusters. Since the number of segments is unknown and shapes may be irregular, DBSCAN is the better choice.

📝
Q4: In scikit-learn, which method trains a model on training data?

A) model.predict()
B) model.fit()
C) model.transform()
D) model.score()
Show Answer

B) model.fit(). The scikit-learn API uses fit(X_train, y_train) to train a model on data. predict() makes predictions on new data. transform() is for preprocessing (scalers, encoders). score() returns the default evaluation metric.

📝
Q5: Which method helps determine the optimal number of clusters (K) for K-Means?

A) Cross-validation
B) The elbow method
C) Grid search
D) Gradient descent
Show Answer

B) The elbow method. Plot the within-cluster sum of squares (WCSS) for different values of K. The optimal K is at the "elbow" where adding more clusters provides diminishing returns. Cross-validation and grid search are for supervised learning hyperparameter tuning.