ML with Python
Machine learning fundamentals using scikit-learn — regression, classification, clustering, and model evaluation with practical Python examples.
Scikit-learn Overview
Scikit-learn is the most popular Python library for traditional machine learning. It provides a consistent API for all algorithms: fit() to train, predict() to make predictions, and score() to evaluate.
# The scikit-learn pattern
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train
model = LinearRegression()
model.fit(X_train, y_train)
# Predict and evaluate
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
Regression
Regression predicts continuous numerical values.
Linear Regression
Finds the best-fit line through the data. Use when the relationship between features and target is approximately linear. Key metric: R-squared (how much variance the model explains).
Polynomial Regression
Extends linear regression to capture non-linear relationships by adding polynomial features (x^2, x^3, etc.).
Regression Metrics
- MAE — Mean Absolute Error. Average of absolute differences.
- MSE — Mean Squared Error. Penalizes large errors more.
- RMSE — Root MSE. Same units as target variable.
- R-squared — Proportion of variance explained (0-1, higher is better).
Classification
Classification predicts categorical labels (classes).
K-Nearest Neighbors (KNN)
Classifies a data point based on the majority class of its K nearest neighbors. Simple but effective. The choice of K matters: small K = noisy, large K = smooth boundaries.
Decision Trees
Build a tree of if-else rules based on feature values. Easy to interpret but prone to overfitting. Use max_depth to control tree complexity.
Logistic Regression
Despite the name, it is a classification algorithm. Uses the sigmoid function to output probabilities between 0 and 1. Use for binary classification (yes/no, spam/not spam).
Support Vector Machines (SVM)
Finds the hyperplane that best separates classes with the maximum margin. Effective in high-dimensional spaces. Can use kernel tricks for non-linear boundaries.
Classification Metrics
- Accuracy — Overall correct predictions. Misleading for imbalanced data.
- Precision — True positives / (True positives + False positives).
- Recall — True positives / (True positives + False negatives).
- F1 Score — Harmonic mean of precision and recall.
- Confusion Matrix — Table of TP, TN, FP, FN values.
Clustering
Clustering groups similar data points without labels (unsupervised learning).
K-Means Clustering
Partitions data into K clusters by minimizing the distance from each point to its cluster center. You must specify K in advance. Use the elbow method to find the optimal K.
Hierarchical Clustering
Builds a tree (dendrogram) of clusters. Can be agglomerative (bottom-up) or divisive (top-down). Does not require specifying K in advance.
DBSCAN
Density-Based Spatial Clustering of Applications with Noise. Finds clusters of arbitrary shape and identifies outliers as noise. Does not require specifying the number of clusters.
Practice Questions
A) Logistic regression
B) Linear regression
C) K-Means clustering
D) KNN classification
Show Answer
B) Linear regression. House price is a continuous numerical value, making this a regression problem. Linear regression learns the relationship between input features and the continuous target. Logistic regression is for classification (predicting categories), not regression despite its name.
A) Underfitting — increase tree depth
B) Overfitting — limit tree depth with max_depth
C) Data drift — collect new data
D) Class imbalance — use SMOTE
Show Answer
B) Overfitting — limit tree depth with max_depth. 100% training accuracy with 65% test accuracy is classic overfitting. The tree has memorized the training data. Setting max_depth limits how deep the tree can grow, forcing it to learn general patterns rather than memorizing noise.
A) K-Means
B) DBSCAN
C) Linear regression
D) Logistic regression
Show Answer
B) DBSCAN. DBSCAN does not require specifying the number of clusters and can find clusters of arbitrary shapes. K-Means requires you to specify K and finds only spherical clusters. Since the number of segments is unknown and shapes may be irregular, DBSCAN is the better choice.
A) model.predict()
B) model.fit()
C) model.transform()
D) model.score()
Show Answer
B) model.fit(). The scikit-learn API uses fit(X_train, y_train) to train a model on data. predict() makes predictions on new data. transform() is for preprocessing (scalers, encoders). score() returns the default evaluation metric.
A) Cross-validation
B) The elbow method
C) Grid search
D) Gradient descent
Show Answer
B) The elbow method. Plot the within-cluster sum of squares (WCSS) for different values of K. The optimal K is at the "elbow" where adding more clusters provides diminishing returns. Cross-validation and grid search are for supervised learning hyperparameter tuning.