Project Setup Beginner

In this project you will build a complete recommendation engine from the ground up. This first lesson covers the system architecture, introduces the MovieLens dataset, defines our tech stack, and walks through environment setup so you can start coding immediately.

System Architecture

Our recommendation engine follows a layered architecture that separates data ingestion, model training, and serving:

Architecture
+------------------+     +---------------------+     +----------------+
|   MovieLens      | --> |   Data Pipeline     | --> |  Model Layer   |
|   Dataset        |     |   (pandas, numpy)   |     |  - CF (sklearn)|
+------------------+     +---------------------+     |  - Content     |
                                                      |  - NCF (torch) |
                                                      +-------+--------+
                                                              |
                                                      +-------v--------+
                                                      |   FastAPI      |
                                                      |   REST API     |
                                                      +-------+--------+
                                                              |
                                                      +-------v--------+
                                                      |   Redis Cache  |
                                                      +----------------+

The MovieLens Dataset

We use MovieLens 100K, a widely-used benchmark dataset from the GroupLens research lab at the University of Minnesota. It contains:

  • 100,000 ratings from 943 users on 1,682 movies
  • Ratings on a 1–5 integer scale
  • Demographic data (age, gender, occupation, zip code)
  • Movie metadata (title, release date, genres, IMDb URL)
Why MovieLens? It is the standard benchmark for recommendation system research. Results on MovieLens are directly comparable across academic papers and industry tutorials. The 100K variant is small enough to iterate quickly on a laptop but large enough to demonstrate real patterns.

Tech Stack

Component Technology Purpose
Language Python 3.9+ Core development language
Data pandas, NumPy Data loading, manipulation, and matrix operations
ML scikit-learn TF-IDF, cosine similarity, train/test splitting
Deep Learning PyTorch Neural collaborative filtering model
API FastAPI + Uvicorn REST API serving and async request handling
Caching Redis Low-latency recommendation caching
Testing pytest, httpx Unit and integration tests

Environment Setup

Create the project structure and install all dependencies:

Bash
# Create project directory
mkdir recommendation-engine && cd recommendation-engine

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Create project structure
mkdir -p data models api tests

# Install dependencies
pip install pandas numpy scikit-learn scipy
pip install torch torchvision
pip install fastapi uvicorn redis
pip install pytest httpx
pip install matplotlib seaborn jupyter

requirements.txt

Text
pandas>=2.0.0
numpy>=1.24.0
scikit-learn>=1.3.0
scipy>=1.11.0
torch>=2.0.0
fastapi>=0.100.0
uvicorn>=0.23.0
redis>=5.0.0
pytest>=7.4.0
httpx>=0.24.0
matplotlib>=3.7.0
seaborn>=0.12.0

Download the Dataset

Python
import os
import urllib.request
import zipfile

def download_movielens(data_dir="data"):
    """Download and extract MovieLens 100K dataset."""
    url = "https://files.grouplens.org/datasets/movielens/ml-100k.zip"
    zip_path = os.path.join(data_dir, "ml-100k.zip")
    extract_path = os.path.join(data_dir, "ml-100k")

    if os.path.exists(extract_path):
        print("Dataset already downloaded.")
        return extract_path

    os.makedirs(data_dir, exist_ok=True)
    print("Downloading MovieLens 100K...")
    urllib.request.urlretrieve(url, zip_path)

    print("Extracting...")
    with zipfile.ZipFile(zip_path, "r") as z:
        z.extractall(data_dir)

    os.remove(zip_path)
    print(f"Dataset ready at {extract_path}")
    return extract_path

# Run it
data_path = download_movielens()

Project Directory Structure

Text
recommendation-engine/
  data/
    ml-100k/           # MovieLens dataset files
      u.data             # 100K ratings (user_id, item_id, rating, timestamp)
      u.item             # Movie metadata (id, title, genres...)
      u.user             # User demographics
  models/
    collaborative.py   # User-based and item-based CF
    content_based.py   # TF-IDF content filtering
    ncf_model.py       # Neural collaborative filtering (PyTorch)
    hybrid.py          # Hybrid recommender combining approaches
  api/
    main.py            # FastAPI application
    schemas.py         # Pydantic response models
    cache.py           # Redis caching layer
  tests/
    test_models.py     # Model unit tests
    test_api.py        # API integration tests
  requirements.txt
  README.md
Ready to Code: With the environment set up and dataset downloaded, you are ready to begin building. The next lesson covers data loading, exploratory analysis, and building the user-item interaction matrix that all our models will consume.

Next: Data Preparation

Load the MovieLens dataset, explore rating distributions, and build the user-item matrix for our recommendation algorithms.

Step 1: Data Preparation →