Beginner

Project Setup & Architecture

In this first step, you will understand the end-to-end ASR → LLM → TTS pipeline, set up the project structure, install all dependencies, configure API keys, and verify that Whisper, OpenAI, and ElevenLabs are connected. By the end you will have a running FastAPI server ready for voice processing.

Architecture Overview

A voice assistant is a pipeline of three stages that run in sequence for every user turn. Understanding this pipeline is critical before writing any code.

        Microphone
            |
            v
   +------------------+
   |   ASR (Whisper)   |   Stage 1: Speech-to-Text
   +------------------+
            |
        transcript
            |
            v
   +------------------+
   |   LLM (GPT-4o)   |   Stage 2: Understanding + Response
   +------------------+
            |
       response text
            |
            v
   +------------------+
   |  TTS (ElevenLabs) |   Stage 3: Text-to-Speech
   +------------------+
            |
        audio stream
            |
            v
         Speaker
  • ASR (Automatic Speech Recognition): Converts raw audio from the microphone into text. We use OpenAI Whisper for its accuracy across accents and noise levels.
  • LLM (Large Language Model): Processes the transcript, maintains conversation context, calls tools when needed, and generates a text response. We use GPT-4o for its speed and tool-calling ability.
  • TTS (Text-to-Speech): Converts the LLM response text into natural-sounding audio. We use ElevenLabs for premium quality and OpenAI TTS as a fallback.
💡
Latency Budget: The total round-trip time from the user finishing their sentence to hearing the first word of the response should be under 2 seconds. We achieve this by streaming at every stage: streaming audio to Whisper, streaming tokens from GPT-4o, and streaming audio chunks from TTS.

Step 1: Create the Project Structure

Create the following directory layout. Every file will be built throughout this course.

voice-assistant/
├── docker-compose.yml
├── Dockerfile
├── .env
├── .env.example
├── requirements.txt
├── app/
│   ├── __init__.py
│   ├── main.py                # FastAPI application entry point
│   ├── config.py              # Environment config with pydantic-settings
│   ├── asr/
│   │   ├── __init__.py
│   │   ├── whisper_client.py  # Whisper API client
│   │   └── audio_utils.py     # Audio format conversion helpers
│   ├── llm/
│   │   ├── __init__.py
│   │   ├── conversation.py    # Dialog manager with memory
│   │   ├── tools.py           # Tool definitions (weather, timer, search)
│   │   └── engine.py          # LLM orchestrator with streaming
│   ├── tts/
│   │   ├── __init__.py
│   │   ├── elevenlabs_client.py  # ElevenLabs streaming TTS
│   │   ├── openai_tts.py         # OpenAI TTS fallback
│   │   └── voice_manager.py      # Voice selection and caching
│   └── ws/
│       ├── __init__.py
│       └── handler.py         # WebSocket connection handler
├── frontend/
│   ├── index.html             # Voice assistant UI
│   ├── app.js                 # Client-side audio + WebSocket logic
│   └── style.css              # UI styles
└── tests/
    ├── test_asr.py
    ├── test_llm.py
    └── test_tts.py

Run these commands to create the structure:

# Create project directory
mkdir -p voice-assistant/{app/{asr,llm,tts,ws},frontend,tests}

# Create __init__.py files
touch voice-assistant/app/__init__.py
touch voice-assistant/app/asr/__init__.py
touch voice-assistant/app/llm/__init__.py
touch voice-assistant/app/tts/__init__.py
touch voice-assistant/app/ws/__init__.py

Step 2: Define Dependencies

Create requirements.txt with all the packages we need:

# requirements.txt
fastapi==0.115.6
uvicorn[standard]==0.34.0
websockets==14.1
python-dotenv==1.0.1
pydantic-settings==2.7.1

# ASR - Speech to Text
openai==1.58.1

# LLM - Conversation Engine
# (uses openai package above)

# TTS - Text to Speech
elevenlabs==1.15.0
httpx==0.28.1

# Audio processing
numpy==2.2.1
soundfile==0.12.1
pydub==0.25.1
webrtcvad==2.0.10

# Utilities
python-multipart==0.0.20
aiofiles==24.1.0

Step 3: Environment Configuration

Create .env.example (commit this) and .env (never commit this):

# .env.example - Copy to .env and fill in your values
OPENAI_API_KEY=sk-your-openai-key-here
ELEVENLABS_API_KEY=your-elevenlabs-key-here

# ASR Settings
WHISPER_MODEL=whisper-1
WHISPER_LANGUAGE=en

# LLM Settings
LLM_MODEL=gpt-4o
LLM_TEMPERATURE=0.7
LLM_MAX_TOKENS=1024

# TTS Settings
TTS_PROVIDER=elevenlabs
ELEVENLABS_VOICE_ID=21m00Tcm4TlvDq8ikWAM
ELEVENLABS_MODEL=eleven_turbo_v2_5
OPENAI_TTS_MODEL=tts-1
OPENAI_TTS_VOICE=alloy

# Server Settings
HOST=0.0.0.0
PORT=8000
LOG_LEVEL=INFO

Now create the config module that loads these values with validation:

# app/config.py
from pydantic_settings import BaseSettings
from functools import lru_cache
from typing import Literal


class Settings(BaseSettings):
    """Application settings loaded from environment variables."""

    # API Keys
    openai_api_key: str
    elevenlabs_api_key: str = ""

    # ASR Settings
    whisper_model: str = "whisper-1"
    whisper_language: str = "en"

    # LLM Settings
    llm_model: str = "gpt-4o"
    llm_temperature: float = 0.7
    llm_max_tokens: int = 1024

    # TTS Settings
    tts_provider: Literal["elevenlabs", "openai"] = "elevenlabs"
    elevenlabs_voice_id: str = "21m00Tcm4TlvDq8ikWAM"
    elevenlabs_model: str = "eleven_turbo_v2_5"
    openai_tts_model: str = "tts-1"
    openai_tts_voice: str = "alloy"

    # Server Settings
    host: str = "0.0.0.0"
    port: int = 8000
    log_level: str = "INFO"

    class Config:
        env_file = ".env"
        env_file_encoding = "utf-8"


@lru_cache()
def get_settings() -> Settings:
    """Cached settings instance - loaded once, reused everywhere."""
    return Settings()
💡
Why pydantic-settings? It validates your environment variables at startup. If OPENAI_API_KEY is missing, you get a clear error immediately instead of a cryptic failure mid-conversation. The @lru_cache ensures the .env file is read only once.

Step 4: Create the FastAPI Entry Point

Create the initial app/main.py with health checks, CORS, and WebSocket endpoint stub:

# app/main.py
import logging
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles
from fastapi.responses import FileResponse

from app.config import get_settings

settings = get_settings()

# Configure logging
logging.basicConfig(
    level=getattr(logging, settings.log_level),
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

# Create FastAPI app
app = FastAPI(
    title="Voice Assistant API",
    description="An end-to-end voice assistant with ASR, LLM, and TTS",
    version="1.0.0"
)

# CORS middleware
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Serve the frontend
app.mount("/static", StaticFiles(directory="frontend"), name="static")


@app.get("/")
async def root():
    """Serve the voice assistant UI."""
    return FileResponse("frontend/index.html")


@app.get("/health")
async def health_check():
    """Health check endpoint for monitoring."""
    return {
        "status": "healthy",
        "asr_model": settings.whisper_model,
        "llm_model": settings.llm_model,
        "tts_provider": settings.tts_provider,
    }


@app.websocket("/ws/voice")
async def voice_websocket(websocket: WebSocket):
    """WebSocket endpoint for real-time voice communication.

    Protocol:
    1. Client sends audio chunks (binary frames)
    2. Server transcribes with Whisper (sends transcript as text)
    3. Server generates LLM response (sends text stream)
    4. Server synthesizes speech (sends audio chunks as binary)
    """
    await websocket.accept()
    logger.info("Voice WebSocket connected")

    try:
        while True:
            # Receive audio data from client
            data = await websocket.receive_bytes()

            # TODO: Process through ASR -> LLM -> TTS pipeline
            # This will be implemented in the next lessons

            await websocket.send_json({
                "type": "status",
                "message": "Pipeline not yet implemented"
            })
    except WebSocketDisconnect:
        logger.info("Voice WebSocket disconnected")

Step 5: Start Everything and Verify

Let us start the server and verify the setup works:

# 1. Create a virtual environment and install dependencies
python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate
pip install -r requirements.txt

# 2. Copy .env.example to .env and add your API keys
cp .env.example .env
# Edit .env and set:
#   OPENAI_API_KEY=sk-your-actual-key
#   ELEVENLABS_API_KEY=your-actual-key

# 3. Start the FastAPI server
uvicorn app.main:app --reload --port 8000

# 4. Verify the API is running
curl http://localhost:8000/health
# Expected: {"status":"healthy","asr_model":"whisper-1","llm_model":"gpt-4o","tts_provider":"elevenlabs"}

Quick Smoke Test

Write a quick test to verify all three API connections:

# tests/test_connections.py
"""Smoke tests to verify all API services are connected."""
import os
from dotenv import load_dotenv

load_dotenv()


def test_whisper_connection():
    """Verify Whisper API works with a tiny audio sample."""
    from openai import OpenAI
    import numpy as np
    import soundfile as sf
    import io

    client = OpenAI()

    # Generate a 1-second silent WAV for testing
    sample_rate = 16000
    silence = np.zeros(sample_rate, dtype=np.float32)
    buffer = io.BytesIO()
    sf.write(buffer, silence, sample_rate, format="WAV")
    buffer.seek(0)
    buffer.name = "test.wav"

    response = client.audio.transcriptions.create(
        model="whisper-1",
        file=buffer,
        language="en"
    )
    print(f"Whisper OK - transcription: '{response.text}'")


def test_llm_connection():
    """Verify GPT-4o API works."""
    from openai import OpenAI

    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Say hello in one word."}],
        max_tokens=10
    )
    reply = response.choices[0].message.content
    print(f"LLM OK - response: '{reply}'")


def test_elevenlabs_connection():
    """Verify ElevenLabs API works."""
    import httpx

    api_key = os.getenv("ELEVENLABS_API_KEY")
    if not api_key:
        print("ElevenLabs SKIPPED - no API key set")
        return

    response = httpx.get(
        "https://api.elevenlabs.io/v1/voices",
        headers={"xi-api-key": api_key}
    )
    voices = response.json().get("voices", [])
    print(f"ElevenLabs OK - {len(voices)} voices available")


if __name__ == "__main__":
    test_whisper_connection()
    test_llm_connection()
    test_elevenlabs_connection()
    print("\nAll smoke tests passed!")
# Run the smoke tests
python tests/test_connections.py
# Expected:
# Whisper OK - transcription: ''
# LLM OK - response: 'Hello!'
# ElevenLabs OK - 29 voices available
# All smoke tests passed!
📝
Checkpoint: At this point you should have a running FastAPI server on port 8000 with a health check endpoint. All three API connections (Whisper, GPT-4o, ElevenLabs) should pass the smoke tests. If any fail, double-check your API keys in the .env file.

Key Takeaways

  • A voice assistant is a three-stage pipeline: ASR (speech to text) → LLM (understanding and response) → TTS (text to speech).
  • Latency is critical — streaming at every stage keeps the round-trip under 2 seconds.
  • The project uses a clean modular structure: asr/, llm/, tts/, and ws/ are separate packages.
  • WebSocket communication enables real-time bidirectional audio streaming between the browser and server.

What Is Next

In the next lesson, you will build the speech recognition module — integrating Whisper for real-time transcription with streaming audio capture, noise handling, and silence detection.