Project Setup & Architecture
In this first step, you will understand the end-to-end ASR → LLM → TTS pipeline, set up the project structure, install all dependencies, configure API keys, and verify that Whisper, OpenAI, and ElevenLabs are connected. By the end you will have a running FastAPI server ready for voice processing.
Architecture Overview
A voice assistant is a pipeline of three stages that run in sequence for every user turn. Understanding this pipeline is critical before writing any code.
Microphone
|
v
+------------------+
| ASR (Whisper) | Stage 1: Speech-to-Text
+------------------+
|
transcript
|
v
+------------------+
| LLM (GPT-4o) | Stage 2: Understanding + Response
+------------------+
|
response text
|
v
+------------------+
| TTS (ElevenLabs) | Stage 3: Text-to-Speech
+------------------+
|
audio stream
|
v
Speaker
- ASR (Automatic Speech Recognition): Converts raw audio from the microphone into text. We use OpenAI Whisper for its accuracy across accents and noise levels.
- LLM (Large Language Model): Processes the transcript, maintains conversation context, calls tools when needed, and generates a text response. We use GPT-4o for its speed and tool-calling ability.
- TTS (Text-to-Speech): Converts the LLM response text into natural-sounding audio. We use ElevenLabs for premium quality and OpenAI TTS as a fallback.
Step 1: Create the Project Structure
Create the following directory layout. Every file will be built throughout this course.
voice-assistant/
├── docker-compose.yml
├── Dockerfile
├── .env
├── .env.example
├── requirements.txt
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI application entry point
│ ├── config.py # Environment config with pydantic-settings
│ ├── asr/
│ │ ├── __init__.py
│ │ ├── whisper_client.py # Whisper API client
│ │ └── audio_utils.py # Audio format conversion helpers
│ ├── llm/
│ │ ├── __init__.py
│ │ ├── conversation.py # Dialog manager with memory
│ │ ├── tools.py # Tool definitions (weather, timer, search)
│ │ └── engine.py # LLM orchestrator with streaming
│ ├── tts/
│ │ ├── __init__.py
│ │ ├── elevenlabs_client.py # ElevenLabs streaming TTS
│ │ ├── openai_tts.py # OpenAI TTS fallback
│ │ └── voice_manager.py # Voice selection and caching
│ └── ws/
│ ├── __init__.py
│ └── handler.py # WebSocket connection handler
├── frontend/
│ ├── index.html # Voice assistant UI
│ ├── app.js # Client-side audio + WebSocket logic
│ └── style.css # UI styles
└── tests/
├── test_asr.py
├── test_llm.py
└── test_tts.py
Run these commands to create the structure:
# Create project directory
mkdir -p voice-assistant/{app/{asr,llm,tts,ws},frontend,tests}
# Create __init__.py files
touch voice-assistant/app/__init__.py
touch voice-assistant/app/asr/__init__.py
touch voice-assistant/app/llm/__init__.py
touch voice-assistant/app/tts/__init__.py
touch voice-assistant/app/ws/__init__.py
Step 2: Define Dependencies
Create requirements.txt with all the packages we need:
# requirements.txt
fastapi==0.115.6
uvicorn[standard]==0.34.0
websockets==14.1
python-dotenv==1.0.1
pydantic-settings==2.7.1
# ASR - Speech to Text
openai==1.58.1
# LLM - Conversation Engine
# (uses openai package above)
# TTS - Text to Speech
elevenlabs==1.15.0
httpx==0.28.1
# Audio processing
numpy==2.2.1
soundfile==0.12.1
pydub==0.25.1
webrtcvad==2.0.10
# Utilities
python-multipart==0.0.20
aiofiles==24.1.0
Step 3: Environment Configuration
Create .env.example (commit this) and .env (never commit this):
# .env.example - Copy to .env and fill in your values
OPENAI_API_KEY=sk-your-openai-key-here
ELEVENLABS_API_KEY=your-elevenlabs-key-here
# ASR Settings
WHISPER_MODEL=whisper-1
WHISPER_LANGUAGE=en
# LLM Settings
LLM_MODEL=gpt-4o
LLM_TEMPERATURE=0.7
LLM_MAX_TOKENS=1024
# TTS Settings
TTS_PROVIDER=elevenlabs
ELEVENLABS_VOICE_ID=21m00Tcm4TlvDq8ikWAM
ELEVENLABS_MODEL=eleven_turbo_v2_5
OPENAI_TTS_MODEL=tts-1
OPENAI_TTS_VOICE=alloy
# Server Settings
HOST=0.0.0.0
PORT=8000
LOG_LEVEL=INFO
Now create the config module that loads these values with validation:
# app/config.py
from pydantic_settings import BaseSettings
from functools import lru_cache
from typing import Literal
class Settings(BaseSettings):
"""Application settings loaded from environment variables."""
# API Keys
openai_api_key: str
elevenlabs_api_key: str = ""
# ASR Settings
whisper_model: str = "whisper-1"
whisper_language: str = "en"
# LLM Settings
llm_model: str = "gpt-4o"
llm_temperature: float = 0.7
llm_max_tokens: int = 1024
# TTS Settings
tts_provider: Literal["elevenlabs", "openai"] = "elevenlabs"
elevenlabs_voice_id: str = "21m00Tcm4TlvDq8ikWAM"
elevenlabs_model: str = "eleven_turbo_v2_5"
openai_tts_model: str = "tts-1"
openai_tts_voice: str = "alloy"
# Server Settings
host: str = "0.0.0.0"
port: int = 8000
log_level: str = "INFO"
class Config:
env_file = ".env"
env_file_encoding = "utf-8"
@lru_cache()
def get_settings() -> Settings:
"""Cached settings instance - loaded once, reused everywhere."""
return Settings()
OPENAI_API_KEY is missing, you get a clear error immediately instead of a cryptic failure mid-conversation. The @lru_cache ensures the .env file is read only once.Step 4: Create the FastAPI Entry Point
Create the initial app/main.py with health checks, CORS, and WebSocket endpoint stub:
# app/main.py
import logging
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles
from fastapi.responses import FileResponse
from app.config import get_settings
settings = get_settings()
# Configure logging
logging.basicConfig(
level=getattr(logging, settings.log_level),
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
# Create FastAPI app
app = FastAPI(
title="Voice Assistant API",
description="An end-to-end voice assistant with ASR, LLM, and TTS",
version="1.0.0"
)
# CORS middleware
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Serve the frontend
app.mount("/static", StaticFiles(directory="frontend"), name="static")
@app.get("/")
async def root():
"""Serve the voice assistant UI."""
return FileResponse("frontend/index.html")
@app.get("/health")
async def health_check():
"""Health check endpoint for monitoring."""
return {
"status": "healthy",
"asr_model": settings.whisper_model,
"llm_model": settings.llm_model,
"tts_provider": settings.tts_provider,
}
@app.websocket("/ws/voice")
async def voice_websocket(websocket: WebSocket):
"""WebSocket endpoint for real-time voice communication.
Protocol:
1. Client sends audio chunks (binary frames)
2. Server transcribes with Whisper (sends transcript as text)
3. Server generates LLM response (sends text stream)
4. Server synthesizes speech (sends audio chunks as binary)
"""
await websocket.accept()
logger.info("Voice WebSocket connected")
try:
while True:
# Receive audio data from client
data = await websocket.receive_bytes()
# TODO: Process through ASR -> LLM -> TTS pipeline
# This will be implemented in the next lessons
await websocket.send_json({
"type": "status",
"message": "Pipeline not yet implemented"
})
except WebSocketDisconnect:
logger.info("Voice WebSocket disconnected")
Step 5: Start Everything and Verify
Let us start the server and verify the setup works:
# 1. Create a virtual environment and install dependencies
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
# 2. Copy .env.example to .env and add your API keys
cp .env.example .env
# Edit .env and set:
# OPENAI_API_KEY=sk-your-actual-key
# ELEVENLABS_API_KEY=your-actual-key
# 3. Start the FastAPI server
uvicorn app.main:app --reload --port 8000
# 4. Verify the API is running
curl http://localhost:8000/health
# Expected: {"status":"healthy","asr_model":"whisper-1","llm_model":"gpt-4o","tts_provider":"elevenlabs"}
Quick Smoke Test
Write a quick test to verify all three API connections:
# tests/test_connections.py
"""Smoke tests to verify all API services are connected."""
import os
from dotenv import load_dotenv
load_dotenv()
def test_whisper_connection():
"""Verify Whisper API works with a tiny audio sample."""
from openai import OpenAI
import numpy as np
import soundfile as sf
import io
client = OpenAI()
# Generate a 1-second silent WAV for testing
sample_rate = 16000
silence = np.zeros(sample_rate, dtype=np.float32)
buffer = io.BytesIO()
sf.write(buffer, silence, sample_rate, format="WAV")
buffer.seek(0)
buffer.name = "test.wav"
response = client.audio.transcriptions.create(
model="whisper-1",
file=buffer,
language="en"
)
print(f"Whisper OK - transcription: '{response.text}'")
def test_llm_connection():
"""Verify GPT-4o API works."""
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Say hello in one word."}],
max_tokens=10
)
reply = response.choices[0].message.content
print(f"LLM OK - response: '{reply}'")
def test_elevenlabs_connection():
"""Verify ElevenLabs API works."""
import httpx
api_key = os.getenv("ELEVENLABS_API_KEY")
if not api_key:
print("ElevenLabs SKIPPED - no API key set")
return
response = httpx.get(
"https://api.elevenlabs.io/v1/voices",
headers={"xi-api-key": api_key}
)
voices = response.json().get("voices", [])
print(f"ElevenLabs OK - {len(voices)} voices available")
if __name__ == "__main__":
test_whisper_connection()
test_llm_connection()
test_elevenlabs_connection()
print("\nAll smoke tests passed!")
# Run the smoke tests
python tests/test_connections.py
# Expected:
# Whisper OK - transcription: ''
# LLM OK - response: 'Hello!'
# ElevenLabs OK - 29 voices available
# All smoke tests passed!
.env file.Key Takeaways
- A voice assistant is a three-stage pipeline: ASR (speech to text) → LLM (understanding and response) → TTS (text to speech).
- Latency is critical — streaming at every stage keeps the round-trip under 2 seconds.
- The project uses a clean modular structure:
asr/,llm/,tts/, andws/are separate packages. - WebSocket communication enables real-time bidirectional audio streaming between the browser and server.
What Is Next
In the next lesson, you will build the speech recognition module — integrating Whisper for real-time transcription with streaming audio capture, noise handling, and silence detection.
Lilly Tech Systems