Beginner
Kafka Basics
Master the core Kafka concepts: topics, partitions, producers, consumers, and consumer groups that power ML data pipelines.
Core Concepts
- Broker: A Kafka server that stores data and serves client requests. A cluster has multiple brokers.
- Topic: A named stream of events (like a database table). Events are appended to topics.
- Partition: Topics are split into partitions for parallelism. Each partition is an ordered, immutable log.
- Offset: A sequential ID for each record within a partition. Consumers track their position via offsets.
- Producer: Writes events to topics.
- Consumer: Reads events from topics.
- Consumer Group: A set of consumers that cooperatively read from a topic, each reading different partitions.
Setting Up Kafka
Bash — Quick Start with Docker
# docker-compose.yml for Kafka + Zookeeper
# Start with: docker-compose up -d
# Install Python client
pip install confluent-kafka
# Create a topic
kafka-topics --create \
--bootstrap-server localhost:9092 \
--topic ml-events \
--partitions 6 \
--replication-factor 1
Producing Messages
Python — Kafka Producer
from confluent_kafka import Producer
import json
producer = Producer({
'bootstrap.servers': 'localhost:9092',
'client.id': 'ml-producer'
})
def delivery_callback(err, msg):
if err:
print(f"Delivery failed: {err}")
else:
print(f"Delivered to {msg.topic()} [{msg.partition()}] @ {msg.offset()}")
# Send ML events
event = {
"user_id": "u123",
"action": "click",
"item_id": "item456",
"timestamp": "2026-03-15T10:30:00Z",
"features": {"page_time": 12.5, "scroll_depth": 0.75}
}
producer.produce(
topic='ml-events',
key='u123',
value=json.dumps(event).encode('utf-8'),
callback=delivery_callback
)
producer.flush()
Consuming Messages
Python — Kafka Consumer
from confluent_kafka import Consumer
import json
consumer = Consumer({
'bootstrap.servers': 'localhost:9092',
'group.id': 'ml-feature-pipeline',
'auto.offset.reset': 'earliest',
'enable.auto.commit': True
})
consumer.subscribe(['ml-events'])
try:
while True:
msg = consumer.poll(timeout=1.0)
if msg is None:
continue
if msg.error():
print(f"Consumer error: {msg.error()}")
continue
event = json.loads(msg.value().decode('utf-8'))
print(f"User: {event['user_id']}, Action: {event['action']}")
# Process for ML feature computation
compute_features(event)
finally:
consumer.close()
Consumer Groups for ML
Consumer groups let you scale ML processing horizontally. Each consumer in a group reads from different partitions, so adding consumers increases throughput linearly.
Python — Scaling with Consumer Groups
# Consumer 1 and Consumer 2 both in group "ml-inference"
# With 6 partitions, each consumer handles 3 partitions
consumer_config = {
'bootstrap.servers': 'localhost:9092',
'group.id': 'ml-inference', # Same group = load balanced
'auto.offset.reset': 'latest',
'max.poll.interval.ms': 300000, # 5 min max for ML processing
'session.timeout.ms': 30000
}
# Run multiple instances of this consumer for parallel inference
Topic Design for ML
| Topic Pattern | Use Case | Partitioning Key |
|---|---|---|
| raw-events | Ingestion of all user events | user_id |
| ml-features | Computed features for serving | entity_id |
| ml-predictions | Model output for downstream | request_id |
| ml-feedback | Labels and outcomes for retraining | prediction_id |
Partition by entity ID: For ML pipelines, partition by user_id or entity_id. This ensures all events for the same entity go to the same partition, enabling stateful feature computation without shuffles.
Lilly Tech Systems