Beginner

Kafka Basics

Master the core Kafka concepts: topics, partitions, producers, consumers, and consumer groups that power ML data pipelines.

Core Concepts

  • Broker: A Kafka server that stores data and serves client requests. A cluster has multiple brokers.
  • Topic: A named stream of events (like a database table). Events are appended to topics.
  • Partition: Topics are split into partitions for parallelism. Each partition is an ordered, immutable log.
  • Offset: A sequential ID for each record within a partition. Consumers track their position via offsets.
  • Producer: Writes events to topics.
  • Consumer: Reads events from topics.
  • Consumer Group: A set of consumers that cooperatively read from a topic, each reading different partitions.

Setting Up Kafka

Bash — Quick Start with Docker
# docker-compose.yml for Kafka + Zookeeper
# Start with: docker-compose up -d

# Install Python client
pip install confluent-kafka

# Create a topic
kafka-topics --create \
  --bootstrap-server localhost:9092 \
  --topic ml-events \
  --partitions 6 \
  --replication-factor 1

Producing Messages

Python — Kafka Producer
from confluent_kafka import Producer
import json

producer = Producer({
    'bootstrap.servers': 'localhost:9092',
    'client.id': 'ml-producer'
})

def delivery_callback(err, msg):
    if err:
        print(f"Delivery failed: {err}")
    else:
        print(f"Delivered to {msg.topic()} [{msg.partition()}] @ {msg.offset()}")

# Send ML events
event = {
    "user_id": "u123",
    "action": "click",
    "item_id": "item456",
    "timestamp": "2026-03-15T10:30:00Z",
    "features": {"page_time": 12.5, "scroll_depth": 0.75}
}

producer.produce(
    topic='ml-events',
    key='u123',
    value=json.dumps(event).encode('utf-8'),
    callback=delivery_callback
)
producer.flush()

Consuming Messages

Python — Kafka Consumer
from confluent_kafka import Consumer
import json

consumer = Consumer({
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'ml-feature-pipeline',
    'auto.offset.reset': 'earliest',
    'enable.auto.commit': True
})

consumer.subscribe(['ml-events'])

try:
    while True:
        msg = consumer.poll(timeout=1.0)
        if msg is None:
            continue
        if msg.error():
            print(f"Consumer error: {msg.error()}")
            continue

        event = json.loads(msg.value().decode('utf-8'))
        print(f"User: {event['user_id']}, Action: {event['action']}")

        # Process for ML feature computation
        compute_features(event)
finally:
    consumer.close()

Consumer Groups for ML

Consumer groups let you scale ML processing horizontally. Each consumer in a group reads from different partitions, so adding consumers increases throughput linearly.

Python — Scaling with Consumer Groups
# Consumer 1 and Consumer 2 both in group "ml-inference"
# With 6 partitions, each consumer handles 3 partitions

consumer_config = {
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'ml-inference',      # Same group = load balanced
    'auto.offset.reset': 'latest',
    'max.poll.interval.ms': 300000,   # 5 min max for ML processing
    'session.timeout.ms': 30000
}

# Run multiple instances of this consumer for parallel inference

Topic Design for ML

Topic PatternUse CasePartitioning Key
raw-eventsIngestion of all user eventsuser_id
ml-featuresComputed features for servingentity_id
ml-predictionsModel output for downstreamrequest_id
ml-feedbackLabels and outcomes for retrainingprediction_id
Partition by entity ID: For ML pipelines, partition by user_id or entity_id. This ensures all events for the same entity go to the same partition, enabling stateful feature computation without shuffles.