Advanced

Patterns & Tips

The difference between a passing and failing systems coding interview often comes down to code organization, edge case handling, and testing. This lesson covers the most common implementation patterns, testing strategies for infrastructure code, and a comprehensive FAQ.

Top 10 Common Mistakes

These are the mistakes that cause candidates to fail systems coding interviews most frequently:

# ---- MISTAKE 1: Using time.time() directly everywhere ----
# BAD: Impossible to test, non-deterministic
class RateLimiter:
    def allow(self):
        if time.time() - self.last > self.window:  # Untestable!
            ...

# GOOD: Inject a clock function for testability
class RateLimiter:
    def __init__(self, clock=time.time):
        self._clock = clock
    def allow(self):
        if self._clock() - self.last > self.window:  # Testable!
            ...


# ---- MISTAKE 2: Not handling capacity=0 ----
# BAD:
class LRUCache:
    def put(self, key, val):
        if len(self._cache) >= self._capacity:  # Fails if capacity=0
            self._cache.popitem(last=False)
        self._cache[key] = val

# GOOD:
class LRUCache:
    def put(self, key, val):
        if self._capacity <= 0:
            return  # Cache with 0 capacity stores nothing
        ...


# ---- MISTAKE 3: Forgetting to handle duplicate keys ----
# BAD: LRU cache counts duplicate puts as new entries
# GOOD: Check if key exists before checking capacity


# ---- MISTAKE 4: Thread-unsafe code without mentioning it ----
# BAD: Silently ignoring concurrency
# GOOD: Either use locks or explicitly state "single-threaded"
import threading

class ThreadSafeCache:
    def __init__(self):
        self._lock = threading.Lock()
        self._cache = OrderedDict()

    def get(self, key):
        with self._lock:
            return self._cache.get(key)


# ---- MISTAKE 5: Not cleaning up expired state ----
# BAD: Rate limiter stores timestamps forever
# GOOD: Clean up old entries periodically or lazily


# ---- MISTAKE 6: O(n) operations where O(1) is expected ----
# BAD: Finding least recently used by scanning all entries
# GOOD: Use OrderedDict.popitem(last=False) for O(1)


# ---- MISTAKE 7: Inconsistent return types ----
# BAD:
def get(self, key):
    if key in self._cache:
        return self._cache[key]  # Returns value
    return False                  # Returns boolean!

# GOOD:
def get(self, key) -> Optional[Any]:
    return self._cache.get(key)   # Always returns value or None


# ---- MISTAKE 8: Not validating input in constructor ----
# BAD: Silently accepting invalid config
class RateLimiter:
    def __init__(self, max_requests):
        self._max = max_requests  # What if max_requests is -1?

# GOOD: Fail fast
class RateLimiter:
    def __init__(self, max_requests):
        if max_requests < 0:
            raise ValueError("max_requests must be non-negative")
        self._max = max_requests


# ---- MISTAKE 9: Modifying data structures during iteration ----
# BAD:
for key in self._store:
    if self._is_expired(key):
        del self._store[key]  # RuntimeError!

# GOOD:
expired = [k for k in self._store if self._is_expired(k)]
for key in expired:
    del self._store[key]


# ---- MISTAKE 10: Not providing get_stats() or __repr__() ----
# Adding observability methods shows production experience
def get_stats(self):
    return {
        "size": len(self._cache),
        "hit_rate": self._hits / max(1, self._hits + self._misses),
        "evictions": self._eviction_count,
    }

Implementation Patterns Cheat Sheet

PatternData StructureTime ComplexityUse Case
LRU evictionOrderedDictO(1) get/putCaches with recency-based eviction
Priority orderingheapqO(log n) push/popSchedulers, priority queues
FIFO queuedequeO(1) append/popleftMessage queues, sliding windows
Counting/groupingdefaultdictO(1) accessRate limiters, pub/sub subscribers
Sorted lookupbisect + listO(log n) searchConsistent hashing ring
Bit arraylist of boolO(k) per opBloom filters
State machineEnum + transitionsO(1)Circuit breaker
Middleware chainDecorator patternO(1)Auth, rate limiting, logging

Testing Infrastructure Code

# ---- Testing Strategy for System Implementations ----

# 1. BASIC FUNCTIONALITY: Happy path works
cache = LRUCache(capacity=3)
cache.put("a", 1)
assert cache.get("a") == 1, "Basic get after put"

# 2. EVICTION: Correct item is evicted
cache.put("a", 1)
cache.put("b", 2)
cache.put("c", 3)
cache.put("d", 4)  # Should evict "a"
assert cache.get("a") is None, "LRU item should be evicted"
assert cache.get("d") == 4, "New item should exist"

# 3. EDGE CASES:
# - Empty state: get on empty cache returns None
# - Capacity 0: all puts are no-ops
# - Capacity 1: every put evicts the previous item
# - Duplicate keys: update in place, don't double-count
# - Same key overwritten multiple times

# 4. ORDERING: Verify correct ordering behavior
cache = LRUCache(capacity=3)
cache.put("a", 1)
cache.put("b", 2)
cache.put("c", 3)
cache.get("a")       # "a" is now most recent
cache.put("d", 4)    # Should evict "b", NOT "a"
assert cache.get("b") is None, "b should be evicted (LRU after a was accessed)"
assert cache.get("a") == 1, "a should still exist"

# 5. TIME-BASED TESTING: Inject a mock clock
class MockClock:
    def __init__(self):
        self._time = 0.0
    def __call__(self):
        return self._time
    def advance(self, seconds):
        self._time += seconds

clock = MockClock()
limiter = RateLimiter(max_requests=3, window=60, clock=clock)

assert limiter.allow("user1") == True
assert limiter.allow("user1") == True
assert limiter.allow("user1") == True
assert limiter.allow("user1") == False  # Rate limited

clock.advance(61)  # Advance past window
assert limiter.allow("user1") == True   # Window reset

# 6. CONCURRENCY: Test thread safety
import threading

cache = ThreadSafeCache(capacity=100)
errors = []

def writer(start):
    for i in range(start, start + 100):
        try:
            cache.put(f"key_{i}", i)
        except Exception as e:
            errors.append(e)

threads = [threading.Thread(target=writer, args=(i*100,)) for i in range(10)]
for t in threads:
    t.start()
for t in threads:
    t.join()

assert len(errors) == 0, f"Thread safety errors: {errors}"

# 7. STRESS TEST: Large volume
cache = LRUCache(capacity=1000)
for i in range(100000):
    cache.put(f"key_{i}", i)
assert cache.size() == 1000, "Should not exceed capacity"
💡
Interview tip: After implementing your solution, write 2-3 test cases out loud. This demonstrates testing discipline. Focus on: (1) basic functionality, (2) the eviction/limit boundary, and (3) one edge case. If you use a mock clock for time-dependent code, mention it — it shows you write testable code in production.

Component Quick Reference

# ---- Complete Component Cheat Sheet ----

# LRU CACHE
# Data: OrderedDict
# get(): lookup + move_to_end -> O(1)
# put(): insert + popitem(last=False) if over capacity -> O(1)

# LFU CACHE
# Data: key_to_val + key_to_freq + freq_to_keys(OrderedDict) + min_freq
# get(): lookup + update frequency bucket -> O(1)
# put(): insert + evict min_freq LRU item -> O(1)

# FIXED WINDOW RATE LIMITER
# Data: counter + window_start per client
# allow(): check window, increment counter -> O(1)

# TOKEN BUCKET RATE LIMITER
# Data: tokens + last_refill per client
# allow(): refill based on elapsed time, consume token -> O(1)

# SLIDING WINDOW RATE LIMITER
# Data: deque of timestamps per client
# allow(): remove old timestamps, count remaining -> O(cleanup)

# MESSAGE QUEUE
# Data: dict of messages + deque of message IDs
# enqueue(): append to deque -> O(1)
# dequeue(): pop from deque, set visibility timeout -> O(1)

# PRIORITY SCHEDULER
# Data: heapq with (priority, sequence, task)
# schedule(): heappush -> O(log n)
# next(): heappop (skip cancelled) -> O(log n)

# PUB/SUB
# Data: defaultdict(dict) for topic -> subscribers
# publish(): iterate subscribers, call callbacks -> O(subscribers)
# subscribe(): add to dict -> O(1)

# CONSISTENT HASHING
# Data: sorted list of hash positions + hash_to_node map
# get_node(): bisect to find position -> O(log n)
# add_node(): insert num_replicas positions -> O(r * log n)

# BLOOM FILTER
# Data: bit array of size m
# add(): set k hash positions -> O(k)
# contains(): check k hash positions -> O(k)

# CIRCUIT BREAKER
# Data: state enum + counters + timestamps
# allow_request(): check state -> O(1)
# record_failure(): increment counter, maybe trip -> O(1)

# RETRY WITH BACKOFF
# delay = base * 2^attempt (exponential)
# delay = random(0, delay) (full jitter)
# delay = min(delay, max_delay) (cap)

Frequently Asked Questions

In order of frequency: (1) LRU Cache — by far the most common, appears at Google, Meta, Amazon, Microsoft, and virtually every company. It tests your knowledge of OrderedDict or HashMap + DoublyLinkedList. (2) Rate Limiter — very common at API-focused companies like Stripe, Cloudflare, and AWS. Token bucket is the most-asked variant. (3) Message Queue — common at companies with distributed systems like Uber, LinkedIn, and Confluent. (4) Consistent Hashing — common at database and infrastructure companies. Focus your preparation on LRU cache and rate limiter first.

Start with OrderedDict — it is cleaner, less error-prone, and shows you know Python well. If the interviewer specifically says "no OrderedDict" or "implement the data structure from scratch," then build a HashMap + DoublyLinkedList. In that case, implement a Node class with prev/next pointers, and maintain head (least recent) and tail (most recent) sentinel nodes. The key operations: (1) _remove(node): unlink a node from the list, (2) _add_to_tail(node): add a node before the tail sentinel, (3) get(): find node in map, remove from list, add to tail, (4) put(): if exists update, else if full evict head.next, then add to tail.

Typically 45-60 minutes total. For a single component like LRU cache or rate limiter, the interviewer expects a complete, working implementation in 25-30 minutes, with 10-15 minutes for follow-up questions (e.g., "how would you make this distributed?" or "what if we need thread safety?"). Budget your time: 3-5 minutes for clarifying questions and API design, 20-25 minutes for implementation, 5-10 minutes for testing and edge cases. For more complex problems (like consistent hashing + replication), you may get the full 45 minutes for implementation.

Unless the interviewer specifically asks for thread safety, implement the single-threaded version first and say: "This implementation is not thread-safe. To make it thread-safe, I would add a threading.Lock around the critical sections." If they do ask for thread safety, use threading.Lock with a with statement for the simplest correct solution. For read-heavy workloads, mention threading.RLock (reentrant) or threading.RWLock (read-write lock, available via third-party). Never implement your own lock — use Python's built-in threading primitives.

The most common follow-ups after implementing a system component: (1) "How would you make this distributed?" — For cache: consistent hashing for sharding, replication for availability. For rate limiter: Redis + Lua scripts for atomic operations. (2) "How would you handle persistence?" — Write-ahead log for durability, periodic snapshots. (3) "What happens under high concurrency?" — Locking strategies, lock-free data structures, eventual consistency trade-offs. (4) "How would you monitor this in production?" — Hit rate, latency percentiles, eviction rate, queue depth. Prepare 1-2 sentences for each of these.

Yes, use them judiciously. @dataclass for simple data containers (Message, Task, ApiResponse) saves boilerplate and shows modern Python knowledge. Type hints on method signatures (but not every local variable) improve readability. Enum for states (CircuitState, MessageStatus, Priority) prevents magic strings. However, do not over-engineer: if you are running low on time, plain dicts and strings are fine. The interviewer cares more about correct logic than perfect type annotations.

Inject a clock function. Instead of calling time.time() directly, accept a clock parameter in the constructor: def __init__(self, clock=time.time). In tests, pass a MockClock that you can advance manually: clock.advance(60). This is a standard dependency injection pattern that interviewers love to see. Mention it even if you do not have time to implement it: "In production, I would inject a clock for testability." This one sentence signals strong engineering practices.

Correct time complexity is the number one differentiator. Any candidate can implement an LRU cache with a list (O(n) eviction), but the interviewer expects O(1) for both get and put. Specifically: (1) O(1) LRU cache using OrderedDict or HashMap + DoublyLinkedList, (2) O(1) rate limiter using counters or token bucket, (3) O(log n) consistent hashing using bisect. After complexity, interviewers look for clean API design (clear method names, type hints) and edge case handling (capacity 0, empty state, duplicate keys). A working O(1) solution with clean API beats a buggy O(1) solution with messy code.

Key Takeaways

💡
  • Inject dependencies (clock, store) for testable code — never call time.time() directly
  • Validate inputs in constructors: fail fast on invalid configuration
  • Use consistent return types: Optional[T] for lookups, bool for mutations
  • Add get_stats() and __repr__() for observability — this signals production experience
  • Test edge cases: capacity 0, empty state, duplicate keys, boundary conditions
  • Use mock clocks for time-dependent tests — never use time.sleep() in tests
  • Mention thread safety explicitly: either implement it or state the assumption
  • Keep the component cheat sheet handy for quick reference during preparation