Advanced

Patterns & Tips

This final lesson consolidates everything you have learned into quick-reference guides for algorithm selection, complexity analysis, and interview preparation. Use this as your revision sheet before coding interviews.

Algorithm Selection Decision Tree

Use this decision tree to quickly determine which algorithm to use based on the problem characteristics.

Algorithm Selection Guide:

Q1: Do you need ALL elements sorted, or just the top K?
  |
  |-- Top K only --> Q2a
  |-- All sorted --> Q2b
  |
  Q2a: Is the data streaming (arriving one by one)?
    |-- Yes --> Min-heap of size K: O(n log k)
    |-- No  --> Quickselect: O(n) average
  |
  Q2b: What type of data?
    |
    |-- Integers with small range (k << n^2)
    |     --> Counting Sort: O(n + k)
    |
    |-- Integers with large range
    |     --> Radix Sort: O(d * n) where d = digits
    |
    |-- Floats in [0, 1), uniformly distributed
    |     --> Bucket Sort: O(n) average
    |
    |-- General data --> Q3
    |
    Q3: Do you need stability (equal elements keep order)?
      |-- Yes --> MergeSort: O(n log n), O(n) space
      |           Or Python's sorted() which uses Timsort
      |-- No  --> Q4
      |
      Q4: Is memory a constraint?
        |-- Yes --> HeapSort: O(n log n), O(1) space
        |-- No  --> QuickSort: O(n log n) avg, fastest in practice

Special Cases:
  - Nearly sorted data    --> Timsort (Python built-in): O(n) best case
  - Linked list           --> MergeSort: no random access needed
  - External (disk) data  --> External MergeSort: sequential I/O
  - Need to count inversions --> Modified MergeSort
  - Need median of stream --> Two heaps (max-heap + min-heap)

Complexity Cheat Sheet

Sorting Algorithms

Algorithm	Best	Average	Worst	Space	Stable	Notes
QuickSort	O(n log n)	O(n log n)	O(n²)	O(log n)	No	Fastest in practice, randomize pivot
MergeSort	O(n log n)	O(n log n)	O(n log n)	O(n)	Yes	Guaranteed, good for linked lists
HeapSort	O(n log n)	O(n log n)	O(n log n)	O(1)	No	In-place, poor cache performance
Timsort	O(n)	O(n log n)	O(n log n)	O(n)	Yes	Python/Java default, great on real data
Counting Sort	O(n+k)	O(n+k)	O(n+k)	O(k)	Yes	k = value range, integers only
Radix Sort	O(dn)	O(dn)	O(dn)	O(n+k)	Yes	d = digits, k = base
Bucket Sort	O(n+k)	O(n+k)	O(n²)	O(n+k)	Yes	Uniform distribution, floats

Searching and Selection

Algorithm	Time	Space	Use Case
Binary Search	O(log n)	O(1)	Sorted array, threshold finding
Quickselect	O(n) avg	O(1)	Kth element, batch top-K
Heap (top-K)	O(n log k)	O(k)	Streaming top-K, priority queue
Two Heaps (median)	O(log n) add	O(n)	Streaming median
Merge K sorted	O(N log k)	O(k)	Distributed merge, multi-source

Common Mistakes to Avoid

1. Full Sort for Top-K

Wrong: sorted(arr)[-k:] is O(n log n). Right: heapq.nlargest(k, arr) is O(n log k). When k = 10 and n = 10 million, heap is ~3x faster.

2. Binary Search Off-by-One

Wrong: Mixing up lo <= hi vs lo < hi and hi = mid vs hi = mid - 1. Right: Pick one template and stick with it. Test on arrays of size 0, 1, and 2.

3. Unstable Sort for Rankings

Wrong: Using QuickSort when equal-scored items must maintain original order. Right: Use MergeSort or Python's sorted() (Timsort is stable).

4. Counting Sort with Huge Range

Wrong: Counting sort with values up to 10^9 allocates a 10^9 array. Right: Use radix sort (processes digit by digit) or a regular comparison sort.

5. Not Randomizing QuickSort Pivot

Wrong: Always picking first/last element as pivot. Sorted input gives O(n²). Right: Random pivot or median-of-three. One line: random.randint(lo, hi).

6. Forgetting Edge Cases

Wrong: Not handling empty arrays, single elements, all-duplicate arrays, or already-sorted input. Right: Always test with [], [1], [1,1,1], and sorted input.

Python Standard Library Quick Reference

import heapq
import bisect
from collections import Counter
from functools import cmp_to_key

# ---- SORTING ----
# Timsort: stable, O(n log n), O(n) on nearly sorted
sorted(arr)                              # New sorted list
arr.sort()                               # In-place sort
sorted(arr, key=lambda x: -x)           # Descending
sorted(arr, key=lambda x: (x[1], -x[0]))  # Multi-key

# Custom comparator (when key function isn't enough)
def cmp(a, b):
    return -1 if a+b > b+a else 1 if a+b < b+a else 0
sorted(strs, key=cmp_to_key(cmp))

# ---- BINARY SEARCH ----
# bisect: operates on sorted lists
bisect.bisect_left(arr, x)    # First position where x can be inserted
bisect.bisect_right(arr, x)   # Position after all existing x values
bisect.insort_left(arr, x)    # Insert x maintaining sorted order
bisect.insort_right(arr, x)

# ---- HEAPS (min-heap) ----
heapq.heapify(arr)            # Convert list to heap in-place: O(n)
heapq.heappush(heap, val)     # Push: O(log n)
heapq.heappop(heap)           # Pop smallest: O(log n)
heapq.heapreplace(heap, val)  # Pop then push: O(log n)
heapq.heappushpop(heap, val)  # Push then pop: O(log n)
heapq.nlargest(k, iterable)   # Top K largest: O(n log k)
heapq.nsmallest(k, iterable)  # Top K smallest: O(n log k)

# Max-heap trick: negate values
heapq.heappush(max_heap, -val)
largest = -heapq.heappop(max_heap)

# ---- COUNTING ----
Counter(arr).most_common(k)    # K most frequent: O(n log k)

Interview Problem Pattern Recognition

When You See...	Think...	Algorithm
"Find the kth largest/smallest"	Selection problem	Quickselect O(n) or Heap O(n log k)
"Top K" or "K most frequent"	Partial sorting	Min-heap of size K
"Sorted array" + "find target"	Binary search	O(log n) with appropriate template
"Merge sorted" anything	Merge operation	Two pointers or K-way merge with heap
"Overlapping intervals"	Sort by start, then merge	Sort + linear scan O(n log n)
"Minimum rooms/resources"	Sweep line or heap	Events sort or min-heap of end times
"Sort with custom order"	Comparator function	cmp_to_key or tuple keys
"Median of stream"	Two heaps	Max-heap + min-heap
"Find threshold/boundary"	Binary search on answer	Binary search with predicate function
"Arrange to maximize/minimize"	Greedy + custom sort	Sort with proof of correctness

Frequently Asked Questions

Timsort (a hybrid of merge sort and insertion sort) was chosen for Python because: (1) It is stable, which Python guarantees for sorted() and list.sort(). (2) It performs O(n) on nearly-sorted data, which is common in practice (appending to sorted lists, re-sorting after small changes). (3) It has guaranteed O(n log n) worst case, unlike QuickSort's O(n²). (4) It has good cache performance because it processes runs of already-sorted elements. Java also switched from QuickSort to Timsort for Arrays.sort() on objects.

In interviews: implement from scratch when explicitly asked ("implement QuickSort") or when the problem requires a modified sort (counting inversions with MergeSort, partial sort with Quickselect). Use built-in sort when the problem is about the logic around sorting (merge intervals, meeting rooms, custom ordering). In production code: always use built-in sort unless you have a specific reason not to (non-comparison sort for integer data, partial sort for top-K). Python's Timsort is highly optimized C code that you cannot beat in Python.

Use a heap (O(n log k)) when: (1) Data is streaming and you cannot store it all in memory. (2) K might change (you need both top-5 and top-10). (3) You need the results sorted. (4) Worst-case guarantees matter. Use QuickSelect (O(n) average) when: (1) All data is in memory. (2) You only need the K elements, not sorted. (3) Average-case performance is acceptable (O(n²) worst case is rare with randomization). In Python, heapq.nlargest(k, arr) is usually the best choice because it automatically picks the optimal strategy based on k and n.

NumPy's np.sort() defaults to introsort (QuickSort with HeapSort fallback) for numeric types and Timsort for object arrays. np.partition() uses introselect (QuickSelect with median-of-medians fallback) for O(n) partial sorting. PyTorch's torch.sort() uses parallel merge sort on GPU. XGBoost and LightGBM use radix sort and counting sort for histogram-based feature binning. FAISS uses partial sorts and heaps for approximate nearest neighbor retrieval. Understanding these choices helps you optimize ML pipelines.

NaN values break comparison-based sorting because NaN is not equal to, less than, or greater than any value (including itself). Strategies: (1) Filter NaNs before sorting: sorted(x for x in arr if not math.isnan(x)). (2) Replace NaNs with a sentinel value: float('inf') to sort them last or float('-inf') to sort them first. (3) Use a custom key: sorted(arr, key=lambda x: (math.isnan(x), x)) puts NaNs at the end. NumPy's np.sort() puts NaNs at the end by default. Always handle NaNs explicitly in production ML code.

Many ML evaluation metrics are fundamentally sorting operations: (1) AUC-ROC: sort predictions by score, then compute TPR/FPR at each threshold. O(n log n). (2) Precision@K and Recall@K: sort predictions descending, take top K, count true positives. (3) NDCG (Normalized Discounted Cumulative Gain): sort by predicted relevance, compare with ideal ranking. (4) MAP (Mean Average Precision): sort by score, compute precision at each relevant item. (5) Kendall tau: count inversions between two rankings using merge sort. All these metrics require sorting, which is why efficient sorting is critical for ML evaluation at scale.

External merge sort: (1) Read chunks that fit in memory. (2) Sort each chunk in memory using QuickSort/Timsort. (3) Write sorted chunks to disk. (4) Merge all sorted chunks using K-way merge (min-heap of size K where K = number of chunks). This is O(n log n) total with O(memory) space. The merge step reads sequentially from disk, which is efficient for HDDs and SSDs. In practice, tools like Unix sort, Apache Spark's sortBy(), and pandas' chunked reading handle this automatically. For ML: if your training data is too large to sort in memory, use Spark or Dask, or sort on the database side before loading.

Course Summary

💡

Lesson 1: Sorting and searching are foundational to ML ranking, retrieval, feature engineering, and evaluation
Lesson 2: QuickSort (fastest average), MergeSort (stable, guaranteed), HeapSort (O(1) space) — know trade-offs
Lesson 3: Non-comparison sorts break O(n log n) when data has structure (integers, bounded range, uniform)
Lesson 4: Binary search patterns: exact match, first/last occurrence, rotated array, matrix search, threshold finding
Lesson 5: Top-K problems are the most common sorting questions in ML interviews — use heaps for streaming, quickselect for batch
Lesson 6: Custom sorting with comparators, multi-key sorts, interval merging, and resource scheduling
Lesson 7 (this): Use the decision tree, cheat sheet, and pattern recognition table for rapid algorithm selection

💡

Final advice: In coding interviews, always (1) clarify the data type and size, (2) ask if the data is already sorted or nearly sorted, (3) ask if you need stability, (4) ask if it is streaming or batch, and (5) state the complexity of your chosen approach before coding. This shows algorithmic maturity that interviewers value highly.

← Previous Custom Sorting & Comparators Course Home → Sorting & Searching for ML