Spark Fundamentals
Master the foundation of every Spark application — RDDs, DataFrames, SparkSQL, transformations, actions, lazy evaluation, and the distributed architecture that powers large-scale data processing.
Spark Architecture
Apache Spark is a distributed computing engine that processes data in parallel across a cluster. Understanding the architecture is essential for the exam.
- Driver: The master process that coordinates the application, creates the SparkContext/SparkSession, and distributes work to executors
- Executors: Worker processes that run on cluster nodes, execute tasks, and store data in memory or disk
- Cluster Manager: Allocates resources (YARN, Mesos, Kubernetes, or Standalone)
- SparkSession: The unified entry point for all Spark functionality (replaces SparkContext in modern Spark)
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("ML Pipeline") \
.config("spark.sql.shuffle.partitions", 200) \
.getOrCreate()
RDDs vs. DataFrames
RDD (Resilient Distributed Dataset)
- Low-level distributed collection
- Unstructured: no schema or columns
- Functional API (map, filter, reduce)
- No Catalyst optimizer
- Use only when you need fine-grained control
DataFrame (Preferred for ML)
- High-level structured data with schema
- Columnar: named columns with types
- SQL-like API (select, filter, groupBy)
- Catalyst optimizer for query planning
- Required by Spark ML Pipeline API
Transformations vs. Actions
This distinction is critical for the exam. Spark uses lazy evaluation — transformations build a plan but do not execute. Actions trigger execution.
Transformations (Lazy)
Create a new DataFrame/RDD from an existing one. Nothing is computed until an action is called.
- Narrow:
select(),filter(),withColumn(),map()— no shuffle, each partition processes independently - Wide:
groupBy(),join(),orderBy(),repartition()— require shuffling data across partitions
Actions (Trigger Execution)
Return results or write data. These trigger the execution of all preceding transformations.
show(),collect(),count(),first(),take(n)write.parquet(),write.csv(),save()
# Transformations (lazy - nothing executes yet)
df_filtered = df.filter(df["age"] > 25)
df_selected = df_filtered.select("name", "age", "salary")
df_grouped = df_selected.groupBy("age").avg("salary")
# Action (triggers execution of the entire plan)
df_grouped.show()
cache() and persist() are transformations, not actions. They mark a DataFrame for caching but the actual caching happens when the next action is called. The exam tests this distinction.SparkSQL
SparkSQL allows you to query DataFrames using standard SQL syntax:
# Register DataFrame as a temp view
df.createOrReplaceTempView("customers")
# Query with SQL
result = spark.sql("""
SELECT age, AVG(salary) as avg_salary
FROM customers
WHERE age > 25
GROUP BY age
ORDER BY avg_salary DESC
""")
Caching and Persistence
Caching stores intermediate results in memory to avoid recomputation:
- cache(): Stores in memory only (MEMORY_AND_DISK by default in DataFrame API)
- persist(level): Stores with specified storage level (MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, etc.)
- unpersist(): Removes from cache
Practice Questions
Question 1
A) filter()
B) select()
C) count()
D) withColumn()
Answer: C —
count() is an action that triggers execution and returns a value (the number of rows). filter(), select(), and withColumn() are all transformations that return new DataFrames without triggering execution.
Question 2
A) To save memory on the driver
B) To optimize the execution plan by analyzing all transformations before running them
C) To make the API easier to use
D) To support Python syntax
Answer: B — Lazy evaluation allows Spark's Catalyst optimizer to analyze the entire chain of transformations and produce an optimized physical execution plan. It can eliminate unnecessary computations, push down predicates, and optimize join strategies.
Question 3
A) Create two separate DataFrames
B) Cache the DataFrame using .cache() before the first use
C) Write it to disk and read it back
D) Use an RDD instead
Answer: B — Caching stores the DataFrame in memory so it is not recomputed when used the second time. Creating two DataFrames (A) would duplicate the computation. Writing to disk (C) is slower. Using an RDD (D) is not compatible with Spark ML.
Question 4
A) filter()
B) select()
C) groupBy()
D) withColumn()
Answer: C —
groupBy() is a wide transformation that requires shuffling data across partitions to group rows with the same key onto the same partition. filter(), select(), and withColumn() are narrow transformations that operate on each partition independently.
Question 5
A) SparkContext
B) SparkSession
C) SQLContext
D) HiveContext
Answer: B — SparkSession (introduced in Spark 2.0) is the unified entry point that combines SparkContext, SQLContext, and HiveContext. It is used for creating DataFrames, registering tables, executing SQL, and configuring Spark settings.
Lilly Tech Systems