Free Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Actual Exam Questions - Question 2 Discussion
too large to fit entirely in memory.
What is the likely behavior when Spark runs out of memory to store the DataFrame?
It’s C because Spark caches whole partitions and spills those that don’t fit to disk, so it stores as much as possible in memory first and the rest on disk with some overhead. Partial partition splitting isn’t how it works.
Probably C here. Spark caches partitions, and when memory is full, it spills entire partitions to disk rather than splitting a partition between memory and disk. This means it stores as much as it can in memory, then moves the rest to disk. D sounds tempting but Spark doesn’t track row-level frequency for caching, it works at the partition level mostly. B is definitely off since Spark isn’t balancing storage evenly, and A isn’t quite right because Spark doesn’t duplicate data both in memory and disk simultaneously. So C fits with how MEMORY_AND_DISK typically behaves under memory pressure.
Not B, because Spark doesn’t evenly split storage; it fills memory first and spills partitions to disk only when needed, so balanced storage use isn’t guaranteed.
Maybe D if you consider that Spark tries to optimize what stays in memory, but really it’s more about partitions, not individual rows. Since MEMORY_AND_DISK caches partitions and spills whole partitions, it’s less about frequency and more about capacity. So while D sounds like a smart approach, that’s not exactly how caching works under the hood. C still makes more sense given the partition-based spilling behavior.
It’s C, since Spark caches partitions in memory and spills full partitions to disk when out of memory.
Probably C. Spark caches what fits in memory and then spills the overflow to disk automatically, so processing continues without crashing, just slower due to disk reads. D sounds nice but isn't how MEMORY_AND_DISK works exactly.
C. The key point is that MEMORY_AND_DISK means Spark caches as much as it can in memory, but when it hits the memory limit, it spills the rest to disk automatically. This prevents job failure due to memory overflow, though with some performance hit. Options A and B don’t reflect this dynamic spilling behavior, and D talks about frequency-based storage which isn’t how MEMORY_AND_DISK works—it’s more straightforward. So C fits best with how Spark actually manages caching large DataFrames under memory constraints.
I think the answer is C. Spark tries to keep as much of the DataFrame in memory as possible and spills the rest to disk when it runs out of memory, which slows things down but avoids errors. Makes sense since MEMORY_AND_DISK means use memory first, then disk if needed.