Free Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Actual Exam Questions - Question 8 Discussion

Question No. 8
A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The
cluster has 10 nodes, each with 16 CPUs. Spark UI shows:
Low number of Active Tasks
Many tasks complete in milliseconds
Fewer tasks than available CPUs
Which approach should be used to adjust the partitioning for optimal resource allocation?
Select one option, then reveal solution.
US
NL
Noah L.
2026-02-22

A imo. Since the cluster has 10 nodes with 16 CPUs each, that’s 160 CPUs total. Setting partitions equal to total CPUs ensures each core gets a task, maximizing parallelism without overwhelming the cluster with too many tiny tasks. Options B and C seem arbitrary without tying directly to CPU count. D is good in theory, but partition size can vary depending on file format and compression—just dividing by 128 MB might not perfectly match CPU availability or task length. The key here is matching tasks to cores for better resource use.

0
AS
Ali S.
2026-02-18

It’s D because calculating partitions from data size usually avoids too few or too many tasks.

0
RX
Ravi X.
2026-02-05

B tbh, setting a fixed number like 200 partitions is a common default and usually enough to keep CPUs busy without overloading the scheduler. It's simpler than calculating based on data size for some use cases.

0
SS
Sarah S.
2026-01-30

Probably D here. Splitting the data by a fixed partition size like 128 MB makes more sense than just matching CPUs or nodes because it scales with dataset size. Options A and C clearly give too few partitions given the cluster’s capacity. B feels arbitrary and might not fully utilize resources or could cause overhead if 200 isn’t the right number for 1 TB. D ensures you get enough partitions to keep all cores busy without making tasks too tiny, which fits the problem’s context better.

0
RI
Ravi I.
2026-01-26

I’m thinking option B might be too arbitrary since 200 partitions might not suit every situation, especially with 1 TB of data. Option C seems way too low—only 10 partitions for the whole dataset would definitely bottleneck CPU usage. Between A and D, setting partitions by dividing the dataset size (option D) feels more data-driven and flexible. It helps avoid both too few and too many partitions based on actual data volume rather than fixed numbers or just CPU count. But I wonder if overhead from too many small tasks could actually slow things down more than expected?

0
RI
Ravi I.
2026-01-26

Option A seems too limiting since just matching partitions to CPU count might not fully utilize resources for 1 TB data. More partitions help spread load better and avoid idle CPUs.

0
ZG
Zain G.
2026-01-24

What about B? Fixed partitions avoid too many tiny tasks and balance parallelism.

0
ZG
Zain G.
2026-01-19

Makes sense that tasks are too quick and fewer than CPU count, so we need more partitions. D is solid because it scales with data size, not just cluster resources.

0
AC
Ali C.
2026-01-17

A/B? A matches partitions to CPU count for parallelism, but B’s fixed 200 could also work if 1 TB / 128 MB is close to that number. Either way, more partitions than CPUs are needed.

0
KN
Kevin N.
2026-01-15

It’s A because matching partitions to total CPUs ensures all cores are utilized, avoiding idle resources. Fewer partitions than CPUs clearly limits parallelism here.

0
NM
Naveed M.
2026-01-12

Looks like the job is under-partitioned since tasks finish too fast and CPUs are idle. D makes sense—partition by data size (1 TB / 128 MB) to get enough partitions and keep all CPUs busy.

0