Free Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Actual Exam Questions - Question 1 Discussion
DataFrame A: 128 GB of transactions
DataFrame B: 1 GB user lookup table
Which strategy is correct for broadcasting?
B The key is avoiding shuffling the large DataFrame A, so broadcasting the smaller B is the move. Options involving broadcasting A don’t make sense given its size.
A vs B? Both push for broadcasting B since it’s smaller, but B is clearer about stopping the big DataFrame A from shuffling. That’s the key win here, so B’s explanation feels more on point.
A/B? Both say broadcast B since it’s smaller, but A is vague about "eliminating shuffling itself"—which doesn’t clearly specify which DataFrame. B explicitly says broadcasting B avoids shuffling the big DataFrame A, which is the heavy operation to avoid. Also, C and D can’t be right since broadcasting the huge 128 GB A is impractical. So between A and B, B’s explanation about cutting down the shuffle on A is more precise and seems correct if broadcasting 1 GB is allowed in the environment.
A vs B? Both say broadcast B since it’s smaller, but A’s wording about eliminating shuffling itself is a bit vague. B clearly says it stops shuffling the big DataFrame A, which is the main win here.
Actually, C and D can be ruled out since broadcasting the large DataFrame A doesn’t make sense. Between A and B, broadcasting B avoids shuffling the huge A, which is the costly part. So B’s reasoning stands stronger here.
A vs B, both say broadcast B, but B’s reasoning about shuffling A feels clearer.
Broadcasting B is best since it's smaller and avoids shuffling A, so A.
Option A makes sense—broadcast the smaller DataFrame B to avoid shuffling the big one.