Free Databricks Machine Learning Associate Actual Exam Questions - Question 11 Discussion

Question No. 11
A data scientist has written a feature engineering notebook that utilizes the pandas library. As the
size of the data processed by the notebook increases, the notebook's runtime is drastically
increasing, but it is processing slowly as the size of the data included in the process increases.
Which of the following tools can the data scientist use to spend the least amount of time refactoring
their notebook to scale with big data?
Select one option, then reveal solution.
US
VE
Vikas E.
2026-02-22

C imo, Spark SQL is pretty powerful for big data and integrates well with existing Spark setups, though it might require more refactoring than B. Still, it's a solid option if you want efficient querying without fully switching APIs.

0
FM
Farhan M.
2026-02-20

Actually, D might not be the right pick since feature stores mainly help with managing and sharing features rather than scaling the computation itself. Between the options, B sounds like the smartest choice to me because it lets you reuse most of your pandas code with minimal changes, while still running on Spark for big data. A and C would definitely need more work to refactor since they have different APIs and require rewriting the code logic. So for quick scaling with minimal refactoring, pandas API on Spark (B) really stands out.

0
FM
Farhan M.
2026-02-16

It’s A since PySpark DataFrame API is widely used for big data processing and offers better performance than pandas, even though it needs more refactoring than B. It’s a solid choice if you want scalable and efficient handling.

0
RO
Ryan O.
2026-02-12

B is the best fit here since it lets you keep pandas-like code without full rewrites.

0
RO
Ryan O.
2026-02-11

Makes sense to pick B here since it’s designed to mimic pandas but runs on Spark, so you don’t have to rewrite everything from scratch. A and C would require a bigger overhaul of the code, which goes against the goal of spending the least time refactoring. D doesn’t really address the scaling issue directly. So B fits best for quick scaling with minimal code changes.

0
OP
Osama P.
2026-02-10

It’s B because pandas API on Spark lets you keep most of your pandas code intact while scaling out, so you avoid the full rewrite that PySpark (A) or Spark SQL (C) would require. Feature Store (D) doesn’t help much with runtime here.

0
OP
Osama P.
2026-02-10

A imo, even though PySpark might need more refactoring, it’s the most robust for big data. B’s pandas API on Spark sounds good for minimal changes but can have hidden compatibility issues or slower performance in some cases. Feature Store (D) doesn’t solve runtime scaling directly, and Spark SQL (C) usually means rewriting queries entirely, which isn’t minimal. If the goal is truly the least refactor but still handle big data efficiently, B’s tempting, but if you want solid scaling without surprises, PySpark (A) is safer long-term.

0
OP
Osama P.
2026-01-26

A/D? A could work but PySpark DataFrame API usually needs a lot of code changes, so not the least time spent refactoring. D is about storing features for reuse and sharing but doesn’t directly help with scaling the existing pandas code. B is probably still best for minimal changes, but if you want a totally different approach, maybe D could help in a broader data pipeline sense.

0
UY
Usman Y.
2026-01-20

Makes sense to pick B here since it lets you scale without rewriting all your pandas logic, unlike PySpark or Spark SQL which need bigger changes. B fits the "least refactoring" part best.

0
PW
Peter W.
2026-01-16

Maybe B, since pandas API on Spark lets you use similar pandas code but scales better for big data without too much rewiring.

0