Free Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Actual Exam Questions - Question 4 Discussion

Question No. 4
An MLOps engineer is building a Pandas UDF that applies a language model that translates English
strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting
the performance of the data pipeline.
The initial code is:
Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 practice exam questions
def in_spanish_inner(df: pd.Series) -> pd.Series:
model = get_translation_model(target_lang='es')
return df.apply(model)
in_spanish = sf.pandas_udf(in_spanish_inner, StringType())
How can the MLOps engineer change this code to reduce how many times the language model is
loaded?
Select one option, then reveal solution.
US
BF
Brian F.
2026-02-18

It’s D because using an iterator UDF means the model loads once per partition and stays loaded across all batches, cutting down repeated reloads unlike the other options.

0
RP
Ravi P.
2026-02-17

It’s D because using an iterator UDF lets you initialize the model once per batch instead of on every single call, which really cuts down the loading overhead compared to the other options.

0
FJ
Farhan J.
2026-02-12

D, since iterator UDFs keep the model loaded across batches, unlike the others.

0
BO
Bilal O.
2026-02-04

D imo, since iterator UDFs let you load the model once per batch instead of per row, cutting down reloads big time. The other options don’t batch process as efficiently.

0
BO
Bilal O.
2026-02-01

A/B? Changing to a PySpark UDF (A) might not fix the issue since it could still load the model on every call. Switching to a Series → Scalar UDF (B) doesn’t really help either because the model would still reload per row. The key is minimizing model loading frequency, which neither A nor B directly address. So these seem less likely to improve performance compared to options that batch process like C or D.

0
BO
Bilal O.
2026-02-01

D imo, because using the iterator UDF means the model gets loaded once per batch, not on every single row. That’s a huge performance boost compared to the current setup.

0
BO
Bilal O.
2026-01-30

D imo, because an iterator UDF processes batches as chunks, so you load the model once per batch instead of once per row. A and B don’t really help since they still process data row-wise, causing repeated loads. C is closer but mapInPandas is more manual and a bit heavier to maintain compared to the iterator approach in D, which is designed exactly for this use case—efficient model loading and batch processing.

0
BO
Bilal O.
2026-01-30

C/D? I see why D is better for loading once per batch, but C also lets you control model loading outside the row processing. Either way beats loading every row like the original.

0
MI
Mason I.
2026-01-26

C/D? I get why D is popular since iterator UDFs let you load the model once per batch, but mapInPandas (C) also processes data in batches and lets you handle the model loading outside the row-wise logic. Both reduce the overhead compared to loading per row. The main point is moving from per-row loading to per-batch or per-partition loading, which both C and D enable. A and B won't help here since they don't address how often the model loads. So between C and D, it’s about which approach fits better with your pipeline setup.

0
MI
Mason I.
2026-01-26

It’s D for sure. Using an iterator UDF means the model is loaded once per batch of data, not once per row like in the original code. That cuts down the overhead drastically. Plus, it avoids reloading the model multiple times within the same task execution, which is exactly what’s slowing things down here. Options A and C don’t directly address the load frequency, and B still loads per element, so they’re not as effective.

0
NI
Naveed I.
2026-01-24

B tbh, changing to a Series → Scalar UDF means the model loads once per row still, so no big gain. The real win is loading once per batch or partition, not per element.

0
NI
Naveed I.
2026-01-24

Option D, since loading the model once per partition is way better than every row.

0
RZ
Ryan Z.
2026-01-15

D imo, the main issue is loading the model every time the function runs. Using an Iterator[Series] → Iterator[Series] UDF allows you to load the model once per partition instead of every row or batch, which should cut down on overhead a lot. The other options don’t really solve that repeated loading problem directly. PySpark UDFs could be slower, and changing input/output types won’t fix model instantiation frequency by itself. MapInPandas is nice but doesn’t inherently solve this unless you explicitly cache the model somewhere, which isn’t shown here.

0