Free Databricks Machine Learning Associate Actual Exam Questions - Question 7 Discussion
UDFs?
Good point on batch processing improving speed. Another way to look at it: standard PySpark UDFs operate row by row, which is slower because of serialization overhead each time. Vectorized pandas UDFs cut down this overhead by working with batches of data as Series, making option B the clear choice. The others either aren’t unique benefits or apply to both types.
C is true but not unique to vectorized UDFs since regular pandas UDFs also use pandas API. B stands out because batch processing is what really boosts performance compared to row-by-row handling.
B imo, since the main speed gain comes from batch processing data at once, unlike standard UDFs that handle one row at a time. D applies to both types, so it’s not really a unique benefit here.
B/D? The batch processing in B definitely speeds things up compared to row-wise UDFs, but D is also true since both vectorized and standard UDFs run on distributed data. B feels more like the main benefit though.
B/C? Vectorized UDFs batch process which speeds things up and support pandas API usage inside.