Free Databricks Machine Learning Associate Actual Exam Questions - Question 1 Discussion
predictions in a Spark DataFrame preds_df with the following schema:
prediction DOUBLE
actual DOUBLE
Which of the following code blocks can be used to compute the root mean-squared-error of the
model according to the data in preds_df and assign it to the rmse variable?
A)

B)

C)

D)

Maybe D, it explicitly computes mean and sqrt clearly, looks correct.
Maybe C works better here since it uses built-in Spark functions to calculate squared error and then takes the sqrt over the average, which is the exact RMSE formula. D looks similar but a bit more manual.
B tbh looks off since it misses taking the square root, so it’s actually MSE not RMSE. A also doesn’t aggregate properly. C and D both compute mean squared error, but D’s final sqrt step is clearer and cleaner.
C/D? C uses Spark functions straightforwardly for squared errors and sqrt, but D explicitly calculates mean and then sqrt too. Both look close, but D’s aggregation looks simpler and cleaner to me.
C imo, because it uses the built-in Spark functions to compute squared error and then applies sqrt correctly. A and B miss either the sqrt or proper aggregation, so C fits the RMSE calculation better.
I’m going with option D here. It looks like it computes the squared difference between prediction and actual correctly, then takes the mean and applies sqrt, which is exactly how you calculate RMSE. Option B also seems close but might be missing the sqrt part at the end, so it’s probably MSE instead of RMSE.
B (calculates squared errors and averages correctly)
Maybe D