Free Databricks Machine Learning Associate Actual Exam Questions
Dumps Box (DumpsBox) offers up-to-date practice exam questions for Machine Learning Associate certification exam which are developed and validated by Databricks subject domain experts certified in Databricks Machine Learning Associate . These practice questions are update regularly as we keep an eye on any recent changes in Machine Learning Associate syllabus, and when there is update our team quickly adjusts the questions. This commitment to providing the best quality exam prep material to certification aspirants is what makes DumpsBox.com the best certification exam prep website. On top of that, our strong, yet strictly moderated, community based feedback keeps the content clean and current. Each question has helpful community discussion that provides it extra perspective and introduces helpful resources for better exam preparation. This also saves students from other outdated practice questions or illicit exam dumps that can have adverse affects on career. Browse through our Databricks Machine Learning Associate exam questions and pass your exam on first try.
predictions in a Spark DataFrame preds_df with the following schema:
prediction DOUBLE
actual DOUBLE
Which of the following code blocks can be used to compute the root mean-squared-error of the
model according to the data in preds_df and assign it to the rmse variable?
A)

B)

C)

D)

Maybe D, it explicitly computes mean and sqrt clearly, looks correct.
Maybe C works better here since it uses built-in Spark functions to calculate squared error and then takes the sqrt over the average, which is the exact RMSE formula. D looks similar but a bit more manual.
problem using matrix decomposition, but this method does not scale well to large datasets with a
large number of variables.
Which of the following approaches does Spark ML use to distribute the training of a linear regression
model for large data?
Makes sense that iterative optimization (C) is used since matrix methods don't scale well.
Maybe C here too, since iterative methods can be distributed easily. D and E are traditional but don’t scale well, and A doesn’t apply because that’s a different model type.
training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult.
Which of the following describes why?
Maybe D makes the most sense since gradient boosting builds trees one after another, each depending on the errors of the previous trees. Options A and C don’t really fit because gradient boosting isn’t strictly about linear algebra or using all cores for gradient calculation in a way that blocks parallelism. B also seems off because you can actually batch data or use subsets, so it’s not like you need all data at once every iteration. The main hold-up is definitely that iterative dependency between trees.
Parallelizing is tough because each tree depends on the previous one, so D.
Pandas Function API. They have developed the apply_model function that will look up and load the
correct model for each group, and they want to apply it to each group of DataFrame df.
They have written the following incomplete code block:

Which piece of code can be used to fill in the above blank to complete the task?
Maybe C makes sense because mapInPandas lets you transform each group by applying a function and returning a DataFrame, which fits loading and applying group-specific models smoothly.
Maybe A makes the most sense since applyInPandas is meant for grouping operations like this. The others don’t seem like valid Pandas Function API methods for group-level work.
DataFrame and a test DataFrame for downstream use?
E vs B? B filters data but doesn’t split it randomly. E is made for splitting DataFrames randomly into parts, so it’s the straightforward choice here.
E, the only option that actually splits data randomly, unlike model tuning tools.
to specify a search space for two hyperparameters and let the tuning process randomly select values
for each evaluation.
They attempt to run the following code block, but it does not accomplish the desired task:
Which of the following changes can the data scientist make to accomplish the task?
It’s A because RandomizedSearchCV handles random sampling directly, unlike GridSearchCV.
It’s A because RandomizedSearchCV is designed for random sampling of hyperparameters, unlike GridSearchCV which does exhaustive search. No need to mess with random_state or parameter formats here.
UDFs?
Good point on batch processing improving speed. Another way to look at it: standard PySpark UDFs operate row by row, which is slower because of serialization overhead each time. Vectorized pandas UDFs cut down this overhead by working with batches of data as Series, making option B the clear choice. The others either aren’t unique benefits or apply to both types.
C is true but not unique to vectorized UDFs since regular pandas UDFs also use pandas API. B stands out because batch processing is what really boosts performance compared to row-by-row handling.
Learning.
Which of the following steps will the data scientist need to perform outside of their AutoML
experiment?
Maybe D, since EDA is about understanding data before any modeling starts.
D/C? I agree deployment (C) is usually done outside AutoML, but EDA (D) also has to be separate since you need to understand and preprocess your data before feeding it in. AutoML handles tuning and evaluation, so those are inside, but you can’t skip proper exploration beforehand.
single-node model:
They have written the following incomplete code block to use predict to score each record of Spark
DataFrame spark_df:

Which of the following lines of code can be used to complete the code block to successfully complete
the task?
Guessing B makes the most sense since mapInPandas applies a function that processes batches as Pandas DataFrames, which fits the idea of parallelizing inference on a Spark DataFrame.
A makes sense since unpacking columns fits Pandas UDF input style better than passing the whole DataFrame.
selections of hyperparameter values based on previous trials for each iterative model evaluation?
C for sure, it’s the only one that truly adapts based on past evaluations.
C That method builds a model of the objective function and uses it to pick promising hyperparameters, unlike random or grid search which don’t learn from previous results.
size of the data processed by the notebook increases, the notebook's runtime is drastically
increasing, but it is processing slowly as the size of the data included in the process increases.
Which of the following tools can the data scientist use to spend the least amount of time refactoring
their notebook to scale with big data?
C imo, Spark SQL is pretty powerful for big data and integrates well with existing Spark setups, though it might require more refactoring than B. Still, it's a solid option if you want efficient querying without fully switching APIs.
Actually, D might not be the right pick since feature stores mainly help with managing and sharing features rather than scaling the computation itself. Between the options, B sounds like the smartest choice to me because it lets you reuse most of your pandas code with minimal changes, while still running on Spark for big data. A and C would definitely need more work to refactor since they have different APIs and require rewriting the code logic. So for quick scaling with minimal refactoring, pandas API on Spark (B) really stands out.
the MLflow Model Registry has passed all tests. As a result, the machine learning engineer wants to
put this model into production by transitioning it to the Production stage in the Model Registry.
From which of the following pages in Databricks Machine Learning can the machine learning
engineer accomplish this task?
C, because the staging and production transitions happen on the specific model version page, not on the general model or experiment pages. This lets you focus on that version’s details and stage status.
I think option D makes more sense because it shows all the versions in one place, so you can easily compare and pick the right model to promote. From the model page, you can manage versions without switching back and forth. It feels more practical for managing production transitions when you want a broader overview.
Pandas Function API. They have developed the train_model function, and they want to apply it to
each group of DataFrame df.
They have written the following incomplete code block:

Which of the following pieces of code can be used to fill in the above blank to complete the task?
Option B works too since mapInPandas applies a function to each partition or group and also returns a DataFrame, so it can handle the train_model function on grouped data.
It’s A, applyInPandas is built for grouped DataFrames and fits perfectly here.
this model, they have performed inference and the predictions and actual label values are in Spark
DataFrame preds_df.
They are using the following code block to evaluate the model:
regression_evaluator.setMetricName("rmse").evaluate(preds_df)
Which of the following changes should the data scientist make to evaluate the RMSE in a way that is
comparable with price?
A imo, exponentiating RMSE doesn’t fix scale issues, but converting predictions back first does.
A vs D? Exponentiating the RMSE itself (A) doesn’t really fix the scale mismatch because RMSE is a single error metric, not predictions. For a meaningful RMSE on price, you want both predictions and labels in the same scale before computing error. So transforming predictions back with exp (D) makes more sense. Just make sure actual labels are also on the original price scale, not logged, otherwise RMSE will still be off.
notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further
feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame
API.
Which of the following blocks of code can the data scientist run to be able to use the pandas API on
Spark?
Probably A. It directly imports the pandas API on Spark and wraps the Spark DataFrame, so you can use pandas-like syntax without converting everything to pandas, which could cause memory issues. E actually pulls data into a regular pandas DataFrame, which might be too heavy for big data. B looks off since ps.to_pandas isn’t a known method. D won’t work because pandas can’t handle Spark DataFrames natively. C is unrelated since to_sql is for databases, not transitioning between Spark and pandas APIs.
A. This one imports the pandas API on Spark, which is exactly what the question asks for. B looks wrong because ps.to_pandas isn’t a real function, and D just tries converting straight to pandas without the Spark context, which won’t work well. E converts to a real pandas DataFrame, not the pandas API on Spark, so that’s different from what they want. C is unrelated and won’t help here. So A is the only option that actually sets up the dataframe to use pandas API on Spark.