Free Databricks Machine Learning Associate Actual Exam Questions - Question 15 Discussion
notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further
feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame
API.
Which of the following blocks of code can the data scientist run to be able to use the pandas API on
Spark?
Probably A. It directly imports the pandas API on Spark and wraps the Spark DataFrame, so you can use pandas-like syntax without converting everything to pandas, which could cause memory issues. E actually pulls data into a regular pandas DataFrame, which might be too heavy for big data. B looks off since ps.to_pandas isn’t a known method. D won’t work because pandas can’t handle Spark DataFrames natively. C is unrelated since to_sql is for databases, not transitioning between Spark and pandas APIs.
A. This one imports the pandas API on Spark, which is exactly what the question asks for. B looks wrong because ps.to_pandas isn’t a real function, and D just tries converting straight to pandas without the Spark context, which won’t work well. E converts to a real pandas DataFrame, not the pandas API on Spark, so that’s different from what they want. C is unrelated and won’t help here. So A is the only option that actually sets up the dataframe to use pandas API on Spark.
A/D? A gets the pandas API on Spark, D just tries converting directly to pandas which might fail.
A definitely works to get the pandas API on Spark DataFrame.
E/A? E looks right since spark_df has to_pandas() to convert to pandas API on Spark. A might work too, but to_pandas is more straightforward for this.
It’s A