Free Databricks Machine Learning Associate Actual Exam Questions - Question 5 Discussion
DataFrame and a test DataFrame for downstream use?
E vs B? B filters data but doesn’t split it randomly. E is made for splitting DataFrames randomly into parts, so it’s the straightforward choice here.
E, the only option that actually splits data randomly, unlike model tuning tools.
E randomSplit is the only method that actually splits DataFrames randomly. The others are related to model selection or filtering, so they don’t make sense for this task.
B imo, where is for filtering rows based on a condition, so it can’t randomly split data. That just leaves randomSplit as the real choice for random division.
E randomSplit is the only one designed to split DataFrames randomly, others are for model tuning.
It’s E for sure. The others focus on model evaluation or filtering, not on creating random subsets of data. randomSplit is built exactly for dividing DataFrames randomly — perfect for training/testing splits.
E vs A/B? The terms TrainValidationSplit and CrossValidator are definitely linked to model tuning, so they don’t actually split the data themselves. DataFrame.where is just for filtering based on conditions, not random splitting. That pretty much leaves randomSplit as the only option that actually takes a DataFrame and splits it randomly into parts you can use separately. So E makes the most sense here.
E imo. The others are related to model validation and tuning, not data splitting. randomSplit is the only one that directly splits DataFrames into random parts.
Maybe E since randomSplit sounds like it’s made for splitting data randomly. The others seem more like model tuning stuff.