Free Databricks Machine Learning Associate Actual Exam Questions - Question 2 Discussion
problem using matrix decomposition, but this method does not scale well to large datasets with a
large number of variables.
Which of the following approaches does Spark ML use to distribute the training of a linear regression
model for large data?
Makes sense that iterative optimization (C) is used since matrix methods don't scale well.
Maybe C here too, since iterative methods can be distributed easily. D and E are traditional but don’t scale well, and A doesn’t apply because that’s a different model type.
Makes sense that it’s iterative optimization (C) since matrix decomposition like SVD isn’t great for big data. Logistic regression (A) is unrelated and B is just false. So C for sure.
Totally agree that A and E look off for this context. Logistic regression is a different model altogether, and SVD doesn’t really fit the scalability angle here. The least squares method (D) is more traditional but not great for huge datasets since it doesn’t handle distribution well. So it’s really about using something that iterates to gradually improve the model across partitions. Does anyone know if Spark ML uses any specific algorithm under iterative optimization, like gradient descent or something else?
C imo, because iterative optimization fits distributed systems better. A and E are traps since logistic regression and SVD aren't the main methods Spark ML uses for linear regression scaling.