Free Google Professional-Machine-Learning-Engineer Actual Exam Questions - Question 6 Discussion
conduct data transformations at scale, but your pipelines are taking over 12 hours to run. To speed
up development and pipeline run time, you want to use a serverless tool and SQL syntax. You have
already moved your raw data into Cloud Storage. How should you build the pipeline on Google Cloud
while meeting the speed and processing requirements?
A/D? Data Fusion’s GUI could speed up development with less coding, but might not handle very large scale as efficiently as BigQuery. Since the data’s in Cloud Storage already, loading into BigQuery (D) seems cleaner for serverless SQL.
Option D, since BigQuery handles massive data natively and removes cluster management hassles.
It’s D for sure. BigQuery is built for large-scale data and serverless SQL transformations, so it fits the need to speed up development and run time without managing clusters. Also, since the data’s already in Cloud Storage, loading into BigQuery is straightforward. Option A sounds tempting but Data Fusion might add complexity if you want pure SQL and faster iteration. B and C don’t really hit the mark because they either require cluster management or rely on Cloud SQL, which isn’t ideal for big data scale. D keeps it simple, fast, and serverless all at once.
B tbh, Dataproc still means managing clusters, which goes against the serverless goal. But converting PySpark to SparkSQL and running on Dataproc might speed things up compared to raw PySpark jobs. Still, it's probably not the best fit if you want to avoid cluster management completely. A and D seem more serverless-friendly. Plus, B involves some overhead in rewriting and cluster config, so not the quickest win here.
I think option A deserves more attention since Data Fusion is serverless and designed for building data pipelines without worrying about cluster management. It also supports SQL transforms through its GUI, which might speed up development compared to rewriting everything in SQL like D suggests. Plus, if the transformations involve complex logic, Data Fusion can handle custom plugins, unlike BigQuery’s SQL-only approach. Does anyone have experience with Data Fusion handling large-scale PySpark-like jobs and whether it really cuts down on runtime that much?
Totally agree that D fits the serverless and SQL requirement best, and BigQuery handles huge datasets efficiently. Plus, it avoids managing clusters unlike B, so D makes more sense here.
I get why D looks good for serverless and pure SQL, but don’t forget Data Fusion in A is also serverless and built for easier pipeline building without managing clusters. It can speed up development since you don’t rewrite PySpark code into SQL yourself. Plus, Data Fusion can handle scale well and write directly to BigQuery. So, I’d say A is a solid choice if you want less manual SQL conversion and a user-friendly interface, while still getting decent performance without managing infrastructure.
I’m ruling out C because Cloud SQL isn’t really built for big data transformations at scale, especially with structured data that’s already in Cloud Storage. It would slow things down compared to BigQuery or other serverless options. D seems solid since BigQuery handles SQL natively and is serverless, but if the PySpark has complex logic, rewriting might be tough. Still, D fits the requirement better than B or A for speed and serverless. So, I’d go with D based on the serverless and SQL criteria and assuming the transformations can be expressed in SQL.
I’m thinking option A could work since Data Fusion is serverless and designed for building pipelines with less code, plus it can handle transformations at scale and write to BigQuery. It might not be as fast as BigQuery SQL, but it could speed up development time without needing to rewrite all your PySpark code into SQL. Anyone else see a downside with using Data Fusion here compared to just switching fully to BigQuery for SQL transformations?
Probably D. Moving the data into BigQuery and using SQL there is the most straightforward way to get serverless scalability and faster runtime compared to PySpark on Dataproc. Plus, BigQuery’s native support for SQL fits the requirement to use SQL syntax. The only catch is if your PySpark logic is super complex or uses custom UDFs that don’t map easily to SQL, but the question doesn’t say that explicitly. B still requires managing Dataproc clusters, so it’s not fully serverless and might not speed things up as much.
I ruled out A since Data Fusion isn't fully serverless and can be slower for big data. C seems off because Cloud SQL isn’t designed for large-scale transformations like this. So it’s between B and D, but D’s serverless approach feels more fitting here. Anyone think B might still have a use case?
B/D? B still uses Spark, so it might not solve the long runtime issue since Dataproc isn’t fully serverless and can have cluster startup overhead. D looks better because BigQuery is fully serverless and optimized for SQL at scale. Plus, loading data into BigQuery after Cloud Storage seems straightforward. The key challenge is converting PySpark logic into SQL, but once done, BigQuery’s speed and scalability should crush the 12-hour runtime problem. So if you can handle the rewrite effort, D seems like the practical choice for faster pipeline runs and easier maintenance.
A. I think A could work since Data Fusion is serverless and has a GUI which makes development faster. Plus, it supports SQL transformations under the hood. It avoids managing clusters like Dataproc does, so it fits the speed and simplicity goal. Also, it can easily write to BigQuery for further analysis or ML steps. Running PySpark on Dataproc might be too heavy and slow still, and Cloud SQL isn’t really designed for big data pipelines. So A feels like a solid option to speed things up without losing scalability.
D imo, BigQuery’s serverless and optimized for SQL makes sense here for speed and simplicity. Turning PySpark to BigQuery SQL looks like the best fit given the requirements.