Free Google Professional Data Engineer Actual Exam Questions - Question 12 Discussion
Apache Hadoop and Spark workloads that they cannot move to BigQuery. Flowlogistic does not know
how to store the data that is common to both workloads. What should they do?
C. Keeping data in Avro format on Cloud Storage works well since it’s accessible by both Spark and BigQuery without needing extra infrastructure like HDFS. It’s simpler than managing Dataproc storage.
Option C makes sense because storing data as Avro in Cloud Storage lets both BigQuery and Spark access it without depending on a specific cluster. It’s more flexible than tying data to HDFS on Dataproc.
C/D but C feels less tied to infrastructure, so easier for BigQuery and Spark both.
C/D? I’d go with C since storing in Avro on Cloud Storage is more flexible for both Spark and BigQuery without locking data into a specific cluster’s HDFS. Plus, GCS is easier to scale and manage long term.
C/D? Avro in Cloud Storage (C) is great for interoperability since both Spark and BigQuery can read it easily. But using HDFS on Dataproc (D) might give better performance for Spark jobs needing fast local access.
D imo because storing the data in HDFS on a Dataproc cluster means Spark jobs can directly access it without extra data movement. Also, since they can connect BigQuery to Dataproc via connectors, this keeps the data close to both processing frameworks. Options involving BigQuery storage could limit Spark’s flexibility or require complex syncing. GCS with Avro is convenient but might add latency or consistency headaches if data is updated often. So, putting the common data in Dataproc’s HDFS gives a single source that supports both workloads efficiently.
C imo works better since Avro files in Cloud Storage can be read by both Spark and BigQuery without extra conversion steps. It’s more flexible than locking data into BigQuery or HDFS alone.
C. Keeping data encoded as Avro in Cloud Storage makes it easy for both Spark and BigQuery to access without locking you into one system’s format or storage.
Maybe C. Storing common data in GCS as Avro sounds like a good way to keep it accessible for both BigQuery and Spark workloads.