Question 1

You orchestrate ETL pipelines by using Cloud Composer One of the tasks in the Apache Airflow
directed acyclic graph (DAG) relies on a third-party service. You want to be notified when the task
does not succeed. What should you do?

Accepted Answer

D

Explanation: The onfailurecallback parameter is a standard feature of Apache Airflow operators. It allows you to specify a Python callable (a function) that will be executed only when a task instance fails. This is the most direct and appropriate mechanism to trigger a custom action, such as sending a notification, in response to a task failure. By assigning a function with notification logic to this parameter for the specific task, you ensure that you are alerted precisely when that task does not succeed after all configured retries have been exhausted.

Question 2

You want to rebuild your batch pipeline for structured data on Google Cloud You are using PySpark to
conduct data transformations at scale, but your pipelines are taking over twelve hours to run. To
expedite development and pipeline run time, you want to use a serverless tool and SQL syntax You
have already moved your raw data into Cloud Storage How should you build the pipeline on Google
Cloud while meeting speed and processing requirements?

Accepted Answer

C

Explanation: The most effective solution is to adopt a serverless Extract-Load-Transform (ELT) pattern using Cloud Storage and BigQuery. First, load the raw structured data from Cloud Storage directly into a staging table in BigQuery. BigQuery is a fully-managed, serverless data warehouse designed for petabyte-scale analytics. Then, convert the PySpark transformation logic into BigQuery Standard SQL queries. Executing these SQL queries within BigQuery leverages its massively parallel processing engine to perform the transformations far more quickly than the original Spark job. The final results are then written to a new, permanent table in BigQuery. This approach meets all requirements: it is serverless, uses SQL, and significantly expedites both development and execution time.

Question 3

You are designing an Apache Beam pipeline to enrich data from Cloud Pub/Sub with static reference
data from BigQuery. The reference data is small enough to fit in memory on a single worker. The
pipeline should write enriched results to BigQuery for analysis. Which job type and transforms
should this pipeline use?

Accepted Answer

C

Explanation: The pipeline must be a streaming job because it ingests data from Cloud Pub/Sub, which is an unbounded, continuous data source. The pipeline will use PubSubIO to read from Pub/Sub and BigQueryIO to both read the static reference data and write the enriched output. The key requirement is enriching the streaming data with a small, static reference dataset. The most efficient and standard Apache Beam pattern for this scenario is using side-inputs. A side input allows a ParDo transform to access an entire PCollection (the reference data) in-memory while processing each element of the main PCollection (the streaming data), making it ideal for lookups.

Question 4

You need to deploy additional dependencies to all of a Cloud Dataproc cluster at startup using an
existing initialization action. Company security policies require that Cloud Dataproc nodes do not
have access to the Internet so public initialization actions cannot fetch resources. What should you
do?

Accepted Answer

C

Explanation: The primary constraint is that Cloud Dataproc nodes cannot access the public internet, which prevents initialization actions from downloading dependencies from public repositories. The correct and secure approach is to pre-stage all required dependencies in a location accessible from within the Google Cloud private network. A Cloud Storage bucket is the ideal service for this. By copying dependencies to a bucket and ensuring the cluster's subnet has Private Google Access enabled, the initialization action script can use gsutil to securely copy the files from the bucket to the cluster nodes for installation, fully complying with the security policy.

Question 5

A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real
time. This is then loaded into BigQuery. Analysts in your company want to query the tracking data in
BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created
with ingest-date partitioning. Over time, the query processing time has increased. You need to
implement a change that would improve query performance in BigQuery. What should you do?

Accepted Answer

B

Explanation: The analysts are querying geospatial trends in the lifecycle of a package, which means their queries will frequently filter by a specific packagetrackingID. The current table is partitioned by ingest date, which forces BigQuery to scan the entirety of each relevant date partition to find all events for a single package. By implementing clustering on the packagetrackingID column, BigQuery will physically co-locate all data for the same package within each partition. This allows the query engine to prune the blocks it needs to scan, significantly reducing query processing time and cost for these common analytical queries.

Question 6

You use a dataset in BigQuery for analysis. You want to provide third-party companies with access to
the same dataset. You need to keep the costs of data sharing low and ensure that the data is current.
Which solution should you choose?

Accepted Answer

A

Explanation: Creating an authorized view is the most effective solution as it directly meets all the requirements. An authorized view allows you to grant third parties access to the results of a query without giving them access to the underlying source table. This ensures data is always current because the view queries the live data. For cost-effectiveness, the data owner only pays for the storage of the original dataset. The third-party company is billed for the queries they run against the view in their own Google Cloud project, thus keeping costs low for the data owner.

Question 7

How would you query specific partitions in a BigQuery table?

Accepted Answer

C

Explanation: To query specific partitions in a BigQuery ingestion-time partitioned table, you must use a pseudo-column in the WHERE clause. The PARTITIONTIME pseudo-column contains the UTC timestamp for the start of the partition. By filtering on PARTITIONTIME (or the simpler PARTITIONDATE for daily partitions), you instruct BigQuery to scan only the relevant partitions. This process, known as partition pruning, significantly reduces the amount of data scanned, which lowers query costs and improves performance.

Question 8

Which of these statements about BigQuery caching is true?

Accepted Answer

D

Explanation: BigQuery automatically caches the results of queries that have been run previously. When an identical query is submitted again, BigQuery can retrieve the results directly from this temporary cache. This process is significantly faster than re-executing the query. Because no data is processed for a cached query, there is no charge for on-demand analysis when the results are served from the cache. The job statistics will indicate a cacheHit.

Question 9

An organization maintains a Google BigQuery dataset that contains tables with user-level data. They want to expose aggregates of this data to other Google Cloud projects, while still controlling access to the user-level data. Additionally, they need to minimize their overall storage cost and ensure the analysis cost for other projects is assigned to those projects. What should they do?

Accepted Answer

A

Explanation: An authorized view is the ideal solution for this scenario. It is a virtual table defined by a SQL query that can perform aggregations on the source data. By sharing only the view, you grant access to the aggregated results without exposing the underlying user-level data. Since a view is a logical construct and does not store data, it incurs no additional storage costs, satisfying the cost-minimization requirement. When users from other projects query the authorized view, the processing costs are billed to their respective projects, fulfilling the final requirement.

Question 10

You are using Cloud Bigtable to persist and serve stock market data for each of the major indices. To
serve the trading application, you need to access only the most recent stock prices that are streaming
in How should you design your row key and tables to ensure that you can access the data with the
most simple query?

Accepted Answer

B

Explanation: The optimal design for this time-series use case is to use a single table with a composite row key that facilitates the primary query: retrieving the most recent entry. The row key [index]#[reversetimestamp] achieves this. By prefixing with the stock index, all data for a given index is grouped together. Using a reverse timestamp (e.g., Long.MAXVALUE - timestamp) ensures that the most recent data (with the highest original timestamp) has the lowest reverse timestamp value. This places the newest entries at the very beginning of each index's row range, allowing the most recent price to be retrieved with a simple and highly efficient prefix scan limited to one row.

Question 11

You have a data stored in BigQuery. The data in the BigQuery dataset must be highly available. You
need to define a storage, backup, and recovery strategy of this data that minimizes cost. How should
you configure the BigQuery table?

Accepted Answer

C

Explanation: This option provides the best balance between the requirements of high availability and cost minimization. A multi-regional dataset in BigQuery automatically replicates data across multiple geographic regions, providing high availability and resilience against a single regional failure. This directly addresses the primary requirement for the data to be "highly available." For recovery, using a point-in-time snapshot (BigQuery's time travel feature) is the most cost-effective method. It allows recovery from logical errors (e.g., accidental deletion or modification) within a 7-day window at no additional storage cost, thus minimizing the overall cost of the backup and recovery strategy.

Question 12

You need ads data to serve Al models and historical data tor analytics longtail and outlier data points
need to be identified You want to cleanse the data n near-reel time before running it through Al
models What should you do?

Accepted Answer

C

Explanation: The core requirements are near-real-time data cleansing and preparing data for both AI models and historical analytics. Dataflow is Google Cloud's fully managed service for stream and batch data processing, making it the ideal choice for this scenario. It allows for programmatic identification of outliers and cleansing of data as it arrives. Using BigQuery as a sink (destination) is a standard and highly effective pattern. The cleansed, real-time data is streamed into BigQuery, where it is immediately available for querying by AI models and for long-term historical analysis.

Question 13

Dataproc clusters contain many configuration files. To update these files, you will need to use the -- properties option. The format for the option is: file_prefix:property=_____.

Accepted Answer

B

Explanation: When creating or updating a Dataproc cluster using the gcloud command-line tool, the --properties flag is used to set or override configuration settings in files like core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml, and others. The correct syntax for this flag follows a standard key-value pair format. The format fileprefix:property=value allows you to specify the configuration file (via its prefix), the specific property within that file, and the value you wish to assign to it.

Question 14

You have a BigQuery table that contains customer data, including sensitive information such as
names and addresses. You need to share the customer data with your data analytics and consumer
support teams securely. The data analytics team needs to access the data of all the customers, but
must not be able to access the sensitive dat
a. The consumer support team needs access to all data columns, but must not be able to access
customers that no longer have active contracts. You enforced these requirements by using an
authorized dataset and policy tags After implementing these steps, the data analytics team reports
that they still have access to the sensitive columns. You need to ensure that the data analytics team
does not have access to restricted data What should you do?
Choose 2 answers

Accepted Answer

B, C

Explanation: The problem describes a failure in implementing BigQuery column-level security using policy tags. For this feature to function correctly, two conditions must be met. First, the Data Catalog taxonomy that contains the policy tags must have access control enforcement enabled. Without this, the tags serve only as metadata and do not restrict data access. Second, access to data in a protected column is granted by the roles/datacatalog.fineGrainedReader IAM role on the specific policy tag. If the data analytics team has this role for the sensitive columns' tags, they will be able to view the data. Therefore, to fix the issue, you must ensure access control is enforced on the taxonomy and that the analytics team does not have the Fine-Grained Reader role.

Question 15

The Dataflow SDKs have been recently transitioned into which Apache service?

Accepted Answer

D

Explanation: Google created the Dataflow model and its associated SDKs for unified batch and stream data processing. In 2016, Google donated the Dataflow SDKs and the underlying programming model to the Apache Software Foundation. This contribution became the basis for the open-source project Apache Beam (Batch + strEAM). Apache Beam provides a portable, unified programming model, and Google Cloud Dataflow is one of several "runners" or execution engines that can run Beam pipelines. This allows developers to write data processing pipelines that are not locked into a single execution environment.

Free Google Professional Data Engineer Actual Exam Questions