Question 1

In which of the following scenarios should a data engineer use the MERGE INTO command instead of the INSERT INTO command?

Accepted Answer

D

Explanation: The MERGE INTO command is used to perform upserts, which are a combination of insertions and updates, based on a source table into a target Delta table1. The MERGE INTO command can handle scenarios where the target table cannot contain duplicate records, such as when there is a primary key or a unique constraint on the target table. The MERGE INTO command can match the source and target rows based on a merge condition and perform different actions depending on whether the rows are matched or not. For example, the MERGE INTO command can update the existing target rows with the new source values, insert the new source rows that do not exist in the target table, or delete the target rows that do not exist in the source table1. The INSERT INTO command is used to append new rows to an existing table or create a new table from a query result2. The INSERT INTO command does not perform any updates or deletions on the existing target table rows. The INSERT INTO command can handle scenarios where the location of the data needs to be changed, such as when the data needs to be moved from one table to another, or when the data needs to be partitioned by a certain column2. The INSERT INTO command can also handle scenarios where the target table is an external table, such as when the data is stored in an external storage system like Amazon S3 or Azure Blob Storage3. The INSERT INTO command can also handle scenarios where the source table can be deleted, such as when the source table is a temporary table or a view4. The INSERT INTO command can also handle scenarios where the source is not a Delta table, such as when the source is a Parquet, CSV, JSON, or Avro file5. Reference: 1: MERGE INTO | Databricks on AWS 2: [INSERT INTO | Databricks on AWS] 3: [External tables | Databricks on AWS] 4: [Temporary views | Databricks on AWS] 5: [Data sources | Databricks on AWS]

Question 2

Which of the following data workloads will utilize a Gold table as its source?

Accepted Answer

D

Explanation: A Gold table is a table that contains highly refined and aggregated data that powers analytics, machine learning, and production applications. It represents data that has been transformed into knowledge, rather than just information. A Gold table is typically the final output of a medallion lakehouse architecture, where data flows from Bronze to Silver to Gold tables, with each layer improving the structure and quality of data. A job that queries aggregated data designed to feed into a dashboard is an example of a data workload that will utilize a Gold table as its source, as it requires data that is ready for consumption and analysis. The other options are either data workloads that will use a Bronze or Silver table as their source, or data workloads that will produce a Gold table as their output. Reference: Databricks Documentation - What is the medallion lakehouse architecture?, Databricks Documentation - What is a Medallion Architecture?, K21Academy - Delta Lake Architecture & Azure Databricks Workspace.

Question 3

A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader,
but the engineer has not provided any type inference or schema hints in their pipeline. Upon
reviewing the data, the data engineer has noticed that all of the columns in the target table are of
the string type despite some of the fields only including float or boolean values.
Which of the following describes why Auto Loader inferred all of the columns to be of the string
type?

Accepted Answer

B

Explanation: JSON data is a text-based format that represents data as a collection of name-value pairs. By default, when Auto Loader infers the schema of JSON data, it treats all columns as strings. This is because JSON data can have varying data types for the same column across different files or records, and Auto Loader does not attempt to reconcile these differences. For example, a column named “age” may have integer values in some files, but string values in others. To avoid data loss or errors, Auto Loader infers the column as a string type. However, Auto Loader also provides an option to infer more precise column types based on the sample data. This option is called cloudFiles.inferColumnTypes and it can be set to true or false. When set to true, Auto Loader tries to infer the exact data types of the columns, such as integers, floats, booleans, or nested structures. When set to false, Auto Loader infers all columns as strings. The default value of this option is false. Reference: Configure schema inference and evolution in Auto Loader, Schema inference with auto loader (non-DLT and DLT), Using and Abusing Auto Loader’s Inferred Schema, Explicit path to data or a defined schema required for Auto loader.

Question 4

Which two components function in the DB platform architecture’s control plane? (Choose two.)

Accepted Answer

Explanation: :

Question 5

Which of the following code blocks will remove the rows where the value in column age is greater than 25 from the existing Delta table my_table and save the updated table?

Accepted Answer

C

Explanation: The DELETE command in Delta Lake allows you to remove data that matches a predicate from a Delta table. This command will delete all the rows where the value in the column age is greater than 25 from the existing Delta table my_table and save the updated table. The other options are either incorrect or do not achieve the desired result. Option A will only select the rows that match the predicate, but not delete them. Option B will update the rows that match the predicate, but not delete them. Option D will update the rows that do not match the predicate, but not delete them. Option E will delete the rows that do not match the predicate, which is the opposite of what we want. Reference: Table deletes, updates, and merges — Delta Lake Documentation

Question 6

A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table: https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/DATABRICKS-CERTIFIED-DATA-ENGINEER-ASSOCIATE/page_41_img_1.jpg Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?

Accepted Answer

E

Explanation: To read from a stream source, the data engineer needs to use the spark.readStream method instead of the spark.read method. The spark.readStream method returns a DataStreamReader object that can be used to specify the details of the input source, such as the format, the schema, the path, and the options. The spark.read method is only suitable for batch processing, not streaming processing. The other changes are not necessary or correct for reading from a stream source. Reference: Structured Streaming Programming Guide, Read a stream, Databricks Data Sources

Question 7

Which of the following statements regarding the relationship between Silver tables and Bronze tables is always true?

Accepted Answer

D

Explanation: In a medallion architecture, a common data design pattern for lakehouses, data flows from Bronze to Silver to Gold layer tables, with each layer progressively improving the structure and quality of data. Bronze tables store raw data ingested from various sources, while Silver tables apply minimal transformations and cleansing to create an enterprise view of the data. Silver tables can also join and enrich data from different Bronze tables to provide a more complete and consistent view of the data. Therefore, option D is the correct answer, as Silver tables contain a more refined and cleaner view of data than Bronze tables. Option A is incorrect, as it is the opposite of the correct answer. Option B is incorrect, as Silver tables do not necessarily contain aggregates, but can also store detailed records. Option C is incorrect, as Silver tables may contain less data than Bronze tables, depending on the transformations and cleansing applied. Option E is incorrect, as Silver tables may contain more data than Bronze tables, depending on the joins and enrichments applied. Reference: What is a Medallion Architecture?, Transforming Bronze Tables in Silver Tables, What is the medallion lakehouse architecture?

Question 8

A data organization leader is upset about the data analysis team’s reports being different from the
data engineering team’s reports. The leader believes the siloed nature of their organization’s data
engineering and data analysis architectures is to blame.
Which of the following describes how a data lakehouse could alleviate this issue?

Accepted Answer

B

Explanation: A data lakehouse is a data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data12. By using a data lakehouse, both the data analysis and data engineering teams can access the same data sources and formats, ensuring data consistency and quality across their reports. A data lakehouse also supports schema enforcement and evolution, data validation, and time travel to old table versions, which can help resolve data conflicts and errors1. Reference: 1: What is a Data Lakehouse? - Databricks 2: What is a data lakehouse? | IBM

Question 9

Which of the following benefits is provided by the array functions from Spark SQL?

Accepted Answer

D

Explanation: The array functions from Spark SQL are a subset of the collection functions that operate on array columns1. They provide an ability to work with complex, nested data ingested from JSON files or other sources2. For example, the explode function can be used to transform an array column into multiple rows, one for each element in the array3. The array_contains function can be used to check if a value is present in an array column4. The array_join function can be used to concatenate all elements of an array column with a delimiter. These functions can be useful for processing JSON data that may have nested arrays or objects. Reference: 1: Spark SQL, Built-in Functions - Apache Spark 2: Spark SQL Array Functions Complete List - Spark By Examples 3: Spark SQL Array Functions - Syntax and Examples - DWgeek.com 4: Spark SQL, Built-in Functions - Apache Spark : Spark SQL, Built-in Functions - Apache Spark : [Working with Nested Data Using Higher Order Functions in SQL on Databricks - The Databricks Blog]

Question 10

Which of the following describes a scenario in which a data team will want to utilize cluster pools?

Accepted Answer

A

Explanation: Databricks cluster pools are a set of idle, ready-to-use instances that can reduce cluster start and auto-scaling times. This is useful for scenarios where a data team needs to run an automated report as quickly as possible, without waiting for the cluster to launch or scale up. Cluster pools can also help save costs by reusing idle instances across different clusters and avoiding DBU charges for idle instances in the pool. Reference: Best practices: pools | Databricks on AWS, Best practices: pools - Azure Databricks | Microsoft Learn, Best practices: pools | Databricks on Google Cloud

Question 11

A new data engineering team team has been assigned to an ELT project. The new data engineering
team will need full privileges on the table sales to fully manage the project.
Which command can be used to grant full permissions on the database to the new data engineering
team?

Accepted Answer

A

Explanation: To grant full privileges on a table such as 'sales' to a group like 'team', the correct SQL command in Databricks is: GRANT ALL PRIVILEGES ON TABLE sales TO team; This command assigns all available privileges, including SELECT, INSERT, UPDATE, DELETE, and any other data manipulation or definition actions, to the specified team. This is typically necessary when a team needs full control over a table to manage and manipulate it as part of a project or ongoing maintenance. Reference: Databricks documentation on SQL permissions: SQL Permissions in Databricks

Question 12

A data engineer needs access to a table new_uable, but they do not have the correct permissions.
They can ask the table owner for permission, but they do not know who the table owner is.
Which approach can be used to identify the owner of new_table?

Accepted Answer

D

Explanation: To find the owner of a table in Databricks, one can utilize the Data Explorer feature. The Data Explorer provides detailed information about various data objects, including tables. By navigating to the specific table's page in Data Explorer, a data engineer can review the Owner field, which identifies the individual or role that owns the table. This information is crucial for obtaining the necessary permissions or for any administrative actions related to the table. Reference: Databricks documentation on Data Explorer: Using Data Explorer in Databricks

Question 13

Which of the following is stored in the Databricks customer's cloud account?

Accepted Answer

D

Explanation: The only option that is stored in the Databricks customer’s cloud account is data. Data is stored in the customer’s cloud storage service, such as AWS S3 or Azure Data Lake Storage. The customer has full control and ownership of their data and can access it directly from their cloud account. Option A is not correct, as the Databricks web application is hosted and managed by Databricks on their own cloud infrastructure. The customer does not need to install or maintain the web application, but only needs to access it through a web browser. Option B is not correct, as the cluster management metadata is stored and managed by Databricks on their own cloud infrastructure. The cluster management metadata includes information such as cluster configuration, status, logs, and metrics. The customer can view and manage their clusters through the Databricks web application, but does not have direct access to the cluster management metadata. Option C is not correct, as the repos are stored and managed by Databricks on their own cloud infrastructure. Repos are version-controlled repositories that store code and data files for Databricks projects. The customer can create and manage their repos through the Databricks web application, but does not have direct access to the repos. Option E is not correct, as the notebooks are stored and managed by Databricks on their own cloud infrastructure. Notebooks are interactive documents that contain code, text, and visualizations for Databricks workflows. The customer can create and manage their notebooks through the Databricks web application, but does not have direct access to the notebooks. Reference: Databricks Architecture Databricks Data Sources Databricks Repos [Databricks Notebooks] [Databricks Data Engineer Professional Exam Guide]

Question 14

Which of the following describes when to use the CREATE STREAMING LIVE TABLE (formerly CREATE
INCREMENTAL LIVE TABLE) syntax over the CREATE LIVE TABLE syntax when creating Delta Live Tables
(DLT) tables using SQL?

Accepted Answer

B

Explanation: A streaming live table or view processes data that has been added only since the last pipeline update. Streaming tables and views are stateful; if the defining query changes, new data will be processed based on the new query and existing data is not recomputed. This is useful when data needs to be processed incrementally, such as when ingesting streaming data sources or performing incremental loads from batch data sources. A live table or view, on the other hand, may be entirely computed when possible to optimize computation resources and time. This is suitable when data needs to be processed in full, such as when performing complex transformations or aggregations that require scanning all the data. Reference: Difference between LIVE TABLE and STREAMING LIVE TABLE, CREATE STREAMING TABLE, Load data using streaming tables in Databricks SQL.

Question 15

Which of the following must be specified when creating a new Delta Live Tables pipeline?

Accepted Answer

E

Explanation: Option E is the correct answer because it is the only mandatory requirement when creating a new Delta Live Tables pipeline. A pipeline is a data processing workflow that contains materialized views and streaming tables declared in Python or SQL source files. Delta Live Tables infers the dependencies between these tables and ensures updates occur in the correct order. To create a pipeline, you need to specify at least one notebook library to be executed, which contains the Delta Live Tables syntax. You can also specify multiple libraries of different languages within your pipeline. The other options are optional or not applicable for creating a pipeline. Option A is not required, but you can optionally provide a key-value pair configuration to customize the pipeline settings, such as the storage location, the target schema, the notifications, and the pipeline mode. Option B is not applicable, as the DBU/hour cost is determined by the cluster configuration, not the pipeline creation. Option C is not required, but you can optionally specify a storage location for the output data from the pipeline. If you leave it empty, the system uses a default location. Option D is not required, but you can optionally specify a location of a target database for the written data, either in the Hive metastore or the Unity Catalog. Reference: Tutorial: Run your first Delta Live Tables pipeline, What is Delta Live Tables?, Create a pipeline, Pipeline configuration.

Free Databricks Certified Data Engineer Associate Actual Exam Questions