Question 1

You have: DataFrame A: 128 GB of transactions DataFrame B: 1 GB user lookup table Which strategy is correct for broadcasting?

Accepted Answer

B

Explanation: The most effective join optimization strategy in this scenario is the broadcast hash join. This strategy involves sending the smaller DataFrame (DataFrame B, 1 GB) to every executor node in the cluster. By doing this, each executor has a complete local copy of the user lookup table. The join can then be performed on each partition of the large DataFrame (DataFrame A, 128 GB) locally, without requiring the costly network operation of shuffling the massive 128 GB DataFrame. This significantly improves performance by minimizing data movement across the network.

Question 2

A Spark DataFrame df is cached using the MEMORY_AND_DISK storage level, but the DataFrame is
too large to fit entirely in memory.
What is the likely behavior when Spark runs out of memory to store the DataFrame?

Accepted Answer

C

Explanation: The MEMORYANDDISK storage level instructs Spark to first attempt to store the DataFrame's partitions in memory as deserialized objects. If the DataFrame is too large to fit entirely in memory, Spark will store the partitions that fit and "spill" the remaining partitions to disk. When an action requires a partition stored on disk, Spark reads it from there. This process avoids job failure due to insufficient memory but introduces performance overhead because disk I/O is significantly slower than memory access. This behavior ensures that the computation can proceed even when the data exceeds available RAM.

Question 3

44 of 55. A data engineer is working on a real-time analytics pipeline using Spark Structured Streaming. They want the system to process incoming data in micro-batches at a fixed interval of 5 seconds. Which code snippet fulfills this requirement?

Accepted Answer

A

Explanation: The trigger(processingTime="5 seconds") configuration explicitly sets up a micro-batch processing model where the Spark engine initiates a new batch at a fixed interval. This means the query will check for new data and start processing it every 5 seconds, regardless of whether the previous batch has finished. This directly fulfills the requirement for a real-time analytics pipeline that processes data in micro-batches at a fixed 5-second interval.

Question 4

An MLOps engineer is building a Pandas UDF that applies a language model that translates English
strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting
the performance of the data pipeline.
The initial code is:
Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 practice exam questions

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 practice exam questions

def in_spanish_inner(df: pd.Series) -> pd.Series:
model = get_translation_model(target_lang='es')
return df.apply(model)
in_spanish = sf.pandas_udf(in_spanish_inner, StringType())
How can the MLOps engineer change this code to reduce how many times the language model is
loaded?

Accepted Answer

D

Explanation: The Iterator[Series] -> Iterator[Series] Pandas UDF is the ideal pattern for this scenario. This type of UDF is invoked once for each data partition. The function receives an iterator of pandas Series, allowing the expensive model (gettranslationmodel) to be loaded just once at the beginning of the function. The code can then iterate through the batches of data within the partition, applying the already-loaded model to each batch. This avoids the significant performance overhead of re-initializing the model for every single batch, which is what happens with a standard Series -> Series UDF.

Question 5

In the code block below, aggDF contains aggregations on a streaming DataFrame: https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks_DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK-3.5/page_30_img_1.jpg Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

Accepted Answer

A

Explanation: In Apache Spark Structured Streaming, the complete output mode is specifically designed for aggregation queries. When this mode is used, the entire updated Result Table, which contains all the aggregated data from the beginning of the stream, is written to the sink during each trigger interval. This fulfills the requirement of ensuring the entire result table is outputted every time the stream processes new data.

Question 6

A data engineer writes the following code to join two DataFrames df1 and df2: df1 = spark.read.csv("sales_data.csv") # ~10 GB df2 = spark.read.csv("product_data.csv") # ~8 MB result = df1.join(df2, df1.product_id == df2.product_id) https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks_DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK-3.5/page_38_img_1.jpg Which join strategy will Spark use?

Accepted Answer

B

Explanation: Apache Spark's Catalyst Optimizer automatically selects the most efficient physical plan for a query. For join operations, when one of the DataFrames is significantly smaller than the other, a Broadcast Hash Join is often the most performant strategy. This avoids a costly and network-intensive shuffle of the larger DataFrame. Spark makes this decision based on the configuration parameter spark.sql.autoBroadcastJoinThreshold, which has a default value of 10 MB. In this scenario, df2 has a size of 8 MB, which is below the default threshold. Therefore, Spark will automatically broadcast df2 to each executor node to be joined with the partitions of df1.

Question 7

A developer wants to test Spark Connect with an existing Spark application.
What are the two alternative ways the developer can start a local Spark Connect server without
changing their existing application code? (Choose 2 answers)

Accepted Answer

B, C

Explanation: To connect an existing Spark application to a Spark Connect server without code modification, the connection must be configured externally. The two primary methods are using a command-line flag when launching the application (like the PySpark shell) or setting an environment variable. The --remote "sc://localhost" flag instructs the PySpark shell to operate in Connect mode and connect to the specified server. Similarly, setting the SPARKREMOTE="sc://localhost" environment variable before launching any Spark application (including a shell or a submitted job) will configure the default SparkSession to connect to the specified remote server, overriding the default behavior of creating a local Spark session.

Question 8

A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The
cluster has 10 nodes, each with 16 CPUs. Spark UI shows:
Low number of Active Tasks
Many tasks complete in milliseconds
Fewer tasks than available CPUs
Which approach should be used to adjust the partitioning for optimal resource allocation?

Accepted Answer

D

Explanation: The symptoms described—low number of active tasks and fewer tasks than available CPUs—are classic indicators of under-partitioning. With 160 available CPU cores (10 nodes 16 CPUs), the cluster is significantly underutilized. The most effective and scalable approach to resolve this is to determine the number of partitions based on the data size. By dividing the total dataset size (1 TB) by a recommended partition size (e.g., 128 MB), you ensure that each task processes a manageable chunk of data. This creates a high number of partitions (approx. 8,192), allowing Spark to achieve maximum parallelism by keeping all CPU cores busy, thus optimizing resource allocation and job performance.

Question 9

A data engineer is asked to build an ingestion pipeline for a set of Parquet files delivered by an
upstream team on a nightly basis. The data is stored in a directory structure with a base path of
"/path/events/data". The upstream team drops daily data into the underlying subdirectories
following the convention year/month/day.
A few examples of the directory structure are:
Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 practice exam questions

Which of the following code snippets will read all the data within the directory structure?

Accepted Answer

B

Explanation: The Parquet files are stored in a nested directory structure (/year/month/day/). By default, the Spark DataFrameReader (spark.read.parquet) only reads files from the top-level directory provided and does not search subdirectories. To instruct Spark to traverse the entire directory tree and discover all files within the nested subdirectories, the recursiveFileLookup option must be explicitly set to true. This ensures all Parquet files under /path/events/data/ are included in the resulting DataFrame, regardless of their depth in the directory hierarchy.

Question 10

A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from
failures or intentional shutdowns by continuing where the pipeline left off.
How can this be achieved?

Accepted Answer

D

Explanation: To ensure a Structured Streaming pipeline can recover from failures and continue from where it left off, you must enable checkpointing. This is achieved by specifying a path in a fault-tolerant file system (e.g., HDFS, S3, ADLS Gen2) using the checkpointLocation option on the DataStreamWriter (the object returned by writeStream). Spark uses this location to save all progress information for the query, including the range of offsets processed for each batch and the state of any running aggregations. When the query is restarted, it uses this checkpoint data to resume processing from the exact point it stopped, guaranteeing end-to-end, exactly-once fault tolerance.

Question 11

Given a DataFrame df that has 10 partitions, after running the code: result = df.coalesce(20) How many partitions will the result DataFrame have?

Accepted Answer

A

Explanation: The coalesce(n) transformation in Apache Spark is an optimized method used to decrease the number of partitions. It avoids a full data shuffle by combining existing partitions on the same worker node, making it a "narrow" transformation. A key characteristic of coalesce is that it cannot be used to increase the number of partitions. When the number n provided to coalesce is greater than the current number of partitions, the operation has no effect, and the DataFrame retains its original number of partitions. In this case, the DataFrame df has 10 partitions, and coalesce(20) is called. Since 20 is greater than 10, the number of partitions remains 10.

Question 12

What is the difference between df.cache() and df.persist() in Spark DataFrame?

Accepted Answer

D

Explanation: The cache() method is a specific, parameter-less convenience function that persists a DataFrame using the default storage level, which is MEMORYANDDISK. In contrast, the persist() method is more versatile. While calling persist() with no arguments is equivalent to cache(), its primary purpose is to allow the developer to specify a different storage level. This provides granular control over how the DataFrame is stored, enabling optimizations based on memory availability and performance requirements by choosing levels like MEMORYONLY, DISKONLY, or serialized versions.

Question 13

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this: https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks_DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK-3.5/page_11_img_1.jpg The resulting Python dictionary must contain a mapping of region -> region id containing the smallest 3 region_id values. Which code fragment meets the requirements? A) https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks_DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK-3.5/page_11_img_2.jpg B) https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks_DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK-3.5/page_11_img_3.jpg C) https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks_DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK-3.5/page_12_img_1.jpg D) https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks_DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK-3.5/page_12_img_2.jpg The resulting Python dictionary must contain a mapping of region -> region_id for the smallest 3 region_id values. Which code fragment meets the requirements?

Accepted Answer

A

Explanation: This solution correctly addresses all requirements. First, select('region', 'regionid') projects the DataFrame with columns in the required key-value order for the final dictionary. Second, sort('regionid') orders the rows by regionid in ascending order (the default), which is necessary to find the smallest values. Third, the take(3) action efficiently retrieves only the top 3 rows from the cluster to the driver node as a list of Row objects. Finally, the dict() constructor is applied to this list, correctly creating a Python dictionary with region as the key and regionid as the value.

Question 14

Given the following code snippet in my_spark_app.py: https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks_DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK-3.5/page_21_img_1.jpg What is the role of the driver node?

Accepted Answer

A

Explanation: The driver node is the central coordinator for a Spark application. It hosts the main() function of the program and the SparkSession (or SparkContext). Its primary responsibility is to translate the user's code, which consists of transformations and actions on DataFrames or RDDs, into a logical and physical execution plan. This plan is represented as a Directed Acyclic Graph (DAG). The driver then breaks the plan into smaller physical execution units called tasks and works with the cluster manager to schedule and distribute these tasks to the executor processes running on worker nodes for parallel execution. The driver tracks the status of executors and tasks throughout the application's lifecycle.

Question 15

23 of 55.
A data scientist is working with a massive dataset that exceeds the memory capacity of a single
machine. The data scientist is considering using Apache Spark™ instead of traditional single-machine
languages like standard Python scripts.
Which two advantages does Apache Spark™ offer over a normal single-machine language in this
scenario? (Choose 2 answers)

Accepted Answer

A, E

Explanation: Apache Spark is a distributed computing system designed for large-scale data processing. Its primary advantage is the ability to partition data and distribute computational tasks across a cluster of machines, enabling horizontal scalability (A). This allows it to process datasets that are far too large to fit into the memory of a single machine. Furthermore, Spark is inherently fault-tolerant. It achieves this by tracking the lineage of transformations used to build a dataset (specifically, a Resilient Distributed Dataset or RDD). If a node fails and a data partition is lost, Spark can automatically recompute that partition on another available node, ensuring the job completes successfully (E).

Free Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Actual Exam Questions