Home/databricks/Free Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Actual Exam Questions

Free Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Actual Exam Questions

The questions for this exam were last updated on January 9, 2026

Dumps Box (DumpsBox) offers up-to-date practice exam questions for Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 certification exam which are developed and validated by Databricks subject domain experts certified in Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 . These practice questions are update regularly as we keep an eye on any recent changes in Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 syllabus, and when there is update our team quickly adjusts the questions. This commitment to providing the best quality exam prep material to certification aspirants is what makes DumpsBox.com the best certification exam prep website. On top of that, our strong, yet strictly moderated, community based feedback keeps the content clean and current. Each question has helpful community discussion that provides it extra perspective and introduces helpful resources for better exam preparation. This also saves students from other outdated practice questions or illicit exam dumps that can have adverse affects on career. Browse through our Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam questions and pass your exam on first try.

Question No. 1
You have:
DataFrame A: 128 GB of transactions
DataFrame B: 1 GB user lookup table
Which strategy is correct for broadcasting?
Select one option, then reveal solution.
Top comments
SX
Sohail X.
2026-02-09

B The key is avoiding shuffling the large DataFrame A, so broadcasting the smaller B is the move. Options involving broadcasting A don’t make sense given its size.

0
AY
Arjun Y.
2026-02-03

A vs B? Both push for broadcasting B since it’s smaller, but B is clearer about stopping the big DataFrame A from shuffling. That’s the key win here, so B’s explanation feels more on point.

0
Question No. 2
A Spark DataFrame df is cached using the MEMORY_AND_DISK storage level, but the DataFrame is
too large to fit entirely in memory.
What is the likely behavior when Spark runs out of memory to store the DataFrame?
Select one option, then reveal solution.
Top comments
BL
Bilal L.
2026-02-10

It’s C because Spark caches whole partitions and spills those that don’t fit to disk, so it stores as much as possible in memory first and the rest on disk with some overhead. Partial partition splitting isn’t how it works.

0
FY
Farhan Y.
2026-02-02

Probably C here. Spark caches partitions, and when memory is full, it spills entire partitions to disk rather than splitting a partition between memory and disk. This means it stores as much as it can in memory, then moves the rest to disk. D sounds tempting but Spark doesn’t track row-level frequency for caching, it works at the partition level mostly. B is definitely off since Spark isn’t balancing storage evenly, and A isn’t quite right because Spark doesn’t duplicate data both in memory and disk simultaneously. So C fits with how MEMORY_AND_DISK typically behaves under memory pressure.

0
Question No. 3

44 of 55. A data engineer is working on a real-time analytics pipeline using Spark Structured Streaming. They want the system to process incoming data in micro-batches at a fixed interval of 5 seconds. Which code snippet fulfills this requirement?

Select one option, then reveal solution.
Top comments
SJ
Sohail J.
2026-02-22

A Spark’s processingTime trigger is designed exactly for fixed-interval micro-batches, so A makes the most sense. B’s continuous trigger is for a different streaming mode, not fixed micro-batch intervals.

0
FJ
Farhan J.
2026-02-22

A/D? D lacks a trigger, so it won’t guarantee 5-second intervals. A explicitly sets micro-batch timing with processingTime, which fits the 5-second fixed interval best.

0
Question No. 4
An MLOps engineer is building a Pandas UDF that applies a language model that translates English
strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting
the performance of the data pipeline.
The initial code is:
Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 practice exam questions
def in_spanish_inner(df: pd.Series) -> pd.Series:
model = get_translation_model(target_lang='es')
return df.apply(model)
in_spanish = sf.pandas_udf(in_spanish_inner, StringType())
How can the MLOps engineer change this code to reduce how many times the language model is
loaded?
Select one option, then reveal solution.
Top comments
BF
Brian F.
2026-02-18

It’s D because using an iterator UDF means the model loads once per partition and stays loaded across all batches, cutting down repeated reloads unlike the other options.

0
RP
Ravi P.
2026-02-17

It’s D because using an iterator UDF lets you initialize the model once per batch instead of on every single call, which really cuts down the loading overhead compared to the other options.

0
Question No. 5
In the code block below, aggDF contains aggregations on a streaming DataFrame:
Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 practice exam questions
Which output mode at line 3 ensures that the entire result table is written to the console during each
trigger execution?
Select one option, then reveal solution.
Top comments
IC
Irfan C.
2026-02-18

A/B? Append only adds new rows, so it can’t output the whole table. Complete mode outputs the full aggregation every trigger, which matches what the question asks. Options C and D don’t exist in standard Spark modes.

0
IC
Irfan C.
2026-02-15

A, complete mode always outputs full aggregation results every trigger run.

0
Question No. 6
A data engineer writes the following code to join two DataFrames df1 and df2:
df1 = spark.read.csv("sales_data.csv") # ~10 GB
df2 = spark.read.csv("product_data.csv") # ~8 MB
result = df1.join(df2, df1.product_id == df2.product_id)
Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 practice exam questions
Which join strategy will Spark use?
Select one option, then reveal solution.
Top comments
BA
Bilal A.
2026-02-16

A imo, no AQE means static plan so no broadcast despite small df2 size.

0
BA
Bilal A.
2026-02-15

B. Since df2 is just 8 MB, it’s well under the default broadcast threshold (usually 10 MB), so Spark should automatically pick a broadcast join here without needing hints or AQE. The fact that df1 is huge doesn’t matter as much as this small side table being eligible for broadcast.

0
Question No. 7
A developer wants to test Spark Connect with an existing Spark application.
What are the two alternative ways the developer can start a local Spark Connect server without
changing their existing application code? (Choose 2 answers)
Select all that apply, then reveal solution.
Top comments
RJ
Ryan J.
2026-02-01

B and A, since both --remote options start a local server without code changes.

0
RJ
Ryan J.
2026-01-31

A imo, since --remote "https://localhost" can also trigger a local Spark Connect server for testing. D and E require code changes, so they don’t fit the "without changing code" part.

0
Question No. 8
A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The
cluster has 10 nodes, each with 16 CPUs. Spark UI shows:
Low number of Active Tasks
Many tasks complete in milliseconds
Fewer tasks than available CPUs
Which approach should be used to adjust the partitioning for optimal resource allocation?
Select one option, then reveal solution.
Top comments
NL
Noah L.
2026-02-22

A imo. Since the cluster has 10 nodes with 16 CPUs each, that’s 160 CPUs total. Setting partitions equal to total CPUs ensures each core gets a task, maximizing parallelism without overwhelming the cluster with too many tiny tasks. Options B and C seem arbitrary without tying directly to CPU count. D is good in theory, but partition size can vary depending on file format and compression—just dividing by 128 MB might not perfectly match CPU availability or task length. The key here is matching tasks to cores for better resource use.

0
AS
Ali S.
2026-02-18

It’s D because calculating partitions from data size usually avoids too few or too many tasks.

0
Question No. 9
A data engineer is asked to build an ingestion pipeline for a set of Parquet files delivered by an
upstream team on a nightly basis. The data is stored in a directory structure with a base path of
"/path/events/data". The upstream team drops daily data into the underlying subdirectories
following the convention year/month/day.
A few examples of the directory structure are:
Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 practice exam questions
Which of the following code snippets will read all the data within the directory structure?
Select one option, then reveal solution.
Top comments
AY
Andre Y.
2026-02-19

C imo, wildcard should catch all year folders without depending on Spark version.

0
AY
Andre Y.
2026-02-15

Option C makes sense too since using a wildcard like /path/events/data/* should grab all first-level directories (the year folders) and Spark will then read all files inside those folders. It doesn’t rely on recursiveFileLookup, so it works regardless of Spark version. A and D just read the base path without recursion, so they probably miss nested files. B is good if the Spark version supports recursiveFileLookup, but since that info’s missing, C feels like a safer bet.

0
Question No. 10
A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from
failures or intentional shutdowns by continuing where the pipeline left off.
How can this be achieved?
Select one option, then reveal solution.
Top comments
RA
Ravi A.
2026-02-22

D, checkpointLocation in writeStream saves offsets and state for recovery.

0
RA
Ravi A.
2026-02-19

I think A can be ruled out since checkpointLocation in readStream doesn’t store progress info needed for recovery. But is there any config beyond checkpointLocation in writeStream for handling metadata or query state?

0
Question No. 11
Given a DataFrame df that has 10 partitions, after running the code:
result = df.coalesce(20)
How many partitions will the result DataFrame have?
Select one option, then reveal solution.
Top comments
ZN
Zain N.
2026-02-22

A imo, coalesce only increases partitions if shuffle=True, which isn’t mentioned here.

0
SH
Shoaib H.
2026-02-18

It’s A, coalesce can’t increase partitions without shuffle, so stays 10 here.

0
Question No. 12
What is the difference between df.cache() and df.persist() in Spark DataFrame?
Select one option, then reveal solution.
Top comments
TA
Tom A.
2026-02-20

A vs D? A is off since cache() can’t set storage level; it just uses a default. D fits better because cache() defaults to MEMORY_AND_DISK and persist() lets you specify other levels. That flexibility is really the key difference here. B and C mix up which method lets you choose storage levels, so not those either.

0
OA
Omar A.
2026-02-05

I’m thinking A can be ruled out since cache() doesn’t let you change the storage level, it just uses MEMORY_AND_DISK by default. Also, B is off because persist() definitely lets you pick levels, not just DISK_ONLY. Does D cover all the bases?

0
Question No. 13

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this: Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 practice exam questions The resulting Python dictionary must contain a mapping of region -> region id containing the smallest 3 region_id values. Which code fragment meets the requirements? A) Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 real exam questions B) Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 actual exam questions C) Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 practice exam questions D) Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 real exam questions The resulting Python dictionary must contain a mapping of region -> region_id for the smallest 3 region_id values. Which code fragment meets the requirements?

Select one option, then reveal solution.
Top comments
RS
Ravi S.
2026-02-18

B tbh, since it selects 'region_id' first then 'region', so when converted to dict it'll map region_id to region, which isn't what we want. A is better for region to region_id mapping after sorting ascending.

0
RS
Ravi S.
2026-02-18

Option B also sorts by region_id ascending and selects columns in the right order for key-value pairs, so it fits the mapping. Just like with A, dict() might need tuples instead of Rows though.

0
Question No. 14
Given the following code snippet in my_spark_app.py:
Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 practice exam questions
What is the role of the driver node?
Select one option, then reveal solution.
Top comments
RU
Ryan U.
2026-02-19

D imo, B and C are clearly incorrect since the driver does way more than just UI or local computations. D is also wrong because the driver doesn’t store final results; it coordinates execution, so A fits best.

0
HT
Hassan T.
2026-01-28

It’s A. The driver node is definitely the one that breaks down actions into tasks and schedules them on the workers. B’s just wrong because it’s more than monitoring. C is off since the driver doesn’t hold the data or run all computations itself. D’s not accurate either because the driver doesn’t store results after workers finish; it coordinates the whole thing. So yeah, “orchestrates” pretty much sums up scheduling plus sending tasks out.

0
Question No. 15
23 of 55.
A data scientist is working with a massive dataset that exceeds the memory capacity of a single
machine. The data scientist is considering using Apache Spark™ instead of traditional single-machine
languages like standard Python scripts.
Which two advantages does Apache Spark™ offer over a normal single-machine language in this
scenario? (Choose 2 answers)
Select all that apply, then reveal solution.
Top comments
YU
Yasir U.
2026-02-20

It’s A for sure because Spark’s whole deal is spreading work across multiple machines, handling way bigger data than one computer can. I’d go with E too since normal scripts usually crash if a node fails, but Spark can retry tasks or reroute work so your job keeps running without losing progress. B and D are definitely out—Spark runs on regular hardware and you still write code. C’s not right either because Spark actually tries to keep data in memory to speed things up rather than just relying on disk.

0
BA
Bilal A.
2026-02-19

A/C? Spark uses memory mostly but can spill to disk, so C partially fits.

0