Question 1

The code block displayed below contains an error. The code block is intended to return all columns of
DataFrame transactionsDf except for columns predError, productId, and value. Find the error.
Excerpt of DataFrame transactionsDf:
transactionsDf.select(~col("predError"), ~col("productId"), ~col("value"))

Accepted Answer

C

Explanation: Correct code block: transactionsDf.join(itemsDf, itemsDf.itemId == transactionsDf.productId, "outer") Static notebook | Dynamic notebook: See test 1, ( Databricks import instructions) ( https://flrs.github.io/spark_practice_tests_code/#1/33.html , https://bit.ly/sparkpracticeexams_import_instructions)

Question 2

Which of the following code blocks shows the structure of a DataFrame in a tree-like way, containing both column names and types?

Accepted Answer

B

Explanation: itemsDf.printSchema() Correct! Here is an example of what itemsDf.printSchema() shows, you can see the tree-like structure containing both column names and types: root |-- itemId: integer (nullable = true) |-- attributes: array (nullable = true) | |-- element: string (containsNull = true) |-- supplier: string (nullable = true) itemsDf.rdd.printSchema() No, the DataFrame's underlying RDD does not have a printSchema() method. spark.schema(itemsDf) Incorrect, there is no spark.schema command. print(itemsDf.columns) print(itemsDf.dtypes) Wrong. While the output of this code blocks contains both column names and column types, the information is not arranges in a tree-like way. itemsDf.print.schema() No, DataFrame does not have a print method. Static notebook | Dynamic notebook: See test 3,

Question 3

Which of the following code blocks reads the parquet file stored at filePath into DataFrame itemsDf,
using a valid schema for the sample of itemsDf shown below?
Sample of itemsDf:
1.
+------+-----------------------------+-------------------+
2.
|itemId|attributes
|supplier
|
3.
+------+-----------------------------+-------------------+
4.
|1
|[blue, winter, cozy]
|Sports Company Inc.|
5.
|2
|[red, summer, fresh, cooling]|YetiX
|
6.
|3
|[green, summer, travel]
|Sports Company Inc.|
7.
+------+-----------------------------+-------------------+

Accepted Answer

D

Explanation: The challenge in this Question: comes from there being an array variable in the schem a. In addition, you should know how to pass a schema to the DataFrameReader that is invoked by spark.read. The correct way to define an array of strings in a schema is through ArrayType(StringType()). A schema can be passed to the DataFrameReader by simply appending schema(structType) to the read() operator. Alternatively, you can also define a schema as a string. For example, for the schema of itemsDf, the following string would make sense: itemId integer, attributes array , supplier string. A thing to keep in mind is that in schema definitions, you always need to instantiate the types, like so: StringType(). Just using StringType does not work in pySpark and will fail. Another concern with schemas is whether columns should be nullable, so allowed to have null values. In the case at hand, this is not a concern however, since the Question: just asks for a "valid" schema. Both non-nullable and nullable column schemas would be valid here, since no null value appears in the DataFrame sample. More info: Learning Spark, 2nd Edition, Chapter 3 Static notebook | Dynamic notebook: See test 3,

Question 4

Which of the following code blocks returns a 2-column DataFrame that shows the distinct values in column productId and the number of rows with that productId in DataFrame transactionsDf?

Accepted Answer

D

Explanation: transactionsDf.groupBy("productId").count() Correct. This code block first groups DataFrame transactionsDf by column productId and then counts the rows in each group. transactionsDf.groupBy("productId").select(count("value")) Incorrect. You cannot call select on a GroupedData object (the output of a groupBy) statement. transactionsDf.count("productId") No. DataFrame.count() does not take any arguments. transactionsDf.count("productId").distinct() Wrong. Since DataFrame.count() does not take any arguments, this option cannot be right. transactionsDf.groupBy("productId").agg(col("value").count()) False. A Column object, as returned by col("value"), does not have a count() method. You can see all available methods for Column object linked in the Spark documentation below. More info: pyspark.sql.DataFrame.count — PySpark 3.1.2 documentation, pyspark.sql.Column — PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3,

Question 5

The code block shown below should return the number of columns in the CSV file stored at location
filePath. From the CSV file, only lines should be read that do not start with a # character. Choose
the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
__1__(__2__.__3__.csv(filePath, __4__).__5__)

Accepted Answer

B

Explanation: transactionsDf.where(transactionsDf.storeId!=25) Correct. DataFrame.where() is an alias for the DataFrame.filter() method. Using this method, it is straightforward to filter out rows that do not have value 25 in column storeId. transactionsDf.select(transactionsDf.storeId!=25) Wrong. The select operator allows you to build DataFrames column-wise, but when using it as shown, it does not filter out rows. transactionsDf.filter(transactionsDf.storeId==25) Incorrect. Although the filter expression works for filtering rows, the == in the filtering condition is inappropriate. It should be != instead. transactionsDf.drop(transactionsDf.storeId==25) No. DataFrame.drop() is used to remove specific columns, but not rows, from the DataFrame. transactionsDf.remove(transactionsDf.storeId==25) False. There is no DataFrame.remove() operator in PySpark. More info: pyspark.sql.DataFrame.where — PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3,

Question 6

The code block displayed below contains an error. The code block should configure Spark to split data
in 20 parts when exchanging data between executors for joins or aggregations. Find the error.
Code block:
spark.conf.set(spark.sql.shuffle.partitions, 20)

Accepted Answer

C

Explanation: Correct code block: spark.conf.set("spark.sql.shuffle.partitions", 20) The code block expresses the option incorrectly. Correct! The option should be expressed as a string. The code block sets the wrong option. No, spark.sql.shuffle.partitions is the correct option for the use case in the question. The code block sets the incorrect number of parts. Wrong, the code block correctly states 20 parts. The code block uses the wrong command for setting an option. No, in PySpark spark.conf.set() is the correct command for setting an option. The code block is missing a parameter. Incorrect, spark.conf.set() takes two parameters. More info: Configuration - Spark 3.1.2 Documentation

Question 7

The code block displayed below contains an error. The code block is intended to return all columns of
DataFrame transactionsDf except for columns predError, productId, and value. Find the error.
Excerpt of DataFrame transactionsDf:
transactionsDf.select(~col("predError"), ~col("productId"), ~col("value"))

Accepted Answer

E

Explanation: Correct code block: transactionsDf.drop("predError", "productId", "value") Static notebook | Dynamic notebook: See test 1,

Question 8

Which of the following statements about executors is correct, assuming that one can consider each of the JVMs working as executors as a pool of task execution slots?

Accepted Answer

E

Explanation: Tasks run in parallel via slots. Correct. Given the assumption, an executor then has one or more "slots", defined by the equation spark.executor.cores / spark.task.cpus. With the executor's resources divided into slots, each task takes up a slot and multiple tasks can be executed in parallel. Slot is another name for executor. No, a slot is part of an executor. An executor runs on a single core. No, an executor can occupy multiple cores. This is set by the spark.executor.cores option. There must be more slots than tasks. No. Slots just process tasks. One could imagine a scenario where there was just a single slot for multiple tasks, processing one task at a time. Granted – this is the opposite of what Spark should be used for, which is distributed data processing over multiple cores and machines, performing many tasks in parallel. There must be less executors than tasks. No, there is no such requirement. More info: Spark Architecture | Distributed Systems Architecture (https://bit.ly/3x4MZZt)

Question 9

Which of the following code blocks returns a single-column DataFrame showing the number of words
in column supplier of DataFrame itemsDf?
Sample of DataFrame itemsDf:
1.
+------+-----------------------------+-------------------+
2.
|itemId|attributes
|supplier
|
3.
+------+-----------------------------+-------------------+
4.
|1
|[blue, winter, cozy]
|Sports Company Inc.|
5.
|2
|[red, summer, fresh, cooling]|YetiX
|
6.
|3
|[green, summer, travel]
|Sports Company Inc.|
7.
+------+-----------------------------+-------------------+

Accepted Answer

E

Explanation: Output of correct code block: +----------------------------+ |size(split(supplier, , -1))| +----------------------------+ | 3| | 1| | 3| +----------------------------+ This Question: shows a typical use case for the split command: Splitting a string into words. An additional difficulty is that you are asked to count the words. Although it is tempting to use the count method here, the size method (as in: size of an array) is actually the correct one to use. Familiarize yourself with the split and the size methods using the linked documentation below. More info: Split method: pyspark.sql.functions.split — PySpark 3.1.2 documentation Size method: pyspark.sql.functions.size — PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2,

Question 10

Which of the following code blocks immediately removes the previously cached DataFrame transactionsDf from memory and disk?

Accepted Answer

B

Explanation: transactionsDf.unpersist() Correct. The DataFrame.unpersist() command does exactly what the Question: asks for - it removes all cached parts of the DataFrame from memory and disk. del transactionsDf False. While this option can help remove the DataFrame from memory and disk, it does not do so immediately. The reason is that this command just notifies the Python garbage collector that the transactionsDf now may be deleted from memory. However, the garbage collector does not do so immediately and, if you wanted it to run immediately, would need to be specifically triggered to do so. Find more information linked below. array_remove(transactionsDf, "*") Incorrect. The array_remove method from pyspark.sql.functions is used for removing elements from arrays in columns that match a specific condition. Also, the first argument would be a column, and not a DataFrame as shown in the code block. transactionsDf.persist() No. This code block does exactly the opposite of what is asked for: It caches (writes) DataFrame transactionsDf to memory and disk. Note that even though you do not pass in a specific storage level here, Spark will use the default storage level (MEMORY_AND_DISK). transactionsDf.clearCache() Wrong. Spark's DataFrame does not have a clearCache() method. More info: pyspark.sql.DataFrame.unpersist — PySpark 3.1.2 documentation, python - How to delete an RDD in PySpark for the purpose of releasing resources? - Stack Overflow Static notebook | Dynamic notebook: See test 3,

Question 11

Which of the following code blocks returns a DataFrame that matches the multi-column DataFrame itemsDf, except that integer column itemId has been converted into a string column?

Accepted Answer

B

Explanation: itemsDf.withColumn("itemId", col("itemId").cast("string")) Correct. You can convert the data type of a column using the cast method of the Column class. Also note that you will have to use the withColumn method on itemsDf for replacing the existing itemId column with the new version that contains strings. itemsDf.withColumn("itemId", col("itemId").convert("string")) Incorrect. The Column object that col("itemId") returns does not have a convert method. itemsDf.withColumn("itemId", convert("itemId", "string")) Wrong. Spark's spark.sql.functions module does not have a convert method. The Question: is trying to mislead you by using the word "converted". Type conversion is also called "type casting". This may help you remember to look for a cast method instead of a convert method (see correct answer). itemsDf.select(astype("itemId", "string")) False. While astype is a method of Column (and an alias of Column.cast), it is not a method of pyspark.sql.functions (what the code block implies). In addition, the Question: asks to return a full DataFrame that matches the multi-column DataFrame itemsDf. Selecting just one column from itemsDf as in the code block would just return a single-column DataFrame. spark.cast(itemsDf, "itemId", "string") No, the Spark session (called by spark) does not have a cast method. You can find a list of all methods available for the Spark session linked in the documentation below. More info: - pyspark.sql.Column.cast — PySpark 3.1.2 documentation - pyspark.sql.Column.astype — PySpark 3.1.2 documentation - pyspark.sql.SparkSession — PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3,

Question 12

The code block displayed below contains an error. The code block should create DataFrame itemsAttributesDf which has columns itemId and attribute and lists every attribute from the attributes column in DataFrame itemsDf next to the itemId of the respective row in itemsDf. Find the error. A sample of DataFrame itemsDf is below. https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0/page_16_img_1.jpg Code block: itemsAttributesDf = itemsDf.explode("attributes").alias("attribute").select("attribute", "itemId")

Accepted Answer

D

Explanation: The correct code block looks like this: https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0/page_16_img_2.jpg Then, the first couple of rows of itemAttributesDf look like this: https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0/page_17_img_1.jpg explode() is not a method of DataFrame. explode() should be used inside the select() method instead. This is correct. The split() method should be used inside the select() method instead of the explode() method. No, the split() method is used to split strings into parts. However, column attributs is an array of strings. In this case, the explode() method is appropriate. Since itemId is the index, it does not need to be an argument to the select() method. No, itemId still needs to be selected, whether it is used as an index or not. The explode() method expects a Column object rather than a string. No, a string works just fine here. This being said, there are some valid alternatives to passing in a string: https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0/page_17_img_2.jpg The alias() method needs to be called after the select() method. No. More info: pyspark.sql.functions.explode — PySpark 3.1.1 documentation (https://bit.ly/2QUZI1J) Static notebook | Dynamic notebook: See test 1,Question: 22 ( Databricks import instructions) (https://flrs.github.io/spark_practice_tests_code/#1/22.html , https://bit.ly/sparkpracticeexams_import_instructions)

Question 13

Which of the following code blocks creates a new 6-column DataFrame by appending the rows of the
6-column
DataFrame
yesterdayTransactionsDf
to
the
rows
of
the
6-column
DataFrame
todayTransactionsDf, ignoring that both DataFrames have different column names?

Accepted Answer

E

Explanation: todayTransactionsDf.union(yesterdayTransactionsDf) Correct. The union command appends rows of yesterdayTransactionsDf to the rows of todayTransactionsDf, ignoring that both DataFrames have different column names. The resulting DataFrame will have the column names of DataFrame todayTransactionsDf. todayTransactionsDf.unionByName(yesterdayTransactionsDf) No. unionByName specifically tries to match columns in the two DataFrames by name and only appends values in columns with identical names across the two DataFrames. In the form presented above, the command is a great fit for joining DataFrames that have exactly the same columns, but in a different order. In this case though, the command will fail because the two DataFrames have different columns. todayTransactionsDf.unionByName(yesterdayTransactionsDf, allowMissingColumns=True) No. The unionByName command is described in the previous explanation. However, with the allowMissingColumns argument set to True, it is no longer an issue that the two DataFrames have different column names. Any columns that do not have a match in the other DataFrame will be filled with null where there is no value. In the case at hand, the resulting DataFrame will have 7 or more columns though, so it this command is not the right answer. union(todayTransactionsDf, yesterdayTransactionsDf) No, there is no union method in pyspark.sql.functions. todayTransactionsDf.concat(yesterdayTransactionsDf) Wrong, the DataFrame class does not have a concat method. More info: pyspark.sql.DataFrame.union — PySpark 3.1.2 documentation, pyspark.sql.DataFrame.unionByName — PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3,

Question 14

The code block shown below should return the number of columns in the CSV file stored at location
filePath. From the CSV file, only lines should be read that do not start with a # character. Choose
the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
__1__(__2__.__3__.csv(filePath, __4__).__5__)

Accepted Answer

E

Explanation: Correct code block: len(spark.read.csv(filePath, comment='#').columns) This is a challenging Question: with difficulties in an unusual context: The boundary between DataFrame and the DataFrameReader. It is unlikely that a Question: of this difficulty level appears in the exam. However, solving it helps you get more comfortable with the DataFrameReader, a subject you will likely have to deal with in the exam. Before dealing with the inner parentheses, it is easier to figure out the outer parentheses, gaps 1 and 5. Given the code block, the object in gap 5 would have to be evaluated by the object in gap 1, returning the number of columns in the read-in CSV. One answer option includes DataFrame in gap 1 and shape[0] in gap 2. Since DataFrame cannot be used to evaluate shape[0], we can discard this answer option. Other answer options include size in gap 1. size() is not a built-in Python command, so if we use it, it would have to come from somewhere else. pyspark.sql.functions includes a size() method, but this method only returns the length of an array or map stored within a column (documentation linked below). So, using a size() method is not an option here. This leaves us with two potentially valid answers. We have to pick between gaps 2 and 3 being spark.read or pyspark.DataFrameReader. Looking at the documentation (linked below), the DataFrameReader is actually a child class of pyspark.sql, which means that we cannot import it using pyspark.DataFrameReader. Moreover, spark.read makes sense because on Databricks, spark

Question 15

In which order should the code blocks shown below be run in order to create a DataFrame that
shows the mean of column predError of DataFrame transactionsDf per column storeId and productId,
where productId should be either 2 or 3 and the returned DataFrame should be sorted in ascending
order by column storeId, leaving out any nulls in that column?
DataFrame transactionsDf:
1.
+-------------+---------+-----+-------+---------+----+
2.
|transactionId|predError|value|storeId|productId| f|
3.
+-------------+---------+-----+-------+---------+----+
4.
|
1|
3|
4|
25|
1|null|
5.
|
2|
6|
7|
2|
2|null|
6.
|
3|
3| null|
25|
3|null|
7.
|
4|
null| null|
3|
2|null|
8.
|
5|
null| null| null|
2|null|
9.
|
6|
3|
2|
25|
2|null|
10.
+-------------+---------+-----+-------+---------+----+
1. .mean("predError")
2. .groupBy("storeId")
3. .orderBy("storeId")
4. transactionsDf.filter(transactionsDf.storeId.isNotNull())
5. .pivot("productId", [2, 3])

Accepted Answer

D

Explanation: Correct code block: transactionsDf.filter(transactionsDf.storeId.isNotNull()).groupBy("storeId").pivot("productId", [2, 3]).mean("predError").orderBy("storeId") Output of correct code block: +-------+----+----+ |storeId| 2| 3| +-------+----+----+ | 2| 6.0|null| | 3|null|null| | 25| 3.0| 3.0| +-------+----+----+ This Question: is quite convoluted and requires you to think hard about the correct order of operations. The pivot method also makes an appearance - a method that you may not know all that much about (yet). At the first position in all answers is code block 4, so the Question: is essentially just about the ordering of the remaining 4 code blocks. The Question: states that the returned DataFrame should be sorted by column storeId. So, it should make sense to have code block 3 which includes the orderBy operator at the very end of the code block. This leaves you with only two answer options. Now, it is useful to know more about the context of pivot in PySpark. A common pattern is groupBy, pivot, and then another aggregating function, like mean. In the documentation linked below you can see that pivot is a method of pyspark.sql.GroupedData - meaning that before pivoting, you have to use groupBy. The only answer option matching this requirement is the one in which code block 2 (which includes groupBy) is stated before code block 5 (which includes pivot). More info: pyspark.sql.GroupedData.pivot — PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3,

Free Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Actual Exam Questions