Question 1

Which of the following statements about a refresh schedule is incorrect?

Accepted Answer

C, E

Explanation: Both statements C and E are factually incorrect based on the fundamental architecture and permission model of Databricks SQL. Statement C is incorrect because scheduled queries require compute resources to execute. In Databricks SQL, this compute is provided by a SQL Warehouse (formerly SQL Endpoint). The documentation explicitly states that each scheduled execution runs on a selected SQL warehouse. Statement E is incorrect because configuring a refresh schedule does not require workspace administrator privileges. The required permission is CAN RUN on the query. This allows data analysts and other non-administrator users to manage their own query schedules, which is a core part of the intended workflow.

Question 2

A data analyst has been asked to produce a visualization that shows the flow of users through a website. Which of the following is used for visualizing this type of flow?

Accepted Answer

E

Explanation: A Sankey diagram is the most appropriate visualization for showing the flow of users through a website. This type of chart is specifically designed to illustrate the movement of quantities—in this case, users—from one stage to another (e.g., from the homepage to a product page, then to the checkout page). The width of the connecting bands in a Sankey diagram is proportional to the flow quantity, making it easy to identify the most common user paths and drop-off points.

Question 3

A data team has been given a series of projects by a consultant that need to be implemented in the Databricks Lakehouse Platform. Which of the following projects should be completed in Databricks SQL?

Accepted Answer

C

Explanation: Databricks SQL is the dedicated workspace within the Databricks Lakehouse Platform for data analysts to execute SQL queries, create visualizations, and build dashboards. A core task for any data analyst is to combine different data sources to create a unified view for analysis. This is fundamentally accomplished using SQL JOIN operations. Therefore, combining two data sources into a single, comprehensive dataset is a project perfectly suited for the Databricks SQL query editor, aligning with its primary purpose of enabling business intelligence and SQL-based analytics.

Question 4

A data analyst is attempting to drop a table my_table. The analyst wants to delete all table metadata
and data.
They run the following command:
DROP TABLE IF EXISTS my_table;
While the object no longer appears when they run SHOW TABLES, the data files still exist.
Which of the following describes why the data files still exist and the metadata files were deleted?

Accepted Answer

C

Explanation: The behavior described—where metadata is deleted but the underlying data files persist—is the defining characteristic of dropping an external table in Databricks. When a DROP TABLE command is executed on an external table, Databricks removes the table's definition from the metastore but intentionally leaves the data files in their specified external location untouched. This prevents accidental data loss, as the data might be used by other processes or tables. In contrast, dropping a managed table would delete both the metadata and the data files.

Question 5

In which of the following situations will the mean value and median value of variable be meaningfully different?

Accepted Answer

E

Explanation: The mean is the arithmetic average of all data points, making it highly sensitive to the magnitude of each value. Extreme outliers, which are values far from the central tendency, will disproportionately pull the mean towards them. In contrast, the median is the middle value of a sorted dataset and is not affected by the specific values of outliers, only their position. This property makes the median a "robust" measure of central tendency. Consequently, in a dataset with significant outliers (a skewed distribution), the mean and median will diverge, with the mean being pulled in the direction of the outliers.

Question 6

Which of the following benefits of using Databricks SQL is provided by Data Explorer?

Accepted Answer

B

Explanation: Data Explorer is the primary user interface within the Databricks workspace for managing data assets. It allows users to browse through catalogs, schemas, tables, and views to understand the available data (metadata). It includes a "Sample Data" tab for previewing the actual data within tables. Crucially, it also provides a "Permissions" tab where authorized users can view and modify access control lists (ACLs) for these data objects, making it a central hub for both data discovery and governance.

Question 7

A data analyst runs the following command: INSERT INTO stakeholders.suppliers TABLE stakeholders.new_suppliers; What is the result of running this command?

Accepted Answer

C

Explanation: The INSERT INTO command is used to add new rows of data to an existing table. The syntax INSERT INTO targettable TABLE sourcetable is a valid Databricks SQL command that appends all rows from the sourcetable (stakeholders.newsuppliers) to the targettable (stakeholders.suppliers). This operation does not remove or alter the existing data in the target table, nor does it automatically check for or remove duplicate rows. Therefore, the suppliers table will contain its original data plus all the data from the newsuppliers table.

Question 8

A data analyst has a managed table table_name in database database_name. They would now like to
remove the table from the database and all of the data files associated with the table. The rest of the
tables in the database must continue to exist.
Which of the following commands can the analyst use to complete the task without producing an
error?

Accepted Answer

B

Explanation: The DROP TABLE command is the correct Data Definition Language (DDL) statement in Databricks SQL to remove a table's definition and its data. For a managed table, which is the default type, this command also deletes the underlying data files from their storage location. The syntax DROP TABLE databasename.tablename; uses a two-part identifier to precisely target the specific table within the specified database for removal, leaving all other tables in the database untouched. This directly and correctly accomplishes the task described.

Question 9

A stakeholder has provided a data analyst with a lookup dataset in the form of a 50-row CSV file. The
data analyst needs to upload this dataset for use as a table in Databricks SQL.
Which approach should the data analyst use to quickly upload the file into a table for use in
Databricks SOL?

Accepted Answer

A

Explanation: The most direct and efficient method for a data analyst to quickly upload a small CSV file and create a table in Databricks SQL is by using the built-in "Add data" UI. This feature is specifically designed for this use case, allowing users to upload files from their local machine directly through the web interface. The UI guides the user through the process, automatically inferring the schema and creating a managed Delta table, which is immediately available for querying in Databricks SQL. This single, streamlined workflow is significantly faster than multi-step alternatives.

Question 10

A data analyst wants the following output: customer_name number_of_orders John Doe 388 Zhang San 234 Which statement will produce this output?

Accepted Answer

A

Explanation: This query correctly produces the desired output by performing three essential SQL operations. First, it uses an INNER JOIN to combine rows from the customers and orders tables based on their shared customerid. Second, it applies the COUNT(orderid) aggregate function to count the number of orders for each customer. Third, the GROUP BY customername clause is used to group the rows so that the COUNT function calculates the total orders for each unique customer name. Finally, AS numberoforders renames the aggregated column to match the required output.

Question 11

Which of the following should data analysts consider when working with personally identifiable information (PII) data?

Accepted Answer

E

Explanation: When handling personally identifiable information (PII), a data analyst must adopt a comprehensive approach to compliance and security. This involves adhering to the organization's internal data handling policies and best practices (A). Crucially, they must also comply with legal and regulatory frameworks, which often depend on geography. This includes the laws of the region where the data was collected, such as GDPR for European residents (B), and the laws of the jurisdiction where the analysis is being performed, which may have its own data processing and sovereignty requirements (D). Therefore, all these factors are critical considerations.

Question 12

Which statement describes descriptive statistics?

Accepted Answer

C

Explanation: Descriptive statistics is a branch of statistics focused on summarizing and organizing data to describe its main features. This is achieved by using quantitative measures such as mean, median, mode, standard deviation, and variance, as well as graphical representations like histograms and box plots. The primary goal is to provide a quantitative summary of the sample data, rather than making inferences about the larger population from which the data was drawn.

Question 13

A data scientist has asked a data analyst to create histograms for every continuous variable in a data
set. The data analyst needs to identify which columns are continuous in the data set.
What describes a continuous variable?

Accepted Answer

C

Explanation: A continuous variable is a type of quantitative variable that can assume an infinite, uncountable number of values within a given range. The defining characteristic is that for any two values, it is always possible to find another value between them. For example, height, weight, and temperature are continuous because a measurement can be refined to greater and greater precision (e.g., 70.1 inches, 70.11 inches, 70.112 inches). Histograms are the appropriate visualization to understand the distribution of such variables by grouping the uncountable values into a finite number of bins.

Question 14

In which of the following situations should a data analyst use higher-order functions?

Accepted Answer

C

Explanation: Higher-order functions in Databricks SQL are specifically designed to process complex data types, most notably arrays and maps. They allow a data analyst to apply custom logic, expressed as a lambda function, to each element within an array or map. This enables powerful, efficient, and scalable transformations, filtering, and aggregations on nested data structures directly within SQL queries. Functions like TRANSFORM, FILTER, and EXISTS are prime examples that operate on array data objects, avoiding less performant alternatives like user-defined functions (UDFs) or exploding arrays.

Question 15

Data professionals with varying responsibilities use the Databricks Lakehouse Platform Which role in the Databricks Lakehouse Platform use Databricks SQL as their primary service?

Accepted Answer

D

Explanation: Databricks SQL is a service on the Databricks Lakehouse Platform specifically designed for data analysts and business analysts. Its primary purpose is to provide a familiar, intuitive environment for running SQL queries, creating visualizations, and building interactive dashboards to derive business insights. The user interface, including the SQL editor, query history, and visualization tools, is tailored to the typical workflow of an analyst who needs to perform fast, ad-hoc analysis and reporting directly on data in the lakehouse.

Free Databricks Certified Data Analyst Associate Actual Exam Questions