Question 1

You are managing a Kubernetes cluster running AI training jobs using TensorFlow. The jobs require
access to multiple GPUs across different nodes, but inter-node communication seems slow,
impacting performance.
What is a potential networking configuration you would implement to optimize inter-node
communication for distributed training?

Accepted Answer

D

Explanation: InfiniBand is a high-performance computing (HPC) interconnect that provides significantly higher throughput and lower latency than standard Ethernet. For distributed AI training, where frequent and large-volume gradient exchanges occur between nodes, network performance is critical. InfiniBand utilizes Remote Direct Memory Access (RDMA), which allows GPUs on different nodes to communicate directly, bypassing the CPU and kernel network stack. This minimizes communication overhead and latency, preventing GPUs from idling while waiting for data, thereby directly addressing the performance bottleneck in multi-node training scenarios.

Question 2

If a Magnum IO-enabled application experiences delays during the ETL phase, what troubleshooting step should be taken?

Accepted Answer

D

Explanation: NVIDIA Magnum IO is a suite of technologies designed to eliminate storage and input/output (I/O) bottlenecks. A key component is GPUDirect Storage, which creates a direct data path between storage (like NVMe SSDs) and GPU memory. This path bypasses the CPU and system RAM, significantly accelerating data-heavy phases like ETL. If an application experiences delays during ETL, it strongly suggests this direct path is not being utilized. Therefore, the primary troubleshooting step is to verify that GPUDirect Storage is correctly installed, configured, and enabled, ensuring the application can leverage the high-bandwidth, low-latency connection it is designed for.

Question 3

You are an administrator managing a large-scale Kubernetes-based GPU cluster using Run:AI.
To automate repetitive administrative tasks and efficiently manage resources across multiple nodes,
which of the following is essential when using the Run:AI Administrator CLI for environments where
automation or scripting is required?

Accepted Answer

C

Explanation: The Run:AI Administrator Command-Line Interface (runai-adm) interacts directly with the Kubernetes API to manage the Run:AI control plane components and resources. To perform cluster-wide administrative tasks such as creating Projects, managing users, or configuring system-wide settings, the CLI requires authenticated and authorized access. This is achieved through a Kubernetes configuration file (kubeconfig) that contains credentials with cluster-admin privileges. Without these administrative rights, the API server will reject the requests, rendering the CLI non-functional for its intended purpose and making automation impossible.

Question 4

A system administrator needs to scale a Kubernetes Job to 4 replicas. What command should be used?

Accepted Answer

C

Explanation: The kubectl scale command is the standard method for manually adjusting the number of running pods for a scalable Kubernetes resource. When applied to a Job resource, this command modifies the .spec.parallelism field, which dictates how many pods can run concurrently. The syntax kubectl scale job --replicas=4 sets the desired parallelism to 4. Although the question omits the specific job name, option C correctly identifies the command, resource type, and flag required to achieve the administrator's goal.

Question 5

A system administrator needs to collect the information below:
GPU behavior monitoring
GPU configuration management
GPU policy oversight
GPU health and diagnostics
GPU accounting and process statistics
NVSwitch configuration and monitoring
What single tool should be used?

Accepted Answer

C

Explanation: NVIDIA Data Center GPU Manager (DCGM) is the single, comprehensive suite of tools designed for managing and monitoring NVIDIA GPUs in large-scale cluster environments. It provides a unified API for all the required tasks listed: active health monitoring and diagnostics, system validation, policy management (e.g., for handling errors), power and clock configuration, detailed accounting statistics for processes, and monitoring of NVSwitch and NVLink interconnects. It is specifically engineered for the enterprise-level oversight described in the question.

Question 6

A cloud engineer is looking to deploy a digital fingerprinting pipeline using NVIDIA Morpheus and the NVIDIA AI Enterprise Virtual Machine Image (VMI). Where would the cloud engineer find the VMI?

Accepted Answer

B

Explanation: The NVIDIA AI Enterprise Virtual Machine Image (VMI) is a pre-configured, optimized image designed for streamlined deployment on major public cloud platforms. To facilitate this, NVIDIA distributes the VMI directly through the official marketplaces of cloud service providers like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). This allows a cloud engineer to easily find and launch a virtual machine instance with the entire NVIDIA AI Enterprise software stack, including drivers and toolkits, fully installed and ready for immediate use with services like NVIDIA Morpheus.

Question 7

A cloud engineer is looking to provision a virtual machine for machine learning using the NVIDIA
Virtual Machine Image (VMI) and Rapids.
What technology stack will be set up for the development team automatically when the VMI is
deployed?

Accepted Answer

C

Explanation: The NVIDIA Virtual Machine Image (VMI) tailored for RAPIDS is designed to be a turnkey solution for data scientists and ML engineers. When deployed from a cloud service provider (CSP) marketplace, it automatically provisions a complete, ready-to-run environment. This stack includes the foundational components like the Ubuntu Server OS, the necessary NVIDIA Driver for GPU hardware access, and a containerization platform consisting of Docker-CE and the NVIDIA Container Toolkit. It also includes command-line utilities for the CSP and NVIDIA's NGC catalog. Crucially, this specific VMI comes with the RAPIDS libraries pre-installed, enabling the development team to start working immediately without manual setup of the core data science framework.

Question 8

A data scientist is training a deep learning model and notices slower than expected training times.
The data scientist alerts a system administrator to inspect the issue. The system administrator
suspects the disk IO is the issue.
What command should be used?

Accepted Answer

B

Explanation: The system administrator's hypothesis is that a disk I/O bottleneck is causing the performance degradation. The iostat (input/output statistics) command is the standard Linux utility designed specifically to monitor system I/O device loading. It reports on CPU statistics as well as I/O statistics for block devices (i.e., disks). Key metrics provided by iostat, such as %util (device utilization), await (average time for I/O requests), and r/s / w/s (reads/writes per second), allow an administrator to directly diagnose whether the storage subsystem is saturated and causing a bottleneck for the data loading pipeline of the deep learning model.

Question 9

After completing the installation of a Kubernetes cluster on your NVIDIA DGX systems using BCM, how can you verify that all worker nodes are properly registered and ready?

Accepted Answer

A

Explanation: The kubectl get nodes command is the standard and authoritative method for querying the Kubernetes API server to determine the status of all nodes within the cluster. The output lists each registered node and its current status. A status of "Ready" indicates that the node's kubelet is healthy, has successfully registered with the control plane, and is fully prepared to accept and run pods. This single command directly and efficiently verifies the exact conditions specified in the question for all worker nodes simultaneously.

Question 10

Your Kubernetes cluster is running a mixture of AI training and inference workloads. You want to
ensure that inference services have higher priority over training jobs during peak resource usage
times.
How would you configure Kubernetes to prioritize inference workloads?

Accepted Answer

D

Explanation: The most effective and native Kubernetes method for ensuring one workload type is favored over another during resource contention is using PriorityClass. A PriorityClass object defines a mapping from a priority name to an integer value; higher values indicate higher priority. When a high-priority pod cannot be scheduled due to insufficient resources, the scheduler can preempt (evict) lower-priority pods to make room. ResourceQuotas complement this by setting aggregate resource limits per namespace. This ensures that the namespace for high-priority inference workloads has guaranteed resource allocations, preventing lower-priority training jobs in other namespaces from consuming all available cluster resources. This combination directly addresses the requirement for prioritization and resource guarantees.

Question 11

When troubleshooting Slurm job scheduling issues, a common source of problems is jobs getting
stuck in a pending state indefinitely.
Which Slurm command can be used to view detailed information about all pending jobs and identify
the cause of the delay?

Accepted Answer

A

Explanation: The scontrol command is the primary administrative tool for viewing and modifying the state of Slurm jobs, partitions, and nodes. To diagnose a job stuck in a pending state, the command scontrol show job provides the most comprehensive output. This includes a dedicated Reason field that explicitly states why the scheduler has not yet allocated resources to the job (e.g., (Resources), (Priority), (Dependency)). This level of detail is crucial for identifying the specific cause of the scheduling delay, making it the correct tool for in-depth troubleshooting as described in the scenario.

Question 12

A GPU administrator needs to virtualize AI/ML training in an HGX environment. How can the NVIDIA Fabric Manager be used to meet this demand?

Accepted Answer

C

Explanation: The NVIDIA Fabric Manager is a software service specifically designed to initialize, monitor, and manage the high-speed interconnect fabric in complex multi-GPU systems like NVIDIA HGX platforms. This fabric is composed of NVLink and NVSwitch technologies, which provide high-bandwidth, low-latency communication paths between GPUs. For virtualized AI/ML training, which relies on efficient multi-GPU scaling, the Fabric Manager is essential for ensuring the fabric is correctly configured, healthy, and optimized. It is a foundational component for enabling the full performance of the interconnected GPUs in both bare-metal and virtualized environments.

Question 13

You are setting up a Kubernetes cluster on NVIDIA DGX systems using BCM, and you need to initialize the control-plane nodes. What is the most important step to take before initializing these nodes?

Accepted Answer

B

Explanation: The Kubernetes scheduler determines the best available node to place a pod on. If swap is enabled, and a pod's memory usage exceeds its defined limit, the host OS could swap parts of the pod's memory to disk. This behavior is unpredictable and can severely degrade performance and stability, which contradicts Kubernetes' goal of resource management and predictable scheduling. Therefore, the kubelet is designed not to start if swap is detected. The kubeadm init command runs a series of preflight checks that will fail if swap is enabled, preventing the control-plane initialization from proceeding. This makes disabling swap a mandatory and critical prerequisite.

Question 14

A Slurm user needs to submit a batch job script for execution tomorrow. Which command should be used to complete this task?

Accepted Answer

A

Explanation: The sbatch command is the standard utility in Slurm for submitting a batch script for later execution. To control when the job becomes eligible to run, the --begin (or -b) option is used. This option accepts various time formats, including the specific keyword tomorrow, which defers the job's start time until the beginning of the next day (00:00:00). This command combination precisely fulfills the user's requirement to submit a script now that will be scheduled to run on the following day.

Question 15

You need to do maintenance on a node. What should you do first?

Accepted Answer

A

Explanation: The correct first step for planned maintenance on a compute node is to place it in the DRAIN state using the scontrol update command. This is the standard, non-disruptive procedure in Slurm. The DRAIN state prevents new jobs from being allocated to the node while allowing any currently running jobs to complete. Once all jobs have finished, the node's state transitions to IDLE+DRAIN, indicating it is safe to take offline for maintenance without interrupting user workloads.

Free NVIDIA NCP-AIO Actual Exam Questions