Protected

NCA-GENM mock questions are available after login. Redirecting...

If you are not redirected, login.

Courses / Nvidia / NCA-GENM / Self-Test

NCA-GENM Mock Questions

Multiple-choice practice to test concept understanding

48 total questions · 4 sets · 12 per set · chapter-balanced

How to Use

  1. Complete a full set before opening any answers.
  2. Track which chapters produce the most misses and return to those chapter pages.
  3. Repeat the weakest set after review until you can explain each answer in your own words.

Set 1

Set 1 · Question 1 · Chapter 1

What is the main advantage of transfer learning for a multimodal associate-level project?

  1. It reduces adaptation time and compute compared with training from scratch
  2. It removes the need for evaluation entirely
  3. It guarantees zero hallucinations in production
  4. It allows every modality to share the same tokenizer
Show answer

Correct: It reduces adaptation time and compute compared with training from scratch

Transfer learning starts from useful pretrained representations, which lowers the cost and time needed for downstream adaptation.

Set 1 · Question 2 · Chapter 2

What best describes how a diffusion model generates outputs?

  1. By classifying a final image into predefined labels
  2. By iteratively denoising a noisy sample into structured output
  3. By decoding only the first token of a sequence
  4. By selecting nearest neighbors from a vector database
Show answer

Correct: By iteratively denoising a noisy sample into structured output

Diffusion systems begin with noise and repeatedly denoise it until a coherent sample is produced.

Set 1 · Question 3 · Chapter 3

In a multimodal system, what does a cross-modal transformer most directly help with?

  1. Compressing model checkpoints for storage
  2. Aligning relationships between different modalities such as text and image
  3. Replacing every encoder with a CNN
  4. Eliminating the need for prompt design
Show answer

Correct: Aligning relationships between different modalities such as text and image

Cross-modal transformers explicitly model interactions across modalities, such as text attending to image regions.

Set 1 · Question 4 · Chapter 4

Which representation is commonly used before many speech models process audio?

  1. A spectrogram or mel-spectrogram
  2. A confusion matrix
  3. A routing table
  4. A positional hash table
Show answer

Correct: A spectrogram or mel-spectrogram

Speech pipelines frequently transform waveforms into spectral representations that capture time-frequency structure.

Set 1 · Question 5 · Chapter 5

What is the key idea behind Vision Transformers (ViTs)?

  1. They convert image patches into tokens for attention-based modeling
  2. They process only grayscale frames
  3. They replace every image with caption text first
  4. They require recurrent layers for global context
Show answer

Correct: They convert image patches into tokens for attention-based modeling

ViTs split images into patches, embed them as tokens, and use attention to model global relationships.

Set 1 · Question 6 · Chapter 6

What matters most for a believable digital human interaction?

  1. Synchronized reasoning, speech, and avatar behavior
  2. Using the largest possible tokenizer
  3. Serving every model on CPU only
  4. Disabling guardrails to improve fluency
Show answer

Correct: Synchronized reasoning, speech, and avatar behavior

Digital humans feel coherent when the language, audio, and avatar response stay aligned in timing and intent.

Set 1 · Question 7 · Chapter 7

Why is properly aligned multimodal data important?

  1. Because labels and paired modalities must correspond to the same example
  2. Because it removes the need for train/validation splits
  3. Because it guarantees no missing values
  4. Because it forces every dataset to be balanced
Show answer

Correct: Because labels and paired modalities must correspond to the same example

Multimodal systems depend on paired signals being correctly matched, or the model learns the wrong cross-modal relationships.

Set 1 · Question 8 · Chapter 8

Why should multimodal evaluation include both automated metrics and human review?

  1. Because multimodal quality includes subjective usefulness that pure scalar metrics can miss
  2. Because human review makes metrics unnecessary
  3. Because automated metrics cannot be computed for language outputs
  4. Because humans always agree on quality
Show answer

Correct: Because multimodal quality includes subjective usefulness that pure scalar metrics can miss

Automated metrics help at scale, but human review is still needed for relevance, safety, and interaction quality.

Set 1 · Question 9 · Chapter 9

What is a common tradeoff when applying quantization to a multimodal model?

  1. Lower latency and memory use, with possible accuracy impact
  2. Higher latency with guaranteed accuracy gains
  3. More storage cost with smaller batches
  4. Removal of the need for profiling
Show answer

Correct: Lower latency and memory use, with possible accuracy impact

Quantization often improves efficiency, but it can reduce quality if calibration and validation are weak.

Set 1 · Question 10 · Chapter 10

In production deployment, what is the role of observability?

  1. Tracking latency, errors, throughput, and quality signals over time
  2. Replacing authentication and authorization
  3. Eliminating the need for incident response
  4. Guaranteeing no model drift
Show answer

Correct: Tracking latency, errors, throughput, and quality signals over time

Observability helps operators understand whether the system is healthy and whether outputs remain acceptable in production.

Set 1 · Question 11 · Chapter 11

What is the safest default for high-risk multimodal outputs that could affect people directly?

  1. Require human review before actioning sensitive outputs
  2. Trust the first model answer if confidence is above 50%
  3. Disable moderation to avoid latency
  4. Always expand chain-of-thought for transparency
Show answer

Correct: Require human review before actioning sensitive outputs

High-risk outputs need layered controls, and human review remains a core safeguard.

Set 1 · Question 12 · Chapter 12

Which NVIDIA component is commonly used to serve optimized inference workloads?

  1. Triton Inference Server
  2. Nsight Systems
  3. NCCL
  4. DCGM Exporter
Show answer

Correct: Triton Inference Server

Triton is a serving platform built for production inference across model backends and deployment patterns.

Set 2

Set 2 · Question 1 · Chapter 1

Which task is discriminative rather than generative?

  1. Classifying whether an uploaded image contains a defect
  2. Generating a caption for an image
  3. Synthesizing a customer support reply
  4. Creating speech from text
Show answer

Correct: Classifying whether an uploaded image contains a defect

Classification predicts labels from inputs, which is a discriminative task.

Set 2 · Question 2 · Chapter 2

What do system prompts and conditioning tokens primarily provide?

  1. Steering context for behavior, style, or safety constraints
  2. A replacement for model evaluation
  3. A way to increase GPU memory capacity
  4. A guarantee that outputs are factual
Show answer

Correct: Steering context for behavior, style, or safety constraints

They provide high-priority control signals, but they do not replace validation, retrieval, or guardrails.

Set 2 · Question 3 · Chapter 3

What is one core challenge in multimodal fusion?

  1. Aligning representations with different structures, rates, and semantics
  2. Making every dataset use the same file extension
  3. Avoiding any use of attention layers
  4. Forcing all models to share one loss function
Show answer

Correct: Aligning representations with different structures, rates, and semantics

Text, image, and audio signals differ substantially, so fusion requires careful alignment logic.

Set 2 · Question 4 · Chapter 4

Which metric is especially common when evaluating automatic speech recognition?

  1. Word error rate (WER)
  2. Intersection over Union
  3. BLEU only
  4. Top-1 accuracy only
Show answer

Correct: Word error rate (WER)

WER is a standard ASR metric because it captures insertions, deletions, and substitutions in transcripts.

Set 2 · Question 5 · Chapter 5

What does grounding a text query in image regions help with?

  1. Linking language references to the correct visual evidence
  2. Eliminating all preprocessing
  3. Compressing the model for mobile deployment
  4. Disabling the need for captions
Show answer

Correct: Linking language references to the correct visual evidence

Grounding improves correctness by tying language outputs to actual image content rather than unsupported guesses.

Set 2 · Question 6 · Chapter 6

Why is low latency especially important for interactive digital humans?

  1. Because conversational lag quickly breaks the illusion of responsiveness
  2. Because low latency automatically improves factuality
  3. Because avatars do not require synchronization
  4. Because users prefer text-only fallback
Show answer

Correct: Because conversational lag quickly breaks the illusion of responsiveness

Interactive avatars feel unnatural when speech, reasoning, or animation lags behind the conversation flow.

Set 2 · Question 7 · Chapter 7

What is a good reason to keep train, validation, and test sets clearly separated in multimodal data workflows?

  1. To reduce leakage and get a more honest signal of generalization
  2. To make augmentation impossible
  3. To ensure every class has the same number of examples
  4. To remove the need for metadata
Show answer

Correct: To reduce leakage and get a more honest signal of generalization

Split discipline protects evaluation integrity and reduces false confidence from data leakage.

Set 2 · Question 8 · Chapter 8

Why should evaluation cover multiple benchmark types instead of just one leaderboard score?

  1. Because multimodal behavior can fail differently across tasks and modalities
  2. Because single metrics are always enough for production
  3. Because leaderboard rank removes the need for internal tests
  4. Because one benchmark can measure business ROI directly
Show answer

Correct: Because multimodal behavior can fail differently across tasks and modalities

Different tasks expose different failure modes, so broader evaluation provides better operational confidence.

Set 2 · Question 9 · Chapter 9

Which change most directly improves throughput when many requests repeat shared context?

  1. Caching reusable computation or context state
  2. Disabling autoscaling
  3. Increasing label entropy
  4. Removing batching
Show answer

Correct: Caching reusable computation or context state

Caching avoids recomputing repeated work and can materially improve effective serving throughput.

Set 2 · Question 10 · Chapter 10

Which signal would be most useful in a production dashboard for a multimodal service?

  1. Latency percentiles, error rates, and output-quality indicators
  2. Only total page views
  3. Only CUDA driver version
  4. Only the model parameter count
Show answer

Correct: Latency percentiles, error rates, and output-quality indicators

Production health depends on both systems signals and output-quality signals, not on vanity metrics alone.

Set 2 · Question 11 · Chapter 11

What is prompt injection in a multimodal application?

  1. A malicious instruction embedded in user content that attempts to override system behavior
  2. A method for compressing weights
  3. A GPU scheduling optimization
  4. A replacement for role-based access control
Show answer

Correct: A malicious instruction embedded in user content that attempts to override system behavior

Prompt injection is a control attack against the model’s instruction hierarchy and must be mitigated with layered defenses.

Set 2 · Question 12 · Chapter 12

What is a common role of NVIDIA NeMo in generative AI workflows?

  1. Model customization, training, and adaptation workflows
  2. Layer-1 network switching only
  3. Vector database replication only
  4. Replacing model serving systems
Show answer

Correct: Model customization, training, and adaptation workflows

NeMo is often used for model development, fine-tuning, and related enterprise AI workflows.

Set 3

Set 3 · Question 1 · Chapter 1

What is a common benefit of mixed precision during training?

  1. Lower memory use and faster throughput on supported hardware
  2. Guaranteed elimination of overfitting
  3. Removal of the need for checkpointing
  4. Perfect multimodal alignment by default
Show answer

Correct: Lower memory use and faster throughput on supported hardware

Mixed precision commonly improves efficiency, though stability and validation still matter.

Set 3 · Question 2 · Chapter 2

When is an encoder-decoder architecture often preferable to a decoder-only model?

  1. When the task depends on structured source-to-target transformation
  2. When the system has no input modality
  3. When retrieval must be disabled
  4. When model weights cannot be updated
Show answer

Correct: When the task depends on structured source-to-target transformation

Encoder-decoder designs work well when there is a clear source signal that must be encoded then transformed into a target output.

Set 3 · Question 3 · Chapter 3

What does late fusion mean in a multimodal pipeline?

  1. Combining modality-specific outputs or representations at a later decision stage
  2. Combining every modality before any encoding happens
  3. Serving only one modality at inference time
  4. Deleting metadata after ingestion
Show answer

Correct: Combining modality-specific outputs or representations at a later decision stage

Late fusion keeps modality processing separate longer and merges signals closer to the final decision stage.

Set 3 · Question 4 · Chapter 4

For text-to-speech quality, what does alignment primarily help with?

  1. Mapping generated speech timing to the intended text structure
  2. Reducing GPU clock speed
  3. Replacing phoneme modeling completely
  4. Avoiding all acoustic features
Show answer

Correct: Mapping generated speech timing to the intended text structure

Good alignment helps generated speech follow the text correctly in timing and sequence.

Set 3 · Question 5 · Chapter 5

Why is OCR often relevant in multimodal vision systems?

  1. Because useful visual meaning may be embedded as text inside images or documents
  2. Because OCR replaces all object detection tasks
  3. Because OCR removes the need for captions
  4. Because OCR guarantees cross-lingual grounding
Show answer

Correct: Because useful visual meaning may be embedded as text inside images or documents

Many real-world multimodal tasks involve images that contain important text content.

Set 3 · Question 6 · Chapter 6

In an ACE-style assistant, what does orchestration coordinate?

  1. Reasoning, speech, animation, and response timing across components
  2. Only DNS records for the frontend
  3. Only storage lifecycle rules
  4. Only keyboard shortcuts in the UI
Show answer

Correct: Reasoning, speech, animation, and response timing across components

A digital human experience depends on coordinating several subsystems into one coherent interaction loop.

Set 3 · Question 7 · Chapter 7

Why is metadata valuable in multimodal data engineering?

  1. It helps trace provenance, modality pairing, splits, and filtering logic
  2. It makes evaluation unnecessary
  3. It guarantees balanced datasets
  4. It removes the need for schema versioning
Show answer

Correct: It helps trace provenance, modality pairing, splits, and filtering logic

Metadata supports governance, reproducibility, and operational reliability across complex datasets.

Set 3 · Question 8 · Chapter 8

Why are traditional language-only metrics insufficient for many multimodal tasks?

  1. Because multimodal quality also depends on grounding, perception, and interaction fidelity
  2. Because language outputs cannot be scored
  3. Because multimodal models do not generate text
  4. Because only humans may evaluate any multimodal task
Show answer

Correct: Because multimodal quality also depends on grounding, perception, and interaction fidelity

Multimodal systems often need evaluation beyond text overlap, including alignment to visual or audio evidence.

Set 3 · Question 9 · Chapter 9

What is a practical benefit of model distillation?

  1. Creating a smaller model that can preserve much of a larger model’s behavior
  2. Removing the need for retraining
  3. Guaranteeing lower cloud cost with no testing
  4. Eliminating all hallucinations
Show answer

Correct: Creating a smaller model that can preserve much of a larger model’s behavior

Distillation can improve serving efficiency, but the student model must still be validated for quality.

Set 3 · Question 10 · Chapter 10

Why is a fallback path useful in a production multimodal system?

  1. It helps degrade gracefully when a model, dependency, or modality path fails
  2. It guarantees higher benchmark accuracy
  3. It removes the need for monitoring
  4. It replaces incident response training
Show answer

Correct: It helps degrade gracefully when a model, dependency, or modality path fails

Fallbacks preserve service continuity when parts of the pipeline become unavailable or unreliable.

Set 3 · Question 11 · Chapter 11

What is the purpose of red-teaming in responsible AI?

  1. Stress-testing the system for safety, abuse, and failure modes before or during deployment
  2. Increasing GPU utilization only
  3. Replacing all benchmark testing
  4. Encrypting training data at rest
Show answer

Correct: Stress-testing the system for safety, abuse, and failure modes before or during deployment

Red-teaming intentionally probes for harmful or fragile behaviors so defenses can be improved.

Set 3 · Question 12 · Chapter 12

How does NVIDIA NIM fit into the ecosystem?

  1. It packages inference endpoints and deployment interfaces for model serving workflows
  2. It is a data labeling platform only
  3. It replaces every training pipeline
  4. It is a GPU interconnect standard
Show answer

Correct: It packages inference endpoints and deployment interfaces for model serving workflows

NIM helps standardize how models are exposed and operated in deployment environments.

Set 4

Set 4 · Question 1 · Chapter 1

What problem does gradient clipping most directly address?

  1. Exploding gradients during optimization
  2. Tokenization mismatch
  3. Slow DNS resolution
  4. Label imbalance in inference
Show answer

Correct: Exploding gradients during optimization

Gradient clipping limits unstable update magnitudes and helps maintain training stability.

Set 4 · Question 2 · Chapter 2

What is a typical benefit of few-shot prompting?

  1. Improved output format adherence and task pattern consistency
  2. Permanent weight updates inside the model
  3. Guaranteed safe behavior without moderation
  4. Higher GPU memory capacity
Show answer

Correct: Improved output format adherence and task pattern consistency

Few-shot examples often give the model a clearer pattern for how to respond.

Set 4 · Question 3 · Chapter 3

Why might retrieval be added to a multimodal application?

  1. To supply relevant external context that improves grounded responses
  2. To avoid using embeddings entirely
  3. To replace all training data
  4. To disable modality fusion
Show answer

Correct: To supply relevant external context that improves grounded responses

Retrieval can strengthen grounding by bringing in fresh or domain-specific context at inference time.

Set 4 · Question 4 · Chapter 4

What does speaker diarization try to solve?

  1. Determining who spoke when in a multi-speaker audio stream
  2. Improving image segmentation masks
  3. Reducing tokenizer vocabulary size
  4. Selecting the best GPU instance family
Show answer

Correct: Determining who spoke when in a multi-speaker audio stream

Diarization separates and tracks speakers across an audio stream, which is important in conversations and meetings.

Set 4 · Question 5 · Chapter 5

What does cross-attention from text to image regions help a model do?

  1. Focus language generation on relevant visual evidence
  2. Remove the need for training data
  3. Convert every image into audio features
  4. Disable prompt conditioning
Show answer

Correct: Focus language generation on relevant visual evidence

Cross-attention helps the model tie language outputs to the appropriate parts of the visual input.

Set 4 · Question 6 · Chapter 6

Why should persona and safety policy stay aligned in a digital human experience?

  1. Because inconsistent behavior reduces trust and can create unsafe interactions
  2. Because it increases token throughput automatically
  3. Because it removes the need for speech synthesis
  4. Because it guarantees realism regardless of latency
Show answer

Correct: Because inconsistent behavior reduces trust and can create unsafe interactions

Users notice mismatches between tone, behavior, and policy quickly, so consistency is important for safety and trust.

Set 4 · Question 7 · Chapter 7

Which data-governance concern is especially important for multimodal datasets?

  1. Consent, licensing, and provenance across every modality
  2. Using one file format only
  3. Avoiding all data augmentation
  4. Keeping train and test sets identical
Show answer

Correct: Consent, licensing, and provenance across every modality

Multimodal data can combine text, audio, images, and metadata, so governance must cover each source clearly.

Set 4 · Question 8 · Chapter 8

What does online monitoring help detect after deployment?

  1. Drift, degradation, or emerging failure patterns in real usage
  2. Only local syntax errors in the source code
  3. Only whether a prompt is grammatically correct
  4. Only GPU hardware serial numbers
Show answer

Correct: Drift, degradation, or emerging failure patterns in real usage

Production monitoring helps teams spot changes in data, quality, or reliability before they become larger incidents.

Set 4 · Question 9 · Chapter 9

What is the main reason to connect autoscaling to latency and queueing signals?

  1. To maintain service responsiveness under variable demand
  2. To guarantee perfect output accuracy
  3. To remove the need for profiling
  4. To replace batching and caching
Show answer

Correct: To maintain service responsiveness under variable demand

Autoscaling should respond to real load and latency pressure so user experience remains stable as demand changes.

Set 4 · Question 10 · Chapter 10

What should a production system do when retrieval or a dependent model path fails?

  1. Fail gracefully with fallback behavior and clear operational signals
  2. Silently fabricate unsupported answers
  3. Disable logging to reduce overhead
  4. Delete the request from telemetry
Show answer

Correct: Fail gracefully with fallback behavior and clear operational signals

Graceful degradation and visibility are safer than silently returning unsupported or misleading outputs.

Set 4 · Question 11 · Chapter 11

Why is PII handling part of trustworthy multimodal AI operations?

  1. Because inputs and outputs may contain sensitive data that must be protected and governed
  2. Because PII only affects training, not inference
  3. Because privacy risks vanish when prompts are short
  4. Because multimodal systems cannot process personal data
Show answer

Correct: Because inputs and outputs may contain sensitive data that must be protected and governed

Trustworthy AI includes privacy controls for both data flowing into the system and content flowing out.

Set 4 · Question 12 · Chapter 12

Which NVIDIA capability is especially relevant when optimizing high-performance inference on NVIDIA GPUs?

  1. TensorRT-based optimization
  2. Spreadsheet pivot tables
  3. BGP unnumbered
  4. Route reflection
Show answer

Correct: TensorRT-based optimization

TensorRT-style optimization is part of the serving and performance toolbox for accelerating inference workloads.

Back to NCA-GENM course page