Protected

NCA-GENM mock questions are available after login. Redirecting...

If you are not redirected, login.

Courses / Nvidia / NCA-GENM / Self-Test

NCA-GENM Mock Questions

Multiple-choice practice to test concept understanding

48 total questions · 4 sets · 12 per set · chapter-balanced

How to Use

Complete a full set before opening any answers.
Track which chapters produce the most misses and return to those chapter pages.
Repeat the weakest set after review until you can explain each answer in your own words.

Set 1

Set 1 · Question 1 · Chapter 1

What is the main advantage of transfer learning for a multimodal associate-level project?

It reduces adaptation time and compute compared with training from scratch
It removes the need for evaluation entirely
It guarantees zero hallucinations in production
It allows every modality to share the same tokenizer

Show answer

Correct: It reduces adaptation time and compute compared with training from scratch

Transfer learning starts from useful pretrained representations, which lowers the cost and time needed for downstream adaptation.

Set 1 · Question 2 · Chapter 2

What best describes how a diffusion model generates outputs?

By classifying a final image into predefined labels
By iteratively denoising a noisy sample into structured output
By decoding only the first token of a sequence
By selecting nearest neighbors from a vector database

Show answer

Correct: By iteratively denoising a noisy sample into structured output

Diffusion systems begin with noise and repeatedly denoise it until a coherent sample is produced.

Set 1 · Question 3 · Chapter 3

In a multimodal system, what does a cross-modal transformer most directly help with?

Compressing model checkpoints for storage
Aligning relationships between different modalities such as text and image
Replacing every encoder with a CNN
Eliminating the need for prompt design

Show answer

Correct: Aligning relationships between different modalities such as text and image

Cross-modal transformers explicitly model interactions across modalities, such as text attending to image regions.

Set 1 · Question 4 · Chapter 4

Which representation is commonly used before many speech models process audio?

A spectrogram or mel-spectrogram
A confusion matrix
A routing table
A positional hash table

Show answer

Correct: A spectrogram or mel-spectrogram

Speech pipelines frequently transform waveforms into spectral representations that capture time-frequency structure.

Set 1 · Question 5 · Chapter 5

What is the key idea behind Vision Transformers (ViTs)?

They convert image patches into tokens for attention-based modeling
They process only grayscale frames
They replace every image with caption text first
They require recurrent layers for global context

Show answer

Correct: They convert image patches into tokens for attention-based modeling

ViTs split images into patches, embed them as tokens, and use attention to model global relationships.

Set 1 · Question 6 · Chapter 6

What matters most for a believable digital human interaction?

Synchronized reasoning, speech, and avatar behavior
Using the largest possible tokenizer
Serving every model on CPU only
Disabling guardrails to improve fluency

Show answer

Correct: Synchronized reasoning, speech, and avatar behavior

Digital humans feel coherent when the language, audio, and avatar response stay aligned in timing and intent.

Set 1 · Question 7 · Chapter 7

Why is properly aligned multimodal data important?

Because labels and paired modalities must correspond to the same example
Because it removes the need for train/validation splits
Because it guarantees no missing values
Because it forces every dataset to be balanced

Show answer

Correct: Because labels and paired modalities must correspond to the same example

Multimodal systems depend on paired signals being correctly matched, or the model learns the wrong cross-modal relationships.

Set 1 · Question 8 · Chapter 8

Why should multimodal evaluation include both automated metrics and human review?

Because multimodal quality includes subjective usefulness that pure scalar metrics can miss
Because human review makes metrics unnecessary
Because automated metrics cannot be computed for language outputs
Because humans always agree on quality

Show answer

Correct: Because multimodal quality includes subjective usefulness that pure scalar metrics can miss

Automated metrics help at scale, but human review is still needed for relevance, safety, and interaction quality.

Set 1 · Question 9 · Chapter 9

What is a common tradeoff when applying quantization to a multimodal model?

Lower latency and memory use, with possible accuracy impact
Higher latency with guaranteed accuracy gains
More storage cost with smaller batches
Removal of the need for profiling

Show answer

Correct: Lower latency and memory use, with possible accuracy impact

Quantization often improves efficiency, but it can reduce quality if calibration and validation are weak.

Set 1 · Question 10 · Chapter 10

In production deployment, what is the role of observability?

Tracking latency, errors, throughput, and quality signals over time
Replacing authentication and authorization
Eliminating the need for incident response
Guaranteeing no model drift

Show answer

Correct: Tracking latency, errors, throughput, and quality signals over time

Observability helps operators understand whether the system is healthy and whether outputs remain acceptable in production.

Set 1 · Question 11 · Chapter 11

What is the safest default for high-risk multimodal outputs that could affect people directly?

Require human review before actioning sensitive outputs
Trust the first model answer if confidence is above 50%
Disable moderation to avoid latency
Always expand chain-of-thought for transparency

Show answer

Correct: Require human review before actioning sensitive outputs

High-risk outputs need layered controls, and human review remains a core safeguard.

Set 1 · Question 12 · Chapter 12

Which NVIDIA component is commonly used to serve optimized inference workloads?

Triton Inference Server
Nsight Systems
NCCL
DCGM Exporter

Show answer

Correct: Triton Inference Server

Triton is a serving platform built for production inference across model backends and deployment patterns.

Set 2

Set 2 · Question 1 · Chapter 1

Which task is discriminative rather than generative?

Classifying whether an uploaded image contains a defect
Generating a caption for an image
Synthesizing a customer support reply
Creating speech from text

Show answer

Correct: Classifying whether an uploaded image contains a defect

Classification predicts labels from inputs, which is a discriminative task.

Set 2 · Question 2 · Chapter 2

What do system prompts and conditioning tokens primarily provide?

Steering context for behavior, style, or safety constraints
A replacement for model evaluation
A way to increase GPU memory capacity
A guarantee that outputs are factual

Show answer

Correct: Steering context for behavior, style, or safety constraints

They provide high-priority control signals, but they do not replace validation, retrieval, or guardrails.

Set 2 · Question 3 · Chapter 3

What is one core challenge in multimodal fusion?

Aligning representations with different structures, rates, and semantics
Making every dataset use the same file extension
Avoiding any use of attention layers
Forcing all models to share one loss function

Show answer

Correct: Aligning representations with different structures, rates, and semantics

Text, image, and audio signals differ substantially, so fusion requires careful alignment logic.

Set 2 · Question 4 · Chapter 4

Which metric is especially common when evaluating automatic speech recognition?

Word error rate (WER)
Intersection over Union
BLEU only
Top-1 accuracy only

Show answer

Correct: Word error rate (WER)

WER is a standard ASR metric because it captures insertions, deletions, and substitutions in transcripts.

Set 2 · Question 5 · Chapter 5

What does grounding a text query in image regions help with?

Linking language references to the correct visual evidence
Eliminating all preprocessing
Compressing the model for mobile deployment
Disabling the need for captions

Show answer

Correct: Linking language references to the correct visual evidence

Grounding improves correctness by tying language outputs to actual image content rather than unsupported guesses.

Set 2 · Question 6 · Chapter 6

Why is low latency especially important for interactive digital humans?

Because conversational lag quickly breaks the illusion of responsiveness
Because low latency automatically improves factuality
Because avatars do not require synchronization
Because users prefer text-only fallback

Show answer

Correct: Because conversational lag quickly breaks the illusion of responsiveness

Interactive avatars feel unnatural when speech, reasoning, or animation lags behind the conversation flow.

Set 2 · Question 7 · Chapter 7

What is a good reason to keep train, validation, and test sets clearly separated in multimodal data workflows?

To reduce leakage and get a more honest signal of generalization
To make augmentation impossible
To ensure every class has the same number of examples
To remove the need for metadata

Show answer

Correct: To reduce leakage and get a more honest signal of generalization

Split discipline protects evaluation integrity and reduces false confidence from data leakage.

Set 2 · Question 8 · Chapter 8

Why should evaluation cover multiple benchmark types instead of just one leaderboard score?

Because multimodal behavior can fail differently across tasks and modalities
Because single metrics are always enough for production
Because leaderboard rank removes the need for internal tests
Because one benchmark can measure business ROI directly

Show answer

Correct: Because multimodal behavior can fail differently across tasks and modalities

Different tasks expose different failure modes, so broader evaluation provides better operational confidence.

Set 2 · Question 9 · Chapter 9

Which change most directly improves throughput when many requests repeat shared context?

Caching reusable computation or context state
Disabling autoscaling
Increasing label entropy
Removing batching

Show answer

Correct: Caching reusable computation or context state

Caching avoids recomputing repeated work and can materially improve effective serving throughput.

Set 2 · Question 10 · Chapter 10

Which signal would be most useful in a production dashboard for a multimodal service?

Latency percentiles, error rates, and output-quality indicators
Only total page views
Only CUDA driver version
Only the model parameter count

Show answer

Correct: Latency percentiles, error rates, and output-quality indicators

Production health depends on both systems signals and output-quality signals, not on vanity metrics alone.

Set 2 · Question 11 · Chapter 11

What is prompt injection in a multimodal application?

A malicious instruction embedded in user content that attempts to override system behavior
A method for compressing weights
A GPU scheduling optimization
A replacement for role-based access control

Show answer

Correct: A malicious instruction embedded in user content that attempts to override system behavior

Prompt injection is a control attack against the model’s instruction hierarchy and must be mitigated with layered defenses.

Set 2 · Question 12 · Chapter 12

What is a common role of NVIDIA NeMo in generative AI workflows?

Model customization, training, and adaptation workflows
Layer-1 network switching only
Vector database replication only
Replacing model serving systems

Show answer

Correct: Model customization, training, and adaptation workflows

NeMo is often used for model development, fine-tuning, and related enterprise AI workflows.

Set 3

Set 3 · Question 1 · Chapter 1

What is a common benefit of mixed precision during training?

Lower memory use and faster throughput on supported hardware
Guaranteed elimination of overfitting
Removal of the need for checkpointing
Perfect multimodal alignment by default

Show answer

Correct: Lower memory use and faster throughput on supported hardware

Mixed precision commonly improves efficiency, though stability and validation still matter.

Set 3 · Question 2 · Chapter 2

When is an encoder-decoder architecture often preferable to a decoder-only model?

When the task depends on structured source-to-target transformation
When the system has no input modality
When retrieval must be disabled
When model weights cannot be updated

Show answer

Correct: When the task depends on structured source-to-target transformation

Encoder-decoder designs work well when there is a clear source signal that must be encoded then transformed into a target output.

Set 3 · Question 3 · Chapter 3

What does late fusion mean in a multimodal pipeline?

Combining modality-specific outputs or representations at a later decision stage
Combining every modality before any encoding happens
Serving only one modality at inference time
Deleting metadata after ingestion

Show answer

Correct: Combining modality-specific outputs or representations at a later decision stage

Late fusion keeps modality processing separate longer and merges signals closer to the final decision stage.

Set 3 · Question 4 · Chapter 4

For text-to-speech quality, what does alignment primarily help with?

Mapping generated speech timing to the intended text structure
Reducing GPU clock speed
Replacing phoneme modeling completely
Avoiding all acoustic features

Show answer

Correct: Mapping generated speech timing to the intended text structure

Good alignment helps generated speech follow the text correctly in timing and sequence.

Set 3 · Question 5 · Chapter 5

Why is OCR often relevant in multimodal vision systems?

Because useful visual meaning may be embedded as text inside images or documents
Because OCR replaces all object detection tasks
Because OCR removes the need for captions
Because OCR guarantees cross-lingual grounding

Show answer

Correct: Because useful visual meaning may be embedded as text inside images or documents

Many real-world multimodal tasks involve images that contain important text content.

Set 3 · Question 6 · Chapter 6

In an ACE-style assistant, what does orchestration coordinate?

Reasoning, speech, animation, and response timing across components
Only DNS records for the frontend
Only storage lifecycle rules
Only keyboard shortcuts in the UI

Show answer

Correct: Reasoning, speech, animation, and response timing across components

A digital human experience depends on coordinating several subsystems into one coherent interaction loop.

Set 3 · Question 7 · Chapter 7

Why is metadata valuable in multimodal data engineering?

It helps trace provenance, modality pairing, splits, and filtering logic
It makes evaluation unnecessary
It guarantees balanced datasets
It removes the need for schema versioning

Show answer

Correct: It helps trace provenance, modality pairing, splits, and filtering logic

Metadata supports governance, reproducibility, and operational reliability across complex datasets.

Set 3 · Question 8 · Chapter 8

Why are traditional language-only metrics insufficient for many multimodal tasks?

Because multimodal quality also depends on grounding, perception, and interaction fidelity
Because language outputs cannot be scored
Because multimodal models do not generate text
Because only humans may evaluate any multimodal task

Show answer

Correct: Because multimodal quality also depends on grounding, perception, and interaction fidelity

Multimodal systems often need evaluation beyond text overlap, including alignment to visual or audio evidence.

Set 3 · Question 9 · Chapter 9

What is a practical benefit of model distillation?

Creating a smaller model that can preserve much of a larger model’s behavior
Removing the need for retraining
Guaranteeing lower cloud cost with no testing
Eliminating all hallucinations

Show answer

Correct: Creating a smaller model that can preserve much of a larger model’s behavior

Distillation can improve serving efficiency, but the student model must still be validated for quality.

Set 3 · Question 10 · Chapter 10

Why is a fallback path useful in a production multimodal system?

It helps degrade gracefully when a model, dependency, or modality path fails
It guarantees higher benchmark accuracy
It removes the need for monitoring
It replaces incident response training

Show answer

Correct: It helps degrade gracefully when a model, dependency, or modality path fails

Fallbacks preserve service continuity when parts of the pipeline become unavailable or unreliable.

Set 3 · Question 11 · Chapter 11

What is the purpose of red-teaming in responsible AI?

Stress-testing the system for safety, abuse, and failure modes before or during deployment
Increasing GPU utilization only
Replacing all benchmark testing
Encrypting training data at rest

Show answer

Correct: Stress-testing the system for safety, abuse, and failure modes before or during deployment

Red-teaming intentionally probes for harmful or fragile behaviors so defenses can be improved.

Set 3 · Question 12 · Chapter 12

How does NVIDIA NIM fit into the ecosystem?

It packages inference endpoints and deployment interfaces for model serving workflows
It is a data labeling platform only
It replaces every training pipeline
It is a GPU interconnect standard

Show answer

Correct: It packages inference endpoints and deployment interfaces for model serving workflows

NIM helps standardize how models are exposed and operated in deployment environments.

Set 4

Set 4 · Question 1 · Chapter 1

What problem does gradient clipping most directly address?

Exploding gradients during optimization
Tokenization mismatch
Slow DNS resolution
Label imbalance in inference

Show answer

Correct: Exploding gradients during optimization

Gradient clipping limits unstable update magnitudes and helps maintain training stability.

Set 4 · Question 2 · Chapter 2

What is a typical benefit of few-shot prompting?

Improved output format adherence and task pattern consistency
Permanent weight updates inside the model
Guaranteed safe behavior without moderation
Higher GPU memory capacity

Show answer

Correct: Improved output format adherence and task pattern consistency

Few-shot examples often give the model a clearer pattern for how to respond.

Set 4 · Question 3 · Chapter 3

Why might retrieval be added to a multimodal application?

To supply relevant external context that improves grounded responses
To avoid using embeddings entirely
To replace all training data
To disable modality fusion

Show answer

Correct: To supply relevant external context that improves grounded responses

Retrieval can strengthen grounding by bringing in fresh or domain-specific context at inference time.

Set 4 · Question 4 · Chapter 4

What does speaker diarization try to solve?

Determining who spoke when in a multi-speaker audio stream
Improving image segmentation masks
Reducing tokenizer vocabulary size
Selecting the best GPU instance family

Show answer

Correct: Determining who spoke when in a multi-speaker audio stream

Diarization separates and tracks speakers across an audio stream, which is important in conversations and meetings.

Set 4 · Question 5 · Chapter 5

What does cross-attention from text to image regions help a model do?

Focus language generation on relevant visual evidence
Remove the need for training data
Convert every image into audio features
Disable prompt conditioning

Show answer

Correct: Focus language generation on relevant visual evidence

Cross-attention helps the model tie language outputs to the appropriate parts of the visual input.

Set 4 · Question 6 · Chapter 6

Why should persona and safety policy stay aligned in a digital human experience?

Because inconsistent behavior reduces trust and can create unsafe interactions
Because it increases token throughput automatically
Because it removes the need for speech synthesis
Because it guarantees realism regardless of latency

Show answer

Correct: Because inconsistent behavior reduces trust and can create unsafe interactions

Users notice mismatches between tone, behavior, and policy quickly, so consistency is important for safety and trust.

Set 4 · Question 7 · Chapter 7

Which data-governance concern is especially important for multimodal datasets?

Consent, licensing, and provenance across every modality
Using one file format only
Avoiding all data augmentation
Keeping train and test sets identical

Show answer

Correct: Consent, licensing, and provenance across every modality

Multimodal data can combine text, audio, images, and metadata, so governance must cover each source clearly.

Set 4 · Question 8 · Chapter 8

What does online monitoring help detect after deployment?

Drift, degradation, or emerging failure patterns in real usage
Only local syntax errors in the source code
Only whether a prompt is grammatically correct
Only GPU hardware serial numbers

Show answer

Correct: Drift, degradation, or emerging failure patterns in real usage

Production monitoring helps teams spot changes in data, quality, or reliability before they become larger incidents.

Set 4 · Question 9 · Chapter 9

What is the main reason to connect autoscaling to latency and queueing signals?

To maintain service responsiveness under variable demand
To guarantee perfect output accuracy
To remove the need for profiling
To replace batching and caching

Show answer

Correct: To maintain service responsiveness under variable demand

Autoscaling should respond to real load and latency pressure so user experience remains stable as demand changes.

Set 4 · Question 10 · Chapter 10

What should a production system do when retrieval or a dependent model path fails?

Fail gracefully with fallback behavior and clear operational signals
Silently fabricate unsupported answers
Disable logging to reduce overhead
Delete the request from telemetry

Show answer

Correct: Fail gracefully with fallback behavior and clear operational signals

Graceful degradation and visibility are safer than silently returning unsupported or misleading outputs.

Set 4 · Question 11 · Chapter 11

Why is PII handling part of trustworthy multimodal AI operations?

Because inputs and outputs may contain sensitive data that must be protected and governed
Because PII only affects training, not inference
Because privacy risks vanish when prompts are short
Because multimodal systems cannot process personal data

Show answer

Correct: Because inputs and outputs may contain sensitive data that must be protected and governed

Trustworthy AI includes privacy controls for both data flowing into the system and content flowing out.

Set 4 · Question 12 · Chapter 12

Which NVIDIA capability is especially relevant when optimizing high-performance inference on NVIDIA GPUs?

TensorRT-based optimization
Spreadsheet pivot tables
BGP unnumbered
Route reflection

Show answer

Correct: TensorRT-based optimization

TensorRT-style optimization is part of the serving and performance toolbox for accelerating inference workloads.

Back to NCA-GENM course page