Protected

NCA-GENL mock questions are available after login. Redirecting...

If you are not redirected, login.

Courses / Nvidia / NCA-GENL / Self-Test

NCA-GENL Mock Questions

Multiple-choice practice to test concept understanding

80 total questions · 4 sets · 20 per set · no duplicates

How to Use

  1. Attempt each set without opening answers first.
  2. Use explanations to identify weak domains and chapter gaps.
  3. Repeat missed questions after reviewing related chapter pages.

Set 1

Set 1 · Question 1 · Chapter 1

What target does self-supervised LLM pretraining usually optimize?

  1. Next-token prediction on unlabeled text
  2. Human preference rankings only
  3. Image segmentation masks
  4. SQL query execution accuracy
Show answer

Correct: Next-token prediction on unlabeled text

LLM pretraining typically uses next-token prediction built from raw text without manual labels.

Set 1 · Question 2 · Chapter 1

In a support assistant, which component is usually discriminative?

  1. Intent classifier that routes tickets
  2. Model generating final answer text
  3. Image generator producing screenshots
  4. Synthetic data creator
Show answer

Correct: Intent classifier that routes tickets

Classification/routing is a discriminative task; answer generation is generative.

Set 1 · Question 3 · Chapter 1

What is a practical benefit of transfer learning for enterprise teams?

  1. Lower adaptation time and compute cost
  2. Guaranteed zero hallucinations
  3. No need for evaluation
  4. No tokenizer required
Show answer

Correct: Lower adaptation time and compute cost

Starting from pretrained weights reduces data, time, and compute needed for downstream tasks.

Set 1 · Question 4 · Chapter 1

Which loss is standard for token prediction tasks?

  1. Mean squared error
  2. Cross-entropy
  3. Hinge loss
  4. Huber loss
Show answer

Correct: Cross-entropy

Cross-entropy aligns naturally with probabilistic next-token prediction.

Set 1 · Question 5 · Chapter 1

Which technique most directly helps stabilize exploding gradients?

  1. Gradient clipping
  2. Vocabulary expansion
  3. Prompt caching
  4. Label smoothing only
Show answer

Correct: Gradient clipping

Gradient clipping limits update magnitude and helps prevent unstable loss spikes.

Set 1 · Question 6 · Chapter 2

Which transformer family is dominant for autoregressive chat generation?

  1. Encoder-only
  2. Decoder-only
  3. Encoder-decoder only
  4. CNN-RNN hybrid
Show answer

Correct: Decoder-only

Decoder-only models predict the next token from prior context and are common for chat.

Set 1 · Question 7 · Chapter 2

Why are positional encodings needed in transformers?

  1. To add token order information
  2. To compress model weights
  3. To remove attention heads
  4. To replace tokenization
Show answer

Correct: To add token order information

Attention alone is permutation-invariant, so position signals are needed for sequence order.

Set 1 · Question 8 · Chapter 2

What is the core value of multi-head attention?

  1. It allows multiple relational views in parallel
  2. It removes the need for FFN blocks
  3. It guarantees factual correctness
  4. It makes context windows infinite
Show answer

Correct: It allows multiple relational views in parallel

Different heads can learn complementary relationships such as syntax and long-range dependencies.

Set 1 · Question 9 · Chapter 2

Which statement about context windows is correct?

  1. Larger windows always improve quality
  2. Only user text counts toward token limits
  3. Long context can still degrade quality if retrieval is noisy
  4. Context limits affect training only, not inference
Show answer

Correct: Long context can still degrade quality if retrieval is noisy

Large windows still require relevant, clean context; irrelevant chunks can hurt answer quality.

Set 1 · Question 10 · Chapter 2

What does KV cache primarily improve during decoding?

  1. Data privacy
  2. Generation latency
  3. Tokenizer quality
  4. Model alignment
Show answer

Correct: Generation latency

Reusing prior key/value states avoids redundant computation and reduces token generation latency.

Set 1 · Question 11 · Chapter 3

Which sequence best describes lifecycle order for many LLM projects?

  1. Fine-tuning -> pretraining -> tokenization
  2. Pretraining -> adaptation (SFT/PEFT) -> deployment
  3. Deployment -> pretraining -> labeling
  4. Prompting -> backpropagation -> retrieval
Show answer

Correct: Pretraining -> adaptation (SFT/PEFT) -> deployment

General capability is built in pretraining, then adapted and deployed for specific tasks.

Set 1 · Question 12 · Chapter 3

Supervised fine-tuning (SFT) depends on which data format?

  1. Labeled input-output examples
  2. Only unlabeled web crawl text
  3. Only telemetry logs
  4. Only embeddings without text
Show answer

Correct: Labeled input-output examples

SFT uses curated labeled examples to shape task behavior and response style.

Set 1 · Question 13 · Chapter 3

What is the main purpose of instruction tuning?

  1. Increase PCIe bandwidth
  2. Improve instruction following and response formatting
  3. Replace tokenizer training
  4. Disable model regularization
Show answer

Correct: Improve instruction following and response formatting

Instruction tuning improves how reliably the model follows user intent and format constraints.

Set 1 · Question 14 · Chapter 3

What does data parallelism do?

  1. Splits one model layer across GPUs
  2. Replicates model copies and splits batches across workers
  3. Caches retrieval chunks in vector DB
  4. Converts FP16 to INT8 at runtime
Show answer

Correct: Replicates model copies and splits batches across workers

Data parallelism keeps model replicas on workers and distributes input batches.

Set 1 · Question 15 · Chapter 3

When is model parallelism most needed?

  1. When a model does not fit on a single device
  2. When prompts are short
  3. When using only CPU inference
  4. When no checkpointing is required
Show answer

Correct: When a model does not fit on a single device

Model parallelism partitions large models across multiple devices when single-device memory is insufficient.

Set 1 · Question 16 · Chapter 3

What is a common benefit of mixed precision training?

  1. Lower memory use and faster training
  2. Guaranteed best perplexity
  3. Automatic bias removal
  4. No need for optimizer tuning
Show answer

Correct: Lower memory use and faster training

FP16/BF16 usually reduce memory pressure and improve throughput.

Set 1 · Question 17 · Chapter 3

What problem does gradient accumulation solve?

  1. Merges vector databases
  2. Emulates larger effective batch size with limited memory
  3. Adds long-context support
  4. Creates synthetic labels
Show answer

Correct: Emulates larger effective batch size with limited memory

It accumulates gradients over micro-batches before update steps.

Set 1 · Question 18 · Chapter 3

Why is checkpointing essential during long training runs?

  1. It guarantees fairness compliance
  2. It enables recovery and reproducibility
  3. It replaces evaluation metrics
  4. It reduces tokenizer size
Show answer

Correct: It enables recovery and reproducibility

Checkpoints allow restart after failure and provide auditable run artifacts.

Set 1 · Question 19 · Chapter 3

What distinguishes AdamW from classic Adam in practice?

  1. Decoupled weight decay from gradient update
  2. No learning rate required
  3. Only works on CPUs
  4. Cannot train transformers
Show answer

Correct: Decoupled weight decay from gradient update

AdamW applies weight decay in a decoupled way and is widely used in modern training pipelines.

Set 1 · Question 20 · Chapter 3

Why is learning-rate warmup commonly used at training start?

  1. To stabilize early optimization before full step sizes
  2. To freeze all model layers permanently
  3. To increase tokenizer vocabulary quickly
  4. To bypass gradient computation
Show answer

Correct: To stabilize early optimization before full step sizes

Warmup prevents unstable early updates by ramping learning rate gradually.

Set 2

Set 2 · Question 1 · Chapter 4

When does few-shot prompting usually help most?

  1. When output format consistency is important
  2. When you want to disable system prompts
  3. When training from scratch
  4. When reducing GPU clock speed
Show answer

Correct: When output format consistency is important

Few-shot examples anchor format and behavior for tasks with strict output expectations.

Set 2 · Question 2 · Chapter 4

What is the role of a system prompt in an LLM application?

  1. High-level behavior and policy instruction layer
  2. Vector index storage layer
  3. GPU scheduling layer
  4. Checkpoint serialization format
Show answer

Correct: High-level behavior and policy instruction layer

System prompts define global instructions and boundaries for model behavior.

Set 2 · Question 3 · Chapter 4

What effect does increasing temperature generally have?

  1. More deterministic outputs
  2. More sampling randomness
  3. Lower token count limits
  4. Higher embedding dimensions
Show answer

Correct: More sampling randomness

Higher temperature increases variability in token sampling.

Set 2 · Question 4 · Chapter 4

What does top-k sampling do?

  1. Selects from top k probable tokens each step
  2. Selects from tokens until cumulative probability p
  3. Always selects the maximum probability token
  4. Selects random tokens from full vocabulary
Show answer

Correct: Selects from top k probable tokens each step

Top-k restricts sampling to a fixed-size candidate set of likely tokens.

Set 2 · Question 5 · Chapter 4

What is a common beam search tradeoff?

  1. Higher diversity with lower compute
  2. Potentially better sequence likelihood with higher compute cost
  3. No dependence on logits
  4. Eliminates need for prompt engineering
Show answer

Correct: Potentially better sequence likelihood with higher compute cost

Beam search explores multiple candidate sequences but costs more compute.

Set 2 · Question 6 · Chapter 4

Which setup is most deterministic for repeated runs?

  1. Greedy decoding with fixed prompts
  2. High temperature with top-p
  3. Random seed omitted with sampling
  4. Few-shot with stochastic decoding
Show answer

Correct: Greedy decoding with fixed prompts

Greedy decoding avoids random sampling and is typically repeatable with fixed conditions.

Set 2 · Question 7 · Chapter 4

Which control best reduces prompt injection risk from retrieved documents?

  1. Treat retrieved content as untrusted and enforce policy boundaries
  2. Increase max tokens
  3. Remove system prompts entirely
  4. Disable metadata filters
Show answer

Correct: Treat retrieved content as untrusted and enforce policy boundaries

Injection defense depends on trust boundaries, sanitization, and strict policy enforcement.

Set 2 · Question 8 · Chapter 4

Why use structured output schemas in production?

  1. To improve downstream parsing reliability
  2. To train larger foundation models
  3. To avoid evaluation completely
  4. To disable function calling
Show answer

Correct: To improve downstream parsing reliability

Schema-constrained outputs reduce parser failures and integration errors.

Set 2 · Question 9 · Chapter 4

What is the main purpose of function calling in LLM workflows?

  1. Map model output to explicit tool/action interfaces
  2. Increase parameter count
  3. Compress checkpoints
  4. Replace retrieval systems
Show answer

Correct: Map model output to explicit tool/action interfaces

Function calling turns model intent into structured, callable actions.

Set 2 · Question 10 · Chapter 5

What core problem does RAG solve?

  1. Grounding responses with external knowledge
  2. Replacing tokenization
  3. Eliminating need for inference optimization
  4. Automatically tuning reward models
Show answer

Correct: Grounding responses with external knowledge

RAG connects generation to retrievable evidence for fresher and auditable answers.

Set 2 · Question 11 · Chapter 5

What is an embedding in RAG systems?

  1. A dense vector representing semantic meaning
  2. A compressed checkpoint archive
  3. A decoding temperature schedule
  4. A GPU utilization metric
Show answer

Correct: A dense vector representing semantic meaning

Embeddings map text (or other data) into vectors used for similarity search.

Set 2 · Question 12 · Chapter 5

Why use overlapping chunks when indexing documents?

  1. To preserve context across chunk boundaries
  2. To reduce vocabulary size
  3. To force deterministic decoding
  4. To disable reranking
Show answer

Correct: To preserve context across chunk boundaries

Overlap helps avoid losing important context that spans adjacent segments.

Set 2 · Question 13 · Chapter 5

What is the benefit of hybrid search (keyword + semantic)?

  1. Better recall and precision balance
  2. Lower need for embeddings
  3. No need for chunking strategy
  4. Guaranteed zero latency
Show answer

Correct: Better recall and precision balance

Combining lexical and semantic retrieval can improve both coverage and relevance.

Set 2 · Question 14 · Chapter 5

What is reranking used for in a retrieval pipeline?

  1. Improve relevance ordering among initial retrieved candidates
  2. Convert vectors to images
  3. Increase context window limit
  4. Train tokenizer merge rules
Show answer

Correct: Improve relevance ordering among initial retrieved candidates

Reranking refines candidate ordering so better evidence reaches generation.

Set 2 · Question 15 · Chapter 5

Why apply metadata filtering during retrieval?

  1. Limit retrieval to valid source scope or tenant boundaries
  2. Increase model width
  3. Reduce prompt template size
  4. Bypass access control
Show answer

Correct: Limit retrieval to valid source scope or tenant boundaries

Metadata constraints enforce relevance and governance requirements.

Set 2 · Question 16 · Chapter 6

What is PEFT primarily optimizing for?

  1. Lower adaptation cost with smaller trainable parameter sets
  2. Higher tokenization speed only
  3. Replacing evaluation metrics
  4. Removing all adaptation artifacts
Show answer

Correct: Lower adaptation cost with smaller trainable parameter sets

PEFT updates a small subset of parameters to reduce training and storage cost.

Set 2 · Question 17 · Chapter 6

What is the key idea behind LoRA updates?

  1. Low-rank trainable matrices approximate weight deltas
  2. Replace attention with convolutions
  3. Remove residual connections
  4. Use only greedy decoding
Show answer

Correct: Low-rank trainable matrices approximate weight deltas

LoRA learns compact low-rank updates rather than full-matrix modifications.

Set 2 · Question 18 · Chapter 6

What does prompt tuning modify?

  1. Virtual prompt embeddings rather than core model weights
  2. All transformer layers
  3. Vector database indexes
  4. GPU firmware
Show answer

Correct: Virtual prompt embeddings rather than core model weights

Prompt tuning learns input-side embeddings and keeps most model weights frozen.

Set 2 · Question 19 · Chapter 6

What is a practical risk of model merging?

  1. Behavior regressions without strong validation
  2. Guaranteed incompatibility with adapters
  3. Inability to run on GPUs
  4. Automatic policy compliance
Show answer

Correct: Behavior regressions without strong validation

Merged checkpoints can introduce unexpected quality or safety regressions.

Set 2 · Question 20 · Chapter 6

Which factor most influences adaptation method choice?

  1. Quality target, compute budget, and deployment constraints
  2. Color theme of UI dashboard
  3. Number of newsletter subscribers
  4. CPU fan profile
Show answer

Correct: Quality target, compute budget, and deployment constraints

Method selection is a tradeoff between required quality and operational cost/complexity.

Set 3

Set 3 · Question 1 · Chapter 7

What is the reward model output used for in RLHF?

  1. Preference-aligned quality scores for candidate responses
  2. Direct token generation
  3. Embedding similarity only
  4. GPU allocation schedules
Show answer

Correct: Preference-aligned quality scores for candidate responses

The reward model estimates preference quality and guides policy optimization.

Set 3 · Question 2 · Chapter 7

Which step comes before policy optimization in a basic RLHF pipeline?

  1. Train reward model from preference data
  2. Quantize deployment model
  3. Build vector index
  4. Run load test
Show answer

Correct: Train reward model from preference data

Preference data is used to train reward scoring before updating policy behavior.

Set 3 · Question 3 · Chapter 7

Which failure mode is associated with over-optimizing reward models?

  1. Reward hacking
  2. Tokenizer dropout
  3. Context window overflow
  4. Kernel panic
Show answer

Correct: Reward hacking

Models may exploit reward-model shortcuts rather than improving true usefulness.

Set 3 · Question 4 · Chapter 7

What does Constitutional AI emphasize?

  1. Principle-based self-critique and revision
  2. Only supervised classification
  3. Only rule-based chatbot behavior
  4. Removing human evaluation
Show answer

Correct: Principle-based self-critique and revision

Constitutional AI uses explicit principles to critique and refine responses.

Set 3 · Question 5 · Chapter 7

What is the objective of safety alignment?

  1. Reduce unsafe or policy-violating outputs
  2. Maximize context window size
  3. Increase model parameter count
  4. Eliminate all refusals
Show answer

Correct: Reduce unsafe or policy-violating outputs

Safety alignment steers behavior toward safer, policy-consistent outputs.

Set 3 · Question 6 · Chapter 8

Perplexity is primarily used to measure what?

  1. How well a model predicts token sequences
  2. GPU memory temperature
  3. Prompt injection risk
  4. RAG retrieval freshness
Show answer

Correct: How well a model predicts token sequences

Perplexity reflects predictive uncertainty over token sequences.

Set 3 · Question 7 · Chapter 8

Which metric family is commonly used in summarization evaluation?

  1. ROUGE
  2. IoU
  3. AUC-ROC
  4. PSNR
Show answer

Correct: ROUGE

ROUGE compares overlap patterns and is a common summarization baseline metric.

Set 3 · Question 8 · Chapter 8

What does tokens/sec indicate in inference monitoring?

  1. Generation throughput speed
  2. Ground-truth factuality
  3. Dataset labeling quality
  4. Prompt template complexity
Show answer

Correct: Generation throughput speed

Tokens/sec measures how fast the system generates output tokens.

Set 3 · Question 9 · Chapter 8

Which pair best captures a common serving tradeoff?

  1. Higher batching can increase throughput but add latency
  2. Higher throughput always lowers latency
  3. Lower latency always lowers cost
  4. Higher accuracy always lowers GPU use
Show answer

Correct: Higher batching can increase throughput but add latency

Batching improves utilization/throughput but queueing can increase response delay.

Set 3 · Question 10 · Chapter 8

What is the purpose of hallucination detection checks?

  1. Find plausible outputs unsupported by evidence
  2. Increase context window length
  3. Compress model checkpoints
  4. Expand tokenizer vocabulary
Show answer

Correct: Find plausible outputs unsupported by evidence

Hallucination checks target unsupported claims that sound confident but are not grounded.

Set 3 · Question 11 · Chapter 8

Robustness testing is designed to evaluate what?

  1. Stability under noisy or shifted inputs
  2. Only average BLEU improvement
  3. Only training duration
  4. Only GPU purchase cost
Show answer

Correct: Stability under noisy or shifted inputs

Robustness evaluates how performance holds under input variation and distribution shift.

Set 3 · Question 12 · Chapter 8

What makes adversarial testing different from routine regression tests?

  1. It intentionally uses difficult or malicious inputs
  2. It ignores safety outcomes
  3. It runs only once per year
  4. It does not require expected behaviors
Show answer

Correct: It intentionally uses difficult or malicious inputs

Adversarial tests intentionally stress weak points in behavior and controls.

Set 3 · Question 13 · Chapter 9

Which practice helps reduce bias risk early in the lifecycle?

  1. Dataset auditing and balanced sampling
  2. Disabling evaluation pipelines
  3. Removing all metadata
  4. Increasing temperature
Show answer

Correct: Dataset auditing and balanced sampling

Data quality and representation checks are foundational bias mitigation controls.

Set 3 · Question 14 · Chapter 9

Where is toxicity detection most useful in an LLM app pipeline?

  1. Both input and output filtering stages
  2. Only before tokenization
  3. Only during model pretraining
  4. Only on infrastructure logs
Show answer

Correct: Both input and output filtering stages

Applying toxicity checks at ingress and egress reduces harmful content risk.

Set 3 · Question 15 · Chapter 9

What are guardrails in this course context?

  1. Policy and rule layers constraining model and tool behavior
  2. GPU chassis rails
  3. Learning-rate decay formulas
  4. Vector embedding norms
Show answer

Correct: Policy and rule layers constraining model and tool behavior

Guardrails enforce operational and policy boundaries around model actions.

Set 3 · Question 16 · Chapter 9

What best describes a jailbreak attempt?

  1. An adversarial prompt trying to bypass safety policy
  2. A failed checkpoint restore
  3. A tokenizer training bug
  4. A cloud autoscaling delay
Show answer

Correct: An adversarial prompt trying to bypass safety policy

Jailbreak prompts attempt to circumvent refusal and safety constraints.

Set 3 · Question 17 · Chapter 9

Which control is central to protecting sensitive user data?

  1. Access controls plus data minimization and redaction
  2. Higher top-k sampling
  3. Longer prompts
  4. Lower beam width
Show answer

Correct: Access controls plus data minimization and redaction

Privacy protection depends on strict data handling controls, not decoding settings.

Set 3 · Question 18 · Chapter 9

What does model governance primarily provide?

  1. Ownership, approvals, audit trails, and change control
  2. Automatic context expansion
  3. Guaranteed low latency
  4. Prompt style personalization
Show answer

Correct: Ownership, approvals, audit trails, and change control

Governance establishes accountability and traceability across model lifecycle decisions.

Set 3 · Question 19 · Chapter 9

Why are compliance requirements not one-size-fits-all?

  1. Requirements vary by region and industry
  2. LLMs remove legal obligations
  3. Only cloud providers define compliance
  4. Compliance applies only to training
Show answer

Correct: Requirements vary by region and industry

Regulatory obligations differ by jurisdiction, sector, and data sensitivity.

Set 3 · Question 20 · Chapter 7

What is the operational value of feedback loops after deployment?

  1. Continuous correction and iterative behavior improvement
  2. Permanent model freeze
  3. Elimination of human review
  4. Removal of monitoring
Show answer

Correct: Continuous correction and iterative behavior improvement

Feedback loops close the gap between observed production behavior and target behavior.

Set 4

Set 4 · Question 1 · Chapter 10

What defines a multimodal model?

  1. A model that can process or generate across multiple data modalities
  2. A model that uses only text
  3. A model that runs only on CPUs
  4. A model trained without embeddings
Show answer

Correct: A model that can process or generate across multiple data modalities

Multimodal systems handle combinations of text, image, audio, or video.

Set 4 · Question 2 · Chapter 10

Which use case is most directly aligned with vision-language models?

  1. Answering questions about an image
  2. Replacing distributed training
  3. Compiling CUDA kernels
  4. Computing BLEU for translation
Show answer

Correct: Answering questions about an image

VLMs jointly reason across visual and textual inputs.

Set 4 · Question 3 · Chapter 10

How do diffusion models generate samples at a high level?

  1. Iterative denoising from noise to structured output
  2. Direct nearest-neighbor retrieval
  3. Rule-based template fill
  4. Single-step deterministic mapping only
Show answer

Correct: Iterative denoising from noise to structured output

Diffusion models repeatedly denoise latent noise into coherent outputs.

Set 4 · Question 4 · Chapter 10

What is the core training setup in GANs?

  1. Generator versus discriminator competition
  2. Teacher-student distillation
  3. Prompt-only adaptation
  4. Reinforcement learning with human feedback
Show answer

Correct: Generator versus discriminator competition

GANs use adversarial training between a generator and discriminator.

Set 4 · Question 5 · Chapter 10

Why are cross-modal embeddings useful?

  1. They enable retrieval across text and media in a shared space
  2. They remove the need for metadata
  3. They eliminate token limits
  4. They guarantee legal compliance
Show answer

Correct: They enable retrieval across text and media in a shared space

Shared embedding spaces connect semantics across different modalities.

Set 4 · Question 6 · Chapter 10

What task does image captioning perform?

  1. Generate descriptive text from visual input
  2. Convert text prompts into videos
  3. Rank retrieval documents
  4. Train optimizer schedules
Show answer

Correct: Generate descriptive text from visual input

Image captioning maps visual content into textual descriptions.

Set 4 · Question 7 · Chapter 11

What is a primary effect of inference quantization?

  1. Lower precision can improve speed and reduce memory use
  2. Automatically improves every accuracy metric
  3. Removes need for serving infrastructure
  4. Disables KV caching
Show answer

Correct: Lower precision can improve speed and reduce memory use

Quantization reduces numeric precision to improve efficiency, with quality tradeoffs to validate.

Set 4 · Question 8 · Chapter 11

What is pruning intended to do?

  1. Remove less important parameters to reduce serving cost
  2. Increase prompt length
  3. Add new attention heads
  4. Change tokenization strategy
Show answer

Correct: Remove less important parameters to reduce serving cost

Pruning targets redundant parameters to improve efficiency.

Set 4 · Question 9 · Chapter 11

How does knowledge distillation usually work?

  1. A smaller student model learns behavior from a larger teacher
  2. A tokenizer learns from beam search
  3. A retriever learns from CUDA kernels
  4. A scheduler learns from invoices
Show answer

Correct: A smaller student model learns behavior from a larger teacher

Distillation transfers useful behavior into a smaller, cheaper model.

Set 4 · Question 10 · Chapter 11

What is TensorRT used for in this stack?

  1. Optimizing inference execution on NVIDIA hardware
  2. Creating annotation guidelines
  3. Building legal compliance checklists
  4. Generating synthetic datasets
Show answer

Correct: Optimizing inference execution on NVIDIA hardware

TensorRT compiles and optimizes inference graphs/kernels for GPU performance.

Set 4 · Question 11 · Chapter 11

Which approach typically improves perceived latency for chat users?

  1. Streaming inference with incremental token output
  2. Very large batch inference only
  3. Offline-only generation
  4. Disabling KV cache
Show answer

Correct: Streaming inference with incremental token output

Streaming returns partial output quickly, improving user-perceived responsiveness.

Set 4 · Question 12 · Chapter 11

What does autoscaling address in production inference systems?

  1. Dynamic capacity adjustment based on load
  2. Automatic prompt rewriting
  3. Automatic reward model retraining
  4. Automatic tokenizer merges
Show answer

Correct: Dynamic capacity adjustment based on load

Autoscaling increases or decreases serving resources as demand changes.

Set 4 · Question 13 · Chapter 11

What is NVIDIA Triton Inference Server primarily for?

  1. Serving models in production with backend/runtime integration
  2. Collecting human preference labels
  3. Training foundation models from scratch
  4. Building ETL transformations
Show answer

Correct: Serving models in production with backend/runtime integration

Triton is a production serving platform supporting multiple model backends.

Set 4 · Question 14 · Chapter 11

How is NVIDIA NIM best described at a high level?

  1. Packaged inference microservices for simpler deployment
  2. A BLEU-like evaluation metric
  3. A tokenization standard
  4. A database migration tool
Show answer

Correct: Packaged inference microservices for simpler deployment

NIM packages model serving components to speed practical deployment.

Set 4 · Question 15 · Chapter 11

What does GPU memory management aim to prevent?

  1. Out-of-memory failures and fragmentation-related instability
  2. Any need for observability
  3. Any need for model versioning
  4. Any need for access control
Show answer

Correct: Out-of-memory failures and fragmentation-related instability

Managing allocation and fragmentation is key for stable high-throughput serving.

Set 4 · Question 16 · Chapter 12

In ETL, which step transforms raw source data into usable format?

  1. Transform
  2. Extract
  3. Load
  4. Archive
Show answer

Correct: Transform

Extract pulls data, transform reshapes/cleans it, and load stores processed outputs.

Set 4 · Question 17 · Chapter 12

Why is dataset versioning important for ML/LLM systems?

  1. It enables reproducibility and auditability
  2. It removes need for evaluation
  3. It guarantees fairness
  4. It increases context window size
Show answer

Correct: It enables reproducibility and auditability

Versioned datasets let teams reproduce results and trace changes over time.

Set 4 · Question 18 · Chapter 12

What should experiment tracking capture for each run?

  1. Parameters, metrics, artifacts, and run context
  2. Only final accuracy
  3. Only model file size
  4. Only deployment region
Show answer

Correct: Parameters, metrics, artifacts, and run context

Complete run context is required for repeatability and debugging.

Set 4 · Question 19 · Chapter 12

What does drift detection monitor after deployment?

  1. Shifts in data distributions or model behavior
  2. Only fan speed and power draw
  3. Only source code comments
  4. Only monthly subscription count
Show answer

Correct: Shifts in data distributions or model behavior

Drift checks detect changing conditions that can silently degrade performance.

Set 4 · Question 20 · Chapter 12

What is the purpose of CI/CD for ML in this course context?

  1. Automate validation, packaging, and release of model changes
  2. Replace human governance entirely
  3. Disable monitoring to reduce cost
  4. Use only manual deployment
Show answer

Correct: Automate validation, packaging, and release of model changes

CI/CD for ML standardizes safe, repeatable model release workflows.

Back to NCA-GENL course page