Protected

NCA-GENL course chapter content is available after login. Redirecting...

If you are not redirected, login.

Courses / Nvidia / NCA-GENL

Chapter 8: Evaluation and Metrics

Chapter study guide page

Chapter 8 of 12 · Productionizing LLM Solutions (22%).

Chapter Content

Exam focus

Primary domain: Productionizing LLM Solutions (22%).

Language Metrics

Perplexity
BLEU
ROUGE
F1 score
Exact match
Human evaluation

Performance Metrics

Latency
Throughput
Tokens/sec
GPU utilization
Memory footprint

Reliability

Hallucination detection
Factual accuracy
Bias measurement
Robustness testing
Adversarial testing

Scope Bullet Explanations

Perplexity: Measures how well a model predicts token sequences (lower is generally better).
BLEU: N-gram overlap metric often used for translation quality.
ROUGE: Overlap metric family commonly used for summarization evaluation.
F1 score: Harmonic mean of precision and recall for classification/extraction tasks.
Exact match: Percentage of predictions that exactly match reference answers.
Human evaluation: Manual judging of usefulness, correctness, and quality dimensions.
Latency: Time from request to response (or first token/time to completion).
Throughput: Work processed per unit time (requests/sec or tokens/sec).
Tokens/sec: Generation speed metric for inference performance.
GPU utilization: Fraction of GPU compute capacity actively used.
Memory footprint: RAM/VRAM consumption of model and runtime pipeline.
Hallucination detection: Identifying plausible but unsupported or incorrect outputs.
Factual accuracy: Degree to which outputs align with verifiable facts/evidence.
Bias measurement: Quantifying systematic differences or unfair behavior across groups.
Robustness testing: Evaluating stability under distribution shifts and noisy inputs.
Adversarial testing: Stress-testing with intentionally difficult/malicious inputs.

Chapter overview

Evaluation is how you convert “it seems good” into defensible release decisions. This chapter separates quality, performance, and reliability metrics, then shows how to combine them into practical release gates.

Learning objectives

Select language-quality metrics based on task type.
Measure runtime performance and capacity for serving systems.
Assess reliability risks including hallucination, bias, and adversarial robustness.
Build blended evaluation pipelines with automated and human review.

8.1 Language quality metrics

Perplexity

Useful for language modeling quality trends, but not sufficient alone for end-user task quality.

BLEU and ROUGE

N-gram overlap metrics used for translation and summarization baselines. They are convenient but can miss semantic quality differences.

F1 and exact match

Strong for extraction and QA tasks where canonical answers exist.

Human evaluation

Required for nuanced criteria such as usefulness, completeness, tone, and policy adherence.

8.2 Performance and systems metrics

Latency

Track both time-to-first-token and time-to-last-token.

Throughput

Requests/sec and tokens/sec indicate capacity under load.

GPU utilization and memory footprint

Low utilization may indicate scheduling inefficiency; high memory pressure can lead to instability and tail-latency spikes.

Cost metrics

Cost per request and cost per successful task are key for production sustainability.

8.3 Reliability and risk metrics

Hallucination and factuality

Measure claim-evidence alignment, especially in RAG flows.

Bias and fairness

Evaluate performance across user segments and content categories.

Robustness and adversarial testing

Use red-team prompts, malformed input, and policy stress cases.

Safety policy compliance

Track refusal correctness, false refusals, and unsafe pass-through rates.

8.4 Evaluation design principles

Match metric to task objective.
Separate offline and online evaluation.
Use threshold-based release gates.
Monitor post-release drift continuously.
Keep a fixed benchmark set for regression comparison.

8.5 Building a release scorecard

A practical scorecard includes:

task quality metrics,
system performance metrics,
safety and reliability metrics,
pass/fail thresholds,
rollback triggers. No single metric should determine release.

8.6 Failure modes

Optimizing to one metric while user satisfaction declines.
Ignoring distribution shift between benchmark and production traffic.
No adversarial evaluation before release.
Treating average latency as sufficient while p95/p99 regress badly.

Chapter summary

Evaluation must be multidimensional. Strong production teams combine offline benchmarks, online telemetry, and human review into a single decision framework.

Mini-lab: production scorecard design

Goal: define a release gate for one assistant workflow.

Choose one real workflow and success outcome.
Select three quality metrics and two system metrics.
Add two safety/reliability checks.
Define green/yellow/red thresholds.
Simulate one regression scenario and rollback decision. Deliverable in Notion:

Release scorecard template with thresholds and escalation policy.

Review questions

Why is perplexity insufficient for production readiness decisions?
When is exact match preferable to ROUGE?
Why track both latency and throughput together?
What does tokens/sec hide if used alone?
How should hallucination be measured in RAG systems?
Why is human evaluation still mandatory?
What is the value of adversarial testing before release?
How do you design meaningful rollback criteria?
Why must benchmark sets remain stable across model versions?
What post-release signals indicate evaluation coverage gaps?

Key terms

Perplexity, BLEU, ROUGE, F1, exact match, latency, throughput, tokens/sec, hallucination, factuality, robustness testing, adversarial testing.

Exam traps

Equating benchmark gains with production gains.
Ignoring tail-latency and failure rates.
Treating safety evaluation as optional if quality metrics are high.

Navigation

Back to NCA-GENL course page Previous: Chapter 7 Next: Chapter 9