Chapter 7: Reinforcement Learning and Alignment

Chapter study guide page

Chapter 7 of 12 · Productionizing LLM Solutions (22%).

Chapter Content

Exam focus

Primary domain: Productionizing LLM Solutions (22%).

RLHF
Reward modeling
Human preference optimization
Policy optimization
Alignment techniques
Safety alignment
Model steering
Constitutional AI
Feedback loops

Scope Bullet Explanations

RLHF: Reinforcement learning from human feedback to align outputs with user preferences.
Reward modeling: Trains a model to score response quality from preference data.
Human preference optimization: Improves policy behavior using human-ranked comparisons.
Policy optimization: Updates the model policy to maximize reward signals.
Alignment techniques: Methods to steer model behavior toward helpful/safe/consistent outputs.
Safety alignment: Targeted controls that reduce harmful, unsafe, or policy-violating outputs.
Model steering: Runtime or training-time methods to guide tone, style, or behavior.
Constitutional AI: Uses explicit principle sets to self-critique and revise responses.
Feedback loops: Continuous monitoring and correction cycles after deployment.

Chapter overview

Alignment ensures models behave in ways users and organizations can trust. RLHF and related methods improve preference quality, but they introduce new failure modes that must be managed through evaluation and governance.

Learning objectives

Explain RLHF pipeline components and training flow.
Describe reward modeling and policy optimization at a practical level.
Compare alignment techniques including safety alignment and Constitutional AI.
Build feedback loops to continuously monitor and improve behavior.

7.1 RLHF pipeline fundamentals

A simplified RLHF loop:

Collect model responses for sampled prompts.
Gather human preference rankings.
Train reward model to score response quality.
Optimize policy model against reward.
Re-evaluate behavior and repeat. RLHF improves behavior quality beyond SFT, especially for helpfulness and style consistency.

7.2 Reward modeling

Reward models are proxies, not truth. They estimate preferences from labeled comparisons. Labeling rubric quality determines reward model usefulness.

Design considerations:

clear criteria (helpful, harmless, honest),
rater calibration,
disagreement handling,
balanced prompt distribution.

7.3 Policy optimization

Policy optimization updates model behavior toward higher reward. Common risks include:

reward hacking,
over-optimization on narrow preference patterns,
degradation of factual correctness. Always pair reward improvement with independent truthfulness and safety evaluations.

7.4 Alignment techniques beyond RLHF

Safety alignment

Imposes refusal behavior, risk controls, and safe-completion constraints for harmful requests.

Model steering

Controls style, tone, and response boundaries without full retraining in some workflows.

Constitutional AI

Uses explicit principle sets to guide critique and self-revision. Can reduce human-label volume for some alignment tasks.

7.5 Feedback loops and governance

Continuous alignment requires:

live incident logging,
abuse pattern analysis,
periodic rubric updates,
retraining or policy tuning triggers,
ownership and approval workflows. Without feedback loops, aligned behavior drifts as user behavior and domain context change.

7.6 Failure modes

Treating reward score as complete quality signal.
Using low-consistency raters without calibration.
Optimizing for politeness while factual quality drops.
Failing to re-test safety after policy updates.

Chapter summary

Alignment is an ongoing program, not a one-time training step. RLHF, safety alignment, and governance controls must work together to maintain reliable behavior.

Mini-lab: preference rubric and scoring

Goal: design a practical alignment scoring setup.

Define rubric dimensions (helpful, harmless, honest, concise).
Create 20 prompt-response pairs for scoring.
Have two independent scoring passes and compare disagreement.
Identify patterns of reward ambiguity.
Propose reward-model feature improvements and policy updates. Deliverable in Notion:

Alignment rubric with disagreement analysis and improvement plan.

Review questions

Why is RLHF often applied after SFT?
What makes reward models vulnerable to proxy mismatch?
What is reward hacking in practice?
Why is factual evaluation needed even when reward score improves?
How does Constitutional AI differ from pure RLHF loops?
Why are feedback loops required post-deployment?
What data quality issues degrade preference optimization?
Which governance controls are needed for alignment changes?
How can model steering complement retraining?
What triggers should initiate alignment rework cycles?

Key terms

RLHF, reward model, preference data, policy optimization, reward hacking, safety alignment, Constitutional AI, model steering, feedback loop.

Exam traps

Assuming aligned tone implies aligned safety.
Ignoring preference drift after release.
Treating reward-model outputs as ground truth.

Navigation

Back to NCA-GENL course page Previous: Chapter 6 Next: Chapter 8