Exam focus
Primary domain: Productionizing LLM Solutions (22%). Secondary: Developing LLM Applications (24%).
- Bias in LLMs
- Fairness
- Toxicity detection
- Content filtering
- Guardrails
- Prompt injection attacks
- Jailbreaking
- Data privacy
- Model governance
- Compliance considerations
- Ethical AI principles
Scope Bullet Explanations
- Bias in LLMs: Systematic skew in outputs caused by data, training, or deployment patterns.
- Fairness: Ensuring comparable quality/treatment across different groups and contexts.
- Toxicity detection: Identifying harmful or abusive language in input/output streams.
- Content filtering: Blocking/redacting policy-violating or unsafe content.
- Guardrails: Rule and policy layers constraining model behavior and tool actions.
- Prompt injection attacks: Untrusted content attempting to override instructions and controls.
- Jailbreaking: Attempts to bypass safety policies through adversarial prompt patterns.
- Data privacy: Protection of sensitive information in training, retrieval, and inference.
- Model governance: Ownership, approvals, auditability, and change-control for AI systems.
- Compliance considerations: Legal and regulatory requirements tied to industry/jurisdiction.
- Ethical AI principles: Transparency, accountability, fairness, privacy, and harm reduction in design and operations.
Chapter overview
LLM risk is not only model risk. It includes prompt channels, retrieval channels, tool calls, user interfaces, and governance processes. This chapter provides a layered approach to safety, security, and responsible AI operations.
Learning objectives
- Identify bias, toxicity, injection, jailbreak, privacy, and governance risk categories.
- Design layered controls across model, application, and infrastructure boundaries.
- Apply responsible AI principles with measurable operational practices.
- Build incident-response and policy-improvement loops.
9.1 Risk taxonomy
Bias and fairness
Bias can originate from training data, prompting patterns, retrieval corpus imbalance, or policy application inconsistency.
Toxicity and harmful content
Models can produce unsafe outputs in benign and adversarial contexts. Detection and mitigation need pre- and post-generation controls.
Prompt injection and jailbreaks
Attackers can use user input, documents, or tool output to override intended behavior.
Privacy and data leakage
Risk includes exposing sensitive user data, training data fragments, or cross-tenant information.
9.2 Guardrails and control layers
Input controls
- sanitize and classify input,
- detect malicious instructions,
- enforce trust boundary labels.
Retrieval controls
- scope-based metadata filters,
- source allowlists,
- stale or untrusted document rejection.
Generation controls
- policy-aware decoding constraints,
- content filtering,
- refusal templates for restricted requests.
Tool and action controls
- least-privilege tool permissions,
- argument validation,
- audit logs for tool invocation.
Output controls
- toxicity and policy filtering,
- PII redaction,
- confidence-aware escalation.
9.3 Governance and compliance
Governance requires explicit ownership:
- policy owners,
- model owners,
- incident owners,
- approval workflows for updates. Compliance scope varies by industry, but minimum expectations usually include auditability, traceability, and documented risk controls.
9.4 Responsible AI operating model
Core principles translated into operations:
- fairness: evaluate across cohorts,
- accountability: assign change approvers,
- transparency: document model and prompt behavior,
- privacy: minimize and protect sensitive data,
- safety: test and monitor high-risk failure modes.
9.5 Incident response cycle
- Detect violation or unsafe behavior.
- Triage severity and impacted population.
- Apply immediate mitigation (block, rollback, permission tighten).
- Perform root-cause analysis.
- Update policy, tests, and monitoring.
9.6 Failure modes
- Overreliance on one toxicity classifier.
- No boundary between trusted system instructions and untrusted content.
- Incomplete audit logs for tool actions.
- Governance on paper only, without enforcement workflow.
Chapter summary
Safe LLM deployment requires layered controls and accountable governance. Strong security posture comes from defense in depth, not single-prompt hardening.
Mini-lab: LLM threat model
Goal: produce a risk-control map for one app.
- Draw data flow: user -> prompt -> retrieval -> model -> tool -> output.
- Identify top five threats.
- Map preventive, detective, and corrective controls.
- Define ownership and severity levels.
- Add one incident playbook for prompt injection. Deliverable in Notion:
- Threat matrix with control mapping and response playbook.
Review questions
- Why is prompt injection a system design issue, not just a prompt issue?
- What is the operational difference between guardrails and filters?
- How can metadata filtering support both security and relevance?
- Why are tool permissions critical in LLM risk management?
- What evidence is required for governance auditability?
- How should privacy controls differ for internal and external assistants?
- Why is fairness evaluation a recurring process?
- What triggers immediate model rollback?
- How should incident severity be defined?
- Which layer usually fails first in real prompt-injection incidents?
Key terms
Bias, fairness, toxicity, prompt injection, jailbreak, guardrails, policy enforcement, PII, governance, compliance, incident response.
Exam traps
- Treating safety as a post-processing-only problem.
- Ignoring retrieval and tool channels in threat models.
- Assuming internal deployments carry no compliance risk.