Chapter 4: Audio and Speech Processing
Exam focus
- ASR
- TTS
- Speaker identification
- Speaker diarization
- Voice cloning
- Spectrogram / mel spectrogram / MFCC
- Audio tokenization
- Phoneme modeling
Scope Bullet Explanations
- ASR: Speech-to-text conversion pipeline and quality constraints.
- TTS: Text-to-waveform synthesis with intelligibility and prosody goals.
- Speaker identification/diarization: Who spoke vs when each speaker spoke.
- Voice cloning: Style transfer with identity, consent, and misuse implications.
- Audio representations: Feature extraction options and tradeoffs.
- Audio tokenization/phonemes: Sequence modeling units for speech systems.
Chapter overview
Speech is a core modality in GENM, especially in conversational assistants and avatar pipelines. This chapter focuses on the practical architecture and risk controls expected at associate level.
Assumed foundational awareness
Expected baseline:
- digital signal intuition (time and frequency view),
- sequence modeling basics,
- confidence/error-rate concept.
Learning objectives
- Explain end-to-end ASR and TTS workflows.
- Interpret common speech feature representations.
- Differentiate speaker-centric tasks and evaluation needs.
- Apply safety and consent controls for voice capabilities.
4.1 Speech pipeline fundamentals
ASR pipeline
Typical stages:
- audio capture and cleanup,
- feature extraction,
- acoustic/language decoding,
- post-processing and confidence scoring.
TTS pipeline
Typical stages:
- text normalization,
- phoneme/prosody modeling,
- acoustic generation,
- vocoding and post-filtering.
4.2 Speaker tasks and operational uses
Speaker identification predicts identity from voice pattern. Diarization segments conversation by speaker turns. They solve different problems and should not be conflated.
4.3 Feature representations
- Spectrogram: energy across frequency over time.
- Mel spectrogram: perceptually scaled frequency representation.
- MFCC: compact coefficients for robust speech characterization.
Representation choice affects downstream accuracy, compute, and robustness.
4.4 Audio tokenization and phoneme modeling
Tokenization converts continuous audio to discrete units for transformer-style modeling. Phoneme-aware modeling improves pronunciation fidelity and can stabilize TTS quality.
4.5 Voice cloning and responsible use
Voice cloning requires explicit consent, provenance tracking, and misuse controls (watermarking, policy gating, audit logging).
Common failure modes
- Using speaker ID metrics to evaluate diarization quality.
- Ignoring domain noise profile during preprocessing.
- Shipping cloning features without consent workflow.
- Measuring ASR quality only on clean benchmark audio.
Chapter summary
Speech systems combine signal processing, sequence modeling, and strict operational controls. GENM exam questions often focus on pipeline distinctions and risk-aware implementation choices.
Mini-lab: speech quality and safety checklist
- Define one ASR and one TTS use case.
- Pick feature representation and justify choice.
- Define evaluation metrics and test slices.
- Add consent/safety controls for voice personalization.
Deliverable:
- speech pipeline checklist with metrics and governance.
Review questions
- How does diarization differ from speaker identification?
- Why is mel scaling useful in speech features?
- What is a practical role of MFCC in baseline systems?
- Why does text normalization matter in TTS quality?
- What risks make voice cloning high-governance?
- How can noisy environments degrade ASR unexpectedly?
- Why do confidence scores matter in ASR production flow?
- When should audio tokenization be preferred over handcrafted features?
- What is one failure signal of poor speaker turn segmentation?
- Why should speech models be evaluated across accent and channel diversity?
Key terms
ASR, TTS, diarization, mel spectrogram, MFCC, phoneme modeling, voice cloning.
Exam traps
- Treating diarization and identification as interchangeable.
- Ignoring consent and provenance in voice systems.
- Evaluating speech only under clean-lab conditions.