Chapter 1: Core Machine Learning and Deep Learning Foundations
Exam focus
Primary coverage from the GENM blueprint:
- Supervised learning
- Unsupervised learning
- Semi-supervised learning
- Training vs inference
- Feature engineering
- Labeling strategies
- Perceptron and fully connected networks
- CNNs and RNNs
- Transformers, attention, self-attention, cross-attention
- Accuracy, precision, recall, F1, confusion matrix, ROC-AUC
- Overfitting, underfitting, bias-variance tradeoff
Scope Bullet Explanations
- Supervised learning: Train with labeled input-output pairs and optimize prediction quality on target labels.
- Unsupervised learning: Learn latent structure without explicit labels (clustering, representation learning, density patterns).
- Semi-supervised learning: Combine limited labels with large unlabeled data to improve generalization.
- Training vs inference: Training updates model parameters; inference uses frozen parameters under latency/cost constraints.
- Feature engineering: Transform raw data into stable, informative model inputs.
- Labeling strategies: Define annotation policy, quality controls, and edge-case handling.
- Perceptron/FCN/CNN/RNN/Transformer: Foundational model family progression for modality-specific and sequence tasks.
- Attention family: Mechanisms that learn which tokens/regions/signals matter for current prediction.
- Core metrics: Different metrics expose different failure patterns; no single metric is enough.
- Fit diagnostics: Overfit/underfit and bias-variance analysis guide model and data decisions.
Chapter overview
This chapter builds the baseline reasoning required for all GENM topics. Multimodal systems are not fundamentally separate from ML and DL; they are an extension that combines multiple signal types and introduces additional alignment and evaluation complexity.
Assumed foundational awareness
The exam expects familiarity with:
- basic probability concepts (distribution, conditional probability, confidence),
- linear algebra intuition (vectors, matrix operations, dot product),
- optimization basics (loss, gradient, update step),
- practical train/validation/test separation.
Learning objectives
- Differentiate learning paradigms and choose the right framing for a given task.
- Explain how classical and modern neural architectures relate to multimodal systems.
- Select and interpret evaluation metrics for practical model decisions.
- Diagnose common generalization failures and choose mitigation paths.
1.1 Learning paradigms for multimodal AI
Supervised and unsupervised foundations
Supervised learning is ideal when label semantics are reliable and the task target is clear. Unsupervised learning is useful for discovering latent organization and for initialization or retrieval spaces when labels are scarce.
Semi-supervised workflows
Semi-supervised approaches are common in multimodal pipelines because high-quality cross-modal labels are expensive. A practical pattern is:
- pretrain representations on large unlabeled corpora,
- fine-tune with smaller labeled sets,
- iterate with active labeling for hard slices.
Training versus inference mindset
Training optimizes quality across epochs and large datasets. Inference optimizes user-facing behavior under constraints (latency, throughput, memory, reliability, safety).
1.2 Neural network evolution and architecture intuition
Perceptron to deep networks
Perceptron and fully connected networks establish nonlinear function approximation. CNNs improve spatial modeling through locality and shared kernels. RNNs add sequence recurrence but struggle with long-range dependency stability.
Transformer transition
Transformers replaced recurrence with attention, enabling parallel token interaction and better long-context reasoning. For GENM, transformer ideas extend naturally to image patches, audio tokens, and cross-modal conditioning.
Attention, self-attention, cross-attention
- Attention: weighted relevance computation.
- Self-attention: tokens attend within the same sequence/modality.
- Cross-attention: one sequence attends to another (for example text attending to image embeddings).
Cross-attention is central in multimodal alignment.
1.3 Feature engineering and labeling strategy
Feature quality
Poor feature design can hide useful signal or amplify noise. In multimodal systems, feature quality also depends on temporal/spatial alignment across modalities.
Labeling policy
Labeling must define:
- class definitions,
- ambiguity rules,
- reviewer consistency protocol,
- escalation for uncertain samples.
Weak labeling policy causes silent quality drift, especially in subjective multimodal tasks.
1.4 Evaluation fundamentals and decision logic
Metric interpretation
- Accuracy: useful for balanced tasks with similar class costs.
- Precision/Recall: critical when false positives and false negatives have different impact.
- F1: balance metric when both precision and recall matter.
- Confusion matrix: class-level error visibility.
- ROC-AUC: threshold-agnostic ranking quality.
Fit diagnostics
- Overfitting: excellent training metrics, poor validation generalization.
- Underfitting: poor training and validation performance.
- Bias-variance tradeoff: balance model complexity with data signal quality.
1.5 Exam-oriented decision framework
When asked scenario-based questions:
- Identify task type and modality profile.
- Check whether labels are strong, weak, or absent.
- Pick baseline architecture family.
- Select metrics tied to operational risk.
- Diagnose likely fit failure and mitigation.
Common failure modes
- Choosing metrics that hide business-critical errors.
- Treating training quality as deployment quality.
- Ignoring labeling consistency drift.
- Over-indexing on model scale before fixing data quality.
Chapter summary
Chapter 1 gives you a stable ML/DL decision frame. GENM complexity comes later, but this foundation is what keeps architecture, evaluation, and optimization choices coherent.
Mini-lab: baseline modeling decision memo
Goal: build a one-page decision memo for a multimodal feature.
- Choose a task (captioning, retrieval, VQA, speech assistant).
- Specify learning paradigm and why.
- Pick model family and explain tradeoffs.
- Select 3 evaluation metrics and expected thresholds.
- List one overfit risk and one mitigation.
Deliverable:
- concise architecture + evaluation memo with assumptions.
Review questions
- Why is semi-supervised learning common in multimodal systems?
- How does cross-attention differ functionally from self-attention?
- When is F1 more useful than raw accuracy?
- What does ROC-AUC tell you that a fixed-threshold metric does not?
- How can label policy inconsistency impact validation results?
- Why might a model with higher training accuracy still be worse in production?
- What is one practical sign of underfitting?
- How does feature engineering differ from model architecture tuning?
- Why should confusion matrix analysis be part of release criteria?
- What is a reliable first step when metrics degrade after deployment?
Key terms
Supervised learning, semi-supervised learning, feature engineering, self-attention, cross-attention, F1 score, ROC-AUC, bias-variance tradeoff.
Exam traps
- Using accuracy-only arguments for imbalanced data.
- Assuming larger models solve weak data and label quality.
- Ignoring operational error cost while choosing metrics.