Chapter 5: Vision and Image Understanding

Chapter study guide page

Chapter 5 of 12 · Vision and Image Understanding.

Chapter Content

Chapter 5: Vision and Image Understanding

Exam focus

Image classification
Object detection
Image segmentation
Feature extraction
CNN feature maps
Vision transformers
CLIP
Contrastive learning
Embedding similarity

Scope Bullet Explanations

Classification: Assign one or more labels to an image.
Detection: Predict object classes and locations.
Segmentation: Predict pixel-level masks for regions or objects.
Feature extraction: Build representation vectors for downstream tasks.
CNN maps vs ViT tokens: Local hierarchical features versus global attention-based context.
CLIP/contrastive learning: Align vision and text in shared semantic space.
Embedding similarity: Core mechanism behind image-text retrieval systems.

Chapter overview

This chapter covers core vision concepts that repeatedly appear in multimodal pipelines. GENM questions usually test whether you can select the correct vision primitive and explain representation tradeoffs.

Assumed foundational awareness

Expected baseline:

matrix/tensor intuition,
train-validation behavior,
precision/recall interpretation.

Learning objectives

Differentiate common computer vision task types and outputs.
Compare CNN and ViT representation behavior.
Explain CLIP and contrastive alignment for multimodal retrieval.
Use embedding similarity concepts for practical search and ranking.

5.1 Vision task families

Image classification

Outputs label probabilities for entire image context. Useful for coarse categorization and routing.

Object detection

Outputs bounding boxes plus class confidence. Useful for localized semantic understanding.

Segmentation

Outputs per-pixel class or instance regions. Useful when boundary precision matters.

5.2 Representation learning in vision

CNN feature maps

CNNs learn local edge-to-pattern hierarchies and are efficient for many spatial tasks.

ViT representations

ViTs tokenize image patches and use attention to model long-range interactions. They often benefit from larger pretraining and can capture broader global context.

5.3 CLIP and contrastive learning

Contrastive learning optimizes positive image-text pairs to be close and negative pairs to be far in embedding space. CLIP-style training enables text-driven image retrieval and zero/few-shot adaptation behavior.

5.4 Embedding similarity in practice

Key deployment choices:

embedding model and dimension,
similarity metric (cosine, dot product),
indexing strategy,
reranking policy,
relevance quality monitoring.

5.5 Evaluation and reliability

Vision evaluation should include:

class-wise precision/recall,
localization quality for detection,
mask quality for segmentation,
retrieval metrics (top-k precision, recall@k) for CLIP-like systems.

Common failure modes

Using classification metrics to judge detection quality.
Ignoring domain shift (lighting, camera quality, viewpoint).
Treating embedding retrieval as solved without reranking validation.
Skipping bias/fairness slices for visual categories.

Chapter summary

Vision systems are foundational building blocks in GENM. Correct task framing and representation selection are often more important than model-size escalation.

Mini-lab: image-text retrieval benchmark

Build a small image-text paired dataset.
Generate embeddings and index vectors.
Run text-to-image and image-to-text queries.
Measure top-k quality and analyze failures.

Deliverable:

retrieval scorecard with top failure categories.

Review questions

Why is segmentation not interchangeable with detection?
How do CNN and ViT inductive biases differ?
What makes CLIP effective for cross-modal retrieval?
Why is contrastive negative sampling quality important?
Which metric best surfaces class-specific errors?
How can domain shift break retrieval performance?
Why should reranking be considered after ANN retrieval?
What is one risk of relying only on global image embeddings?
How can visual bias appear in classification outputs?
Why should evaluation include both quality and latency?

Key terms

Image classification, object detection, segmentation, feature map, ViT, CLIP, contrastive learning, embedding similarity.

Exam traps

Confusing task outputs across classification/detection/segmentation.
Assuming CLIP alignment always transfers without domain validation.
Ignoring deployment-time data shift.

Navigation

Back to NCA-GENM course page Previous: Chapter 4 Next: Chapter 6