Protected

NCP-ADS module content is available after admin verification. Redirecting…

If you are not redirected, login.

Training / NCP-ADS

Data Preparation

Module study guide

Priority 4 of 6 · Domain 3 in exam order

Scope

Exam study content

This module contains expanded study notes, practical drills, and an exam-style question set.

Exam weight: 15%
Priority tier: Tier 2
Why this domain: Practical preprocessing foundations used across modeling pipelines.

Exam Framework

How to reason under pressure

1. Stabilize Before Optimizing

Verify hardware and management-plane integrity first.
Confirm firmware/software baseline consistency.
Only then run performance tuning decisions.

2. Single-Variable Changes

Change one parameter at a time when investigating regressions.
Use before/after evidence with constant workload input.
Discard changes without reproducible benefit.

Exam Scope Coverage

What this module now covers

This module focuses on data preparation decisions that materially affect model quality and training efficiency: quality checks, cleaning, transformation, feature engineering, and leakage-safe dataset construction.

Track 1: Data quality profiling and schema discipline

Preparation errors are a top cause of downstream modeling failure and unstable results.

Validate column types, null patterns, duplicates, and unexpected value ranges first.
Define schema expectations early so ingestion and transformations remain auditable.
Check row counts and key integrity after each major transformation step.

Drill: Build a reusable data-quality checklist and run it at raw, intermediate, and final dataset stages.

Track 2: Missing data strategy

Imputation decisions can shift model behavior and metric conclusions.

Separate missingness mechanisms from imputation method choice.
Simple imputers are often strong baselines when paired with indicator features where appropriate.
Fit imputation logic only on training data to avoid leakage.

Drill: Compare median imputation, constant imputation, and row filtering on the same task and record validation impact.

Track 3: Encoding and data-type handling

Wrong encoding or type assumptions can inflate dimensionality and degrade performance.

Use encoding strategies that match cardinality and model class (for example one-hot vs ordinal).
Keep categorical and numeric pipelines explicit for repeatability.
In GPU workflows, verify cuDF data types and null semantics before heavy transformations.

Drill: Create a column-wise transformation plan and justify each encoding choice with model compatibility.

Track 4: Feature scaling and transformation

Scale-sensitive models and distance-based methods depend on consistent feature ranges.

Standardization and normalization solve different problems; choose by algorithm behavior.
Skewed distributions may require log or power transformations before scaling.
Fit all scaling parameters on training split only and reuse on validation/test.

Drill: Run a model with unscaled, standardized, and normalized features and explain metric differences.

Track 5: Outliers, imbalance, and robust preparation

Outliers and minority classes can dominate error patterns if untreated.

Use robust statistics and plots to detect outlier influence before clipping or transformation.
Address class imbalance with stratified splits and, when justified, resampling strategies.
Track every outlier or resampling rule so results remain reproducible.

Drill: Document one outlier policy and one imbalance policy, then compare with a no-adjustment baseline.

Track 6: I/O-aware preparation pipelines

Preparation throughput can bottleneck end-to-end training and iteration speed.

Parquet is typically preferred for columnar projection and predicate pushdown.
JSON Lines parsing should be configured for file-size profile and ingestion path.
Use repeatable pipeline orchestration and environment pinning for stable reruns.

Drill: Build one preparation pipeline that ingests raw data, applies transformations, and writes clean Parquet with audit metadata.

Concept Explanations

Deep-dive concept library

Exam Decision Hierarchy

Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.

If integrity checks fail, stop optimization and remediate first.
Compare against known-good baseline before changing multiple variables.
Document rationale for each decision to support incident replay.

Operational Evidence Standard

Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.

Evidence should be reproducible by another engineer.
Use stable command templates for repeated environments.
Keep concise but complete validation artifacts for exam-style reasoning.

Split-first preparation discipline

Preparation logic must be fit on training data only; anything else risks leakage and inflated metrics.

Perform train/validation/test split before fitting imputers, scalers, and encoders.
Reuse train-fitted transformers for validation/test and production inference.
Log transformation artifacts and version hashes for reproducibility.

Feature-type aware transformation design

Preparation quality improves when numeric and categorical paths are explicit and auditable.

Use separate pipelines per feature family with documented rationale.
Select encodings based on cardinality and model behavior, not habit.
Track dtype and null semantics through each transformation stage.

Robustness against outliers and imbalance

Preparation decisions should stabilize model behavior under skew, outliers, and minority-class scarcity.

Use robust stats to detect heavy tails before scaling decisions.
Prefer stratified splits for classification imbalance.
Validate outlier/imbalance treatments with before/after metric impact.

Scenario Playbooks

Exam-style scenario explanations

Scenario A: Validation score collapses after production deployment

A model looked strong in validation, but production quality dropped immediately. You need to assess preparation leakage or transformation mismatch.

Architecture Diagram

[Raw Data] -> [Split] -> [Train-fitted Transformers] -> [Model]
                    |
          [Artifact Registry + Inference Reuse]

Response Flow

Audit whether any imputer/scaler/encoder was fit on full dataset.
Compare offline transformer artifacts with online inference pipeline.
Re-run evaluation with strict split-first pipeline and report delta.

Success Signals

Preparation artifacts are versioned and reused consistently online/offline.
Validation metrics remain stable across reruns with leakage-safe flow.
No feature generated from future or target-adjacent information.

Pipeline split-first skeleton

python - <<'PY'
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
X_train, X_valid, y_train, y_valid = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
pipe = Pipeline([...])
pipe.fit(X_train, y_train)
print('valid_score=', pipe.score(X_valid, y_valid))
PY

Expected output (example)

valid_score= 0.812

Scenario B: Encoding choices cause training instability and huge memory use

One-hot encoding on high-cardinality fields produced feature explosion and unstable training runtime.

Architecture Diagram

[Raw Categorical Features] -> [Cardinality Audit] -> [Encoding Policy]
                                            |
                              [Memory + Metric Validation]

Response Flow

Run cardinality profiling and identify high-expansion columns.
Swap encoding strategy for high-cardinality fields and compare metrics/runtime.
Document final encoding policy and compatibility with target model family.

Success Signals

Feature matrix size stays within memory budget.
Validation performance remains stable or improves.
Encoding policy is documented per column group.

CLI and Commands

High-yield command runbooks

CLI Execution Pattern

1. Capture baseline state before running any intrusive command.
2. Execute command with explicit scope (node, interface, GPU set).
3. Compare output against expected baseline signature.
4. Record timestamp and decision (pass, investigate, remediate).

Data quality contract runbook

Validate schema, missingness, and duplicate behavior before transformation.

Null and duplicate audit

python - <<'PY'
import pandas as pd
df = pd.read_parquet('raw.parquet')
print(df.isna().mean().sort_values(ascending=False).head(10))
print('dupes=', df.duplicated().sum())
PY

Expected output (example)

feature_x 0.23
feature_y 0.08
dupes= 1042

Dtype contract check

python - <<'PY'
expected = {'age':'int64','country':'object','target':'int64'}
for c,t in expected.items():
    print(c, t)
PY

Expected output (example)

age int64
country object
target int64

Run this before fitting any preprocessing pipeline.
Persist report artifacts for reproducibility.

Leakage-safe preprocessing runbook

Train and validate with split-safe transformations.

ColumnTransformer + pipeline flow

python - <<'PY'
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
num_pipe = Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
cat_pipe = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('enc', OneHotEncoder(handle_unknown='ignore'))])
pre = ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)])
print(pre)
PY

Expected output (example)

ColumnTransformer(...)

Stratified split check

python - <<'PY'
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,stratify=y,test_size=0.2,random_state=42)
print(y_train.mean(), y_test.mean())
PY

Expected output (example)

0.184 0.185

If class proportions diverge strongly, revisit split strategy before training.
Keep fitted preprocessors versioned with model artifact.

Common Problems

Failure patterns and fixes

Hidden leakage through preprocessing fitted on full dataset

Symptoms

Validation score is significantly higher than holdout or production score.
Imputers/scalers were built before train-validation split.
Feature importance appears unrealistically clean.

Likely Cause

Preparation steps consumed information from validation/test distribution during fitting.

Remediation

Rebuild pipeline with split-first procedure and train-only fitting.
Re-evaluate metrics on untouched validation/test splits.
Add leakage checks to experiment review checklist.

Prevention: Gate model promotion on evidence that all transforms were fit only on training data.

Feature explosion from naive categorical encoding

Symptoms

Training memory usage spikes after preprocessing.
Sparse matrix dimensions are unexpectedly large.
Runtime and model variance increase without quality gain.

Likely Cause

High-cardinality categories were one-hot encoded without cardinality control or policy.

Remediation

Profile cardinality and redesign encoding strategy for high-cardinality fields.
Use model-compatible alternatives where needed (target-safe techniques with leakage controls).
Benchmark memory and quality impact before finalizing.

Prevention: Include cardinality thresholds and encoding policy in data contract.

Lab Walkthroughs

Step-by-step execution guides

Walkthrough A: Build a leakage-safe preprocessing pipeline

Construct and validate train-only fitted transforms with reproducible artifacts.

Prerequisites

Labeled tabular dataset and feature dictionary.
scikit-learn environment ready.
Train/validation/test split policy documented.

Run stratified split before any imputation or scaling.

python - <<'PY'
from sklearn.model_selection import train_test_split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, stratify=y_temp, test_size=0.5, random_state=42)
print(len(X_train), len(X_valid), len(X_test))
PY

Expected: Split sizes and class ratio look consistent across partitions.

Fit preprocessing only on train, then transform validation/test.

Expected: No fitting step consumes validation/test data.
Persist transformer artifact and attach version metadata.

Expected: Inference path can load the same artifact without retraining.

Success Criteria

Audit confirms zero train-test contamination in transform fitting.
Validation metrics are reproducible from saved artifact.
Artifact metadata includes data version and feature schema.

Walkthrough B: Outlier and imbalance policy calibration

Compare robust preparation policies under skewed and imbalanced conditions.

Prerequisites

Dataset with known outlier behavior and class imbalance.
Baseline model and metric suite defined.
Notebook template for policy comparisons.

Create baseline with no outlier/imbalance interventions.

Expected: Reference metrics are captured for comparison.
Apply one outlier policy and one imbalance policy, then retrain.

Expected: You measure changes in minority-class metrics and overall stability.
Select final policy using reproducible metric evidence.

Expected: Policy decision includes tradeoff notes and deployment implications.

Success Criteria

Selected policy improves target metrics without overfitting signs.
Tradeoff rationale is documented in final report.
Team can replay experiment with fixed seeds and artifact versions.

Study Sprint

10-day execution plan

Day	Focus	Output
1	Raw data audit and schema contract definition.	Data-quality baseline report and schema checks.
2	Missingness analysis and baseline imputation strategies.	Missing-data decision matrix by feature group.
3	Categorical and numeric transformation plan.	Column transformation map with encoding rationale.
4	Scaling and distribution-shape transformation experiments.	Comparison table for scaling choices by model type.
5	Outlier handling and robustness checks.	Outlier policy memo with before/after metric impact.
6	Imbalance mitigation and stratified split validation.	Split-quality and class-balance checklist.
7	Leakage prevention using pipeline and split-safe transforms.	Leakage risk register and mitigations.
8	GPU-oriented ingest and type validation (JSON/Parquet).	I/O tuning note and data-type compatibility log.
9	Assemble full preparation pipeline and dry run.	Reusable preprocessing workflow artifact.
10	Revision sprint and timed case drill.	Final preparation decision cheatsheet.

Hands-on Labs

Practical module work

Each lab includes a collapsed execution sample with representative CLI usage and expected output.

Lab A: Data quality contract

Formalize deterministic quality checks before feature engineering.

Implement checks for null thresholds, duplicates, and value-range violations.
Fail pipeline early when contract conditions are broken.
Record quality metrics for each data version.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Data quality contract runbook)

python - <<'PY'
import pandas as pd
df = pd.read_parquet('raw.parquet')
print(df.isna().mean().sort_values(ascending=False).head(10))
print('dupes=', df.duplicated().sum())
PY

Expected output (example)

feature_x 0.23
feature_y 0.08
dupes= 1042

Lab B: Leakage-safe transformation pipeline

Build split-aware preprocessing flow with explicit train-only fitting.

Split data first, then fit imputers/scalers on train only.
Apply identical fitted transforms to validation/test.
Verify that no target-derived information leaks into features.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Data quality contract runbook)

python - <<'PY'
expected = {'age':'int64','country':'object','target':'int64'}
for c,t in expected.items():
    print(c, t)
PY

Expected output (example)

age int64
country object
target int64

Lab C: Imbalance and outlier handling

Compare robust preparation strategies under skewed target distribution.

Run stratified split and establish baseline metrics.
Apply one imbalance strategy and one outlier strategy.
Evaluate whether changes improve minority-class behavior without overfitting.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Leakage-safe preprocessing runbook)

python - <<'PY'
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
num_pipe = Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
cat_pipe = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('enc', OneHotEncoder(handle_unknown='ignore'))])
pre = ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)])
print(pre)
PY

Expected output (example)

ColumnTransformer(...)

Lab D: I/O and format optimization

Improve preprocessing throughput while preserving data fidelity.

Benchmark JSON Lines ingestion and Parquet read/write behavior.
Compare compression choices for storage and read performance tradeoffs.
Document preferred preparation storage format for your workload.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Leakage-safe preprocessing runbook)

python - <<'PY'
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,stratify=y,test_size=0.2,random_state=42)
print(y_train.mean(), y_test.mean())
PY

Expected output (example)

0.184 0.185

Exam Pitfalls

Common failure patterns

Fitting scalers or imputers on the full dataset before splitting.
Using encoding choices that explode dimensionality without justification.
Applying outlier clipping blindly without domain constraints.
Skipping stratification for imbalanced classification datasets.
Ignoring datatype drift between ingestion and model input stages.
Treating data-preparation runtime as irrelevant in iterative ML workflows.

Practice Set

Domain checkpoint questions

Attempt each question first, then open the answer and explanation.

Q1. Why should imputation parameters be fit only on the training split?

A. To increase test size
B. To avoid leakage from validation/test distributions
C. To guarantee perfect accuracy
D. To remove categorical features

Answer: B

Using full-dataset statistics leaks future information and overstates model performance.

Q2. Which statement about one-hot encoding is most accurate?

A. It is always best for every categorical feature
B. It can create high dimensionality for high-cardinality features
C. It is only for regression targets
D. It removes all preprocessing needs

Answer: B

One-hot works well for moderate cardinality but can become expensive for very large category counts.

Q3. What is a key reason to use stratified splitting in classification?

A. It guarantees no outliers
B. It preserves class proportions across splits
C. It converts labels to numeric form
D. It eliminates feature scaling

Answer: B

Stratification keeps class distribution more stable between train, validation, and test.

Q4. When is standardization usually most useful?

A. For scale-sensitive models such as distance-based or gradient-based methods
B. Only for tree ensembles
C. Never in ML pipelines
D. Only for image labels

Answer: A

Standardization improves optimization and distance comparability when feature scales differ.

Q5. What is a common risk of dropping all rows with missing values?

A. Guaranteed faster and better models
B. Biased samples and major data loss
C. Automatic leakage prevention
D. Better minority-class recall by default

Answer: B

Aggressive row dropping can remove informative data and distort the sample distribution.

Q6. Which statement best describes leakage?

A. Any model with high validation accuracy
B. Feature engineering that uses information unavailable at inference time
C. Data ingestion from Parquet
D. Random seed setting

Answer: B

Leakage occurs when training features capture future or target-derived information not available in real use.

Q7. Why is Parquet often preferred for preparation outputs?

A. It is row-oriented only
B. It supports efficient columnar reads and pushdown-friendly patterns
C. It cannot handle compression
D. It is incompatible with distributed analytics

Answer: B

Parquet's columnar structure usually improves downstream analytical read performance.

Q8. What is a sound outlier-handling approach?

A. Remove every value above mean
B. Choose robust rules tied to domain limits and validate impact
C. Ignore outliers always
D. Replace outliers with class labels

Answer: B

Outlier treatment should be explicit, domain-aware, and validated against model behavior.

Q9. In split-safe pipelines, when should transformations be applied to validation/test?

A. Before train split
B. Using transformers fitted on training data only
C. With separately fitted transformers for each split
D. Never apply transformations

Answer: B

Train-fitted transformations prevent leakage and keep evaluation realistic.

Q10. What is the main purpose of a preprocessing pipeline object?

A. Increase random variability
B. Make transformation steps reproducible and composable
C. Replace model evaluation
D. Skip feature engineering

Answer: B

Pipelines enforce consistent transformation order and reduce human error across runs.

Primary References

Curated from official documentation and high-signal references.

Objectives

3.1 Perform data cleansing and preprocessing with cuDF and pandas.
3.2 Transform and standardize data for model readiness.
3.3 Apply standardization to ensure feature uniformity where required.
3.4 Generate synthetic data for augmentation using cuDF and RAPIDS.
3.5 Identify and acquire suitable datasets for the task.
3.6 Monitor processing pipelines to recognize bottlenecks.
3.7 Process, organize, and store datasets for downstream use.

Navigation

Back to NCP-ADS landing Previous: GPU and Cloud Computing Next: MLOps