1. Stabilize Before Optimizing
- Verify hardware and management-plane integrity first.
- Confirm firmware/software baseline consistency.
- Only then run performance tuning decisions.
Protected
NCP-ADS module content is available after admin verification. Redirecting…
If you are not redirected, login.
Access
Admin only
NCP-ADS module pages are restricted to admin users.
Training / NCP-ADS
Module study guide
Priority 4 of 6 · Domain 3 in exam order
Scope
This module contains expanded study notes, practical drills, and an exam-style question set.
Exam Framework
Exam Scope Coverage
This module focuses on data preparation decisions that materially affect model quality and training efficiency: quality checks, cleaning, transformation, feature engineering, and leakage-safe dataset construction.
Preparation errors are a top cause of downstream modeling failure and unstable results.
Drill: Build a reusable data-quality checklist and run it at raw, intermediate, and final dataset stages.
Imputation decisions can shift model behavior and metric conclusions.
Drill: Compare median imputation, constant imputation, and row filtering on the same task and record validation impact.
Wrong encoding or type assumptions can inflate dimensionality and degrade performance.
Drill: Create a column-wise transformation plan and justify each encoding choice with model compatibility.
Scale-sensitive models and distance-based methods depend on consistent feature ranges.
Drill: Run a model with unscaled, standardized, and normalized features and explain metric differences.
Outliers and minority classes can dominate error patterns if untreated.
Drill: Document one outlier policy and one imbalance policy, then compare with a no-adjustment baseline.
Preparation throughput can bottleneck end-to-end training and iteration speed.
Drill: Build one preparation pipeline that ingests raw data, applies transformations, and writes clean Parquet with audit metadata.
Concept Explanations
Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.
Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.
Preparation logic must be fit on training data only; anything else risks leakage and inflated metrics.
Preparation quality improves when numeric and categorical paths are explicit and auditable.
Preparation decisions should stabilize model behavior under skew, outliers, and minority-class scarcity.
Scenario Playbooks
A model looked strong in validation, but production quality dropped immediately. You need to assess preparation leakage or transformation mismatch.
Architecture Diagram
[Raw Data] -> [Split] -> [Train-fitted Transformers] -> [Model]
|
[Artifact Registry + Inference Reuse] Response Flow
Success Signals
Pipeline split-first skeleton
python - <<'PY'
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
X_train, X_valid, y_train, y_valid = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
pipe = Pipeline([...])
pipe.fit(X_train, y_train)
print('valid_score=', pipe.score(X_valid, y_valid))
PY Expected output (example)
valid_score= 0.812 One-hot encoding on high-cardinality fields produced feature explosion and unstable training runtime.
Architecture Diagram
[Raw Categorical Features] -> [Cardinality Audit] -> [Encoding Policy]
|
[Memory + Metric Validation] Response Flow
Success Signals
CLI and Commands
Validate schema, missingness, and duplicate behavior before transformation.
Null and duplicate audit
python - <<'PY'
import pandas as pd
df = pd.read_parquet('raw.parquet')
print(df.isna().mean().sort_values(ascending=False).head(10))
print('dupes=', df.duplicated().sum())
PY Expected output (example)
feature_x 0.23
feature_y 0.08
dupes= 1042 Dtype contract check
python - <<'PY'
expected = {'age':'int64','country':'object','target':'int64'}
for c,t in expected.items():
print(c, t)
PY Expected output (example)
age int64
country object
target int64 Train and validate with split-safe transformations.
ColumnTransformer + pipeline flow
python - <<'PY'
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
num_pipe = Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
cat_pipe = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('enc', OneHotEncoder(handle_unknown='ignore'))])
pre = ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)])
print(pre)
PY Expected output (example)
ColumnTransformer(...) Stratified split check
python - <<'PY'
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,stratify=y,test_size=0.2,random_state=42)
print(y_train.mean(), y_test.mean())
PY Expected output (example)
0.184 0.185 Common Problems
Symptoms
Likely Cause
Preparation steps consumed information from validation/test distribution during fitting.
Remediation
Prevention: Gate model promotion on evidence that all transforms were fit only on training data.
Symptoms
Likely Cause
High-cardinality categories were one-hot encoded without cardinality control or policy.
Remediation
Prevention: Include cardinality thresholds and encoding policy in data contract.
Lab Walkthroughs
Construct and validate train-only fitted transforms with reproducible artifacts.
Prerequisites
Run stratified split before any imputation or scaling.
python - <<'PY'
from sklearn.model_selection import train_test_split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, stratify=y_temp, test_size=0.5, random_state=42)
print(len(X_train), len(X_valid), len(X_test))
PY Expected: Split sizes and class ratio look consistent across partitions.
Fit preprocessing only on train, then transform validation/test.
Expected: No fitting step consumes validation/test data.
Persist transformer artifact and attach version metadata.
Expected: Inference path can load the same artifact without retraining.
Success Criteria
Compare robust preparation policies under skewed and imbalanced conditions.
Prerequisites
Create baseline with no outlier/imbalance interventions.
Expected: Reference metrics are captured for comparison.
Apply one outlier policy and one imbalance policy, then retrain.
Expected: You measure changes in minority-class metrics and overall stability.
Select final policy using reproducible metric evidence.
Expected: Policy decision includes tradeoff notes and deployment implications.
Success Criteria
Study Sprint
| Day | Focus | Output |
|---|---|---|
| 1 | Raw data audit and schema contract definition. | Data-quality baseline report and schema checks. |
| 2 | Missingness analysis and baseline imputation strategies. | Missing-data decision matrix by feature group. |
| 3 | Categorical and numeric transformation plan. | Column transformation map with encoding rationale. |
| 4 | Scaling and distribution-shape transformation experiments. | Comparison table for scaling choices by model type. |
| 5 | Outlier handling and robustness checks. | Outlier policy memo with before/after metric impact. |
| 6 | Imbalance mitigation and stratified split validation. | Split-quality and class-balance checklist. |
| 7 | Leakage prevention using pipeline and split-safe transforms. | Leakage risk register and mitigations. |
| 8 | GPU-oriented ingest and type validation (JSON/Parquet). | I/O tuning note and data-type compatibility log. |
| 9 | Assemble full preparation pipeline and dry run. | Reusable preprocessing workflow artifact. |
| 10 | Revision sprint and timed case drill. | Final preparation decision cheatsheet. |
Hands-on Labs
Each lab includes a collapsed execution sample with representative CLI usage and expected output.
Formalize deterministic quality checks before feature engineering.
Sample Command (Data quality contract runbook)
python - <<'PY'
import pandas as pd
df = pd.read_parquet('raw.parquet')
print(df.isna().mean().sort_values(ascending=False).head(10))
print('dupes=', df.duplicated().sum())
PY Expected output (example)
feature_x 0.23
feature_y 0.08
dupes= 1042 Build split-aware preprocessing flow with explicit train-only fitting.
Sample Command (Data quality contract runbook)
python - <<'PY'
expected = {'age':'int64','country':'object','target':'int64'}
for c,t in expected.items():
print(c, t)
PY Expected output (example)
age int64
country object
target int64 Compare robust preparation strategies under skewed target distribution.
Sample Command (Leakage-safe preprocessing runbook)
python - <<'PY'
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
num_pipe = Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
cat_pipe = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('enc', OneHotEncoder(handle_unknown='ignore'))])
pre = ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)])
print(pre)
PY Expected output (example)
ColumnTransformer(...) Improve preprocessing throughput while preserving data fidelity.
Sample Command (Leakage-safe preprocessing runbook)
python - <<'PY'
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,stratify=y,test_size=0.2,random_state=42)
print(y_train.mean(), y_test.mean())
PY Expected output (example)
0.184 0.185 Exam Pitfalls
Practice Set
Attempt each question first, then open the answer and explanation.
Answer: B
Using full-dataset statistics leaks future information and overstates model performance.
Answer: B
One-hot works well for moderate cardinality but can become expensive for very large category counts.
Answer: B
Stratification keeps class distribution more stable between train, validation, and test.
Answer: A
Standardization improves optimization and distance comparability when feature scales differ.
Answer: B
Aggressive row dropping can remove informative data and distort the sample distribution.
Answer: B
Leakage occurs when training features capture future or target-derived information not available in real use.
Answer: B
Parquet's columnar structure usually improves downstream analytical read performance.
Answer: B
Outlier treatment should be explicit, domain-aware, and validated against model behavior.
Answer: B
Train-fitted transformations prevent leakage and keep evaluation realistic.
Answer: B
Pipelines enforce consistent transformation order and reduce human error across runs.
Primary References
Curated from official documentation and high-signal references.
Objectives
Navigation