Protected

NCP-ADS module content is available after admin verification. Redirecting…

If you are not redirected, login.

Training / NCP-ADS

Data Analysis

Module study guide

Priority 6 of 6 · Domain 1 in exam order

Scope

Exam study content

This module contains expanded study notes, practical drills, and an exam-style question set.

Exam weight: 15%
Priority tier: Tier 3
Why this domain: EDA + anomaly + graphs (less technical depth)

Exam Framework

How to reason under pressure

1. Stabilize Before Optimizing

Verify hardware and management-plane integrity first.
Confirm firmware/software baseline consistency.
Only then run performance tuning decisions.

2. Single-Variable Changes

Change one parameter at a time when investigating regressions.
Use before/after evidence with constant workload input.
Discard changes without reproducible benefit.

Exam Scope Coverage

What this module now covers

This module is aligned to Domain 1 scope: exploratory data analysis, summary statistics, probability distributions, hypothesis testing, correlation/covariance, and model evaluation metrics.

Track 1: Exploratory data analysis workflow

EDA questions test whether you can quickly profile data quality, distribution shape, and data issues before modeling.

Start with schema, missingness, cardinality, and duplicate checks before deeper analysis.
Use univariate and bivariate analysis to surface skew, outliers, and class imbalance.
In GPU workflows, use cuDF and cuxfilter to keep early exploration interactive on larger datasets.

Drill: Take one medium-size tabular dataset and produce a one-page EDA summary with missing-data, distribution, and relationship sections.

Track 2: Summary statistics and robust interpretation

The exam checks whether you can choose statistics that match data shape and noise profile.

Mean and standard deviation are sensitive to outliers; median and IQR are often safer for heavy-tailed data.
Percentiles communicate spread and tails better than single central tendency values.
Segment-level summaries can expose patterns that global averages hide.

Drill: For the same feature, compare mean/std vs median/IQR and explain which pair you trust and why.

Track 3: Probability distributions and sampling intuition

Distribution assumptions drive the validity of tests, confidence statements, and many modeling decisions.

Different features can follow normal, skewed, count, or multimodal patterns; verify instead of assuming.
Sampling variation decreases with sample size, but bias in collection remains a separate risk.
Distinguish population parameters from sample estimates and state uncertainty explicitly.

Drill: Assess two features, state plausible distributions, and justify with visual and summary evidence.

Track 4: Hypothesis testing decisions

You may be asked to pick appropriate tests and correctly interpret p-values and decision thresholds.

State null and alternative hypotheses clearly before running any test.
Choose tests based on data type, independence assumptions, and variance behavior (for example Welch vs equal-variance t-test).
A p-value is not effect size; always pair significance with practical magnitude.

Drill: Run one two-sample test and one contingency-table test, then interpret statistical and practical significance separately.

Track 5: Correlation and covariance interpretation

Correlation questions are common and frequently mixed with causality traps.

Covariance scale depends on feature units; correlation normalizes covariance for comparability.
Pearson captures linear relationships; low Pearson does not prove no relationship.
Correlation does not establish causation and can be confounded by hidden variables.

Drill: Compute covariance and correlation matrices and list one misleading interpretation you intentionally avoid.

Track 6: Evaluation metrics by problem type

Metric selection errors cause wrong model conclusions even when pipelines run correctly.

For imbalanced classification, rely on precision/recall/F1 and PR-AUC rather than accuracy alone.
ROC-AUC measures ranking quality, while thresholded metrics reflect deployed operating points.
For regression, pair absolute and squared-error metrics to capture both average and large-error behavior.

Drill: Given one imbalanced classification output, choose a metric set and justify tradeoffs for a production-like threshold.

Concept Explanations

Deep-dive concept library

Exam Decision Hierarchy

Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.

If integrity checks fail, stop optimization and remediate first.
Compare against known-good baseline before changing multiple variables.
Document rationale for each decision to support incident replay.

Operational Evidence Standard

Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.

Evidence should be reproducible by another engineer.
Use stable command templates for repeated environments.
Keep concise but complete validation artifacts for exam-style reasoning.

EDA decision ladder for exam cases

Use a fixed order when reading any dataset: integrity checks, distribution checks, relationship checks, then metric-aligned conclusions.

Start with schema, nulls, duplicates, and impossible ranges before charts.
Compare global summary with segment-level summary to avoid hidden subgroup effects.
Document assumptions before selecting tests or model metrics.

Hypothesis testing selection flow

Select tests using data type and assumptions first, then interpret p-value and effect size separately.

Numeric vs numeric two-group comparisons often start with t-test variants or nonparametric alternatives.
Categorical association checks commonly use chi-square style tests.
Significance decision without effect size or confidence context is incomplete.

Metric selection under class imbalance

In imbalanced classification, threshold-aware metrics and error-cost framing are more reliable than accuracy.

Precision/recall/F1 and PR-AUC usually carry more signal than raw accuracy.
ROC-AUC is ranking quality; operating threshold still needs business-cost reasoning.
Always state threshold, confusion matrix, and expected false-positive/false-negative impact.

Scenario Playbooks

Exam-style scenario explanations

Scenario A: High validation accuracy but poor production alert quality

A binary anomaly model shows strong validation accuracy, but production users report many missed anomalies. You need to re-evaluate analysis and metric strategy.

Architecture Diagram

[Raw Events] -> [EDA + Label Audit] -> [Train/Validation Split]
                                  |
                         [Threshold + Metrics]
                                  |
                         [Production Alerts]

Response Flow

Audit class ratio and confirm whether validation split was stratified.
Recompute confusion matrix at multiple thresholds and examine recall impact.
Prioritize precision/recall tradeoff with documented business costs for misses.

Success Signals

Model review report includes threshold-specific confusion matrices.
Chosen threshold maps explicitly to false-negative tolerance.
Production alert quality improves on monitored sample window.

Threshold sweep for confusion matrix

python - <<'PY'
from sklearn.metrics import confusion_matrix
import numpy as np
# y_true, y_score loaded earlier
for t in [0.2, 0.4, 0.6, 0.8]:
    y_pred = (y_score >= t).astype(int)
    print(t, confusion_matrix(y_true, y_pred).ravel())
PY

Expected output (example)

0.2 [tn fp fn tp]
0.4 [tn fp fn tp]
...

Scenario B: Correlation appears strong but deployment decision is risky

A feature shows high correlation with target in one slice. You must decide whether it is stable enough for production decisions.

Architecture Diagram

[Feature Table] -> [Correlation Matrix] -> [Segment Stability Check]
                                       -> [Confounder Review]
                                       -> [Decision Memo]

Response Flow

Compute correlation by segment/time window, not only globally.
Check whether relationship persists after controlling for obvious confounders.
Record why correlation is used as signal and not as causal proof.

Success Signals

Segment-level stability summary is included in final recommendation.
Potential confounders are explicitly listed and tested.
Decision memo avoids causal language unless justified.

CLI and Commands

High-yield command runbooks

CLI Execution Pattern

1. Capture baseline state before running any intrusive command.
2. Execute command with explicit scope (node, interface, GPU set).
3. Compare output against expected baseline signature.
4. Record timestamp and decision (pass, investigate, remediate).

EDA integrity quick pass

Run a fast quality and distribution baseline before selecting any statistical test.

Schema/null/duplicate profile

python - <<'PY'
import pandas as pd
df = pd.read_parquet('dataset.parquet')
print(df.shape)
print(df.isna().mean().sort_values(ascending=False).head(10))
print('duplicates=', df.duplicated().sum())
PY

Expected output (example)

(1200000, 42)
feature_a    0.182
feature_b    0.104
duplicates= 3912

Distribution and tail check

python - <<'PY'
import pandas as pd
df = pd.read_parquet('dataset.parquet')
print(df['amount'].describe(percentiles=[0.5,0.9,0.99]))
PY

Expected output (example)

count ...
50% 42.1
90% 381.3
99% 3210.7

Run this before hypothesis testing so assumptions are evidence-based.
Save the output snapshot for exam-style justification.

Classification metric decision runbook

Compare threshold-dependent and threshold-independent metrics for imbalanced labels.

PR-AUC and ROC-AUC check

python - <<'PY'
from sklearn.metrics import average_precision_score, roc_auc_score
print('pr_auc=', average_precision_score(y_true, y_score))
print('roc_auc=', roc_auc_score(y_true, y_score))
PY

Expected output (example)

pr_auc= 0.41
roc_auc= 0.89

Thresholded precision/recall/F1

python - <<'PY'
from sklearn.metrics import precision_recall_fscore_support
for t in [0.3,0.5,0.7]:
    y_pred=(y_score>=t).astype(int)
    p,r,f,_=precision_recall_fscore_support(y_true,y_pred,average='binary')
    print(t, round(p,3), round(r,3), round(f,3))
PY

Expected output (example)

0.3 0.22 0.81 0.346
0.5 0.37 0.58 0.452
0.7 0.61 0.34 0.437

Use threshold table when the exam asks for deployment recommendation.
Tie final threshold to error cost asymmetry.

Common Problems

Failure patterns and fixes

Inflated validation score from leakage-prone split logic

Symptoms

Validation metrics drop sharply after first production run.
Feature engineering used global statistics before splitting.
Repeated IDs appear across train and validation sets.

Likely Cause

Split discipline was applied late, allowing future or target-adjacent information into training features.

Remediation

Rebuild pipeline with split-first policy and train-only fitting for all transforms.
Recompute validation metrics with group-aware or time-aware split if required.
Add leakage checks to every experiment report.

Prevention: Require a split and leakage checklist before approving model comparisons.

Accuracy looks strong while minority-class recall is unacceptable

Symptoms

High accuracy with low detection rate for the positive class.
Confusion matrix shows large false-negative count.
PR-AUC is low despite decent ROC-AUC.

Likely Cause

Metric strategy over-relied on accuracy and default threshold, which masked minority-class failures.

Remediation

Switch to precision/recall/F1 and PR-AUC as primary evaluation metrics.
Tune decision threshold against business cost model.
Validate with stratified or class-aware cross-validation.

Prevention: Define primary metric family by class balance before any model training begins.

Lab Walkthroughs

Step-by-step execution guides

Walkthrough A: Hypothesis testing with assumption checks

Select and defend the right statistical test for two-group comparison.

Prerequisites

Prepared dataset with target split into two groups.
Python environment with scipy installed.
Document template for assumption and decision logging.

Profile both groups for shape and variance behavior.

python - <<'PY'
import pandas as pd
from scipy import stats
df=pd.read_parquet('dataset.parquet')
a=df[df.grp==0]['metric']
b=df[df.grp==1]['metric']
print(stats.describe(a))
print(stats.describe(b))
PY

Expected: You identify skew/outlier behavior and decide if parametric assumptions are acceptable.

Run chosen test and record p-value plus effect size.

python - <<'PY'
from scipy import stats
stat,p=stats.ttest_ind(a,b,equal_var=False)
print('p=',p)
print('mean_delta=', b.mean()-a.mean())
PY

Expected: Decision report includes both significance and practical magnitude.

Write final decision statement with assumptions and caveats.

Expected: Statement clearly separates statistical evidence from operational recommendation.

Success Criteria

Test choice matches data type and assumption profile.
Effect-size context appears in the final answer.
Result can be reproduced by rerunning saved commands.

Walkthrough B: Threshold calibration for imbalanced classification

Choose a deployment threshold using confusion-matrix tradeoffs.

Prerequisites

Predicted probabilities on validation set.
Business guidance for false-negative and false-positive cost.
scikit-learn metrics utilities available.

Sweep thresholds and capture precision, recall, and F1.

python - <<'PY'
from sklearn.metrics import precision_recall_fscore_support
for t in [x/10 for x in range(1,10)]:
  y_pred=(y_score>=t).astype(int)
  p,r,f,_=precision_recall_fscore_support(y_true,y_pred,average='binary')
  print(t,p,r,f)
PY

Expected: You obtain a threshold table that reveals tradeoff knees.

Pick candidate threshold and verify confusion matrix impact.

python - <<'PY'
from sklearn.metrics import confusion_matrix
t=0.4
y_pred=(y_score>=t).astype(int)
print(confusion_matrix(y_true,y_pred))
PY

Expected: Chosen threshold aligns with acceptable miss rate and alert load.

Publish threshold recommendation with rationale.

Expected: Recommendation includes metric evidence plus cost-based reasoning.

Success Criteria

Recommendation references confusion matrix, not accuracy alone.
Threshold rationale ties directly to business error costs.
Team can rerun commands and reproduce reported numbers.

Study Sprint

10-day execution plan

Day	Focus	Output
1	Dataset inventory and schema-quality audit (types, nulls, duplicates).	EDA setup notebook and baseline data-quality report.
2	Univariate analysis with robust and classical statistics.	Feature summary sheet with mean/median/IQR comparisons.
3	Bivariate analysis and relationship mapping.	Correlation map plus caveats and confounder notes.
4	Distribution checks and sampling assumptions.	Distribution-assumption log for key variables.
5	Hypothesis testing practice (parametric and nonparametric choices).	Decision table: test selection, p-values, effect-size interpretation.
6	Classification metric deep dive on imbalanced labels.	Metric dashboard with confusion-matrix and PR/ROC analysis.
7	Regression metric and residual diagnostics.	Residual-analysis memo and metric tradeoff summary.
8	Anomaly and threshold calibration exercise.	Threshold strategy note with false-positive cost assumptions.
9	Timed mini-case combining EDA and evaluation decisions.	End-to-end case writeup with defensible conclusions.
10	Final revision and weak-area drills.	Exam-day cheat sheet for statistics and metrics decisions.

Hands-on Labs

Practical module work

Each lab includes a collapsed execution sample with representative CLI usage and expected output.

Lab A: Fast EDA triage

Produce a repeatable EDA checklist for tabular datasets under time pressure.

Profile nulls, duplicates, type consistency, and high-cardinality columns.
Create compact visuals for distribution shape and outlier risk.
Document top three data-quality risks before any modeling.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (EDA integrity quick pass)

python - <<'PY'
import pandas as pd
df = pd.read_parquet('dataset.parquet')
print(df.shape)
print(df.isna().mean().sort_values(ascending=False).head(10))
print('duplicates=', df.duplicated().sum())
PY

Expected output (example)

(1200000, 42)
feature_a    0.182
feature_b    0.104
duplicates= 3912

Lab B: Test-selection decision lab

Practice selecting and interpreting hypothesis tests correctly.

Write hypotheses in plain language before computing p-values.
Run at least one two-sample mean comparison and one categorical independence test.
Report both significance decision and practical significance.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (EDA integrity quick pass)

python - <<'PY'
import pandas as pd
df = pd.read_parquet('dataset.parquet')
print(df['amount'].describe(percentiles=[0.5,0.9,0.99]))
PY

Expected output (example)

count ...
50% 42.1
90% 381.3
99% 3210.7

Lab C: Correlation discipline lab

Avoid common interpretation errors in covariance/correlation analysis.

Compute covariance and correlation matrices for selected features.
Identify at least one spurious or confounded relationship candidate.
Propose one validation step to test robustness of observed relationships.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Classification metric decision runbook)

python - <<'PY'
from sklearn.metrics import average_precision_score, roc_auc_score
print('pr_auc=', average_precision_score(y_true, y_score))
print('roc_auc=', roc_auc_score(y_true, y_score))
PY

Expected output (example)

pr_auc= 0.41
roc_auc= 0.89

Lab D: Metric strategy lab

Align evaluation metrics to business and deployment constraints.

Build confusion matrix and derive precision, recall, and F1 at multiple thresholds.
Compare ROC-AUC with PR-AUC for an imbalanced scenario.
Write a threshold recommendation tied to false-positive and false-negative costs.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Classification metric decision runbook)

python - <<'PY'
from sklearn.metrics import precision_recall_fscore_support
for t in [0.3,0.5,0.7]:
    y_pred=(y_score>=t).astype(int)
    p,r,f,_=precision_recall_fscore_support(y_true,y_pred,average='binary')
    print(t, round(p,3), round(r,3), round(f,3))
PY

Expected output (example)

0.3 0.22 0.81 0.346
0.5 0.37 0.58 0.452
0.7 0.61 0.34 0.437

Exam Pitfalls

Common failure patterns

Using accuracy as the primary metric in imbalanced classification tasks.
Treating statistically significant p-values as proof of practical impact.
Assuming normality without checking distribution behavior.
Interpreting correlation as causal evidence.
Using mean-only summaries on heavily skewed or outlier-heavy features.
Reporting a single model score without confidence context or threshold rationale.

Practice Set

Domain checkpoint questions

Attempt each question first, then open the answer and explanation.

Q1. Which metric is usually more informative than accuracy for highly imbalanced binary classification?

A. Precision/recall family metrics
B. Only training loss
C. Feature count
D. R-squared

Answer: A

Precision and recall focus on minority-class performance and error tradeoffs that accuracy can hide.

Q2. What does covariance provide compared with correlation?

A. A unitless standardized relationship only
B. Relationship scaled in original feature units
C. Guaranteed causality
D. Class labels

Answer: B

Covariance reflects joint variation in original units, while correlation normalizes for comparability.

Q3. A p-value below alpha mainly supports which conclusion?

A. Reject the null hypothesis under the chosen test assumptions
B. The effect is always practically large
C. The model is production-ready
D. Causality is proven

Answer: A

Statistical significance is a decision under assumptions; it does not guarantee large real-world effect.

Q4. Why is median often preferred over mean for heavy-tailed distributions?

A. Median is more sensitive to extreme outliers
B. Median is robust to extreme values
C. Mean cannot be computed
D. Median implies normality

Answer: B

Median is less affected by extreme values and often better reflects central tendency in skewed data.

Q5. Which statement about ROC-AUC is most accurate?

A. It is a threshold-dependent confusion-matrix metric
B. It summarizes ranking quality across thresholds
C. It replaces calibration analysis completely
D. It applies only to regression

Answer: B

ROC-AUC evaluates separability over threshold ranges, not one fixed decision threshold.

Q6. What is the best first step before selecting a hypothesis test?

A. Choose any test and interpret later
B. Define hypotheses and check data assumptions
C. Force normality by dropping random rows
D. Skip exploratory analysis

Answer: B

Clear hypotheses and assumption checks prevent invalid test selection and interpretation.

Q7. Which statement correctly reflects correlation analysis?

A. Zero correlation always means no relationship
B. High correlation alone proves causal direction
C. Correlation quantifies association, not causation
D. Correlation can be computed only for categorical data

Answer: C

Correlation measures association strength and direction, but causal claims require additional evidence.

Q8. For regression evaluation, why combine MAE and RMSE?

A. They are identical metrics
B. MAE shows average absolute error while RMSE emphasizes larger errors
C. RMSE is classification-only
D. Neither depends on predictions

Answer: B

Using both reveals average error and sensitivity to large misses.

Q9. In exam scenarios, what is a common mistake in EDA reporting?

A. Listing assumptions and limitations
B. Showing only one aggregate summary and ignoring segments
C. Comparing multiple summaries
D. Checking missingness early

Answer: B

Single aggregate summaries can mask subgroup patterns and lead to weak conclusions.

Q10. When model threshold is business-critical, what should you do?

A. Use default 0.5 without review
B. Tune threshold using error-cost tradeoffs and validation behavior
C. Ignore false negatives
D. Replace metrics with training speed

Answer: B

Operational thresholds should reflect business cost asymmetry and validated performance.

Primary References

Curated from official documentation and high-signal references.

Objectives

1.1 Detect anomalies in time-series datasets.
1.2 Conduct time-series analysis.
1.3 Create and analyze graph data using GPU-accelerated tools such as cuGraph.
1.4 Identify when dataset scale requires accelerated or distributed methods.
1.5 Perform exploratory data analysis (EDA).
1.6 Visualize time-series data effectively.

Navigation

Back to NCP-ADS landing Previous: MLOps