Protected

NCP-ADS detailed study notes are available after admin verification. Redirecting…

If you are not redirected, login.

Training / NCP-ADS / Study Notes

NCP-ADS Detailed Study Notes

Goal: make you decision-accurate under exam pressure. NCP-ADS is primarily a systems and performance reasoning exam for accelerated data workflows, not a math exam.

Blueprint-aligned reasoning guide for exam decisions under pressure.

6/6 blueprint domains covered · with expanded context and failure signatures

0. Foundation: GPU-Accelerated Thinking

In exam scenarios, ask these four questions first: where compute occurs, where data lives, what the dominant bottleneck is, and what scaling limit appears first.

ASCII diagram: memory and data movement path

[Disk/Obj Storage]
      |
      v
[CPU RAM] --(PCIe/NVLink transfer)--> [GPU VRAM]
      |                                   |
      |                              [SM / Registers / Shared Mem]
      +---- ETL fallback or serialization ----+

Rule: Data movement is often more expensive than arithmetic.

1. Data Manipulation and Software Literacy (20%)

Highest NVIDIA flavor domain: cuDF, Dask-cuDF, partitioning, shuffle behavior, and RMM stability.

1.1 cuDF vs pandas

WHAT: pandas is CPU DataFrame, cuDF is GPU DataFrame with pandas-like API.
WHY: column-parallel transforms, filters, joins, and groupby operations map well to GPU throughput.
WHEN: use cuDF when data fits VRAM (or can be chunked), operations are vectorizable, and throughput is the goal.
FAILURES: heavy row-wise UDFs, branch divergence, tiny datasets where launch overhead dominates.

1.2 Dask-cuDF scale-out

WHAT: partitioned cuDF datasets scheduled across workers/GPUs.
WHY: one GPU VRAM is limited; distributed partitioning enables bigger datasets.
WHEN: dataset exceeds single-GPU memory or SLA needs multi-GPU ETL throughput.
FAILURES: too many tiny partitions (scheduler overhead), 4→8 GPU non-scaling (communication/shuffle bound), memory spikes on join/groupby (shuffle explosion).

1.3 Partition strategy and shuffle

Partition size drives parallelism, memory pressure, and orchestration cost. Too small means overhead. Too large means OOM risk. Goldilocks partitions keep utilization and memory stable.

ASCII diagram: shuffle amplification

Before shuffle:
GPU0:[A A A] GPU1:[B B B] GPU2:[C C C] GPU3:[D D D]

After key-based shuffle:
GPU0:[A B C D] GPU1:[A B C D] GPU2:[A B C D] GPU3:[A B C D]

Cost drivers:
- all-to-all communication
- buffer materialization
- memory spikes + retries

1.4 RMM and fragmentation

Intermittent OOM with visible free VRAM is a classic fragmentation pattern. RMM pooling reduces allocator churn, improves repeatability, and stabilizes multi-stage pipelines.

2. Machine Learning (20%)

Focus on memory model, parallelism choice, communication scaling, and profiling interpretation.

2.1 Training memory model

Parameters
Gradients (roughly parameter scale)
Optimizer state (for Adam often ~2x parameters)
Activations (scales with batch size/depth)

OOM before first step usually indicates model/optimizer footprint. OOM during steps often indicates activation or batch pressure.

2.2 Parallelism and AllReduce

Data parallel: model replicated; gradient sync through collectives.
Model parallel: model split when single GPU memory is insufficient.
Pipeline parallel: staged execution across GPUs with microbatches.

Scaling collapse beyond N GPUs usually indicates communication overhead. Interconnect quality (NVLink, InfiniBand/RDMA) becomes the key variable.

2.3 Mixed precision and profiling

AMP improves throughput and often lowers memory footprint.
Accuracy regressions after FP16 often need loss scaling or stability tuning.
Profilers are for bottleneck classification, not just screenshots.

3. Data Analysis (15%)

Easy scoring domain if you keep concepts practical: EDA purpose, time-series framing, anomaly categories, and graph analytics fit.

Statistical thresholding and density-based outlier detection patterns.
Time-series anomaly framing with seasonality and spikes.
Graph analytics when data is naturally nodes + edges.

4. Data Preparation (15%)

Most common trap: leakage and transform order.

Split first.
Fit transforms on train only.
Apply train-derived transforms to train/test.

Imbalance mitigation: stratification, resampling, class weights.
Storage choice: Parquet for columnar analytics and GPU workflows.

5. GPU and Cloud Computing (20%)

Driver/runtime compatibility is first check for CUDA container errors.
GPU visibility failures usually indicate container runtime configuration gaps.
MIG is for isolation/QoS and multi-tenant inference predictability.
Instance selection starts with VRAM fit, then interconnect, then cost/performance.

6. MLOps (10%)

Triton reasoning patterns appear frequently.

Dynamic batching: higher throughput, potential latency increase.
Cache repeated inference requests to reduce compute and p99 latency.
Inference memory model excludes gradients/optimizer state; focus on weights + buffers.

7. Benchmarking (Cross-Cutting)

Control model version, precision, batch, concurrency, data, and software stack.
Exclude warm-up iterations from comparisons.
Choose metric by goal: throughput vs p50/p90/p99 latency.

8. Master Decision Framework (Exam Cheat Logic)

ASCII decision tree

1) Identify workload
   -> ETL/DataFrame  -> cuDF/Dask/shuffle/partitions/RMM
   -> Training       -> batch/activations/parallelism/comms
   -> Inference      -> Triton/batching/cache/latency

2) Classify bottleneck
   -> GPU high, CPU low            => compute-bound
   -> GPU low, CPU high            => pipeline-bound
   -> scale stalls with more GPUs  => communication-bound
   -> join/groupby memory spikes   => shuffle-bound
   -> intermittent OOM + free VRAM => fragmentation

3) Apply optimization
   -> compute-bound       => AMP, larger batch, faster GPU, kernel efficiency
   -> pipeline-bound      => prefetch, caching, storage, worker tuning
   -> communication-bound => better fabric/interconnect, adjust parallelism
   -> shuffle-bound       => repartition, key design, pre-aggregation
   -> fragmentation       => RMM pooling and allocator stabilization

Blueprint Coverage Checklist

Every objective from the current NCP-ADS blueprint metadata is listed below and mapped to this guide + module deep dives.

Data Analysis (15%)

Open domain module

Detect anomalies in time-series datasets.
Conduct time-series analysis.
Create and analyze graph data using GPU-accelerated tools such as cuGraph.
Identify when dataset scale requires accelerated or distributed methods.
Perform exploratory data analysis (EDA).
Visualize time-series data effectively.

Data Manipulation and Software Literacy (20%)

Open domain module

Design and implement accelerated ETL (extract, transform, load) workflows.
Implement caching strategies to reduce shuffle overhead.
Use distributed data processing frameworks for large-scale datasets.
Implement Dask-based data parallelism for multi-GPU scaling.
Profile deep learning workloads with tools such as DLProf.
Choose optimal data processing libraries for varying dataset sizes and workloads.

Data Preparation (15%)

Open domain module

Perform data cleansing and preprocessing with cuDF and pandas.
Transform and standardize data for model readiness.
Apply standardization to ensure feature uniformity where required.
Generate synthetic data for augmentation using cuDF and RAPIDS.
Identify and acquire suitable datasets for the task.
Monitor processing pipelines to recognize bottlenecks.
Process, organize, and store datasets for downstream use.

GPU and Cloud Computing (20%)

Open domain module

Analyze graph data with GPU-accelerated tools such as cuGraph.
Optimize data science performance through GPU acceleration.
Describe, follow, and execute CRISP-DM process steps.
Use dependency management tools such as Docker and Conda to handle versioning conflicts.
Determine optimal data type choices for feature columns.
Design and implement benchmarks to compare framework performance.

Machine Learning (20%)

Open domain module

Perform feature engineering for model development.
Identify when data scale or workload profile requires acceleration.
Run rapid experiments to balance accuracy and inference performance.
Optimize machine learning hyperparameters.
Train models and compare single-GPU versus multi-GPU strategies.
Apply GPU memory optimization techniques such as batching and mixed precision.

MLOps (10%)

Open domain module

Determine optimal data type choices for each feature.
Assess and verify dataset memory footprint.
Compare required memory against available device memory.
Benchmark and optimize GPU-accelerated workflows.
Deploy and monitor models in production.