Protected

NCP-ADS module content is available after admin verification. Redirecting…

If you are not redirected, login.

Access

Admin only

NCP-ADS module pages are restricted to admin users.

Training / NCP-ADS

Data Manipulation and Software Literacy

Module study guide

Priority 1 of 6 · Domain 2 in exam order

Scope

Exam study content

This module contains expanded study notes, practical drills, and an exam-style question set.

Exam weight
20%
Priority tier
Tier 1
Why this domain
RAPIDS, Dask, scaling, and memory-heavy NVIDIA workflows.

Exam Framework

How to reason under pressure

1. Stabilize Before Optimizing

  • Verify hardware and management-plane integrity first.
  • Confirm firmware/software baseline consistency.
  • Only then run performance tuning decisions.

2. Single-Variable Changes

  • Change one parameter at a time when investigating regressions.
  • Use before/after evidence with constant workload input.
  • Discard changes without reproducible benefit.

Exam Scope Coverage

What this module now covers

This module is designed as a structured study path for Data Manipulation and Software Literacy, focused on RAPIDS, Dask, scaling, memory control, and software literacy for GPU data workflows.

Track 1: cuDF vs cudf.pandas decision making

Exam scenarios often test whether you can choose the right API path for both speed and compatibility.

  • cudf.pandas is the fastest way to accelerate existing pandas code with minimal refactor.
  • cuDF direct API usually gives tighter control and higher peak performance when your workflow is fully GPU-compatible.
  • cudf.pandas can fall back to CPU pandas for unsupported paths, and each fallback can add host-device transfer overhead.

Drill: Take one pandas notebook, run with cudf.pandas, profile fallback-heavy steps, then rewrite only hot steps in cuDF.

Track 2: Dask-cuDF architecture and scaling model

The domain is heavily NVIDIA-flavored; multi-GPU and distributed patterns are core exam scope.

  • Dask-cuDF extends Dask DataFrame with cuDF partitions; multi-GPU requires a distributed cluster.
  • Use Dask-CUDA and LocalCUDACluster for GPU pinning, diagnostics, and memory controls.
  • Prefer the portable dask.dataframe API with backend set to cudf, and convert only when needed with to_backend.

Drill: Deploy a 2-GPU LocalCUDACluster, run groupby and join workloads, and capture dashboard screenshots for bottleneck analysis.

Track 3: Partitioning, shuffle, and execution semantics

Most performance regressions come from partition shape and expensive all-to-all operations.

  • Typical partition-size target is between 1/32 and 1/8 of single-GPU memory.
  • Shuffle-heavy jobs often do best closer to 1/32-1/16; larger partitions can increase OOM risk.
  • Avoid calling compute on large collections; persist keeps data distributed and avoids client-side concatenation.

Drill: Benchmark one pipeline with three partition sizes and compare runtime, spill behavior, and worker memory headroom.

Track 4: Memory management and spill strategy

Memory pressure is a top exam and production concern in GPU data pipelines.

  • Initialize an RMM pool per worker (for example, rmm_pool_size near 0.8-0.9) to reduce allocation overhead.
  • Enable cuDF spill for ETL-style workflows; evaluate JIT-unspill for mixed DataFrame and array workloads.
  • Minimize repeated CPU-GPU backend switching, because movement overhead can erase acceleration gains.

Drill: Run a memory-stressed join with and without RMM pool and spill enabled, then compare spill counts and total runtime.

Track 5: I/O path optimization (Parquet and JSON)

Ingest is often the bottleneck before transformation or modeling begins.

  • Parquet is usually preferred for column projection and predicate pushdown in distributed workflows.
  • For JSON Lines, use lines=True and use byte-range or multi-source patterns depending on file size profile.
  • String-heavy Parquet workloads can benefit from encoding and compression tuning (for example ZSTD and selective delta encodings).

Drill: Create one Parquet write/read benchmark and one JSON Lines benchmark, then document throughput and CPU utilization.

Track 6: Software literacy for reproducible pipelines

The exam tests practical stack literacy, not only DataFrame syntax.

  • Conda environments should be reproducible from environment.yml and solved with pinned core dependencies.
  • Docker images package runtime, dependencies, and entrypoints for consistent dev-test-prod behavior.
  • Spark RAPIDS and Databricks integrations can accelerate SQL/DataFrame paths with fallback to CPU when operators are unsupported.

Drill: Package one RAPIDS workflow in Docker and run it in a fresh environment to validate reproducibility.

Module Resources

Downloads and quick links

Concept Explanations

Deep-dive concept library

Exam Decision Hierarchy

Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.

  • If integrity checks fail, stop optimization and remediate first.
  • Compare against known-good baseline before changing multiple variables.
  • Document rationale for each decision to support incident replay.

Operational Evidence Standard

Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.

  • Evidence should be reproducible by another engineer.
  • Use stable command templates for repeated environments.
  • Keep concise but complete validation artifacts for exam-style reasoning.

Decision model: cudf.pandas first, cuDF for hotspots

For exam scenarios, fastest migration path is usually `cudf.pandas` for compatibility, then selective direct cuDF optimization where fallback or memory pressure remains.

  • Start with least-disruptive acceleration path to establish baseline.
  • Profile unsupported operations and repeated fallback transitions.
  • Refactor only measured hotspots into native cuDF or dask-cuDF patterns.

Distributed execution discipline with Dask-cuDF

Most failures come from partitioning and execution semantics, not DataFrame syntax.

  • Choose partition sizes by workload type; shuffle-heavy jobs often need smaller partitions.
  • Prefer `persist()` for large distributed collections and delay `compute()` until final reduction.
  • Control worker memory with RMM pool and spill settings before scaling up.

I/O and data-type literacy for GPU pipelines

Parquet and JSON ingestion choices directly impact throughput and memory stability.

  • Use Parquet for projection/predicate pushdown in analytics-heavy ETL.
  • Use JSON Lines patterns (`lines=True`, byte-range for large files) to avoid parser bottlenecks.
  • Validate cuDF dtypes and null behavior before joins/groupby to prevent hidden casts and spill.

Scenario Playbooks

Exam-style scenario explanations

Scenario A: ETL pipeline is faster on CPU than GPU after migration

A pandas ETL job was switched to `cudf.pandas`, but runtime improved only slightly and sometimes regressed. You need to isolate root cause quickly.

Architecture Diagram

[Raw Files] -> [cudf.pandas pipeline] -> [Dask scheduler/workers] -> [Parquet output]
                     |
          [Fallback + transfer hotspots]

Response Flow

  1. Profile pipeline stages to identify unsupported operations and fallback-heavy paths.
  2. Move high-cost operations into native cuDF APIs or backend-safe operations.
  3. Re-run with controlled partition size and memory settings (RMM pool + spill).

Success Signals

  • CPU fallback frequency decreases in hotspot stages.
  • Worker spill events are controlled and no OOM occurs.
  • End-to-end runtime improves against baseline in repeat runs.

GPU fallback and runtime check

python - <<'PY'
import time, pandas as pd, cudf.pandas
start=time.time()
# run existing ETL notebook entrypoint here
print('elapsed_s=', round(time.time()-start,2))
PY

Expected output (example)

elapsed_s= 214.37

Scenario B: Multi-GPU groupby job fails with intermittent OOM

A Dask-cuDF job crashes during shuffle-heavy groupby steps on larger batches. You need a stable tuning sequence.

Architecture Diagram

[Scheduler]
   |
[GPU Worker 0] [GPU Worker 1] ... [GPU Worker N]
   |                |
 [RMM Pool + Spill Control + Shuffle]

Response Flow

  1. Reduce partition target to safer range for shuffle-heavy phase.
  2. Enable RMM pool and cuDF spill; verify settings at worker startup.
  3. Persist intermediate distributed results and avoid premature global `compute()`.

Success Signals

  • No worker exits from OOM in repeated test window.
  • Shuffle phase runtime stabilizes across runs.
  • Dashboard shows balanced partition memory usage.

CLI and Commands

High-yield command runbooks

CLI Execution Pattern

  • 1. Capture baseline state before running any intrusive command.
  • 2. Execute command with explicit scope (node, interface, GPU set).
  • 3. Compare output against expected baseline signature.
  • 4. Record timestamp and decision (pass, investigate, remediate).

Cluster bring-up and memory controls

Start a stable local multi-GPU Dask-cuDF session with explicit memory behavior.

Launch LocalCUDACluster with pool and spill

python - <<'PY'
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster(rmm_pool_size='0.85', enable_cudf_spill=True)
client = Client(cluster)
print(client)
PY

Expected output (example)

<Client: 'tcp://127.0.0.1:8786' processes=4 threads=4>

Backend portability with dask.dataframe

python - <<'PY'
import dask.dataframe as dd
import dask
with dask.config.set({'dataframe.backend':'cudf'}):
    ddf = dd.read_parquet('data/*.parquet')
    print(ddf.head())
PY

Expected output (example)

Returns first rows using cuDF backend where supported.
  • Use this baseline before workload tuning so failures are not caused by implicit defaults.
  • Keep startup config in runbook for reproducible exam-style reasoning.

I/O optimization quick checks

Validate Parquet and JSON ingestion behavior before transformation tuning.

Parquet projection benchmark

python - <<'PY'
import cudf, time
t=time.time()
gdf=cudf.read_parquet('data/*.parquet', columns=['id','ts','value'])
print('rows=', len(gdf), 'elapsed_s=', round(time.time()-t,2))
PY

Expected output (example)

rows= 18400213 elapsed_s= 9.84

JSON Lines ingest benchmark

python - <<'PY'
import cudf, time
t=time.time()
gdf=cudf.read_json('logs/*.jsonl', lines=True)
print('rows=', len(gdf), 'elapsed_s=', round(time.time()-t,2))
PY

Expected output (example)

rows= 9023310 elapsed_s= 14.21
  • Record both throughput and memory profile, not runtime alone.
  • Use consistent input snapshots when comparing formats.

Common Problems

Failure patterns and fixes

Repeated backend switching erases acceleration gains

Symptoms

  • Frequent `to_backend` conversions between pandas and cuDF.
  • GPU utilization remains low despite GPU-enabled environment.
  • Runtime improves only marginally after migration.

Likely Cause

Pipeline mixes CPU-only operations in high-frequency stages, causing excessive host-device transfers.

Remediation

  • Identify CPU-only hotspots with profiling and replace with cuDF-native operations where possible.
  • Keep workflow on one backend through major transformation stages.
  • Isolate unavoidable CPU operations to final, low-volume steps.

Prevention: Require backend-transition review in code review for each ETL change.

Dask-cuDF OOM during shuffle-intensive joins

Symptoms

  • Workers restart or fail during repartition/join phases.
  • Memory spikes coincide with all-to-all shuffle stage.
  • `compute()` call triggers large client-memory pressure.

Likely Cause

Partitions are too large for shuffle profile and memory controls were not configured up front.

Remediation

  • Reduce partition size toward shuffle-safe range and retest.
  • Enable RMM pool and spill; verify with worker diagnostics.
  • Persist distributed intermediate results and delay `compute()`.

Prevention: Set partition/memory defaults in cluster bootstrap templates before production runs.

Lab Walkthroughs

Step-by-step execution guides

Walkthrough A: Migrate pandas ETL to GPU with measured optimization

Deliver stable speedup while preserving result parity.

Prerequisites

  • Existing pandas ETL notebook and fixed sample dataset.
  • Environment with RAPIDS and Dask-CUDA installed.
  • Baseline runtime and output checksums.
  1. Run baseline pandas workflow and capture runtime plus output checksum.

    python - <<'PY'
    # run pandas ETL baseline
    print('baseline_elapsed_s= 420.7')
    print('checksum= 9df7...')
    PY

    Expected: You have baseline performance and parity marker.

  2. Enable cudf.pandas and rerun unchanged workflow.

    python - <<'PY'
    import cudf.pandas
    # run ETL entrypoint
    print('gpu_elapsed_s= 260.4')
    PY

    Expected: Initial acceleration is measured and fallback hotspots are identified.

  3. Refactor top hotspot into direct cuDF operation and compare again.

    Expected: Runtime improves further with parity preserved against checksum.

Success Criteria

  • Output parity check passes against baseline checksum.
  • Runtime improvement is reproducible across two runs.
  • Optimization notes document exactly which hotspot was rewritten.

Walkthrough B: Stabilize multi-GPU shuffle workflow

Tune partition and memory settings to avoid OOM in join/groupby workload.

Prerequisites

  • Two or more GPUs available.
  • Representative shuffle-heavy dataset.
  • Dask dashboard access.
  1. Start LocalCUDACluster with explicit memory settings.

    python - <<'PY'
    from dask_cuda import LocalCUDACluster
    from dask.distributed import Client
    cluster=LocalCUDACluster(rmm_pool_size='0.8', enable_cudf_spill=True)
    client=Client(cluster)
    print('cluster_ready')
    PY

    Expected: Cluster initializes with predictable memory behavior.

  2. Run workload at three partition targets and record runtime/spill.

    Expected: You identify a stable partition range with no worker failures.

  3. Publish final runbook defaults for this workload class.

    Expected: Team has a reusable partition and memory baseline.

Success Criteria

  • No OOM in two consecutive validation runs.
  • Documented partition choice includes rationale and evidence.
  • Dashboard screenshots/logs are attached to runbook.

Study Sprint

10-day execution plan

Day Focus Output
1 Set up reproducible environment (Conda + Docker baseline) and validate GPU availability. Working environment.yml, Dockerfile notes, and a smoke test notebook.
2 Port pandas baseline to cudf.pandas and identify CPU fallback hotspots. Fallback hotspot log and first-pass speedup measurements.
3 Rewrite hotspot sections using direct cuDF API. Before/after timing and code diff with rationale.
4 Build Dask-cuDF single-node multi-GPU setup with LocalCUDACluster. Cluster config template and dashboard capture.
5 Partition-size tuning and shuffle stress test. Tuning table with chosen target partition size.
6 Memory engineering: RMM pool, spill, and persist strategy. Memory playbook and safe defaults checklist.
7 I/O focus: Parquet read/write options, JSON Lines handling. I/O benchmark summary and file-format decision rubric.
8 Distributed workflow case study (Dask, Spark RAPIDS, or Databricks pattern). Architecture diagram and fallback notes.
9 Run full mock workflow under exam-like time pressure. End-to-end runbook and failure recovery notes.
10 Revision and targeted weak-area drills. Final cheat sheet and confidence checklist.

Hands-on Labs

Practical module work

Each lab includes a collapsed execution sample with representative CLI usage and expected output.

Lab A: Zero-code-change acceleration baseline

Measure uplift from pandas to cudf.pandas and isolate fallback penalties.

  • Run baseline pandas job with fixed input size.
  • Enable cudf.pandas and rerun unchanged code.
  • List operations that still execute on CPU and estimate their share of total runtime.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Cluster bring-up and memory controls)

python - <<'PY'
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster(rmm_pool_size='0.85', enable_cudf_spill=True)
client = Client(cluster)
print(client)
PY

Expected output (example)

<Client: 'tcp://127.0.0.1:8786' processes=4 threads=4>

Lab B: Dask-cuDF scaling and partition tuning

Find stable partition strategy for join and groupby workload.

  • Deploy LocalCUDACluster with at least two workers.
  • Test partition targets at 1/32, 1/16, and 1/8 memory fractions.
  • Record runtime, spill behavior, and OOM incidents.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Cluster bring-up and memory controls)

python - <<'PY'
import dask.dataframe as dd
import dask
with dask.config.set({'dataframe.backend':'cudf'}):
    ddf = dd.read_parquet('data/*.parquet')
    print(ddf.head())
PY

Expected output (example)

Returns first rows using cuDF backend where supported.

Lab C: Memory stability under pressure

Compare default allocator vs RMM pool and spill controls.

  • Run memory-heavy workflow with default settings.
  • Enable rmm_pool_size and cudf spill, then rerun.
  • Document runtime delta and stability improvements.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (I/O optimization quick checks)

python - <<'PY'
import cudf, time
t=time.time()
gdf=cudf.read_parquet('data/*.parquet', columns=['id','ts','value'])
print('rows=', len(gdf), 'elapsed_s=', round(time.time()-t,2))
PY

Expected output (example)

rows= 18400213 elapsed_s= 9.84

Lab D: I/O throughput optimization

Improve ingest throughput for Parquet and JSON workloads.

  • Benchmark read_parquet with representative file sizes.
  • Benchmark read_json for JSON Lines with mixed file-size distribution.
  • Write recommendations for file format and encoding choices.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (I/O optimization quick checks)

python - <<'PY'
import cudf, time
t=time.time()
gdf=cudf.read_json('logs/*.jsonl', lines=True)
print('rows=', len(gdf), 'elapsed_s=', round(time.time()-t,2))
PY

Expected output (example)

rows= 9023310 elapsed_s= 14.21

Exam Pitfalls

Common failure patterns

  • Using compute too early and forcing large data into single-GPU client memory.
  • Overusing sort_values or set_index when global ordering is not required.
  • Repeated to_backend conversions that trigger avoidable CPU-GPU transfer.
  • Ignoring partition skew and then blaming GPUs for join or groupby slowdown.
  • Treating cudf.pandas as always-GPU without checking fallback behavior.
  • Shipping non-reproducible environments with unpinned dependencies.

Practice Set

Domain checkpoint questions

Attempt each question first, then open the answer and explanation.

Q1. You inherited a large pandas ETL notebook and need quick acceleration with minimal refactor. What is the best first move?
  • A. Rewrite all logic directly in cuDF before testing
  • B. Enable cudf.pandas, benchmark, then optimize fallback hotspots
  • C. Convert everything to Spark SQL immediately
  • D. Start with random partition tuning

Answer: B

cudf.pandas gives fastest time-to-value. Then you optimize only the expensive fallback paths with direct cuDF.

Q2. What is the most accurate statement about multi-GPU execution for Dask-cuDF?
  • A. Dask-cuDF alone automatically creates multi-GPU execution
  • B. Multi-GPU requires a distributed cluster, typically with Dask-CUDA
  • C. You must always use explicit dask_cudf API, never dask.dataframe
  • D. Multi-GPU only works for CSV input

Answer: B

Dask-cuDF registers the backend, but multi-GPU needs distributed cluster deployment.

Q3. For shuffle-heavy workloads, which partition-size starting range is generally recommended?
  • A. 1/2 to 1 of GPU memory
  • B. 1/32 to 1/16 of GPU memory
  • C. Exactly 1/8 for all workloads
  • D. One partition per file, no matter size

Answer: B

Smaller partitions reduce OOM risk and help shuffle-intensive workflows remain stable.

Q4. Why is calling ddf.compute() on large collections risky?
  • A. It disables GPU execution permanently
  • B. It concatenates results into local memory and can exceed single-GPU capacity
  • C. It forces Parquet output
  • D. It prevents dashboard diagnostics

Answer: B

compute materializes output on the client, which can trigger memory pressure or OOM.

Q5. Which file format is usually preferred for Dask-cuDF analytical pipelines and why?
  • A. JSON, because all fields are always typed
  • B. CSV, because projection pushdown is stronger
  • C. Parquet, because columnar layout supports projection and pushdown optimizations
  • D. Any format gives same optimization behavior

Answer: C

Parquet is columnar and usually offers the best optimization surface for distributed analytics.

Q6. What is a key benefit of initializing RMM pools on workers?
  • A. It removes need for all memory diagnostics
  • B. It guarantees no spilling under any workload
  • C. It reduces allocation overhead and improves memory behavior
  • D. It automatically repartitions data

Answer: C

RMM pooling improves allocation efficiency and can stabilize memory-heavy pipelines.

Q7. In cudf.pandas mode, what is the main performance danger when many operations are unsupported on GPU?
  • A. Kernel launch overhead only
  • B. CPU fallback plus repeated CPU-GPU transfers
  • C. Loss of DataFrame semantics
  • D. Inability to read Parquet

Answer: B

Fallback is functional, but frequent host-device movement can significantly hurt performance.

Q8. For JSON Lines ingestion in cuDF, which approach helps with very large files?
  • A. Parse entire file on CPU first
  • B. Use byte-range reads and adjacent ranges to avoid missed or duplicated rows
  • C. Convert JSON to images
  • D. Disable lines mode

Answer: B

Byte-range support is designed for large JSON Lines workloads while preserving row integrity.

Q9. What does GPUDirect Storage primarily optimize in cuDF ingest workflows?
  • A. It improves SQL optimizer rules
  • B. It enables direct storage-to-GPU transfer and reduces CPU bounce-buffer overhead
  • C. It replaces Parquet compression
  • D. It removes need for NVMe storage

Answer: B

GDS reduces CPU-mediated copies and can increase ingest throughput on supported systems.

Q10. Which practice best supports software literacy and reproducibility for this domain?
  • A. Install packages ad hoc in shared base environment
  • B. Use environment.yml plus containerized runtime for consistent dev-test-prod behavior
  • C. Avoid version pinning entirely
  • D. Depend on local machine state

Answer: B

Reproducible environments and containerized execution are core operational skills in accelerated data pipelines.

Primary References

Curated from your NCP-ADS vault and current official documentation.

Objectives

  1. 2.1 Design and implement accelerated ETL (extract, transform, load) workflows.
  2. 2.2 Implement caching strategies to reduce shuffle overhead.
  3. 2.3 Use distributed data processing frameworks for large-scale datasets.
  4. 2.4 Implement Dask-based data parallelism for multi-GPU scaling.
  5. 2.5 Profile deep learning workloads with tools such as DLProf.
  6. 2.6 Choose optimal data processing libraries for varying dataset sizes and workloads.

Navigation