Introduction to CUDA programming and CUDA programming model
  • Very very important
  • Introduction to parallel programming
  • Parallel computing and Super computing
  • Let's investigate some background.
  • How to install CUDA toolkit and first look at CUDA program
  • Basic elements of CUDA program
  • Organization of threads in a CUDA program - threadIdx
  • Organization of thread in a CUDA program - blockIdx,blockDim,gridDim
  • Programming exercise 1
  • Unique index calculation using threadIdx blockId and blockDim
  • Unique index calculation for 2D grid 1
  • Unique index calculation for 2D grid 2
  • Memory transfer between host and device
  • Programming exercise 2
  • Sum array example with validity check
  • Sum array example with error handling
  • Sum array example with timing
  • Extend sum array implementation to sum up 3 arrays
  • Device properties
  • Summary
CUDA Execution model
  • Understand the device better
  • All about warps
  • Warp divergence
  • Resource partitioning and latency hiding 1
  • Resource partitioning and latency hiding 2
  • Occupancy
  • Profile driven optimization with nvprof
  • Parallel reduction as synchronization example
  • Parallel reduction as warp divergence example
  • Parallel reduction with loop unrolling
  • Parallel reduction as warp unrolling
  • Reduction with complete unrolling
  • Performance comparison of reduction kernels
  • CUDA Dynamic parallelism
  • Reduction with dynamic parallelism
  • Summary
CUDA memory model
  • CUDA memory model
  • Different memory types in CUDA
  • Memory management and pinned memory
  • Zero copy memory
  • Unified memory
  • Global memory access patterns
  • Global memory writes
  • AOS vs SOA
  • Matrix transpose
  • Matrix transpose with unrolling
  • Matrix transpose with diagonal coordinate system
  • Summary
CUDA Shared memory and constant memory
  • Introduction to CUDA shared memory
  • Shared memory access modes and memory banks
  • Row major and Column major access to shared memory
  • Static and Dynamic shared memory
  • Shared memory padding
  • Parallel reduction with shared memory
  • Synchronization in CUDA
  • Matrix transpose with shared memory
  • CUDA constant memory
  • Matrix transpose with Shared memory padding
  • CUDA warp shuffle instructions
  • Parallel reduction with warp shuffle instructions
  • Summary
CUDA Streams
  • Introduction to CUDA streams and events
  • How to use CUDA asynchronous functions
  • How to use CUDA streams
  • Overlapping memory transfer and kernel execution
  • Stream synchronization and blocking behavious of NULL stream
  • Explicit and implicit synchronization
  • CUDA events and timing with CUDA events
  • Creating inter stream dependencies with events
Performance Tuning with CUDA instruction level primitives
  • Introduction to different types of instructions in CUDA
  • Floating point operations
  • Standard and Instrict functions
  • Atomic functions
Parallel Patterns and Applications
  • Scan algorithm introduction
  • Simple parallel scan
  • Work efficient parallel exclusive scan
  • Work efficient parallel inclusive scan
  • Parallel scan for large data sets
  • Parallel Compact algorithm
Bonus: Introduction to Image processing with CUDA
  • Introduction part 1
  • Introduction part 2
  • Digital image processing
  • Digital image fundametals : Human perception
  • Digital image fundamentals : Image formation
  • OpenCV installation