- Very very important
- Introduction to parallel programming
- Parallel computing and Super computing
- Let's investigate some background.
- How to install CUDA toolkit and first look at CUDA program
- Basic elements of CUDA program
- Organization of threads in a CUDA program - threadIdx
- Organization of thread in a CUDA program - blockIdx,blockDim,gridDim
- Programming exercise 1
- Unique index calculation using threadIdx blockId and blockDim
- Unique index calculation for 2D grid 1
- Unique index calculation for 2D grid 2
- Memory transfer between host and device
- Programming exercise 2
- Sum array example with validity check
- Sum array example with error handling
- Sum array example with timing
- Extend sum array implementation to sum up 3 arrays
- Device properties
- Summary
- Understand the device better
- All about warps
- Warp divergence
- Resource partitioning and latency hiding 1
- Resource partitioning and latency hiding 2
- Occupancy
- Profile driven optimization with nvprof
- Parallel reduction as synchronization example
- Parallel reduction as warp divergence example
- Parallel reduction with loop unrolling
- Parallel reduction as warp unrolling
- Reduction with complete unrolling
- Performance comparison of reduction kernels
- CUDA Dynamic parallelism
- Reduction with dynamic parallelism
- Summary
- CUDA memory model
- Different memory types in CUDA
- Memory management and pinned memory
- Zero copy memory
- Unified memory
- Global memory access patterns
- Global memory writes
- AOS vs SOA
- Matrix transpose
- Matrix transpose with unrolling
- Matrix transpose with diagonal coordinate system
- Summary
- Introduction to CUDA shared memory
- Shared memory access modes and memory banks
- Row major and Column major access to shared memory
- Static and Dynamic shared memory
- Shared memory padding
- Parallel reduction with shared memory
- Synchronization in CUDA
- Matrix transpose with shared memory
- CUDA constant memory
- Matrix transpose with Shared memory padding
- CUDA warp shuffle instructions
- Parallel reduction with warp shuffle instructions
- Summary
- Introduction to CUDA streams and events
- How to use CUDA asynchronous functions
- How to use CUDA streams
- Overlapping memory transfer and kernel execution
- Stream synchronization and blocking behavious of NULL stream
- Explicit and implicit synchronization
- CUDA events and timing with CUDA events
- Creating inter stream dependencies with events
- Introduction to different types of instructions in CUDA
- Floating point operations
- Standard and Instrict functions
- Atomic functions
- Scan algorithm introduction
- Simple parallel scan
- Work efficient parallel exclusive scan
- Work efficient parallel inclusive scan
- Parallel scan for large data sets
- Parallel Compact algorithm
- Introduction part 1
- Introduction part 2
- Digital image processing
- Digital image fundametals : Human perception
- Digital image fundamentals : Image formation
- OpenCV installation