FLOPS and Roofline Models; Tiling/Blocking and Data Reuse; Overlap Compute and Memory: Compilers, Multithreading, Bulk/Vector data access, Software Pipelining, Software Specialization, Asynchronous data transfers; Block Scheduling; Overlap GPU and CPU
GPU Weak Consistency PTX Model: Compiler and hardware memory fences, Axioms and litmus tests, Memory ordering and visibility; Cache Coherence Problem; Hardware Cache Coherence: Snooping and Directory Protocols, Implementations and Optimizations
Lecture 9: Exploiting Parallelism with Specialization - PDF
Data Supply Challenge; Complex Compute and Memory Instructions: Asynchronous memory operations, Matrix-Multiuply Accumulate (MMA/Tensor Core); Throughput Optimizations: Software pipelining, Asynchronous data transfer, Caches, Offloading memory accesses, Decoupled compute and memory, Register reallocations, Cooperative cache hierarchy
Lecture 10: Specialization to Accelerate Compute and Communication - PDF
Ray Tracing Accleration in GPUs, Tensor Cores beyond Matrix Multiply; Message Passing Model: Explicit memory transfer for efficiency; Acclerating Communication: Hardware send and receive instructions, Overlap compute and data
Data Movement Challenges: Energy and Latency, Data parallel primitives and communication; Shared vs. Private Cache Hierarchy Tradeoffs, Non-Uniform Cache Access (NUCA); Cache Interference; Caches and Coherence