Graph Input-Aware Matrix Multiplication for Pruned Graph Neural Network Acceleration

Abstract

Graphs-based neural networks are powerful tools for analyzing intricate non-euclidean data structures to solve complex real-world problems bounded by latency and throughput constraints. However, the computation structures operate on ultra-sparse and unstructured matrices resulting in load balancing and data locality challenges for the sparse matrix-matrix multiplication (SpMM) in massively parallel processors. To overcome these challenges, researchers have turned to model pruning techniques that effectively compress the hidden dimensions in the dense matrices without compromising model accuracy. However, exploiting sparsity for performance becomes challenging when the compressed dimensions cannot exploit the full potential of vector-level parallelism in single instruction multiple thread (SIMT) processors. We propose to virtualize the vector unit into virtual SIMT slots to map the low-dimensional SpMM computations, and improve parallel hardware utilization. The key challenge is to find parallel work to occupy the exposed virtualized slots. A static slot mapping strategy exposes parallelism by trading off load-balancing, memory coalescing, and data locality when distributing work among the slots. The key insight is that work efficiency and parallelism are highly input graph dependent, and static schemes cannot optimally exploit performance. An input-aware slot mapping strategy is proposed that adapts to the input graph's characteristics at SIMT granularity to expose maximum parallelism at runtime. The evaluation with an NVIDIA H100 GPU shows improved performance scaling using the proposed input-aware slot mapping heuristic compared to the static baseline.

Publication
Proceedings of the 39th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2025)