Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning
TimeThursday, July 14th4:30pm - 4:50pm PDT
Location3002, Level 3
Event Type
Research Manuscript
AI/ML Design: System and Platform
DescriptionWeight pruning in DNNs can reduce storage and computation cost, but struggles to bring practical speedup to inference. Tensor-cores can boost the throughput of GPUs on dense computation, but exploiting tensor-cores for sparse DNNs is challenging due to poor data reuse opportunities in sparse kernels. Existing pruning approaches fail to balance the demands of accuracy and performance. In this work, we propose a novel sparsity pattern, Shuffled Block-wise sparsity (Shfl-BW), designed to fully utilize tensor-cores while minimizing the constraints on the weight structure. Evaluations show we can accelerate Transformer by 4.83x on T4 GPU, and achieve a state-of-the-art accuracy-speedup tradeoff.