A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining
TimeThursday, July 14th3:30pm - 3:50pm PDT
Location3002, Level 3
Event Type
Research Manuscript
AI/ML Design: System and Platform
DescriptionRecently, attention-based models have gradually gained popularity and perform outstandingly for many Natural Language Processing (NLP) tasks. However, the attention-based models suffer from quadratic computational complexity and heavy memory footprint. To deploy attention-based models efficiently on FPGA, we propose a hardware-friendly sparse attention operator and a length-aware resource hardware scheduling algorithm. The proposed sparse attention operator brings the complexity of attention-based models down to linear complexity and alleviates the main memory traffic. The proposed length-aware resource hardware scheduling algorithm dynamically allocates the hardware resources to fill up the pipeline slots and eliminates bubbles for NLP tasks.