File Download
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1145/3714416
- Scopus: eid_2-s2.0-105003445308
- Find via

Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture
| Title | TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture |
|---|---|
| Authors | |
| Keywords | FPGA mixed integer-floating-point inference non-linear arithmetic operations SIMD systolic array transformable arithmetic architecture Transformer acceleration |
| Issue Date | 14-Mar-2025 |
| Publisher | Association for Computing Machinery (ACM) |
| Citation | ACM Transactions on Reconfigurable Technology and Systems, 2025, v. 18, n. 1, p. 1-31 How to Cite? |
| Abstract | Modern transformer-based deep neural networks present unique technical challenges for effective acceleration in real-world applications. Apart from the vast amount of linear operations needed due to their sizes, modern transformer models are increasingly reliance on precise non-linear computations that make traditional low-bitwidth quantization methods and fixed-dataflow matrix accelerators ineffective for end-to-end acceleration. To address this need to accelerate both linear and non-linear operations in a unified and programmable framework, this article introduces TATAA. TATAA employs 8-bit integer (int8) arithmetic for quantized linear layer operations through post-training quantization, while it relies on bfloat16 floating-point arithmetic to approximate non-linear layers of a transformer model. TATAA hardware features a transformable arithmetic architecture that supports both formats during runtime with minimal overhead, enabling it to switch between a systolic array mode for int8 matrix multiplications and a SIMD mode for vectorized bfloat16 operations. An end-to-end compiler is presented to enable flexible mapping from emerging transformer models to the proposed hardware. Experimental results indicate that our mixed-precision design incurs only 0.14% to 1.16% accuracy drop when compared with the pre-trained single-precision transformer models across a range of vision, language, and generative text applications. Our prototype implementation on the Alveo U280 FPGA currently achieves 2,935.2 GOPS throughput on linear layers and a maximum of 189.5 GFLOPS for non-linear operations, outperforming related works by up to in end-to-end throughput and in DSP efficiency, while achieving higher power efficiency than modern NVIDIA RTX4090 GPU. |
| Persistent Identifier | http://hdl.handle.net/10722/366101 |
| ISSN | 2023 Impact Factor: 3.1 2023 SCImago Journal Rankings: 0.802 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Wu, Jiajun | - |
| dc.contributor.author | Song, Mo | - |
| dc.contributor.author | Zhao, Jingmin | - |
| dc.contributor.author | Gao, Yizhao | - |
| dc.contributor.author | Li, Jia | - |
| dc.contributor.author | So, Hayden Kwok Hay | - |
| dc.date.accessioned | 2025-11-15T00:35:32Z | - |
| dc.date.available | 2025-11-15T00:35:32Z | - |
| dc.date.issued | 2025-03-14 | - |
| dc.identifier.citation | ACM Transactions on Reconfigurable Technology and Systems, 2025, v. 18, n. 1, p. 1-31 | - |
| dc.identifier.issn | 1936-7406 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/366101 | - |
| dc.description.abstract | <p>Modern transformer-based deep neural networks present unique technical challenges for effective acceleration in real-world applications. Apart from the vast amount of linear operations needed due to their sizes, modern transformer models are increasingly reliance on precise non-linear computations that make traditional low-bitwidth quantization methods and fixed-dataflow matrix accelerators ineffective for end-to-end acceleration. To address this need to accelerate both linear and non-linear operations in a unified and programmable framework, this article introduces TATAA. TATAA employs 8-bit integer (int8) arithmetic for quantized linear layer operations through post-training quantization, while it relies on bfloat16 floating-point arithmetic to approximate non-linear layers of a transformer model. TATAA hardware features a transformable arithmetic architecture that supports both formats during runtime with minimal overhead, enabling it to switch between a systolic array mode for int8 matrix multiplications and a SIMD mode for vectorized bfloat16 operations. An end-to-end compiler is presented to enable flexible mapping from emerging transformer models to the proposed hardware. Experimental results indicate that our mixed-precision design incurs only 0.14% to 1.16% accuracy drop when compared with the pre-trained single-precision transformer models across a range of vision, language, and generative text applications. Our prototype implementation on the Alveo U280 FPGA currently achieves 2,935.2 GOPS throughput on linear layers and a maximum of 189.5 GFLOPS for non-linear operations, outperforming related works by up to in end-to-end throughput and in DSP efficiency, while achieving higher power efficiency than modern NVIDIA RTX4090 GPU.</p> | - |
| dc.language | eng | - |
| dc.publisher | Association for Computing Machinery (ACM) | - |
| dc.relation.ispartof | ACM Transactions on Reconfigurable Technology and Systems | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject | FPGA | - |
| dc.subject | mixed integer-floating-point inference | - |
| dc.subject | non-linear arithmetic operations | - |
| dc.subject | SIMD | - |
| dc.subject | systolic array | - |
| dc.subject | transformable arithmetic architecture | - |
| dc.subject | Transformer acceleration | - |
| dc.title | TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture | - |
| dc.type | Article | - |
| dc.description.nature | published_or_final_version | - |
| dc.identifier.doi | 10.1145/3714416 | - |
| dc.identifier.scopus | eid_2-s2.0-105003445308 | - |
| dc.identifier.volume | 18 | - |
| dc.identifier.issue | 1 | - |
| dc.identifier.spage | 1 | - |
| dc.identifier.epage | 31 | - |
| dc.identifier.eissn | 1936-7414 | - |
| dc.identifier.issnl | 1936-7406 | - |
