File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

Article: TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture

TitleTATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture
Authors
KeywordsFPGA
mixed integer-floating-point inference
non-linear arithmetic operations
SIMD
systolic array
transformable arithmetic architecture
Transformer acceleration
Issue Date14-Mar-2025
PublisherAssociation for Computing Machinery (ACM)
Citation
ACM Transactions on Reconfigurable Technology and Systems, 2025, v. 18, n. 1, p. 1-31 How to Cite?
Abstract

Modern transformer-based deep neural networks present unique technical challenges for effective acceleration in real-world applications. Apart from the vast amount of linear operations needed due to their sizes, modern transformer models are increasingly reliance on precise non-linear computations that make traditional low-bitwidth quantization methods and fixed-dataflow matrix accelerators ineffective for end-to-end acceleration. To address this need to accelerate both linear and non-linear operations in a unified and programmable framework, this article introduces TATAA. TATAA employs 8-bit integer (int8) arithmetic for quantized linear layer operations through post-training quantization, while it relies on bfloat16 floating-point arithmetic to approximate non-linear layers of a transformer model. TATAA hardware features a transformable arithmetic architecture that supports both formats during runtime with minimal overhead, enabling it to switch between a systolic array mode for int8 matrix multiplications and a SIMD mode for vectorized bfloat16 operations. An end-to-end compiler is presented to enable flexible mapping from emerging transformer models to the proposed hardware. Experimental results indicate that our mixed-precision design incurs only 0.14% to 1.16% accuracy drop when compared with the pre-trained single-precision transformer models across a range of vision, language, and generative text applications. Our prototype implementation on the Alveo U280 FPGA currently achieves 2,935.2 GOPS throughput on linear layers and a maximum of 189.5 GFLOPS for non-linear operations, outperforming related works by up to in end-to-end throughput and in DSP efficiency, while achieving higher power efficiency than modern NVIDIA RTX4090 GPU.


Persistent Identifierhttp://hdl.handle.net/10722/366101
ISSN
2023 Impact Factor: 3.1
2023 SCImago Journal Rankings: 0.802

 

DC FieldValueLanguage
dc.contributor.authorWu, Jiajun-
dc.contributor.authorSong, Mo-
dc.contributor.authorZhao, Jingmin-
dc.contributor.authorGao, Yizhao-
dc.contributor.authorLi, Jia-
dc.contributor.authorSo, Hayden Kwok Hay-
dc.date.accessioned2025-11-15T00:35:32Z-
dc.date.available2025-11-15T00:35:32Z-
dc.date.issued2025-03-14-
dc.identifier.citationACM Transactions on Reconfigurable Technology and Systems, 2025, v. 18, n. 1, p. 1-31-
dc.identifier.issn1936-7406-
dc.identifier.urihttp://hdl.handle.net/10722/366101-
dc.description.abstract<p>Modern transformer-based deep neural networks present unique technical challenges for effective acceleration in real-world applications. Apart from the vast amount of linear operations needed due to their sizes, modern transformer models are increasingly reliance on precise non-linear computations that make traditional low-bitwidth quantization methods and fixed-dataflow matrix accelerators ineffective for end-to-end acceleration. To address this need to accelerate both linear and non-linear operations in a unified and programmable framework, this article introduces TATAA. TATAA employs 8-bit integer (int8) arithmetic for quantized linear layer operations through post-training quantization, while it relies on bfloat16 floating-point arithmetic to approximate non-linear layers of a transformer model. TATAA hardware features a transformable arithmetic architecture that supports both formats during runtime with minimal overhead, enabling it to switch between a systolic array mode for int8 matrix multiplications and a SIMD mode for vectorized bfloat16 operations. An end-to-end compiler is presented to enable flexible mapping from emerging transformer models to the proposed hardware. Experimental results indicate that our mixed-precision design incurs only 0.14% to 1.16% accuracy drop when compared with the pre-trained single-precision transformer models across a range of vision, language, and generative text applications. Our prototype implementation on the Alveo U280 FPGA currently achieves 2,935.2 GOPS throughput on linear layers and a maximum of 189.5 GFLOPS for non-linear operations, outperforming related works by up to in end-to-end throughput and in DSP efficiency, while achieving higher power efficiency than modern NVIDIA RTX4090 GPU.</p>-
dc.languageeng-
dc.publisherAssociation for Computing Machinery (ACM)-
dc.relation.ispartofACM Transactions on Reconfigurable Technology and Systems-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subjectFPGA-
dc.subjectmixed integer-floating-point inference-
dc.subjectnon-linear arithmetic operations-
dc.subjectSIMD-
dc.subjectsystolic array-
dc.subjecttransformable arithmetic architecture-
dc.subjectTransformer acceleration-
dc.titleTATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture-
dc.typeArticle-
dc.description.naturepublished_or_final_version-
dc.identifier.doi10.1145/3714416-
dc.identifier.scopuseid_2-s2.0-105003445308-
dc.identifier.volume18-
dc.identifier.issue1-
dc.identifier.spage1-
dc.identifier.epage31-
dc.identifier.eissn1936-7414-
dc.identifier.issnl1936-7406-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats