TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture

Wu, Jiajun; Song, Mo; Zhao, Jingmin; Gao, Yizhao; Li, Jia; So, Hayden Kwok Hay

File Download

content.pdf

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1145/3714416
Scopus: eid_2-s2.0-105003445308
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Faculty of Engineering: Journal/Magazine Articles

Article: TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture

Title	TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture
Authors	Wu, Jiajun Song, Mo Zhao, Jingmin Gao, Yizhao Li, Jia So, Hayden Kwok Hay
Keywords	FPGA mixed integer-floating-point inference non-linear arithmetic operations SIMD systolic array transformable arithmetic architecture Transformer acceleration
Issue Date	14-Mar-2025
Publisher	Association for Computing Machinery (ACM)
Citation	ACM Transactions on Reconfigurable Technology and Systems, 2025, v. 18, n. 1, p. 1-31 How to Cite? DOI: http://dx.doi.org/10.1145/3714416
Abstract	Modern transformer-based deep neural networks present unique technical challenges for effective acceleration in real-world applications. Apart from the vast amount of linear operations needed due to their sizes, modern transformer models are increasingly reliance on precise non-linear computations that make traditional low-bitwidth quantization methods and fixed-dataflow matrix accelerators ineffective for end-to-end acceleration. To address this need to accelerate both linear and non-linear operations in a unified and programmable framework, this article introduces TATAA. TATAA employs 8-bit integer (int8) arithmetic for quantized linear layer operations through post-training quantization, while it relies on bfloat16 floating-point arithmetic to approximate non-linear layers of a transformer model. TATAA hardware features a transformable arithmetic architecture that supports both formats during runtime with minimal overhead, enabling it to switch between a systolic array mode for int8 matrix multiplications and a SIMD mode for vectorized bfloat16 operations. An end-to-end compiler is presented to enable flexible mapping from emerging transformer models to the proposed hardware. Experimental results indicate that our mixed-precision design incurs only 0.14% to 1.16% accuracy drop when compared with the pre-trained single-precision transformer models across a range of vision, language, and generative text applications. Our prototype implementation on the Alveo U280 FPGA currently achieves 2,935.2 GOPS throughput on linear layers and a maximum of 189.5 GFLOPS for non-linear operations, outperforming related works by up to in end-to-end throughput and in DSP efficiency, while achieving higher power efficiency than modern NVIDIA RTX4090 GPU.
Persistent Identifier	http://hdl.handle.net/10722/366101
ISSN	1936-7406 2023 Impact Factor: 3.1 2023 SCImago Journal Rankings: 0.802

DC Field	Value	Language
dc.contributor.author	Wu, Jiajun	-
dc.contributor.author	Song, Mo	-
dc.contributor.author	Zhao, Jingmin	-
dc.contributor.author	Gao, Yizhao	-
dc.contributor.author	Li, Jia	-
dc.contributor.author	So, Hayden Kwok Hay	-
dc.date.accessioned	2025-11-15T00:35:32Z	-
dc.date.available	2025-11-15T00:35:32Z	-
dc.date.issued	2025-03-14	-
dc.identifier.citation	ACM Transactions on Reconfigurable Technology and Systems, 2025, v. 18, n. 1, p. 1-31	-
dc.identifier.issn	1936-7406	-
dc.identifier.uri	http://hdl.handle.net/10722/366101	-
dc.description.abstract	<p>Modern transformer-based deep neural networks present unique technical challenges for effective acceleration in real-world applications. Apart from the vast amount of linear operations needed due to their sizes, modern transformer models are increasingly reliance on precise non-linear computations that make traditional low-bitwidth quantization methods and fixed-dataflow matrix accelerators ineffective for end-to-end acceleration. To address this need to accelerate both linear and non-linear operations in a unified and programmable framework, this article introduces TATAA. TATAA employs 8-bit integer (int8) arithmetic for quantized linear layer operations through post-training quantization, while it relies on bfloat16 floating-point arithmetic to approximate non-linear layers of a transformer model. TATAA hardware features a transformable arithmetic architecture that supports both formats during runtime with minimal overhead, enabling it to switch between a systolic array mode for int8 matrix multiplications and a SIMD mode for vectorized bfloat16 operations. An end-to-end compiler is presented to enable flexible mapping from emerging transformer models to the proposed hardware. Experimental results indicate that our mixed-precision design incurs only 0.14% to 1.16% accuracy drop when compared with the pre-trained single-precision transformer models across a range of vision, language, and generative text applications. Our prototype implementation on the Alveo U280 FPGA currently achieves 2,935.2 GOPS throughput on linear layers and a maximum of 189.5 GFLOPS for non-linear operations, outperforming related works by up to in end-to-end throughput and in DSP efficiency, while achieving higher power efficiency than modern NVIDIA RTX4090 GPU.</p>	-
dc.language	eng	-
dc.publisher	Association for Computing Machinery (ACM)	-
dc.relation.ispartof	ACM Transactions on Reconfigurable Technology and Systems	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	FPGA	-
dc.subject	mixed integer-floating-point inference	-
dc.subject	non-linear arithmetic operations	-
dc.subject	SIMD	-
dc.subject	systolic array	-
dc.subject	transformable arithmetic architecture	-
dc.subject	Transformer acceleration	-
dc.title	TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture	-
dc.type	Article	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.1145/3714416	-
dc.identifier.scopus	eid_2-s2.0-105003445308	-
dc.identifier.volume	18	-
dc.identifier.issue	1	-
dc.identifier.spage	1	-
dc.identifier.epage	31	-
dc.identifier.eissn	1936-7414	-
dc.identifier.issnl	1936-7406	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats