A Low-Cost Floating-Point Dot-Product-Dual-Accumulate Architecture for HPC-Enabled AI

被引：2

作者：

Tan, Hongbing ^{[1
]}

Huang, Libo ^{[1
]}

Zheng, Zhong ^{[1
]}

Guo, Hui ^{[1
]}

Yang, Qianmin ^{[1
]}

Shen, Li ^{[1
]}

Chen, Gang ^{[2
]}

Xiao, Liquan ^{[1
]}

Xiao, Nong

机构：

[1] Natl Univ Def Technol, Coll Comp Sci & Technol, Changsha 410073, Peoples R China

[2] Sun Yat sen Univ, Sch Data & Comp Sci, Guangzhou 510006, Peoples R China

来源：

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS | 2024年 / 43卷 / 02期

关键词：

Dot-product-dual-accumulate (DPDAC); fused multiply-add; high-performance computing (HPC)-enabled artificial intelligence (AI); mixed-precision; numerical precision conversion; transprecision computing; FUSED-MULTIPLY-ADD; PERFORMANCE;

D O I：

10.1109/TCAD.2023.3316994

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The dot-product Sigma( N)(i=1) A(i) x B-i is one of the most frequently used operations for a wide variety of high-performance computing (HPC) and artificial intelligence (AI) applications. However, for large-scale algorithms, such as acrshort GEMM and acrshort FFT, independent additions are necessary to accumulate the results of length-limited dot-product in order to form the final result, thus increasing latency and overhead. Hence, we proposed a dot-product-dual-accumulate (DPDAC) architecture capable of performing (Sigma( N=1,2,4 )(i=1)A(i) x B-i + Sigma C-M=1,2 (j=1)j) on a wide range of formats. The proposed architecture supports both single-path and dual-path execution. The single path is designed for performing acrshort DP acrshort FMA or DPDAC of lower formats, while dual-path supports parallel operations for single-precision (SP) addition and 2-term SP or acrshort TF32 dot-product or 4-term acrshort HP or BF16 dot-product. Moreover, numerical precision conversion is also supported by the proposed architecture, allowing for the conversion of numbers to higher or lower formats. The proposed DPDAC has been demonstrated to significantly reduce the overhead in comparison to discrete designs that utilize multiple single-mode acrshort FP units to achieve the same functionalities. Furthermore, when compared to the state-of-the-art multiple-precision designs, the proposed architecture has been shown to support a wide range of formats and a greater variety of operations with lower costs.

引用

页码：681 / 693

页数：13

共 13 条

[1] Dual-Path Architecture of Floating-Point Dot Product Computation
Yao Tao
An Jianfeng
Gao Deyuan
Fan Xiaoya
2011 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT), VOLS 1-4, 2012, : 2272 - 2276
[2] A Low-Cost Floating-Point FMA Unit Supporting Package Operations for HPC-AI Applications
Tan, Hongbing
Zhang, Jing
He, Xiaowei
Huang, Libo
Wang, Yongwen
Xiao, Liquan
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2024, 71 (07) : 3488 - 3492
[3] Exact Dot Product Accumulate Operators for 8-bit Floating-Point Deep Learning
Desrentes, Oregane
de Dinechin, Benoit Dupont
Le Maire, Julien
2023 26TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN, DSD 2023, 2023, : 642 - 649
[4] Low-Cost High-Precision Architecture for Arbitrary Floating-Point Nth Root Computation
Hong, Wanyuan
Chen, Hui
Quan, Lianghua
Fu, Yuxiang
Li, Li
2023 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS, 2023,
[5] A Low-Cost High Radix Floating-Point Square-Root Circuit
Yang, Yuheng
Yuan, Qing
Liu, Jian
ELECTRONICS, 2021, 10 (16)
[6] Low-Cost Concurrent Error Detection for Floating-Point Unit (FPU) Controllers
Maniatakos, Michail
Kudva, Prabhakar
Fleischer, Bruce M.
Makris, Yiorgos
IEEE TRANSACTIONS ON COMPUTERS, 2013, 62 (07) : 1376 - 1388
[7] An FPGA-based low-cost VLIW floating-point processor for CNC applications
Dong, Jingchuan
Wang, Taiyong
Li, Bo
Liu, Zhe
Yu, Zhigiang
MICROPROCESSORS AND MICROSYSTEMS, 2017, 50 : 14 - 25
[8] LOCOFloat: A Low-Cost Floating-Point Format for FPGAs.: Application to HIL Simulators
Sanchez, Alberto
de Castro, Angel
Sofia Martinez-Garcia, Maria
Garrido, Javier
ELECTRONICS, 2020, 9 (01)
[9] A Low Complexity Floating-Point Complex Multiplier with a Three-term Dot-Product Unit
Yun, Sangho
Sobelman, Gerald. E.
Zhou, Xiaofang
2014 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, COMMUNICATIONS AND COMPUTING (ICSPCC), 2014, : 549 - 552
[10] Low-Cost Binary128 Floating-Point FMA Unit Design with SIMD Support
Huang, Libo
Ma, Sheng
Shen, Li
Wang, Zhiying
Xiao, Nong
IEEE TRANSACTIONS ON COMPUTERS, 2012, 61 (05) : 745 - 751

← 1 2 →