Peta-Scale Embedded Photonics Architecture for Distributed Deep Learning Applications

被引：4

作者：

Wu, Zhenguo ^{[1
]}

Dai, Liang Yuan ^{[1
]}

Novick, Asher ^{[1
]}

Glick, Madeleine ^{[1
]}

Zhu, Ziyi ^{[1
]}

Rumley, Sebastien ^{[2
]}

Michelogiannakis, George ^{[3
]}

Shalf, John ^{[3
]}

Bergman, Keren ^{[4
]}

机构：

[1] Columbia Univ, Elect Engn, New York, NY 10027 USA

[2] Univ Appl Sci & Arts Western Switzerland, Elect Engn, CH-2800 Delemont, Switzerland

[3] Lawrence Berkeley Natl Lab, Comp Sci, Berkeley, CA 94720 USA

[4] Columbia Univ, Elect Engn Dept, New York, NY 10027 USA

来源：

JOURNAL OF LIGHTWAVE TECHNOLOGY | 2023年 / 41卷 / 12期

关键词：

Distributed deep learning; collective communication; silicon photonics; optical interconnect; DESIGN;

D O I：

10.1109/JLT.2023.3276588

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

As Deep Learning (DL) models grow larger and more complex, training jobs are increasingly distributed across multiple Computing Units (CU) such as GPUs and TPUs. Each CU processes a sub-part of the model and synchronizes results with others. Communication among these CUs has emerged as a key bottleneck in the training process. In this work, we present SiPAC, a Silicon Photonic Accelerated Compute cluster. SiPAC accelerates distributed DL training by means of two co-designed components: a photonic physical layer and a novel collective algorithm. The physical layer exploits embedded photonics to bring peta-scale I/O directly to the CUs of a DL optimized cluster and uses resonator-based optical wavelength selectivity to realize hardware multi-casting. The collective algorithm builds on the hardware multi-casting primitive. This combination expedites a variety of collective communications commonly employed in DL training and has the potential to drastically ease the communication bottlenecks. We demonstrate the feasibility of realizing the SiPAC architecture through 1) an optical testbed experiment where an array of comb laser wavelengths are shuffled by a cascaded ring switch, with each ring selecting and forwarding multiple wavelengths to increase the effective communication bandwidth and hence demonstrating the hardware multicasting primitive, and 2) a four-GPU testbed running a realistic DL workload that achieves 22% system-level performance improvement relative to a similarly-sized leaf-spine topology. Large scale simulations show that SiPAC achieves a 1.4x to 5.9x communication time reduction compared to state-of-the-art compute clusters for representative collective communications.

引用

页码：3737 / 3749

页数：13

共 50 条

[1] Memory and Interconnect Optimizations for Peta-Scale Deep Learning Systems
Venkataramani, Swagath
Srinivasan, Vijayalakshmi
Choi, Jungwook
Heidelberger, Philip
Chang, Leland
Gopalakrishnan, Kailash
2019 IEEE 26TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC), 2019, : 225 - 234
[2] Parallel computing in biomedical research and the search for peta-scale biomedical applications
Stewart, CA
Hart, D
Sheppard, RW
Li, H
Cruise, R
Moskvin, V
Papiez, L
PARALLEL COMPUTING: SOFTWARE TECHNOLOGY, ALGORITHMS, ARCHITECTURES AND APPLICATIONS, 2004, 13 : 719 - 726
[3] A Deep Learning Convolution Architecture for Simple Embedded Applications
Kim, Chan
Cho, Yong Cheol Peter
Kwon, Youngsu
2017 IEEE 7TH INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - BERLIN (ICCE-BERLIN), 2017, : 74 - 78
[4] CineGrid Exchange: A workflow-based peta-scale distributed storage platform on a high-speed network
Liu, Shaofeng
Schulze, Jurgen P.
Herr, Laurin
Weekley, Jeffrey D.
Zhu, Bing
Osdol, Natalie V.
Plepys, Dana
Wan, Mike
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING AND ESCIENCE, 2011, 27 (07): : 966 - 976
[5] Architecture and applications for a distributed embedded firewall
Payne, C
Markham, T
17TH ANNUAL COMPUTER SECURITY APPLICATIONS CONFERENCE, PROCEEDINGS, 2001, : 329 - 336
[6] A Framework Architecture for Student Learning in Distributed Embedded Systems
Honig, William L.
Laufer, Konstantin
Thiruvathukal, George K.
2015 10TH IEEE INTERNATIONAL SYMPOSIUM ON INDUSTRIAL EMBEDDED SYSTEMS (SIES), 2015, : 148 - 151
[7] A Study on Deep Learning Architecture and Their Applications
Ghimire, Samip
Ghimire, Sarala
Subedi, Santosh
2019 INTERNATIONAL CONFERENCE ON POWER ELECTRONICS, CONTROL AND AUTOMATION (ICPECA-2019), 2019, : 430 - 435
[8] Learning Distributed Representations and Deep Embedded Clustering of Texts
Wang, Shuang
Beheshti, Amin
Wang, Yufei
Lu, Jianchao
Sheng, Quan Z.
Elbourn, Stephen
Alinejad-Rokny, Hamid
ALGORITHMS, 2023, 16 (03)
[9] A Real-Time Container Architecture for Dependable Distributed Embedded Applications
Telschig, Kilian
Schoenberger, Andreas
Knapp, Alexander
2018 IEEE 14TH INTERNATIONAL CONFERENCE ON AUTOMATION SCIENCE AND ENGINEERING (CASE), 2018, : 1367 - 1374
[10] On the Feasibility of Hybrid Electrical/Optical Switch Architecture for Large-Scale Training of Distributed Deep Learning
Thao Nguyen Truong
Takano, Ryousei
PROCEEDINGS OF 2019 IEEE/ACM WORKSHOP ON PHOTONICS-OPTICS TECHNOLOGY ORIENTED NETWORKING, INFORMATION AND COMPUTING SYSTEMS (PHOTONICS2019), 2019, : 7 - 14

← 1 2 3 4 5 →