Peta-Scale Embedded Photonics Architecture for Distributed Deep Learning Applications

被引:4
|
作者
Wu, Zhenguo [1 ]
Dai, Liang Yuan [1 ]
Novick, Asher [1 ]
Glick, Madeleine [1 ]
Zhu, Ziyi [1 ]
Rumley, Sebastien [2 ]
Michelogiannakis, George [3 ]
Shalf, John [3 ]
Bergman, Keren [4 ]
机构
[1] Columbia Univ, Elect Engn, New York, NY 10027 USA
[2] Univ Appl Sci & Arts Western Switzerland, Elect Engn, CH-2800 Delemont, Switzerland
[3] Lawrence Berkeley Natl Lab, Comp Sci, Berkeley, CA 94720 USA
[4] Columbia Univ, Elect Engn Dept, New York, NY 10027 USA
关键词
Distributed deep learning; collective communication; silicon photonics; optical interconnect; DESIGN;
D O I
10.1109/JLT.2023.3276588
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
As Deep Learning (DL) models grow larger and more complex, training jobs are increasingly distributed across multiple Computing Units (CU) such as GPUs and TPUs. Each CU processes a sub-part of the model and synchronizes results with others. Communication among these CUs has emerged as a key bottleneck in the training process. In this work, we present SiPAC, a Silicon Photonic Accelerated Compute cluster. SiPAC accelerates distributed DL training by means of two co-designed components: a photonic physical layer and a novel collective algorithm. The physical layer exploits embedded photonics to bring peta-scale I/O directly to the CUs of a DL optimized cluster and uses resonator-based optical wavelength selectivity to realize hardware multi-casting. The collective algorithm builds on the hardware multi-casting primitive. This combination expedites a variety of collective communications commonly employed in DL training and has the potential to drastically ease the communication bottlenecks. We demonstrate the feasibility of realizing the SiPAC architecture through 1) an optical testbed experiment where an array of comb laser wavelengths are shuffled by a cascaded ring switch, with each ring selecting and forwarding multiple wavelengths to increase the effective communication bandwidth and hence demonstrating the hardware multicasting primitive, and 2) a four-GPU testbed running a realistic DL workload that achieves 22% system-level performance improvement relative to a similarly-sized leaf-spine topology. Large scale simulations show that SiPAC achieves a 1.4x to 5.9x communication time reduction compared to state-of-the-art compute clusters for representative collective communications.
引用
收藏
页码:3737 / 3749
页数:13
相关论文
共 50 条
  • [1] Memory and Interconnect Optimizations for Peta-Scale Deep Learning Systems
    Venkataramani, Swagath
    Srinivasan, Vijayalakshmi
    Choi, Jungwook
    Heidelberger, Philip
    Chang, Leland
    Gopalakrishnan, Kailash
    2019 IEEE 26TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC), 2019, : 225 - 234
  • [2] Parallel computing in biomedical research and the search for peta-scale biomedical applications
    Stewart, CA
    Hart, D
    Sheppard, RW
    Li, H
    Cruise, R
    Moskvin, V
    Papiez, L
    PARALLEL COMPUTING: SOFTWARE TECHNOLOGY, ALGORITHMS, ARCHITECTURES AND APPLICATIONS, 2004, 13 : 719 - 726
  • [3] A Deep Learning Convolution Architecture for Simple Embedded Applications
    Kim, Chan
    Cho, Yong Cheol Peter
    Kwon, Youngsu
    2017 IEEE 7TH INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - BERLIN (ICCE-BERLIN), 2017, : 74 - 78
  • [4] CineGrid Exchange: A workflow-based peta-scale distributed storage platform on a high-speed network
    Liu, Shaofeng
    Schulze, Jurgen P.
    Herr, Laurin
    Weekley, Jeffrey D.
    Zhu, Bing
    Osdol, Natalie V.
    Plepys, Dana
    Wan, Mike
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING AND ESCIENCE, 2011, 27 (07): : 966 - 976
  • [5] Architecture and applications for a distributed embedded firewall
    Payne, C
    Markham, T
    17TH ANNUAL COMPUTER SECURITY APPLICATIONS CONFERENCE, PROCEEDINGS, 2001, : 329 - 336
  • [6] A Framework Architecture for Student Learning in Distributed Embedded Systems
    Honig, William L.
    Laufer, Konstantin
    Thiruvathukal, George K.
    2015 10TH IEEE INTERNATIONAL SYMPOSIUM ON INDUSTRIAL EMBEDDED SYSTEMS (SIES), 2015, : 148 - 151
  • [7] A Study on Deep Learning Architecture and Their Applications
    Ghimire, Samip
    Ghimire, Sarala
    Subedi, Santosh
    2019 INTERNATIONAL CONFERENCE ON POWER ELECTRONICS, CONTROL AND AUTOMATION (ICPECA-2019), 2019, : 430 - 435
  • [8] Learning Distributed Representations and Deep Embedded Clustering of Texts
    Wang, Shuang
    Beheshti, Amin
    Wang, Yufei
    Lu, Jianchao
    Sheng, Quan Z.
    Elbourn, Stephen
    Alinejad-Rokny, Hamid
    ALGORITHMS, 2023, 16 (03)
  • [9] A Real-Time Container Architecture for Dependable Distributed Embedded Applications
    Telschig, Kilian
    Schoenberger, Andreas
    Knapp, Alexander
    2018 IEEE 14TH INTERNATIONAL CONFERENCE ON AUTOMATION SCIENCE AND ENGINEERING (CASE), 2018, : 1367 - 1374
  • [10] On the Feasibility of Hybrid Electrical/Optical Switch Architecture for Large-Scale Training of Distributed Deep Learning
    Thao Nguyen Truong
    Takano, Ryousei
    PROCEEDINGS OF 2019 IEEE/ACM WORKSHOP ON PHOTONICS-OPTICS TECHNOLOGY ORIENTED NETWORKING, INFORMATION AND COMPUTING SYSTEMS (PHOTONICS2019), 2019, : 7 - 14