General Implementation of 1-D FFT on the Sunway 26010 Processor

被引:0
|
作者
Zhao Y.-W. [1 ,4 ]
Ao Y.-L. [2 ]
Yang C. [1 ,2 ]
Liu F.-F. [1 ,3 ,4 ]
Yin W.-W. [5 ]
Lin R.-F. [5 ]
机构
[1] Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing
[2] School of Mathematical Sciences, Peking University, Beijing
[3] State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing
[4] University of Chinese Academy of Sciences, Beijing
[5] National Research Center of Parallel Computer Engineering and Technology, Beijing
来源
Ruan Jian Xue Bao/Journal of Software | 2020年 / 31卷 / 10期
基金
国家重点研发计划;
关键词
1-D FFT; Cooley-Tukey; Multi-core parallel; Sunway; 26010; processor; Two-layer decomposition;
D O I
10.13328/j.cnki.jos.005848
中图分类号
学科分类号
摘要
A two-layer decomposition 1-D FFT multi-core parallel algorithm is proposed according to the characteristics of Sunway 26010 processor. It is based on the iterative Stockholm FFT framework and the Cooley-Tukey FFT algorithm. It decomposes large scale FFT into a series of small scale FFTs. It improves the performance of the algorithm by means of designing reasonable task partitioning, register communication, double-buffering, and SIMD vectorization. Finally, the performance of the two-layer decomposition 1-D FFT multi-core parallel algorithm is tested. It achieves an average speedup of 44.53x, with a maximum speedup of up to 56.33x, and a maximum bandwidth utilization of 83.45%, compared to FFTW3.3.4 library running on the single MPE. © Copyright 2020, Institute of Software, the Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:3184 / 3196
页数:12
相关论文
共 40 条
  • [1] Cipra BA., The best of the 20th century: Editors name top 10 algorithms, SIAM News, 33, 4, pp. 1-2, (2000)
  • [2] Luszczek P, Dongarra JJ, Koester D, Et al., Introduction to the HPC challenge benchmark suite, Office of Scientific & Technical Information Technical Reports, (2005)
  • [3] Fu H, Liao J, Yang J, Et al., The Sunway TaihuLight supercomputer: System and applications, Science China Information Sciences, 59, 7, (2016)
  • [4] Frigo M, Johnson SG., The design and implementation of FFTW3, Proc. of the IEEE, 93, 2, pp. 216-231, (2005)
  • [5] Frigo M, Johnson SG., FFTW: An adaptive software architecture for the FFT, Proc. of the IEEE Int'l Conf. on Acoustics, Speech and Signal Processing, pp. 1381-1384, (2002)
  • [6] Ali A, Johnsson L, Subhlok J., Scheduling FFT computation on SMP and multicore systems, Proc. of the Int'l Conf. on Supercomputing, ICS 2007, pp. 293-301, (2007)
  • [7] Puschel M, Moura JMF, Johnson JR, Et al., SPIRAL: Code generation for DSP transforms, Proc. of the IEEE, 93, 2, pp. 232-275, (2005)
  • [8] Pekurovsky D., P3DFFT: A framework for parallel computations of Fourier transforms in three dimensions, SIAM Journal on Scientific Computing, 34, 4, pp. C192-C209, (2012)
  • [9] Ayala O, Wang LP., Parallel implementation and scalability analysis of 3D fast Fourier transform using 2D domain decomposition, Parallel Computing, 39, 1, pp. 58-77, (2013)
  • [10] Pippig M., PFFT: An extension of FFTW to massively parallel architectures, SIAM Journal on Scientific Computing, 35, 3, pp. C213-C236, (2013)