An efficient SIMD parallel memory structure for radix-2 FFT computation

被引:0
|
作者
Chen H.-Y. [1 ]
Yang C. [1 ]
Liu S. [1 ]
Liu Z. [1 ]
机构
[1] College of Computer, National University of Defense Technology, Changsha, 410073, Hunan
来源
Chen, Hai-Yan (hychen608@163.com) | 2016年 / Chinese Institute of Electronics卷 / 44期
关键词
Access conflict; Data shuffle; FFT; Low-order interleave; Parallel memory; SIMD;
D O I
10.3969/j.issn.0372-2112.2016.02.001
中图分类号
学科分类号
摘要
As more and more execution units are integrated in the digital signal processor (DSP) with single instruction multiple data stream (SIMD) extension, the flexibility and bandwidth efficiency of parallel memory access have significant effects on its whole practical performance. Based on detailed analysis of the memory access problems for radix-2 fast Fourier transform (FFT) algorithm in general SIMD DSP, this paper used parts of the address bit XOR logic to realize memory access address translation, and achieved conflict-free parallel SIMD memory accesses for FFT computation. Then several memory access instructions with special shuffle modes were brought forward, which could completely eliminate extra shuffling instruction operations of radix-2 FFT algorithm in the SIMD architecture. Finally, the vector memory (VM) in 16-way SIMD DSP YHFT-Matrix2 was optimized by above methods. The test results show that the optimized VM can realize fully pipelined conflict-free memory accesses and 100% parallel memory access bandwidth utilization with increase of 18% area overheads. Compared with the design before optimization, the performance of different points radix-2 FFT can achieve speedup ranging from 1.32 to 2.66. © 2016, Chinese Institute of Electronics. All right reserved.
引用
收藏
页码:241 / 246
页数:5
相关论文
共 17 条
  • [1] Intel 64 and IA-32 Architectures Software Developer Manuals, Volume1: Basic Architecture, (2015)
  • [2] The Architecture for the Digital World, (2015)
  • [3] Woh M., Seo S., Mahlke S., Et al., AnySP: anytime anywhere anyway signal processing, Proceedings of the 36th Annual International Symposium on Computer Architecture, pp. 128-139, (2009)
  • [4] Talla D., John L.K., Burger D., Bottlenecks in multimedia processing with SIMD style extensions and architectural enhancements, IEEE Transactions on Computers, 52, 8, pp. 1015-1030, (2003)
  • [5] Cooley J.W., Tukey J.W., An algorithm for the machine calculation of complex Fourier series, Mathematics of Computation, 19, 90, pp. 297-301, (1965)
  • [6] Chang Y.-N., Parhi K.K., An efficient pipelined FFT architecture, IEEE Transactions on Circuits and Systems-II: Analog and Digital Signal Processing, 50, 6, pp. 322-325, (2003)
  • [7] Baas B.M., A low-power, high-performance, 1024-point FFT processor, IEEE Journal of Solid-State Circuits, 34, 3, pp. 380-387, (1999)
  • [8] Richardson S., Shacham O., Et al., An area-efficient minimum-time FFT schedule using single-ported memory, Proceedings of 2013 IFIP/IEEE 21st International Conference on VLSI-SoC, pp. 39-44, (2013)
  • [9] Yu J.-Y., Li Y., An efficient conflict-free parallel memory access scheme for dual-butterfly constant geometry radix-2 FFT processor, ICSP2008 Proceedings, pp. 458-461, (2008)
  • [10] Hsiao C.-F., Chen Y., Lee C.-Y., A Generalized mixed-radix algorithm for memory-based FFT processors, IEEE Transactions on Circuits and Systems-II: Express Briefs, 57, 1, pp. 26-30, (2010)