詹逸梦,扈啸,郭阳.基于FT-X DSP的二维FFT并行实现与优化研究[J]. 微电子学与计算机,2023,40(2):71-78. doi: 10.19304/J.ISSN1000-7180.2022.0360
引用本文: 詹逸梦,扈啸,郭阳.基于FT-X DSP的二维FFT并行实现与优化研究[J]. 微电子学与计算机,2023,40(2):71-78. doi: 10.19304/J.ISSN1000-7180.2022.0360
ZHAN Y M,HU X,GUO Y. Two-dimensional FFT parallel implementation and optimization on FT-X DSP platform[J]. Microelectronics & Computer,2023,40(2):71-78. doi: 10.19304/J.ISSN1000-7180.2022.0360
Citation: ZHAN Y M,HU X,GUO Y. Two-dimensional FFT parallel implementation and optimization on FT-X DSP platform[J]. Microelectronics & Computer,2023,40(2):71-78. doi: 10.19304/J.ISSN1000-7180.2022.0360

基于FT-X DSP的二维FFT并行实现与优化研究

Two-dimensional FFT parallel implementation and optimization on FT-X DSP platform

  • 摘要: 二维FFT是图像处理的典型算法,广泛应用于图像滤波、快速卷积、目标跟踪等领域. 为满足高分辨率图像的实时处理需求,基于自主研制的FT-X众核DSP处理器,提出了一种二维FFT算法的多核并行实现方法. 基于众核编程模型,通过多核任务部署、地址空间重映射等方式完成了任务初始化,实现了24核数据并行处理,加速比达到19.8倍. 在此基础上,提出了基于DMA跨步传输的隐式转置方案,通过矩阵地址分配的方式,解决了大型矩阵跨步传输步长受限的问题. 实验结果表明,在8 K×8 K的数据规模下,相对于直接转置和指令隐式转置分别节省了91%和65%的转置时间,同时识别并解决了某特殊情况下的多核负载不均衡的问题,将各核的用时差距从64%下降到了12%,整体用时下降了26%.

     

    Abstract: Two-dimensional FFT is a typical algorithm of image processing, widely used in image filtering, fast convolution, target tracking and other fields. A parallel implementation method of 2D FFT algorithm based on the self-developed FT-X many-core DSP is proposed, in order to meet the real-time processing requirements of high resolution images. Based on the multi-core programming model, the task initialization is accomplished through multi-core task deployment and address space remapping. The parallel data processing of 24 cores is realized and the speed ratio is 19.8 times. An implicit transpose based on DMA step transfer is proposed, which uses matrix address allocation to solve the problem of limited step size in large matrix step transfer. Experimental results show that compared with direct transpose and instruction implicit transpose, the transpose time is saved 91% and 65% respectively at 8K×8K data scale. At the same time, the problem of unbalanced multi-core load in a special case is identified and solved. The difference between cores fell from 64% to 12%, and overall time fell 26%.

     

/

返回文章
返回