董刚,胡克坤,杨宏斌,等.一种通用型卷积神经网络加速器架构研究[J]. 微电子学与计算机,2023,40(5):97-103. doi: 10.19304/J.ISSN1000-7180.2022.0518
引用本文: 董刚,胡克坤,杨宏斌,等.一种通用型卷积神经网络加速器架构研究[J]. 微电子学与计算机,2023,40(5):97-103. doi: 10.19304/J.ISSN1000-7180.2022.0518
DONG G,HU K K,YANG H B,et al. A general-purpose accelerator for convolutional neural networks[J]. Microelectronics & Computer,2023,40(5):97-103. doi: 10.19304/J.ISSN1000-7180.2022.0518
Citation: DONG G,HU K K,YANG H B,et al. A general-purpose accelerator for convolutional neural networks[J]. Microelectronics & Computer,2023,40(5):97-103. doi: 10.19304/J.ISSN1000-7180.2022.0518

一种通用型卷积神经网络加速器架构研究

A general-purpose accelerator for convolutional neural networks

  • 摘要: 针对当前AI专用加速器设计复杂且存在内存瓶颈等不足,提出一种通用型卷积神经网络加速器架构. 其RISC(Reduced Instruction Set Computer)指令集支持不同类型卷积神经网络到硬件加速器的高效映射. 其通用卷积计算模块是一个由多个基本运算单元组成的可重构三维脉动阵列,支持不同尺寸的二维卷积计算;脉动阵列规模可根据需要进行配置,适用不同的并行加速需求. 为缓解内存瓶颈、提高算力,输入模块引入多级缓存结构,可实现对片外数据的高速读取;输出模块设计一种基于“乒乓”架构的多级数据累加结构,以实现卷积计算结果的高速缓存输出. 将所提架构在FPGA芯片上予以实现,实验结果表明该架构凭借较少计算资源和较低功耗取得了与当前先进加速器相近的性能,且通用性更强.

     

    Abstract: A general-purpose CNN (Convolutional Neural Networks) accelerator is proposed to solve the problems of the current AI-specific accelerators such as complex design and memory bottlenecks. Its RISC (Reduced Instruction Set Computer) instruction set supports efficient mapping of different types of convolutional neural networks to hardware accelerators. The convolution calculation module is a reconfigurable 3D systolic array composed of multiple basic operation units, which supports two-dimensional convolution calculations of different sizes. The scale of the 3D systolic array can be configured according to needs, which is suitable for different parallel acceleration requirements. To further improve the computing power of the accelerator by easing its memory bottlenecks, a multi-level cache structureis introduced into the input module, which can realize high-speed reading of off-chip data. A multi-level data accumulation structure based on the ping-pong architecture is designed to realize the cache of convolution calculation results in the output module. The proposed architecture is implemented on an FPGA chip, and the experimental results show that the architecture achieves competitive performance with less computing resources and lower power consumption, and is more versatile.

     

/

返回文章
返回