焦峰, 马瑶, 毕思颖, 马钟. 神经网络加速器指令控制系统设计[J]. 微电子学与计算机, 2022, 39(8): 78-85. DOI: 10.19304/J.ISSN1000-7180.2021.1344
引用本文: 焦峰, 马瑶, 毕思颖, 马钟. 神经网络加速器指令控制系统设计[J]. 微电子学与计算机, 2022, 39(8): 78-85. DOI: 10.19304/J.ISSN1000-7180.2021.1344
JIAO Feng, MA Yao, BI Siying, MA Zhong. Design of instruction control system for neural network accelerator[J]. Microelectronics & Computer, 2022, 39(8): 78-85. DOI: 10.19304/J.ISSN1000-7180.2021.1344
Citation: JIAO Feng, MA Yao, BI Siying, MA Zhong. Design of instruction control system for neural network accelerator[J]. Microelectronics & Computer, 2022, 39(8): 78-85. DOI: 10.19304/J.ISSN1000-7180.2021.1344

神经网络加速器指令控制系统设计

Design of instruction control system for neural network accelerator

  • 摘要: 深度神经网络在图像语音智能化处理领域的应用越来越广泛,但其算子、参数类型多,计算、存储密集大的特点制约了在航空航天、移动智能终端等嵌入式场景中的应用.针对这一问题,提出了解耦输入数据流,进行高效流水并行处理的思路,设计了一种神经网络加速器指令控制系统.不同算子的输入数据循环分块后,对应到指令组配置中,多状态机协同完成指令信息三阶段分发控制,实现指令解析、数据输入、计算、数据输出四级并行流水,充分利用分块内的数据复用机会,减少访存带宽及流水周期空闲率.将其部署在ZCU102开发板上,测试中支持常见的多种神经网络层类型和宽范围参数配置.频率为200 M时,峰值算力800 GOPS,运行VGG16网络模型,实际测试运行算力为489.4 GOPS,功耗4.42 W,能效比为113.3 GOPS/W,优于调研对比的同类神经网络加速器和CPU、GPU.实验结果表明,分解数据流,采用指令调度实现高效并行流水的方法解决了通用性和能效比两大难题,基于此方法设计的指令控制系统,可为神经网络加速器的嵌入式平台应用提供方案.

     

    Abstract: Deep neural networks are increasingly used in the field of intelligent processing of image and speech, however their multiple operator and parameter types, large computation and storage intensive characteristics restrict the application in embedded scenarios such as aerospace and mobile intelligent terminals. To address this problem, the concept of decoupling input data streams for efficient flowing parallel processing is proposed, and an instruction control system for a neural network accelerator is designed. After the input data of different operators are cyclically chunked, and corresponding to the instruction group configuration, multiple state machines collaborate to complete the three-stage distribution control of instruction information, realising four stages of parallel flow of instruction parsing, data input, computation and data output, fully utilising the data reuse possibilities within the chunks, so as to reduce the access bandwidth and flow cycle idle rate. Deployed on the ZCU102 development board, the test shows support for a variety of common neural network layer types and a wide range of parameter configurations. At a frequency of 200M, with a peak arithmetic power of 800 Gops and running the VGG16 network model, an actual test run of 489.4Gops and power consumption of 4.42W resulted in an energy efficiency ratio of 113.3GOPs/W, superior to similar neural network accelerators and CPUs and GPUs. Experimental results show that the method of decomposing data streams and using instruction scheduling to achieve efficient parallelism solves the two major challenges of generality and energy efficiency, the instruction control system based on this method, can provide a solution for the embedded platform application of neural network accelerators.

     

/

返回文章
返回