施思雨, 魏继增. 基于MobileNet-SSD目标检测算法的硬件加速器设计[J]. 微电子学与计算机, 2022, 39(6): 99-107. DOI: 10.19304/J.ISSN1000-7180.2021.1352
引用本文: 施思雨, 魏继增. 基于MobileNet-SSD目标检测算法的硬件加速器设计[J]. 微电子学与计算机, 2022, 39(6): 99-107. DOI: 10.19304/J.ISSN1000-7180.2021.1352
SHI Siyu, WEI Jizeng. Design of hardware accelerator based on MobileNet-SSD for object detection algorithm[J]. Microelectronics & Computer, 2022, 39(6): 99-107. DOI: 10.19304/J.ISSN1000-7180.2021.1352
Citation: SHI Siyu, WEI Jizeng. Design of hardware accelerator based on MobileNet-SSD for object detection algorithm[J]. Microelectronics & Computer, 2022, 39(6): 99-107. DOI: 10.19304/J.ISSN1000-7180.2021.1352

基于MobileNet-SSD目标检测算法的硬件加速器设计

Design of hardware accelerator based on MobileNet-SSD for object detection algorithm

  • 摘要: 人工智能的迅速发展使得现代卷积神经网络在图像识别和分类任务上取得了巨大成功.然而,复杂神经网络模型不断向更深层的网络结构发展,在面积、功耗受限的移动设备上部署时无法保持高性能和高精度.针对该问题,面向可编程阵列芯片(FPGA)平台提出了一种基于软硬件协同方法的MobileNet-SSD目标检测硬件加速器设计.首先采用剪枝和量化算法对原始MobileNet-SSD模型进行压缩,其中剪枝是针对点卷积层参数冗余问题而提出的卷积核剪枝算法,量化则是将训练后的网络模型中的浮点数统一转换为定点数参与卷积计算.然后,设计了一种可配置的卷积计算加速阵列,通过循环分块实现不同规模网络层的多粒度并行.在此基础上,进一步设计了一种针对输入缓存的行缓存优化机制,结合直接存取存储器(DMA)和数据流接口传输数据解决传输延迟的瓶颈.实验表明,所提出的目标检测系统的性能功耗比相较于CPU和GPU分别提升了79倍和1.9倍,相比于以往工作中提出的目标检测系统具有更高的准确度和更优的性能.

     

    Abstract: With the rapid development of Artificial Intelligence, modern convolutional neural network has achieved great success in image recognition and classification. However, the complex neural network model continues to develop to a deeper network structure, which can not maintain high performance and high accuracy when deployed on mobile devices with limited area and power consumption. To solve this problem, a design of MobileNet-SSD object detection hardware accelerator based on software-hardware cooperation approach is proposed for FPGA platform. Firstly, pruning and quantization algorithms are used to compress the original MobileNet-SSD model. Pruning is a convolution kernel pruning algorithm proposed for the problem of point-wise convolution layer parameter redundancy, and quantization is to uniformly convert the floating-point numbers in the trained network model into fixed-point numbers to participate in convolution calculation. Then, a configurable convolution computing acceleration array is designed to realize multi-granularity parallelism of different scale network layers through tiling strategy. On this basis, a linebuffer optimization mechanism for input buffer is further designed, which combines Direct Memory Access (DMA) and data stream interface to transfer data to solve the bottleneck of transmission delay. Experiments show that the performance and power consumption of the proposed object detection system are 79× and 1.9× higher than that of CPU and GPU, respectively. Compared with the object detection system proposed in the previous work, it has higher accuracy and better performance.

     

/

返回文章
返回