甘芷莹, 许达文. 面向AIoT芯片的OCR模型压缩方案研究[J]. 微电子学与计算机, 2022, 39(11): 110-117. DOI: 10.19304/J.ISSN1000-7180.2022.0241
引用本文: 甘芷莹, 许达文. 面向AIoT芯片的OCR模型压缩方案研究[J]. 微电子学与计算机, 2022, 39(11): 110-117. DOI: 10.19304/J.ISSN1000-7180.2022.0241
GAN Zhiying, XU Dawen. Research on OCR model compression scheme for AIoT chips[J]. Microelectronics & Computer, 2022, 39(11): 110-117. DOI: 10.19304/J.ISSN1000-7180.2022.0241
Citation: GAN Zhiying, XU Dawen. Research on OCR model compression scheme for AIoT chips[J]. Microelectronics & Computer, 2022, 39(11): 110-117. DOI: 10.19304/J.ISSN1000-7180.2022.0241

面向AIoT芯片的OCR模型压缩方案研究

Research on OCR model compression scheme for AIoT chips

  • 摘要: 基于深度学习的OCR模型通常由CNN和RNN/LSTM构成,模型计算量大、权重参数多,导致在边缘设备上推理需要大量的计算资源才有可能达到性能要求.CPU和GPU这样的通用处理器无法同时满足处理速度和功耗要求,并且成本非常高.随着深度学习的普及,神经处理单元NPU在许多嵌入式和边缘设备中变得普遍,它具有高吞吐量的计算能力来处理神经网络所涉及的矩阵运算.以基于CRNN的OCR模型为例,面向AIoT芯片给出一个解决方案,通过剪枝和量化两种压缩算法降低网络参数冗余度,减少计算开销但仍能得到一个准确性和鲁棒性高的压缩模型,使得模型能够部署在NPU上.实验结果表明:对剪枝微调后的模型进行参数量化,稀疏度为78%量化后的模型精度降低不超过3%,模型大小从15.87 MB压缩为3.13 MB,将压缩后的模型部署到NPU端,与在CPU和GPU上的实现相比,NPU在延迟上分别实现了28.87倍和6.1倍的加速.

     

    Abstract: Deep learning-based OCR models usually consist of CNN and RNN/LSTM, which are computationally intensive and have many weight parameters, resulting in a large amount of computational resources required to achieve the performance requirements for inference in edge devices. general-purpose processors such as CPU and GPU cannot meet both processing speed and power requirements, and are very costly. With the popularity of deep learning, neural processing units NPUs are becoming common in many embedded and edge devices with high throughput computational power to handle the matrix operations involved in neural networks. An OCR model based on CRNN, for example, gives a solution for AIoT chips that reduces the redundancy of network parameters through two compression algorithms, pruning and quantization, to reduce the computational overhead but still obtain a compression model with high accuracy and robustness, enabling the model to be deployed on NPUs. Experimental results show that parameter quantization of the pruned and fine-tuned model reduces the accuracy of the quantized model by no more than 3% with a sparsity of 78% and compresses the model size from 15.87MB to 3.13MB. Deploying the compressed model to the NPU side, the NPU achieves a 28.87x and 6.1x speedup in latency compared to the implementations on the CPU and GPU, respectively.

     

/

返回文章
返回