Mixed-trace based simulation model and evaluation for AI processor
-
摘要:
近年来,紧耦合智能处理器在资源受限的边缘侧智能处理器应用中受到了广泛关注.但是针对主协处理器在流水线耦合关系做早期设计空间探索时,存在硬件资源关系共享性,数据通路结构复杂多样化以及片上主协计算特征异构性的特点,使得针对智能处理器的仿真评估建模面临着挑战.本文针对紧耦合智能处理器的结构特点,将硬件结构抽象成为软件仿真模型框架,通过对主协处理器基本硬件资源分析,分解指令控制的不同数据通路,设计智能处理器仿真模型.将主处理器与智能协处理器分别采用踪迹仿真和模型解析的方法,引入混合踪迹记录时间戳以统计部件访问信息,结合基于解析的性能评估算法,实现对智能处理器的性能评估.实验结果表明,基于混合踪迹的智能处理器模型和评估分析可以有效的解出智能计算的实际执行结果,并评估得到硬件的性能,包括延时,能耗和功耗等重要参数.
Abstract:In recent years, tightly coupled AI processors have received extensive attention in resource-constrained edge-side intelligent processor applications. But when doing early design space exploration for the main coprocessor in the pipeline coupling relationship. There are the characteristics of shared hardware resource relationship, complex and diverse data path structure, and heterogeneity of on-chip main-cooperative computing features. This makes the simulation evaluation modeling for AI processors facing challenges. This paper focuses on the structural characteristics of tightly coupled AI processors. Abstract the hardware structure into a software simulation model framework. By analyzing the basic hardware resources of the main coprocessor, the different data paths controlled by the instruction are decomposed, and the AI processor simulation model is designed. The main processor and the coprocessor are used in trace simulation and model analysis, respectively. Introduce hybrid trace record timestamps to count widget access information. Then, combined with the analytical performance evaluation algorithm, the performance evaluation of the AI processor is realized. The experimental results show that the AI processor model and evaluation analysis based on the hybrid trace can effectively solve the actual execution results of AI computing, and evaluate the performance of the hardware, including important parameters such as delay, energy and power.
-
Key words:
- RISC-V /
- AI processor /
- trace /
- simulation model /
- performance evaluation
-
表 1 硬件结构映射参数
Table 1. Hardware architecture mapping parameters
配置部件 配置参数 主处理器结构 标量/超标量处理器流水线,包括寄存器配置、流水线结构等 存储子系统,包括L1Cache、L2Cache配置、替换策略 片上互联,包括取数队列表项、存数队列表项、运行指令数、一致性协议、DRAM类型等 协处理器结构 计算并行度 执行块 计算模式,计算尺寸,映射方案 存储容量 带宽 表 2 Benchmark的网络参数表
Table 2. Network parameter table for the benchmark
网络名称 参数量 计算量 yolov2-tiny 11 M 2,669 M AlexNet 60 M 724 M VGG16 138 M 15 M ResNet18 25 M 1,942 M ResNet50 25 M 4,140 M 表 3 网络配置参数
Table 3. Network configuration parameters
网络参数 参数含义 层信息 网络总层数、当前层类型、当前层索引 输入信息 输入尺寸、输入数量 输出信息 输出尺寸、输出数量 卷积核信息 卷积核尺寸、卷积核数量、滑步大小 填充 填充宽度 归一化 当前层是否需要BN 输入数据位宽 输入数据位宽 池化 当前层是否进行了池化操作 表 4 实验约束参数设置表
Table 4. Experiment constraint parameter settings
配置参数 参数含义 实验使用参数 实验一 实验二 Freq 时钟频率 100 Mhz 100 Mhz Para 并行度 64 128 Bandwidth_total 最大带宽 64 Gb/s 64 Gb/s DSP_total 片上DSP总数量 2020 2020 BRAM_total 片上存储使用容量 26.5 Mb 26.5 Mb 表 5 评估模型精度比较
Table 5. Evaluate model accuracy comparisons
评估数据 硬件数据 误差 延时 333.99 ms 362.8 ms 7.94% 能耗 2.33 J 2.225 J 4.94% 功耗 6.99 W 6.147 W 13.71% 功效比 2.63 GOPS/W 2.76 GOPS/W 4.6% 表 6 ASIC的硬件结构评估结果
Table 6. The results of the hardware structure evaluation of the ASIC
VGG16 AlexNet Resnet18 Resnet50 计算量GOP 30.76 1.45 3.6 7.72 吞吐量GOPS 131.81 76.04 113.82 92.51 延时ms 233.37 19.07 31.63 83.45 功耗W 0.411 0.308 0.419 0.223 能耗J 0.096 0.006 0.013 0.019 能效比GOPS/J 1374.2 12945.4 8587.9 4971.1 功效比GOPS/W 240.9 246.9 271.11 415.12 -
[1] JURACY L R, MOREIRA M T, DE MORAISAMORY A, et al. A high-level modeling framework for estimating hardware metrics of CNN accelerators[J]. IEEE Transactions on Circuits and Systems I:Regular Papers,2021,68(11):4783-4795. DOI: 10.1109/TCSI.2021.3104644. [2] TANG T Q, LI S, NAI L F, et al. Neurometer: an integrated power, area, and timing modeling framework for machine learning acceleratorsindustry track paper[C]//IEEE International Symposium on High-Performance Computer Architecture. Seoul: IEEE, 2021: 853-841. [3] ZHANG L L, HAN S H, WEI J Y, et al. nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices[C]//In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys'21). Association for Computing Machinery, New York, NY, USA, 81-93. [4] GAUTSCHI M, SCHIAVONE P D, TRABER A, et al. Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,2017,25(10):2700-2713. DOI: 10.1109/tvlsi.2017.2654506. [5] BINKERT N, BECKMANN B, BLACK G, et al. The gem5 simulator[J]. ACM SIGARCH computer architecture news,2011,39(2):1-7. DOI: 10.1145/2024716.2024718. [6] KWON H, CHATARASI P, SARKAR V, et al. MAESTRO: a data-centric approach to understand reuse, performance, and hardware cost of DNN mappings[J]. IEEE Micro,2020,40(3):20-29. DOI: 10.1109/mm.2020.2985963. [7] PARASHAR A, RAINA P, SHAO Y S, et al. Timeloop: a systematic approach to DNN accelerator evaluation[C]//2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). Madison: IEEE, 2019: 304-315. [8] HEIDORN C, HANNIG F, TEICH J. Design space exploration for layer-parallel execution of convolutional neural networks on CGRAs[C]//Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems. St. Goar: ACM, 2020: 26-31. [9] ZHAO Y, LI C J, WANG Y, et al. DNN-chip predictor: an analytical performance predictor for DNN accelerators with various dataflows and hardware architectures[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona: IEEE, 2020: 1593-1597. [10] WU Y N, EMER J S, SZE V. Accelergy: an architecture-level energy estimation methodology for accelerator designs[C]//2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). Westminster: IEEE, 2019: 1-8. [11] LI C X, FAN X Y, GENG Y L, et al. ENAS oriented layer adaptive data scheduling strategy for resource limited hardware[J]. Neurocomputing,2020,381:29-39. DOI: 10.1016/j.neucom.2019.11.005. [12] LI C X, FAN X Y, ZHANG S B, et al. Hardware-aware NAS framework with layer adaptive scheduling on embedded system[C]//2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC). Tokyo: IEEE, 2021: 798-805. [13] LI C X, FAN X Y, ZHANG S B, et al. DCNN search and accelerator co-design: improve the adaptability between NAS frameworks and embedded platforms[J]. Integration,2022,87:147-157. DOI: 10.1016/j.vlsi.2022.07.003. [14] LI S, AHN J H, STRONG R D, et al. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures[C]//Proceedings of the 42nd Annual IEEE/ACMInternational Symposium on Microarchitecture. New York: ACM, 2009: 469-480. [15] THOZIYOOR S, MURALIMANOHAR N, AHN J H, et al. CACTI 5.1[R]. Technical Report HPL-2008-20, HP Labs, 2008. [16] Xilinx power estimator[EB/OL]. [2022-5-30]. https://www.xilinx.com/products/technology/power/xpe.html. [17] ZHANG H F, WU X T, DU Y Y, et al. A heterogeneous RISC-V processor for efficient DNN application in smart sensing system[J]. Sensors,2021,21(19):6491. DOI: 10.3390/s21196491. -