[1]谢志豪,李国刚.软硬件协同设计的异构CNN加速器[J].华侨大学学报(自然科学版),2025,(2):209-216.[doi:10.11830/ISSN.1000-5013.202409017]
 XIE Zhihao,LI Guogang.Heterogeneous CNN Accelerator Based on Hardware-Software Co-Design[J].Journal of Huaqiao University(Natural Science),2025,(2):209-216.[doi:10.11830/ISSN.1000-5013.202409017]
点击复制

软硬件协同设计的异构CNN加速器()
分享到:

《华侨大学学报(自然科学版)》[ISSN:1000-5013/CN:35-1079/N]

卷:
期数:
2025年第2期
页码:
209-216
栏目:
出版日期:
2025-03-20

文章信息/Info

Title:
Heterogeneous CNN Accelerator Based on Hardware-Software Co-Design
文章编号:
1000-5013(2025)02-0209-08
作者:
谢志豪12 李国刚12
1. 华侨大学 信息科学与工程学院, 福建 厦门 361021;2. 华侨大学 厦门市专用集成电路与功率半导体系统重点实验室, 福建 厦门 361021
Author(s):
XIE Zhihao12 LI Guogang12
1. School of Information and Engineering, Huaqiao University, Xiamen 361021, China; 2. Xiamen Key Laboratory of ASIC and Power Semiconductor System, Huaqiao University, Xiamen 361021, China
关键词:
现场可编程门阵列(FGPA) 硬件加速 软硬件协同设计 高层次综合
Keywords:
field programmable gate array(FPGA) hardware acceleration hardware-software co-design high-level synthesis
分类号:
TP391.4
DOI:
10.11830/ISSN.1000-5013.202409017
文献标志码:
A
摘要:
为解决卷积神经网络(CNN)高效部署的挑战,提出一种基于软硬件协同设计的异构CNN加速器,并在YOLOv4-tiny模型上进行验证。搭建基于高级精简指令集机器(ARM)处理器与现场可编程门阵列(FGPA)的异构系统。通过高层次综合(HLS)将可并行执行的计算单元映射为FPGA端寄存器传输级(RTL)知识产权(IP);ARM处理器控制系统的协同工作与IP核的调度,最终实现前向推理加速。结果表明:该异构CNN加速器的工作频率为130 MHz,功耗为2.809 W,推理速度达到511 ms,吞吐率为13.40 GOPS;相较于桌面端图形处理单元(GPU)、中央处理单元(CPU)及主流嵌入式AI加速平台,该设计在推理速度与功耗之间取得了良好平衡,同时关键性能指标均有显著提升;所设计异构CNN加速器在边缘计算场景中表现出优异性能,能够满足实际部署需求。
Abstract:
To address the challenges associated with the efficient deployment of convolutional neural network(CNN), a heterogeneous CNN accelerator based on a hardware-software co-design is proposed and validated on the YOLOv4-tiny model. The heterogeneous system is built with an advanced reduced instruction set machine(ARM)processors and a field programmable gate array(FPGA). Through high-level synthesis(HLS), the computational units that can be executed in parallel are mapped to a register transfer level(RTL)intellectual property(IP)on FPGA. The ARM processors manage the collaborative operations of the system and the scheduling of the IP core,ultimately achieving acceleration of forward inference. The results show that the heterogeneous CNN accelerator operates at a frequency of 130 MHz, with a power consumption of 2.809 W and an inference speed of ms,achieving a throughput of 13.40 GOPS. Compared to desktop graphics processing unit(GPU), central processing unit(CPU)and mainstream embedded AI acceleration platforms, the proposed design achieves a favorable balance between inference speed and power consumption,while significantly improving key performance indicators. The designed heterogeneous CNN accelerator demonstrates excellent performance in edge computing scenarios and meets the requirements for practical deployment.

参考文献/References:

[1] 李全.面向ARM嵌入式平台的卷积神经网络前向加速研究[D].武汉: 华中科技大学,2019.
[2] 陈朋,陈庆清,王海霞,等.基于改进动态配置的FPGA卷积神经网络加速器的优化方法[J].高技术通讯,2020,30(3):240-247.DOI:10.3772/j.issn.1002-0470.2020.03.004.
[3] NIKOLIC G S,DIMITRIJEVIC B R,NIKOLIC T R,et al.A survey of three types of processing units: CPU, GPU and TPU[C]//57th Iinternational Scientific Conference on Information, Communication and Energy Systems and Technologies.Ohrid: IEEE Press,2022:1-6.DOI:10.1109/ICEST55168.2022.9828625.
[4] 王江波.基于ZYNQ嵌入式平台的CNN图像识别加速器研究与实现[D].沈阳: 中国科学院大学(中国科学院沈阳计算技术研究所),2022.
[5] MA Yufei,CAO Yu,VRUDHULA S,et al.Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks[C]//Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.New York: Association for Computing Machinery,2017:45-54.DOI:10.1145/3020078.3021736.
[6] 帅禄玮,张柳欣,叶蕾,等.基于低误差并行计算加速的OFDR实时处理技术[J].中国激光,2024,51(14):233-242.DOI:10.3788/CJL231526.
[7] 高树静,王程龙,董廷坤.基于ZYNQ的优化Adaboost人脸检测[J].计算机工程与应用,2020,56(6):201-206.DOI:10.3778/j.issn.1002-8331.1812-0228.
[8] 嵇达龙,张尤赛,王亚军.基于ZYNQ的行人检测系统的设计与实现[J].计算机工程与设计,2020,41(1):238-245.DOI:10.16208/j.issn1000-7024.2020.01.039.
[9] LU Liqiang,XIE Jiaming,HUANG Ruirui,et al.An efficient hardware accelerator for sparse convolutional neural networks on FPGAs[C]//27th Annual International Symposium on Field-Programmable Custom Computing Machines.San Diego: IEEE Press,2019:17-25.DOI:10.1109/FCCM.2019.00013.
[10] BAI Lin,ZHAO Yiming,HUANG Xinming.A CNN accelerator on FPGA using depthwise separable convolution[J].IEEE Transactions on Circuits and Systems Ⅱ: Express Briefs,2018,65(10):1415-1419.DOI:10.1109/TCSII.2018.2865896.
[11] MILLóN R,FRATI E,RUCCI E.A comparative study between HLS and HDL on SoC for image processing applications[EB/OL].(2020-12-15)[2024-08-20] .https://doi.org/10.48550/arxiv.2012.08320.
[12] GIRSHICK R,DONAHUE J,DARRELL T,et al.Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Columbus: IEEE Press,2014:580-587.DOI:10.1109/CVPR.2014.81.
[13] GIRSHICK R.Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision.Santiago:IEEE Press,2015:1440-1448.DOI:10.1109/ICCV.2015.169.
[14] REN Shaoqing,HE Kaiming,GIRSHICK R,et al.Faster R-CNN: Towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,39(6):1137-1149.DOI:10.1109/TPAMI.2016.2577031.
[15] REDMON J,DIVVALA S,GIRSHICK R,et al.You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas: IEEE Press,2016:779-788.DOI: 10.1109/CVPR.2016.91.
[16] CONG J,XIAO Bingjun.Minimizing computation in convolutional neural networks[C]//International Conference on Artificial Neural Networks.Cham: Springer International Publishing,2014:281-290.DOI:10.1007/978-3-319-11179-7_36.
[17] CIJOV A.Self-driving cars[EB/OL].(2021-12-08)[2024-08-20] .https://www.kaggle.com/datasets/alincijov/self-driving-cars.
[18] CHOI K,SOBELMAN G E.An efficient CNN accelerator for low-cost edge systems[J].ACM Transactions on Embedded Computing Systems,2022,21(4):1-20.DOI:10.1145/3539224.
[19] MAZZIA V,KHALIQ A,SALVETTI F,et al.Real-time apple detection system using embedded systems with hardware accelerators: An edge AI application[J].IEEE Access,2020,8:9102-9114.DOI:10.1109/ACCESS.2020.2964608.
[20] 戴振宇.基于ZYNQ的卷积神经网络加速设计与实现[D].呼和浩特: 内蒙古大学,2021.
[21] 李景阳.基于Zynq的热成像人体目标识别算法研究及硬件加速[D].成都: 电子科技大学,2023.
[22] YU Hao,LI Sizhao.A higher performance accelerator for resource-limited FPGA to deploy deeper object detection networks[C]//16th International Conference on Anti-Counterfeiting, Security, and Identification.Xiamen: IEEE Press,2022:1-5.DOI:10.1109/ASID56930.2022.9995953.
[23] LI Peng,CHE Cheng.Mapping YOLOv4-tiny on FPGA-based DNN accelerator by using dynamic fixed-point method[C]//12th International Symposium on Parallel Architectures, Algorithms and Programming.Xi’an: IEEE Press,2021:125-129.DOI:10.1109/PAAP54281.2021.9720468.
[24] XU Shanyong,ZHOU Yujie,HUANG Yourui,et al.YOLOv4-tiny-based coalgangue image recognition and FPGA implementation[J].Micromachines,2022,13(11):1983.DOI:10.3390/mi13111983.

备注/Memo

备注/Memo:
收稿日期: 2024-09-25
通信作者: 李国刚(1973-),男,副教授,博士,主要从事电路系统和神经网络的研究。E-mail:lgg@hqu.edu.cn。
基金项目: 国家自然科学基金资助项目(61370007); 福建省高校产学合作项目(2023H6013)https://hdxb.hqu.edu.cn/
更新日期/Last Update: 2025-03-20