Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure

IF 1.5 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Architecture and Code Optimization Pub Date : 2023-06-18 DOI:10.1145/3605149
Jiazhi Jiang, Zijiang Huang, Dan-E Huang, Jiangsu Du, Lin Chen, Ziguan Chen, Yutong Lu
{"title":"Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure","authors":"Jiazhi Jiang, Zijiang Huang, Dan-E Huang, Jiangsu Du, Lin Chen, Ziguan Chen, Yutong Lu","doi":"10.1145/3605149","DOIUrl":null,"url":null,"abstract":"The tremendous success of convolutional neural network (CNN) has made it ubiquitous in many fields of human endeavor. Many applications such as biomedical analysis and scientific data analysis involve analyzing volumetric data. This spawns huge demand for 3D-CNN. Although accelerators such as GPU may provide higher throughput on deep learning applications, they may not be available in all scenarios. CPU, especially many-core CPU with non-uniform memory access (NUMA) architecture, remains an attractive choice for deep learning inference in many scenarios. In this article, we propose a distributed inference solution for 3D-CNN that targets on the emerging ARM many-core CPU platform. A hierarchical partition approach is claimed to accelerate 3D-CNN inference by exploiting characteristics of memory and cache on ARM many-core CPU. Based on the hierarchical model partition approach, other optimization techniques such as NUMA-aware thread scheduling and optimization of 3D-img2row convolution are designed to exploit the potential of ARM many-core CPU for 3D-CNN. We evaluate our proposed inference solution with several classic 3D-CNNs: C3D, 3D-resnet34, 3D-resnet50, 3D-vgg11, and P3D. Our experimental results show that our solution can boost the performance of the 3D-CNN inference, and achieve much better scalability, with a negligible fluctuation in accuracy. When employing our 3D-CNN inference solution on ACL libraries, it can outperform naive ACL implementations by 11× to 50× on ARM many-core processor. When employing our 3D-CNN inference solution on NCNN libraries, it can outperform the naive NCNN implementations by 5.2× to 14.2× on ARM many-core processor.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"28 6 1","pages":"1 - 21"},"PeriodicalIF":1.5000,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3605149","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

The tremendous success of convolutional neural network (CNN) has made it ubiquitous in many fields of human endeavor. Many applications such as biomedical analysis and scientific data analysis involve analyzing volumetric data. This spawns huge demand for 3D-CNN. Although accelerators such as GPU may provide higher throughput on deep learning applications, they may not be available in all scenarios. CPU, especially many-core CPU with non-uniform memory access (NUMA) architecture, remains an attractive choice for deep learning inference in many scenarios. In this article, we propose a distributed inference solution for 3D-CNN that targets on the emerging ARM many-core CPU platform. A hierarchical partition approach is claimed to accelerate 3D-CNN inference by exploiting characteristics of memory and cache on ARM many-core CPU. Based on the hierarchical model partition approach, other optimization techniques such as NUMA-aware thread scheduling and optimization of 3D-img2row convolution are designed to exploit the potential of ARM many-core CPU for 3D-CNN. We evaluate our proposed inference solution with several classic 3D-CNNs: C3D, 3D-resnet34, 3D-resnet50, 3D-vgg11, and P3D. Our experimental results show that our solution can boost the performance of the 3D-CNN inference, and achieve much better scalability, with a negligible fluctuation in accuracy. When employing our 3D-CNN inference solution on ACL libraries, it can outperform naive ACL implementations by 11× to 50× on ARM many-core processor. When employing our 3D-CNN inference solution on NCNN libraries, it can outperform the naive NCNN implementations by 5.2× to 14.2× on ARM many-core processor.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于解耦3D-CNN结构的分层模型并行化多核处理器推理优化
卷积神经网络(CNN)的巨大成功使其在人类努力的许多领域无处不在。生物医学分析和科学数据分析等许多应用都涉及分析体积数据。这催生了对3D-CNN的巨大需求。尽管GPU等加速器可以为深度学习应用程序提供更高的吞吐量,但它们可能不适用于所有场景。CPU,特别是具有非统一内存访问(NUMA)架构的多核CPU,在许多场景下仍然是深度学习推理的一个有吸引力的选择。在本文中,我们提出了一种针对新兴的ARM多核CPU平台的3D-CNN分布式推理解决方案。利用ARM多核CPU的内存和缓存特性,提出了一种分层分区方法来加速3D-CNN推理。基于分层模型划分方法,设计了numa感知线程调度和3D-img2row卷积优化等优化技术,以挖掘ARM多核CPU对3D-CNN的潜力。我们用几种经典的3d - cnn: C3D, 3D-resnet34, 3D-resnet50, 3D-vgg11和P3D来评估我们提出的推理解决方案。我们的实验结果表明,我们的解决方案可以提高3D-CNN推理的性能,并实现更好的可扩展性,精度波动可以忽略不计。在ACL库上采用我们的3D-CNN推理解决方案,在ARM多核处理器上的性能比原始ACL实现高出11倍到50倍。在NCNN库上使用我们的3D-CNN推理解决方案,在ARM多核处理器上的性能比原始的NCNN实现高出5.2倍到14.2倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization 工程技术-计算机:理论方法
CiteScore
3.60
自引率
6.20%
发文量
78
审稿时长
6-12 weeks
期刊介绍: ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.
期刊最新文献
A Survey of General-purpose Polyhedral Compilers Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture Scythe: A Low-latency RDMA-enabled Distributed Transaction System for Disaggregated Memory FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1