首页 > 最新文献

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)最新文献

英文 中文
Layercake: Efficient Inference Serving with Cloud and Mobile Resources Layercake:基于云和移动资源的高效推理服务
Samuel S. Ogden, Tian Guo
Many mobile applications are now integrating deep learning models into their core functionality. These functionalities have diverse latency requirements while demanding high-accuracy results. Currently, mobile applications statically decide to use either in-cloud inference, relying on a fast and consistent network, or on-device execution, relying on sufficient local resources. However, neither mobile networks nor computation resources deliver consistent performance in practice. Consequently, mobile inference often experiences variable performance or struggles to meet performance goals, when inference execution decisions are not made dynamically. In this paper, we introduce Layer Cake, a deep-learning inference framework that dynamically selects the best model and location for executing inferences. Layercake accomplishes this by tracking model state and availability, both locally and remotely, as well as the network bandwidth, allowing for accurate estimations of model response time. By doing so, Layercake achieves latency targets in up to 96.4% of cases, which is an improvement of 16.7% over similar systems, while decreasing the cost of cloud-based resources by over 68.33% than in-cloud inference.
许多移动应用程序现在将深度学习模型集成到其核心功能中。这些功能有不同的延迟要求,同时要求高精度的结果。目前,移动应用程序静态地决定使用云内推理(依赖于快速和一致的网络)或设备上执行(依赖于足够的本地资源)。然而,无论是移动网络还是计算资源,在实践中都无法提供一致的性能。因此,当没有动态地做出推理执行决策时,移动推理通常会经历可变的性能或难以满足性能目标。在本文中,我们介绍了一种深度学习推理框架Layer Cake,它可以动态地选择执行推理的最佳模型和位置。Layercake通过跟踪模型状态和可用性(本地和远程)以及网络带宽来实现这一点,从而允许对模型响应时间进行准确的估计。通过这样做,Layercake在高达96.4%的情况下实现了延迟目标,比类似系统提高了16.7%,同时将基于云的资源成本比云内推理降低了68.33%以上。
{"title":"Layercake: Efficient Inference Serving with Cloud and Mobile Resources","authors":"Samuel S. Ogden, Tian Guo","doi":"10.1109/CCGrid57682.2023.00027","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00027","url":null,"abstract":"Many mobile applications are now integrating deep learning models into their core functionality. These functionalities have diverse latency requirements while demanding high-accuracy results. Currently, mobile applications statically decide to use either in-cloud inference, relying on a fast and consistent network, or on-device execution, relying on sufficient local resources. However, neither mobile networks nor computation resources deliver consistent performance in practice. Consequently, mobile inference often experiences variable performance or struggles to meet performance goals, when inference execution decisions are not made dynamically. In this paper, we introduce Layer Cake, a deep-learning inference framework that dynamically selects the best model and location for executing inferences. Layercake accomplishes this by tracking model state and availability, both locally and remotely, as well as the network bandwidth, allowing for accurate estimations of model response time. By doing so, Layercake achieves latency targets in up to 96.4% of cases, which is an improvement of 16.7% over similar systems, while decreasing the cost of cloud-based resources by over 68.33% than in-cloud inference.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124661111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mixed Precision Based Parallel Optimization of Tensor Mathematical Operations on a New-generation Sunway Processor 基于混合精度的新一代神威处理器张量数学运算并行优化
Shuwei Fan, Yao Liu, Juliang Su, Xianyou Wu, Qiong Jiang
As an important part of high-performance computing (HPC) applications, tensor mathematical operations have a wide and significant impact on application performance. However, due to the unique heterogeneous architecture and software environment of the new-generation Sunway processors, it is critical to utilize the computing capacities of the processor for tensor mathematical operations. The existing research has not fully considered the computing characteristics of tensor mathematical operations and the hardware features of the new-generation Sunway processor. In this paper, we propose an optimization method for tensor mathematical operations on the new-generation Sunway processor. Firstly, an optimization method for elementary functions is proposed, which implements high-performance vector elementary functions with variable precision. Then, an mixed precision optimization method is proposed, which realizes expression computation with variable precision according to precision requirements of users. Finally, a multi-level parallel optimization method is proposed, which realizes asynchronous parallelism of the master core and the slave cores. The experimental results show that, compared with the native implementation, optimized tensor mathematical operations can achieve an average speedup of 112.19× on 64 cores, which exceeds the theoretical speedup.
张量数学运算作为高性能计算(HPC)应用的重要组成部分,对应用性能有着广泛而重要的影响。然而,由于新一代神威处理器独特的异构架构和软件环境,利用处理器的计算能力进行张量数学运算至关重要。现有的研究没有充分考虑张量数学运算的计算特点和新一代神威处理器的硬件特点。本文提出了一种在新一代神威处理器上进行张量数学运算的优化方法。首先,提出了一种优化初等函数的方法,实现了高性能的变精度矢量初等函数。然后,提出了一种混合精度优化方法,根据用户的精度要求实现变精度表达式的计算。最后,提出了一种多级并行优化方法,实现了主核和从核的异步并行。实验结果表明,与原生实现相比,优化后的张量数学运算在64核上的平均加速可达到112.19倍,超过理论加速。
{"title":"Mixed Precision Based Parallel Optimization of Tensor Mathematical Operations on a New-generation Sunway Processor","authors":"Shuwei Fan, Yao Liu, Juliang Su, Xianyou Wu, Qiong Jiang","doi":"10.1109/CCGrid57682.2023.00062","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00062","url":null,"abstract":"As an important part of high-performance computing (HPC) applications, tensor mathematical operations have a wide and significant impact on application performance. However, due to the unique heterogeneous architecture and software environment of the new-generation Sunway processors, it is critical to utilize the computing capacities of the processor for tensor mathematical operations. The existing research has not fully considered the computing characteristics of tensor mathematical operations and the hardware features of the new-generation Sunway processor. In this paper, we propose an optimization method for tensor mathematical operations on the new-generation Sunway processor. Firstly, an optimization method for elementary functions is proposed, which implements high-performance vector elementary functions with variable precision. Then, an mixed precision optimization method is proposed, which realizes expression computation with variable precision according to precision requirements of users. Finally, a multi-level parallel optimization method is proposed, which realizes asynchronous parallelism of the master core and the slave cores. The experimental results show that, compared with the native implementation, optimized tensor mathematical operations can achieve an average speedup of 112.19× on 64 cores, which exceeds the theoretical speedup.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127703976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
KalpaVriksh: Efficient and Cost-effective GUI Application Hosting using Singleton Snapshots KalpaVriksh:使用单例快照的高效且经济的GUI应用程序托管
Sumaiya Shaikh, Saurabh Kumar, Debadatta Mishra
Hosting popular GUI applications in different virtual machines (VMs) in a cloud can provide strong intra- application isolation and enhance the security of end-user devices. In this context, micro-VMs can be a very good fit where different applications are hosted in different micro-VMs hosted in the cloud. However, one of the challenges for the cloud service provider is to launch the application quickly when requested by any client. Techniques like VM snapshots can be used to improve the application launch time as shown in many existing research works. In this paper, we argue that GUI applications are different from snapshot-optimized cloud services like FaaS because the GUI applications are stateful and require specialized techniques for snapshot management. To manage application snapshots in a memory-efficient manner, the proposed KalpaVriksh framework maintains a single snapshot to launch multiple GUI applications from different end users. Furthermore, the unified snapshot framework does not impact the application launch time by using intelligent snapshot creation procedures. The experimental analysis shows that KalpaVriksh snapshot techniques apart from being memory- efficient, reach the farthest feasible point of snapshot capture (i.e., first external communication) during application execution, faster than a normal application launch (by 4.9x).
在云中的不同虚拟机(vm)中托管流行的GUI应用程序可以提供强大的应用程序内部隔离,并增强最终用户设备的安全性。在这种情况下,不同的应用程序托管在云中托管的不同微型虚拟机中,微型虚拟机可能非常适合。然而,云服务提供商面临的挑战之一是在任何客户端请求时快速启动应用程序。像VM快照这样的技术可以用来改善应用程序启动时间,正如许多现有的研究工作所显示的那样。在本文中,我们认为GUI应用程序与FaaS等快照优化的云服务不同,因为GUI应用程序是有状态的,需要专门的快照管理技术。为了以节省内存的方式管理应用程序快照,建议的KalpaVriksh框架维护单个快照,以便从不同的最终用户启动多个GUI应用程序。此外,统一的快照框架通过使用智能快照创建过程,不会影响应用程序的启动时间。实验分析表明,KalpaVriksh快照技术除了具有内存效率外,还在应用程序执行期间达到了快照捕获的最远可行点(即第一次外部通信),比普通应用程序启动快(4.9倍)。
{"title":"KalpaVriksh: Efficient and Cost-effective GUI Application Hosting using Singleton Snapshots","authors":"Sumaiya Shaikh, Saurabh Kumar, Debadatta Mishra","doi":"10.1109/CCGrid57682.2023.00026","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00026","url":null,"abstract":"Hosting popular GUI applications in different virtual machines (VMs) in a cloud can provide strong intra- application isolation and enhance the security of end-user devices. In this context, micro-VMs can be a very good fit where different applications are hosted in different micro-VMs hosted in the cloud. However, one of the challenges for the cloud service provider is to launch the application quickly when requested by any client. Techniques like VM snapshots can be used to improve the application launch time as shown in many existing research works. In this paper, we argue that GUI applications are different from snapshot-optimized cloud services like FaaS because the GUI applications are stateful and require specialized techniques for snapshot management. To manage application snapshots in a memory-efficient manner, the proposed KalpaVriksh framework maintains a single snapshot to launch multiple GUI applications from different end users. Furthermore, the unified snapshot framework does not impact the application launch time by using intelligent snapshot creation procedures. The experimental analysis shows that KalpaVriksh snapshot techniques apart from being memory- efficient, reach the farthest feasible point of snapshot capture (i.e., first external communication) during application execution, faster than a normal application launch (by 4.9x).","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"166 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121026710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HyQ: Hybrid I/O Queue Architecture for NVMe over Fabrics to Enable High- Performance Hardware Offloading HyQ: NVMe在fabric上的混合I/O队列架构,以实现高性能硬件卸载
Yiquan Chen, Jinlong Chen, Yijing Wang, Yi Chen, Zhengxu Jin, Jiexiong Xu, Guoju Fang, Wenhai Lin, Chengkun Wei, Wenzhi Chen
NVMe over Fabrics (NVMe-oF) has been widely applied as a remote storage protocol in cloud computing. The existing NVMe-oF software stack consumes a large number of CPU resources. Emerging devices, such as Smart NICs and DPUs, have supported hardware offloading of NVMe-oF to free these valuable CPU cores. However, NVMe-oF offloading capacity is always compromised because of limited hardware resources on design. Additionally, from thorough evaluations, we found that NVMe-oF inevitably suffers from severe performance degradation on complex application I/O patterns when using hardware offloading. It is challenging to achieve high performance and fully utilize NVMe-oF offloading simultaneously. In this paper, we propose HyQ, a novel hybrid I/O queue architecture for NVMe-oF, to achieve high performance while gaining the advantages of hardware offloading. HyQ realizes the coexistence of hardware offloading and software non-offloading queues, thus enabling the dynamic dispatching of I/O requests to appropriate processing queues according to user-defined I/O scheduling policies. Additionally, HyQ provides a request scheduling framework to support customized schedulers that select appropriate queues for I/O requests. In our evaluation, HyQ achieves up to 1.91x IOPS and 8.36x bandwidth performance improvement over the original hardware offloading scheme.
NVMe over Fabrics (NVMe- of)作为一种远程存储协议在云计算领域得到了广泛的应用。现有的NVMe-oF软件栈占用了大量的CPU资源。新兴设备,如智能网卡和dpu,已经支持NVMe-oF的硬件卸载,以释放这些宝贵的CPU内核。然而,NVMe-oF卸载能力总是受到设计上有限的硬件资源的影响。此外,通过全面的评估,我们发现在使用硬件卸载时,NVMe-oF不可避免地会在复杂的应用程序I/O模式上遭受严重的性能下降。同时实现高性能和充分利用NVMe-oF卸载是一项挑战。在本文中,我们提出了HyQ,一种新型的NVMe-oF混合I/O队列架构,以实现高性能,同时获得硬件卸载的优势。HyQ实现了硬件卸载队列和软件非卸载队列的共存,可以根据用户自定义的I/O调度策略,将I/O请求动态地分配到合适的处理队列中。此外,HyQ还提供了一个请求调度框架,以支持为I/O请求选择适当队列的自定义调度程序。在我们的评估中,HyQ比原来的硬件卸载方案实现了1.91倍的IOPS和8.36倍的带宽性能提升。
{"title":"HyQ: Hybrid I/O Queue Architecture for NVMe over Fabrics to Enable High- Performance Hardware Offloading","authors":"Yiquan Chen, Jinlong Chen, Yijing Wang, Yi Chen, Zhengxu Jin, Jiexiong Xu, Guoju Fang, Wenhai Lin, Chengkun Wei, Wenzhi Chen","doi":"10.1109/CCGrid57682.2023.00012","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00012","url":null,"abstract":"NVMe over Fabrics (NVMe-oF) has been widely applied as a remote storage protocol in cloud computing. The existing NVMe-oF software stack consumes a large number of CPU resources. Emerging devices, such as Smart NICs and DPUs, have supported hardware offloading of NVMe-oF to free these valuable CPU cores. However, NVMe-oF offloading capacity is always compromised because of limited hardware resources on design. Additionally, from thorough evaluations, we found that NVMe-oF inevitably suffers from severe performance degradation on complex application I/O patterns when using hardware offloading. It is challenging to achieve high performance and fully utilize NVMe-oF offloading simultaneously. In this paper, we propose HyQ, a novel hybrid I/O queue architecture for NVMe-oF, to achieve high performance while gaining the advantages of hardware offloading. HyQ realizes the coexistence of hardware offloading and software non-offloading queues, thus enabling the dynamic dispatching of I/O requests to appropriate processing queues according to user-defined I/O scheduling policies. Additionally, HyQ provides a request scheduling framework to support customized schedulers that select appropriate queues for I/O requests. In our evaluation, HyQ achieves up to 1.91x IOPS and 8.36x bandwidth performance improvement over the original hardware offloading scheme.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"185 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126955386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Deep Learning Pipeline Parallel Optimization Method 一种深度学习管道并行优化方法
Tiantian Lv, Lu Wu, Zhigang Zhao, Chunxiao Wang, Chuantao Li
In recent years, with the continuous development of artificial intelligence, deep learning algorithms are becoming more and more complex, and the scale of model training is also growing. The artificial intelligence platform also involves large-scale model training in our computing network operating system project. However, with the increasing size of data sets and models, the traditional single-card training makes the training speed very slow, and the training accuracy needs to converge, which has yet to meet people's computational needs. This has led to the development of GPipe, PipeDream, and other famous pipelines. In this paper, an efficient pipeline parallel training optimization method is proposed. In our approach, multiple computing nodes process small batches of data in parallel in a pipeline manner. We have mainly done the following two aspects of work: First, we designed a weight buffer strategy to limit the number of weight versions generated and ensure the model's accuracy. And we also developed a tensor compression mechanism to improve the transmission rate. Secondly, we propose a prefix sum partition algorithm to ensure that the pipeline can achieve balanced partitioning and save the memory of computing resources. Compared with several popular pipeline parallel frameworks, the proposed method can achieve about twice the training acceleration and save about 30% - 40% of the memory usage.
近年来,随着人工智能的不断发展,深度学习算法越来越复杂,模型训练的规模也越来越大。在我们的计算网络操作系统项目中,人工智能平台也涉及到大规模的模型训练。然而,随着数据集和模型的规模越来越大,传统的单卡训练使得训练速度非常慢,训练精度需要收敛,还不能满足人们的计算需求。这导致了GPipe、PipeDream和其他著名管道的发展。本文提出了一种高效的管道并行训练优化方法。在我们的方法中,多个计算节点以管道方式并行处理小批量数据。我们主要做了以下两方面的工作:首先,设计了权值缓冲策略,限制生成的权值版本数,保证模型的准确性。我们还开发了一种张量压缩机制来提高传输速率。其次,提出了一种前缀和分区算法,以保证管道能够实现均衡分区,节省计算资源的内存。与几种流行的流水线并行框架相比,该方法的训练速度提高了一倍左右,内存使用节省了30% ~ 40%。
{"title":"A Deep Learning Pipeline Parallel Optimization Method","authors":"Tiantian Lv, Lu Wu, Zhigang Zhao, Chunxiao Wang, Chuantao Li","doi":"10.1109/CCGrid57682.2023.00031","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00031","url":null,"abstract":"In recent years, with the continuous development of artificial intelligence, deep learning algorithms are becoming more and more complex, and the scale of model training is also growing. The artificial intelligence platform also involves large-scale model training in our computing network operating system project. However, with the increasing size of data sets and models, the traditional single-card training makes the training speed very slow, and the training accuracy needs to converge, which has yet to meet people's computational needs. This has led to the development of GPipe, PipeDream, and other famous pipelines. In this paper, an efficient pipeline parallel training optimization method is proposed. In our approach, multiple computing nodes process small batches of data in parallel in a pipeline manner. We have mainly done the following two aspects of work: First, we designed a weight buffer strategy to limit the number of weight versions generated and ensure the model's accuracy. And we also developed a tensor compression mechanism to improve the transmission rate. Secondly, we propose a prefix sum partition algorithm to ensure that the pipeline can achieve balanced partitioning and save the memory of computing resources. Compared with several popular pipeline parallel frameworks, the proposed method can achieve about twice the training acceleration and save about 30% - 40% of the memory usage.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116291391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EMPI: Enhanced Message Passing Interface in Modern C++ 现代c++中增强的消息传递接口
Majid Salimi Beni, Luigi Crisci, Biagio Cosenza
Message Passing Interface (MPI) is a well-known standard for programming distributed and HPC systems. While the community has been continuously improving MPI to address the requirements of next-generation architectures and applications, its interface has not substantially evolved. In fact, MPI only provides an interface to C and Fortran and does not support recent features of modern C++. Moreover, MPI programs are error-prone and subject to different syntactic and semantic errors. This paper introduces EMPI, an Enhanced Message Passing Interface based on modern C++, which is directly mapped to the OpenMPI implementation and exploits modern C++ for safe and efficient distributed programming. EMPI proposes novel C++RAII-based semantics and constant specialization to prevent error-prone code patterns such as parameter mismatch, and reduce the overhead of handling multiple objects and perinvocation time. Consequently, EMPI programs are safer: six out of nine well-known MPI error patterns do not occur while correctly using EMPI semantics. Experimental results on five microbenchmarks and two applications on a large-scale cluster using up to 1024 processes show that EMPI's performance is very similar to native MPI and considerably faster than the MPL C++ interface.
消息传递接口(Message Passing Interface, MPI)是一个众所周知的用于分布式和高性能计算系统编程的标准。虽然社区一直在不断改进MPI以满足下一代体系结构和应用程序的需求,但其接口并没有实质性的发展。事实上,MPI只提供了C和Fortran的接口,并不支持现代c++的最新特性。此外,MPI程序容易出错,并受到各种语法和语义错误的影响。本文介绍了一种基于现代c++的增强型消息传递接口EMPI,它直接映射到OpenMPI的实现,利用现代c++实现安全高效的分布式编程。EMPI提出了新颖的基于c++ raii的语义和恒定的专门化,以防止容易出错的代码模式,如参数不匹配,并减少处理多个对象的开销和重复调用时间。因此,EMPI程序更安全:在正确使用EMPI语义时,九种众所周知的MPI错误模式中有六种不会发生。在使用多达1024个进程的大型集群上的5个微基准测试和两个应用程序上的实验结果表明,EMPI的性能与本机MPI非常相似,并且比MPL c++接口快得多。
{"title":"EMPI: Enhanced Message Passing Interface in Modern C++","authors":"Majid Salimi Beni, Luigi Crisci, Biagio Cosenza","doi":"10.1109/CCGrid57682.2023.00023","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00023","url":null,"abstract":"Message Passing Interface (MPI) is a well-known standard for programming distributed and HPC systems. While the community has been continuously improving MPI to address the requirements of next-generation architectures and applications, its interface has not substantially evolved. In fact, MPI only provides an interface to C and Fortran and does not support recent features of modern C++. Moreover, MPI programs are error-prone and subject to different syntactic and semantic errors. This paper introduces EMPI, an Enhanced Message Passing Interface based on modern C++, which is directly mapped to the OpenMPI implementation and exploits modern C++ for safe and efficient distributed programming. EMPI proposes novel C++RAII-based semantics and constant specialization to prevent error-prone code patterns such as parameter mismatch, and reduce the overhead of handling multiple objects and perinvocation time. Consequently, EMPI programs are safer: six out of nine well-known MPI error patterns do not occur while correctly using EMPI semantics. Experimental results on five microbenchmarks and two applications on a large-scale cluster using up to 1024 processes show that EMPI's performance is very similar to native MPI and considerably faster than the MPL C++ interface.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116458557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
hsSpMV: A Heterogeneous and SPM-aggregated SpMV for SW26010-Pro many-core processor hsSpMV:用于SW26010-Pro多核处理器的异构spm聚合SpMV
J. Pan, Lei Xiao, Min Tian, Li Wang, Chaochao Yang, Renjiang Chen, Zenghui Ren, Anjun Liu, Guanghui Zhu
Sparse matrix vector multiplication (SpMV) is a critical performance bottleneck for numerical simulation and artificial intelligence training. The new generation of Sunway supercomputer is the advanced exascale supercomputer in China. The SW26010-Pro many-core processor renders itself as a competitive candidate for its attractive computational power in both numerical simulation and artificial intelligence training. In this paper, we propose a heterogeneous and SPM-aggregated SpMV kernel, specifically designed for the SW26010-Pro many-core processor. To fully exploit the computational power of the SW26010-Pro and balance the load of each core group(CG) during computation, we employ asynchronous computation workflow and propose the SPM-aggregated strategy and vector adaptive mapping algorithm. In addition, we propose the two-level data partition scheme to implement computational load balance. In order to improve memory access efficiency, we directly access memory via DMA controller to replace the discrete memory access. Using several optimizations, we achieve a 77.16x speedup compared to the original implementation. Our experimental results show that the hsSpMV yields up to 3.82× speedups on average compared to the SpMV kernel of the state-of-the-art Sunway math library xMath2.0.
稀疏矩阵向量乘法(SpMV)是数值模拟和人工智能训练的关键性能瓶颈。新一代神威超级计算机是目前国内最先进的百亿亿次超级计算机。SW26010-Pro多核处理器以其在数值模拟和人工智能训练中具有吸引力的计算能力而成为有竞争力的候选人。在本文中,我们提出了一个异构和spm聚合的SpMV内核,专门为SW26010-Pro多核处理器设计。为了充分利用SW26010-Pro的计算能力,平衡计算过程中各核心组(CG)的负载,我们采用异步计算工作流,提出spm聚合策略和矢量自适应映射算法。此外,我们还提出了两级数据分区方案来实现计算负载均衡。为了提高存储器访问效率,我们通过DMA控制器直接访问存储器,以取代离散存储器访问。通过一些优化,我们实现了77.16倍的速度提升。我们的实验结果表明,与最先进的Sunway数学库xMath2.0的SpMV内核相比,hsSpMV的平均速度提高了3.82倍。
{"title":"hsSpMV: A Heterogeneous and SPM-aggregated SpMV for SW26010-Pro many-core processor","authors":"J. Pan, Lei Xiao, Min Tian, Li Wang, Chaochao Yang, Renjiang Chen, Zenghui Ren, Anjun Liu, Guanghui Zhu","doi":"10.1109/CCGrid57682.2023.00016","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00016","url":null,"abstract":"Sparse matrix vector multiplication (SpMV) is a critical performance bottleneck for numerical simulation and artificial intelligence training. The new generation of Sunway supercomputer is the advanced exascale supercomputer in China. The SW26010-Pro many-core processor renders itself as a competitive candidate for its attractive computational power in both numerical simulation and artificial intelligence training. In this paper, we propose a heterogeneous and SPM-aggregated SpMV kernel, specifically designed for the SW26010-Pro many-core processor. To fully exploit the computational power of the SW26010-Pro and balance the load of each core group(CG) during computation, we employ asynchronous computation workflow and propose the SPM-aggregated strategy and vector adaptive mapping algorithm. In addition, we propose the two-level data partition scheme to implement computational load balance. In order to improve memory access efficiency, we directly access memory via DMA controller to replace the discrete memory access. Using several optimizations, we achieve a 77.16x speedup compared to the original implementation. Our experimental results show that the hsSpMV yields up to 3.82× speedups on average compared to the SpMV kernel of the state-of-the-art Sunway math library xMath2.0.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"283 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114609165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Cloud-Fog Architecture for Video Analytics on Large Scale Camera Networks Using Semantic Scene Analysis 基于语义场景分析的大规模摄像机网络视频分析云雾架构
Kunal Jain, Kishan Sairam Adapa, Kunwar Grover, R. Sarvadevabhatla, Suresh Purini
This paper proposes a scalable distributed video analytics framework that can process thousands of video streams from sources such as CCTV cameras using semantic scene analysis. The main idea is to deploy deep learning pipelines on the fog nodes and generate semantic scene description records (SDRs) of video feeds from the associated CCTV cameras. These SDRs are transmitted to the cloud instead of video frames saving on network bandwidth. Using these SDRs stored on the cloud database, we can answer many complex queries and perform rich video analytics, within extremely low latencies. There is no need to scan and process the video streams again on a per query basis. The software architecture on the fog nodes allows for integrating new deep learning pipelines dynamically into the existing system, thereby supporting novel analytics and queries. We demonstrate the effectiveness of the system by proposing a novel distributed algorithm for real-time vehicle pursuit. The proposed algorithm involves asking multiple spatio-temporal queries in an adaptive fashion to reduce the query processing time and is robust to inaccuracies in the deployed deep learning pipelines and camera failures.
本文提出了一个可扩展的分布式视频分析框架,该框架可以使用语义场景分析处理来自CCTV摄像机等源的数千个视频流。主要思想是在雾节点上部署深度学习管道,并从相关的闭路电视摄像机生成视频提要的语义场景描述记录(sdr)。这些sdr被传输到云端,而不是视频帧,从而节省了网络带宽。使用存储在云数据库中的这些sdr,我们可以在极低的延迟内回答许多复杂的查询并执行丰富的视频分析。不需要在每个查询的基础上再次扫描和处理视频流。雾节点上的软件架构允许将新的深度学习管道动态集成到现有系统中,从而支持新的分析和查询。我们通过提出一种新的分布式实时车辆追踪算法来验证该系统的有效性。该算法涉及以自适应方式询问多个时空查询,以减少查询处理时间,并且对部署的深度学习管道中的不准确性和相机故障具有鲁棒性。
{"title":"A Cloud-Fog Architecture for Video Analytics on Large Scale Camera Networks Using Semantic Scene Analysis","authors":"Kunal Jain, Kishan Sairam Adapa, Kunwar Grover, R. Sarvadevabhatla, Suresh Purini","doi":"10.1109/CCGrid57682.2023.00054","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00054","url":null,"abstract":"This paper proposes a scalable distributed video analytics framework that can process thousands of video streams from sources such as CCTV cameras using semantic scene analysis. The main idea is to deploy deep learning pipelines on the fog nodes and generate semantic scene description records (SDRs) of video feeds from the associated CCTV cameras. These SDRs are transmitted to the cloud instead of video frames saving on network bandwidth. Using these SDRs stored on the cloud database, we can answer many complex queries and perform rich video analytics, within extremely low latencies. There is no need to scan and process the video streams again on a per query basis. The software architecture on the fog nodes allows for integrating new deep learning pipelines dynamically into the existing system, thereby supporting novel analytics and queries. We demonstrate the effectiveness of the system by proposing a novel distributed algorithm for real-time vehicle pursuit. The proposed algorithm involves asking multiple spatio-temporal queries in an adaptive fashion to reduce the query processing time and is robust to inaccuracies in the deployed deep learning pipelines and camera failures.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115195752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Empirical Study of Container Image Configurations and Their Impact on Start Times 容器镜像配置及其对启动时间影响的实证研究
Martin Straesser, A. Bauer, Robert Leppich, N. Herbst, K. Chard, I. Foster, Samuel Kounev
A core selling point of application containers is their fast start times compared to other virtualization approaches like virtual machines. Predictable and fast container start times are crucial for improving and guaranteeing the performance of containerized cloud, serverless, and edge applications. While previous work has investigated container starts, there remains a lack of understanding of how start times may vary across container configurations. We address this shortcoming by presenting and analyzing a dataset of approximately 200,000 open-source Docker Hub images featuring different image configurations (e.g., image size and exposed ports). Leveraging this dataset, we investigate the start times of containers in two environments and identify the most influential features. Our experiments show that container start times can vary between hundreds of milliseconds and tens of seconds in the same environment. Moreover, we conclude that no single dominant configuration feature determines a container's start time, and hardware and software parameters must be considered together for an accurate assessment.
应用程序容器的一个核心卖点是,与虚拟机等其他虚拟化方法相比,它们的启动时间更快。可预测和快速的容器启动时间对于改进和保证容器化云、无服务器和边缘应用程序的性能至关重要。虽然以前的工作已经研究了容器启动,但仍然缺乏对不同容器配置的启动时间如何变化的理解。我们通过展示和分析大约200,000个开源Docker Hub映像的数据集来解决这个缺点,这些映像具有不同的映像配置(例如,映像大小和暴露的端口)。利用这个数据集,我们研究了两种环境中容器的启动时间,并确定了最具影响力的特征。我们的实验表明,在相同的环境中,容器启动时间可能在数百毫秒到数十秒之间变化。此外,我们得出结论,没有单一的主导配置特征决定容器的启动时间,必须同时考虑硬件和软件参数才能进行准确的评估。
{"title":"An Empirical Study of Container Image Configurations and Their Impact on Start Times","authors":"Martin Straesser, A. Bauer, Robert Leppich, N. Herbst, K. Chard, I. Foster, Samuel Kounev","doi":"10.1109/CCGrid57682.2023.00019","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00019","url":null,"abstract":"A core selling point of application containers is their fast start times compared to other virtualization approaches like virtual machines. Predictable and fast container start times are crucial for improving and guaranteeing the performance of containerized cloud, serverless, and edge applications. While previous work has investigated container starts, there remains a lack of understanding of how start times may vary across container configurations. We address this shortcoming by presenting and analyzing a dataset of approximately 200,000 open-source Docker Hub images featuring different image configurations (e.g., image size and exposed ports). Leveraging this dataset, we investigate the start times of containers in two environments and identify the most influential features. Our experiments show that container start times can vary between hundreds of milliseconds and tens of seconds in the same environment. Moreover, we conclude that no single dominant configuration feature determines a container's start time, and hardware and software parameters must be considered together for an accurate assessment.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133659670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Scheduling DNN Inferencing on Edge and Cloud for Personalized UAV Fleets 基于边缘和云的个性化无人机机群调度DNN推理
Suman Raj, Harshil Gupta, Yogesh L. Simmhan
Drone fleets with onboard cameras coupled with DNN inferencing models can support diverse applications, from infrastructure monitoring to package deliveries. Here, we propose to use one or more “buddy” drones to help Visually Impaired People (VIPs) lead an active lifestyle. Video inferencing tasks from such drones are used to navigate the drone and alert the VIP to threats, and hence have strict execution deadlines. They have a choice to execute either on an accelerated edge like Nvidia Jetson linked to the drone, or on a cloud INFerencing-as-a-Service (INFaaS). However, making this decision is a challenge given the latency and cost trade-offs, and network variability in outdoor environments. We propose a deadline-driven heuristic to schedule a stream of diverse DNN inferencing tasks executing over video segments generated by multiple drones linked to an edge, with the option to execute on the cloud. We use strategies like task dropping, work stealing and migration, and dynamic adaptation to cloud variability, to fully utilize the captive edge with intelligent offloading to the cloud, to maximize the utility and the number of tasks completed. We evaluate our strategies using a setup that emulates a fleet of > 50 drones within city conditions supporting> 25 VIPs, with real DNN models executing on drone video streams, using Jetson Nano edges and AWS Lambda cloud functions. Our detailed comparison of our strategy exhibits a task completion rate of up to 91 %, up to 2.5× higher utility compared to the baselines and 68% higher utility with network variability.
配备机载摄像头和DNN推理模型的无人机可以支持从基础设施监控到包裹递送等各种应用。在这里,我们建议使用一个或多个“伙伴”无人机来帮助视障人士(vip)过上积极的生活方式。来自此类无人机的视频推理任务用于导航无人机并提醒VIP注意威胁,因此有严格的执行期限。他们可以选择在与无人机相连的英伟达Jetson这样的加速边缘上执行,也可以选择在云推理即服务(INFaaS)上执行。然而,考虑到室外环境中的延迟和成本权衡以及网络可变性,做出这个决定是一个挑战。我们提出了一种截止日期驱动的启发式方法,以调度在连接到边缘的多个无人机生成的视频片段上执行的各种DNN推理任务流,并可选择在云上执行。我们使用诸如任务丢弃、工作窃取和迁移以及动态适应云变化等策略,通过智能卸载到云,充分利用捕获边缘,最大限度地提高效用和完成任务的数量。我们使用一个设置来评估我们的策略,该设置在城市条件下模拟bb50架无人机的机队,支持> 25个vip,使用Jetson Nano边缘和AWS Lambda云功能,在无人机视频流上执行真正的DNN模型。我们对策略的详细比较显示,任务完成率高达91%,与基线相比,效用高出2.5倍,与网络可变性相比,效用高出68%。
{"title":"Scheduling DNN Inferencing on Edge and Cloud for Personalized UAV Fleets","authors":"Suman Raj, Harshil Gupta, Yogesh L. Simmhan","doi":"10.1109/CCGrid57682.2023.00063","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00063","url":null,"abstract":"Drone fleets with onboard cameras coupled with DNN inferencing models can support diverse applications, from infrastructure monitoring to package deliveries. Here, we propose to use one or more “buddy” drones to help Visually Impaired People (VIPs) lead an active lifestyle. Video inferencing tasks from such drones are used to navigate the drone and alert the VIP to threats, and hence have strict execution deadlines. They have a choice to execute either on an accelerated edge like Nvidia Jetson linked to the drone, or on a cloud INFerencing-as-a-Service (INFaaS). However, making this decision is a challenge given the latency and cost trade-offs, and network variability in outdoor environments. We propose a deadline-driven heuristic to schedule a stream of diverse DNN inferencing tasks executing over video segments generated by multiple drones linked to an edge, with the option to execute on the cloud. We use strategies like task dropping, work stealing and migration, and dynamic adaptation to cloud variability, to fully utilize the captive edge with intelligent offloading to the cloud, to maximize the utility and the number of tasks completed. We evaluate our strategies using a setup that emulates a fleet of > 50 drones within city conditions supporting> 25 VIPs, with real DNN models executing on drone video streams, using Jetson Nano edges and AWS Lambda cloud functions. Our detailed comparison of our strategy exhibits a task completion rate of up to 91 %, up to 2.5× higher utility compared to the baselines and 68% higher utility with network variability.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131257603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1