2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)最新文献

Balancing Computation and Communication in Distributed Sparse Matrix-Vector Multiplication 分布式稀疏矩阵向量乘法中的平衡计算与通信

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Pub Date : 2023-05-01 DOI: 10.1109/CCGrid57682.2023.00056

Hongli Mi, Xiangrui Yu, Xiaosong Yu, Shuangyuan Wu, Weifeng Liu

Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in a number of scientific and engineering problems. When the sparse matrices processed are large enough, distributed memory systems should be used to accelerate SpMV. At present, the optimization techniques for distributed SpMV mainly focus on reordering through graph or hypergraph partitioning. However, although the reordering could reduce the amount of communications in general, there are still load balancing challenges in computations and communications on distributed platforms that are not well addressed. In this paper, we propose two strategies to optimize SpMV on distributed clusters: (1) resizing the number of row blocks on the nodes for balancing the amount of computations, and (2) adjusting the column number of the diagonal blocks for balancing tasks and reducing communications among compute nodes. The experimental results show that compared with the classic distributed SpMV implementation and its variant reordered with graph partitioning, our algorithm achieves on average 77.20x and 5.18x (up to 460.52x and 27.50x) speedups, respectively. Also, our method bring on average 19.56x (up to 48.49x) speedup over a recently proposed hybrid distributed SpMV algorithm. In addition, our algorithm achieves obviously better scalability over these existing distributed SpMV methods.

稀疏矩阵向量乘法(SpMV)是许多科学和工程问题中的基本运算。当处理的稀疏矩阵足够大时，应该使用分布式内存系统来加速SpMV。目前，分布式SpMV的优化技术主要集中在通过图或超图划分进行重排序。然而，尽管重新排序通常可以减少通信的数量，但在分布式平台上的计算和通信中仍然存在负载平衡方面的挑战，这些挑战没有得到很好的解决。在本文中，我们提出了两种策略来优化分布式集群上的SpMV:(1)调整节点上的行块的大小以平衡计算量;(2)调整对角线块的列数以平衡任务和减少计算节点之间的通信。实验结果表明，与经典的分布式SpMV实现及其基于图划分的改进型SpMV实现相比，我们的算法的平均速度分别提高了77.20倍和5.18倍(最高可达460.52倍和27.50倍)。此外，我们的方法比最近提出的混合分布式SpMV算法平均提高19.56倍(最高48.49倍)的速度。此外，与现有的分布式SpMV方法相比，我们的算法具有明显更好的可扩展性。

{"title":"Balancing Computation and Communication in Distributed Sparse Matrix-Vector Multiplication","authors":"Hongli Mi, Xiangrui Yu, Xiaosong Yu, Shuangyuan Wu, Weifeng Liu","doi":"10.1109/CCGrid57682.2023.00056","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00056","url":null,"abstract":"Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in a number of scientific and engineering problems. When the sparse matrices processed are large enough, distributed memory systems should be used to accelerate SpMV. At present, the optimization techniques for distributed SpMV mainly focus on reordering through graph or hypergraph partitioning. However, although the reordering could reduce the amount of communications in general, there are still load balancing challenges in computations and communications on distributed platforms that are not well addressed. In this paper, we propose two strategies to optimize SpMV on distributed clusters: (1) resizing the number of row blocks on the nodes for balancing the amount of computations, and (2) adjusting the column number of the diagonal blocks for balancing tasks and reducing communications among compute nodes. The experimental results show that compared with the classic distributed SpMV implementation and its variant reordered with graph partitioning, our algorithm achieves on average 77.20x and 5.18x (up to 460.52x and 27.50x) speedups, respectively. Also, our method bring on average 19.56x (up to 48.49x) speedup over a recently proposed hybrid distributed SpMV algorithm. In addition, our algorithm achieves obviously better scalability over these existing distributed SpMV methods.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115314511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

HDFL: A Heterogeneity and Client Dropout-Aware Federated Learning Framework HDFL:一个异构性和客户退学意识的联邦学习框架

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Pub Date : 2023-05-01 DOI: 10.1109/CCGrid57682.2023.00037

Syed Zawad, A. Anwar, Yi Zhou, N. Baracaldo, Feng Yan

Cross-device Federated Learning (FL) enables training machine learning (ML) models on private data that is heterogeneously distributed over many IoT end devices without violating privacy requirements. Clients typically vary significantly in data quality, hardware resources and stability, which results in challenges such as increased training times, higher resource costs, sub-par model performance and biased training. Existing works tend to address each of these challenges in isolation, but overlook how they might impact each other holistically. We perform a first of its kind characterization study that empirically demonstrates how these properties interact with each other to impact important performance metrics such as model error, fairness, resource cost and training time. We then propose a method called HDFL based on our observations, which is the first framework to our knowledge that comprehensively considers the multiple aforementioned important challenges of practical FL systems. We implement HDFL on a real distributed system and evaluate it on multiple benchmark datasets which show that HDFL achieves better Pareto frontier compared to both the state-of-the-practice and state-of-the-art systems with up to 4-10% better model accuracy, 33% improved good-intent fairness, 63% lower cost, and 17% faster training time.

跨设备联邦学习(FL)支持在私有数据上训练机器学习(ML)模型，这些私有数据异构地分布在许多物联网终端设备上，而不会违反隐私要求。客户端通常在数据质量、硬件资源和稳定性方面差异很大，这导致了诸如增加的训练时间、更高的资源成本、低于标准的模型性能和有偏差的训练等挑战。现有的作品倾向于孤立地解决这些挑战，但忽略了它们如何整体地相互影响。我们进行了首次此类表征研究，实证地展示了这些属性如何相互作用以影响重要的性能指标，如模型误差、公平性、资源成本和训练时间。然后，我们根据我们的观察提出了一种称为HDFL的方法，这是我们知识的第一个框架，全面考虑了实际FL系统的多个上述重要挑战。我们在一个真实的分布式系统上实现了HDFL，并在多个基准数据集上对其进行了评估，结果表明HDFL与现状和最先进的系统相比，达到了更好的帕累托边界，模型精度提高了4-10%，良好意图公平性提高了33%，成本降低了63%，训练时间缩短了17%。

{"title":"HDFL: A Heterogeneity and Client Dropout-Aware Federated Learning Framework","authors":"Syed Zawad, A. Anwar, Yi Zhou, N. Baracaldo, Feng Yan","doi":"10.1109/CCGrid57682.2023.00037","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00037","url":null,"abstract":"Cross-device Federated Learning (FL) enables training machine learning (ML) models on private data that is heterogeneously distributed over many IoT end devices without violating privacy requirements. Clients typically vary significantly in data quality, hardware resources and stability, which results in challenges such as increased training times, higher resource costs, sub-par model performance and biased training. Existing works tend to address each of these challenges in isolation, but overlook how they might impact each other holistically. We perform a first of its kind characterization study that empirically demonstrates how these properties interact with each other to impact important performance metrics such as model error, fairness, resource cost and training time. We then propose a method called HDFL based on our observations, which is the first framework to our knowledge that comprehensively considers the multiple aforementioned important challenges of practical FL systems. We implement HDFL on a real distributed system and evaluate it on multiple benchmark datasets which show that HDFL achieves better Pareto frontier compared to both the state-of-the-practice and state-of-the-art systems with up to 4-10% better model accuracy, 33% improved good-intent fairness, 63% lower cost, and 17% faster training time.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115548817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PrivFlow: Secure and Privacy Preserving Serverless Workflows on Cloud PrivFlow:云上的安全和隐私保护无服务器工作流

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Pub Date : 2023-05-01 DOI: 10.1109/CCGrid57682.2023.00049

Surabhi Garg, Meena Singh Dilip Thakur, R. A, L. Maddali, Vigneswaran Ramachandran

The recent advancement of serverless computing in the widespread deployment of applications prompts the need to protect serverless workflows against cloud vulnerabilities and threats. We propose PrivFlow, a workflow-centric, privacy preserving framework to protect the information flow in serverless computing applications in semi-honest (S-PrivFlow) and malicious (M-PrivFlow) adversarial settings. An Authenticated Data Structure is used to store the valid workflows encoded in the proposed format. The validation of workflows is performed in a privacy preserving manner that leaks no sensitive information to any unauthorized user. We focus on the two most prevalent attacks on the serverless cloud platforms, namely the Denial-of-Wallet and Wrong Function Invocation attacks. We demonstrate that PrivFlow mitigates both of these attacks. Further, we evaluate PrivFlow on the popular benchmark application- Hello Retail, and a customized scaled application. Though the comparison with the state-of-the-art approaches in terms of the runtime performance shows a latency of 1.6 times for S-PrivFlow and 8 times for M-PrivFlow, the PrivFlow provides high security and privacy. PrivFlow acts as a wrapper to the application resulting in no change to the source code.

在应用程序的广泛部署中，最近无服务器计算的进步促使人们需要保护无服务器工作流免受云漏洞和威胁的侵害。我们提出了PrivFlow，一个以工作流为中心的隐私保护框架，用于在半诚实(S-PrivFlow)和恶意(M-PrivFlow)对抗设置下保护无服务器计算应用程序中的信息流。经过身份验证的数据结构用于存储以建议格式编码的有效工作流。工作流的验证以保护隐私的方式执行，不会向任何未经授权的用户泄露敏感信息。我们重点关注无服务器云平台上最常见的两种攻击，即拒绝钱包攻击和错误函数调用攻击。我们证明了PrivFlow减轻了这两种攻击。此外，我们在流行的基准应用程序Hello Retail和定制的缩放应用程序上评估了PrivFlow。虽然在运行时性能方面与最先进的方法相比，S-PrivFlow的延迟是1.6倍，M-PrivFlow的延迟是8倍，但PrivFlow提供了高安全性和隐私性。PrivFlow充当应用程序的包装器，不会对源代码进行更改。

{"title":"PrivFlow: Secure and Privacy Preserving Serverless Workflows on Cloud","authors":"Surabhi Garg, Meena Singh Dilip Thakur, R. A, L. Maddali, Vigneswaran Ramachandran","doi":"10.1109/CCGrid57682.2023.00049","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00049","url":null,"abstract":"The recent advancement of serverless computing in the widespread deployment of applications prompts the need to protect serverless workflows against cloud vulnerabilities and threats. We propose PrivFlow, a workflow-centric, privacy preserving framework to protect the information flow in serverless computing applications in semi-honest (S-PrivFlow) and malicious (M-PrivFlow) adversarial settings. An Authenticated Data Structure is used to store the valid workflows encoded in the proposed format. The validation of workflows is performed in a privacy preserving manner that leaks no sensitive information to any unauthorized user. We focus on the two most prevalent attacks on the serverless cloud platforms, namely the Denial-of-Wallet and Wrong Function Invocation attacks. We demonstrate that PrivFlow mitigates both of these attacks. Further, we evaluate PrivFlow on the popular benchmark application- Hello Retail, and a customized scaled application. Though the comparison with the state-of-the-art approaches in terms of the runtime performance shows a latency of 1.6 times for S-PrivFlow and 8 times for M-PrivFlow, the PrivFlow provides high security and privacy. PrivFlow acts as a wrapper to the application resulting in no change to the source code.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124456276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Overcoming Noisy Labels in Federated Learning Through Local Self-Guiding 局部自引导克服联邦学习中的噪声标签

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Pub Date : 2023-05-01 DOI: 10.1109/CCGrid57682.2023.00042

Daokuan Bai, Shanshan Wang, Wenyue Wang, Hua Wang, Chuan Zhao, Peng Yuan, Zhenxiang Chen

Federated Learning (FL) is a privacy-preserving machine learning paradigm that enables clients such as Internet of Things (IoT) devices, and smartphones, to train a high-performance global model jointly. However, in real-world FL deployments, carefully human-annotated labels are expensive and time-consuming. So the presence of incorrect labels (noisy labels) in the local training data of the clients is inevitable, which will cause the performance degradation of the global model. To tackle this problem, we propose a simple but effective method Local Self-Guiding (LSG) to let clients guide themselves during training in the presence of noisy labels. Specifically, LSG keeps the model from memorizing noisy labels by enhancing the confidence of model predictions. Meanwhile, it utilizes the knowledge from local historical models which haven't fit noisy patterns to extract potential ground truth labels of samples. To keep the knowledge without storing models, LSG records the exponential moving average (EMA) of model output logits at different local training epochs as self-ensemble logits on clients' devices, which will lead to negligible computation and storage overhead. Then logit-based knowledge distillation is conducted to guide the local training. Experiments on MNIST, Fashion-MNIST, CIFAR-10, ImageNet-100 with multiple noise levels, and an unbalanced noisy dataset, Clothing1M, demonstrate the resistance of LSG to noisy labels. The code of LSG is available at https://github.com/DaokuanBai/LSG-Main

联邦学习(FL)是一种保护隐私的机器学习范式，它使物联网(IoT)设备和智能手机等客户端能够联合训练高性能的全球模型。然而，在实际的FL部署中，人工精心标注的标签既昂贵又耗时。因此，在客户端的局部训练数据中不可避免地存在不正确的标签(噪声标签)，这将导致全局模型的性能下降。为了解决这个问题，我们提出了一种简单而有效的局部自引导(LSG)方法，让客户在有噪声标签的情况下进行自我引导。具体来说，LSG通过提高模型预测的置信度来防止模型记忆噪声标签。同时，利用局部历史模型中未拟合噪声模式的知识提取样本的潜在地面真值标签。为了保留知识而不存储模型，LSG将不同局部训练时期模型输出logit的指数移动平均(EMA)记录为客户端设备上的自集成logit，这将导致可以忽略不计的计算和存储开销。然后进行基于逻辑的知识提炼，指导局部培训。在MNIST、Fashion-MNIST、CIFAR-10、ImageNet-100多噪声水平和Clothing1M非平衡噪声数据集上的实验证明了LSG对噪声标签的抵抗性。LSG的代码可以在https://github.com/DaokuanBai/LSG-Main上找到

{"title":"Overcoming Noisy Labels in Federated Learning Through Local Self-Guiding","authors":"Daokuan Bai, Shanshan Wang, Wenyue Wang, Hua Wang, Chuan Zhao, Peng Yuan, Zhenxiang Chen","doi":"10.1109/CCGrid57682.2023.00042","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00042","url":null,"abstract":"Federated Learning (FL) is a privacy-preserving machine learning paradigm that enables clients such as Internet of Things (IoT) devices, and smartphones, to train a high-performance global model jointly. However, in real-world FL deployments, carefully human-annotated labels are expensive and time-consuming. So the presence of incorrect labels (noisy labels) in the local training data of the clients is inevitable, which will cause the performance degradation of the global model. To tackle this problem, we propose a simple but effective method Local Self-Guiding (LSG) to let clients guide themselves during training in the presence of noisy labels. Specifically, LSG keeps the model from memorizing noisy labels by enhancing the confidence of model predictions. Meanwhile, it utilizes the knowledge from local historical models which haven't fit noisy patterns to extract potential ground truth labels of samples. To keep the knowledge without storing models, LSG records the exponential moving average (EMA) of model output logits at different local training epochs as self-ensemble logits on clients' devices, which will lead to negligible computation and storage overhead. Then logit-based knowledge distillation is conducted to guide the local training. Experiments on MNIST, Fashion-MNIST, CIFAR-10, ImageNet-100 with multiple noise levels, and an unbalanced noisy dataset, Clothing1M, demonstrate the resistance of LSG to noisy labels. The code of LSG is available at https://github.com/DaokuanBai/LSG-Main","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128090683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WiDual: User Identified Gesture Recognition Using Commercial WiFi 手动:用户识别手势识别使用商用WiFi

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Pub Date : 2023-05-01 DOI: 10.1109/CCGrid57682.2023.00068

Miaoling Dai, Chenhong Cao, Tong Liu, Meijia Su, Yufeng Li, Jiangtao Li

WiFi-based human gesture recognition has recently enjoyed increasing popularity in the Internet of Things (IoT) scenarios. Simultaneously recognizing user identities and user gestures is of great importance for enhancing the system security and user quality of experience (QoE). State-of-the-art approaches that perform dual tasks suffer from increased latency or degraded accuracy in cross-domain scenarios. In this paper, we present WiDual, a dual-task system that achieves accurate cross-domain gesture recognition and user identification based on WiFi in a real-time manner. The basic idea of WiDual is to use the attention mechanism to adaptively explore cross-domain features worthy of attention for dual tasks. WiDual employs a CSI (Channel Statement Information) visualization method that transfers WiFi signals to images for further feature extraction and model training. In this way, WiDual mitigates the possible loss of useful information and excessive delays caused by extracting handcrafted features directly from the WiFi signal. Furthermore, WiDual utilizes a collaboration module to combine gesture features and user identity features to enhance the performance of dual-task recognition. We implement WiDual and evaluate its performance extensively on a public dataset including 6 gestures and 6 users performed across domains. Results show that WiDual outperforms state-of-the-art approaches, with 26% and 8% improvements on the accuracy of cross-domain user identification and gesture recognition respectively.

最近，基于wifi的人体手势识别在物联网(IoT)场景中越来越受欢迎。同时识别用户身份和用户手势对提高系统安全性和用户体验质量具有重要意义。执行双重任务的最先进的方法在跨域场景中会增加延迟或降低准确性。在本文中，我们提出了一种基于WiFi实时实现准确跨域手势识别和用户识别的双任务系统WiDual。WiDual的基本思想是利用注意机制自适应地探索双任务中值得注意的跨域特征。WiDual采用CSI (Channel Statement Information)可视化方法，将WiFi信号传输到图像上，进一步进行特征提取和模型训练。通过这种方式，WiDual减轻了直接从WiFi信号中提取手工特征可能导致的有用信息丢失和过度延迟。此外，WiDual利用协作模块将手势特征和用户身份特征结合起来，提高了双任务识别的性能。我们实现了WiDual，并在一个公共数据集上广泛评估了它的性能，该数据集包括跨域执行的6个手势和6个用户。结果表明，该方法优于最先进的方法，在跨域用户识别和手势识别的准确性上分别提高了26%和8%。

{"title":"WiDual: User Identified Gesture Recognition Using Commercial WiFi","authors":"Miaoling Dai, Chenhong Cao, Tong Liu, Meijia Su, Yufeng Li, Jiangtao Li","doi":"10.1109/CCGrid57682.2023.00068","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00068","url":null,"abstract":"WiFi-based human gesture recognition has recently enjoyed increasing popularity in the Internet of Things (IoT) scenarios. Simultaneously recognizing user identities and user gestures is of great importance for enhancing the system security and user quality of experience (QoE). State-of-the-art approaches that perform dual tasks suffer from increased latency or degraded accuracy in cross-domain scenarios. In this paper, we present WiDual, a dual-task system that achieves accurate cross-domain gesture recognition and user identification based on WiFi in a real-time manner. The basic idea of WiDual is to use the attention mechanism to adaptively explore cross-domain features worthy of attention for dual tasks. WiDual employs a CSI (Channel Statement Information) visualization method that transfers WiFi signals to images for further feature extraction and model training. In this way, WiDual mitigates the possible loss of useful information and excessive delays caused by extracting handcrafted features directly from the WiFi signal. Furthermore, WiDual utilizes a collaboration module to combine gesture features and user identity features to enhance the performance of dual-task recognition. We implement WiDual and evaluate its performance extensively on a public dataset including 6 gestures and 6 users performed across domains. Results show that WiDual outperforms state-of-the-art approaches, with 26% and 8% improvements on the accuracy of cross-domain user identification and gesture recognition respectively.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"234 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131460157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ScaMP: Scalable Meta-Parallelism for Deep Learning Search ScaMP:深度学习搜索的可伸缩元并行

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Pub Date : 2023-05-01 DOI: 10.1109/CCGrid57682.2023.00044

Quentin G. Anthony, Lang Xu, A. Shafi, H. Subramoni, Dhabaleswar K. Panda

Deep Learning (DL) models are growing exponentially and require increasingly powerful High Performance Computing (HPC) systems to train them. Achieving state-of-the-art results requires carefully tuning the DL model architecture and training settings, which is a time-consuming process commonly relegated to distributed search frameworks and trial-and-error. However, search frameworks don't provide a flexible parallelism scheme within and among the chosen DL framework for modern out-of-core DL models. In this paper, we propose Scalable Meta-Parallelism for Deep Learning Search (ScaMP): a distributed Hyperparameter Optimization (HPO) and Neural Architecture Search (NAS) framework that supports out-of-core models with flexible parallelism schemes. SCaMP is integrated into the modern DL ecosystem, and enables both efficient parallel training of concurrent candidate architectures and aggregate device memory saturation via a powerful load balancing engine. SCaMP estimates the memory requirements of each candidate architecture and automatically applies the appropriate model-parallel degree and maximum batch size supported for the given candidate. Further, HPO and NAS with SCaMP are highly customizable via flexible configuration options. We evaluate the benefits of our designs on synthetic training benchmarks and in training a state-of-the-art vision transformer model. We select transformers as a candidate DL model type and demonstrate a 29% improvement in end-to-end HPO time on 32 V100 GPUs on the Lassen and ThetaGPU HPC systems. Further, we demonstrate a reduction in the proportion of NAS time spent in communication from 28% to 15%. Finally, we thoroughly verify the correctness of SCaMP by training a state-of-the-art SwinIR model.

深度学习(DL)模型呈指数级增长，需要越来越强大的高性能计算(HPC)系统来训练它们。获得最先进的结果需要仔细调整DL模型架构和训练设置，这是一个耗时的过程，通常被归为分布式搜索框架和试错。然而，搜索框架并不能在所选的深度学习框架内部和框架之间为现代核心外深度学习模型提供灵活的并行方案。在本文中，我们提出了深度学习搜索的可扩展元并行(Scalable Meta-Parallelism, ScaMP):一个分布式超参数优化(HPO)和神经结构搜索(NAS)框架，支持具有灵活并行方案的核外模型。SCaMP集成到现代DL生态系统中，并通过强大的负载平衡引擎实现并发候选架构的高效并行训练和聚合设备内存饱和。SCaMP估计每个候选体系结构的内存需求，并自动应用适当的模型并行度和给定候选体系结构支持的最大批大小。此外，使用SCaMP的HPO和NAS可以通过灵活的配置选项进行高度定制。我们评估了我们的设计在综合训练基准和训练最先进的视觉变压器模型方面的好处。我们选择变压器作为候选DL模型类型，并在Lassen和ThetaGPU HPC系统上的32个V100 gpu上证明了端到端HPO时间提高了29%。此外，我们证明了NAS在通信中花费的时间比例从28%减少到15%。最后，我们通过训练最先进的SwinIR模型来彻底验证SCaMP的正确性。

{"title":"ScaMP: Scalable Meta-Parallelism for Deep Learning Search","authors":"Quentin G. Anthony, Lang Xu, A. Shafi, H. Subramoni, Dhabaleswar K. Panda","doi":"10.1109/CCGrid57682.2023.00044","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00044","url":null,"abstract":"Deep Learning (DL) models are growing exponentially and require increasingly powerful High Performance Computing (HPC) systems to train them. Achieving state-of-the-art results requires carefully tuning the DL model architecture and training settings, which is a time-consuming process commonly relegated to distributed search frameworks and trial-and-error. However, search frameworks don't provide a flexible parallelism scheme within and among the chosen DL framework for modern out-of-core DL models. In this paper, we propose Scalable Meta-Parallelism for Deep Learning Search (ScaMP): a distributed Hyperparameter Optimization (HPO) and Neural Architecture Search (NAS) framework that supports out-of-core models with flexible parallelism schemes. SCaMP is integrated into the modern DL ecosystem, and enables both efficient parallel training of concurrent candidate architectures and aggregate device memory saturation via a powerful load balancing engine. SCaMP estimates the memory requirements of each candidate architecture and automatically applies the appropriate model-parallel degree and maximum batch size supported for the given candidate. Further, HPO and NAS with SCaMP are highly customizable via flexible configuration options. We evaluate the benefits of our designs on synthetic training benchmarks and in training a state-of-the-art vision transformer model. We select transformers as a candidate DL model type and demonstrate a 29% improvement in end-to-end HPO time on 32 V100 GPUs on the Lassen and ThetaGPU HPC systems. Further, we demonstrate a reduction in the proportion of NAS time spent in communication from 28% to 15%. Finally, we thoroughly verify the correctness of SCaMP by training a state-of-the-art SwinIR model.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128531352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CUDAsap: Statically-Determined Execution Statistics as Alternative to Execution-Based Profiling CUDAsap:静态确定的执行统计作为基于执行的分析的替代方案

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Pub Date : 2023-05-01 DOI: 10.1109/CCGrid57682.2023.00021

Yannick Emonds, Lorenz Braun, H. Fröning

Today a variety of different GPU types exists, raising questions regarding high-level tasks such as provisioning and scheduling. To predict execution time on different GPU types accurately, we propose a method to obtain execution statistics based on compile-time static code analysis, in which the control flow graph for the code's basic blocks is determined. This graph is represented as an adjacency matrix and used in a system of linear equations to calculate the basic block execution frequencies. Kernel execution itself is not necessary for this analysis. We analyze the proposed method for five different benchmark suites, showing that 76 out of 79 evaluated kernels can be analyzed with an average error of 0.4 %, primarily due to different LLVM versions, with an average prediction time of 203.96 ms. Furthermore, repetitive kernels make memoization effective, and the underlying analysis is largely independent of problem size.

今天，各种不同的GPU类型存在，提出了有关高级任务(如供应和调度)的问题。为了准确预测不同GPU类型上的执行时间，我们提出了一种基于编译时静态代码分析的执行统计数据获取方法，该方法确定了代码基本块的控制流图。此图表示为邻接矩阵，并用于线性方程组中计算基本块执行频率。内核执行本身对于这个分析来说并不是必需的。我们对五种不同的基准套件分析了所提出的方法，结果表明，79个评估内核中的76个可以以0.4%的平均误差进行分析，平均预测时间为203.96 ms，主要是由于不同的LLVM版本。此外，重复的核使记忆变得有效，并且底层分析在很大程度上与问题大小无关。

引用次数: 0

COUNSEL: Cloud Resource Configuration Management using Deep Reinforcement Learning 建议:使用深度强化学习的云资源配置管理

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Pub Date : 2023-05-01 DOI: 10.1109/CCGrid57682.2023.00035

Adithya Hegde, Sameer G. Kulkarni, Abhinandan S. Prasad

Internet Clouds are essentially service factories that offer various networked services through different service models, viz., Infrastructure, Platform, Software, and Functions as a Service. Meeting the desired service level objectives (SLOs) while ensuring efficient resource utilization requires significant efforts to provision the associated cloud resources correctly and on time. Therefore, one of the critical issues for any cloud service provider is resource configuration management. On one end, i.e., from the cloud operator's perspective, resource management affects overall resource utilization and efficiency. In contrast, from the cloud user/customer perspective, resource configuration affects the performance, cost, and offered SLOs. However, the state-of-the-art solutions for finding the configurations are limited to a single component or handle static workloads. Further, these solutions are computationally expensive and introduce profiling overhead, limiting scalability. Therefore, we propose COUNSEL, a deep reinforcement learning-based framework to handle the dynamic workloads and efficiently manage the configurations of an arbitrary multi-component service. We evaluate COUNSEL with three initial policies: over-provisioning, under-provisioning, and expert provisioning. In all the cases, COUNSEL eliminates the profiling overhead and achieves the average reward between 20 - 60% without violating the SLOs and budget constraints. Moreover, the inference time of COUNSEL has a constant time complexity.

互联网云本质上是服务工厂，它通过不同的服务模型(即基础设施、平台、软件和功能即服务)提供各种网络化服务。要在确保有效资源利用的同时满足所需的服务水平目标(slo)，需要付出巨大的努力来正确、及时地提供相关的云资源。因此，任何云服务提供商的关键问题之一就是资源配置管理。一方面，从云运营商的角度来看，资源管理影响整体资源的利用和效率。相反，从云用户/客户的角度来看，资源配置会影响性能、成本和提供的slo。然而，用于查找配置的最先进的解决方案仅限于单个组件或处理静态工作负载。此外，这些解决方案的计算成本很高，并且引入了分析开销，限制了可伸缩性。因此，我们提出了基于深度强化学习的COUNSEL框架来处理动态工作负载，并有效地管理任意多组件服务的配置。我们用三个初始策略来评估COUNSEL:过度配置、配置不足和专家配置。在所有情况下，COUNSEL都消除了分析开销，并在不违反slo和预算限制的情况下实现了20% - 60%的平均回报。而且，COUNSEL的推理时间具有恒定的时间复杂度。

{"title":"COUNSEL: Cloud Resource Configuration Management using Deep Reinforcement Learning","authors":"Adithya Hegde, Sameer G. Kulkarni, Abhinandan S. Prasad","doi":"10.1109/CCGrid57682.2023.00035","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00035","url":null,"abstract":"Internet Clouds are essentially service factories that offer various networked services through different service models, viz., Infrastructure, Platform, Software, and Functions as a Service. Meeting the desired service level objectives (SLOs) while ensuring efficient resource utilization requires significant efforts to provision the associated cloud resources correctly and on time. Therefore, one of the critical issues for any cloud service provider is resource configuration management. On one end, i.e., from the cloud operator's perspective, resource management affects overall resource utilization and efficiency. In contrast, from the cloud user/customer perspective, resource configuration affects the performance, cost, and offered SLOs. However, the state-of-the-art solutions for finding the configurations are limited to a single component or handle static workloads. Further, these solutions are computationally expensive and introduce profiling overhead, limiting scalability. Therefore, we propose COUNSEL, a deep reinforcement learning-based framework to handle the dynamic workloads and efficiently manage the configurations of an arbitrary multi-component service. We evaluate COUNSEL with three initial policies: over-provisioning, under-provisioning, and expert provisioning. In all the cases, COUNSEL eliminates the profiling overhead and achieves the average reward between 20 - 60% without violating the SLOs and budget constraints. Moreover, the inference time of COUNSEL has a constant time complexity.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131038569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RoUD: Scalable RDMA over UD in Lossy Data Center Networks 路:在有损数据中心网络中可扩展的RDMA

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Pub Date : 2023-05-01 DOI: 10.1109/CCGrid57682.2023.00014

Zhiqiang He, Yuxin Chen, Bei Hua

Remote direct memory access (RDMA) has been widely deployed in data centers due to the lower latency and higher throughput of the kernel TCP/IP stack. However, RDMA still faces a scalability problem including connection scalability and network scalability issues. In this paper, we present RoUD, a userspace network stack that leverages the unreliable datagram (UD) transport mode of RDMA to improve connection scalability. RoUD also eliminates the dependency on PFC in data center networks, thereby enhancing network scalability. RoUdimplements three performance optimizations in the userspace network stack and introduces two types of flow control to avoid packet loss on the host from happening on the host for high performance. We built a prototype of RoUD based on the standard InfiniBand Verbs library. The evaluation results on a testbed with 100 Gbps RNICs show that in the case of large-scale connections its throughput is 1.4× better than the widely used reliable connection (RC) transport.

由于TCP/IP内核栈具有较低的延迟和较高的吞吐量，远程直接内存访问(RDMA)在数据中心中得到了广泛的应用。然而，RDMA仍然面临着可扩展性问题，包括连接可扩展性和网络可扩展性问题。在本文中，我们提出了一个用户空间网络堆栈，它利用RDMA的不可靠数据报(UD)传输模式来提高连接的可扩展性。route还消除了数据中心网络对PFC的依赖，从而增强了网络的可扩展性。roud在用户空间网络堆栈中实现了三种性能优化，并引入了两种类型的流量控制，以避免主机上的数据包丢失，从而实现高性能。我们基于标准的InfiniBand Verbs库构建了一个route的原型。在100 Gbps rnic测试台上的评估结果表明，在大规模连接情况下，其吞吐量比广泛使用的可靠连接(RC)传输高1.4倍。

引用次数: 0

Congestion Minimization using Fog-deployed DRL-Agent Feedback enabled Traffic Light Cooperative Framework 使用雾部署的DRL-Agent反馈实现交通灯合作框架的拥塞最小化

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Pub Date : 2023-05-01 DOI: 10.1109/CCGrid57682.2023.00058

Anuj Sachan, Nisha Singh Chauhan, Neetesh Kumar

Congestion at signalized intersections can be alleviated by improving traffic signal control system's performance. In this context, Deep Reinforcement Learning (DRL) methods are increasingly gaining attention towards collaborative traffic signal control in vehicular networks for improving the traffic-flow. However, the existing collaborative methods lack in accounting the influence of neighbouring intersections traffic while working at a particular junction as built on the top of traditional client-server architecture. To address this, a Fog integrated DRL-based Smart Traffic Light Controller (STLC) cooperative framework is proposed via TCP/IP based communication among Fog node, Road Side Cameras (RSCs) and STLCs at the edge. The significant contributions of this work are: (1) A Fog node integrated DRL agent is proposed to minimize average waiting time and queue length, at the intersection, by generating Cycle Phase Duration (CPD) for the STLC via an appropriate coordination among neighboring intersections; (2) Utilizing the Fog node generated CPD as the feedback, a max-pressure based algorithm is proposed, for the STLC at the edge to improve the congestion at the intersection; (3) The performance of the proposed framework is analyzed on Indian cities OpenStreetMap utilizing the Simulation of Urban MObility (SUMO) simulator by varying arrival rate of the vehicles. The results demonstrate the effectiveness of the method over same line state-of-the-art methods.

通过提高交通信号控制系统的性能，可以缓解信号交叉口的拥堵。在此背景下，深度强化学习(Deep Reinforcement Learning, DRL)方法在车辆网络交通信号协同控制中得到越来越多的关注，以改善交通流量。然而，现有的协作方法在建立在传统的客户端-服务器架构之上的特定路口工作时，缺乏对相邻交叉口交通影响的考虑。为了解决这个问题，提出了一个基于drl的基于Fog集成的智能交通灯控制器(STLC)合作框架，该框架基于TCP/IP在Fog节点、路边摄像头(RSCs)和边缘STLC之间进行通信。本文的主要贡献有:(1)提出了一种雾节点集成DRL代理，通过在相邻交叉口之间进行适当的协调，生成STLC的周期阶段持续时间(CPD)，从而最小化交叉口的平均等待时间和排队长度;(2)利用Fog节点生成的CPD作为反馈，提出了一种基于最大压力的边缘STLC算法，以改善交叉口的拥塞;(3)利用城市移动模拟(SUMO)模拟器，通过改变车辆到达率，在印度城市OpenStreetMap上分析了所提出框架的性能。结果表明，该方法优于同线最优方法。

{"title":"Congestion Minimization using Fog-deployed DRL-Agent Feedback enabled Traffic Light Cooperative Framework","authors":"Anuj Sachan, Nisha Singh Chauhan, Neetesh Kumar","doi":"10.1109/CCGrid57682.2023.00058","DOIUrl":"https://doi.org/10.1109/CCGrid57682.2023.00058","url":null,"abstract":"Congestion at signalized intersections can be alleviated by improving traffic signal control system's performance. In this context, Deep Reinforcement Learning (DRL) methods are increasingly gaining attention towards collaborative traffic signal control in vehicular networks for improving the traffic-flow. However, the existing collaborative methods lack in accounting the influence of neighbouring intersections traffic while working at a particular junction as built on the top of traditional client-server architecture. To address this, a Fog integrated DRL-based Smart Traffic Light Controller (STLC) cooperative framework is proposed via TCP/IP based communication among Fog node, Road Side Cameras (RSCs) and STLCs at the edge. The significant contributions of this work are: (1) A Fog node integrated DRL agent is proposed to minimize average waiting time and queue length, at the intersection, by generating Cycle Phase Duration (CPD) for the STLC via an appropriate coordination among neighboring intersections; (2) Utilizing the Fog node generated CPD as the feedback, a max-pressure based algorithm is proposed, for the STLC at the edge to improve the congestion at the intersection; (3) The performance of the proposed framework is analyzed on Indian cities OpenStreetMap utilizing the Simulation of Urban MObility (SUMO) simulator by varying arrival rate of the vehicles. The results demonstrate the effectiveness of the method over same line state-of-the-art methods.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126786386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0