首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
SC-CGRA: An Energy-Efficient CGRA Using Stochastic Computing SC-CGRA:使用随机计算的高能效 CGRA
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453310
Di Mou;Bo Wang;Dajiang Liu
Stochastic Computing (SC) offers a promising computing paradigm for low-power and cost-effective applications, with the added advantage of high error tolerance. In parallel, Coarse-Grained Reconfigurable Arrays (CGRA) prove to be a highly promising platform for domain-specific applications due to their combination of energy efficiency and flexibility. Intuitively, introducing SC to CGRA would significantly reinforce the strengths of both paradigms. However, existing SC-based architectures often encounter inherent computation errors, while the stochastic number generators employed in SC result in exponentially growing latency, which is deemed unacceptable in CGRA. In this work, we propose an SC-based CGRA by replacing the exact multiplication in traditional CGRA with an SC-based multiplication. To improve the accuracy of SC and shorten the latency of Stochastic Number Generators (SNG), we introduce the leading zero shifting and comparator truncation, while keeping the length of bitstream fixed. In addition, due to the flexible interconnections among PEs, we propose a quality scaling strategy that combines neighbor PEs to achieve high-accuracy operations without switching costs like power-gating. Compared to the state-of-the-art approximate computing design of CGRA, our proposed CGRA can averagely achieve a 65.3% reduction in output error while having a 21.2% reduction in energy consumption and a noteworthy 28.37% area savings.
随机计算(Schochastic Computing,SC)为低功耗、高成本效益的应用提供了一种前景广阔的计算范式,并具有高容错性的额外优势。与此同时,粗粒度可重构阵列(CGRA)由于兼具能效和灵活性,被证明是一种非常有前途的特定领域应用平台。直观地说,将 SC 引入 CGRA 将大大加强这两种模式的优势。然而,现有的基于 SC 的架构经常会遇到固有的计算错误,而 SC 中采用的随机数字生成器会导致指数级增长的延迟,这在 CGRA 中被认为是不可接受的。在这项工作中,我们提出了一种基于 SC 的 CGRA,用基于 SC 的乘法取代传统 CGRA 中的精确乘法。为了提高 SC 的精度并缩短随机数发生器 (SNG) 的延迟,我们引入了前导零移位和比较器截断,同时保持比特流的长度不变。此外,由于 PE 之间具有灵活的互连,我们提出了一种质量缩放策略,即结合相邻 PE 实现高精度操作,而无需电源门等开关成本。与最先进的近似计算 CGRA 设计相比,我们提出的 CGRA 平均可将输出误差减少 65.3%,同时能耗减少 21.2%,面积节省 28.37%。
{"title":"SC-CGRA: An Energy-Efficient CGRA Using Stochastic Computing","authors":"Di Mou;Bo Wang;Dajiang Liu","doi":"10.1109/TPDS.2024.3453310","DOIUrl":"10.1109/TPDS.2024.3453310","url":null,"abstract":"Stochastic Computing (SC) offers a promising computing paradigm for low-power and cost-effective applications, with the added advantage of high error tolerance. In parallel, Coarse-Grained Reconfigurable Arrays (CGRA) prove to be a highly promising platform for domain-specific applications due to their combination of energy efficiency and flexibility. Intuitively, introducing SC to CGRA would significantly reinforce the strengths of both paradigms. However, existing SC-based architectures often encounter inherent computation errors, while the stochastic number generators employed in SC result in exponentially growing latency, which is deemed unacceptable in CGRA. In this work, we propose an SC-based CGRA by replacing the exact multiplication in traditional CGRA with an SC-based multiplication. To improve the accuracy of SC and shorten the latency of Stochastic Number Generators (SNG), we introduce the leading zero shifting and comparator truncation, while keeping the length of bitstream fixed. In addition, due to the flexible interconnections among PEs, we propose a quality scaling strategy that combines neighbor PEs to achieve high-accuracy operations without switching costs like power-gating. Compared to the state-of-the-art approximate computing design of CGRA, our proposed CGRA can averagely achieve a 65.3% reduction in output error while having a 21.2% reduction in energy consumption and a noteworthy 28.37% area savings.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Efficient Graph Processing in Geo-Distributed Data Centers 在地理分布式数据中心实现高效图形处理
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-03 DOI: 10.1109/tpds.2024.3453872
Feng Yao, Qian Tao, Shengyuan Lin, Yanfeng Zhang, Wenyuan Yu, Shufeng Gong, Qiange Wang, Ge Yu, Jingren Zhou
{"title":"Towards Efficient Graph Processing in Geo-Distributed Data Centers","authors":"Feng Yao, Qian Tao, Shengyuan Lin, Yanfeng Zhang, Wenyuan Yu, Shufeng Gong, Qiange Wang, Ge Yu, Jingren Zhou","doi":"10.1109/tpds.2024.3453872","DOIUrl":"https://doi.org/10.1109/tpds.2024.3453872","url":null,"abstract":"","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CODE$^{+}$+: Fast and Accurate Inference for Compact Distributed IoT Data Collection CODE+:针对紧凑型分布式物联网数据采集的快速准确推理。
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453607
Huali Lu;Feng Lyu;Ju Ren;Huaqing Wu;Conghao Zhou;Zhongyuan Liu;Yaoxue Zhang;Xuemin Shen
In distributed IoT data systems, full-size data collection is impractical due to the energy constraints and large system scales. Our previous work has investigated the advantages of integrating matrix sampling and inference for compact distributed IoT data collection, to minimize the data collection cost while guaranteeing the data benefits. This paper further advances the technology by boosting fast and accurate inference for those distributed IoT data systems that are sensitive to computation time, training stability, and inference accuracy. Particularly, we propose CODE$^{+}$+, i.e., Compact Distributed IOT Data CollEction Plus, which features a cluster-based sampling module and a Convolutional Neural Network (CNN)-Transformer Autoencoders-based inference module, to reduce cost and guarantee the data benefits. The sampling component employs a cluster-based matrix sampling approach, in which data clustering is first conducted and then a two-step sampling is performed in accordance with the number of clusters and clustering errors. The inference component integrates a CNN-Transformer Autoencoders-based matrix inference model to estimate the full-size spatio-temporal data matrix, which consists of a CNN-Transformer encoder that extracts the underlying features from the sampled data matrix and a lightweight decoder that maps the learned latent features back to the original full-size data matrix. We implement CODE$^{+}$+ under three operational large-scale IoT systems and one synthetic Gaussian distribution dataset, and extensive experiments are provided to demonstrate its efficiency and robustness. With a 20% sampling ratio, CODE$^{+}$+ achieves an average data reconstruction accuracy of 94% across four datasets, outperforming our previous version of 87% and state-of-the-art baseline of 71%.
在分布式物联网数据系统中,由于能源限制和庞大的系统规模,全尺寸数据收集是不切实际的。我们之前的工作研究了在紧凑型分布式物联网数据收集中集成矩阵采样和推理的优势,从而在保证数据效益的同时最大限度地降低数据收集成本。本文针对对计算时间、训练稳定性和推理准确性敏感的分布式物联网数据系统,通过提高推理的快速性和准确性,进一步推动了该技术的发展。特别是,我们提出了 CODE$^{+}$+,即 Compact Distributed IOT Data CollEction Plus,它具有基于集群的采样模块和基于卷积神经网络(CNN)-变换器自动编码器的推理模块,以降低成本并保证数据效益。采样组件采用基于聚类的矩阵采样方法,首先对数据进行聚类,然后根据聚类数量和聚类误差进行两步采样。推理组件集成了一个基于 CNN-Transformer Autoencoders 的矩阵推理模型来估计全尺寸时空数据矩阵,它由一个 CNN-Transformer 编码器和一个轻量级解码器组成,前者从采样数据矩阵中提取底层特征,后者则将学习到的潜在特征映射回原始全尺寸数据矩阵。我们在三个运行中的大规模物联网系统和一个合成高斯分布数据集下实现了 CODE$^{+}$+,并通过大量实验证明了其效率和鲁棒性。在采样率为 20% 的情况下,CODE$^{+}$+ 在四个数据集上实现了 94% 的平均数据重建准确率,优于我们之前版本的 87% 和最先进基线的 71%。
{"title":"CODE$^{+}$+: Fast and Accurate Inference for Compact Distributed IoT Data Collection","authors":"Huali Lu;Feng Lyu;Ju Ren;Huaqing Wu;Conghao Zhou;Zhongyuan Liu;Yaoxue Zhang;Xuemin Shen","doi":"10.1109/TPDS.2024.3453607","DOIUrl":"10.1109/TPDS.2024.3453607","url":null,"abstract":"In distributed IoT data systems, full-size data collection is impractical due to the energy constraints and large system scales. Our previous work has investigated the advantages of integrating matrix sampling and inference for compact distributed IoT data collection, to minimize the data collection cost while guaranteeing the data benefits. This paper further advances the technology by boosting fast and accurate inference for those distributed IoT data systems that are sensitive to computation time, training stability, and inference accuracy. Particularly, we propose \u0000<italic>CODE<inline-formula><tex-math>$^{+}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>+</mml:mo></mml:msup></mml:math><inline-graphic></alternatives></inline-formula></i>\u0000, i.e., \u0000<underline>C</u>\u0000ompact Distributed I\u0000<underline>O</u>\u0000T \u0000<underline>D</u>\u0000ata Coll\u0000<underline>E</u>\u0000ction Plus, which features a cluster-based sampling module and a Convolutional Neural Network (CNN)-Transformer Autoencoders-based inference module, to reduce cost and guarantee the data benefits. The sampling component employs a cluster-based matrix sampling approach, in which data clustering is first conducted and then a two-step sampling is performed in accordance with the number of clusters and clustering errors. The inference component integrates a CNN-Transformer Autoencoders-based matrix inference model to estimate the full-size spatio-temporal data matrix, which consists of a CNN-Transformer encoder that extracts the underlying features from the sampled data matrix and a lightweight decoder that maps the learned latent features back to the original full-size data matrix. We implement \u0000<italic>CODE<inline-formula><tex-math>$^{+}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>+</mml:mo></mml:msup></mml:math><inline-graphic></alternatives></inline-formula></i>\u0000 under three operational large-scale IoT systems and one synthetic Gaussian distribution dataset, and extensive experiments are provided to demonstrate its efficiency and robustness. With a 20% sampling ratio, \u0000<italic>CODE<inline-formula><tex-math>$^{+}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>+</mml:mo></mml:msup></mml:math><inline-graphic></alternatives></inline-formula></i>\u0000 achieves an average data reconstruction accuracy of 94% across four datasets, outperforming our previous version of 87% and state-of-the-art baseline of 71%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the Design Space of Distributed Parallel Sparse Matrix-Multiple Vector Multiplication 探索分布式并行稀疏矩阵-多矢量乘法的设计空间
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-30 DOI: 10.1109/TPDS.2024.3452478
Hua Huang;Edmond Chow
We consider the distributed memory parallel multiplication of a sparse matrix by a dense matrix (SpMM). The dense matrix is often a collection of dense vectors. Standard implementations will multiply the sparse matrix by multiple dense vectors at the same time, to exploit the computational efficiencies therein. But such approaches generally utilize the same sparse matrix partitioning as if multiplying by a single vector. This article explores the design space of parallelizing SpMM and shows that a coarser grain partitioning of the matrix combined with a column-wise partitioning of the block of vectors can often require less communication volume and achieve higher SpMM performance. An algorithm is presented that chooses a process grid geometry for a given number of processes to optimize the performance of parallel SpMM. The algorithm can augment existing graph partitioners by utilizing the additional concurrency available when multiplying by multiple dense vectors to further reduce communication.
我们考虑的是稀疏矩阵与密集矩阵的分布式内存并行乘法(SpMM)。稠密矩阵通常是稠密向量的集合。标准实现方法会同时用稀疏矩阵与多个稠密向量相乘,以利用其中的计算效率。但这种方法通常使用的稀疏矩阵分区与单个向量相乘的方法相同。本文探讨了 SpMM 并行化的设计空间,并表明较粗粒度的矩阵划分与按列划分的向量块相结合,往往能减少通信量,实现更高的 SpMM 性能。本文提出了一种算法,可为给定数量的进程选择进程网格几何形状,以优化并行 SpMM 性能。该算法可以利用多个密集向量相乘时的额外并发性,进一步减少通信量,从而增强现有的图分割器。
{"title":"Exploring the Design Space of Distributed Parallel Sparse Matrix-Multiple Vector Multiplication","authors":"Hua Huang;Edmond Chow","doi":"10.1109/TPDS.2024.3452478","DOIUrl":"10.1109/TPDS.2024.3452478","url":null,"abstract":"We consider the distributed memory parallel multiplication of a sparse matrix by a dense matrix (SpMM). The dense matrix is often a collection of dense vectors. Standard implementations will multiply the sparse matrix by multiple dense vectors at the same time, to exploit the computational efficiencies therein. But such approaches generally utilize the same sparse matrix partitioning as if multiplying by a single vector. This article explores the design space of parallelizing SpMM and shows that a coarser grain partitioning of the matrix combined with a column-wise partitioning of the block of vectors can often require less communication volume and achieve higher SpMM performance. An algorithm is presented that chooses a process grid geometry for a given number of processes to optimize the performance of parallel SpMM. The algorithm can augment existing graph partitioners by utilizing the additional concurrency available when multiplying by multiple dense vectors to further reduce communication.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond Belady to Attain a Seemingly Unattainable Byte Miss Ratio for Content Delivery Networks 超越 Belady,为内容交付网络实现看似遥不可及的字节遗漏率
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-30 DOI: 10.1109/TPDS.2024.3452096
Peng Wang;Hong Jiang;Yu Liu;Zhelong Zhao;Ke Zhou;Zhihai Huang
Reducing the byte miss ratio (BMR) in the Content Delivery Network (CDN) caches can help providers save on the cost of paying for traffic. When evicting objects or files of different sizes in the caches of CDNs, it is no longer sufficient to pursue an optimal object miss ratio (OMR) by approximating Belady to ensure an optimal BMR. Our experimental observations suggest that there are multiple request sequence windows. In these windows, a replacement policy prioritizes the eviction of objects with large sizes and ultimately evicts the object with the longest reuse distance, lowering the BMR without increasing the OMR. To accurately capture those windows, we monitor the changes in OMR and BMR using a deep reinforcement learning (RL) model and then implement a BMR-friendly replacement algorithm in these windows. Based on this policy, we propose a Belady and Size Eviction (LRU-BaSE) algorithm that reduces BMR while maintaining OMR. To make LRU-BaSE efficient and practical, we address the feedback delay problem of RL with a two-pronged approach. On the one hand, we shorten the LRU-base decision region based on the observation that the rear section of the cache queue contains most of the eviction candidates. On the other hand, the request distribution on CDNs makes it feasible to divide the learning region into multiple sub-regions that are each learned with reduced time and increased accuracy. In real CDN systems, LRU-BaSE outperforms LRU by reducing “backing to OS” traffic and access latency by 30.05% and 17.07%, respectively, on average. In simulator tests, LRU-BaSE outperforms state-of-the-art cache replacement policies. On average, LRU-BaSE's BMR is 0.63% and 0.33% less than that of Belady and Practical Flow-based Offline Optimal (PFOO), respectively. In addition, compared to Learning Relaxed Belady (LRB), LRU-BaSE can yield relatively stable performance when facing workload drift.
降低内容分发网络(CDN)缓存中的字节遗漏率(BMR)可以帮助提供商节省流量付费成本。在驱逐 CDN 缓存中不同大小的对象或文件时,通过近似贝拉迪(Belady)来追求最佳对象遗漏率(OMR)以确保最佳字节遗漏率(BMR)已经不够了。我们的实验观察表明,存在多个请求序列窗口。在这些窗口中,替换策略会优先驱逐尺寸较大的对象,并最终驱逐重用距离最长的对象,从而在不增加 OMR 的情况下降低 BMR。为了准确捕捉这些窗口,我们使用深度强化学习(RL)模型监控 OMR 和 BMR 的变化,然后在这些窗口中实施 BMR 友好替换算法。基于这一策略,我们提出了一种 "Belady and Size Eviction"(LRU-BaSE)算法,可在保持 OMR 的同时降低 BMR。为了使 LRU-BaSE 高效实用,我们采用双管齐下的方法来解决 RL 的反馈延迟问题。一方面,我们根据高速缓存队列后部包含大部分驱逐候选对象的观察结果,缩短了 LRU 基准决策区域。另一方面,CDN 上的请求分布使得将学习区域划分为多个子区域成为可行,每个子区域的学习时间更短,准确率更高。在实际 CDN 系统中,LRU-BaSE 的性能优于 LRU,"备份到操作系统 "流量和访问延迟平均分别减少了 30.05% 和 17.07%。在模拟器测试中,LRU-BaSE 的性能优于最先进的缓存替换策略。平均而言,LRU-BaSE 的 BMR 分别比 Belady 和基于实践流的离线优化(PFOO)低 0.63% 和 0.33%。此外,与学习宽松贝拉迪(LRB)相比,LRU-BaSE 在面对工作负载漂移时能产生相对稳定的性能。
{"title":"Beyond Belady to Attain a Seemingly Unattainable Byte Miss Ratio for Content Delivery Networks","authors":"Peng Wang;Hong Jiang;Yu Liu;Zhelong Zhao;Ke Zhou;Zhihai Huang","doi":"10.1109/TPDS.2024.3452096","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3452096","url":null,"abstract":"Reducing the byte miss ratio (BMR) in the Content Delivery Network (CDN) caches can help providers save on the cost of paying for traffic. When evicting objects or files of different sizes in the caches of CDNs, it is no longer sufficient to pursue an optimal object miss ratio (OMR) by approximating Belady to ensure an optimal BMR. Our experimental observations suggest that there are multiple request sequence windows. In these windows, a replacement policy prioritizes the eviction of objects with large sizes and ultimately evicts the object with the longest reuse distance, lowering the BMR without increasing the OMR. To accurately capture those windows, we monitor the changes in OMR and BMR using a deep reinforcement learning (RL) model and then implement a BMR-friendly replacement algorithm in these windows. Based on this policy, we propose a Belady and Size Eviction (LRU-BaSE) algorithm that reduces BMR while maintaining OMR. To make LRU-BaSE efficient and practical, we address the feedback delay problem of RL with a two-pronged approach. On the one hand, we shorten the LRU-base decision region based on the observation that the rear section of the cache queue contains most of the eviction candidates. On the other hand, the request distribution on CDNs makes it feasible to divide the learning region into multiple sub-regions that are each learned with reduced time and increased accuracy. In real CDN systems, LRU-BaSE outperforms LRU by reducing “backing to OS” traffic and access latency by 30.05% and 17.07%, respectively, on average. In simulator tests, LRU-BaSE outperforms state-of-the-art cache replacement policies. On average, LRU-BaSE's BMR is 0.63% and 0.33% less than that of Belady and Practical Flow-based Offline Optimal (PFOO), respectively. In addition, compared to Learning Relaxed Belady (LRB), LRU-BaSE can yield relatively stable performance when facing workload drift.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142160006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BIRD+: Design of a Lightweight Communication Compressor for Resource-Constrained Distribution Learning Platforms BIRD+:为资源有限的分布式学习平台设计轻量级通信压缩器
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-21 DOI: 10.1109/tpds.2024.3447221
Donglei Wu, Weihao Yang, Xiangyu Zou, Hao Feng, Dingwen Tao, Shiyi Li, Wen Xia, Binxing Fang
{"title":"BIRD+: Design of a Lightweight Communication Compressor for Resource-Constrained Distribution Learning Platforms","authors":"Donglei Wu, Weihao Yang, Xiangyu Zou, Hao Feng, Dingwen Tao, Shiyi Li, Wen Xia, Binxing Fang","doi":"10.1109/tpds.2024.3447221","DOIUrl":"https://doi.org/10.1109/tpds.2024.3447221","url":null,"abstract":"","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Privacy-Preserving Data Selection for Horizontal and Vertical Federated Learning 为横向和纵向联合学习选择保护隐私的数据
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-19 DOI: 10.1109/TPDS.2024.3439709
Lan Zhang;Anran Li;Hongyi Peng;Feng Han;Fan Huang;Xiang-Yang Li
Federated learning (FL) enables distributed participants to collaboratively train a machine learning model without accessing to their local data. In FL systems, the selection of training samples has a significant impact on model performances, e.g., selecting participants whose datasets have low-quality samples, features would result in low accuracy, unstable models. In this work, we aim to solve the problem that selects a collection of high-quality training samples for a given FL task under a monetary budget. We propose a holistic design to efficiently select high-quality samples while preserve the privacy of participants’ local data, the server’s label set. We propose an efficient hierarchical sample selection mechanism to select relevant clients, their samples before training for horizontal federated learning (HFL). It uses the determinantal point process (DPP) to select both the statistical homogenous, content diverse clients, samples. Besides, we propose a private set intersection (PSI) based scheme to filter relevant features for the target VFL task. Finally, during training, an erroneous-aware importance based selection is proposed to dynamically select important clients, samples to accelerate model convergence. We verify the merits of our proposed solution with extensive experiments on a real AIoT system with 50 clients. The experimental results validate that our solution achieves accurate, efficient selection of high-quality data, consequently an FL model with a faster convergence speed, higher accuracy.
联邦学习(FL)使分布式参与者能够协作训练机器学习模型,而无需访问其本地数据。在联机学习系统中,训练样本的选择对模型性能有重大影响,例如,如果选择的参与者的数据集样本质量较低,则会导致模型准确率低、不稳定。在这项工作中,我们的目标是解决这样一个问题,即在资金预算允许的情况下,为给定的 FL 任务选择高质量的训练样本集。我们提出了一种整体设计方案,既能有效地选择高质量样本,又能保护参与者的本地数据(即服务器标签集)的隐私。我们提出了一种高效的分层样本选择机制,用于在水平联合学习(HFL)训练前选择相关客户及其样本。它使用行列式点过程(DPP)来选择统计同质和内容多样的客户、样本。此外,我们还提出了一种基于私有集交集(PSI)的方案,用于过滤目标 VFL 任务的相关特征。最后,在训练过程中,我们提出了一种基于错误感知重要性的选择方法,以动态选择重要的客户和样本,从而加速模型收敛。我们在一个拥有 50 个客户端的真实 AIoT 系统上进行了大量实验,验证了我们提出的解决方案的优点。实验结果验证了我们的解决方案能够准确、高效地选择高质量数据,从而使 FL 模型具有更快的收敛速度和更高的准确性。
{"title":"Privacy-Preserving Data Selection for Horizontal and Vertical Federated Learning","authors":"Lan Zhang;Anran Li;Hongyi Peng;Feng Han;Fan Huang;Xiang-Yang Li","doi":"10.1109/TPDS.2024.3439709","DOIUrl":"10.1109/TPDS.2024.3439709","url":null,"abstract":"Federated learning (FL) enables distributed participants to collaboratively train a machine learning model without accessing to their local data. In FL systems, the selection of training samples has a significant impact on model performances, e.g., selecting participants whose datasets have low-quality samples, features would result in low accuracy, unstable models. In this work, we aim to solve the problem that selects a collection of high-quality training samples for a given FL task under a monetary budget. We propose a holistic design to efficiently select high-quality samples while preserve the privacy of participants’ local data, the server’s label set. We propose an efficient hierarchical sample selection mechanism to select relevant clients, their samples before training for horizontal federated learning (HFL). It uses the determinantal point process (DPP) to select both the statistical homogenous, content diverse clients, samples. Besides, we propose a private set intersection (PSI) based scheme to filter relevant features for the target VFL task. Finally, during training, an erroneous-aware importance based selection is proposed to dynamically select important clients, samples to accelerate model convergence. We verify the merits of our proposed solution with extensive experiments on a real AIoT system with 50 clients. The experimental results validate that our solution achieves accurate, efficient selection of high-quality data, consequently an FL model with a faster convergence speed, higher accuracy.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Paired Many-to-Many 2-Disjoint Path Covers in Meshes 网格中成对的多对多 2-Disjoint 路径覆盖
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-16 DOI: 10.1109/TPDS.2024.3445283
Fatemeh Keshavarz-Kohjerdi
In the paired many-to-many $k$-disjoint path cover ($k$-DPC) problem, given a set of $k$ pairs of vertices $(s_{i},t_{i})$, $1leqslant ileqslant k$, of a graph $G$ we want to find $k$ simple vertex-disjoint paths whose end-vertices are these $k$ pairs, such that each vertex of $G$ is covered by a path. This problem is a well-known problem in parallel processing and is a generalization of the well-known Hamiltonian $(s,t)$-path problem, which is equal to 1-DPC. In this paper, we consider the paired many-to-many 2-disjoint path cover problem (2-DPC) in meshes (rectangular grids). We give the necessary conditions for existence of such covers, and present a linear-time algorithm to compute them. Although the paired many-to-many $k$-disjoint path cover problem is well-known in parallel processing, our motivation to study this problem is its application in solving the Hamiltonian path problem in solid grid graphs. We consider the case where the pairs of vertices are on the outer face of the graph.
在成对的多对多 $k$-disjoint path cover($k$-DPC)问题中,给定图 $G$ 的一组 $k$ 对顶点 $(s_{i},t_{i})$,1leqslant ileqslant k$,我们要找到其末端顶点是这 $k$ 对的 $k$ 简单顶点-disjoint 路径,从而使 $G$ 的每个顶点都被路径覆盖。这个问题是并行处理中的一个著名问题,也是著名的哈密顿$(s,t)$路径问题的一般化,相当于 1-DPC。在本文中,我们考虑的是网格(矩形网格)中成对的多对多 2-disjoint 路径覆盖问题(2-DPC)。我们给出了这种覆盖存在的必要条件,并提出了一种计算这种覆盖的线性时间算法。尽管成对的多对多 $k$-isjoint 路径覆盖问题在并行处理中是众所周知的,但我们研究这个问题的动机是它在解决实体网格图中的哈密顿路径问题中的应用。我们考虑的情况是,顶点对位于图的外侧。
{"title":"Paired Many-to-Many 2-Disjoint Path Covers in Meshes","authors":"Fatemeh Keshavarz-Kohjerdi","doi":"10.1109/TPDS.2024.3445283","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3445283","url":null,"abstract":"In the paired many-to-many \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000-disjoint path cover (\u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000-DPC) problem, given a set of \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000 pairs of vertices \u0000<inline-formula><tex-math>$(s_{i},t_{i})$</tex-math></inline-formula>\u0000, \u0000<inline-formula><tex-math>$1leqslant ileqslant k$</tex-math></inline-formula>\u0000, of a graph \u0000<inline-formula><tex-math>$G$</tex-math></inline-formula>\u0000 we want to find \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000 simple vertex-disjoint paths whose end-vertices are these \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000 pairs, such that each vertex of \u0000<inline-formula><tex-math>$G$</tex-math></inline-formula>\u0000 is covered by a path. This problem is a well-known problem in parallel processing and is a generalization of the well-known Hamiltonian \u0000<inline-formula><tex-math>$(s,t)$</tex-math></inline-formula>\u0000-path problem, which is equal to 1-DPC. In this paper, we consider the paired many-to-many 2-disjoint path cover problem (2-DPC) in meshes (rectangular grids). We give the necessary conditions for existence of such covers, and present a linear-time algorithm to compute them. Although the paired many-to-many \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000-disjoint path cover problem is well-known in parallel processing, our motivation to study this problem is its application in solving the Hamiltonian path problem in solid grid graphs. We consider the case where the pairs of vertices are on the outer face of the graph.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142090712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Logical Synchrony and the Bittide Mechanism 逻辑同步和比特机制
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-16 DOI: 10.1109/TPDS.2024.3444739
Sanjay Lall;Călin Caşcaval;Martin Izzard;Tammo Spalink
We introduce logical synchrony, a framework that allows distributed computing to be coordinated as tightly as in synchronous systems without the distribution of a global clock or any reference to universal time. We develop a model of events called a logical synchrony network, in which nodes correspond to processors and every node has an associated local clock which generates the events. We construct a measure of logical latency and develop its properties. A further model, called a multiclock network, is then analyzed and shown to be a refinement of the logical synchrony network. We present the bittide mechanism as an instantiation of multiclock networks, and discuss the clock control mechanism that ensures that buffers do not overflow or underflow. Finally we give conditions under which a logical synchrony network has an equivalent synchronous realization.
我们介绍了逻辑同步,这是一个允许分布式计算像同步系统一样紧密协调的框架,而无需分配全局时钟或参考通用时间。我们建立了一个称为逻辑同步网络的事件模型,其中的节点与处理器相对应,每个节点都有一个相关的本地时钟来产生事件。我们构建了逻辑延迟的测量方法,并发展了其特性。然后,我们分析了另一种称为多时钟网络的模型,并证明它是逻辑同步网络的一种改进。我们介绍了作为多时钟网络实例化的比特化机制,并讨论了确保缓冲区不会溢出或下溢的时钟控制机制。最后,我们给出了逻辑同步网络具有等效同步实现的条件。
{"title":"Logical Synchrony and the Bittide Mechanism","authors":"Sanjay Lall;Călin Caşcaval;Martin Izzard;Tammo Spalink","doi":"10.1109/TPDS.2024.3444739","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3444739","url":null,"abstract":"We introduce logical synchrony, a framework that allows distributed computing to be coordinated as tightly as in synchronous systems without the distribution of a global clock or any reference to universal time. We develop a model of events called a logical synchrony network, in which nodes correspond to processors and every node has an associated local clock which generates the events. We construct a measure of logical latency and develop its properties. A further model, called a multiclock network, is then analyzed and shown to be a refinement of the logical synchrony network. We present the bittide mechanism as an instantiation of multiclock networks, and discuss the clock control mechanism that ensures that buffers do not overflow or underflow. Finally we give conditions under which a logical synchrony network has an equivalent synchronous realization.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10638228","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142159918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FlexRaft: Exploiting Flexible Erasure Coding for Minimum-Cost Consensus and Fast Recovery FlexRaft:利用灵活的擦除编码实现最低成本共识和快速恢复
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-14 DOI: 10.1109/TPDS.2024.3443424
Mi Zhang;Qihan Kang;Patrick P. C. Lee
Consensus protocols like Paxos and Raft provide data consistency and fault tolerance for distributed services. Log replication in these protocols can be supported by erasure coding, which incurs lower redundancy than full-copy replication and significantly saves network and storage costs for overall performance improvements. However, existing consensus protocols with erasure coding cannot achieve the minimum network and storage costs during log replication. We propose FlexRaft, which dynamically varies the coding scheme used in Raft based on the server status to always achieve the theoretically minimum redundancy ratio, while maintaining the same liveness as in Raft. To address the issue of an inconsistent coding scheme between the leader and its followers, we specify the prerequisite of overwriting a log entry and also allow the leader and its followers to exactly track the coding scheme being used. We further extend FlexRaft into FlexRaft+, which provides a different storage layout to vary the coding scheme through a novel technique called re-encoding-free replication, so as to enable fast server recovery. We prove that both FlexRaft and FlexRaft+ maintain Raft safety. We implement a prototype of FlexRaft and FlexRaft+, atop which we build a distributed key-value store to show its efficacy. Experiments on Alibaba Cloud show that FlexRaft achieves the theoretically minimum network and storage costs in practice, and reduces the commit latency by 44.51% and 19.37% compared with state-of-the-art CRaft and HRaft, respectively. FlexRaft+ further reduces the commit latency when the coding scheme is being varied and improves the server recovery performance.
Paxos 和 Raft 等共识协议可为分布式服务提供数据一致性和容错性。这些协议中的日志复制可由擦除编码提供支持,擦除编码比全拷贝复制产生的冗余度更低,可显著节省网络和存储成本,从而提高整体性能。然而,采用擦除编码的现有共识协议无法在日志复制过程中实现最低的网络和存储成本。我们提出了 FlexRaft,它能根据服务器状态动态改变 Raft 中使用的编码方案,以始终达到理论上的最小冗余比,同时保持与 Raft 相同的有效性。为了解决领导者和跟随者之间编码方案不一致的问题,我们规定了覆盖日志条目的前提条件,并允许领导者和跟随者精确跟踪正在使用的编码方案。我们进一步将 FlexRaft 扩展为 FlexRaft+,它提供了不同的存储布局,通过一种称为无重码复制的新技术来改变编码方案,从而实现快速的服务器恢复。我们证明 FlexRaft 和 FlexRaft+ 都能保持 Raft 安全性。我们实现了 FlexRaft 和 FlexRaft+ 的原型,并在此基础上构建了分布式键值存储,以展示其功效。在阿里巴巴云上的实验表明,FlexRaft 实现了理论上最低的网络和存储成本,与最先进的 CRaft 和 HRaft 相比,提交延迟分别降低了 44.51% 和 19.37%。当编码方案发生变化时,FlexRaft+ 还能进一步降低提交延迟,并提高服务器恢复性能。
{"title":"FlexRaft: Exploiting Flexible Erasure Coding for Minimum-Cost Consensus and Fast Recovery","authors":"Mi Zhang;Qihan Kang;Patrick P. C. Lee","doi":"10.1109/TPDS.2024.3443424","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3443424","url":null,"abstract":"Consensus protocols like Paxos and Raft provide data consistency and fault tolerance for distributed services. Log replication in these protocols can be supported by erasure coding, which incurs lower redundancy than full-copy replication and significantly saves network and storage costs for overall performance improvements. However, existing consensus protocols with erasure coding cannot achieve the minimum network and storage costs during log replication. We propose FlexRaft, which dynamically varies the coding scheme used in Raft based on the server status to always achieve the theoretically minimum redundancy ratio, while maintaining the same liveness as in Raft. To address the issue of an inconsistent coding scheme between the leader and its followers, we specify the prerequisite of overwriting a log entry and also allow the leader and its followers to exactly track the coding scheme being used. We further extend FlexRaft into FlexRaft+, which provides a different storage layout to vary the coding scheme through a novel technique called re-encoding-free replication, so as to enable fast server recovery. We prove that both FlexRaft and FlexRaft+ maintain Raft safety. We implement a prototype of FlexRaft and FlexRaft+, atop which we build a distributed key-value store to show its efficacy. Experiments on Alibaba Cloud show that FlexRaft achieves the theoretically minimum network and storage costs in practice, and reduces the commit latency by 44.51% and 19.37% compared with state-of-the-art CRaft and HRaft, respectively. FlexRaft+ further reduces the commit latency when the coding scheme is being varied and improves the server recovery performance.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142090784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1