首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
CODE$^{+}$+: Fast and Accurate Inference for Compact Distributed IoT Data Collection CODE+:针对紧凑型分布式物联网数据采集的快速准确推理。
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453607
Huali Lu;Feng Lyu;Ju Ren;Huaqing Wu;Conghao Zhou;Zhongyuan Liu;Yaoxue Zhang;Xuemin Shen
In distributed IoT data systems, full-size data collection is impractical due to the energy constraints and large system scales. Our previous work has investigated the advantages of integrating matrix sampling and inference for compact distributed IoT data collection, to minimize the data collection cost while guaranteeing the data benefits. This paper further advances the technology by boosting fast and accurate inference for those distributed IoT data systems that are sensitive to computation time, training stability, and inference accuracy. Particularly, we propose CODE$^{+}$+, i.e., Compact Distributed IOT Data CollEction Plus, which features a cluster-based sampling module and a Convolutional Neural Network (CNN)-Transformer Autoencoders-based inference module, to reduce cost and guarantee the data benefits. The sampling component employs a cluster-based matrix sampling approach, in which data clustering is first conducted and then a two-step sampling is performed in accordance with the number of clusters and clustering errors. The inference component integrates a CNN-Transformer Autoencoders-based matrix inference model to estimate the full-size spatio-temporal data matrix, which consists of a CNN-Transformer encoder that extracts the underlying features from the sampled data matrix and a lightweight decoder that maps the learned latent features back to the original full-size data matrix. We implement CODE$^{+}$+ under three operational large-scale IoT systems and one synthetic Gaussian distribution dataset, and extensive experiments are provided to demonstrate its efficiency and robustness. With a 20% sampling ratio, CODE$^{+}$+ achieves an average data reconstruction accuracy of 94% across four datasets, outperforming our previous version of 87% and state-of-the-art baseline of 71%.
在分布式物联网数据系统中,由于能源限制和庞大的系统规模,全尺寸数据收集是不切实际的。我们之前的工作研究了在紧凑型分布式物联网数据收集中集成矩阵采样和推理的优势,从而在保证数据效益的同时最大限度地降低数据收集成本。本文针对对计算时间、训练稳定性和推理准确性敏感的分布式物联网数据系统,通过提高推理的快速性和准确性,进一步推动了该技术的发展。特别是,我们提出了 CODE$^{+}$+,即 Compact Distributed IOT Data CollEction Plus,它具有基于集群的采样模块和基于卷积神经网络(CNN)-变换器自动编码器的推理模块,以降低成本并保证数据效益。采样组件采用基于聚类的矩阵采样方法,首先对数据进行聚类,然后根据聚类数量和聚类误差进行两步采样。推理组件集成了一个基于 CNN-Transformer Autoencoders 的矩阵推理模型来估计全尺寸时空数据矩阵,它由一个 CNN-Transformer 编码器和一个轻量级解码器组成,前者从采样数据矩阵中提取底层特征,后者则将学习到的潜在特征映射回原始全尺寸数据矩阵。我们在三个运行中的大规模物联网系统和一个合成高斯分布数据集下实现了 CODE$^{+}$+,并通过大量实验证明了其效率和鲁棒性。在采样率为 20% 的情况下,CODE$^{+}$+ 在四个数据集上实现了 94% 的平均数据重建准确率,优于我们之前版本的 87% 和最先进基线的 71%。
{"title":"CODE$^{+}$+: Fast and Accurate Inference for Compact Distributed IoT Data Collection","authors":"Huali Lu;Feng Lyu;Ju Ren;Huaqing Wu;Conghao Zhou;Zhongyuan Liu;Yaoxue Zhang;Xuemin Shen","doi":"10.1109/TPDS.2024.3453607","DOIUrl":"10.1109/TPDS.2024.3453607","url":null,"abstract":"In distributed IoT data systems, full-size data collection is impractical due to the energy constraints and large system scales. Our previous work has investigated the advantages of integrating matrix sampling and inference for compact distributed IoT data collection, to minimize the data collection cost while guaranteeing the data benefits. This paper further advances the technology by boosting fast and accurate inference for those distributed IoT data systems that are sensitive to computation time, training stability, and inference accuracy. Particularly, we propose \u0000<italic>CODE<inline-formula><tex-math>$^{+}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>+</mml:mo></mml:msup></mml:math><inline-graphic></alternatives></inline-formula></i>\u0000, i.e., \u0000<underline>C</u>\u0000ompact Distributed I\u0000<underline>O</u>\u0000T \u0000<underline>D</u>\u0000ata Coll\u0000<underline>E</u>\u0000ction Plus, which features a cluster-based sampling module and a Convolutional Neural Network (CNN)-Transformer Autoencoders-based inference module, to reduce cost and guarantee the data benefits. The sampling component employs a cluster-based matrix sampling approach, in which data clustering is first conducted and then a two-step sampling is performed in accordance with the number of clusters and clustering errors. The inference component integrates a CNN-Transformer Autoencoders-based matrix inference model to estimate the full-size spatio-temporal data matrix, which consists of a CNN-Transformer encoder that extracts the underlying features from the sampled data matrix and a lightweight decoder that maps the learned latent features back to the original full-size data matrix. We implement \u0000<italic>CODE<inline-formula><tex-math>$^{+}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>+</mml:mo></mml:msup></mml:math><inline-graphic></alternatives></inline-formula></i>\u0000 under three operational large-scale IoT systems and one synthetic Gaussian distribution dataset, and extensive experiments are provided to demonstrate its efficiency and robustness. With a 20% sampling ratio, \u0000<italic>CODE<inline-formula><tex-math>$^{+}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>+</mml:mo></mml:msup></mml:math><inline-graphic></alternatives></inline-formula></i>\u0000 achieves an average data reconstruction accuracy of 94% across four datasets, outperforming our previous version of 87% and state-of-the-art baseline of 71%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the Design Space of Distributed Parallel Sparse Matrix-Multiple Vector Multiplication 探索分布式并行稀疏矩阵-多矢量乘法的设计空间
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-30 DOI: 10.1109/TPDS.2024.3452478
Hua Huang;Edmond Chow
We consider the distributed memory parallel multiplication of a sparse matrix by a dense matrix (SpMM). The dense matrix is often a collection of dense vectors. Standard implementations will multiply the sparse matrix by multiple dense vectors at the same time, to exploit the computational efficiencies therein. But such approaches generally utilize the same sparse matrix partitioning as if multiplying by a single vector. This article explores the design space of parallelizing SpMM and shows that a coarser grain partitioning of the matrix combined with a column-wise partitioning of the block of vectors can often require less communication volume and achieve higher SpMM performance. An algorithm is presented that chooses a process grid geometry for a given number of processes to optimize the performance of parallel SpMM. The algorithm can augment existing graph partitioners by utilizing the additional concurrency available when multiplying by multiple dense vectors to further reduce communication.
我们考虑的是稀疏矩阵与密集矩阵的分布式内存并行乘法(SpMM)。稠密矩阵通常是稠密向量的集合。标准实现方法会同时用稀疏矩阵与多个稠密向量相乘,以利用其中的计算效率。但这种方法通常使用的稀疏矩阵分区与单个向量相乘的方法相同。本文探讨了 SpMM 并行化的设计空间,并表明较粗粒度的矩阵划分与按列划分的向量块相结合,往往能减少通信量,实现更高的 SpMM 性能。本文提出了一种算法,可为给定数量的进程选择进程网格几何形状,以优化并行 SpMM 性能。该算法可以利用多个密集向量相乘时的额外并发性,进一步减少通信量,从而增强现有的图分割器。
{"title":"Exploring the Design Space of Distributed Parallel Sparse Matrix-Multiple Vector Multiplication","authors":"Hua Huang;Edmond Chow","doi":"10.1109/TPDS.2024.3452478","DOIUrl":"10.1109/TPDS.2024.3452478","url":null,"abstract":"We consider the distributed memory parallel multiplication of a sparse matrix by a dense matrix (SpMM). The dense matrix is often a collection of dense vectors. Standard implementations will multiply the sparse matrix by multiple dense vectors at the same time, to exploit the computational efficiencies therein. But such approaches generally utilize the same sparse matrix partitioning as if multiplying by a single vector. This article explores the design space of parallelizing SpMM and shows that a coarser grain partitioning of the matrix combined with a column-wise partitioning of the block of vectors can often require less communication volume and achieve higher SpMM performance. An algorithm is presented that chooses a process grid geometry for a given number of processes to optimize the performance of parallel SpMM. The algorithm can augment existing graph partitioners by utilizing the additional concurrency available when multiplying by multiple dense vectors to further reduce communication.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond Belady to Attain a Seemingly Unattainable Byte Miss Ratio for Content Delivery Networks 超越 Belady,为内容交付网络实现看似遥不可及的字节遗漏率
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-30 DOI: 10.1109/TPDS.2024.3452096
Peng Wang;Hong Jiang;Yu Liu;Zhelong Zhao;Ke Zhou;Zhihai Huang
Reducing the byte miss ratio (BMR) in the Content Delivery Network (CDN) caches can help providers save on the cost of paying for traffic. When evicting objects or files of different sizes in the caches of CDNs, it is no longer sufficient to pursue an optimal object miss ratio (OMR) by approximating Belady to ensure an optimal BMR. Our experimental observations suggest that there are multiple request sequence windows. In these windows, a replacement policy prioritizes the eviction of objects with large sizes and ultimately evicts the object with the longest reuse distance, lowering the BMR without increasing the OMR. To accurately capture those windows, we monitor the changes in OMR and BMR using a deep reinforcement learning (RL) model and then implement a BMR-friendly replacement algorithm in these windows. Based on this policy, we propose a Belady and Size Eviction (LRU-BaSE) algorithm that reduces BMR while maintaining OMR. To make LRU-BaSE efficient and practical, we address the feedback delay problem of RL with a two-pronged approach. On the one hand, we shorten the LRU-base decision region based on the observation that the rear section of the cache queue contains most of the eviction candidates. On the other hand, the request distribution on CDNs makes it feasible to divide the learning region into multiple sub-regions that are each learned with reduced time and increased accuracy. In real CDN systems, LRU-BaSE outperforms LRU by reducing “backing to OS” traffic and access latency by 30.05% and 17.07%, respectively, on average. In simulator tests, LRU-BaSE outperforms state-of-the-art cache replacement policies. On average, LRU-BaSE's BMR is 0.63% and 0.33% less than that of Belady and Practical Flow-based Offline Optimal (PFOO), respectively. In addition, compared to Learning Relaxed Belady (LRB), LRU-BaSE can yield relatively stable performance when facing workload drift.
降低内容分发网络(CDN)缓存中的字节遗漏率(BMR)可以帮助提供商节省流量付费成本。在驱逐 CDN 缓存中不同大小的对象或文件时,通过近似贝拉迪(Belady)来追求最佳对象遗漏率(OMR)以确保最佳字节遗漏率(BMR)已经不够了。我们的实验观察表明,存在多个请求序列窗口。在这些窗口中,替换策略会优先驱逐尺寸较大的对象,并最终驱逐重用距离最长的对象,从而在不增加 OMR 的情况下降低 BMR。为了准确捕捉这些窗口,我们使用深度强化学习(RL)模型监控 OMR 和 BMR 的变化,然后在这些窗口中实施 BMR 友好替换算法。基于这一策略,我们提出了一种 "Belady and Size Eviction"(LRU-BaSE)算法,可在保持 OMR 的同时降低 BMR。为了使 LRU-BaSE 高效实用,我们采用双管齐下的方法来解决 RL 的反馈延迟问题。一方面,我们根据高速缓存队列后部包含大部分驱逐候选对象的观察结果,缩短了 LRU 基准决策区域。另一方面,CDN 上的请求分布使得将学习区域划分为多个子区域成为可行,每个子区域的学习时间更短,准确率更高。在实际 CDN 系统中,LRU-BaSE 的性能优于 LRU,"备份到操作系统 "流量和访问延迟平均分别减少了 30.05% 和 17.07%。在模拟器测试中,LRU-BaSE 的性能优于最先进的缓存替换策略。平均而言,LRU-BaSE 的 BMR 分别比 Belady 和基于实践流的离线优化(PFOO)低 0.63% 和 0.33%。此外,与学习宽松贝拉迪(LRB)相比,LRU-BaSE 在面对工作负载漂移时能产生相对稳定的性能。
{"title":"Beyond Belady to Attain a Seemingly Unattainable Byte Miss Ratio for Content Delivery Networks","authors":"Peng Wang;Hong Jiang;Yu Liu;Zhelong Zhao;Ke Zhou;Zhihai Huang","doi":"10.1109/TPDS.2024.3452096","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3452096","url":null,"abstract":"Reducing the byte miss ratio (BMR) in the Content Delivery Network (CDN) caches can help providers save on the cost of paying for traffic. When evicting objects or files of different sizes in the caches of CDNs, it is no longer sufficient to pursue an optimal object miss ratio (OMR) by approximating Belady to ensure an optimal BMR. Our experimental observations suggest that there are multiple request sequence windows. In these windows, a replacement policy prioritizes the eviction of objects with large sizes and ultimately evicts the object with the longest reuse distance, lowering the BMR without increasing the OMR. To accurately capture those windows, we monitor the changes in OMR and BMR using a deep reinforcement learning (RL) model and then implement a BMR-friendly replacement algorithm in these windows. Based on this policy, we propose a Belady and Size Eviction (LRU-BaSE) algorithm that reduces BMR while maintaining OMR. To make LRU-BaSE efficient and practical, we address the feedback delay problem of RL with a two-pronged approach. On the one hand, we shorten the LRU-base decision region based on the observation that the rear section of the cache queue contains most of the eviction candidates. On the other hand, the request distribution on CDNs makes it feasible to divide the learning region into multiple sub-regions that are each learned with reduced time and increased accuracy. In real CDN systems, LRU-BaSE outperforms LRU by reducing “backing to OS” traffic and access latency by 30.05% and 17.07%, respectively, on average. In simulator tests, LRU-BaSE outperforms state-of-the-art cache replacement policies. On average, LRU-BaSE's BMR is 0.63% and 0.33% less than that of Belady and Practical Flow-based Offline Optimal (PFOO), respectively. In addition, compared to Learning Relaxed Belady (LRB), LRU-BaSE can yield relatively stable performance when facing workload drift.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142160006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BIRD+: Design of a Lightweight Communication Compressor for Resource-Constrained Distribution Learning Platforms BIRD+:为资源有限的分布式学习平台设计轻量级通信压缩器
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-21 DOI: 10.1109/TPDS.2024.3447221
Donglei Wu;Weihao Yang;Xiangyu Zou;Hao Feng;Dingwen Tao;Shiyi Li;Wen Xia;Binxing Fang
The Top-K sparsification-based compression framework is extensively explored for reducing communication costs in distributed learning. However, we identified several issues with existing Top-K sparsification-based compression methods: (i) The limited compressibility of the Top-K parameter's indexes critically restricts the overall communication compression ratio; (ii) Several time-consuming compression operations significantly offset the benefits of communication compression; (iii) The use of error feedback techniques to maintain model quality results in a high memory footprint consumption. To solve these issues, we propose BIRD, a lightweight tensor-wise Bi-Random sampling strategy with an expectation invariance property. Specifically, BIRD applies a tensor-wise index sharing mechanism that reduces the index proportion by allowing multiple tensor elements to share a single index, thus improving the overall compression ratio. Additionally, BIRD replaces the time-consuming Top-K sorting with a faster Bi-Random sampling strategy based on the aforementioned index sharing mechanism, significantly reducing compression overheads; Moreover, BIRD establishes an expectation invariance property into the Bi-Random sampling to ensure an approximate unbiased representation for the $L_1$-norm of the sampled tensors, effectively maintaining the model quality without incurring extra memory costs. We further optimize BIRD to BIRD+ by introducing the uniform distribution-based sampling and Gamma correction on the tensor-wise sampling process, achieving a more flexibly adjustment of the sparsity with better convergence performance. Experimental evaluations across multiple conventional distributed learning tasks demonstrate that compared to state-of-the-art approaches, BIRD+ achieves higher communication compression ratios up to 36.2$times$ and higher computation throughput up to 149.6$times$ while maintaining the model quality without incurring extra memory costs.
为降低分布式学习中的通信成本,基于 Top-K 稀疏化的压缩框架得到了广泛探索。然而,我们发现现有的基于 Top-K 稀疏化的压缩方法存在几个问题:(i) Top-K 参数索引的可压缩性有限,严重限制了整体通信压缩率;(ii) 一些耗时的压缩操作大大抵消了通信压缩的好处;(iii) 使用误差反馈技术来保持模型质量会消耗大量内存。为了解决这些问题,我们提出了具有期望不变性的轻量级张量双随机抽样策略 BIRD。具体来说,BIRD 采用了一种张量索引共享机制,通过允许多个张量元素共享一个索引来降低索引比例,从而提高整体压缩率。此外,BIRD 在上述索引共享机制的基础上采用了更快的双随机抽样策略,取代了耗时的 Top-K 排序,大大减少了压缩开销;而且,BIRD 在双随机抽样中建立了期望不变性属性,以确保对抽样张量的 $L_1$-norm 进行近似无偏表示,从而在不产生额外内存成本的情况下有效保持了模型质量。通过引入基于均匀分布的采样和张量采样过程中的伽马修正,我们进一步将 BIRD 优化为 BIRD+,实现了更灵活的稀疏性调整和更好的收敛性能。多个传统分布式学习任务的实验评估表明,与最先进的方法相比,BIRD+ 实现了更高的通信压缩比,最高可达 36.2 美元/次,计算吞吐量最高可达 149.6 美元/次,同时保持了模型质量,不会产生额外的内存成本。
{"title":"BIRD+: Design of a Lightweight Communication Compressor for Resource-Constrained Distribution Learning Platforms","authors":"Donglei Wu;Weihao Yang;Xiangyu Zou;Hao Feng;Dingwen Tao;Shiyi Li;Wen Xia;Binxing Fang","doi":"10.1109/TPDS.2024.3447221","DOIUrl":"10.1109/TPDS.2024.3447221","url":null,"abstract":"The Top-K sparsification-based compression framework is extensively explored for reducing communication costs in distributed learning. However, we identified several issues with existing Top-K sparsification-based compression methods: (\u0000<i>i</i>\u0000) The limited compressibility of the Top-K parameter's indexes critically restricts the overall communication compression ratio; (\u0000<i>ii</i>\u0000) Several time-consuming compression operations significantly offset the benefits of communication compression; (\u0000<i>iii</i>\u0000) The use of error feedback techniques to maintain model quality results in a high memory footprint consumption. To solve these issues, we propose BIRD, a lightweight tensor-wise \u0000<i>Bi-Random sampling</i>\u0000 strategy with an expectation invariance property. Specifically, BIRD applies a tensor-wise \u0000<i>index sharing</i>\u0000 mechanism that reduces the index proportion by allowing multiple tensor elements to share a single index, thus improving the overall compression ratio. Additionally, BIRD replaces the time-consuming Top-K sorting with a faster \u0000<i>Bi-Random sampling</i>\u0000 strategy based on the aforementioned \u0000<i>index sharing</i>\u0000 mechanism, significantly reducing compression overheads; Moreover, BIRD establishes an \u0000<i>expectation invariance</i>\u0000 property into the \u0000<i>Bi-Random sampling</i>\u0000 to ensure an approximate unbiased representation for the \u0000<inline-formula><tex-math>$L_1$</tex-math></inline-formula>\u0000-norm of the sampled tensors, effectively maintaining the model quality without incurring extra memory costs. We further optimize BIRD to BIRD+ by introducing the uniform distribution-based sampling and Gamma correction on the tensor-wise sampling process, achieving a more flexibly adjustment of the sparsity with better convergence performance. Experimental evaluations across multiple conventional distributed learning tasks demonstrate that compared to state-of-the-art approaches, BIRD+ achieves higher communication compression ratios up to 36.2\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 and higher computation throughput up to 149.6\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 while maintaining the model quality without incurring extra memory costs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fair Coflow Scheduling via Controlled Slowdown 通过受控减速实现公平的共流调度
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-20 DOI: 10.1109/TPDS.2024.3446188
Francesco De Pellegrini;Vaibhav Kumar Gupta;Rachid El Azouzi;Serigne Gueye;Cedric Richier;Jeremie Leguay
The average coflow completion time (CCT) is the standard performance metric in coflow scheduling. However, standard CCT minimization may introduce unfairness between the data transfer phase of different computing jobs. Thus, while progress guarantees have been introduced in the literature to mitigate this fairness issue, the trade-off between fairness and efficiency of data transfer is hard to control. This paper introduces a fairness framework for coflow scheduling based on the concept of slowdown, i.e., the performance loss of a coflow compared to isolation. By controlling the slowdown it is possible to enforce a target coflow progress while minimizing the average CCT. In the proposed framework, the minimum slowdown for a batch of coflows can be determined in polynomial time. By showing the equivalence with Gaussian elimination, slowdown constraints are introduced into primal-dual iterations of the CoFair algorithm. The algorithm extends the class of the $sigma$-order schedulers to solve the fair coflow scheduling problem in polynomial time. It provides a 4-approximation of the average CCT w.r.t. an optimal scheduler. Extensive numerical results demonstrate that this approach can trade off average CCT for slowdown more efficiently than existing state of the art schedulers.
平均共流完成时间(CCT)是共流调度的标准性能指标。然而,标准的 CCT 最小化可能会导致不同计算作业的数据传输阶段之间出现不公平现象。因此,虽然文献中引入了进度保证来缓解这一公平性问题,但数据传输的公平性和效率之间的权衡很难控制。本文基于 "减速 "的概念,即与隔离相比,共同流的性能损失,为共同流调度引入了一个公平性框架。通过控制减速,可以在最大限度降低平均 CCT 的同时,强制执行目标 coflow 进度。在所提出的框架中,一批共同流的最小减速可以在多项式时间内确定。通过证明与高斯消元的等价性,减速约束被引入到 CoFair 算法的基元-双迭代中。该算法扩展了$sigma$阶调度器的类别,可以在多项式时间内解决公平共流调度问题。它提供了与最优调度器相比平均 CCT 的 4 倍近似值。大量的数值结果表明,与现有的最先进调度器相比,这种方法能更有效地权衡平均 CCT 与速度减慢之间的关系。
{"title":"Fair Coflow Scheduling via Controlled Slowdown","authors":"Francesco De Pellegrini;Vaibhav Kumar Gupta;Rachid El Azouzi;Serigne Gueye;Cedric Richier;Jeremie Leguay","doi":"10.1109/TPDS.2024.3446188","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3446188","url":null,"abstract":"The average coflow completion time (CCT) is the standard performance metric in coflow scheduling. However, standard CCT minimization may introduce unfairness between the data transfer phase of different computing jobs. Thus, while progress guarantees have been introduced in the literature to mitigate this fairness issue, the trade-off between fairness and efficiency of data transfer is hard to control. This paper introduces a fairness framework for coflow scheduling based on the concept of slowdown, i.e., the performance loss of a coflow compared to isolation. By controlling the slowdown it is possible to enforce a target coflow progress while minimizing the average CCT. In the proposed framework, the minimum slowdown for a batch of coflows can be determined in polynomial time. By showing the equivalence with Gaussian elimination, slowdown constraints are introduced into primal-dual iterations of the CoFair algorithm. The algorithm extends the class of the \u0000<inline-formula><tex-math>$sigma$</tex-math></inline-formula>\u0000-order schedulers to solve the fair coflow scheduling problem in polynomial time. It provides a 4-approximation of the average CCT w.r.t. an optimal scheduler. Extensive numerical results demonstrate that this approach can trade off average CCT for slowdown more efficiently than existing state of the art schedulers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Privacy-Preserving Data Selection for Horizontal and Vertical Federated Learning 为横向和纵向联合学习选择保护隐私的数据
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-19 DOI: 10.1109/TPDS.2024.3439709
Lan Zhang;Anran Li;Hongyi Peng;Feng Han;Fan Huang;Xiang-Yang Li
Federated learning (FL) enables distributed participants to collaboratively train a machine learning model without accessing to their local data. In FL systems, the selection of training samples has a significant impact on model performances, e.g., selecting participants whose datasets have low-quality samples, features would result in low accuracy, unstable models. In this work, we aim to solve the problem that selects a collection of high-quality training samples for a given FL task under a monetary budget. We propose a holistic design to efficiently select high-quality samples while preserve the privacy of participants’ local data, the server’s label set. We propose an efficient hierarchical sample selection mechanism to select relevant clients, their samples before training for horizontal federated learning (HFL). It uses the determinantal point process (DPP) to select both the statistical homogenous, content diverse clients, samples. Besides, we propose a private set intersection (PSI) based scheme to filter relevant features for the target VFL task. Finally, during training, an erroneous-aware importance based selection is proposed to dynamically select important clients, samples to accelerate model convergence. We verify the merits of our proposed solution with extensive experiments on a real AIoT system with 50 clients. The experimental results validate that our solution achieves accurate, efficient selection of high-quality data, consequently an FL model with a faster convergence speed, higher accuracy.
联邦学习(FL)使分布式参与者能够协作训练机器学习模型,而无需访问其本地数据。在联机学习系统中,训练样本的选择对模型性能有重大影响,例如,如果选择的参与者的数据集样本质量较低,则会导致模型准确率低、不稳定。在这项工作中,我们的目标是解决这样一个问题,即在资金预算允许的情况下,为给定的 FL 任务选择高质量的训练样本集。我们提出了一种整体设计方案,既能有效地选择高质量样本,又能保护参与者的本地数据(即服务器标签集)的隐私。我们提出了一种高效的分层样本选择机制,用于在水平联合学习(HFL)训练前选择相关客户及其样本。它使用行列式点过程(DPP)来选择统计同质和内容多样的客户、样本。此外,我们还提出了一种基于私有集交集(PSI)的方案,用于过滤目标 VFL 任务的相关特征。最后,在训练过程中,我们提出了一种基于错误感知重要性的选择方法,以动态选择重要的客户和样本,从而加速模型收敛。我们在一个拥有 50 个客户端的真实 AIoT 系统上进行了大量实验,验证了我们提出的解决方案的优点。实验结果验证了我们的解决方案能够准确、高效地选择高质量数据,从而使 FL 模型具有更快的收敛速度和更高的准确性。
{"title":"Privacy-Preserving Data Selection for Horizontal and Vertical Federated Learning","authors":"Lan Zhang;Anran Li;Hongyi Peng;Feng Han;Fan Huang;Xiang-Yang Li","doi":"10.1109/TPDS.2024.3439709","DOIUrl":"10.1109/TPDS.2024.3439709","url":null,"abstract":"Federated learning (FL) enables distributed participants to collaboratively train a machine learning model without accessing to their local data. In FL systems, the selection of training samples has a significant impact on model performances, e.g., selecting participants whose datasets have low-quality samples, features would result in low accuracy, unstable models. In this work, we aim to solve the problem that selects a collection of high-quality training samples for a given FL task under a monetary budget. We propose a holistic design to efficiently select high-quality samples while preserve the privacy of participants’ local data, the server’s label set. We propose an efficient hierarchical sample selection mechanism to select relevant clients, their samples before training for horizontal federated learning (HFL). It uses the determinantal point process (DPP) to select both the statistical homogenous, content diverse clients, samples. Besides, we propose a private set intersection (PSI) based scheme to filter relevant features for the target VFL task. Finally, during training, an erroneous-aware importance based selection is proposed to dynamically select important clients, samples to accelerate model convergence. We verify the merits of our proposed solution with extensive experiments on a real AIoT system with 50 clients. The experimental results validate that our solution achieves accurate, efficient selection of high-quality data, consequently an FL model with a faster convergence speed, higher accuracy.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Paired Many-to-Many 2-Disjoint Path Covers in Meshes 网格中成对的多对多 2-Disjoint 路径覆盖
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-16 DOI: 10.1109/TPDS.2024.3445283
Fatemeh Keshavarz-Kohjerdi
In the paired many-to-many $k$-disjoint path cover ($k$-DPC) problem, given a set of $k$ pairs of vertices $(s_{i},t_{i})$, $1leqslant ileqslant k$, of a graph $G$ we want to find $k$ simple vertex-disjoint paths whose end-vertices are these $k$ pairs, such that each vertex of $G$ is covered by a path. This problem is a well-known problem in parallel processing and is a generalization of the well-known Hamiltonian $(s,t)$-path problem, which is equal to 1-DPC. In this paper, we consider the paired many-to-many 2-disjoint path cover problem (2-DPC) in meshes (rectangular grids). We give the necessary conditions for existence of such covers, and present a linear-time algorithm to compute them. Although the paired many-to-many $k$-disjoint path cover problem is well-known in parallel processing, our motivation to study this problem is its application in solving the Hamiltonian path problem in solid grid graphs. We consider the case where the pairs of vertices are on the outer face of the graph.
在成对的多对多 $k$-disjoint path cover($k$-DPC)问题中,给定图 $G$ 的一组 $k$ 对顶点 $(s_{i},t_{i})$,1leqslant ileqslant k$,我们要找到其末端顶点是这 $k$ 对的 $k$ 简单顶点-disjoint 路径,从而使 $G$ 的每个顶点都被路径覆盖。这个问题是并行处理中的一个著名问题,也是著名的哈密顿$(s,t)$路径问题的一般化,相当于 1-DPC。在本文中,我们考虑的是网格(矩形网格)中成对的多对多 2-disjoint 路径覆盖问题(2-DPC)。我们给出了这种覆盖存在的必要条件,并提出了一种计算这种覆盖的线性时间算法。尽管成对的多对多 $k$-isjoint 路径覆盖问题在并行处理中是众所周知的,但我们研究这个问题的动机是它在解决实体网格图中的哈密顿路径问题中的应用。我们考虑的情况是,顶点对位于图的外侧。
{"title":"Paired Many-to-Many 2-Disjoint Path Covers in Meshes","authors":"Fatemeh Keshavarz-Kohjerdi","doi":"10.1109/TPDS.2024.3445283","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3445283","url":null,"abstract":"In the paired many-to-many \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000-disjoint path cover (\u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000-DPC) problem, given a set of \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000 pairs of vertices \u0000<inline-formula><tex-math>$(s_{i},t_{i})$</tex-math></inline-formula>\u0000, \u0000<inline-formula><tex-math>$1leqslant ileqslant k$</tex-math></inline-formula>\u0000, of a graph \u0000<inline-formula><tex-math>$G$</tex-math></inline-formula>\u0000 we want to find \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000 simple vertex-disjoint paths whose end-vertices are these \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000 pairs, such that each vertex of \u0000<inline-formula><tex-math>$G$</tex-math></inline-formula>\u0000 is covered by a path. This problem is a well-known problem in parallel processing and is a generalization of the well-known Hamiltonian \u0000<inline-formula><tex-math>$(s,t)$</tex-math></inline-formula>\u0000-path problem, which is equal to 1-DPC. In this paper, we consider the paired many-to-many 2-disjoint path cover problem (2-DPC) in meshes (rectangular grids). We give the necessary conditions for existence of such covers, and present a linear-time algorithm to compute them. Although the paired many-to-many \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000-disjoint path cover problem is well-known in parallel processing, our motivation to study this problem is its application in solving the Hamiltonian path problem in solid grid graphs. We consider the case where the pairs of vertices are on the outer face of the graph.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142090712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Logical Synchrony and the Bittide Mechanism 逻辑同步和比特机制
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-16 DOI: 10.1109/TPDS.2024.3444739
Sanjay Lall;Călin Caşcaval;Martin Izzard;Tammo Spalink
We introduce logical synchrony, a framework that allows distributed computing to be coordinated as tightly as in synchronous systems without the distribution of a global clock or any reference to universal time. We develop a model of events called a logical synchrony network, in which nodes correspond to processors and every node has an associated local clock which generates the events. We construct a measure of logical latency and develop its properties. A further model, called a multiclock network, is then analyzed and shown to be a refinement of the logical synchrony network. We present the bittide mechanism as an instantiation of multiclock networks, and discuss the clock control mechanism that ensures that buffers do not overflow or underflow. Finally we give conditions under which a logical synchrony network has an equivalent synchronous realization.
我们介绍了逻辑同步,这是一个允许分布式计算像同步系统一样紧密协调的框架,而无需分配全局时钟或参考通用时间。我们建立了一个称为逻辑同步网络的事件模型,其中的节点与处理器相对应,每个节点都有一个相关的本地时钟来产生事件。我们构建了逻辑延迟的测量方法,并发展了其特性。然后,我们分析了另一种称为多时钟网络的模型,并证明它是逻辑同步网络的一种改进。我们介绍了作为多时钟网络实例化的比特化机制,并讨论了确保缓冲区不会溢出或下溢的时钟控制机制。最后,我们给出了逻辑同步网络具有等效同步实现的条件。
{"title":"Logical Synchrony and the Bittide Mechanism","authors":"Sanjay Lall;Călin Caşcaval;Martin Izzard;Tammo Spalink","doi":"10.1109/TPDS.2024.3444739","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3444739","url":null,"abstract":"We introduce logical synchrony, a framework that allows distributed computing to be coordinated as tightly as in synchronous systems without the distribution of a global clock or any reference to universal time. We develop a model of events called a logical synchrony network, in which nodes correspond to processors and every node has an associated local clock which generates the events. We construct a measure of logical latency and develop its properties. A further model, called a multiclock network, is then analyzed and shown to be a refinement of the logical synchrony network. We present the bittide mechanism as an instantiation of multiclock networks, and discuss the clock control mechanism that ensures that buffers do not overflow or underflow. Finally we give conditions under which a logical synchrony network has an equivalent synchronous realization.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10638228","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142159918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FlexRaft: Exploiting Flexible Erasure Coding for Minimum-Cost Consensus and Fast Recovery FlexRaft:利用灵活的擦除编码实现最低成本共识和快速恢复
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-14 DOI: 10.1109/TPDS.2024.3443424
Mi Zhang;Qihan Kang;Patrick P. C. Lee
Consensus protocols like Paxos and Raft provide data consistency and fault tolerance for distributed services. Log replication in these protocols can be supported by erasure coding, which incurs lower redundancy than full-copy replication and significantly saves network and storage costs for overall performance improvements. However, existing consensus protocols with erasure coding cannot achieve the minimum network and storage costs during log replication. We propose FlexRaft, which dynamically varies the coding scheme used in Raft based on the server status to always achieve the theoretically minimum redundancy ratio, while maintaining the same liveness as in Raft. To address the issue of an inconsistent coding scheme between the leader and its followers, we specify the prerequisite of overwriting a log entry and also allow the leader and its followers to exactly track the coding scheme being used. We further extend FlexRaft into FlexRaft+, which provides a different storage layout to vary the coding scheme through a novel technique called re-encoding-free replication, so as to enable fast server recovery. We prove that both FlexRaft and FlexRaft+ maintain Raft safety. We implement a prototype of FlexRaft and FlexRaft+, atop which we build a distributed key-value store to show its efficacy. Experiments on Alibaba Cloud show that FlexRaft achieves the theoretically minimum network and storage costs in practice, and reduces the commit latency by 44.51% and 19.37% compared with state-of-the-art CRaft and HRaft, respectively. FlexRaft+ further reduces the commit latency when the coding scheme is being varied and improves the server recovery performance.
Paxos 和 Raft 等共识协议可为分布式服务提供数据一致性和容错性。这些协议中的日志复制可由擦除编码提供支持,擦除编码比全拷贝复制产生的冗余度更低,可显著节省网络和存储成本,从而提高整体性能。然而,采用擦除编码的现有共识协议无法在日志复制过程中实现最低的网络和存储成本。我们提出了 FlexRaft,它能根据服务器状态动态改变 Raft 中使用的编码方案,以始终达到理论上的最小冗余比,同时保持与 Raft 相同的有效性。为了解决领导者和跟随者之间编码方案不一致的问题,我们规定了覆盖日志条目的前提条件,并允许领导者和跟随者精确跟踪正在使用的编码方案。我们进一步将 FlexRaft 扩展为 FlexRaft+,它提供了不同的存储布局,通过一种称为无重码复制的新技术来改变编码方案,从而实现快速的服务器恢复。我们证明 FlexRaft 和 FlexRaft+ 都能保持 Raft 安全性。我们实现了 FlexRaft 和 FlexRaft+ 的原型,并在此基础上构建了分布式键值存储,以展示其功效。在阿里巴巴云上的实验表明,FlexRaft 实现了理论上最低的网络和存储成本,与最先进的 CRaft 和 HRaft 相比,提交延迟分别降低了 44.51% 和 19.37%。当编码方案发生变化时,FlexRaft+ 还能进一步降低提交延迟,并提高服务器恢复性能。
{"title":"FlexRaft: Exploiting Flexible Erasure Coding for Minimum-Cost Consensus and Fast Recovery","authors":"Mi Zhang;Qihan Kang;Patrick P. C. Lee","doi":"10.1109/TPDS.2024.3443424","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3443424","url":null,"abstract":"Consensus protocols like Paxos and Raft provide data consistency and fault tolerance for distributed services. Log replication in these protocols can be supported by erasure coding, which incurs lower redundancy than full-copy replication and significantly saves network and storage costs for overall performance improvements. However, existing consensus protocols with erasure coding cannot achieve the minimum network and storage costs during log replication. We propose FlexRaft, which dynamically varies the coding scheme used in Raft based on the server status to always achieve the theoretically minimum redundancy ratio, while maintaining the same liveness as in Raft. To address the issue of an inconsistent coding scheme between the leader and its followers, we specify the prerequisite of overwriting a log entry and also allow the leader and its followers to exactly track the coding scheme being used. We further extend FlexRaft into FlexRaft+, which provides a different storage layout to vary the coding scheme through a novel technique called re-encoding-free replication, so as to enable fast server recovery. We prove that both FlexRaft and FlexRaft+ maintain Raft safety. We implement a prototype of FlexRaft and FlexRaft+, atop which we build a distributed key-value store to show its efficacy. Experiments on Alibaba Cloud show that FlexRaft achieves the theoretically minimum network and storage costs in practice, and reduces the commit latency by 44.51% and 19.37% compared with state-of-the-art CRaft and HRaft, respectively. FlexRaft+ further reduces the commit latency when the coding scheme is being varied and improves the server recovery performance.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142090784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SSRAID: A Stripe-Queued and Stripe-Threaded Merging I/O Strategy to Improve Write Performance of Serial Interface SSD RAID SSRAID:提高串行接口固态盘 RAID 写入性能的条带-队列和条带-线程合并 I/O 策略
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-14 DOI: 10.1109/TPDS.2024.3443083
Peixuan Li;Ping Xie;Qiang Cao
RAID (Redundant Array of Independent Disks) has been widely used to enhance read and write performance of existing storage systems. Existing software RAID do not fully utilize write performance of Serial interface SSDs (Solid State Drive). The most popular software RAID currently is Linux Multiple-Disks (MD), and the latest software RAID is StRAID. We observe that both of these software RAID methods lead to thread contention in multi-threaded mode, especially when applied to Serial interface SSDs. Multiple threads writing to same address can limit write performance. In this paper, we propose a stripe-queued and stripe-threaded merging I/O strategy. First, SSRAID segregates write requests across different stripes using a set of stripe-queues and stripe-threads to prevent interference between them. As a result, write thread contention in SSRAID is eliminated, allowing stripe-threads to maintain the highest efficiency of parallelism. Secondly, SSRAID can merge write requests from the same stripe-queue multiple times through stripe-thread, effectively reducing the number of additional write I/Os. Finally, SSRAID presents a stage buffer based on data merging. During partial stripe-write, write-induced read I/Os on the SSD are transformed into direct access to the stage buffer, effectively reducing write-induced read I/Os. Compared to StRAID, SSRAID improves average sequential write throughput by 86% and reduces average sequential write latency by 61% in the optimal case.
RAID(独立磁盘冗余阵列)已被广泛用于提高现有存储系统的读写性能。现有的软件 RAID 无法充分利用串行接口固态硬盘(SSD)的写入性能。目前最流行的软件 RAID 是 Linux Multiple-Disks(MD),最新的软件 RAID 是 StRAID。我们发现,这两种软件 RAID 方法在多线程模式下都会导致线程争用,尤其是在应用于串行接口固态硬盘时。多个线程写入同一地址会限制写入性能。在本文中,我们提出了一种条带排队和条带线程合并 I/O 策略。首先,SSRAID 使用一组条带队列和条带线程将写入请求隔离到不同的条带上,以防止它们之间的干扰。因此,SSRAID 中的写线程竞争得以消除,从而使条带线程保持最高的并行效率。其次,SSRAID 可以通过条带线程多次合并来自同一条带队列的写入请求,从而有效减少额外的写入 I/O 数量。最后,SSRAID 提出了基于数据合并的阶段缓冲。在部分条带写入过程中,固态硬盘上由写入引起的读 I/O 将转化为对阶段缓冲区的直接访问,从而有效减少由写入引起的读 I/O。与 StRAID 相比,在最佳情况下,SSRAID 将平均连续写吞吐量提高了 86%,将平均连续写延迟降低了 61%。
{"title":"SSRAID: A Stripe-Queued and Stripe-Threaded Merging I/O Strategy to Improve Write Performance of Serial Interface SSD RAID","authors":"Peixuan Li;Ping Xie;Qiang Cao","doi":"10.1109/TPDS.2024.3443083","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3443083","url":null,"abstract":"RAID (Redundant Array of Independent Disks) has been widely used to enhance read and write performance of existing storage systems. Existing software RAID do not fully utilize write performance of Serial interface SSDs (Solid State Drive). The most popular software RAID currently is Linux Multiple-Disks (MD), and the latest software RAID is StRAID. We observe that both of these software RAID methods lead to thread contention in multi-threaded mode, especially when applied to Serial interface SSDs. Multiple threads writing to same address can limit write performance. In this paper, we propose a stripe-queued and stripe-threaded merging I/O strategy. First, SSRAID segregates write requests across different stripes using a set of stripe-queues and stripe-threads to prevent interference between them. As a result, write thread contention in SSRAID is eliminated, allowing stripe-threads to maintain the highest efficiency of parallelism. Secondly, SSRAID can merge write requests from the same stripe-queue multiple times through stripe-thread, effectively reducing the number of additional write I/Os. Finally, SSRAID presents a stage buffer based on data merging. During partial stripe-write, write-induced read I/Os on the SSD are transformed into direct access to the stage buffer, effectively reducing write-induced read I/Os. Compared to StRAID, SSRAID improves average sequential write throughput by 86% and reduces average sequential write latency by 61% in the optimal case.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142090952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1