Journal of Parallel and Distributed Computing最新文献_第9页

Reliability assessment for k-ary n-cubes with faulty edges 具有故障边的 kary n 立方体的可靠性评估

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-03-27 DOI: 10.1016/j.jpdc.2024.104886

Si-Yu Li , Xiang-Jun Li , Meijie Ma

The g-restricted edge connectivity is an important measurement to assess the reliability of networks. The g-restricted edge connectivity of a connected graph G is the minimum size of a set of edges in G, if it exists, whose deletion separates G and leaves every vertex in the remaining components with at least g neighbors. The k-ary n-cube is an extension of the hypercube network and has many desirable properties. It has been used to build the architecture of the Supercomputer Fugaku. This paper establishes that for $g \leq n$ , the g-restricted edge connectivity of 3-ary n-cubes is $3^{⌊ g / 2 ⌋} (1 + (g mod 2)) (2 n - g)$ , and the g-restricted edge connectivity of k-ary n-cubes with $k \geq 4$ is $2^{g} (2 n - g)$ . These results imply that in $Q_{n}^{3}$ with at most $3^{⌊ g / 2 ⌋} (1 + (g mod 2)) (2 n - g) - 1$ faulty edges, or $Q_{n}^{k} (k \geq 4)$ with at most $2^{g} (2 n - g) - 1$ faulty edges, if each vertex is incident with at least g fault-free edges, then the remaining network is connected.

受 g 限制的边连通性是评估网络可靠性的一个重要指标。连通图 G 的 g 受限边连通性是 G 中一组边的最小大小（如果存在），删除这组边可以将 G 分割开来，并使剩余部分中的每个顶点都至少有 g 个邻居。k-ary n 立方体是超立方体网络的扩展，具有许多理想的特性。超级计算机 Fugaku 就是用它构建的。本文证明，对于 g≤n，3-ary n 立方体的 g 限制边连通性为 3⌊g/2⌋(1+(gmod2))(2n-g)，而 k≥4 的 k-ary n 立方体的 g 限制边连通性为 2g(2n-g)。这些结果意味着，在最多有 3⌊g/2⌋(1+(gmod2))(2n-g)-1条故障边的 Qn3 中，或最多有 2g(2n-g)-1条故障边的 Qnk(k≥4)中，如果每个顶点至少有 g 条无故障边，那么其余网络是连通的。

{"title":"Reliability assessment for k-ary n-cubes with faulty edges","authors":"Si-Yu Li , Xiang-Jun Li , Meijie Ma","doi":"10.1016/j.jpdc.2024.104886","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104886","url":null,"abstract":"<div>The g-restricted edge connectivity is an important measurement to assess the reliability of networks. The g-restricted edge connectivity of a connected graph G is the minimum size of a set of edges in G, if it exists, whose deletion separates G and leaves every vertex in the remaining components with at least g neighbors. The k-ary n-cube is an extension of the hypercube network and has many desirable properties. It has been used to build the architecture of the Supercomputer Fugaku. This paper establishes that for <math><mi>g</mi><mo>≤</mo><mi>n</mi></math>, the g-restricted edge connectivity of 3-ary n-cubes is <math><msup><mrow><mn>3</mn></mrow><mrow><mo>⌊</mo><mi>g</mi><mo>/</mo><mn>2</mn><mo>⌋</mo></mrow></msup><mo>(</mo><mn>1</mn><mo>+</mo><mo>(</mo><mi>g</mi><mrow><mspace></mspace><mtext>mod</mtext><mspace></mspace></mrow><mn>2</mn><mo>)</mo><mo>)</mo><mo>(</mo><mn>2</mn><mi>n</mi><mo>−</mo><mi>g</mi><mo>)</mo></math>, and the g-restricted edge connectivity of k-ary n-cubes with <math><mi>k</mi><mo>≥</mo><mn>4</mn></math> is <math><msup><mrow><mn>2</mn></mrow><mrow><mi>g</mi></mrow></msup><mo>(</mo><mn>2</mn><mi>n</mi><mo>−</mo><mi>g</mi><mo>)</mo></math>. These results imply that in <math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mn>3</mn></mrow></msubsup></math> with at most <math><msup><mrow><mn>3</mn></mrow><mrow><mo>⌊</mo><mi>g</mi><mo>/</mo><mn>2</mn><mo>⌋</mo></mrow></msup><mo>(</mo><mn>1</mn><mo>+</mo><mo>(</mo><mi>g</mi><mrow><mspace></mspace><mtext>mod</mtext><mspace></mspace></mrow><mn>2</mn><mo>)</mo><mo>)</mo><mo>(</mo><mn>2</mn><mi>n</mi><mo>−</mo><mi>g</mi><mo>)</mo><mo>−</mo><mn>1</mn></math> faulty edges, or <math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup><mo>(</mo><mi>k</mi><mo>≥</mo><mn>4</mn><mo>)</mo></math> with at most <math><msup><mrow><mn>2</mn></mrow><mrow><mi>g</mi></mrow></msup><mo>(</mo><mn>2</mn><mi>n</mi><mo>−</mo><mi>g</mi><mo>)</mo><mo>−</mo><mn>1</mn></math> faulty edges, if each vertex is incident with at least g fault-free edges, then the remaining network is connected.</div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"190 ","pages":"Article 104886"},"PeriodicalIF":3.8,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140321045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Paired 2-disjoint path covers of k-ary n-cubes under the partitioned edge fault model 分区边缘断层模型下 k-ary n 立方体的成对 2-disjoint 路径盖

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-03-27 DOI: 10.1016/j.jpdc.2024.104887

Hongbin Zhuang , Xiao-Yan Li , Jou-Ming Chang , Ximeng Liu

The k-ary n-cube $Q_{n}^{k}$ serves as an indispensable interconnection network in the design of data center networks, network-on-chips, and parallel computing systems since it possesses numerous attractive properties. In these parallel architectures, the paired (or unpaired) many-to-many m-disjoint path cover (m-DPC) plays a significant role in message transmission. Nevertheless, the construction of m-DPC is severely obstructed by large-scale edge faults due to the rapid growth of the system scale. In this paper, we investigate the existence of paired 2-DPC in $Q_{n}^{k}$ under the partitioned edge fault (PEF) model, which is a novel fault model for enhancing the networks' fault-tolerance related to path embedding problem. We exploit this model to evaluate the edge fault-tolerance of $Q_{n}^{k}$ when a paired 2-DPC is embedded into $Q_{n}^{k}$ . Compared to the other known works, our results can help $Q_{n}^{k}$ to achieve large-scale edge fault-tolerance.

k-ary n立方体 Qnk 是数据中心网络、片上网络和并行计算系统设计中不可或缺的互连网络，因为它具有许多吸引人的特性。在这些并行架构中，配对（或非配对）多对多 m-disjoint path cover（m-DPC）在信息传输中发挥着重要作用。然而，由于系统规模的快速增长，大规模边缘故障严重阻碍了 m-DPC 的构建。在本文中，我们研究了 Qnk 中分区边缘故障（PEF）模型下成对 2-DPC 的存在性，PEF 是一种新的故障模型，用于增强与路径嵌入问题相关的网络容错性。我们利用这一模型来评估 Qnk 中嵌入成对 2-DPC 时的边缘容错性。与其他已知工作相比，我们的结果有助于 Qnk 实现大规模边缘容错。

{"title":"Paired 2-disjoint path covers of k-ary n-cubes under the partitioned edge fault model","authors":"Hongbin Zhuang , Xiao-Yan Li , Jou-Ming Chang , Ximeng Liu","doi":"10.1016/j.jpdc.2024.104887","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104887","url":null,"abstract":"<div>The k-ary n-cube <math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup></math> serves as an indispensable interconnection network in the design of data center networks, network-on-chips, and parallel computing systems since it possesses numerous attractive properties. In these parallel architectures, the paired (or unpaired) many-to-many m-disjoint path cover (m-DPC) plays a significant role in message transmission. Nevertheless, the construction of m-DPC is severely obstructed by large-scale edge faults due to the rapid growth of the system scale. In this paper, we investigate the existence of paired 2-DPC in <math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup></math> under the partitioned edge fault (PEF) model, which is a novel fault model for enhancing the networks' fault-tolerance related to path embedding problem. We exploit this model to evaluate the edge fault-tolerance of <math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup></math> when a paired 2-DPC is embedded into <math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup></math>. Compared to the other known works, our results can help <math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup></math> to achieve large-scale edge fault-tolerance.</div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"190 ","pages":"Article 104887"},"PeriodicalIF":3.8,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140344268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A distributed learning based on robust diffusion SGD over adaptive networks with noisy output data 基于有噪声输出数据的自适应网络的鲁棒扩散 SGD 分布式学习

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-03-26 DOI: 10.1016/j.jpdc.2024.104883

Fatemeh Barani , Abdorreza Savadi , Hadi Sadoghi Yazdi

Outliers and noises are unavoidable factors that cause performance of the distributed learning algorithms to be severely reduced. Developing a robust algorithm is vital in applications such as system identification and forecasting stock market, in which noise on the desired signals may intensely divert the solutions. In this paper, we propose a Robust Diffusion Stochastic Gradient Descent (RDSGD) algorithm based on the pseudo-Huber loss function which can significantly suppress the effect of Gaussian and non-Gaussian noises on estimation performances in the adaptive networks. Performance and convergence behavior of RDSGD are assessed in presence of the α-stable and Mixed-Gaussian noises in the stationary and non-stationary environments. Simulation results show that the proposed algorithm can achieve both higher convergence rate and lower steady-state misadjustment than the conventional diffusion algorithms and several robust algorithms.

异常值和噪声是导致分布式学习算法性能严重下降的不可避免的因素。在系统识别和股市预测等应用中，所需的信号上的噪声可能会严重干扰解决方案，因此开发一种鲁棒性算法至关重要。本文提出了一种基于伪胡贝尔损失函数的鲁棒扩散随机梯度下降算法（RDSGD），它能显著抑制高斯和非高斯噪声对自适应网络估计性能的影响。我们评估了 RDSGD 在静态和非静态环境中存在 α 稳定和混合高斯噪声时的性能和收敛行为。仿真结果表明，与传统的扩散算法和几种鲁棒算法相比，所提出的算法能获得更高的收敛速率和更低的稳态失调。

引用次数: 0

A novel framework for generic Spark workload characterization and similar pattern recognition using machine learning 利用机器学习进行通用 Spark 工作负载特征描述和类似模式识别的新型框架

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-03-26 DOI: 10.1016/j.jpdc.2024.104881

Mariano Garralda-Barrio, Carlos Eiras-Franco, Verónica Bolón-Canedo

Comprehensive workload characterization plays a pivotal role in comprehending Spark applications, as it enables the analysis of diverse aspects and behaviors. This understanding is indispensable for devising downstream tuning objectives, such as performance improvement. To address this pivotal issue, our work introduces a novel and scalable framework for generic Spark workload characterization, complemented by consistent geometric measurements. The presented approach aims to build robust workload descriptors by profiling only quantitative metrics at the application task-level, in a non-intrusive manner. We expand our framework for downstream workload pattern recognition by incorporating unsupervised machine learning techniques: clustering algorithms and feature selection. These techniques significantly improve the process of grouping similar workloads without relying on predefined labels. We effectively recognize 24 representative Spark workloads from diverse domains, including SQL, machine learning, web search, graph, and micro-benchmarks, available in HiBench. Our framework achieves a high accuracy F-Measure score of up to 90.9% and a Normalized Mutual Information of up to 94.5% in similar workload pattern recognition. These scores significantly outperform the results obtained in a comparative analysis with an established workload characterization approach in the literature.

全面的工作负载特征描述在理解 Spark 应用程序方面起着至关重要的作用，因为它可以对不同的方面和行为进行分析。这种理解对于设计下游调整目标（如提高性能）是不可或缺的。为解决这一关键问题，我们的工作引入了一个新颖且可扩展的框架，用于通用 Spark 工作负载特征描述，并辅以一致的几何测量。所介绍的方法旨在以非侵入式方式，仅对应用任务级的定量指标进行剖析，从而建立稳健的工作负载描述符。我们结合了无监督机器学习技术：聚类算法和特征选择，从而扩展了下游工作负载模式识别框架。这些技术大大改进了类似工作负载的分组过程，而无需依赖预定义标签。我们有效识别了 24 种具有代表性的 Spark 工作负载，它们来自不同的领域，包括 SQL、机器学习、网络搜索、图和 HiBench 中的微基准。在类似工作负载模式识别方面，我们的框架获得了高达 90.9% 的高精度 F-Measure 分数和高达 94.5% 的归一化互信息。这些分数大大超过了与文献中已有的工作负载特征描述方法进行比较分析后得出的结果。

{"title":"A novel framework for generic Spark workload characterization and similar pattern recognition using machine learning","authors":"Mariano Garralda-Barrio, Carlos Eiras-Franco, Verónica Bolón-Canedo","doi":"10.1016/j.jpdc.2024.104881","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104881","url":null,"abstract":"<div>Comprehensive workload characterization plays a pivotal role in comprehending Spark applications, as it enables the analysis of diverse aspects and behaviors. This understanding is indispensable for devising downstream tuning objectives, such as performance improvement. To address this pivotal issue, our work introduces a novel and scalable framework for generic Spark workload characterization, complemented by consistent geometric measurements. The presented approach aims to build robust workload descriptors by profiling only quantitative metrics at the application task-level, in a non-intrusive manner. We expand our framework for downstream workload pattern recognition by incorporating unsupervised machine learning techniques: clustering algorithms and feature selection. These techniques significantly improve the process of grouping similar workloads without relying on predefined labels. We effectively recognize 24 representative Spark workloads from diverse domains, including SQL, machine learning, web search, graph, and micro-benchmarks, available in HiBench. Our framework achieves a high accuracy F-Measure score of up to 90.9% and a Normalized Mutual Information of up to 94.5% in similar workload pattern recognition. These scores significantly outperform the results obtained in a comparative analysis with an established workload characterization approach in the literature.</div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104881"},"PeriodicalIF":3.8,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000455/pdfft?md5=f38d6d7d46cfa72abd25c2f3150c7112&pid=1-s2.0-S0743731524000455-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140309738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cloud-edge-end workflow scheduling with multiple privacy levels 具有多种隐私级别的云端工作流程调度

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-03-25 DOI: 10.1016/j.jpdc.2024.104882

Shuang Wang , Zian Yuan , Xiaodong Zhang , Jiawen Wu , Yamin Wang

The cloud-edge-end architecture satisfies the execution requirements of various workflow applications. However, owing to the diversity of resources, the complex hierarchical structure, and different privacy requirements for users, determining how to lease suitable cloud-edge-end resources, schedule multi-privacy-level workflow tasks, and optimize leasing costs is currently one of the key challenges in cloud computing. In this paper, we address the scheduling optimization problem of workflow applications containing tasks with multiple privacy levels. To tackle this problem, we propose a heuristic privacy-preserving workflow scheduling algorithm (PWHSA) designed to minimize rental costs which includes time parameter estimation, task sub-deadline division, scheduling sequence generation, task scheduling, and task adjustment, with candidate strategies developed for each component. These candidate strategies in each step undergo statistical calibration across a comprehensive set of workflow instances. We compare the proposed algorithm with modified classical algorithms that target similar problems. The experimental results demonstrate that the PWHSA algorithm outperforms the comparison algorithms while maintaining acceptable execution times.

云端架构满足了各种工作流应用的执行要求。然而，由于资源的多样性、层次结构的复杂性以及用户对隐私的不同要求，如何租用合适的云端资源、调度多隐私级别的工作流任务并优化租用成本是当前云计算面临的关键挑战之一。在本文中，我们讨论了包含多隐私级别任务的工作流应用的调度优化问题。为了解决这个问题，我们提出了一种旨在最小化租赁成本的启发式隐私保护工作流调度算法（PWHSA），该算法包括时间参数估计、任务子截止日期划分、调度序列生成、任务调度和任务调整，每个部分都有候选策略。每个步骤中的候选策略都会在一组全面的工作流程实例中进行统计校准。我们将所提出的算法与针对类似问题的改进型经典算法进行了比较。实验结果表明，在保持可接受的执行时间的同时，PWHSA 算法优于比较算法。

{"title":"Cloud-edge-end workflow scheduling with multiple privacy levels","authors":"Shuang Wang , Zian Yuan , Xiaodong Zhang , Jiawen Wu , Yamin Wang","doi":"10.1016/j.jpdc.2024.104882","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104882","url":null,"abstract":"<div>The cloud-edge-end architecture satisfies the execution requirements of various workflow applications. However, owing to the diversity of resources, the complex hierarchical structure, and different privacy requirements for users, determining how to lease suitable cloud-edge-end resources, schedule multi-privacy-level workflow tasks, and optimize leasing costs is currently one of the key challenges in cloud computing. In this paper, we address the scheduling optimization problem of workflow applications containing tasks with multiple privacy levels. To tackle this problem, we propose a heuristic privacy-preserving workflow scheduling algorithm (PWHSA) designed to minimize rental costs which includes time parameter estimation, task sub-deadline division, scheduling sequence generation, task scheduling, and task adjustment, with candidate strategies developed for each component. These candidate strategies in each step undergo statistical calibration across a comprehensive set of workflow instances. We compare the proposed algorithm with modified classical algorithms that target similar problems. The experimental results demonstrate that the PWHSA algorithm outperforms the comparison algorithms while maintaining acceptable execution times.</div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104882"},"PeriodicalIF":3.8,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140309231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SCIPIS: Scalable and concurrent persistent indexing and search in high-end computing systems SCIPIS：高端计算系统中的可扩展并发持续索引和搜索

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-03-25 DOI: 10.1016/j.jpdc.2024.104878

Alexandru Iulian Orhean , Anna Giannakou , Lavanya Ramakrishnan , Kyle Chard , Boris Glavic , Ioan Raicu

While it is now routine to search for data on a personal computer or discover data online, there is no such equivalent method for discovering data on large parallel and distributed file systems commonly deployed on HPC systems. In contrast to web search, which has to deal with a larger number of relatively small files, in HPC applications there is a need to also support efficient indexing of large files. We propose SCIPIS, an indexing and search framework, that can exploit the properties of modern high-end computing systems, with many-core architectures, multiple NUMA nodes and multiple NVMe storage devices. SCIPIS supports building and searching TFIDF persistent indexes, and can deliver orders of magnitude better performance than state-of-the-art approaches. We achieve scalability and performance of indexing by decomposing the indexing process into separate components that can be optimized independently, by building disk-friendly data structures in-memory that can be persisted in long sequential writes, and by avoiding communication between indexing threads that collaboratively build an index over a collection of large files. We evaluated SCIPIS with three types of datasets (logs, scientific data, and metadata), on systems with configurations up to 192-cores, 768 GiB of RAM, 8 NUMA nodes, and up to 16 NVMe drives, and achieved up to 29x better indexing while maintaining similar search latency when compared to Apache Lucene.

在个人电脑上搜索数据或在线发现数据现在已是家常便饭，但在大型并行和分布式文件系统上发现数据却没有类似的方法，这些系统通常部署在高性能计算系统上。与必须处理大量相对较小文件的网络搜索不同，在高性能计算应用中，还需要支持高效的大文件索引。我们提出的 SCIPIS 是一个索引和搜索框架，可以利用多核架构、多 NUMA 节点和多 NVMe 存储设备等现代高端计算系统的特性。SCIPIS 支持构建和搜索 TFIDF 持久性索引，其性能比最先进的方法高出几个数量级。我们通过以下方法实现了索引的可扩展性和性能：将索引过程分解为可独立优化的单独组件；在内存中构建磁盘友好型数据结构（可在长时间顺序写入中持久化）；避免索引线程之间的通信（这些线程在大型文件集合上协作构建索引）。我们使用三种类型的数据集（日志、科学数据和元数据）对 SCIPIS 进行了评估，系统配置高达 192 核、768GB 内存、8 个 NUMA 节点和多达 16 个 NVMe 驱动器，与 Apache Lucene 相比，索引效果提高了 29 倍，同时保持了类似的搜索延迟。

{"title":"SCIPIS: Scalable and concurrent persistent indexing and search in high-end computing systems","authors":"Alexandru Iulian Orhean , Anna Giannakou , Lavanya Ramakrishnan , Kyle Chard , Boris Glavic , Ioan Raicu","doi":"10.1016/j.jpdc.2024.104878","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104878","url":null,"abstract":"<div>While it is now routine to search for data on a personal computer or discover data online, there is no such equivalent method for discovering data on large parallel and distributed file systems commonly deployed on HPC systems. In contrast to web search, which has to deal with a larger number of relatively small files, in HPC applications there is a need to also support efficient indexing of large files. We propose SCIPIS, an indexing and search framework, that can exploit the properties of modern high-end computing systems, with many-core architectures, multiple NUMA nodes and multiple NVMe storage devices. SCIPIS supports building and searching TFIDF persistent indexes, and can deliver orders of magnitude better performance than state-of-the-art approaches. We achieve scalability and performance of indexing by decomposing the indexing process into separate components that can be optimized independently, by building disk-friendly data structures in-memory that can be persisted in long sequential writes, and by avoiding communication between indexing threads that collaboratively build an index over a collection of large files. We evaluated SCIPIS with three types of datasets (logs, scientific data, and metadata), on systems with configurations up to 192-cores, 768 GiB of RAM, 8 NUMA nodes, and up to 16 NVMe drives, and achieved up to 29x better indexing while maintaining similar search latency when compared to Apache Lucene.</div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104878"},"PeriodicalIF":3.8,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140321203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning-driven hybrid scaling for multi-type services in cloud 云中多类型服务的学习驱动混合扩展

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-03-24 DOI: 10.1016/j.jpdc.2024.104880

Haitao Zhang, Tongyu Guo, Wei Tian, Huadong Ma

In order to deal with the fast changing requirements of container based services in clouds, auto-scaling is used as an essential mechanism for adapting the number of provisioned resources with the variable service workloads. However, the latest auto-scaling approaches lack the comprehensive consideration of variable workloads and hybrid auto-scaling for multi-type services. Firstly, the historical data based proactive approaches are widely used to handle complex and variable workloads in advance. The decision-making accuracy of proactive approaches depends on the prediction algorithm, which is affected by the anomalies, missing values and errors in the historical workload data, and the unexpected workload cannot be handled. Secondly, the trigger based reactive approaches are seriously affected by workload fluctuation which causes the frequent invalid scaling of service resources. Besides, due to the existence of scaling time, there are different completion delays of different scaling actions. Thirdly, the latest approaches also ignore the different scaling time of hybrid scaling for multi-type services including stateful services and stateless services. Especially, when the stateful services are scaled horizontally, the neglected long scaling time causes the untimely supply and withdrawal of resources. Consequently, all three issues above can lead to the degradation of Quality of Services (QoS) and the inefficient utilization of resources. This paper proposes a new hybrid auto-scaling approach for multi-type services to resolve the impact of service scaling time on decision making. We combine the proactive scaling strategy with the reactive anomaly detection and correction mechanism. For making a proactive decision, the ensemble learning model with the structure improved deep network is designed to predict the future workload. On the basis of the predicted results and the scaling time of different types of services, the auto-scaling decisions are made by a Deep Reinforcement Learning (DRL) model with heterogeneous action space, which integrates horizontal and vertical scaling actions. Meanwhile, with the anomaly detection and correction mechanism, the workload fluctuation and unexpected workload can be detected and handled. We evaluate our approach against three different proactive and reactive auto-scaling approaches in the cloud environment, and the experimental results show the proposed approach can achieve the better scaling behavior compared to state-of-the-art approaches.

为了应对云中基于容器的服务的快速变化需求，自动缩放被用作一种重要机制，用于根据可变的服务工作负载调整供应资源的数量。然而，最新的自动缩放方法缺乏对可变工作负载和多类型服务混合自动缩放的全面考虑。首先，基于历史数据的主动方法被广泛用于提前处理复杂多变的工作负载。主动式方法的决策准确性取决于预测算法，而预测算法会受到历史工作负载数据异常、缺失值和错误的影响，无法处理突发的工作负载。其次，基于触发器的被动方法会受到工作量波动的严重影响，导致服务资源的频繁无效扩展。此外，由于缩放时间的存在，不同的缩放操作存在不同的完成延迟。第三，最新的方法还忽略了多类型服务（包括有状态服务和无状态服务）混合缩放的不同缩放时间。特别是当有状态服务横向扩展时，由于忽略了较长的扩展时间，导致资源的供应和撤出不及时。因此，上述三个问题都会导致服务质量（QoS）下降和资源利用效率低下。本文针对多类型服务提出了一种新的混合自动缩放方法，以解决服务缩放时间对决策的影响。我们将主动缩放策略与被动异常检测和纠正机制相结合。为了做出主动决策，我们设计了具有结构改进深度网络的集合学习模型来预测未来的工作量。在预测结果和不同类型服务的缩放时间的基础上，由具有异构行动空间的深度强化学习（DRL）模型做出自动缩放决策，该模型整合了横向和纵向缩放行动。同时，通过异常检测和纠正机制，可以检测并处理工作负载波动和意外工作负载。我们针对云环境中三种不同的主动和被动自动缩放方法对我们的方法进行了评估，实验结果表明，与最先进的方法相比，我们提出的方法可以实现更好的缩放行为。

{"title":"Learning-driven hybrid scaling for multi-type services in cloud","authors":"Haitao Zhang, Tongyu Guo, Wei Tian, Huadong Ma","doi":"10.1016/j.jpdc.2024.104880","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104880","url":null,"abstract":"<div>In order to deal with the fast changing requirements of container based services in clouds, auto-scaling is used as an essential mechanism for adapting the number of provisioned resources with the variable service workloads. However, the latest auto-scaling approaches lack the comprehensive consideration of variable workloads and hybrid auto-scaling for multi-type services. Firstly, the historical data based proactive approaches are widely used to handle complex and variable workloads in advance. The decision-making accuracy of proactive approaches depends on the prediction algorithm, which is affected by the anomalies, missing values and errors in the historical workload data, and the unexpected workload cannot be handled. Secondly, the trigger based reactive approaches are seriously affected by workload fluctuation which causes the frequent invalid scaling of service resources. Besides, due to the existence of scaling time, there are different completion delays of different scaling actions. Thirdly, the latest approaches also ignore the different scaling time of hybrid scaling for multi-type services including stateful services and stateless services. Especially, when the stateful services are scaled horizontally, the neglected long scaling time causes the untimely supply and withdrawal of resources. Consequently, all three issues above can lead to the degradation of Quality of Services (QoS) and the inefficient utilization of resources. This paper proposes a new hybrid auto-scaling approach for multi-type services to resolve the impact of service scaling time on decision making. We combine the proactive scaling strategy with the reactive anomaly detection and correction mechanism. For making a proactive decision, the ensemble learning model with the structure improved deep network is designed to predict the future workload. On the basis of the predicted results and the scaling time of different types of services, the auto-scaling decisions are made by a Deep Reinforcement Learning (DRL) model with heterogeneous action space, which integrates horizontal and vertical scaling actions. Meanwhile, with the anomaly detection and correction mechanism, the workload fluctuation and unexpected workload can be detected and handled. We evaluate our approach against three different proactive and reactive auto-scaling approaches in the cloud environment, and the experimental results show the proposed approach can achieve the better scaling behavior compared to state-of-the-art approaches.</div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104880"},"PeriodicalIF":3.8,"publicationDate":"2024-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140295918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A characterization of soft-error sensitivity in data-parallel and model-parallel distributed deep learning 数据并行和模型并行分布式深度学习中软误差敏感性的表征

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-03-21 DOI: 10.1016/j.jpdc.2024.104879

Elvis Rojas , Diego Pérez , Esteban Meneses

The latest advances in artificial intelligence deep learning models are unprecedented. A wide spectrum of application areas is now thriving thanks to available massive training datasets and gigantic complex neural network models. Those two characteristics demand outstanding computing power that only advanced computing platforms can provide. Therefore, distributed deep learning has become a necessity in capitalizing on the potential of cutting-edge artificial intelligence. Two basic schemes have emerged in distributed learning. First, the data-parallel approach, which aims at dividing the training dataset into multiple computing nodes. Second, the model-parallel approach, which splits layers of a model into several computing nodes. Each scheme has its upsides and downsides, particularly when running on large machines that are susceptible to soft errors. Those errors occur as a consequence of several factors involved in the manufacturing process of current electronic components of supercomputers. On many occasions, those errors are expressed as bit flips that do not cause the whole system to crash, but generate wrong numerical results in computations. To study the effect of soft error on different approaches for distributed learning, we leverage checkpoint alteration, a technique that injects bit flips on checkpoint files. It allows researchers to understand the effect of soft errors on applications that produce checkpoint files in HDF5 format. This paper uses the popular deep learning PyTorch tool on two distributed-learning platforms: one for data-parallel training and one for model-parallel training. We use well-known deep learning models with popular training datasets to provide a picture of how soft errors challenge the training phase of a deep learning model.

人工智能深度学习模型的最新进展是前所未有的。得益于现有的海量训练数据集和巨型复杂神经网络模型，广泛的应用领域正在蓬勃发展。这两个特点需要出色的计算能力，而只有先进的计算平台才能提供这种能力。因此，分布式深度学习已成为利用尖端人工智能潜力的必然选择。分布式学习出现了两种基本方案。第一，数据并行方法，旨在将训练数据集划分到多个计算节点中。第二，模型并行方法，即把一个模型的各层分成多个计算节点。每种方案都有其优点和缺点，尤其是在大型机器上运行时，容易出现软误差。目前超级计算机电子元件的制造过程中存在多种因素，导致了这些错误的发生。在许多情况下，这些错误表现为位翻转，不会导致整个系统崩溃，但会在计算中产生错误的数值结果。为了研究软错误对不同分布式学习方法的影响，我们利用了检查点更改技术，这是一种在检查点文件中注入位翻转的技术。它能让研究人员了解软错误对生成 HDF5 格式检查点文件的应用程序的影响。本文在两个分布式学习平台上使用了流行的深度学习 PyTorch 工具：一个用于数据并行训练，另一个用于模型并行训练。我们使用知名的深度学习模型和流行的训练数据集，来说明软错误是如何挑战深度学习模型的训练阶段的。

{"title":"A characterization of soft-error sensitivity in data-parallel and model-parallel distributed deep learning","authors":"Elvis Rojas , Diego Pérez , Esteban Meneses","doi":"10.1016/j.jpdc.2024.104879","DOIUrl":"10.1016/j.jpdc.2024.104879","url":null,"abstract":"<div>The latest advances in artificial intelligence deep learning models are unprecedented. A wide spectrum of application areas is now thriving thanks to available massive training datasets and gigantic complex neural network models. Those two characteristics demand outstanding computing power that only advanced computing platforms can provide. Therefore, distributed deep learning has become a necessity in capitalizing on the potential of cutting-edge artificial intelligence. Two basic schemes have emerged in distributed learning. First, the data-parallel approach, which aims at dividing the training dataset into multiple computing nodes. Second, the model-parallel approach, which splits layers of a model into several computing nodes. Each scheme has its upsides and downsides, particularly when running on large machines that are susceptible to soft errors. Those errors occur as a consequence of several factors involved in the manufacturing process of current electronic components of supercomputers. On many occasions, those errors are expressed as bit flips that do not cause the whole system to crash, but generate wrong numerical results in computations. To study the effect of soft error on different approaches for distributed learning, we leverage checkpoint alteration, a technique that injects bit flips on checkpoint files. It allows researchers to understand the effect of soft errors on applications that produce checkpoint files in HDF5 format. This paper uses the popular deep learning PyTorch tool on two distributed-learning platforms: one for data-parallel training and one for model-parallel training. We use well-known deep learning models with popular training datasets to provide a picture of how soft errors challenge the training phase of a deep learning model.</div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"190 ","pages":"Article 104879"},"PeriodicalIF":3.8,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140282552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Corrigendum to “MLLess: Achieving Cost Efficiency in Serverless Machine Learning Training” [Journal of Parallel and Distributed Computing 183 (2024) 104764] MLLess：在无服务器机器学习训练中实现成本效益》[《并行和分布式计算期刊》183 (2024) 104764] 的更正

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-03-13 DOI: 10.1016/j.jpdc.2024.104871

Pablo Gimeno Sarroca, Marc Sánchez-Artigas

引用次数: 0

Public cloud object storage auditing: Design, implementation, and analysis 公共云对象存储审计：设计、实施和分析

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-03-09 DOI: 10.1016/j.jpdc.2024.104870

Fei Chen , Fengming Meng , Zhipeng Li , Li Li , Tao Xiang

Cloud storage auditing is a technique that enables a user to remotely check the integrity of the outsourced data in the cloud storage. Although researchers have proposed various protocols for cloud storage auditing, the proposed schemes are theoretical in nature, which are not fit for existing mainstream cloud storage service practices. To bridge this gap, this paper proposes a cloud storage auditing system that works for current mainstream cloud object storage services. We design the proposed system over existing proof of data possession (PDP) schemes and make them practical as well as usable in the real world. Specifically, we propose an architecture that separates the compute and storage functionalities of a storage auditing scheme. Because cloud object storage only provides read and write interfaces, we leverage a cloud virtual machine to implement the user-defined computations that are needed in a PDP scheme. We store the authentication tags of the outsourced data as an independent object to allow existing popular cloud storage applications, e.g., file online previewing. We also present a cost model to analyze the economic cost of a cloud storage auditing scheme. The cost model allows a user to balance security, efficiency, and economic cost by tuning various system parameters. We implemented, open-sourced the proposed system over a mainstream cloud object storage service. Experimental analysis shows that the proposed system is pretty efficient and promising for a production environment usage. Specifically, for a 40 GB sized data, the proposed system only incurs 1.66% additional storage cost, 3796 bytes communication cost, 2.9 seconds maximum auditing time cost, and 0.9 CNY per auditing monetary cost.

云存储审计是一种能让用户远程检查云存储中外包数据完整性的技术。虽然研究人员提出了各种云存储审计协议，但所提出的方案都是理论性的，不适合现有主流云存储服务实践。为了弥补这一缺陷，本文提出了一种适用于当前主流云对象存储服务的云存储审计系统。我们在现有数据占有证明（PDP）方案的基础上设计了该系统，并使其在现实世界中切实可行。具体来说，我们提出了一种将存储审计方案的计算和存储功能分离开来的架构。由于云对象存储只提供读写接口，因此我们利用云虚拟机来实现 PDP 方案中所需的用户自定义计算。我们将外包数据的认证标签存储为独立对象，以便允许现有的流行云存储应用（如文件在线预览）。我们还提出了一个成本模型，用于分析云存储审核方案的经济成本。该成本模型允许用户通过调整各种系统参数来平衡安全性、效率和经济成本。我们在主流云对象存储服务上实施了开源的拟议系统。实验分析表明，提议的系统非常高效，有望在生产环境中使用。具体来说，对于 40 GB 大小的数据，建议的系统只产生了 1.66% 的额外存储成本、3796 字节的通信成本、2.9 秒的最长审核时间成本和 0.9 元的每次审核货币成本。

{"title":"Public cloud object storage auditing: Design, implementation, and analysis","authors":"Fei Chen , Fengming Meng , Zhipeng Li , Li Li , Tao Xiang","doi":"10.1016/j.jpdc.2024.104870","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104870","url":null,"abstract":"<div>Cloud storage auditing is a technique that enables a user to remotely check the integrity of the outsourced data in the cloud storage. Although researchers have proposed various protocols for cloud storage auditing, the proposed schemes are theoretical in nature, which are not fit for existing mainstream cloud storage service practices. To bridge this gap, this paper proposes a cloud storage auditing system that works for current mainstream cloud object storage services. We design the proposed system over existing proof of data possession (PDP) schemes and make them practical as well as usable in the real world. Specifically, we propose an architecture that separates the compute and storage functionalities of a storage auditing scheme. Because cloud object storage only provides read and write interfaces, we leverage a cloud virtual machine to implement the user-defined computations that are needed in a PDP scheme. We store the authentication tags of the outsourced data as an independent object to allow existing popular cloud storage applications, e.g., file online previewing. We also present a cost model to analyze the economic cost of a cloud storage auditing scheme. The cost model allows a user to balance security, efficiency, and economic cost by tuning various system parameters. We implemented, open-sourced the proposed system over a mainstream cloud object storage service. Experimental analysis shows that the proposed system is pretty efficient and promising for a production environment usage. Specifically, for a 40 GB sized data, the proposed system only incurs 1.66% additional storage cost, 3796 bytes communication cost, 2.9 seconds maximum auditing time cost, and 0.9 CNY per auditing monetary cost.</div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104870"},"PeriodicalIF":3.8,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140122500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0