Pub Date : 2024-03-27DOI: 10.1016/j.jpdc.2024.104886
Si-Yu Li , Xiang-Jun Li , Meijie Ma
The g-restricted edge connectivity is an important measurement to assess the reliability of networks. The g-restricted edge connectivity of a connected graph G is the minimum size of a set of edges in G, if it exists, whose deletion separates G and leaves every vertex in the remaining components with at least g neighbors. The k-ary n-cube is an extension of the hypercube network and has many desirable properties. It has been used to build the architecture of the Supercomputer Fugaku. This paper establishes that for , the g-restricted edge connectivity of 3-ary n-cubes is , and the g-restricted edge connectivity of k-ary n-cubes with is . These results imply that in with at most faulty edges, or with at most faulty edges, if each vertex is incident with at least g fault-free edges, then the remaining network is connected.
受 g 限制的边连通性是评估网络可靠性的一个重要指标。连通图 G 的 g 受限边连通性是 G 中一组边的最小大小(如果存在),删除这组边可以将 G 分割开来,并使剩余部分中的每个顶点都至少有 g 个邻居。k-ary n 立方体是超立方体网络的扩展,具有许多理想的特性。超级计算机 Fugaku 就是用它构建的。本文证明,对于 g≤n,3-ary n 立方体的 g 限制边连通性为 3⌊g/2⌋(1+(gmod2))(2n-g),而 k≥4 的 k-ary n 立方体的 g 限制边连通性为 2g(2n-g)。这些结果意味着,在最多有 3⌊g/2⌋(1+(gmod2))(2n-g)-1条故障边的 Qn3 中,或最多有 2g(2n-g)-1条故障边的 Qnk(k≥4)中,如果每个顶点至少有 g 条无故障边,那么其余网络是连通的。
{"title":"Reliability assessment for k-ary n-cubes with faulty edges","authors":"Si-Yu Li , Xiang-Jun Li , Meijie Ma","doi":"10.1016/j.jpdc.2024.104886","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104886","url":null,"abstract":"<div><p>The <em>g</em>-restricted edge connectivity is an important measurement to assess the reliability of networks. The <em>g</em>-restricted edge connectivity of a connected graph <em>G</em> is the minimum size of a set of edges in <em>G</em>, if it exists, whose deletion separates <em>G</em> and leaves every vertex in the remaining components with at least <em>g</em> neighbors. The <em>k</em>-ary <em>n</em>-cube is an extension of the hypercube network and has many desirable properties. It has been used to build the architecture of the Supercomputer Fugaku. This paper establishes that for <span><math><mi>g</mi><mo>≤</mo><mi>n</mi></math></span>, the <em>g</em>-restricted edge connectivity of 3-ary <em>n</em>-cubes is <span><math><msup><mrow><mn>3</mn></mrow><mrow><mo>⌊</mo><mi>g</mi><mo>/</mo><mn>2</mn><mo>⌋</mo></mrow></msup><mo>(</mo><mn>1</mn><mo>+</mo><mo>(</mo><mi>g</mi><mrow><mspace></mspace><mtext>mod</mtext><mspace></mspace></mrow><mn>2</mn><mo>)</mo><mo>)</mo><mo>(</mo><mn>2</mn><mi>n</mi><mo>−</mo><mi>g</mi><mo>)</mo></math></span>, and the <em>g</em>-restricted edge connectivity of <em>k</em>-ary <em>n</em>-cubes with <span><math><mi>k</mi><mo>≥</mo><mn>4</mn></math></span> is <span><math><msup><mrow><mn>2</mn></mrow><mrow><mi>g</mi></mrow></msup><mo>(</mo><mn>2</mn><mi>n</mi><mo>−</mo><mi>g</mi><mo>)</mo></math></span>. These results imply that in <span><math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mn>3</mn></mrow></msubsup></math></span> with at most <span><math><msup><mrow><mn>3</mn></mrow><mrow><mo>⌊</mo><mi>g</mi><mo>/</mo><mn>2</mn><mo>⌋</mo></mrow></msup><mo>(</mo><mn>1</mn><mo>+</mo><mo>(</mo><mi>g</mi><mrow><mspace></mspace><mtext>mod</mtext><mspace></mspace></mrow><mn>2</mn><mo>)</mo><mo>)</mo><mo>(</mo><mn>2</mn><mi>n</mi><mo>−</mo><mi>g</mi><mo>)</mo><mo>−</mo><mn>1</mn></math></span> faulty edges, or <span><math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup><mo>(</mo><mi>k</mi><mo>≥</mo><mn>4</mn><mo>)</mo></math></span> with at most <span><math><msup><mrow><mn>2</mn></mrow><mrow><mi>g</mi></mrow></msup><mo>(</mo><mn>2</mn><mi>n</mi><mo>−</mo><mi>g</mi><mo>)</mo><mo>−</mo><mn>1</mn></math></span> faulty edges, if each vertex is incident with at least <em>g</em> fault-free edges, then the remaining network is connected.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"190 ","pages":"Article 104886"},"PeriodicalIF":3.8,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140321045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-27DOI: 10.1016/j.jpdc.2024.104887
Hongbin Zhuang , Xiao-Yan Li , Jou-Ming Chang , Ximeng Liu
The k-ary n-cube serves as an indispensable interconnection network in the design of data center networks, network-on-chips, and parallel computing systems since it possesses numerous attractive properties. In these parallel architectures, the paired (or unpaired) many-to-many m-disjoint path cover (m-DPC) plays a significant role in message transmission. Nevertheless, the construction of m-DPC is severely obstructed by large-scale edge faults due to the rapid growth of the system scale. In this paper, we investigate the existence of paired 2-DPC in under the partitioned edge fault (PEF) model, which is a novel fault model for enhancing the networks' fault-tolerance related to path embedding problem. We exploit this model to evaluate the edge fault-tolerance of when a paired 2-DPC is embedded into . Compared to the other known works, our results can help to achieve large-scale edge fault-tolerance.
{"title":"Paired 2-disjoint path covers of k-ary n-cubes under the partitioned edge fault model","authors":"Hongbin Zhuang , Xiao-Yan Li , Jou-Ming Chang , Ximeng Liu","doi":"10.1016/j.jpdc.2024.104887","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104887","url":null,"abstract":"<div><p>The <em>k</em>-ary <em>n</em>-cube <span><math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup></math></span> serves as an indispensable interconnection network in the design of data center networks, network-on-chips, and parallel computing systems since it possesses numerous attractive properties. In these parallel architectures, the paired (or unpaired) many-to-many <em>m</em>-disjoint path cover (<em>m</em>-DPC) plays a significant role in message transmission. Nevertheless, the construction of <em>m</em>-DPC is severely obstructed by large-scale edge faults due to the rapid growth of the system scale. In this paper, we investigate the existence of paired 2-DPC in <span><math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup></math></span> under the partitioned edge fault (PEF) model, which is a novel fault model for enhancing the networks' fault-tolerance related to path embedding problem. We exploit this model to evaluate the edge fault-tolerance of <span><math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup></math></span> when a paired 2-DPC is embedded into <span><math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup></math></span>. Compared to the other known works, our results can help <span><math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup></math></span> to achieve large-scale edge fault-tolerance.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"190 ","pages":"Article 104887"},"PeriodicalIF":3.8,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140344268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-26DOI: 10.1016/j.jpdc.2024.104883
Fatemeh Barani , Abdorreza Savadi , Hadi Sadoghi Yazdi
Outliers and noises are unavoidable factors that cause performance of the distributed learning algorithms to be severely reduced. Developing a robust algorithm is vital in applications such as system identification and forecasting stock market, in which noise on the desired signals may intensely divert the solutions. In this paper, we propose a Robust Diffusion Stochastic Gradient Descent (RDSGD) algorithm based on the pseudo-Huber loss function which can significantly suppress the effect of Gaussian and non-Gaussian noises on estimation performances in the adaptive networks. Performance and convergence behavior of RDSGD are assessed in presence of the α-stable and Mixed-Gaussian noises in the stationary and non-stationary environments. Simulation results show that the proposed algorithm can achieve both higher convergence rate and lower steady-state misadjustment than the conventional diffusion algorithms and several robust algorithms.
{"title":"A distributed learning based on robust diffusion SGD over adaptive networks with noisy output data","authors":"Fatemeh Barani , Abdorreza Savadi , Hadi Sadoghi Yazdi","doi":"10.1016/j.jpdc.2024.104883","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104883","url":null,"abstract":"<div><p>Outliers and noises are unavoidable factors that cause performance of the distributed learning algorithms to be severely reduced. Developing a robust algorithm is vital in applications such as system identification and forecasting stock market, in which noise on the desired signals may intensely divert the solutions. In this paper, we propose a Robust Diffusion Stochastic Gradient Descent (RDSGD) algorithm based on the pseudo-Huber loss function which can significantly suppress the effect of Gaussian and non-Gaussian noises on estimation performances in the adaptive networks. Performance and convergence behavior of RDSGD are assessed in presence of the <em>α</em>-stable and Mixed-Gaussian noises in the stationary and non-stationary environments. Simulation results show that the proposed algorithm can achieve both higher convergence rate and lower steady-state misadjustment than the conventional diffusion algorithms and several robust algorithms.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"190 ","pages":"Article 104883"},"PeriodicalIF":3.8,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140328744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-26DOI: 10.1016/j.jpdc.2024.104881
Mariano Garralda-Barrio, Carlos Eiras-Franco, Verónica Bolón-Canedo
Comprehensive workload characterization plays a pivotal role in comprehending Spark applications, as it enables the analysis of diverse aspects and behaviors. This understanding is indispensable for devising downstream tuning objectives, such as performance improvement. To address this pivotal issue, our work introduces a novel and scalable framework for generic Spark workload characterization, complemented by consistent geometric measurements. The presented approach aims to build robust workload descriptors by profiling only quantitative metrics at the application task-level, in a non-intrusive manner. We expand our framework for downstream workload pattern recognition by incorporating unsupervised machine learning techniques: clustering algorithms and feature selection. These techniques significantly improve the process of grouping similar workloads without relying on predefined labels. We effectively recognize 24 representative Spark workloads from diverse domains, including SQL, machine learning, web search, graph, and micro-benchmarks, available in HiBench. Our framework achieves a high accuracy F-Measure score of up to 90.9% and a Normalized Mutual Information of up to 94.5% in similar workload pattern recognition. These scores significantly outperform the results obtained in a comparative analysis with an established workload characterization approach in the literature.
{"title":"A novel framework for generic Spark workload characterization and similar pattern recognition using machine learning","authors":"Mariano Garralda-Barrio, Carlos Eiras-Franco, Verónica Bolón-Canedo","doi":"10.1016/j.jpdc.2024.104881","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104881","url":null,"abstract":"<div><p>Comprehensive workload characterization plays a pivotal role in comprehending Spark applications, as it enables the analysis of diverse aspects and behaviors. This understanding is indispensable for devising downstream tuning objectives, such as performance improvement. To address this pivotal issue, our work introduces a novel and scalable framework for generic Spark workload characterization, complemented by consistent geometric measurements. The presented approach aims to build robust workload descriptors by profiling only quantitative metrics at the application task-level, in a non-intrusive manner. We expand our framework for downstream workload pattern recognition by incorporating unsupervised machine learning techniques: clustering algorithms and feature selection. These techniques significantly improve the process of grouping similar workloads without relying on predefined labels. We effectively recognize 24 representative Spark workloads from diverse domains, including SQL, machine learning, web search, graph, and micro-benchmarks, available in HiBench. Our framework achieves a high accuracy F-Measure score of up to 90.9% and a Normalized Mutual Information of up to 94.5% in similar workload pattern recognition. These scores significantly outperform the results obtained in a comparative analysis with an established workload characterization approach in the literature.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104881"},"PeriodicalIF":3.8,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000455/pdfft?md5=f38d6d7d46cfa72abd25c2f3150c7112&pid=1-s2.0-S0743731524000455-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140309738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-25DOI: 10.1016/j.jpdc.2024.104882
Shuang Wang , Zian Yuan , Xiaodong Zhang , Jiawen Wu , Yamin Wang
The cloud-edge-end architecture satisfies the execution requirements of various workflow applications. However, owing to the diversity of resources, the complex hierarchical structure, and different privacy requirements for users, determining how to lease suitable cloud-edge-end resources, schedule multi-privacy-level workflow tasks, and optimize leasing costs is currently one of the key challenges in cloud computing. In this paper, we address the scheduling optimization problem of workflow applications containing tasks with multiple privacy levels. To tackle this problem, we propose a heuristic privacy-preserving workflow scheduling algorithm (PWHSA) designed to minimize rental costs which includes time parameter estimation, task sub-deadline division, scheduling sequence generation, task scheduling, and task adjustment, with candidate strategies developed for each component. These candidate strategies in each step undergo statistical calibration across a comprehensive set of workflow instances. We compare the proposed algorithm with modified classical algorithms that target similar problems. The experimental results demonstrate that the PWHSA algorithm outperforms the comparison algorithms while maintaining acceptable execution times.
{"title":"Cloud-edge-end workflow scheduling with multiple privacy levels","authors":"Shuang Wang , Zian Yuan , Xiaodong Zhang , Jiawen Wu , Yamin Wang","doi":"10.1016/j.jpdc.2024.104882","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104882","url":null,"abstract":"<div><p>The cloud-edge-end architecture satisfies the execution requirements of various workflow applications. However, owing to the diversity of resources, the complex hierarchical structure, and different privacy requirements for users, determining how to lease suitable cloud-edge-end resources, schedule multi-privacy-level workflow tasks, and optimize leasing costs is currently one of the key challenges in cloud computing. In this paper, we address the scheduling optimization problem of workflow applications containing tasks with multiple privacy levels. To tackle this problem, we propose a heuristic privacy-preserving workflow scheduling algorithm (PWHSA) designed to minimize rental costs which includes time parameter estimation, task sub-deadline division, scheduling sequence generation, task scheduling, and task adjustment, with candidate strategies developed for each component. These candidate strategies in each step undergo statistical calibration across a comprehensive set of workflow instances. We compare the proposed algorithm with modified classical algorithms that target similar problems. The experimental results demonstrate that the PWHSA algorithm outperforms the comparison algorithms while maintaining acceptable execution times.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104882"},"PeriodicalIF":3.8,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140309231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-25DOI: 10.1016/j.jpdc.2024.104878
Alexandru Iulian Orhean , Anna Giannakou , Lavanya Ramakrishnan , Kyle Chard , Boris Glavic , Ioan Raicu
While it is now routine to search for data on a personal computer or discover data online, there is no such equivalent method for discovering data on large parallel and distributed file systems commonly deployed on HPC systems. In contrast to web search, which has to deal with a larger number of relatively small files, in HPC applications there is a need to also support efficient indexing of large files. We propose SCIPIS, an indexing and search framework, that can exploit the properties of modern high-end computing systems, with many-core architectures, multiple NUMA nodes and multiple NVMe storage devices. SCIPIS supports building and searching TFIDF persistent indexes, and can deliver orders of magnitude better performance than state-of-the-art approaches. We achieve scalability and performance of indexing by decomposing the indexing process into separate components that can be optimized independently, by building disk-friendly data structures in-memory that can be persisted in long sequential writes, and by avoiding communication between indexing threads that collaboratively build an index over a collection of large files. We evaluated SCIPIS with three types of datasets (logs, scientific data, and metadata), on systems with configurations up to 192-cores, 768 GiB of RAM, 8 NUMA nodes, and up to 16 NVMe drives, and achieved up to 29x better indexing while maintaining similar search latency when compared to Apache Lucene.
{"title":"SCIPIS: Scalable and concurrent persistent indexing and search in high-end computing systems","authors":"Alexandru Iulian Orhean , Anna Giannakou , Lavanya Ramakrishnan , Kyle Chard , Boris Glavic , Ioan Raicu","doi":"10.1016/j.jpdc.2024.104878","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104878","url":null,"abstract":"<div><p>While it is now routine to search for data on a personal computer or discover data online, there is no such equivalent method for discovering data on large parallel and distributed file systems commonly deployed on HPC systems. In contrast to web search, which has to deal with a larger number of relatively small files, in HPC applications there is a need to also support efficient indexing of large files. We propose SCIPIS, an indexing and search framework, that can exploit the properties of modern high-end computing systems, with many-core architectures, multiple NUMA nodes and multiple NVMe storage devices. SCIPIS supports building and searching TFIDF persistent indexes, and can deliver orders of magnitude better performance than state-of-the-art approaches. We achieve scalability and performance of indexing by decomposing the indexing process into separate components that can be optimized independently, by building disk-friendly data structures in-memory that can be persisted in long sequential writes, and by avoiding communication between indexing threads that collaboratively build an index over a collection of large files. We evaluated SCIPIS with three types of datasets (logs, scientific data, and metadata), on systems with configurations up to 192-cores, 768 GiB of RAM, 8 NUMA nodes, and up to 16 NVMe drives, and achieved up to 29x better indexing while maintaining similar search latency when compared to Apache Lucene.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104878"},"PeriodicalIF":3.8,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140321203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-24DOI: 10.1016/j.jpdc.2024.104880
Haitao Zhang, Tongyu Guo, Wei Tian, Huadong Ma
In order to deal with the fast changing requirements of container based services in clouds, auto-scaling is used as an essential mechanism for adapting the number of provisioned resources with the variable service workloads. However, the latest auto-scaling approaches lack the comprehensive consideration of variable workloads and hybrid auto-scaling for multi-type services. Firstly, the historical data based proactive approaches are widely used to handle complex and variable workloads in advance. The decision-making accuracy of proactive approaches depends on the prediction algorithm, which is affected by the anomalies, missing values and errors in the historical workload data, and the unexpected workload cannot be handled. Secondly, the trigger based reactive approaches are seriously affected by workload fluctuation which causes the frequent invalid scaling of service resources. Besides, due to the existence of scaling time, there are different completion delays of different scaling actions. Thirdly, the latest approaches also ignore the different scaling time of hybrid scaling for multi-type services including stateful services and stateless services. Especially, when the stateful services are scaled horizontally, the neglected long scaling time causes the untimely supply and withdrawal of resources. Consequently, all three issues above can lead to the degradation of Quality of Services (QoS) and the inefficient utilization of resources. This paper proposes a new hybrid auto-scaling approach for multi-type services to resolve the impact of service scaling time on decision making. We combine the proactive scaling strategy with the reactive anomaly detection and correction mechanism. For making a proactive decision, the ensemble learning model with the structure improved deep network is designed to predict the future workload. On the basis of the predicted results and the scaling time of different types of services, the auto-scaling decisions are made by a Deep Reinforcement Learning (DRL) model with heterogeneous action space, which integrates horizontal and vertical scaling actions. Meanwhile, with the anomaly detection and correction mechanism, the workload fluctuation and unexpected workload can be detected and handled. We evaluate our approach against three different proactive and reactive auto-scaling approaches in the cloud environment, and the experimental results show the proposed approach can achieve the better scaling behavior compared to state-of-the-art approaches.
{"title":"Learning-driven hybrid scaling for multi-type services in cloud","authors":"Haitao Zhang, Tongyu Guo, Wei Tian, Huadong Ma","doi":"10.1016/j.jpdc.2024.104880","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104880","url":null,"abstract":"<div><p>In order to deal with the fast changing requirements of container based services in clouds, auto-scaling is used as an essential mechanism for adapting the number of provisioned resources with the variable service workloads. However, the latest auto-scaling approaches lack the comprehensive consideration of variable workloads and hybrid auto-scaling for multi-type services. Firstly, the historical data based proactive approaches are widely used to handle complex and variable workloads in advance. The decision-making accuracy of proactive approaches depends on the prediction algorithm, which is affected by the anomalies, missing values and errors in the historical workload data, and the unexpected workload cannot be handled. Secondly, the trigger based reactive approaches are seriously affected by workload fluctuation which causes the frequent invalid scaling of service resources. Besides, due to the existence of scaling time, there are different completion delays of different scaling actions. Thirdly, the latest approaches also ignore the different scaling time of hybrid scaling for multi-type services including stateful services and stateless services. Especially, when the stateful services are scaled horizontally, the neglected long scaling time causes the untimely supply and withdrawal of resources. Consequently, all three issues above can lead to the degradation of Quality of Services (QoS) and the inefficient utilization of resources. This paper proposes a new hybrid auto-scaling approach for multi-type services to resolve the impact of service scaling time on decision making. We combine the proactive scaling strategy with the reactive anomaly detection and correction mechanism. For making a proactive decision, the ensemble learning model with the structure improved deep network is designed to predict the future workload. On the basis of the predicted results and the scaling time of different types of services, the auto-scaling decisions are made by a Deep Reinforcement Learning (DRL) model with heterogeneous action space, which integrates horizontal and vertical scaling actions. Meanwhile, with the anomaly detection and correction mechanism, the workload fluctuation and unexpected workload can be detected and handled. We evaluate our approach against three different proactive and reactive auto-scaling approaches in the cloud environment, and the experimental results show the proposed approach can achieve the better scaling behavior compared to state-of-the-art approaches.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104880"},"PeriodicalIF":3.8,"publicationDate":"2024-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140295918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-21DOI: 10.1016/j.jpdc.2024.104879
Elvis Rojas , Diego Pérez , Esteban Meneses
The latest advances in artificial intelligence deep learning models are unprecedented. A wide spectrum of application areas is now thriving thanks to available massive training datasets and gigantic complex neural network models. Those two characteristics demand outstanding computing power that only advanced computing platforms can provide. Therefore, distributed deep learning has become a necessity in capitalizing on the potential of cutting-edge artificial intelligence. Two basic schemes have emerged in distributed learning. First, the data-parallel approach, which aims at dividing the training dataset into multiple computing nodes. Second, the model-parallel approach, which splits layers of a model into several computing nodes. Each scheme has its upsides and downsides, particularly when running on large machines that are susceptible to soft errors. Those errors occur as a consequence of several factors involved in the manufacturing process of current electronic components of supercomputers. On many occasions, those errors are expressed as bit flips that do not cause the whole system to crash, but generate wrong numerical results in computations. To study the effect of soft error on different approaches for distributed learning, we leverage checkpoint alteration, a technique that injects bit flips on checkpoint files. It allows researchers to understand the effect of soft errors on applications that produce checkpoint files in HDF5 format. This paper uses the popular deep learning PyTorch tool on two distributed-learning platforms: one for data-parallel training and one for model-parallel training. We use well-known deep learning models with popular training datasets to provide a picture of how soft errors challenge the training phase of a deep learning model.
{"title":"A characterization of soft-error sensitivity in data-parallel and model-parallel distributed deep learning","authors":"Elvis Rojas , Diego Pérez , Esteban Meneses","doi":"10.1016/j.jpdc.2024.104879","DOIUrl":"10.1016/j.jpdc.2024.104879","url":null,"abstract":"<div><p>The latest advances in artificial intelligence deep learning models are unprecedented. A wide spectrum of application areas is now thriving thanks to available massive training datasets and gigantic complex neural network models. Those two characteristics demand outstanding computing power that only advanced computing platforms can provide. Therefore, distributed deep learning has become a necessity in capitalizing on the potential of cutting-edge artificial intelligence. Two basic schemes have emerged in distributed learning. First, the data-parallel approach, which aims at dividing the training dataset into multiple computing nodes. Second, the model-parallel approach, which splits layers of a model into several computing nodes. Each scheme has its upsides and downsides, particularly when running on large machines that are susceptible to soft errors. Those errors occur as a consequence of several factors involved in the manufacturing process of current electronic components of supercomputers. On many occasions, those errors are expressed as bit flips that do not cause the whole system to crash, but generate wrong numerical results in computations. To study the effect of soft error on different approaches for distributed learning, we leverage checkpoint alteration, a technique that injects bit flips on checkpoint files. It allows researchers to understand the effect of soft errors on applications that produce checkpoint files in HDF5 format. This paper uses the popular deep learning PyTorch tool on two distributed-learning platforms: one for data-parallel training and one for model-parallel training. We use well-known deep learning models with popular training datasets to provide a picture of how soft errors challenge the training phase of a deep learning model.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"190 ","pages":"Article 104879"},"PeriodicalIF":3.8,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140282552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-09DOI: 10.1016/j.jpdc.2024.104870
Fei Chen , Fengming Meng , Zhipeng Li , Li Li , Tao Xiang
Cloud storage auditing is a technique that enables a user to remotely check the integrity of the outsourced data in the cloud storage. Although researchers have proposed various protocols for cloud storage auditing, the proposed schemes are theoretical in nature, which are not fit for existing mainstream cloud storage service practices. To bridge this gap, this paper proposes a cloud storage auditing system that works for current mainstream cloud object storage services. We design the proposed system over existing proof of data possession (PDP) schemes and make them practical as well as usable in the real world. Specifically, we propose an architecture that separates the compute and storage functionalities of a storage auditing scheme. Because cloud object storage only provides read and write interfaces, we leverage a cloud virtual machine to implement the user-defined computations that are needed in a PDP scheme. We store the authentication tags of the outsourced data as an independent object to allow existing popular cloud storage applications, e.g., file online previewing. We also present a cost model to analyze the economic cost of a cloud storage auditing scheme. The cost model allows a user to balance security, efficiency, and economic cost by tuning various system parameters. We implemented, open-sourced the proposed system over a mainstream cloud object storage service. Experimental analysis shows that the proposed system is pretty efficient and promising for a production environment usage. Specifically, for a 40 GB sized data, the proposed system only incurs 1.66% additional storage cost, 3796 bytes communication cost, 2.9 seconds maximum auditing time cost, and 0.9 CNY per auditing monetary cost.
{"title":"Public cloud object storage auditing: Design, implementation, and analysis","authors":"Fei Chen , Fengming Meng , Zhipeng Li , Li Li , Tao Xiang","doi":"10.1016/j.jpdc.2024.104870","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104870","url":null,"abstract":"<div><p>Cloud storage auditing is a technique that enables a user to remotely check the integrity of the outsourced data in the cloud storage. Although researchers have proposed various protocols for cloud storage auditing, the proposed schemes are theoretical in nature, which are not fit for existing mainstream cloud storage service practices. To bridge this gap, this paper proposes a cloud storage auditing system that works for current mainstream cloud <em>object storage</em> services. We design the proposed system over existing proof of data possession (PDP) schemes and make them practical as well as usable in the real world. Specifically, we propose an architecture that separates the compute and storage functionalities of a storage auditing scheme. Because cloud object storage only provides <span>read</span> and <span>write</span> interfaces, we leverage a cloud virtual machine to implement the user-defined computations that are needed in a PDP scheme. We store the authentication tags of the outsourced data as an independent object to allow existing popular cloud storage applications, e.g., file online previewing. We also present a cost model to analyze the economic cost of a cloud storage auditing scheme. The cost model allows a user to balance security, efficiency, and economic cost by tuning various system parameters. We implemented, open-sourced the proposed system over a mainstream cloud object storage service. Experimental analysis shows that the proposed system is pretty efficient and promising for a production environment usage. Specifically, for a 40 GB sized data, the proposed system only incurs 1.66% additional storage cost, 3796 bytes communication cost, 2.9 seconds maximum auditing time cost, and 0.9 CNY per auditing monetary cost.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104870"},"PeriodicalIF":3.8,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140122500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}