首页 > 最新文献

Journal of Parallel and Distributed Computing最新文献

英文 中文
Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues) 封面 1 - 完整扉页(常规期刊)/特刊扉页(特刊)
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-04-12 DOI: 10.1016/S0743-7315(24)00058-3
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(24)00058-3","DOIUrl":"https://doi.org/10.1016/S0743-7315(24)00058-3","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000583/pdfft?md5=8c3c570a807bfaf1547376210bf18a64&pid=1-s2.0-S0743731524000583-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140549613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting inherent elasticity of serverless in algorithms with unbalanced and irregular workloads 在不平衡和不规则工作负载的算法中利用无服务器的固有弹性
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-04-10 DOI: 10.1016/j.jpdc.2024.104891
Gerard Finol, Gerard París, Pedro García-López, Marc Sánchez-Artigas

Function-as-a-Service execution model in serverless computing has been successful in running large-scale computations like MapReduce, linear algebra, and machine learning. However, little attention has been given to executing highly-dynamic parallel applications with unbalanced and irregular workloads. These algorithms are difficult to execute with good parallel efficiency due to the challenge of provisioning the required computing resources in time, leading to resource over- and under-provisioning in clusters of static size. We propose that the elasticity and fine-grained “pay-as-you-go model” of the FaaS model can be a key enabler for effectively running these algorithms in the cloud. We use a simple serverless executor pool abstraction, and evaluate it using three algorithms with unbalanced and irregular workloads. Results show that their serverless implementation can outperform a static Spark cluster of large virtual machines by up to 55% with the same cost, and can even outperform a single large virtual machine running locally.

无服务器计算中的 "功能即服务"(Function-as-a-Service)执行模式在运行大规模计算(如 MapReduce、线性代数和机器学习)方面取得了成功。然而,人们很少关注如何执行具有不平衡和不规则工作负载的高动态并行应用。这些算法难以以良好的并行效率执行,原因在于及时调配所需计算资源的挑战,导致静态规模的集群中资源调配过多或不足。我们提出,FaaS 模式的弹性和细粒度 "现收现付模式 "是在云中有效运行这些算法的关键因素。我们使用了一个简单的无服务器执行器池抽象,并使用三种不平衡和不规则工作负载的算法对其进行了评估。结果表明,在成本相同的情况下,其无服务器实现的性能比由大型虚拟机组成的静态 Spark 集群高出 55%,甚至比本地运行的单个大型虚拟机的性能还要高。
{"title":"Exploiting inherent elasticity of serverless in algorithms with unbalanced and irregular workloads","authors":"Gerard Finol,&nbsp;Gerard París,&nbsp;Pedro García-López,&nbsp;Marc Sánchez-Artigas","doi":"10.1016/j.jpdc.2024.104891","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104891","url":null,"abstract":"<div><p>Function-as-a-Service execution model in serverless computing has been successful in running large-scale computations like MapReduce, linear algebra, and machine learning. However, little attention has been given to executing highly-dynamic parallel applications with <em>unbalanced</em> and <em>irregular</em> workloads. These algorithms are difficult to execute with good parallel efficiency due to the challenge of provisioning the required computing resources in time, leading to resource over- and under-provisioning in clusters of static size. We propose that the elasticity and fine-grained “pay-as-you-go model” of the FaaS model can be a key enabler for effectively running these algorithms in the cloud. We use a simple serverless executor pool abstraction, and evaluate it using three algorithms with <em>unbalanced</em> and <em>irregular</em> workloads. Results show that their serverless implementation can outperform a static Spark cluster of large virtual machines by up to 55% with the same cost, and can even outperform a single large virtual machine running locally.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000558/pdfft?md5=dfd5618d89af807a65e1b979fb557eaa&pid=1-s2.0-S0743731524000558-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140618804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing DNN training with pipeline model parallelism for enhanced performance in embedded systems 利用管道模型并行性优化 DNN 训练,提高嵌入式系统性能
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-04-06 DOI: 10.1016/j.jpdc.2024.104890
Md Al Maruf , Akramul Azim , Nitin Auluck , Mansi Sahi

Deep Neural Networks (DNNs) have gained widespread popularity in different domain applications due to their dominant performance. Despite the prevalence of massively parallel multi-core processor architectures, adopting large DNN models in embedded systems remains challenging, as most embedded applications are designed with single-core processors in mind. This limits DNN adoption in embedded systems due to inefficient leveraging of model parallelization and workload partitioning. Prior solutions attempt to address these challenges using data and model parallelism. However, they lack in finding optimal DNN model partitions and distributing them efficiently to achieve improved performance.

This paper proposes a DNN model parallelism framework to accelerate model training by finding the optimal number of model partitions and resource provisions. The proposed framework combines data and model parallelism techniques to optimize the parallel processing of DNNs for embedded applications. In addition, it implements the pipeline execution of the partitioned models and integrates a task controller to manage the computing resources. The experimental results for image object detection demonstrate the applicability of our proposed framework in estimating the latest execution time and reducing overall model training time by almost 44.87% compared to the baseline AlexNet convolutional neural network (CNN) model.

深度神经网络(DNN)因其卓越的性能在不同领域的应用中获得了广泛的青睐。尽管大规模并行多核处理器架构已经普及,但在嵌入式系统中采用大型 DNN 模型仍然具有挑战性,因为大多数嵌入式应用在设计时都考虑到了单核处理器。这限制了 DNN 在嵌入式系统中的应用,原因是模型并行化和工作负载分区的利用效率不高。先前的解决方案试图利用数据和模型并行化来应对这些挑战。本文提出了 DNN 模型并行化框架,通过寻找最佳的模型分区数量和资源供应来加速模型训练。所提出的框架结合了数据和模型并行技术,优化了嵌入式应用中 DNN 的并行处理。此外,它还实现了分区模型的流水线执行,并集成了一个任务控制器来管理计算资源。图像对象检测的实验结果表明,与基线 AlexNet 卷积神经网络 (CNN) 模型相比,我们提出的框架可估算最新执行时间,并将整体模型训练时间减少近 44.87%。
{"title":"Optimizing DNN training with pipeline model parallelism for enhanced performance in embedded systems","authors":"Md Al Maruf ,&nbsp;Akramul Azim ,&nbsp;Nitin Auluck ,&nbsp;Mansi Sahi","doi":"10.1016/j.jpdc.2024.104890","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104890","url":null,"abstract":"<div><p>Deep Neural Networks (DNNs) have gained widespread popularity in different domain applications due to their dominant performance. Despite the prevalence of massively parallel multi-core processor architectures, adopting large DNN models in embedded systems remains challenging, as most embedded applications are designed with single-core processors in mind. This limits DNN adoption in embedded systems due to inefficient leveraging of model parallelization and workload partitioning. Prior solutions attempt to address these challenges using data and model parallelism. However, they lack in finding optimal DNN model partitions and distributing them efficiently to achieve improved performance.</p><p>This paper proposes a DNN model parallelism framework to accelerate model training by finding the optimal number of model partitions and resource provisions. The proposed framework combines data and model parallelism techniques to optimize the parallel processing of DNNs for embedded applications. In addition, it implements the pipeline execution of the partitioned models and integrates a task controller to manage the computing resources. The experimental results for image object detection demonstrate the applicability of our proposed framework in estimating the latest execution time and reducing overall model training time by almost 44.87% compared to the baseline AlexNet convolutional neural network (CNN) model.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000546/pdfft?md5=d1af7342dc4b7d20a8dac857da5813c8&pid=1-s2.0-S0743731524000546-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140618805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A two-dimensional time-aware cloud service recommendation approach with enhanced similarity and trust 增强相似性和信任度的二维时间感知云服务推荐方法
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-04-05 DOI: 10.1016/j.jpdc.2024.104889
Chunhua Tang , Shuangyao Zhao , Binbin Chen , Xiaonong Lu , Qiang Zhang

Collaborative Filtering (CF) is one of the most successful techniques for quality-of-service (QoS) prediction and cloud service recommendation. However, individual QoS are time-sensitive and fluctuating, resulting in the QoS predicted by CF to deviate from the actual values. In addition, existing CF approaches ignore inauthentic QoS values given by untrustworthy users. To address these problems, we develop a two-dimensional time-aware and trust-aware service recommendation approach (TaTruSR). First, considering both timeliness and fluctuation of service QoS, an integrative method incorporates time weight (time dimension) and temporal certainty (QoS dimension) are proposed to determine the contribution of co-invoked services. Time weight is computed by a personalized logistic decay function to measure QoS changes by weighting the length of the time interval, while temporal certainty is defined by entropy to acquire the degree of QoS fluctuation over a period of time. Second, a set of most similar and trusted neighbors can be identified from the view of the time-aware similarity model and trust model. In models, the direct similarity and local trust are calculated based on the QoS ratings and contribution of co-invoked services to improve the prediction accuracy and eliminate unreliable QoS. The indirect similarity and global trust are estimated based on user relationship networks to alleviate the data sparsity problem. Finally, missing QoS prediction and reliable service recommendation for the active user can be achieved based on enhanced similarity and trust. A case study and experimental evaluation on real-world datasets demonstrate the practicality and accuracy of the proposed approach.

协同过滤(CF)是服务质量(QoS)预测和云服务推荐方面最成功的技术之一。然而,个人 QoS 具有时效性和波动性,导致 CF 预测的 QoS 与实际值存在偏差。此外,现有的服务质量预测方法会忽略不可信用户提供的不真实服务质量值。为了解决这些问题,我们开发了一种二维时间感知和信任感知服务推荐方法(TaTruSR)。首先,考虑到服务质量的及时性和波动性,我们提出了一种综合方法,将时间权重(时间维度)和时间确定性(质量维度)结合起来,以确定共同引用服务的贡献。时间权重由个性化的逻辑衰减函数计算,通过加权时间间隔的长度来衡量服务质量的变化;而时间确定性则由熵定义,以获得一段时间内服务质量的波动程度。其次,从时间感知的相似性模型和信任模型来看,可以确定一组最相似和最信任的邻居。在模型中,直接相似度和本地信任度是根据共同唤起服务的 QoS 评级和贡献来计算的,以提高预测准确性并消除不可靠的 QoS。间接相似性和全局信任度是基于用户关系网络估算的,以缓解数据稀疏问题。最后,基于增强的相似性和信任度,可以为活跃用户实现缺失的 QoS 预测和可靠的服务推荐。案例研究和实际数据集的实验评估证明了所提方法的实用性和准确性。
{"title":"A two-dimensional time-aware cloud service recommendation approach with enhanced similarity and trust","authors":"Chunhua Tang ,&nbsp;Shuangyao Zhao ,&nbsp;Binbin Chen ,&nbsp;Xiaonong Lu ,&nbsp;Qiang Zhang","doi":"10.1016/j.jpdc.2024.104889","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104889","url":null,"abstract":"<div><p>Collaborative Filtering (CF) is one of the most successful techniques for quality-of-service (QoS) prediction and cloud service recommendation. However, individual QoS are time-sensitive and fluctuating, resulting in the QoS predicted by CF to deviate from the actual values. In addition, existing CF approaches ignore inauthentic QoS values given by untrustworthy users. To address these problems, we develop a two-dimensional time-aware and trust-aware service recommendation approach (TaTruSR). First, considering both timeliness and fluctuation of service QoS, an integrative method incorporates time weight (time dimension) and temporal certainty (QoS dimension) are proposed to determine the contribution of co-invoked services. Time weight is computed by a personalized logistic decay function to measure QoS changes by weighting the length of the time interval, while temporal certainty is defined by entropy to acquire the degree of QoS fluctuation over a period of time. Second, a set of most similar and trusted neighbors can be identified from the view of the time-aware similarity model and trust model. In models, the direct similarity and local trust are calculated based on the QoS ratings and contribution of co-invoked services to improve the prediction accuracy and eliminate unreliable QoS. The indirect similarity and global trust are estimated based on user relationship networks to alleviate the data sparsity problem. Finally, missing QoS prediction and reliable service recommendation for the active user can be achieved based on enhanced similarity and trust. A case study and experimental evaluation on real-world datasets demonstrate the practicality and accuracy of the proposed approach.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140605693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parameter identification method of a reaction-diffusion network information propagation system based on optimization theory 基于优化理论的反应扩散网络信息传播系统参数识别方法
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-04-03 DOI: 10.1016/j.jpdc.2024.104888
Yi Ding, Linhe Zhu

With the development of the times, rumors spread rapidly on the Internet. Firstly, this paper establishes a reaction-diffusion system with Allee effect to describe the rumor spreading process and derives the necessary conditions for the emergence of Turing bifurcation. Next, a parameter identification approach utilizing optimal control theory is shown. Ultimately, the impact of the magnitude of the certain parameters in the objective function on parameter identification is examined through numerous parameter identifications in continuous space and various complex networks. Additionally, the convergence rates and error magnitudes of different algorithms for parameter identification are studied across different spatial structures.

随着时代的发展,谣言在互联网上迅速传播。本文首先建立了一个具有阿利效应的反应-扩散系统来描述谣言传播过程,并推导出图灵分岔出现的必要条件。接下来,本文展示了一种利用最优控制理论的参数识别方法。最后,通过连续空间和各种复杂网络中的大量参数识别,研究了目标函数中某些参数的大小对参数识别的影响。此外,还研究了不同空间结构下不同参数识别算法的收敛速度和误差大小。
{"title":"Parameter identification method of a reaction-diffusion network information propagation system based on optimization theory","authors":"Yi Ding,&nbsp;Linhe Zhu","doi":"10.1016/j.jpdc.2024.104888","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104888","url":null,"abstract":"<div><p>With the development of the times, rumors spread rapidly on the Internet. Firstly, this paper establishes a reaction-diffusion system with Allee effect to describe the rumor spreading process and derives the necessary conditions for the emergence of Turing bifurcation. Next, a parameter identification approach utilizing optimal control theory is shown. Ultimately, the impact of the magnitude of the certain parameters in the objective function on parameter identification is examined through numerous parameter identifications in continuous space and various complex networks. Additionally, the convergence rates and error magnitudes of different algorithms for parameter identification are studied across different spatial structures.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140535044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast knowledge graph completion using graphics processing units 利用图形处理器快速完成知识图谱
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-03-28 DOI: 10.1016/j.jpdc.2024.104885
Chun-Hee Lee , Dong-oh Kang , Hwa Jeon Song

Knowledge graphs can be used in many areas related to data semantics such as question-answering systems, knowledge based systems. However, the currently constructed knowledge graphs need to be complemented for better knowledge in terms of relations. It is called knowledge graph completion. To add new relations to the existing knowledge graph by using knowledge graph embedding models, we have to evaluate N×N×R vector operations, where N is the number of entities and R is the number of relation types. It is very costly.

In this paper, we provide an efficient knowledge graph completion framework on GPUs to get new relations using knowledge graph embedding vectors. In the proposed framework, we first define transformable to a metric space and then provide a method to transform the knowledge graph completion problem into the similarity join problem for a model which is transformable to a metric space. After that, to efficiently process the similarity join problem, we derive formulas using the properties of a metric space. Based on the formulas, we develop a fast knowledge graph completion algorithm. Finally, we experimentally show that our framework can efficiently process the knowledge graph completion problem.

知识图谱可用于许多与数据语义相关的领域,如问题解答系统、基于知识的系统等。然而,目前构建的知识图谱需要进行补充,以获得更好的知识关系。这就是所谓的知识图谱补全。要使用知识图谱嵌入模型为现有知识图谱添加新的关系,我们必须评估 N×N×R 向量运算,其中 N 是实体的数量,R 是关系类型的数量。在本文中,我们在 GPU 上提供了一个高效的知识图完成框架,利用知识图嵌入向量获取新关系。在所提出的框架中,我们首先定义了可转换为度量空间的模型,然后提供了一种将知识图完成问题转换为可转换为度量空间的模型的相似性连接问题的方法。之后,为了有效地处理相似性连接问题,我们利用度量空间的特性推导出公式。基于这些公式,我们开发了一种快速知识图完成算法。最后,我们通过实验证明,我们的框架可以高效地处理知识图完成问题。
{"title":"Fast knowledge graph completion using graphics processing units","authors":"Chun-Hee Lee ,&nbsp;Dong-oh Kang ,&nbsp;Hwa Jeon Song","doi":"10.1016/j.jpdc.2024.104885","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104885","url":null,"abstract":"<div><p>Knowledge graphs can be used in many areas related to data semantics such as question-answering systems, knowledge based systems. However, the currently constructed knowledge graphs need to be complemented for better knowledge in terms of relations. It is called knowledge graph completion. To add new relations to the existing knowledge graph by using knowledge graph embedding models, we have to evaluate <span><math><mi>N</mi><mo>×</mo><mi>N</mi><mo>×</mo><mi>R</mi></math></span> vector operations, where <em>N</em> is the number of entities and <em>R</em> is the number of relation types. It is very costly.</p><p>In this paper, we provide an efficient knowledge graph completion framework on GPUs to get new relations using knowledge graph embedding vectors. In the proposed framework, we first define <em>transformable to a metric space</em> and then provide a method to transform the knowledge graph completion problem into the similarity join problem for a model which is <em>transformable to a metric space</em>. After that, to efficiently process the similarity join problem, we derive formulas using the properties of a metric space. Based on the formulas, we develop a fast knowledge graph completion algorithm. Finally, we experimentally show that our framework can efficiently process the knowledge graph completion problem.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140348191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel HPL-AI approach for FP16-only accelerator and its instantiation on Kunpeng+Ascend AI-specific platform 适用于 FP16 加速器的新型 HPL-AI 方法及其在 Kunpeng+Ascend AI 专用平台上的实例化
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-03-27 DOI: 10.1016/j.jpdc.2024.104884
Zijian Cao , Qiao Sun , Wenhao Yang , Changcheng Song , Zhe Wang , Huiyuan Li

HPL-AI, also known as HPL-MxP, is a new benchmark program used to evaluate the upper-bound performance of AI-related tasks on a specific computing cluster. It solves a large linear equation system in FP64, preconditioned by complete LU factorization in lower precision. In this paper, we propose a new HPL-AI approach that relies on the factorization of the coefficient matrix in mixed precision: FP32 diagonals and FP16 off-diagonals. Without compromising the quality of the resultant LU preconditioner, the proposed approach only utilizes the primitive of dense matrix multiplication in FP16 on the accelerator, maximizing the FP16 throughput. Numerical analysis and experiments validate our approach, ensuring avoidance of numerical underflow or overflow during factorization. We implement the proposed approach on Kunpeng+Ascend clusters, a novel AI-specific platform with exceedingly high FP16 peak performance. By applying various optimization techniques, including 2D lookahead, HCCL-based communication pipeline, and SYCL-based tasks overlapping, we achieve 975 TFlops on a single node and nearly 100 PFlops on a cluster of 128 nodes, with a weak scalability of 79.8%.

HPL-AI 也称为 HPL-MxP,是一个新的基准程序,用于评估特定计算集群上人工智能相关任务的上限性能。它以 FP64 解大型线性方程组,并以低精度的完整 LU 因式分解为前提条件。在本文中,我们提出了一种新的 HPL-AI 方法,该方法依赖于系数矩阵的混合精度因式分解:FP32 对角线和 FP16 非对角线。在不影响 LU 预处理结果质量的前提下,所提出的方法只利用了加速器上 FP16 密集矩阵乘法的基元,从而最大限度地提高了 FP16 吞吐量。数值分析和实验验证了我们的方法,确保在因式分解过程中避免数值下溢或溢出。我们在鲲鹏+Ascend 集群上实现了所提出的方法,这是一种新颖的人工智能专用平台,具有超高的 FP16 峰值性能。通过应用各种优化技术,包括 2D lookahead、基于 HCCL 的通信管道和基于 SYCL 的任务重叠,我们在单节点上实现了 975 TFlops,在 128 节点的集群上实现了近 100 PFlops,弱可扩展性达到 79.8%。
{"title":"A novel HPL-AI approach for FP16-only accelerator and its instantiation on Kunpeng+Ascend AI-specific platform","authors":"Zijian Cao ,&nbsp;Qiao Sun ,&nbsp;Wenhao Yang ,&nbsp;Changcheng Song ,&nbsp;Zhe Wang ,&nbsp;Huiyuan Li","doi":"10.1016/j.jpdc.2024.104884","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104884","url":null,"abstract":"<div><p>HPL-AI, also known as HPL-MxP, is a new benchmark program used to evaluate the upper-bound performance of AI-related tasks on a specific computing cluster. It solves a large linear equation system in FP64, preconditioned by complete LU factorization in lower precision. In this paper, we propose a new HPL-AI approach that relies on the factorization of the coefficient matrix in mixed precision: FP32 diagonals and FP16 off-diagonals. Without compromising the quality of the resultant LU preconditioner, the proposed approach only utilizes the primitive of dense matrix multiplication in FP16 on the accelerator, maximizing the FP16 throughput. Numerical analysis and experiments validate our approach, ensuring avoidance of numerical underflow or overflow during factorization. We implement the proposed approach on Kunpeng+Ascend clusters, a novel AI-specific platform with exceedingly high FP16 peak performance. By applying various optimization techniques, including 2D lookahead, HCCL-based communication pipeline, and SYCL-based tasks overlapping, we achieve 975 TFlops on a single node and nearly 100 PFlops on a cluster of 128 nodes, with a weak scalability of 79.8%.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140341161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reliability assessment for k-ary n-cubes with faulty edges 具有故障边的 kary n 立方体的可靠性评估
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-03-27 DOI: 10.1016/j.jpdc.2024.104886
Si-Yu Li , Xiang-Jun Li , Meijie Ma

The g-restricted edge connectivity is an important measurement to assess the reliability of networks. The g-restricted edge connectivity of a connected graph G is the minimum size of a set of edges in G, if it exists, whose deletion separates G and leaves every vertex in the remaining components with at least g neighbors. The k-ary n-cube is an extension of the hypercube network and has many desirable properties. It has been used to build the architecture of the Supercomputer Fugaku. This paper establishes that for gn, the g-restricted edge connectivity of 3-ary n-cubes is 3g/2(1+(gmod2))(2ng), and the g-restricted edge connectivity of k-ary n-cubes with k4 is 2g(2ng). These results imply that in Qn3 with at most 3g/2(1+(gmod2))(2ng)1 faulty edges, or Qnk(k4) with at most 2g(2ng)1 faulty edges, if each vertex is incident with at least g fault-free edges, then the remaining network is connected.

受 g 限制的边连通性是评估网络可靠性的一个重要指标。连通图 G 的 g 受限边连通性是 G 中一组边的最小大小(如果存在),删除这组边可以将 G 分割开来,并使剩余部分中的每个顶点都至少有 g 个邻居。k-ary n 立方体是超立方体网络的扩展,具有许多理想的特性。超级计算机 Fugaku 就是用它构建的。本文证明,对于 g≤n,3-ary n 立方体的 g 限制边连通性为 3⌊g/2⌋(1+(gmod2))(2n-g),而 k≥4 的 k-ary n 立方体的 g 限制边连通性为 2g(2n-g)。这些结果意味着,在最多有 3⌊g/2⌋(1+(gmod2))(2n-g)-1条故障边的 Qn3 中,或最多有 2g(2n-g)-1条故障边的 Qnk(k≥4)中,如果每个顶点至少有 g 条无故障边,那么其余网络是连通的。
{"title":"Reliability assessment for k-ary n-cubes with faulty edges","authors":"Si-Yu Li ,&nbsp;Xiang-Jun Li ,&nbsp;Meijie Ma","doi":"10.1016/j.jpdc.2024.104886","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104886","url":null,"abstract":"<div><p>The <em>g</em>-restricted edge connectivity is an important measurement to assess the reliability of networks. The <em>g</em>-restricted edge connectivity of a connected graph <em>G</em> is the minimum size of a set of edges in <em>G</em>, if it exists, whose deletion separates <em>G</em> and leaves every vertex in the remaining components with at least <em>g</em> neighbors. The <em>k</em>-ary <em>n</em>-cube is an extension of the hypercube network and has many desirable properties. It has been used to build the architecture of the Supercomputer Fugaku. This paper establishes that for <span><math><mi>g</mi><mo>≤</mo><mi>n</mi></math></span>, the <em>g</em>-restricted edge connectivity of 3-ary <em>n</em>-cubes is <span><math><msup><mrow><mn>3</mn></mrow><mrow><mo>⌊</mo><mi>g</mi><mo>/</mo><mn>2</mn><mo>⌋</mo></mrow></msup><mo>(</mo><mn>1</mn><mo>+</mo><mo>(</mo><mi>g</mi><mrow><mspace></mspace><mtext>mod</mtext><mspace></mspace></mrow><mn>2</mn><mo>)</mo><mo>)</mo><mo>(</mo><mn>2</mn><mi>n</mi><mo>−</mo><mi>g</mi><mo>)</mo></math></span>, and the <em>g</em>-restricted edge connectivity of <em>k</em>-ary <em>n</em>-cubes with <span><math><mi>k</mi><mo>≥</mo><mn>4</mn></math></span> is <span><math><msup><mrow><mn>2</mn></mrow><mrow><mi>g</mi></mrow></msup><mo>(</mo><mn>2</mn><mi>n</mi><mo>−</mo><mi>g</mi><mo>)</mo></math></span>. These results imply that in <span><math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mn>3</mn></mrow></msubsup></math></span> with at most <span><math><msup><mrow><mn>3</mn></mrow><mrow><mo>⌊</mo><mi>g</mi><mo>/</mo><mn>2</mn><mo>⌋</mo></mrow></msup><mo>(</mo><mn>1</mn><mo>+</mo><mo>(</mo><mi>g</mi><mrow><mspace></mspace><mtext>mod</mtext><mspace></mspace></mrow><mn>2</mn><mo>)</mo><mo>)</mo><mo>(</mo><mn>2</mn><mi>n</mi><mo>−</mo><mi>g</mi><mo>)</mo><mo>−</mo><mn>1</mn></math></span> faulty edges, or <span><math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup><mo>(</mo><mi>k</mi><mo>≥</mo><mn>4</mn><mo>)</mo></math></span> with at most <span><math><msup><mrow><mn>2</mn></mrow><mrow><mi>g</mi></mrow></msup><mo>(</mo><mn>2</mn><mi>n</mi><mo>−</mo><mi>g</mi><mo>)</mo><mo>−</mo><mn>1</mn></math></span> faulty edges, if each vertex is incident with at least <em>g</em> fault-free edges, then the remaining network is connected.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140321045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Paired 2-disjoint path covers of k-ary n-cubes under the partitioned edge fault model 分区边缘断层模型下 k-ary n 立方体的成对 2-disjoint 路径盖
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-03-27 DOI: 10.1016/j.jpdc.2024.104887
Hongbin Zhuang , Xiao-Yan Li , Jou-Ming Chang , Ximeng Liu

The k-ary n-cube Qnk serves as an indispensable interconnection network in the design of data center networks, network-on-chips, and parallel computing systems since it possesses numerous attractive properties. In these parallel architectures, the paired (or unpaired) many-to-many m-disjoint path cover (m-DPC) plays a significant role in message transmission. Nevertheless, the construction of m-DPC is severely obstructed by large-scale edge faults due to the rapid growth of the system scale. In this paper, we investigate the existence of paired 2-DPC in Qnk under the partitioned edge fault (PEF) model, which is a novel fault model for enhancing the networks' fault-tolerance related to path embedding problem. We exploit this model to evaluate the edge fault-tolerance of Qnk when a paired 2-DPC is embedded into Qnk. Compared to the other known works, our results can help Qnk to achieve large-scale edge fault-tolerance.

k-ary n立方体 Qnk 是数据中心网络、片上网络和并行计算系统设计中不可或缺的互连网络,因为它具有许多吸引人的特性。在这些并行架构中,配对(或非配对)多对多 m-disjoint path cover(m-DPC)在信息传输中发挥着重要作用。然而,由于系统规模的快速增长,大规模边缘故障严重阻碍了 m-DPC 的构建。在本文中,我们研究了 Qnk 中分区边缘故障(PEF)模型下成对 2-DPC 的存在性,PEF 是一种新的故障模型,用于增强与路径嵌入问题相关的网络容错性。我们利用这一模型来评估 Qnk 中嵌入成对 2-DPC 时的边缘容错性。与其他已知工作相比,我们的结果有助于 Qnk 实现大规模边缘容错。
{"title":"Paired 2-disjoint path covers of k-ary n-cubes under the partitioned edge fault model","authors":"Hongbin Zhuang ,&nbsp;Xiao-Yan Li ,&nbsp;Jou-Ming Chang ,&nbsp;Ximeng Liu","doi":"10.1016/j.jpdc.2024.104887","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104887","url":null,"abstract":"<div><p>The <em>k</em>-ary <em>n</em>-cube <span><math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup></math></span> serves as an indispensable interconnection network in the design of data center networks, network-on-chips, and parallel computing systems since it possesses numerous attractive properties. In these parallel architectures, the paired (or unpaired) many-to-many <em>m</em>-disjoint path cover (<em>m</em>-DPC) plays a significant role in message transmission. Nevertheless, the construction of <em>m</em>-DPC is severely obstructed by large-scale edge faults due to the rapid growth of the system scale. In this paper, we investigate the existence of paired 2-DPC in <span><math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup></math></span> under the partitioned edge fault (PEF) model, which is a novel fault model for enhancing the networks' fault-tolerance related to path embedding problem. We exploit this model to evaluate the edge fault-tolerance of <span><math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup></math></span> when a paired 2-DPC is embedded into <span><math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup></math></span>. Compared to the other known works, our results can help <span><math><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>k</mi></mrow></msubsup></math></span> to achieve large-scale edge fault-tolerance.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140344268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A distributed learning based on robust diffusion SGD over adaptive networks with noisy output data 基于有噪声输出数据的自适应网络的鲁棒扩散 SGD 分布式学习
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-03-26 DOI: 10.1016/j.jpdc.2024.104883
Fatemeh Barani , Abdorreza Savadi , Hadi Sadoghi Yazdi

Outliers and noises are unavoidable factors that cause performance of the distributed learning algorithms to be severely reduced. Developing a robust algorithm is vital in applications such as system identification and forecasting stock market, in which noise on the desired signals may intensely divert the solutions. In this paper, we propose a Robust Diffusion Stochastic Gradient Descent (RDSGD) algorithm based on the pseudo-Huber loss function which can significantly suppress the effect of Gaussian and non-Gaussian noises on estimation performances in the adaptive networks. Performance and convergence behavior of RDSGD are assessed in presence of the α-stable and Mixed-Gaussian noises in the stationary and non-stationary environments. Simulation results show that the proposed algorithm can achieve both higher convergence rate and lower steady-state misadjustment than the conventional diffusion algorithms and several robust algorithms.

异常值和噪声是导致分布式学习算法性能严重下降的不可避免的因素。在系统识别和股市预测等应用中,所需的信号上的噪声可能会严重干扰解决方案,因此开发一种鲁棒性算法至关重要。本文提出了一种基于伪胡贝尔损失函数的鲁棒扩散随机梯度下降算法(RDSGD),它能显著抑制高斯和非高斯噪声对自适应网络估计性能的影响。我们评估了 RDSGD 在静态和非静态环境中存在 α 稳定和混合高斯噪声时的性能和收敛行为。仿真结果表明,与传统的扩散算法和几种鲁棒算法相比,所提出的算法能获得更高的收敛速率和更低的稳态失调。
{"title":"A distributed learning based on robust diffusion SGD over adaptive networks with noisy output data","authors":"Fatemeh Barani ,&nbsp;Abdorreza Savadi ,&nbsp;Hadi Sadoghi Yazdi","doi":"10.1016/j.jpdc.2024.104883","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104883","url":null,"abstract":"<div><p>Outliers and noises are unavoidable factors that cause performance of the distributed learning algorithms to be severely reduced. Developing a robust algorithm is vital in applications such as system identification and forecasting stock market, in which noise on the desired signals may intensely divert the solutions. In this paper, we propose a Robust Diffusion Stochastic Gradient Descent (RDSGD) algorithm based on the pseudo-Huber loss function which can significantly suppress the effect of Gaussian and non-Gaussian noises on estimation performances in the adaptive networks. Performance and convergence behavior of RDSGD are assessed in presence of the <em>α</em>-stable and Mixed-Gaussian noises in the stationary and non-stationary environments. Simulation results show that the proposed algorithm can achieve both higher convergence rate and lower steady-state misadjustment than the conventional diffusion algorithms and several robust algorithms.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140328744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Parallel and Distributed Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1