Journal of Parallel and Distributed Computing最新文献_第8页

Efficient topology reconfiguration for NoC-based multiprocessors: A greedy-memetic algorithm 基于 NoC 的多处理器的高效拓扑重新配置：贪婪内存算法

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-04-21 DOI: 10.1016/j.jpdc.2024.104904

Junyan Qian , Chuanfang Zhang , Zheng Wu , Hao Ding , Long Li

In multi-core processor systems, the Network-on-Chip (NoC) serves as a vital communication infrastructure. To ensure chip reliability during potential failures, this paper proposes a two-level topology reconfiguration algorithm with core-level redundancy technology. Initially, a heuristic topology reconfiguration method utilizing a greedy strategy is proposed to perform local replacement of faulty processing elements (PEs) and generate an initial logical topology with shorter interconnection paths between PEs. Then, an intelligent optimization method based on memetic algorithm is introduced to optimize the generated initial topology for better communication performance. The experimental results demonstrate that compared to the current state-of-the-art algorithm, the proposed algorithm achieves an average improvement of 13.92% and 30.83% on various size topologies in terms of distance factor (DF) and congestion factor (CF), which represent communication delay and traffic balance respectively. The proposed algorithm significantly enhances the communication performance of the target topology, mitigating communication latency and potential congestion problems.

在多核处理器系统中，片上网络（NoC）是重要的通信基础设施。为确保芯片在潜在故障期间的可靠性，本文提出了一种采用内核级冗余技术的两级拓扑重新配置算法。首先，本文提出了一种利用贪婪策略的启发式拓扑重新配置方法，用于执行故障处理元件（PE）的局部替换，并生成具有较短 PE 之间互连路径的初始逻辑拓扑。然后，引入基于记忆算法的智能优化方法，优化生成的初始拓扑结构，以获得更好的通信性能。实验结果表明，与目前最先进的算法相比，所提出的算法在各种规模的拓扑结构上，在距离因子（DF）和拥塞因子（CF）（分别代表通信延迟和流量平衡）方面平均提高了 13.92% 和 30.83%。所提出的算法大大提高了目标拓扑的通信性能，缓解了通信延迟和潜在的拥塞问题。

{"title":"Efficient topology reconfiguration for NoC-based multiprocessors: A greedy-memetic algorithm","authors":"Junyan Qian , Chuanfang Zhang , Zheng Wu , Hao Ding , Long Li","doi":"10.1016/j.jpdc.2024.104904","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104904","url":null,"abstract":"<div><p>In multi-core processor systems, the Network-on-Chip (NoC) serves as a vital communication infrastructure. To ensure chip reliability during potential failures, this paper proposes a two-level topology reconfiguration algorithm with core-level redundancy technology. Initially, a heuristic topology reconfiguration method utilizing a greedy strategy is proposed to perform local replacement of faulty processing elements (PEs) and generate an initial logical topology with shorter interconnection paths between PEs. Then, an intelligent optimization method based on memetic algorithm is introduced to optimize the generated initial topology for better communication performance. The experimental results demonstrate that compared to the current state-of-the-art algorithm, the proposed algorithm achieves an average improvement of 13.92% and 30.83% on various size topologies in terms of distance factor (DF) and congestion factor (CF), which represent communication delay and traffic balance respectively. The proposed algorithm significantly enhances the communication performance of the target topology, mitigating communication latency and potential congestion problems.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"190 ","pages":"Article 104904"},"PeriodicalIF":3.8,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140638742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CUDA acceleration of MI-based feature selection methods 基于 MI 的特征选择方法的 CUDA 加速

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-04-18 DOI: 10.1016/j.jpdc.2024.104901

Bieito Beceiro , Jorge González-Domínguez , Laura Morán-Fernández , Verónica Bolón-Canedo , Juan Touriño

Feature selection algorithms are necessary nowadays for machine learning as they are capable of removing irrelevant and redundant information to reduce the dimensionality of the data and improve the quality of subsequent analyses. The problem with current feature selection approaches is that they are computationally expensive when processing large datasets. This work presents parallel implementations for Nvidia GPUs of three highly-used feature selection methods based on the Mutual Information (MI) metric: mRMR, JMI and DISR. Publicly available code includes not only CUDA implementations of the general methods, but also an adaptation of them to work with low-precision fixed point in order to further increase their performance on GPUs. The experimental evaluation was carried out on two modern Nvidia GPUs (Turing T4 and Ampere A100) with highly satisfactory results, achieving speedups of up to 283x when compared to state-of-the-art C implementations.

特征选择算法是当今机器学习所必需的，因为它们能够去除无关信息和冗余信息，从而降低数据维度，提高后续分析的质量。目前的特征选择方法存在的问题是，在处理大型数据集时计算成本高昂。这项工作介绍了基于互信息（MI）度量的三种常用特征选择方法在 Nvidia GPU 上的并行实现：mRMR、JMI 和 DISR。公开的代码不仅包括一般方法的 CUDA 实现，还包括将这些方法调整为使用低精度定点，以进一步提高它们在 GPU 上的性能。实验评估是在两个现代 Nvidia GPU（图灵 T4 和安培 A100）上进行的，结果非常令人满意，与最先进的 C 语言实现相比，速度提高了 283 倍。

{"title":"CUDA acceleration of MI-based feature selection methods","authors":"Bieito Beceiro , Jorge González-Domínguez , Laura Morán-Fernández , Verónica Bolón-Canedo , Juan Touriño","doi":"10.1016/j.jpdc.2024.104901","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104901","url":null,"abstract":"<div><p>Feature selection algorithms are necessary nowadays for machine learning as they are capable of removing irrelevant and redundant information to reduce the dimensionality of the data and improve the quality of subsequent analyses. The problem with current feature selection approaches is that they are computationally expensive when processing large datasets. This work presents parallel implementations for Nvidia GPUs of three highly-used feature selection methods based on the Mutual Information (MI) metric: mRMR, JMI and DISR. Publicly available code includes not only CUDA implementations of the general methods, but also an adaptation of them to work with low-precision fixed point in order to further increase their performance on GPUs. The experimental evaluation was carried out on two modern Nvidia GPUs (Turing T4 and Ampere A100) with highly satisfactory results, achieving speedups of up to 283x when compared to state-of-the-art C implementations.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"190 ","pages":"Article 104901"},"PeriodicalIF":3.8,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000650/pdfft?md5=702120f16f21ee1ed938e87b7c2e0385&pid=1-s2.0-S0743731524000650-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140638743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient and lightweight in-memory computing architecture for hardware security 面向硬件安全的高效轻量级内存计算架构

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-04-16 DOI: 10.1016/j.jpdc.2024.104898

Hala Ajmi , Fakhreddine Zayer , Amira Hadj Fredj, Hamdi Belgacem, Baker Mohammad, Naoufel Werghi, Jorge Dias

This paper introduces an innovative solution for improving the efficiency and speed of the Advanced Encryption Standard (AES) based cryptographic algorithm. The approach leverages in-memory computing (IMC) and is versatile for application across a broad spectrum of IoT applications, including robotic autonomous vehicles and various other scenarios. To achieve this goal, memristor (MR) designs are proposed to emulate the arithmetic operations required for different phases of the AES algorithm, enabling efficient in-memory processing. The key contributions of this work include; 1) The development of a 4 bit-MR state element for implementing different arithmetic operations in an AES hardware prototype; 2) The creation of a pipeline AES design for massive parallelism and MR integration compatibility; and 3) The hardware implementation of the AES-IMC based architecture using the MR emulator. The results show that AES-IMC performs better than existing architectures in terms of higher throughput and energy efficiency. Compared to conventional AES hardware, AES-IMC achieves a 30% power enhancement with comparable throughput. Additionally, when compared to state-of-the-art AES-based NVM engines, AES-IMC demonstrates comparable power dissipation and a 62% increase in throughput. The IMC architecture enables cost-effective real-time deployment of AES, leading to high-performance computing. By leveraging the power of in-memory computing, this system is able to provide improved computational efficiency and faster processing speeds, making it a promising solution for a wide range of applications in the field of autonomous driving and robotics. The potential benefits of this system include improved safety and security of unmanned devices, as well as enhanced performance and cost-effectiveness in a variety of computing environments.

本文介绍了一种创新解决方案，用于提高基于高级加密标准（AES）的加密算法的效率和速度。该方法利用内存计算（IMC），适用于广泛的物联网应用，包括机器人自动驾驶汽车和其他各种场景。为了实现这一目标，我们提出了忆阻器（MR）设计，以模拟 AES 算法不同阶段所需的算术运算，从而实现高效的内存处理。这项工作的主要贡献包括：1）开发了用于在 AES 硬件原型中实现不同算术运算的 4 位 MR 状态元素；2）创建了用于大规模并行性和 MR 集成兼容性的流水线 AES 设计；3）使用 MR 仿真器实现了基于 AES-IMC 架构的硬件实施。结果表明，就更高的吞吐量和能效而言，AES-IMC 的性能优于现有架构。与传统的 AES 硬件相比，AES-IMC 在吞吐量相当的情况下提高了 30% 的功耗。此外，与最先进的基于 AES 的 NVM 引擎相比，AES-IMC 的功耗相当，吞吐量提高了 62%。IMC 架构实现了具有成本效益的 AES 实时部署，带来了高性能计算。通过利用内存计算的强大功能，该系统能够提供更高的计算效率和更快的处理速度，使其成为自动驾驶和机器人领域各种应用的理想解决方案。该系统的潜在优势包括提高无人驾驶设备的安全性，以及在各种计算环境中提高性能和成本效益。

{"title":"Efficient and lightweight in-memory computing architecture for hardware security","authors":"Hala Ajmi , Fakhreddine Zayer , Amira Hadj Fredj, Hamdi Belgacem, Baker Mohammad, Naoufel Werghi, Jorge Dias","doi":"10.1016/j.jpdc.2024.104898","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104898","url":null,"abstract":"<div><p>This paper introduces an innovative solution for improving the efficiency and speed of the Advanced Encryption Standard (AES) based cryptographic algorithm. The approach leverages in-memory computing (IMC) and is versatile for application across a broad spectrum of IoT applications, including robotic autonomous vehicles and various other scenarios. To achieve this goal, memristor (MR) designs are proposed to emulate the arithmetic operations required for different phases of the AES algorithm, enabling efficient in-memory processing. The key contributions of this work include; 1) The development of a 4 bit-MR state element for implementing different arithmetic operations in an AES hardware prototype; 2) The creation of a pipeline AES design for massive parallelism and MR integration compatibility; and 3) The hardware implementation of the AES-IMC based architecture using the MR emulator. The results show that AES-IMC performs better than existing architectures in terms of higher throughput and energy efficiency. Compared to conventional AES hardware, AES-IMC achieves a 30% power enhancement with comparable throughput. Additionally, when compared to state-of-the-art AES-based NVM engines, AES-IMC demonstrates comparable power dissipation and a 62% increase in throughput. The IMC architecture enables cost-effective real-time deployment of AES, leading to high-performance computing. By leveraging the power of in-memory computing, this system is able to provide improved computational efficiency and faster processing speeds, making it a promising solution for a wide range of applications in the field of autonomous driving and robotics. The potential benefits of this system include improved safety and security of unmanned devices, as well as enhanced performance and cost-effectiveness in a variety of computing environments.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"190 ","pages":"Article 104898"},"PeriodicalIF":3.8,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140645848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues) 封面 1 - 完整扉页（常规期刊）/特刊扉页（特刊）

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-04-12 DOI: 10.1016/S0743-7315(24)00058-3

引用次数: 0

Exploiting inherent elasticity of serverless in algorithms with unbalanced and irregular workloads 在不平衡和不规则工作负载的算法中利用无服务器的固有弹性

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-04-10 DOI: 10.1016/j.jpdc.2024.104891

Gerard Finol, Gerard París, Pedro García-López, Marc Sánchez-Artigas

Function-as-a-Service execution model in serverless computing has been successful in running large-scale computations like MapReduce, linear algebra, and machine learning. However, little attention has been given to executing highly-dynamic parallel applications with unbalanced and irregular workloads. These algorithms are difficult to execute with good parallel efficiency due to the challenge of provisioning the required computing resources in time, leading to resource over- and under-provisioning in clusters of static size. We propose that the elasticity and fine-grained “pay-as-you-go model” of the FaaS model can be a key enabler for effectively running these algorithms in the cloud. We use a simple serverless executor pool abstraction, and evaluate it using three algorithms with unbalanced and irregular workloads. Results show that their serverless implementation can outperform a static Spark cluster of large virtual machines by up to 55% with the same cost, and can even outperform a single large virtual machine running locally.

无服务器计算中的 "功能即服务"（Function-as-a-Service）执行模式在运行大规模计算（如 MapReduce、线性代数和机器学习）方面取得了成功。然而，人们很少关注如何执行具有不平衡和不规则工作负载的高动态并行应用。这些算法难以以良好的并行效率执行，原因在于及时调配所需计算资源的挑战，导致静态规模的集群中资源调配过多或不足。我们提出，FaaS 模式的弹性和细粒度 "现收现付模式 "是在云中有效运行这些算法的关键因素。我们使用了一个简单的无服务器执行器池抽象，并使用三种不平衡和不规则工作负载的算法对其进行了评估。结果表明，在成本相同的情况下，其无服务器实现的性能比由大型虚拟机组成的静态 Spark 集群高出 55%，甚至比本地运行的单个大型虚拟机的性能还要高。

{"title":"Exploiting inherent elasticity of serverless in algorithms with unbalanced and irregular workloads","authors":"Gerard Finol, Gerard París, Pedro García-López, Marc Sánchez-Artigas","doi":"10.1016/j.jpdc.2024.104891","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104891","url":null,"abstract":"<div><p>Function-as-a-Service execution model in serverless computing has been successful in running large-scale computations like MapReduce, linear algebra, and machine learning. However, little attention has been given to executing highly-dynamic parallel applications with <em>unbalanced</em> and <em>irregular</em> workloads. These algorithms are difficult to execute with good parallel efficiency due to the challenge of provisioning the required computing resources in time, leading to resource over- and under-provisioning in clusters of static size. We propose that the elasticity and fine-grained “pay-as-you-go model” of the FaaS model can be a key enabler for effectively running these algorithms in the cloud. We use a simple serverless executor pool abstraction, and evaluate it using three algorithms with <em>unbalanced</em> and <em>irregular</em> workloads. Results show that their serverless implementation can outperform a static Spark cluster of large virtual machines by up to 55% with the same cost, and can even outperform a single large virtual machine running locally.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"190 ","pages":"Article 104891"},"PeriodicalIF":3.8,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000558/pdfft?md5=dfd5618d89af807a65e1b979fb557eaa&pid=1-s2.0-S0743731524000558-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140618804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimizing DNN training with pipeline model parallelism for enhanced performance in embedded systems 利用管道模型并行性优化 DNN 训练，提高嵌入式系统性能

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-04-06 DOI: 10.1016/j.jpdc.2024.104890

Md Al Maruf , Akramul Azim , Nitin Auluck , Mansi Sahi

Deep Neural Networks (DNNs) have gained widespread popularity in different domain applications due to their dominant performance. Despite the prevalence of massively parallel multi-core processor architectures, adopting large DNN models in embedded systems remains challenging, as most embedded applications are designed with single-core processors in mind. This limits DNN adoption in embedded systems due to inefficient leveraging of model parallelization and workload partitioning. Prior solutions attempt to address these challenges using data and model parallelism. However, they lack in finding optimal DNN model partitions and distributing them efficiently to achieve improved performance.

This paper proposes a DNN model parallelism framework to accelerate model training by finding the optimal number of model partitions and resource provisions. The proposed framework combines data and model parallelism techniques to optimize the parallel processing of DNNs for embedded applications. In addition, it implements the pipeline execution of the partitioned models and integrates a task controller to manage the computing resources. The experimental results for image object detection demonstrate the applicability of our proposed framework in estimating the latest execution time and reducing overall model training time by almost 44.87% compared to the baseline AlexNet convolutional neural network (CNN) model.

深度神经网络（DNN）因其卓越的性能在不同领域的应用中获得了广泛的青睐。尽管大规模并行多核处理器架构已经普及，但在嵌入式系统中采用大型 DNN 模型仍然具有挑战性，因为大多数嵌入式应用在设计时都考虑到了单核处理器。这限制了 DNN 在嵌入式系统中的应用，原因是模型并行化和工作负载分区的利用效率不高。先前的解决方案试图利用数据和模型并行化来应对这些挑战。本文提出了 DNN 模型并行化框架，通过寻找最佳的模型分区数量和资源供应来加速模型训练。所提出的框架结合了数据和模型并行技术，优化了嵌入式应用中 DNN 的并行处理。此外，它还实现了分区模型的流水线执行，并集成了一个任务控制器来管理计算资源。图像对象检测的实验结果表明，与基线 AlexNet 卷积神经网络 (CNN) 模型相比，我们提出的框架可估算最新执行时间，并将整体模型训练时间减少近 44.87%。

{"title":"Optimizing DNN training with pipeline model parallelism for enhanced performance in embedded systems","authors":"Md Al Maruf , Akramul Azim , Nitin Auluck , Mansi Sahi","doi":"10.1016/j.jpdc.2024.104890","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104890","url":null,"abstract":"<div><p>Deep Neural Networks (DNNs) have gained widespread popularity in different domain applications due to their dominant performance. Despite the prevalence of massively parallel multi-core processor architectures, adopting large DNN models in embedded systems remains challenging, as most embedded applications are designed with single-core processors in mind. This limits DNN adoption in embedded systems due to inefficient leveraging of model parallelization and workload partitioning. Prior solutions attempt to address these challenges using data and model parallelism. However, they lack in finding optimal DNN model partitions and distributing them efficiently to achieve improved performance.</p><p>This paper proposes a DNN model parallelism framework to accelerate model training by finding the optimal number of model partitions and resource provisions. The proposed framework combines data and model parallelism techniques to optimize the parallel processing of DNNs for embedded applications. In addition, it implements the pipeline execution of the partitioned models and integrates a task controller to manage the computing resources. The experimental results for image object detection demonstrate the applicability of our proposed framework in estimating the latest execution time and reducing overall model training time by almost 44.87% compared to the baseline AlexNet convolutional neural network (CNN) model.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"190 ","pages":"Article 104890"},"PeriodicalIF":3.8,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000546/pdfft?md5=d1af7342dc4b7d20a8dac857da5813c8&pid=1-s2.0-S0743731524000546-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140618805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A two-dimensional time-aware cloud service recommendation approach with enhanced similarity and trust 增强相似性和信任度的二维时间感知云服务推荐方法

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-04-05 DOI: 10.1016/j.jpdc.2024.104889

Chunhua Tang , Shuangyao Zhao , Binbin Chen , Xiaonong Lu , Qiang Zhang

Collaborative Filtering (CF) is one of the most successful techniques for quality-of-service (QoS) prediction and cloud service recommendation. However, individual QoS are time-sensitive and fluctuating, resulting in the QoS predicted by CF to deviate from the actual values. In addition, existing CF approaches ignore inauthentic QoS values given by untrustworthy users. To address these problems, we develop a two-dimensional time-aware and trust-aware service recommendation approach (TaTruSR). First, considering both timeliness and fluctuation of service QoS, an integrative method incorporates time weight (time dimension) and temporal certainty (QoS dimension) are proposed to determine the contribution of co-invoked services. Time weight is computed by a personalized logistic decay function to measure QoS changes by weighting the length of the time interval, while temporal certainty is defined by entropy to acquire the degree of QoS fluctuation over a period of time. Second, a set of most similar and trusted neighbors can be identified from the view of the time-aware similarity model and trust model. In models, the direct similarity and local trust are calculated based on the QoS ratings and contribution of co-invoked services to improve the prediction accuracy and eliminate unreliable QoS. The indirect similarity and global trust are estimated based on user relationship networks to alleviate the data sparsity problem. Finally, missing QoS prediction and reliable service recommendation for the active user can be achieved based on enhanced similarity and trust. A case study and experimental evaluation on real-world datasets demonstrate the practicality and accuracy of the proposed approach.

协同过滤（CF）是服务质量（QoS）预测和云服务推荐方面最成功的技术之一。然而，个人 QoS 具有时效性和波动性，导致 CF 预测的 QoS 与实际值存在偏差。此外，现有的服务质量预测方法会忽略不可信用户提供的不真实服务质量值。为了解决这些问题，我们开发了一种二维时间感知和信任感知服务推荐方法（TaTruSR）。首先，考虑到服务质量的及时性和波动性，我们提出了一种综合方法，将时间权重（时间维度）和时间确定性（质量维度）结合起来，以确定共同引用服务的贡献。时间权重由个性化的逻辑衰减函数计算，通过加权时间间隔的长度来衡量服务质量的变化；而时间确定性则由熵定义，以获得一段时间内服务质量的波动程度。其次，从时间感知的相似性模型和信任模型来看，可以确定一组最相似和最信任的邻居。在模型中，直接相似度和本地信任度是根据共同唤起服务的 QoS 评级和贡献来计算的，以提高预测准确性并消除不可靠的 QoS。间接相似性和全局信任度是基于用户关系网络估算的，以缓解数据稀疏问题。最后，基于增强的相似性和信任度，可以为活跃用户实现缺失的 QoS 预测和可靠的服务推荐。案例研究和实际数据集的实验评估证明了所提方法的实用性和准确性。

{"title":"A two-dimensional time-aware cloud service recommendation approach with enhanced similarity and trust","authors":"Chunhua Tang , Shuangyao Zhao , Binbin Chen , Xiaonong Lu , Qiang Zhang","doi":"10.1016/j.jpdc.2024.104889","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104889","url":null,"abstract":"<div><p>Collaborative Filtering (CF) is one of the most successful techniques for quality-of-service (QoS) prediction and cloud service recommendation. However, individual QoS are time-sensitive and fluctuating, resulting in the QoS predicted by CF to deviate from the actual values. In addition, existing CF approaches ignore inauthentic QoS values given by untrustworthy users. To address these problems, we develop a two-dimensional time-aware and trust-aware service recommendation approach (TaTruSR). First, considering both timeliness and fluctuation of service QoS, an integrative method incorporates time weight (time dimension) and temporal certainty (QoS dimension) are proposed to determine the contribution of co-invoked services. Time weight is computed by a personalized logistic decay function to measure QoS changes by weighting the length of the time interval, while temporal certainty is defined by entropy to acquire the degree of QoS fluctuation over a period of time. Second, a set of most similar and trusted neighbors can be identified from the view of the time-aware similarity model and trust model. In models, the direct similarity and local trust are calculated based on the QoS ratings and contribution of co-invoked services to improve the prediction accuracy and eliminate unreliable QoS. The indirect similarity and global trust are estimated based on user relationship networks to alleviate the data sparsity problem. Finally, missing QoS prediction and reliable service recommendation for the active user can be achieved based on enhanced similarity and trust. A case study and experimental evaluation on real-world datasets demonstrate the practicality and accuracy of the proposed approach.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"190 ","pages":"Article 104889"},"PeriodicalIF":3.8,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140605693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parameter identification method of a reaction-diffusion network information propagation system based on optimization theory 基于优化理论的反应扩散网络信息传播系统参数识别方法

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-04-03 DOI: 10.1016/j.jpdc.2024.104888

Yi Ding, Linhe Zhu

With the development of the times, rumors spread rapidly on the Internet. Firstly, this paper establishes a reaction-diffusion system with Allee effect to describe the rumor spreading process and derives the necessary conditions for the emergence of Turing bifurcation. Next, a parameter identification approach utilizing optimal control theory is shown. Ultimately, the impact of the magnitude of the certain parameters in the objective function on parameter identification is examined through numerous parameter identifications in continuous space and various complex networks. Additionally, the convergence rates and error magnitudes of different algorithms for parameter identification are studied across different spatial structures.

随着时代的发展，谣言在互联网上迅速传播。本文首先建立了一个具有阿利效应的反应-扩散系统来描述谣言传播过程，并推导出图灵分岔出现的必要条件。接下来，本文展示了一种利用最优控制理论的参数识别方法。最后，通过连续空间和各种复杂网络中的大量参数识别，研究了目标函数中某些参数的大小对参数识别的影响。此外，还研究了不同空间结构下不同参数识别算法的收敛速度和误差大小。

引用次数: 0

Fast knowledge graph completion using graphics processing units 利用图形处理器快速完成知识图谱

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-03-28 DOI: 10.1016/j.jpdc.2024.104885

Chun-Hee Lee , Dong-oh Kang , Hwa Jeon Song

Knowledge graphs can be used in many areas related to data semantics such as question-answering systems, knowledge based systems. However, the currently constructed knowledge graphs need to be complemented for better knowledge in terms of relations. It is called knowledge graph completion. To add new relations to the existing knowledge graph by using knowledge graph embedding models, we have to evaluate $N \times N \times R$ vector operations, where N is the number of entities and R is the number of relation types. It is very costly.

In this paper, we provide an efficient knowledge graph completion framework on GPUs to get new relations using knowledge graph embedding vectors. In the proposed framework, we first define transformable to a metric space and then provide a method to transform the knowledge graph completion problem into the similarity join problem for a model which is transformable to a metric space. After that, to efficiently process the similarity join problem, we derive formulas using the properties of a metric space. Based on the formulas, we develop a fast knowledge graph completion algorithm. Finally, we experimentally show that our framework can efficiently process the knowledge graph completion problem.

知识图谱可用于许多与数据语义相关的领域，如问题解答系统、基于知识的系统等。然而，目前构建的知识图谱需要进行补充，以获得更好的知识关系。这就是所谓的知识图谱补全。要使用知识图谱嵌入模型为现有知识图谱添加新的关系，我们必须评估 N×N×R 向量运算，其中 N 是实体的数量，R 是关系类型的数量。在本文中，我们在 GPU 上提供了一个高效的知识图完成框架，利用知识图嵌入向量获取新关系。在所提出的框架中，我们首先定义了可转换为度量空间的模型，然后提供了一种将知识图完成问题转换为可转换为度量空间的模型的相似性连接问题的方法。之后，为了有效地处理相似性连接问题，我们利用度量空间的特性推导出公式。基于这些公式，我们开发了一种快速知识图完成算法。最后，我们通过实验证明，我们的框架可以高效地处理知识图完成问题。

{"title":"Fast knowledge graph completion using graphics processing units","authors":"Chun-Hee Lee , Dong-oh Kang , Hwa Jeon Song","doi":"10.1016/j.jpdc.2024.104885","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104885","url":null,"abstract":"<div><p>Knowledge graphs can be used in many areas related to data semantics such as question-answering systems, knowledge based systems. However, the currently constructed knowledge graphs need to be complemented for better knowledge in terms of relations. It is called knowledge graph completion. To add new relations to the existing knowledge graph by using knowledge graph embedding models, we have to evaluate <span><math><mi>N</mi><mo>×</mo><mi>N</mi><mo>×</mo><mi>R</mi></math></span> vector operations, where <em>N</em> is the number of entities and <em>R</em> is the number of relation types. It is very costly.</p><p>In this paper, we provide an efficient knowledge graph completion framework on GPUs to get new relations using knowledge graph embedding vectors. In the proposed framework, we first define <em>transformable to a metric space</em> and then provide a method to transform the knowledge graph completion problem into the similarity join problem for a model which is <em>transformable to a metric space</em>. After that, to efficiently process the similarity join problem, we derive formulas using the properties of a metric space. Based on the formulas, we develop a fast knowledge graph completion algorithm. Finally, we experimentally show that our framework can efficiently process the knowledge graph completion problem.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"190 ","pages":"Article 104885"},"PeriodicalIF":3.8,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140348191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A novel HPL-AI approach for FP16-only accelerator and its instantiation on Kunpeng+Ascend AI-specific platform 适用于 FP16 加速器的新型 HPL-AI 方法及其在 Kunpeng+Ascend AI 专用平台上的实例化

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-03-27 DOI: 10.1016/j.jpdc.2024.104884

Zijian Cao , Qiao Sun , Wenhao Yang , Changcheng Song , Zhe Wang , Huiyuan Li

HPL-AI, also known as HPL-MxP, is a new benchmark program used to evaluate the upper-bound performance of AI-related tasks on a specific computing cluster. It solves a large linear equation system in FP64, preconditioned by complete LU factorization in lower precision. In this paper, we propose a new HPL-AI approach that relies on the factorization of the coefficient matrix in mixed precision: FP32 diagonals and FP16 off-diagonals. Without compromising the quality of the resultant LU preconditioner, the proposed approach only utilizes the primitive of dense matrix multiplication in FP16 on the accelerator, maximizing the FP16 throughput. Numerical analysis and experiments validate our approach, ensuring avoidance of numerical underflow or overflow during factorization. We implement the proposed approach on Kunpeng+Ascend clusters, a novel AI-specific platform with exceedingly high FP16 peak performance. By applying various optimization techniques, including 2D lookahead, HCCL-based communication pipeline, and SYCL-based tasks overlapping, we achieve 975 TFlops on a single node and nearly 100 PFlops on a cluster of 128 nodes, with a weak scalability of 79.8%.

HPL-AI 也称为 HPL-MxP，是一个新的基准程序，用于评估特定计算集群上人工智能相关任务的上限性能。它以 FP64 解大型线性方程组，并以低精度的完整 LU 因式分解为前提条件。在本文中，我们提出了一种新的 HPL-AI 方法，该方法依赖于系数矩阵的混合精度因式分解：FP32 对角线和 FP16 非对角线。在不影响 LU 预处理结果质量的前提下，所提出的方法只利用了加速器上 FP16 密集矩阵乘法的基元，从而最大限度地提高了 FP16 吞吐量。数值分析和实验验证了我们的方法，确保在因式分解过程中避免数值下溢或溢出。我们在鲲鹏+Ascend 集群上实现了所提出的方法，这是一种新颖的人工智能专用平台，具有超高的 FP16 峰值性能。通过应用各种优化技术，包括 2D lookahead、基于 HCCL 的通信管道和基于 SYCL 的任务重叠，我们在单节点上实现了 975 TFlops，在 128 节点的集群上实现了近 100 PFlops，弱可扩展性达到 79.8%。

{"title":"A novel HPL-AI approach for FP16-only accelerator and its instantiation on Kunpeng+Ascend AI-specific platform","authors":"Zijian Cao , Qiao Sun , Wenhao Yang , Changcheng Song , Zhe Wang , Huiyuan Li","doi":"10.1016/j.jpdc.2024.104884","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104884","url":null,"abstract":"<div><p>HPL-AI, also known as HPL-MxP, is a new benchmark program used to evaluate the upper-bound performance of AI-related tasks on a specific computing cluster. It solves a large linear equation system in FP64, preconditioned by complete LU factorization in lower precision. In this paper, we propose a new HPL-AI approach that relies on the factorization of the coefficient matrix in mixed precision: FP32 diagonals and FP16 off-diagonals. Without compromising the quality of the resultant LU preconditioner, the proposed approach only utilizes the primitive of dense matrix multiplication in FP16 on the accelerator, maximizing the FP16 throughput. Numerical analysis and experiments validate our approach, ensuring avoidance of numerical underflow or overflow during factorization. We implement the proposed approach on Kunpeng+Ascend clusters, a novel AI-specific platform with exceedingly high FP16 peak performance. By applying various optimization techniques, including 2D lookahead, HCCL-based communication pipeline, and SYCL-based tasks overlapping, we achieve 975 TFlops on a single node and nearly 100 PFlops on a cluster of 128 nodes, with a weak scalability of 79.8%.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"190 ","pages":"Article 104884"},"PeriodicalIF":3.8,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140341161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0