Journal of Parallel and Distributed Computing最新文献

英文中文

Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues) 封面 1 - 完整扉页（常规期刊）/特刊扉页（特刊）

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-03-09 DOI: 10.1016/S0743-7315(24)00038-8

引用次数: 0

Dataflow optimization with layer-wise design variables estimation method for enflame CNN accelerators 针对enflame CNN加速器的数据流优化与分层设计变量估算方法

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-03-06 DOI: 10.1016/j.jpdc.2024.104869

Tian Chen , Yu-an Tan , Zheng Zhang , Nan Luo , Bin Li , Yuanzhang Li

As convolution layers have been proved to be the most time-consuming operation in convolutional neural network (CNN) algorithms, many efficient CNN accelerators have been designed to boost the performance of convolution operations. Previous works on CNN acceleration usually use fixed design variables for diverse convolutional layers, which would lead to inefficient data movements and low utilization of computing resource. We tackle this issue by proposing a flexible dataflow optimization method with design variables estimation for different layers. The optimization method first narrows the design space by the priori constraints, and then enumerates all legal solutions to select the optimal design variables. We demonstrate the effectiveness of the proposed optimization method by implementing representative CNN models (VGG-16, ResNet-18 and MobileNet V1) on Enflame Technology's programmable CNN accelerator, General Computing Unit (GCU). The results indicate that our optimization can significantly enhance the throughput of the convolution layers in ResNet, VGG and MobileNet on GCU, with improvement of up to 1.84×. Furthermore, it achieves up to 2.08× of GCU utilization specifically for the convolution layers of ResNet on GCU.

卷积层被证明是卷积神经网络（CNN）算法中最耗时的操作，因此人们设计了许多高效的 CNN 加速器来提高卷积操作的性能。以往关于 CNN 加速的研究通常使用固定的设计变量来设计不同的卷积层，这将导致数据移动效率低下和计算资源利用率低。针对这一问题，我们提出了一种灵活的数据流优化方法，对不同层的设计变量进行估算。该优化方法首先根据先验约束条件缩小设计空间，然后枚举所有合法解决方案，选出最优设计变量。我们在恩福莱姆科技公司的可编程 CNN 加速器通用计算单元（GCU）上实现了具有代表性的 CNN 模型（VGG-16、ResNet-18 和 MobileNet V1），证明了所提出的优化方法的有效性。结果表明，我们的优化方法可以在 GCU 上显著提高 ResNet、VGG 和 MobileNet 卷积层的吞吐量，最高可提高 1.84 倍。此外，特别是 ResNet 的卷积层在 GCU 上的 GCU 利用率提高了 2.08 倍。

{"title":"Dataflow optimization with layer-wise design variables estimation method for enflame CNN accelerators","authors":"Tian Chen , Yu-an Tan , Zheng Zhang , Nan Luo , Bin Li , Yuanzhang Li","doi":"10.1016/j.jpdc.2024.104869","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104869","url":null,"abstract":"<div><p>As convolution layers have been proved to be the most time-consuming operation in convolutional neural network (CNN) algorithms, many efficient CNN accelerators have been designed to boost the performance of convolution operations. Previous works on CNN acceleration usually use fixed design variables for diverse convolutional layers, which would lead to inefficient data movements and low utilization of computing resource. We tackle this issue by proposing a flexible dataflow optimization method with design variables estimation for different layers. The optimization method first narrows the design space by the priori constraints, and then enumerates all legal solutions to select the optimal design variables. We demonstrate the effectiveness of the proposed optimization method by implementing representative CNN models (VGG-16, ResNet-18 and MobileNet V1) on Enflame Technology's programmable CNN accelerator, General Computing Unit (GCU). The results indicate that our optimization can significantly enhance the throughput of the convolution layers in ResNet, VGG and MobileNet on GCU, with improvement of up to 1.84×. Furthermore, it achieves up to 2.08× of GCU utilization specifically for the convolution layers of ResNet on GCU.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104869"},"PeriodicalIF":3.8,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140067279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive patch grid strategy for parallel protein folding using atomic burials with NAMD 利用 NAMD 原子埋藏技术实现并行蛋白质折叠的自适应补丁网格策略

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-03-04 DOI: 10.1016/j.jpdc.2024.104868

Emerson A. Macedo, Alba C.M.A. Melo

The definition of protein structures is an important research topic in molecular biology currently, since there is a direct relationship between the function of the protein in the organism and the 3D geometric configuration it adopts. The transformations that occur in the protein structure from the 1D configuration to the 3D form are called protein folding. Ab initio protein folding methods use physical forces to model the interactions among the atoms that compose the protein. In order to accelerate those methods, parallel tools such as NAMD were proposed. In this paper, we propose two contributions for parallel protein folding simulations: (a) adaptive patch grid (APG) and (b) the addition of atomic burials (AB) to the traditional forces used in the simulation. With APG, we are able to adapt the simulation box (patch grid) to the current shape of the protein during the folding process. AB forces relate the 3D protein structure to its geometric center and are adequate for modeling globular proteins. Thus, adding AB to the forces used in parallel protein folding potentially increases the quality of the result for this class of proteins. APG and AB were implemented in NAMD and tested in supercomputer environments. Our results show that, with APG, we are able to reduce the execution time of the folding simulation of protein 4LNZ (5,714 atoms, 15 million time steps) from 12 hours and 36 minutes to 11 hours and 8 minutes, using 16 nodes (256 CPU cores). We also show that our APG+AB strategy was successfully used in a realistic protein folding simulation (1.7 billion time steps).

蛋白质结构的定义是当前分子生物学的一个重要研究课题，因为蛋白质在生物体内的功能与它所采用的三维几何构型有直接关系。蛋白质结构从一维构型到三维形式的转变称为蛋白质折叠。蛋白质折叠方法使用物理力来模拟组成蛋白质的原子之间的相互作用。为了加速这些方法，人们提出了 NAMD 等并行工具。在本文中，我们提出了并行蛋白质折叠模拟的两个贡献：(a) 自适应补丁网格 (APG) 和 (b) 在模拟中使用的传统力之外添加原子埋藏 (AB)。有了 APG，我们就能在折叠过程中根据蛋白质的当前形状调整模拟框（补丁网格）。AB 力将三维蛋白质结构与其几何中心相关联，适用于球状蛋白质建模。因此，将 AB 力添加到并行蛋白质折叠中可能会提高这类蛋白质的结果质量。在 NAMD 中实现了 APG 和 AB，并在超级计算机环境中进行了测试。结果表明，使用 APG，我们能够将 4LNZ 蛋白质（5714 个原子，1500 万个时间步）折叠模拟的执行时间从 12 小时 36 分钟减少到 11 小时 8 分钟，使用 16 个节点（256 个 CPU 内核）。我们还展示了 APG+AB 策略在实际蛋白质折叠模拟（17 亿时间步）中的成功应用。

{"title":"Adaptive patch grid strategy for parallel protein folding using atomic burials with NAMD","authors":"Emerson A. Macedo, Alba C.M.A. Melo","doi":"10.1016/j.jpdc.2024.104868","DOIUrl":"10.1016/j.jpdc.2024.104868","url":null,"abstract":"<div><p>The definition of protein structures is an important research topic in molecular biology currently, since there is a direct relationship between the function of the protein in the organism and the 3D geometric configuration it adopts. The transformations that occur in the protein structure from the 1D configuration to the 3D form are called protein folding. <em>Ab initio</em> protein folding methods use physical forces to model the interactions among the atoms that compose the protein. In order to accelerate those methods, parallel tools such as NAMD were proposed. In this paper, we propose two contributions for parallel protein folding simulations: (a) adaptive patch grid (APG) and (b) the addition of atomic burials (AB) to the traditional forces used in the simulation. With APG, we are able to adapt the simulation box (patch grid) to the current shape of the protein during the folding process. AB forces relate the 3D protein structure to its geometric center and are adequate for modeling globular proteins. Thus, adding AB to the forces used in parallel protein folding potentially increases the quality of the result for this class of proteins. APG and AB were implemented in NAMD and tested in supercomputer environments. Our results show that, with APG, we are able to reduce the execution time of the folding simulation of protein 4LNZ (5,714 atoms, 15 million time steps) from 12 hours and 36 minutes to 11 hours and 8 minutes, using 16 nodes (256 CPU cores). We also show that our APG+AB strategy was successfully used in a realistic protein folding simulation (1.7 billion time steps).</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"189 ","pages":"Article 104868"},"PeriodicalIF":3.8,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140054484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HoneyTwin: Securing smart cities with machine learning-enabled SDN edge and cloud-based honeypots HoneyTwin：利用支持机器学习的 SDN 边缘和基于云的蜜罐确保智慧城市安全

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-02-20 DOI: 10.1016/j.jpdc.2024.104866

Mohammed M. Alani

With the promise of higher throughput, and better response times, 6G networks provide a significant enabler for smart cities to evolve. The rapidly-growing reliance on connected devices within the smart city context encourages malicious actors to target these devices to achieve various malicious goals. In this paper, we present a novel defense technique that creates a cloud-based virtualized honeypot/twin that is designed to receive malicious traffic through edge-based machine learning-enabled detection system. The proposed system performs early identification of malicious traffic in a software defined network-enabled edge routing point to divert that traffic away from the 6G-enabled smart city endpoints. Testing of the proposed system showed an accuracy exceeding 99.8%, with an $F_{1}$ score of 0.9984.

6G 网络有望实现更高的吞吐量和更短的响应时间，为智慧城市的发展提供了重要的推动力。在智慧城市背景下，人们对联网设备的依赖性迅速增加，这促使恶意行为者瞄准这些设备，以实现各种恶意目标。在本文中，我们提出了一种新颖的防御技术，它创建了一个基于云的虚拟化蜜罐/双核，旨在通过基于边缘机器学习的检测系统接收恶意流量。所提出的系统可在软件定义网络支持的边缘路由点中对恶意流量进行早期识别，从而将这些流量从支持 6G 的智慧城市终端分流出去。对拟议系统的测试表明，其准确率超过 99.8%，F1 得分为 0.9984。

引用次数: 0

Hierarchical sort-based parallel algorithm for dynamic interest matching 基于分层排序的动态兴趣匹配并行算法

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-02-18 DOI: 10.1016/j.jpdc.2024.104867

Wenjie Tang, Yiping Yao, Lizhen Ou, Kai Chen

Publish–subscribe communication is a fundamental service used for message-passing between decoupled applications in distributed simulation. When abundant unnecessary data transfer is introduced, interest-matching services are needed to filter irrelevant message traffic. Frequent demands during simulation execution makes interest matching a bottleneck with increased simulation scale. Contemporary algorithms built for serial processing inadequately leverage multicore processor-based parallel resources. Parallel algorithmic improvements are insufficient for large-scale simulations. Therefore, we propose a hierarchical sort-based parallel algorithm for dynamic interest matching that embeds all update and subscription regions into two full binary trees, thereby transferring the region-matching task to one of node-matching. It utilizes the association between adjacent nodes and the hierarchical relation between parent‒child nodes to eliminate redundant operations, and achieves incremental parallel matching that only compares changed regions. We analyze the time and space complexity of this process. The new algorithm performs better and is more scalable than state-of-the-art algorithms.

发布-订阅通信是分布式仿真中解耦应用程序之间消息传递的基本服务。当引入大量不必要的数据传输时，就需要兴趣匹配服务来过滤不相关的消息流量。随着仿真规模的扩大，仿真执行过程中的频繁需求使得兴趣匹配成为一个瓶颈。为串行处理而构建的当代算法无法充分利用基于多核处理器的并行资源。并行算法的改进不足以应对大规模仿真。因此，我们提出了一种基于分层排序的动态兴趣匹配并行算法，它将所有更新和订阅区域嵌入两棵完整的二叉树中，从而将区域匹配任务转移到节点匹配任务中。它利用相邻节点之间的关联和父子节点之间的层级关系来消除冗余操作，并实现只比较变化区域的增量并行匹配。我们分析了这一过程的时间和空间复杂性。与最先进的算法相比，新算法性能更好，扩展性更强。

{"title":"Hierarchical sort-based parallel algorithm for dynamic interest matching","authors":"Wenjie Tang, Yiping Yao, Lizhen Ou, Kai Chen","doi":"10.1016/j.jpdc.2024.104867","DOIUrl":"10.1016/j.jpdc.2024.104867","url":null,"abstract":"<div><p>Publish–subscribe communication is a fundamental service used for message-passing between decoupled applications in distributed simulation. When abundant unnecessary data transfer is introduced, interest-matching services are needed to filter irrelevant message traffic. Frequent demands during simulation execution makes interest matching a bottleneck with increased simulation scale. Contemporary algorithms built for serial processing inadequately leverage multicore processor-based parallel resources. Parallel algorithmic improvements are insufficient for large-scale simulations. Therefore, we propose a hierarchical sort-based parallel algorithm for dynamic interest matching that embeds all update and subscription regions into two full binary trees, thereby transferring the region-matching task to one of node-matching. It utilizes the association between adjacent nodes and the hierarchical relation between parent‒child nodes to eliminate redundant operations, and achieves incremental parallel matching that only compares changed regions. We analyze the time and space complexity of this process. The new algorithm performs better and is more scalable than state-of-the-art algorithms.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"188 ","pages":"Article 104867"},"PeriodicalIF":3.8,"publicationDate":"2024-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139923545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Revisiting I/O bandwidth-sharing strategies for HPC applications 重新审视高性能计算应用的 I/O 带宽共享策略

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-02-15 DOI: 10.1016/j.jpdc.2024.104863

Anne Benoit , Thomas Herault , Lucas Perotin , Yves Robert , Frédéric Vivien

This work revisits I/O bandwidth-sharing strategies for HPC applications. When several applications post concurrent I/O operations, well-known approaches include serializing these operations (

) or fair-sharing the bandwidth across them (FairShare). Another recent approach, I/O-Sets, assigns priorities to the applications, which are classified into different sets based upon the average length of their iterations. We introduce several new bandwidth-sharing strategies, some of them simple greedy algorithms, and some of them more complicated to implement, and we compare them with existing ones. Our new strategies do not rely on any a-priori knowledge of the behavior of the applications, such as the length of work phases, the volume of I/O operations, or some expected periodicity. We introduce a rigorous framework, namely steady-state windows, which enables to derive bounds on the competitive ratio of all bandwidth-sharing strategies for three different objectives: minimum yield, platform utilization, and global efficiency. To the best of our knowledge, this work is the first to provide a quantitative assessment of the online competitiveness of any bandwidth-sharing strategy. This theory-oriented assessment is complemented by a comprehensive set of simulations, based upon both synthetic and realistic traces. The main conclusion is that two of our simple and low-complexity greedy strategies significantly outperform

, FairShare and I/O-Sets, and we recommend that the I/O community would implement them for further assessment.

这项工作重新探讨了高性能计算应用的 I/O 带宽共享策略。当多个应用程序同时进行 I/O 操作时，众所周知的方法包括将这些操作序列化（）或在它们之间公平共享带宽（FairShare）。另一种最新方法是 I/O 集，它为应用程序分配优先级，并根据其迭代的平均长度将其分为不同的集。我们引入了几种新的带宽共享策略，其中一些是简单的贪婪算法，另一些则实施起来较为复杂，我们还将它们与现有策略进行了比较。我们的新策略不依赖于对应用程序行为的任何先验知识，例如工作阶段的长度、I/O 操作量或某些预期周期。我们引入了一个严格的框架，即稳态窗口，它可以推导出所有带宽共享策略在三个不同目标下的竞争比率界限：最小产量、平台利用率和全局效率。据我们所知，这是首次对任何带宽共享策略的在线竞争力进行定量评估。这一以理论为导向的评估还辅以一套基于合成和现实轨迹的综合模拟。主要结论是，我们的两种简单、低复杂度的贪婪策略明显优于 FairShare 和 I/O-Sets。

{"title":"Revisiting I/O bandwidth-sharing strategies for HPC applications","authors":"Anne Benoit , Thomas Herault , Lucas Perotin , Yves Robert , Frédéric Vivien","doi":"10.1016/j.jpdc.2024.104863","DOIUrl":"10.1016/j.jpdc.2024.104863","url":null,"abstract":"<div><p>This work revisits I/O bandwidth-sharing strategies for HPC applications. When several applications post concurrent I/O operations, well-known approaches include serializing these operations (<figure><img></figure>) or fair-sharing the bandwidth across them (<span>FairShare</span>). Another recent approach, I/O-Sets, assigns priorities to the applications, which are classified into different sets based upon the average length of their iterations. We introduce several new bandwidth-sharing strategies, some of them simple greedy algorithms, and some of them more complicated to implement, and we compare them with existing ones. Our new strategies do not rely on any a-priori knowledge of the behavior of the applications, such as the length of work phases, the volume of I/O operations, or some expected periodicity. We introduce a rigorous framework, namely <em>steady-state windows</em>, which enables to derive bounds on the competitive ratio of all bandwidth-sharing strategies for three different objectives: minimum yield, platform utilization, and global efficiency. To the best of our knowledge, this work is the first to provide a quantitative assessment of the online competitiveness of any bandwidth-sharing strategy. This theory-oriented assessment is complemented by a comprehensive set of simulations, based upon both synthetic and realistic traces. The main conclusion is that two of our simple and low-complexity greedy strategies significantly outperform <figure><img></figure>, <span>FairShare</span> and I/O-Sets, and we recommend that the I/O community would implement them for further assessment.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"188 ","pages":"Article 104863"},"PeriodicalIF":3.8,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139878546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues) 封面 1 - 完整扉页（常规期刊）/特刊扉页（特刊）

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-02-12 DOI: 10.1016/S0743-7315(24)00023-6

引用次数: 0

Exploring multiprocessor approaches to time series analysis 探索时间序列分析的多处理器方法

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-02-08 DOI: 10.1016/j.jpdc.2024.104855

Ricardo Quislant, Eladio Gutierrez, Oscar Plata

Time series analysis is a key technique for extracting and predicting events in domains as diverse as epidemiology, genomics, neuroscience, environmental sciences, economics, etc. Matrix Profile, a state-of-the-art algorithm to perform time series analysis, finds out the most similar and dissimilar subsequences in a time series in deterministic time and it is exact. Matrix Profile has low arithmetic intensity and it operates on large amounts of time series data, which can be an issue in terms of memory requirements. On the other hand, Hardware Transactional Memory (HTM) is an alternative optimistic synchronization method that executes transactions speculatively in parallel while keeping track of memory accesses to detect and resolve conflicts.

This work evaluates one of the best implementations of Matrix Profile exploring multiple multiprocessor variants and proposing new implementations that consider a variety of synchronization methods (HTM, locks, barriers), as well as algorithm organizations. We analyze these variants using real datasets, both short and large, in terms of speedup and memory requirements, the latter being a major issue when dealing with very large time series. The experimental evaluation shows that our proposals can achieve up to 100× speedup over the sequential algorithm for 128 threads, and up to 3× over the baseline, while keeping memory requirements low and even independent of the number of threads.

时间序列分析是提取和预测流行病学、基因组学、神经科学、环境科学、经济学等不同领域事件的关键技术。Matrix Profile 是一种最先进的时间序列分析算法，它能在确定的时间内找出时间序列中最相似和最不相似的子序列，而且是精确的。Matrix Profile 的运算强度较低，可处理大量的时间序列数据，这可能是内存需求方面的一个问题。另一方面，硬件事务内存（HTM）是另一种乐观的同步方法，它以并行方式推测性地执行事务，同时跟踪内存访问以检测和解决冲突。这项工作评估了 Matrix Profile 的最佳实现之一，探索了多个多处理器变体，并提出了考虑各种同步方法（HTM、锁、障碍）以及算法组织的新实现。我们使用真实的短期和长期数据集分析了这些变体的速度提升和内存需求，后者是处理超大时间序列时的一个主要问题。实验评估表明，在 128 个线程的情况下，我们的建议比顺序算法的速度提高了 100 倍，比基准算法的速度提高了 3 倍，同时保持了较低的内存需求，甚至与线程数无关。

{"title":"Exploring multiprocessor approaches to time series analysis","authors":"Ricardo Quislant, Eladio Gutierrez, Oscar Plata","doi":"10.1016/j.jpdc.2024.104855","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104855","url":null,"abstract":"<div><p>Time series analysis is a key technique for extracting and predicting events in domains as diverse as epidemiology, genomics, neuroscience, environmental sciences, economics, etc. <em>Matrix Profile</em>, a state-of-the-art algorithm to perform time series analysis, finds out the most similar and dissimilar subsequences in a time series in deterministic time and it is exact. Matrix Profile has low arithmetic intensity and it operates on large amounts of time series data, which can be an issue in terms of memory requirements. On the other hand, Hardware Transactional Memory (HTM) is an alternative optimistic synchronization method that executes transactions speculatively in parallel while keeping track of memory accesses to detect and resolve conflicts.</p><p>This work evaluates one of the best implementations of Matrix Profile exploring multiple multiprocessor variants and proposing new implementations that consider a variety of synchronization methods (HTM, locks, barriers), as well as algorithm organizations. We analyze these variants using real datasets, both short and large, in terms of speedup and memory requirements, the latter being a major issue when dealing with very large time series. The experimental evaluation shows that our proposals can achieve up to 100× speedup over the sequential algorithm for 128 threads, and up to 3× over the baseline, while keeping memory requirements low and even independent of the number of threads.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"188 ","pages":"Article 104855"},"PeriodicalIF":3.8,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000194/pdfft?md5=a25b14cc13a327c9c4b6c5f9abde8126&pid=1-s2.0-S0743731524000194-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139732906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast recovery for large disk enclosures based on RAID2.0: Algorithms and evaluation 基于 RAID2.0 的大型磁盘阵列的快速恢复：算法与评估

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-02-07 DOI: 10.1016/j.jpdc.2024.104854

Qiliang Li , Min Lyu , Liangliang Xu , Yinlong Xu

The RAID2.0 architecture, which uses dozens or even hundreds of disks, is widely adopted for large-capacity data storage. However, limited resources like memory and CPU cause RAID2.0 to execute batch recovery for disk failures. The traditional random data placement and recovery schemes result in highly skewed I/O access within a batch, which slows down the recovery speed. To address this issue, we propose DR-RAID, an efficient reconstruction scheme that balances local rebuilding workloads across all surviving disks within a batch. We dynamically select a batch of tasks with almost balanced read loads and make intra-batch adjustments for tasks with multiple solutions of reading source chunks. Furthermore, we use a bipartite graph model to achieve a uniform distribution of write loads. DR-RAID can be applied with homogeneous or heterogeneous disk rebuilding bandwidth. Experimental results demonstrate that in offline rebuilding, DR-RAID enhances the rebuilding throughput by up to 61.90% compared to the random data placement scheme. With varied rebuilding bandwidth, the improvement can reach up to 65.00%.

RAID2.0 架构使用数十个甚至数百个磁盘，被广泛用于大容量数据存储。然而，由于内存和 CPU 等资源有限，RAID2.0 只能对磁盘故障执行批量恢复。传统的随机数据放置和恢复方案会导致批次内的 I/O 访问高度倾斜，从而降低恢复速度。为了解决这个问题，我们提出了 DR-RAID，这是一种高效的重建方案，可以平衡批次内所有存活磁盘的本地重建工作量。我们动态地选择一批读取负载基本平衡的任务，并对有多种读取源块解决方案的任务进行批内调整。此外，我们还使用双链图模型来实现写入负载的均匀分布。DR-RAID 可应用于同质或异质磁盘重建带宽。实验结果表明，在离线重建中，与随机数据放置方案相比，DR-RAID 提高了高达 61.90% 的重建吞吐量。在重建带宽不同的情况下，提高幅度可达 65.00%。

{"title":"Fast recovery for large disk enclosures based on RAID2.0: Algorithms and evaluation","authors":"Qiliang Li , Min Lyu , Liangliang Xu , Yinlong Xu","doi":"10.1016/j.jpdc.2024.104854","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104854","url":null,"abstract":"<div><p>The RAID2.0 architecture, which uses dozens or even hundreds of disks, is widely adopted for large-capacity data storage. However, limited resources like memory and CPU cause RAID2.0 to execute batch recovery for disk failures. The traditional random data placement and recovery schemes result in highly skewed I/O access within a batch, which slows down the recovery speed. To address this issue, we propose DR-RAID, an efficient reconstruction scheme that balances local rebuilding workloads across all surviving disks within a batch. We dynamically select a batch of tasks with almost balanced read loads and make intra-batch adjustments for tasks with multiple solutions of reading source chunks. Furthermore, we use a bipartite graph model to achieve a uniform distribution of write loads. DR-RAID can be applied with homogeneous or heterogeneous disk rebuilding bandwidth. Experimental results demonstrate that in offline rebuilding, DR-RAID enhances the rebuilding throughput by up to 61.90% compared to the random data placement scheme. With varied rebuilding bandwidth, the improvement can reach up to 65.00%.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"188 ","pages":"Article 104854"},"PeriodicalIF":3.8,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139732543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating the effectiveness of Bat optimization in an adaptive and energy-efficient network-on-chip routing framework 评估自适应高能效片上网络路由框架中蝙蝠优化的效果

IF 3.8 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing

Pub Date : 2024-02-05 DOI: 10.1016/j.jpdc.2024.104853

B. Naresh Kumar Reddy , Aruru Sai Kumar

Adaptive routing is effective in maintaining higher processor performance and avoids packets over minimal or non-minimal alternate routes without congestion for a multiprocessor system on chip. However, many systems cannot deal with the fact that sending packets over an alternative path rather than the shorter, fixed-priority route can result in packets arriving at the destination node out of order. This can occur if packets belonging to the same communication flow are adaptively routed through a different path. In real-world network systems, there are strategies and algorithms to efficiently handle out-of-order packets without requiring infinite memory. Techniques like buffering, sliding windows, and sequence number management are used to reorder packets while considering the practical constraints of available memory and processing power. The specific method used depends on the network protocol and the requirements of the application. In the proposed technique, a novel technique aimed at improving the performance of multiprocessor systems on chip by implementing adaptive routing based on the Bat algorithm. The framework employs 5 stage pipeline router, that completely gained and forward a packet at the perfect direction in an adaptive mode. Bat algorithm is used to enhance the performance, which can optimize route to transmit packets at the destination. A test was carried out on various NoC sizes (6 X 6 and 8 X 8) under multimedia benchmarks, compared with other related algorithms and implemented on Kintex-7 FPGA board. The outcomes of the simulation illustrate that the proposed algorithm reduces delay and improves the throughput over the other traditional adaptive algorithms.

对于芯片上的多处理器系统而言，自适应路由选择可有效保持较高的处理器性能，并避免数据包通过最小或非最小的备用路径而造成拥塞。然而，许多系统无法处理这样一个事实，即通过替代路径而不是更短、固定优先级的路径发送数据包，会导致数据包不按顺序到达目的地节点。如果属于同一通信流的数据包通过不同路径自适应路由，就会出现这种情况。在现实世界的网络系统中，有一些策略和算法可以在不需要无限内存的情况下有效处理失序数据包。缓冲、滑动窗口和序列号管理等技术可用于重新排序数据包，同时考虑可用内存和处理能力的实际限制。具体采用哪种方法取决于网络协议和应用程序的要求。在所提出的技术中，一种新型技术旨在通过实施基于 Bat 算法的自适应路由来提高芯片上多处理器系统的性能。该框架采用 5 级流水线路由器，以自适应模式在最佳方向完全获取和转发数据包。Bat 算法用于提高性能，可以优化路由，将数据包传送到目的地。在多媒体基准下对不同尺寸（6 X 6 和 8 X 8）的 NoC 进行了测试，与其他相关算法进行了比较，并在 Kintex-7 FPGA 板上进行了实现。仿真结果表明，与其他传统自适应算法相比，建议的算法减少了延迟，提高了吞吐量。

{"title":"Evaluating the effectiveness of Bat optimization in an adaptive and energy-efficient network-on-chip routing framework","authors":"B. Naresh Kumar Reddy , Aruru Sai Kumar","doi":"10.1016/j.jpdc.2024.104853","DOIUrl":"10.1016/j.jpdc.2024.104853","url":null,"abstract":"<div><p>Adaptive routing is effective in maintaining higher processor performance and avoids packets over minimal or non-minimal alternate routes without congestion for a multiprocessor system on chip. However, many systems cannot deal with the fact that sending packets over an alternative path rather than the shorter, fixed-priority route can result in packets arriving at the destination node out of order. This can occur if packets belonging to the same communication flow are adaptively routed through a different path. In real-world network systems, there are strategies and algorithms to efficiently handle out-of-order packets without requiring infinite memory. Techniques like buffering, sliding windows, and sequence number management are used to reorder packets while considering the practical constraints of available memory and processing power. The specific method used depends on the network protocol and the requirements of the application. In the proposed technique, a novel technique aimed at improving the performance of multiprocessor systems on chip by implementing adaptive routing based on the Bat algorithm. The framework employs 5 stage pipeline router, that completely gained and forward a packet at the perfect direction in an adaptive mode. Bat algorithm is used to enhance the performance, which can optimize route to transmit packets at the destination. A test was carried out on various NoC sizes (6 X 6 and 8 X 8) under multimedia benchmarks, compared with other related algorithms and implemented on Kintex-7 FPGA board. The outcomes of the simulation illustrate that the proposed algorithm reduces delay and improves the throughput over the other traditional adaptive algorithms.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"188 ","pages":"Article 104853"},"PeriodicalIF":3.8,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139688940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Journal of Parallel and Distributed Computing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀