首页 > 最新文献

Parallel Computing最新文献

英文 中文
Towards analysis and refinement of auto-tuning spaces 对自动调谐空间的分析和改进
IF 2.1 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-17 DOI: 10.1016/j.parco.2026.103185
Jiří Filipovič , Suren Harutyunyan Gevorgyan , Eduardo César , Anna Sikora
Source code-level auto-tuning enables applications to adapt their implementation to maintain peak performance under varying execution environments (i. e.hardware, input, or application settings). However, the performance of the auto-tuned code is inherently tied to the design of the tuning space (the space of possible changes to the code). An ideal tuning space must include configurations diverse enough to ensure high performance across all targeted environments while simultaneously eliminating redundant or inefficient regions that slow the tuning space search process. Traditional research has focused primarily on identifying optimization opportunities in the code and on efficient tuning space search. However, there is no rigorous methodology or tool supporting analysis and refinement of the tuning spaces, allowing for the addition of configurations that perform well in an unseen environment or the removal of configurations that perform poorly in any realistic environment.
In this short communication, we argue that hardware performance counters should be used to analyze tuning spaces, and that such an analysis would allow programmers to refine the tuning spaces by adding configurations that unlock additional performance in unseen environments and removing those unlikely to produce efficient code in any realistic environment. While our primary goal is to introduce this research question and foster discussion, we also present a preliminary methodology for tuning-space analysis. We validate our approach through a case study using a GPU implementation of an N-body simulation. Our results demonstrate that the proposed analysis can detect the weaknesses of a tuning space: based on its outcomes, we refined the tuning space, improving the average configuration performance 3.3×, and the best-performing configuration by 2–18%.
源代码级别的自动调优使应用程序能够调整其实现,以在不同的执行环境(即硬件、输入或应用程序设置)下保持峰值性能。然而,自动调优代码的性能与调优空间的设计(可能对代码进行更改的空间)密切相关。理想的调优空间必须包含足够多样化的配置,以确保在所有目标环境中实现高性能,同时消除会减慢调优空间搜索过程的冗余或低效区域。传统的研究主要集中在识别代码中的优化机会和有效的调优空间搜索上。然而,没有严格的方法或工具来支持对调优空间的分析和细化,从而允许添加在不可见的环境中表现良好的配置,或者删除在任何实际环境中表现不佳的配置。在这篇简短的交流中,我们认为应该使用硬件性能计数器来分析调优空间,这样的分析将允许程序员通过添加配置来优化调优空间,这些配置可以在不可见的环境中释放额外的性能,并删除那些不可能在任何实际环境中产生高效代码的配置。虽然我们的主要目标是介绍这个研究问题并促进讨论,但我们也提出了调优空间分析的初步方法。我们通过使用n体仿真的GPU实现的案例研究验证了我们的方法。我们的结果表明,所提出的分析可以检测到调优空间的弱点:基于其结果,我们改进了调优空间,将平均配置性能提高了3.3倍,性能最佳的配置提高了2-18%。
{"title":"Towards analysis and refinement of auto-tuning spaces","authors":"Jiří Filipovič ,&nbsp;Suren Harutyunyan Gevorgyan ,&nbsp;Eduardo César ,&nbsp;Anna Sikora","doi":"10.1016/j.parco.2026.103185","DOIUrl":"10.1016/j.parco.2026.103185","url":null,"abstract":"<div><div>Source code-level auto-tuning enables applications to adapt their implementation to maintain peak performance under varying execution environments (i.<!--> <!-->e.hardware, input, or application settings). However, the performance of the auto-tuned code is inherently tied to the design of the tuning space (the space of possible changes to the code). An ideal tuning space must include configurations diverse enough to ensure high performance across all targeted environments while simultaneously eliminating redundant or inefficient regions that slow the tuning space search process. Traditional research has focused primarily on identifying optimization opportunities in the code and on efficient tuning space search. However, there is no rigorous methodology or tool supporting analysis and refinement of the tuning spaces, allowing for the addition of configurations that perform well in an unseen environment or the removal of configurations that perform poorly in any realistic environment.</div><div>In this short communication, we argue that hardware performance counters should be used to analyze tuning spaces, and that such an analysis would allow programmers to refine the tuning spaces by adding configurations that unlock additional performance in unseen environments and removing those unlikely to produce efficient code in any realistic environment. While our primary goal is to introduce this research question and foster discussion, we also present a preliminary methodology for tuning-space analysis. We validate our approach through a case study using a GPU implementation of an N-body simulation. Our results demonstrate that the proposed analysis can detect the weaknesses of a tuning space: based on its outcomes, we refined the tuning space, improving the average configuration performance <span><math><mrow><mn>3</mn><mo>.</mo><mn>3</mn><mo>×</mo></mrow></math></span>, and the best-performing configuration by 2–18<span><math><mtext>%</mtext></math></span>.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103185"},"PeriodicalIF":2.1,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146022650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis of the impact of NUMA node configuration on the performance of offloading computations to GPUs 分析NUMA节点配置对gpu卸载计算性能的影响
IF 2.1 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-19 DOI: 10.1016/j.parco.2025.103182
Sergey Malkovsky, Aleksei Sorokin, Sergey Korolev
The article presents the results of several studies to assess the impact of the configuration of NUMA (Non-Uniform Memory Access) nodes on the performance of GPU-accelerated applications in hybrid computing system with shared memory. Using Crossroads/N9 DGEMM (NVBLAS library) as a model application, the performance in various NUMA modes with one or more GPUs was analyzed, and the throughput of the memory subsystem and data transfer channels between the host memory and graphics processors was also measured. The impact of coprocessor distribution across NUMA nodes on the efficiency of the model application was also examined.
Results showed that configuration of NUMA nodes can have a significant impact on the performance of applications that offload calculations to graphics coprocessors in a hybrid computing system with shared memory, and this impact could have an effect in different ways. For example, using one NUMA node for the entire computing system is the least optimal approach in terms of memory bandwidth, but it provided the highest bandwidth for communication between host memory and coprocessors during active data transfer to several accelerators. Thus, this mode achieves maximum performance when performing calculations on multiple GPUs that actively exchange data through host memory. Other modes showed advantages in different situations. Overall, to achieve maximum performance during active data transfer to coprocessors, they should be part of one NUMA node. These results will help to develop approaches to configuration of hybrid computing systems on processors with a chiplet layout, and help to improve the performance of software that offloads calculations to graphics accelerators with the Ampere architecture, such as NVIDIA A800 and NVIDIA A100, which are currently widely represented in the high-performance computing industry.
本文介绍了几项研究的结果,以评估NUMA(非统一内存访问)节点的配置对具有共享内存的混合计算系统中gpu加速应用程序性能的影响。以Crossroads/N9 DGEMM (NVBLAS库)为模型应用程序,分析了不同NUMA模式下单个或多个gpu的性能,并测量了内存子系统的吞吐量以及主机内存与图形处理器之间的数据传输通道。研究了跨NUMA节点的协处理器分布对模型应用效率的影响。结果表明,在具有共享内存的混合计算系统中,NUMA节点的配置可以对将计算卸载到图形协处理器的应用程序的性能产生重大影响,并且这种影响可能以不同的方式产生影响。例如,就内存带宽而言,在整个计算系统中使用一个NUMA节点是最不理想的方法,但是在将活动数据传输到多个加速器期间,它为主机内存和协处理器之间的通信提供了最高的带宽。因此,当多个gpu通过主机内存主动交换数据时,该模式可以达到最佳性能。其他模式在不同情况下表现出优势。总的来说,为了在活动数据传输到协处理器期间实现最大性能,它们应该是一个NUMA节点的一部分。这些结果将有助于开发在具有芯片布局的处理器上配置混合计算系统的方法,并有助于提高将计算卸载到具有Ampere架构的图形加速器(如NVIDIA A800和NVIDIA A100)的软件的性能,这些软件目前在高性能计算行业中广泛代表。
{"title":"Analysis of the impact of NUMA node configuration on the performance of offloading computations to GPUs","authors":"Sergey Malkovsky,&nbsp;Aleksei Sorokin,&nbsp;Sergey Korolev","doi":"10.1016/j.parco.2025.103182","DOIUrl":"10.1016/j.parco.2025.103182","url":null,"abstract":"<div><div>The article presents the results of several studies to assess the impact of the configuration of NUMA (Non-Uniform Memory Access) nodes on the performance of GPU-accelerated applications in hybrid computing system with shared memory. Using Crossroads/N9 DGEMM (NVBLAS library) as a model application, the performance in various NUMA modes with one or more GPUs was analyzed, and the throughput of the memory subsystem and data transfer channels between the host memory and graphics processors was also measured. The impact of coprocessor distribution across NUMA nodes on the efficiency of the model application was also examined.</div><div>Results showed that configuration of NUMA nodes can have a significant impact on the performance of applications that offload calculations to graphics coprocessors in a hybrid computing system with shared memory, and this impact could have an effect in different ways. For example, using one NUMA node for the entire computing system is the least optimal approach in terms of memory bandwidth, but it provided the highest bandwidth for communication between host memory and coprocessors during active data transfer to several accelerators. Thus, this mode achieves maximum performance when performing calculations on multiple GPUs that actively exchange data through host memory. Other modes showed advantages in different situations. Overall, to achieve maximum performance during active data transfer to coprocessors, they should be part of one NUMA node. These results will help to develop approaches to configuration of hybrid computing systems on processors with a chiplet layout, and help to improve the performance of software that offloads calculations to graphics accelerators with the Ampere architecture, such as NVIDIA A800 and NVIDIA A100, which are currently widely represented in the high-performance computing industry.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103182"},"PeriodicalIF":2.1,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145841285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A case study in hardware specialization for Monte Carlo cross-section lookup 蒙特卡罗横截面查找的硬件专门化案例研究
IF 2.1 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-16 DOI: 10.1016/j.parco.2025.103168
Kazutomo Yoshii , John R. Tramm , Bryce Allen , Tomohiro Ueno , Kentaro Sano , Andrew Siegel , Pete Beckman
Hardware specialization is a promising direction in the post-Moore era, particularly for high-performance computing (HPC). In this work, we present a lightweight prototyping example of hardware specialization using open-source tools. Focusing on the Monte Carlo cross-section lookup kernel, a computation with low resource utilization on general-purpose architectures, we implement a custom hardware pipeline in Chisel and generate Verilog for resource usage estimation. We explore hardware optimization techniques that trade off throughput and resource usage, and show that, as SRAM scaling stalls and memory dominates chip area, using additional logic, even in brute-force forms, can lead to better overall efficiency. Our estimation demonstrates a significant performance gain over general-purpose CPUs. While this is a case study, the methodology provides a practical path for quick feasibility studies in hardware specialization.
硬件专门化在后摩尔时代是一个很有前途的方向,特别是对于高性能计算(HPC)而言。在这项工作中,我们提出了一个使用开源工具的硬件专门化的轻量级原型示例。针对蒙特卡罗横截面查找内核(Monte Carlo cross-section lookup kernel)这一在通用架构中资源利用率较低的计算,我们在Chisel中实现了自定义硬件管道,并生成了Verilog用于资源使用估算。我们探索了权衡吞吐量和资源使用的硬件优化技术,并表明,随着SRAM扩展停滞和内存主导芯片面积,使用额外的逻辑,即使是蛮力形式,也可以提高整体效率。我们的估计表明,与通用cpu相比,性能有了显著提高。虽然这是一个案例研究,但该方法为硬件专门化的快速可行性研究提供了一个实用的途径。
{"title":"A case study in hardware specialization for Monte Carlo cross-section lookup","authors":"Kazutomo Yoshii ,&nbsp;John R. Tramm ,&nbsp;Bryce Allen ,&nbsp;Tomohiro Ueno ,&nbsp;Kentaro Sano ,&nbsp;Andrew Siegel ,&nbsp;Pete Beckman","doi":"10.1016/j.parco.2025.103168","DOIUrl":"10.1016/j.parco.2025.103168","url":null,"abstract":"<div><div>Hardware specialization is a promising direction in the post-Moore era, particularly for high-performance computing (HPC). In this work, we present a lightweight prototyping example of hardware specialization using open-source tools. Focusing on the Monte Carlo cross-section lookup kernel, a computation with low resource utilization on general-purpose architectures, we implement a custom hardware pipeline in Chisel and generate Verilog for resource usage estimation. We explore hardware optimization techniques that trade off throughput and resource usage, and show that, as SRAM scaling stalls and memory dominates chip area, using additional logic, even in brute-force forms, can lead to better overall efficiency. Our estimation demonstrates a significant performance gain over general-purpose CPUs. While this is a case study, the methodology provides a practical path for quick feasibility studies in hardware specialization.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103168"},"PeriodicalIF":2.1,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146022651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PROAD: Boosting Caffe Training via improving LevelDB I/O performance with Parallel Read, Out-of-Order Optimization, and Adaptive Design PROAD:通过并行读取、乱序优化和自适应设计提高LevelDB I/O性能来促进Caffe训练
IF 2.1 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-05 DOI: 10.1016/j.parco.2025.103181
Yubiao Pan, Ailing Tian, Huizhen Zhang
Caffe, one of the most popular deep learning frameworks, trains models by reading training data from the storage engine, LevelDB, and feeding it into the computation engine. This paper analyzes the challenges faced by data reading in Caffe Training: (1) Fetch, Parse, and Transform—the three steps of reading each image—are serial, and each image is read sequentially; (2) Frequent disk I/O—each image read triggers an I/O operation—significantly increases the data reading time; (3) Caffe calls LevelDB’s range query method to read training data, but this leads to unnecessary pointer comparison operations, wasting CPU resources; (4) Since LevelDB reads training data in key order during range queries, the fixed order of training data across epochs may cause overfitting and lower the model’s test accuracy.
Based on these challenges, this paper proposes Parallel Read, Out-of-Order Optimization, and Adaptive Design strategies to design a new I/O layer, PROAD, for Caffe that systematically reconstructs and optimizes LevelDB’s original data reading mechanism, thus improving LevelDB I/O performance for Caffe Training. The Parallel Read method pipelines the Fetch, Parse, and Transform steps and accelerates reading via large block reads; Out-of-Order Optimization discards the range scan feature of LevelDB, allowing Caffe to read training data in a random manner during training, avoiding the original key comparison overhead and providing a boost to model accuracy; while the Adaptive Design method supports efficient reading of training data with different resolutions. Based on these designs, this paper implements PROAD and deploys it in Caffe for performance evaluation. Experimental results show that Caffe with PROAD significantly improves data reading performance during training, especially for high-resolution datasets, where data reading time in Caffe with PROAD is reduced by 14%–42% compared to Caffe with LevelDB and 6%–34% compared to Caffe with LMDB. Furthermore, Caffe with PROAD improves model test accuracy due to the Out-of-Order Optimization strategy, while consuming relatively reasonable memory resources.
Caffe是最流行的深度学习框架之一,它通过从存储引擎LevelDB读取训练数据并将其输入计算引擎来训练模型。本文分析了Caffe Training中数据读取所面临的挑战:(1)读取每幅图像的三个步骤Fetch、Parse和transform是串行的,并且对每幅图像进行顺序读取;(2)频繁的磁盘I/O(每次读取映像都会触发一次I/O操作)显著增加了数据读取时间;(3) Caffe调用LevelDB的range query方法读取训练数据,但这会导致不必要的指针比较操作,浪费CPU资源;(4)由于LevelDB在range查询时是按键顺序读取训练数据的,如果训练数据在epoch之间的顺序固定,可能会导致过拟合,降低模型的测试精度。基于这些挑战,本文提出并行读取、乱序优化和自适应设计策略,为Caffe设计新的I/O层PROAD,系统地重构和优化LevelDB原有的数据读取机制,从而提高LevelDB用于Caffe Training的I/O性能。Parallel Read方法是Fetch, Parse和Transform步骤的管道,并通过大块读取加速读取;乱序优化放弃了LevelDB的范围扫描功能,允许Caffe在训练期间以随机方式读取训练数据,避免了原始键比较开销并提高了模型准确性;而自适应设计方法支持不同分辨率的训练数据的有效读取。基于这些设计,本文实现了PROAD,并将其部署在Caffe中进行性能评估。实验结果表明,在训练过程中,使用PROAD的Caffe显著提高了数据读取性能,特别是对于高分辨率数据集,使用PROAD的Caffe的数据读取时间比使用LevelDB的Caffe减少了14%-42%,比使用LMDB的Caffe减少了6%-34%。此外,使用PROAD的Caffe由于无序优化策略提高了模型测试的准确性,同时消耗了相对合理的内存资源。
{"title":"PROAD: Boosting Caffe Training via improving LevelDB I/O performance with Parallel Read, Out-of-Order Optimization, and Adaptive Design","authors":"Yubiao Pan,&nbsp;Ailing Tian,&nbsp;Huizhen Zhang","doi":"10.1016/j.parco.2025.103181","DOIUrl":"10.1016/j.parco.2025.103181","url":null,"abstract":"<div><div>Caffe, one of the most popular deep learning frameworks, trains models by reading training data from the storage engine, LevelDB, and feeding it into the computation engine. This paper analyzes the challenges faced by data reading in Caffe Training: (1) Fetch, Parse, and Transform—the three steps of reading each image—are serial, and each image is read sequentially; (2) Frequent disk I/O—each image read triggers an I/O operation—significantly increases the data reading time; (3) Caffe calls LevelDB’s range query method to read training data, but this leads to unnecessary pointer comparison operations, wasting CPU resources; (4) Since LevelDB reads training data in key order during range queries, the fixed order of training data across epochs may cause overfitting and lower the model’s test accuracy.</div><div>Based on these challenges, this paper proposes Parallel Read, Out-of-Order Optimization, and Adaptive Design strategies to design a new I/O layer, PROAD, for Caffe that systematically reconstructs and optimizes LevelDB’s original data reading mechanism, thus improving LevelDB I/O performance for Caffe Training. The Parallel Read method pipelines the Fetch, Parse, and Transform steps and accelerates reading via large block reads; Out-of-Order Optimization discards the range scan feature of LevelDB, allowing Caffe to read training data in a random manner during training, avoiding the original key comparison overhead and providing a boost to model accuracy; while the Adaptive Design method supports efficient reading of training data with different resolutions. Based on these designs, this paper implements PROAD and deploys it in Caffe for performance evaluation. Experimental results show that Caffe with PROAD significantly improves data reading performance during training, especially for high-resolution datasets, where data reading time in Caffe with PROAD is reduced by 14%–42% compared to Caffe with LevelDB and 6%–34% compared to Caffe with LMDB. Furthermore, Caffe with PROAD improves model test accuracy due to the Out-of-Order Optimization strategy, while consuming relatively reasonable memory resources.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103181"},"PeriodicalIF":2.1,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145738517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cache partitioning for sparse matrix–vector multiplication on the A64FX A64FX上稀疏矩阵向量乘法的缓存分区
IF 2.1 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-04 DOI: 10.1016/j.parco.2025.103169
Sergej Breiter , James D. Trotter , Karl Fürlinger
One of the novel features of the Fujitsu A64FX CPU is the sector cache. This feature enables hardware-supported partitioning of the L1 and L2 caches and allows the programmer control of which partition is used to place data in. This paper performs an in-depth study of applying the sector cache to sparse matrix-vector multiplication (SpMV) in the Compressed Sparse Row (CSR) format using a collection of 490 sparse matrices. A performance model based on reuse analysis is used to better understand situations in which and how the sector cache leads to improved cache reuse and to predict cache behavior. The model predicts the number of L2 cache misses within an error of 2% without cache partitioning. With sector cache enabled, depending on the configuration, the model predicts the number of L2 cache missed within 2–3% and 4–18% for sequential and parallel SpMV with 48 threads, respectively. Further experiments show the effect of various sector cache configurations on performance. A median speedup of about 1.05× is achieved, whereas the maximum speedup is about 1.6×.
富士通A64FX CPU的一个新特性是扇区缓存。该特性支持硬件支持的L1和L2缓存分区,并允许程序员控制使用哪个分区来放置数据。本文使用490个稀疏矩阵的集合,对将扇区缓存应用于压缩稀疏行(CSR)格式的稀疏矩阵向量乘法(SpMV)进行了深入研究。使用基于重用分析的性能模型来更好地理解扇区缓存在哪些情况下以及如何导致改进的缓存重用,并预测缓存行为。该模型在没有缓存分区的情况下预测L2缓存丢失的数量,误差在2%以内。启用扇区缓存后,根据配置的不同,该模型预测,对于48线程的顺序和并行SpMV, L2缓存丢失的数量分别在2-3%和4-18%之间。进一步的实验显示了不同扇区缓存配置对性能的影响。中值加速约为1.05倍,而最大加速约为1.6倍。
{"title":"Cache partitioning for sparse matrix–vector multiplication on the A64FX","authors":"Sergej Breiter ,&nbsp;James D. Trotter ,&nbsp;Karl Fürlinger","doi":"10.1016/j.parco.2025.103169","DOIUrl":"10.1016/j.parco.2025.103169","url":null,"abstract":"<div><div>One of the novel features of the Fujitsu A64FX CPU is the <em>sector cache</em>. This feature enables hardware-supported partitioning of the L1 and L2 caches and allows the programmer control of which partition is used to place data in. This paper performs an in-depth study of applying the sector cache to sparse matrix-vector multiplication (SpMV) in the Compressed Sparse Row (CSR) format using a collection of 490 sparse matrices. A performance model based on reuse analysis is used to better understand situations in which and how the sector cache leads to improved cache reuse and to predict cache behavior. The model predicts the number of L2 cache misses within an error of 2% without cache partitioning. With sector cache enabled, depending on the configuration, the model predicts the number of L2 cache missed within 2–3% and 4–18% for sequential and parallel SpMV with 48 threads, respectively. Further experiments show the effect of various sector cache configurations on performance. A median speedup of about 1.05<span><math><mo>×</mo></math></span> is achieved, whereas the maximum speedup is about 1.6<span><math><mo>×</mo></math></span>.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103169"},"PeriodicalIF":2.1,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145791306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Benchmark of classical disk array and software-defined storage on near-identical hardware 在几乎相同的硬件上对经典磁盘阵列和软件定义存储进行基准测试
IF 2.1 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-03 DOI: 10.1016/j.parco.2025.103166
Tomas Vondra, David Sebek
This article presents a comparative analysis of two storage approaches: a SAN disk array, exemplified by an HPE 3PAR device, and a software-defined storage cluster constructed with the Ceph software. The objective of this comparison is to ascertain whether a software-defined storage cluster built with commodity servers can achieve comparable performance to a SAN disk array with a similar hardware configuration. The configuration used identical numbers of components of matching speeds, capacities, and hardware generation from the same manufacturer. By relaxing some requirements on the software-defined storage, we were able to benchmark all RAID levels with corresponding replication and erasure code settings. The results revealed that 3PAR performed 31 times better for 4 KiB data block writes than Ceph. On the contrary, the Ceph cluster surpassed 3PAR by a factor of 1.4 in 16 MiB large-block reads. The differences are explained in the text based on the theory of operation of the two types of storage. We propose criteria for choosing the correct type of technology for individual use cases.
本文对两种存储方法进行了比较分析:以HPE 3PAR设备为例的SAN磁盘阵列和使用Ceph软件构建的软件定义存储集群。这种比较的目的是确定使用商用服务器构建的软件定义存储集群是否能够实现与具有类似硬件配置的SAN磁盘阵列相当的性能。该配置使用了相同数量的组件,这些组件具有相同的速度、容量和来自同一制造商的硬件生成。通过放宽对软件定义存储的一些要求,我们能够用相应的复制和擦除代码设置对所有RAID级别进行基准测试。结果显示,3PAR在4 KiB数据块写入方面的性能是Ceph的31倍。相反,Ceph集群在16 MiB大块读取中超过3PAR 1.4倍。本文从两种存储方式的操作原理出发,对其区别进行了说明。我们提出了为单个用例选择正确类型的技术的标准。
{"title":"Benchmark of classical disk array and software-defined storage on near-identical hardware","authors":"Tomas Vondra,&nbsp;David Sebek","doi":"10.1016/j.parco.2025.103166","DOIUrl":"10.1016/j.parco.2025.103166","url":null,"abstract":"<div><div>This article presents a comparative analysis of two storage approaches: a SAN disk array, exemplified by an HPE 3PAR device, and a software-defined storage cluster constructed with the Ceph software. The objective of this comparison is to ascertain whether a software-defined storage cluster built with commodity servers can achieve comparable performance to a SAN disk array with a similar hardware configuration. The configuration used identical numbers of components of matching speeds, capacities, and hardware generation from the same manufacturer. By relaxing some requirements on the software-defined storage, we were able to benchmark all RAID levels with corresponding replication and erasure code settings. The results revealed that 3PAR performed 31 times better for 4 KiB data block writes than Ceph. On the contrary, the Ceph cluster surpassed 3PAR by a factor of 1.4 in 16 MiB large-block reads. The differences are explained in the text based on the theory of operation of the two types of storage. We propose criteria for choosing the correct type of technology for individual use cases.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103166"},"PeriodicalIF":2.1,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145738515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning-driven fault-tolerant core mapping in Network-on-Chip architectures for advanced computing networks 先进计算网络的片上网络架构中机器学习驱动的容错核心映射
IF 2.1 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-01 DOI: 10.1016/j.parco.2025.103167
Challa Muralikrishna Yadav, B. Naresh Kumar Reddy
The utilization of machine learning (ML) in architectural design shows great potential, especially in addressing the challenges posed by complex design spaces where traditional approaches may fall short. Network-on-chip (NoC) architecture has emerged as an efficient solution for on-chip communication among processors. However, with the increasing device scaling and component density, the likelihood of processor failures also rises, making fault-tolerant design a critical aspect of chip development to ensure system reliability. In this paper, we present a novel ML framework for fault-tolerant core mapping that effectively overcomes issues encountered in previous methodologies, such as re-transmission and re-mapping. The proposed framework intelligently learns optimal core mapping strategies and effectively addresses fault tolerance concerns in NoCs with diverse application core graphs. The approach begins with efficient NoC mapping and scheduling as the primary step. In the event of any faults during this process, an error detection and correction mechanism is applied within the NoC itself, eliminating the need for time-consuming re-transmissions. Furthermore, if faults persist even after error correction, the tasks assigned to the failed core are seamlessly migrated to a designated spare core, ensuring continuous system operation. Comparisons with conventional methods demonstrate considerable improvements in processor speed-up, energy efficiency, as well as reductions in re-transmission, latency, and dynamic power consumption. Hardware results indicate enhanced performance, reduced area, and lower power consumption compared to related algorithms when implemented on an FPGA board. The proposed technique showcases significant advancements in fault-tolerant core mapping for NoCs, thereby enhancing overall chip reliability and performance.
机器学习(ML)在建筑设计中的应用显示出巨大的潜力,特别是在解决传统方法可能达不到的复杂设计空间所带来的挑战方面。片上网络(NoC)架构作为处理器之间片上通信的有效解决方案而出现。然而,随着器件规模和组件密度的增加,处理器故障的可能性也在增加,这使得容错设计成为芯片开发中确保系统可靠性的关键方面。在本文中,我们提出了一个新的机器学习框架,用于容错核心映射,有效地克服了以前的方法中遇到的问题,如重传输和重映射。该框架能够智能地学习最优核心映射策略,有效地解决了具有不同应用核心图的noc中的容错问题。该方法首先将有效的NoC映射和调度作为首要步骤。如果在此过程中出现任何故障,则在NoC内部应用错误检测和纠正机制,从而消除了耗时的重新传输的需要。此外,如果在纠错后故障仍然存在,分配给故障核心的任务可以无缝迁移到指定的备用核心,保证系统的持续运行。与传统方法的比较表明,该方法在处理器加速、能源效率以及重新传输、延迟和动态功耗方面都有相当大的改进。硬件结果表明,与在FPGA板上实现的相关算法相比,性能增强,面积减少,功耗降低。提出的技术展示了noc在容错内核映射方面的重大进步,从而提高了芯片的整体可靠性和性能。
{"title":"Machine learning-driven fault-tolerant core mapping in Network-on-Chip architectures for advanced computing networks","authors":"Challa Muralikrishna Yadav,&nbsp;B. Naresh Kumar Reddy","doi":"10.1016/j.parco.2025.103167","DOIUrl":"10.1016/j.parco.2025.103167","url":null,"abstract":"<div><div>The utilization of machine learning (ML) in architectural design shows great potential, especially in addressing the challenges posed by complex design spaces where traditional approaches may fall short. Network-on-chip (NoC) architecture has emerged as an efficient solution for on-chip communication among processors. However, with the increasing device scaling and component density, the likelihood of processor failures also rises, making fault-tolerant design a critical aspect of chip development to ensure system reliability. In this paper, we present a novel ML framework for fault-tolerant core mapping that effectively overcomes issues encountered in previous methodologies, such as re-transmission and re-mapping. The proposed framework intelligently learns optimal core mapping strategies and effectively addresses fault tolerance concerns in NoCs with diverse application core graphs. The approach begins with efficient NoC mapping and scheduling as the primary step. In the event of any faults during this process, an error detection and correction mechanism is applied within the NoC itself, eliminating the need for time-consuming re-transmissions. Furthermore, if faults persist even after error correction, the tasks assigned to the failed core are seamlessly migrated to a designated spare core, ensuring continuous system operation. Comparisons with conventional methods demonstrate considerable improvements in processor speed-up, energy efficiency, as well as reductions in re-transmission, latency, and dynamic power consumption. Hardware results indicate enhanced performance, reduced area, and lower power consumption compared to related algorithms when implemented on an FPGA board. The proposed technique showcases significant advancements in fault-tolerant core mapping for NoCs, thereby enhancing overall chip reliability and performance.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103167"},"PeriodicalIF":2.1,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145665693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Butterfly factorization for vision transformers on multi-IPU systems 多ipu系统中视觉变压器的蝴蝶分解
IF 2.1 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-27 DOI: 10.1016/j.parco.2025.103165
S.-Kazem Shekofteh, Daniel Bogacz, Christian Alles, Holger Fröning
Recent advances in machine learning have led to increasingly large and complex models, placing significant demands on computation and memory. Techniques such as Butterfly factorization have emerged to reduce model parameters and memory footprints while preserving accuracy. Specialized hardware accelerators, such as Graphcore’s Intelligence Processing Units (IPUs), are designed to address these challenges through massive parallelism and efficient on-chip memory utilization. In this paper, we extend our analysis of Butterfly structures for efficient utilization on single and multiple IPUs, comparing their performance with GPUs. These structures drastically reduce the number of parameters and memory footprint while preserving model accuracy. Experimental results on the Graphcore GC200 IPU chip, compared with an NVIDIA A30 GPU, demonstrate a 98.5% compression ratio, with speedups of 1.6× and 1.3× for Butterfly and Pixelated Butterfly structures, respectively. Extending our evaluation to Vision Transformer (ViT) models, we compare Multi-GPU and Multi-IPU systems on the M2000 machine: Multi-GPU reaches a maximum accuracy of 84.51% with a training time of 401.44 min, whereas Multi-IPU attains a higher maximum accuracy of 88.92% with a training time of 694.03 min. These results demonstrate that Butterfly factorization enables substantial compression of ViT layers (up to 97.17%) while improving model accuracy. The findings highlight the promise of IPU machines as a suitable platform for large-scale machine learning model training, especially when coupled with sparsification methods like Butterfly factorization, thanks to their efficient support for model parallelism.
机器学习的最新进展导致了越来越大和复杂的模型,对计算和内存提出了巨大的要求。诸如蝴蝶分解之类的技术已经出现,以减少模型参数和内存占用,同时保持准确性。专门的硬件加速器,如Graphcore的智能处理单元(ipu),旨在通过大规模并行性和高效的片上内存利用率来解决这些挑战。在本文中,我们扩展了对Butterfly结构的分析,以有效利用单个和多个ipu,并将其性能与gpu进行比较。这些结构大大减少了参数的数量和内存占用,同时保持了模型的准确性。在Graphcore GC200 IPU芯片上的实验结果表明,与NVIDIA A30 GPU相比,蝴蝶和像素化蝴蝶结构的压缩比达到98.5%,速度分别提高了1.6倍和1.3倍。将我们的评估扩展到Vision Transformer (ViT)模型,我们比较了M2000机器上的Multi-GPU和Multi-IPU系统:Multi-GPU在训练时间为401.44 min时达到了84.51%的最大准确率,而Multi-IPU在训练时间为694.03 min时达到了88.92%的最高准确率。这些结果表明,蝴蝶分解可以在提高模型精度的同时大幅压缩ViT层(高达97.17%)。这些发现强调了IPU机器作为大规模机器学习模型训练的合适平台的前景,特别是当与像Butterfly分解这样的稀疏化方法相结合时,由于它们对模型并行性的有效支持。
{"title":"Butterfly factorization for vision transformers on multi-IPU systems","authors":"S.-Kazem Shekofteh,&nbsp;Daniel Bogacz,&nbsp;Christian Alles,&nbsp;Holger Fröning","doi":"10.1016/j.parco.2025.103165","DOIUrl":"10.1016/j.parco.2025.103165","url":null,"abstract":"<div><div>Recent advances in machine learning have led to increasingly large and complex models, placing significant demands on computation and memory. Techniques such as Butterfly factorization have emerged to reduce model parameters and memory footprints while preserving accuracy. Specialized hardware accelerators, such as Graphcore’s Intelligence Processing Units (IPUs), are designed to address these challenges through massive parallelism and efficient on-chip memory utilization. In this paper, we extend our analysis of Butterfly structures for efficient utilization on single and multiple IPUs, comparing their performance with GPUs. These structures drastically reduce the number of parameters and memory footprint while preserving model accuracy. Experimental results on the Graphcore GC200 IPU chip, compared with an NVIDIA A30 GPU, demonstrate a 98.5% compression ratio, with speedups of 1.6<span><math><mo>×</mo></math></span> and 1.3<span><math><mo>×</mo></math></span> for Butterfly and Pixelated Butterfly structures, respectively. Extending our evaluation to Vision Transformer (ViT) models, we compare Multi-GPU and Multi-IPU systems on the M2000 machine: Multi-GPU reaches a maximum accuracy of 84.51% with a training time of 401.44 min, whereas Multi-IPU attains a higher maximum accuracy of 88.92% with a training time of 694.03 min. These results demonstrate that Butterfly factorization enables substantial compression of ViT layers (up to 97.17%) while improving model accuracy. The findings highlight the promise of IPU machines as a suitable platform for large-scale machine learning model training, especially when coupled with sparsification methods like Butterfly factorization, thanks to their efficient support for model parallelism.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103165"},"PeriodicalIF":2.1,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145738516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LSHDP: Locally sharded heterogeneous data parallel for distributed deep learning LSHDP:分布式深度学习的本地分片异构数据并行
IF 2.1 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-01 DOI: 10.1016/j.parco.2025.103164
Motahhare Mirzaei, Mehrdad Ashtiani, Mohammad Javad Pirhadi, Sauleh Eetemadi
In today’s world, pre-trained models such as GPT-3 and Llama 3.1, along with the use of transformers, recognized as large AI models, have gained significant importance. To accelerate the training of these models, distributed training has become a fundamental approach. This method enables the execution of model training across multiple GPUs, which is particularly essential for models that require more data and training time. Despite past advancements, achieving optimal utilization of GPU capacity remains a major challenge, especially in academic environments that often feature heterogeneous infrastructures and limited bandwidth between nodes, which do not align with the assumptions of existing methods. In previous methods, the node with the lowest computational power is considered the bottleneck, leading to computational slowdowns and increased waiting times for other nodes. This study addresses the issue by adjusting batch sizes to minimize node waiting times. This approach improves the efficiency of node utilization without reducing the convergence speed. Moreover, to address GPU memory limitations, existing methods often rely on high-speed inter-node communication. This reliance increases training time in scenarios with low network bandwidth (e.g., 1 Gb/s). This research mitigates this challenge using the LSDP (Locally Sharded Data Parallel) method, which leverages CPU memory instead of inter-node communication. Finally, by combining these two strategies, the LSHDP (Locally Sharded Heterogeneous Data Parallel) solution is introduced which is suitable for heterogeneous infrastructures with low inter-node communication speeds. Experiments demonstrate that this method outperforms previous approaches in such environments, achieving improvements of 35.39 % and 52.57 % in terms of speed compared to data-parallel and Fully Sharded Data Parallel (FSDP) methods respectively.
在当今世界,预训练模型,如GPT-3和Llama 3.1,以及变压器的使用,被认为是大型人工智能模型,已经变得非常重要。为了加速这些模型的训练,分布式训练已经成为一种基本的方法。这种方法可以跨多个gpu执行模型训练,这对于需要更多数据和训练时间的模型来说尤为重要。尽管过去取得了进步,但实现GPU容量的最佳利用仍然是一个主要挑战,特别是在通常具有异构基础设施和节点之间有限带宽的学术环境中,这与现有方法的假设不一致。在以前的方法中,计算能力最低的节点被认为是瓶颈,这会导致计算速度变慢,并增加其他节点的等待时间。本研究通过调整批处理大小来最小化节点等待时间来解决这个问题。该方法在不降低收敛速度的前提下提高了节点利用率。此外,为了解决GPU内存的限制,现有的方法通常依赖于高速节点间通信。这种依赖增加了低网络带宽(例如1gb /s)场景下的训练时间。本研究使用LSDP(本地分片数据并行)方法缓解了这一挑战,该方法利用CPU内存而不是节点间通信。最后,结合这两种策略,提出了适用于节点间通信速度较低的异构基础设施的LSHDP(本地分片异构数据并行)解决方案。实验表明,该方法在这种环境下优于以往的方法,与数据并行和完全分片数据并行(FSDP)方法相比,速度分别提高了35.39%和52.57%。
{"title":"LSHDP: Locally sharded heterogeneous data parallel for distributed deep learning","authors":"Motahhare Mirzaei,&nbsp;Mehrdad Ashtiani,&nbsp;Mohammad Javad Pirhadi,&nbsp;Sauleh Eetemadi","doi":"10.1016/j.parco.2025.103164","DOIUrl":"10.1016/j.parco.2025.103164","url":null,"abstract":"<div><div>In today’s world, pre-trained models such as GPT-3 and Llama 3.1, along with the use of transformers, recognized as large AI models, have gained significant importance. To accelerate the training of these models, distributed training has become a fundamental approach. This method enables the execution of model training across multiple GPUs, which is particularly essential for models that require more data and training time. Despite past advancements, achieving optimal utilization of GPU capacity remains a major challenge, especially in academic environments that often feature heterogeneous infrastructures and limited bandwidth between nodes, which do not align with the assumptions of existing methods. In previous methods, the node with the lowest computational power is considered the bottleneck, leading to computational slowdowns and increased waiting times for other nodes. This study addresses the issue by adjusting batch sizes to minimize node waiting times. This approach improves the efficiency of node utilization without reducing the convergence speed. Moreover, to address GPU memory limitations, existing methods often rely on high-speed inter-node communication. This reliance increases training time in scenarios with low network bandwidth (e.g., 1 Gb/s). This research mitigates this challenge using the LSDP (Locally Sharded Data Parallel) method, which leverages CPU memory instead of inter-node communication. Finally, by combining these two strategies, the LSHDP (Locally Sharded Heterogeneous Data Parallel) solution is introduced which is suitable for heterogeneous infrastructures with low inter-node communication speeds. Experiments demonstrate that this method outperforms previous approaches in such environments, achieving improvements of 35.39 % and 52.57 % in terms of speed compared to data-parallel and Fully Sharded Data Parallel (FSDP) methods respectively.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"126 ","pages":"Article 103164"},"PeriodicalIF":2.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A sleek lock-free hash map in an ERA of safe memory reclamation methods 一个光滑的无锁哈希映射在一个安全的内存回收方法的年代
IF 2.1 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-01 DOI: 10.1016/j.parco.2025.103162
Pedro Moreno , Miguel Areias , Ricardo Rocha
Lock-free data structures have become increasingly significant due to their algorithmic advantages in multi-core cache-based architectures. Safe Memory Reclamation (SMR) is a technique used in concurrent programming to ensure that memory can be safely reclaimed without causing data corruption, dangling pointers, or access to freed memory. The ERA theorem states that any SMR method for concurrent data structures can only provide at most two of the three main desirable properties: Ease of use, Robustness, and Applicability. This fundamental trade-off influences the design of efficient lock-free data structures at an early stage. This work redesigns a previous lock-free hash map to fully exploit the properties of the ERA theorem and to leverage the characteristics of multi-core cache-based architectures by minimizing the number of cache misses, which are a significant bottleneck in multi-core environments. Experimental results show that our design outperforms the previous design, which was already quite competitive when compared against the Concurrent Hash Map design of the Intel’s TBB library.
无锁数据结构由于其在基于多核缓存的体系结构中的算法优势而变得越来越重要。安全内存回收(SMR)是并发编程中使用的一种技术,用于确保可以安全地回收内存,而不会导致数据损坏、悬空指针或访问已释放的内存。ERA定理指出,任何用于并发数据结构的SMR方法最多只能提供三个主要理想属性中的两个:易用性、健壮性和适用性。这种基本的权衡在早期阶段就影响了高效无锁数据结构的设计。这项工作重新设计了以前的无锁哈希映射,以充分利用ERA定理的属性,并通过最小化缓存缺失的数量来利用基于多核缓存的架构的特征,这是多核环境中的一个重要瓶颈。实验结果表明,我们的设计优于之前的设计,与Intel的TBB库的Concurrent Hash Map设计相比,我们的设计已经很有竞争力了。
{"title":"A sleek lock-free hash map in an ERA of safe memory reclamation methods","authors":"Pedro Moreno ,&nbsp;Miguel Areias ,&nbsp;Ricardo Rocha","doi":"10.1016/j.parco.2025.103162","DOIUrl":"10.1016/j.parco.2025.103162","url":null,"abstract":"<div><div>Lock-free data structures have become increasingly significant due to their algorithmic advantages in multi-core cache-based architectures. Safe Memory Reclamation (SMR) is a technique used in concurrent programming to ensure that memory can be safely reclaimed without causing data corruption, dangling pointers, or access to freed memory. The ERA theorem states that any SMR method for concurrent data structures can only provide at most two of the three main desirable properties: Ease of use, Robustness, and Applicability. This fundamental trade-off influences the design of efficient lock-free data structures at an early stage. This work redesigns a previous lock-free hash map to fully exploit the properties of the ERA theorem and to leverage the characteristics of multi-core cache-based architectures by minimizing the number of cache misses, which are a significant bottleneck in multi-core environments. Experimental results show that our design outperforms the previous design, which was already quite competitive when compared against the Concurrent Hash Map design of the Intel’s TBB library.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"126 ","pages":"Article 103162"},"PeriodicalIF":2.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145474308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Parallel Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1