首页 > 最新文献

Parallel Computing最新文献

英文 中文
NPDP benchmark suite for the evaluation of the effectiveness of automatic optimizing compilers NPDP基准套件,用于评估自动优化编译器的有效性
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103016
Marek Palkowski, Wlodzimierz Bielecki

The paper presents a benchmark suite of ten non-serial polyadic dynamic programming (NPDP) kernels, which are designed to test the efficiency of tiled code generated by polyhedral optimization compilers. These kernels are mainly derived from bioinformatics algorithms, which pose a significant challenge for automatic loop nest tiling transformations. The paper describes algorithms implemented with examined kernels and unifies them in the form of loop nests presented in the C language. The purpose is to reconsider the execution and monitoring of codes, typically used in past and current publications. For carrying out experiments with introduced benchmarks, we applied the two source-to-source compilers, PLuTo and TRACO, to generate cache-efficient codes and analyzed their performance on four multi-core machines. We discuss the limitations of well-known tiling approaches and outline future tiling strategies to generate effective tiled code by means of optimizing compilers for introduced benchmarks.

本文提出了一套由十个非串行的多元动态编程(NPDP)内核组成的基准套件,旨在测试多面体优化编译器生成的平铺代码的效率。这些内核主要来自生物信息学算法,这对自动循环嵌套平铺转换提出了重大挑战。本文描述了用检查过的内核实现的算法,并以C语言中的循环嵌套形式将它们统一起来。其目的是重新考虑过去和当前出版物中通常使用的代码的执行和监控。为了用引入的基准测试进行实验,我们应用了两个源到源编译器PLuTo和TRACO来生成高效缓存的代码,并分析了它们在四台多核机器上的性能。我们讨论了众所周知的平铺方法的局限性,并概述了未来的平铺策略,通过为引入的基准优化编译器来生成有效的平铺代码。
{"title":"NPDP benchmark suite for the evaluation of the effectiveness of automatic optimizing compilers","authors":"Marek Palkowski,&nbsp;Wlodzimierz Bielecki","doi":"10.1016/j.parco.2023.103016","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103016","url":null,"abstract":"<div><p><span>The paper presents a benchmark suite of ten non-serial polyadic dynamic programming<span> (NPDP) kernels, which are designed to test the efficiency of tiled code generated by polyhedral optimization compilers. These kernels are mainly derived from bioinformatics algorithms, which pose a significant challenge for automatic loop nest tiling transformations. The paper describes algorithms implemented with examined kernels and unifies them in the form of loop nests presented in the C language. The purpose is to reconsider the execution and monitoring of codes, typically used in past and current publications. For carrying out experiments with introduced benchmarks, we applied the two source-to-source compilers, PLuTo and TRACO, to generate cache-efficient codes and analyzed their performance on four multi-core machines. We discuss the limitations of well-known tiling approaches and outline future tiling strategies to generate effective tiled code by means of </span></span>optimizing compilers for introduced benchmarks.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103016"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A parallel non-convex approximation framework for risk parity portfolio design 风险平价投资组合设计的并行非凸近似框架
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.102999
Yidong Chen , Chen Li , Yonghong Hu , Zhonghua Lu

In this paper, we propose a parallel non-convex approximation framework (NCAQ) for optimization problems whose objective is to minimize a convex function plus the sum of non-convex functions. Based on the structure of the objective function, our framework transforms the non-convex constraints to the logarithmic barrier function and approximates the non-convex problem by a parallel quadratic approximation scheme, which will allow the original problem to be solved by accelerated inexact gradient descent in the parallel environment. Moreover, we give a detailed convergence analysis for the proposed framework. The numerical experiments show that our framework outperforms the state-of-art approaches in terms of accuracy and computation time on the high dimension non-convex Rosenbrock test functions and the risk parity problems. In particular, we implement the proposed framework on CUDA, showing a more than 25 times speed-up ratio and removing the computational bottleneck for non-convex risk-parity portfolio design. Finally, we construct the high dimension risk parity portfolio which can consistently outperform the equal weight portfolio in the application of Chinese stock markets.

在本文中,我们提出了一个用于优化问题的并行非凸近似框架(NCAQ),其目标是最小化凸函数加上非凸函数之和。基于目标函数的结构,我们的框架将非凸约束转换为对数屏障函数,并通过并行二次逼近方案逼近非凸问题,这将允许在并行环境中通过加速不精确梯度下降来解决原始问题。此外,我们还对所提出的框架进行了详细的收敛性分析。数值实验表明,在高维非凸Rosenbrock检验函数和风险平价问题上,我们的框架在精度和计算时间方面优于现有方法。特别是,我们在CUDA上实现了所提出的框架,显示出超过25倍的加速率,并消除了非凸风险平价投资组合设计的计算瓶颈。最后,我们构造了高维风险平价投资组合,该投资组合在中国股市的应用中能够持续优于等权重投资组合。
{"title":"A parallel non-convex approximation framework for risk parity portfolio design","authors":"Yidong Chen ,&nbsp;Chen Li ,&nbsp;Yonghong Hu ,&nbsp;Zhonghua Lu","doi":"10.1016/j.parco.2023.102999","DOIUrl":"https://doi.org/10.1016/j.parco.2023.102999","url":null,"abstract":"<div><p>In this paper, we propose a parallel non-convex approximation framework (NCAQ) for optimization problems whose objective is to minimize a convex function plus the sum of non-convex functions. Based on the structure of the objective function, our framework transforms the non-convex constraints to the logarithmic barrier function and approximates the non-convex problem by a parallel quadratic approximation scheme, which will allow the original problem to be solved by accelerated inexact gradient descent in the parallel environment. Moreover, we give a detailed convergence analysis for the proposed framework. The numerical experiments show that our framework outperforms the state-of-art approaches in terms of accuracy and computation time on the high dimension non-convex Rosenbrock test functions and the risk parity problems. In particular, we implement the proposed framework on CUDA, showing a more than 25 times speed-up ratio and removing the computational bottleneck for non-convex risk-parity portfolio design. Finally, we construct the high dimension risk parity portfolio which can consistently outperform the equal weight portfolio in the application of Chinese stock markets.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 102999"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49756831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An optimal scheduling algorithm considering the transactions worst-case delay for multi-channel hyperledger fabric network 多通道超级账本网络中考虑事务最坏延迟的最优调度算法
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103041
Ou Wu, Shanshan Li, He Zhang, Liwen Liu, Haoming Li, Yanze Wang, Ziyi Zhang
{"title":"An optimal scheduling algorithm considering the transactions worst-case delay for multi-channel hyperledger fabric network","authors":"Ou Wu, Shanshan Li, He Zhang, Liwen Liu, Haoming Li, Yanze Wang, Ziyi Zhang","doi":"10.1016/j.parco.2023.103041","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103041","url":null,"abstract":"","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"117 1","pages":"103041"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"55107811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A survey of software techniques to emulate heterogeneous memory systems in high-performance computing 在高性能计算中模拟异构存储系统的软件技术综述
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103023
Clément Foyer, Brice Goglin, Andrès Rubio Proaño

Heterogeneous memory will be involved in several upcoming platforms on the way to exascale. Combining technologies such as HBM, DRAM and/or NVDIMM allows to tackle the needs of different applications in terms of bandwidth, latency or capacity. And new memory interconnects such as CXL bring easy ways to attach these technologies to the processors.

High-performance computing developers must prepare their runtimes and applications for these architectures, even before they are actually available. Hence, we survey software solutions for emulating them. First, we list many ways to modify the performance of platforms so that developers may test their code under different memory performance profiles. This is required to identify kernels and data buffers that are sensitive to memory performance.

Then, we present several techniques for exposing fake heterogeneous memory information to the software stack. This is useful for adapting runtimes and applications to heterogeneous memory so that different kinds of memory are detected at runtime and so that buffers are allocated in the appropriate one.

异构内存将在即将推出的几个平台中进行扩展。结合HBM、DRAM和/或NVDIMM等技术,可以满足不同应用程序在带宽、延迟或容量方面的需求。像CXL这样的新型内存互连提供了将这些技术连接到处理器上的简单方法。高性能计算开发人员必须为这些体系结构准备运行时和应用程序,甚至在它们真正可用之前。因此,我们调查了用于模拟它们的软件解决方案。首先,我们列出了许多修改平台性能的方法,以便开发人员可以在不同的内存性能配置文件下测试他们的代码。这是识别对内存性能敏感的内核和数据缓冲区所必需的。然后,我们提出了几种将伪造的异构内存信息暴露给软件堆栈的技术。这对于使运行时和应用程序适应异构内存非常有用,这样可以在运行时检测到不同类型的内存,并将缓冲区分配到适当的缓冲区中。
{"title":"A survey of software techniques to emulate heterogeneous memory systems in high-performance computing","authors":"Clément Foyer,&nbsp;Brice Goglin,&nbsp;Andrès Rubio Proaño","doi":"10.1016/j.parco.2023.103023","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103023","url":null,"abstract":"<div><p><span>Heterogeneous memory will be involved in several upcoming platforms on the way to exascale. Combining technologies such as HBM, DRAM and/or </span>NVDIMM<span> allows to tackle the needs of different applications in terms of bandwidth, latency or capacity. And new memory interconnects such as CXL bring easy ways to attach these technologies to the processors.</span></p><p>High-performance computing developers must prepare their runtimes and applications for these architectures, even before they are actually available. Hence, we survey software solutions for emulating them. First, we list many ways to modify the performance of platforms so that developers may test their code under different memory performance profiles. This is required to identify kernels and data buffers that are sensitive to memory performance.</p><p>Then, we present several techniques for exposing fake heterogeneous memory information to the software stack. This is useful for adapting runtimes and applications to heterogeneous memory so that different kinds of memory are detected at runtime and so that buffers are allocated in the appropriate one.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103023"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49756349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A lightweight semi-centralized strategy for the massive parallelization of branching algorithms 分支算法大规模并行化的轻量级半集中式策略
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103024
Andres Pastrana-Cruz, Manuel Lafond

Several NP-hard problems are solved exactly using exponential-time branching strategies, whether it be branch-and-bound algorithms, or bounded search trees in fixed-parameter algorithms. The number of tractable instances that can be handled by sequential algorithms is usually small, whereas massive parallelization has been shown to significantly increase the space of instances that can be solved exactly. However, previous centralized approaches require too much communication to be efficient, whereas decentralized approaches are more efficient but have difficulty keeping track of the global state of the exploration.

In this work, we propose to revisit the centralized paradigm while avoiding previous bottlenecks. In our strategy, the center has lightweight responsibilities, requires only a few bits for every communication, but is still able to keep track of the progress of every worker. In particular, the center never holds any task but is able to guarantee that a process with no work always receives the highest priority task globally.

Our strategy was implemented in a generic C++ library called GemPBA, which allows a programmer to convert a sequential branching algorithm into a parallel version by changing only a few lines of code. An experimental case study on the vertex cover problem demonstrates that some of the toughest instances from the DIMACS challenge graphs that would take months to solve sequentially can be handled within two hours with our approach.

无论是分支定界算法,还是固定参数算法中的有界搜索树,都可以使用指数时间分支策略精确地解决一些NP难题。可以由顺序算法处理的可处理实例的数量通常很小,而大规模并行化已被证明可以显著增加可以精确求解的实例的空间。然而,以前的集中方法需要太多的沟通才能有效,而分散方法更有效,但很难跟踪全球勘探状况。在这项工作中,我们建议重新审视集中式范式,同时避免以前的瓶颈。在我们的战略中,该中心的职责很轻,每次通信只需要几位,但仍然能够跟踪每个工人的进度。特别是,该中心从不持有任何任务,但能够保证没有工作的流程始终在全球范围内接收最高优先级的任务。我们的策略是在一个名为GemPBA的通用C++库中实现的,该库允许程序员通过只更改几行代码将顺序分支算法转换为并行版本。一个关于顶点覆盖问题的实验案例研究表明,使用我们的方法,可以在两小时内处理DIMACS挑战图中一些需要数月才能顺序解决的最困难的实例。
{"title":"A lightweight semi-centralized strategy for the massive parallelization of branching algorithms","authors":"Andres Pastrana-Cruz,&nbsp;Manuel Lafond","doi":"10.1016/j.parco.2023.103024","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103024","url":null,"abstract":"<div><p>Several NP-hard problems are solved exactly using exponential-time branching strategies, whether it be branch-and-bound algorithms, or bounded search trees in fixed-parameter algorithms. The number of tractable instances that can be handled by sequential algorithms is usually small, whereas massive parallelization has been shown to significantly increase the space of instances that can be solved exactly. However, previous centralized approaches require too much communication to be efficient, whereas decentralized approaches are more efficient but have difficulty keeping track of the global state of the exploration.</p><p>In this work, we propose to revisit the centralized paradigm while avoiding previous bottlenecks. In our strategy, the center has lightweight responsibilities, requires only a few bits for every communication, but is still able to keep track of the progress of every worker. In particular, the center never holds any task but is able to guarantee that a process with no work always receives the highest priority task globally.</p><p>Our strategy was implemented in a generic C++ library called GemPBA, which allows a programmer to convert a sequential branching algorithm into a parallel version by changing only a few lines of code. An experimental case study on the vertex cover problem demonstrates that some of the toughest instances from the DIMACS challenge graphs that would take months to solve sequentially can be handled within two hours with our approach.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103024"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49756350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lifeline-based load balancing schemes for Asynchronous Many-Task runtimes in clusters 集群中异步多任务运行时的基于生命线的负载平衡方案
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103020
Lukas Reitz, Kai Hardenbicker, Tobias Werner, Claudia Fohry

A popular approach to program scalable irregular applications is Asynchronous Many-Task (AMT) Programming. Here, programs define tasks according to task models such as dynamic independent tasks (DIT) or nested fork-join (NFJ). We consider cluster AMTs, in which a runtime system maps the tasks to worker threads in multiple processes.

Thereby, dynamic load balancing can be achieved via cooperative work stealing, coordinated work stealing, or work sharing. A well-performing cooperative work stealing variant is the lifeline scheme. While previous implementations of this scheme are restricted to single-worker processes, a recent hybrid extension combines it with intra-process work sharing between multiple workers. The hybrid scheme, which was proposed for both DIT and NFJ, comes at the price of a higher complexity.

This paper investigates whether this complexity is indispensable for multi-worker processes by contrasting the hybrid scheme with a novel pure work stealing extension of the lifeline scheme to multiple workers. We independently implemented the extension for DIT and NFJ. In experiments based on four benchmarks, we observed the pure scheme to be on a par or even outperform the hybrid one by up to 18% for DIT and up to 5% for NFJ.

Building on this main result, we studied a modification of the pure scheme, which prefers local over global victims, and more heavily loaded over less loaded ones. The modification improves the performance of the pure scheme by up to 15%. Finally, we explored whether the lifeline scheme can profit from a change to coordinated work stealing. We developed a coordinated multi-worker implementation for DIT and observed a performance improvement over the cooperative scheme by up to 17%.

对可扩展的不规则应用程序进行编程的一种流行方法是异步多任务(AMT)编程。这里,程序根据任务模型定义任务,例如动态独立任务(DIT)或嵌套叉联接(NFJ)。我们考虑集群AMT,其中运行时系统将任务映射到多个进程中的工作线程。从而,可以通过协作工作窃取、协同工作窃取或工作共享来实现动态负载平衡。一个表现良好的合作窃取工作变体是生命线方案。虽然该方案以前的实现仅限于单个工作进程,但最近的一个混合扩展将其与多个工作进程之间的进程内工作共享相结合。为DIT和NFJ提出的混合方案以更高的复杂性为代价。本文通过将混合方案与生命线方案的一个新的纯窃取工作扩展到多个工人来比较,来研究这种复杂性对于多工人过程是否是必不可少的。我们独立实现了DIT和NFJ的扩展。在基于四个基准的实验中,我们观察到纯方案在DIT和NFJ方面与混合方案不相上下,甚至优于混合方案高达18%和5%。在这一主要结果的基础上,我们研究了对纯方案的修改,该方案更喜欢局部受害者而不是全局受害者,并且负载更重而不是负载更少。该修改将纯方案的性能提高了15%。最后,我们探讨了生命线计划是否可以从协调工作盗窃的转变中获利。我们为DIT开发了一个协调的多工作者实现,并观察到与合作方案相比,性能提高了17%。
{"title":"Lifeline-based load balancing schemes for Asynchronous Many-Task runtimes in clusters","authors":"Lukas Reitz,&nbsp;Kai Hardenbicker,&nbsp;Tobias Werner,&nbsp;Claudia Fohry","doi":"10.1016/j.parco.2023.103020","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103020","url":null,"abstract":"<div><p><span>A popular approach to program scalable irregular applications is Asynchronous Many-Task (AMT) Programming. Here, programs define tasks according to task models such as dynamic independent tasks (DIT) or nested fork-join (NFJ). We consider cluster AMTs, in which a runtime system maps the tasks to worker </span>threads in multiple processes.</p><p>Thereby, dynamic load balancing can be achieved via cooperative work stealing, coordinated work stealing, or work sharing. A well-performing cooperative work stealing variant is the lifeline scheme. While previous implementations of this scheme are restricted to single-worker processes, a recent hybrid extension combines it with intra-process work sharing between multiple workers. The hybrid scheme, which was proposed for both DIT and NFJ, comes at the price of a higher complexity.</p><p>This paper investigates whether this complexity is indispensable for multi-worker processes by contrasting the hybrid scheme with a novel pure work stealing extension of the lifeline scheme to multiple workers. We independently implemented the extension for DIT and NFJ. In experiments based on four benchmarks, we observed the pure scheme to be on a par or even outperform the hybrid one by up to 18% for DIT and up to 5% for NFJ.</p><p>Building on this main result, we studied a modification of the pure scheme, which prefers local over global victims, and more heavily loaded over less loaded ones. The modification improves the performance of the pure scheme by up to 15%. Finally, we explored whether the lifeline scheme can profit from a change to coordinated work stealing. We developed a coordinated multi-worker implementation for DIT and observed a performance improvement over the cooperative scheme by up to 17%.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103020"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A heterogeneous processing-in-memory approach to accelerate quantum chemistry simulation 一种加速量子化学模拟的内存异构处理方法
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103017
Zeshi Liu , Zhen Xie , Wenqian Dong , Mengting Yuan , Haihang You , Dong Li

The “memory wall” is an architectural property introducing high memory access latency that can manifest application performance, and this wall becomes even taller in the context of big data. Although the use of GPU-based systems could achieve high performance, it is difficult to improve the utilization of GPU systems due to the “memory wall”. The intensive data exchange and computation remains a challenge when confronting applications with a massive memory footprint. Quantum-mechanics-based ab initio calculations, which leverage high-performance computing to investigate multi-electron systems, have been widely used in computational chemistry. However, ab initio calculations are labor-intensive and can easily consume more than hundreds of gigabytes of memory. Previous efforts on heterogeneous accelerators via GPU and CPU suffer from high-latency off-device memory access. In this paper, we introduce heterogeneous processing-in-memory (PIM) to mitigate the overhead of data movement between CPUs and GPUs, and deeply analyze two of the most memory-intensive parts of the quantum chemistry, for example, the FFT and time-consuming loops. Specifically, we exploit runtime systems and programming models to improve hardware utilization and simplify programming efforts by moving computation close to the data and eliminating hardware idling. We take a widely used software, the QUANTUM ESPRESSO (opEn-Source Package for Research in Electronic Structure, Simulation, and Optimization), to perform our experiments, and our results show that our design provides up to 4.09× and 2.60× of performance improvement and 71% and 88% energy reduction over CPU and GPU (NVIDIA P100), respectively.

“内存墙”是一种引入高内存访问延迟的体系结构特性,可以体现应用程序性能,在大数据的背景下,这堵墙会变得更高。尽管使用基于GPU的系统可以实现高性能,但由于“内存墙”的存在,很难提高GPU系统的利用率。当应用程序占用大量内存时,密集的数据交换和计算仍然是一个挑战。基于量子力学的从头计算利用高性能计算来研究多电子系统,已在计算化学中得到广泛应用。然而,从头计算是劳动密集型的,很容易消耗超过数百GB的内存。先前通过GPU和CPU对异构加速器的研究遭遇了高延迟的设备外内存访问。在本文中,我们引入了内存中的异构处理(PIM),以减轻CPU和GPU之间的数据移动开销,并深入分析了量子化学中两个内存最密集的部分,例如FFT和耗时的循环。具体来说,我们利用运行时系统和编程模型来提高硬件利用率,并通过将计算移动到数据附近和消除硬件空闲来简化编程工作。我们采用了一个广泛使用的软件QUANTUM ESPRESSO(用于电子结构、模拟和优化研究的opEn Source Package)来进行实验,结果表明,我们的设计比CPU和GPU(NVIDIA P100)分别提供了高达4.09倍和2.60倍的性能改进和71%和88%的能耗降低。
{"title":"A heterogeneous processing-in-memory approach to accelerate quantum chemistry simulation","authors":"Zeshi Liu ,&nbsp;Zhen Xie ,&nbsp;Wenqian Dong ,&nbsp;Mengting Yuan ,&nbsp;Haihang You ,&nbsp;Dong Li","doi":"10.1016/j.parco.2023.103017","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103017","url":null,"abstract":"<div><p><span><span>The “memory wall” is an architectural property introducing high memory access latency that can manifest application performance, and this wall becomes even taller in the context of big data. Although the use of GPU-based systems could achieve high performance, it is difficult to improve the utilization of </span>GPU<span> systems due to the “memory wall”. The intensive data exchange and computation remains a challenge when confronting applications with a massive memory footprint<span>. Quantum-mechanics-based ab initio calculations, which leverage high-performance computing to investigate multi-electron systems, have been widely used in computational chemistry. However, ab initio calculations are labor-intensive and can easily consume more than hundreds of gigabytes of memory. Previous efforts on heterogeneous accelerators via GPU and CPU suffer from high-latency off-device memory access. In this paper, we introduce heterogeneous processing-in-memory (PIM) to mitigate the overhead of data movement between CPUs and GPUs, and deeply analyze two of the most memory-intensive parts of the quantum chemistry, for example, the FFT<span> and time-consuming loops. Specifically, we exploit runtime systems and programming models to improve hardware utilization and simplify programming efforts by moving computation close to the data and eliminating hardware idling. We take a widely used software, the QUANTUM ESPRESSO (opEn-Source Package for Research in Electronic Structure, Simulation, and Optimization), to perform our experiments, and our results show that our design provides up to </span></span></span></span><span><math><mrow><mn>4</mn><mo>.</mo><mn>09</mn><mo>×</mo></mrow></math></span> and <span><math><mrow><mn>2</mn><mo>.</mo><mn>60</mn><mo>×</mo></mrow></math></span> of performance improvement and 71% and 88% energy reduction over CPU and GPU (NVIDIA P100), respectively.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"116 ","pages":"Article 103017"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
New YARN sharing GPU based on graphics memory granularity scheduling 基于图形内存粒度调度的新型YARN共享GPU
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-07-01 DOI: 10.1016/j.parco.2023.103038
Jinliang Shi, Dewu Chen, Jiabi Liang, Lin Li, Yue-ying Lin, Jianjiang Li
{"title":"New YARN sharing GPU based on graphics memory granularity scheduling","authors":"Jinliang Shi, Dewu Chen, Jiabi Liang, Lin Li, Yue-ying Lin, Jianjiang Li","doi":"10.1016/j.parco.2023.103038","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103038","url":null,"abstract":"","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"117 1","pages":"103038"},"PeriodicalIF":1.4,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"55107745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Big data BPMN workflow resource optimization in the cloud 云中的大数据BPMN工作流资源优化
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-06-01 DOI: 10.1016/j.parco.2023.103025
S. Simić, Nikola Tanković, D. Etinger
{"title":"Big data BPMN workflow resource optimization in the cloud","authors":"S. Simić, Nikola Tanković, D. Etinger","doi":"10.1016/j.parco.2023.103025","DOIUrl":"https://doi.org/10.1016/j.parco.2023.103025","url":null,"abstract":"","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"117 1","pages":"103025"},"PeriodicalIF":1.4,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"55107045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPU acceleration of Levenshtein distance computation between long strings 长字符串间Levenshtein距离计算的GPU加速
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-04-01 DOI: 10.2139/ssrn.4244720
David Castells-Rufas
{"title":"GPU acceleration of Levenshtein distance computation between long strings","authors":"David Castells-Rufas","doi":"10.2139/ssrn.4244720","DOIUrl":"https://doi.org/10.2139/ssrn.4244720","url":null,"abstract":"","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"91 1","pages":"103019"},"PeriodicalIF":1.4,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80523791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Parallel Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1