首页 > 最新文献

ACM Transactions on Parallel Computing最新文献

英文 中文
Introduction to the Special Issue for SPAA’21 SPAA'21 特刊简介
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-12-14 DOI: 10.1145/3630608
Y. Azar, Julian Shun
{"title":"Introduction to the Special Issue for SPAA’21","authors":"Y. Azar, Julian Shun","doi":"10.1145/3630608","DOIUrl":"https://doi.org/10.1145/3630608","url":null,"abstract":"","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"2001 20","pages":"1 - 1"},"PeriodicalIF":1.6,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139001830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Conflict-Resilient Lock-Free Linearizable Calendar Queue 具有冲突恢复能力的无锁线性日历队列
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-12-06 DOI: 10.1145/3635163
Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, F. Quaglia
In the last two decades, great attention has been devoted to the design of non-blocking and linearizable data structures, which enable exploiting the scaled-up degree of parallelism in off-the-shelf shared-memory multi-core machines. In this context, priority queues are highly challenging. Indeed, concurrent attempts to extract the highest-priority item are prone to create detrimental thread conflicts that lead to abort/retry of the operations. In this article, we present the first priority queue that jointly provides: i) lock-freedom and linearizability; ii) conflict resiliency against concurrent extractions; iii) adaptiveness to different contention profiles; and iv) amortized constant-time access for both insertions and extractions. Beyond presenting our solution, we also provide proof of its correctness based on an assertional approach. Also, we present an experimental study on a 64-CPU machine, showing that our proposal provides performance improvements over state-of-the-art non-blocking priority queues.
在过去的二十年中,人们非常关注非阻塞和线性数据结构的设计,这使得在现成的共享内存多核机器中利用扩展的并行度成为可能。在这种情况下,优先级队列非常具有挑战性。实际上,并发尝试提取最高优先级的项很容易产生有害的线程冲突,从而导致操作的中止/重试。在本文中,我们提出了第一优先队列,它共同提供:i)锁自由和线性化;Ii)针对并发提取的冲突弹性;Iii)适应不同的争用情况;iv)插入和提取的平摊常数时间访问。除了展示我们的解决方案之外,我们还提供了基于断言方法的其正确性的证明。此外,我们在64 cpu机器上进行了一项实验研究,表明我们的建议比最先进的非阻塞优先级队列提供了性能改进。
{"title":"A Conflict-Resilient Lock-Free Linearizable Calendar Queue","authors":"Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, F. Quaglia","doi":"10.1145/3635163","DOIUrl":"https://doi.org/10.1145/3635163","url":null,"abstract":"In the last two decades, great attention has been devoted to the design of non-blocking and linearizable data structures, which enable exploiting the scaled-up degree of parallelism in off-the-shelf shared-memory multi-core machines. In this context, priority queues are highly challenging. Indeed, concurrent attempts to extract the highest-priority item are prone to create detrimental thread conflicts that lead to abort/retry of the operations. In this article, we present the first priority queue that jointly provides: i) lock-freedom and linearizability; ii) conflict resiliency against concurrent extractions; iii) adaptiveness to different contention profiles; and iv) amortized constant-time access for both insertions and extractions. Beyond presenting our solution, we also provide proof of its correctness based on an assertional approach. Also, we present an experimental study on a 64-CPU machine, showing that our proposal provides performance improvements over state-of-the-art non-blocking priority queues.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"89 8","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138595960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HPS Cholesky: Hierarchical Parallelized Supernodal Cholesky with Adaptive Parameters HPS choolesky:自适应参数的分层并行超节点choolesky
Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-10-26 DOI: 10.1145/3630051
Shengle Lin, Wangdong Yang, Yikun Hu, Qinyun Cai, Minlu Dai, Haotian Wang, Kenli Li
Sparse supernodal Cholesky on multi-NUMAs is challenging due to the supernode relaxation and load balancing. In this work, we propose a novel approach to improve the performance of sparse Cholesky by combining deep learning with a relaxation parameter and a hierarchical parallelization strategy with NUMA affinity. Specifically, our relaxed supernodal algorithm utilizes a well-trained GCN model to adaptively adjust relaxation parameters based on the sparse matrix’s structure, achieving a proper balance between task-level parallelism and dense computational granularity. Additionally, the hierarchical parallelization maps supernodal tasks to the local NUMA parallel queue and updates contribution blocks in pipeline mode. Furthermore, the stream scheduling with NUMA affinity can further enhance the efficiency of memory access during the numerical factorization. The experimental results show that HPS Cholesky can outperform state-of-the-art libraries, such as Eigen LL T , CHOLMOD, PaStiX and SuiteSparse on (79.78% ) , (79.60% ) , (82.09% ) and (74.47% ) of 1128 datasets. It achieves an average speedup of 1.41x over the current optimal relaxation algorithm. Moreover, (70.83% ) of matrices have surpassed MKL sparse Cholesky on Xeon Gold 6248.
由于超节点松弛和负载平衡问题,多numa上的稀疏超节点Cholesky具有挑战性。在这项工作中,我们提出了一种新的方法,通过将深度学习与松弛参数和具有NUMA亲和力的分层并行化策略相结合来提高稀疏Cholesky的性能。具体而言,我们的松弛超节点算法利用训练良好的GCN模型,根据稀疏矩阵的结构自适应调整松弛参数,在任务级并行性和密集计算粒度之间实现了适当的平衡。此外,分层并行化将超节点任务映射到本地NUMA并行队列,并以管道模式更新贡献块。此外,具有NUMA亲和性的流调度可以进一步提高数值分解过程中的内存访问效率。实验结果表明,HPS Cholesky在1128个数据集的(79.78% )、(79.60% )、(82.09% )和(74.47% )上的性能优于Eigen LL T、CHOLMOD、PaStiX和SuiteSparse等最先进的库。与当前最优松弛算法相比,它实现了1.41倍的平均加速。此外,在Xeon Gold 6248上,矩阵的(70.83% )已经超越了MKL稀疏Cholesky。
{"title":"HPS Cholesky: Hierarchical Parallelized Supernodal Cholesky with Adaptive Parameters","authors":"Shengle Lin, Wangdong Yang, Yikun Hu, Qinyun Cai, Minlu Dai, Haotian Wang, Kenli Li","doi":"10.1145/3630051","DOIUrl":"https://doi.org/10.1145/3630051","url":null,"abstract":"Sparse supernodal Cholesky on multi-NUMAs is challenging due to the supernode relaxation and load balancing. In this work, we propose a novel approach to improve the performance of sparse Cholesky by combining deep learning with a relaxation parameter and a hierarchical parallelization strategy with NUMA affinity. Specifically, our relaxed supernodal algorithm utilizes a well-trained GCN model to adaptively adjust relaxation parameters based on the sparse matrix’s structure, achieving a proper balance between task-level parallelism and dense computational granularity. Additionally, the hierarchical parallelization maps supernodal tasks to the local NUMA parallel queue and updates contribution blocks in pipeline mode. Furthermore, the stream scheduling with NUMA affinity can further enhance the efficiency of memory access during the numerical factorization. The experimental results show that HPS Cholesky can outperform state-of-the-art libraries, such as Eigen LL T , CHOLMOD, PaStiX and SuiteSparse on (79.78% ) , (79.60% ) , (82.09% ) and (74.47% ) of 1128 datasets. It achieves an average speedup of 1.41x over the current optimal relaxation algorithm. Moreover, (70.83% ) of matrices have surpassed MKL sparse Cholesky on Xeon Gold 6248.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134907691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improved Online Scheduling of Moldable Task Graphs under Common Speedup Models 常用加速模型下可塑任务图的改进在线调度
Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-10-26 DOI: 10.1145/3630052
Lucas Perotin, Hongyang Sun
We consider the online scheduling problem of moldable task graphs on multiprocessor systems for minimizing the overall completion time (or makespan). Moldable job scheduling has been widely studied in the literature, in particular when tasks have dependencies (i.e., task graphs) or when tasks are released on-the-fly (i.e., online). However, few studies have focused on both (i.e., online scheduling of moldable task graphs). In this paper, we design a new online scheduling algorithm for this problem and derive constant competitive ratios under several common yet realistic speedup models (i.e., roofline, communication, Amdahl, and a general combination). These results improve the ones we have shown in the preliminary version of the paper. We also prove, for each speedup model, a lower bound on the competitiveness of any online list scheduling algorithm that allocates processors to a task based only on the task’s parameters and not on its position in the graph. This lower bound matches exactly the competitive ratio of our algorithm for the roofline, communication and Amdahl’s model, and is close to the ratio for the general model. Finally, we provide a lower bound on the competitive ratio of any deterministic online algorithm for the arbitrary speedup model, which is not constant but depends on the number of tasks in the longest path of the graph.
研究了多处理机系统上可塑任务图的在线调度问题,以使总完成时间最小化。可塑作业调度在文献中得到了广泛的研究,特别是当任务具有依赖性(即任务图)或任务动态释放(即在线)时。然而,很少有研究同时关注两者(即可建模任务图的在线调度)。本文针对这一问题设计了一种新的在线调度算法,并在几种常见且现实的加速模型(即rooline、communication、Amdahl和一般组合)下推导出恒定的竞争比。这些结果改进了我们在论文初稿中所展示的结果。我们还证明了对于每个加速模型,任何在线列表调度算法的竞争下界,该算法仅根据任务的参数而不是其在图中的位置为任务分配处理器。这个下界完全符合我们的算法对屋顶线、通信和Amdahl模型的竞争比,并且接近于一般模型的竞争比。最后,我们给出了任意加速模型下任何确定性在线算法的竞争比的下界,它不是恒定的,而是取决于图中最长路径上的任务数。
{"title":"Improved Online Scheduling of Moldable Task Graphs under Common Speedup Models","authors":"Lucas Perotin, Hongyang Sun","doi":"10.1145/3630052","DOIUrl":"https://doi.org/10.1145/3630052","url":null,"abstract":"We consider the online scheduling problem of moldable task graphs on multiprocessor systems for minimizing the overall completion time (or makespan). Moldable job scheduling has been widely studied in the literature, in particular when tasks have dependencies (i.e., task graphs) or when tasks are released on-the-fly (i.e., online). However, few studies have focused on both (i.e., online scheduling of moldable task graphs). In this paper, we design a new online scheduling algorithm for this problem and derive constant competitive ratios under several common yet realistic speedup models (i.e., roofline, communication, Amdahl, and a general combination). These results improve the ones we have shown in the preliminary version of the paper. We also prove, for each speedup model, a lower bound on the competitiveness of any online list scheduling algorithm that allocates processors to a task based only on the task’s parameters and not on its position in the graph. This lower bound matches exactly the competitive ratio of our algorithm for the roofline, communication and Amdahl’s model, and is close to the ratio for the general model. Finally, we provide a lower bound on the competitive ratio of any deterministic online algorithm for the arbitrary speedup model, which is not constant but depends on the number of tasks in the longest path of the graph.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134908046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Checkpointing strategies to tolerate non-memoryless failures on HPC platforms 在HPC平台上容忍非无内存故障的检查点策略
Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-09-22 DOI: 10.1145/3624560
Anne Benoit, Lucas Perotin, Yves Robert, Frédéric Vivien
This paper studies checkpointing strategies for parallel applications subject to failures. The optimal strategy to minimize total execution time, or makespan, is well known when failure IATs obey an Exponential distribution, but it is unknown for non-memoryless failure distributions. We explain why the latter fact is misunderstood in recent literature. We propose a general strategy that maximizes the expected efficiency until the next failure, and we show that this strategy achieves an asymptotically optimal makespan, thereby establishing the first optimality result for arbitrary failure distributions. Through extensive simulations, we show that the new strategy is always at least as good as the Young/Daly strategy for various failure distributions. For distributions with a high infant mortality (such as LogNormal with shape parameter k = 2.51 or Weibull with shape parameter 0.5), the execution time is divided by a factor 1.9 on average, and up to a factor 4.2 for recently deployed platforms.
本文研究了存在故障的并行应用程序的检查点策略。当故障iat服从指数分布时,最小化总执行时间或makespan的最佳策略是众所周知的,但对于非无内存故障分布,则未知。我们解释了为什么后一个事实在最近的文献中被误解。我们提出了一种通用策略,在下一次故障之前最大化期望效率,并证明该策略实现了渐近最优的最大完工时间,从而建立了任意故障分布的第一最优性结果。通过大量的仿真,我们表明,对于各种故障分布,新策略总是至少与Young/Daly策略一样好。对于具有高婴儿死亡率的分布(例如形状参数k = 2.51的LogNormal或形状参数为0.5的Weibull),执行时间平均除以因子1.9,对于最近部署的平台,最高可达因子4.2。
{"title":"Checkpointing strategies to tolerate non-memoryless failures on HPC platforms","authors":"Anne Benoit, Lucas Perotin, Yves Robert, Frédéric Vivien","doi":"10.1145/3624560","DOIUrl":"https://doi.org/10.1145/3624560","url":null,"abstract":"This paper studies checkpointing strategies for parallel applications subject to failures. The optimal strategy to minimize total execution time, or makespan, is well known when failure IATs obey an Exponential distribution, but it is unknown for non-memoryless failure distributions. We explain why the latter fact is misunderstood in recent literature. We propose a general strategy that maximizes the expected efficiency until the next failure, and we show that this strategy achieves an asymptotically optimal makespan, thereby establishing the first optimality result for arbitrary failure distributions. Through extensive simulations, we show that the new strategy is always at least as good as the Young/Daly strategy for various failure distributions. For distributions with a high infant mortality (such as LogNormal with shape parameter k = 2.51 or Weibull with shape parameter 0.5), the execution time is divided by a factor 1.9 on average, and up to a factor 4.2 for recently deployed platforms.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136015029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed Graph Coloring Made Easy 分布式图形着色变得容易
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-08-17 DOI: 10.1145/3605896
Yannic Maus
In this paper, we present a deterministic (mathsf {CONGEST} ) algorithm to compute an O(kΔ)-vertex coloring in O(Δ/k) + log *n rounds, where Δ is the maximum degree of the network graph and k ≥ 1 can be freely chosen. The algorithm is extremely simple: Each node locally computes a sequence of colors and then it tries colors from the sequence in batches of size k. Our algorithm subsumes many important results in the history of distributed graph coloring as special cases, including Linial’s color reduction [Linial, FOCS’87], the celebrated locally iterative algorithm from [Barenboim, Elkin, Goldenberg, PODC’18], and various algorithms to compute defective and arbdefective colorings. Our algorithm can smoothly scale between several of these previous results and also simplifies the state of the art (Δ + 1)-coloring algorithm. At the cost of losing some of the algorithm’s simplicity we also provide a O(kΔ)-coloring algorithm in (O(sqrt {Delta /k})+log ^{*} n ) rounds. We also provide improved deterministic algorithms for ruling sets, and, additionally, we provide a tight characterization for 1-round color reduction algorithms.
在本文中,我们提出了一个确定性算法来计算O(Δ/k)+log中的O(kΔ)-顶点着色 *n轮,其中Δ是网络图的最大度,k≥1可以自由选择。该算法非常简单:每个节点局部计算一个颜色序列,然后按k大小批量尝试序列中的颜色。我们的算法将分布图着色历史上的许多重要结果作为特例,包括Linial的颜色约简[Linal,FOCS'87],Barenboim,Elkin,Goldenberg,PODC'18]中著名的局部迭代算法,以及计算缺陷和无缺陷着色的各种算法。我们的算法可以在之前的几个结果之间平滑缩放,还简化了现有技术的(Δ+1)-着色算法。以失去算法的一些简单性为代价,我们还提供了一种在(O(sqrt{Delta/k})+log^{*}n)轮中的O(kΔ)-着色算法。我们还为规则集提供了改进的确定性算法,此外,我们还为一轮颜色减少算法提供了严格的表征。
{"title":"Distributed Graph Coloring Made Easy","authors":"Yannic Maus","doi":"10.1145/3605896","DOIUrl":"https://doi.org/10.1145/3605896","url":null,"abstract":"In this paper, we present a deterministic (mathsf {CONGEST} ) algorithm to compute an O(kΔ)-vertex coloring in O(Δ/k) + log *n rounds, where Δ is the maximum degree of the network graph and k ≥ 1 can be freely chosen. The algorithm is extremely simple: Each node locally computes a sequence of colors and then it tries colors from the sequence in batches of size k. Our algorithm subsumes many important results in the history of distributed graph coloring as special cases, including Linial’s color reduction [Linial, FOCS’87], the celebrated locally iterative algorithm from [Barenboim, Elkin, Goldenberg, PODC’18], and various algorithms to compute defective and arbdefective colorings. Our algorithm can smoothly scale between several of these previous results and also simplifies the state of the art (Δ + 1)-coloring algorithm. At the cost of losing some of the algorithm’s simplicity we also provide a O(kΔ)-coloring algorithm in (O(sqrt {Delta /k})+log ^{*} n ) rounds. We also provide improved deterministic algorithms for ruling sets, and, additionally, we provide a tight characterization for 1-round color reduction algorithms.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"1 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42561903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Fast Algorithm for Aperiodic Linear Stencil Computation using Fast Fourier Transforms 基于快速傅里叶变换的非周期线性模板计算快速算法
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-07-24 DOI: 10.1145/3606338
Zafar Ahmad, R. Chowdhury, Rathish Das, P. Ganapathi, Aaron Gregory, Yimin Zhu
Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, cache-oblivious divide-and-conquer trapezoidal algorithms, and Krylov subspace methods. In this paper, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using DFT preconditioning on a Krylov method to achieve a direct solver which is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform Θ(NT) work, where N is the size of the spatial grid and T is the number of timesteps, our algorithms perform o(NT) work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions, and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations of our algorithms that evolve grids of roughly 107 cells for around 105 timesteps run orders of magnitude faster than state-of-the-art implementations for periodic stencil problems, and 1.3 × to 8.5 × faster for aperiodic stencil problems. Code Repository: https://github.com/TEAlab/FFTStencils
模板计算被广泛用于模拟多个时间步长上多维网格上物理系统状态的变化。该领域最先进的技术可分为三组:缓存感知平铺循环算法、缓存不感知分治梯形算法和Krylov子空间方法。在本文中,我们提出了两种有效的并行算法来执行线性模板计算。目前该领域的直接求解器在计算上效率低下,Krylov方法需要手工劳动和数学训练。我们通过在Krylov方法上使用DFT预处理来解决线性模板的这些问题,以实现快速且通用的直接求解器。事实上,虽然目前所有可用的求解一般线性模板的算法都执行θ(NT)功,其中N是空间网格的大小,T是时间步长,但我们的算法执行o(NT)工作。据我们所知,我们给出了第一个算法,该算法使用快速傅立叶变换,通过一次进化多个时间步长的初始数据来计算最终网格数据。我们的算法处理周期性和非周期性边界条件,并实现了比所有其他现有解决方案更好的性能边界(即计算复杂性和并行运行时间)。初步实验结果表明,我们的算法在大约105个时间步长内进化出大约107个单元的网格,对于周期性模板问题,其运行速度比最先进的实现快几个数量级,对于非周期性模板的问题,其速度快1.3倍至8.5倍。代码库:https://github.com/TEAlab/FFTStencils
{"title":"A Fast Algorithm for Aperiodic Linear Stencil Computation using Fast Fourier Transforms","authors":"Zafar Ahmad, R. Chowdhury, Rathish Das, P. Ganapathi, Aaron Gregory, Yimin Zhu","doi":"10.1145/3606338","DOIUrl":"https://doi.org/10.1145/3606338","url":null,"abstract":"Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, cache-oblivious divide-and-conquer trapezoidal algorithms, and Krylov subspace methods. In this paper, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using DFT preconditioning on a Krylov method to achieve a direct solver which is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform Θ(NT) work, where N is the size of the spatial grid and T is the number of timesteps, our algorithms perform o(NT) work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions, and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations of our algorithms that evolve grids of roughly 107 cells for around 105 timesteps run orders of magnitude faster than state-of-the-art implementations for periodic stencil problems, and 1.3 × to 8.5 × faster for aperiodic stencil problems. Code Repository: https://github.com/TEAlab/FFTStencils","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43986447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Computational Complexity of Feasibility Analysis for Conditional DAG Tasks 条件DAG任务可行性分析的计算复杂度
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-07-05 DOI: 10.1145/3606342
Sanjoy Baruah, A. Marchetti-Spaccamela
The Conditional DAG (CDAG) task model is used for modeling multiprocessor real-time systems containing conditional expressions for which outcomes are not known prior to their evaluation. Feasibility analysis for CDAG tasks upon multiprocessor platforms is shown to be complete for the complexity class pspace; assuming np ≠ pspace, this result rules out the use of Integer Linear Programming solvers for solving this problem efficiently. It is further shown that there can be no pseudo-polynomial time algorithm that solves this problem unless p = pspace.
条件DAG (CDAG)任务模型用于建模包含条件表达式的多处理器实时系统,这些条件表达式的结果在评估之前是未知的。对于复杂度类pspace,在多处理器平台上完成了CDAG任务的可行性分析;假设np≠pspace,这个结果排除了使用整数线性规划求解器来有效地解决这个问题。进一步证明,除非p = pspace,否则不可能存在伪多项式时间算法来解决这个问题。
{"title":"The Computational Complexity of Feasibility Analysis for Conditional DAG Tasks","authors":"Sanjoy Baruah, A. Marchetti-Spaccamela","doi":"10.1145/3606342","DOIUrl":"https://doi.org/10.1145/3606342","url":null,"abstract":"The Conditional DAG (CDAG) task model is used for modeling multiprocessor real-time systems containing conditional expressions for which outcomes are not known prior to their evaluation. Feasibility analysis for CDAG tasks upon multiprocessor platforms is shown to be complete for the complexity class pspace; assuming np ≠ pspace, this result rules out the use of Integer Linear Programming solvers for solving this problem efficiently. It is further shown that there can be no pseudo-polynomial time algorithm that solves this problem unless p = pspace.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 22"},"PeriodicalIF":1.6,"publicationDate":"2023-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48543350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Algorithms for Right-Sizing Heterogeneous Data Centers 正确确定异构数据中心规模的算法
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-05-10 DOI: 10.1145/3595286
S. Albers, Jens Quedenfeld
Power consumption is a dominant and still growing cost factor in data centers. In time periods with low load, the energy consumption can be reduced by powering down unused servers. We resort to a model introduced by Lin, Wierman, Andrew and Thereska [23, 24] that considers data centers with identical machines, and generalize it to heterogeneous data centers with d different server types. The operating cost of a server depends on its load and is modeled by an increasing, convex function for each server type. In contrast to earlier work, we consider the discrete setting, where the number of active servers must be integral. Thereby, we seek truly feasible solutions. For homogeneous data centers (d = 1), both the offline and the online problem were solved optimally in [3, 4]. In this paper, we study heterogeneous data centers with general time-dependent operating cost functions. We develop an online algorithm based on a work function approach which achieves a competitive ratio of 2d + 1 + ϵ for any ϵ > 0. For time-independent operating cost functions, the competitive ratio can be reduced to 2d + 1. There is a lower bound of 2d shown in [5], so our algorithm is nearly optimal. For the offline version, we give a graph-based (1 + ϵ)-approximation algorithm. Additionally, our offline algorithm is able to handle time-variable data-center sizes.
功耗是数据中心中一个占主导地位且仍在增长的成本因素。在低负载的时间段内,可以通过关闭未使用的服务器来降低能耗。我们求助于Lin、Wierman、Andrew和Thereska[23,24]提出的一个模型,该模型考虑了具有相同机器的数据中心,并将其推广到具有d种不同服务器类型的异构数据中心。服务器的运行成本取决于其负载,并通过每种服务器类型的递增凸函数进行建模。与之前的工作相比,我们考虑了离散设置,其中活动服务器的数量必须是整数。因此,我们寻求真正可行的解决方案。对于同构数据中心(d=1),离线和在线问题都在[3,4]中得到了最优解决。在本文中,我们研究了具有一般时间相关运营成本函数的异构数据中心。我们开发了一种基于功函数方法的在线算法,该算法对任何一个>0都能实现2d+1+的竞争比。对于与时间无关的运营成本函数,竞争比可以降低到2d+1。[5]中显示了2d的下界,因此我们的算法几乎是最优的。对于离线版本,我们给出了一个基于图的(1+ε)-近似算法。此外,我们的离线算法能够处理随时间变化的数据中心大小。
{"title":"Algorithms for Right-Sizing Heterogeneous Data Centers","authors":"S. Albers, Jens Quedenfeld","doi":"10.1145/3595286","DOIUrl":"https://doi.org/10.1145/3595286","url":null,"abstract":"Power consumption is a dominant and still growing cost factor in data centers. In time periods with low load, the energy consumption can be reduced by powering down unused servers. We resort to a model introduced by Lin, Wierman, Andrew and Thereska [23, 24] that considers data centers with identical machines, and generalize it to heterogeneous data centers with d different server types. The operating cost of a server depends on its load and is modeled by an increasing, convex function for each server type. In contrast to earlier work, we consider the discrete setting, where the number of active servers must be integral. Thereby, we seek truly feasible solutions. For homogeneous data centers (d = 1), both the offline and the online problem were solved optimally in [3, 4]. In this paper, we study heterogeneous data centers with general time-dependent operating cost functions. We develop an online algorithm based on a work function approach which achieves a competitive ratio of 2d + 1 + ϵ for any ϵ > 0. For time-independent operating cost functions, the competitive ratio can be reduced to 2d + 1. There is a lower bound of 2d shown in [5], so our algorithm is nearly optimal. For the offline version, we give a graph-based (1 + ϵ)-approximation algorithm. Additionally, our offline algorithm is able to handle time-variable data-center sizes.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44289659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Non-Clairvoyant Scheduling with Predictions 具有预测的非偷窥调度
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-05-02 DOI: 10.1145/3593969
Sungjin Im, Ravi Kumar, Mahshid Montazer Qaem, Manish Purohit
In the single-machine non-clairvoyant scheduling problem, the goal is to minimize the total completion time of jobs whose processing times are unknown a priori. We revisit this well-studied problem and consider the question of how to effectively use (possibly erroneous) predictions of the processing times. We study this question from ground zero by first asking what constitutes a good prediction; we then propose a new measure to gauge prediction quality and design scheduling algorithms with strong guarantees under this measure. Our approach to derive a prediction error measure based on natural desiderata could find applications for other online problems.
在单机非透视调度问题中,目标是最小化处理时间先验未知的作业的总完成时间。我们重新审视这个经过充分研究的问题,并考虑如何有效地使用(可能是错误的)处理时间预测的问题。我们从零点开始研究这个问题,首先问什么是好的预测;然后,我们提出了一种新的度量预测质量的方法,并在此方法下设计了具有强保证的调度算法。我们基于自然需求推导预测误差测度的方法可以应用于其他在线问题。
{"title":"Non-Clairvoyant Scheduling with Predictions","authors":"Sungjin Im, Ravi Kumar, Mahshid Montazer Qaem, Manish Purohit","doi":"10.1145/3593969","DOIUrl":"https://doi.org/10.1145/3593969","url":null,"abstract":"In the single-machine non-clairvoyant scheduling problem, the goal is to minimize the total completion time of jobs whose processing times are unknown a priori. We revisit this well-studied problem and consider the question of how to effectively use (possibly erroneous) predictions of the processing times. We study this question from ground zero by first asking what constitutes a good prediction; we then propose a new measure to gauge prediction quality and design scheduling algorithms with strong guarantees under this measure. Our approach to derive a prediction error measure based on natural desiderata could find applications for other online problems.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44598514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Parallel Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1