ACM Transactions on Parallel Computing最新文献_第8页

Sixteen Heuristics for Joint Optimization of Performance, Energy, and Temperature in Allocating Tasks to Multi-Cores 多核任务分配中性能、能量和温度联合优化的16种启发式方法

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2016-08-08 DOI: 10.1145/2948973

Hafiz Fahad Sheikh, I. Ahmad

Three-way joint optimization of performance (P), energy (E), and temperature (T) in scheduling parallel tasks to multiple cores poses a challenge that is staggering in its computational complexity. The goal of the PET optimized scheduling (PETOS) problem is to minimize three quantities: the completion time of a task graph, the total energy consumption, and the peak temperature of the system. Algorithms based on conventional multi-objective optimization techniques can be designed for solving the PETOS problem. But their execution times are exceedingly high and hence their applicability is restricted merely to problems of modest size. Exacerbating the problem is the solution space that is typically a Pareto front since no single solution can be strictly best along all three objectives. Thus, not only is the absolute quality of the solutions important but “the spread of the solutions” along each objective and the distribution of solutions within the generated tradeoff front are also desired. A natural alternative is to design efficient heuristic algorithms that can generate good solutions as well as good spreads -- note that most of the prior work in energy-efficient task allocation is predominantly single- or dual-objective oriented. Given a directed acyclic graph (DAG) representing a parallel program, a heuristic encompasses policies as to what tasks should go to what cores and at what frequency should that core operate. Various policies, such as greedy, iterative, and probabilistic, can be employed. However, the choice and usage of these policies can influence a heuristic towards a particular objective and can also profoundly impact its performance. This article proposes 16 heuristics that utilize various methods for task-to-core allocation and frequency selection. This article also presents a methodical classification scheme which not only categorizes the proposed heuristics but can also accommodate additional heuristics. Extensive simulation experiments compare these algorithms while shedding light on their strengths and tradeoffs.

将并行任务调度到多核时，性能(P)、能量(E)和温度(T)的三向联合优化是其计算复杂度惊人的挑战。PET优化调度(PETOS)问题的目标是最小化三个量:任务图的完成时间、总能耗和系统的峰值温度。在传统多目标优化技术的基础上，可以设计求解PETOS问题的算法。但是它们的执行时间非常高，因此它们的适用性仅限于中等规模的问题。使问题恶化的是解决方案空间，这是典型的帕累托前沿，因为没有一个解决方案可以严格地同时满足所有三个目标。因此，不仅解决方案的绝对质量很重要，而且沿着每个目标的“解决方案的传播”以及在所生成的权衡前沿中的解决方案的分布也是需要的。一个自然的替代方案是设计有效的启发式算法，它可以生成良好的解决方案和良好的分布——注意，大多数节能任务分配的先前工作主要是单目标或双目标导向的。给定一个表示并行程序的有向无环图(DAG)，启发式包含关于哪些任务应该分配到哪些核心以及该核心应该以什么频率运行的策略。可以采用各种策略，例如贪心策略、迭代策略和概率策略。然而，这些策略的选择和使用可能会影响启发式对特定目标的实现，也可能深刻地影响其性能。本文提出了16种启发式方法，利用各种方法进行任务到核心分配和频率选择。本文还提出了一个系统的分类方案，该方案不仅对提出的启发式进行分类，而且还可以容纳额外的启发式。大量的仿真实验比较了这些算法，同时揭示了它们的优势和权衡。

{"title":"Sixteen Heuristics for Joint Optimization of Performance, Energy, and Temperature in Allocating Tasks to Multi-Cores","authors":"Hafiz Fahad Sheikh, I. Ahmad","doi":"10.1145/2948973","DOIUrl":"https://doi.org/10.1145/2948973","url":null,"abstract":"Three-way joint optimization of performance (P), energy (E), and temperature (T) in scheduling parallel tasks to multiple cores poses a challenge that is staggering in its computational complexity. The goal of the PET optimized scheduling (PETOS) problem is to minimize three quantities: the completion time of a task graph, the total energy consumption, and the peak temperature of the system. Algorithms based on conventional multi-objective optimization techniques can be designed for solving the PETOS problem. But their execution times are exceedingly high and hence their applicability is restricted merely to problems of modest size. Exacerbating the problem is the solution space that is typically a Pareto front since no single solution can be strictly best along all three objectives. Thus, not only is the absolute quality of the solutions important but “the spread of the solutions” along each objective and the distribution of solutions within the generated tradeoff front are also desired. A natural alternative is to design efficient heuristic algorithms that can generate good solutions as well as good spreads -- note that most of the prior work in energy-efficient task allocation is predominantly single- or dual-objective oriented. Given a directed acyclic graph (DAG) representing a parallel program, a heuristic encompasses policies as to what tasks should go to what cores and at what frequency should that core operate. Various policies, such as greedy, iterative, and probabilistic, can be employed. However, the choice and usage of these policies can influence a heuristic towards a particular objective and can also profoundly impact its performance. This article proposes 16 heuristics that utilize various methods for task-to-core allocation and frequency selection. This article also presents a methodical classification scheme which not only categorizes the proposed heuristics but can also accommodate additional heuristics. Extensive simulation experiments compare these algorithms while shedding light on their strengths and tradeoffs.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"22 1","pages":"9:1-9:29"},"PeriodicalIF":1.6,"publicationDate":"2016-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90794321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication 稀疏矩阵-矩阵乘法的超图划分

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2016-03-17 DOI: 10.1145/3015144

Grey Ballard, Alex Druinsky, Nicholas Knight, O. Schwartz

We propose a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM), a key computational kernel in scientific computing and data analysis whose performance is often communication bound. This model correctly describes both the interprocessor communication volume along a critical path in a parallel computation and also the volume of data moving through the memory hierarchy in a sequential computation. We show that identifying a communication-optimal algorithm for particular input matrices is equivalent to solving a hypergraph partitioning problem. Our approach is nonzero structure dependent, meaning that we seek the best algorithm for the given input matrices. In addition to our three-dimensional fine-grained model, we also propose coarse-grained one-dimensional and two-dimensional models that correspond to simpler SpGEMM algorithms. We explore the relations between our models theoretically, and we study their performance experimentally in the context of three applications that use SpGEMM as a key computation. For each application, we find that at least one coarse-grained model is as communication efficient as the fine-grained model. We also observe that different applications have affinities for different algorithms. Our results demonstrate that hypergraphs are an accurate model for reasoning about the communication costs of SpGEMM as well as a practical tool for exploring the SpGEMM algorithm design space.

稀疏矩阵-矩阵乘法(SpGEMM)是科学计算和数据分析中的关键计算内核，其性能通常受通信限制，本文提出了一种细粒度超图模型。该模型正确地描述了并行计算中沿关键路径的处理器间通信量以及顺序计算中通过内存层次结构移动的数据量。我们证明了识别特定输入矩阵的通信最优算法等同于解决超图划分问题。我们的方法是非零结构相关的，这意味着我们寻找给定输入矩阵的最佳算法。除了我们的三维细粒度模型外，我们还提出了对应于更简单的SpGEMM算法的粗粒度一维和二维模型。我们从理论上探讨了我们的模型之间的关系，并在使用SpGEMM作为关键计算的三个应用程序的背景下实验研究了它们的性能。对于每个应用程序，我们发现至少有一个粗粒度模型与细粒度模型具有相同的通信效率。我们还观察到不同的应用对不同的算法具有亲和力。我们的研究结果表明，超图是推理SpGEMM通信成本的准确模型，也是探索SpGEMM算法设计空间的实用工具。

{"title":"Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication","authors":"Grey Ballard, Alex Druinsky, Nicholas Knight, O. Schwartz","doi":"10.1145/3015144","DOIUrl":"https://doi.org/10.1145/3015144","url":null,"abstract":"We propose a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM), a key computational kernel in scientific computing and data analysis whose performance is often communication bound. This model correctly describes both the interprocessor communication volume along a critical path in a parallel computation and also the volume of data moving through the memory hierarchy in a sequential computation. We show that identifying a communication-optimal algorithm for particular input matrices is equivalent to solving a hypergraph partitioning problem. Our approach is nonzero structure dependent, meaning that we seek the best algorithm for the given input matrices.\u0000 In addition to our three-dimensional fine-grained model, we also propose coarse-grained one-dimensional and two-dimensional models that correspond to simpler SpGEMM algorithms. We explore the relations between our models theoretically, and we study their performance experimentally in the context of three applications that use SpGEMM as a key computation. For each application, we find that at least one coarse-grained model is as communication efficient as the fine-grained model. We also observe that different applications have affinities for different algorithms.\u0000 Our results demonstrate that hypergraphs are an accurate model for reasoning about the communication costs of SpGEMM as well as a practical tool for exploring the SpGEMM algorithm design space.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 12 1","pages":"18:1-18:34"},"PeriodicalIF":1.6,"publicationDate":"2016-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90366897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

Time-Warp: Efficient Abort Reduction in Transactional Memory 时间扭曲:事务性内存中的有效中断减少

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2015-07-08 DOI: 10.1145/2775435

Nuno Diegues, P. Romano

The multicore revolution that took place one decade ago has turned parallel programming into a major concern for the mainstream software development industry. In this context, Transactional Memory (TM) has emerged as a simpler and attractive alternative to that of lock-based synchronization, whose complexity and error-proneness are widely recognized. The notion of permissiveness in TM translates to only aborting a transaction when it cannot be accepted in any history that guarantees a target correctness criterion. This theoretically powerful property is often neglected by state-of-the-art TMs because it imposes considerable algorithmic costs. Instead, these TMs opt to maximize their implementation’s efficiency by aborting transactions under overly conservative conditions. As a result, they risk rejecting a significant number of safe executions. In this article, we seek to identify a sweet spot between permissiveness and efficiency by introducing the Time-Warp Multiversion (TWM) algorithm. TWM is based on the key idea of allowing an update transaction that has performed stale reads (i.e., missed the writes of concurrently committed transactions) to be serialized by “committing it in the past,” which we call a time-warp commit. At its core, TWM uses a novel, lightweight validation mechanism with little computational overhead. TWM also guarantees that read-only transactions can never be aborted. Further, TWM guarantees Virtual World Consistency, a safety property that is deemed as particularly relevant in the context of TM. We demonstrate the practicality of this approach through an extensive experimental study: we compare TWM with five other TMs, representative of typical alternative design choices, and on a wide variety of benchmarks. This study shows an average performance improvement across all considered workloads and TMs of 65% in high concurrency scenarios, with gains extending up to 9 × with the most favorable benchmarks. These results are a consequence of TWM’s ability to achieve drastic reduction of aborts in scenarios of nonminimal contention, while introducing little overhead (approximately 10%) in worst-case, synthetically designed scenarios (i.e., no contention or contention patterns that cannot be optimized using TWM).

十年前发生的多核革命已经使并行编程成为主流软件开发行业的主要关注点。在这种情况下，事务性内存(Transactional Memory, TM)作为一种比基于锁的同步更简单、更有吸引力的选择而出现，锁的同步的复杂性和易出错性已得到广泛认可。在TM中，允许的概念转换为只有在保证目标正确性标准的任何历史中不能接受事务时才终止事务。这个理论上强大的属性经常被最先进的TMs所忽略，因为它会带来相当大的算法成本。相反，这些tm选择通过在过于保守的条件下终止事务来最大化其实现的效率。因此，他们冒着拒绝大量安全处决的风险。在本文中，我们试图通过引入时间扭曲多版本(Time-Warp Multiversion, TWM)算法来确定容忍度和效率之间的平衡点。TWM的关键思想是允许执行过时读取(即错过并发提交事务的写操作)的更新事务通过“在过去提交”来序列化，我们称之为时间扭曲提交。在其核心，TWM使用了一种新颖的轻量级验证机制，计算开销很小。TWM还保证只读事务永远不会被终止。此外，TWM保证了虚拟世界的一致性，这是一个被认为与TM特别相关的安全属性。我们通过广泛的实验研究证明了这种方法的实用性:我们将TWM与其他五种典型替代设计选择的tm进行比较，并在各种基准上进行比较。该研究显示，在高并发场景中，所有考虑的工作负载和TMs的平均性能提高了65%，在最有利的基准测试中，收益可扩展到9倍。这些结果是由于TWM能够在非最小争用场景中大幅减少中断，同时在最坏情况下引入很少的开销(大约10%)，综合设计的场景(即没有不能使用TWM优化的争用或争用模式)。

{"title":"Time-Warp: Efficient Abort Reduction in Transactional Memory","authors":"Nuno Diegues, P. Romano","doi":"10.1145/2775435","DOIUrl":"https://doi.org/10.1145/2775435","url":null,"abstract":"The multicore revolution that took place one decade ago has turned parallel programming into a major concern for the mainstream software development industry. In this context, Transactional Memory (TM) has emerged as a simpler and attractive alternative to that of lock-based synchronization, whose complexity and error-proneness are widely recognized.\u0000 The notion of permissiveness in TM translates to only aborting a transaction when it cannot be accepted in any history that guarantees a target correctness criterion. This theoretically powerful property is often neglected by state-of-the-art TMs because it imposes considerable algorithmic costs. Instead, these TMs opt to maximize their implementation’s efficiency by aborting transactions under overly conservative conditions. As a result, they risk rejecting a significant number of safe executions.\u0000 In this article, we seek to identify a sweet spot between permissiveness and efficiency by introducing the Time-Warp Multiversion (TWM) algorithm. TWM is based on the key idea of allowing an update transaction that has performed stale reads (i.e., missed the writes of concurrently committed transactions) to be serialized by “committing it in the past,” which we call a time-warp commit. At its core, TWM uses a novel, lightweight validation mechanism with little computational overhead. TWM also guarantees that read-only transactions can never be aborted. Further, TWM guarantees Virtual World Consistency, a safety property that is deemed as particularly relevant in the context of TM.\u0000 We demonstrate the practicality of this approach through an extensive experimental study: we compare TWM with five other TMs, representative of typical alternative design choices, and on a wide variety of benchmarks. This study shows an average performance improvement across all considered workloads and TMs of 65% in high concurrency scenarios, with gains extending up to 9 × with the most favorable benchmarks. These results are a consequence of TWM’s ability to achieve drastic reduction of aborts in scenarios of nonminimal contention, while introducing little overhead (approximately 10%) in worst-case, synthetically designed scenarios (i.e., no contention or contention patterns that cannot be optimized using TWM).","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"32 1","pages":"12:1-12:44"},"PeriodicalIF":1.6,"publicationDate":"2015-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84859845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Supporting Time-Based QoS Requirements in Software Transactional Memory 在软件事务性内存中支持基于时间的QoS需求

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2015-07-08 DOI: 10.1145/2779621

Walther Maldonado, P. Marlier, P. Felber, J. Lawall, Gilles Muller, E. Rivière

Software transactional memory (STM) is an optimistic concurrency control mechanism that simplifies parallel programming. However, there has been little interest in its applicability to reactive applications in which there is a required response time for certain operations. We propose supporting such applications by allowing programmers to associate time with atomic blocks in the form of deadlines and quality-of-service (QoS) requirements. Based on statistics of past executions, we adjust the execution mode of transactions by decreasing the level of optimism as the deadline approaches. In the presence of concurrent deadlines, we propose different conflict resolution policies. Execution mode switching mechanisms allow the meeting of multiple deadlines in a consistent manner, with potential QoS degradations being split fairly among several threads as contention increases, and avoiding starvation. Our implementation consists of extensions to an STM runtime that allow gathering statistics and switching execution modes. We also propose novel contention managers adapted to transactional workloads subject to deadlines. The experimental evaluation shows that our approaches significantly improve the likelihood of a transaction meeting its deadline and QoS requirement, even in cases where progress is hampered by conflicts and other concurrent transactions with deadlines.

软件事务性内存(STM)是一种简化并行编程的乐观并发控制机制。然而，对于它在响应式应用程序中的适用性，人们却很少感兴趣，因为在响应式应用程序中，某些操作需要一定的响应时间。我们建议通过允许程序员以截止日期和服务质量(QoS)需求的形式将时间与原子块关联起来来支持这样的应用程序。根据过去执行的统计数据，随着截止日期的临近，我们通过降低乐观程度来调整事务的执行模式。在同时存在最后期限的情况下，我们提出不同的冲突解决策略。执行模式切换机制允许以一致的方式满足多个截止日期，随着争用的增加，潜在的QoS降级在几个线程之间公平分配，并避免饥饿。我们的实现包括对一个允许收集统计数据和切换执行模式的STM运行时的扩展。我们还提出了适用于受截止日期约束的事务工作负载的新颖争用管理器。实验评估表明，我们的方法显着提高了事务满足其截止日期和QoS要求的可能性，即使在进程被冲突和其他具有截止日期的并发事务阻碍的情况下也是如此。

{"title":"Supporting Time-Based QoS Requirements in Software Transactional Memory","authors":"Walther Maldonado, P. Marlier, P. Felber, J. Lawall, Gilles Muller, E. Rivière","doi":"10.1145/2779621","DOIUrl":"https://doi.org/10.1145/2779621","url":null,"abstract":"Software transactional memory (STM) is an optimistic concurrency control mechanism that simplifies parallel programming. However, there has been little interest in its applicability to reactive applications in which there is a required response time for certain operations. We propose supporting such applications by allowing programmers to associate time with atomic blocks in the form of deadlines and quality-of-service (QoS) requirements. Based on statistics of past executions, we adjust the execution mode of transactions by decreasing the level of optimism as the deadline approaches. In the presence of concurrent deadlines, we propose different conflict resolution policies. Execution mode switching mechanisms allow the meeting of multiple deadlines in a consistent manner, with potential QoS degradations being split fairly among several threads as contention increases, and avoiding starvation. Our implementation consists of extensions to an STM runtime that allow gathering statistics and switching execution modes. We also propose novel contention managers adapted to transactional workloads subject to deadlines. The experimental evaluation shows that our approaches significantly improve the likelihood of a transaction meeting its deadline and QoS requirement, even in cases where progress is hampered by conflicts and other concurrent transactions with deadlines.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"65 1","pages":"10:1-10:30"},"PeriodicalIF":1.6,"publicationDate":"2015-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88974963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

TRADE: Precise Dynamic Race Detection for Scalable Transactional Memory Systems 贸易:精确动态竞争检测可扩展的事务性内存系统

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2015-07-08 DOI: 10.1145/2786021

Gokcen Kestor, O. Unsal, A. Cristal, S. Tasiran

As other multithreaded programs, transactional memory (TM) programs are prone to race conditions. Previous work focuses on extending existing definitions of data race for lock-based applications to TM applications, which requires all transactions to be totally ordered “as if” serialized by a global lock. This approach poses implementation constraints on the STM that severely limits TM applications’ performance. This article shows that forcing total ordering among all running transactions, while sufficient, is not necessary. We introduce an alternative data race definition, relaxed transactional data race, that requires ordering of only conflicting transactions. The advantages of our relaxed definition are twofold: First, unlike the previous definition, this definition can be applied to a wide range of TMs, including those that do not enforce transaction total ordering. Second, within a single execution, it exposes a higher number of data races, which considerably reduces debugging time. Based on this definition, we propose a novel and precise race detection tool for C/C++ TM applications (TRADE), which detects data races by tracking happens-before edges among conflicting transactions. Our experiments reveal that TRADE precisely detects data races for STAMP applications running on modern STMs with overhead comparable to state-of-the-art race detectors for lock-based applications. Our experiments also show that in a single run, TRADE identifies several races not discovered by 10 separate runs of a race detection tool based on the previous data race definition.

与其他多线程程序一样，事务性内存(TM)程序也容易出现竞争条件。以前的工作重点是将基于锁的应用程序的数据竞争的现有定义扩展到TM应用程序，TM应用程序要求所有事务都“像”由全局锁序列化一样完全有序。这种方法对TM提出了实现约束，严重限制了TM应用程序的性能。本文表明，在所有正在运行的事务之间强制进行总排序虽然足够，但不是必需的。我们引入了另一种数据竞争定义，即宽松的事务性数据竞争，它只要求对冲突的事务进行排序。我们的宽松定义有两个优点:首先，与前面的定义不同，这个定义可以应用于广泛的tm，包括那些不强制执行事务总排序的tm。其次，在一次执行中，它暴露了更多的数据竞争，这大大减少了调试时间。基于这一定义，我们提出了一种新的、精确的C/ c++ TM应用程序竞争检测工具(TRADE)，它通过跟踪冲突事务之间的happens-before边来检测数据竞争。我们的实验表明，TRADE可以精确地检测在现代stm上运行的STAMP应用程序的数据竞争，其开销与基于锁的应用程序的最先进的竞争检测器相当。我们的实验还表明，在单次运行中，TRADE根据先前的数据竞赛定义，识别出几个未被10次独立运行的竞赛检测工具发现的竞赛。

{"title":"TRADE: Precise Dynamic Race Detection for Scalable Transactional Memory Systems","authors":"Gokcen Kestor, O. Unsal, A. Cristal, S. Tasiran","doi":"10.1145/2786021","DOIUrl":"https://doi.org/10.1145/2786021","url":null,"abstract":"As other multithreaded programs, transactional memory (TM) programs are prone to race conditions. Previous work focuses on extending existing definitions of data race for lock-based applications to TM applications, which requires all transactions to be totally ordered “as if” serialized by a global lock. This approach poses implementation constraints on the STM that severely limits TM applications’ performance.\u0000 This article shows that forcing total ordering among all running transactions, while sufficient, is not necessary. We introduce an alternative data race definition, relaxed transactional data race, that requires ordering of only conflicting transactions. The advantages of our relaxed definition are twofold: First, unlike the previous definition, this definition can be applied to a wide range of TMs, including those that do not enforce transaction total ordering. Second, within a single execution, it exposes a higher number of data races, which considerably reduces debugging time. Based on this definition, we propose a novel and precise race detection tool for C/C++ TM applications (TRADE), which detects data races by tracking happens-before edges among conflicting transactions.\u0000 Our experiments reveal that TRADE precisely detects data races for STAMP applications running on modern STMs with overhead comparable to state-of-the-art race detectors for lock-based applications. Our experiments also show that in a single run, TRADE identifies several races not discovered by 10 separate runs of a race detection tool based on the previous data race definition.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"53 1","pages":"11:1-11:23"},"PeriodicalIF":1.6,"publicationDate":"2015-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82703218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Remote Memory Access Programming in MPI-3 MPI-3中的远程内存访问编程

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2015-07-08 DOI: 10.1145/2780584

T. Hoefler, James Dinan, R. Thakur, Brian W. Barrett, P. Balaji, W. Gropp, K. Underwood

The Message Passing Interface (MPI) 3.0 standard, introduced in September 2012, includes a significant update to the one-sided communication interface, also known as remote memory access (RMA). In particular, the interface has been extended to better support popular one-sided and global-address-space parallel programming models to provide better access to hardware performance features and enable new data-access modes. We present the new RMA interface and specify formal axiomatic models for data consistency and access semantics. Such models can help users reason about details of the semantics that are hard to extract from the English prose in the standard. It also fosters the development of tools and compilers, enabling them to automatically analyze, optimize, and debug RMA programs.

消息传递接口(MPI) 3.0标准于2012年9月推出，包括对单侧通信接口(也称为远程内存访问(RMA))的重大更新。特别是，接口已经扩展到更好地支持流行的单边和全局地址空间并行编程模型，以提供对硬件性能特性的更好访问，并启用新的数据访问模式。我们提出了新的RMA接口，并指定了数据一致性和访问语义的形式化公理模型。这些模型可以帮助用户推断出标准英语散文中难以提取的语义细节。它还促进了工具和编译器的开发，使它们能够自动分析、优化和调试RMA程序。

引用次数: 94

Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors 评估处理故障停止和静默错误的通用算法

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2014-11-16 DOI: 10.1145/2897189

A. Benoit, Aurélien Cavelan, Y. Robert, Hongyang Sun

In this article, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bicriteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via Dynamic Voltage and Frequency Scaling (DVFS). In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.

在本文中，我们将传统的检查点和回滚恢复策略与验证机制结合起来，以处理故障停止和静默错误。目标是最小化完工时间和/或能源消耗。对于可分负载应用程序，我们使用一阶近似来找到最佳检查点周期以最小化执行时间，并使用额外的验证机制在每个检查点之前检测沉默错误，从而扩展了Young和Daly的经典公式，仅适用于失败停止错误。我们进一步扩展了该方法，以包括中间验证，并考虑涉及时间和能量的双标准问题(执行时间和能量消耗的线性组合)。然后，我们重点研究了应用程序工作流，其依赖图是一个线性的任务链。在这里，我们为双标准问题确定最优检查点和验证位置，有或没有中间验证。与在整个执行过程中使用单一速度不同，我们进一步引入了一个新的执行场景，该场景允许通过动态电压和频率缩放(DVFS)改变执行速度。在后一种场景中，我们确定最佳检查点和验证位置，以及任意两个连续检查点之间每个任务段的最佳速度对。最后，我们进行了一组广泛的仿真来支持理论研究，并评估了每种算法的性能，结果表明，在使用中间验证和不同速度的最灵活场景下，可以获得最佳的整体性能。

{"title":"Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors","authors":"A. Benoit, Aurélien Cavelan, Y. Robert, Hongyang Sun","doi":"10.1145/2897189","DOIUrl":"https://doi.org/10.1145/2897189","url":null,"abstract":"In this article, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bicriteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via Dynamic Voltage and Frequency Scaling (DVFS). In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"85 1","pages":"13:1-13:36"},"PeriodicalIF":1.6,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89030481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Parallel Scheduling of Task Trees with Limited Memory 有限内存条件下任务树的并行调度

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2014-10-01 DOI: 10.1145/2779052

Lionel Eyraud-Dubois, L. Marchal, O. Sinnen, F. Vivien

This article investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. Such trees arise in the multifrontal method of sparse matrix factorization. The peak memory needed for the processing of the entire tree depends on the execution order of the tasks. With one processor, the objective of the tree traversal is to minimize the required memory. This problem was well studied, and optimal polynomial algorithms were proposed. Here, we extend the problem by considering multiple processors, which is of obvious interest in the application area of matrix factorization. With multiple processors comes the additional objective to minimize the time needed to traverse the tree—that is, to minimize the makespan. Not surprisingly, this problem proves to be much harder than the sequential one. We study the computational complexity of this problem and provide inapproximability results even for unit weight trees. We design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan. Some of these heuristics are able to process a tree while keeping the memory usage under a given memory limit. The different heuristics are evaluated in an extensive experimental evaluation using realistic trees.

本文研究使用多个处理器执行树形任务图。这种树的每条边代表一些大数据。只有当所有输入和输出数据都适合内存时，任务才能执行，并且数据只有在将其用作输入数据的任务完成后才能从内存中删除。这种树出现在稀疏矩阵分解的多正面方法中。处理整个树所需的峰值内存取决于任务的执行顺序。对于一个处理器，树遍历的目标是最小化所需的内存。对该问题进行了深入的研究，并提出了最优多项式算法。在这里，我们通过考虑多处理器来扩展问题，这在矩阵分解的应用领域有明显的兴趣。多处理器带来了另一个目标，即最小化遍历树所需的时间，即最小化makespan。不出所料，这个问题比顺序问题要困难得多。我们研究了这一问题的计算复杂度，并给出了单位权重树的不逼近性结果。我们设计了一系列实用的启发式方法，在最小化峰值内存使用和最大跨度之间实现不同的权衡。其中一些启发式方法能够在处理树的同时将内存使用保持在给定的内存限制之下。不同的启发式评估在广泛的实验评估使用现实树。

{"title":"Parallel Scheduling of Task Trees with Limited Memory","authors":"Lionel Eyraud-Dubois, L. Marchal, O. Sinnen, F. Vivien","doi":"10.1145/2779052","DOIUrl":"https://doi.org/10.1145/2779052","url":null,"abstract":"This article investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. Such trees arise in the multifrontal method of sparse matrix factorization. The peak memory needed for the processing of the entire tree depends on the execution order of the tasks. With one processor, the objective of the tree traversal is to minimize the required memory. This problem was well studied, and optimal polynomial algorithms were proposed.\u0000 Here, we extend the problem by considering multiple processors, which is of obvious interest in the application area of matrix factorization. With multiple processors comes the additional objective to minimize the time needed to traverse the tree—that is, to minimize the makespan. Not surprisingly, this problem proves to be much harder than the sequential one. We study the computational complexity of this problem and provide inapproximability results even for unit weight trees. We design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan. Some of these heuristics are able to process a tree while keeping the memory usage under a given memory limit. The different heuristics are evaluated in an extensive experimental evaluation using realistic trees.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"25 1","pages":"13:1-13:37"},"PeriodicalIF":1.6,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81771666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Simple Parallel and Distributed Algorithms for Spectral Graph Sparsification 谱图稀疏化的简单并行和分布式算法

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2014-02-16 DOI: 10.1145/2948062

I. Koutis, S. Xu

We describe simple algorithms for spectral graph sparsification, based on iterative computations of weighted spanners and sampling. Leveraging the algorithms of Baswana and Sen for computing spanners, we obtain the first distributed spectral sparsification algorithm in the CONGEST model. We also obtain a parallel algorithm with improved work and time guarantees, as well as other natural distributed implementations. Combining this algorithm with the parallel framework of Peng and Spielman for solving symmetric diagonally dominant linear systems, we get a parallel solver that is significantly more efficient in terms of the total work.

我们描述了基于加权扳手和采样迭代计算的谱图稀疏化的简单算法。利用Baswana和Sen计算扳手的算法，我们在CONGEST模型中获得了第一个分布式频谱稀疏算法。我们还获得了一种改进的工作和时间保证的并行算法，以及其他自然的分布式实现。将该算法与Peng和Spielman用于求解对称对角占优线性系统的并行框架相结合，我们得到了一个在总工作方面效率显著提高的并行求解器。

引用次数: 44