Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures最新文献

英文中文

Automatic HBM Management: Models and Algorithms 自动HBM管理:模型和算法

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

Pub Date : 2022-07-11 DOI: 10.1145/3490148.3538570

Daniel DeLayo, Kenny Zhang, Kunal Agrawal, M. A. Bender, Jonathan W. Berry, Rathish Das, Benjamin Moseley, C. Phillips

Some past and future supercomputer nodes incorporate High- Bandwidth Memory (HBM). Compared to standard DRAM, HBM has similar latency, higher bandwidth and lower capacity. In this paper, we evaluate algorithms for managing High- Bandwidth Memory automatically. Previous work suggests that, in the worst case, performance is extremely sensitive to the policy for managing the channel to DRAM. Prior theory shows that a priority-based scheme (where there is a static strict priority-order among p threads for channel access) is O(1)-competitive, but FIFO is not, and in the worst case is Ω(p) competitive. Following this theoretical guidance would be a disruptive change for vendors, who currently use FIFO variants in their DRAMcontroller hardware. Our goal is to determine theoretically and empirically whether we can justify recommending investment in priority-based DRAM controller hardware. In order to experiment with DRAM channel protocols, we chose a theoretical model, validated it against real hardware, and implemented a basic simulator. We corroborated the previous theoretical results for the model, conducted a parameter sweep while running our simulator on address traces from memory bandwidth-bound codes (GNU sort and TACO sparse matrix-vector product), and designed better channel-access algorithms.

一些过去和未来的超级计算机节点采用了高带宽存储器(HBM)。与标准DRAM相比，HBM具有相似的延迟，更高的带宽和更低的容量。在本文中，我们评估了自动管理高带宽存储器的算法。以前的工作表明，在最坏的情况下，性能对管理到DRAM的通道的策略极其敏感。先前的理论表明，基于优先级的方案(在p个线程之间有一个静态严格的优先级顺序用于通道访问)是0(1)竞争的，但FIFO不是，在最坏的情况下是Ω(p)竞争的。对于目前在其dram控制器硬件中使用FIFO变体的供应商来说，遵循这一理论指导将是一个颠覆性的变化。我们的目标是从理论上和经验上确定我们是否可以合理地建议投资基于优先级的DRAM控制器硬件。为了实验DRAM通道协议，我们选择了一个理论模型，在实际硬件上验证了它，并实现了一个基本的模拟器。我们验证了之前模型的理论结果，在运行模拟器时对内存带宽绑定码(GNU排序和TACO稀疏矩阵向量积)的地址跟踪进行了参数扫描，并设计了更好的信道访问算法。

{"title":"Automatic HBM Management: Models and Algorithms","authors":"Daniel DeLayo, Kenny Zhang, Kunal Agrawal, M. A. Bender, Jonathan W. Berry, Rathish Das, Benjamin Moseley, C. Phillips","doi":"10.1145/3490148.3538570","DOIUrl":"https://doi.org/10.1145/3490148.3538570","url":null,"abstract":"Some past and future supercomputer nodes incorporate High- Bandwidth Memory (HBM). Compared to standard DRAM, HBM has similar latency, higher bandwidth and lower capacity. In this paper, we evaluate algorithms for managing High- Bandwidth Memory automatically. Previous work suggests that, in the worst case, performance is extremely sensitive to the policy for managing the channel to DRAM. Prior theory shows that a priority-based scheme (where there is a static strict priority-order among p threads for channel access) is O(1)-competitive, but FIFO is not, and in the worst case is Ω(p) competitive. Following this theoretical guidance would be a disruptive change for vendors, who currently use FIFO variants in their DRAMcontroller hardware. Our goal is to determine theoretically and empirically whether we can justify recommending investment in priority-based DRAM controller hardware. In order to experiment with DRAM channel protocols, we chose a theoretical model, validated it against real hardware, and implemented a basic simulator. We corroborated the previous theoretical results for the model, conducted a parameter sweep while running our simulator on address traces from memory bandwidth-bound codes (GNU sort and TACO sparse matrix-vector product), and designed better channel-access algorithms.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114934422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Massively Parallel Algorithms for b-Matching b匹配的大规模并行算法

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

Pub Date : 2022-07-11 DOI: 10.1145/3490148.3538589

M. Ghaffari, C. Grunau, Slobodan Mitrovic

This paper presents an O(log log đ) round massively parallel algorithm for 1 + ε approximation of maximum weighted b-matchings, using near-linear memory per machine. Here đ denotes the average degree in the graph and ε is an arbitrarily small positive constant. Recall that b-matching is the natural and well-studied generalization of the matching problem where different vertices are allowed to have different numbers of incident edges in the matching.

本文提出了一种O(log log)轮大规模并行算法，用于最大加权b匹配的1 + ε近似，每台机器使用近线性内存。这里，表示图中的平均度，ε是一个任意小的正常数。回想一下，b匹配是匹配问题的自然和充分研究的推广，其中不同的顶点在匹配中允许有不同数量的关联边。

引用次数: 2

Parallel Batch-Dynamic Algorithms for k-Core Decomposition and Related Graph Problems k核分解的并行批动态算法及相关图问题

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

Pub Date : 2022-07-11 DOI: 10.1145/3490148.3538569

Quanquan C. Liu, Jessica Shi, Shangdi Yu, Laxman Dhulipala, Julian Shun

Maintaining a k-core decomposition quickly in a dynamic graph has important applications in network analysis. The main challenge for designing efficient exact algorithms is that a single update to the graph can cause significant global changes. Our paper focuses on approximation algorithms with small approximation factors that are much more efficient than what exact algorithms can obtain.

在动态图中快速维护k核分解在网络分析中有着重要的应用。设计高效精确算法的主要挑战是，对图的一次更新可能导致重大的全局更改。我们的论文关注的是具有小近似因子的近似算法，这些近似算法比精确算法更有效。

引用次数: 2

PREP-UC: A Practical Replicated Persistent Universal Construction 一个实用的复制持久通用结构

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

Pub Date : 2022-07-11 DOI: 10.1145/3490148.3538568

Gaetano Coccimiglio, Trevor Brown, Srivatsan Ravi

The process of designing and implementing correct concurrent data structures is non-trivial and often error prone. The recent commercial availability of non-volatile memory has prompted many researchers to also consider designing concurrent data structures that persist shared state allowing the data structure to be recovered following a power failure. These so called persistent concurrent data structures further complicate the process of achieving correct and efficient implementations. Universal constructions (UCs) which produce a concurrent object given a sequential object, have been studied extensively in the space of volatile shared memory as a means of more easily implementing correct concurrent data structures. In contrast, there are only a handful of persistent universal constructions (PUCs) which beyond producing a concurrent object from a sequential object, guarantees that the object can be recovered following a crash. Existing PUCs satisfy the correctness condition of durable linearizability which requires that operations are persisted before they complete. Satisfying the weaker correctness condition of buffered durable linearizability allows for improved performance at the cost of failing to recover some completed operations following a crash. In this work we design and implement both a buffered durable linearizable and a durable linearizable PUC based on the node replication UC. We demonstrate that we can achieve significantly better performance satisfying buffered durable linearizability while also restricting the maximum number of operations that can be lost after a crash.

设计和实现正确的并发数据结构的过程并不简单，而且常常容易出错。最近非易失性存储器的商业可用性促使许多研究人员也考虑设计持久共享状态的并发数据结构，允许在电源故障后恢复数据结构。这些所谓的持久并发数据结构进一步复杂化了实现正确和有效实现的过程。通用结构(UCs)作为一种更容易实现正确并发数据结构的方法，在易失性共享存储器领域得到了广泛的研究，它可以在给定顺序对象的情况下产生并发对象。相比之下，只有少数持久通用结构(puc)除了从顺序对象产生并发对象之外，还保证在崩溃后可以恢复对象。现有的cpu满足持久线性化的正确性条件，即要求操作在完成之前被持久化。满足缓冲持久线性性较弱的正确性条件可以提高性能，但代价是在崩溃后无法恢复某些已完成的操作。在这项工作中，我们设计并实现了基于节点复制UC的缓冲持久线性化和持久线性化PUC。我们证明，我们可以获得更好的性能，满足缓冲持久线性性，同时还限制了崩溃后可能丢失的最大操作数量。

{"title":"PREP-UC: A Practical Replicated Persistent Universal Construction","authors":"Gaetano Coccimiglio, Trevor Brown, Srivatsan Ravi","doi":"10.1145/3490148.3538568","DOIUrl":"https://doi.org/10.1145/3490148.3538568","url":null,"abstract":"The process of designing and implementing correct concurrent data structures is non-trivial and often error prone. The recent commercial availability of non-volatile memory has prompted many researchers to also consider designing concurrent data structures that persist shared state allowing the data structure to be recovered following a power failure. These so called persistent concurrent data structures further complicate the process of achieving correct and efficient implementations. Universal constructions (UCs) which produce a concurrent object given a sequential object, have been studied extensively in the space of volatile shared memory as a means of more easily implementing correct concurrent data structures. In contrast, there are only a handful of persistent universal constructions (PUCs) which beyond producing a concurrent object from a sequential object, guarantees that the object can be recovered following a crash. Existing PUCs satisfy the correctness condition of durable linearizability which requires that operations are persisted before they complete. Satisfying the weaker correctness condition of buffered durable linearizability allows for improved performance at the cost of failing to recover some completed operations following a crash. In this work we design and implement both a buffered durable linearizable and a durable linearizable PUC based on the node replication UC. We demonstrate that we can achieve significantly better performance satisfying buffered durable linearizability while also restricting the maximum number of operations that can be lost after a crash.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132529403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online Parallel Paging with Optimal Makespan 具有最优最大时间跨度的在线并行分页

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

Pub Date : 2022-07-11 DOI: 10.1145/3490148.3538577

Kunal Agrawal, M. A. Bender, Rathish Das, William Kuszmaul, E. Peserico, Michele Scquizzato

The classical paging problem can be described as follows: given a cache that can hold up to k pages (or blocks) and a sequence of requests to pages, how should we manage the cache so as to maximize performance-or, in other words, complete the sequence as quickly as possible. Whereas this sequential paging problem has been well understood for decades, the parallel version, where the cache is shared among p processors each issuing its own sequence of page requests, has been much more resistant. In this problem we are given p request sequences R1, R2, . . . , Rp , each of which accesses a disjoint set of pages, and we ask the question: how should the paging algorithm manage the cache to optimize the completion time of all sequences (i.e., the makespan). As for the classical sequential problem, the goal is to design an online paging algorithm that achieves an optimal competitive ratio, using O(1) resource augmentation.

经典的分页问题可以描述如下:给定一个可以容纳多达k个页面(或块)的缓存和对页面的一系列请求，我们应该如何管理缓存以最大化性能，或者换句话说，尽可能快地完成该序列。虽然这种顺序分页问题已经被很好地理解了几十年，但并行版本(其中缓存在p个处理器之间共享，每个处理器发出自己的页面请求序列)的阻力要大得多。在这个问题中，我们给定p个请求序列R1, R2，…。， Rp，每个都访问一组不相交的页面，我们问:分页算法应该如何管理缓存以优化所有序列的完成时间(即makespan)。对于经典的顺序问题，目标是设计一种在线分页算法，该算法使用O(1)资源增量实现最优竞争比。

引用次数: 3

Approximate Dynamic Balanced Graph Partitioning 近似动态平衡图划分

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

Pub Date : 2022-07-11 DOI: 10.1145/3490148.3538563

Harald Räcke, Stefan Schmid, R. Zabrodin

Networked systems are increasingly flexible and reconfigurable. This enables demand-aware infrastructures whose resources can be adjusted according to the traffic pattern they currently serve. This paper revisits the dynamic balanced graph partitioning problem, a generalization of the classic balanced graph partitioning problem. We are given a set P of n = kℓ processes which communicate over time according to a given request sequence σ. The processes are assigned to ℓ servers (each of capacity k), and a scheduler can change this assignment dynamically to reduce communication costs, at cost α per node move. Avin et al. showed an Ω(k) lower bound on the competitive ratio of any deterministic online algorithm, even in a model with resource augmentation, and presented an O(k log k)-competitive online algorithm. We study the offline version of this problem where σ is known to the algorithm. Our main contribution is a polynomial-time algorithm which provides an O(log n)-approximation with resource augmentation. Our algorithm relies on an integer linear program formulation in a metric space with spreading constraints. We relax the formulation to a linear program and employ Bartal's clustering algorithm in a novel way to round it.

网络系统越来越灵活和可重构。这使得需求感知基础设施的资源可以根据其当前服务的流量模式进行调整。本文重新讨论了动态平衡图划分问题，这是经典平衡图划分问题的推广。给定一组P (n = k)个进程，这些进程根据给定的请求序列σ随时间通信。这些进程被分配到l个服务器(每个服务器的容量为k)，调度程序可以动态地改变这个分配以减少通信成本，每个节点移动的成本为α。Avin等人展示了任何确定性在线算法的竞争比的Ω(k)下界，即使在资源增加的模型中也是如此，并提出了一个O(k log k)竞争在线算法。我们研究这个问题的离线版本，其中σ是算法已知的。我们的主要贡献是一个多项式时间算法，它提供了一个O(log n)-近似与资源增加。我们的算法依赖于具有扩展约束的度量空间中的整数线性规划公式。我们将公式简化为线性规划，并采用Bartal聚类算法以一种新颖的方式对其进行舍入。

{"title":"Approximate Dynamic Balanced Graph Partitioning","authors":"Harald Räcke, Stefan Schmid, R. Zabrodin","doi":"10.1145/3490148.3538563","DOIUrl":"https://doi.org/10.1145/3490148.3538563","url":null,"abstract":"Networked systems are increasingly flexible and reconfigurable. This enables demand-aware infrastructures whose resources can be adjusted according to the traffic pattern they currently serve. This paper revisits the dynamic balanced graph partitioning problem, a generalization of the classic balanced graph partitioning problem. We are given a set P of n = kℓ processes which communicate over time according to a given request sequence σ. The processes are assigned to ℓ servers (each of capacity k), and a scheduler can change this assignment dynamically to reduce communication costs, at cost α per node move. Avin et al. showed an Ω(k) lower bound on the competitive ratio of any deterministic online algorithm, even in a model with resource augmentation, and presented an O(k log k)-competitive online algorithm. We study the offline version of this problem where σ is known to the algorithm. Our main contribution is a polynomial-time algorithm which provides an O(log n)-approximation with resource augmentation. Our algorithm relies on an integer linear program formulation in a metric space with spreading constraints. We relax the formulation to a linear program and employ Bartal's clustering algorithm in a novel way to round it.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128567768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Balancing Flow Time and Energy Consumption 平衡流动时间和能量消耗

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

Pub Date : 2022-06-03 DOI: 10.1145/3490148.3538582

Sami Davies, S. Khuller, Shi-Han Zhang

In this paper, we study the following batch scheduling model: find a schedule that minimizes total flow time for n uniform length jobs, with release times and deadlines, where the machine is only actively processing jobs in at most k synchronized batches of size at most B. Prior work on such batch scheduling models has considered only feasibility with no regard to the flow time of the schedule. However, algorithms that minimize the cost from the scheduler's perspective---such as ones that minimize the active time of the processor---can result in schedules where the total flow time is arbitrarily high [15]. Such schedules are not valuable from the perspective of the client. In response, our work provides dynamic programs which minimize flow time subject to active time constraints. Our main contribution focuses on jobs with agreeable deadlines; for such job instances, we introduce dynamic programs that achieve runtimes of O(B ․ k ․ n) for unit jobs and O(B ․ O(B ․ n5) for uniform length jobs. These results improve upon our modification of a different, classical dynamic programming approach by Baptiste. While the modified DP works when deadlines are non-agreeable, this solution is more expensive, with runtime O(B ․ k2 ․ n7) [7].

在本文中，我们研究了以下的批调度模型:找到一个调度，使n个等长作业的总流时间最小，具有释放时间和截止日期，其中机器只主动处理最多k个同步批次的作业，批次的大小最多为b。先前对此类批调度模型的研究只考虑了可行性，而不考虑调度的流时间。然而，从调度程序的角度来看，最小化成本的算法——比如最小化处理器活动时间的算法——可能会导致总流时间任意高的调度[15]。从客户的角度来看，这样的时间表是没有价值的。作为回应，我们的工作提供了动态程序，使受活动时间限制的流程时间最小化。我们的主要贡献集中在有明确截止日期的工作;对于这样的作业实例，我们引入动态程序，实现单位作业的运行时间为O(B . k . n)，统一长度作业的运行时间为O(B . O(B . n5))。这些结果改进了我们对Baptiste的另一种经典动态规划方法的修改。虽然修改后的DP在截止日期不一致的情况下可以工作，但这种解决方案的成本更高，运行时间为O(B . k2 . n7)[7]。

{"title":"Balancing Flow Time and Energy Consumption","authors":"Sami Davies, S. Khuller, Shi-Han Zhang","doi":"10.1145/3490148.3538582","DOIUrl":"https://doi.org/10.1145/3490148.3538582","url":null,"abstract":"In this paper, we study the following batch scheduling model: find a schedule that minimizes total flow time for n uniform length jobs, with release times and deadlines, where the machine is only actively processing jobs in at most k synchronized batches of size at most B. Prior work on such batch scheduling models has considered only feasibility with no regard to the flow time of the schedule. However, algorithms that minimize the cost from the scheduler's perspective---such as ones that minimize the active time of the processor---can result in schedules where the total flow time is arbitrarily high [15]. Such schedules are not valuable from the perspective of the client. In response, our work provides dynamic programs which minimize flow time subject to active time constraints. Our main contribution focuses on jobs with agreeable deadlines; for such job instances, we introduce dynamic programs that achieve runtimes of O(B ․ k ․ n) for unit jobs and O(B ․ O(B ․ n5) for uniform length jobs. These results improve upon our modification of a different, classical dynamic programming approach by Baptiste. While the modified DP works when deadlines are non-agreeable, this solution is more expensive, with runtime O(B ․ k2 ․ n7) [7].","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130076048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Fully Polynomial-Time Distributed Computation in Low-Treewidth Graphs 低树宽图的全多项式时间分布计算

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

Pub Date : 2022-05-30 DOI: 10.1145/3490148.3538590

Taisuke Izumi, Naoki Kitamura, Takamasa Naruse, Gregory Schwartzman

We consider global problems, i.e. problems that take at least diameter time, even when the bandwidth is not restricted. We show that all problems considered admit efficient solutions in low-treewidth graphs.

我们考虑全局问题，即即使带宽不受限制，也至少需要直径时间的问题。我们证明了所考虑的所有问题在低树宽图中都有有效的解。

引用次数: 4

Adaptive Massively Parallel Algorithms for Cut Problems 切问题的自适应大规模并行算法

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

Pub Date : 2022-05-27 DOI: 10.1145/3490148.3538576

M. Hajiaghayi, Marina Knittel, J. Olkowski, Hamed Saleh

We study the Weighted Min Cut problem in the Adaptive Massively Parallel Computation (AMPC) model. In 2019, Behnezhad et al. [3] introduced the AMPC model as an extension of the Massively Parallel Computation (MPC) model. In the past decade, research on highly scalable algorithms has had significant impact on many massive systems. The MPC model, introduced in 2010 by Karloff et al. [16], which is an abstraction of famous practical frameworks such as MapReduce, Hadoop, Flume, and Spark, has been at the forefront of this research. While great strides have been taken to create highly efficient MPC algorithms for a range of problems, recent progress has been limited by the 1-vs-2 Cycle Conjecture [20], which postulates that the simple problem of distinguishing between one and two cycles requires Ω(log n) MPC rounds. In the AMPC model, each machine has adaptive read access to a distributed hash table even when communication is restricted (i.e., in the middle of a round). While remaining practical [4], this gives algorithms the power to bypass limitations like the 1-vs-2 Cycle Conjecture.

研究了自适应大规模并行计算(AMPC)模型中的加权最小割问题。2019年，Behnezhad等人[3]引入了AMPC模型，作为大规模并行计算(MPC)模型的扩展。在过去的十年中，对高可扩展性算法的研究对许多大规模系统产生了重大影响。MPC模型由Karloff等人于2010年提出[16]，它是对MapReduce、Hadoop、Flume和Spark等著名实用框架的抽象，一直处于该研究的前沿。虽然在为一系列问题创建高效的MPC算法方面已经取得了很大的进步，但最近的进展受到1 vs 2周期猜想的限制[20]，该猜想假设区分一个和两个周期的简单问题需要Ω(log n)个MPC轮。在AMPC模型中，即使通信受到限制(例如，在一轮中)，每台机器也对分布式哈希表具有自适应读访问。在保持实用性的同时[4]，这使算法能够绕过1 vs 2周期猜想等限制。

引用次数: 2

Brief Announcement: Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 简短公告:紧内存无关并行矩阵乘法通信下界

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

Pub Date : 2022-05-26 DOI: 10.1145/3490148.3538552

Hussam Al Daas, Grey Ballard, L. Grigori, Suraj Kumar, Kathryn Rouse

Communication lower bounds have long been established for matrix multiplication algorithms. However, most methods of asymptotic analysis have either ignored constant factors or not obtained the tightest possible values. The main result of this work is establishing memory-independent communication lower bounds with tight constants for parallel matrix multiplication. Our constants improve on previous work in each of three cases that depend on the relative sizes of the matrix aspect ratios and the number of processors.

对于矩阵乘法算法，通信下界早已建立。然而，大多数渐近分析方法要么忽略了常数因素，要么没有得到最接近的可能值。这项工作的主要结果是建立了具有紧常数的并行矩阵乘法的与内存无关的通信下界。在依赖于矩阵长宽比和处理器数量的相对大小的三种情况下，我们的常数都比以前的工作有所改进。

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀