Daniel DeLayo, Kenny Zhang, Kunal Agrawal, M. A. Bender, Jonathan W. Berry, Rathish Das, Benjamin Moseley, C. Phillips
Some past and future supercomputer nodes incorporate High- Bandwidth Memory (HBM). Compared to standard DRAM, HBM has similar latency, higher bandwidth and lower capacity. In this paper, we evaluate algorithms for managing High- Bandwidth Memory automatically. Previous work suggests that, in the worst case, performance is extremely sensitive to the policy for managing the channel to DRAM. Prior theory shows that a priority-based scheme (where there is a static strict priority-order among p threads for channel access) is O(1)-competitive, but FIFO is not, and in the worst case is Ω(p) competitive. Following this theoretical guidance would be a disruptive change for vendors, who currently use FIFO variants in their DRAMcontroller hardware. Our goal is to determine theoretically and empirically whether we can justify recommending investment in priority-based DRAM controller hardware. In order to experiment with DRAM channel protocols, we chose a theoretical model, validated it against real hardware, and implemented a basic simulator. We corroborated the previous theoretical results for the model, conducted a parameter sweep while running our simulator on address traces from memory bandwidth-bound codes (GNU sort and TACO sparse matrix-vector product), and designed better channel-access algorithms.
{"title":"Automatic HBM Management: Models and Algorithms","authors":"Daniel DeLayo, Kenny Zhang, Kunal Agrawal, M. A. Bender, Jonathan W. Berry, Rathish Das, Benjamin Moseley, C. Phillips","doi":"10.1145/3490148.3538570","DOIUrl":"https://doi.org/10.1145/3490148.3538570","url":null,"abstract":"Some past and future supercomputer nodes incorporate High- Bandwidth Memory (HBM). Compared to standard DRAM, HBM has similar latency, higher bandwidth and lower capacity. In this paper, we evaluate algorithms for managing High- Bandwidth Memory automatically. Previous work suggests that, in the worst case, performance is extremely sensitive to the policy for managing the channel to DRAM. Prior theory shows that a priority-based scheme (where there is a static strict priority-order among p threads for channel access) is O(1)-competitive, but FIFO is not, and in the worst case is Ω(p) competitive. Following this theoretical guidance would be a disruptive change for vendors, who currently use FIFO variants in their DRAMcontroller hardware. Our goal is to determine theoretically and empirically whether we can justify recommending investment in priority-based DRAM controller hardware. In order to experiment with DRAM channel protocols, we chose a theoretical model, validated it against real hardware, and implemented a basic simulator. We corroborated the previous theoretical results for the model, conducted a parameter sweep while running our simulator on address traces from memory bandwidth-bound codes (GNU sort and TACO sparse matrix-vector product), and designed better channel-access algorithms.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114934422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents an O(log log đ) round massively parallel algorithm for 1 + ε approximation of maximum weighted b-matchings, using near-linear memory per machine. Here đ denotes the average degree in the graph and ε is an arbitrarily small positive constant. Recall that b-matching is the natural and well-studied generalization of the matching problem where different vertices are allowed to have different numbers of incident edges in the matching.
{"title":"Massively Parallel Algorithms for b-Matching","authors":"M. Ghaffari, C. Grunau, Slobodan Mitrovic","doi":"10.1145/3490148.3538589","DOIUrl":"https://doi.org/10.1145/3490148.3538589","url":null,"abstract":"This paper presents an O(log log đ) round massively parallel algorithm for 1 + ε approximation of maximum weighted b-matchings, using near-linear memory per machine. Here đ denotes the average degree in the graph and ε is an arbitrarily small positive constant. Recall that b-matching is the natural and well-studied generalization of the matching problem where different vertices are allowed to have different numbers of incident edges in the matching.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129738266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maintaining a k-core decomposition quickly in a dynamic graph has important applications in network analysis. The main challenge for designing efficient exact algorithms is that a single update to the graph can cause significant global changes. Our paper focuses on approximation algorithms with small approximation factors that are much more efficient than what exact algorithms can obtain.
{"title":"Parallel Batch-Dynamic Algorithms for k-Core Decomposition and Related Graph Problems","authors":"Quanquan C. Liu, Jessica Shi, Shangdi Yu, Laxman Dhulipala, Julian Shun","doi":"10.1145/3490148.3538569","DOIUrl":"https://doi.org/10.1145/3490148.3538569","url":null,"abstract":"Maintaining a k-core decomposition quickly in a dynamic graph has important applications in network analysis. The main challenge for designing efficient exact algorithms is that a single update to the graph can cause significant global changes. Our paper focuses on approximation algorithms with small approximation factors that are much more efficient than what exact algorithms can obtain.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115042970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The process of designing and implementing correct concurrent data structures is non-trivial and often error prone. The recent commercial availability of non-volatile memory has prompted many researchers to also consider designing concurrent data structures that persist shared state allowing the data structure to be recovered following a power failure. These so called persistent concurrent data structures further complicate the process of achieving correct and efficient implementations. Universal constructions (UCs) which produce a concurrent object given a sequential object, have been studied extensively in the space of volatile shared memory as a means of more easily implementing correct concurrent data structures. In contrast, there are only a handful of persistent universal constructions (PUCs) which beyond producing a concurrent object from a sequential object, guarantees that the object can be recovered following a crash. Existing PUCs satisfy the correctness condition of durable linearizability which requires that operations are persisted before they complete. Satisfying the weaker correctness condition of buffered durable linearizability allows for improved performance at the cost of failing to recover some completed operations following a crash. In this work we design and implement both a buffered durable linearizable and a durable linearizable PUC based on the node replication UC. We demonstrate that we can achieve significantly better performance satisfying buffered durable linearizability while also restricting the maximum number of operations that can be lost after a crash.
{"title":"PREP-UC: A Practical Replicated Persistent Universal Construction","authors":"Gaetano Coccimiglio, Trevor Brown, Srivatsan Ravi","doi":"10.1145/3490148.3538568","DOIUrl":"https://doi.org/10.1145/3490148.3538568","url":null,"abstract":"The process of designing and implementing correct concurrent data structures is non-trivial and often error prone. The recent commercial availability of non-volatile memory has prompted many researchers to also consider designing concurrent data structures that persist shared state allowing the data structure to be recovered following a power failure. These so called persistent concurrent data structures further complicate the process of achieving correct and efficient implementations. Universal constructions (UCs) which produce a concurrent object given a sequential object, have been studied extensively in the space of volatile shared memory as a means of more easily implementing correct concurrent data structures. In contrast, there are only a handful of persistent universal constructions (PUCs) which beyond producing a concurrent object from a sequential object, guarantees that the object can be recovered following a crash. Existing PUCs satisfy the correctness condition of durable linearizability which requires that operations are persisted before they complete. Satisfying the weaker correctness condition of buffered durable linearizability allows for improved performance at the cost of failing to recover some completed operations following a crash. In this work we design and implement both a buffered durable linearizable and a durable linearizable PUC based on the node replication UC. We demonstrate that we can achieve significantly better performance satisfying buffered durable linearizability while also restricting the maximum number of operations that can be lost after a crash.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132529403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kunal Agrawal, M. A. Bender, Rathish Das, William Kuszmaul, E. Peserico, Michele Scquizzato
The classical paging problem can be described as follows: given a cache that can hold up to k pages (or blocks) and a sequence of requests to pages, how should we manage the cache so as to maximize performance-or, in other words, complete the sequence as quickly as possible. Whereas this sequential paging problem has been well understood for decades, the parallel version, where the cache is shared among p processors each issuing its own sequence of page requests, has been much more resistant. In this problem we are given p request sequences R1, R2, . . . , Rp , each of which accesses a disjoint set of pages, and we ask the question: how should the paging algorithm manage the cache to optimize the completion time of all sequences (i.e., the makespan). As for the classical sequential problem, the goal is to design an online paging algorithm that achieves an optimal competitive ratio, using O(1) resource augmentation.
{"title":"Online Parallel Paging with Optimal Makespan","authors":"Kunal Agrawal, M. A. Bender, Rathish Das, William Kuszmaul, E. Peserico, Michele Scquizzato","doi":"10.1145/3490148.3538577","DOIUrl":"https://doi.org/10.1145/3490148.3538577","url":null,"abstract":"The classical paging problem can be described as follows: given a cache that can hold up to k pages (or blocks) and a sequence of requests to pages, how should we manage the cache so as to maximize performance-or, in other words, complete the sequence as quickly as possible. Whereas this sequential paging problem has been well understood for decades, the parallel version, where the cache is shared among p processors each issuing its own sequence of page requests, has been much more resistant. In this problem we are given p request sequences R1, R2, . . . , Rp , each of which accesses a disjoint set of pages, and we ask the question: how should the paging algorithm manage the cache to optimize the completion time of all sequences (i.e., the makespan). As for the classical sequential problem, the goal is to design an online paging algorithm that achieves an optimal competitive ratio, using O(1) resource augmentation.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"27 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132530771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Networked systems are increasingly flexible and reconfigurable. This enables demand-aware infrastructures whose resources can be adjusted according to the traffic pattern they currently serve. This paper revisits the dynamic balanced graph partitioning problem, a generalization of the classic balanced graph partitioning problem. We are given a set P of n = kℓ processes which communicate over time according to a given request sequence σ. The processes are assigned to ℓ servers (each of capacity k), and a scheduler can change this assignment dynamically to reduce communication costs, at cost α per node move. Avin et al. showed an Ω(k) lower bound on the competitive ratio of any deterministic online algorithm, even in a model with resource augmentation, and presented an O(k log k)-competitive online algorithm. We study the offline version of this problem where σ is known to the algorithm. Our main contribution is a polynomial-time algorithm which provides an O(log n)-approximation with resource augmentation. Our algorithm relies on an integer linear program formulation in a metric space with spreading constraints. We relax the formulation to a linear program and employ Bartal's clustering algorithm in a novel way to round it.
{"title":"Approximate Dynamic Balanced Graph Partitioning","authors":"Harald Räcke, Stefan Schmid, R. Zabrodin","doi":"10.1145/3490148.3538563","DOIUrl":"https://doi.org/10.1145/3490148.3538563","url":null,"abstract":"Networked systems are increasingly flexible and reconfigurable. This enables demand-aware infrastructures whose resources can be adjusted according to the traffic pattern they currently serve. This paper revisits the dynamic balanced graph partitioning problem, a generalization of the classic balanced graph partitioning problem. We are given a set P of n = kℓ processes which communicate over time according to a given request sequence σ. The processes are assigned to ℓ servers (each of capacity k), and a scheduler can change this assignment dynamically to reduce communication costs, at cost α per node move. Avin et al. showed an Ω(k) lower bound on the competitive ratio of any deterministic online algorithm, even in a model with resource augmentation, and presented an O(k log k)-competitive online algorithm. We study the offline version of this problem where σ is known to the algorithm. Our main contribution is a polynomial-time algorithm which provides an O(log n)-approximation with resource augmentation. Our algorithm relies on an integer linear program formulation in a metric space with spreading constraints. We relax the formulation to a linear program and employ Bartal's clustering algorithm in a novel way to round it.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128567768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we study the following batch scheduling model: find a schedule that minimizes total flow time for n uniform length jobs, with release times and deadlines, where the machine is only actively processing jobs in at most k synchronized batches of size at most B. Prior work on such batch scheduling models has considered only feasibility with no regard to the flow time of the schedule. However, algorithms that minimize the cost from the scheduler's perspective---such as ones that minimize the active time of the processor---can result in schedules where the total flow time is arbitrarily high [15]. Such schedules are not valuable from the perspective of the client. In response, our work provides dynamic programs which minimize flow time subject to active time constraints. Our main contribution focuses on jobs with agreeable deadlines; for such job instances, we introduce dynamic programs that achieve runtimes of O(B ․ k ․ n) for unit jobs and O(B ․ O(B ․ n5) for uniform length jobs. These results improve upon our modification of a different, classical dynamic programming approach by Baptiste. While the modified DP works when deadlines are non-agreeable, this solution is more expensive, with runtime O(B ․ k2 ․ n7) [7].
{"title":"Balancing Flow Time and Energy Consumption","authors":"Sami Davies, S. Khuller, Shi-Han Zhang","doi":"10.1145/3490148.3538582","DOIUrl":"https://doi.org/10.1145/3490148.3538582","url":null,"abstract":"In this paper, we study the following batch scheduling model: find a schedule that minimizes total flow time for n uniform length jobs, with release times and deadlines, where the machine is only actively processing jobs in at most k synchronized batches of size at most B. Prior work on such batch scheduling models has considered only feasibility with no regard to the flow time of the schedule. However, algorithms that minimize the cost from the scheduler's perspective---such as ones that minimize the active time of the processor---can result in schedules where the total flow time is arbitrarily high [15]. Such schedules are not valuable from the perspective of the client. In response, our work provides dynamic programs which minimize flow time subject to active time constraints. Our main contribution focuses on jobs with agreeable deadlines; for such job instances, we introduce dynamic programs that achieve runtimes of O(B ․ k ․ n) for unit jobs and O(B ․ O(B ․ n5) for uniform length jobs. These results improve upon our modification of a different, classical dynamic programming approach by Baptiste. While the modified DP works when deadlines are non-agreeable, this solution is more expensive, with runtime O(B ․ k2 ․ n7) [7].","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130076048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider global problems, i.e. problems that take at least diameter time, even when the bandwidth is not restricted. We show that all problems considered admit efficient solutions in low-treewidth graphs.
{"title":"Fully Polynomial-Time Distributed Computation in Low-Treewidth Graphs","authors":"Taisuke Izumi, Naoki Kitamura, Takamasa Naruse, Gregory Schwartzman","doi":"10.1145/3490148.3538590","DOIUrl":"https://doi.org/10.1145/3490148.3538590","url":null,"abstract":"We consider global problems, i.e. problems that take at least diameter time, even when the bandwidth is not restricted. We show that all problems considered admit efficient solutions in low-treewidth graphs.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131945096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Hajiaghayi, Marina Knittel, J. Olkowski, Hamed Saleh
We study the Weighted Min Cut problem in the Adaptive Massively Parallel Computation (AMPC) model. In 2019, Behnezhad et al. [3] introduced the AMPC model as an extension of the Massively Parallel Computation (MPC) model. In the past decade, research on highly scalable algorithms has had significant impact on many massive systems. The MPC model, introduced in 2010 by Karloff et al. [16], which is an abstraction of famous practical frameworks such as MapReduce, Hadoop, Flume, and Spark, has been at the forefront of this research. While great strides have been taken to create highly efficient MPC algorithms for a range of problems, recent progress has been limited by the 1-vs-2 Cycle Conjecture [20], which postulates that the simple problem of distinguishing between one and two cycles requires Ω(log n) MPC rounds. In the AMPC model, each machine has adaptive read access to a distributed hash table even when communication is restricted (i.e., in the middle of a round). While remaining practical [4], this gives algorithms the power to bypass limitations like the 1-vs-2 Cycle Conjecture.
研究了自适应大规模并行计算(AMPC)模型中的加权最小割问题。2019年,Behnezhad等人[3]引入了AMPC模型,作为大规模并行计算(MPC)模型的扩展。在过去的十年中,对高可扩展性算法的研究对许多大规模系统产生了重大影响。MPC模型由Karloff等人于2010年提出[16],它是对MapReduce、Hadoop、Flume和Spark等著名实用框架的抽象,一直处于该研究的前沿。虽然在为一系列问题创建高效的MPC算法方面已经取得了很大的进步,但最近的进展受到1 vs 2周期猜想的限制[20],该猜想假设区分一个和两个周期的简单问题需要Ω(log n)个MPC轮。在AMPC模型中,即使通信受到限制(例如,在一轮中),每台机器也对分布式哈希表具有自适应读访问。在保持实用性的同时[4],这使算法能够绕过1 vs 2周期猜想等限制。
{"title":"Adaptive Massively Parallel Algorithms for Cut Problems","authors":"M. Hajiaghayi, Marina Knittel, J. Olkowski, Hamed Saleh","doi":"10.1145/3490148.3538576","DOIUrl":"https://doi.org/10.1145/3490148.3538576","url":null,"abstract":"We study the Weighted Min Cut problem in the Adaptive Massively Parallel Computation (AMPC) model. In 2019, Behnezhad et al. [3] introduced the AMPC model as an extension of the Massively Parallel Computation (MPC) model. In the past decade, research on highly scalable algorithms has had significant impact on many massive systems. The MPC model, introduced in 2010 by Karloff et al. [16], which is an abstraction of famous practical frameworks such as MapReduce, Hadoop, Flume, and Spark, has been at the forefront of this research. While great strides have been taken to create highly efficient MPC algorithms for a range of problems, recent progress has been limited by the 1-vs-2 Cycle Conjecture [20], which postulates that the simple problem of distinguishing between one and two cycles requires Ω(log n) MPC rounds. In the AMPC model, each machine has adaptive read access to a distributed hash table even when communication is restricted (i.e., in the middle of a round). While remaining practical [4], this gives algorithms the power to bypass limitations like the 1-vs-2 Cycle Conjecture.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130650076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hussam Al Daas, Grey Ballard, L. Grigori, Suraj Kumar, Kathryn Rouse
Communication lower bounds have long been established for matrix multiplication algorithms. However, most methods of asymptotic analysis have either ignored constant factors or not obtained the tightest possible values. The main result of this work is establishing memory-independent communication lower bounds with tight constants for parallel matrix multiplication. Our constants improve on previous work in each of three cases that depend on the relative sizes of the matrix aspect ratios and the number of processors.
{"title":"Brief Announcement: Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds","authors":"Hussam Al Daas, Grey Ballard, L. Grigori, Suraj Kumar, Kathryn Rouse","doi":"10.1145/3490148.3538552","DOIUrl":"https://doi.org/10.1145/3490148.3538552","url":null,"abstract":"Communication lower bounds have long been established for matrix multiplication algorithms. However, most methods of asymptotic analysis have either ignored constant factors or not obtained the tightest possible values. The main result of this work is establishing memory-independent communication lower bounds with tight constants for parallel matrix multiplication. Our constants improve on previous work in each of three cases that depend on the relative sizes of the matrix aspect ratios and the number of processors.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125301929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}