首页 > 最新文献

Proceedings of the 2018 International Conference on Supercomputing最新文献

英文 中文
ComPEND 概略
Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205295
Dongwook Lee, Sungbum Kang, Kiyoung Choi
While negative inputs for ReLU are useless, it consumes a lot of computing power to calculate them for deep neural networks. We propose a computation pruning technique that detects at an early stage that the result of a sum of products will be negative by adopting an inverted two's complement expression for weights and a bit-serial sum of products. Therefore, it can skip a large amount of computations for negative results and simply set the ReLU outputs to zero. Moreover, we devise a DNN accelerator architecture that can efficiently apply the proposed technique. The evaluation shows that the accelerator using the computation pruning through early negative detection technique significantly improves the energy efficiency and the performance.
{"title":"ComPEND","authors":"Dongwook Lee, Sungbum Kang, Kiyoung Choi","doi":"10.1145/3205289.3205295","DOIUrl":"https://doi.org/10.1145/3205289.3205295","url":null,"abstract":"While negative inputs for ReLU are useless, it consumes a lot of computing power to calculate them for deep neural networks. We propose a computation pruning technique that detects at an early stage that the result of a sum of products will be negative by adopting an inverted two's complement expression for weights and a bit-serial sum of products. Therefore, it can skip a large amount of computations for negative results and simply set the ReLU outputs to zero. Moreover, we devise a DNN accelerator architecture that can efficiently apply the proposed technique. The evaluation shows that the accelerator using the computation pruning through early negative detection technique significantly improves the energy efficiency and the performance.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114888509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
High-Performance, Low-Complexity Deadlock Avoidance for Arbitrary Topologies/Routings 任意拓扑/路由的高性能、低复杂度死锁避免
Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205307
J. A. Pascual, J. Navaridas
Recently, the use of graph-based network topologies has been proposed as an alternative to traditional networks such as tori or fat-trees due to their very good topological characteristics. However they pose practical implementation challenges such as the lack of deadlock avoidance strategies. Previous proposals either lack flexibility, underutilise network resources or are exceedingly complex. We propose--and prove formally--three generic, low-complexity deadlock avoidance mechanisms that only require local information. Our methods are topology- and routing-independent and their virtual channel count is bounded by the length of the longest path. We evaluate our algorithms through an extensive simulation study to measure the impact on the performance using both synthetic and realistic traffic. First we compare against a well-known HPC mechanism for dragonfly and achieve similar performance level. Then we moved to Graph-based networks and show that our mechanisms can greatly outperform traditional, spanning-tree based mechanisms, even if these use a much larger number of virtual channels. Overall, our proposal provides a simple, flexible and high performance deadlock-avoidance solution.
近年来,由于基于图的网络拓扑结构具有良好的拓扑特性,人们提出使用基于图的网络拓扑结构来替代环面或胖树等传统网络。然而,它们带来了实际的实现挑战,例如缺乏死锁避免策略。以往的方案要么缺乏灵活性,要么没有充分利用网络资源,要么过于复杂。我们提出并正式证明了三种通用的、低复杂度的死锁避免机制,它们只需要本地信息。我们的方法是拓扑和路由无关的,它们的虚拟信道计数由最长路径的长度限制。我们通过广泛的模拟研究来评估我们的算法,以使用合成和现实流量来衡量对性能的影响。首先,我们将其与蜻蜓的高性能计算机制进行了比较,并达到了相似的性能水平。然后我们转向基于图的网络,并表明我们的机制可以大大优于传统的基于生成树的机制,即使这些机制使用了大量的虚拟通道。总的来说,我们的方案提供了一个简单、灵活和高性能的避免死锁的解决方案。
{"title":"High-Performance, Low-Complexity Deadlock Avoidance for Arbitrary Topologies/Routings","authors":"J. A. Pascual, J. Navaridas","doi":"10.1145/3205289.3205307","DOIUrl":"https://doi.org/10.1145/3205289.3205307","url":null,"abstract":"Recently, the use of graph-based network topologies has been proposed as an alternative to traditional networks such as tori or fat-trees due to their very good topological characteristics. However they pose practical implementation challenges such as the lack of deadlock avoidance strategies. Previous proposals either lack flexibility, underutilise network resources or are exceedingly complex. We propose--and prove formally--three generic, low-complexity deadlock avoidance mechanisms that only require local information. Our methods are topology- and routing-independent and their virtual channel count is bounded by the length of the longest path. We evaluate our algorithms through an extensive simulation study to measure the impact on the performance using both synthetic and realistic traffic. First we compare against a well-known HPC mechanism for dragonfly and achieve similar performance level. Then we moved to Graph-based networks and show that our mechanisms can greatly outperform traditional, spanning-tree based mechanisms, even if these use a much larger number of virtual channels. Overall, our proposal provides a simple, flexible and high performance deadlock-avoidance solution.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123032834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Towards Efficient SpMV on Sunway Manycore Architectures 在双威多核架构上实现高效SpMV
Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205313
Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, Xu Liu
Sparse Matrix-Vector Multiplication (SpMV) is an essential computation kernel for many data-analytic workloads running in both supercomputers and data centers. The intrinsic irregularity in SpMV is challenging to achieve high performance, especially when porting to new architectures. In this paper, we present our work on designing and implementing efficient SpMV algorithms on Sunway, a novel architecture with many unique features. To fully exploit the Sunway architecture, we have designed a dual-side multi-level partition mechanism on both sparse matrices and hardware resources to improve locality and parallelism. On one hand, we partition sparse matrices into blocks, tiles, and slices for different granularities. On the other hand, we partition cores in a Sunway processor into fleets, and further dedicate part of cores in a fleet as computation and I/O cores. Moreover, we have optimized the communication between partitions to further improve the performance. Our scheme is generally applicable to different SpMV formats and implementations. For evaluation, we have applied our techniques atop a popular SpMV format, CSR. Experimental results on 18 datasets show that our optimization yields up to 15.5x (12.3x on average) speedups.
稀疏矩阵向量乘法(SpMV)是运行在超级计算机和数据中心中的许多数据分析工作负载的基本计算内核。SpMV固有的不规则性对实现高性能具有挑战性,特别是在移植到新体系结构时。在本文中,我们介绍了我们在Sunway上设计和实现高效SpMV算法的工作,Sunway是一种具有许多独特特征的新型架构。为了充分利用双威架构,我们在稀疏矩阵和硬件资源上设计了双向多级分区机制,以提高局部性和并行性。一方面,我们将稀疏矩阵划分为不同粒度的块、块和片。另一方面,我们将神威处理器的内核划分为多个队列,并进一步将队列中的部分内核用作计算和I/O内核。此外,我们还优化了分区之间的通信,以进一步提高性能。我们的方案一般适用于不同的SpMV格式和实现。为了进行评估,我们在流行的SpMV格式CSR上应用了我们的技术。在18个数据集上的实验结果表明,我们的优化产生了高达15.5倍(平均12.3倍)的加速。
{"title":"Towards Efficient SpMV on Sunway Manycore Architectures","authors":"Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, Xu Liu","doi":"10.1145/3205289.3205313","DOIUrl":"https://doi.org/10.1145/3205289.3205313","url":null,"abstract":"Sparse Matrix-Vector Multiplication (SpMV) is an essential computation kernel for many data-analytic workloads running in both supercomputers and data centers. The intrinsic irregularity in SpMV is challenging to achieve high performance, especially when porting to new architectures. In this paper, we present our work on designing and implementing efficient SpMV algorithms on Sunway, a novel architecture with many unique features. To fully exploit the Sunway architecture, we have designed a dual-side multi-level partition mechanism on both sparse matrices and hardware resources to improve locality and parallelism. On one hand, we partition sparse matrices into blocks, tiles, and slices for different granularities. On the other hand, we partition cores in a Sunway processor into fleets, and further dedicate part of cores in a fleet as computation and I/O cores. Moreover, we have optimized the communication between partitions to further improve the performance. Our scheme is generally applicable to different SpMV formats and implementations. For evaluation, we have applied our techniques atop a popular SpMV format, CSR. Experimental results on 18 datasets show that our optimization yields up to 15.5x (12.3x on average) speedups.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123510152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
A Case for Granularity Aware Page Migration 粒度感知页面迁移的案例
Pub Date : 2018-06-12 DOI: 10.1145/3205289.3208064
Jee Ho Ryoo, L. John, Arkaprava Basu
Memory is becoming increasingly heterogeneous with the emergence of disparate memory technologies ranging from non-volatile memories like PCM, STT-RAM, and memristors to 3D-stacked memories like HBM. In such systems, data is of ten migrated across memory regions backed by different technologies for better overall performance. An effective migration mechanism is a prerequisite in such systems. Prior works on OS-directed page migration have focused on what data to migrate and/or on when to migrate. In this work, we demonstrate the need to investigate another dimension -- how much to migrate. Specifically, we show that the amount of data migrated in a single migration operation (called "migration granularity") is vital to the overall performance. Through analysis on real hardware, we further show that different applications benefit from different migration granularities, owing to their distinct memory access characteristics. Since this preferred migration granularity may not be known a priori, we propose a novel scheme to infer this for any given application at runtime. When implemented in the Linux OS, running on a current hardware, the performance improved by up to 36% over a baseline with a fixed migration granularity.
随着不同存储技术的出现,从非易失性存储器(如PCM、STT-RAM和忆阻器)到3d堆叠存储器(如HBM),存储器正变得越来越异构。在这样的系统中,数据通常在不同技术支持的内存区域之间迁移,以获得更好的整体性能。有效的迁移机制是此类系统的先决条件。之前关于面向操作系统的页面迁移的工作主要集中在迁移什么数据和/或何时迁移。在这项工作中,我们展示了调查另一个维度的需要——迁移多少。具体来说,我们展示了在单个迁移操作中迁移的数据量(称为“迁移粒度”)对整体性能至关重要。通过对实际硬件的分析,我们进一步证明了不同的应用程序由于其不同的内存访问特性而受益于不同的迁移粒度。由于这种首选迁移粒度可能不是先验的,因此我们提出了一种新的方案,可以在运行时对任何给定的应用程序进行推断。当在Linux操作系统中实现时,在当前硬件上运行,性能比固定迁移粒度的基线提高了36%。
{"title":"A Case for Granularity Aware Page Migration","authors":"Jee Ho Ryoo, L. John, Arkaprava Basu","doi":"10.1145/3205289.3208064","DOIUrl":"https://doi.org/10.1145/3205289.3208064","url":null,"abstract":"Memory is becoming increasingly heterogeneous with the emergence of disparate memory technologies ranging from non-volatile memories like PCM, STT-RAM, and memristors to 3D-stacked memories like HBM. In such systems, data is of ten migrated across memory regions backed by different technologies for better overall performance. An effective migration mechanism is a prerequisite in such systems. Prior works on OS-directed page migration have focused on what data to migrate and/or on when to migrate. In this work, we demonstrate the need to investigate another dimension -- how much to migrate. Specifically, we show that the amount of data migrated in a single migration operation (called \"migration granularity\") is vital to the overall performance. Through analysis on real hardware, we further show that different applications benefit from different migration granularities, owing to their distinct memory access characteristics. Since this preferred migration granularity may not be known a priori, we propose a novel scheme to infer this for any given application at runtime. When implemented in the Linux OS, running on a current hardware, the performance improved by up to 36% over a baseline with a fixed migration granularity.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132608509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Bootstrapping Parameter Space Exploration for Fast Tuning 用于快速调谐的自举参数空间探索
Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205321
Jayaraman J. Thiagarajan, Nikhil Jain, Rushil Anirudh, Alfredo Giménez, R. Sridhar, Aniruddha Marathe, Tao Wang, M. Emani, A. Bhatele, T. Gamblin
The task of tuning parameters for optimizing performance or other metrics of interest such as energy, variability, etc. can be resource and time consuming. Presence of a large parameter space makes a comprehensive exploration infeasible. In this paper, we propose a novel bootstrap scheme, called GEIST, for parameter space exploration to find performance-optimizing configurations quickly. Our scheme represents the parameter space as a graph whose connectivity guides information propagation from known configurations. Guided by the predictions of a semi-supervised learning method over the parameter graph, GEIST is able to adaptively sample and find desirable configurations using limited results from experiments. We show the effectiveness of GEIST for selecting application input options, compiler flags, and runtime/system settings for several parallel codes including LULESH, Kripke, Hypre, and OpenAtom.
为优化性能或其他感兴趣的指标(如能量、可变性等)而调优参数的任务可能会耗费资源和时间。由于存在较大的参数空间,使得综合勘探不可行。在本文中,我们提出了一种新的自举方案,称为GEIST,用于参数空间探索,以快速找到性能优化配置。我们的方案将参数空间表示为一个图,其连通性指导信息从已知配置传播。在参数图上的半监督学习方法预测的指导下,GEIST能够自适应采样并使用有限的实验结果找到理想的配置。我们展示了GEIST在为多个并行代码(包括LULESH, Kripke, Hypre和OpenAtom)选择应用程序输入选项,编译器标志和运行时/系统设置方面的有效性。
{"title":"Bootstrapping Parameter Space Exploration for Fast Tuning","authors":"Jayaraman J. Thiagarajan, Nikhil Jain, Rushil Anirudh, Alfredo Giménez, R. Sridhar, Aniruddha Marathe, Tao Wang, M. Emani, A. Bhatele, T. Gamblin","doi":"10.1145/3205289.3205321","DOIUrl":"https://doi.org/10.1145/3205289.3205321","url":null,"abstract":"The task of tuning parameters for optimizing performance or other metrics of interest such as energy, variability, etc. can be resource and time consuming. Presence of a large parameter space makes a comprehensive exploration infeasible. In this paper, we propose a novel bootstrap scheme, called GEIST, for parameter space exploration to find performance-optimizing configurations quickly. Our scheme represents the parameter space as a graph whose connectivity guides information propagation from known configurations. Guided by the predictions of a semi-supervised learning method over the parameter graph, GEIST is able to adaptively sample and find desirable configurations using limited results from experiments. We show the effectiveness of GEIST for selecting application input options, compiler flags, and runtime/system settings for several parallel codes including LULESH, Kripke, Hypre, and OpenAtom.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125634665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
ProfDP ProfDP
Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205320
Shasha Wen, Lucy Cherkasova, F. Lin, Xu Liu
New memory technologies, such as non-volatile memory and stacked memory, have reformed the memory hierarchies in modern and emerging computer architectures. It becomes common to see memories of different types integrated into the same system, as known as heterogeneous memory. Typically, a heterogeneous memory system consists of a small fast component and a large slow component. This encourages new style of data processing and exposes developers with a new problem: given two memory types, how shall we redesign applications to benefit from this memory arrangement and decide on the efficient data placement? Existing methods perform detailed memory access pattern analysis to guide data placement. However, these methods are heavyweight and ignore the interactions between software and hardware. To address these issues, we develop ProfDP, a lightweight profiler that employs differential data-centric analysis to provide intuitive guidance for data placement in heterogeneous memory. Evaluated with a number of parallel benchmarks running on a state-of-the-art emulator and a real machine with heterogeneous memory, we show that ProfDP is able to guide nearly-optimal data placement to maximize performance with minimum programming efforts.
{"title":"ProfDP","authors":"Shasha Wen, Lucy Cherkasova, F. Lin, Xu Liu","doi":"10.1145/3205289.3205320","DOIUrl":"https://doi.org/10.1145/3205289.3205320","url":null,"abstract":"New memory technologies, such as non-volatile memory and stacked memory, have reformed the memory hierarchies in modern and emerging computer architectures. It becomes common to see memories of different types integrated into the same system, as known as heterogeneous memory. Typically, a heterogeneous memory system consists of a small fast component and a large slow component. This encourages new style of data processing and exposes developers with a new problem: given two memory types, how shall we redesign applications to benefit from this memory arrangement and decide on the efficient data placement? Existing methods perform detailed memory access pattern analysis to guide data placement. However, these methods are heavyweight and ignore the interactions between software and hardware. To address these issues, we develop ProfDP, a lightweight profiler that employs differential data-centric analysis to provide intuitive guidance for data placement in heterogeneous memory. Evaluated with a number of parallel benchmarks running on a state-of-the-art emulator and a real machine with heterogeneous memory, we show that ProfDP is able to guide nearly-optimal data placement to maximize performance with minimum programming efforts.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129160355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems 在并行和分布式系统上精确、快速和可扩展的核岭回归
Pub Date : 2018-05-01 DOI: 10.1145/3205289.3205290
Yang You, J. Demmel, Cho-Jui Hsieh, R. Vuduc
Kernel Ridge Regression (KRR) is a fundamental method in machine learning. Given an n-by-d data matrix as input, a traditional implementation requires Θ(n2) memory to form an n-by-n kernel matrix and Θ(n3) flops to compute the final model. These time and storage costs prohibit KRR from scaling up to large datasets. For example, even on a relatively small dataset (a 520k-by-90 input requiring 357 MB), KRR requires 2 TB memory just to store the kernel matrix. The reason is that n usually is much larger than d for real-world applications. On the other hand, weak scaling becomes a problem: if we keep d and n/p fixed as p grows (p is # machines), the memory needed grows as Θ(p) per processor and the flops as Θ(p2) per processor. In the perfect weak scaling situation, both the memory needed and the flops grow as Θ(1) per processor (i.e. memory and flops are constant). The traditional Distributed KRR implementation (DKRR) only achieved 0.32% weak scaling efficiency from 96 to 1536 processors. We propose two new methods to address these problems: the Balanced KRR (BKRR) and K-means KRR (KKRR). These methods consider alternative ways to partition the input dataset into p different parts, generating p different models, and then selecting the best model among them. Compared to a conventional implementation, KKRR2 (optimized version of KKRR) improves the weak scaling efficiency from 0.32% to 38% and achieves a 591x speedup for getting the same accuracy by using the same data and the same hardware (1536 processors). BKRR2 (optimized version of BKRR) achieves a higher accuracy than the current fastest method using less training time for a variety of datasets. For the applications requiring only approximate solutions, BKRR2 improves the weak scaling efficiency to 92% and achieves 3505x speedup (theoretical speedup: 4096x).
核岭回归(KRR)是机器学习的一种基本方法。给定一个n × d的数据矩阵作为输入,传统的实现需要Θ(n2)内存来形成一个n × n的内核矩阵,并且需要Θ(n3)内存来计算最终的模型。这些时间和存储成本使KRR无法扩展到大型数据集。例如,即使在相对较小的数据集上(520k × 90的输入需要357 MB), KRR也需要2tb内存来存储内核矩阵。原因是,在实际应用程序中,n通常比d大得多。另一方面,弱可伸缩性成为一个问题:如果我们保持d和n/p固定为p的增长(p是#机器),则所需的内存以每个处理器Θ(p)的速度增长,而磁盘以每个处理器Θ(p2)的速度增长。在完美的弱伸缩情况下,每个处理器所需的内存和flops都以Θ(1)的速度增长(即内存和flops是恒定的)。传统的分布式KRR实现(DKRR)从96个处理器到1536个处理器的弱扩展效率仅为0.32%。我们提出了两种新的方法来解决这些问题:平衡KRR (BKRR)和K-means KRR (KKRR)。这些方法考虑了将输入数据集划分为p个不同部分的替代方法,生成p个不同的模型,然后从中选择最佳模型。与传统实现相比,KKRR2 (KKRR的优化版本)将弱缩放效率从0.32%提高到38%,并在使用相同的数据和相同的硬件(1536个处理器)的情况下获得相同的精度,实现了591倍的加速。BKRR2(优化版的BKRR)在各种数据集上使用更少的训练时间实现了比目前最快的方法更高的准确率。对于只需要近似解的应用,BKRR2将弱缩放效率提高到92%,实现了3505倍的加速(理论加速:4096倍)。
{"title":"Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems","authors":"Yang You, J. Demmel, Cho-Jui Hsieh, R. Vuduc","doi":"10.1145/3205289.3205290","DOIUrl":"https://doi.org/10.1145/3205289.3205290","url":null,"abstract":"Kernel Ridge Regression (KRR) is a fundamental method in machine learning. Given an n-by-d data matrix as input, a traditional implementation requires Θ(n2) memory to form an n-by-n kernel matrix and Θ(n3) flops to compute the final model. These time and storage costs prohibit KRR from scaling up to large datasets. For example, even on a relatively small dataset (a 520k-by-90 input requiring 357 MB), KRR requires 2 TB memory just to store the kernel matrix. The reason is that n usually is much larger than d for real-world applications. On the other hand, weak scaling becomes a problem: if we keep d and n/p fixed as p grows (p is # machines), the memory needed grows as Θ(p) per processor and the flops as Θ(p2) per processor. In the perfect weak scaling situation, both the memory needed and the flops grow as Θ(1) per processor (i.e. memory and flops are constant). The traditional Distributed KRR implementation (DKRR) only achieved 0.32% weak scaling efficiency from 96 to 1536 processors. We propose two new methods to address these problems: the Balanced KRR (BKRR) and K-means KRR (KKRR). These methods consider alternative ways to partition the input dataset into p different parts, generating p different models, and then selecting the best model among them. Compared to a conventional implementation, KKRR2 (optimized version of KKRR) improves the weak scaling efficiency from 0.32% to 38% and achieves a 591x speedup for getting the same accuracy by using the same data and the same hardware (1536 processors). BKRR2 (optimized version of BKRR) achieves a higher accuracy than the current fastest method using less training time for a variety of datasets. For the applications requiring only approximate solutions, BKRR2 improves the weak scaling efficiency to 92% and achieves 3505x speedup (theoretical speedup: 4096x).","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121192756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
On Optimizing Distributed Tucker Decomposition for Sparse Tensors 稀疏张量分布Tucker分解的优化研究
Pub Date : 2018-04-25 DOI: 10.1145/3205289.3205315
Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Prakash Murali, Yogish Sabharwal, Shivmaran S. Pandian, D. Sreedhar
The Tucker decomposition generalizes the notion of Singular Value Decomposition (SVD) to tensors, the higher dimensional analogues of matrices. We study the problem of constructing the Tucker decomposition of sparse tensors on distributed memory systems via the HOOI procedure, a popular iterative method. The scheme used for distributing the input tensor among the processors (MPI ranks) critically influences the HOOI execution time. Prior work has proposed different distribution schemes: an offline scheme based on sophisticated hypergraph partitioning method and simple, lightweight alternatives that can be used real-time. While the hypergraph based scheme typically results in faster HOOI execution time, being complex, the time taken for determining the distribution is an order of magnitude higher than the execution time of a single HOOI iteration. Our main contribution is a lightweight distribution scheme, which achieves the best of both worlds. We show that the scheme is near-optimal on certain fundamental metrics associated with the HOOI procedure and as a result, near-optimal on the computational load (FLOPs). Though the scheme may incur higher communication volume, the computation time is the dominant factor and as the result, the scheme achieves better performance on the overall HOOI execution time. Our experimental evaluation on large real-life tensors (having up to 4 billion elements) shows that the scheme outperforms the prior schemes on the HOOI execution time by a factor of up to 3x. On the other hand, its distribution time is comparable to the prior lightweight schemes and is typically lesser than the execution time of a single HOOI iteration.
Tucker分解将奇异值分解(SVD)的概念推广到张量,即矩阵的高维类似物。本文研究了分布式存储系统上稀疏张量的Tucker分解的构造问题。用于在处理器之间分配输入张量的方案(MPI排名)对HOOI的执行时间有重要影响。先前的工作提出了不同的分发方案:基于复杂的超图划分方法的离线方案和可以实时使用的简单轻量级替代方案。虽然基于超图的方案通常会导致更快的HOOI执行时间,但它很复杂,确定分布所花费的时间比单个HOOI迭代的执行时间高一个数量级。我们的主要贡献是一个轻量级的分发方案,它实现了两个世界的最佳效果。我们表明,该方案在与HOOI过程相关的某些基本指标上接近最优,因此在计算负载(FLOPs)上接近最优。虽然该方案可能会产生更高的通信量,但计算时间是主要因素,因此该方案在整体HOOI执行时间上具有更好的性能。我们在大型现实张量(具有多达40亿个元素)上的实验评估表明,该方案在HOOI执行时间上优于先前方案,最高可达3倍。另一方面,它的分发时间与之前的轻量级模式相当,并且通常小于单个HOOI迭代的执行时间。
{"title":"On Optimizing Distributed Tucker Decomposition for Sparse Tensors","authors":"Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Prakash Murali, Yogish Sabharwal, Shivmaran S. Pandian, D. Sreedhar","doi":"10.1145/3205289.3205315","DOIUrl":"https://doi.org/10.1145/3205289.3205315","url":null,"abstract":"The Tucker decomposition generalizes the notion of Singular Value Decomposition (SVD) to tensors, the higher dimensional analogues of matrices. We study the problem of constructing the Tucker decomposition of sparse tensors on distributed memory systems via the HOOI procedure, a popular iterative method. The scheme used for distributing the input tensor among the processors (MPI ranks) critically influences the HOOI execution time. Prior work has proposed different distribution schemes: an offline scheme based on sophisticated hypergraph partitioning method and simple, lightweight alternatives that can be used real-time. While the hypergraph based scheme typically results in faster HOOI execution time, being complex, the time taken for determining the distribution is an order of magnitude higher than the execution time of a single HOOI iteration. Our main contribution is a lightweight distribution scheme, which achieves the best of both worlds. We show that the scheme is near-optimal on certain fundamental metrics associated with the HOOI procedure and as a result, near-optimal on the computational load (FLOPs). Though the scheme may incur higher communication volume, the computation time is the dominant factor and as the result, the scheme achieves better performance on the overall HOOI execution time. Our experimental evaluation on large real-life tensors (having up to 4 billion elements) shows that the scheme outperforms the prior schemes on the HOOI execution time by a factor of up to 3x. On the other hand, its distribution time is comparable to the prior lightweight schemes and is typically lesser than the execution time of a single HOOI iteration.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"175 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123400597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
期刊
Proceedings of the 2018 International Conference on Supercomputing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1