While negative inputs for ReLU are useless, it consumes a lot of computing power to calculate them for deep neural networks. We propose a computation pruning technique that detects at an early stage that the result of a sum of products will be negative by adopting an inverted two's complement expression for weights and a bit-serial sum of products. Therefore, it can skip a large amount of computations for negative results and simply set the ReLU outputs to zero. Moreover, we devise a DNN accelerator architecture that can efficiently apply the proposed technique. The evaluation shows that the accelerator using the computation pruning through early negative detection technique significantly improves the energy efficiency and the performance.
{"title":"ComPEND","authors":"Dongwook Lee, Sungbum Kang, Kiyoung Choi","doi":"10.1145/3205289.3205295","DOIUrl":"https://doi.org/10.1145/3205289.3205295","url":null,"abstract":"While negative inputs for ReLU are useless, it consumes a lot of computing power to calculate them for deep neural networks. We propose a computation pruning technique that detects at an early stage that the result of a sum of products will be negative by adopting an inverted two's complement expression for weights and a bit-serial sum of products. Therefore, it can skip a large amount of computations for negative results and simply set the ReLU outputs to zero. Moreover, we devise a DNN accelerator architecture that can efficiently apply the proposed technique. The evaluation shows that the accelerator using the computation pruning through early negative detection technique significantly improves the energy efficiency and the performance.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114888509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, the use of graph-based network topologies has been proposed as an alternative to traditional networks such as tori or fat-trees due to their very good topological characteristics. However they pose practical implementation challenges such as the lack of deadlock avoidance strategies. Previous proposals either lack flexibility, underutilise network resources or are exceedingly complex. We propose--and prove formally--three generic, low-complexity deadlock avoidance mechanisms that only require local information. Our methods are topology- and routing-independent and their virtual channel count is bounded by the length of the longest path. We evaluate our algorithms through an extensive simulation study to measure the impact on the performance using both synthetic and realistic traffic. First we compare against a well-known HPC mechanism for dragonfly and achieve similar performance level. Then we moved to Graph-based networks and show that our mechanisms can greatly outperform traditional, spanning-tree based mechanisms, even if these use a much larger number of virtual channels. Overall, our proposal provides a simple, flexible and high performance deadlock-avoidance solution.
{"title":"High-Performance, Low-Complexity Deadlock Avoidance for Arbitrary Topologies/Routings","authors":"J. A. Pascual, J. Navaridas","doi":"10.1145/3205289.3205307","DOIUrl":"https://doi.org/10.1145/3205289.3205307","url":null,"abstract":"Recently, the use of graph-based network topologies has been proposed as an alternative to traditional networks such as tori or fat-trees due to their very good topological characteristics. However they pose practical implementation challenges such as the lack of deadlock avoidance strategies. Previous proposals either lack flexibility, underutilise network resources or are exceedingly complex. We propose--and prove formally--three generic, low-complexity deadlock avoidance mechanisms that only require local information. Our methods are topology- and routing-independent and their virtual channel count is bounded by the length of the longest path. We evaluate our algorithms through an extensive simulation study to measure the impact on the performance using both synthetic and realistic traffic. First we compare against a well-known HPC mechanism for dragonfly and achieve similar performance level. Then we moved to Graph-based networks and show that our mechanisms can greatly outperform traditional, spanning-tree based mechanisms, even if these use a much larger number of virtual channels. Overall, our proposal provides a simple, flexible and high performance deadlock-avoidance solution.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123032834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sparse Matrix-Vector Multiplication (SpMV) is an essential computation kernel for many data-analytic workloads running in both supercomputers and data centers. The intrinsic irregularity in SpMV is challenging to achieve high performance, especially when porting to new architectures. In this paper, we present our work on designing and implementing efficient SpMV algorithms on Sunway, a novel architecture with many unique features. To fully exploit the Sunway architecture, we have designed a dual-side multi-level partition mechanism on both sparse matrices and hardware resources to improve locality and parallelism. On one hand, we partition sparse matrices into blocks, tiles, and slices for different granularities. On the other hand, we partition cores in a Sunway processor into fleets, and further dedicate part of cores in a fleet as computation and I/O cores. Moreover, we have optimized the communication between partitions to further improve the performance. Our scheme is generally applicable to different SpMV formats and implementations. For evaluation, we have applied our techniques atop a popular SpMV format, CSR. Experimental results on 18 datasets show that our optimization yields up to 15.5x (12.3x on average) speedups.
{"title":"Towards Efficient SpMV on Sunway Manycore Architectures","authors":"Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, Xu Liu","doi":"10.1145/3205289.3205313","DOIUrl":"https://doi.org/10.1145/3205289.3205313","url":null,"abstract":"Sparse Matrix-Vector Multiplication (SpMV) is an essential computation kernel for many data-analytic workloads running in both supercomputers and data centers. The intrinsic irregularity in SpMV is challenging to achieve high performance, especially when porting to new architectures. In this paper, we present our work on designing and implementing efficient SpMV algorithms on Sunway, a novel architecture with many unique features. To fully exploit the Sunway architecture, we have designed a dual-side multi-level partition mechanism on both sparse matrices and hardware resources to improve locality and parallelism. On one hand, we partition sparse matrices into blocks, tiles, and slices for different granularities. On the other hand, we partition cores in a Sunway processor into fleets, and further dedicate part of cores in a fleet as computation and I/O cores. Moreover, we have optimized the communication between partitions to further improve the performance. Our scheme is generally applicable to different SpMV formats and implementations. For evaluation, we have applied our techniques atop a popular SpMV format, CSR. Experimental results on 18 datasets show that our optimization yields up to 15.5x (12.3x on average) speedups.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123510152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Memory is becoming increasingly heterogeneous with the emergence of disparate memory technologies ranging from non-volatile memories like PCM, STT-RAM, and memristors to 3D-stacked memories like HBM. In such systems, data is of ten migrated across memory regions backed by different technologies for better overall performance. An effective migration mechanism is a prerequisite in such systems. Prior works on OS-directed page migration have focused on what data to migrate and/or on when to migrate. In this work, we demonstrate the need to investigate another dimension -- how much to migrate. Specifically, we show that the amount of data migrated in a single migration operation (called "migration granularity") is vital to the overall performance. Through analysis on real hardware, we further show that different applications benefit from different migration granularities, owing to their distinct memory access characteristics. Since this preferred migration granularity may not be known a priori, we propose a novel scheme to infer this for any given application at runtime. When implemented in the Linux OS, running on a current hardware, the performance improved by up to 36% over a baseline with a fixed migration granularity.
{"title":"A Case for Granularity Aware Page Migration","authors":"Jee Ho Ryoo, L. John, Arkaprava Basu","doi":"10.1145/3205289.3208064","DOIUrl":"https://doi.org/10.1145/3205289.3208064","url":null,"abstract":"Memory is becoming increasingly heterogeneous with the emergence of disparate memory technologies ranging from non-volatile memories like PCM, STT-RAM, and memristors to 3D-stacked memories like HBM. In such systems, data is of ten migrated across memory regions backed by different technologies for better overall performance. An effective migration mechanism is a prerequisite in such systems. Prior works on OS-directed page migration have focused on what data to migrate and/or on when to migrate. In this work, we demonstrate the need to investigate another dimension -- how much to migrate. Specifically, we show that the amount of data migrated in a single migration operation (called \"migration granularity\") is vital to the overall performance. Through analysis on real hardware, we further show that different applications benefit from different migration granularities, owing to their distinct memory access characteristics. Since this preferred migration granularity may not be known a priori, we propose a novel scheme to infer this for any given application at runtime. When implemented in the Linux OS, running on a current hardware, the performance improved by up to 36% over a baseline with a fixed migration granularity.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132608509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jayaraman J. Thiagarajan, Nikhil Jain, Rushil Anirudh, Alfredo Giménez, R. Sridhar, Aniruddha Marathe, Tao Wang, M. Emani, A. Bhatele, T. Gamblin
The task of tuning parameters for optimizing performance or other metrics of interest such as energy, variability, etc. can be resource and time consuming. Presence of a large parameter space makes a comprehensive exploration infeasible. In this paper, we propose a novel bootstrap scheme, called GEIST, for parameter space exploration to find performance-optimizing configurations quickly. Our scheme represents the parameter space as a graph whose connectivity guides information propagation from known configurations. Guided by the predictions of a semi-supervised learning method over the parameter graph, GEIST is able to adaptively sample and find desirable configurations using limited results from experiments. We show the effectiveness of GEIST for selecting application input options, compiler flags, and runtime/system settings for several parallel codes including LULESH, Kripke, Hypre, and OpenAtom.
{"title":"Bootstrapping Parameter Space Exploration for Fast Tuning","authors":"Jayaraman J. Thiagarajan, Nikhil Jain, Rushil Anirudh, Alfredo Giménez, R. Sridhar, Aniruddha Marathe, Tao Wang, M. Emani, A. Bhatele, T. Gamblin","doi":"10.1145/3205289.3205321","DOIUrl":"https://doi.org/10.1145/3205289.3205321","url":null,"abstract":"The task of tuning parameters for optimizing performance or other metrics of interest such as energy, variability, etc. can be resource and time consuming. Presence of a large parameter space makes a comprehensive exploration infeasible. In this paper, we propose a novel bootstrap scheme, called GEIST, for parameter space exploration to find performance-optimizing configurations quickly. Our scheme represents the parameter space as a graph whose connectivity guides information propagation from known configurations. Guided by the predictions of a semi-supervised learning method over the parameter graph, GEIST is able to adaptively sample and find desirable configurations using limited results from experiments. We show the effectiveness of GEIST for selecting application input options, compiler flags, and runtime/system settings for several parallel codes including LULESH, Kripke, Hypre, and OpenAtom.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125634665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
New memory technologies, such as non-volatile memory and stacked memory, have reformed the memory hierarchies in modern and emerging computer architectures. It becomes common to see memories of different types integrated into the same system, as known as heterogeneous memory. Typically, a heterogeneous memory system consists of a small fast component and a large slow component. This encourages new style of data processing and exposes developers with a new problem: given two memory types, how shall we redesign applications to benefit from this memory arrangement and decide on the efficient data placement? Existing methods perform detailed memory access pattern analysis to guide data placement. However, these methods are heavyweight and ignore the interactions between software and hardware. To address these issues, we develop ProfDP, a lightweight profiler that employs differential data-centric analysis to provide intuitive guidance for data placement in heterogeneous memory. Evaluated with a number of parallel benchmarks running on a state-of-the-art emulator and a real machine with heterogeneous memory, we show that ProfDP is able to guide nearly-optimal data placement to maximize performance with minimum programming efforts.
{"title":"ProfDP","authors":"Shasha Wen, Lucy Cherkasova, F. Lin, Xu Liu","doi":"10.1145/3205289.3205320","DOIUrl":"https://doi.org/10.1145/3205289.3205320","url":null,"abstract":"New memory technologies, such as non-volatile memory and stacked memory, have reformed the memory hierarchies in modern and emerging computer architectures. It becomes common to see memories of different types integrated into the same system, as known as heterogeneous memory. Typically, a heterogeneous memory system consists of a small fast component and a large slow component. This encourages new style of data processing and exposes developers with a new problem: given two memory types, how shall we redesign applications to benefit from this memory arrangement and decide on the efficient data placement? Existing methods perform detailed memory access pattern analysis to guide data placement. However, these methods are heavyweight and ignore the interactions between software and hardware. To address these issues, we develop ProfDP, a lightweight profiler that employs differential data-centric analysis to provide intuitive guidance for data placement in heterogeneous memory. Evaluated with a number of parallel benchmarks running on a state-of-the-art emulator and a real machine with heterogeneous memory, we show that ProfDP is able to guide nearly-optimal data placement to maximize performance with minimum programming efforts.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129160355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kernel Ridge Regression (KRR) is a fundamental method in machine learning. Given an n-by-d data matrix as input, a traditional implementation requires Θ(n2) memory to form an n-by-n kernel matrix and Θ(n3) flops to compute the final model. These time and storage costs prohibit KRR from scaling up to large datasets. For example, even on a relatively small dataset (a 520k-by-90 input requiring 357 MB), KRR requires 2 TB memory just to store the kernel matrix. The reason is that n usually is much larger than d for real-world applications. On the other hand, weak scaling becomes a problem: if we keep d and n/p fixed as p grows (p is # machines), the memory needed grows as Θ(p) per processor and the flops as Θ(p2) per processor. In the perfect weak scaling situation, both the memory needed and the flops grow as Θ(1) per processor (i.e. memory and flops are constant). The traditional Distributed KRR implementation (DKRR) only achieved 0.32% weak scaling efficiency from 96 to 1536 processors. We propose two new methods to address these problems: the Balanced KRR (BKRR) and K-means KRR (KKRR). These methods consider alternative ways to partition the input dataset into p different parts, generating p different models, and then selecting the best model among them. Compared to a conventional implementation, KKRR2 (optimized version of KKRR) improves the weak scaling efficiency from 0.32% to 38% and achieves a 591x speedup for getting the same accuracy by using the same data and the same hardware (1536 processors). BKRR2 (optimized version of BKRR) achieves a higher accuracy than the current fastest method using less training time for a variety of datasets. For the applications requiring only approximate solutions, BKRR2 improves the weak scaling efficiency to 92% and achieves 3505x speedup (theoretical speedup: 4096x).
{"title":"Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems","authors":"Yang You, J. Demmel, Cho-Jui Hsieh, R. Vuduc","doi":"10.1145/3205289.3205290","DOIUrl":"https://doi.org/10.1145/3205289.3205290","url":null,"abstract":"Kernel Ridge Regression (KRR) is a fundamental method in machine learning. Given an n-by-d data matrix as input, a traditional implementation requires Θ(n2) memory to form an n-by-n kernel matrix and Θ(n3) flops to compute the final model. These time and storage costs prohibit KRR from scaling up to large datasets. For example, even on a relatively small dataset (a 520k-by-90 input requiring 357 MB), KRR requires 2 TB memory just to store the kernel matrix. The reason is that n usually is much larger than d for real-world applications. On the other hand, weak scaling becomes a problem: if we keep d and n/p fixed as p grows (p is # machines), the memory needed grows as Θ(p) per processor and the flops as Θ(p2) per processor. In the perfect weak scaling situation, both the memory needed and the flops grow as Θ(1) per processor (i.e. memory and flops are constant). The traditional Distributed KRR implementation (DKRR) only achieved 0.32% weak scaling efficiency from 96 to 1536 processors. We propose two new methods to address these problems: the Balanced KRR (BKRR) and K-means KRR (KKRR). These methods consider alternative ways to partition the input dataset into p different parts, generating p different models, and then selecting the best model among them. Compared to a conventional implementation, KKRR2 (optimized version of KKRR) improves the weak scaling efficiency from 0.32% to 38% and achieves a 591x speedup for getting the same accuracy by using the same data and the same hardware (1536 processors). BKRR2 (optimized version of BKRR) achieves a higher accuracy than the current fastest method using less training time for a variety of datasets. For the applications requiring only approximate solutions, BKRR2 improves the weak scaling efficiency to 92% and achieves 3505x speedup (theoretical speedup: 4096x).","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121192756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Prakash Murali, Yogish Sabharwal, Shivmaran S. Pandian, D. Sreedhar
The Tucker decomposition generalizes the notion of Singular Value Decomposition (SVD) to tensors, the higher dimensional analogues of matrices. We study the problem of constructing the Tucker decomposition of sparse tensors on distributed memory systems via the HOOI procedure, a popular iterative method. The scheme used for distributing the input tensor among the processors (MPI ranks) critically influences the HOOI execution time. Prior work has proposed different distribution schemes: an offline scheme based on sophisticated hypergraph partitioning method and simple, lightweight alternatives that can be used real-time. While the hypergraph based scheme typically results in faster HOOI execution time, being complex, the time taken for determining the distribution is an order of magnitude higher than the execution time of a single HOOI iteration. Our main contribution is a lightweight distribution scheme, which achieves the best of both worlds. We show that the scheme is near-optimal on certain fundamental metrics associated with the HOOI procedure and as a result, near-optimal on the computational load (FLOPs). Though the scheme may incur higher communication volume, the computation time is the dominant factor and as the result, the scheme achieves better performance on the overall HOOI execution time. Our experimental evaluation on large real-life tensors (having up to 4 billion elements) shows that the scheme outperforms the prior schemes on the HOOI execution time by a factor of up to 3x. On the other hand, its distribution time is comparable to the prior lightweight schemes and is typically lesser than the execution time of a single HOOI iteration.
{"title":"On Optimizing Distributed Tucker Decomposition for Sparse Tensors","authors":"Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Prakash Murali, Yogish Sabharwal, Shivmaran S. Pandian, D. Sreedhar","doi":"10.1145/3205289.3205315","DOIUrl":"https://doi.org/10.1145/3205289.3205315","url":null,"abstract":"The Tucker decomposition generalizes the notion of Singular Value Decomposition (SVD) to tensors, the higher dimensional analogues of matrices. We study the problem of constructing the Tucker decomposition of sparse tensors on distributed memory systems via the HOOI procedure, a popular iterative method. The scheme used for distributing the input tensor among the processors (MPI ranks) critically influences the HOOI execution time. Prior work has proposed different distribution schemes: an offline scheme based on sophisticated hypergraph partitioning method and simple, lightweight alternatives that can be used real-time. While the hypergraph based scheme typically results in faster HOOI execution time, being complex, the time taken for determining the distribution is an order of magnitude higher than the execution time of a single HOOI iteration. Our main contribution is a lightweight distribution scheme, which achieves the best of both worlds. We show that the scheme is near-optimal on certain fundamental metrics associated with the HOOI procedure and as a result, near-optimal on the computational load (FLOPs). Though the scheme may incur higher communication volume, the computation time is the dominant factor and as the result, the scheme achieves better performance on the overall HOOI execution time. Our experimental evaluation on large real-life tensors (having up to 4 billion elements) shows that the scheme outperforms the prior schemes on the HOOI execution time by a factor of up to 3x. On the other hand, its distribution time is comparable to the prior lightweight schemes and is typically lesser than the execution time of a single HOOI iteration.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"175 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123400597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}