Hao Yan, Hebin R. Cherian, Ethan C. Ahn, Lide Duan
Book file PDF easily for everyone and every device. You can download and read online Celia file PDF Book only if you are registered here. And also you can download or read online all Book PDF file that related with Celia book. Happy reading Celia Bookeveryone. Download file Free Book PDF Celia at Complete PDF Library. This Book have some digital formats such us :paperbook, ebook, kindle, epub, fb2 and another formats. Here is The Complete PDF Book Library. It's free to register here to get Book file PDF Celia.
{"title":"CELIA","authors":"Hao Yan, Hebin R. Cherian, Ethan C. Ahn, Lide Duan","doi":"10.1145/3205289.3205297","DOIUrl":"https://doi.org/10.1145/3205289.3205297","url":null,"abstract":"Book file PDF easily for everyone and every device. You can download and read online Celia file PDF Book only if you are registered here. And also you can download or read online all Book PDF file that related with Celia book. Happy reading Celia Bookeveryone. Download file Free Book PDF Celia at Complete PDF Library. This Book have some digital formats such us :paperbook, ebook, kindle, epub, fb2 and another formats. Here is The Complete PDF Book Library. It's free to register here to get Book file PDF Celia.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114973337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Growing complexity of applications pose new challenges to memory system design due to their data intensive nature, complex access patterns, larger footprints, etc. The slow nature of full-system simulators, challenges of simulators to run deep software stacks of many emerging workloads, proprietary nature of software, etc. pose challenges to fast and accurate microarchitectural explorations of future memory hierarchies. One technique to mitigate this problem is to create spatio-temporal models of access streams and use them to explore memory system tradeoffs. However, existing memory stream models have weaknesses such as they only model temporal locality behavior or model spatio-temporal locality using global stride transitions, resulting in high storage/metadata overhead. In this paper, we propose HALO, a Hierarchical memory Access LOcality modeling technique that identifies patterns by isolating global memory references into localized streams and further zooming into each local stream capturing multi-granularity spatial locality patterns. HALO also models the interleaving degree between localized stream accesses leveraging coarse-grained reuse locality. We evaluate HALO's effectiveness in replicating original application performance using over 20K different memory system configurations and show that HALO achieves over 98.3%, 95.6%, 99.3% and 96% accuracy in replicating performance of prefetcher-enabled L1 & L2 caches, TLB and DRAM respectively. HALO outperforms the state-of-the-art memory cloning schemes, WEST and STM, while using ~39X less metadata storage than STM.
{"title":"HALO: A Hierarchical Memory Access Locality Modeling Technique For Memory System Explorations","authors":"Reena Panda, L. John","doi":"10.1145/3205289.3205323","DOIUrl":"https://doi.org/10.1145/3205289.3205323","url":null,"abstract":"Growing complexity of applications pose new challenges to memory system design due to their data intensive nature, complex access patterns, larger footprints, etc. The slow nature of full-system simulators, challenges of simulators to run deep software stacks of many emerging workloads, proprietary nature of software, etc. pose challenges to fast and accurate microarchitectural explorations of future memory hierarchies. One technique to mitigate this problem is to create spatio-temporal models of access streams and use them to explore memory system tradeoffs. However, existing memory stream models have weaknesses such as they only model temporal locality behavior or model spatio-temporal locality using global stride transitions, resulting in high storage/metadata overhead. In this paper, we propose HALO, a Hierarchical memory Access LOcality modeling technique that identifies patterns by isolating global memory references into localized streams and further zooming into each local stream capturing multi-granularity spatial locality patterns. HALO also models the interleaving degree between localized stream accesses leveraging coarse-grained reuse locality. We evaluate HALO's effectiveness in replicating original application performance using over 20K different memory system configurations and show that HALO achieves over 98.3%, 95.6%, 99.3% and 96% accuracy in replicating performance of prefetcher-enabled L1 & L2 caches, TLB and DRAM respectively. HALO outperforms the state-of-the-art memory cloning schemes, WEST and STM, while using ~39X less metadata storage than STM.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132557669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ang Li, Weifeng Liu, Linnan Wang, K. Barker, S. Song
With the unprecedented development of compute capability and extension of memory bandwidth on modern GPUs, parallel communication and synchronization soon becomes a major concern for continuous performance scaling. This is especially the case for emerging big-data applications. Instead of relying on a few heavily-loaded CTAs that may expose opportunities for intra-CTA data reuse, current technology and design trends suggest the performance potential of allocating more lightweighted CTAs for processing individual tasks more independently, as the overheads from synchronization, communication and cooperation may greatly outweigh the benefits from exploiting limited data reuse in heavily-loaded CTAs. This paper proceeds this trend and proposes a novel execution model for modern GPUs that hides the CTA execution hierarchy from the classic GPU execution model; meanwhile exposes the originally hidden warp-level execution. Specifically, it relies on individual warps to undertake the original CTAs' tasks. The major observation is that by replacing traditional inter-warp communication (e.g., via shared memory), cooperation (e.g., via bar primitives) and synchronizations (e.g., via CTA barriers), with more efficient intra-warp communication (e.g., via register shuffling), cooperation (e.g., via warp voting) and synchronizations (naturally lockstep execution) across the SIMD-lanes within a warp, significant performance gain can be achieved. We analyze the pros and cons for this design and propose corresponding solutions to counter potential negative effects. Experimental results on a diverse group of thirty-two representative applications show that our proposed Warp-Consolidation execution model can achieve an average speedup of 1.7x, 2.3x, 1.5x and 1.2x (up to 6.3x, 31x, 6.4x and 3.8x) on NVIDIA Kepler (Tesla-K80), Maxwell (Tesla-M40), Pascal (Tesla-P100) and Volta (Tesla-V100) GPUs, respectively, demonstrating its applicability and portability. Our approach can be directly employed to either transform legacy codes or write new algorithms on modern commodity GPUs.
{"title":"Warp-Consolidation: A Novel Execution Model for GPUs","authors":"Ang Li, Weifeng Liu, Linnan Wang, K. Barker, S. Song","doi":"10.1145/3205289.3205294","DOIUrl":"https://doi.org/10.1145/3205289.3205294","url":null,"abstract":"With the unprecedented development of compute capability and extension of memory bandwidth on modern GPUs, parallel communication and synchronization soon becomes a major concern for continuous performance scaling. This is especially the case for emerging big-data applications. Instead of relying on a few heavily-loaded CTAs that may expose opportunities for intra-CTA data reuse, current technology and design trends suggest the performance potential of allocating more lightweighted CTAs for processing individual tasks more independently, as the overheads from synchronization, communication and cooperation may greatly outweigh the benefits from exploiting limited data reuse in heavily-loaded CTAs. This paper proceeds this trend and proposes a novel execution model for modern GPUs that hides the CTA execution hierarchy from the classic GPU execution model; meanwhile exposes the originally hidden warp-level execution. Specifically, it relies on individual warps to undertake the original CTAs' tasks. The major observation is that by replacing traditional inter-warp communication (e.g., via shared memory), cooperation (e.g., via bar primitives) and synchronizations (e.g., via CTA barriers), with more efficient intra-warp communication (e.g., via register shuffling), cooperation (e.g., via warp voting) and synchronizations (naturally lockstep execution) across the SIMD-lanes within a warp, significant performance gain can be achieved. We analyze the pros and cons for this design and propose corresponding solutions to counter potential negative effects. Experimental results on a diverse group of thirty-two representative applications show that our proposed Warp-Consolidation execution model can achieve an average speedup of 1.7x, 2.3x, 1.5x and 1.2x (up to 6.3x, 31x, 6.4x and 3.8x) on NVIDIA Kepler (Tesla-K80), Maxwell (Tesla-M40), Pascal (Tesla-P100) and Volta (Tesla-V100) GPUs, respectively, demonstrating its applicability and portability. Our approach can be directly employed to either transform legacy codes or write new algorithms on modern commodity GPUs.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130423378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Low-dose X-ray computed tomography (XCT) is a popular imaging technique to visualize the inside structure of object non-destructively. Model-based Iterative Reconstruction (MBIR) method can reconstruct high-quality image but at the cost of large computational demands. Therefore, MBIR of ten resorts to the platforms with hardware accelerators such as GPUs to speed up the reconstruction process. For MBIR, the reconstruction process is to minimize an objective function by updating image iteratively. The X-ray source emits large amounts of X-rays from various views to cover the object as much as possible. Different X-rays always have complex and irregular geometric relationship. This inherent irregularity makes the minimization process of the objective function on GPUs very challenging. First, different implementations of the minimization of objective function have different impacts on the convergence and GPU resource utilization. To this end, we explore different solvers to the minimization problem and different parallelism granularities for GPU kernel design. Second, the complex and irregular geometric relationship of X-rays introduces irregular memory behaviors. Two nearby X-rays may intersect and thus incur memory collisions, while two far away X-rays may incur non-coalesced memory accesses. We design a unified thread mapping algorithm to guide the mapping from X-rays to threads, which can optimize the memory collisions and non-coalesced memory accesses together. Finally, we present a series of architecture level optimizations to fully release the horse power of GPUs. Evaluation results demonstrate that cuMBIR can achieve 1.48X speedup over the state-of-the-art implementation on GPUs.
{"title":"cuMBIR: An Efficient Framework for Low-dose X-ray CT Image Reconstruction on GPUs","authors":"Xiuhong Li, Yun Liang, Wentai Zhang, Taide Liu, Haochen Li, Guojie Luo, M. Jiang","doi":"10.1145/3205289.3205309","DOIUrl":"https://doi.org/10.1145/3205289.3205309","url":null,"abstract":"Low-dose X-ray computed tomography (XCT) is a popular imaging technique to visualize the inside structure of object non-destructively. Model-based Iterative Reconstruction (MBIR) method can reconstruct high-quality image but at the cost of large computational demands. Therefore, MBIR of ten resorts to the platforms with hardware accelerators such as GPUs to speed up the reconstruction process. For MBIR, the reconstruction process is to minimize an objective function by updating image iteratively. The X-ray source emits large amounts of X-rays from various views to cover the object as much as possible. Different X-rays always have complex and irregular geometric relationship. This inherent irregularity makes the minimization process of the objective function on GPUs very challenging. First, different implementations of the minimization of objective function have different impacts on the convergence and GPU resource utilization. To this end, we explore different solvers to the minimization problem and different parallelism granularities for GPU kernel design. Second, the complex and irregular geometric relationship of X-rays introduces irregular memory behaviors. Two nearby X-rays may intersect and thus incur memory collisions, while two far away X-rays may incur non-coalesced memory accesses. We design a unified thread mapping algorithm to guide the mapping from X-rays to threads, which can optimize the memory collisions and non-coalesced memory accesses together. Finally, we present a series of architecture level optimizations to fully release the horse power of GPUs. Evaluation results demonstrate that cuMBIR can achieve 1.48X speedup over the state-of-the-art implementation on GPUs.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124240514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditionally, performance analysis tools have focused on collecting measurements, attributing them to program source code, and presenting them; responsibility for analysis and interpretation of measurement data falls to application developers. While profiles of parallel programs can identify the presence of performance problems, often developers need to analyze execution behavior over time to understand how and why parallel inefficiencies arise. With the growing scale of supercomputers, such manual analysis is becoming increasingly difficult. In many cases, performance problems of interest only appear at larger scales. Manual analysis of time series data from executions on extreme-scale parallel systems is daunting as the volume of data across processors and time makes it difficult to assimilate. To address this problem, we have developed an automated analysis framework that generates compact summaries of time series data for parallel program executions. These summaries provide users with high-level insight into patterns in the performance data and can quickly direct a user's attention to potential performance bottlenecks. We demonstrate the effectiveness of our framework by applying it to time-series measurements of two scientific codes.
{"title":"Automated Analysis of Time Series Data to Understand Parallel Program Behaviors","authors":"Lai Wei, J. Mellor-Crummey","doi":"10.1145/3205289.3205308","DOIUrl":"https://doi.org/10.1145/3205289.3205308","url":null,"abstract":"Traditionally, performance analysis tools have focused on collecting measurements, attributing them to program source code, and presenting them; responsibility for analysis and interpretation of measurement data falls to application developers. While profiles of parallel programs can identify the presence of performance problems, often developers need to analyze execution behavior over time to understand how and why parallel inefficiencies arise. With the growing scale of supercomputers, such manual analysis is becoming increasingly difficult. In many cases, performance problems of interest only appear at larger scales. Manual analysis of time series data from executions on extreme-scale parallel systems is daunting as the volume of data across processors and time makes it difficult to assimilate. To address this problem, we have developed an automated analysis framework that generates compact summaries of time series data for parallel program executions. These summaries provide users with high-level insight into patterns in the performance data and can quickly direct a user's attention to potential performance bottlenecks. We demonstrate the effectiveness of our framework by applying it to time-series measurements of two scientific codes.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126911800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keke Zhai, Tania Banerjee-Mishra, D. Zwick, J. Hackl, S. Ranka
CMT-nek is a new scientific application for performing high fidelity predictive simulations of particle laden explosively dispersed turbulent flows. CMT-nek involves detailed simulations, is compute intensive and is targeted to be deployed on exascale platforms. The moving particles are the main source of load imbalance as the application is executed on parallel processors. In a demonstration problem, all the particles are initially in a closed container until a detonation occurs and the particles move apart. If all processors get an equal share of the fluid domain, then only some of the processors get sections of the domain that are initially laden with particles, leading to disparate load on the processors. In order to eliminate load imbalance in different processors and to speedup the makespan, we present different load balancing algorithms for CMT-nek on large scale multicore platforms consisting of hundred of thousands of cores. The detailed process of the load balancing algorithms are presented. The performance of the different load balancing algorithms are compared and the associated overheads are analyzed. Evaluations on the application with and without load balancing are conducted and these show that with load balancing, simulation time becomes faster by a factor of up to 9.97.
{"title":"Dynamic Load Balancing for Compressible Multiphase Turbulence","authors":"Keke Zhai, Tania Banerjee-Mishra, D. Zwick, J. Hackl, S. Ranka","doi":"10.1145/3205289.3205304","DOIUrl":"https://doi.org/10.1145/3205289.3205304","url":null,"abstract":"CMT-nek is a new scientific application for performing high fidelity predictive simulations of particle laden explosively dispersed turbulent flows. CMT-nek involves detailed simulations, is compute intensive and is targeted to be deployed on exascale platforms. The moving particles are the main source of load imbalance as the application is executed on parallel processors. In a demonstration problem, all the particles are initially in a closed container until a detonation occurs and the particles move apart. If all processors get an equal share of the fluid domain, then only some of the processors get sections of the domain that are initially laden with particles, leading to disparate load on the processors. In order to eliminate load imbalance in different processors and to speedup the makespan, we present different load balancing algorithms for CMT-nek on large scale multicore platforms consisting of hundred of thousands of cores. The detailed process of the load balancing algorithms are presented. The performance of the different load balancing algorithms are compared and the associated overheads are analyzed. Evaluations on the application with and without load balancing are conducted and these show that with load balancing, simulation time becomes faster by a factor of up to 9.97.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127940920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jie Yu, Guangming Liu, Xin Liu, Wenrui Dong, Xiaoyong Li, Yusheng Liu
Job scheduling in HPC systems by default allocate adjacent compute nodes for jobs for lower communication overhead. However, it is no longer applicable to data-intensive jobs running on systems with I/O forwarding layer, where each I/O node performs I/O on behalf of a subset of compute nodes in the vicinity. Under the default node allocation strategy a job's nodes are located close to each other and thus it only uses a limited number of I/O nodes. Since the I/O activities of jobs are bursty, at any moment only a minority of jobs in the system are busy processing I/O. Consequently, the bursty I/O traffic in the system is also concentrated in space, making the load on I/O nodes highly unbalanced. In this paper, we use the job logs and I/O traces collected from Tianhe-1A to quantitatively analyze the two causes of spatially bursty I/O, including uneven I/O traffic of job's processes and uneven distribution of job's nodes. Based on the analysis we propose a node allocation strategy that takes account of processes' different amounts of I/O traffic, so that the I/O traffic can be processed by more I/O nodes more evenly. Our evaluations on Tianhe-1A with synthetic benchmarks and realistic applications show that the proposed strategy can further exploit the potential of I/O forwarding layer and promote the I/O performance.
{"title":"Rethinking Node Allocation Strategy for Data-intensive Applications in Consideration of Spatially Bursty I/O","authors":"Jie Yu, Guangming Liu, Xin Liu, Wenrui Dong, Xiaoyong Li, Yusheng Liu","doi":"10.1145/3205289.3205305","DOIUrl":"https://doi.org/10.1145/3205289.3205305","url":null,"abstract":"Job scheduling in HPC systems by default allocate adjacent compute nodes for jobs for lower communication overhead. However, it is no longer applicable to data-intensive jobs running on systems with I/O forwarding layer, where each I/O node performs I/O on behalf of a subset of compute nodes in the vicinity. Under the default node allocation strategy a job's nodes are located close to each other and thus it only uses a limited number of I/O nodes. Since the I/O activities of jobs are bursty, at any moment only a minority of jobs in the system are busy processing I/O. Consequently, the bursty I/O traffic in the system is also concentrated in space, making the load on I/O nodes highly unbalanced. In this paper, we use the job logs and I/O traces collected from Tianhe-1A to quantitatively analyze the two causes of spatially bursty I/O, including uneven I/O traffic of job's processes and uneven distribution of job's nodes. Based on the analysis we propose a node allocation strategy that takes account of processes' different amounts of I/O traffic, so that the I/O traffic can be processed by more I/O nodes more evenly. Our evaluations on Tianhe-1A with synthetic benchmarks and realistic applications show that the proposed strategy can further exploit the potential of I/O forwarding layer and promote the I/O performance.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125457046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There is an ocean of available storage solutions in modern high-performance and distributed systems. These solutions consist of Parallel File Systems (PFS) for the more traditional high-performance computing (HPC) systems and of Object Stores for emerging cloud environments. More of ten than not, these storage solutions are tied to specific APIs and data models and thus, bind developers, applications, and entire computing facilities to using certain interfaces. Each storage system is designed and optimized for certain applications but does not perform well for others. Furthermore, modern applications have become more and more complex consisting of a collection of phases with different computation and I/O requirements. In this paper, we propose a unified storage access system, called IRIS (i.e., I/O Redirection via Integrated Storage). IRIS enables unified data access and seamlessly bridges the semantic gap between file systems and object stores. With IRIS, emerging High-Performance Data Analytics software has capable and diverse I/O support. IRIS can bring us closer to the convergence of HPC and Cloud environments by combining the best storage subsystems from both worlds. Experimental results show that IRIS can grant more than 7x improvement in performance than existing solutions.
{"title":"IRIS","authors":"Anthony Kougkas, H. Devarajan, Xian-He Sun","doi":"10.1145/3205289.3205322","DOIUrl":"https://doi.org/10.1145/3205289.3205322","url":null,"abstract":"There is an ocean of available storage solutions in modern high-performance and distributed systems. These solutions consist of Parallel File Systems (PFS) for the more traditional high-performance computing (HPC) systems and of Object Stores for emerging cloud environments. More of ten than not, these storage solutions are tied to specific APIs and data models and thus, bind developers, applications, and entire computing facilities to using certain interfaces. Each storage system is designed and optimized for certain applications but does not perform well for others. Furthermore, modern applications have become more and more complex consisting of a collection of phases with different computation and I/O requirements. In this paper, we propose a unified storage access system, called IRIS (i.e., I/O Redirection via Integrated Storage). IRIS enables unified data access and seamlessly bridges the semantic gap between file systems and object stores. With IRIS, emerging High-Performance Data Analytics software has capable and diverse I/O support. IRIS can bring us closer to the convergence of HPC and Cloud environments by combining the best storage subsystems from both worlds. Experimental results show that IRIS can grant more than 7x improvement in performance than existing solutions.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122891937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ben Karsin, Volker Weichert, H. Casanova, J. Iacono, Nodari Sitchinava
We study the relationship between memory accesses, bank conflicts, thread multiplicity (also known as over-subscription) and instruction-level parallelism in comparison-based sorting algorithms for Graphics Processing Units (GPUs). We experimentally validate a proposed formula that relates these parameters with asymptotic analysis of the number of memory accesses by an algorithm. Using this formula we analyze and compare several GPU sorting algorithms, identifying key performance bottlenecks in each one of them. Based on this analysis we propose a GPU-efficient multiway merge-sort algorithm, GPU-MMS, which minimizes or eliminates these bottlenecks and balances various limiting factors for specific hardware. We realize an implementation of GPU-MMS and compare it to sorting algorithm implementations in state-of-the-art GPU libraries on three GPU architectures. Despite these library implementations being highly optimized, we find that GPU-MMS outperforms them by an average of 21% for random integer inputs and 14% for random key-value pairs.
{"title":"Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs","authors":"Ben Karsin, Volker Weichert, H. Casanova, J. Iacono, Nodari Sitchinava","doi":"10.1145/3205289.3205298","DOIUrl":"https://doi.org/10.1145/3205289.3205298","url":null,"abstract":"We study the relationship between memory accesses, bank conflicts, thread multiplicity (also known as over-subscription) and instruction-level parallelism in comparison-based sorting algorithms for Graphics Processing Units (GPUs). We experimentally validate a proposed formula that relates these parameters with asymptotic analysis of the number of memory accesses by an algorithm. Using this formula we analyze and compare several GPU sorting algorithms, identifying key performance bottlenecks in each one of them. Based on this analysis we propose a GPU-efficient multiway merge-sort algorithm, GPU-MMS, which minimizes or eliminates these bottlenecks and balances various limiting factors for specific hardware. We realize an implementation of GPU-MMS and compare it to sorting algorithm implementations in state-of-the-art GPU libraries on three GPU architectures. Despite these library implementations being highly optimized, we find that GPU-MMS outperforms them by an average of 21% for random integer inputs and 14% for random key-value pairs.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133877252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-06-12DOI: 10.1036/1097-8542.306100
Reena Panda, L. John
Optical Spectra of 24 µ m Galaxies in the COSMOS Field
COSMOS场中24µm星系的光谱
{"title":"HALO","authors":"Reena Panda, L. John","doi":"10.1036/1097-8542.306100","DOIUrl":"https://doi.org/10.1036/1097-8542.306100","url":null,"abstract":"Optical Spectra of 24 µ m Galaxies in the COSMOS Field","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123611949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}