Proceedings of the 2018 International Conference on Supercomputing最新文献

英文中文

CELIA 西莉亚

Proceedings of the 2018 International Conference on Supercomputing

Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205297

Hao Yan, Hebin R. Cherian, Ethan C. Ahn, Lide Duan

Book file PDF easily for everyone and every device. You can download and read online Celia file PDF Book only if you are registered here. And also you can download or read online all Book PDF file that related with Celia book. Happy reading Celia Bookeveryone. Download file Free Book PDF Celia at Complete PDF Library. This Book have some digital formats such us :paperbook, ebook, kindle, epub, fb2 and another formats. Here is The Complete PDF Book Library. It's free to register here to get Book file PDF Celia.

图书文件PDF容易为每个人，每个设备。只有在这里注册，您才能下载并在线阅读西莉亚文件PDF书。你也可以下载或在线阅读所有与西莉亚书有关的PDF文件。祝大家阅读愉快。下载文件免费图书PDF西莉亚在完整的PDF图书馆。这本书有一些数字格式，如:纸质书，电子书，kindle, epub, fb2和其他格式。这里是完整的PDF图书库。在这里注册获得图书文件PDF是免费的。

引用次数: 3

HALO: A Hierarchical Memory Access Locality Modeling Technique For Memory System Explorations 用于内存系统探索的分层内存访问局部性建模技术

Proceedings of the 2018 International Conference on Supercomputing

Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205323

Reena Panda, L. John

Growing complexity of applications pose new challenges to memory system design due to their data intensive nature, complex access patterns, larger footprints, etc. The slow nature of full-system simulators, challenges of simulators to run deep software stacks of many emerging workloads, proprietary nature of software, etc. pose challenges to fast and accurate microarchitectural explorations of future memory hierarchies. One technique to mitigate this problem is to create spatio-temporal models of access streams and use them to explore memory system tradeoffs. However, existing memory stream models have weaknesses such as they only model temporal locality behavior or model spatio-temporal locality using global stride transitions, resulting in high storage/metadata overhead. In this paper, we propose HALO, a Hierarchical memory Access LOcality modeling technique that identifies patterns by isolating global memory references into localized streams and further zooming into each local stream capturing multi-granularity spatial locality patterns. HALO also models the interleaving degree between localized stream accesses leveraging coarse-grained reuse locality. We evaluate HALO's effectiveness in replicating original application performance using over 20K different memory system configurations and show that HALO achieves over 98.3%, 95.6%, 99.3% and 96% accuracy in replicating performance of prefetcher-enabled L1 & L2 caches, TLB and DRAM respectively. HALO outperforms the state-of-the-art memory cloning schemes, WEST and STM, while using ~39X less metadata storage than STM.

由于应用程序的数据密集性、复杂的访问模式、更大的占用空间等，应用程序的复杂性日益增加，对内存系统设计提出了新的挑战。全系统模拟器的缓慢特性，模拟器运行许多新兴工作负载的深层软件堆栈的挑战，软件的专有特性等，对未来内存层次结构的快速和准确的微架构探索提出了挑战。缓解这个问题的一种技术是创建访问流的时空模型，并使用它们来探索内存系统的权衡。然而，现有的内存流模型存在弱点，例如它们只模拟时间局部性行为或使用全局跨距转换来模拟时空局部性，从而导致高存储/元数据开销。在本文中，我们提出了HALO，这是一种分层内存访问局部性建模技术，它通过将全局内存引用隔离到局部流中并进一步放大到每个局部流中捕获多粒度空间局部性模式来识别模式。HALO还利用粗粒度重用局部性对本地化流访问之间的交错程度进行建模。我们使用超过20K的不同内存系统配置评估了HALO在复制原始应用程序性能方面的有效性，并表明HALO在复制启用了预取器的L1和L2缓存、TLB和DRAM的性能方面分别达到了98.3%、95.6%、99.3%和96%的准确性。HALO优于最先进的内存克隆方案WEST和STM，同时使用的元数据存储比STM少39X。

{"title":"HALO: A Hierarchical Memory Access Locality Modeling Technique For Memory System Explorations","authors":"Reena Panda, L. John","doi":"10.1145/3205289.3205323","DOIUrl":"https://doi.org/10.1145/3205289.3205323","url":null,"abstract":"Growing complexity of applications pose new challenges to memory system design due to their data intensive nature, complex access patterns, larger footprints, etc. The slow nature of full-system simulators, challenges of simulators to run deep software stacks of many emerging workloads, proprietary nature of software, etc. pose challenges to fast and accurate microarchitectural explorations of future memory hierarchies. One technique to mitigate this problem is to create spatio-temporal models of access streams and use them to explore memory system tradeoffs. However, existing memory stream models have weaknesses such as they only model temporal locality behavior or model spatio-temporal locality using global stride transitions, resulting in high storage/metadata overhead. In this paper, we propose HALO, a Hierarchical memory Access LOcality modeling technique that identifies patterns by isolating global memory references into localized streams and further zooming into each local stream capturing multi-granularity spatial locality patterns. HALO also models the interleaving degree between localized stream accesses leveraging coarse-grained reuse locality. We evaluate HALO's effectiveness in replicating original application performance using over 20K different memory system configurations and show that HALO achieves over 98.3%, 95.6%, 99.3% and 96% accuracy in replicating performance of prefetcher-enabled L1 & L2 caches, TLB and DRAM respectively. HALO outperforms the state-of-the-art memory cloning schemes, WEST and STM, while using ~39X less metadata storage than STM.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132557669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Warp-Consolidation: A Novel Execution Model for GPUs 扭曲整合:一种新的gpu执行模型

Proceedings of the 2018 International Conference on Supercomputing

Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205294

Ang Li, Weifeng Liu, Linnan Wang, K. Barker, S. Song

With the unprecedented development of compute capability and extension of memory bandwidth on modern GPUs, parallel communication and synchronization soon becomes a major concern for continuous performance scaling. This is especially the case for emerging big-data applications. Instead of relying on a few heavily-loaded CTAs that may expose opportunities for intra-CTA data reuse, current technology and design trends suggest the performance potential of allocating more lightweighted CTAs for processing individual tasks more independently, as the overheads from synchronization, communication and cooperation may greatly outweigh the benefits from exploiting limited data reuse in heavily-loaded CTAs. This paper proceeds this trend and proposes a novel execution model for modern GPUs that hides the CTA execution hierarchy from the classic GPU execution model; meanwhile exposes the originally hidden warp-level execution. Specifically, it relies on individual warps to undertake the original CTAs' tasks. The major observation is that by replacing traditional inter-warp communication (e.g., via shared memory), cooperation (e.g., via bar primitives) and synchronizations (e.g., via CTA barriers), with more efficient intra-warp communication (e.g., via register shuffling), cooperation (e.g., via warp voting) and synchronizations (naturally lockstep execution) across the SIMD-lanes within a warp, significant performance gain can be achieved. We analyze the pros and cons for this design and propose corresponding solutions to counter potential negative effects. Experimental results on a diverse group of thirty-two representative applications show that our proposed Warp-Consolidation execution model can achieve an average speedup of 1.7x, 2.3x, 1.5x and 1.2x (up to 6.3x, 31x, 6.4x and 3.8x) on NVIDIA Kepler (Tesla-K80), Maxwell (Tesla-M40), Pascal (Tesla-P100) and Volta (Tesla-V100) GPUs, respectively, demonstrating its applicability and portability. Our approach can be directly employed to either transform legacy codes or write new algorithms on modern commodity GPUs.

随着现代gpu计算能力的空前发展和内存带宽的不断扩展，并行通信和同步很快成为持续性能扩展的主要问题。对于新兴的大数据应用来说尤其如此。当前的技术和设计趋势表明，分配更多的轻量级cta来更独立地处理单个任务，而不是依赖一些可能暴露cta内部数据重用机会的重负载cta，因为同步、通信和合作的开销可能大大超过在重负载cta中利用有限数据重用所带来的好处。本文沿着这一趋势，提出了一种新的现代GPU执行模型，该模型将CTA执行层次隐藏在经典GPU执行模型中;同时公开最初隐藏的翘曲级执行。具体来说，它依赖于单个经纬仪来承担原始cta的任务。主要观察结果是，通过替换传统的warp间通信(例如，通过共享内存)，合作(例如，通过bar原语)和同步(例如，通过CTA屏障)，使用更有效的warp内通信(例如，通过寄存器洗牌)，合作(例如，通过warp投票)和同步(自然同步执行)在warp内的SIMD-lanes，可以实现显着的性能提升。我们分析了该设计的利弊，并提出了相应的解决方案，以应对潜在的负面影响。实验结果表明，我们提出的Warp-Consolidation执行模型在NVIDIA Kepler (Tesla-K80)、Maxwell (Tesla-M40)、Pascal (Tesla-P100)和Volta (Tesla-V100) gpu上的平均加速分别达到1.7倍、2.3倍、1.5倍和1.2倍(最高可达6.3倍、31倍、6.4倍和3.8倍)，证明了其适用性和可移植性。我们的方法可以直接用于转换遗留代码或在现代商品gpu上编写新算法。

{"title":"Warp-Consolidation: A Novel Execution Model for GPUs","authors":"Ang Li, Weifeng Liu, Linnan Wang, K. Barker, S. Song","doi":"10.1145/3205289.3205294","DOIUrl":"https://doi.org/10.1145/3205289.3205294","url":null,"abstract":"With the unprecedented development of compute capability and extension of memory bandwidth on modern GPUs, parallel communication and synchronization soon becomes a major concern for continuous performance scaling. This is especially the case for emerging big-data applications. Instead of relying on a few heavily-loaded CTAs that may expose opportunities for intra-CTA data reuse, current technology and design trends suggest the performance potential of allocating more lightweighted CTAs for processing individual tasks more independently, as the overheads from synchronization, communication and cooperation may greatly outweigh the benefits from exploiting limited data reuse in heavily-loaded CTAs. This paper proceeds this trend and proposes a novel execution model for modern GPUs that hides the CTA execution hierarchy from the classic GPU execution model; meanwhile exposes the originally hidden warp-level execution. Specifically, it relies on individual warps to undertake the original CTAs' tasks. The major observation is that by replacing traditional inter-warp communication (e.g., via shared memory), cooperation (e.g., via bar primitives) and synchronizations (e.g., via CTA barriers), with more efficient intra-warp communication (e.g., via register shuffling), cooperation (e.g., via warp voting) and synchronizations (naturally lockstep execution) across the SIMD-lanes within a warp, significant performance gain can be achieved. We analyze the pros and cons for this design and propose corresponding solutions to counter potential negative effects. Experimental results on a diverse group of thirty-two representative applications show that our proposed Warp-Consolidation execution model can achieve an average speedup of 1.7x, 2.3x, 1.5x and 1.2x (up to 6.3x, 31x, 6.4x and 3.8x) on NVIDIA Kepler (Tesla-K80), Maxwell (Tesla-M40), Pascal (Tesla-P100) and Volta (Tesla-V100) GPUs, respectively, demonstrating its applicability and portability. Our approach can be directly employed to either transform legacy codes or write new algorithms on modern commodity GPUs.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130423378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

cuMBIR: An Efficient Framework for Low-dose X-ray CT Image Reconstruction on GPUs 基于gpu的低剂量x线CT图像重构框架

Proceedings of the 2018 International Conference on Supercomputing

Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205309

Xiuhong Li, Yun Liang, Wentai Zhang, Taide Liu, Haochen Li, Guojie Luo, M. Jiang

Low-dose X-ray computed tomography (XCT) is a popular imaging technique to visualize the inside structure of object non-destructively. Model-based Iterative Reconstruction (MBIR) method can reconstruct high-quality image but at the cost of large computational demands. Therefore, MBIR of ten resorts to the platforms with hardware accelerators such as GPUs to speed up the reconstruction process. For MBIR, the reconstruction process is to minimize an objective function by updating image iteratively. The X-ray source emits large amounts of X-rays from various views to cover the object as much as possible. Different X-rays always have complex and irregular geometric relationship. This inherent irregularity makes the minimization process of the objective function on GPUs very challenging. First, different implementations of the minimization of objective function have different impacts on the convergence and GPU resource utilization. To this end, we explore different solvers to the minimization problem and different parallelism granularities for GPU kernel design. Second, the complex and irregular geometric relationship of X-rays introduces irregular memory behaviors. Two nearby X-rays may intersect and thus incur memory collisions, while two far away X-rays may incur non-coalesced memory accesses. We design a unified thread mapping algorithm to guide the mapping from X-rays to threads, which can optimize the memory collisions and non-coalesced memory accesses together. Finally, we present a series of architecture level optimizations to fully release the horse power of GPUs. Evaluation results demonstrate that cuMBIR can achieve 1.48X speedup over the state-of-the-art implementation on GPUs.

低剂量x射线计算机断层扫描(XCT)是一种常用的对物体内部结构进行无损成像的成像技术。基于模型的迭代重建(MBIR)方法可以重建出高质量的图像，但需要大量的计算量。因此，MBIR通常会借助带有gpu等硬件加速器的平台来加快重建过程。对于MBIR，重建过程是通过迭代更新图像来最小化目标函数。x射线源从不同的角度发射大量的x射线，以尽可能多地覆盖物体。不同的x射线总是具有复杂而不规则的几何关系。这种固有的不规则性使得gpu上目标函数的最小化过程非常具有挑战性。首先，目标函数最小化的不同实现方式对收敛性和GPU资源利用率有不同的影响。为此，我们探索了GPU内核设计中最小化问题的不同求解方法和不同并行度粒度。其次，x射线复杂而不规则的几何关系引入了不规则的记忆行为。两个附近的x射线可能相交，从而导致内存碰撞，而两个遥远的x射线可能导致非合并内存访问。我们设计了一个统一的线程映射算法来引导x射线到线程的映射，可以同时优化内存冲突和非合并内存访问。最后，我们提出了一系列架构级优化，以充分释放gpu的马力。评估结果表明，与gpu上最先进的实现相比，cuMBIR可以实现1.48倍的加速。

{"title":"cuMBIR: An Efficient Framework for Low-dose X-ray CT Image Reconstruction on GPUs","authors":"Xiuhong Li, Yun Liang, Wentai Zhang, Taide Liu, Haochen Li, Guojie Luo, M. Jiang","doi":"10.1145/3205289.3205309","DOIUrl":"https://doi.org/10.1145/3205289.3205309","url":null,"abstract":"Low-dose X-ray computed tomography (XCT) is a popular imaging technique to visualize the inside structure of object non-destructively. Model-based Iterative Reconstruction (MBIR) method can reconstruct high-quality image but at the cost of large computational demands. Therefore, MBIR of ten resorts to the platforms with hardware accelerators such as GPUs to speed up the reconstruction process. For MBIR, the reconstruction process is to minimize an objective function by updating image iteratively. The X-ray source emits large amounts of X-rays from various views to cover the object as much as possible. Different X-rays always have complex and irregular geometric relationship. This inherent irregularity makes the minimization process of the objective function on GPUs very challenging. First, different implementations of the minimization of objective function have different impacts on the convergence and GPU resource utilization. To this end, we explore different solvers to the minimization problem and different parallelism granularities for GPU kernel design. Second, the complex and irregular geometric relationship of X-rays introduces irregular memory behaviors. Two nearby X-rays may intersect and thus incur memory collisions, while two far away X-rays may incur non-coalesced memory accesses. We design a unified thread mapping algorithm to guide the mapping from X-rays to threads, which can optimize the memory collisions and non-coalesced memory accesses together. Finally, we present a series of architecture level optimizations to fully release the horse power of GPUs. Evaluation results demonstrate that cuMBIR can achieve 1.48X speedup over the state-of-the-art implementation on GPUs.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124240514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Automated Analysis of Time Series Data to Understand Parallel Program Behaviors 时间序列数据的自动分析以理解并行程序的行为

Proceedings of the 2018 International Conference on Supercomputing

Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205308

Lai Wei, J. Mellor-Crummey

Traditionally, performance analysis tools have focused on collecting measurements, attributing them to program source code, and presenting them; responsibility for analysis and interpretation of measurement data falls to application developers. While profiles of parallel programs can identify the presence of performance problems, often developers need to analyze execution behavior over time to understand how and why parallel inefficiencies arise. With the growing scale of supercomputers, such manual analysis is becoming increasingly difficult. In many cases, performance problems of interest only appear at larger scales. Manual analysis of time series data from executions on extreme-scale parallel systems is daunting as the volume of data across processors and time makes it difficult to assimilate. To address this problem, we have developed an automated analysis framework that generates compact summaries of time series data for parallel program executions. These summaries provide users with high-level insight into patterns in the performance data and can quickly direct a user's attention to potential performance bottlenecks. We demonstrate the effectiveness of our framework by applying it to time-series measurements of two scientific codes.

传统上，性能分析工具关注于收集度量，将它们归因于程序源代码，并呈现它们;分析和解释测量数据的责任落在了应用程序开发人员身上。虽然并行程序的概要文件可以识别性能问题的存在，但开发人员通常需要随着时间的推移分析执行行为，以了解并行效率低下是如何以及为什么产生的。随着超级计算机规模的不断扩大，这种人工分析变得越来越困难。在许多情况下，感兴趣的性能问题只会出现在更大的规模上。对极端规模并行系统上执行的时间序列数据进行手动分析是令人生畏的，因为跨处理器和时间的数据量使其难以吸收。为了解决这个问题，我们开发了一个自动分析框架，为并行程序执行生成时间序列数据的紧凑摘要。这些摘要为用户提供了对性能数据模式的高级洞察，并可以快速引导用户注意潜在的性能瓶颈。我们通过将我们的框架应用于两个科学代码的时间序列测量来证明它的有效性。

{"title":"Automated Analysis of Time Series Data to Understand Parallel Program Behaviors","authors":"Lai Wei, J. Mellor-Crummey","doi":"10.1145/3205289.3205308","DOIUrl":"https://doi.org/10.1145/3205289.3205308","url":null,"abstract":"Traditionally, performance analysis tools have focused on collecting measurements, attributing them to program source code, and presenting them; responsibility for analysis and interpretation of measurement data falls to application developers. While profiles of parallel programs can identify the presence of performance problems, often developers need to analyze execution behavior over time to understand how and why parallel inefficiencies arise. With the growing scale of supercomputers, such manual analysis is becoming increasingly difficult. In many cases, performance problems of interest only appear at larger scales. Manual analysis of time series data from executions on extreme-scale parallel systems is daunting as the volume of data across processors and time makes it difficult to assimilate. To address this problem, we have developed an automated analysis framework that generates compact summaries of time series data for parallel program executions. These summaries provide users with high-level insight into patterns in the performance data and can quickly direct a user's attention to potential performance bottlenecks. We demonstrate the effectiveness of our framework by applying it to time-series measurements of two scientific codes.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126911800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Dynamic Load Balancing for Compressible Multiphase Turbulence 可压缩多相湍流的动态负载平衡

Proceedings of the 2018 International Conference on Supercomputing

Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205304

Keke Zhai, Tania Banerjee-Mishra, D. Zwick, J. Hackl, S. Ranka

CMT-nek is a new scientific application for performing high fidelity predictive simulations of particle laden explosively dispersed turbulent flows. CMT-nek involves detailed simulations, is compute intensive and is targeted to be deployed on exascale platforms. The moving particles are the main source of load imbalance as the application is executed on parallel processors. In a demonstration problem, all the particles are initially in a closed container until a detonation occurs and the particles move apart. If all processors get an equal share of the fluid domain, then only some of the processors get sections of the domain that are initially laden with particles, leading to disparate load on the processors. In order to eliminate load imbalance in different processors and to speedup the makespan, we present different load balancing algorithms for CMT-nek on large scale multicore platforms consisting of hundred of thousands of cores. The detailed process of the load balancing algorithms are presented. The performance of the different load balancing algorithms are compared and the associated overheads are analyzed. Evaluations on the application with and without load balancing are conducted and these show that with load balancing, simulation time becomes faster by a factor of up to 9.97.

CMT-nek是一种新的科学应用程序，用于执行高保真的颗粒负载爆炸分散湍流预测模拟。CMT-nek涉及详细的模拟，是计算密集型的，目标是部署在百亿亿级平台上。当应用程序在并行处理器上执行时，移动的粒子是负载不平衡的主要来源。在一个演示问题中，所有的粒子最初都在一个封闭的容器中，直到爆炸发生，粒子分开。如果所有处理器获得流体域的相同份额，那么只有一些处理器获得最初装载粒子的区域，从而导致处理器上的负载不同。为了消除不同处理器间的负载不平衡，加快最大运行时间，在数十万核的大型多核平台上提出了不同的CMT-nek负载平衡算法。给出了负载均衡算法的具体实现过程。比较了不同负载均衡算法的性能，并分析了相关的开销。对应用程序进行了负载均衡和不负载均衡的评估，结果表明，在负载均衡的情况下，模拟时间缩短了9.97倍。

{"title":"Dynamic Load Balancing for Compressible Multiphase Turbulence","authors":"Keke Zhai, Tania Banerjee-Mishra, D. Zwick, J. Hackl, S. Ranka","doi":"10.1145/3205289.3205304","DOIUrl":"https://doi.org/10.1145/3205289.3205304","url":null,"abstract":"CMT-nek is a new scientific application for performing high fidelity predictive simulations of particle laden explosively dispersed turbulent flows. CMT-nek involves detailed simulations, is compute intensive and is targeted to be deployed on exascale platforms. The moving particles are the main source of load imbalance as the application is executed on parallel processors. In a demonstration problem, all the particles are initially in a closed container until a detonation occurs and the particles move apart. If all processors get an equal share of the fluid domain, then only some of the processors get sections of the domain that are initially laden with particles, leading to disparate load on the processors. In order to eliminate load imbalance in different processors and to speedup the makespan, we present different load balancing algorithms for CMT-nek on large scale multicore platforms consisting of hundred of thousands of cores. The detailed process of the load balancing algorithms are presented. The performance of the different load balancing algorithms are compared and the associated overheads are analyzed. Evaluations on the application with and without load balancing are conducted and these show that with load balancing, simulation time becomes faster by a factor of up to 9.97.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127940920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Rethinking Node Allocation Strategy for Data-intensive Applications in Consideration of Spatially Bursty I/O 考虑空间突发I/O的数据密集型应用节点分配策略

Proceedings of the 2018 International Conference on Supercomputing

Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205305

Jie Yu, Guangming Liu, Xin Liu, Wenrui Dong, Xiaoyong Li, Yusheng Liu

Job scheduling in HPC systems by default allocate adjacent compute nodes for jobs for lower communication overhead. However, it is no longer applicable to data-intensive jobs running on systems with I/O forwarding layer, where each I/O node performs I/O on behalf of a subset of compute nodes in the vicinity. Under the default node allocation strategy a job's nodes are located close to each other and thus it only uses a limited number of I/O nodes. Since the I/O activities of jobs are bursty, at any moment only a minority of jobs in the system are busy processing I/O. Consequently, the bursty I/O traffic in the system is also concentrated in space, making the load on I/O nodes highly unbalanced. In this paper, we use the job logs and I/O traces collected from Tianhe-1A to quantitatively analyze the two causes of spatially bursty I/O, including uneven I/O traffic of job's processes and uneven distribution of job's nodes. Based on the analysis we propose a node allocation strategy that takes account of processes' different amounts of I/O traffic, so that the I/O traffic can be processed by more I/O nodes more evenly. Our evaluations on Tianhe-1A with synthetic benchmarks and realistic applications show that the proposed strategy can further exploit the potential of I/O forwarding layer and promote the I/O performance.

HPC系统中的作业调度默认为相邻的计算节点分配作业，以降低通信开销。但是，它不再适用于在具有I/O转发层的系统上运行的数据密集型作业，其中每个I/O节点代表附近的计算节点子集执行I/O。在默认节点分配策略下，作业的节点彼此靠近，因此它只使用有限数量的I/O节点。由于作业的I/O活动是突发的，因此在任何时候，系统中只有少数作业忙于处理I/O。因此，系统中的突发I/O流量也集中在空间中，使I/O节点上的负载高度不平衡。本文利用天河1a的作业日志和I/O轨迹，定量分析了作业进程I/O流量不均匀和作业节点分布不均匀这两个造成空间突发I/O的原因。在此基础上，提出了一种考虑进程不同I/O流量的节点分配策略，使更多的I/O节点能够更均匀地处理I/O流量。通过综合基准测试和实际应用对天河1a进行的评估表明，该策略可以进一步挖掘I/O转发层的潜力，提高I/O性能。

{"title":"Rethinking Node Allocation Strategy for Data-intensive Applications in Consideration of Spatially Bursty I/O","authors":"Jie Yu, Guangming Liu, Xin Liu, Wenrui Dong, Xiaoyong Li, Yusheng Liu","doi":"10.1145/3205289.3205305","DOIUrl":"https://doi.org/10.1145/3205289.3205305","url":null,"abstract":"Job scheduling in HPC systems by default allocate adjacent compute nodes for jobs for lower communication overhead. However, it is no longer applicable to data-intensive jobs running on systems with I/O forwarding layer, where each I/O node performs I/O on behalf of a subset of compute nodes in the vicinity. Under the default node allocation strategy a job's nodes are located close to each other and thus it only uses a limited number of I/O nodes. Since the I/O activities of jobs are bursty, at any moment only a minority of jobs in the system are busy processing I/O. Consequently, the bursty I/O traffic in the system is also concentrated in space, making the load on I/O nodes highly unbalanced. In this paper, we use the job logs and I/O traces collected from Tianhe-1A to quantitatively analyze the two causes of spatially bursty I/O, including uneven I/O traffic of job's processes and uneven distribution of job's nodes. Based on the analysis we propose a node allocation strategy that takes account of processes' different amounts of I/O traffic, so that the I/O traffic can be processed by more I/O nodes more evenly. Our evaluations on Tianhe-1A with synthetic benchmarks and realistic applications show that the proposed strategy can further exploit the potential of I/O forwarding layer and promote the I/O performance.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125457046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

IRIS 虹膜

Proceedings of the 2018 International Conference on Supercomputing

Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205322

Anthony Kougkas, H. Devarajan, Xian-He Sun

There is an ocean of available storage solutions in modern high-performance and distributed systems. These solutions consist of Parallel File Systems (PFS) for the more traditional high-performance computing (HPC) systems and of Object Stores for emerging cloud environments. More of ten than not, these storage solutions are tied to specific APIs and data models and thus, bind developers, applications, and entire computing facilities to using certain interfaces. Each storage system is designed and optimized for certain applications but does not perform well for others. Furthermore, modern applications have become more and more complex consisting of a collection of phases with different computation and I/O requirements. In this paper, we propose a unified storage access system, called IRIS (i.e., I/O Redirection via Integrated Storage). IRIS enables unified data access and seamlessly bridges the semantic gap between file systems and object stores. With IRIS, emerging High-Performance Data Analytics software has capable and diverse I/O support. IRIS can bring us closer to the convergence of HPC and Cloud environments by combining the best storage subsystems from both worlds. Experimental results show that IRIS can grant more than 7x improvement in performance than existing solutions.

{"title":"IRIS","authors":"Anthony Kougkas, H. Devarajan, Xian-He Sun","doi":"10.1145/3205289.3205322","DOIUrl":"https://doi.org/10.1145/3205289.3205322","url":null,"abstract":"There is an ocean of available storage solutions in modern high-performance and distributed systems. These solutions consist of Parallel File Systems (PFS) for the more traditional high-performance computing (HPC) systems and of Object Stores for emerging cloud environments. More of ten than not, these storage solutions are tied to specific APIs and data models and thus, bind developers, applications, and entire computing facilities to using certain interfaces. Each storage system is designed and optimized for certain applications but does not perform well for others. Furthermore, modern applications have become more and more complex consisting of a collection of phases with different computation and I/O requirements. In this paper, we propose a unified storage access system, called IRIS (i.e., I/O Redirection via Integrated Storage). IRIS enables unified data access and seamlessly bridges the semantic gap between file systems and object stores. With IRIS, emerging High-Performance Data Analytics software has capable and diverse I/O support. IRIS can bring us closer to the convergence of HPC and Cloud environments by combining the best storage subsystems from both worlds. Experimental results show that IRIS can grant more than 7x improvement in performance than existing solutions.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122891937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs gpu上基于比较的排序算法的分析驱动工程

Proceedings of the 2018 International Conference on Supercomputing

Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205298

Ben Karsin, Volker Weichert, H. Casanova, J. Iacono, Nodari Sitchinava

We study the relationship between memory accesses, bank conflicts, thread multiplicity (also known as over-subscription) and instruction-level parallelism in comparison-based sorting algorithms for Graphics Processing Units (GPUs). We experimentally validate a proposed formula that relates these parameters with asymptotic analysis of the number of memory accesses by an algorithm. Using this formula we analyze and compare several GPU sorting algorithms, identifying key performance bottlenecks in each one of them. Based on this analysis we propose a GPU-efficient multiway merge-sort algorithm, GPU-MMS, which minimizes or eliminates these bottlenecks and balances various limiting factors for specific hardware. We realize an implementation of GPU-MMS and compare it to sorting algorithm implementations in state-of-the-art GPU libraries on three GPU architectures. Despite these library implementations being highly optimized, we find that GPU-MMS outperforms them by an average of 21% for random integer inputs and 14% for random key-value pairs.

我们研究了图形处理单元(gpu)基于比较的排序算法中内存访问、银行冲突、线程多重性(也称为超额订阅)和指令级并行性之间的关系。我们通过实验验证了所提出的公式，该公式将这些参数与通过算法对存储器访问次数的渐近分析联系起来。使用这个公式，我们分析和比较了几种GPU排序算法，确定了每种算法的关键性能瓶颈。基于此分析，我们提出了一种gpu高效的多路合并排序算法GPU-MMS，它可以最大限度地减少或消除这些瓶颈，并平衡特定硬件的各种限制因素。我们实现了GPU- mms的实现，并将其与三种GPU架构上最先进的GPU库中的排序算法实现进行了比较。尽管这些库实现得到了高度优化，但我们发现GPU-MMS在随机整数输入和随机键值对方面的性能比它们平均高出21%和14%。

引用次数: 11

HALO 光环。

Proceedings of the 2018 International Conference on Supercomputing

Pub Date : 2018-06-12 DOI: 10.1036/1097-8542.306100

Reena Panda, L. John

Optical Spectra of 24 µ m Galaxies in the COSMOS Field

COSMOS场中24µm星系的光谱

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2018 International Conference on Supercomputing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀