首页 > 最新文献

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
Managing DRAM Latency Divergence in Irregular GPGPU Applications 管理不规则GPGPU应用程序中的DRAM延迟差异
Niladrish Chatterjee, Mike O'Connor, G. Loh, N. Jayasena, R. Balasubramonian
Memory controllers in modern GPUs aggressively reorder requests for high bandwidth usage, often interleaving requests from different warps. This leads to high variance in the latency of different requests issued by the threads of a warp. Since a warp in a SIMT architecture can proceed only when all of its memory requests are returned by memory, such latency divergence causes significant slowdown when running irregular GPGPU applications. To solve this issue, we propose memory scheduling mechanisms that avoid inter-warp interference in the DRAM system to reduce the average memory stall latency experienced by warps. We further reduce latency divergence through mechanisms that coordinate scheduling decisions across multiple independent memory channels. Finally we show that carefully orchestrating the memory scheduling policy can achieve low average latency for warps, without compromising bandwidth utilization. Our combined scheme yields a 10.1% performance improvement for irregular GPGPU workloads relative to a throughput-optimized GPU memory controller.
现代gpu中的内存控制器积极地为高带宽使用重新排序请求,通常是来自不同经线的交叉请求。这将导致warp线程发出的不同请求的延迟时间差异很大。由于SIMT架构中的翘曲只有在其所有内存请求都由内存返回时才能进行,因此在运行不规则的GPGPU应用程序时,这种延迟差异会导致显著的减速。为了解决这个问题,我们提出了内存调度机制,以避免DRAM系统中的warp间干扰,以减少warp所经历的平均内存失速延迟。我们通过跨多个独立内存通道协调调度决策的机制进一步减少延迟差异。最后,我们展示了精心编排内存调度策略可以在不影响带宽利用率的情况下实现低平均延迟。与吞吐量优化的GPU内存控制器相比,我们的组合方案在不规则GPGPU工作负载下的性能提高了10.1%。
{"title":"Managing DRAM Latency Divergence in Irregular GPGPU Applications","authors":"Niladrish Chatterjee, Mike O'Connor, G. Loh, N. Jayasena, R. Balasubramonian","doi":"10.1109/SC.2014.16","DOIUrl":"https://doi.org/10.1109/SC.2014.16","url":null,"abstract":"Memory controllers in modern GPUs aggressively reorder requests for high bandwidth usage, often interleaving requests from different warps. This leads to high variance in the latency of different requests issued by the threads of a warp. Since a warp in a SIMT architecture can proceed only when all of its memory requests are returned by memory, such latency divergence causes significant slowdown when running irregular GPGPU applications. To solve this issue, we propose memory scheduling mechanisms that avoid inter-warp interference in the DRAM system to reduce the average memory stall latency experienced by warps. We further reduce latency divergence through mechanisms that coordinate scheduling decisions across multiple independent memory channels. Finally we show that carefully orchestrating the memory scheduling policy can achieve low average latency for warps, without compromising bandwidth utilization. Our combined scheme yields a 10.1% performance improvement for irregular GPGPU workloads relative to a throughput-optimized GPU memory controller.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126368661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 80
Oil and Water Can Mix: An Integration of Polyhedral and AST-Based Transformations 油和水可以混合:多面体和基于ast的转换的集成
J. Shirako, L. Pouchet, Vivek Sarkar
Optimizing compilers targeting modern multi-core machines require complex program restructuring to expose the best combinations of coarse- and fine-grain parallelism and data locality. The polyhedral compilation model has provided significant advancements in the seamless handling of compositions of loop transformations, thereby exposing multiple levels of parallelism and improving data reuse. However, it usually implements abstract optimization objectives, for example "maximize data reuse", which often does not deliver best performance, e.g., The complex loop structures generated can be detrimental to short-vector SIMD performance. In addition, several key transformations such as pipeline-parallelism and unroll-and-jam are difficult to express in the polyhedral framework. In this paper, we propose a novel optimization flow that combines polyhedral and syntactic/AST-based transformations. It generates high-performance code that contains regular loops which can be effectively vectorized, while still implementing sufficient parallelism and data reuse. It combines several transformation stages using both polyhedral and AST-based transformations, delivering performance improvements of up to 3× over the PoCC polyhedral compiler on Intel Nehalem and IBM Power7 multicore processors.
优化针对现代多核机器的编译器需要复杂的程序重构,以暴露粗粒度和细粒度并行性和数据局部性的最佳组合。多面体编译模型在无缝处理循环转换组合方面取得了重大进展,从而暴露了多层并行性并改进了数据重用。然而,它通常实现抽象的优化目标,例如“最大化数据重用”,这通常不会提供最佳性能,例如,生成的复杂循环结构可能对短向量SIMD性能有害。此外,在多面体框架中难以表达管道并行化和展开卡塞等关键变换。在本文中,我们提出了一种新的优化流程,该流程结合了多面体和基于语法/ ast的转换。它生成高性能代码,其中包含可以有效向量化的规则循环,同时仍然实现足够的并行性和数据重用。它结合了使用多面体和基于ast的转换的几个转换阶段,在Intel Nehalem和IBM Power7多核处理器上提供了比PoCC多面体编译器高达3倍的性能改进。
{"title":"Oil and Water Can Mix: An Integration of Polyhedral and AST-Based Transformations","authors":"J. Shirako, L. Pouchet, Vivek Sarkar","doi":"10.1109/SC.2014.29","DOIUrl":"https://doi.org/10.1109/SC.2014.29","url":null,"abstract":"Optimizing compilers targeting modern multi-core machines require complex program restructuring to expose the best combinations of coarse- and fine-grain parallelism and data locality. The polyhedral compilation model has provided significant advancements in the seamless handling of compositions of loop transformations, thereby exposing multiple levels of parallelism and improving data reuse. However, it usually implements abstract optimization objectives, for example \"maximize data reuse\", which often does not deliver best performance, e.g., The complex loop structures generated can be detrimental to short-vector SIMD performance. In addition, several key transformations such as pipeline-parallelism and unroll-and-jam are difficult to express in the polyhedral framework. In this paper, we propose a novel optimization flow that combines polyhedral and syntactic/AST-based transformations. It generates high-performance code that contains regular loops which can be effectively vectorized, while still implementing sufficient parallelism and data reuse. It combines several transformation stages using both polyhedral and AST-based transformations, delivering performance improvements of up to 3× over the PoCC polyhedral compiler on Intel Nehalem and IBM Power7 multicore processors.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125930042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Real-Time Scalable Cortical Computing at 46 Giga-Synaptic OPS/Watt with ~100× Speedup in Time-to-Solution and ~100,000× Reduction in Energy-to-Solution 46千兆突触OPS/瓦特的实时可扩展皮质计算,加速到解决时间约100倍,减少能量至解决方案约100,000倍
A. Cassidy, Rodrigo Alvarez-Icaza, Filipp Akopyan, J. Sawada, J. Arthur, P. Merolla, Pallab Datta, Marc González, B. Taba, Alexander Andreopoulos, A. Amir, Steven K. Esser, J. Kusnitz, R. Appuswamy, C. Haymes, B. Brezzo, R. Moussalli, Ralph Bellofatto, C. Baks, M. Mastro, K. Schleupen, C. E. Cox, K. Inoue, S. Millman, N. Imam, E. McQuinn, Yutaka Nakamura, I. Vo, Chen Guok, Don N. Nguyen, S. Lekuch, S. Asaad, D. Friedman, Bryan L. Jackson, M. Flickner, W. Risk, R. Manohar, D. Modha
Drawing on neuroscience, we have developed a parallel, event-driven kernel for neurosynaptic computation, that is efficient with respect to computation, memory, and communication. Building on the previously demonstrated highly optimized software expression of the kernel, here, we demonstrate True North, a co-designed silicon expression of the kernel. True North achieves five orders of magnitude reduction in energy to-solution and two orders of magnitude speedup in time-to solution, when running computer vision applications and complex recurrent neural network simulations. Breaking path with the von Neumann architecture, True North is a 4,096 core, 1 million neuron, and 256 million synapse brain-inspired neurosynaptic processor, that consumes 65mW of power running at real-time and delivers performance of 46 Giga-Synaptic OPS/Watt. We demonstrate seamless tiling of True North chips into arrays, forming a foundation for cortex-like scalability. True North's unprecedented time-to-solution, energy-to-solution, size, scalability, and performance combined with the underlying flexibility of the kernel enable a broad range of cognitive applications.
利用神经科学,我们为神经突触计算开发了一个并行的、事件驱动的内核,它在计算、记忆和通信方面是高效的。在先前演示的高度优化的内核软件表达式的基础上,这里,我们演示True North,一个共同设计的内核硅表达式。当运行计算机视觉应用程序和复杂的递归神经网络模拟时,True North在能量到解的速度上降低了5个数量级,在时间到解的速度上提高了2个数量级。突破冯·诺伊曼架构,真北是一个4,096核,100万个神经元和2.56亿个突触的脑启发神经突触处理器,实时运行功耗为65mW,性能为46giga - synaptic OPS/Watt。我们演示了将真北芯片无缝平铺到阵列中,形成了类似于皮质的可扩展性的基础。True North前所未有的从时间到解决方案、从能量到解决方案、大小、可伸缩性和性能,再加上内核的底层灵活性,使广泛的认知应用程序成为可能。
{"title":"Real-Time Scalable Cortical Computing at 46 Giga-Synaptic OPS/Watt with ~100× Speedup in Time-to-Solution and ~100,000× Reduction in Energy-to-Solution","authors":"A. Cassidy, Rodrigo Alvarez-Icaza, Filipp Akopyan, J. Sawada, J. Arthur, P. Merolla, Pallab Datta, Marc González, B. Taba, Alexander Andreopoulos, A. Amir, Steven K. Esser, J. Kusnitz, R. Appuswamy, C. Haymes, B. Brezzo, R. Moussalli, Ralph Bellofatto, C. Baks, M. Mastro, K. Schleupen, C. E. Cox, K. Inoue, S. Millman, N. Imam, E. McQuinn, Yutaka Nakamura, I. Vo, Chen Guok, Don N. Nguyen, S. Lekuch, S. Asaad, D. Friedman, Bryan L. Jackson, M. Flickner, W. Risk, R. Manohar, D. Modha","doi":"10.1109/SC.2014.8","DOIUrl":"https://doi.org/10.1109/SC.2014.8","url":null,"abstract":"Drawing on neuroscience, we have developed a parallel, event-driven kernel for neurosynaptic computation, that is efficient with respect to computation, memory, and communication. Building on the previously demonstrated highly optimized software expression of the kernel, here, we demonstrate True North, a co-designed silicon expression of the kernel. True North achieves five orders of magnitude reduction in energy to-solution and two orders of magnitude speedup in time-to solution, when running computer vision applications and complex recurrent neural network simulations. Breaking path with the von Neumann architecture, True North is a 4,096 core, 1 million neuron, and 256 million synapse brain-inspired neurosynaptic processor, that consumes 65mW of power running at real-time and delivers performance of 46 Giga-Synaptic OPS/Watt. We demonstrate seamless tiling of True North chips into arrays, forming a foundation for cortex-like scalability. True North's unprecedented time-to-solution, energy-to-solution, size, scalability, and performance combined with the underlying flexibility of the kernel enable a broad range of cognitive applications.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127047732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 83
NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing NUMARCK:弹性和检查点的机器学习算法
Zhengzhang Chen, S. Son, W. Hendrix, Ankit Agrawal, W. Liao, A. Choudhary
Data check pointing is an important fault tolerance technique in High Performance Computing (HPC) systems. As the HPC systems move towards exascale, the storage space and time costs of check pointing threaten to overwhelm not only the simulation but also the post-simulation data analysis. One common practice to address this problem is to apply compression algorithms to reduce the data size. However, traditional lossless compression techniques that look for repeated patterns are ineffective for scientific data in which high-precision data is used and hence common patterns are rare to find. This paper exploits the fact that in many scientific applications, the relative changes in data values from one simulation iteration to the next are not very significantly different from each other. Thus, capturing the distribution of relative changes in data instead of storing the data itself allows us to incorporate the temporal dimension of the data and learn the evolving distribution of the changes. We show that an order of magnitude data reduction becomes achievable within guaranteed user-defined error bounds for each data point. We propose NUMARCK, North western University Machine learning Algorithm for Resiliency and Check pointing, that makes use of the emerging distributions of data changes between consecutive simulation iterations and encodes them into an indexing space that can be concisely represented. We evaluate NUMARCK using two production scientific simulations, FLASH and CMIP5, and demonstrate a superior performance in terms of compression ratio and compression accuracy. More importantly, our algorithm allows users to specify the maximum tolerable error on a per point basis, while compressing the data by an order of magnitude.
数据校验指向是高性能计算系统中一项重要的容错技术。随着高性能计算系统向百亿亿级发展,检查点的存储空间和时间成本不仅威胁到模拟,而且威胁到模拟后的数据分析。解决这个问题的一个常见做法是应用压缩算法来减小数据大小。然而,寻找重复模式的传统无损压缩技术对于使用高精度数据的科学数据是无效的,因此很难找到常见的模式。本文利用了这样一个事实,即在许多科学应用中,从一次模拟迭代到下一次迭代的数据值的相对变化彼此之间并没有太大的差异。因此,捕获数据中相对变化的分布,而不是存储数据本身,使我们能够结合数据的时间维度,并了解变化的演变分布。我们表明,在保证用户定义的每个数据点的误差范围内,可以实现数量级的数据缩减。我们提出了西北大学弹性和检查指向机器学习算法NUMARCK,它利用连续模拟迭代之间数据变化的新分布,并将其编码为可以简洁表示的索引空间。我们使用FLASH和CMIP5两种生产科学模拟对NUMARCK进行了评估,并在压缩比和压缩精度方面展示了优越的性能。更重要的是,我们的算法允许用户在每个点的基础上指定最大可容忍误差,同时将数据压缩一个数量级。
{"title":"NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing","authors":"Zhengzhang Chen, S. Son, W. Hendrix, Ankit Agrawal, W. Liao, A. Choudhary","doi":"10.1109/SC.2014.65","DOIUrl":"https://doi.org/10.1109/SC.2014.65","url":null,"abstract":"Data check pointing is an important fault tolerance technique in High Performance Computing (HPC) systems. As the HPC systems move towards exascale, the storage space and time costs of check pointing threaten to overwhelm not only the simulation but also the post-simulation data analysis. One common practice to address this problem is to apply compression algorithms to reduce the data size. However, traditional lossless compression techniques that look for repeated patterns are ineffective for scientific data in which high-precision data is used and hence common patterns are rare to find. This paper exploits the fact that in many scientific applications, the relative changes in data values from one simulation iteration to the next are not very significantly different from each other. Thus, capturing the distribution of relative changes in data instead of storing the data itself allows us to incorporate the temporal dimension of the data and learn the evolving distribution of the changes. We show that an order of magnitude data reduction becomes achievable within guaranteed user-defined error bounds for each data point. We propose NUMARCK, North western University Machine learning Algorithm for Resiliency and Check pointing, that makes use of the emerging distributions of data changes between consecutive simulation iterations and encodes them into an indexing space that can be concisely represented. We evaluate NUMARCK using two production scientific simulations, FLASH and CMIP5, and demonstrate a superior performance in terms of compression ratio and compression accuracy. More importantly, our algorithm allows users to specify the maximum tolerable error on a per point basis, while compressing the data by an order of magnitude.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127072425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
期刊
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1