首页 > 最新文献

Proceedings of the 2018 International Conference on Supercomputing最新文献

英文 中文
Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs 在CCSD(T)中优化张量收缩在gpu上的高效执行
Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205296
Jinsung Kim, Aravind Sukumaran-Rajam, Changwan Hong, Ajay Panyala, Rohit Kumar Srivastava, S. Krishnamoorthy, P. Sadayappan
Tensor contractions are higher dimensional analogs of matrix multiplications, used in many computational contexts such as high order models in quantum chemistry, deep learning, finite element methods etc. In contrast to the wide availability of high-performance libraries for matrix multiplication on GPUs, the same is not true for tensor contractions. In this paper, we address the optimization of a set of symmetrized tensor contractions that form the computational bottleneck in the CCSD(T) coupled-cluster method in computational chemistry suites like NWChem. Some of the challenges in optimizing tensor contractions that arise in practice from the variety of dimensionalities and shapes for tensors include effective mapping of the high-dimensional iteration space to threads, choice of data buffering in shared-memory and registers, and tile sizes for multi-level tiling. Furthermore, in the case of symmetrized tensor contractions in CCSD(T), it is also a challenge to fuse contractions to reduce data movement cost by exploiting reuse of intermediate tensors. In this paper, we develop an efficient GPU implementation of the tensor contractions in CCSD(T) using shared-memory buffering, register tiling, loop fusion and register transpose. Experimental results demonstrate significant improvement over the current state-of-the-art.
张量收缩是矩阵乘法的高维类似物,用于许多计算环境,如量子化学中的高阶模型,深度学习,有限元方法等。与gpu上用于矩阵乘法的高性能库的广泛可用性相反,张量收缩的情况并非如此。在本文中,我们解决了在计算化学套件(如NWChem)中形成CCSD(T)耦合簇方法计算瓶颈的一组对称张量收缩的优化问题。在实践中,由于张量的各种维度和形状,优化张量收缩的一些挑战包括高维迭代空间到线程的有效映射,共享内存和寄存器中数据缓冲的选择,以及多级平铺的平铺大小。此外,在CCSD(T)中的对称张量收缩情况下,通过利用中间张量的重用来融合收缩以降低数据移动成本也是一个挑战。在本文中,我们开发了一种在CCSD(T)中使用共享内存缓冲、寄存器平铺、环路融合和寄存器转置实现张量收缩的高效GPU实现。实验结果表明,该方法在现有基础上有了显著的改进。
{"title":"Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs","authors":"Jinsung Kim, Aravind Sukumaran-Rajam, Changwan Hong, Ajay Panyala, Rohit Kumar Srivastava, S. Krishnamoorthy, P. Sadayappan","doi":"10.1145/3205289.3205296","DOIUrl":"https://doi.org/10.1145/3205289.3205296","url":null,"abstract":"Tensor contractions are higher dimensional analogs of matrix multiplications, used in many computational contexts such as high order models in quantum chemistry, deep learning, finite element methods etc. In contrast to the wide availability of high-performance libraries for matrix multiplication on GPUs, the same is not true for tensor contractions. In this paper, we address the optimization of a set of symmetrized tensor contractions that form the computational bottleneck in the CCSD(T) coupled-cluster method in computational chemistry suites like NWChem. Some of the challenges in optimizing tensor contractions that arise in practice from the variety of dimensionalities and shapes for tensors include effective mapping of the high-dimensional iteration space to threads, choice of data buffering in shared-memory and registers, and tile sizes for multi-level tiling. Furthermore, in the case of symmetrized tensor contractions in CCSD(T), it is also a challenge to fuse contractions to reduce data movement cost by exploiting reuse of intermediate tensors. In this paper, we develop an efficient GPU implementation of the tensor contractions in CCSD(T) using shared-memory buffering, register tiling, loop fusion and register transpose. Experimental results demonstrate significant improvement over the current state-of-the-art.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125412039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
A two-phase recovery mechanism 两阶段恢复机制
Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205300
Zhaoxiang Jin, Soner Önder
Superscalar processors take advantage of speculative execution to improve performance. When the speculation turns out to be incorrect, a recovery procedure is initiated. The back-end of the processor cannot be flushed due to having a mixture of both valid and invalid instructions. A basic solution is to wait for all valid instructions to retire and then purge the invalid instructions. However, if a long latency operation, such as a Last-level Cache (LLC) miss appears before the misspeculation point, the back-end recovery time significantly increases. Many proposed mechanisms selectively flush invalid instructions in order to speed up the back-end recovery. In general, these mechanisms rely on broadcasting some misprediction related tags to remove the instructions from any backend structures, such as ROB, LSQ, RS, etc. The hardware overhead in these mechanisms is nontrivial and can potentially affect the processor clock cycle time if they are on the critical path. Moreover, a checkpointing mechanism or a walker needs to be added to accelerate the recovery of the front-end register alias table (F-RAT). We propose a two-phase recovery mechanism which does not need any walking or broadcasting process and can still match the performance of the state-of-the-art recovery approaches. The first phase works similar to a typical basic recovery mechanism and the second phase is not triggered until the backend is stalled by an LLC miss load. In that case, the second phase treats the load as a misspeculation and recovers from this load. Since the LLC miss response time is usually much longer than the time to fill the entire pipeline with new instructions, in most cases our mechanism can completely overlap the branch misprediction recovery penalty with the cache miss penalty.
超标量处理器利用推测执行来提高性能。当猜测被证明是不正确时,就会启动恢复程序。处理器的后端由于混合了有效和无效的指令而无法刷新。一个基本的解决方案是等待所有有效指令退役,然后清除无效指令。但是,如果在错误猜测点之前出现长延迟操作(例如Last-level Cache (LLC) miss)),则后端恢复时间将显著增加。许多提出的机制选择性地清除无效指令,以加快后端恢复。一般来说,这些机制依赖于广播一些与错误预测相关的标签,以从任何后端结构(如ROB、LSQ、RS等)中删除指令。这些机制中的硬件开销非常大,如果它们位于关键路径上,可能会影响处理器时钟周期时间。此外,还需要添加检查点机制或行走器来加速前端寄存器别名表(F-RAT)的恢复。我们提出了一种两阶段恢复机制,它不需要任何行走或广播过程,并且仍然可以匹配最先进的恢复方法的性能。第一阶段的工作原理类似于典型的基本恢复机制,第二阶段直到后端因LLC miss负载而停滞时才会触发。在这种情况下,第二阶段将负载视为错误猜测并从该负载中恢复。由于LLC miss响应时间通常比用新指令填充整个管道的时间长得多,在大多数情况下,我们的机制可以完全重叠分支错误预测恢复惩罚和缓存miss惩罚。
{"title":"A two-phase recovery mechanism","authors":"Zhaoxiang Jin, Soner Önder","doi":"10.1145/3205289.3205300","DOIUrl":"https://doi.org/10.1145/3205289.3205300","url":null,"abstract":"Superscalar processors take advantage of speculative execution to improve performance. When the speculation turns out to be incorrect, a recovery procedure is initiated. The back-end of the processor cannot be flushed due to having a mixture of both valid and invalid instructions. A basic solution is to wait for all valid instructions to retire and then purge the invalid instructions. However, if a long latency operation, such as a Last-level Cache (LLC) miss appears before the misspeculation point, the back-end recovery time significantly increases. Many proposed mechanisms selectively flush invalid instructions in order to speed up the back-end recovery. In general, these mechanisms rely on broadcasting some misprediction related tags to remove the instructions from any backend structures, such as ROB, LSQ, RS, etc. The hardware overhead in these mechanisms is nontrivial and can potentially affect the processor clock cycle time if they are on the critical path. Moreover, a checkpointing mechanism or a walker needs to be added to accelerate the recovery of the front-end register alias table (F-RAT). We propose a two-phase recovery mechanism which does not need any walking or broadcasting process and can still match the performance of the state-of-the-art recovery approaches. The first phase works similar to a typical basic recovery mechanism and the second phase is not triggered until the backend is stalled by an LLC miss load. In that case, the second phase treats the load as a misspeculation and recovers from this load. Since the LLC miss response time is usually much longer than the time to fill the entire pipeline with new instructions, in most cases our mechanism can completely overlap the branch misprediction recovery penalty with the cache miss penalty.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124048356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Demystifying Cache Policies for Photo Stores at Scale: A Tencent Case Study 揭秘大规模照片存储的缓存策略:以腾讯为例
Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205299
Ke Zhou, Si Sun, Hua Wang, Ping-Hsiu Huang, Xubin He, Rui Lan, Wenyan Li, Wenjie Liu, Tianming Yang
Photo service providers are facing critical challenges of dealing with the huge amount of photo storage, typically in a magnitude of billions of photos, while ensuring national-wide or world-wide satisfactory user experiences. Distributed photo caching architecture is widely deployed to meet high performance expectations, where efficient still mysterious caching policies play essential roles. In this work, we present a comprehensive study on internet-scale photo caching algorithms in the case of QQPhoto from Tencent Inc., the largest social network service company in China. We unveil that even advanced cache algorithms can only perform at a similar level as simple baseline algorithms and there still exists a large performance gap between these cache algorithms and the theoretically optimal algorithm due to the complicated access behaviors in such a large multi-tenant environment. We then expound the behind reasons for that phenomenon via extensively investigating the characteristics of QQPhoto workloads. Finally, in order to realistically further improve QQPhoto cache efficiency, we propose to incorporate a prefetcher in the cache stack based on the observed immediacy feature that is unique to the QQPhoto workload. Evaluation results show that with appropriate prefetching we improve the cache hit ratio by up to 7.4%, while reducing the average access latency by 6.9% at a marginal cost of 4.14% backend network traffic compared to the original system that performs no prefetching.
照片服务提供商正面临着处理海量照片存储(通常是数十亿张照片)的严峻挑战,同时确保在全国或全球范围内提供满意的用户体验。分布式照片缓存架构被广泛部署以满足高性能期望,其中高效但神秘的缓存策略起着至关重要的作用。在这项工作中,我们以中国最大的社交网络服务公司腾讯公司的QQPhoto为例,对互联网规模的照片缓存算法进行了全面研究。我们发现,即使是先进的缓存算法也只能达到与简单的基线算法相似的水平,并且由于在这样一个大型多租户环境中复杂的访问行为,这些缓存算法与理论上最优的算法之间仍然存在很大的性能差距。然后,我们通过广泛研究QQPhoto工作负载的特点,阐述了这种现象的背后原因。最后,为了切实地进一步提高QQPhoto的缓存效率,我们建议在缓存堆栈中加入一个预取器,该预取器基于所观察到的QQPhoto工作负载特有的即时性特征。评估结果表明,与不执行预取的原始系统相比,通过适当的预取,我们将缓存命中率提高了7.4%,同时将平均访问延迟降低了6.9%,后端网络流量的边际成本为4.14%。
{"title":"Demystifying Cache Policies for Photo Stores at Scale: A Tencent Case Study","authors":"Ke Zhou, Si Sun, Hua Wang, Ping-Hsiu Huang, Xubin He, Rui Lan, Wenyan Li, Wenjie Liu, Tianming Yang","doi":"10.1145/3205289.3205299","DOIUrl":"https://doi.org/10.1145/3205289.3205299","url":null,"abstract":"Photo service providers are facing critical challenges of dealing with the huge amount of photo storage, typically in a magnitude of billions of photos, while ensuring national-wide or world-wide satisfactory user experiences. Distributed photo caching architecture is widely deployed to meet high performance expectations, where efficient still mysterious caching policies play essential roles. In this work, we present a comprehensive study on internet-scale photo caching algorithms in the case of QQPhoto from Tencent Inc., the largest social network service company in China. We unveil that even advanced cache algorithms can only perform at a similar level as simple baseline algorithms and there still exists a large performance gap between these cache algorithms and the theoretically optimal algorithm due to the complicated access behaviors in such a large multi-tenant environment. We then expound the behind reasons for that phenomenon via extensively investigating the characteristics of QQPhoto workloads. Finally, in order to realistically further improve QQPhoto cache efficiency, we propose to incorporate a prefetcher in the cache stack based on the observed immediacy feature that is unique to the QQPhoto workload. Evaluation results show that with appropriate prefetching we improve the cache hit ratio by up to 7.4%, while reducing the average access latency by 6.9% at a marginal cost of 4.14% backend network traffic compared to the original system that performs no prefetching.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122187986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
ChplBlamer
Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205314
Hui Zhang, Jeffrey K. Hollingsworth
Parallel programming is hard, and it is even harder to analyze parallel programs and identify specific performance bottlenecks. Chapel is an emerging Partitioned-Global-Address-Space (PGAS) language that provides productive parallel programming. Most established profilers either completely lack the capacity to profile Chapel programs or generate information that cannot provide insightful guidance in a user-level context. To address this issue, we developed ChplBlamer to pinpoint performance losses due to data distribution and remote data accesses. We use a data-centric and code-centric combined approach to help Chapel users quickly identify performance bottlenecks in the source. To demonstrate the utility of ChplBlamer, we studied three multi-locale Chapel benchmarks. For each benchmark, ChplBlamer found the causes of the performance losses. With the optimization guidance provided by ChplBlamer, we significantly improved the performance by up to 4x with little code modification.
{"title":"ChplBlamer","authors":"Hui Zhang, Jeffrey K. Hollingsworth","doi":"10.1145/3205289.3205314","DOIUrl":"https://doi.org/10.1145/3205289.3205314","url":null,"abstract":"Parallel programming is hard, and it is even harder to analyze parallel programs and identify specific performance bottlenecks. Chapel is an emerging Partitioned-Global-Address-Space (PGAS) language that provides productive parallel programming. Most established profilers either completely lack the capacity to profile Chapel programs or generate information that cannot provide insightful guidance in a user-level context. To address this issue, we developed ChplBlamer to pinpoint performance losses due to data distribution and remote data accesses. We use a data-centric and code-centric combined approach to help Chapel users quickly identify performance bottlenecks in the source. To demonstrate the utility of ChplBlamer, we studied three multi-locale Chapel benchmarks. For each benchmark, ChplBlamer found the causes of the performance losses. With the optimization guidance provided by ChplBlamer, we significantly improved the performance by up to 4x with little code modification.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128260687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU 代理队列:用于GPU上细粒度工作分配的快速、线性FIFO队列
Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205291
B. Kerbl, Michael Kenzel, J. H. Mueller, D. Schmalstieg, M. Steinberger
Harnessing the power of massively parallel devices like the graphics processing unit (GPU) is difficult for algorithms that show dynamic or inhomogeneous workloads. To achieve high performance, such advanced algorithms require scalable, concurrent queues to collect and distribute work. We show that previous queuing approaches are unfit for this task, as they either (1) do not work well in a massively parallel environment, or (2) obstruct the use of individual threads on top of single-instruction-multiple-data (SIMD) cores, or (3) block during access, thus prohibiting multi-queue setups. With these issues in mind, we present the Broker Queue, a highly efficient, fully linearizable FIFO queue for fine-granular parallel work distribution on the GPU. We evaluate its performance and usability on modern GPU models against a wide range of existing algorithms. The Broker Queue is up to three orders of magnitude faster than nonblocking queues and can even outperform significantly simpler techniques that lack desired properties for fine-granular work distribution.
对于显示动态或非均匀工作负载的算法来说,利用图形处理单元(GPU)等大规模并行设备的能力是很困难的。为了实现高性能,这种高级算法需要可扩展的并发队列来收集和分发工作。我们表明,以前的排队方法不适合此任务,因为它们要么(1)在大规模并行环境中不能很好地工作,要么(2)阻碍在单指令多数据(SIMD)内核上使用单个线程,或者(3)在访问期间阻塞,从而禁止多队列设置。考虑到这些问题,我们提出了Broker Queue,这是一种高效、完全线性化的FIFO队列,用于GPU上的细粒度并行工作分配。我们针对各种现有算法评估了其在现代GPU模型上的性能和可用性。代理队列比非阻塞队列快三个数量级,甚至可以明显优于缺乏细粒度工作分发所需属性的简单技术。
{"title":"The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU","authors":"B. Kerbl, Michael Kenzel, J. H. Mueller, D. Schmalstieg, M. Steinberger","doi":"10.1145/3205289.3205291","DOIUrl":"https://doi.org/10.1145/3205289.3205291","url":null,"abstract":"Harnessing the power of massively parallel devices like the graphics processing unit (GPU) is difficult for algorithms that show dynamic or inhomogeneous workloads. To achieve high performance, such advanced algorithms require scalable, concurrent queues to collect and distribute work. We show that previous queuing approaches are unfit for this task, as they either (1) do not work well in a massively parallel environment, or (2) obstruct the use of individual threads on top of single-instruction-multiple-data (SIMD) cores, or (3) block during access, thus prohibiting multi-queue setups. With these issues in mind, we present the Broker Queue, a highly efficient, fully linearizable FIFO queue for fine-granular parallel work distribution on the GPU. We evaluate its performance and usability on modern GPU models against a wide range of existing algorithms. The Broker Queue is up to three orders of magnitude faster than nonblocking queues and can even outperform significantly simpler techniques that lack desired properties for fine-granular work distribution.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121359414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Runtime-Guided Management of Stacked DRAM Memories in Task Parallel Programs 任务并行程序中堆叠DRAM存储器的运行时引导管理
Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205312
Lluc Alvarez, Marc Casas, Jesús Labarta, E. Ayguadé, M. Valero, Miquel Moretó
Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These memories provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked DRAM and off-chip memories are expected to co-exist in HPC architectures, giving raise to different approaches for architecting the stacked DRAM in the system. This paper proposes a runtime approach to transparently manage stacked DRAM memories in task-based programming models. In this approach the runtime system is in charge of copying the data accessed by the tasks to the stacked DRAM, without any complex hardware support nor modifications to the application code. To mitigate the cost of copying data between the stacked DRAM and the off-chip memory, the proposal includes an optimization to parallelize the copies across idle or additional helper threads. In addition, the runtime system is aware of the reuse pattern of the data accessed by the tasks, and can exploit this information to avoid unworthy copies of data to the stacked DRAM. Results on the Intel Knights Landing processor show that the proposed techniques achieve an average speedup of 14% against the state-of-the-art library to manage the stacked DRAM and 29% against a stacked DRAM architected as a hardware cache.
堆叠DRAM存储器在高性能计算(HPC)架构中已经成为现实。这些存储器比传统的片外存储器提供更高的带宽,同时消耗更少的功率,但它们有限的存储器容量不足以满足现代HPC系统的需要。由于这个原因,堆叠DRAM和片外存储器有望在高性能计算架构中共存,从而提出了在系统中构建堆叠DRAM的不同方法。本文提出了一种在基于任务的编程模型中透明地管理堆叠DRAM存储器的运行时方法。在这种方法中,运行时系统负责将任务访问的数据复制到堆叠的DRAM中,不需要任何复杂的硬件支持,也不需要修改应用程序代码。为了降低在堆叠DRAM和片外内存之间复制数据的成本,该建议包括一个优化,以便在空闲或额外的辅助线程之间并行化副本。此外,运行时系统知道任务访问的数据的重用模式,并且可以利用这些信息来避免将不值得的数据拷贝到堆叠的DRAM中。在英特尔Knights Landing处理器上的结果表明,所提出的技术相对于管理堆叠DRAM的最先进库实现了14%的平均加速,相对于作为硬件缓存架构的堆叠DRAM实现了29%的平均加速。
{"title":"Runtime-Guided Management of Stacked DRAM Memories in Task Parallel Programs","authors":"Lluc Alvarez, Marc Casas, Jesús Labarta, E. Ayguadé, M. Valero, Miquel Moretó","doi":"10.1145/3205289.3205312","DOIUrl":"https://doi.org/10.1145/3205289.3205312","url":null,"abstract":"Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These memories provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked DRAM and off-chip memories are expected to co-exist in HPC architectures, giving raise to different approaches for architecting the stacked DRAM in the system. This paper proposes a runtime approach to transparently manage stacked DRAM memories in task-based programming models. In this approach the runtime system is in charge of copying the data accessed by the tasks to the stacked DRAM, without any complex hardware support nor modifications to the application code. To mitigate the cost of copying data between the stacked DRAM and the off-chip memory, the proposal includes an optimization to parallelize the copies across idle or additional helper threads. In addition, the runtime system is aware of the reuse pattern of the data accessed by the tasks, and can exploit this information to avoid unworthy copies of data to the stacked DRAM. Results on the Intel Knights Landing processor show that the proposed techniques achieve an average speedup of 14% against the state-of-the-art library to manage the stacked DRAM and 29% against a stacked DRAM architected as a hardware cache.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128018884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Proceedings of the 2018 International Conference on Supercomputing 2018年国际超级计算会议论文集
Pub Date : 2018-06-12 DOI: 10.1145/3205289
{"title":"Proceedings of the 2018 International Conference on Supercomputing","authors":"","doi":"10.1145/3205289","DOIUrl":"https://doi.org/10.1145/3205289","url":null,"abstract":"","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133972295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Phase-Aware Web Browser Power Management on HMP Platforms 基于HMP平台的相位感知Web浏览器电源管理
Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205293
N. Peters, Sangyoung Park, Daniel Clifford, S. Kyostila, R. McIlroy, B. Meurer, H. Payer, S. Chakraborty
Over the last years, web browsing has been steadily shifting from desktop computers to mobile devices like smartphones and tablets. However, mobile browsers available today have mainly focused on performance rather than power consumption, although the battery life of a mobile device is one of the most important usability metrics. This is because many of these browsers have originated in the desktop domain and have been ported to the mobile domain. Such browsers have multiple power hungry components such as the rendering engine, and the JavaScript engine, and generate high workload without considering the capabilities and the power consumption characteristics of the underlying hardware platform. Also, the lack of coordination between a browser application and the power manager in the operating system (such as Android) results in poor power savings. In this paper, we propose a power manager that takes into account the internal state of a browser -- that we refer to as a phase -- and show with Google's Chrome running on Android that up to 57.4% more energy can be saved over Android's default power managers. We implemented and evaluated our technique on a heterogeneous multiprocessing (HMP) ARM big.LITTLE platform such as the ones found in most modern smartphones.
在过去的几年里,网页浏览一直在稳步地从台式电脑转向智能手机和平板电脑等移动设备。然而,目前可用的移动浏览器主要关注性能而不是功耗,尽管移动设备的电池寿命是最重要的可用性指标之一。这是因为这些浏览器中的许多都起源于桌面领域,并已被移植到移动领域。这样的浏览器具有多个耗电组件,如呈现引擎和JavaScript引擎,并且在不考虑底层硬件平台的功能和功耗特征的情况下产生高工作负载。此外,浏览器应用程序和操作系统(如Android)中的电源管理器之间缺乏协调导致省电效果不佳。在这篇论文中,我们提出了一个电源管理器,它考虑了浏览器的内部状态——我们称之为阶段——并显示在Android上运行的谷歌Chrome可以比Android默认的电源管理器节省多达57.4%的能量。我们在异构多处理(HMP) ARM处理器上实现并评估了我们的技术。像大多数现代智能手机那样的小平台。
{"title":"Phase-Aware Web Browser Power Management on HMP Platforms","authors":"N. Peters, Sangyoung Park, Daniel Clifford, S. Kyostila, R. McIlroy, B. Meurer, H. Payer, S. Chakraborty","doi":"10.1145/3205289.3205293","DOIUrl":"https://doi.org/10.1145/3205289.3205293","url":null,"abstract":"Over the last years, web browsing has been steadily shifting from desktop computers to mobile devices like smartphones and tablets. However, mobile browsers available today have mainly focused on performance rather than power consumption, although the battery life of a mobile device is one of the most important usability metrics. This is because many of these browsers have originated in the desktop domain and have been ported to the mobile domain. Such browsers have multiple power hungry components such as the rendering engine, and the JavaScript engine, and generate high workload without considering the capabilities and the power consumption characteristics of the underlying hardware platform. Also, the lack of coordination between a browser application and the power manager in the operating system (such as Android) results in poor power savings. In this paper, we propose a power manager that takes into account the internal state of a browser -- that we refer to as a phase -- and show with Google's Chrome running on Android that up to 57.4% more energy can be saved over Android's default power managers. We implemented and evaluated our technique on a heterogeneous multiprocessing (HMP) ARM big.LITTLE platform such as the ones found in most modern smartphones.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133979846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Revisiting Loop Tiling for Datacenters: Live and Let Live 重新审视数据中心的循环平铺:生存和让生存
Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205306
Jiacheng Zhao, Huimin Cui, Yalin Zhang, Jingling Xue, Xiaobing Feng
As DNNs gain popularity in modern datacenters, it becomes imperative to revisit compiler optimizations for DNNs in a colocation scenario. Loop tiling turns out to be the most significant compiler optimization, since DNNs typically apply a series of matrix computations iteratively to a massive amount of data. We introduce a reuse-pattern-centric approach to obtaining a peer-aware TSS (Tile Size Selection) model for a matrix-based application A. Our key insight is that the co-running cache behavior of A (once tiled) can be determined by its data reuse patterns, together with the cache pressure exerted by its co-running peers, without actually the need for analyzing the code of its co-runners. Compared with static tiling (that determines a tile size for A statically without considering its co-running peers), our peer-aware tiling enables compilers to generate either faster peer-aware efficient code for A (by optimizing the performance of A) or faster peer-aware nice code for A (by optimizing the performance of its co-runners). In addition, our peer-aware tiling also enables library developers to improve the performance of library routines (more effectively than static tiling).
随着dnn在现代数据中心的普及,在托管场景中重新审视dnn的编译器优化变得势在必行。循环平铺被证明是最重要的编译器优化,因为dnn通常对大量数据迭代地应用一系列矩阵计算。我们引入了一种以重用模式为中心的方法来获得基于矩阵的应用程序a的对等感知TSS (Tile Size Selection)模型。我们的关键见解是,a(一旦被平铺)的共同运行缓存行为可以由其数据重用模式以及共同运行的对等体施加的缓存压力来确定,而实际上不需要分析其共同运行程序的代码。与静态平铺(静态地确定a的平铺大小,而不考虑其共同运行的对等程序)相比,我们的对等感知平铺使编译器能够为a生成更快的、具有对等意识的高效代码(通过优化a的性能),或者为a生成更快的、具有对等意识的良好代码(通过优化其共同运行程序的性能)。此外,我们的对等感知平铺还使库开发人员能够改进库例程的性能(比静态平铺更有效)。
{"title":"Revisiting Loop Tiling for Datacenters: Live and Let Live","authors":"Jiacheng Zhao, Huimin Cui, Yalin Zhang, Jingling Xue, Xiaobing Feng","doi":"10.1145/3205289.3205306","DOIUrl":"https://doi.org/10.1145/3205289.3205306","url":null,"abstract":"As DNNs gain popularity in modern datacenters, it becomes imperative to revisit compiler optimizations for DNNs in a colocation scenario. Loop tiling turns out to be the most significant compiler optimization, since DNNs typically apply a series of matrix computations iteratively to a massive amount of data. We introduce a reuse-pattern-centric approach to obtaining a peer-aware TSS (Tile Size Selection) model for a matrix-based application A. Our key insight is that the co-running cache behavior of A (once tiled) can be determined by its data reuse patterns, together with the cache pressure exerted by its co-running peers, without actually the need for analyzing the code of its co-runners. Compared with static tiling (that determines a tile size for A statically without considering its co-running peers), our peer-aware tiling enables compilers to generate either faster peer-aware efficient code for A (by optimizing the performance of A) or faster peer-aware nice code for A (by optimizing the performance of its co-runners). In addition, our peer-aware tiling also enables library developers to improve the performance of library routines (more effectively than static tiling).","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133035299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs 基于指令的高级编程和fpga高性能计算优化
Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205324
Jacob Lambert, Seyong Lee, Jungwon Kim, J. Vetter, A. Malony
Reconfigurable architectures like Field Programmable Gate Arrays (FPGAs) have been used for accelerating computations from several domains because of their unique combination of flexibility, performance, and power efficiency. However, FPGAs have not been widely used for high-performance computing, primarily because of their programming complexity and difficulties in optimizing performance. In this paper, we present a directive-based, high-level optimization framework for high-performance computing with FPGAs, built on top of an OpenACC-to-FPGA translation framework called OpenARC. We propose directive extensions and corresponding compile-time optimization techniques to enable the compiler to generate more efficient FPGA hardware configuration files. Empirical evaluation of the proposed framework on an Intel Stratix V with five OpenACC benchmarks from various application domains shows that FPGA-specific optimizations can lead to significant increases in performance across all tested applications. We also demonstrate that applying these high-level directive-based optimizations can allow OpenACC applications to perform similarly to lower-level OpenCL applications with hand-written FPGA-specific optimizations, and offer runtime and power performance benefits compared to CPUs and GPUs.
现场可编程门阵列(fpga)等可重构架构由于其独特的灵活性、性能和能效组合,已被用于加速多个领域的计算。然而,fpga尚未广泛应用于高性能计算,主要是因为其编程复杂性和难以优化性能。在本文中,我们提出了一个基于指令的高级优化框架,用于fpga的高性能计算,该框架建立在称为OpenARC的openacc到fpga转换框架之上。我们提出指令扩展和相应的编译时优化技术,使编译器能够生成更高效的FPGA硬件配置文件。在Intel Stratix V上使用来自不同应用领域的五个OpenACC基准测试对所提出的框架进行的经验评估表明,特定于fpga的优化可以在所有测试应用程序中显著提高性能。我们还演示了应用这些高级的基于指令的优化可以让OpenACC应用程序执行类似于低级的OpenCL应用程序,并提供与cpu和gpu相比的运行时和电源性能优势。
{"title":"Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs","authors":"Jacob Lambert, Seyong Lee, Jungwon Kim, J. Vetter, A. Malony","doi":"10.1145/3205289.3205324","DOIUrl":"https://doi.org/10.1145/3205289.3205324","url":null,"abstract":"Reconfigurable architectures like Field Programmable Gate Arrays (FPGAs) have been used for accelerating computations from several domains because of their unique combination of flexibility, performance, and power efficiency. However, FPGAs have not been widely used for high-performance computing, primarily because of their programming complexity and difficulties in optimizing performance. In this paper, we present a directive-based, high-level optimization framework for high-performance computing with FPGAs, built on top of an OpenACC-to-FPGA translation framework called OpenARC. We propose directive extensions and corresponding compile-time optimization techniques to enable the compiler to generate more efficient FPGA hardware configuration files. Empirical evaluation of the proposed framework on an Intel Stratix V with five OpenACC benchmarks from various application domains shows that FPGA-specific optimizations can lead to significant increases in performance across all tested applications. We also demonstrate that applying these high-level directive-based optimizations can allow OpenACC applications to perform similarly to lower-level OpenCL applications with hand-written FPGA-specific optimizations, and offer runtime and power performance benefits compared to CPUs and GPUs.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129052578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
期刊
Proceedings of the 2018 International Conference on Supercomputing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1