首页 > 最新文献

Proceedings of the ACM International Conference on Computing Frontiers最新文献

英文 中文
Libra: an automated code generation and tuning framework for register-limited stencils on GPUs Libra:一个自动代码生成和调优框架,用于gpu上的寄存器限制模板
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903158
Mengyao Jin, H. Fu, Zihong Lv, Guangwen Yang
Stencils account for a significant part in many scientific computing applications. Besides simple stencils which can be completed with a few arithmetic operations, there are also many register-limited stencils with hundreds or thousands of variables and operations. The massive registers required by these stencils largely limit the parallelism of the programs on current many-core architectures, and consequently degrade the overall performance. Based on the register usage, which is the major constraining factor for most register-limited stencils, we propose a DDG (data-dependency-graph) oriented code transformation approach to improve the performance of these stencils. This approach analyzes, reorders and transforms the original program on GPUs, and further explores for the best tradeoff between the computation amount and the parallelism degree. Based on our graphoriented code transformation approach, we further design and implement an automated code generation and tuning framework called Libra, to improve the productivity and performance simultaneously. We apply Libra to 5 widely used stencils, and experiment results show that these stencils achieve a speedup of 1.12~2.16X when compared with the original fairly-optimized implementations.
模板在许多科学计算应用中占有重要的地位。除了可以用一些算术运算完成的简单模板外,还有许多具有数百或数千个变量和运算的寄存器限制模板。这些模板所需的大量寄存器在很大程度上限制了当前多核体系结构上程序的并行性,从而降低了整体性能。基于寄存器的使用是大多数寄存器受限模板的主要制约因素,我们提出了一种面向DDG(数据依赖图)的代码转换方法来提高这些模板的性能。该方法在gpu上对原程序进行分析、重排序和变换,并进一步探索计算量和并行度之间的最佳权衡。基于我们面向图形的代码转换方法,我们进一步设计并实现了一个名为Libra的自动代码生成和调优框架,以同时提高生产力和性能。我们将Libra应用于5种广泛使用的模板,实验结果表明,这些模板与原始的优化实现相比,速度提高了1.12~2.16倍。
{"title":"Libra: an automated code generation and tuning framework for register-limited stencils on GPUs","authors":"Mengyao Jin, H. Fu, Zihong Lv, Guangwen Yang","doi":"10.1145/2903150.2903158","DOIUrl":"https://doi.org/10.1145/2903150.2903158","url":null,"abstract":"Stencils account for a significant part in many scientific computing applications. Besides simple stencils which can be completed with a few arithmetic operations, there are also many register-limited stencils with hundreds or thousands of variables and operations. The massive registers required by these stencils largely limit the parallelism of the programs on current many-core architectures, and consequently degrade the overall performance. Based on the register usage, which is the major constraining factor for most register-limited stencils, we propose a DDG (data-dependency-graph) oriented code transformation approach to improve the performance of these stencils. This approach analyzes, reorders and transforms the original program on GPUs, and further explores for the best tradeoff between the computation amount and the parallelism degree. Based on our graphoriented code transformation approach, we further design and implement an automated code generation and tuning framework called Libra, to improve the productivity and performance simultaneously. We apply Libra to 5 widely used stencils, and experiment results show that these stencils achieve a speedup of 1.12~2.16X when compared with the original fairly-optimized implementations.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124087798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Mitigating sync overhead in single-level store systems 减少单级存储系统中的同步开销
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903161
Yuanchao Xu, Hu Wan, Zeyi Hou, Keni Qiu
Emerging non-volatile memory technologies offer the durability of disk and the byte-addressability of DRAM, which makes it feasible to build up single-level store systems. However, due to extremely low latency of persistent writes to non-volatile memory, software stack accounts for the majority of the overall performance overhead, one of which comes from crash consistency guarantees. In order to let persistent data structures survive power failures or system crashes, some measures, such as write-ahead logging or copy-on-write, along with frequent cacheline flushes, must be taken to ensure the consistency of durable data, thereby incurring non-trivial sync overhead. In this paper, we propose two techniques to mitigate the sync overhead. First, we leverage write-optimized non-volatile memory to store log entries on chip instead of off chip, thereby eliminating sync overhead. Second, we present an adaptive caching mode policy in terms of data access patterns to eliminate unnecessary sync overhead. Evaluation results indicate that the two techniques help improve the overall performance from 5.88x to 6.77x compared to conventional transactional persistent memory.
新兴的非易失性存储器技术提供了磁盘的耐用性和DRAM的字节寻址能力,这使得建立单级存储系统成为可能。然而,由于持久写入非易失性内存的延迟极低,软件堆栈占了总体性能开销的大部分,其中之一来自崩溃一致性保证。为了让持久数据结构能够在电源故障或系统崩溃中存活下来,必须采取一些措施,如预写日志记录或写时复制,以及频繁的缓存刷新,以确保持久数据的一致性,从而产生重要的同步开销。在本文中,我们提出了两种技术来减轻同步开销。首先,我们利用写优化的非易失性内存将日志条目存储在芯片上而不是芯片外,从而消除了同步开销。其次,我们在数据访问模式方面提出了一种自适应缓存模式策略,以消除不必要的同步开销。评估结果表明,与传统事务性持久性内存相比,这两种技术有助于将总体性能从5.88倍提高到6.77倍。
{"title":"Mitigating sync overhead in single-level store systems","authors":"Yuanchao Xu, Hu Wan, Zeyi Hou, Keni Qiu","doi":"10.1145/2903150.2903161","DOIUrl":"https://doi.org/10.1145/2903150.2903161","url":null,"abstract":"Emerging non-volatile memory technologies offer the durability of disk and the byte-addressability of DRAM, which makes it feasible to build up single-level store systems. However, due to extremely low latency of persistent writes to non-volatile memory, software stack accounts for the majority of the overall performance overhead, one of which comes from crash consistency guarantees. In order to let persistent data structures survive power failures or system crashes, some measures, such as write-ahead logging or copy-on-write, along with frequent cacheline flushes, must be taken to ensure the consistency of durable data, thereby incurring non-trivial sync overhead. In this paper, we propose two techniques to mitigate the sync overhead. First, we leverage write-optimized non-volatile memory to store log entries on chip instead of off chip, thereby eliminating sync overhead. Second, we present an adaptive caching mode policy in terms of data access patterns to eliminate unnecessary sync overhead. Evaluation results indicate that the two techniques help improve the overall performance from 5.88x to 6.77x compared to conventional transactional persistent memory.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127382149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Shared resource aware scheduling on power-constrained tiled many-core processors 在功率受限的平铺多核处理器上实现共享资源感知调度
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903490
S. S. Jha, W. Heirman, Ayose Falcón, Jordi Tubella, Antonio González, L. Eeckhout
Power management through dynamic core, cache and frequency adaptation is becoming a necessity in today's power-constrained many-core environments. Unfortunately, as core count grows, the complexity of both the adaptation hardware and the power management algorithms increases. In this paper, we propose a two-tier hierarchical power management methodology to exploit per-tile voltage regulators and clustered last-level caches. In addition, we include a novel thread migration layer that (i) analyzes threads running on the tiled many-core processor for shared resource sensitivity in tandem with core, cache and frequency adaptation, and (ii) co-schedules threads per tile with compatible behavior.
在当今功率受限的多核环境中,通过动态核心、缓存和频率自适应进行电源管理已成为一种必要。不幸的是,随着核数的增加,适配硬件和电源管理算法的复杂性也在增加。在本文中,我们提出了一种双层分层电源管理方法来利用每层电压调节器和集群的最后一级缓存。此外,我们还包含了一个新的线程迁移层,该层(i)分析在平铺多核处理器上运行的线程,以便与内核、缓存和频率适配一起共享资源敏感性,以及(ii)以兼容的行为共同调度每个平铺上的线程。
{"title":"Shared resource aware scheduling on power-constrained tiled many-core processors","authors":"S. S. Jha, W. Heirman, Ayose Falcón, Jordi Tubella, Antonio González, L. Eeckhout","doi":"10.1145/2903150.2903490","DOIUrl":"https://doi.org/10.1145/2903150.2903490","url":null,"abstract":"Power management through dynamic core, cache and frequency adaptation is becoming a necessity in today's power-constrained many-core environments. Unfortunately, as core count grows, the complexity of both the adaptation hardware and the power management algorithms increases. In this paper, we propose a two-tier hierarchical power management methodology to exploit per-tile voltage regulators and clustered last-level caches. In addition, we include a novel thread migration layer that (i) analyzes threads running on the tiled many-core processor for shared resource sensitivity in tandem with core, cache and frequency adaptation, and (ii) co-schedules threads per tile with compatible behavior.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125755841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Lock-based synchronization for GPU architectures GPU架构的基于锁的同步
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903155
Yunlong Xu, Lan Gao, Rui Wang, Zhongzhi Luan, Weiguo Wu, D. Qian
Modern GPUs have shown promising results in accelerating compute-intensive and numerical workloads with limited data sharing. However, emerging GPU applications manifest ample amount of data sharing among concurrently executing threads. Often data sharing requires mutual exclusion mechanism to ensure data integrity in multithreaded environment. Although modern GPUs provide atomic primitives that can be leveraged to construct fine-grained locks, the existing GPU lock implementations either incur frequent concurrency bugs, or lead to extremely low hardware utilization due to the Single Instruction Multiple Threads (SIMT) execution paradigm of GPUs. To make more applications with data sharing benefit from GPU acceleration, we propose a new locking scheme for GPU architectures. The proposed locking scheme allows lock stealing within individual warps to avoid the concurrency bugs due to the SMIT execution of GPUs. Moreover, it adopts lock virtualization to reduce the memory cost of fine-grain GPU locks. To illustrate the usage and the benefit of GPU locks, we apply the proposed GPU locking scheme to Delaunay mesh refinement (DMR), an application involving massive data sharing among threads. Our lock-based implementation can achieve 1.22x speedup over an algorithmic optimization based implementation (which uses a synchronization mechanism tailored for DMR) with 94% less memory cost.
现代gpu在加速计算密集型和有限数据共享的数字工作负载方面显示出了有希望的结果。然而,新兴的GPU应用程序在并发执行的线程之间显示了大量的数据共享。在多线程环境中,数据共享往往需要互斥机制来保证数据的完整性。尽管现代GPU提供了可以用来构造细粒度锁的原子原语,但是现有的GPU锁实现要么导致频繁的并发错误,要么由于GPU的单指令多线程(Single Instruction Multiple Threads, SIMT)执行范例导致硬件利用率极低。为了使更多的数据共享应用受益于GPU加速,我们提出了一种新的GPU架构锁定方案。所提出的锁定方案允许在单个扭曲中窃取锁,以避免由于gpu的SMIT执行而导致的并发错误。此外,它还采用了锁虚拟化来降低细粒度GPU锁的内存开销。为了说明GPU锁的使用和好处,我们将提出的GPU锁方案应用于Delaunay网格细化(DMR),一个涉及线程间大量数据共享的应用程序。与基于算法优化的实现(使用为DMR量身定制的同步机制)相比,我们基于锁的实现可以实现1.22倍的加速,内存成本降低94%。
{"title":"Lock-based synchronization for GPU architectures","authors":"Yunlong Xu, Lan Gao, Rui Wang, Zhongzhi Luan, Weiguo Wu, D. Qian","doi":"10.1145/2903150.2903155","DOIUrl":"https://doi.org/10.1145/2903150.2903155","url":null,"abstract":"Modern GPUs have shown promising results in accelerating compute-intensive and numerical workloads with limited data sharing. However, emerging GPU applications manifest ample amount of data sharing among concurrently executing threads. Often data sharing requires mutual exclusion mechanism to ensure data integrity in multithreaded environment. Although modern GPUs provide atomic primitives that can be leveraged to construct fine-grained locks, the existing GPU lock implementations either incur frequent concurrency bugs, or lead to extremely low hardware utilization due to the Single Instruction Multiple Threads (SIMT) execution paradigm of GPUs. To make more applications with data sharing benefit from GPU acceleration, we propose a new locking scheme for GPU architectures. The proposed locking scheme allows lock stealing within individual warps to avoid the concurrency bugs due to the SMIT execution of GPUs. Moreover, it adopts lock virtualization to reduce the memory cost of fine-grain GPU locks. To illustrate the usage and the benefit of GPU locks, we apply the proposed GPU locking scheme to Delaunay mesh refinement (DMR), an application involving massive data sharing among threads. Our lock-based implementation can achieve 1.22x speedup over an algorithmic optimization based implementation (which uses a synchronization mechanism tailored for DMR) with 94% less memory cost.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133073720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Malevolent app pairs: an Android permission overpassing scheme 恶意应用对:Android权限跨越方案
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2911706
Antonios Dimitriadis, P. Efraimidis, Vasilios Katos
Portable smart devices potentially store a wealth of information of personal data, making them attractive targets for data exfiltration attacks. Permission based schemes are core security controls for reducing privacy and security risks. In this paper we demonstrate that current permission schemes cannot effectively mitigate risks posed by covert channels. We show that a pair of apps with different permission settings may collude in order to effectively create a state where a union of their permissions is obtained, giving opportunities for leaking sensitive data, whilst keeping the leak potentially unnoticed. We then propose a solution for such attacks.
便携式智能设备可能存储大量个人数据信息,使其成为数据泄露攻击的诱人目标。基于权限的方案是降低隐私和安全风险的核心安全控制。在本文中,我们证明了当前的许可方案不能有效地减轻隐蔽通道带来的风险。我们展示了一对具有不同权限设置的应用程序可能会串通,以便有效地创建一种状态,在这种状态下获得他们的权限联合,从而为泄露敏感数据提供机会,同时保持泄漏可能不被注意。然后,我们提出了针对此类攻击的解决方案。
{"title":"Malevolent app pairs: an Android permission overpassing scheme","authors":"Antonios Dimitriadis, P. Efraimidis, Vasilios Katos","doi":"10.1145/2903150.2911706","DOIUrl":"https://doi.org/10.1145/2903150.2911706","url":null,"abstract":"Portable smart devices potentially store a wealth of information of personal data, making them attractive targets for data exfiltration attacks. Permission based schemes are core security controls for reducing privacy and security risks. In this paper we demonstrate that current permission schemes cannot effectively mitigate risks posed by covert channels. We show that a pair of apps with different permission settings may collude in order to effectively create a state where a union of their permissions is obtained, giving opportunities for leaking sensitive data, whilst keeping the leak potentially unnoticed. We then propose a solution for such attacks.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133507839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Power and clock gating modelling in coarse grained reconfigurable systems 粗粒度可重构系统中的功率和时钟门控建模
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2911713
Tiziana Fanni, Carlo Sau, P. Meloni, L. Raffo, F. Palumbo
Power reduction is one of the biggest challenges in modern systems and tends to become a severe issue dealing with complex scenarios. To provide high-performance and flexibility, designers often opt for coarse-grained reconfigurable (CGR) systems. Nevertheless, these systems require specific attention to the power problem, since large set of resources may be underutilized while computing a certain task. This paper focuses on this issue. Targeting CGR devices, we propose a way to model in advance power and clock gating costs on the basis of the functional, technological and architectural parameters of the baseline CGR system. The proposed flow guides designers towards optimal implementations, saving designer effort and time.
功耗降低是现代系统中最大的挑战之一,并且在处理复杂场景时往往成为一个严重的问题。为了提供高性能和灵活性,设计人员通常选择粗粒度的可重构(CGR)系统。然而,这些系统需要特别注意电源问题,因为在计算某个任务时,大量资源可能未得到充分利用。本文对这一问题进行了研究。针对CGR器件,我们提出了一种基于基准CGR系统的功能、技术和结构参数的功率和时钟门控成本的预先建模方法。所建议的流程指导设计人员实现最佳实现,节省了设计人员的精力和时间。
{"title":"Power and clock gating modelling in coarse grained reconfigurable systems","authors":"Tiziana Fanni, Carlo Sau, P. Meloni, L. Raffo, F. Palumbo","doi":"10.1145/2903150.2911713","DOIUrl":"https://doi.org/10.1145/2903150.2911713","url":null,"abstract":"Power reduction is one of the biggest challenges in modern systems and tends to become a severe issue dealing with complex scenarios. To provide high-performance and flexibility, designers often opt for coarse-grained reconfigurable (CGR) systems. Nevertheless, these systems require specific attention to the power problem, since large set of resources may be underutilized while computing a certain task. This paper focuses on this issue. Targeting CGR devices, we propose a way to model in advance power and clock gating costs on the basis of the functional, technological and architectural parameters of the baseline CGR system. The proposed flow guides designers towards optimal implementations, saving designer effort and time.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128285629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Energy reduction in video systems: the GreenVideo project 视频系统节能:绿色视频项目
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2911716
M. Pelcat, Erwan Nogues, X. Ducloux
With the current progress in microelectronics and the constant increase of network bandwidth, video applications are becoming ubiquitous and spread especially in the context of mobility. In 2019, 80% of the worldwide Internet traffic will be video. Nevertheless, optimizing the energy consumption for video processing is still a challenge due to the large amount of processed data. This talk will concentrate on the energy optimization of video codecs. In the first part, the Green Metadata initiative will be presented. In November 2014, MPEG released a new standard, named Green Metadata that fosters energy-efficient media on consumer devices. This standard specifies metadata to be transmitted between encoder and decoder for reducing power consumption during encoding, decoding and display. The different metadata considered in the standard will be presented. More specifically, the Green Adaptive Streaming proposition will be detailed. In the second part, the energy optimization of an HEVC decoder implemented on a modern MP-SoC will be presented. The different techniques used to implement efficiently an HEVC decoder on a general-purpose processor (GPP) will be detailed. Different levels of parallelism have been exploited to increase and exploit slack time. A sophisticated DVFS mechanism has been developed to handle the variability of the decoding process for each frame. To get further energy gains, the concept of approximate computing is exploited to propose a modified HEVC decoder capable of tuning its energy gains while managing the decoding quality versus energy trade-off. The work detailed in this second part of the talk is the result of the french GreenVideo FUI project.
随着微电子技术的发展和网络带宽的不断增加,视频应用越来越普遍,尤其是在移动环境下。2019年,全球80%的互联网流量将是视频。然而,由于处理的数据量很大,优化视频处理的能耗仍然是一个挑战。本讲座将集中讨论视频编解码器的能量优化。在第一部分中,将介绍绿色元数据计划。2014年11月,MPEG发布了一项名为“绿色元数据”的新标准,旨在促进消费设备上的节能媒体。本标准规定了在编码器和解码器之间传输的元数据,以减少编码、解码和显示过程中的功耗。将介绍标准中考虑的不同元数据。更具体地说,绿色自适应流的主张将被详细说明。在第二部分中,将介绍在现代MP-SoC上实现HEVC解码器的能量优化。本文将详细介绍在通用处理器(GPP)上有效实现HEVC解码器的不同技术。利用不同程度的并行性来增加和利用空闲时间。一个复杂的DVFS机制已经被开发来处理每帧解码过程的可变性。为了获得进一步的能量增益,利用近似计算的概念提出了一种改进的HEVC解码器,该解码器能够在管理解码质量与能量权衡的同时调整其能量增益。第二部分详细介绍的工作是法国GreenVideo FUI项目的成果。
{"title":"Energy reduction in video systems: the GreenVideo project","authors":"M. Pelcat, Erwan Nogues, X. Ducloux","doi":"10.1145/2903150.2911716","DOIUrl":"https://doi.org/10.1145/2903150.2911716","url":null,"abstract":"With the current progress in microelectronics and the constant increase of network bandwidth, video applications are becoming ubiquitous and spread especially in the context of mobility. In 2019, 80% of the worldwide Internet traffic will be video. Nevertheless, optimizing the energy consumption for video processing is still a challenge due to the large amount of processed data. This talk will concentrate on the energy optimization of video codecs. In the first part, the Green Metadata initiative will be presented. In November 2014, MPEG released a new standard, named Green Metadata that fosters energy-efficient media on consumer devices. This standard specifies metadata to be transmitted between encoder and decoder for reducing power consumption during encoding, decoding and display. The different metadata considered in the standard will be presented. More specifically, the Green Adaptive Streaming proposition will be detailed. In the second part, the energy optimization of an HEVC decoder implemented on a modern MP-SoC will be presented. The different techniques used to implement efficiently an HEVC decoder on a general-purpose processor (GPP) will be detailed. Different levels of parallelism have been exploited to increase and exploit slack time. A sophisticated DVFS mechanism has been developed to handle the variability of the decoding process for each frame. To get further energy gains, the concept of approximate computing is exploited to propose a modified HEVC decoder capable of tuning its energy gains while managing the decoding quality versus energy trade-off. The work detailed in this second part of the talk is the result of the french GreenVideo FUI project.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134501493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A non von neumann continuum computer architecture for scalability beyond Moore's law 一个非冯·诺伊曼连续体计算机体系结构的可扩展性超越摩尔定律
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903486
M. Brodowicz, T. Sterling
A strategic challenge confronting the continued advance of high performance computing (HPC) to extreme scale is the approaching near-nanoscale semiconductor technology and the end of Moore's Law. This paper introduces the foundations of an innovative class of parallel architecture reversing many of the conventional architecture directions, but benefiting from substantial prior art of previous decades. The Continuum Computer Architecture, or CCA, eschews traditional von Neumann-derived processing logic, instead employing structures composed of fine-grain cells (fontons) that combine functional units, memory, and network. The paper describes how CCA systems of various scales may be organized and implemented using currently available technology. As programming of such systems substantially differs from established practices, a still experimental ParalleX execution model is introduced to be used as a guide for the implementation of related software stack layers, ranging from the operating system to application level constructs. Finally, the HPX-5 runtime system, an advanced implementation of ParalleX core components, is presented as an intermediate software methodology for CCA system computation resource management.
高性能计算(HPC)的持续发展所面临的一个战略挑战是接近纳米级的半导体技术和摩尔定律的终结。本文介绍了一种创新的并行体系结构的基础,它颠覆了许多传统的体系结构方向,但受益于过去几十年的大量现有技术。连续体计算机体系结构(continuous Computer Architecture,简称CCA)避开了传统的冯·诺伊曼衍生的处理逻辑,而是采用了由细粒细胞(按钮)组成的结构,这些细胞结合了功能单元、存储器和网络。本文描述了如何利用现有技术组织和实施各种规模的CCA系统。由于此类系统的编程与已建立的实践有很大的不同,本文引入了一个仍处于实验阶段的parallelx执行模型,用于指导相关软件堆栈层的实现,范围从操作系统到应用程序级结构。最后,提出了基于ParalleX核心组件的HPX-5运行时系统作为CCA系统计算资源管理的中间软件方法。
{"title":"A non von neumann continuum computer architecture for scalability beyond Moore's law","authors":"M. Brodowicz, T. Sterling","doi":"10.1145/2903150.2903486","DOIUrl":"https://doi.org/10.1145/2903150.2903486","url":null,"abstract":"A strategic challenge confronting the continued advance of high performance computing (HPC) to extreme scale is the approaching near-nanoscale semiconductor technology and the end of Moore's Law. This paper introduces the foundations of an innovative class of parallel architecture reversing many of the conventional architecture directions, but benefiting from substantial prior art of previous decades. The Continuum Computer Architecture, or CCA, eschews traditional von Neumann-derived processing logic, instead employing structures composed of fine-grain cells (fontons) that combine functional units, memory, and network. The paper describes how CCA systems of various scales may be organized and implemented using currently available technology. As programming of such systems substantially differs from established practices, a still experimental ParalleX execution model is introduced to be used as a guide for the implementation of related software stack layers, ranging from the operating system to application level constructs. Finally, the HPX-5 runtime system, an advanced implementation of ParalleX core components, is presented as an intermediate software methodology for CCA system computation resource management.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132785366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Optimizing sparse matrix computations through compiler-assisted programming 通过编译器辅助编程优化稀疏矩阵计算
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2903157
K. Rietveld, H. Wijshoff
Existing high-performance implementations of sparse matrix codes are intricate and result in large code bases. In fact, a single floating-point operation requires 400 to 600 lines of additional code to "prepare" this operation. This imbalance severely obscures code development, thereby complicating maintenance and portability. In this paper, we propose a drastically different approach in order to continue to effectively handle these codes. We propose to only specify the essence of the computation on the level of individual matrix elements. All additional source code to embed these computations are then generated and optimized automatically by the compiler. This approach is far superior to existing library approaches and allows code to perform scatter/gather operations, matrix reordering, matrix data structure handling, handling of fill-in, etc., to be generated automatically. Experiments show that very efficient data structures can be generated and the resulting codes can be very competitive.
现有的稀疏矩阵代码的高性能实现是复杂的,并且导致大量的代码库。事实上,一个浮点操作需要400到600行额外的代码来“准备”这个操作。这种不平衡严重地模糊了代码开发,从而使维护和可移植性复杂化。在本文中,我们提出了一种完全不同的方法,以便继续有效地处理这些代码。我们建议只在单个矩阵元素的水平上指定计算的本质。然后编译器会自动生成和优化嵌入这些计算的所有附加源代码。这种方法远远优于现有的库方法,并允许代码自动生成分散/收集操作、矩阵重新排序、矩阵数据结构处理、填充处理等。实验表明,可以生成非常高效的数据结构,并且生成的代码具有很强的竞争力。
{"title":"Optimizing sparse matrix computations through compiler-assisted programming","authors":"K. Rietveld, H. Wijshoff","doi":"10.1145/2903150.2903157","DOIUrl":"https://doi.org/10.1145/2903150.2903157","url":null,"abstract":"Existing high-performance implementations of sparse matrix codes are intricate and result in large code bases. In fact, a single floating-point operation requires 400 to 600 lines of additional code to \"prepare\" this operation. This imbalance severely obscures code development, thereby complicating maintenance and portability. In this paper, we propose a drastically different approach in order to continue to effectively handle these codes. We propose to only specify the essence of the computation on the level of individual matrix elements. All additional source code to embed these computations are then generated and optimized automatically by the compiler. This approach is far superior to existing library approaches and allows code to perform scatter/gather operations, matrix reordering, matrix data structure handling, handling of fill-in, etc., to be generated automatically. Experiments show that very efficient data structures can be generated and the resulting codes can be very competitive.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133671812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Big data analytics and the LHC 大数据分析和LHC
Pub Date : 2016-05-16 DOI: 10.1145/2903150.2917755
M. Girone
The Large Hadron Collider is one of the largest and most complicated pieces of scientific apparatus ever constructed. The detectors along the LHC ring see as many as 800 million proton-proton collisions per second. An event in 10 to the 11th power is new physics and there is a hierarchical series of steps to extract a tiny signal from an enormous background. High energy physics (HEP) has long been a driver in managing and processing enormous scientific datasets and the largest scale high throughput computing centers. HEP developed one of the first scientific computing grids that now regularly operates 500k processor cores and half of an exabyte of disk storage located on 5 continents including hundred of connected facilities. In this presentation I will discuss the techniques used to extract scientific discovery from a large and complicated dataset. While HEP has developed many tools and techniques for handling big datasets, there is an increasing desire within the field to make more effective use of additional industry developments. I will discuss some of the ongoing work to adopt industry techniques in big data analytics to improve the discovery potential of the LHC and the effectiveness of the scientists who work on it.
大型强子对撞机是迄今为止建造的最大、最复杂的科学设备之一。大型强子对撞机环上的探测器每秒能观测到多达8亿次质子-质子碰撞。10 ^ 11次方的事件是新的物理学,从巨大的背景中提取微小信号需要一系列分层步骤。高能物理(HEP)长期以来一直是管理和处理庞大科学数据集和最大规模高通量计算中心的驱动因素。HEP开发了首批科学计算网格之一,现在定期运行50万个处理器内核和0.5 eb的磁盘存储,分布在五大洲,包括数百个连接的设施。在这次演讲中,我将讨论用于从庞大而复杂的数据集中提取科学发现的技术。虽然HEP已经开发了许多处理大数据集的工具和技术,但该领域越来越希望更有效地利用其他行业发展。我将讨论在大数据分析中采用工业技术的一些正在进行的工作,以提高大型强子对撞机的发现潜力和从事该工作的科学家的效率。
{"title":"Big data analytics and the LHC","authors":"M. Girone","doi":"10.1145/2903150.2917755","DOIUrl":"https://doi.org/10.1145/2903150.2917755","url":null,"abstract":"The Large Hadron Collider is one of the largest and most complicated pieces of scientific apparatus ever constructed. The detectors along the LHC ring see as many as 800 million proton-proton collisions per second. An event in 10 to the 11th power is new physics and there is a hierarchical series of steps to extract a tiny signal from an enormous background. High energy physics (HEP) has long been a driver in managing and processing enormous scientific datasets and the largest scale high throughput computing centers. HEP developed one of the first scientific computing grids that now regularly operates 500k processor cores and half of an exabyte of disk storage located on 5 continents including hundred of connected facilities. In this presentation I will discuss the techniques used to extract scientific discovery from a large and complicated dataset. While HEP has developed many tools and techniques for handling big datasets, there is an increasing desire within the field to make more effective use of additional industry developments. I will discuss some of the ongoing work to adopt industry techniques in big data analytics to improve the discovery potential of the LHC and the effectiveness of the scientists who work on it.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114168928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the ACM International Conference on Computing Frontiers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1