首页 > 最新文献

ACM International Conference on Computing Frontiers最新文献

英文 中文
Software-defined massive multicore networking via freespace optical interconnect 通过自由空间光互连实现软件定义的大规模多核网络
Pub Date : 2013-05-14 DOI: 10.1145/2482767.2482802
Y. Katayama, A. Okazaki, N. Ohba
This paper presents a new frontier where future computer systems can continue to evolve as CMOS technology reaches its fundamental performance and density scaling limits. Our idea adopts freespace circuit-switched optical interconnect in massive multicore networking on chips and modules to flexibly configure private cache-coherent networks for allocated groups of cores in a software-defined manner. The proposed scheme can avoid networking inefficiencies due to the core resource fragmentation by providing deterministically lower latencies and higher bandwidth while advancing the technology roadmap with lower power consumption and improved cooling. We also discuss implementation plan and challenges for our proposal.
本文提出了一个新的前沿,未来的计算机系统可以继续发展,因为CMOS技术达到其基本性能和密度缩放极限。我们的想法是在芯片和模块的大规模多核网络中采用自由空间电路交换光互连,以软件定义的方式灵活地为分配的核组配置专用缓存相干网络。提出的方案可以通过提供确定性的更低延迟和更高带宽来避免由于核心资源碎片而导致的网络低效率,同时以更低的功耗和改进的冷却推进技术路线图。我们还讨论了提案的实施计划和面临的挑战。
{"title":"Software-defined massive multicore networking via freespace optical interconnect","authors":"Y. Katayama, A. Okazaki, N. Ohba","doi":"10.1145/2482767.2482802","DOIUrl":"https://doi.org/10.1145/2482767.2482802","url":null,"abstract":"This paper presents a new frontier where future computer systems can continue to evolve as CMOS technology reaches its fundamental performance and density scaling limits. Our idea adopts freespace circuit-switched optical interconnect in massive multicore networking on chips and modules to flexibly configure private cache-coherent networks for allocated groups of cores in a software-defined manner. The proposed scheme can avoid networking inefficiencies due to the core resource fragmentation by providing deterministically lower latencies and higher bandwidth while advancing the technology roadmap with lower power consumption and improved cooling. We also discuss implementation plan and challenges for our proposal.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"14 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123446649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
System integration of tightly-coupled processor arrays using reconfigurable buffer structures 使用可重构缓冲结构的紧密耦合处理器阵列的系统集成
Pub Date : 2013-05-14 DOI: 10.1145/2482767.2482770
Frank Hannig, Moritz Schmid, Vahid Lari, Srinivas Boppu, J. Teich
As data locality is a key factor for the acceleration of loop programs on processor arrays, we propose a buffer architecture that can be configured at run-time to select between different schemes for memory access. In addition to traditional address-based memory banks, the buffer architecture can deliver data in a streaming manner to the processing elements of the array, which supports dense and sparse stencil operations. Moreover, to minimize data transfers to the buffers, the design contains an interlinked mode, which is especially targeted at 2-D kernel computations. The buffers can be used individually to achieve high data throughput by utilizing a maximum number of I/O channels to the array, or concatenated to provide higher storage capacity at a reduced amount of I/O channels.
由于数据局部性是处理器阵列上循环程序加速的关键因素,我们提出了一种可以在运行时配置的缓冲体系结构,以选择不同的内存访问方案。除了传统的基于地址的内存库之外,缓冲区架构还可以以流方式将数据传递给数组的处理元素,从而支持密集和稀疏的模板操作。此外,为了最大限度地减少数据传输到缓冲区,该设计包含一个互连模式,这是特别针对二维核计算。缓冲区可以单独使用,通过利用最大数量的I/O通道到阵列来实现高数据吞吐量,或者连接起来,以减少I/O通道的数量来提供更高的存储容量。
{"title":"System integration of tightly-coupled processor arrays using reconfigurable buffer structures","authors":"Frank Hannig, Moritz Schmid, Vahid Lari, Srinivas Boppu, J. Teich","doi":"10.1145/2482767.2482770","DOIUrl":"https://doi.org/10.1145/2482767.2482770","url":null,"abstract":"As data locality is a key factor for the acceleration of loop programs on processor arrays, we propose a buffer architecture that can be configured at run-time to select between different schemes for memory access. In addition to traditional address-based memory banks, the buffer architecture can deliver data in a streaming manner to the processing elements of the array, which supports dense and sparse stencil operations. Moreover, to minimize data transfers to the buffers, the design contains an interlinked mode, which is especially targeted at 2-D kernel computations. The buffers can be used individually to achieve high data throughput by utilizing a maximum number of I/O channels to the array, or concatenated to provide higher storage capacity at a reduced amount of I/O channels.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116723807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Distributed queues in shared memory: multicore performance and scalability through quantitative relaxation 共享内存中的分布式队列:通过定量放松的多核性能和可伸缩性
Pub Date : 2013-05-14 DOI: 10.1145/2482767.2482789
Andreas Haas, Michael Lippautz, T. Henzinger, H. Payer, A. Sokolova, C. Kirsch, A. Sezgin
A prominent remedy to multicore scalability issues in concurrent data structure implementations is to relax the sequential specification of the data structure. We present distributed queues (DQ), a new family of relaxed concurrent queue implementations. DQs implement relaxed queues with linearizable emptiness check and either configurable or bounded out-of-order behavior or pool behavior. Our experiments show that DQs outperform and outscale in micro- and macrobenchmarks all strict and relaxed queue as well as pool implementations that we considered.
解决并发数据结构实现中的多核可伸缩性问题的一个重要方法是放宽数据结构的顺序规范。我们提出了分布式队列(DQ),这是一种新的放松并发队列实现。dq实现了具有线性空检查和可配置或有界乱序行为或池行为的宽松队列。我们的实验表明,dq在我们考虑的所有严格和宽松队列以及池实现的微观和宏观基准测试中都优于和超越了。
{"title":"Distributed queues in shared memory: multicore performance and scalability through quantitative relaxation","authors":"Andreas Haas, Michael Lippautz, T. Henzinger, H. Payer, A. Sokolova, C. Kirsch, A. Sezgin","doi":"10.1145/2482767.2482789","DOIUrl":"https://doi.org/10.1145/2482767.2482789","url":null,"abstract":"A prominent remedy to multicore scalability issues in concurrent data structure implementations is to relax the sequential specification of the data structure. We present distributed queues (DQ), a new family of relaxed concurrent queue implementations. DQs implement relaxed queues with linearizable emptiness check and either configurable or bounded out-of-order behavior or pool behavior. Our experiments show that DQs outperform and outscale in micro- and macrobenchmarks all strict and relaxed queue as well as pool implementations that we considered.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134241496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
A divide-and-conquer approach for solving singular value decomposition on a heterogeneous system 异构系统奇异值分解的分治方法
Pub Date : 2013-05-14 DOI: 10.1145/2482767.2482813
Ding Liu, Ruixuan Li, D. Lilja, Weijun Xiao
Singular value decomposition (SVD) is a fundamental linear operation that has been used for many applications, such as pattern recognition and statistical information processing. In order to accelerate this time-consuming operation, this paper presents a new divide-and-conquer approach for solving SVD on a heterogeneous CPU-GPU system. We carefully design our algorithm to match the mathematical requirements of SVD to the unique characteristics of a heterogeneous computing platform. This includes a high-performanc solution to the secular equation with good numerical stability, overlapping the CPU and the GPU tasks, and leveraging the GPU bandwidth in a heterogeneous system. The experimental results show that our algorithm has better performance than MKL's divide-and-conquer routine [18] with four cores (eight hardware threads) when the size of the input matrix is larger than 3000. Furthermore, it is up to 33 times faster than LAPACK's divide-and-conquer routine [17], 3 times faster than MKL's divide-and-conquer routine with four cores, and 7 times faster than CULA on the same device, when the size of the matrix grows up to 14,000. Our algorithm is also much faster than previous SVD approaches on GPUs.
奇异值分解(SVD)是一种基本的线性运算,已被用于模式识别和统计信息处理等许多应用中。为了加速这一耗时的运算,本文提出了一种新的分而治之的方法来求解异构CPU-GPU系统上的奇异值分解。我们精心设计了算法,使奇异值分解的数学要求与异构计算平台的独特特征相匹配。这包括对长期方程的高性能解决方案,具有良好的数值稳定性,重叠CPU和GPU任务,并在异构系统中利用GPU带宽。实验结果表明,当输入矩阵的大小大于3000时,我们的算法比MKL的四核(8个硬件线程)分治算法[18]具有更好的性能。此外,当矩阵的大小增加到14000时,它比LAPACK的分治例程[17]快33倍,比MKL的四核分治例程快3倍,比相同设备上的CULA快7倍。我们的算法也比以前gpu上的SVD方法快得多。
{"title":"A divide-and-conquer approach for solving singular value decomposition on a heterogeneous system","authors":"Ding Liu, Ruixuan Li, D. Lilja, Weijun Xiao","doi":"10.1145/2482767.2482813","DOIUrl":"https://doi.org/10.1145/2482767.2482813","url":null,"abstract":"Singular value decomposition (SVD) is a fundamental linear operation that has been used for many applications, such as pattern recognition and statistical information processing. In order to accelerate this time-consuming operation, this paper presents a new divide-and-conquer approach for solving SVD on a heterogeneous CPU-GPU system. We carefully design our algorithm to match the mathematical requirements of SVD to the unique characteristics of a heterogeneous computing platform. This includes a high-performanc solution to the secular equation with good numerical stability, overlapping the CPU and the GPU tasks, and leveraging the GPU bandwidth in a heterogeneous system. The experimental results show that our algorithm has better performance than MKL's divide-and-conquer routine [18] with four cores (eight hardware threads) when the size of the input matrix is larger than 3000. Furthermore, it is up to 33 times faster than LAPACK's divide-and-conquer routine [17], 3 times faster than MKL's divide-and-conquer routine with four cores, and 7 times faster than CULA on the same device, when the size of the matrix grows up to 14,000. Our algorithm is also much faster than previous SVD approaches on GPUs.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126078727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Uncovering CPU load balancing policies with harmony 和谐地揭示CPU负载均衡策略
Pub Date : 2013-05-14 DOI: 10.1145/2482767.2482784
Joe Meehean, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, M. Livny
We introduce Harmony, a system for extracting the multiprocessor scheduling policies from commodity operating systems. Harmony can be used to unearth many aspects of multiprocessor scheduling policy, including the nuanced behaviors of core scheduling mechanisms and policies. We demonstrate the effectiveness of Harmony by applying it to the analysis of the load-balancing behavior of three Linux schedulers: O(1), CFS, and BFS. Our analysis uncovers the strengths and weaknesses of each of these schedulers, and more generally shows how to utilize Harmony to perform detailed analyses of complex scheduling systems.
我们介绍了Harmony,一个从商用操作系统中提取多处理器调度策略的系统。Harmony可用于揭示多处理器调度策略的许多方面,包括核心调度机制和策略的细微行为。我们通过将Harmony应用于三个Linux调度器(O(1)、CFS和BFS)的负载平衡行为分析来展示它的有效性。我们的分析揭示了每种调度器的优缺点,并更全面地展示了如何利用Harmony对复杂的调度系统进行详细分析。
{"title":"Uncovering CPU load balancing policies with harmony","authors":"Joe Meehean, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, M. Livny","doi":"10.1145/2482767.2482784","DOIUrl":"https://doi.org/10.1145/2482767.2482784","url":null,"abstract":"We introduce Harmony, a system for extracting the multiprocessor scheduling policies from commodity operating systems. Harmony can be used to unearth many aspects of multiprocessor scheduling policy, including the nuanced behaviors of core scheduling mechanisms and policies. We demonstrate the effectiveness of Harmony by applying it to the analysis of the load-balancing behavior of three Linux schedulers: O(1), CFS, and BFS. Our analysis uncovers the strengths and weaknesses of each of these schedulers, and more generally shows how to utilize Harmony to perform detailed analyses of complex scheduling systems.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130511957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Multi-processor architectural support for protecting virtual machine privacy in untrusted cloud environment 在不可信的云环境中保护虚拟机隐私的多处理器架构支持
Pub Date : 2013-05-14 DOI: 10.1145/2482767.2482799
Y. Wen, Jong-Hyuk Lee, Ziyi Liu, Qingji Zheng, W. Shi, Shouhuai Xu, Taeweon Suh
Virtualization is fundamental to cloud computing because it allows multiple operating systems to run simultaneously on a physical machine. However, it also brings a range of security/privacy problems. One particularly challenging and important problem is: how can we protect the Virtual Machines (VMs) from being attacked by Virtual Machine Monitors (VMMs) and/or by the cloud vendors when they are not trusted? In this paper, we propose an architectural solution to the above problem in multi-processor cloud environments. Our key idea is to exploit hardware mechanisms to enforce access control over the shared resources (e.g., memory spaces), while protecting VM memory integrity as well as inter-processor communications and data sharing. We evaluate the solution using full-system emulation and cycle-based architecture models. Experiments based on 20 benchmark applications show that the performance overhead is 1.5%--10% when access control is enforced, and 9%--19% when VM memory is encrypted.
虚拟化是云计算的基础,因为它允许多个操作系统在一台物理机器上同时运行。然而,它也带来了一系列的安全/隐私问题。一个特别具有挑战性和重要的问题是:当虚拟机监视器(vmm)和/或云供应商不受信任时,我们如何保护虚拟机(vm)免受攻击?在本文中,我们提出了一种在多处理器云环境中解决上述问题的架构解决方案。我们的关键思想是利用硬件机制来加强对共享资源(例如内存空间)的访问控制,同时保护VM内存完整性以及处理器间通信和数据共享。我们使用全系统仿真和基于周期的体系结构模型来评估该解决方案。基于20个基准应用程序的实验表明,当执行访问控制时,性能开销为1.5%- 10%,当虚拟机内存加密时,性能开销为9%- 19%。
{"title":"Multi-processor architectural support for protecting virtual machine privacy in untrusted cloud environment","authors":"Y. Wen, Jong-Hyuk Lee, Ziyi Liu, Qingji Zheng, W. Shi, Shouhuai Xu, Taeweon Suh","doi":"10.1145/2482767.2482799","DOIUrl":"https://doi.org/10.1145/2482767.2482799","url":null,"abstract":"Virtualization is fundamental to cloud computing because it allows multiple operating systems to run simultaneously on a physical machine. However, it also brings a range of security/privacy problems. One particularly challenging and important problem is: how can we protect the Virtual Machines (VMs) from being attacked by Virtual Machine Monitors (VMMs) and/or by the cloud vendors when they are not trusted? In this paper, we propose an architectural solution to the above problem in multi-processor cloud environments. Our key idea is to exploit hardware mechanisms to enforce access control over the shared resources (e.g., memory spaces), while protecting VM memory integrity as well as inter-processor communications and data sharing. We evaluate the solution using full-system emulation and cycle-based architecture models. Experiments based on 20 benchmark applications show that the performance overhead is 1.5%--10% when access control is enforced, and 9%--19% when VM memory is encrypted.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123110347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Kinship: efficient resource management for performance and functionally asymmetric platforms 亲属关系:对性能和功能不对称平台的有效资源管理
Pub Date : 2013-05-14 DOI: 10.1145/2482767.2482787
Vishakha Gupta, Rob C. Knauerhase, P. Brett, K. Schwan
On-chip heterogeneity has become key to balancing performance and power constraints, resulting in disparate (functionally overlapping but not equivalent) cores on a single die. Requiring developers to deal with such heterogeneity can impede adoption through increased programming effort and result in cross-platform incompatibility. We propose that systems software must evolve to dynamically accommodate heterogeneity and to automatically choose task-to-resource mappings to best use these features. We describe the kinship approach for mapping workloads to heterogeneous cores. A hypervisor-level realization of the approach on a variety of experimental heterogeneous platforms demonstrates the general applicability and utility of kinship-based scheduling, matching dynamic workloads to available resources as well as scaling with the number of processes and with different types/configurations of compute resources. Performance advantages of kinship based scheduling are evident for runs across multiple generations of heterogeneous platforms.
芯片上的异质性已经成为平衡性能和功率限制的关键,导致单个芯片上的不同(功能重叠但不等同)核心。要求开发人员处理这种异构性可能会通过增加编程工作来阻碍采用,并导致跨平台不兼容。我们建议系统软件必须进化到动态适应异构性,并自动选择任务到资源的映射,以最好地利用这些特性。我们描述了将工作负载映射到异构核心的亲属关系方法。该方法在各种实验性异构平台上的管理程序级实现演示了基于亲缘关系的调度的一般适用性和实用性,将动态工作负载与可用资源相匹配,并根据进程数量和不同类型/配置的计算资源进行扩展。在跨多代异构平台运行时,基于亲缘关系的调度的性能优势是显而易见的。
{"title":"Kinship: efficient resource management for performance and functionally asymmetric platforms","authors":"Vishakha Gupta, Rob C. Knauerhase, P. Brett, K. Schwan","doi":"10.1145/2482767.2482787","DOIUrl":"https://doi.org/10.1145/2482767.2482787","url":null,"abstract":"On-chip heterogeneity has become key to balancing performance and power constraints, resulting in disparate (functionally overlapping but not equivalent) cores on a single die. Requiring developers to deal with such heterogeneity can impede adoption through increased programming effort and result in cross-platform incompatibility. We propose that systems software must evolve to dynamically accommodate heterogeneity and to automatically choose task-to-resource mappings to best use these features. We describe the kinship approach for mapping workloads to heterogeneous cores. A hypervisor-level realization of the approach on a variety of experimental heterogeneous platforms demonstrates the general applicability and utility of kinship-based scheduling, matching dynamic workloads to available resources as well as scaling with the number of processes and with different types/configurations of compute resources. Performance advantages of kinship based scheduling are evident for runs across multiple generations of heterogeneous platforms.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115431386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Cost-effective soft-error protection for SRAM-based structures in GPGPUs 基于sram的gpgpu结构的高性价比软错误保护
Pub Date : 2013-05-14 DOI: 10.1145/2482767.2482804
Jingweijia Tan, Zhi Li, Xin Fu
The general-purpose computing on graphics processing units (GPGPUs) are increasingly used to accelerate parallel applications. This makes reliability a growing concern in GPUs as they are originally designed for graphics processing with relaxed requirements for execution correctness. With CMOS processing technologies continuously scaling down to the nano-scale, on-chip soft error rate (SER) has been predicted to increase exponentially. GPGPUs with hundreds of cores integrated into a single chip are prone to manifest high SER. This paper aims to enhance the GPGPU reliability in light of soft errors. We leverage the GPGPU microarchitecture characteristics, and propose energy-efficient protection mechanisms for two typical SRAM-based structures (i.e. instruction buffer and registers) which suffer high susceptibility. We develop Similarity-AWare Protection (SAWP) scheme that leverages the instruction similarity to provide the near-full ECC protection to the instruction buffer with quite little area and power overhead. Based on the observation that shared memory usually exhibits low utilization, we propose SHAred memory to Register Protection (SHARP) scheme, it intelligently leverages shared memory to hold the ECCs of registers. Experimental results show that our techniques have the strong capability of substantially improving the structure vulnerability, and significantly reducing the power consumption compared to the full ECC protection mechanism.
图形处理单元(gpgpu)上的通用计算越来越多地用于加速并行应用程序。这使得可靠性在gpu中越来越受到关注,因为它们最初是为图形处理而设计的,对执行正确性的要求并不高。随着CMOS处理技术的不断缩小到纳米级,片上软错误率(SER)预计将呈指数级增长。将数百个核集成到单个芯片中的gpgpu容易表现出高SER。本文旨在针对软误差问题,提高GPGPU的可靠性。利用GPGPU微架构的特点,针对两种典型的易受影响的sram结构(即指令缓冲区和寄存器)提出了节能保护机制。我们开发了相似感知保护(similarity - aware Protection, SAWP)方案,该方案利用指令相似度为指令缓冲区提供近乎完全的ECC保护,且面积和功耗开销很小。基于共享内存通常表现出较低的利用率,我们提出了共享内存寄存器保护(SHARP)方案,它智能地利用共享内存来保存寄存器的ecc。实验结果表明,与全ECC保护机制相比,我们的技术具有较强的改善结构脆弱性的能力,并且显著降低了功耗。
{"title":"Cost-effective soft-error protection for SRAM-based structures in GPGPUs","authors":"Jingweijia Tan, Zhi Li, Xin Fu","doi":"10.1145/2482767.2482804","DOIUrl":"https://doi.org/10.1145/2482767.2482804","url":null,"abstract":"The general-purpose computing on graphics processing units (GPGPUs) are increasingly used to accelerate parallel applications. This makes reliability a growing concern in GPUs as they are originally designed for graphics processing with relaxed requirements for execution correctness. With CMOS processing technologies continuously scaling down to the nano-scale, on-chip soft error rate (SER) has been predicted to increase exponentially. GPGPUs with hundreds of cores integrated into a single chip are prone to manifest high SER. This paper aims to enhance the GPGPU reliability in light of soft errors. We leverage the GPGPU microarchitecture characteristics, and propose energy-efficient protection mechanisms for two typical SRAM-based structures (i.e. instruction buffer and registers) which suffer high susceptibility. We develop Similarity-AWare Protection (SAWP) scheme that leverages the instruction similarity to provide the near-full ECC protection to the instruction buffer with quite little area and power overhead. Based on the observation that shared memory usually exhibits low utilization, we propose SHAred memory to Register Protection (SHARP) scheme, it intelligently leverages shared memory to hold the ECCs of registers. Experimental results show that our techniques have the strong capability of substantially improving the structure vulnerability, and significantly reducing the power consumption compared to the full ECC protection mechanism.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114672034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Trace construction using enhanced performance monitoring 使用增强的性能监视跟踪构造
Pub Date : 2013-05-14 DOI: 10.1145/2482767.2482811
M. Serrano
We present a hardware assisted approach for constructing software traces in binary translation systems. The new approach leverages an enhanced performance monitoring hardware (PMU) with a combination of hardware techniques: branch prediction information, branch trace collection, and a hardware signature representing the calling context. Overall, the combined approach significantly reduces the time and overhead in building traces, while capturing high-quality traces. Our approach significantly reduces the time to build traces, compared to previous research which exploited a sampling PMU mechanism. The calling context signature could also be used in other applications such as debugging, program understanding, security and other optimizations.
我们提出了一种硬件辅助的方法来构造二进制翻译系统中的软件轨迹。新方法利用增强的性能监视硬件(PMU),并结合了硬件技术:分支预测信息、分支跟踪收集和表示调用上下文的硬件签名。总的来说,这种组合方法显著地减少了构建轨迹的时间和开销,同时捕获了高质量的轨迹。与以前利用采样PMU机制的研究相比,我们的方法显着减少了构建跟踪的时间。调用上下文签名也可以用于其他应用程序,如调试、程序理解、安全和其他优化。
{"title":"Trace construction using enhanced performance monitoring","authors":"M. Serrano","doi":"10.1145/2482767.2482811","DOIUrl":"https://doi.org/10.1145/2482767.2482811","url":null,"abstract":"We present a hardware assisted approach for constructing software traces in binary translation systems. The new approach leverages an enhanced performance monitoring hardware (PMU) with a combination of hardware techniques: branch prediction information, branch trace collection, and a hardware signature representing the calling context. Overall, the combined approach significantly reduces the time and overhead in building traces, while capturing high-quality traces. Our approach significantly reduces the time to build traces, compared to previous research which exploited a sampling PMU mechanism. The calling context signature could also be used in other applications such as debugging, program understanding, security and other optimizations.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116049961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A shared-FPU architecture for ultra-low power MPSoCs 超低功耗mpsoc的共享fpu架构
Pub Date : 2013-05-14 DOI: 10.1145/2482767.2482772
M. R. Kakoee, Igor Loi, L. Benini
In this work we propose a shared floating point unit (FPU) architecture for ultra low power (ULP) system on chips operating at near threshold voltage (NTV). Since high-performance FP units (FPUs) are large and complex, but their utilization is relatively low, adding one FPU per each core in a ULP multicore is costly and power hungry. In our approach, we share a few FPUs among all the cores in the system. This increases the utilization of FPUs leading to an energy-efficient design. As a part of our approach, we propose two different FPU allocation techniques: optimal and random. Experimental results demonstrate that compared to a traditional private-FPU approach, our technique in a multicore system with 8 processors and 2 shared FPUs can increase the performance/(area*power) by 5x for applications with 10% FP operations and by 2.5x for applications with 25% FP operations.
在这项工作中,我们提出了一种共享浮点单元(FPU)架构,用于在接近阈值电压(NTV)下工作的芯片上的超低功耗(ULP)系统。由于高性能FP单元(FPU)庞大而复杂,但利用率相对较低,因此在ULP多核中为每个核心添加一个FPU成本高昂且耗电。在我们的方法中,我们在系统的所有核心中共享几个fpu。这增加了fpu的利用率,从而实现了节能设计。作为我们方法的一部分,我们提出了两种不同的FPU分配技术:最优和随机。实验结果表明,与传统的私有fpu方法相比,我们的技术在具有8个处理器和2个共享fpu的多核系统中,对于具有10% FP操作的应用程序可以将性能/(面积*功率)提高5倍,对于具有25% FP操作的应用程序可以提高2.5倍。
{"title":"A shared-FPU architecture for ultra-low power MPSoCs","authors":"M. R. Kakoee, Igor Loi, L. Benini","doi":"10.1145/2482767.2482772","DOIUrl":"https://doi.org/10.1145/2482767.2482772","url":null,"abstract":"In this work we propose a shared floating point unit (FPU) architecture for ultra low power (ULP) system on chips operating at near threshold voltage (NTV). Since high-performance FP units (FPUs) are large and complex, but their utilization is relatively low, adding one FPU per each core in a ULP multicore is costly and power hungry. In our approach, we share a few FPUs among all the cores in the system. This increases the utilization of FPUs leading to an energy-efficient design. As a part of our approach, we propose two different FPU allocation techniques: optimal and random.\u0000 Experimental results demonstrate that compared to a traditional private-FPU approach, our technique in a multicore system with 8 processors and 2 shared FPUs can increase the performance/(area*power) by 5x for applications with 10% FP operations and by 2.5x for applications with 25% FP operations.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129802258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM International Conference on Computing Frontiers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1