首页 > 最新文献

ACM International Conference on Computing Frontiers最新文献

英文 中文
Improving the performance of k-means clustering through computation skipping and data locality optimizations 通过计算跳过和数据局部性优化来提高k-means聚类的性能
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212951
Orhan Kislal, P. Berman, M. Kandemir
We present three different optimization techniques for k-means clustering algorithm to improve the running time without decreasing the accuracy of the cluster centers significantly. Our first optimization restructures loops to improve cache behavior when executing on multicore architectures. The remaining two optimizations skip select points to reduce execution latency. Our sensitivity analysis suggests that the performance can be enhanced through a good understanding of the data and careful configuration of the parameters.
为了在不显著降低聚类中心精度的前提下提高k-means聚类算法的运行时间,我们提出了三种不同的优化技术。我们的第一个优化重组了循环,以改善在多核架构上执行时的缓存行为。其余两个优化跳过选择点以减少执行延迟。我们的敏感性分析表明,通过对数据的良好理解和对参数的仔细配置,可以提高性能。
{"title":"Improving the performance of k-means clustering through computation skipping and data locality optimizations","authors":"Orhan Kislal, P. Berman, M. Kandemir","doi":"10.1145/2212908.2212951","DOIUrl":"https://doi.org/10.1145/2212908.2212951","url":null,"abstract":"We present three different optimization techniques for k-means clustering algorithm to improve the running time without decreasing the accuracy of the cluster centers significantly. Our first optimization restructures loops to improve cache behavior when executing on multicore architectures. The remaining two optimizations skip select points to reduce execution latency. Our sensitivity analysis suggests that the performance can be enhanced through a good understanding of the data and careful configuration of the parameters.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126471846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
CoreSymphony architecture CoreSymphony架构
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212945
Tomoyuki Nagatsuka, Yoshito Sakaguchi, Kenji Kise
We propose CoreSymphony architecture, which aims at balancing single-thread performance and multi-thread performance on CMPs. The former version of CoreSymphony had complex branch predictor, re-order buffer, and in-order state management mechanism. In this paper, we solve these problems and evaluate the performance of CoreSymphony.
我们提出了CoreSymphony架构,旨在平衡cmp上的单线程性能和多线程性能。前一版本的CoreSymphony具有复杂的分支预测器、重排序缓冲区和有序状态管理机制。在本文中,我们解决了这些问题,并对CoreSymphony进行了性能评估。
{"title":"CoreSymphony architecture","authors":"Tomoyuki Nagatsuka, Yoshito Sakaguchi, Kenji Kise","doi":"10.1145/2212908.2212945","DOIUrl":"https://doi.org/10.1145/2212908.2212945","url":null,"abstract":"We propose CoreSymphony architecture, which aims at balancing single-thread performance and multi-thread performance on CMPs. The former version of CoreSymphony had complex branch predictor, re-order buffer, and in-order state management mechanism. In this paper, we solve these problems and evaluate the performance of CoreSymphony.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116028583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Adaptive task duplication using on-line bottleneck detection for streaming applications 使用在线瓶颈检测的自适应任务复制流应用程序
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212932
Yoonseo Choi, Cheng-Hong Li, D. D. Silva, A. Bivens, E. Schenfeld
In this paper we describe an approach to dynamically improve the progress of streaming applications on SMP multi-core systems. We show that run-time task duplication is an effective method for maximizing application throughput in face of changes in available computing resources. Such changes can not be fully handled by static optimizations. We derive a theoretical performance model to identify tasks in need of more computing resources. We propose two on-line algorithms that use indications from the performance model to detect computation bottlenecks. In these algorithms, a task can identify itself as a bottleneck using only its local data. The proposed technique is transparent to end programmers and portable to systems with fair scheduling. Our on-line detection algorithms can be applied to other dynamic scenarios, for example, involving run-time variation of workload. Our experiments using the StreamIt benchmarks [5] show that the proposed run-time task duplication achieves considerable speedups over the multi-threaded baseline on a 16-core machine and on the scenarios with dynamically changing number of processing cores. We also show that our algorithms achieve better application throughput than alternative approaches for task duplication.
本文描述了一种在SMP多核系统上动态改进流应用进程的方法。我们表明,面对可用计算资源的变化,运行时任务复制是最大化应用程序吞吐量的有效方法。静态优化无法完全处理此类更改。我们推导了一个理论性能模型来识别需要更多计算资源的任务。我们提出了两种在线算法,使用性能模型的指示来检测计算瓶颈。在这些算法中,任务可以仅使用其本地数据将自己标识为瓶颈。所提出的技术对终端程序员是透明的,并且可移植到具有公平调度的系统中。我们的在线检测算法可以应用于其他动态场景,例如,涉及工作负载的运行时变化。我们使用StreamIt基准测试的实验[5]表明,在16核机器上的多线程基线上,以及在处理核心数量动态变化的情况下,所建议的运行时任务复制实现了相当大的速度提升。我们还表明,我们的算法比其他任务复制方法实现了更好的应用程序吞吐量。
{"title":"Adaptive task duplication using on-line bottleneck detection for streaming applications","authors":"Yoonseo Choi, Cheng-Hong Li, D. D. Silva, A. Bivens, E. Schenfeld","doi":"10.1145/2212908.2212932","DOIUrl":"https://doi.org/10.1145/2212908.2212932","url":null,"abstract":"In this paper we describe an approach to dynamically improve the progress of streaming applications on SMP multi-core systems. We show that run-time task duplication is an effective method for maximizing application throughput in face of changes in available computing resources. Such changes can not be fully handled by static optimizations. We derive a theoretical performance model to identify tasks in need of more computing resources. We propose two on-line algorithms that use indications from the performance model to detect computation bottlenecks. In these algorithms, a task can identify itself as a bottleneck using only its local data. The proposed technique is transparent to end programmers and portable to systems with fair scheduling. Our on-line detection algorithms can be applied to other dynamic scenarios, for example, involving run-time variation of workload.\u0000 Our experiments using the StreamIt benchmarks [5] show that the proposed run-time task duplication achieves considerable speedups over the multi-threaded baseline on a 16-core machine and on the scenarios with dynamically changing number of processing cores. We also show that our algorithms achieve better application throughput than alternative approaches for task duplication.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134614383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Selective search of inlining vectors for program optimization 选择性搜索内联向量的程序优化
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212947
Rosario Cammarota, A. Kejariwal, D. Donato, A. Nicolau, A. Veidenbaum
We propose a novel technique to select the inlining options of a compiler - referred to as an inlining vector, for program optimization. The proposed technique trains a machine learning algorithm to model the relation between inlining vectors and performance (completion time). The training set is composed of sample runs of the programs to optimize - that are compiled with a limited number of inlining vectors. Subject to a given compiler, the model evaluates the benefit of inlining combined with other compiler heuristics. The model is subsequently used to select the inlining vector which minimizes the predicted completion time of a program with respect to a given level of optimization. We present a case study based on the compiler GNU GCC. We used our technique to improve performance of 403.gcc from SPEC CPU2006 - a program which is notoriously hard to optimize - with respect to the optimization level -O3 as the baseline. On the state-of-the-art Intel Xeon Westmere architecture, 403.gcc, compiled using the inlining vectors selected by our technique, outperforms the baseline by up to 9%.
我们提出了一种新的技术来选择编译器的内联选项-称为内联向量,用于程序优化。该技术训练一种机器学习算法来模拟内联向量和性能(完成时间)之间的关系。训练集由要优化的程序的样本运行组成,这些程序是用有限数量的内联向量编译的。根据给定的编译器,该模型评估内联与其他编译器启发式相结合的好处。该模型随后用于选择内联向量,该内联向量使程序相对于给定的优化水平的预测完成时间最小化。我们给出了一个基于编译器GNU GCC的案例研究。我们使用我们的技术来提高403的性能。从SPEC CPU2006——一个众所周知难以优化的程序——以o3为基准的优化级别。在最先进的英特尔至强韦斯特米尔架构上,403。使用我们的技术选择的内联向量编译的Gcc的性能比基线高出9%。
{"title":"Selective search of inlining vectors for program optimization","authors":"Rosario Cammarota, A. Kejariwal, D. Donato, A. Nicolau, A. Veidenbaum","doi":"10.1145/2212908.2212947","DOIUrl":"https://doi.org/10.1145/2212908.2212947","url":null,"abstract":"We propose a novel technique to select the inlining options of a compiler - referred to as an inlining vector, for program optimization. The proposed technique trains a machine learning algorithm to model the relation between inlining vectors and performance (completion time). The training set is composed of sample runs of the programs to optimize - that are compiled with a limited number of inlining vectors. Subject to a given compiler, the model evaluates the benefit of inlining combined with other compiler heuristics. The model is subsequently used to select the inlining vector which minimizes the predicted completion time of a program with respect to a given level of optimization.\u0000 We present a case study based on the compiler GNU GCC. We used our technique to improve performance of 403.gcc from SPEC CPU2006 - a program which is notoriously hard to optimize - with respect to the optimization level -O3 as the baseline. On the state-of-the-art Intel Xeon Westmere architecture, 403.gcc, compiled using the inlining vectors selected by our technique, outperforms the baseline by up to 9%.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134058923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The tradeoffs of fused memory hierarchies in heterogeneous computing architectures 异构计算体系结构中融合内存层次结构的权衡
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212924
Kyle Spafford, J. Meredith, Seyong Lee, Dong Li, P. Roth, J. Vetter
With the rise of general purpose computing on graphics processing units (GPGPU), the influence from consumer markets can now be seen across the spectrum of computer architectures. In fact, many of the high-ranking Top500 HPC systems now include these accelerators. Traditionally, GPUs have connected to the CPU via the PCIe bus, which has proved to be a significant bottleneck for scalable scientific applications. Now, a trend toward tighter integration between CPU and GPU has removed this bottleneck and unified the memory hierarchy for both CPU and GPU cores. We examine the impact of this trend for high performance scientific computing by investigating AMD's new Fusion Accelerated Processing Unit (APU) as a testbed. In particular, we evaluate the tradeoffs in performance, power consumption, and programmability when comparing this unified memory hierarchy with similar, but discrete GPUs.
随着通用计算在图形处理单元(GPGPU)上的兴起,消费市场的影响现在可以在整个计算机体系结构中看到。事实上,许多排名前500的高性能计算系统现在都包括这些加速器。传统上,gpu通过PCIe总线连接到CPU,这已被证明是可扩展科学应用的一个重要瓶颈。现在,CPU和GPU之间更紧密集成的趋势已经消除了这个瓶颈,并统一了CPU和GPU内核的内存层次结构。我们通过研究AMD的新型融合加速处理单元(APU)作为测试平台来研究这一趋势对高性能科学计算的影响。特别地,我们在比较这种统一内存层次结构与类似但离散的gpu时,评估了性能,功耗和可编程性方面的权衡。
{"title":"The tradeoffs of fused memory hierarchies in heterogeneous computing architectures","authors":"Kyle Spafford, J. Meredith, Seyong Lee, Dong Li, P. Roth, J. Vetter","doi":"10.1145/2212908.2212924","DOIUrl":"https://doi.org/10.1145/2212908.2212924","url":null,"abstract":"With the rise of general purpose computing on graphics processing units (GPGPU), the influence from consumer markets can now be seen across the spectrum of computer architectures. In fact, many of the high-ranking Top500 HPC systems now include these accelerators. Traditionally, GPUs have connected to the CPU via the PCIe bus, which has proved to be a significant bottleneck for scalable scientific applications. Now, a trend toward tighter integration between CPU and GPU has removed this bottleneck and unified the memory hierarchy for both CPU and GPU cores. We examine the impact of this trend for high performance scientific computing by investigating AMD's new Fusion Accelerated Processing Unit (APU) as a testbed. In particular, we evaluate the tradeoffs in performance, power consumption, and programmability when comparing this unified memory hierarchy with similar, but discrete GPUs.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134172724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 60
A flexible OS-based approach for characterizing solid-state disk endurance 一种灵活的基于操作系统的方法,用于表征固态磁盘的耐用性
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212939
G. Kandiraju, Kaoutar El Maghraoui
The performance and power benefits of Flash memory have paved its adoption in mass storage devices in the form of Solid-State Disks (SSDs). Despite these benefits, Flash memory's limited write endurance remains a big impediment to its wide adoption in the enterprise server market. Existing research efforts have mostly focused on proposing various mechanisms and algorithms to improve SSD's performance and reliability. However, there is still a lack of flexible tools that allow characterizing SSD endurance (i.e., wear-out behavior) and investigating its impact on applications without affecting the lifetime of the real SSD device. To address this issue, SolidSim, a kernel-level simulator has been enhanced with capabilities to simulate state-of-the-art wear-leveling, garbage-collection and other advanced internal management techniques of an SSD. These extensions have further increased SolidSim's flexibility to study both SSD performance and endurance characteristics. Our approach allows investigating these characteristics without requiring any changes to applications or gathering any workload traces. The paper presents insights into wear-out behavior including logical, physical and translation characteristics, and correlates them with application behavior and SSD life-times using a set of representative workloads.
闪存的性能和功耗优势已经为其在固态硬盘(ssd)形式的大容量存储设备中的应用铺平了道路。尽管有这些好处,闪存有限的写入持久性仍然是其在企业服务器市场广泛采用的一大障碍。现有的研究工作主要集中在提出各种机制和算法来提高SSD的性能和可靠性。然而,仍然缺乏灵活的工具,可以在不影响实际SSD设备的使用寿命的情况下,表征SSD耐用性(即磨损行为)并调查其对应用程序的影响。为了解决这个问题,内核级模拟器SolidSim已经增强了模拟SSD最先进的磨损均衡、垃圾收集和其他高级内部管理技术的功能。这些扩展进一步增加了SolidSim研究SSD性能和耐用性特性的灵活性。我们的方法允许在不需要对应用程序进行任何更改或收集任何工作负载跟踪的情况下调查这些特征。本文介绍了对损耗行为的见解,包括逻辑、物理和转换特征,并使用一组代表性工作负载将它们与应用程序行为和SSD生命周期相关联。
{"title":"A flexible OS-based approach for characterizing solid-state disk endurance","authors":"G. Kandiraju, Kaoutar El Maghraoui","doi":"10.1145/2212908.2212939","DOIUrl":"https://doi.org/10.1145/2212908.2212939","url":null,"abstract":"The performance and power benefits of Flash memory have paved its adoption in mass storage devices in the form of Solid-State Disks (SSDs). Despite these benefits, Flash memory's limited write endurance remains a big impediment to its wide adoption in the enterprise server market. Existing research efforts have mostly focused on proposing various mechanisms and algorithms to improve SSD's performance and reliability. However, there is still a lack of flexible tools that allow characterizing SSD endurance (i.e., wear-out behavior) and investigating its impact on applications without affecting the lifetime of the real SSD device. To address this issue, SolidSim, a kernel-level simulator has been enhanced with capabilities to simulate state-of-the-art wear-leveling, garbage-collection and other advanced internal management techniques of an SSD. These extensions have further increased SolidSim's flexibility to study both SSD performance and endurance characteristics. Our approach allows investigating these characteristics without requiring any changes to applications or gathering any workload traces. The paper presents insights into wear-out behavior including logical, physical and translation characteristics, and correlates them with application behavior and SSD life-times using a set of representative workloads.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127636149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A capacity-efficient insertion policy for dynamic cache resizing mechanisms 用于动态缓存调整大小机制的高效容量插入策略
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212949
Masayuki Sato, Yusuke Tobo, Ryusuke Egawa, H. Takizawa, Hiroaki Kobayashi
Dynamic cache resizing mechanisms have been proposed to achieve both high performance and low energy consumption. The basic idea behind such mechanisms is to divide a cache into some parts, and manage them independently to resize the cache for resource allocation and energy saving. However, dynamic cache resizing mechanisms waste their resource to store a lot of dead-on-fill blocks, which are not reused after being stored in the cache. To reduce the number of dead-on-fill blocks in the cache and thus improve energy efficiency of dynamic cache resizing mechanisms, this paper proposes a dynamic LRU-K insertion policy. The policy stores a new coming block as the K-th least-recently-used one and adjusts K dynamically according to the application to be executed. Therefore, the policy can balance between early eviction of dead-on-fill blocks and retainment of reusable blocks.
为了实现高性能和低能耗,提出了动态缓存调整机制。这种机制背后的基本思想是将缓存分成若干部分,并独立管理它们以调整缓存大小,以实现资源分配和节能。然而,动态缓存调整机制会浪费资源来存储大量的“未填充”块,这些块在存储到缓存后不会被重用。为了减少缓存中未填充死块的数量,从而提高动态缓存调整机制的能量效率,本文提出了一种动态LRU-K插入策略。该策略将即将到来的新块存储为最近最少使用的第K个块,并根据要执行的应用程序动态调整K。因此,该策略可以在早期清除已死填充块和保留可重用块之间取得平衡。
{"title":"A capacity-efficient insertion policy for dynamic cache resizing mechanisms","authors":"Masayuki Sato, Yusuke Tobo, Ryusuke Egawa, H. Takizawa, Hiroaki Kobayashi","doi":"10.1145/2212908.2212949","DOIUrl":"https://doi.org/10.1145/2212908.2212949","url":null,"abstract":"Dynamic cache resizing mechanisms have been proposed to achieve both high performance and low energy consumption. The basic idea behind such mechanisms is to divide a cache into some parts, and manage them independently to resize the cache for resource allocation and energy saving. However, dynamic cache resizing mechanisms waste their resource to store a lot of dead-on-fill blocks, which are not reused after being stored in the cache. To reduce the number of dead-on-fill blocks in the cache and thus improve energy efficiency of dynamic cache resizing mechanisms, this paper proposes a dynamic LRU-K insertion policy. The policy stores a new coming block as the K-th least-recently-used one and adjusts K dynamically according to the application to be executed. Therefore, the policy can balance between early eviction of dead-on-fill blocks and retainment of reusable blocks.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120973509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A programmable processing array architecture supporting dynamic task scheduling and module-level prefetching 支持动态任务调度和模块级预取的可编程处理阵列体系结构
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212931
Junghee Lee, H. Lee, S. Ha, Jongman Kim, C. Nicopoulos
Massively Parallel Processing Arrays (MPPA) constitute programmable hardware accelerators that excel in the execution of applications exhibiting Data-Level Parallelism (DLP). The concept of employing such programmable accelerators as sidekicks to the more traditional, general-purpose processing cores has very recently entered the mainstream; both Intel and AMD have introduced processor architectures integrating a Graphics Processing Unit (GPU) alongside the main CPU cores. These GPU engines are expected to play a pivotal role in the espousal of General-Purpose computing on GPUs (GPGPU). However, the widespread adoption of MPPAs, in general, as hardware accelerators entails the effective tackling of some fundamental obstacles: the expressiveness of the programming model, the debugging capabilities, and the memory hierarchy design. Toward this end, this paper proposes a hardware architecture for MPPA that adopts an event-driven execution model. It supports dynamic task scheduling, which offers better expressiveness to the execution model and improves the utilization of processing elements. Moreover, a novel module-level prefetching mechanism - enabled by the specification of the execution model - hides the access time to memory and the scheduler. The execution model also ensures complete encapsulation of the modules, which greatly facilitates debugging. Finally, the fact that all associated inputs of a module are explicitly known can be exploited by the hardware to hide memory access latency without having to resort to caches and a cache coherence protocol. Results using a cycle-level simulator of the proposed architecture and a variety of real application benchmarks demonstrate the efficacy and efficiency of the proposed paradigm.
大规模并行处理阵列(MPPA)构成了可编程硬件加速器,在执行显示数据级并行性(DLP)的应用程序方面表现出色。使用这种可编程加速器作为更传统的通用处理核心的助手的概念最近已经进入主流;英特尔和AMD都推出了处理器架构,除了主CPU内核外,还集成了图形处理单元(GPU)。这些GPU引擎有望在GPU通用计算(GPGPU)的支持中发挥关键作用。然而,一般来说,广泛采用mppa作为硬件加速器需要有效地解决一些基本障碍:编程模型的表达性、调试功能和内存层次结构设计。为此,本文提出了一种采用事件驱动执行模型的MPPA硬件架构。它支持动态任务调度,这为执行模型提供了更好的表达性,并提高了处理元素的利用率。此外,一种新的模块级预取机制——由执行模型的规范支持——隐藏了对内存和调度器的访问时间。执行模型还保证了模块的完整封装,极大地方便了调试。最后,一个模块的所有相关输入都是显式已知的,这一事实可以被硬件利用来隐藏内存访问延迟,而不必诉诸缓存和缓存一致性协议。使用所提出架构的循环级模拟器和各种实际应用基准测试的结果证明了所提出范式的有效性和效率。
{"title":"A programmable processing array architecture supporting dynamic task scheduling and module-level prefetching","authors":"Junghee Lee, H. Lee, S. Ha, Jongman Kim, C. Nicopoulos","doi":"10.1145/2212908.2212931","DOIUrl":"https://doi.org/10.1145/2212908.2212931","url":null,"abstract":"Massively Parallel Processing Arrays (MPPA) constitute programmable hardware accelerators that excel in the execution of applications exhibiting Data-Level Parallelism (DLP). The concept of employing such programmable accelerators as sidekicks to the more traditional, general-purpose processing cores has very recently entered the mainstream; both Intel and AMD have introduced processor architectures integrating a Graphics Processing Unit (GPU) alongside the main CPU cores. These GPU engines are expected to play a pivotal role in the espousal of General-Purpose computing on GPUs (GPGPU). However, the widespread adoption of MPPAs, in general, as hardware accelerators entails the effective tackling of some fundamental obstacles: the expressiveness of the programming model, the debugging capabilities, and the memory hierarchy design. Toward this end, this paper proposes a hardware architecture for MPPA that adopts an event-driven execution model. It supports dynamic task scheduling, which offers better expressiveness to the execution model and improves the utilization of processing elements. Moreover, a novel module-level prefetching mechanism - enabled by the specification of the execution model - hides the access time to memory and the scheduler. The execution model also ensures complete encapsulation of the modules, which greatly facilitates debugging. Finally, the fact that all associated inputs of a module are explicitly known can be exploited by the hardware to hide memory access latency without having to resort to caches and a cache coherence protocol. Results using a cycle-level simulator of the proposed architecture and a variety of real application benchmarks demonstrate the efficacy and efficiency of the proposed paradigm.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121108600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Improving coherence protocol reactiveness by trading bandwidth for latency 通过带宽交换延迟来提高一致性协议的反应性
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212929
L. G. Menezo, Valentin Puente, Pablo Abad Fidalgo, J. Gregorio
This paper describes how on-chip network particularities could be used to improve coherence protocol responsiveness. In order to achieve this, a new coherence protocol, named LOCKE, is proposed. LOCKE successfully exploits large on-chip bandwidth availability to improve cache-coherent chip multiprocessor performance and energy efficiency. Provided that the interconnection network is designed to support multicast traffic and the protocol maximizes the potential advantages that direct coherence brings, we demonstrate that a multicast-based coherence protocol could reduce energy requirements in the CMP memory hierarchy. The key idea presented is to establish a suitable level of on-chip network throughput to accelerate synchronization by two means: avoiding the protocol serialization, inherent to directory-based coherence protocol, and reducing average access time more than in other snoop-based coherence protocols, when shared data is truly contended. LOCKE is developed on top of a Token coherence performance substrate, with a new set of simple proactive policies that speeds up data synchronization and eliminates the passive token starvation avoidance mechanism. Using a full-system simulator that faithfully models on-chip interconnection, aggressive core architecture and precise memory hierarchy details, while running a broad spectrum of workloads, our proposal can improve both directory-based and token-based coherence protocols both in terms of energy and performance, at least in systems with up to 16 aggressive out-of-order processors in the chip.
本文描述了如何利用片上网络的特殊性来提高一致性协议的响应性。为了实现这一目标,提出了一种新的相干协议LOCKE。LOCKE成功地利用了大的片上带宽可用性,以提高缓存相干芯片多处理器的性能和能源效率。假设互连网络被设计为支持组播流量,并且协议最大限度地发挥了直接相干带来的潜在优势,我们证明了基于组播的相干协议可以降低CMP内存层次中的能量需求。提出的关键思想是建立一个合适的片上网络吞吐量水平,通过两种方式加速同步:避免协议序列化,固有的基于目录的一致性协议,并减少平均访问时间比其他基于窥探的一致性协议,当共享数据真正竞争时。LOCKE是在令牌一致性性能基础上开发的,具有一组新的简单的主动策略,可以加速数据同步并消除被动令牌饥饿避免机制。使用一个完整的系统模拟器,真实地模拟片上互连,积极的核心架构和精确的内存层次结构细节,同时运行广泛的工作负载,我们的建议可以在能量和性能方面改进基于目录和基于令牌的一致性协议,至少在芯片中具有多达16个积极的乱序处理器的系统中。
{"title":"Improving coherence protocol reactiveness by trading bandwidth for latency","authors":"L. G. Menezo, Valentin Puente, Pablo Abad Fidalgo, J. Gregorio","doi":"10.1145/2212908.2212929","DOIUrl":"https://doi.org/10.1145/2212908.2212929","url":null,"abstract":"This paper describes how on-chip network particularities could be used to improve coherence protocol responsiveness. In order to achieve this, a new coherence protocol, named LOCKE, is proposed. LOCKE successfully exploits large on-chip bandwidth availability to improve cache-coherent chip multiprocessor performance and energy efficiency. Provided that the interconnection network is designed to support multicast traffic and the protocol maximizes the potential advantages that direct coherence brings, we demonstrate that a multicast-based coherence protocol could reduce energy requirements in the CMP memory hierarchy. The key idea presented is to establish a suitable level of on-chip network throughput to accelerate synchronization by two means: avoiding the protocol serialization, inherent to directory-based coherence protocol, and reducing average access time more than in other snoop-based coherence protocols, when shared data is truly contended. LOCKE is developed on top of a Token coherence performance substrate, with a new set of simple proactive policies that speeds up data synchronization and eliminates the passive token starvation avoidance mechanism. Using a full-system simulator that faithfully models on-chip interconnection, aggressive core architecture and precise memory hierarchy details, while running a broad spectrum of workloads, our proposal can improve both directory-based and token-based coherence protocols both in terms of energy and performance, at least in systems with up to 16 aggressive out-of-order processors in the chip.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115515565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Algorithmic methodologies for ultra-efficient inexact architectures for sustaining technology scaling 用于维持技术扩展的超高效非精确架构的算法方法
Pub Date : 2012-05-15 DOI: 10.1145/2212908.2212912
L. Avinash, Kirthi Krishna Muntimadugu, C. Enz, R. Karp, K. Palem, C. Piguet
Owing to a growing desire to reduce energy consumption and widely anticipated hurdles to the continued technology scaling promised by Moore's law, techniques and technologies such as inexact circuits and probabilistic CMOS (PCMOS) have gained prominence. These radical approaches trade accuracy at the hardware level for significant gains in energy consumption, area, and speed. While holding great promise, their ability to influence the broader milieu of computing is limited due to two shortcomings. First, they were mostly based on ad-hoc hand designs and did not consider algorithmically well-characterized automated design methodologies. Also, existing design approaches were limited to particular layers of abstraction such as physical, architectural and algorithmic or more broadly software. However, it is well-known that significant gains can be achieved by optimizing across the layers. To respond to this need, in this paper, we present an algorithmically well-founded cross-layer co-design framework (CCF) for automatically designing inexact hardware in the form of datapath elements. Specifically adders and multipliers, and show that significant associated gains can be achieved in terms of energy, area, and delay or speed. Our algorithms can achieve these gains with adding any additional hardware overhead. The proposed CCF framework embodies a symbiotic relationship between architecture and logic-layer design through the technique of probabilistic pruning combined with the novel confined voltage scaling technique introduced in this paper, applied at the physical layer. A second drawback of the state of the art with inexact design is the lack of physical evidence established through measuring fabricated ICs that the gains and other benefits that can be achieved are valid. Again, in this paper, we have addressed this shortcoming by using CCF to fabricate a prototype chip implementing inexact data-path elements; a range of 64-bit integer adders whose outputs can be erroneous. Through physical measurements of our prototype chip wherein the inexact adders admit expected relative error magnitudes of 10% or less, we have found that cumulative gains over comparable and fully accurate chips, quantified through the area-delay-energy product, can be a multiplicative factor of 15 or more. As evidence of the utility of these results, we demonstrate that despite admitting error while achieving gains, images processed using the FFT algorithm implemented using our inexact adders are visually discernible.
由于降低能耗的愿望日益增长,以及摩尔定律所承诺的持续技术扩展的广泛预期障碍,诸如不精确电路和概率CMOS (PCMOS)等技术和技术已获得突出地位。这些激进的方法以硬件级别的准确性为代价,在能耗、面积和速度方面取得了显著的进步。虽然前景光明,但由于两个缺点,它们影响更广泛计算环境的能力受到限制。首先,它们大多基于特别的手工设计,没有考虑算法上特征良好的自动化设计方法。此外,现有的设计方法仅限于特定的抽象层,如物理、架构和算法或更广泛的软件。然而,众所周知,通过跨层优化可以获得显著的收益。为了满足这一需求,在本文中,我们提出了一个基于算法的跨层协同设计框架(CCF),用于以数据路径元素的形式自动设计不精确的硬件。特别是加法器和乘法器,并表明在能量、面积、延迟或速度方面可以实现显著的相关增益。我们的算法可以在不增加任何额外硬件开销的情况下实现这些增益。本文提出的CCF框架通过概率剪枝技术结合本文介绍的新型限压标度技术,在物理层应用,体现了架构与逻辑层设计之间的共生关系。不精确设计的第二个缺点是缺乏通过测量制造的集成电路来建立的物理证据,证明可以实现的增益和其他好处是有效的。同样,在本文中,我们通过使用CCF制造实现不精确数据路径元素的原型芯片来解决这一缺点;一组64位整数加法器,其输出可能是错误的。通过对我们的原型芯片的物理测量,其中不精确加法器承认预期的相对误差幅度为10%或更小,我们发现,通过面积延迟能量积量化,与可比和完全精确的芯片相比,累积增益可以是15或更多的乘法因子。作为这些结果的实用性的证据,我们证明,尽管在获得增益的同时承认错误,但使用我们的不精确加法器实现的FFT算法处理的图像在视觉上是可识别的。
{"title":"Algorithmic methodologies for ultra-efficient inexact architectures for sustaining technology scaling","authors":"L. Avinash, Kirthi Krishna Muntimadugu, C. Enz, R. Karp, K. Palem, C. Piguet","doi":"10.1145/2212908.2212912","DOIUrl":"https://doi.org/10.1145/2212908.2212912","url":null,"abstract":"Owing to a growing desire to reduce energy consumption and widely anticipated hurdles to the continued technology scaling promised by Moore's law, techniques and technologies such as inexact circuits and probabilistic CMOS (PCMOS) have gained prominence. These radical approaches trade accuracy at the hardware level for significant gains in energy consumption, area, and speed. While holding great promise, their ability to influence the broader milieu of computing is limited due to two shortcomings. First, they were mostly based on ad-hoc hand designs and did not consider algorithmically well-characterized automated design methodologies. Also, existing design approaches were limited to particular layers of abstraction such as physical, architectural and algorithmic or more broadly software. However, it is well-known that significant gains can be achieved by optimizing across the layers. To respond to this need, in this paper, we present an algorithmically well-founded cross-layer co-design framework (CCF) for automatically designing inexact hardware in the form of datapath elements. Specifically adders and multipliers, and show that significant associated gains can be achieved in terms of energy, area, and delay or speed. Our algorithms can achieve these gains with adding any additional hardware overhead. The proposed CCF framework embodies a symbiotic relationship between architecture and logic-layer design through the technique of probabilistic pruning combined with the novel confined voltage scaling technique introduced in this paper, applied at the physical layer. A second drawback of the state of the art with inexact design is the lack of physical evidence established through measuring fabricated ICs that the gains and other benefits that can be achieved are valid. Again, in this paper, we have addressed this shortcoming by using CCF to fabricate a prototype chip implementing inexact data-path elements; a range of 64-bit integer adders whose outputs can be erroneous. Through physical measurements of our prototype chip wherein the inexact adders admit expected relative error magnitudes of 10% or less, we have found that cumulative gains over comparable and fully accurate chips, quantified through the area-delay-energy product, can be a multiplicative factor of 15 or more. As evidence of the utility of these results, we demonstrate that despite admitting error while achieving gains, images processed using the FFT algorithm implemented using our inexact adders are visually discernible.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117028363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
期刊
ACM International Conference on Computing Frontiers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1