We present three different optimization techniques for k-means clustering algorithm to improve the running time without decreasing the accuracy of the cluster centers significantly. Our first optimization restructures loops to improve cache behavior when executing on multicore architectures. The remaining two optimizations skip select points to reduce execution latency. Our sensitivity analysis suggests that the performance can be enhanced through a good understanding of the data and careful configuration of the parameters.
{"title":"Improving the performance of k-means clustering through computation skipping and data locality optimizations","authors":"Orhan Kislal, P. Berman, M. Kandemir","doi":"10.1145/2212908.2212951","DOIUrl":"https://doi.org/10.1145/2212908.2212951","url":null,"abstract":"We present three different optimization techniques for k-means clustering algorithm to improve the running time without decreasing the accuracy of the cluster centers significantly. Our first optimization restructures loops to improve cache behavior when executing on multicore architectures. The remaining two optimizations skip select points to reduce execution latency. Our sensitivity analysis suggests that the performance can be enhanced through a good understanding of the data and careful configuration of the parameters.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126471846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose CoreSymphony architecture, which aims at balancing single-thread performance and multi-thread performance on CMPs. The former version of CoreSymphony had complex branch predictor, re-order buffer, and in-order state management mechanism. In this paper, we solve these problems and evaluate the performance of CoreSymphony.
{"title":"CoreSymphony architecture","authors":"Tomoyuki Nagatsuka, Yoshito Sakaguchi, Kenji Kise","doi":"10.1145/2212908.2212945","DOIUrl":"https://doi.org/10.1145/2212908.2212945","url":null,"abstract":"We propose CoreSymphony architecture, which aims at balancing single-thread performance and multi-thread performance on CMPs. The former version of CoreSymphony had complex branch predictor, re-order buffer, and in-order state management mechanism. In this paper, we solve these problems and evaluate the performance of CoreSymphony.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116028583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yoonseo Choi, Cheng-Hong Li, D. D. Silva, A. Bivens, E. Schenfeld
In this paper we describe an approach to dynamically improve the progress of streaming applications on SMP multi-core systems. We show that run-time task duplication is an effective method for maximizing application throughput in face of changes in available computing resources. Such changes can not be fully handled by static optimizations. We derive a theoretical performance model to identify tasks in need of more computing resources. We propose two on-line algorithms that use indications from the performance model to detect computation bottlenecks. In these algorithms, a task can identify itself as a bottleneck using only its local data. The proposed technique is transparent to end programmers and portable to systems with fair scheduling. Our on-line detection algorithms can be applied to other dynamic scenarios, for example, involving run-time variation of workload. Our experiments using the StreamIt benchmarks [5] show that the proposed run-time task duplication achieves considerable speedups over the multi-threaded baseline on a 16-core machine and on the scenarios with dynamically changing number of processing cores. We also show that our algorithms achieve better application throughput than alternative approaches for task duplication.
{"title":"Adaptive task duplication using on-line bottleneck detection for streaming applications","authors":"Yoonseo Choi, Cheng-Hong Li, D. D. Silva, A. Bivens, E. Schenfeld","doi":"10.1145/2212908.2212932","DOIUrl":"https://doi.org/10.1145/2212908.2212932","url":null,"abstract":"In this paper we describe an approach to dynamically improve the progress of streaming applications on SMP multi-core systems. We show that run-time task duplication is an effective method for maximizing application throughput in face of changes in available computing resources. Such changes can not be fully handled by static optimizations. We derive a theoretical performance model to identify tasks in need of more computing resources. We propose two on-line algorithms that use indications from the performance model to detect computation bottlenecks. In these algorithms, a task can identify itself as a bottleneck using only its local data. The proposed technique is transparent to end programmers and portable to systems with fair scheduling. Our on-line detection algorithms can be applied to other dynamic scenarios, for example, involving run-time variation of workload.\u0000 Our experiments using the StreamIt benchmarks [5] show that the proposed run-time task duplication achieves considerable speedups over the multi-threaded baseline on a 16-core machine and on the scenarios with dynamically changing number of processing cores. We also show that our algorithms achieve better application throughput than alternative approaches for task duplication.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134614383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rosario Cammarota, A. Kejariwal, D. Donato, A. Nicolau, A. Veidenbaum
We propose a novel technique to select the inlining options of a compiler - referred to as an inlining vector, for program optimization. The proposed technique trains a machine learning algorithm to model the relation between inlining vectors and performance (completion time). The training set is composed of sample runs of the programs to optimize - that are compiled with a limited number of inlining vectors. Subject to a given compiler, the model evaluates the benefit of inlining combined with other compiler heuristics. The model is subsequently used to select the inlining vector which minimizes the predicted completion time of a program with respect to a given level of optimization. We present a case study based on the compiler GNU GCC. We used our technique to improve performance of 403.gcc from SPEC CPU2006 - a program which is notoriously hard to optimize - with respect to the optimization level -O3 as the baseline. On the state-of-the-art Intel Xeon Westmere architecture, 403.gcc, compiled using the inlining vectors selected by our technique, outperforms the baseline by up to 9%.
{"title":"Selective search of inlining vectors for program optimization","authors":"Rosario Cammarota, A. Kejariwal, D. Donato, A. Nicolau, A. Veidenbaum","doi":"10.1145/2212908.2212947","DOIUrl":"https://doi.org/10.1145/2212908.2212947","url":null,"abstract":"We propose a novel technique to select the inlining options of a compiler - referred to as an inlining vector, for program optimization. The proposed technique trains a machine learning algorithm to model the relation between inlining vectors and performance (completion time). The training set is composed of sample runs of the programs to optimize - that are compiled with a limited number of inlining vectors. Subject to a given compiler, the model evaluates the benefit of inlining combined with other compiler heuristics. The model is subsequently used to select the inlining vector which minimizes the predicted completion time of a program with respect to a given level of optimization.\u0000 We present a case study based on the compiler GNU GCC. We used our technique to improve performance of 403.gcc from SPEC CPU2006 - a program which is notoriously hard to optimize - with respect to the optimization level -O3 as the baseline. On the state-of-the-art Intel Xeon Westmere architecture, 403.gcc, compiled using the inlining vectors selected by our technique, outperforms the baseline by up to 9%.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134058923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kyle Spafford, J. Meredith, Seyong Lee, Dong Li, P. Roth, J. Vetter
With the rise of general purpose computing on graphics processing units (GPGPU), the influence from consumer markets can now be seen across the spectrum of computer architectures. In fact, many of the high-ranking Top500 HPC systems now include these accelerators. Traditionally, GPUs have connected to the CPU via the PCIe bus, which has proved to be a significant bottleneck for scalable scientific applications. Now, a trend toward tighter integration between CPU and GPU has removed this bottleneck and unified the memory hierarchy for both CPU and GPU cores. We examine the impact of this trend for high performance scientific computing by investigating AMD's new Fusion Accelerated Processing Unit (APU) as a testbed. In particular, we evaluate the tradeoffs in performance, power consumption, and programmability when comparing this unified memory hierarchy with similar, but discrete GPUs.
{"title":"The tradeoffs of fused memory hierarchies in heterogeneous computing architectures","authors":"Kyle Spafford, J. Meredith, Seyong Lee, Dong Li, P. Roth, J. Vetter","doi":"10.1145/2212908.2212924","DOIUrl":"https://doi.org/10.1145/2212908.2212924","url":null,"abstract":"With the rise of general purpose computing on graphics processing units (GPGPU), the influence from consumer markets can now be seen across the spectrum of computer architectures. In fact, many of the high-ranking Top500 HPC systems now include these accelerators. Traditionally, GPUs have connected to the CPU via the PCIe bus, which has proved to be a significant bottleneck for scalable scientific applications. Now, a trend toward tighter integration between CPU and GPU has removed this bottleneck and unified the memory hierarchy for both CPU and GPU cores. We examine the impact of this trend for high performance scientific computing by investigating AMD's new Fusion Accelerated Processing Unit (APU) as a testbed. In particular, we evaluate the tradeoffs in performance, power consumption, and programmability when comparing this unified memory hierarchy with similar, but discrete GPUs.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134172724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The performance and power benefits of Flash memory have paved its adoption in mass storage devices in the form of Solid-State Disks (SSDs). Despite these benefits, Flash memory's limited write endurance remains a big impediment to its wide adoption in the enterprise server market. Existing research efforts have mostly focused on proposing various mechanisms and algorithms to improve SSD's performance and reliability. However, there is still a lack of flexible tools that allow characterizing SSD endurance (i.e., wear-out behavior) and investigating its impact on applications without affecting the lifetime of the real SSD device. To address this issue, SolidSim, a kernel-level simulator has been enhanced with capabilities to simulate state-of-the-art wear-leveling, garbage-collection and other advanced internal management techniques of an SSD. These extensions have further increased SolidSim's flexibility to study both SSD performance and endurance characteristics. Our approach allows investigating these characteristics without requiring any changes to applications or gathering any workload traces. The paper presents insights into wear-out behavior including logical, physical and translation characteristics, and correlates them with application behavior and SSD life-times using a set of representative workloads.
{"title":"A flexible OS-based approach for characterizing solid-state disk endurance","authors":"G. Kandiraju, Kaoutar El Maghraoui","doi":"10.1145/2212908.2212939","DOIUrl":"https://doi.org/10.1145/2212908.2212939","url":null,"abstract":"The performance and power benefits of Flash memory have paved its adoption in mass storage devices in the form of Solid-State Disks (SSDs). Despite these benefits, Flash memory's limited write endurance remains a big impediment to its wide adoption in the enterprise server market. Existing research efforts have mostly focused on proposing various mechanisms and algorithms to improve SSD's performance and reliability. However, there is still a lack of flexible tools that allow characterizing SSD endurance (i.e., wear-out behavior) and investigating its impact on applications without affecting the lifetime of the real SSD device. To address this issue, SolidSim, a kernel-level simulator has been enhanced with capabilities to simulate state-of-the-art wear-leveling, garbage-collection and other advanced internal management techniques of an SSD. These extensions have further increased SolidSim's flexibility to study both SSD performance and endurance characteristics. Our approach allows investigating these characteristics without requiring any changes to applications or gathering any workload traces. The paper presents insights into wear-out behavior including logical, physical and translation characteristics, and correlates them with application behavior and SSD life-times using a set of representative workloads.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127636149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Masayuki Sato, Yusuke Tobo, Ryusuke Egawa, H. Takizawa, Hiroaki Kobayashi
Dynamic cache resizing mechanisms have been proposed to achieve both high performance and low energy consumption. The basic idea behind such mechanisms is to divide a cache into some parts, and manage them independently to resize the cache for resource allocation and energy saving. However, dynamic cache resizing mechanisms waste their resource to store a lot of dead-on-fill blocks, which are not reused after being stored in the cache. To reduce the number of dead-on-fill blocks in the cache and thus improve energy efficiency of dynamic cache resizing mechanisms, this paper proposes a dynamic LRU-K insertion policy. The policy stores a new coming block as the K-th least-recently-used one and adjusts K dynamically according to the application to be executed. Therefore, the policy can balance between early eviction of dead-on-fill blocks and retainment of reusable blocks.
{"title":"A capacity-efficient insertion policy for dynamic cache resizing mechanisms","authors":"Masayuki Sato, Yusuke Tobo, Ryusuke Egawa, H. Takizawa, Hiroaki Kobayashi","doi":"10.1145/2212908.2212949","DOIUrl":"https://doi.org/10.1145/2212908.2212949","url":null,"abstract":"Dynamic cache resizing mechanisms have been proposed to achieve both high performance and low energy consumption. The basic idea behind such mechanisms is to divide a cache into some parts, and manage them independently to resize the cache for resource allocation and energy saving. However, dynamic cache resizing mechanisms waste their resource to store a lot of dead-on-fill blocks, which are not reused after being stored in the cache. To reduce the number of dead-on-fill blocks in the cache and thus improve energy efficiency of dynamic cache resizing mechanisms, this paper proposes a dynamic LRU-K insertion policy. The policy stores a new coming block as the K-th least-recently-used one and adjusts K dynamically according to the application to be executed. Therefore, the policy can balance between early eviction of dead-on-fill blocks and retainment of reusable blocks.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120973509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junghee Lee, H. Lee, S. Ha, Jongman Kim, C. Nicopoulos
Massively Parallel Processing Arrays (MPPA) constitute programmable hardware accelerators that excel in the execution of applications exhibiting Data-Level Parallelism (DLP). The concept of employing such programmable accelerators as sidekicks to the more traditional, general-purpose processing cores has very recently entered the mainstream; both Intel and AMD have introduced processor architectures integrating a Graphics Processing Unit (GPU) alongside the main CPU cores. These GPU engines are expected to play a pivotal role in the espousal of General-Purpose computing on GPUs (GPGPU). However, the widespread adoption of MPPAs, in general, as hardware accelerators entails the effective tackling of some fundamental obstacles: the expressiveness of the programming model, the debugging capabilities, and the memory hierarchy design. Toward this end, this paper proposes a hardware architecture for MPPA that adopts an event-driven execution model. It supports dynamic task scheduling, which offers better expressiveness to the execution model and improves the utilization of processing elements. Moreover, a novel module-level prefetching mechanism - enabled by the specification of the execution model - hides the access time to memory and the scheduler. The execution model also ensures complete encapsulation of the modules, which greatly facilitates debugging. Finally, the fact that all associated inputs of a module are explicitly known can be exploited by the hardware to hide memory access latency without having to resort to caches and a cache coherence protocol. Results using a cycle-level simulator of the proposed architecture and a variety of real application benchmarks demonstrate the efficacy and efficiency of the proposed paradigm.
{"title":"A programmable processing array architecture supporting dynamic task scheduling and module-level prefetching","authors":"Junghee Lee, H. Lee, S. Ha, Jongman Kim, C. Nicopoulos","doi":"10.1145/2212908.2212931","DOIUrl":"https://doi.org/10.1145/2212908.2212931","url":null,"abstract":"Massively Parallel Processing Arrays (MPPA) constitute programmable hardware accelerators that excel in the execution of applications exhibiting Data-Level Parallelism (DLP). The concept of employing such programmable accelerators as sidekicks to the more traditional, general-purpose processing cores has very recently entered the mainstream; both Intel and AMD have introduced processor architectures integrating a Graphics Processing Unit (GPU) alongside the main CPU cores. These GPU engines are expected to play a pivotal role in the espousal of General-Purpose computing on GPUs (GPGPU). However, the widespread adoption of MPPAs, in general, as hardware accelerators entails the effective tackling of some fundamental obstacles: the expressiveness of the programming model, the debugging capabilities, and the memory hierarchy design. Toward this end, this paper proposes a hardware architecture for MPPA that adopts an event-driven execution model. It supports dynamic task scheduling, which offers better expressiveness to the execution model and improves the utilization of processing elements. Moreover, a novel module-level prefetching mechanism - enabled by the specification of the execution model - hides the access time to memory and the scheduler. The execution model also ensures complete encapsulation of the modules, which greatly facilitates debugging. Finally, the fact that all associated inputs of a module are explicitly known can be exploited by the hardware to hide memory access latency without having to resort to caches and a cache coherence protocol. Results using a cycle-level simulator of the proposed architecture and a variety of real application benchmarks demonstrate the efficacy and efficiency of the proposed paradigm.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121108600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. G. Menezo, Valentin Puente, Pablo Abad Fidalgo, J. Gregorio
This paper describes how on-chip network particularities could be used to improve coherence protocol responsiveness. In order to achieve this, a new coherence protocol, named LOCKE, is proposed. LOCKE successfully exploits large on-chip bandwidth availability to improve cache-coherent chip multiprocessor performance and energy efficiency. Provided that the interconnection network is designed to support multicast traffic and the protocol maximizes the potential advantages that direct coherence brings, we demonstrate that a multicast-based coherence protocol could reduce energy requirements in the CMP memory hierarchy. The key idea presented is to establish a suitable level of on-chip network throughput to accelerate synchronization by two means: avoiding the protocol serialization, inherent to directory-based coherence protocol, and reducing average access time more than in other snoop-based coherence protocols, when shared data is truly contended. LOCKE is developed on top of a Token coherence performance substrate, with a new set of simple proactive policies that speeds up data synchronization and eliminates the passive token starvation avoidance mechanism. Using a full-system simulator that faithfully models on-chip interconnection, aggressive core architecture and precise memory hierarchy details, while running a broad spectrum of workloads, our proposal can improve both directory-based and token-based coherence protocols both in terms of energy and performance, at least in systems with up to 16 aggressive out-of-order processors in the chip.
{"title":"Improving coherence protocol reactiveness by trading bandwidth for latency","authors":"L. G. Menezo, Valentin Puente, Pablo Abad Fidalgo, J. Gregorio","doi":"10.1145/2212908.2212929","DOIUrl":"https://doi.org/10.1145/2212908.2212929","url":null,"abstract":"This paper describes how on-chip network particularities could be used to improve coherence protocol responsiveness. In order to achieve this, a new coherence protocol, named LOCKE, is proposed. LOCKE successfully exploits large on-chip bandwidth availability to improve cache-coherent chip multiprocessor performance and energy efficiency. Provided that the interconnection network is designed to support multicast traffic and the protocol maximizes the potential advantages that direct coherence brings, we demonstrate that a multicast-based coherence protocol could reduce energy requirements in the CMP memory hierarchy. The key idea presented is to establish a suitable level of on-chip network throughput to accelerate synchronization by two means: avoiding the protocol serialization, inherent to directory-based coherence protocol, and reducing average access time more than in other snoop-based coherence protocols, when shared data is truly contended. LOCKE is developed on top of a Token coherence performance substrate, with a new set of simple proactive policies that speeds up data synchronization and eliminates the passive token starvation avoidance mechanism. Using a full-system simulator that faithfully models on-chip interconnection, aggressive core architecture and precise memory hierarchy details, while running a broad spectrum of workloads, our proposal can improve both directory-based and token-based coherence protocols both in terms of energy and performance, at least in systems with up to 16 aggressive out-of-order processors in the chip.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115515565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Avinash, Kirthi Krishna Muntimadugu, C. Enz, R. Karp, K. Palem, C. Piguet
Owing to a growing desire to reduce energy consumption and widely anticipated hurdles to the continued technology scaling promised by Moore's law, techniques and technologies such as inexact circuits and probabilistic CMOS (PCMOS) have gained prominence. These radical approaches trade accuracy at the hardware level for significant gains in energy consumption, area, and speed. While holding great promise, their ability to influence the broader milieu of computing is limited due to two shortcomings. First, they were mostly based on ad-hoc hand designs and did not consider algorithmically well-characterized automated design methodologies. Also, existing design approaches were limited to particular layers of abstraction such as physical, architectural and algorithmic or more broadly software. However, it is well-known that significant gains can be achieved by optimizing across the layers. To respond to this need, in this paper, we present an algorithmically well-founded cross-layer co-design framework (CCF) for automatically designing inexact hardware in the form of datapath elements. Specifically adders and multipliers, and show that significant associated gains can be achieved in terms of energy, area, and delay or speed. Our algorithms can achieve these gains with adding any additional hardware overhead. The proposed CCF framework embodies a symbiotic relationship between architecture and logic-layer design through the technique of probabilistic pruning combined with the novel confined voltage scaling technique introduced in this paper, applied at the physical layer. A second drawback of the state of the art with inexact design is the lack of physical evidence established through measuring fabricated ICs that the gains and other benefits that can be achieved are valid. Again, in this paper, we have addressed this shortcoming by using CCF to fabricate a prototype chip implementing inexact data-path elements; a range of 64-bit integer adders whose outputs can be erroneous. Through physical measurements of our prototype chip wherein the inexact adders admit expected relative error magnitudes of 10% or less, we have found that cumulative gains over comparable and fully accurate chips, quantified through the area-delay-energy product, can be a multiplicative factor of 15 or more. As evidence of the utility of these results, we demonstrate that despite admitting error while achieving gains, images processed using the FFT algorithm implemented using our inexact adders are visually discernible.
{"title":"Algorithmic methodologies for ultra-efficient inexact architectures for sustaining technology scaling","authors":"L. Avinash, Kirthi Krishna Muntimadugu, C. Enz, R. Karp, K. Palem, C. Piguet","doi":"10.1145/2212908.2212912","DOIUrl":"https://doi.org/10.1145/2212908.2212912","url":null,"abstract":"Owing to a growing desire to reduce energy consumption and widely anticipated hurdles to the continued technology scaling promised by Moore's law, techniques and technologies such as inexact circuits and probabilistic CMOS (PCMOS) have gained prominence. These radical approaches trade accuracy at the hardware level for significant gains in energy consumption, area, and speed. While holding great promise, their ability to influence the broader milieu of computing is limited due to two shortcomings. First, they were mostly based on ad-hoc hand designs and did not consider algorithmically well-characterized automated design methodologies. Also, existing design approaches were limited to particular layers of abstraction such as physical, architectural and algorithmic or more broadly software. However, it is well-known that significant gains can be achieved by optimizing across the layers. To respond to this need, in this paper, we present an algorithmically well-founded cross-layer co-design framework (CCF) for automatically designing inexact hardware in the form of datapath elements. Specifically adders and multipliers, and show that significant associated gains can be achieved in terms of energy, area, and delay or speed. Our algorithms can achieve these gains with adding any additional hardware overhead. The proposed CCF framework embodies a symbiotic relationship between architecture and logic-layer design through the technique of probabilistic pruning combined with the novel confined voltage scaling technique introduced in this paper, applied at the physical layer. A second drawback of the state of the art with inexact design is the lack of physical evidence established through measuring fabricated ICs that the gains and other benefits that can be achieved are valid. Again, in this paper, we have addressed this shortcoming by using CCF to fabricate a prototype chip implementing inexact data-path elements; a range of 64-bit integer adders whose outputs can be erroneous. Through physical measurements of our prototype chip wherein the inexact adders admit expected relative error magnitudes of 10% or less, we have found that cumulative gains over comparable and fully accurate chips, quantified through the area-delay-energy product, can be a multiplicative factor of 15 or more. As evidence of the utility of these results, we demonstrate that despite admitting error while achieving gains, images processed using the FFT algorithm implemented using our inexact adders are visually discernible.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117028363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}