Md. Tauhidur Rahman, Domenic Forte, J. Fahrny, M. Tehranipoor
Physically Unclonable Functions (PUFs) have emerged as a security block with the potential to generate chip-specific identifiers and cryptographic keys. However it has been shown that the stability of these identifiers and keys is heavily impacted by aging and environmental variations. Previous techniques have mostly focused on improving PUF robustness against supply noise and temperature but aging has been largely neglected. In this paper, we propose a new aging resistant design for the popular ring-oscillator (RO)-PUF. Simulation results demonstrate that our aging resistant RO-PUF (called ARO-PUF) can produce unique, random, and more reliable keys. Only 7.7% bits get flipped on average over 10 years operation period for an ARO-PUF due to aging where the value is 32% for a conventional RO-PUF. The ARO-PUF shows an average interchip HD of 49.67% (close to ideal value 50%) and better than the conventional RO-PUF (~45%). With lower error, ARO-PUF offers ~ 24X area reduction for a 128-bit key because of reduced ECC complexity and smaller PUF footprint.
{"title":"ARO-PUF: An aging-resistant ring oscillator PUF design","authors":"Md. Tauhidur Rahman, Domenic Forte, J. Fahrny, M. Tehranipoor","doi":"10.7873/DATE.2014.082","DOIUrl":"https://doi.org/10.7873/DATE.2014.082","url":null,"abstract":"Physically Unclonable Functions (PUFs) have emerged as a security block with the potential to generate chip-specific identifiers and cryptographic keys. However it has been shown that the stability of these identifiers and keys is heavily impacted by aging and environmental variations. Previous techniques have mostly focused on improving PUF robustness against supply noise and temperature but aging has been largely neglected. In this paper, we propose a new aging resistant design for the popular ring-oscillator (RO)-PUF. Simulation results demonstrate that our aging resistant RO-PUF (called ARO-PUF) can produce unique, random, and more reliable keys. Only 7.7% bits get flipped on average over 10 years operation period for an ARO-PUF due to aging where the value is 32% for a conventional RO-PUF. The ARO-PUF shows an average interchip HD of 49.67% (close to ideal value 50%) and better than the conventional RO-PUF (~45%). With lower error, ARO-PUF offers ~ 24X area reduction for a 128-bit key because of reduced ECC complexity and smaller PUF footprint.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"9 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84065403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Smartphones provide applications that are increasingly similar to those of interactive desktop programs, providing rich graphics and animations. To simplify the creation of these interactive applications, mobile operating systems employ highlevel object-oriented programming languages and shared libraries to manipulate the device's peripherals and provide common userinterface frameworks. The presence of dynamic dispatch and polymorphism allows for robust and extensible application coding. Unfortunately, the presence of dynamic dispatch also introduces significant overheads during method calls, which directly impact execution time. Furthermore, since these applications rely heavily on shared libraries and helper routines, the quantity of these method calls is higher than those found in typical desktop-based programs. Optimizing these method calls centrally before consumers download the application onto a given phone is exacerbated due to the large diversity of hardware and operating system versions that the application could run on. This paper proposes a methodology to tailor a given Objective-C application and its associated device-specific shared library codebase using on-device post-compilation code optimization and transformation. In doing so, many polymorphic sites can be resolved statically, improving the overall application performance.
{"title":"On-device objective-C application optimization framework for high-performance mobile processors","authors":"Garo Bournoutian, A. Orailoglu","doi":"10.7873/DATE.2014.098","DOIUrl":"https://doi.org/10.7873/DATE.2014.098","url":null,"abstract":"Smartphones provide applications that are increasingly similar to those of interactive desktop programs, providing rich graphics and animations. To simplify the creation of these interactive applications, mobile operating systems employ highlevel object-oriented programming languages and shared libraries to manipulate the device's peripherals and provide common userinterface frameworks. The presence of dynamic dispatch and polymorphism allows for robust and extensible application coding. Unfortunately, the presence of dynamic dispatch also introduces significant overheads during method calls, which directly impact execution time. Furthermore, since these applications rely heavily on shared libraries and helper routines, the quantity of these method calls is higher than those found in typical desktop-based programs. Optimizing these method calls centrally before consumers download the application onto a given phone is exacerbated due to the large diversity of hardware and operating system versions that the application could run on. This paper proposes a methodology to tailor a given Objective-C application and its associated device-specific shared library codebase using on-device post-compilation code optimization and transformation. In doing so, many polymorphic sites can be resolved statically, improving the overall application performance.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"IA-20 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84602982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Promising advantages offered by resistive NonVolatile Memories (NVMs) have brought great attention to replace existing volatile memory technologies. While NVMs were primarily studied to be used in the memory hierarchy, they can also provide benefits in Field-Programmable Gate Arrays (FPGAs). One major limitation of employing NVMs in FPGAs is significant power and area overheads imposed by the Peripheral Circuitry (PC) of NVM configuration bits. In this paper, we investigate the applicability of different NVM technologies for configuration bits of FPGAs and propose a power-efficient reconfigurable architecture based on Phase Change Memory (PCM). The proposed PCM-based architecture has been evaluated using different technology nodes and it is compared to the SRAM-based FPGA architecture. Power and Power Delay Product (PDP) estimations of the proposed architecture show up to 37.7% and 35.7% improvements over SRAM-based FPGAs, respectively, with less than 3.2% performance overhead.
{"title":"A power-efficient reconfigurable architecture using PCM configuration technology","authors":"A. Ahari, H. Asadi, Behnam Khaleghi, M. Tahoori","doi":"10.7873/DATE.2014.349","DOIUrl":"https://doi.org/10.7873/DATE.2014.349","url":null,"abstract":"Promising advantages offered by resistive NonVolatile Memories (NVMs) have brought great attention to replace existing volatile memory technologies. While NVMs were primarily studied to be used in the memory hierarchy, they can also provide benefits in Field-Programmable Gate Arrays (FPGAs). One major limitation of employing NVMs in FPGAs is significant power and area overheads imposed by the Peripheral Circuitry (PC) of NVM configuration bits. In this paper, we investigate the applicability of different NVM technologies for configuration bits of FPGAs and propose a power-efficient reconfigurable architecture based on Phase Change Memory (PCM). The proposed PCM-based architecture has been evaluated using different technology nodes and it is compared to the SRAM-based FPGA architecture. Power and Power Delay Product (PDP) estimations of the proposed architecture show up to 37.7% and 35.7% improvements over SRAM-based FPGAs, respectively, with less than 3.2% performance overhead.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"7 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78262118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a novel tree arbiter cell that allows a pipelined processing of asynchronous requests. In this way it can achieve significantly lower delay in the critical case of frequent requests coming from different clients. We elaborate the necessary extension to facilitate a cascaded use of this cell in a tree-like fashion, and we show by theoretical analysis that in this configuration our cell provides better fairness than the standard approach. We implement our approach and quantitatively compare its performance properties with related work in a gatelevel simulation. In our sample asynchronous Networks-on-Chip application our new cell proves to increase the throughput of three different designs available in literature by approximately 61.28%, 69.24%, and 186.85% respectively.
{"title":"A tree arbiter cell for high speed resource sharing in asynchronous environments","authors":"S. R. Naqvi, A. Steininger","doi":"10.7873/DATE.2014.308","DOIUrl":"https://doi.org/10.7873/DATE.2014.308","url":null,"abstract":"We present a novel tree arbiter cell that allows a pipelined processing of asynchronous requests. In this way it can achieve significantly lower delay in the critical case of frequent requests coming from different clients. We elaborate the necessary extension to facilitate a cascaded use of this cell in a tree-like fashion, and we show by theoretical analysis that in this configuration our cell provides better fairness than the standard approach. We implement our approach and quantitatively compare its performance properties with related work in a gatelevel simulation. In our sample asynchronous Networks-on-Chip application our new cell proves to increase the throughput of three different designs available in literature by approximately 61.28%, 69.24%, and 186.85% respectively.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"27 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78034902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seyab Khan, I. Agbo, S. Hamdioui, H. Kukner, B. Kaczer, P. Raghavan, F. Catthoor
Bias Temperature Instability (BTI) is posing a major reliability challenge for today's and future semiconductor devices as it degrades their performance. This paper provides a comprehensive BTI impact analysis, in terms of time-dependent degradation, of FinFET based SRAM cell. The evaluation metrics are read Static Noise Margin (SNM), hold SNM and Write Trip Point (WTP); while the aspects investigated include BTI impact dependence on the supply voltage, cell strength, and design styles (6 versus 8 Transistors cell). A comparison between FinFET and planar CMOS based SRAM cells degradation is also covered. The simulation performed on FinFET based cells for 108 seconds of operation under nominal Vdd show that Read SNM degradation is 16.72%, which is 1.17× faster than hold SNM, while WTP improves by 6.82%. In addition, a supply voltage increment of 25% reduces the Read SNM degradation by 40%, while strengthening the cell pull-down transistors by 1.5× reduces the degradation by only 22%. Moreover, the results reveal that 8T cell degrades 1.31× faster than 6T cell, and that FinFET cells are more vulnerable (~2×) to BTI degradation than planar CMOS cells.
偏置温度不稳定性(BTI)降低了半导体器件的性能,对当今和未来的半导体器件的可靠性构成了重大挑战。本文提供了一个全面的BTI影响分析,在时间相关的退化,基于FinFET的SRAM电池。评价指标为读静态噪声裕度(SNM)、保持SNM和写跳闸点(WTP);而研究的方面包括BTI对供电电压、电池强度和设计风格(6 vs 8晶体管电池)的影响。比较了FinFET和基于平面CMOS的SRAM电池的退化。基于FinFET的电池在额定Vdd下运行108秒的仿真表明,Read SNM的退化率为16.72%,比hold SNM快1.17倍,而WTP提高了6.82%。此外,电源电压增加25%可使Read SNM退化降低40%,而将电池下拉晶体管加强1.5倍仅可使退化降低22%。结果表明,8T电池的降解速度比6T电池快1.31倍,而FinFET电池比平面CMOS电池更容易受到BTI降解的影响(~2倍)。
{"title":"Bias Temperature Instability analysis of FinFET based SRAM cells","authors":"Seyab Khan, I. Agbo, S. Hamdioui, H. Kukner, B. Kaczer, P. Raghavan, F. Catthoor","doi":"10.7873/DATE.2014.044","DOIUrl":"https://doi.org/10.7873/DATE.2014.044","url":null,"abstract":"Bias Temperature Instability (BTI) is posing a major reliability challenge for today's and future semiconductor devices as it degrades their performance. This paper provides a comprehensive BTI impact analysis, in terms of time-dependent degradation, of FinFET based SRAM cell. The evaluation metrics are read Static Noise Margin (SNM), hold SNM and Write Trip Point (WTP); while the aspects investigated include BTI impact dependence on the supply voltage, cell strength, and design styles (6 versus 8 Transistors cell). A comparison between FinFET and planar CMOS based SRAM cells degradation is also covered. The simulation performed on FinFET based cells for 108 seconds of operation under nominal Vdd show that Read SNM degradation is 16.72%, which is 1.17× faster than hold SNM, while WTP improves by 6.82%. In addition, a supply voltage increment of 25% reduces the Read SNM degradation by 40%, while strengthening the cell pull-down transistors by 1.5× reduces the degradation by only 22%. Moreover, the results reveal that 8T cell degrades 1.31× faster than 6T cell, and that FinFET cells are more vulnerable (~2×) to BTI degradation than planar CMOS cells.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"12 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79858452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Ramini, Alberto Ghiribaldi, P. Grani, S. Bartolini, H. Tatenguem, D. Bertozzi
Many crossbenchmarking results reported in the open literature raise optimistic expectations on the use of optical networks-on-chip (ONoCs) for high-performance and low-power on-chip communication. However, most of those previous works ultimately fail to make a compelling case for chip-level nanophotonic NoCs, especially for the lack of aggressive electronic baselines (ENoC), and the poor accuracy in physical- and architecture-layer analysis of the ONoC. This paper aims at providing the guidelines and minimum requirements so that nanophotonic emerging technology may become of practical relevance. The key differentiating factor of this work consists of contrasting ONoC solutions with an aggressive ENoC architecture with realistic complexity, performance, and power figures, synthesized on an industrial 40nm low-power technology. At the same time, key physical design issues and network interface architecture requirements for the ONoC under test are carefully assessed, thus paving the way for a well-grounded definition of the requirements for the emerging ONoC technology to achieve the energy break-even point with respect to pure electronic interconnect solutions in future multi- and many-core systems.
{"title":"Assessing the energy break-even point between an optical NoC architecture and an aggressive electronic baseline","authors":"L. Ramini, Alberto Ghiribaldi, P. Grani, S. Bartolini, H. Tatenguem, D. Bertozzi","doi":"10.7873/DATE.2014.321","DOIUrl":"https://doi.org/10.7873/DATE.2014.321","url":null,"abstract":"Many crossbenchmarking results reported in the open literature raise optimistic expectations on the use of optical networks-on-chip (ONoCs) for high-performance and low-power on-chip communication. However, most of those previous works ultimately fail to make a compelling case for chip-level nanophotonic NoCs, especially for the lack of aggressive electronic baselines (ENoC), and the poor accuracy in physical- and architecture-layer analysis of the ONoC. This paper aims at providing the guidelines and minimum requirements so that nanophotonic emerging technology may become of practical relevance. The key differentiating factor of this work consists of contrasting ONoC solutions with an aggressive ENoC architecture with realistic complexity, performance, and power figures, synthesized on an industrial 40nm low-power technology. At the same time, key physical design issues and network interface architecture requirements for the ONoC under test are carefully assessed, thus paving the way for a well-grounded definition of the requirements for the emerging ONoC technology to achieve the energy break-even point with respect to pure electronic interconnect solutions in future multi- and many-core systems.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"82 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80327637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Sabry, A. Sridhar, David Atienza Alonso, P. Ruch, B. Michel
The soaring demand for computing power in our digital information age has produced, as an undesirable side-effect, a surge in power consumption and heat density for Multiprocessors Systems-on-Chip (MPSoCs). The resulting temperature rise results in operating conditions that already preclude operating all the cores at maximum performance levels, in order to prevent system overheating and failures. With more power demands, MPSoCs will face a power delivery wall due to the reliability limitations of the underlying power delivery medium. Thus, state-of-the-art power and cooling delivery solutions are reaching their performance limits and it will no longer be possible to power up simultaneously all the available on-chip cores (situation known as dark silicon). In this paper we investigate a recently proposed disruptive approach to overcome the prevailing worst-case power and cooling provisioning paradigms for MPSoCs. This proposed approach integrates MPSoC with an on-chip microfluidic fuel cell network for joint cooling and power supply (i.e., localized power generation and delivery). By providing alternative means to power delivery integrated with cooling, MPSoCs are expected to gain in I/O connectivity. Based on this disruptive technology, we can envision the removal of the current limits of power delivery and heat dissipation in MPSoC designs, subsequently avoiding dark silicon and enabling a paradigm shift in future energy-proportional computing architecture designs.
{"title":"Integrated microfluidic power generation and cooling for bright silicon MPSoCs","authors":"M. Sabry, A. Sridhar, David Atienza Alonso, P. Ruch, B. Michel","doi":"10.7873/DATE.2014.147","DOIUrl":"https://doi.org/10.7873/DATE.2014.147","url":null,"abstract":"The soaring demand for computing power in our digital information age has produced, as an undesirable side-effect, a surge in power consumption and heat density for Multiprocessors Systems-on-Chip (MPSoCs). The resulting temperature rise results in operating conditions that already preclude operating all the cores at maximum performance levels, in order to prevent system overheating and failures. With more power demands, MPSoCs will face a power delivery wall due to the reliability limitations of the underlying power delivery medium. Thus, state-of-the-art power and cooling delivery solutions are reaching their performance limits and it will no longer be possible to power up simultaneously all the available on-chip cores (situation known as dark silicon). In this paper we investigate a recently proposed disruptive approach to overcome the prevailing worst-case power and cooling provisioning paradigms for MPSoCs. This proposed approach integrates MPSoC with an on-chip microfluidic fuel cell network for joint cooling and power supply (i.e., localized power generation and delivery). By providing alternative means to power delivery integrated with cooling, MPSoCs are expected to gain in I/O connectivity. Based on this disruptive technology, we can envision the removal of the current limits of power delivery and heat dissipation in MPSoC designs, subsequently avoiding dark silicon and enabling a paradigm shift in future energy-proportional computing architecture designs.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"11 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81865451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large-scale cache-coherent systems often impose unnecessary overhead on data that is thread-private for the whole of its lifetime. These include resources devoted to tracking the coherence state of the data, as well as unnecessary coherence messages sent out over the interconnect. In this paper we show how the memory allocation strategy for non-uniform memory access (NUMA) systems can be exploited to remove any coherence-related traffic for thread-local data, as well removing the need to track those cache lines in sparse directories. Our strategy is to allocate directory state only on a miss from a node in a different affinity domain from the directory. We call this ALLocAte on Remote Miss, or ALLARM. Our solution is entirely backward compatible with existing operating systems and software, and provides a means to scale cache coherence into the many-core era. On a mix of SPLASH2 and Parsec workloads, ALLARM is able to improve performance by 13% on average while reducing dynamic energy consumption by 9% in the on-chip network and 15% in the directory controller. This is achieved through a 46% reduction in the number of sparse directory entries evicted.
{"title":"ALLARM: Optimizing sparse directories for thread-local data","authors":"Amitabha Roy, Timothy M. Jones","doi":"10.7873/DATE.2014.091","DOIUrl":"https://doi.org/10.7873/DATE.2014.091","url":null,"abstract":"Large-scale cache-coherent systems often impose unnecessary overhead on data that is thread-private for the whole of its lifetime. These include resources devoted to tracking the coherence state of the data, as well as unnecessary coherence messages sent out over the interconnect. In this paper we show how the memory allocation strategy for non-uniform memory access (NUMA) systems can be exploited to remove any coherence-related traffic for thread-local data, as well removing the need to track those cache lines in sparse directories. Our strategy is to allocate directory state only on a miss from a node in a different affinity domain from the directory. We call this ALLocAte on Remote Miss, or ALLARM. Our solution is entirely backward compatible with existing operating systems and software, and provides a means to scale cache coherence into the many-core era. On a mix of SPLASH2 and Parsec workloads, ALLARM is able to improve performance by 13% on average while reducing dynamic energy consumption by 9% in the on-chip network and 15% in the directory controller. This is achieved through a 46% reduction in the number of sparse directory entries evicted.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"8 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78671289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Approximate circuits have been considered for error-tolerant applications that can tolerate some loss of accuracy with improved performance and energy efficiency. Multipliers are key arithmetic circuits in many such applications such as digital signal processing (DSP). In this paper, a novel approximate multiplier with a lower power consumption and a shorter critical path than traditional multipliers is proposed for high-performance DSP applications. This multiplier leverages a newly-designed approximate adder that limits its carry propagation to the nearest neighbors for fast partial product accumulation. Different levels of accuracy can be achieved through a configurable error recovery by using different numbers of most significant bits (MSBs) for error reduction. The approximate multiplier has a low mean error distance, i.e., most of the errors are not significant in magnitude. Compared to the Wallace multiplier, a 16-bit approximate multiplier implemented in a 28nm CMOS process shows a reduction in delay and power of 20% and up to 69%, respectively. It is shown that by utilizing an appropriate error recovery, the proposed approximate multiplier achieves similar processing accuracy as traditional exact multipliers but with significant improvements in power and performance.
{"title":"A low-power, high-performance approximate multiplier with configurable partial error recovery","authors":"Cong Liu, Jie Han, F. Lombardi","doi":"10.7873/DATE.2014.108","DOIUrl":"https://doi.org/10.7873/DATE.2014.108","url":null,"abstract":"Approximate circuits have been considered for error-tolerant applications that can tolerate some loss of accuracy with improved performance and energy efficiency. Multipliers are key arithmetic circuits in many such applications such as digital signal processing (DSP). In this paper, a novel approximate multiplier with a lower power consumption and a shorter critical path than traditional multipliers is proposed for high-performance DSP applications. This multiplier leverages a newly-designed approximate adder that limits its carry propagation to the nearest neighbors for fast partial product accumulation. Different levels of accuracy can be achieved through a configurable error recovery by using different numbers of most significant bits (MSBs) for error reduction. The approximate multiplier has a low mean error distance, i.e., most of the errors are not significant in magnitude. Compared to the Wallace multiplier, a 16-bit approximate multiplier implemented in a 28nm CMOS process shows a reduction in delay and power of 20% and up to 69%, respectively. It is shown that by utilizing an appropriate error recovery, the proposed approximate multiplier achieves similar processing accuracy as traditional exact multipliers but with significant improvements in power and performance.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"137 1","pages":"1-4"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86357256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As cloud computing becomes mainstream, the need to ensure the privacy of the data entrusted to third parties keeps rising. Cloud providers resort to numerous security controls and encryption to thwart potential attackers. Still, since the actual computation inside cloud microprocessors remains unencrypted, the opportunity of leakage is theoretically possible. Therefore, in order to address the challenge of protecting the computation inside the microprocessor, we introduce a novel general purpose architecture for secure data processing, called HEROIC (Homomorphically EncRypted One Instruction Computer). This new design utilizes a single instruction architecture and provides native processing of encrypted data at the architecture level. The security of the solution is assured by a variant of Paillier's homomorphic encryption scheme, used to encrypt both instructions and data. Experimental results using our hardware-cognizant software simulator, indicate an average execution overhead between 5 and 45 times for the encrypted computation (depending on the security parameter), compared to the unencrypted variant, for a 16-bit single instruction architecture.
{"title":"HEROIC: Homomorphically EncRypted One Instruction Computer","authors":"N. G. Tsoutsos, M. Maniatakos","doi":"10.7873/DATE2014.259","DOIUrl":"https://doi.org/10.7873/DATE2014.259","url":null,"abstract":"As cloud computing becomes mainstream, the need to ensure the privacy of the data entrusted to third parties keeps rising. Cloud providers resort to numerous security controls and encryption to thwart potential attackers. Still, since the actual computation inside cloud microprocessors remains unencrypted, the opportunity of leakage is theoretically possible. Therefore, in order to address the challenge of protecting the computation inside the microprocessor, we introduce a novel general purpose architecture for secure data processing, called HEROIC (Homomorphically EncRypted One Instruction Computer). This new design utilizes a single instruction architecture and provides native processing of encrypted data at the architecture level. The security of the solution is assured by a variant of Paillier's homomorphic encryption scheme, used to encrypt both instructions and data. Experimental results using our hardware-cognizant software simulator, indicate an average execution overhead between 5 and 45 times for the encrypted computation (depending on the security parameter), compared to the unencrypted variant, for a 16-bit single instruction architecture.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"14 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82808956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}