Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413121
A. Kinsman, N. Nicolici
Automated bit-width allocation is a key step required for the design of hardware accelerators. The use of computational methods based on SAT-Modulo Theory to the problem of finite-precision bit-width allocation has recently been shown to overcome challenges faced by the known-art, particularly in the scientific computing domain. However, many such real-life applications are specified in terms of vectors and matrices and they are rendered infeasible by expansion into scalar equations. This paper proposes a framework to include operations from vector calculus and thus it enables tackling applications of practically relevant complexity.
{"title":"Computational bit-width allocation for operations in vector calculus","authors":"A. Kinsman, N. Nicolici","doi":"10.1109/ICCD.2009.5413121","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413121","url":null,"abstract":"Automated bit-width allocation is a key step required for the design of hardware accelerators. The use of computational methods based on SAT-Modulo Theory to the problem of finite-precision bit-width allocation has recently been shown to overcome challenges faced by the known-art, particularly in the scientific computing domain. However, many such real-life applications are specified in terms of vectors and matrices and they are rendered infeasible by expansion into scalar equations. This paper proposes a framework to include operations from vector calculus and thus it enables tackling applications of practically relevant complexity.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131475792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413176
Kyungtae Han, Zhen Fang, Paul Diefenbaugh, Richard Forand, R. Iyer, D. Newell
Power consumption of the display subsytem has been a relatively less explored area compared to other components of a mobile device including computing, storage, and networking units, although the former often constitutes one of the most power-hungry portions of the system. Typical applications on a mobile device such as web browsing and text editing tend to have rather static image content; each frame hardly changes from the previous one. Efficiently detecting and handling no-motion scenarios is thus critical to extend the battery life. This paper focuses on image change detection. We propose to use checksum to detect image changes. Specifically, CRC hardware is used to optimize the power consumption of 1) refresh of a local display and 2) data compression for wireless remote display. Compared with a traditional, pixel-by-pixel comparison approach, using checksum for image change detection is not only fast, but also reduces accesses to the frame buffer, resulting in significant power savings. We have built a FPGA prototype to verify that CRC can capture image changes well enough to ensure a “visually lossless” quality.
{"title":"Using checksum to reduce power consumption of display systems for low-motion content","authors":"Kyungtae Han, Zhen Fang, Paul Diefenbaugh, Richard Forand, R. Iyer, D. Newell","doi":"10.1109/ICCD.2009.5413176","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413176","url":null,"abstract":"Power consumption of the display subsytem has been a relatively less explored area compared to other components of a mobile device including computing, storage, and networking units, although the former often constitutes one of the most power-hungry portions of the system. Typical applications on a mobile device such as web browsing and text editing tend to have rather static image content; each frame hardly changes from the previous one. Efficiently detecting and handling no-motion scenarios is thus critical to extend the battery life. This paper focuses on image change detection. We propose to use checksum to detect image changes. Specifically, CRC hardware is used to optimize the power consumption of 1) refresh of a local display and 2) data compression for wireless remote display. Compared with a traditional, pixel-by-pixel comparison approach, using checksum for image change detection is not only fast, but also reduces accesses to the frame buffer, resulting in significant power savings. We have built a FPGA prototype to verify that CRC can capture image changes well enough to ensure a “visually lossless” quality.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133932528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413175
A. Nahar, K. Butler, J. Carulli, Charles Weinberger
Quality improvement and cost reduction in the overall IC manufacturing and test processes are being continuously sought. Outlier screening methods can address both of these needs. As technology scales, it has become increasingly difficult to screen outliers without excessive Type I or II errors. Hundreds of parameters are collected at wafer probe, but there lacks a systematic way of selecting outlier screens. In this paper we describe a statistical approach to both identify outliers and select beneficial screening parameters more effectively. Results on a 90nm design to reduce the burn-in fails are described.
{"title":"Quality improvement and cost reduction using statistical outlier methods","authors":"A. Nahar, K. Butler, J. Carulli, Charles Weinberger","doi":"10.1109/ICCD.2009.5413175","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413175","url":null,"abstract":"Quality improvement and cost reduction in the overall IC manufacturing and test processes are being continuously sought. Outlier screening methods can address both of these needs. As technology scales, it has become increasingly difficult to screen outliers without excessive Type I or II errors. Hundreds of parameters are collected at wafer probe, but there lacks a systematic way of selecting outlier screens. In this paper we describe a statistical approach to both identify outliers and select beneficial screening parameters more effectively. Results on a 90nm design to reduce the burn-in fails are described.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132737819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413143
Jiayuan Meng, K. Skadron
Without high-bandwidth broadcast, large numbers of cores require a scalable point-to-point interconnect and a directory protocol. In such cases, a shared, inclusive last level cache (LLC) can improve data sharing and avoid three-way communication for shared reads. However, if inclusion encompasses thread-private data, two problems arise with the shared LLC. First, current memory allocators align stack bases on page boundaries, which emerges as a source of severe conflict misses for large numbers of threads on data-parallel applications. Second, correctness does not require the private data to reside in the shared directory or the LLC. This paper advocates stack-base randomization that eliminates the major source of conflict misses for large numbers of threads. However, when capacity becomes a limitation for the directory or last-level cache, this is not sufficient. We then propose non-inclusive, semi-coherent cache organization (NISC) that removes the requirement for inclusion of private data and reduces capacity misses. Our data-parallel benchmarks show that these limitations prevent scaling beyond 8 cores, while our techniques allow scaling to at least 32 cores for most benchmarks. At 8 cores, stack randomization provides a mean speedup of 1.2X, but stack randomization with 32 cores gives a speedup of 2.7X over the best baseline configuration. Comparing to conventional performance with a 2 MB LLC, our technique achieves similar performance with a 256 KB LLC, suggesting LLCs may be typically overprovisioned. When very limited LLC resources are available, NISC can further improve system performance by 1.8X.
{"title":"Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling","authors":"Jiayuan Meng, K. Skadron","doi":"10.1109/ICCD.2009.5413143","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413143","url":null,"abstract":"Without high-bandwidth broadcast, large numbers of cores require a scalable point-to-point interconnect and a directory protocol. In such cases, a shared, inclusive last level cache (LLC) can improve data sharing and avoid three-way communication for shared reads. However, if inclusion encompasses thread-private data, two problems arise with the shared LLC. First, current memory allocators align stack bases on page boundaries, which emerges as a source of severe conflict misses for large numbers of threads on data-parallel applications. Second, correctness does not require the private data to reside in the shared directory or the LLC. This paper advocates stack-base randomization that eliminates the major source of conflict misses for large numbers of threads. However, when capacity becomes a limitation for the directory or last-level cache, this is not sufficient. We then propose non-inclusive, semi-coherent cache organization (NISC) that removes the requirement for inclusion of private data and reduces capacity misses. Our data-parallel benchmarks show that these limitations prevent scaling beyond 8 cores, while our techniques allow scaling to at least 32 cores for most benchmarks. At 8 cores, stack randomization provides a mean speedup of 1.2X, but stack randomization with 32 cores gives a speedup of 2.7X over the best baseline configuration. Comparing to conventional performance with a 2 MB LLC, our technique achieves similar performance with a 256 KB LLC, suggesting LLCs may be typically overprovisioned. When very limited LLC resources are available, NISC can further improve system performance by 1.8X.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114554482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413183
Yang Xu, K. Stevens
Asynchronous sequential circuit or protocol design requires formal verification to ensure correct behavior under all operating conditions. However, most asynchronous circuits or protocols cannot be proven conformant to a specification without adding timing assumptions. Relative Timing (RT) is an approach to model and verify circuits and protocols that require timing assumptions to operate correctly. The process of creating path-based RT constraints has previously been done by hand with the aid of a formal verification engine. This time consuming and error prone method vastly restricts the application of RT and the capability to implement circuits and protocols. This paper describes an algorithm for automatic generation of RT constraints based on signal traces generated from a formal verification (FV) engine that supports relative timing constraints. This algorithm has been implemented in a CAD tool called Automatic Relative Timing Identifier based on Signal Traces (ARTIST) which has been embedded into the FV engine. A set of asynchronous and clocked designs and protocols have been verified and proven to be hazard-free with the RT constraints generated by ARTIST which would have taken months to perform by hand. A comparison of RT constraints between hand-generated and ARTIST generated constraints is also described in terms of efficiency and quality.
{"title":"Automatic synthesis of computation interference constraints for relative timing verification","authors":"Yang Xu, K. Stevens","doi":"10.1109/ICCD.2009.5413183","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413183","url":null,"abstract":"Asynchronous sequential circuit or protocol design requires formal verification to ensure correct behavior under all operating conditions. However, most asynchronous circuits or protocols cannot be proven conformant to a specification without adding timing assumptions. Relative Timing (RT) is an approach to model and verify circuits and protocols that require timing assumptions to operate correctly. The process of creating path-based RT constraints has previously been done by hand with the aid of a formal verification engine. This time consuming and error prone method vastly restricts the application of RT and the capability to implement circuits and protocols. This paper describes an algorithm for automatic generation of RT constraints based on signal traces generated from a formal verification (FV) engine that supports relative timing constraints. This algorithm has been implemented in a CAD tool called Automatic Relative Timing Identifier based on Signal Traces (ARTIST) which has been embedded into the FV engine. A set of asynchronous and clocked designs and protocols have been verified and proven to be hazard-free with the RT constraints generated by ARTIST which would have taken months to perform by hand. A comparison of RT constraints between hand-generated and ARTIST generated constraints is also described in terms of efficiency and quality.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124600209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413142
Javier Lira, Carlos Molina, Antonio González
The increasing speed-gap between processor and memory and the limited memory bandwidth make last-level cache performance crucial for CMP architectures. non uniform cache architectures (NUCA) have been introduced to deal with this problem. This memory organization divides the whole memory space into smaller pieces or banks allowing nearer banks to have better access latencies than further banks. Moreover, an adaptive replacement policy that efficiently reduces misses in the last-level cache could boost performance, particularly if set associativity is adopted. Unfortunately, traditional replacement policies do not behave properly as they were designed for single-processors. This paper focuses on bank replacement. This policy involves three key decisions when there is a miss: where to place a data block within the cache set, which data to evict from the cache set and finally, where to place the evicted data. We propose a novel replacement technique that enables more intelligent replacement decisions to be taken. This technique is based on the observation that some types of data are less commonly accessed depending on which bank they reside in. We call this technique LRU-PEA (least recently used with a priority eviction approach). We show that the proposed technique significantly reduces the requests to the off-chip memory by increasing the hit ratio in the NUCA cache. This translates into an average IPC improvement of 8% and into an Energy per Instruction (EPI) reduction of 5%.
{"title":"LRU-PEA: A smart replacement policy for non-uniform cache architectures on chip multiprocessors","authors":"Javier Lira, Carlos Molina, Antonio González","doi":"10.1109/ICCD.2009.5413142","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413142","url":null,"abstract":"The increasing speed-gap between processor and memory and the limited memory bandwidth make last-level cache performance crucial for CMP architectures. non uniform cache architectures (NUCA) have been introduced to deal with this problem. This memory organization divides the whole memory space into smaller pieces or banks allowing nearer banks to have better access latencies than further banks. Moreover, an adaptive replacement policy that efficiently reduces misses in the last-level cache could boost performance, particularly if set associativity is adopted. Unfortunately, traditional replacement policies do not behave properly as they were designed for single-processors. This paper focuses on bank replacement. This policy involves three key decisions when there is a miss: where to place a data block within the cache set, which data to evict from the cache set and finally, where to place the evicted data. We propose a novel replacement technique that enables more intelligent replacement decisions to be taken. This technique is based on the observation that some types of data are less commonly accessed depending on which bank they reside in. We call this technique LRU-PEA (least recently used with a priority eviction approach). We show that the proposed technique significantly reduces the requests to the off-chip memory by increasing the hit ratio in the NUCA cache. This translates into an average IPC improvement of 8% and into an Energy per Instruction (EPI) reduction of 5%.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123332224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-04DOI: 10.1109/ICCD.2009.5413111
Rajesh Garg, S. Khatri
In recent times, dynamic supply voltage scaling (DVS) has been extensively employed to minimize the power and energy of VLSI systems. Also, sub-threshold circuits are becoming more popular. At the same time, the reliability of VLSI systems has become a major concern under Single Event Upsets (SEUs). SEUs are very problematic even for circuits operating at nominal voltages. With the increasing demand for low power reliable systems, it is therefore necessary to harden DVS and sub-threshold circuits efficiently. In this paper, we perform 3D simulations of radiation particle strikes in an inverter implemented using DVS and sub-threshold design. We analyze the sensitivity of the inverter to radiation particle strikes by varying the inverter size, the inverter load, the supply voltage (VDD) and the energy of the radiation particles. From these 3D simulations, we make several observations which are important to consider during radiation hardening of DVS and sub-threshold circuits. Based on these observations, we propose several guidelines for radiation hardening of DVS and sub-threshold circuit designs. These guidelines suggest that the traditional radiation hardening approaches need to be revisited for DVS and sub-threshold designs. We also propose a charge collection model for DVS circuits. Our model can accurately estimate (with an average error of 6.3%) the charge collected at the output of a gate for different supply voltages and different gate sizes for medium and high energy particle strikes. The parameters of our charge collection model can be included in SPICE model cards of transistors, to improve the accuracy of SPICE based radiation simulations for DVS circuits.
{"title":"3D simulation and analysis of the radiation tolerance of voltage scaled digital circuit","authors":"Rajesh Garg, S. Khatri","doi":"10.1109/ICCD.2009.5413111","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413111","url":null,"abstract":"In recent times, dynamic supply voltage scaling (DVS) has been extensively employed to minimize the power and energy of VLSI systems. Also, sub-threshold circuits are becoming more popular. At the same time, the reliability of VLSI systems has become a major concern under Single Event Upsets (SEUs). SEUs are very problematic even for circuits operating at nominal voltages. With the increasing demand for low power reliable systems, it is therefore necessary to harden DVS and sub-threshold circuits efficiently. In this paper, we perform 3D simulations of radiation particle strikes in an inverter implemented using DVS and sub-threshold design. We analyze the sensitivity of the inverter to radiation particle strikes by varying the inverter size, the inverter load, the supply voltage (VDD) and the energy of the radiation particles. From these 3D simulations, we make several observations which are important to consider during radiation hardening of DVS and sub-threshold circuits. Based on these observations, we propose several guidelines for radiation hardening of DVS and sub-threshold circuit designs. These guidelines suggest that the traditional radiation hardening approaches need to be revisited for DVS and sub-threshold designs. We also propose a charge collection model for DVS circuits. Our model can accurately estimate (with an average error of 6.3%) the charge collected at the output of a gate for different supply voltages and different gate sizes for medium and high energy particle strikes. The parameters of our charge collection model can be included in SPICE model cards of transistors, to improve the accuracy of SPICE based radiation simulations for DVS circuits.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126286039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2009-10-01DOI: 10.1109/ICCD.2009.5413126
Sourabh Khire, S. Mukhopadhyay
Voltage scaling is a promising approach to reduce the power consumption in signal processing circuits. However aggressive voltage scaling can introduce errors in the output signal, thus degrading the algorithmic performance of the circuit. We consider the specific case of the finite impulse response (FIR) filter, and identify two different sources of errors occurring due to voltage scaling: (a) errors introduced because of increased delay along the logic path and (b) errors caused by failures in the memory due to process variations. We design a FIR filter by using a simple feedback based approach to reduce the memory errors and a linear predictor structure for correcting the logic errors. The proposed filter is more robust to both logic and memory errors caused by voltage scaling. The results show a considerable improvement in the output Signal to Noise ratio (at least around 10 dB) for a probability of error (Perr) even as high as 0.5. We also utilize the proposed technique for an image filtering application and observe a considerable improvement in the visual quality of the output image along with an improvement of over 10 dB in the Peak Signal to Noise ratio for Perr as high as 0.5.
{"title":"On improving the algorithmic robustness of a low-power FIR filter","authors":"Sourabh Khire, S. Mukhopadhyay","doi":"10.1109/ICCD.2009.5413126","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413126","url":null,"abstract":"Voltage scaling is a promising approach to reduce the power consumption in signal processing circuits. However aggressive voltage scaling can introduce errors in the output signal, thus degrading the algorithmic performance of the circuit. We consider the specific case of the finite impulse response (FIR) filter, and identify two different sources of errors occurring due to voltage scaling: (a) errors introduced because of increased delay along the logic path and (b) errors caused by failures in the memory due to process variations. We design a FIR filter by using a simple feedback based approach to reduce the memory errors and a linear predictor structure for correcting the logic errors. The proposed filter is more robust to both logic and memory errors caused by voltage scaling. The results show a considerable improvement in the output Signal to Noise ratio (at least around 10 dB) for a probability of error (Perr) even as high as 0.5. We also utilize the proposed technique for an image filtering application and observe a considerable improvement in the visual quality of the output image along with an improvement of over 10 dB in the Peak Signal to Noise ratio for Perr as high as 0.5.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130385411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}