This paper presents a novel digitally-assisted automatic frequency tuning technique, and the self calibration technique is verified for a 130nm CMOS 4th order biquad baseband low-pass filter case with 20MHz cut-off frequency, which satisfies the typical LTE receiver specifications. The proposed tuning method includes hardware reduction methods, coherent sampling, and magnitude calculator using "alpha max plus beta min" algorithm for significant chip area reduction with negligible accuracy degradation. The cut-off frequency turns out to be tunable in the range of 16.2MHz to 24.4MHz, and the tuning error is less than 0.4% over the whole frequency tuning range. The estimated area consumption is 0.027mm2 with 80% device density, and power dissipation is 0.16mW at 128MHz clock speed with a 1.2V supply voltage.
本文提出了一种新的数字辅助自动调频技术,并对截止频率为20MHz的130nm CMOS四阶双基带低通滤波器进行了自校准技术验证,该自校准技术满足典型LTE接收机规格。所提出的调谐方法包括硬件缩减方法、相干采样和使用“alpha max + beta min”算法的大小计算器,用于显着减少芯片面积,而精度退化可以忽略不计。截止频率在16.2MHz ~ 24.4MHz范围内可调,在整个频率调谐范围内调谐误差小于0.4%。在器件密度为80%时,估计面积消耗为0.027mm2,功耗为0.16mW,时钟速度为128MHz,电源电压为1.2V。
{"title":"A novel mixed-signal self-calibration technique for baseband filters in systems-on-chip mobile transceivers","authors":"Yongsuk Choi, Yong-Bin Kim","doi":"10.1145/2591513.2591522","DOIUrl":"https://doi.org/10.1145/2591513.2591522","url":null,"abstract":"This paper presents a novel digitally-assisted automatic frequency tuning technique, and the self calibration technique is verified for a 130nm CMOS 4th order biquad baseband low-pass filter case with 20MHz cut-off frequency, which satisfies the typical LTE receiver specifications. The proposed tuning method includes hardware reduction methods, coherent sampling, and magnitude calculator using \"alpha max plus beta min\" algorithm for significant chip area reduction with negligible accuracy degradation. The cut-off frequency turns out to be tunable in the range of 16.2MHz to 24.4MHz, and the tuning error is less than 0.4% over the whole frequency tuning range. The estimated area consumption is 0.027mm2 with 80% device density, and power dissipation is 0.16mW at 128MHz clock speed with a 1.2V supply voltage.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124812799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Pereira, A. Soares, A. Susin, A. Bonatto, M. Negreiros
This paper presents a resource optimized hardware solution to perform the H.264 8x8 inverse transform. Row/column decomposition is used, arithmetic units are re-used and the transpose memory is replaced by a shift register. The architecture is able to perform 8x8 integer transform calculation in 144 cycles with as few as 431 LUTs on a Xilinx virtex 6 FPGA for 16-bit resolution. To enable the module to process all inverse transforms in H.264, the number of LUTs is increased to 681. When used to calculate all transforms for H.264 videos, the design supports resolutions up to 1280x720@30fps when running at 84 MHz.
{"title":"H.264 8x8 inverse transform architecture optimization","authors":"F. Pereira, A. Soares, A. Susin, A. Bonatto, M. Negreiros","doi":"10.1145/2591513.2591564","DOIUrl":"https://doi.org/10.1145/2591513.2591564","url":null,"abstract":"This paper presents a resource optimized hardware solution to perform the H.264 8x8 inverse transform. Row/column decomposition is used, arithmetic units are re-used and the transpose memory is replaced by a shift register. The architecture is able to perform 8x8 integer transform calculation in 144 cycles with as few as 431 LUTs on a Xilinx virtex 6 FPGA for 16-bit resolution. To enable the module to process all inverse transforms in H.264, the number of LUTs is increased to 681. When used to calculate all transforms for H.264 videos, the design supports resolutions up to 1280x720@30fps when running at 84 MHz.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125103776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Predictors are used in many fields of computer architectures to enhance performance. With good estimations of future system behaviour, policies can be developed to improve system performance or reduce power consumption. These policies become more effective if the predictors are implemented in hardware and can provide quantified forecasts and not only binary ones. In this paper, we present and evaluate a generic predictor implemented in VHDL running on an FPGA which produces quantified forecasts. Moreover, a complete scalability analysis is presented which shows that our implementation has a maximum device utilization of less than 5%. Furthermore, we analyse the power consumption of the predictor running on an FPGA. Additionally, we show that this implementation can be clocked by over 210 MHz. Finally, we evaluate a power-saving policy based on our hardware predictor. Based on predicted idle periods, this power-saving policy uses power-saving modes and is able to reduce memory power consumption by 14.3%.
{"title":"A generic implementation of a quantified predictor on FPGAs","authors":"G. Thomas, A. Elhossini, B. Juurlink","doi":"10.1145/2591513.2591517","DOIUrl":"https://doi.org/10.1145/2591513.2591517","url":null,"abstract":"Predictors are used in many fields of computer architectures to enhance performance. With good estimations of future system behaviour, policies can be developed to improve system performance or reduce power consumption. These policies become more effective if the predictors are implemented in hardware and can provide quantified forecasts and not only binary ones. In this paper, we present and evaluate a generic predictor implemented in VHDL running on an FPGA which produces quantified forecasts. Moreover, a complete scalability analysis is presented which shows that our implementation has a maximum device utilization of less than 5%. Furthermore, we analyse the power consumption of the predictor running on an FPGA. Additionally, we show that this implementation can be clocked by over 210 MHz. Finally, we evaluate a power-saving policy based on our hardware predictor. Based on predicted idle periods, this power-saving policy uses power-saving modes and is able to reduce memory power consumption by 14.3%.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115274740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bufferless NoC routers employing deflection routing are gaining popularity due to their power and area efficiency. We propose WeDBless, a bufferless deflection router that reduces deflection rate of flits by employing port allocation based on weighted deflection of flits. The proposed method directs the frequently misrouted flits towards their destination by increasing their probability of getting a productive output port. Our evaluations on synthetic traffic patterns show that WeDBless achieves significant reduction in deflection rate, average flit latency and improvement in network saturation point compared to the state-of-the-art bufferless router and reduced complexity in route computing logic.
{"title":"WeDBless: weighted deflection bufferless router for mesh NoCs","authors":"Simi Zerine Sleeba, John Jose, M. G. Mini","doi":"10.1145/2591513.2591559","DOIUrl":"https://doi.org/10.1145/2591513.2591559","url":null,"abstract":"Bufferless NoC routers employing deflection routing are gaining popularity due to their power and area efficiency. We propose WeDBless, a bufferless deflection router that reduces deflection rate of flits by employing port allocation based on weighted deflection of flits. The proposed method directs the frequently misrouted flits towards their destination by increasing their probability of getting a productive output port. Our evaluations on synthetic traffic patterns show that WeDBless achieves significant reduction in deflection rate, average flit latency and improvement in network saturation point compared to the state-of-the-art bufferless router and reduced complexity in route computing logic.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116742907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Z. Wang, Chao Chen, Piyush Sharma, A. Chattopadhyay
Power density of digital circuits increased at alarming rate for deep sub-micron CMOS technology, turning reliability into a serious design concern. On the other hand, ever-growing task complexity with strict performance budget forced designers to adopt complex, heterogeneous MPSoCs as the implementation choice. Several commercial system-level design platforms exist currently for design, exploration and implementation of MPSoC. In this paper, we propose a system-level reliability exploration framework by extending a commercial system-level design flow. Using this framework, a heterogeneous MPSoC is designed which can accept a custom mapping algorithm based on the MPSoC topology before the actual task deployment. The dynamic reliability-aware task management is able to consider the desired reliability constraints of tasks as well as reliability levels of the system components. We report our experimental findings using state-of-the-art benchmark applications.
{"title":"System-level reliability exploration framework for heterogeneous MPSoC","authors":"Z. Wang, Chao Chen, Piyush Sharma, A. Chattopadhyay","doi":"10.1145/2591513.2591519","DOIUrl":"https://doi.org/10.1145/2591513.2591519","url":null,"abstract":"Power density of digital circuits increased at alarming rate for deep sub-micron CMOS technology, turning reliability into a serious design concern. On the other hand, ever-growing task complexity with strict performance budget forced designers to adopt complex, heterogeneous MPSoCs as the implementation choice. Several commercial system-level design platforms exist currently for design, exploration and implementation of MPSoC. In this paper, we propose a system-level reliability exploration framework by extending a commercial system-level design flow. Using this framework, a heterogeneous MPSoC is designed which can accept a custom mapping algorithm based on the MPSoC topology before the actual task deployment. The dynamic reliability-aware task management is able to consider the desired reliability constraints of tasks as well as reliability levels of the system components. We report our experimental findings using state-of-the-art benchmark applications.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115574302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dynamic Random Access Memories (DRAM) are widely used in processor design. Different cells have been proposed in the past to overcome concerns associated with low retention time, degradation in performance due to process variations and susceptibility to soft errors. This paper proposes two novel DRAM cells (referred to as 4TI and 4T1D) that utilize the techniques of gated diode and forward body-biasing to overcome the above issues. The designs of these cells are evaluated by HSPICE simulation; different figures of merits (such as Read delay, Write delay, retention time, power dissipation, critical charge and layout area) are assessed and a comparative analysis of the proposed cells with existing cells is pursued. The 4TI cell achieves the best power dissipation, while the 4T1D achieves the best retention time, the highest critical charge and the least average Read delay. An extensive simulation based evaluation of process variations is also presented to confirm that using static and Monte Carlo based analysis, the proposed cells are likely to be less affected by process variations (in threshold voltage and effective channel length) than the other cells found in the technical literature.
{"title":"New 4T-based DRAM cell designs","authors":"Wei Wei, K. Namba, F. Lombardi","doi":"10.1145/2591513.2591515","DOIUrl":"https://doi.org/10.1145/2591513.2591515","url":null,"abstract":"Dynamic Random Access Memories (DRAM) are widely used in processor design. Different cells have been proposed in the past to overcome concerns associated with low retention time, degradation in performance due to process variations and susceptibility to soft errors. This paper proposes two novel DRAM cells (referred to as 4TI and 4T1D) that utilize the techniques of gated diode and forward body-biasing to overcome the above issues. The designs of these cells are evaluated by HSPICE simulation; different figures of merits (such as Read delay, Write delay, retention time, power dissipation, critical charge and layout area) are assessed and a comparative analysis of the proposed cells with existing cells is pursued. The 4TI cell achieves the best power dissipation, while the 4T1D achieves the best retention time, the highest critical charge and the least average Read delay. An extensive simulation based evaluation of process variations is also presented to confirm that using static and Monte Carlo based analysis, the proposed cells are likely to be less affected by process variations (in threshold voltage and effective channel length) than the other cells found in the technical literature.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114882919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A novel class of linear soft-input soft-output detectors featuring boosted communications performance is introduced. Compared to state-of-the-art linear detectors, the detector has an SNR gain of up to 2.4 dB. We shortly summarize the algorithm, and sketch a suitable architecture. The corresponding ASIC implementation shows the feasibility and efficiency of the concept. It achieves the IEEE 802.11n standard's peak data rate of 600 Mbit/s.
{"title":"VLSI implementation of linear MIMO detection with boosted communications performance: extended abstract","authors":"Dominik Auras, D. Rieth, R. Leupers, G. Ascheid","doi":"10.1145/2591513.2591551","DOIUrl":"https://doi.org/10.1145/2591513.2591551","url":null,"abstract":"A novel class of linear soft-input soft-output detectors featuring boosted communications performance is introduced. Compared to state-of-the-art linear detectors, the detector has an SNR gain of up to 2.4 dB. We shortly summarize the algorithm, and sketch a suitable architecture. The corresponding ASIC implementation shows the feasibility and efficiency of the concept. It achieves the IEEE 802.11n standard's peak data rate of 600 Mbit/s.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128328407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matheus T. Moreira, R. Guazzelli, G. Heck, Ney Laert Vilar Calazans
The correct functionality of quasi-delay-insensitive asynchronous circuits can be jeopardized by the presence and propagation of transient faults. If these faults are latched, they will corrupt data validity and can make the whole circuit to stall, given the strict event ordering constraints imposed by handshaking protocols. This is particularly concerning for the delay-insensitive minterm synthesis logic style, widely adopted by asynchronous designers to implement combinatory quasi-delay-insensitive logic, because it makes extensive use of C-elements and these components are rather vulnerable to transient effects. This paper demonstrates that this logic style submits C-elements to their most vulnerable states during operation. It accordingly proposes the alternative use of the delay-insensitive maxterm synthesis for hardening QDI circuits against transient faults. The latter is a logic style based on the return-to-one 4-phase protocol. Although this style also relies on extensive usage of C-elements, the states where these components are most vulnerable are avoided. Results display improvements of over 300% in C-elements tolerance to transient faults, in the best case.
{"title":"Hardening QDI circuits against transient faults using delay-insensitive maxterm synthesis","authors":"Matheus T. Moreira, R. Guazzelli, G. Heck, Ney Laert Vilar Calazans","doi":"10.1145/2591513.2591531","DOIUrl":"https://doi.org/10.1145/2591513.2591531","url":null,"abstract":"The correct functionality of quasi-delay-insensitive asynchronous circuits can be jeopardized by the presence and propagation of transient faults. If these faults are latched, they will corrupt data validity and can make the whole circuit to stall, given the strict event ordering constraints imposed by handshaking protocols. This is particularly concerning for the delay-insensitive minterm synthesis logic style, widely adopted by asynchronous designers to implement combinatory quasi-delay-insensitive logic, because it makes extensive use of C-elements and these components are rather vulnerable to transient effects. This paper demonstrates that this logic style submits C-elements to their most vulnerable states during operation. It accordingly proposes the alternative use of the delay-insensitive maxterm synthesis for hardening QDI circuits against transient faults. The latter is a logic style based on the return-to-one 4-phase protocol. Although this style also relies on extensive usage of C-elements, the states where these components are most vulnerable are avoided. Results display improvements of over 300% in C-elements tolerance to transient faults, in the best case.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128538889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cong Xu, Dimin Niu, Yang Zheng, Shimeng Yu, Yuan Xie
The transition metal oxide (TMO) resistive random access memory (ReRAM) has been identified as one of the most promising candidates for the next generation non-volatile memory (NVM) technology. Numerous TMO ReRAMs with different materials have been developed and demonstrate attractive characteristics, such as fast read/write speed, low power consumption, high integrated density, and good scalability. Among them, the most attractive characteristic of ReRAM is its cross-point structure which features a 4F2 cell size. However, the existence of sneak current and voltage drop along the wire resistance in a cross-point array brings in extra design challenges. In addition, a robust ReRAM design needs to deal with both soft and hard errors. In this paper, we summarize mechanisms of both soft and hard errors of ReRAM cells and propose a unified model to characterize different failure behaviors. We quantitatively analyze the impact of cell failure modes on the reliability of cross-point array. We also propose an error resilient architecture which avoids unnecessary writes in the hard error detection unit. Experimental results show that our design can extend the lifetime of ReRAM up to 75% over the design without hard error detections and up to 12% over the design with "write-verify" detection mechanism.
{"title":"Reliability-aware cross-point resistive memory design","authors":"Cong Xu, Dimin Niu, Yang Zheng, Shimeng Yu, Yuan Xie","doi":"10.1145/2591513.2591528","DOIUrl":"https://doi.org/10.1145/2591513.2591528","url":null,"abstract":"The transition metal oxide (TMO) resistive random access memory (ReRAM) has been identified as one of the most promising candidates for the next generation non-volatile memory (NVM) technology. Numerous TMO ReRAMs with different materials have been developed and demonstrate attractive characteristics, such as fast read/write speed, low power consumption, high integrated density, and good scalability. Among them, the most attractive characteristic of ReRAM is its cross-point structure which features a 4F2 cell size. However, the existence of sneak current and voltage drop along the wire resistance in a cross-point array brings in extra design challenges. In addition, a robust ReRAM design needs to deal with both soft and hard errors. In this paper, we summarize mechanisms of both soft and hard errors of ReRAM cells and propose a unified model to characterize different failure behaviors. We quantitatively analyze the impact of cell failure modes on the reliability of cross-point array. We also propose an error resilient architecture which avoids unnecessary writes in the hard error detection unit. Experimental results show that our design can extend the lifetime of ReRAM up to 75% over the design without hard error detections and up to 12% over the design with \"write-verify\" detection mechanism.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130120476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to embedded systems' stringent design constraints, much prior work focused on optimizing energy consumption and/or performance. However, since embedded systems have fewer cooling options, rising temperature, and thus temperature optimization, is an emergent concern. We present thermal-aware phase-based tuning--TaPT--that determines Pareto optimal configurations for fine-grained execution time, energy, and temperature tradeoffs. Results show that TaPT reduces execution time, energy, and temperature by as much as 5%, 30%, and 25%, respectively, while adhering to designer-specified design constraints.
{"title":"Thermal-aware phase-based tuning of embedded systems","authors":"Tosiron Adegbija, A. Gordon-Ross","doi":"10.1145/2591513.2591586","DOIUrl":"https://doi.org/10.1145/2591513.2591586","url":null,"abstract":"Due to embedded systems' stringent design constraints, much prior work focused on optimizing energy consumption and/or performance. However, since embedded systems have fewer cooling options, rising temperature, and thus temperature optimization, is an emergent concern. We present thermal-aware phase-based tuning--TaPT--that determines Pareto optimal configurations for fine-grained execution time, energy, and temperature tradeoffs. Results show that TaPT reduces execution time, energy, and temperature by as much as 5%, 30%, and 25%, respectively, while adhering to designer-specified design constraints.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128854642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}