A novel class of linear soft-input soft-output detectors featuring boosted communications performance is introduced. Compared to state-of-the-art linear detectors, the detector has an SNR gain of up to 2.4 dB. We shortly summarize the algorithm, and sketch a suitable architecture. The corresponding ASIC implementation shows the feasibility and efficiency of the concept. It achieves the IEEE 802.11n standard's peak data rate of 600 Mbit/s.
{"title":"VLSI implementation of linear MIMO detection with boosted communications performance: extended abstract","authors":"Dominik Auras, D. Rieth, R. Leupers, G. Ascheid","doi":"10.1145/2591513.2591551","DOIUrl":"https://doi.org/10.1145/2591513.2591551","url":null,"abstract":"A novel class of linear soft-input soft-output detectors featuring boosted communications performance is introduced. Compared to state-of-the-art linear detectors, the detector has an SNR gain of up to 2.4 dB. We shortly summarize the algorithm, and sketch a suitable architecture. The corresponding ASIC implementation shows the feasibility and efficiency of the concept. It achieves the IEEE 802.11n standard's peak data rate of 600 Mbit/s.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128328407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to embedded systems' stringent design constraints, much prior work focused on optimizing energy consumption and/or performance. However, since embedded systems have fewer cooling options, rising temperature, and thus temperature optimization, is an emergent concern. We present thermal-aware phase-based tuning--TaPT--that determines Pareto optimal configurations for fine-grained execution time, energy, and temperature tradeoffs. Results show that TaPT reduces execution time, energy, and temperature by as much as 5%, 30%, and 25%, respectively, while adhering to designer-specified design constraints.
{"title":"Thermal-aware phase-based tuning of embedded systems","authors":"Tosiron Adegbija, A. Gordon-Ross","doi":"10.1145/2591513.2591586","DOIUrl":"https://doi.org/10.1145/2591513.2591586","url":null,"abstract":"Due to embedded systems' stringent design constraints, much prior work focused on optimizing energy consumption and/or performance. However, since embedded systems have fewer cooling options, rising temperature, and thus temperature optimization, is an emergent concern. We present thermal-aware phase-based tuning--TaPT--that determines Pareto optimal configurations for fine-grained execution time, energy, and temperature tradeoffs. Results show that TaPT reduces execution time, energy, and temperature by as much as 5%, 30%, and 25%, respectively, while adhering to designer-specified design constraints.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128854642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Pereira, A. Soares, A. Susin, A. Bonatto, M. Negreiros
This paper presents a resource optimized hardware solution to perform the H.264 8x8 inverse transform. Row/column decomposition is used, arithmetic units are re-used and the transpose memory is replaced by a shift register. The architecture is able to perform 8x8 integer transform calculation in 144 cycles with as few as 431 LUTs on a Xilinx virtex 6 FPGA for 16-bit resolution. To enable the module to process all inverse transforms in H.264, the number of LUTs is increased to 681. When used to calculate all transforms for H.264 videos, the design supports resolutions up to 1280x720@30fps when running at 84 MHz.
{"title":"H.264 8x8 inverse transform architecture optimization","authors":"F. Pereira, A. Soares, A. Susin, A. Bonatto, M. Negreiros","doi":"10.1145/2591513.2591564","DOIUrl":"https://doi.org/10.1145/2591513.2591564","url":null,"abstract":"This paper presents a resource optimized hardware solution to perform the H.264 8x8 inverse transform. Row/column decomposition is used, arithmetic units are re-used and the transpose memory is replaced by a shift register. The architecture is able to perform 8x8 integer transform calculation in 144 cycles with as few as 431 LUTs on a Xilinx virtex 6 FPGA for 16-bit resolution. To enable the module to process all inverse transforms in H.264, the number of LUTs is increased to 681. When used to calculate all transforms for H.264 videos, the design supports resolutions up to 1280x720@30fps when running at 84 MHz.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125103776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Predictors are used in many fields of computer architectures to enhance performance. With good estimations of future system behaviour, policies can be developed to improve system performance or reduce power consumption. These policies become more effective if the predictors are implemented in hardware and can provide quantified forecasts and not only binary ones. In this paper, we present and evaluate a generic predictor implemented in VHDL running on an FPGA which produces quantified forecasts. Moreover, a complete scalability analysis is presented which shows that our implementation has a maximum device utilization of less than 5%. Furthermore, we analyse the power consumption of the predictor running on an FPGA. Additionally, we show that this implementation can be clocked by over 210 MHz. Finally, we evaluate a power-saving policy based on our hardware predictor. Based on predicted idle periods, this power-saving policy uses power-saving modes and is able to reduce memory power consumption by 14.3%.
{"title":"A generic implementation of a quantified predictor on FPGAs","authors":"G. Thomas, A. Elhossini, B. Juurlink","doi":"10.1145/2591513.2591517","DOIUrl":"https://doi.org/10.1145/2591513.2591517","url":null,"abstract":"Predictors are used in many fields of computer architectures to enhance performance. With good estimations of future system behaviour, policies can be developed to improve system performance or reduce power consumption. These policies become more effective if the predictors are implemented in hardware and can provide quantified forecasts and not only binary ones. In this paper, we present and evaluate a generic predictor implemented in VHDL running on an FPGA which produces quantified forecasts. Moreover, a complete scalability analysis is presented which shows that our implementation has a maximum device utilization of less than 5%. Furthermore, we analyse the power consumption of the predictor running on an FPGA. Additionally, we show that this implementation can be clocked by over 210 MHz. Finally, we evaluate a power-saving policy based on our hardware predictor. Based on predicted idle periods, this power-saving policy uses power-saving modes and is able to reduce memory power consumption by 14.3%.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115274740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Z. Wang, Chao Chen, Piyush Sharma, A. Chattopadhyay
Power density of digital circuits increased at alarming rate for deep sub-micron CMOS technology, turning reliability into a serious design concern. On the other hand, ever-growing task complexity with strict performance budget forced designers to adopt complex, heterogeneous MPSoCs as the implementation choice. Several commercial system-level design platforms exist currently for design, exploration and implementation of MPSoC. In this paper, we propose a system-level reliability exploration framework by extending a commercial system-level design flow. Using this framework, a heterogeneous MPSoC is designed which can accept a custom mapping algorithm based on the MPSoC topology before the actual task deployment. The dynamic reliability-aware task management is able to consider the desired reliability constraints of tasks as well as reliability levels of the system components. We report our experimental findings using state-of-the-art benchmark applications.
{"title":"System-level reliability exploration framework for heterogeneous MPSoC","authors":"Z. Wang, Chao Chen, Piyush Sharma, A. Chattopadhyay","doi":"10.1145/2591513.2591519","DOIUrl":"https://doi.org/10.1145/2591513.2591519","url":null,"abstract":"Power density of digital circuits increased at alarming rate for deep sub-micron CMOS technology, turning reliability into a serious design concern. On the other hand, ever-growing task complexity with strict performance budget forced designers to adopt complex, heterogeneous MPSoCs as the implementation choice. Several commercial system-level design platforms exist currently for design, exploration and implementation of MPSoC. In this paper, we propose a system-level reliability exploration framework by extending a commercial system-level design flow. Using this framework, a heterogeneous MPSoC is designed which can accept a custom mapping algorithm based on the MPSoC topology before the actual task deployment. The dynamic reliability-aware task management is able to consider the desired reliability constraints of tasks as well as reliability levels of the system components. We report our experimental findings using state-of-the-art benchmark applications.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115574302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marta Ortín-Obón, L. Ramini, H. Tatenguem, V. Viñals, D. Bertozzi
Although many valuable research works have investigated the properties of optical networks-on-chip (ONoCs), the vast majority of them lack an accurate exploration of the network interface architecture (NI) required to support optical communications on the silicon chip. The complexity of this architecture is especially critical for a specific kind of ONoCs: wavelength-routed ones. From a logical viewpoint, they can be considered as full nonblocking crossbars, thus the control complexity is implemented at the NIs. To our knowledge, this paper proposes the first complete NI architecture for wavelength-routed optical NoCs, by coping with the intricacy of networking issues such as flow control, buffering strategy, deadlock avoidance, serialization, and above all, with their codesign in a complete architecture.
{"title":"A complete electronic network interface architecture for global contention-free communication over emerging optical networks-on-chip","authors":"Marta Ortín-Obón, L. Ramini, H. Tatenguem, V. Viñals, D. Bertozzi","doi":"10.1145/2591513.2591536","DOIUrl":"https://doi.org/10.1145/2591513.2591536","url":null,"abstract":"Although many valuable research works have investigated the properties of optical networks-on-chip (ONoCs), the vast majority of them lack an accurate exploration of the network interface architecture (NI) required to support optical communications on the silicon chip. The complexity of this architecture is especially critical for a specific kind of ONoCs: wavelength-routed ones. From a logical viewpoint, they can be considered as full nonblocking crossbars, thus the control complexity is implemented at the NIs. To our knowledge, this paper proposes the first complete NI architecture for wavelength-routed optical NoCs, by coping with the intricacy of networking issues such as flow control, buffering strategy, deadlock avoidance, serialization, and above all, with their codesign in a complete architecture.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114319873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bufferless NoC routers employing deflection routing are gaining popularity due to their power and area efficiency. We propose WeDBless, a bufferless deflection router that reduces deflection rate of flits by employing port allocation based on weighted deflection of flits. The proposed method directs the frequently misrouted flits towards their destination by increasing their probability of getting a productive output port. Our evaluations on synthetic traffic patterns show that WeDBless achieves significant reduction in deflection rate, average flit latency and improvement in network saturation point compared to the state-of-the-art bufferless router and reduced complexity in route computing logic.
{"title":"WeDBless: weighted deflection bufferless router for mesh NoCs","authors":"Simi Zerine Sleeba, John Jose, M. G. Mini","doi":"10.1145/2591513.2591559","DOIUrl":"https://doi.org/10.1145/2591513.2591559","url":null,"abstract":"Bufferless NoC routers employing deflection routing are gaining popularity due to their power and area efficiency. We propose WeDBless, a bufferless deflection router that reduces deflection rate of flits by employing port allocation based on weighted deflection of flits. The proposed method directs the frequently misrouted flits towards their destination by increasing their probability of getting a productive output port. Our evaluations on synthetic traffic patterns show that WeDBless achieves significant reduction in deflection rate, average flit latency and improvement in network saturation point compared to the state-of-the-art bufferless router and reduced complexity in route computing logic.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116742907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose an FPGA implementation of a genetic algorithm (GA) for linear and nonlinear auto regressive moving average (ARMA) model parameters identification. The GA features specifically designed genetic operators for adaptive filtering applications. The design was implemented using very low bit-wordlength fixed-point representation, where only 6-bit wordlength arithmetic was used. The implementation experiments show high parameters identification capabilities and low footprint.
{"title":"FPGA based implementation of a genetic algorithm for ARMA model parameters identification","authors":"H. Merabti, D. Massicotte","doi":"10.1145/2591513.2591579","DOIUrl":"https://doi.org/10.1145/2591513.2591579","url":null,"abstract":"In this paper, we propose an FPGA implementation of a genetic algorithm (GA) for linear and nonlinear auto regressive moving average (ARMA) model parameters identification. The GA features specifically designed genetic operators for adaptive filtering applications. The design was implemented using very low bit-wordlength fixed-point representation, where only 6-bit wordlength arithmetic was used. The implementation experiments show high parameters identification capabilities and low footprint.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124534332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a novel digitally-assisted automatic frequency tuning technique, and the self calibration technique is verified for a 130nm CMOS 4th order biquad baseband low-pass filter case with 20MHz cut-off frequency, which satisfies the typical LTE receiver specifications. The proposed tuning method includes hardware reduction methods, coherent sampling, and magnitude calculator using "alpha max plus beta min" algorithm for significant chip area reduction with negligible accuracy degradation. The cut-off frequency turns out to be tunable in the range of 16.2MHz to 24.4MHz, and the tuning error is less than 0.4% over the whole frequency tuning range. The estimated area consumption is 0.027mm2 with 80% device density, and power dissipation is 0.16mW at 128MHz clock speed with a 1.2V supply voltage.
本文提出了一种新的数字辅助自动调频技术,并对截止频率为20MHz的130nm CMOS四阶双基带低通滤波器进行了自校准技术验证,该自校准技术满足典型LTE接收机规格。所提出的调谐方法包括硬件缩减方法、相干采样和使用“alpha max + beta min”算法的大小计算器,用于显着减少芯片面积,而精度退化可以忽略不计。截止频率在16.2MHz ~ 24.4MHz范围内可调,在整个频率调谐范围内调谐误差小于0.4%。在器件密度为80%时,估计面积消耗为0.027mm2,功耗为0.16mW,时钟速度为128MHz,电源电压为1.2V。
{"title":"A novel mixed-signal self-calibration technique for baseband filters in systems-on-chip mobile transceivers","authors":"Yongsuk Choi, Yong-Bin Kim","doi":"10.1145/2591513.2591522","DOIUrl":"https://doi.org/10.1145/2591513.2591522","url":null,"abstract":"This paper presents a novel digitally-assisted automatic frequency tuning technique, and the self calibration technique is verified for a 130nm CMOS 4th order biquad baseband low-pass filter case with 20MHz cut-off frequency, which satisfies the typical LTE receiver specifications. The proposed tuning method includes hardware reduction methods, coherent sampling, and magnitude calculator using \"alpha max plus beta min\" algorithm for significant chip area reduction with negligible accuracy degradation. The cut-off frequency turns out to be tunable in the range of 16.2MHz to 24.4MHz, and the tuning error is less than 0.4% over the whole frequency tuning range. The estimated area consumption is 0.027mm2 with 80% device density, and power dissipation is 0.16mW at 128MHz clock speed with a 1.2V supply voltage.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124812799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wire sizing can be used to reduce the delays of critical nets. However, because of the forbidden pitch issue in sub-20nm designs, wide wires may no longer be an attractive solution because of the restrictive wire spacing requirement from advanced lithography. In this work, we investigate the suitability of the parallel wiring technique, in which multiple parallel wires are used to route the same net, as an alternative to routing a net using a single wide wire. In particular, we study the trade offs between parasitics, timing, power, and routing resources. Our study reveals that wire sizing using both parallel wires and wide wires can be advantageous. Moreover, if high layout densities are required, parallel wiring can be a viable approach in solving timing problems for sub-20nm designs.
{"title":"A study on the use of parallel wiring techniques for sub-20nm designs","authors":"Rickard Ewetz, Wen-Hao Liu, Kai-Yuan Chao, Ting-Chi Wang, Cheng-Kok Koh","doi":"10.1145/2591513.2591588","DOIUrl":"https://doi.org/10.1145/2591513.2591588","url":null,"abstract":"Wire sizing can be used to reduce the delays of critical nets. However, because of the forbidden pitch issue in sub-20nm designs, wide wires may no longer be an attractive solution because of the restrictive wire spacing requirement from advanced lithography. In this work, we investigate the suitability of the parallel wiring technique, in which multiple parallel wires are used to route the same net, as an alternative to routing a net using a single wide wire. In particular, we study the trade offs between parasitics, timing, power, and routing resources. Our study reveals that wire sizing using both parallel wires and wide wires can be advantageous. Moreover, if high layout densities are required, parallel wiring can be a viable approach in solving timing problems for sub-20nm designs.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132923898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}