Program execution can be tampered by malicious attackers through exploiting software vulnerabilities. Changing the program behavior by compromising control data and decision data has become the most serious threat to computer systems security. Although several hardware approaches have been presented to validate program execution, they mostly suffer great hardware area or poor ambiguity handling. In this paper, we propose a new hardware-based approach by leveraging the existing speculative architectures for run-time program validation. The on-chip branch target buffer (BTB) is utilized as a cache of the legitimate control flow transfers stored in a secure memory region. In addition, the BTB is extended to store the correct program path information. At each indirect branch site, the BTB is used to validate the decision history of conditional branches before it, and more information about the future decision path is fetched to monitor the execution path at run-time. Implementation of this approach is transparent to the upper operating system and programs. Thus, it is applicable to legacy code. Due to good code locality of the executable programs and effectiveness of branch prediction, the frequency of run-time control flow validations against the secure off-chip memory is low. Our experimental results show a negligible performance penalty and small storage overhead with ambiguity reduced.
{"title":"Leveraging speculative architectures for run-time program validation","authors":"Juan Carlos Martínez Santos, Yunsi Fei","doi":"10.1145/2512456","DOIUrl":"https://doi.org/10.1145/2512456","url":null,"abstract":"Program execution can be tampered by malicious attackers through exploiting software vulnerabilities. Changing the program behavior by compromising control data and decision data has become the most serious threat to computer systems security. Although several hardware approaches have been presented to validate program execution, they mostly suffer great hardware area or poor ambiguity handling. In this paper, we propose a new hardware-based approach by leveraging the existing speculative architectures for run-time program validation. The on-chip branch target buffer (BTB) is utilized as a cache of the legitimate control flow transfers stored in a secure memory region. In addition, the BTB is extended to store the correct program path information. At each indirect branch site, the BTB is used to validate the decision history of conditional branches before it, and more information about the future decision path is fetched to monitor the execution path at run-time. Implementation of this approach is transparent to the upper operating system and programs. Thus, it is applicable to legacy code. Due to good code locality of the executable programs and effectiveness of branch prediction, the frequency of run-time control flow validations against the secure off-chip memory is low. Our experimental results show a negligible performance penalty and small storage overhead with ambiguity reduced.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129455481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751871
A. H. Gholamipour, E. Bozorgzadeh, L. Bao
Software Defined Radio (SDR) base stations can compensate for failures in disaster scenarios by assimilating different communication technologies. FPGAs play an important role in the platform of an SDR base station because of flexibility and DSP processing power that they deliver. The flexibility of FPGAs comes at the high cost of reconfiguration time overhead which can be a serious deterrence because of QoS requirements of real time traffic. In this paper we propose a solution to reduce reconfiguration time overhead at system-level where we are provided the configuration of each wireless system. Following that we step further and integrate our solution in to a floorplanner to generate placements for wireless systems which can systematically hide or reduce reconfiguration time overhead. Our experiments show the effectiveness of our approach.
{"title":"Seamless sequence of software defined radio designs through hardware reconfigurability of FPGAs","authors":"A. H. Gholamipour, E. Bozorgzadeh, L. Bao","doi":"10.1109/ICCD.2008.4751871","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751871","url":null,"abstract":"Software Defined Radio (SDR) base stations can compensate for failures in disaster scenarios by assimilating different communication technologies. FPGAs play an important role in the platform of an SDR base station because of flexibility and DSP processing power that they deliver. The flexibility of FPGAs comes at the high cost of reconfiguration time overhead which can be a serious deterrence because of QoS requirements of real time traffic. In this paper we propose a solution to reduce reconfiguration time overhead at system-level where we are provided the configuration of each wireless system. Following that we step further and integrate our solution in to a floorplanner to generate placements for wireless systems which can systematically hide or reduce reconfiguration time overhead. Our experiments show the effectiveness of our approach.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128197281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751863
J. Lee, R. Mahapatra
The operational lifetimes of SoC and microprocessors face growing threats from technology scaling and increasing device temperature and power density. In-field (or on-line) testing of NoC-based SoC is an important technique in ensuring system integrity throughout this potentially shorter lifetime. Whether in-field testing is conducted concurrently with normal applications or executed in isolation, application intrusion must be minimized in order to maintain system availability. Specialized infrastructure IP have been proposed to manage on-line testing by scheduling tests and delivering test vectors to the various cores within the SoC from a centralized location. However, as the number of cores integrated into a single chip continues to increase, issuing test vectors from a centralized location is not a scalable solution. These increased distances that test vectors must travel have become a major concern for on-line testing because of its direct impact on application intrusion in terms of energy consumption, network load, and latency. In this paper, we apply a distributed storage technique to bound and minimize this distance, thereby minimizing network load, energy consumption, and test delivery latency across the entire network. Our experiments show that test delivery latency and energy consumption is reduced by approximately 90% for moderately sized NoC.
{"title":"In-field NoC-based SoC testing with distributed test vector storage","authors":"J. Lee, R. Mahapatra","doi":"10.1109/ICCD.2008.4751863","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751863","url":null,"abstract":"The operational lifetimes of SoC and microprocessors face growing threats from technology scaling and increasing device temperature and power density. In-field (or on-line) testing of NoC-based SoC is an important technique in ensuring system integrity throughout this potentially shorter lifetime. Whether in-field testing is conducted concurrently with normal applications or executed in isolation, application intrusion must be minimized in order to maintain system availability. Specialized infrastructure IP have been proposed to manage on-line testing by scheduling tests and delivering test vectors to the various cores within the SoC from a centralized location. However, as the number of cores integrated into a single chip continues to increase, issuing test vectors from a centralized location is not a scalable solution. These increased distances that test vectors must travel have become a major concern for on-line testing because of its direct impact on application intrusion in terms of energy consumption, network load, and latency. In this paper, we apply a distributed storage technique to bound and minimize this distance, thereby minimizing network load, energy consumption, and test delivery latency across the entire network. Our experiments show that test delivery latency and energy consumption is reduced by approximately 90% for moderately sized NoC.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124496628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751848
Chunchen Liu, Junjie Su, Yiyu Shi
Temperature variation in microprocessors is a workload dependent problem. In such a design, the clock skew should be minimized with respect to temperature variation. Existing work has studied clock tree embedding perturbation considering time variant temperature variation. There is no existing method that can reduce skew variation. This paper develops an efficient yet effective simultaneous hotspot avoid embedding and thermal aware routing (TMST) method, where hotspot embedding avoid tree topology located in area with high temperature possibility and thermal aware routing reduce skew in tree path with more smooth temperature area. With a thermally tolerable tree structure, our method can reduce not only delay skew but also skew variation (skew violation range). Compared with existing temperature-aware clock tree method, our TMST solution reduces skew variation by 2X compared with the greedy-DME (GDME) method of Edahiro and existing thermal aware clock synthesis TACO and PECO. With the scale from 100 down to 1 temperature maps, our TMST also guarantees the smallest wire length overflow. TMST reduces the worst case skew up to 4X than PECO and 5X than TACO.
{"title":"Temperature-aware clock tree synthesis considering spatiotemporal hot spot correlations","authors":"Chunchen Liu, Junjie Su, Yiyu Shi","doi":"10.1109/ICCD.2008.4751848","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751848","url":null,"abstract":"Temperature variation in microprocessors is a workload dependent problem. In such a design, the clock skew should be minimized with respect to temperature variation. Existing work has studied clock tree embedding perturbation considering time variant temperature variation. There is no existing method that can reduce skew variation. This paper develops an efficient yet effective simultaneous hotspot avoid embedding and thermal aware routing (TMST) method, where hotspot embedding avoid tree topology located in area with high temperature possibility and thermal aware routing reduce skew in tree path with more smooth temperature area. With a thermally tolerable tree structure, our method can reduce not only delay skew but also skew variation (skew violation range). Compared with existing temperature-aware clock tree method, our TMST solution reduces skew variation by 2X compared with the greedy-DME (GDME) method of Edahiro and existing thermal aware clock synthesis TACO and PECO. With the scale from 100 down to 1 temperature maps, our TMST also guarantees the smallest wire length overflow. TMST reduces the worst case skew up to 4X than PECO and 5X than TACO.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122443834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751843
Alberto A. Del Barrio, M. Molina, J. Mendias, Esther Andres Perez, R. Hermida, F. Tirado
This paper justifies the use of estimation and prediction of carries to increase the performance of functional units built with the replication of full adders while keeping a low area penalization. Adders and multipliers are the most representative modules in this group of functional units. The use of these design techniques allows the implementation of modules with performance improvements ranging from 20% to 50% with only an area overheads around 5%. These functional units are suitable for asynchronous circuits but they could also be introduced in synchronous circuits with speculative techniques. The basic idea consists in estimating the carry out from some parts of the functional units, allowing every part to operate independently and in parallel. These modules are connected to build bigger ones. Results from simulations show that for some applications it is possible to make predictions even more accurate that the bit-based estimation. Predictions have also the advantage they can be introduced in the multipliers design, whether estimators cannot. These predictions are similar to the ones used in the branch prediction in a processor.
{"title":"Applying speculation techniques to implement functional units","authors":"Alberto A. Del Barrio, M. Molina, J. Mendias, Esther Andres Perez, R. Hermida, F. Tirado","doi":"10.1109/ICCD.2008.4751843","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751843","url":null,"abstract":"This paper justifies the use of estimation and prediction of carries to increase the performance of functional units built with the replication of full adders while keeping a low area penalization. Adders and multipliers are the most representative modules in this group of functional units. The use of these design techniques allows the implementation of modules with performance improvements ranging from 20% to 50% with only an area overheads around 5%. These functional units are suitable for asynchronous circuits but they could also be introduced in synchronous circuits with speculative techniques. The basic idea consists in estimating the carry out from some parts of the functional units, allowing every part to operate independently and in parallel. These modules are connected to build bigger ones. Results from simulations show that for some applications it is possible to make predictions even more accurate that the bit-based estimation. Predictions have also the advantage they can be introduced in the multipliers design, whether estimators cannot. These predictions are similar to the ones used in the branch prediction in a processor.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125878468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751894
A. Miyamoto, N. Homma, T. Aoki, Akashi Satoh
The present paper proposes a systematic design approach to provide the optimal high-radix Montgomery multipliers for an RSA processor satisfying user requirements. We introduces three multiplier-based architectures using different intermediate-data forms ((i) single form, (ii) semi carry-save form, and (iii) carry-save form, and combined them with a wide variety of arithmetic components. Their radices are also parameterized from 28 to 264. A total of 202 designs for 1,024-bit RSA processors were obtained for each radix, and were synthesized using a 90-nm CMOS standard cell library. The smallest design of 0.9 Kgates with 137.8 ms/RSA to the fastest design of 1.8 ms/RSA at 74.7 Kgates were then obtained. In addition, the optimal design to meet the user requirements can be easily obtained from all the combinations. In addition to choosing the datapath architecture, the arithmetic component, and the radix parameters, the proposed systematic approach can also adopt other process technologies.
{"title":"Systematic design of high-radix Montgomery multipliers for RSA processors","authors":"A. Miyamoto, N. Homma, T. Aoki, Akashi Satoh","doi":"10.1109/ICCD.2008.4751894","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751894","url":null,"abstract":"The present paper proposes a systematic design approach to provide the optimal high-radix Montgomery multipliers for an RSA processor satisfying user requirements. We introduces three multiplier-based architectures using different intermediate-data forms ((i) single form, (ii) semi carry-save form, and (iii) carry-save form, and combined them with a wide variety of arithmetic components. Their radices are also parameterized from 28 to 264. A total of 202 designs for 1,024-bit RSA processors were obtained for each radix, and were synthesized using a 90-nm CMOS standard cell library. The smallest design of 0.9 Kgates with 137.8 ms/RSA to the fastest design of 1.8 ms/RSA at 74.7 Kgates were then obtained. In addition, the optimal design to meet the user requirements can be easily obtained from all the combinations. In addition to choosing the datapath architecture, the arithmetic component, and the radix parameters, the proposed systematic approach can also adopt other process technologies.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127832731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751898
Shilpa Bhoj, D. Bhatia
Increasing transistor densities, rising popularity in mobile applications and migration towards eco-friendly computing systems have made power dissipation a key FPGA design issue. To meet stringent budgets, system architects need accurate estimates of power distribution at various design stages. In this work, we make several key contributions to FPGA leakage power estimation. First, we develop an accurate and efficient model to estimate total interconnect leakage power at various design stages prior to routing. Our methods derive leakage power estimates based on predicted values of routing congestion and interconnect resource utilization. We then extend the model to accomodate complex segmented routing architectures and low leakage architectures. Finally we formulate relations to generate post place leakage power estimates of individual routing channels. Our models for overall leakage power estimation achieve average accuracy rates of 93% and 89% for uniform and segmented routing architectures respectively. Experimentation results also establish the accuracy of the channel level estimation models at 85% and 80% for uniform and segmented routing structures. Our models and techniques would help designers make informed decisions by providing information on the power consumption of the interconnect fabric well before routing. Additionally, the equations can be used for architectural explorations and embedded in power and thermal aware CAD tools.
{"title":"Early stage FPGA interconnect leakage power estimation","authors":"Shilpa Bhoj, D. Bhatia","doi":"10.1109/ICCD.2008.4751898","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751898","url":null,"abstract":"Increasing transistor densities, rising popularity in mobile applications and migration towards eco-friendly computing systems have made power dissipation a key FPGA design issue. To meet stringent budgets, system architects need accurate estimates of power distribution at various design stages. In this work, we make several key contributions to FPGA leakage power estimation. First, we develop an accurate and efficient model to estimate total interconnect leakage power at various design stages prior to routing. Our methods derive leakage power estimates based on predicted values of routing congestion and interconnect resource utilization. We then extend the model to accomodate complex segmented routing architectures and low leakage architectures. Finally we formulate relations to generate post place leakage power estimates of individual routing channels. Our models for overall leakage power estimation achieve average accuracy rates of 93% and 89% for uniform and segmented routing architectures respectively. Experimentation results also establish the accuracy of the channel level estimation models at 85% and 80% for uniform and segmented routing structures. Our models and techniques would help designers make informed decisions by providing information on the power consumption of the interconnect fabric well before routing. Additionally, the equations can be used for architectural explorations and embedded in power and thermal aware CAD tools.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129021637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751929
Wei-Chiu Tseng, Yu-Hsing Chen, Rung-Bin Lin
In this paper we propose a synergetic approach that integrates router design and cell library engineering for improving post-routing via1 (via between M1 and M2) doubling rate at pins. We develop a double-via (DV) aware multilevel router to exploit the via1 doubling possibilities provided to the cells in a conventional as well as a DV-driven cell library. Compared to a non-DV-aware router using a conventional cell library, our approach using a DV-driven library can on average raise via1 doubling rate by 34%, raise total via doubling rate by 11%, reduce the total number of vias by 3%, and reduce the total number of via1s by 8%. All this can be achieved without incurring any performance and area penalties.
{"title":"Router and cell library co-development for improving redundant via insertion at pins","authors":"Wei-Chiu Tseng, Yu-Hsing Chen, Rung-Bin Lin","doi":"10.1109/ICCD.2008.4751929","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751929","url":null,"abstract":"In this paper we propose a synergetic approach that integrates router design and cell library engineering for improving post-routing via1 (via between M1 and M2) doubling rate at pins. We develop a double-via (DV) aware multilevel router to exploit the via1 doubling possibilities provided to the cells in a conventional as well as a DV-driven cell library. Compared to a non-DV-aware router using a conventional cell library, our approach using a DV-driven library can on average raise via1 doubling rate by 34%, raise total via doubling rate by 11%, reduce the total number of vias by 3%, and reduce the total number of via1s by 8%. All this can be achieved without incurring any performance and area penalties.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129308142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751862
K. Ramakrishnan, Xiaoxia Wu, N. Vijaykrishnan, Yuan Xie
Mitigating the circuit aging effect in digital circuits has become a very important concern for current and future technology nodes. Negative Bias Temperature Instability (NBTI) is one of the most important circuit aging mechanisms, which can incur timing errors. Flip-flops play a vital role as storage elements in pipelined architectures and are prone to effects of aging. NBTI increases the transistor threshold voltage, affecting the performance of the chip. In this paper, we study the effects of NBTI on the timing characteristics of different types of low power and high performance flip-flops. Factors such as input data probability and temperature which affect the degradation rate are also analyzed.
{"title":"Comparative analysis of NBTI effects on low power and high performance flip-flops","authors":"K. Ramakrishnan, Xiaoxia Wu, N. Vijaykrishnan, Yuan Xie","doi":"10.1109/ICCD.2008.4751862","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751862","url":null,"abstract":"Mitigating the circuit aging effect in digital circuits has become a very important concern for current and future technology nodes. Negative Bias Temperature Instability (NBTI) is one of the most important circuit aging mechanisms, which can incur timing errors. Flip-flops play a vital role as storage elements in pipelined architectures and are prone to effects of aging. NBTI increases the transistor threshold voltage, affecting the performance of the chip. In this paper, we study the effects of NBTI on the timing characteristics of different types of low power and high performance flip-flops. Factors such as input data probability and temperature which affect the degradation rate are also analyzed.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"322 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116364565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751881
Zhenyu Liu, S. Goto, T. Ikenaga
Variable block size motion estimation algorithm is the effcient approach to reduce the temporal redundancies and it has been adopted by the latest video coding standard H.264/AVC. The computational complexity augment coming from the variable block size technique makes the hardwired accelerator essential, especially for real-time applications. In this paper, the authors apply the architecture level and the circuits level approaches to improve the performance of Propagate Partial SAD and SAD Tree hardwired engines, which outperform other counterparts when considering the impact of supporting the variable block size technique. Experiments demonstrate that by using the proposed approaches, compared with the original architectures, 14.7% and 18.0% hardware cost can be saved for Propagate Partial SAD architecture and SAD Tree architecture, respectively. With TSMC 0.18 mm 1P6M CMOS technology, the proposed Propagate Partial SAD architecture attains 231.6 MHz operating frequency at a cost of 84.1 k gates. Correspondingly, the execution speed of the optimized SAD Tree architecture is improved to 204.8 MHz with 88.5 k gate hardware overhead.
变块大小运动估计算法是减少时间冗余的有效方法,已被最新的视频编码标准H.264/AVC所采用。可变块大小技术带来的计算复杂性的增加使得硬连线加速器变得必不可少,特别是在实时应用中。在本文中,作者采用体系结构级和电路级的方法来提高传播部分SAD和SAD树硬连线引擎的性能,在考虑支持可变块大小技术的影响时,它们优于其他同类引擎。实验表明,采用本文提出的方法,与原有结构相比,可分别节省14.7%和18.0%的硬件成本。采用台积电0.18 mm 1P6M CMOS技术,所提出的Propagate Partial SAD架构以84.1 k栅极成本达到231.6 MHz的工作频率。相应地,优化后的SAD树架构的执行速度提高到204.8 MHz,栅极硬件开销为88.5 k。
{"title":"Optimization of Propagate Partial SAD and SAD tree motion estimation hardwired engine for H.264","authors":"Zhenyu Liu, S. Goto, T. Ikenaga","doi":"10.1109/ICCD.2008.4751881","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751881","url":null,"abstract":"Variable block size motion estimation algorithm is the effcient approach to reduce the temporal redundancies and it has been adopted by the latest video coding standard H.264/AVC. The computational complexity augment coming from the variable block size technique makes the hardwired accelerator essential, especially for real-time applications. In this paper, the authors apply the architecture level and the circuits level approaches to improve the performance of Propagate Partial SAD and SAD Tree hardwired engines, which outperform other counterparts when considering the impact of supporting the variable block size technique. Experiments demonstrate that by using the proposed approaches, compared with the original architectures, 14.7% and 18.0% hardware cost can be saved for Propagate Partial SAD architecture and SAD Tree architecture, respectively. With TSMC 0.18 mm 1P6M CMOS technology, the proposed Propagate Partial SAD architecture attains 231.6 MHz operating frequency at a cost of 84.1 k gates. Correspondingly, the execution speed of the optimized SAD Tree architecture is improved to 204.8 MHz with 88.5 k gate hardware overhead.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114443297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}