Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751848
Chunchen Liu, Junjie Su, Yiyu Shi
Temperature variation in microprocessors is a workload dependent problem. In such a design, the clock skew should be minimized with respect to temperature variation. Existing work has studied clock tree embedding perturbation considering time variant temperature variation. There is no existing method that can reduce skew variation. This paper develops an efficient yet effective simultaneous hotspot avoid embedding and thermal aware routing (TMST) method, where hotspot embedding avoid tree topology located in area with high temperature possibility and thermal aware routing reduce skew in tree path with more smooth temperature area. With a thermally tolerable tree structure, our method can reduce not only delay skew but also skew variation (skew violation range). Compared with existing temperature-aware clock tree method, our TMST solution reduces skew variation by 2X compared with the greedy-DME (GDME) method of Edahiro and existing thermal aware clock synthesis TACO and PECO. With the scale from 100 down to 1 temperature maps, our TMST also guarantees the smallest wire length overflow. TMST reduces the worst case skew up to 4X than PECO and 5X than TACO.
{"title":"Temperature-aware clock tree synthesis considering spatiotemporal hot spot correlations","authors":"Chunchen Liu, Junjie Su, Yiyu Shi","doi":"10.1109/ICCD.2008.4751848","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751848","url":null,"abstract":"Temperature variation in microprocessors is a workload dependent problem. In such a design, the clock skew should be minimized with respect to temperature variation. Existing work has studied clock tree embedding perturbation considering time variant temperature variation. There is no existing method that can reduce skew variation. This paper develops an efficient yet effective simultaneous hotspot avoid embedding and thermal aware routing (TMST) method, where hotspot embedding avoid tree topology located in area with high temperature possibility and thermal aware routing reduce skew in tree path with more smooth temperature area. With a thermally tolerable tree structure, our method can reduce not only delay skew but also skew variation (skew violation range). Compared with existing temperature-aware clock tree method, our TMST solution reduces skew variation by 2X compared with the greedy-DME (GDME) method of Edahiro and existing thermal aware clock synthesis TACO and PECO. With the scale from 100 down to 1 temperature maps, our TMST also guarantees the smallest wire length overflow. TMST reduces the worst case skew up to 4X than PECO and 5X than TACO.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122443834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751934
Chuanjun Zhang, Bing Xue
High associativity is important for level-two cache designs [9]. Implementing CAM-based highly associative caches (CAM-HAC), however, is both costly in hardware and exhibits poor scalability. We propose to implement the CAM-HAC in macro-blocks to improve scalability. Each macro-block contains 128-row and 8-column of cache blocks. We name it Two dimensional Cache, or T-Cache. Each macro-block has an associativity equivalent to 128times8=1024-way. Twelve bits of the T-Cachepsilas tag are implemented by using CAM, while the remaining tag uses SRAM; Furthermore, random replacement is used in rows to balance cache sets usage while LRU is used in columns to select the victim from a row. The hardware complexity for replacement is reduced greatly compared to a traditional CAM-HAC using LRU solely. Experimental results show that the T-Cache achieves a 16% miss rate reduction over a traditional 8-way unified L2 cache. This translates into an average IPC improvement of 5% and as high as 18%. The T-Cache exhibits a 4% total memory access-related energy savings due to the reduction to applicationspsila execution time.
{"title":"Two dimensional highly associative level-two cache design","authors":"Chuanjun Zhang, Bing Xue","doi":"10.1109/ICCD.2008.4751934","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751934","url":null,"abstract":"High associativity is important for level-two cache designs [9]. Implementing CAM-based highly associative caches (CAM-HAC), however, is both costly in hardware and exhibits poor scalability. We propose to implement the CAM-HAC in macro-blocks to improve scalability. Each macro-block contains 128-row and 8-column of cache blocks. We name it Two dimensional Cache, or T-Cache. Each macro-block has an associativity equivalent to 128times8=1024-way. Twelve bits of the T-Cachepsilas tag are implemented by using CAM, while the remaining tag uses SRAM; Furthermore, random replacement is used in rows to balance cache sets usage while LRU is used in columns to select the victim from a row. The hardware complexity for replacement is reduced greatly compared to a traditional CAM-HAC using LRU solely. Experimental results show that the T-Cache achieves a 16% miss rate reduction over a traditional 8-way unified L2 cache. This translates into an average IPC improvement of 5% and as high as 18%. The T-Cache exhibits a 4% total memory access-related energy savings due to the reduction to applicationspsila execution time.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121530197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751894
A. Miyamoto, N. Homma, T. Aoki, Akashi Satoh
The present paper proposes a systematic design approach to provide the optimal high-radix Montgomery multipliers for an RSA processor satisfying user requirements. We introduces three multiplier-based architectures using different intermediate-data forms ((i) single form, (ii) semi carry-save form, and (iii) carry-save form, and combined them with a wide variety of arithmetic components. Their radices are also parameterized from 28 to 264. A total of 202 designs for 1,024-bit RSA processors were obtained for each radix, and were synthesized using a 90-nm CMOS standard cell library. The smallest design of 0.9 Kgates with 137.8 ms/RSA to the fastest design of 1.8 ms/RSA at 74.7 Kgates were then obtained. In addition, the optimal design to meet the user requirements can be easily obtained from all the combinations. In addition to choosing the datapath architecture, the arithmetic component, and the radix parameters, the proposed systematic approach can also adopt other process technologies.
{"title":"Systematic design of high-radix Montgomery multipliers for RSA processors","authors":"A. Miyamoto, N. Homma, T. Aoki, Akashi Satoh","doi":"10.1109/ICCD.2008.4751894","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751894","url":null,"abstract":"The present paper proposes a systematic design approach to provide the optimal high-radix Montgomery multipliers for an RSA processor satisfying user requirements. We introduces three multiplier-based architectures using different intermediate-data forms ((i) single form, (ii) semi carry-save form, and (iii) carry-save form, and combined them with a wide variety of arithmetic components. Their radices are also parameterized from 28 to 264. A total of 202 designs for 1,024-bit RSA processors were obtained for each radix, and were synthesized using a 90-nm CMOS standard cell library. The smallest design of 0.9 Kgates with 137.8 ms/RSA to the fastest design of 1.8 ms/RSA at 74.7 Kgates were then obtained. In addition, the optimal design to meet the user requirements can be easily obtained from all the combinations. In addition to choosing the datapath architecture, the arithmetic component, and the radix parameters, the proposed systematic approach can also adopt other process technologies.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127832731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751871
A. H. Gholamipour, E. Bozorgzadeh, L. Bao
Software Defined Radio (SDR) base stations can compensate for failures in disaster scenarios by assimilating different communication technologies. FPGAs play an important role in the platform of an SDR base station because of flexibility and DSP processing power that they deliver. The flexibility of FPGAs comes at the high cost of reconfiguration time overhead which can be a serious deterrence because of QoS requirements of real time traffic. In this paper we propose a solution to reduce reconfiguration time overhead at system-level where we are provided the configuration of each wireless system. Following that we step further and integrate our solution in to a floorplanner to generate placements for wireless systems which can systematically hide or reduce reconfiguration time overhead. Our experiments show the effectiveness of our approach.
{"title":"Seamless sequence of software defined radio designs through hardware reconfigurability of FPGAs","authors":"A. H. Gholamipour, E. Bozorgzadeh, L. Bao","doi":"10.1109/ICCD.2008.4751871","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751871","url":null,"abstract":"Software Defined Radio (SDR) base stations can compensate for failures in disaster scenarios by assimilating different communication technologies. FPGAs play an important role in the platform of an SDR base station because of flexibility and DSP processing power that they deliver. The flexibility of FPGAs comes at the high cost of reconfiguration time overhead which can be a serious deterrence because of QoS requirements of real time traffic. In this paper we propose a solution to reduce reconfiguration time overhead at system-level where we are provided the configuration of each wireless system. Following that we step further and integrate our solution in to a floorplanner to generate placements for wireless systems which can systematically hide or reduce reconfiguration time overhead. Our experiments show the effectiveness of our approach.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128197281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751906
Amit Hadke, Tony Benavides, R. Amirtharajah, M. Farrens, V. Akella
We present OCDIMM (Optically Connected DIMM), a CPU-DRAM interface that uses multiwavelength optical interconnects. We show that OCDIMM is more scalable and offers higher bandwidth and lower latency than FBDIMM (Fully-Buffered DIMM), a state-of-the-art electrical alternative. Though OCDIMM is more power efficient than FBDIMM, we show that ultimately the total power consumption in the memory subsystem is a key impediment to scalability and thus to achieving truly balanced computing systems in the terascale era.
{"title":"Design and evaluation of an optical CPU-DRAM interconnect","authors":"Amit Hadke, Tony Benavides, R. Amirtharajah, M. Farrens, V. Akella","doi":"10.1109/ICCD.2008.4751906","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751906","url":null,"abstract":"We present OCDIMM (Optically Connected DIMM), a CPU-DRAM interface that uses multiwavelength optical interconnects. We show that OCDIMM is more scalable and offers higher bandwidth and lower latency than FBDIMM (Fully-Buffered DIMM), a state-of-the-art electrical alternative. Though OCDIMM is more power efficient than FBDIMM, we show that ultimately the total power consumption in the memory subsystem is a key impediment to scalability and thus to achieving truly balanced computing systems in the terascale era.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131987309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751873
Dong Ye, Aravind Pavuluri, Carl A. Waldspurger, Brian Tsang, Bohuslav Rychlik, Steven Woo
We use a novel virtualization-based approach for computer architecture performance analysis. We present a case study analyzing a hypothetical hybrid main memory, which consists of a first-level DRAM augmented by a 10-100x slower second-level memory. This architecture is motivated by the recent emergence of lower-cost, higher-density, and lower-power alternative memory technologies. To model such a system, we customize a virtual machine monitor (VMM) with delay-simulation and instrumentation code. Benchmarks representing server, technical computing, and desktop productivity workloads are evaluated in virtual machines (VMs). Relative to baseline all-DRAM systems, these workloads experience widely varying performance degradation when run on hybrid main memory systems which have significant amounts of second-level memory.
{"title":"Prototyping a hybrid main memory using a virtual machine monitor","authors":"Dong Ye, Aravind Pavuluri, Carl A. Waldspurger, Brian Tsang, Bohuslav Rychlik, Steven Woo","doi":"10.1109/ICCD.2008.4751873","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751873","url":null,"abstract":"We use a novel virtualization-based approach for computer architecture performance analysis. We present a case study analyzing a hypothetical hybrid main memory, which consists of a first-level DRAM augmented by a 10-100x slower second-level memory. This architecture is motivated by the recent emergence of lower-cost, higher-density, and lower-power alternative memory technologies. To model such a system, we customize a virtual machine monitor (VMM) with delay-simulation and instrumentation code. Benchmarks representing server, technical computing, and desktop productivity workloads are evaluated in virtual machines (VMs). Relative to baseline all-DRAM systems, these workloads experience widely varying performance degradation when run on hybrid main memory systems which have significant amounts of second-level memory.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134400656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751898
Shilpa Bhoj, D. Bhatia
Increasing transistor densities, rising popularity in mobile applications and migration towards eco-friendly computing systems have made power dissipation a key FPGA design issue. To meet stringent budgets, system architects need accurate estimates of power distribution at various design stages. In this work, we make several key contributions to FPGA leakage power estimation. First, we develop an accurate and efficient model to estimate total interconnect leakage power at various design stages prior to routing. Our methods derive leakage power estimates based on predicted values of routing congestion and interconnect resource utilization. We then extend the model to accomodate complex segmented routing architectures and low leakage architectures. Finally we formulate relations to generate post place leakage power estimates of individual routing channels. Our models for overall leakage power estimation achieve average accuracy rates of 93% and 89% for uniform and segmented routing architectures respectively. Experimentation results also establish the accuracy of the channel level estimation models at 85% and 80% for uniform and segmented routing structures. Our models and techniques would help designers make informed decisions by providing information on the power consumption of the interconnect fabric well before routing. Additionally, the equations can be used for architectural explorations and embedded in power and thermal aware CAD tools.
{"title":"Early stage FPGA interconnect leakage power estimation","authors":"Shilpa Bhoj, D. Bhatia","doi":"10.1109/ICCD.2008.4751898","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751898","url":null,"abstract":"Increasing transistor densities, rising popularity in mobile applications and migration towards eco-friendly computing systems have made power dissipation a key FPGA design issue. To meet stringent budgets, system architects need accurate estimates of power distribution at various design stages. In this work, we make several key contributions to FPGA leakage power estimation. First, we develop an accurate and efficient model to estimate total interconnect leakage power at various design stages prior to routing. Our methods derive leakage power estimates based on predicted values of routing congestion and interconnect resource utilization. We then extend the model to accomodate complex segmented routing architectures and low leakage architectures. Finally we formulate relations to generate post place leakage power estimates of individual routing channels. Our models for overall leakage power estimation achieve average accuracy rates of 93% and 89% for uniform and segmented routing architectures respectively. Experimentation results also establish the accuracy of the channel level estimation models at 85% and 80% for uniform and segmented routing structures. Our models and techniques would help designers make informed decisions by providing information on the power consumption of the interconnect fabric well before routing. Additionally, the equations can be used for architectural explorations and embedded in power and thermal aware CAD tools.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129021637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751929
Wei-Chiu Tseng, Yu-Hsing Chen, Rung-Bin Lin
In this paper we propose a synergetic approach that integrates router design and cell library engineering for improving post-routing via1 (via between M1 and M2) doubling rate at pins. We develop a double-via (DV) aware multilevel router to exploit the via1 doubling possibilities provided to the cells in a conventional as well as a DV-driven cell library. Compared to a non-DV-aware router using a conventional cell library, our approach using a DV-driven library can on average raise via1 doubling rate by 34%, raise total via doubling rate by 11%, reduce the total number of vias by 3%, and reduce the total number of via1s by 8%. All this can be achieved without incurring any performance and area penalties.
{"title":"Router and cell library co-development for improving redundant via insertion at pins","authors":"Wei-Chiu Tseng, Yu-Hsing Chen, Rung-Bin Lin","doi":"10.1109/ICCD.2008.4751929","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751929","url":null,"abstract":"In this paper we propose a synergetic approach that integrates router design and cell library engineering for improving post-routing via1 (via between M1 and M2) doubling rate at pins. We develop a double-via (DV) aware multilevel router to exploit the via1 doubling possibilities provided to the cells in a conventional as well as a DV-driven cell library. Compared to a non-DV-aware router using a conventional cell library, our approach using a DV-driven library can on average raise via1 doubling rate by 34%, raise total via doubling rate by 11%, reduce the total number of vias by 3%, and reduce the total number of via1s by 8%. All this can be achieved without incurring any performance and area penalties.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129308142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751862
K. Ramakrishnan, Xiaoxia Wu, N. Vijaykrishnan, Yuan Xie
Mitigating the circuit aging effect in digital circuits has become a very important concern for current and future technology nodes. Negative Bias Temperature Instability (NBTI) is one of the most important circuit aging mechanisms, which can incur timing errors. Flip-flops play a vital role as storage elements in pipelined architectures and are prone to effects of aging. NBTI increases the transistor threshold voltage, affecting the performance of the chip. In this paper, we study the effects of NBTI on the timing characteristics of different types of low power and high performance flip-flops. Factors such as input data probability and temperature which affect the degradation rate are also analyzed.
{"title":"Comparative analysis of NBTI effects on low power and high performance flip-flops","authors":"K. Ramakrishnan, Xiaoxia Wu, N. Vijaykrishnan, Yuan Xie","doi":"10.1109/ICCD.2008.4751862","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751862","url":null,"abstract":"Mitigating the circuit aging effect in digital circuits has become a very important concern for current and future technology nodes. Negative Bias Temperature Instability (NBTI) is one of the most important circuit aging mechanisms, which can incur timing errors. Flip-flops play a vital role as storage elements in pipelined architectures and are prone to effects of aging. NBTI increases the transistor threshold voltage, affecting the performance of the chip. In this paper, we study the effects of NBTI on the timing characteristics of different types of low power and high performance flip-flops. Factors such as input data probability and temperature which affect the degradation rate are also analyzed.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"322 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116364565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751881
Zhenyu Liu, S. Goto, T. Ikenaga
Variable block size motion estimation algorithm is the effcient approach to reduce the temporal redundancies and it has been adopted by the latest video coding standard H.264/AVC. The computational complexity augment coming from the variable block size technique makes the hardwired accelerator essential, especially for real-time applications. In this paper, the authors apply the architecture level and the circuits level approaches to improve the performance of Propagate Partial SAD and SAD Tree hardwired engines, which outperform other counterparts when considering the impact of supporting the variable block size technique. Experiments demonstrate that by using the proposed approaches, compared with the original architectures, 14.7% and 18.0% hardware cost can be saved for Propagate Partial SAD architecture and SAD Tree architecture, respectively. With TSMC 0.18 mm 1P6M CMOS technology, the proposed Propagate Partial SAD architecture attains 231.6 MHz operating frequency at a cost of 84.1 k gates. Correspondingly, the execution speed of the optimized SAD Tree architecture is improved to 204.8 MHz with 88.5 k gate hardware overhead.
变块大小运动估计算法是减少时间冗余的有效方法,已被最新的视频编码标准H.264/AVC所采用。可变块大小技术带来的计算复杂性的增加使得硬连线加速器变得必不可少,特别是在实时应用中。在本文中,作者采用体系结构级和电路级的方法来提高传播部分SAD和SAD树硬连线引擎的性能,在考虑支持可变块大小技术的影响时,它们优于其他同类引擎。实验表明,采用本文提出的方法,与原有结构相比,可分别节省14.7%和18.0%的硬件成本。采用台积电0.18 mm 1P6M CMOS技术,所提出的Propagate Partial SAD架构以84.1 k栅极成本达到231.6 MHz的工作频率。相应地,优化后的SAD树架构的执行速度提高到204.8 MHz,栅极硬件开销为88.5 k。
{"title":"Optimization of Propagate Partial SAD and SAD tree motion estimation hardwired engine for H.264","authors":"Zhenyu Liu, S. Goto, T. Ikenaga","doi":"10.1109/ICCD.2008.4751881","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751881","url":null,"abstract":"Variable block size motion estimation algorithm is the effcient approach to reduce the temporal redundancies and it has been adopted by the latest video coding standard H.264/AVC. The computational complexity augment coming from the variable block size technique makes the hardwired accelerator essential, especially for real-time applications. In this paper, the authors apply the architecture level and the circuits level approaches to improve the performance of Propagate Partial SAD and SAD Tree hardwired engines, which outperform other counterparts when considering the impact of supporting the variable block size technique. Experiments demonstrate that by using the proposed approaches, compared with the original architectures, 14.7% and 18.0% hardware cost can be saved for Propagate Partial SAD architecture and SAD Tree architecture, respectively. With TSMC 0.18 mm 1P6M CMOS technology, the proposed Propagate Partial SAD architecture attains 231.6 MHz operating frequency at a cost of 84.1 k gates. Correspondingly, the execution speed of the optimized SAD Tree architecture is improved to 204.8 MHz with 88.5 k gate hardware overhead.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114443297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}