Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751876
D. Bol, R. Ambroise, D. Flandre, J. Legat
Over the last decade, the design of ultra-low-power digital circuits in subthreshold regime has been driven by the quest for minimum energy per operation. In this contribution, we observe that operating at minimum-energy point is not straightforward as design constraints from real-life applications have an important impact on energy. Therefore, we introduce the alternative concept of practical energy, taking functional-yield and throughput constraints on minimum Vdd into account. In this context, we demonstrate for the first time the detrimental impact of DIBL on minimum Vdd. Practical energy gives a useful analysis framework of circuit optimization to reach minimum-energy point, while considering the throughput as an input variable dictated by the application. From simulation of a benchmark multiplier in 45 nm technology, we find out that practical energy can be far higher than minimum energy point, in the case of low-throughput applications (ap 10-100 kOp/s) because of static leakage energy and robustness-limited minimum Vdd. With the proposed framework, we investigate the capability of conventional optimization techniques to make practical energy meet minimum energy point. Amongst these techniques, channel length upsize is shown to be more efficient than MTCMOS power gating, body biasing, Vt selection or device width upsize, as it increases robustness while simultaneously reducing static leakage energy. A small length upsize with low area overhead is shown to reduce practical energy at low throughput to less than 2.1 times the minimum energy level. At medium throughput, it even brings practical energy 30% lower than minimum energy level without optimization techniques.
{"title":"Analysis and minimization of practical energy in 45nm subthreshold logic circuits","authors":"D. Bol, R. Ambroise, D. Flandre, J. Legat","doi":"10.1109/ICCD.2008.4751876","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751876","url":null,"abstract":"Over the last decade, the design of ultra-low-power digital circuits in subthreshold regime has been driven by the quest for minimum energy per operation. In this contribution, we observe that operating at minimum-energy point is not straightforward as design constraints from real-life applications have an important impact on energy. Therefore, we introduce the alternative concept of practical energy, taking functional-yield and throughput constraints on minimum Vdd into account. In this context, we demonstrate for the first time the detrimental impact of DIBL on minimum Vdd. Practical energy gives a useful analysis framework of circuit optimization to reach minimum-energy point, while considering the throughput as an input variable dictated by the application. From simulation of a benchmark multiplier in 45 nm technology, we find out that practical energy can be far higher than minimum energy point, in the case of low-throughput applications (ap 10-100 kOp/s) because of static leakage energy and robustness-limited minimum Vdd. With the proposed framework, we investigate the capability of conventional optimization techniques to make practical energy meet minimum energy point. Amongst these techniques, channel length upsize is shown to be more efficient than MTCMOS power gating, body biasing, Vt selection or device width upsize, as it increases robustness while simultaneously reducing static leakage energy. A small length upsize with low area overhead is shown to reduce practical energy at low throughput to less than 2.1 times the minimum energy level. At medium throughput, it even brings practical energy 30% lower than minimum energy level without optimization techniques.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133756057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751891
Arun A. Nair, L. John
Increasing sizes of benchmarks make detailed simulation an extremely time consuming process. Statistical techniques such as the SimPoint methodology have been proposed in order to address this problem during the initial design phase. The SimPoint methodology attempts to identify repetitive, long, large-grain phases in programs and predict the performance of the architecture based on its aggregate performance on the individual phases. This study attempts to compare accuracy of the SimPoint methodology for the SPEC CPU 2006 benchmark suite with that of SPEC CPU 2000 and to study the large-grain phases in the two benchmark suites using the SimPoint methodology. We find that there has not been a significant increase in the number of simulation points required to accurately predict the behavior of the programs in SPEC CPU 2006, despite its significantly larger data footprint and dynamic instruction count. We also find that the programs in both benchmark suites have similar characteristics in terms of the number of phases that contribute significantly towards overall behavior, further emphasizing the similarity between the two benchmark suites with respect to the number of simulation points required for similar accuracies.
越来越多的基准测试使得详细的模拟成为一个非常耗时的过程。为了在初始设计阶段解决这个问题,已经提出了SimPoint方法等统计技术。SimPoint方法试图识别程序中重复的、长时间的、大粒度的阶段,并根据单个阶段的总体性能预测体系结构的性能。本研究试图比较SPEC CPU 2006基准套件与SPEC CPU 2000基准套件的SimPoint方法的准确性,并使用SimPoint方法研究两个基准套件中的大粒度阶段。我们发现,在SPEC CPU 2006中,准确预测程序行为所需的模拟点数量并没有显著增加,尽管它的数据占用和动态指令计数明显增加。我们还发现,两个基准套件中的程序在对总体行为有重大贡献的阶段数量方面具有相似的特征,进一步强调了两个基准套件之间在相似精度所需的模拟点数量方面的相似性。
{"title":"Simulation points for SPEC CPU 2006","authors":"Arun A. Nair, L. John","doi":"10.1109/ICCD.2008.4751891","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751891","url":null,"abstract":"Increasing sizes of benchmarks make detailed simulation an extremely time consuming process. Statistical techniques such as the SimPoint methodology have been proposed in order to address this problem during the initial design phase. The SimPoint methodology attempts to identify repetitive, long, large-grain phases in programs and predict the performance of the architecture based on its aggregate performance on the individual phases. This study attempts to compare accuracy of the SimPoint methodology for the SPEC CPU 2006 benchmark suite with that of SPEC CPU 2000 and to study the large-grain phases in the two benchmark suites using the SimPoint methodology. We find that there has not been a significant increase in the number of simulation points required to accurately predict the behavior of the programs in SPEC CPU 2006, despite its significantly larger data footprint and dynamic instruction count. We also find that the programs in both benchmark suites have similar characteristics in terms of the number of phases that contribute significantly towards overall behavior, further emphasizing the similarity between the two benchmark suites with respect to the number of simulation points required for similar accuracies.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125095867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751836
C. Fobel, G. Grewal
Global routing of macro cells remains an important but time-consuming step in the VLSI design cycle. Macro cells are large, irregularly sized parameterized circuit modules that typically contain large numbers of terminals that must be interconnected. The interconnection pattern for each set of terminals (net) that must be connected is a Steiner tree, and the primary sub-problem in the global routing of macro cells is to find a set of dissimilar, low-cost Steiner trees for each net that must be routed. In this paper, a two-phase, parallel (multi-processor) algorithm is proposed for quickly constructing a diverse pool of high-quality Steiner trees for routing of multi-terminal nets. In the first phase, a single Steiner tree is constructed using a heuristic, called Shrubbery. Then, in the second phase, a pool of dissimilar, high-quality trees are created from the original tree, by running multiple instances of a local search in parallel. Computational experiments performed on over 800 commonly used benchmarks show that running multiple instances of the local search in parallel results in near-linear speed-up over the serial case. Most importantly, the trees produced are both high-quality and dissimilar, allowing for numerous routing possibilities for each net.
{"title":"A parallel Steiner tree heuristic for macro cell routing","authors":"C. Fobel, G. Grewal","doi":"10.1109/ICCD.2008.4751836","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751836","url":null,"abstract":"Global routing of macro cells remains an important but time-consuming step in the VLSI design cycle. Macro cells are large, irregularly sized parameterized circuit modules that typically contain large numbers of terminals that must be interconnected. The interconnection pattern for each set of terminals (net) that must be connected is a Steiner tree, and the primary sub-problem in the global routing of macro cells is to find a set of dissimilar, low-cost Steiner trees for each net that must be routed. In this paper, a two-phase, parallel (multi-processor) algorithm is proposed for quickly constructing a diverse pool of high-quality Steiner trees for routing of multi-terminal nets. In the first phase, a single Steiner tree is constructed using a heuristic, called Shrubbery. Then, in the second phase, a pool of dissimilar, high-quality trees are created from the original tree, by running multiple instances of a local search in parallel. Computational experiments performed on over 800 commonly used benchmarks show that running multiple instances of the local search in parallel results in near-linear speed-up over the serial case. Most importantly, the trees produced are both high-quality and dissimilar, allowing for numerous routing possibilities for each net.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124779796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1142/S0218126614500315
Qianghua Zhu, Fei Xia, Guoqing Jin
In the field of RNA secondary structure prediction, the Zuker algorithm is one of the most popular methods using free energy minimization. However, general-purpose computers including parallel computers or multi-core computers exhibit parallel efficiency of no more than 50% on Zuker. FPGA chips provide a new approach to accelerate the Zuker algorithm by exploiting fine-grained custom design. Zuker shows complicated data dependences, in which the dependence distance is variable, and the dependence direction is also across two dimensions. We propose a systolic array structure including one master PE and multiple slave PEs for fine grain hardware implementation on FPGA. We exploit data reuse schemes to reduce the need to load energy matrices from external memory. We also propose several methods to reduce energy table parameter size by 85%. To our knowledge, our implementation with 16 PEs is the only FPGA accelerator implementing the complete Zuker algorithm. The experimental results show a factor of 14 speedup over the ViennaRNA-1.6.5 software for 2981-residue RNA sequence running on a PC platform with Pentium 4 2.6 GHz CPU.
{"title":"Fine-grained parallel application specific computing for RNA secondary structure prediction on FPGA","authors":"Qianghua Zhu, Fei Xia, Guoqing Jin","doi":"10.1142/S0218126614500315","DOIUrl":"https://doi.org/10.1142/S0218126614500315","url":null,"abstract":"In the field of RNA secondary structure prediction, the Zuker algorithm is one of the most popular methods using free energy minimization. However, general-purpose computers including parallel computers or multi-core computers exhibit parallel efficiency of no more than 50% on Zuker. FPGA chips provide a new approach to accelerate the Zuker algorithm by exploiting fine-grained custom design. Zuker shows complicated data dependences, in which the dependence distance is variable, and the dependence direction is also across two dimensions. We propose a systolic array structure including one master PE and multiple slave PEs for fine grain hardware implementation on FPGA. We exploit data reuse schemes to reduce the need to load energy matrices from external memory. We also propose several methods to reduce energy table parameter size by 85%. To our knowledge, our implementation with 16 PEs is the only FPGA accelerator implementing the complete Zuker algorithm. The experimental results show a factor of 14 speedup over the ViennaRNA-1.6.5 software for 2981-residue RNA sequence running on a PC platform with Pentium 4 2.6 GHz CPU.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130234332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751869
Zhibin Xiao, B. Baas
This paper presents a high-performance parallel context-based adaptive length coding (CAVLC) encoder implemented on a fine-grained many-core system. The software encoder is designed for a H.264/AVC baseline profile encoder. By utilizing arithmetic table elimination and compression techniques, the data-flow of the CAVLC encoder has been partitioned and mapped to an array of 15 small processors. The parallel workload of each processor is characterized and balanced for further throughput optimization. The proposed parallel CAVLC encoder achieves the real-time processing requirement of 30 frames per second for 720 p HDTV. Our experiments show that the presented CAVLC encoder has 4.86 to 6.83 times higher throughput and requires far smaller chip area than the identical encoder implemented on state-of-art general-purpose processors. In comparison to published implementations on common DSP processors, the design has approximately 1.0 to 6.15 times higher throughput while requiring less than 6 times smaller area.
{"title":"A high-performance parallel CAVLC encoder on a fine-grained many-core system","authors":"Zhibin Xiao, B. Baas","doi":"10.1109/ICCD.2008.4751869","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751869","url":null,"abstract":"This paper presents a high-performance parallel context-based adaptive length coding (CAVLC) encoder implemented on a fine-grained many-core system. The software encoder is designed for a H.264/AVC baseline profile encoder. By utilizing arithmetic table elimination and compression techniques, the data-flow of the CAVLC encoder has been partitioned and mapped to an array of 15 small processors. The parallel workload of each processor is characterized and balanced for further throughput optimization. The proposed parallel CAVLC encoder achieves the real-time processing requirement of 30 frames per second for 720 p HDTV. Our experiments show that the presented CAVLC encoder has 4.86 to 6.83 times higher throughput and requires far smaller chip area than the identical encoder implemented on state-of-art general-purpose processors. In comparison to published implementations on common DSP processors, the design has approximately 1.0 to 6.15 times higher throughput while requiring less than 6 times smaller area.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123303639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751834
Rajesh Garg, S. Khatri
In this paper, we present a new radiation tolerant CMOS standard cell library, and demonstrate its effectiveness in implementing radiation hardened digital circuits. We exploit the fact that if a gate is implemented using only PMOS (NMOS) transistors then a radiation particle strike can result only in logic a 0 to 1 (1 to 0) flip. Based on this observation, we derive our radiation hardened gates from regular static CMOS gates. In particular, we separate the PMOS and NMOS devices, and split the gate output into two signals. One of these outputs of our radiation tolerant gate is generated using PMOS transistors, and it drives other PMOS transistors (only) in its fanout. Similarly, the other output (generated from NMOS transistors) drives only other NMOS transistors in its fanout. Now, if a radiation particle strikes one of the outputs of the radiation tolerant gate, then the gates in the fanout enter a high-impedance state, and hence preserve their output values. Our radiation hardened gates exhibit an extremely high degree of SEU tolerance, which is validated at the circuit level. Using these gates, we also implement circuit level hardening based on logical masking, to selectively harden those gates in a circuit which contribute most to the soft error failure of the circuit. The gates with a low probability of logical masking are replaced by SEU tolerant gates from our new library, such that the digital design achieves a 90% soft error rate reduction. Experimental results demonstrate that this reduction is achieved with a modest layout area and delay penalty of 62% and 29% respectively, for area mapped designs. In contrast with existing approaches, our approach results in SEU immunity for extremely large critical charge values (>650fC).
{"title":"A novel, highly SEU tolerant digital circuit design approach","authors":"Rajesh Garg, S. Khatri","doi":"10.1109/ICCD.2008.4751834","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751834","url":null,"abstract":"In this paper, we present a new radiation tolerant CMOS standard cell library, and demonstrate its effectiveness in implementing radiation hardened digital circuits. We exploit the fact that if a gate is implemented using only PMOS (NMOS) transistors then a radiation particle strike can result only in logic a 0 to 1 (1 to 0) flip. Based on this observation, we derive our radiation hardened gates from regular static CMOS gates. In particular, we separate the PMOS and NMOS devices, and split the gate output into two signals. One of these outputs of our radiation tolerant gate is generated using PMOS transistors, and it drives other PMOS transistors (only) in its fanout. Similarly, the other output (generated from NMOS transistors) drives only other NMOS transistors in its fanout. Now, if a radiation particle strikes one of the outputs of the radiation tolerant gate, then the gates in the fanout enter a high-impedance state, and hence preserve their output values. Our radiation hardened gates exhibit an extremely high degree of SEU tolerance, which is validated at the circuit level. Using these gates, we also implement circuit level hardening based on logical masking, to selectively harden those gates in a circuit which contribute most to the soft error failure of the circuit. The gates with a low probability of logical masking are replaced by SEU tolerant gates from our new library, such that the digital design achieves a 90% soft error rate reduction. Experimental results demonstrate that this reduction is achieved with a modest layout area and delay penalty of 62% and 29% respectively, for area mapped designs. In contrast with existing approaches, our approach results in SEU immunity for extremely large critical charge values (>650fC).","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120961689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751904
M. Gschwind
Heterogeneous Chip Multiprocessors (HMPs), such as the Cell Broadband Engine, offer a new design optimization opportunity by allowing designers to provide accelerators for application specific domains. Data sharing between CPUs and accelerators, and memory access mechanisms and protocols are crucial decisions in the design of an HMP. In this article, we analyze the choices between hardware and software managed coherence between CPU and accelerators for DMA-based data sharing, and find that hardware-coherent DMA shows a performance benefit of up to 3x, even for simple workloads.We explore memory address translation architecture choices for DMA-based data sharing. In multiprogramming environments, address translation is commonly used to separate processes. For efficiency, direct access to system memory requires address translation capabilities in the accelerator. We find that hardware managed address translation shows a performance benefit of up to 5x, even for simple workloads, by avoiding the costs of accelerator/CPU communication and supervisor management of the translation context and the introduction of a serial bottleneck on the CPU.
{"title":"Optimizing data sharing and address translation for the Cell BE Heterogeneous Chip Multiprocessor","authors":"M. Gschwind","doi":"10.1109/ICCD.2008.4751904","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751904","url":null,"abstract":"Heterogeneous Chip Multiprocessors (HMPs), such as the Cell Broadband Engine, offer a new design optimization opportunity by allowing designers to provide accelerators for application specific domains. Data sharing between CPUs and accelerators, and memory access mechanisms and protocols are crucial decisions in the design of an HMP. In this article, we analyze the choices between hardware and software managed coherence between CPU and accelerators for DMA-based data sharing, and find that hardware-coherent DMA shows a performance benefit of up to 3x, even for simple workloads.We explore memory address translation architecture choices for DMA-based data sharing. In multiprogramming environments, address translation is commonly used to separate processes. For efficiency, direct access to system memory requires address translation capabilities in the accelerator. We find that hardware managed address translation shows a performance benefit of up to 5x, even for simple workloads, by avoiding the costs of accelerator/CPU communication and supervisor management of the translation context and the introduction of a serial bottleneck on the CPU.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125674744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751880
A. Suleiman, H. Saleh, A. Hussein, D. Akopian
The paper presents a family of architectures for FFT implementation based on the decomposition of the perfect shuffle permutation, which can be designed with variable number of processing elements. This provides designers with a trade-off choice of speed vs. complexity (cost and area.). A detailed case study is provided on the implementation of 1024-point FFT with 2 processing elements using 45 nm process technology, including area, timing, power and place-and-route results.
{"title":"A family of scalable FFT architectures and an implementation of 1024-point radix-2 FFT for real-time communications","authors":"A. Suleiman, H. Saleh, A. Hussein, D. Akopian","doi":"10.1109/ICCD.2008.4751880","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751880","url":null,"abstract":"The paper presents a family of architectures for FFT implementation based on the decomposition of the perfect shuffle permutation, which can be designed with variable number of processing elements. This provides designers with a trade-off choice of speed vs. complexity (cost and area.). A detailed case study is provided on the implementation of 1024-point FFT with 2 processing elements using 45 nm process technology, including area, timing, power and place-and-route results.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126157235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751837
I. Jiang, Yen-Ting Yu
The rectilinear Steiner minimal tree (RSMT) problem is essential in physical design. Moreover, the variant constraints for fabrication issues, including obstacle avoidance, multiple routing layers, layer-specific routing directions, cannot be ignored during RSMT construction for modern SoC and nano technologies. This paper proposes a construction-by-correction approach for obstacle-avoiding preferred direction rectilinear Steiner tree construction. Experimental results show that our algorithm is promising and outperforms the state-of-the-art works.
{"title":"Configurable rectilinear Steiner tree construction for SoC and nano technologies","authors":"I. Jiang, Yen-Ting Yu","doi":"10.1109/ICCD.2008.4751837","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751837","url":null,"abstract":"The rectilinear Steiner minimal tree (RSMT) problem is essential in physical design. Moreover, the variant constraints for fabrication issues, including obstacle avoidance, multiple routing layers, layer-specific routing directions, cannot be ignored during RSMT construction for modern SoC and nano technologies. This paper proposes a construction-by-correction approach for obstacle-avoiding preferred direction rectilinear Steiner tree construction. Experimental results show that our algorithm is promising and outperforms the state-of-the-art works.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127255182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-10-01DOI: 10.1109/ICCD.2008.4751912
Shih-Hung Weng, Yu-Min Kuo, Shih-Chieh Chang, M. Marek-Sadowska
IR drop noise has become a critical issue in advanced process technologies. Traditionally, timing analysis in which the IR drop noise is considered assumes a worst-case IR drop for each gate; however, using this assumption provides unduly pessimistic results. In this paper, we describe a timing analysis approach for power gating designs. To improve the accuracy of the gate delay calculation we determine the virtual voltage level by taking into account the IR drop waveforms across the sleep transistors. These can be obtained efficiently using a linear programming approach. Our experimental results are very promising.
{"title":"Timing analysis considering IR drop waveforms in power gating designs","authors":"Shih-Hung Weng, Yu-Min Kuo, Shih-Chieh Chang, M. Marek-Sadowska","doi":"10.1109/ICCD.2008.4751912","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751912","url":null,"abstract":"IR drop noise has become a critical issue in advanced process technologies. Traditionally, timing analysis in which the IR drop noise is considered assumes a worst-case IR drop for each gate; however, using this assumption provides unduly pessimistic results. In this paper, we describe a timing analysis approach for power gating designs. To improve the accuracy of the gate delay calculation we determine the virtual voltage level by taking into account the IR drop waveforms across the sleep transistors. These can be obtained efficiently using a linear programming approach. Our experimental results are very promising.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126456009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}