A majority of applications require cooperation of two or more independently designed, separately located, but mutually affecting subsystems. In addition to good behavior of each of the subsystems, an effective coordination is very important to achieve the desired overall performance. However, such a co-ordination is very difficult to attain mainly due to the lack of precise system models and/or dynamic parameters. In such situations, the evolvable hardware (EHW) techniques, which can achieve the sophisticated level of information processing the brain is capable of, can excel. In this paper, a new virtual reconfigurable circuit based drive circuit for array elements in smart antenna using the techniques of evolved operators is presented. The idea of this work is to develop a system that is tolerant to array element failure (fault tolerance) by utilizing phased array input programmer connected to a programmable VLSI chip. The approach chosen here is based on functional level evolution whose architecture contains many nonlinear functions and uses an evolutionary algorithm to evolve the best configuration. The system is tested for its effectiveness by choosing a real-time phase control in three element array of smart antenna with three input phases and introducing different element failures such as: element fails as open circuit, sensor fails as short circuit, noise added to individual element, multiple element failure etc.. In each case the mean square error is computed and used as the performance index.
{"title":"Fault Tolerant Dynamic Antenna Array in Smart Antenna System Using Evolved Virtual Reconfigurable Circuit","authors":"D. Dhanasekaran, K. Bagan","doi":"10.1109/VLSI.2008.32","DOIUrl":"https://doi.org/10.1109/VLSI.2008.32","url":null,"abstract":"A majority of applications require cooperation of two or more independently designed, separately located, but mutually affecting subsystems. In addition to good behavior of each of the subsystems, an effective coordination is very important to achieve the desired overall performance. However, such a co-ordination is very difficult to attain mainly due to the lack of precise system models and/or dynamic parameters. In such situations, the evolvable hardware (EHW) techniques, which can achieve the sophisticated level of information processing the brain is capable of, can excel. In this paper, a new virtual reconfigurable circuit based drive circuit for array elements in smart antenna using the techniques of evolved operators is presented. The idea of this work is to develop a system that is tolerant to array element failure (fault tolerance) by utilizing phased array input programmer connected to a programmable VLSI chip. The approach chosen here is based on functional level evolution whose architecture contains many nonlinear functions and uses an evolutionary algorithm to evolve the best configuration. The system is tested for its effectiveness by choosing a real-time phase control in three element array of smart antenna with three input phases and introducing different element failures such as: element fails as open circuit, sensor fails as short circuit, noise added to individual element, multiple element failure etc.. In each case the mean square error is computed and used as the performance index.","PeriodicalId":143886,"journal":{"name":"21st International Conference on VLSI Design (VLSID 2008)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115423170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a retimed decomposed inversion-less serial Berlekamp-Massey (BM) architecture for Reed Solomon (RS) decoding. The key idea is to apply the retiming technique into the critical path in order to achieve high decoding performance. The standard basis irregular fully parallel multiplier is separated into partial product generation (PPG) and partial product reduction (PPR) stages to implement the proposed modified decomposed inversion-less serial BM algorithm. The proposed RS (255,239) decoder is implemented in verilog HDL and synthesized with 0.18 mum CMOS std 130 standard cell library. The proposed architecture achieves almost 76 % increase in speed and throughput, and can be used in high-speed and high-throughput applications such as DVD, optical fiber communications, etc.
提出了一种用于RS译码的重定时分解无反转串行Berlekamp-Massey (BM)结构。关键思想是将重定时技术应用到关键路径中,以达到较高的解码性能。将标准基不规则全并行乘法器分为部分乘积生成(PPG)和部分乘积约简(PPR)两个阶段,实现改进的分解无反转串行BM算法。所提出的RS(255,239)解码器采用verilog HDL语言实现,并采用0.18 μ m CMOS std 130标准单元库合成。该架构的速度和吞吐量提高了近76%,可用于DVD、光纤通信等高速和高吞吐量应用。
{"title":"Retimed Decomposed Serial Berlekamp-Massey (BM) Architecture for High-Speed Reed-Solomon Decoding","authors":"Shahid Rizwan","doi":"10.1109/VLSI.2008.45","DOIUrl":"https://doi.org/10.1109/VLSI.2008.45","url":null,"abstract":"This paper presents a retimed decomposed inversion-less serial Berlekamp-Massey (BM) architecture for Reed Solomon (RS) decoding. The key idea is to apply the retiming technique into the critical path in order to achieve high decoding performance. The standard basis irregular fully parallel multiplier is separated into partial product generation (PPG) and partial product reduction (PPR) stages to implement the proposed modified decomposed inversion-less serial BM algorithm. The proposed RS (255,239) decoder is implemented in verilog HDL and synthesized with 0.18 mum CMOS std 130 standard cell library. The proposed architecture achieves almost 76 % increase in speed and throughput, and can be used in high-speed and high-throughput applications such as DVD, optical fiber communications, etc.","PeriodicalId":143886,"journal":{"name":"21st International Conference on VLSI Design (VLSID 2008)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130776542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Macroblock (aka partition) pin assignment and routing are important tasks in typical top-down hierarchical physical design. Routers use pin locations as connection points to route the design with a goal of minimizing congestion. However, determining suitable pin locations it self depends on availability of congestion free routing topology as a seed input. This results in a catch-22 situation. In this paper, we present an approach, during prototyping phase, to generate fast-and- dirty congestion free routing topology, in top channels. This is real chip routing topology, in the sense that, the routing topology of every net adheres to physical hierarchy, as would happen during hierarchical implementation. This is passed as seed to pin assignment engine, which thus, results in congestion-free pin locations. The novelty of this approach lies in efficient detection of those inter-partition nets whose routing topology have little or no bearing to top channel congestion. These nets are then either not routed or routed in a fast hierarchy unaware manner. We will show that this routing topology is good enough (less than 10% error margin) to establish suitable cross points at partition boundaries, while the speed up achieved is around 6X compared to routing all nets in hierarchy aware manner. Experimental results demonstrate its efficiency and effectiveness. Furthermore, it can also be effectively used as seed input for decisions like channel sizing between partitions, and budgeting timing constraints to partitions.
{"title":"Fast Congestion Aware Routing for Pin Assignment","authors":"S. Prasad","doi":"10.1109/VLSI.2008.110","DOIUrl":"https://doi.org/10.1109/VLSI.2008.110","url":null,"abstract":"Macroblock (aka partition) pin assignment and routing are important tasks in typical top-down hierarchical physical design. Routers use pin locations as connection points to route the design with a goal of minimizing congestion. However, determining suitable pin locations it self depends on availability of congestion free routing topology as a seed input. This results in a catch-22 situation. In this paper, we present an approach, during prototyping phase, to generate fast-and- dirty congestion free routing topology, in top channels. This is real chip routing topology, in the sense that, the routing topology of every net adheres to physical hierarchy, as would happen during hierarchical implementation. This is passed as seed to pin assignment engine, which thus, results in congestion-free pin locations. The novelty of this approach lies in efficient detection of those inter-partition nets whose routing topology have little or no bearing to top channel congestion. These nets are then either not routed or routed in a fast hierarchy unaware manner. We will show that this routing topology is good enough (less than 10% error margin) to establish suitable cross points at partition boundaries, while the speed up achieved is around 6X compared to routing all nets in hierarchy aware manner. Experimental results demonstrate its efficiency and effectiveness. Furthermore, it can also be effectively used as seed input for decisions like channel sizing between partitions, and budgeting timing constraints to partitions.","PeriodicalId":143886,"journal":{"name":"21st International Conference on VLSI Design (VLSID 2008)","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117094075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In chip multiprocessor (CMP) systems the various effects of technology scaling make the on chip components more susceptible to faults. Most of the earlier schemes that address fault tolerance issues in CMPs adopt redundant-thread techniques. These techniques are mostly effective, except that they fail to detect errors resulting from faults in hardware components on chip that commonly serve multiple cores. The cache coherence controller (CC) logic, which ensures consistency of data shared among multiple threads, is a vital common component in CMPs. A fault in CC logic of any of the processors may lead to errors in the data states in the entire CMP system. It is observed that up to 59.6% of the memory references cause a change in cache state for SPLASH-2 applications. We propose a novel scheme with a verification logic that can dynamically detect errors in the CC logic of multiple cores in a CMP system. The entire verification logic is designed with a negligible area of 0.1372 sq.mm using a TSMC 0.18 mu4-metal layer process technology. Even at highly aggressive fault injection rates, the logic achieves an average error coverage of more than 95% (and almost 100% for some applications)
{"title":"Dynamic Error Detection for Dependable Cache Coherency in Multicore Architectures","authors":"Hui Wang, Sandeep Baldawa, R. Sangireddy","doi":"10.1109/VLSI.2008.68","DOIUrl":"https://doi.org/10.1109/VLSI.2008.68","url":null,"abstract":"In chip multiprocessor (CMP) systems the various effects of technology scaling make the on chip components more susceptible to faults. Most of the earlier schemes that address fault tolerance issues in CMPs adopt redundant-thread techniques. These techniques are mostly effective, except that they fail to detect errors resulting from faults in hardware components on chip that commonly serve multiple cores. The cache coherence controller (CC) logic, which ensures consistency of data shared among multiple threads, is a vital common component in CMPs. A fault in CC logic of any of the processors may lead to errors in the data states in the entire CMP system. It is observed that up to 59.6% of the memory references cause a change in cache state for SPLASH-2 applications. We propose a novel scheme with a verification logic that can dynamically detect errors in the CC logic of multiple cores in a CMP system. The entire verification logic is designed with a negligible area of 0.1372 sq.mm using a TSMC 0.18 mu4-metal layer process technology. Even at highly aggressive fault injection rates, the logic achieves an average error coverage of more than 95% (and almost 100% for some applications)","PeriodicalId":143886,"journal":{"name":"21st International Conference on VLSI Design (VLSID 2008)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125281016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Variability in circuit delay is a significant challenge in the design and synthesis of digital circuits. While the challenge is being addressed at various levels of the design hierarchy, we argue that modern register-transfer level (RTL) synthesis tools can be enhanced to deal with this problem in an alternate, yet effective, manner. Our solution involves the design of variability- tolerant, correct circuits assuming common-case, rather than worst-case, values for critical path delays. We propose a methodology to design variability-tolerant circuits that can, at runtime, detect and efficiently recover from delay errors, which would be inevitably introduced due to the use of common-case delay values. Variability-agnostic designs are automatically transformed into variability-tolerant circuits by the introduction of shadow logic to detect and recover from runtime errors, while exploiting data speculation to derive performance benefits. For various benchmark circuits, we show that the area overhead imposed by our scheme is only 11.4% on an average, while achieving upto 16.3% performance speedup over margined designs.
{"title":"Variability-Tolerant Register-Transfer Level Synthesis","authors":"Anish Muttreja, S. Ravi, N. Jha","doi":"10.1109/VLSI.2008.114","DOIUrl":"https://doi.org/10.1109/VLSI.2008.114","url":null,"abstract":"Variability in circuit delay is a significant challenge in the design and synthesis of digital circuits. While the challenge is being addressed at various levels of the design hierarchy, we argue that modern register-transfer level (RTL) synthesis tools can be enhanced to deal with this problem in an alternate, yet effective, manner. Our solution involves the design of variability- tolerant, correct circuits assuming common-case, rather than worst-case, values for critical path delays. We propose a methodology to design variability-tolerant circuits that can, at runtime, detect and efficiently recover from delay errors, which would be inevitably introduced due to the use of common-case delay values. Variability-agnostic designs are automatically transformed into variability-tolerant circuits by the introduction of shadow logic to detect and recover from runtime errors, while exploiting data speculation to derive performance benefits. For various benchmark circuits, we show that the area overhead imposed by our scheme is only 11.4% on an average, while achieving upto 16.3% performance speedup over margined designs.","PeriodicalId":143886,"journal":{"name":"21st International Conference on VLSI Design (VLSID 2008)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122926202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper demonstrates a formal verification- planning process and presents associated verification strategy that we believe is an essential (yet often neglected) step in an ASIC or SoC functional formal verification flow. Our contribution is to present a way to apply the verification planning process and a set of abstraction techniques on a non-trivial open-source example (the Sun OpenSPARCtrade DDR2 controller). The process and verification strategy can be applied to DDR2 controllers in particular and generalized for other designs.
{"title":"Formal Verification of a Public-Domain DDR2 Controller Design","authors":"Abhishek Datta, V. Singhal","doi":"10.1109/VLSI.2008.94","DOIUrl":"https://doi.org/10.1109/VLSI.2008.94","url":null,"abstract":"This paper demonstrates a formal verification- planning process and presents associated verification strategy that we believe is an essential (yet often neglected) step in an ASIC or SoC functional formal verification flow. Our contribution is to present a way to apply the verification planning process and a set of abstraction techniques on a non-trivial open-source example (the Sun OpenSPARCtrade DDR2 controller). The process and verification strategy can be applied to DDR2 controllers in particular and generalized for other designs.","PeriodicalId":143886,"journal":{"name":"21st International Conference on VLSI Design (VLSID 2008)","volume":"45 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120893371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most contemporary SAT solvers use a conjunctive-normal-form (CNF) representation for logic functions due to the availability of efficient algorithms for this form, such as deduction through unit propagation and conflict driven learning using clause resolution. The use of CNF generally entails transformation to this form from other representations such as logic circuits (Tseitin, 1970). However, this transformation results in loss of information such as direction of signal flow and observability of signals at circuit outputs (Een, 2003)(Fu, 2005). This has prompted the development of various circuit-based solvers (Ganai et al., 2002), hybrid CNF+circuit-based solvers (Fu, 2005), as well as augmented CNF solvers (Een, 2003). Having the circuit available provides for additional capabilities at a cost, and thus requires careful analysis to determine the viability of each approach. This paper highlights one specific capability provided by a circuit: the ability to consider reconvergent paths in unit propagation. Unit propagation is the workhorse of contemporary SAT solvers, thus any improvement to this has significant practical potential. We first demonstrate that the Tseitin circuit-to-CNF transformation limits backward unit propagation and how additional implications can be derived when unit propagation across multiple paths is considered. Next, we show how these implications can be exploited by statically learning clauses during circuit pre-processing. The results of the practical implementation of these algorithms show that the static learning can provide significant speed-up on several classes of benchmark circuits. Finally, we discuss how this work compares with other circuit-based approaches, especially those arising from the automatic-test-pattern-generation (ATPG) community (e.g. recursive learning) and circuit and non- circuit based pre-processors.
{"title":"Exploiting Circuit Reconvergence through Static Learning in CNF SAT Solvers","authors":"Yinlei Yu, C. Brien, S. Malik","doi":"10.1109/VLSI.2008.90","DOIUrl":"https://doi.org/10.1109/VLSI.2008.90","url":null,"abstract":"Most contemporary SAT solvers use a conjunctive-normal-form (CNF) representation for logic functions due to the availability of efficient algorithms for this form, such as deduction through unit propagation and conflict driven learning using clause resolution. The use of CNF generally entails transformation to this form from other representations such as logic circuits (Tseitin, 1970). However, this transformation results in loss of information such as direction of signal flow and observability of signals at circuit outputs (Een, 2003)(Fu, 2005). This has prompted the development of various circuit-based solvers (Ganai et al., 2002), hybrid CNF+circuit-based solvers (Fu, 2005), as well as augmented CNF solvers (Een, 2003). Having the circuit available provides for additional capabilities at a cost, and thus requires careful analysis to determine the viability of each approach. This paper highlights one specific capability provided by a circuit: the ability to consider reconvergent paths in unit propagation. Unit propagation is the workhorse of contemporary SAT solvers, thus any improvement to this has significant practical potential. We first demonstrate that the Tseitin circuit-to-CNF transformation limits backward unit propagation and how additional implications can be derived when unit propagation across multiple paths is considered. Next, we show how these implications can be exploited by statically learning clauses during circuit pre-processing. The results of the practical implementation of these algorithms show that the static learning can provide significant speed-up on several classes of benchmark circuits. Finally, we discuss how this work compares with other circuit-based approaches, especially those arising from the automatic-test-pattern-generation (ATPG) community (e.g. recursive learning) and circuit and non- circuit based pre-processors.","PeriodicalId":143886,"journal":{"name":"21st International Conference on VLSI Design (VLSID 2008)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124446430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Veeramachaneni, K. M. Krishna, V. PrateekG., S. Subroto, S. Bharat, M. Srinivas
Increasing prominence of commercial, financial and Internet-based applications, which process decimal data, there is an increasing interest in providing hardware support for such data. In this paper, new architecture for efficient binary and binary coded decimal (BCD) adder/subtracter is presented. This employs a new method of subtraction unlike the existing designs which mostly use 10's complements, to obtain a much lower latency. Though there is a necessity of correction in some cases, the delay overhead is minimal. A complete discussion about such cases and the required logic to process is presented. The architecture is run-time reconfigurable to facilitate both BCD and binary operations, including signed and unsigned numbers. The proposed circuits are compared (both qualitatively as well as quantitatively) with the existing circuits in literature and are shown to perform better. Simulation results show that the proposed architecture is at least 11% faster than the existing designs.
{"title":"A Novel Carry-Look Ahead Approach to a Unified BCD and Binary Adder/Subtractor","authors":"S. Veeramachaneni, K. M. Krishna, V. PrateekG., S. Subroto, S. Bharat, M. Srinivas","doi":"10.1109/VLSI.2008.80","DOIUrl":"https://doi.org/10.1109/VLSI.2008.80","url":null,"abstract":"Increasing prominence of commercial, financial and Internet-based applications, which process decimal data, there is an increasing interest in providing hardware support for such data. In this paper, new architecture for efficient binary and binary coded decimal (BCD) adder/subtracter is presented. This employs a new method of subtraction unlike the existing designs which mostly use 10's complements, to obtain a much lower latency. Though there is a necessity of correction in some cases, the delay overhead is minimal. A complete discussion about such cases and the required logic to process is presented. The architecture is run-time reconfigurable to facilitate both BCD and binary operations, including signed and unsigned numbers. The proposed circuits are compared (both qualitatively as well as quantitatively) with the existing circuits in literature and are shown to perform better. Simulation results show that the proposed architecture is at least 11% faster than the existing designs.","PeriodicalId":143886,"journal":{"name":"21st International Conference on VLSI Design (VLSID 2008)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131766638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paper presents an architecture to implement Karatsuba Multiplier on an FPGA platform. Detailed analysis has been carried out on how existing algorithms utilize FPGA resources. Based on the observations the work develops a hybrid technique which has a better area delay product compared to the known algorithms. The results have been practically demonstrated through a large number of experiments. Subsequently, the work develops a masking strategy to prevent power based side channel attacks on the multiplier. It has been found that the proposed masked Hybrid Karatsuba multiplier is more compact compared to existing designs.
{"title":"Power Attack Resistant Efficient FPGA Architecture for Karatsuba Multiplier","authors":"C. Rebeiro, Debdeep Mukhopadhyay","doi":"10.1109/VLSI.2008.65","DOIUrl":"https://doi.org/10.1109/VLSI.2008.65","url":null,"abstract":"The paper presents an architecture to implement Karatsuba Multiplier on an FPGA platform. Detailed analysis has been carried out on how existing algorithms utilize FPGA resources. Based on the observations the work develops a hybrid technique which has a better area delay product compared to the known algorithms. The results have been practically demonstrated through a large number of experiments. Subsequently, the work develops a masking strategy to prevent power based side channel attacks on the multiplier. It has been found that the proposed masked Hybrid Karatsuba multiplier is more compact compared to existing designs.","PeriodicalId":143886,"journal":{"name":"21st International Conference on VLSI Design (VLSID 2008)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124600499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the technology scales, reduction in transistor size creates many opportunities for increased circuit capabilities in reduced chip area. In modern wide-issue processors, performance of the processor is directly impacted by the time delay complexity of the dynamic scheduling logic. In this paper, we analyze the scaling of time delay of instruction select logic at the submicron technologies, and also present novel designs that provide a single selection tree for two similar functional units. The designs are based on a tree structure using arbiter cells of two and four inputs which can handle one or two functional units. The effects of technology and design decisions are shown based on simulations using four submicron technologies. The delays in the select logic trees are shown to decrease by an average of 60% from 130 nm technology to 45 nm technology when servicing a single functional unit. The double grant arbiter cells are shown to build a tree that will serve multiple functional units simultaneously with 65% lesser delay as compared to multiple single-grant trees1.
{"title":"An Optimal Multi-Functional Unit Dynamic Instruction Selection Logic at Submicron Technologies","authors":"Terrell R. Bennett, R. Sangireddy","doi":"10.1109/VLSI.2008.55","DOIUrl":"https://doi.org/10.1109/VLSI.2008.55","url":null,"abstract":"As the technology scales, reduction in transistor size creates many opportunities for increased circuit capabilities in reduced chip area. In modern wide-issue processors, performance of the processor is directly impacted by the time delay complexity of the dynamic scheduling logic. In this paper, we analyze the scaling of time delay of instruction select logic at the submicron technologies, and also present novel designs that provide a single selection tree for two similar functional units. The designs are based on a tree structure using arbiter cells of two and four inputs which can handle one or two functional units. The effects of technology and design decisions are shown based on simulations using four submicron technologies. The delays in the select logic trees are shown to decrease by an average of 60% from 130 nm technology to 45 nm technology when servicing a single functional unit. The double grant arbiter cells are shown to build a tree that will serve multiple functional units simultaneously with 65% lesser delay as compared to multiple single-grant trees1.","PeriodicalId":143886,"journal":{"name":"21st International Conference on VLSI Design (VLSID 2008)","volume":"123 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124649391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}