Scan compression techniques are widely used to contain test application time and test data volume. Smart techniques exist to match the scan compression CoDec (compactor-decompressor) module with the DUT (design under test), to realize high levels of compression with no loss of coverage. DUT partitioning is often desirable for ease of implementing sub-chips and integrating them into an SOC (system-on-chip). This paper presents various multi-CoDec configurations for partitioned DUTs to enable efficient scan testing, which address the requirements of reduced test mode power with no compromise in test quality. Different configurations are examined, tradeoffs discussed, and the most suitable one amongst them identified. It is shown how the preferred configuration can be architected with low implementation overhead (with no new requirements for bounding when creating the individual partitions), and how the different CoDec – DUT partitions can be operated together to meet dual goals of high quality and low power, with no increase in test time. Experimental data is presented on industrial circuits to illustrate the benefits.
{"title":"Multi-CoDec Configurations for Low Power and High Quality Scan Test","authors":"A. Jain, S. Subramanian, R. Parekhji, S. Ravi","doi":"10.1109/VLSID.2011.15","DOIUrl":"https://doi.org/10.1109/VLSID.2011.15","url":null,"abstract":"Scan compression techniques are widely used to contain test application time and test data volume. Smart techniques exist to match the scan compression CoDec (compactor-decompressor) module with the DUT (design under test), to realize high levels of compression with no loss of coverage. DUT partitioning is often desirable for ease of implementing sub-chips and integrating them into an SOC (system-on-chip). This paper presents various multi-CoDec configurations for partitioned DUTs to enable efficient scan testing, which address the requirements of reduced test mode power with no compromise in test quality. Different configurations are examined, tradeoffs discussed, and the most suitable one amongst them identified. It is shown how the preferred configuration can be architected with low implementation overhead (with no new requirements for bounding when creating the individual partitions), and how the different CoDec – DUT partitions can be operated together to meet dual goals of high quality and low power, with no increase in test time. Experimental data is presented on industrial circuits to illustrate the benefits.","PeriodicalId":371062,"journal":{"name":"2011 24th Internatioal Conference on VLSI Design","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116664386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Standing wave oscillators (SWOs) are attractive since they can sustain extremely high oscillation frequencies with very low power consumption due to their resonant nature. In this paper, we present a technique to design a high frequency SWO to cover a large area on an IC. We achieve this by combining two techniques. The first technique increases the area coverage of an individual SWO by ensuring that it sustains an odd number (greater than one) of standing waves along the ring. The second approach further increases the area coverage by tiling multiple SWOs side by side, and connecting them such that they oscillate with the same high frequency and phase. The combined approach is simulated for a 3×3 array of tiles, using 3D, skin-effect adjusted RLC parasitic extraction. Our simulations are performed using a 90nm process, and indicate that this tiled structure can oscillate at about 7.25 GHz, with low power (about 68 mW per SWO tile) and low jitter (about 3.1% of the nominal clock period).
{"title":"Interconnected Tile Standing Wave Resonant Oscillator Based Clock Distribution Circuits","authors":"Ayan Mandal, V. Karkala, S. Khatri, R. Mahapatra","doi":"10.1109/VLSID.2011.70","DOIUrl":"https://doi.org/10.1109/VLSID.2011.70","url":null,"abstract":"Standing wave oscillators (SWOs) are attractive since they can sustain extremely high oscillation frequencies with very low power consumption due to their resonant nature. In this paper, we present a technique to design a high frequency SWO to cover a large area on an IC. We achieve this by combining two techniques. The first technique increases the area coverage of an individual SWO by ensuring that it sustains an odd number (greater than one) of standing waves along the ring. The second approach further increases the area coverage by tiling multiple SWOs side by side, and connecting them such that they oscillate with the same high frequency and phase. The combined approach is simulated for a 3×3 array of tiles, using 3D, skin-effect adjusted RLC parasitic extraction. Our simulations are performed using a 90nm process, and indicate that this tiled structure can oscillate at about 7.25 GHz, with low power (about 68 mW per SWO tile) and low jitter (about 3.1% of the nominal clock period).","PeriodicalId":371062,"journal":{"name":"2011 24th Internatioal Conference on VLSI Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129986480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ariations in timing can occur due to multiple sources on a chip. Many circuit level statistical techniques are used to analyze timing in the presence of these sources of variation. It is desirable to have “variation awareness” at the Register Transfer Level (RTL), and estimate block level delay distributions early in the design cycle, to evaluate design choices quickly and minimize post-synthesis simulation costs. We introduce SHARPE, a rigorous, systematic methodology to verify design correctness in RTL in the presence of variations. In this paper, we describe SHARPE in the context of computing statistical delay invariants in the presence of input variations. We treat the RTL source code as a program and use static program analysis techniques to compute probabilities. We model the probabilistic RTL modules as Discrete Time Markov Chains (DTMCs) that are then checked formally for probabilistic invariants using PRISM, a probabilistic model checker. Our technique is illustrated on the RTL description of the data path of OR1200, an open source embedded processor. We demonstrate the enhanced scalability of SHARPE by applying compositional reasoning for probabilistic model checking.
{"title":"Variation-Conscious Formal Timing Verification in RTL","authors":"Jayanand Asok Kumar, Shobha Vasudevan","doi":"10.1109/VLSID.2011.48","DOIUrl":"https://doi.org/10.1109/VLSID.2011.48","url":null,"abstract":"ariations in timing can occur due to multiple sources on a chip. Many circuit level statistical techniques are used to analyze timing in the presence of these sources of variation. It is desirable to have “variation awareness” at the Register Transfer Level (RTL), and estimate block level delay distributions early in the design cycle, to evaluate design choices quickly and minimize post-synthesis simulation costs. We introduce SHARPE, a rigorous, systematic methodology to verify design correctness in RTL in the presence of variations. In this paper, we describe SHARPE in the context of computing statistical delay invariants in the presence of input variations. We treat the RTL source code as a program and use static program analysis techniques to compute probabilities. We model the probabilistic RTL modules as Discrete Time Markov Chains (DTMCs) that are then checked formally for probabilistic invariants using PRISM, a probabilistic model checker. Our technique is illustrated on the RTL description of the data path of OR1200, an open source embedded processor. We demonstrate the enhanced scalability of SHARPE by applying compositional reasoning for probabilistic model checking.","PeriodicalId":371062,"journal":{"name":"2011 24th Internatioal Conference on VLSI Design","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133158466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A 1.8GHz high-accuracy, ring-oscillator based Digital Phase Lock Loop (DPLL), suitable for Serializer-Deserializer (SERDES) applications like HDMI, eSATA and USB2.0 is presented here. Sigma-Delta (??) dithering followed by passive filtering, along with Temperature Compensation is used to ensure frequency accuracy and low accumulated jitter, over a large temperature range. A re-circulating delay line based Time to Digital Converter (T2D) is used to handle large phase differences between the reference and feedback clocks. The DPLL is built in 65nm technology, and provides up to 1.8GHz output, with a phase noise of –87dBc/Hz at 1 MHz offset, and a frequency accuracy of +/-100ppm. It supports input frequencies in the range 0.7MHz to 50MHz, occupies a core area of 0.11 sq mm, and does not require external components.
{"title":"A 1.8GHz Digital PLL in 65nm CMOS","authors":"B. Chattopadhyay, Anant S. Kamath, G. Nayak","doi":"10.1109/VLSID.2011.32","DOIUrl":"https://doi.org/10.1109/VLSID.2011.32","url":null,"abstract":"A 1.8GHz high-accuracy, ring-oscillator based Digital Phase Lock Loop (DPLL), suitable for Serializer-Deserializer (SERDES) applications like HDMI, eSATA and USB2.0 is presented here. Sigma-Delta (??) dithering followed by passive filtering, along with Temperature Compensation is used to ensure frequency accuracy and low accumulated jitter, over a large temperature range. A re-circulating delay line based Time to Digital Converter (T2D) is used to handle large phase differences between the reference and feedback clocks. The DPLL is built in 65nm technology, and provides up to 1.8GHz output, with a phase noise of –87dBc/Hz at 1 MHz offset, and a frequency accuracy of +/-100ppm. It supports input frequencies in the range 0.7MHz to 50MHz, occupies a core area of 0.11 sq mm, and does not require external components.","PeriodicalId":371062,"journal":{"name":"2011 24th Internatioal Conference on VLSI Design","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127136916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The capacity of the available on-chip trace buffer is limited. To increase its capacity, we propose real-time compression of the trace data via novel source transformation functions, namely real-time difference vector computation, efficient interconnect network and real time alternate vector reversal that reduces the entropy of the trace data. The proposed compression technique is implemented on hardware and operates real-time to capture debug data. Experimental results for sequential benchmark circuits show that the proposed method gives better compression percentage compared to prior works. The area overhead of our trace compressor is up to 20X less compared to dictionary-based codes and yields up to 4X improvement in the compression ratio.
{"title":"Trace Buffer-Based Silicon Debug with Lossless Compression","authors":"S. Prabhakar, R. Sethuram, M. Hsiao","doi":"10.1109/VLSID.2011.31","DOIUrl":"https://doi.org/10.1109/VLSID.2011.31","url":null,"abstract":"The capacity of the available on-chip trace buffer is limited. To increase its capacity, we propose real-time compression of the trace data via novel source transformation functions, namely real-time difference vector computation, efficient interconnect network and real time alternate vector reversal that reduces the entropy of the trace data. The proposed compression technique is implemented on hardware and operates real-time to capture debug data. Experimental results for sequential benchmark circuits show that the proposed method gives better compression percentage compared to prior works. The area overhead of our trace compressor is up to 20X less compared to dictionary-based codes and yields up to 4X improvement in the compression ratio.","PeriodicalId":371062,"journal":{"name":"2011 24th Internatioal Conference on VLSI Design","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124610846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ayan Mandal, N. Jayakumar, Kalyana C. Bollapalli, S. Khatri, R. Mahapatra
In recent fabrication technologies, buffered clock distribution networks have become increasingly popular due to increasing on-chip wiring delays. Traditionally, clock distribution networks has been optimized to minimize end-to-end skew of the distribution network. However, since most ICs have an on-chip PLL, we argue that the design goal of minimizing end-to-end jitter is more relevant. In this paper, we present a dynamic programming based approach to synthesize a minimum cost buffered H-tree clock distribution network. Our cost functions are a weighted sum of power and jitter, and a weighted sum of power and end-to-end delay of the distribution network. Our approach is based on precharacterizing the delay, jitter and power of buffered segments of different lengths, topologies, buffer sizes and wire-codes. Using this information, a dynamic programming (DP) engine automatically generates the optimal H-tree that minimizes the appropriate cost function. Compared to a manually constructed buffered H-tree network, our approaches are able to reduce both jitter (by as much as 28%, and power by as much as 46%. When optimizing for minimum jitter, the DP engine generates a H-tree with lower jitter than when optimizing for minimum delay, thereby validating our approach, and proving its usefulness.
{"title":"An Automated Approach for Minimum Jitter Buffered H-Tree Construction","authors":"Ayan Mandal, N. Jayakumar, Kalyana C. Bollapalli, S. Khatri, R. Mahapatra","doi":"10.1109/VLSID.2011.69","DOIUrl":"https://doi.org/10.1109/VLSID.2011.69","url":null,"abstract":"In recent fabrication technologies, buffered clock distribution networks have become increasingly popular due to increasing on-chip wiring delays. Traditionally, clock distribution networks has been optimized to minimize end-to-end skew of the distribution network. However, since most ICs have an on-chip PLL, we argue that the design goal of minimizing end-to-end jitter is more relevant. In this paper, we present a dynamic programming based approach to synthesize a minimum cost buffered H-tree clock distribution network. Our cost functions are a weighted sum of power and jitter, and a weighted sum of power and end-to-end delay of the distribution network. Our approach is based on precharacterizing the delay, jitter and power of buffered segments of different lengths, topologies, buffer sizes and wire-codes. Using this information, a dynamic programming (DP) engine automatically generates the optimal H-tree that minimizes the appropriate cost function. Compared to a manually constructed buffered H-tree network, our approaches are able to reduce both jitter (by as much as 28%, and power by as much as 46%. When optimizing for minimum jitter, the DP engine generates a H-tree with lower jitter than when optimizing for minimum delay, thereby validating our approach, and proving its usefulness.","PeriodicalId":371062,"journal":{"name":"2011 24th Internatioal Conference on VLSI Design","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125662586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Manikandan, B. Venkataramani, K. Girish, H. Karthic, V. Siddharth
Continuous, real-time speech recognition is required for various mobile and hands-free applications. In this paper, hardware implementation of real-time speech recognition system is proposed using two approaches and their performances are evaluated. The first approach uses Mel Filter Banks with Mel Frequency Cepstrum Coefficients (MFCC) as feature input and the second approach uses Cochlear Filter Banks with Zero-crossings (ZC) as feature input for recognition. The features extracted from input speech are fed to multi-class Support Vector Machine (SVM) classifier for recognition. The proposed recognition systems are implemented on a Texas Instruments TMS320C6713 floating point digital signal processor for recognizing isolated digits (0-9) and their performances are compared. It is observed that the program memory required for MFCC feature extraction is 44.42% higher than that required for feature extraction using Cochlear filters. Recognition accuracies of 93.33% and 98.67% are achieved for feature inputs from Mel filter banks and Cochlear filter banks respectively. It is also observed that the computational complexity of feature extraction using cochlear filters is 1.53 times of that required for MFCC feature extraction. The recognition performance is also studied for different combinations of test and training utterances. It is found that training using 15 utterances of each digit results in best recognition accuracy. The techniques proposed here can be adapted for various other hands-free consumer applications such as washing machines, hands-free cordless and many more.
{"title":"Hardware Implementation of Real-Time Speech Recognition System Using TMS320C6713 DSP","authors":"J. Manikandan, B. Venkataramani, K. Girish, H. Karthic, V. Siddharth","doi":"10.1109/VLSID.2011.12","DOIUrl":"https://doi.org/10.1109/VLSID.2011.12","url":null,"abstract":"Continuous, real-time speech recognition is required for various mobile and hands-free applications. In this paper, hardware implementation of real-time speech recognition system is proposed using two approaches and their performances are evaluated. The first approach uses Mel Filter Banks with Mel Frequency Cepstrum Coefficients (MFCC) as feature input and the second approach uses Cochlear Filter Banks with Zero-crossings (ZC) as feature input for recognition. The features extracted from input speech are fed to multi-class Support Vector Machine (SVM) classifier for recognition. The proposed recognition systems are implemented on a Texas Instruments TMS320C6713 floating point digital signal processor for recognizing isolated digits (0-9) and their performances are compared. It is observed that the program memory required for MFCC feature extraction is 44.42% higher than that required for feature extraction using Cochlear filters. Recognition accuracies of 93.33% and 98.67% are achieved for feature inputs from Mel filter banks and Cochlear filter banks respectively. It is also observed that the computational complexity of feature extraction using cochlear filters is 1.53 times of that required for MFCC feature extraction. The recognition performance is also studied for different combinations of test and training utterances. It is found that training using 15 utterances of each digit results in best recognition accuracy. The techniques proposed here can be adapted for various other hands-free consumer applications such as washing machines, hands-free cordless and many more.","PeriodicalId":371062,"journal":{"name":"2011 24th Internatioal Conference on VLSI Design","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130527006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A reconfigurable processor tailored for accelerating Phylogenetic Inference is proposed. In this paper, a programmable and scalable architectural platform instantiates an array of coarse grained light weight processing elements and allows arbitrary partitioning and scheduling schemes and capable of solving complete Maximum Likelihood algorithm and deal with arbitrarily large sequences. The key difference of the proposed CGRA based solution compared to FPGA and GPU based solutions is a much better match of the architecture and algorithm for the core computational need as well as the system level architectural need. For the same degree of parallelism, we provide a 2.27X speed-up improvements compared to FPGA with the same amount of core logic, and an 81.87X speed-up improvements compared to GPU with the same silicon area respectively.
{"title":"A Reconfigurable Processor for Phylogenetic Inference","authors":"Pei Liu, A. Hemani, K. Paul","doi":"10.1109/VLSID.2011.74","DOIUrl":"https://doi.org/10.1109/VLSID.2011.74","url":null,"abstract":"A reconfigurable processor tailored for accelerating Phylogenetic Inference is proposed. In this paper, a programmable and scalable architectural platform instantiates an array of coarse grained light weight processing elements and allows arbitrary partitioning and scheduling schemes and capable of solving complete Maximum Likelihood algorithm and deal with arbitrarily large sequences. The key difference of the proposed CGRA based solution compared to FPGA and GPU based solutions is a much better match of the architecture and algorithm for the core computational need as well as the system level architectural need. For the same degree of parallelism, we provide a 2.27X speed-up improvements compared to FPGA with the same amount of core logic, and an 81.87X speed-up improvements compared to GPU with the same silicon area respectively.","PeriodicalId":371062,"journal":{"name":"2011 24th Internatioal Conference on VLSI Design","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122753536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A SPICE macro model for the transient analysis of lossy dispersive coupled GaAs interconnect line system is considered. The model is based on finite Fourier integral transform in spatial domain and is used to the study the transient nature of the signals, signal delays, distortions and cross talk in IC interconnections in digital integrated circuits. An equivalent circuit model is derived from the resulting nonlinear differential equations and is implemented as a macro model in a general purpose circuit simulator, SPICE. The model provides an easy method of including skin effect and dispersion of the lines. This macro model is an alternative method to the multiple PI or Tee sections lumped element modeling of distributed systems. The simulation times and accuracy are well compared to the reduced order PI section lumped element models.
{"title":"A SPICE Macromodel for the Analysis of Lossy Dispersive Coupled GaAs Interconnect Line System","authors":"Bhaskar Gopalan","doi":"10.1109/VLSID.2011.11","DOIUrl":"https://doi.org/10.1109/VLSID.2011.11","url":null,"abstract":"A SPICE macro model for the transient analysis of lossy dispersive coupled GaAs interconnect line system is considered. The model is based on finite Fourier integral transform in spatial domain and is used to the study the transient nature of the signals, signal delays, distortions and cross talk in IC interconnections in digital integrated circuits. An equivalent circuit model is derived from the resulting nonlinear differential equations and is implemented as a macro model in a general purpose circuit simulator, SPICE. The model provides an easy method of including skin effect and dispersion of the lines. This macro model is an alternative method to the multiple PI or Tee sections lumped element modeling of distributed systems. The simulation times and accuracy are well compared to the reduced order PI section lumped element models.","PeriodicalId":371062,"journal":{"name":"2011 24th Internatioal Conference on VLSI Design","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122962386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Various novel NoC designs attempt to improve network throughput and latency by leveraging upon asynchronous bypass and specialized clock routing. The performance of such architectures is limited by the skewing of signal transitions on the bit-lines of link due to cross talk noise. This work proposes a two-step technique: TransSync-RecSync, to eliminate packet errors resulting from inter-bit-line transition skew. TransSync preemptively adds delay to bits in a flit before they are transmitted to overcome skewing of transitions on link while RecSync de-skews the bits at the receiving end by delaying all the transitions by the same amount as the maximum skew on the bus. The approach adds minimally to router complexity and involves no wire overhead. The proposed scheme when employed to augment a NoC design with asynchronous bypass channel was found to improve the average network latency by 38%.
{"title":"Intra-Flit Skew Reduction for Asynchronous Bypass Channel in NoCs","authors":"Reeshav Kumar, Yoon Seok Yang, G. Choi","doi":"10.1109/VLSID.2011.73","DOIUrl":"https://doi.org/10.1109/VLSID.2011.73","url":null,"abstract":"Various novel NoC designs attempt to improve network throughput and latency by leveraging upon asynchronous bypass and specialized clock routing. The performance of such architectures is limited by the skewing of signal transitions on the bit-lines of link due to cross talk noise. This work proposes a two-step technique: TransSync-RecSync, to eliminate packet errors resulting from inter-bit-line transition skew. TransSync preemptively adds delay to bits in a flit before they are transmitted to overcome skewing of transitions on link while RecSync de-skews the bits at the receiving end by delaying all the transitions by the same amount as the maximum skew on the bus. The approach adds minimally to router complexity and involves no wire overhead. The proposed scheme when employed to augment a NoC design with asynchronous bypass channel was found to improve the average network latency by 38%.","PeriodicalId":371062,"journal":{"name":"2011 24th Internatioal Conference on VLSI Design","volume":"16 12","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114128986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}