The rates of transient faults such as soft errors have been significantly impacted due to the aggressive scaling trends in the nanometer regime. In the past, several circuit optimization techniques have been proposed for preventing soft errors in logic circuits. These approaches include, inclusion of concurrent error detection circuits on selective nodes, selective gate sizing, dual-VDD assignment and selective node hardening at the transistor level. However, we show in this paper that larger wirelengths for nets can act as larger RC ladders and can effectively filter out the transient glitches due to radiation strikes. Based on this, we propose a simulated annealing based placement algorithm that significantly reduces the SER of logic circuits. We accurately capture the soft error masking effects by using a new metric called the {em logical observability}. The cost function for simulated annealing is modeled as the summation of the logical observability weighted with the netlength for each net, while simultaneously constraining the total area and the total wirelength. The algorithm tries to assign higher wirelengths for nets with low masking probability for higher glitch reduction, while maintaining low delay and area penalty for the overall circuit. Each placement configuration is represented as a sequence pair and the moves in the space of sequence pairs are probabilistically accepted depending upon the cost gradient and the iteration count. Higher cost moves have a higher probability of acceptance at initial iterations for better state space exploration, while at later iterations the algorithm greedily tries to minimize the cost. To the best of our knowledge, this is the first time that soft error rate reduction is attempted during the placement stage. The proposed algorithm has been implemented and validated on the ISCAS85 benchmarks. We have experimented using the FreePDK 45nm Process Design Kit and the OSU cell library which indicate that our radiation immune placement algorithm can significantly reduce the SER in logic circuits with very low overheads in delay and area.
{"title":"A New Placement Algorithm for Reduction of Soft Errors in Macrocell Based Design of Nanometer Circuits","authors":"K. Bhattacharya, N. Ranganathan","doi":"10.1109/ISVLSI.2009.37","DOIUrl":"https://doi.org/10.1109/ISVLSI.2009.37","url":null,"abstract":"The rates of transient faults such as soft errors have been significantly impacted due to the aggressive scaling trends in the nanometer regime. In the past, several circuit optimization techniques have been proposed for preventing soft errors in logic circuits. These approaches include, inclusion of concurrent error detection circuits on selective nodes, selective gate sizing, dual-VDD assignment and selective node hardening at the transistor level. However, we show in this paper that larger wirelengths for nets can act as larger RC ladders and can effectively filter out the transient glitches due to radiation strikes. Based on this, we propose a simulated annealing based placement algorithm that significantly reduces the SER of logic circuits. We accurately capture the soft error masking effects by using a new metric called the {em logical observability}. The cost function for simulated annealing is modeled as the summation of the logical observability weighted with the netlength for each net, while simultaneously constraining the total area and the total wirelength. The algorithm tries to assign higher wirelengths for nets with low masking probability for higher glitch reduction, while maintaining low delay and area penalty for the overall circuit. Each placement configuration is represented as a sequence pair and the moves in the space of sequence pairs are probabilistically accepted depending upon the cost gradient and the iteration count. Higher cost moves have a higher probability of acceptance at initial iterations for better state space exploration, while at later iterations the algorithm greedily tries to minimize the cost. To the best of our knowledge, this is the first time that soft error rate reduction is attempted during the placement stage. The proposed algorithm has been implemented and validated on the ISCAS85 benchmarks. We have experimented using the FreePDK 45nm Process Design Kit and the OSU cell library which indicate that our radiation immune placement algorithm can significantly reduce the SER in logic circuits with very low overheads in delay and area.","PeriodicalId":137508,"journal":{"name":"2009 IEEE Computer Society Annual Symposium on VLSI","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126482646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hai Helen Li, Haiwen Xi, Yiran Chen, J. Stricklin, Xiaobin Wang, Tong Zhang
Thermal-assisted spin-transfer torque random access memory (STT-RAM) has been considered as a promising candidate of next-generation nonvolatile memory technology. We conducted finite element simulation on thermal dynamics in the programming process of thermal-assisted STT-RAM. Special attentions have been paid to the scalability and design space of the thermal-assist programming scheme by varying the memory element dimension and resistance-area product. We also provide systematic analysis and comparison between the thermal-assisted STT-RAM and standard STT-RAM. Discussions on the writeability and scalability of thermal-assisted STT-RAM are also conducted.
{"title":"Thermal-Assisted Spin Transfer Torque Memory (STT-RAM) Cell Design Exploration","authors":"Hai Helen Li, Haiwen Xi, Yiran Chen, J. Stricklin, Xiaobin Wang, Tong Zhang","doi":"10.1109/ISVLSI.2009.17","DOIUrl":"https://doi.org/10.1109/ISVLSI.2009.17","url":null,"abstract":"Thermal-assisted spin-transfer torque random access memory (STT-RAM) has been considered as a promising candidate of next-generation nonvolatile memory technology. We conducted finite element simulation on thermal dynamics in the programming process of thermal-assisted STT-RAM. Special attentions have been paid to the scalability and design space of the thermal-assist programming scheme by varying the memory element dimension and resistance-area product. We also provide systematic analysis and comparison between the thermal-assisted STT-RAM and standard STT-RAM. Discussions on the writeability and scalability of thermal-assisted STT-RAM are also conducted.","PeriodicalId":137508,"journal":{"name":"2009 IEEE Computer Society Annual Symposium on VLSI","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122317513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Improving the efficiency and the power consumption of a RF transmitter in a wireless sensor network is a major design challenge. Unlike conventional transmitter architectures, injection-locked transmitter (ILTX) provides high efficiency with reduced power consumption. The core block of an ILTX is an injection-locked oscillator. Therefore this paper reports a low-voltage low-power injection-locked LC oscillator employing self-cascode structure and body-terminal coupling. Self-cascode structure of the oscillator provides low-voltage and low-power operation while body terminal coupling facilitates low-power operation without degrading voltage headroom. The proposed oscillator has been fabricated using 0.18-μm RF CMOS process. Measurement results indicate that the designed oscillator can operate with a supply voltage as low as 1 V and exhibits 36.75 MHz locking range for an injection signal of 1.29 GHz at 0 dBm. The core oscillator consumes only 2 mW of power that makes the proposed design highly suitable for low-power transmitter applications.
{"title":"Power-Efficient Body-Coupled Self-Cascode LC Oscillator for Low-Power Injection-Locked Transmitter Applications","authors":"M. Haider, S. Islam","doi":"10.1109/ISVLSI.2009.14","DOIUrl":"https://doi.org/10.1109/ISVLSI.2009.14","url":null,"abstract":"Improving the efficiency and the power consumption of a RF transmitter in a wireless sensor network is a major design challenge. Unlike conventional transmitter architectures, injection-locked transmitter (ILTX) provides high efficiency with reduced power consumption. The core block of an ILTX is an injection-locked oscillator. Therefore this paper reports a low-voltage low-power injection-locked LC oscillator employing self-cascode structure and body-terminal coupling. Self-cascode structure of the oscillator provides low-voltage and low-power operation while body terminal coupling facilitates low-power operation without degrading voltage headroom. The proposed oscillator has been fabricated using 0.18-μm RF CMOS process. Measurement results indicate that the designed oscillator can operate with a supply voltage as low as 1 V and exhibits 36.75 MHz locking range for an injection signal of 1.29 GHz at 0 dBm. The core oscillator consumes only 2 mW of power that makes the proposed design highly suitable for low-power transmitter applications.","PeriodicalId":137508,"journal":{"name":"2009 IEEE Computer Society Annual Symposium on VLSI","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134293852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Die stacking is a promising new technology that enables integration of devices in the third dimension. It allows the stacking of multiple active layers directly on top of one another with short, dense die-to-die vias providing communication. Previous work has shown significant bene¿ts at all design targets, from stacking memory on logic to partitioning individual architectural units across multiple layers. Many high-speed processor units—ALUs, register ¿les, caches, and instruction schedulers—have all been designed in 3D, achieving signi¿cant, simultaneous power savings and performance boosts. Other work has looked at the implementation of network-on-chip in a die stack but restricted the focus to planar designs of the various unit(processors, routers, etc.). This work follows up on these two re-search areas to explore the 3D design of router components, speci¿cally the crossbar. We examine the implementation of a crossbar and two multistage interconnect networks to determine the potential bene¿ts of 3D implementations. Compared to equivalent planar designs,we achieve a maximum delay reduction of 26% and maximum power savings of 24%.
{"title":"High Performance Non-blocking Switch Design in 3D Die-Stacking Technology","authors":"D. L. Lewis, S. Yalamanchili, H. Lee","doi":"10.1109/ISVLSI.2009.53","DOIUrl":"https://doi.org/10.1109/ISVLSI.2009.53","url":null,"abstract":"Die stacking is a promising new technology that enables integration of devices in the third dimension. It allows the stacking of multiple active layers directly on top of one another with short, dense die-to-die vias providing communication. Previous work has shown significant bene¿ts at all design targets, from stacking memory on logic to partitioning individual architectural units across multiple layers. Many high-speed processor units—ALUs, register ¿les, caches, and instruction schedulers—have all been designed in 3D, achieving signi¿cant, simultaneous power savings and performance boosts. Other work has looked at the implementation of network-on-chip in a die stack but restricted the focus to planar designs of the various unit(processors, routers, etc.). This work follows up on these two re-search areas to explore the 3D design of router components, speci¿cally the crossbar. We examine the implementation of a crossbar and two multistage interconnect networks to determine the potential bene¿ts of 3D implementations. Compared to equivalent planar designs,we achieve a maximum delay reduction of 26% and maximum power savings of 24%.","PeriodicalId":137508,"journal":{"name":"2009 IEEE Computer Society Annual Symposium on VLSI","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131857922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gennette Gill, John Hansen, Ankur Agiwal, L. Vicci, Montek Singh
This paper presents the design of a greatest common divisor (GCD) chip as a case study in asynchronous or clockless design. The design uses fine-grain asynchronous pipelining to achieve fairly high performance. At the same time, the use of robust asynchronous handshaking in lieu of clocking allows the design to gracefully adapt its operation to voltage and temperature variations, without the need for clock recalibration.The design was fabricated in a 0.13$mu$m CMOS process, using standard cells and with full testability support. Resulting chips were evaluated for performance and robustness, using a large set of test vectors for good fault coverage. Under nominal operating conditions (1.5V and 27C), the fabricated parts were able to deliver up to 8 giga GCD algorithmic iterations per second (equivalent to 1 GHz clock speed). Moreover, they were functionally correct across a wide range of voltages (0.5V to 4V) and temperatures (-45C to 150C). This case study bolsters our confidence in the potential of aynchronous design techniques to help produce reliable ASICS that are fast, testable, and that operate under a wide range of conditions.
{"title":"A High-Speed GCD Chip: A Case Study in Asynchronous Design","authors":"Gennette Gill, John Hansen, Ankur Agiwal, L. Vicci, Montek Singh","doi":"10.1109/ISVLSI.2009.47","DOIUrl":"https://doi.org/10.1109/ISVLSI.2009.47","url":null,"abstract":"This paper presents the design of a greatest common divisor (GCD) chip as a case study in asynchronous or clockless design. The design uses fine-grain asynchronous pipelining to achieve fairly high performance. At the same time, the use of robust asynchronous handshaking in lieu of clocking allows the design to gracefully adapt its operation to voltage and temperature variations, without the need for clock recalibration.The design was fabricated in a 0.13$mu$m CMOS process, using standard cells and with full testability support. Resulting chips were evaluated for performance and robustness, using a large set of test vectors for good fault coverage. Under nominal operating conditions (1.5V and 27C), the fabricated parts were able to deliver up to 8 giga GCD algorithmic iterations per second (equivalent to 1 GHz clock speed). Moreover, they were functionally correct across a wide range of voltages (0.5V to 4V) and temperatures (-45C to 150C). This case study bolsters our confidence in the potential of aynchronous design techniques to help produce reliable ASICS that are fast, testable, and that operate under a wide range of conditions.","PeriodicalId":137508,"journal":{"name":"2009 IEEE Computer Society Annual Symposium on VLSI","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134320790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-performance buses often use staggered repeaters to mitigate the adverse impact on latency of worst-case capacitive crosstalk between adjacent wires by exploiting the data-dependent nature of crosstalk. An undesirable side effect of staggered repeaters is that they may increase the overall energy of a bus carrying highly correlated traffic associated with real-world benchmarks. In this paper, we introduce an energy model for a staggered-repeater bus (SRB)configuration and propose a low-power dynamic encoding scheme that yields average bus energy reductions for an SRB in excess of 28% and 26% for data and instruction traffic, respectively, for SPEC CPU2k benchmarks.
{"title":"Energy-Efficient Encoding for High-Performance Buses with Staggered Repeaters","authors":"S. Jayaprakash, N. Mahapatra","doi":"10.1109/ISVLSI.2009.58","DOIUrl":"https://doi.org/10.1109/ISVLSI.2009.58","url":null,"abstract":"High-performance buses often use staggered repeaters to mitigate the adverse impact on latency of worst-case capacitive crosstalk between adjacent wires by exploiting the data-dependent nature of crosstalk. An undesirable side effect of staggered repeaters is that they may increase the overall energy of a bus carrying highly correlated traffic associated with real-world benchmarks. In this paper, we introduce an energy model for a staggered-repeater bus (SRB)configuration and propose a low-power dynamic encoding scheme that yields average bus energy reductions for an SRB in excess of 28% and 26% for data and instruction traffic, respectively, for SPEC CPU2k benchmarks.","PeriodicalId":137508,"journal":{"name":"2009 IEEE Computer Society Annual Symposium on VLSI","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129473745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
3D integration is an emerging technology that allows for the vertical stacking of multiple silicon die. These stacked die are tightly integrated with through-silicon vias and promise significant power and area reductions by replacing long global wires with short vertical connections. This technology necessitates that neighboring logical blocks exist on different layers in the stack. However, such functional partitions disable intra-chip communication pre-bond and thus disrupt traditional test techniques.Previous work has described a general test architecture that enables pre-bond testability of an architecturally partitioned 3D processor and provided mechanisms for basic layer functionality. This work proposes new test methods for designs partitioned at the circuits level,in which the gates and transistors of individual circuits could be split across multiple die layers. We investigated a bit-partitioned adder unit and a port-split register file, which represents the most difficult circuit-partitioned design to test pre-bond but which is used widely in many circuits. Two layouts of each circuit, planar and 3D, are produced. Our experiments verify the performance and power results and examine the test coverage achieved.
{"title":"Testing Circuit-Partitioned 3D IC Designs","authors":"D. L. Lewis, H. Lee","doi":"10.1109/ISVLSI.2009.48","DOIUrl":"https://doi.org/10.1109/ISVLSI.2009.48","url":null,"abstract":"3D integration is an emerging technology that allows for the vertical stacking of multiple silicon die. These stacked die are tightly integrated with through-silicon vias and promise significant power and area reductions by replacing long global wires with short vertical connections. This technology necessitates that neighboring logical blocks exist on different layers in the stack. However, such functional partitions disable intra-chip communication pre-bond and thus disrupt traditional test techniques.Previous work has described a general test architecture that enables pre-bond testability of an architecturally partitioned 3D processor and provided mechanisms for basic layer functionality. This work proposes new test methods for designs partitioned at the circuits level,in which the gates and transistors of individual circuits could be split across multiple die layers. We investigated a bit-partitioned adder unit and a port-split register file, which represents the most difficult circuit-partitioned design to test pre-bond but which is used widely in many circuits. Two layouts of each circuit, planar and 3D, are produced. Our experiments verify the performance and power results and examine the test coverage achieved.","PeriodicalId":137508,"journal":{"name":"2009 IEEE Computer Society Annual Symposium on VLSI","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121458139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Field Programmable Gate Arrays offer flexibility to program hardware systems together with the possibility to explore any level of parallelism available in the application. Unfortunately, this flexibility costs a huge amount of circuit area necessary to implement all the routing switches and wires. Also, device scaling in new and future technologies brings along a severe increase in the soft error rate of circuits, for combinational and sequential logic. In order to reduce the impact of the wires and switches and cope with SETs in FPGAs, this work proposes a low power voltage-mode quaternary LUT (QLUT) design that uses quaternary logic to reduce the area spent in switches and routing wires. At the same time, the proposed QLUT provides robustness against SETs. Results show that the fault tolerant QLU There proposed detects all faults that can cause an error with significant less area and less power when comparing to the binary correspondent LUT protected with the DWC technique. In order to evaluate how the proposed QLUT will deal with the process variability of sub 90nm technologies, extensive Monte Carlo simulations were performed and these results are here discussed.
{"title":"A Low Cost Low Power Quaternary LUT Cell for Fault Tolerant Applications in Future Technologies","authors":"E. Rhod, L. Carro","doi":"10.1109/ISVLSI.2009.34","DOIUrl":"https://doi.org/10.1109/ISVLSI.2009.34","url":null,"abstract":"Field Programmable Gate Arrays offer flexibility to program hardware systems together with the possibility to explore any level of parallelism available in the application. Unfortunately, this flexibility costs a huge amount of circuit area necessary to implement all the routing switches and wires. Also, device scaling in new and future technologies brings along a severe increase in the soft error rate of circuits, for combinational and sequential logic. In order to reduce the impact of the wires and switches and cope with SETs in FPGAs, this work proposes a low power voltage-mode quaternary LUT (QLUT) design that uses quaternary logic to reduce the area spent in switches and routing wires. At the same time, the proposed QLUT provides robustness against SETs. Results show that the fault tolerant QLU There proposed detects all faults that can cause an error with significant less area and less power when comparing to the binary correspondent LUT protected with the DWC technique. In order to evaluate how the proposed QLUT will deal with the process variability of sub 90nm technologies, extensive Monte Carlo simulations were performed and these results are here discussed.","PeriodicalId":137508,"journal":{"name":"2009 IEEE Computer Society Annual Symposium on VLSI","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122026322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amongst the video compression standards, the latest one is the H.264/AVC. This standard reaches the highest compression rates when compared to the previous standards. On the other hand, it has a high computational complexity. This high computational complexity makes it difficult the development of software applications running in a current processor when high definitions videos are considered. Thus, hardware implementations become essential. Addressing the hardware architectures, this work presents the architectural design for the variable block size motion estimation (VBSME) defined in the H.264/AVC standard. This architecture is based on full search motion estimation algorithm and SAD calculation. This architecture is able to produce the 41 motion vectors within a macroblock as specified in the standard. The implementation of this architecture was based on standard cell methodology in 0.18μm CMOS technology. The architecture reached a throughput of 34 1080HD frames per second.
{"title":"Hardware Design of the H.264/AVC Variable Block Size Motion Estimation for Real-Time 1080HD Video Encoding","authors":"R. Porto, L. Agostini, S. Bampi","doi":"10.1109/ISVLSI.2009.11","DOIUrl":"https://doi.org/10.1109/ISVLSI.2009.11","url":null,"abstract":"Amongst the video compression standards, the latest one is the H.264/AVC. This standard reaches the highest compression rates when compared to the previous standards. On the other hand, it has a high computational complexity. This high computational complexity makes it difficult the development of software applications running in a current processor when high definitions videos are considered. Thus, hardware implementations become essential. Addressing the hardware architectures, this work presents the architectural design for the variable block size motion estimation (VBSME) defined in the H.264/AVC standard. This architecture is based on full search motion estimation algorithm and SAD calculation. This architecture is able to produce the 41 motion vectors within a macroblock as specified in the standard. The implementation of this architecture was based on standard cell methodology in 0.18μm CMOS technology. The architecture reached a throughput of 34 1080HD frames per second.","PeriodicalId":137508,"journal":{"name":"2009 IEEE Computer Society Annual Symposium on VLSI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129265900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huaxi Gu, Mo Kwai Hung Morton, Jiang Xu, Wei Zhang
Networks-on-chip (NoCs) can improve the communication bandwidth and power efficiency of multiprocessor systems-on-chip (MPSoC). However, traditional metallic interconnects consume significant amount of power to deliver even higher communication bandwidth required in the near future. Optical NoCs are based on optical interconnects and optical routers, and have significant bandwidth and power advantages. This paper proposed a high-performance low-power low-cost optical router, Cygnus, for optical NoCs. Cygnus is non-blocking and based on silicon microresonators. We compared Cygnus with other microresonator-based routers, and analyzed their power consumption, optical power insertion loss, and the number of microresonators used in detail. The results show that Cygnus has the lowest power consumption and losses, and requires the lowest number of microresonators. For example, Cygnus has 50% less power consumption, 51% less optical power insertion loss, and 20% less microresonators than the optimized traditional optical crossbar router. Comparing to a high-performance 45nm electronic router, Cygnus consumes 96% less power. Moreover, the passive routing feature of Cygnus guarantees that, while using dimension order routing algorithm, the maximum power consumption to route a packet through a network is a small constant number, regardless of the network size. For example, the maximum power consumption is 4.80fJ/bit under current technologies. We simulated and analyzed an 8x8 2D mesh NoC built from Cygnus and showed the end-to-end delay and network throughput under different offered loads and packet sizes.
{"title":"A Low-power Low-cost Optical Router for Optical Networks-on-Chip in Multiprocessor Systems-on-Chip","authors":"Huaxi Gu, Mo Kwai Hung Morton, Jiang Xu, Wei Zhang","doi":"10.1109/ISVLSI.2009.19","DOIUrl":"https://doi.org/10.1109/ISVLSI.2009.19","url":null,"abstract":"Networks-on-chip (NoCs) can improve the communication bandwidth and power efficiency of multiprocessor systems-on-chip (MPSoC). However, traditional metallic interconnects consume significant amount of power to deliver even higher communication bandwidth required in the near future. Optical NoCs are based on optical interconnects and optical routers, and have significant bandwidth and power advantages. This paper proposed a high-performance low-power low-cost optical router, Cygnus, for optical NoCs. Cygnus is non-blocking and based on silicon microresonators. We compared Cygnus with other microresonator-based routers, and analyzed their power consumption, optical power insertion loss, and the number of microresonators used in detail. The results show that Cygnus has the lowest power consumption and losses, and requires the lowest number of microresonators. For example, Cygnus has 50% less power consumption, 51% less optical power insertion loss, and 20% less microresonators than the optimized traditional optical crossbar router. Comparing to a high-performance 45nm electronic router, Cygnus consumes 96% less power. Moreover, the passive routing feature of Cygnus guarantees that, while using dimension order routing algorithm, the maximum power consumption to route a packet through a network is a small constant number, regardless of the network size. For example, the maximum power consumption is 4.80fJ/bit under current technologies. We simulated and analyzed an 8x8 2D mesh NoC built from Cygnus and showed the end-to-end delay and network throughput under different offered loads and packet sizes.","PeriodicalId":137508,"journal":{"name":"2009 IEEE Computer Society Annual Symposium on VLSI","volume":"165 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115488711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}