Somrita Ghosh, P. Ghosal, Nabanita Das, S. Mohanty, Oghenekarho Okobiah
Achieving lightning fast speed data communication in Chip Multi Processor (CMP) based systems as well as Networkon Chips (NoCs) is always desired for target performance. Data communication links inside the communication fabric of CMP or NoC architectures have strong impact on their performance and power dissipation. Several approaches exist to reduce power dissipation of parallel link on-chip interconnects, a very few techniques are reported for power reduction in serial links. The existing serial-link power reduction techniques don't necessarily account correlation exhibited in the data and hence are limited in terms of accuracy. In this paper, a novel data encoding scheme isproposed for serial links to decrease the number of self transitions to reduce the power in data transmission. The proposed scheme accounts the correlations in the data and hence is more effective for real-life applications. The system architecture as well as the encoding and decoding schemes have been implemented to explore the proposed algorithm applicable for any CMP or NoC architectures. The proposed encoding scheme has been analyzed with various types of real-life data streams. Experimental resultsshow that up to 27% reduction in power dissipation is possible in NoC links by the proposed scheme.
{"title":"Data Correlation Aware Serial Encoding for Low Switching Power On-Chip Communication","authors":"Somrita Ghosh, P. Ghosal, Nabanita Das, S. Mohanty, Oghenekarho Okobiah","doi":"10.1109/ISVLSI.2014.48","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.48","url":null,"abstract":"Achieving lightning fast speed data communication in Chip Multi Processor (CMP) based systems as well as Networkon Chips (NoCs) is always desired for target performance. Data communication links inside the communication fabric of CMP or NoC architectures have strong impact on their performance and power dissipation. Several approaches exist to reduce power dissipation of parallel link on-chip interconnects, a very few techniques are reported for power reduction in serial links. The existing serial-link power reduction techniques don't necessarily account correlation exhibited in the data and hence are limited in terms of accuracy. In this paper, a novel data encoding scheme isproposed for serial links to decrease the number of self transitions to reduce the power in data transmission. The proposed scheme accounts the correlations in the data and hence is more effective for real-life applications. The system architecture as well as the encoding and decoding schemes have been implemented to explore the proposed algorithm applicable for any CMP or NoC architectures. The proposed encoding scheme has been analyzed with various types of real-life data streams. Experimental resultsshow that up to 27% reduction in power dissipation is possible in NoC links by the proposed scheme.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131777477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the reduction of supply voltage motivated bypower reduction, the signal to noise ratio of digital signals has reduced. Alternately, signal can be represented as current while the supply voltage still remaining small. This gives rise to the field of current mode signal processing circuits. In this work, we propose a current mode analog Walsh-Hadamard processor while the control mechanism remains digital. The design is implemented in 0.35μm CMOS technology. Walsh-Hadamard transform is a complete transform and finds significant applications in the field of image processing, filter design, multiplexing. To the best of our knowledge, no such implementation exists in the published literature.
{"title":"A New Walsh Hadamard Transform Architecture Using Current Mode Circuit","authors":"S. Bhattacharya, S. Talapatra","doi":"10.1109/ISVLSI.2014.71","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.71","url":null,"abstract":"With the reduction of supply voltage motivated bypower reduction, the signal to noise ratio of digital signals has reduced. Alternately, signal can be represented as current while the supply voltage still remaining small. This gives rise to the field of current mode signal processing circuits. In this work, we propose a current mode analog Walsh-Hadamard processor while the control mechanism remains digital. The design is implemented in 0.35μm CMOS technology. Walsh-Hadamard transform is a complete transform and finds significant applications in the field of image processing, filter design, multiplexing. To the best of our knowledge, no such implementation exists in the published literature.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125178315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we focus on the hypergraph bipartitioning problem and present a new multilevel hypergraph partitioning algorithm that is much faster and of similar quality compared with hMETIS. In the coarsening phase, successive coarsened hypergraphs are constructed using the MFCC (Modified First-Choice Coarsening) algorithm. After getting a small hypergraph containing only a small number of vertices, we will use a randomized algorithm to obtain an initial partition and then apply an A-FM (Alternating Fiduccia-Mattheyses) refinement algorithm to optimize it. In the uncoarsening phase, we will extract clusters level by level and apply the A-FM repeatedly. Experiments on large benchmarks issued in the DAC 2012 Routability-Driven Placement Contest show that we can achieve similar or even better quality (1% improvement in minimum cut on average) and save 50% to 80% running time comparing with the state-of-the-art partitioner hMETIS.
{"title":"A Fast Hypergraph Bipartitioning Algorithm","authors":"Wenzan Cai, Evangeline F. Y. Young","doi":"10.1109/ISVLSI.2014.58","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.58","url":null,"abstract":"In this paper, we focus on the hypergraph bipartitioning problem and present a new multilevel hypergraph partitioning algorithm that is much faster and of similar quality compared with hMETIS. In the coarsening phase, successive coarsened hypergraphs are constructed using the MFCC (Modified First-Choice Coarsening) algorithm. After getting a small hypergraph containing only a small number of vertices, we will use a randomized algorithm to obtain an initial partition and then apply an A-FM (Alternating Fiduccia-Mattheyses) refinement algorithm to optimize it. In the uncoarsening phase, we will extract clusters level by level and apply the A-FM repeatedly. Experiments on large benchmarks issued in the DAC 2012 Routability-Driven Placement Contest show that we can achieve similar or even better quality (1% improvement in minimum cut on average) and save 50% to 80% running time comparing with the state-of-the-art partitioner hMETIS.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116201708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A security bug in the OpenSSL library, codenamed Heartbleed, allowed attackers to read the contents of the corresponding server's memory, effectively revealing passwords, master keys, and users' session cookies. As long as the server memory contents are in the clear, it is a matter of time until the next bug/attack hands information over to attackers. In this paper, we investigate the applicability of privacy-preserving general-purpose computation, that would potentially render any information leaked indecipherable to attackers. Privacy is ensured by the use of homomorphically-encrypted memory contents. To this end, we explore the boundaries of general-purpose computation constrained for user data privacy. Specifically, we explore the minimum amount of information required for general purpose computation, which typically requires control flow and branches, and to what extent such information can be kept private from threats that have theoretically unlimited resources, including access to the internals of a target system.
{"title":"Trust No One: Thwarting \"heartbleed\" Attacks Using Privacy-Preserving Computation","authors":"N. G. Tsoutsos, M. Maniatakos","doi":"10.1109/ISVLSI.2014.86","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.86","url":null,"abstract":"A security bug in the OpenSSL library, codenamed Heartbleed, allowed attackers to read the contents of the corresponding server's memory, effectively revealing passwords, master keys, and users' session cookies. As long as the server memory contents are in the clear, it is a matter of time until the next bug/attack hands information over to attackers. In this paper, we investigate the applicability of privacy-preserving general-purpose computation, that would potentially render any information leaked indecipherable to attackers. Privacy is ensured by the use of homomorphically-encrypted memory contents. To this end, we explore the boundaries of general-purpose computation constrained for user data privacy. Specifically, we explore the minimum amount of information required for general purpose computation, which typically requires control flow and branches, and to what extent such information can be kept private from threats that have theoretically unlimited resources, including access to the internals of a target system.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126704072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Programmable reversible logic is emerging as a prospective logic design style for implementation in modern nanotechnology and quantum computing with minimal impact on circuit heat generation. Adiabatic logic is a design methodology for reversible logic in CMOS where the current flow through the circuit is controlled such that the energy dissipation due to switching and capacitor dissipation is minimized. Production of cost-effective Secure Integrated Chips, such as Smart Cards, requires hardware designers to consider tradeoffs in size, security, and power consumption. In order to design successful security-centric designs, the low-level hardware must contain built-in protection mechanisms to supplement cryptographic algorithms such as AES and Triple DES by preventing side channel attacks, such as Differential Power Analysis (DPA). Dynamic logic obfuscates the output waveforms and the circuit operation, reducing the effectiveness of the DPA attack. In this dissertation, I address theory, synthesis, and application of adiabatic and reversible logic circuits for security applications. First, we present a mathematical proof to demonstrate that reversible logic can be used to design sequential computing structures. Next, a novel algorithm for synthesis of adiabatic circuits in CMOS is presented. This approach is unique because it correlates the offsets in the permutation matrix to the transistors required for synthesis, instead of determining an equivalent circuit and substituting a previously synthesized circuit from a library. Using the ESPRESSO heuristic for minimization of Boolean functions method on each output node in parallel, we optimize the synthesized circuit. It is demonstrated that the algorithm produces a 32.86% improvement over previously synthesized circuit benchmarks. For stronger mitigation of DPA attacks, we propose the implementation of Adiabatic Dynamic Differential Logic for applications in secure IC design. A Performance Adiabatic Dynamic Differential Logic (PADDL) is presented for an implementation in high frequency secure ICs. This method improves the differential power over previous dynamic and differential logic methods by up to 89.65. Then, we present an adiabatic S-box which significantly reduces energy imbalance compared to previous benchmarks. The design is capable of forward encryption and reverse decryption with minimal overhead, allowing for efficient hardware reuse.
{"title":"Theory, Synthesis, and Application of Adiabatic and Reversible Logic Circuits for Security Applications","authors":"Matthew Morrison","doi":"10.1109/ISVLSI.2014.88","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.88","url":null,"abstract":"Programmable reversible logic is emerging as a prospective logic design style for implementation in modern nanotechnology and quantum computing with minimal impact on circuit heat generation. Adiabatic logic is a design methodology for reversible logic in CMOS where the current flow through the circuit is controlled such that the energy dissipation due to switching and capacitor dissipation is minimized. Production of cost-effective Secure Integrated Chips, such as Smart Cards, requires hardware designers to consider tradeoffs in size, security, and power consumption. In order to design successful security-centric designs, the low-level hardware must contain built-in protection mechanisms to supplement cryptographic algorithms such as AES and Triple DES by preventing side channel attacks, such as Differential Power Analysis (DPA). Dynamic logic obfuscates the output waveforms and the circuit operation, reducing the effectiveness of the DPA attack. In this dissertation, I address theory, synthesis, and application of adiabatic and reversible logic circuits for security applications. First, we present a mathematical proof to demonstrate that reversible logic can be used to design sequential computing structures. Next, a novel algorithm for synthesis of adiabatic circuits in CMOS is presented. This approach is unique because it correlates the offsets in the permutation matrix to the transistors required for synthesis, instead of determining an equivalent circuit and substituting a previously synthesized circuit from a library. Using the ESPRESSO heuristic for minimization of Boolean functions method on each output node in parallel, we optimize the synthesized circuit. It is demonstrated that the algorithm produces a 32.86% improvement over previously synthesized circuit benchmarks. For stronger mitigation of DPA attacks, we propose the implementation of Adiabatic Dynamic Differential Logic for applications in secure IC design. A Performance Adiabatic Dynamic Differential Logic (PADDL) is presented for an implementation in high frequency secure ICs. This method improves the differential power over previous dynamic and differential logic methods by up to 89.65. Then, we present an adiabatic S-box which significantly reduces energy imbalance compared to previous benchmarks. The design is capable of forward encryption and reverse decryption with minimal overhead, allowing for efficient hardware reuse.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126187333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With growing applications and increased integration of functionalities on multi-electrode biosensors, more attentions are paid to the need to include on-chip temperature measurement for providing ambient temperature monitoring of bio-samples and for recording heat generated by biosensor chips and their potential damage to bio-samples. This paper presents an integrated temperature sensor design which is intended to provide ambient temperature monitoring in a highly integrated biosensor system. Special attentions were paid to improve power supply rejection (PSR) performance at the clock frequency of 1MHz in the integrated biosensor system using PSR enhanced OTAs. The temperature sensor design was implemented using a commercial 0.18μm CMOS process. The temperature sensor achieves an inaccuracy of -0.34°C to 0.27°C from -30°C to 80°C. At 36°C, the PSR is around -50dB at 1MHz and -89.5dB at DC.
{"title":"A CMOS Temperature Sensor with -0.34°C to 0.27°C Inaccuracy from -30°C to 80°C","authors":"Hai Chi, Tom Chen","doi":"10.1109/ISVLSI.2014.30","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.30","url":null,"abstract":"With growing applications and increased integration of functionalities on multi-electrode biosensors, more attentions are paid to the need to include on-chip temperature measurement for providing ambient temperature monitoring of bio-samples and for recording heat generated by biosensor chips and their potential damage to bio-samples. This paper presents an integrated temperature sensor design which is intended to provide ambient temperature monitoring in a highly integrated biosensor system. Special attentions were paid to improve power supply rejection (PSR) performance at the clock frequency of 1MHz in the integrated biosensor system using PSR enhanced OTAs. The temperature sensor design was implemented using a commercial 0.18μm CMOS process. The temperature sensor achieves an inaccuracy of -0.34°C to 0.27°C from -30°C to 80°C. At 36°C, the PSR is around -50dB at 1MHz and -89.5dB at DC.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128559406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juan Yi, Weichen Liu, Weiwen Jiang, Mingwen Qin, Lei Yang, Duo Liu, Chunming Xiao, Luelue Du, E. Sha
With the increasing power density and number of cores integrated into a single chip, thermal management is widely recognized as one of the essential issues in Multi-Processor Systems-on-Chip (MPSoCs). An uncontrolled temperature could significantly decrease system performance, lead to high cooling and packaging costs, and even cause serious damage. These issues have made temperature one of the major factors that must be addressed in MPSoC designs. Static scheduling of applications should take the thermal effects of task executions into consideration to keep the chip temperature under a safety threshold. However, inaccurate temperature estimation would cause processor overheating or system performance degradation. In this paper, we propose an improved thermal modeling technique that can be used to predict the chip temperature more accurately and efficiently at design time. We further develop a simulated annealing (SA)-based algorithm to address the static application mapping and scheduling problem based on the improved thermal model. The thermal condition is greatly improved and the total energy consumption is minimized. Experimental results show that the improved thermal modeling technique could provide an average of over 99% accuracy of temperature prediction when comparing with the results offered by Hotspot simulations. Based on it, the SA-based algorithm could reduce the chances that the temperature threshold to be violated at runtime by 24.3%.
{"title":"An Improved Thermal Model for Static Optimization of Application Mapping and Scheduling in Multiprocessor System-on-Chip","authors":"Juan Yi, Weichen Liu, Weiwen Jiang, Mingwen Qin, Lei Yang, Duo Liu, Chunming Xiao, Luelue Du, E. Sha","doi":"10.1109/ISVLSI.2014.40","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.40","url":null,"abstract":"With the increasing power density and number of cores integrated into a single chip, thermal management is widely recognized as one of the essential issues in Multi-Processor Systems-on-Chip (MPSoCs). An uncontrolled temperature could significantly decrease system performance, lead to high cooling and packaging costs, and even cause serious damage. These issues have made temperature one of the major factors that must be addressed in MPSoC designs. Static scheduling of applications should take the thermal effects of task executions into consideration to keep the chip temperature under a safety threshold. However, inaccurate temperature estimation would cause processor overheating or system performance degradation. In this paper, we propose an improved thermal modeling technique that can be used to predict the chip temperature more accurately and efficiently at design time. We further develop a simulated annealing (SA)-based algorithm to address the static application mapping and scheduling problem based on the improved thermal model. The thermal condition is greatly improved and the total energy consumption is minimized. Experimental results show that the improved thermal modeling technique could provide an average of over 99% accuracy of temperature prediction when comparing with the results offered by Hotspot simulations. Based on it, the SA-based algorithm could reduce the chances that the temperature threshold to be violated at runtime by 24.3%.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128233846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a novel neuromemristive architecture for pattern classification based on extreme learning machines (ELMs). Specifically, we propose CMOS current-mode neuron circuits, memristor-based bipolar synapse circuits, and a stochastic, hardware-friendly training approach based on the least-mean-squares (LMS) learning algorithm. These components are integrated into a current-mode ELM architecture. We show that the current-mode design is especially efficient for implementing constant network weights between the ELM's input and hidden layers. The neuromemristive ELM was simulated in the Cadence AMS design environment. We used an experimental memristor model based on experimental data from an HfO_{x} device. The top-level design was validated by training a 10 hidden-node network to detect edges in binary patterns. Results indicate that the proposed architecture and learning approach are able to yield 100% classification accuracy.
{"title":"Neuromemristive Extreme Learning Machines for Pattern Classification","authors":"Cory E. Merkel, D. Kudithipudi","doi":"10.1109/ISVLSI.2014.67","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.67","url":null,"abstract":"This paper presents a novel neuromemristive architecture for pattern classification based on extreme learning machines (ELMs). Specifically, we propose CMOS current-mode neuron circuits, memristor-based bipolar synapse circuits, and a stochastic, hardware-friendly training approach based on the least-mean-squares (LMS) learning algorithm. These components are integrated into a current-mode ELM architecture. We show that the current-mode design is especially efficient for implementing constant network weights between the ELM's input and hidden layers. The neuromemristive ELM was simulated in the Cadence AMS design environment. We used an experimental memristor model based on experimental data from an HfO_{x} device. The top-level design was validated by training a 10 hidden-node network to detect edges in binary patterns. Results indicate that the proposed architecture and learning approach are able to yield 100% classification accuracy.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":" 25","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113950113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In typical NoC systems, most Routing Algorithms (RAs) abandon the interconnection between two adjacent routers if one traffic direction is broken, despite whether the other one is still functional or not. In this paper, we propose a distributed logic based RA, which can efficiently utilize the UnPaired Functional (UPF) links in such partially defected interconnects. The basic fault pattern tolerated by the proposed RA is a fault wall, which is composed of adjacent broken links with the same outgoing direction. Messages are routed around the fault walls along the misrouting contours of the broken links. The proposed RA requires at least 3 Virtual Channels (VCs) and dynamically reserve them to misrouted messages to avoid deadlock. Our experiments indicate that, for random and localized traffic patterns, we achieve an average saturation throughput 20% higher than the Solid Fault Region Tolerant (SFRT) RA, and 22% and 14% higher than the Ariadne routing table based RA, respectively. For the real applications, sample and satell, our proposal requires a routing execution time with at least 16% shorter than both SFRT and Ariadne. Synthesis results with Synopsis Design Compiler and TSMC 65nm technology indicate that, embedding the proposed RA into a baseline router results in 11% area overhead, which is only 3% higher than that of SFRT. In contrast, Ariadne area overhead is 15% for an 8 × 8 NoC and increases to 21% for a 10 × 10 NoC.
{"title":"Towards an Effective Utilization of Partially Defected Interconnections in 2D Mesh NoCs","authors":"Changlin Chen, S. Cotofana","doi":"10.1109/ISVLSI.2014.70","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.70","url":null,"abstract":"In typical NoC systems, most Routing Algorithms (RAs) abandon the interconnection between two adjacent routers if one traffic direction is broken, despite whether the other one is still functional or not. In this paper, we propose a distributed logic based RA, which can efficiently utilize the UnPaired Functional (UPF) links in such partially defected interconnects. The basic fault pattern tolerated by the proposed RA is a fault wall, which is composed of adjacent broken links with the same outgoing direction. Messages are routed around the fault walls along the misrouting contours of the broken links. The proposed RA requires at least 3 Virtual Channels (VCs) and dynamically reserve them to misrouted messages to avoid deadlock. Our experiments indicate that, for random and localized traffic patterns, we achieve an average saturation throughput 20% higher than the Solid Fault Region Tolerant (SFRT) RA, and 22% and 14% higher than the Ariadne routing table based RA, respectively. For the real applications, sample and satell, our proposal requires a routing execution time with at least 16% shorter than both SFRT and Ariadne. Synthesis results with Synopsis Design Compiler and TSMC 65nm technology indicate that, embedding the proposed RA into a baseline router results in 11% area overhead, which is only 3% higher than that of SFRT. In contrast, Ariadne area overhead is 15% for an 8 × 8 NoC and increases to 21% for a 10 × 10 NoC.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130132380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the limitations of traditional bus based systems, Network-on-Chip (NoC) has evolved as the most dominanttechnology in the paradigm of communication-centric revolution, where, besides the computation, inter-communication between the cores is an indispensable aspect of a SoC. Furthermore, the emergence of three dimensional integrated circuits (3D-ICs) has resulted in better performance, functionality, and packaging density compared to traditional planar ICs. The amalgamation of these two technologies, the 3D NoC architecture, can combine the benefits of these two new domains to offer an unprecedentedperformance gain. In this paper, we present a new 3D topological NoC design based on the butterfly fat tree (BFT) topology with an efficient table based uniform routing algorithm for 3D NoC. Extensive simulation experiments have been performed for BFT and compared to mesh, torus, butterfly and flattened butterfly topologies against four performance metrics viz. overall average latency, overall average acceptance rate, overall minimum acceptance rate, and average hop counts. There are significant latency improvements of 43-89 %, 83-88 %, 46-96 %, and 31-95 % over other topologies respectively. Average hop count is improved by 30 % and 13 % over mesh and torus. Also, there are improvements in average acceptance rate and minimum acceptance rate of 1-8 % and 5-14 % respectively for flattened butterfly and 6-9 % and 6-13 % over torus. Results evidently show that BFT is a very good choice for low network latency and faster communication.
{"title":"A Low Latency Scalable 3D NoC Using BFT Topology with Table Based Uniform Routing","authors":"Avik Bose, P. Ghosal, S. Mohanty","doi":"10.1109/ISVLSI.2014.51","DOIUrl":"https://doi.org/10.1109/ISVLSI.2014.51","url":null,"abstract":"Due to the limitations of traditional bus based systems, Network-on-Chip (NoC) has evolved as the most dominanttechnology in the paradigm of communication-centric revolution, where, besides the computation, inter-communication between the cores is an indispensable aspect of a SoC. Furthermore, the emergence of three dimensional integrated circuits (3D-ICs) has resulted in better performance, functionality, and packaging density compared to traditional planar ICs. The amalgamation of these two technologies, the 3D NoC architecture, can combine the benefits of these two new domains to offer an unprecedentedperformance gain. In this paper, we present a new 3D topological NoC design based on the butterfly fat tree (BFT) topology with an efficient table based uniform routing algorithm for 3D NoC. Extensive simulation experiments have been performed for BFT and compared to mesh, torus, butterfly and flattened butterfly topologies against four performance metrics viz. overall average latency, overall average acceptance rate, overall minimum acceptance rate, and average hop counts. There are significant latency improvements of 43-89 %, 83-88 %, 46-96 %, and 31-95 % over other topologies respectively. Average hop count is improved by 30 % and 13 % over mesh and torus. Also, there are improvements in average acceptance rate and minimum acceptance rate of 1-8 % and 5-14 % respectively for flattened butterfly and 6-9 % and 6-13 % over torus. Results evidently show that BFT is a very good choice for low network latency and faster communication.","PeriodicalId":405755,"journal":{"name":"2014 IEEE Computer Society Annual Symposium on VLSI","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133905678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}