Networks-on-Chip (NoC) serve as efficient and scalable communication substrates for many-core architectures. Currently, the bandwidth provided in NoCs is over provisioned for their typical usage case. In real-world multi-core applications, less than 5% of channels are utilized on average. Large bandwidth resources serve to keep network latency low during periods of peak communication demands. Increasing the average channel utilization through narrower channels could improve the efficiency of NoCs in terms of area and power, however, in current NoC architectures this degrades overall system performance. Based on thorough analysis of the dynamic behaviour of real workloads, we design a novel NoC architecture that adapts to changing application demands. Our architecture uses fine-grained bandwidth-adaptive bidirectional channels to improve channel utilization without negatively affecting network latency. Running PARSEC benchmarks on a cycle-accurate full-system simulator, we show that fine-grained bandwidth adaptivity can save up to 75% of channel resources while achieving 92% of overall system performance compared to the baseline network, no performance is sacrificed in our network design configured with 50% of the channel resources used in the baseline.
{"title":"Fine-Grained Bandwidth Adaptivity in Networks-on-Chip Using Bidirectional Channels","authors":"R. Hesse, J. Nicholls, Natalie D. Enright Jerger","doi":"10.1109/NOCS.2012.23","DOIUrl":"https://doi.org/10.1109/NOCS.2012.23","url":null,"abstract":"Networks-on-Chip (NoC) serve as efficient and scalable communication substrates for many-core architectures. Currently, the bandwidth provided in NoCs is over provisioned for their typical usage case. In real-world multi-core applications, less than 5% of channels are utilized on average. Large bandwidth resources serve to keep network latency low during periods of peak communication demands. Increasing the average channel utilization through narrower channels could improve the efficiency of NoCs in terms of area and power, however, in current NoC architectures this degrades overall system performance. Based on thorough analysis of the dynamic behaviour of real workloads, we design a novel NoC architecture that adapts to changing application demands. Our architecture uses fine-grained bandwidth-adaptive bidirectional channels to improve channel utilization without negatively affecting network latency. Running PARSEC benchmarks on a cycle-accurate full-system simulator, we show that fine-grained bandwidth adaptivity can save up to 75% of channel resources while achieving 92% of overall system performance compared to the baseline network, no performance is sacrificed in our network design configured with 50% of the channel resources used in the baseline.","PeriodicalId":6333,"journal":{"name":"2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip","volume":"18 1","pages":"132-141"},"PeriodicalIF":0.0,"publicationDate":"2012-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73663151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chris Fallin, Greg Nazario, Xiangyao Yu, K. Chang, Rachata Ausavarungnirun, O. Mutlu
A conventional Network-on-Chip (NoC) router uses input buffers to store in-flight packets. These buffers improve performance, but consume significant power. It is possible to bypass these buffers when they are empty, reducing dynamic power, but static buffer power, and dynamic power when buffers are utilized, remain. To improve energy efficiency, buffer less deflection routing removes input buffers, and instead uses deflection (misrouting) to resolve contention. However, at high network load, deflections cause unnecessary network hops, wasting power and reducing performance. In this work, we propose a new NoC router design called the minimally-buffered deflection (MinBD) router. This router combines deflection routing with a small "side buffer," which is much smaller than conventional input buffers. A MinBD router places some network traffic that would have otherwise been deflected in this side buffer, reducing deflections significantly. The router buffers only a fraction of traffic, thus making more efficient use of buffer space than a router that holds every flit in its input buffers. We evaluate MinBD against input-buffered routers of various sizes that implement buffer bypassing, a buffer less router, and a hybrid design, and show that MinBD is more energy efficient than all prior designs, and has performance that approaches the conventional input-buffered router with area and power close to the buffer less router.
{"title":"MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect","authors":"Chris Fallin, Greg Nazario, Xiangyao Yu, K. Chang, Rachata Ausavarungnirun, O. Mutlu","doi":"10.1109/NOCS.2012.8","DOIUrl":"https://doi.org/10.1109/NOCS.2012.8","url":null,"abstract":"A conventional Network-on-Chip (NoC) router uses input buffers to store in-flight packets. These buffers improve performance, but consume significant power. It is possible to bypass these buffers when they are empty, reducing dynamic power, but static buffer power, and dynamic power when buffers are utilized, remain. To improve energy efficiency, buffer less deflection routing removes input buffers, and instead uses deflection (misrouting) to resolve contention. However, at high network load, deflections cause unnecessary network hops, wasting power and reducing performance. In this work, we propose a new NoC router design called the minimally-buffered deflection (MinBD) router. This router combines deflection routing with a small \"side buffer,\" which is much smaller than conventional input buffers. A MinBD router places some network traffic that would have otherwise been deflected in this side buffer, reducing deflections significantly. The router buffers only a fraction of traffic, thus making more efficient use of buffer space than a router that holds every flit in its input buffers. We evaluate MinBD against input-buffered routers of various sizes that implement buffer bypassing, a buffer less router, and a hybrid design, and show that MinBD is more energy efficient than all prior designs, and has performance that approaches the conventional input-buffered router with area and power close to the buffer less router.","PeriodicalId":6333,"journal":{"name":"2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip","volume":"49 1","pages":"1-10"},"PeriodicalIF":0.0,"publicationDate":"2012-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80986436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyunjun Jang, Baik Song An, Nikhil Kulkarni, K. H. Yum, Eun Jung Kim
As the chip multiprocessor (CMP) design moves toward many-core architectures, communication delay in Network-on-Chip (NoC) has been a major bottleneck in CMP systems. Using high-density memories in input buffers helps to reduce the bottleneck through increasing throughput. Spin-Torque Transfer Magnetic RAM (STT-MRAM) can be a suitable solution due to its nature of high density and near-zero leakage power. But its long latency and high power consumption in write operations still need to be addressed. We explore the design issues in using STT-MRAM for NoC input buffers. Motivated by short intra-router latency, we use the previously proposed write latency reduction technique sacrificing retention time. Then we propose a hybrid design of input buffers using both SRAM and STT-MRAM to hide the long write latency efficiently. Considering that simple data migration in the hybrid buffer consumes more dynamic power compared to SRAM, we provide a lazy migration scheme that reduces the dynamic power consumption of the hybrid buffer. Simulation results show that the proposed scheme enhances the throughput by 21% on average.
{"title":"A Hybrid Buffer Design with STT-MRAM for On-Chip Interconnects","authors":"Hyunjun Jang, Baik Song An, Nikhil Kulkarni, K. H. Yum, Eun Jung Kim","doi":"10.1109/NOCS.2012.30","DOIUrl":"https://doi.org/10.1109/NOCS.2012.30","url":null,"abstract":"As the chip multiprocessor (CMP) design moves toward many-core architectures, communication delay in Network-on-Chip (NoC) has been a major bottleneck in CMP systems. Using high-density memories in input buffers helps to reduce the bottleneck through increasing throughput. Spin-Torque Transfer Magnetic RAM (STT-MRAM) can be a suitable solution due to its nature of high density and near-zero leakage power. But its long latency and high power consumption in write operations still need to be addressed. We explore the design issues in using STT-MRAM for NoC input buffers. Motivated by short intra-router latency, we use the previously proposed write latency reduction technique sacrificing retention time. Then we propose a hybrid design of input buffers using both SRAM and STT-MRAM to hide the long write latency efficiently. Considering that simple data migration in the hybrid buffer consumes more dynamic power compared to SRAM, we provide a lazy migration scheme that reduces the dynamic power consumption of the hybrid buffer. Simulation results show that the proposed scheme enhances the throughput by 21% on average.","PeriodicalId":6333,"journal":{"name":"2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip","volume":"13 1","pages":"193-200"},"PeriodicalIF":0.0,"publicationDate":"2012-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78300415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-end MPSoC systems with built-in high-radix topologies achieve good performance because of the improved connectivity and the reduced network diameter. In high-end MPSoC systems, fault tolerance support is becoming a compulsory feature. In this work, we propose a combined method to address permanent and transient link and router failures in those systems. The LBDRhr mechanism is proposed to tolerate permanent link failures in some popular high-radix topologies. The increased router complexity may lead to more transient router errors than routers using simple XY routing algorithm. We exploit the inherent information redundancy (IIR) in LBDRhr logic to manage transient errors in the network routers. Thorough analyses are provided to discover the appropriate internal nodes and the forbidden signal patterns for transient error detection. Simulation results show that LBDRhr logic can tolerate all of the permanent failure combinations of long-range links and 80% of links failures at short-range links. Case studies show that the error detection method based on the new IIR extraction method reduces the power consumption and the residual error rate by 33% and up to two orders of magnitude, respectively, compared to triple modular redundancy. The impact of network topologies on the efficiency of the detection mechanism has been examined in this work, as well.
{"title":"Transient and Permanent Error Control for High-End Multiprocessor Systems-on-Chip","authors":"Qiaoyan Yu, José Cano, J. Flich, P. Ampadu","doi":"10.1109/NOCS.2012.27","DOIUrl":"https://doi.org/10.1109/NOCS.2012.27","url":null,"abstract":"High-end MPSoC systems with built-in high-radix topologies achieve good performance because of the improved connectivity and the reduced network diameter. In high-end MPSoC systems, fault tolerance support is becoming a compulsory feature. In this work, we propose a combined method to address permanent and transient link and router failures in those systems. The LBDRhr mechanism is proposed to tolerate permanent link failures in some popular high-radix topologies. The increased router complexity may lead to more transient router errors than routers using simple XY routing algorithm. We exploit the inherent information redundancy (IIR) in LBDRhr logic to manage transient errors in the network routers. Thorough analyses are provided to discover the appropriate internal nodes and the forbidden signal patterns for transient error detection. Simulation results show that LBDRhr logic can tolerate all of the permanent failure combinations of long-range links and 80% of links failures at short-range links. Case studies show that the error detection method based on the new IIR extraction method reduces the power consumption and the residual error rate by 33% and up to two orders of magnitude, respectively, compared to triple modular redundancy. The impact of network topologies on the efficiency of the detection mechanism has been examined in this work, as well.","PeriodicalId":6333,"journal":{"name":"2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip","volume":"35 1","pages":"169-176"},"PeriodicalIF":0.0,"publicationDate":"2012-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87985321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stavros Volos, Ciprian Seiculescu, Boris Grot, Naser Khosro Pour, B. Falsafi, G. Micheli
Many core chips are emerging as the architecture of choice to provide power efficiency and improve performance, while riding Moore's Law. In these architectures, on-chip inter-connects play a pivotal role in ensuring power and performance scalability. As supply voltages begin to level off in future technologies, chip designs in general and interconnects in particular will require specialization to meet power and performance objectives. In this work, we make the observation that cache-coherent many core server chips exhibit a duality in on-chip network traffic. Request traffic largely consists of simple control messages, while response traffic often carries cache-block-sized payloads. We present Cache-Coherence Network-on-Chip (CCNoC), a design that specializes the NoC to fit the demands of server workloads via a pair of asymmetric networks tuned to the type of traffic traversing them. The networks differ in their data path width, router micro architecture, flow control strategy, and delay. The resulting heterogeneous CCNoC architecture enables significant gains in power efficiency over conventional NoC designs at similar performance levels. Our evaluation reveals that a 4×4 mesh-based chip multiprocessor with the proposed CCNoC organization running commercial server workloads is 15-28% more energy efficient than various state-of-the-art single- and dual-network organizations.
{"title":"CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers","authors":"Stavros Volos, Ciprian Seiculescu, Boris Grot, Naser Khosro Pour, B. Falsafi, G. Micheli","doi":"10.1109/NOCS.2012.15","DOIUrl":"https://doi.org/10.1109/NOCS.2012.15","url":null,"abstract":"Many core chips are emerging as the architecture of choice to provide power efficiency and improve performance, while riding Moore's Law. In these architectures, on-chip inter-connects play a pivotal role in ensuring power and performance scalability. As supply voltages begin to level off in future technologies, chip designs in general and interconnects in particular will require specialization to meet power and performance objectives. In this work, we make the observation that cache-coherent many core server chips exhibit a duality in on-chip network traffic. Request traffic largely consists of simple control messages, while response traffic often carries cache-block-sized payloads. We present Cache-Coherence Network-on-Chip (CCNoC), a design that specializes the NoC to fit the demands of server workloads via a pair of asymmetric networks tuned to the type of traffic traversing them. The networks differ in their data path width, router micro architecture, flow control strategy, and delay. The resulting heterogeneous CCNoC architecture enables significant gains in power efficiency over conventional NoC designs at similar performance levels. Our evaluation reveals that a 4×4 mesh-based chip multiprocessor with the proposed CCNoC organization running commercial server workloads is 15-28% more energy efficient than various state-of-the-art single- and dual-network organizations.","PeriodicalId":6333,"journal":{"name":"2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip","volume":"10 1","pages":"67-74"},"PeriodicalIF":0.0,"publicationDate":"2012-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91147027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Bogdan, R. Marculescu, Siddhartha Jain, Rafael Tornero Gavilá
Reducing energy consumption in multi-processor systems-on-chip (MPSoCs) where communication happens via the network-on-chip (NoC) approach calls for multiple voltage/frequency island (VFI)-based designs. In turn, such multi-VFI architectures need efficient, robust, and accurate run-time control mechanisms that can exploit the workload characteristics in order to save power. Despite being tractable, the linear control models for power management cannot capture some important workload characteristics (e.g., fractality, non-stationarity) observed in heterogeneous NoCs, if ignored, such characteristics lead to inefficient communication and resources allocation, as well as high power dissipation in MPSoCs. To mitigate such limitations, we propose a new paradigm shift from power optimization based on linear models to control approaches based on fractal-state equations. As such, our approach is the first to propose a controller for fractal workloads with precise constraints on state and control variables and specific time bounds. Our results show that significant power savings (about 70%) can be achieved at run-time while running a variety of benchmark applications.
{"title":"An Optimal Control Approach to Power Management for Multi-Voltage and Frequency Islands Multiprocessor Platforms under Highly Variable Workloads","authors":"P. Bogdan, R. Marculescu, Siddhartha Jain, Rafael Tornero Gavilá","doi":"10.1109/NOCS.2012.32","DOIUrl":"https://doi.org/10.1109/NOCS.2012.32","url":null,"abstract":"Reducing energy consumption in multi-processor systems-on-chip (MPSoCs) where communication happens via the network-on-chip (NoC) approach calls for multiple voltage/frequency island (VFI)-based designs. In turn, such multi-VFI architectures need efficient, robust, and accurate run-time control mechanisms that can exploit the workload characteristics in order to save power. Despite being tractable, the linear control models for power management cannot capture some important workload characteristics (e.g., fractality, non-stationarity) observed in heterogeneous NoCs, if ignored, such characteristics lead to inefficient communication and resources allocation, as well as high power dissipation in MPSoCs. To mitigate such limitations, we propose a new paradigm shift from power optimization based on linear models to control approaches based on fractal-state equations. As such, our approach is the first to propose a controller for fractal workloads with precise constraints on state and control variables and specific time bounds. Our results show that significant power savings (about 70%) can be achieved at run-time while running a variety of benchmark applications.","PeriodicalId":6333,"journal":{"name":"2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip","volume":"14 1","pages":"35-42"},"PeriodicalIF":0.0,"publicationDate":"2012-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76164800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aggressive MOS transistor size scaling substantially increase the probability of faults in NoC links due to manufacturing defects, process variations, and chip wire-out effects. Strategies have been proposed to tolerate faulty wires by replacing them with spare ones or by partially using the defective links. However, these strategies either suffer from high area and power overheads, or significantly increase the average network latency. In this paper, we propose a novel flit serialization method, which divides the links and flits into several sections, and serializes flit sections of adjacent flits to transmit them on all available fault-free link sections to avoid the complete waste of defective links bandwidth. Experimental results indicate that our method reduces the latency overhead significantly and enables graceful performance degradation, when compared with related partially faulty link usage proposals, and saves area and power overheads by up to 29% and 43.1%, respectively, when compared with spare wire replacement methods.
{"title":"A Novel Flit Serialization Strategy to Utilize Partially Faulty Links in Networks-on-Chip","authors":"Changlin Chen, Ye Lu, S. Cotofana","doi":"10.1109/NOCS.2012.22","DOIUrl":"https://doi.org/10.1109/NOCS.2012.22","url":null,"abstract":"Aggressive MOS transistor size scaling substantially increase the probability of faults in NoC links due to manufacturing defects, process variations, and chip wire-out effects. Strategies have been proposed to tolerate faulty wires by replacing them with spare ones or by partially using the defective links. However, these strategies either suffer from high area and power overheads, or significantly increase the average network latency. In this paper, we propose a novel flit serialization method, which divides the links and flits into several sections, and serializes flit sections of adjacent flits to transmit them on all available fault-free link sections to avoid the complete waste of defective links bandwidth. Experimental results indicate that our method reduces the latency overhead significantly and enables graceful performance degradation, when compared with related partially faulty link usage proposals, and saves area and power overheads by up to 29% and 43.1%, respectively, when compared with spare wire replacement methods.","PeriodicalId":6333,"journal":{"name":"2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip","volume":"5 1","pages":"124-131"},"PeriodicalIF":0.0,"publicationDate":"2012-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91317391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Snaider Carrillo, J. Harkin, L. McDaid, S. Pande, Seamus Cawley, Brian McGinley, F. Morgan
The complexity of inter-neuron connectivity is prohibiting scalable hardware implementations of spiking neural networks (SNNs). Traditional neuron interconnect using a shared bus topology is not scalable due to non-linear growth of neuron connections with the neural network size. This paper presents a novel hierarchical NoC (H-NoC) architecture for SNN hardware which addresses the scalability issue by creating a 3-dimensional array of clusters of neurons with a hierarchical structure of low and high-level routers. The H-NoC architecture also incorporates a spike traffic compression technique to exploit SNN traffic patterns, thus reducing traffic overhead and improving throughput on the network. In addition, adaptive routing capabilities between clusters balance local and global traffic loads to sustain throughput under bursting activity. Simulation results show a high throughput per cluster (3.33×109 spikes/second), and synthesis results using 65-nm CMOS technology demonstrate low cost area (0.587mm2) and power consumption (13.16mW @100MHz) for a single cluster of 400 neurons, which outperforms existing SNN hardware strategies.
{"title":"Hierarchical Network-on-Chip and Traffic Compression for Spiking Neural Network Implementations","authors":"Snaider Carrillo, J. Harkin, L. McDaid, S. Pande, Seamus Cawley, Brian McGinley, F. Morgan","doi":"10.1109/NOCS.2012.17","DOIUrl":"https://doi.org/10.1109/NOCS.2012.17","url":null,"abstract":"The complexity of inter-neuron connectivity is prohibiting scalable hardware implementations of spiking neural networks (SNNs). Traditional neuron interconnect using a shared bus topology is not scalable due to non-linear growth of neuron connections with the neural network size. This paper presents a novel hierarchical NoC (H-NoC) architecture for SNN hardware which addresses the scalability issue by creating a 3-dimensional array of clusters of neurons with a hierarchical structure of low and high-level routers. The H-NoC architecture also incorporates a spike traffic compression technique to exploit SNN traffic patterns, thus reducing traffic overhead and improving throughput on the network. In addition, adaptive routing capabilities between clusters balance local and global traffic loads to sustain throughput under bursting activity. Simulation results show a high throughput per cluster (3.33×109 spikes/second), and synthesis results using 65-nm CMOS technology demonstrate low cost area (0.587mm2) and power consumption (13.16mW @100MHz) for a single cluster of 400 neurons, which outperforms existing SNN hardware strategies.","PeriodicalId":6333,"journal":{"name":"2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip","volume":"136 1","pages":"83-90"},"PeriodicalIF":0.0,"publicationDate":"2012-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76385798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Nikitin, Javier de San Pedro, J. Carmona, J. Cortadella
The continuous scaling of nanoelectronics is increasing the complexity of chip multiprocessors (CMPs) and exacerbating the memory wall problem. As CMPs become more complex, the memory subsystem is organized into more hierarchical structures to better exploit locality. During the exploration and design of CMP architectures, it is essential to efficiently analyze their performance. However, performance is highly determined by the latency of the memory subsystem, which in turn has a cyclic dependency with the memory traffic generated by the cores. This paper proposes a scalable analytical method to estimate the performance of highly parallel CMPs (hundreds of cores) with hierarchical interconnect fabrics. The method can use customizable probabilistic models and solves the cyclic dependencies by using a fixed-point strategy. The technique is shown to be a very accurate and efficient strategy when compared to the results obtained by simulation.
{"title":"Analytical Performance Modeling of Hierarchical Interconnect Fabrics","authors":"N. Nikitin, Javier de San Pedro, J. Carmona, J. Cortadella","doi":"10.1109/NOCS.2012.20","DOIUrl":"https://doi.org/10.1109/NOCS.2012.20","url":null,"abstract":"The continuous scaling of nanoelectronics is increasing the complexity of chip multiprocessors (CMPs) and exacerbating the memory wall problem. As CMPs become more complex, the memory subsystem is organized into more hierarchical structures to better exploit locality. During the exploration and design of CMP architectures, it is essential to efficiently analyze their performance. However, performance is highly determined by the latency of the memory subsystem, which in turn has a cyclic dependency with the memory traffic generated by the cores. This paper proposes a scalable analytical method to estimate the performance of highly parallel CMPs (hundreds of cores) with hierarchical interconnect fabrics. The method can use customizable probabilistic models and solves the cyclic dependencies by using a fixed-point strategy. The technique is shown to be a very accurate and efficient strategy when compared to the results obtained by simulation.","PeriodicalId":6333,"journal":{"name":"2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip","volume":"195 1","pages":"107-114"},"PeriodicalIF":0.0,"publicationDate":"2012-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72885078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Rahmani, Kameswar Rao Vaddina, Khalid Latif, P. Liljeberg, J. Plosila, H. Tenhunen
Three-dimensional integrated circuits (3D ICs) achieve enhanced system integration and improved performance at lower cost and reduced area footprint. In order to exploit the intrinsic capability of reducing the wire length in 3D ICs, 3D NoC-Bus Hybrid mesh architecture was proposed which provides performance, power consumption, and area benefits. Besides its various advantages, this architecture has a unique and hitherto previously unexplored way to implement an efficient system-wide monitoring network. In this paper, an integrated low-cost monitoring platform for 3D stacked mesh architectures is proposed which can be efficiently used for various system management purposes such as traffic monitoring, thermal management and fault tolerance. The proposed generic monitoring and management infrastructure called ARB-NET utilizes bus arbiters to exchange the monitoring information directly with each other without using the data network. As a test case, based on the proposed monitoring and management platform, a fully congestion-aware and inter-layer fault tolerant routing algorithm named AdaptiveXYZ is presented taking advantage of viable information generated using bus arbiter network. In addition, we propose a thermal monitoring and management strategy on top of our ARB-NET infrastructure. Compared to recently proposed stacked mesh 3D NoCs, our extensive simulations with synthetic and real benchmarks reveal that our architecture using the AdaptiveXYZ routing can help in achieving significant power and performance improvements while preserving the system reliability with negligible hardware overhead.
{"title":"Generic Monitoring and Management Infrastructure for 3D NoC-Bus Hybrid Architectures","authors":"A. Rahmani, Kameswar Rao Vaddina, Khalid Latif, P. Liljeberg, J. Plosila, H. Tenhunen","doi":"10.1109/NOCS.2012.28","DOIUrl":"https://doi.org/10.1109/NOCS.2012.28","url":null,"abstract":"Three-dimensional integrated circuits (3D ICs) achieve enhanced system integration and improved performance at lower cost and reduced area footprint. In order to exploit the intrinsic capability of reducing the wire length in 3D ICs, 3D NoC-Bus Hybrid mesh architecture was proposed which provides performance, power consumption, and area benefits. Besides its various advantages, this architecture has a unique and hitherto previously unexplored way to implement an efficient system-wide monitoring network. In this paper, an integrated low-cost monitoring platform for 3D stacked mesh architectures is proposed which can be efficiently used for various system management purposes such as traffic monitoring, thermal management and fault tolerance. The proposed generic monitoring and management infrastructure called ARB-NET utilizes bus arbiters to exchange the monitoring information directly with each other without using the data network. As a test case, based on the proposed monitoring and management platform, a fully congestion-aware and inter-layer fault tolerant routing algorithm named AdaptiveXYZ is presented taking advantage of viable information generated using bus arbiter network. In addition, we propose a thermal monitoring and management strategy on top of our ARB-NET infrastructure. Compared to recently proposed stacked mesh 3D NoCs, our extensive simulations with synthetic and real benchmarks reveal that our architecture using the AdaptiveXYZ routing can help in achieving significant power and performance improvements while preserving the system reliability with negligible hardware overhead.","PeriodicalId":6333,"journal":{"name":"2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip","volume":"108 4 1","pages":"177-184"},"PeriodicalIF":0.0,"publicationDate":"2012-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84815980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}