The increased communication bandwidth demands of HPC-systems calling at the same time for reduced latency and increased power efficiency have designated optical interconnects as the key technology in order to achieve the target of exascale performance. In this realm, technology advances have to be accompanied by corresponding simulation tools that support end-to-end system modeling in order to evaluate the performance benefits offered by optical components at system-environment. We present here the OptoHPC-Sim, which supports the utilization of optical interconnect and electro-optical routing technologies at system-scale offering complete end-to-end simulation of HPC-systems and allowing for reliable comparison with existing HPC platforms. OptoHPC-sim has been developed using the Omnet++ platform and is designed to offer the optimum balance between the model detail and the simulation execution time. We describe the design of the simulation engine and demonstrate the capabilities of OptoHPC-sim by comparing an HPC system employing state-of-the-art optoelectronic routers and optical interconnects with the Cray XK7 system platform.
{"title":"Bringing OptoBoards to HPC-scale environments: An OptoHPC simulation engine","authors":"N. Terzenidis, P. Maniotis, N. Pleros","doi":"10.1145/2857058.2857062","DOIUrl":"https://doi.org/10.1145/2857058.2857062","url":null,"abstract":"The increased communication bandwidth demands of HPC-systems calling at the same time for reduced latency and increased power efficiency have designated optical interconnects as the key technology in order to achieve the target of exascale performance. In this realm, technology advances have to be accompanied by corresponding simulation tools that support end-to-end system modeling in order to evaluate the performance benefits offered by optical components at system-environment. We present here the OptoHPC-Sim, which supports the utilization of optical interconnect and electro-optical routing technologies at system-scale offering complete end-to-end simulation of HPC-systems and allowing for reliable comparison with existing HPC platforms. OptoHPC-sim has been developed using the Omnet++ platform and is designed to offer the optimum balance between the model detail and the simulation execution time. We describe the design of the simulation engine and demonstrate the capabilities of OptoHPC-sim by comparing an HPC system employing state-of-the-art optoelectronic routers and optical interconnects with the Cray XK7 system platform.","PeriodicalId":292715,"journal":{"name":"AISTECS '16","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115672874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes a proposal for FAT tree based Network-on-Chip system based on MPLS forwarding mechanism. The FAT tree includes processing nodes and communication switches. IP node (processing nodes) has a message generator unit which randomly generates messages to different destinations with different packet lengths and buffering. The switch is based on MPLS technique and consists of the following units: crossbar switch, input/output link controllers and routing and arbitration units. A simulator has been developed in C++ to analyze the proposed architecture. A comparison with wormhole switch is provided to show the efficiency of the MPLS designed switch.
{"title":"Designing an Efficient MPLS-Based Switch for FAT Tree Network-on-Chip Systems","authors":"Najwa Salama, A. M. Sllame","doi":"10.1145/2857058.2857059","DOIUrl":"https://doi.org/10.1145/2857058.2857059","url":null,"abstract":"This paper describes a proposal for FAT tree based Network-on-Chip system based on MPLS forwarding mechanism. The FAT tree includes processing nodes and communication switches. IP node (processing nodes) has a message generator unit which randomly generates messages to different destinations with different packet lengths and buffering. The switch is based on MPLS technique and consists of the following units: crossbar switch, input/output link controllers and routing and arbitration units. A simulator has been developed in C++ to analyze the proposed architecture. A comparison with wormhole switch is provided to show the efficiency of the MPLS designed switch.","PeriodicalId":292715,"journal":{"name":"AISTECS '16","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116829033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. K. V. Maeda, Peng Yang, Xiaowen Wu, Zhe Wang, Jiang Xu, Zhehui Wang, Haoran Li, Luan H. K. Duong, Zhifei Wang
Recent advances in the computing industry towards multiprocessor technologies shifted the dominant method of performance increase from frequency scaling to parallelism. Due to its huge design space, evaluating candidate multicore architectures in early design stages, when the number of variables is at its maximum, is challenging. Simulation plays an important role in estimating architecture performance, and evaluating how the system would perform on average, as well as boundary cases, would require many iterations to cover various cases in the application input domain. Since simulation of heterogeneous systems with enough details are naturally slow, exhaustively evaluating the system for all possible inputs require tremendous amount of time and resources. While there exist quite a few multiprocessor simulators available, they often rely on individual input specification, demanding extensive input enumeration and simulation runs, diminishing their effectiveness for complex systems evaluation. Aiming to fulfill this gap, we publicly release a heterogeneous multiprocessor system simulation platform called JADE, targeting fast initial architecture explorations. Opposing to most simulators, JADE uses statistical models that follow distributions extracted from internal structures of the application, providing a more convenient and systematic exploration approach to evaluate systems performance. JADE simulation features include detailed electrical and optical interconnections, detailed memory hierarchy infrastructure, and built-in energy analysis allowing studies of a broad spectrum of systems.
{"title":"JADE: a Heterogeneous Multiprocessor System Simulation Platform Using Recorded and Statistical Application Models","authors":"R. K. V. Maeda, Peng Yang, Xiaowen Wu, Zhe Wang, Jiang Xu, Zhehui Wang, Haoran Li, Luan H. K. Duong, Zhifei Wang","doi":"10.1145/2857058.2857066","DOIUrl":"https://doi.org/10.1145/2857058.2857066","url":null,"abstract":"Recent advances in the computing industry towards multiprocessor technologies shifted the dominant method of performance increase from frequency scaling to parallelism. Due to its huge design space, evaluating candidate multicore architectures in early design stages, when the number of variables is at its maximum, is challenging. Simulation plays an important role in estimating architecture performance, and evaluating how the system would perform on average, as well as boundary cases, would require many iterations to cover various cases in the application input domain. Since simulation of heterogeneous systems with enough details are naturally slow, exhaustively evaluating the system for all possible inputs require tremendous amount of time and resources. While there exist quite a few multiprocessor simulators available, they often rely on individual input specification, demanding extensive input enumeration and simulation runs, diminishing their effectiveness for complex systems evaluation. Aiming to fulfill this gap, we publicly release a heterogeneous multiprocessor system simulation platform called JADE, targeting fast initial architecture explorations. Opposing to most simulators, JADE uses statistical models that follow distributions extracted from internal structures of the application, providing a more convenient and systematic exploration approach to evaluate systems performance. JADE simulation features include detailed electrical and optical interconnections, detailed memory hierarchy infrastructure, and built-in energy analysis allowing studies of a broad spectrum of systems.","PeriodicalId":292715,"journal":{"name":"AISTECS '16","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129308986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Rumley, M. Bahadori, K. Wen, D. Nikolova, K. Bergman
Silicon Photonics is emerging as a key technology for high-performance computing interconnects. Yet few tools are available to investigate how to best leverage this technology in current or future computer architectures and, furthermore, how this technology will impact real application workloads. In this paper, we present a multi-layer simulation and modeling software solution -- PhoenixSim. PhoenixSim enables integrated and interactive design space exploration over the physical, networking and application layers. In this paper, we report its general organization and constituting models. We show how the different layers of the tool can be utilized to design and analyze an optical interconnect network for supporting the HPCG (High Performance Conjugate Gradient) benchmark.
{"title":"PhoenixSim: Crosslayer Design and Modeling of Silicon Photonic Interconnects","authors":"S. Rumley, M. Bahadori, K. Wen, D. Nikolova, K. Bergman","doi":"10.1145/2857058.2857061","DOIUrl":"https://doi.org/10.1145/2857058.2857061","url":null,"abstract":"Silicon Photonics is emerging as a key technology for high-performance computing interconnects. Yet few tools are available to investigate how to best leverage this technology in current or future computer architectures and, furthermore, how this technology will impact real application workloads. In this paper, we present a multi-layer simulation and modeling software solution -- PhoenixSim. PhoenixSim enables integrated and interactive design space exploration over the physical, networking and application layers. In this paper, we report its general organization and constituting models. We show how the different layers of the tool can be utilized to design and analyze an optical interconnect network for supporting the HPCG (High Performance Conjugate Gradient) benchmark.","PeriodicalId":292715,"journal":{"name":"AISTECS '16","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125400527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Ridwan Madarbux, A. Laer, P. Watts, Timothy M. Jones
Optical network-on-chip (NoC) are being investigated to reduce the latency and power consumption of networks for multicore processors. Our previous work has shown that switched optical networks can achieve lower latency for a given power consumption and component count in shared memory processors compared with arbitration-free networks such as single writer multiple reader. We have also shown the advantage of leaving optical circuits open after being generated to capture multiple memory transactions. However invalidation processes, where numerous cores are sharing a memory block, need to establish a large number of very short lived circuits and this increases the average message latency and overall on-chip contention. In this paper, a low power broadcast architecture is proposed which deals specifically with multicast messages. Separating multicast messages from unicast ones shows an improvement in average arbitration latency of up to 88.2% for the Vips benchmark while the Swaptions benchmark shows the highest improvement in average memory access time (up to 21.1%). Vips also sees an increase of 147% in the average number of messages passing through an open optical circuit. Obtaining these advantages requires an additional broadcast network which consumes only 66.1mW power.
{"title":"Energy Efficient And Low Latency Interconnection Network For Multicast Invalidates In Shared Memory Systems","authors":"Muhammad Ridwan Madarbux, A. Laer, P. Watts, Timothy M. Jones","doi":"10.1145/2857058.2857065","DOIUrl":"https://doi.org/10.1145/2857058.2857065","url":null,"abstract":"Optical network-on-chip (NoC) are being investigated to reduce the latency and power consumption of networks for multicore processors. Our previous work has shown that switched optical networks can achieve lower latency for a given power consumption and component count in shared memory processors compared with arbitration-free networks such as single writer multiple reader. We have also shown the advantage of leaving optical circuits open after being generated to capture multiple memory transactions. However invalidation processes, where numerous cores are sharing a memory block, need to establish a large number of very short lived circuits and this increases the average message latency and overall on-chip contention.\u0000 In this paper, a low power broadcast architecture is proposed which deals specifically with multicast messages. Separating multicast messages from unicast ones shows an improvement in average arbitration latency of up to 88.2% for the Vips benchmark while the Swaptions benchmark shows the highest improvement in average memory access time (up to 21.1%). Vips also sees an increase of 147% in the average number of messages passing through an open optical circuit. Obtaining these advantages requires an additional broadcast network which consumes only 66.1mW power.","PeriodicalId":292715,"journal":{"name":"AISTECS '16","volume":"32 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129989434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gabriele Miorandi, Mahdi Tala, Marco Balboni, L. Ramini, D. Bertozzi
Networks-on-chip (NoCs) are today at the core of multi- and many-core systems, acting as the system-level integration framework. In order to support scaling to future device generations, NoCs will struggle to deliver the required communication performance within tight power budgets. In this respect, evolutionary as well as revolutionary interconnect technologies are currently being considered. On one hand, clockless handshaking materializes GALS systems that completely remove the system clock while reducing idle power to only the leakage power. On the other hand, the technology platform could be changed, by replacing electrical wires with optical links and networks. This paper provides a comprehensive power analysis of the two technologies under test on a path-by-path basis, by comparing them with each other and with a baseline synchronous NoC. The outcome of this paper can support the selection of interconnect solutions for future manycore systems where power is the primary concern, as well as the runtime selection policy of routing paths in the context of hybrid interconnect fabrics.
{"title":"Evolutionary vs. Revolutionary Interconnect Technologies for Future Low-Power Multi-Core Systems","authors":"Gabriele Miorandi, Mahdi Tala, Marco Balboni, L. Ramini, D. Bertozzi","doi":"10.1145/2857058.2857063","DOIUrl":"https://doi.org/10.1145/2857058.2857063","url":null,"abstract":"Networks-on-chip (NoCs) are today at the core of multi- and many-core systems, acting as the system-level integration framework. In order to support scaling to future device generations, NoCs will struggle to deliver the required communication performance within tight power budgets. In this respect, evolutionary as well as revolutionary interconnect technologies are currently being considered. On one hand, clockless handshaking materializes GALS systems that completely remove the system clock while reducing idle power to only the leakage power. On the other hand, the technology platform could be changed, by replacing electrical wires with optical links and networks. This paper provides a comprehensive power analysis of the two technologies under test on a path-by-path basis, by comparing them with each other and with a baseline synchronous NoC. The outcome of this paper can support the selection of interconnect solutions for future manycore systems where power is the primary concern, as well as the runtime selection policy of routing paths in the context of hybrid interconnect fabrics.","PeriodicalId":292715,"journal":{"name":"AISTECS '16","volume":"848 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114059044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bufferless deflection routing enables energy and hardware efficient Network-on-Chips (NoCs). However, due to the lack of buffers, packet switching can not be deployed for such NoCs. Therefore, it is crucial to determine an appropriate flit size and link width, which can be considerably larger compared to packet switched NoCs. In this work, we investigate the effect of the flit size on hardware costs and on performance for NoCs based on a permutation network and additionally on deflection routing. We show that hardware requirements for a permutation network based router increase linearly. The performance decreases exponentially with smaller link widths, however a moderate reduction of the link width can be an option.
{"title":"Consideration of the Flit Size for Deflection Routing based Network-on-Chips","authors":"Armin Runge, Reiner Kolla","doi":"10.1145/2857058.2857060","DOIUrl":"https://doi.org/10.1145/2857058.2857060","url":null,"abstract":"Bufferless deflection routing enables energy and hardware efficient Network-on-Chips (NoCs). However, due to the lack of buffers, packet switching can not be deployed for such NoCs. Therefore, it is crucial to determine an appropriate flit size and link width, which can be considerably larger compared to packet switched NoCs. In this work, we investigate the effect of the flit size on hardware costs and on performance for NoCs based on a permutation network and additionally on deflection routing. We show that hardware requirements for a permutation network based router increase linearly. The performance decreases exponentially with smaller link widths, however a moderate reduction of the link width can be an option.","PeriodicalId":292715,"journal":{"name":"AISTECS '16","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126308977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hierarchy and communication locality are a must for many-core systems. As systems scale to dozens or hundreds of cores, we simply cannot afford the power consumption and latency of random communication that spans the entire chip. Existing hierarchical Networks-on-Chip (NoCs) support communication locality only for a fixed cluster of nodes; providing a fixed hierarchy is too restrictive in terms of parallelism and data placement. Therefore, we propose a new, more flexible class of hierarchical NoCs: Elastic Hierarchical NoCs. Elastic Hierarchical NoCs dynamically adjust the number and size of clusters during runtime according to the system's communication demands. The interconnect can adapt to changes in communication locality across different application phases, between applications and in the presence of server consolidation. Our design improves overall system performance by up to 46% and 13% on average over a conventional 2D mesh and by up to 16% and 6% on average over an existing hierarchical NoC implementation. Power consumption is reduced by 45% and 7% respectively on average.
{"title":"Hierarchical Clustering for On-Chip Networks","authors":"R. Hesse, Natalie D. Enright Jerger","doi":"10.1145/2857058.2857064","DOIUrl":"https://doi.org/10.1145/2857058.2857064","url":null,"abstract":"Hierarchy and communication locality are a must for many-core systems. As systems scale to dozens or hundreds of cores, we simply cannot afford the power consumption and latency of random communication that spans the entire chip. Existing hierarchical Networks-on-Chip (NoCs) support communication locality only for a fixed cluster of nodes; providing a fixed hierarchy is too restrictive in terms of parallelism and data placement. Therefore, we propose a new, more flexible class of hierarchical NoCs: Elastic Hierarchical NoCs. Elastic Hierarchical NoCs dynamically adjust the number and size of clusters during runtime according to the system's communication demands. The interconnect can adapt to changes in communication locality across different application phases, between applications and in the presence of server consolidation. Our design improves overall system performance by up to 46% and 13% on average over a conventional 2D mesh and by up to 16% and 6% on average over an existing hierarchical NoC implementation. Power consumption is reduced by 45% and 7% respectively on average.","PeriodicalId":292715,"journal":{"name":"AISTECS '16","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131383076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}