Marta Ortín-Obón, Alexandra Ferreron, Jorge Albericio, D. S. Gracia, M. Villarroya-Gaudó, C. Izu, V. Viñals
The importance of the interconnection network is growing as the number of cores integrated on a chip increases. Communication among nodes becomes a bottleneck and impacts system performance and power consumption. This work targets general purpose CMPs, where there is a rising concern about finding low-power alternatives. We explore the implications of the interconnect choice on overall performance by comparing the behaviour of three topologies: ring, mesh, and torus. We also evaluate two additional ring configurations (one with increased bandwidth and another with reduced-pipeline routers) and concentrated versions of the topologies. Running full-system simulations allows us to carefully model the processors, memory hierarchy, and interconnection network, and execute realistic parallel and multiprogrammed workloads. We determine that the network diameter is critical for system performance and that a concentrated mesh offers the best area-energy-delay tradeoff for both 16 and 64-core chips. Traffic is very light and highly unbalanced, asserting the need for an heterogeneous network with more resources located in specific areas.
{"title":"Characterization and cost-efficient selection of NoC topologies for general purpose CMPs","authors":"Marta Ortín-Obón, Alexandra Ferreron, Jorge Albericio, D. S. Gracia, M. Villarroya-Gaudó, C. Izu, V. Viñals","doi":"10.1145/2482759.2482765","DOIUrl":"https://doi.org/10.1145/2482759.2482765","url":null,"abstract":"The importance of the interconnection network is growing as the number of cores integrated on a chip increases. Communication among nodes becomes a bottleneck and impacts system performance and power consumption. This work targets general purpose CMPs, where there is a rising concern about finding low-power alternatives.\u0000 We explore the implications of the interconnect choice on overall performance by comparing the behaviour of three topologies: ring, mesh, and torus. We also evaluate two additional ring configurations (one with increased bandwidth and another with reduced-pipeline routers) and concentrated versions of the topologies. Running full-system simulations allows us to carefully model the processors, memory hierarchy, and interconnection network, and execute realistic parallel and multiprogrammed workloads. We determine that the network diameter is critical for system performance and that a concentrated mesh offers the best area-energy-delay tradeoff for both 16 and 64-core chips. Traffic is very light and highly unbalanced, asserting the need for an heterogeneous network with more resources located in specific areas.","PeriodicalId":142364,"journal":{"name":"IMA-OCMC '13","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132930355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose a fast algorithm to reprogram the routing function of an on-chip network (NoC) at runtime. This reconfiguration algorithm comes with the following key novelties. First, it deals with the lack of routing tables, which are poorly scalable and lengthy to reconfigure. Second, it can deal with any number of faults that might be progressively detected over time (i.e., full coverage of fault patterns). Third, it preserves ultra-fast reconfiguration times even for the most challenging scenarios.
{"title":"A fast algorithm for runtime reconfiguration to maximize the lifetime of nanoscale NoCs","authors":"F. Triviño, D. Bertozzi, J. Flich","doi":"10.1145/2482759.2482760","DOIUrl":"https://doi.org/10.1145/2482759.2482760","url":null,"abstract":"In this paper, we propose a fast algorithm to reprogram the routing function of an on-chip network (NoC) at runtime. This reconfiguration algorithm comes with the following key novelties. First, it deals with the lack of routing tables, which are poorly scalable and lengthy to reconfigure. Second, it can deal with any number of faults that might be progressively detected over time (i.e., full coverage of fault patterns). Third, it preserves ultra-fast reconfiguration times even for the most challenging scenarios.","PeriodicalId":142364,"journal":{"name":"IMA-OCMC '13","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116429957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikolaos Chrysos, F. Neeser, M. Gusat, R. Clauberg, C. Minkenberg, C. Basso, Kenneth M. Valk
Network devices supporting above-100G links are needed today in order to scale communication bandwidth along with the processing capabilities of computing nodes in data centers and warehouse computers. In this paper, we propose a light-weight, fair scheduler for such ultra high-speed links, and an arbitrarily large number of requestors. We show that, in practice, our first algorithm, as well its predecessor, DRR, may result in bursty service even in the common case, where flow weights are approximately equal, and we identify applications where this can damage performance. Our second contribution is an enhancement that improves short-term fairness to deliver very smooth service when flow weights are approximately equal, whilst allocating bandwidth in a weighted fair manner.
{"title":"Arbitration of many thousand flows at 100G and beyond","authors":"Nikolaos Chrysos, F. Neeser, M. Gusat, R. Clauberg, C. Minkenberg, C. Basso, Kenneth M. Valk","doi":"10.1145/2482759.2482761","DOIUrl":"https://doi.org/10.1145/2482759.2482761","url":null,"abstract":"Network devices supporting above-100G links are needed today in order to scale communication bandwidth along with the processing capabilities of computing nodes in data centers and warehouse computers. In this paper, we propose a light-weight, fair scheduler for such ultra high-speed links, and an arbitrarily large number of requestors. We show that, in practice, our first algorithm, as well its predecessor, DRR, may result in bursty service even in the common case, where flow weights are approximately equal, and we identify applications where this can damage performance. Our second contribution is an enhancement that improves short-term fairness to deliver very smooth service when flow weights are approximately equal, whilst allocating bandwidth in a weighted fair manner.","PeriodicalId":142364,"journal":{"name":"IMA-OCMC '13","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121831830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Magdalena García, E. Vallejo, R. Beivide, M. Odriozola, C. Camarero, M. Valero, J. Labarta, G. Rodríguez
Dragonfly networks are composed of interconnected groups of routers. Adaptive routing allows packets to be forwarded minimally or non-minimally adapting to the traffic conditions in the network. While minimal routing sends traffic directly between groups, non-minimal routing employs an intermediate group to balance network load. A random selection of this intermediate group (denoted as RRG) typically implies an extra local hop in the source group, what increases average path length and can reduce performance. In this paper we identify different policies for the selection of such intermediate group and explore their performance. Interestingly, simulation results show that an eager policy (denoted as CRG) that selects the intermediate group only between those directly connected to the ongoing router causes starvation in some network nodes. On the contrary, the best performance is obtained by a "mixed mode" policy (denoted as MM) that adds a local hop when the packet has moved away from the source router.
{"title":"Global misrouting policies in two-level hierarchical networks","authors":"Magdalena García, E. Vallejo, R. Beivide, M. Odriozola, C. Camarero, M. Valero, J. Labarta, G. Rodríguez","doi":"10.1145/2482759.2482763","DOIUrl":"https://doi.org/10.1145/2482759.2482763","url":null,"abstract":"Dragonfly networks are composed of interconnected groups of routers. Adaptive routing allows packets to be forwarded minimally or non-minimally adapting to the traffic conditions in the network. While minimal routing sends traffic directly between groups, non-minimal routing employs an intermediate group to balance network load.\u0000 A random selection of this intermediate group (denoted as RRG) typically implies an extra local hop in the source group, what increases average path length and can reduce performance. In this paper we identify different policies for the selection of such intermediate group and explore their performance. Interestingly, simulation results show that an eager policy (denoted as CRG) that selects the intermediate group only between those directly connected to the ongoing router causes starvation in some network nodes. On the contrary, the best performance is obtained by a \"mixed mode\" policy (denoted as MM) that adds a local hop when the packet has moved away from the source router.","PeriodicalId":142364,"journal":{"name":"IMA-OCMC '13","volume":"426 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126715636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Key to the economic viability of clouds and datacenters is their elastic scalability. Therefore most active related research areas focus on the datacenter fabric scalability, efficiency, performance, virtualization, optimal virtual machine (VM) allocation and migration. Here we ask the questions: Given a set of tenant workloads running on generic servers interconnected by a 10--100G Ethernet fabric with modern network virtualization and transport protocols, how can the datacenter operator reach the optimal operation region? How is this optimum defined, traded between operator and tenants, and measured with what metrics? In this paper we propose an evaluation methodology and a set of simple, but descriptive, metrics as a first attempt to answer the questions raised above. As proof of concept, we investigate a multitenant virtualized datacenter network running a 3-tier workload. Our proposal enables a quantitative comparison between competing datacenter fabrics and virtualization architectures.
{"title":"How elastic is your virtualized datacenter fabric?","authors":"D. Crisan, R. Birke, Nikolaos Chrysos, M. Gusat","doi":"10.1145/2482759.2482764","DOIUrl":"https://doi.org/10.1145/2482759.2482764","url":null,"abstract":"Key to the economic viability of clouds and datacenters is their elastic scalability. Therefore most active related research areas focus on the datacenter fabric scalability, efficiency, performance, virtualization, optimal virtual machine (VM) allocation and migration. Here we ask the questions: Given a set of tenant workloads running on generic servers interconnected by a 10--100G Ethernet fabric with modern network virtualization and transport protocols, how can the datacenter operator reach the optimal operation region? How is this optimum defined, traded between operator and tenants, and measured with what metrics? In this paper we propose an evaluation methodology and a set of simple, but descriptive, metrics as a first attempt to answer the questions raised above. As proof of concept, we investigate a multitenant virtualized datacenter network running a 3-tier workload. Our proposal enables a quantitative comparison between competing datacenter fabrics and virtualization architectures.","PeriodicalId":142364,"journal":{"name":"IMA-OCMC '13","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131271522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ana Jokanovic, B. Prisacari, G. Rodríguez, C. Minkenberg
Dragonflies are one of the most promising topologies for the Exascale effort for their scalability and cost. Dragonflies achieve very high throughput under uniform traffic, but have a pathological behavior under other regular traffic patterns, some of them very common in HPC applications. A recent study showed that randomization of task placement can make pathological, regular (multi-dimensional stencil) traffic patterns behave similar to uniform traffic. In this work we provide a theoretical model that is able to predict the expected performance of a generic dragonfly network under uniform traffic and characterize performance-optimal dragonflies. We then analyze whether this model can be extended to other patterns by means of benchmarking the performance of multiple such patterns under both contiguous and randomized task placement. We conclude that, although in comparison with contiguous task placement, randomization does lead to a significant improvement in performance for pathological communication patterns, this performance is not on par with that of uniform traffic, but rather half of it.
{"title":"Randomizing task placement does not randomize traffic (enough)","authors":"Ana Jokanovic, B. Prisacari, G. Rodríguez, C. Minkenberg","doi":"10.1145/2482759.2482762","DOIUrl":"https://doi.org/10.1145/2482759.2482762","url":null,"abstract":"Dragonflies are one of the most promising topologies for the Exascale effort for their scalability and cost. Dragonflies achieve very high throughput under uniform traffic, but have a pathological behavior under other regular traffic patterns, some of them very common in HPC applications. A recent study showed that randomization of task placement can make pathological, regular (multi-dimensional stencil) traffic patterns behave similar to uniform traffic.\u0000 In this work we provide a theoretical model that is able to predict the expected performance of a generic dragonfly network under uniform traffic and characterize performance-optimal dragonflies. We then analyze whether this model can be extended to other patterns by means of benchmarking the performance of multiple such patterns under both contiguous and randomized task placement. We conclude that, although in comparison with contiguous task placement, randomization does lead to a significant improvement in performance for pathological communication patterns, this performance is not on par with that of uniform traffic, but rather half of it.","PeriodicalId":142364,"journal":{"name":"IMA-OCMC '13","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124773554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Network-on-Chips (NoC) play a central role in determining performance and reliability in current and future multi-core architectures. Continuous scaling of CMOS technology enable widespread adoption of multi-core architectures but, unfortunately, poses severe concerns regarding failures. Process variation (PV) is worsening the scenario, decreasing device lifetime and performance predictability during chip fabrication. This paper proposes two solutions exploiting power-gating to cope with NBTI effects in NoC buffers. The techniques are evaluated with respect to a variable number of virtual channels (VCs), in the presence of process variation. Moreover, power gating delay overhead is accounted. Experiments reveal a net NBTI Vth saving up to 54.2% against the baseline NoC, with an area overhead below 5%.
{"title":"NBTI-aware design of NoC buffers","authors":"Davide Zoni, W. Fornaciari","doi":"10.1145/2482759.2482766","DOIUrl":"https://doi.org/10.1145/2482759.2482766","url":null,"abstract":"Network-on-Chips (NoC) play a central role in determining performance and reliability in current and future multi-core architectures. Continuous scaling of CMOS technology enable widespread adoption of multi-core architectures but, unfortunately, poses severe concerns regarding failures. Process variation (PV) is worsening the scenario, decreasing device lifetime and performance predictability during chip fabrication. This paper proposes two solutions exploiting power-gating to cope with NBTI effects in NoC buffers. The techniques are evaluated with respect to a variable number of virtual channels (VCs), in the presence of process variation. Moreover, power gating delay overhead is accounted. Experiments reveal a net NBTI Vth saving up to 54.2% against the baseline NoC, with an area overhead below 5%.","PeriodicalId":142364,"journal":{"name":"IMA-OCMC '13","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114150606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}