IEEE Transactions on Multi-Scale Computing Systems最新文献_第5页

Bi-Objective Cost Function for Adaptive Routing in Network-on-Chip 片上网络自适应路由的双目标代价函数

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-02-27 DOI: 10.1109/TMSCS.2018.2810223

Asma Benmessaoud Gabis;Pierre Bomel;Marc Sevaux

This paper proposes a new fully adaptive routing protocol for 2D-mesh Network-on-Chip (NoCs). It is inspired from the A-star search algorithm and called Heuristic based Routing Algorithm (HRA). It is distributed, congestion-aware, and fault-tolerant by using only the local information of each router neighbors. HRA does not use Virtual Channels (VCs) but tries to reduce the risk of deadlock by avoiding the 2-nodes and the 4-nodes loops. HRA is based on a bi-objective weighted sum cost function. Its goal is optimizing latency and throughput. Experiments show that HRA ensures a good reliability rate despite the presence of many faulty links. In addition, our approach reports interesting latencies and average throughput values when a non-dominated solution is chosen.

本文提出了一种新的用于2D mesh片上网络（NoCs）的完全自适应路由协议。它的灵感来源于A星搜索算法，被称为基于启发式的路由算法（HRA）。它是分布式的、拥塞感知的、容错的，只使用每个路由器邻居的本地信息。HRA不使用虚拟通道（VC），但试图通过避免2节点和4节点循环来降低死锁的风险。HRA基于双目标加权和成本函数。它的目标是优化延迟和吞吐量。实验表明，尽管存在许多故障链路，HRA仍能确保良好的可靠性。此外，当选择非主导解决方案时，我们的方法报告了有趣的延迟和平均吞吐量值。

引用次数: 4

Design Methodology for Responsive and Rrobust MIMO Control of Heterogeneous Multicores 异构多核响应和鲁棒MIMO控制的设计方法

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-02-26 DOI: 10.1109/TMSCS.2018.2808524

Tiago Mück;Bryan Donyanavard;Kasra Moazzemi;Amir M. Rahmani;Axel Jantsch;Nikil Dutt

Heterogeneous multicore processors (HMPs) are commonly deployed to meet the performance and power requirements of emerging workloads. HMPs demand adaptive and coordinated resource management techniques to control such complex systems. While Multiple-Input-Multiple-Output (MIMO) control theory has been applied to adaptively coordinate resources for single-core processors, the coordinated management of HMPs poses significant additional challenges for achieving robustness and responsiveness, due to the unmanageable complexity of modeling the system dynamics. This paper presents, for the first time, a methodology to design robust MIMO controllers with rapid response and formal guarantees for coordinated management of HMPs. Our approach addresses the challenges of: (1) system decomposition and identification; (2) selection of suitable sensor and actuator granularity; and (3) appropriate system modeling to make the system identifiable as well as controllable. We demonstrate the practical applicability of our approach on an ARM big.LITTLE HMP platform running Linux, and demonstrate the efficiency and robustness of our method by designing MIMO-based resource managers.

异构多核处理器（HMP）通常用于满足新兴工作负载的性能和电源要求。HMP需要自适应和协调的资源管理技术来控制这种复杂的系统。虽然多输入多输出（MIMO）控制理论已被应用于自适应地协调单核处理器的资源，但由于系统动力学建模的难以管理的复杂性，HMP的协调管理对实现鲁棒性和响应性提出了重大的额外挑战。本文首次提出了一种设计具有快速响应和形式保证的鲁棒MIMO控制器的方法，用于HMP的协调管理。我们的方法解决了以下挑战：（1）系统分解和识别；（2）选择合适的传感器和致动器粒度；以及（3）适当的系统建模，以使系统可识别和可控。我们展示了我们的方法在ARM大型计算机上的实际适用性。运行Linux的LITTLE HMP平台，并通过设计基于MIMO的资源管理器来证明我们的方法的有效性和稳健性。

{"title":"Design Methodology for Responsive and Rrobust MIMO Control of Heterogeneous Multicores","authors":"Tiago Mück;Bryan Donyanavard;Kasra Moazzemi;Amir M. Rahmani;Axel Jantsch;Nikil Dutt","doi":"10.1109/TMSCS.2018.2808524","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2808524","url":null,"abstract":"Heterogeneous multicore processors (HMPs) are commonly deployed to meet the performance and power requirements of emerging workloads. HMPs demand adaptive and coordinated resource management techniques to control such complex systems. While Multiple-Input-Multiple-Output (MIMO) control theory has been applied to adaptively coordinate resources for \u0000<italic>single-core</i>\u0000 processors, the coordinated management of HMPs poses significant additional challenges for achieving robustness and responsiveness, due to the unmanageable complexity of modeling the system dynamics. This paper presents, for the first time, a methodology to design robust MIMO controllers with rapid response and formal guarantees for coordinated management of HMPs. Our approach addresses the challenges of: (1) system decomposition and identification; (2) selection of suitable sensor and actuator granularity; and (3) appropriate system modeling to make the system identifiable as well as controllable. We demonstrate the practical applicability of our approach on an ARM big.LITTLE HMP platform running Linux, and demonstrate the efficiency and robustness of our method by designing MIMO-based resource managers.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"944-951"},"PeriodicalIF":0.0,"publicationDate":"2018-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2808524","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Incremental Maintenance of Maximal Bicliques in a Dynamic Bipartite Graph 动态二分图中最大二元组的增量维护

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-02-06 DOI: 10.1109/TMSCS.2018.2802920

Apurba Das;Srikanta Tirthapura

We consider incremental maintenance of maximal bicliques from a dynamic bipartite graph that changes over time due to the addition of edges. When new edges are added to the graph, we seek to enumerate the change in the set of maximal bicliques, without enumerating the set of maximal bicliques that remain unaffected. The challenge in an efficient algorithm is to enumerate the change without explicitly enumerating the set of all maximal bicliques. In this work, we present (1) Near-tight bounds on the magnitude of change in the set of maximal bicliques of a graph, due to a change in the edge set, and an (2) Incremental algorithm for enumerating the change in the set of maximal bicliques. For the case when a constant number of edges are added to the graph, our algorithm is “change-sensitive”, i.e., its time complexity is proportional to the magnitude of change in the set of maximal bicliques. To our knowledge, this is the first incremental algorithm for enumerating maximal bicliques in a dynamic graph, with a provable performance guarantee. Our algorithm is easy to implement, and experimental results show that its performance exceeds that of baseline implementations by orders of magnitude substructures.

我们考虑了动态二分图中最大二分图的增量维护，该图由于边的添加而随时间变化。当新的边被添加到图中时，我们试图枚举最大二重集的变化，而不枚举不受影响的最大二重集。有效算法中的挑战是在不显式枚举所有最大二进制的集合的情况下枚举更改。在这项工作中，我们提出了（1）由于边集的变化，图的最大二重集的变化幅度的近紧界，以及（2）枚举最大二重集变化的增量算法。对于向图中添加恒定数量的边的情况，我们的算法是“变化敏感的”，即其时间复杂度与最大双解集中的变化幅度成比例。据我们所知，这是第一个在动态图中枚举最大二进制的增量算法，具有可证明的性能保证。我们的算法易于实现，实验结果表明，它的性能比基线实现高出几个数量级的子结构。

{"title":"Incremental Maintenance of Maximal Bicliques in a Dynamic Bipartite Graph","authors":"Apurba Das;Srikanta Tirthapura","doi":"10.1109/TMSCS.2018.2802920","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2802920","url":null,"abstract":"We consider incremental maintenance of maximal bicliques from a dynamic bipartite graph that changes over time due to the addition of edges. When new edges are added to the graph, we seek to enumerate the change in the set of maximal bicliques, without enumerating the set of maximal bicliques that remain unaffected. The challenge in an efficient algorithm is to enumerate the change without explicitly enumerating the set of all maximal bicliques. In this work, we present (1) Near-tight bounds on the magnitude of change in the set of maximal bicliques of a graph, due to a change in the edge set, and an (2) Incremental algorithm for enumerating the change in the set of maximal bicliques. For the case when a constant number of edges are added to the graph, our algorithm is “change-sensitive”, i.e., its time complexity is proportional to the magnitude of change in the set of maximal bicliques. To our knowledge, this is the first incremental algorithm for enumerating maximal bicliques in a dynamic graph, with a provable performance guarantee. Our algorithm is easy to implement, and experimental results show that its performance exceeds that of baseline implementations by orders of magnitude substructures.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"231-242"},"PeriodicalIF":0.0,"publicationDate":"2018-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2802920","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68026460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Docker Container Scheduler for I/O Intensive Applications Running on NVMe SSDs 适用于NVMe SSD上运行的I/O密集型应用程序的Docker Container Scheduler

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-02-02 DOI: 10.1109/TMSCS.2018.2801281

Janki Bhimani;Zhengyu Yang;Ningfang Mi;Jingpei Yang;Qiumin Xu;Manu Awasthi;Rajinikanth Pandurangan;Vijay Balakrishnan

By using fast back-end storage, performance benefits of a lightweight container platform can be leveraged with quick I/O response. Nevertheless, the performance of simultaneously executing multiple instances of same or different applications may vary significantly with the number of containers. The performance may also vary with the nature of applications because different applications can exhibit different nature on SSDs in terms of I/O types (read/write), I/O access pattern (random/sequential), I/O size, etc. Therefore, this paper aims to investigate and analyze the performance characterization of both homogeneous and heterogeneous mixtures of I/O intensive containerized applications, operating with high performance NVMe SSDs and derive novel design guidelines for achieving an optimal and fair operation of the both homogeneous and heterogeneous mixtures. By leveraging these design guidelines, we further develop a new docker controller for scheduling workload containers of different types of applications. Our controller decides the optimal batches of simultaneously operating containers in order to minimize total execution time and maximize resource utilization. Meanwhile, our controller also strives to balance the throughput among all simultaneously running applications. We develop this new docker controller by solving an optimization problem using five different optimization solvers. We conduct our experiments in a platform of multiple docker containers operating on an array of three enterprise NVMe drives. We further evaluate our controller using different applications of diverse I/O behaviors and compare it with simultaneous operation of containers without the controller. Our evaluation results show that our new docker workload controller helps speed-up the overall execution of multiple applications on SSDs.

通过使用快速后端存储，可以通过快速I/O响应来利用轻量级容器平台的性能优势。然而，同时执行相同或不同应用程序的多个实例的性能可能会随着容器的数量而显著变化。性能也可能随着应用程序的性质而变化，因为不同的应用程序在SSD上可以在I/O类型（读/写）、I/O访问模式（随机/顺序）、I/O大小等方面表现出不同的性质。因此，本文旨在研究和分析I/O密集型容器化应用程序的同质和异构混合的性能特征，使用高性能NVMe SSD进行操作，并得出用于实现均质和非均质混合物的最佳和公平操作的新颖设计指南。通过利用这些设计指南，我们进一步开发了一种新的docker控制器，用于调度不同类型应用程序的工作负载容器。我们的控制器决定同时操作容器的最佳批次，以最小化总执行时间并最大限度地提高资源利用率。同时，我们的控制器还努力平衡所有同时运行的应用程序之间的吞吐量。我们通过使用五个不同的优化求解器来解决一个优化问题，从而开发了这种新的docker控制器。我们在多个docker容器的平台上进行实验，这些容器在三个企业NVMe驱动器的阵列上运行。我们使用不同I/O行为的不同应用程序进一步评估了我们的控制器，并将其与没有控制器的容器的同时操作进行了比较。我们的评估结果表明，我们新的docker工作负载控制器有助于加快SSD上多个应用程序的整体执行。

{"title":"Docker Container Scheduler for I/O Intensive Applications Running on NVMe SSDs","authors":"Janki Bhimani;Zhengyu Yang;Ningfang Mi;Jingpei Yang;Qiumin Xu;Manu Awasthi;Rajinikanth Pandurangan;Vijay Balakrishnan","doi":"10.1109/TMSCS.2018.2801281","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2801281","url":null,"abstract":"By using fast back-end storage, performance benefits of a lightweight container platform can be leveraged with quick I/O response. Nevertheless, the performance of simultaneously executing multiple instances of same or different applications may vary significantly with the number of containers. The performance may also vary with the nature of applications because different applications can exhibit different nature on SSDs in terms of I/O types (read/write), I/O access pattern (random/sequential), I/O size, etc. Therefore, this paper aims to investigate and analyze the performance characterization of both homogeneous and heterogeneous mixtures of I/O intensive containerized applications, operating with high performance NVMe SSDs and derive novel design guidelines for achieving an optimal and fair operation of the both homogeneous and heterogeneous mixtures. By leveraging these design guidelines, we further develop a new docker controller for scheduling workload containers of different types of applications. Our controller decides the optimal batches of simultaneously operating containers in order to minimize total execution time and maximize resource utilization. Meanwhile, our controller also strives to balance the throughput among all simultaneously running applications. We develop this new docker controller by solving an optimization problem using five different optimization solvers. We conduct our experiments in a platform of multiple docker containers operating on an array of three enterprise NVMe drives. We further evaluate our controller using different applications of diverse I/O behaviors and compare it with simultaneous operation of containers without the controller. Our evaluation results show that our new docker workload controller helps speed-up the overall execution of multiple applications on SSDs.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"313-326"},"PeriodicalIF":0.0,"publicationDate":"2018-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2801281","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Application-Arrival Rate Aware Distributed Run-Time Resource Management for Many-Core Computing Platforms 基于应用到达率的多核心计算平台分布式运行时资源管理

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-02-02 DOI: 10.1109/TMSCS.2018.2793189

Vasileios Tsoutsouras;Sotirios Xydis;Dimitrios Soudris

Modern many-core computing platforms execute a diverse set of dynamic workloads in the presence of varying application arrival rates. This inflicts strict requirements on run-time management to efficiently allocate system resources. On the way towards kilo-core processor architectures, centralized resource management approaches will most probably form a severe performance bottleneck, thus focus has been turned to the study of Distributed Run-Time Resource Management (DRTRM) schemes. In this article, we examine the behavior of a DRTRM of dynamic applications with malleable characteristics against stressing incoming application interval rate scenarios, using Intel SCC as the target many-core system. We show that resource allocation is highly affected by application input rate and propose an application-arrival aware DRTRM framework implementing an effective admission control strategy by carefully utilizing voltage and frequency scaling on parts of its resource allocation infrastructure. Through extensive experimental evaluation, we quantitatively analyze the behavior of the introduced DRTRM scheme and show that it achieves up to 44 percent performance gains while consuming 31 percent less energy, in comparison to a state-of-art DRTRM solution. In comparison to a centralized RTRM, the respective metric values rise up to 62 and 45 percent performance and energy gains, respectively.

现代许多核心计算平台在应用程序到达率变化的情况下执行一组不同的动态工作负载。这对运行时管理提出了严格的要求，以有效地分配系统资源。在迈向千核处理器体系结构的道路上，集中式资源管理方法很可能会形成严重的性能瓶颈，因此人们将注意力转向了分布式运行时资源管理（DRTRM）方案的研究。在这篇文章中，我们使用Intel SCC作为目标多核系统，研究了具有延展性特征的动态应用程序的DRTRM在应对传入应用程序间隔率场景时的行为。我们证明了资源分配在很大程度上受应用程序输入率的影响，并提出了一个应用程序到达感知DRTRM框架，通过在其资源分配基础设施的部分上仔细利用电压和频率缩放来实现有效的准入控制策略。通过广泛的实验评估，我们定量分析了引入的DRTRM方案的性能，并表明与现有技术的DRTRM解决方案相比，它实现了高达44%的性能提升，同时能耗减少了31%。与集中式RTRM相比，相应的度量值分别提高了62%和45%的性能和能量增益。

{"title":"Application-Arrival Rate Aware Distributed Run-Time Resource Management for Many-Core Computing Platforms","authors":"Vasileios Tsoutsouras;Sotirios Xydis;Dimitrios Soudris","doi":"10.1109/TMSCS.2018.2793189","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2793189","url":null,"abstract":"Modern many-core computing platforms execute a diverse set of dynamic workloads in the presence of varying application arrival rates. This inflicts strict requirements on run-time management to efficiently allocate system resources. On the way towards kilo-core processor architectures, centralized resource management approaches will most probably form a severe performance bottleneck, thus focus has been turned to the study of Distributed Run-Time Resource Management (DRTRM) schemes. In this article, we examine the behavior of a DRTRM of dynamic applications with malleable characteristics against stressing incoming application interval rate scenarios, using Intel SCC as the target many-core system. We show that resource allocation is highly affected by application input rate and propose an application-arrival aware DRTRM framework implementing an effective admission control strategy by carefully utilizing voltage and frequency scaling on parts of its resource allocation infrastructure. Through extensive experimental evaluation, we quantitatively analyze the behavior of the introduced DRTRM scheme and show that it achieves up to 44 percent performance gains while consuming 31 percent less energy, in comparison to a state-of-art DRTRM solution. In comparison to a centralized RTRM, the respective metric values rise up to 62 and 45 percent performance and energy gains, respectively.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"285-298"},"PeriodicalIF":0.0,"publicationDate":"2018-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2793189","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Multilevel Parallelism for the Exploration of Large-Scale Graphs 探索大尺度图的多级并行性

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-01-23 DOI: 10.1109/TMSCS.2018.2797195

Massimo Bernaschi;Mauro Bisson;Enrico Mastrostefano;Flavio Vella

We present the most recent release of our parallel implementation of the BFS and BC algorithms for the study of large scale graphs. Although our reference platform is a high-end cluster of new generation Nvidia GPUs and some of our optimizations are CUDA specific, most of our ideas can be applied to other platforms offering multiple levels of parallelism. We exploit multi level parallel processing through a hybrid programming paradigm that combines highly tuned CUDA kernels, for the computations performed by each node, and explicit data exchange through the Message Passing Interface (MPI), for the communications among nodes. The results of the numerical experiments show that the performance of our code is comparable or better with respect to other state-of-the-art solutions. For the BFS, for instance, we reach a peak performance of 200 Giga Teps on a single GPU and 5.5 Terateps on 1024 Pascal GPUs. We release our source codes both for reproducing the results and for facilitating their usage as a building block for the implementation of other algorithms.

我们介绍了用于研究大规模图的BFS和BC算法的并行实现的最新版本。尽管我们的参考平台是新一代英伟达GPU的高端集群，并且我们的一些优化是CUDA特有的，但我们的大多数想法都可以应用于其他提供多级并行性的平台。我们通过混合编程范式利用多级并行处理，该编程范式结合了高度调优的CUDA内核，用于每个节点执行的计算，以及通过消息传递接口（MPI）进行的显式数据交换，用于节点之间的通信。数值实验结果表明，我们的代码的性能与其他最先进的解决方案相当或更好。例如，对于BFS，我们在单个GPU上达到200吉比特的峰值性能，在1024 Pascal GPU上达到5.5兆比特。我们发布源代码既是为了重现结果，也是为了便于将其用作实现其他算法的构建块。

引用次数: 9

Scalable and Performant Graph Processing on GPUs Using Approximate Computing 基于近似计算的GPU可伸缩性能图处理

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-01-22 DOI: 10.1109/TMSCS.2018.2795543

Somesh Singh;Rupesh Nasre

Graph algorithms are being widely used in several application domains. It has been established that parallelizing graph algorithms is challenging. The parallelization issues get exacerbated when graphics processing units (GPUs) are used to execute graph algorithms. While the prior art has shown effective parallelization of several graph algorithms on GPUs, a few algorithms are still expensive. In this work, we address the scalability issues in graph parallelization. In particular, we aim to improve the execution time by tolerating a little approximation in the computation. We study the effects of four heuristic approximations on six graph algorithms with five graphs and show that if an application allows for small inaccuracy, this can be leveraged to achieve considerable performance benefits. We also study the effects of the approximations on GPU-based processing and provide interesting takeaways.

图算法正被广泛应用于多个应用领域。已经证实并行化图算法是具有挑战性的。当图形处理单元（GPU）用于执行图形算法时，并行化问题会加剧。虽然现有技术已经在GPU上显示了几种图算法的有效并行化，但是一些算法仍然是昂贵的。在这项工作中，我们解决了图并行化中的可伸缩性问题。特别是，我们的目标是通过在计算中容忍一点近似来提高执行时间。我们研究了四种启发式近似对具有五个图的六个图算法的影响，并表明如果应用程序允许较小的不准确度，则可以利用这一点来实现可观的性能优势。我们还研究了近似对基于GPU的处理的影响，并提供了有趣的结论。

引用次数: 9

Speedup and Power Scaling Models for Heterogeneous Many-Core Systems 异构多核心系统的加速和功率缩放模型

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-01-12 DOI: 10.1109/TMSCS.2018.2791531

Ashur Rafiev;Mohammed A. N. Al-Hayanni;Fei Xia;Rishad Shafik;Alexander Romanovsky;Alex Yakovlev

Traditional speedup models, such as Amdahl's law, Gustafson's, and Sun and Ni's, have helped the research community and industry better understand system performance capabilities and application parallelizability. As they mostly target homogeneous hardware platforms or limited forms of processor heterogeneity, these models do not cover newly emerging multi-core heterogeneous architectures. This paper reports on novel speedup and energy consumption models based on a more general representation of heterogeneity, referred to as the normal form heterogeneity, that supports a wide range of heterogeneous many-core architectures. The modelling method aims to predict system power efficiency and performance ranges, and facilitates research and development at the hardware and system software levels. The models were validated through extensive experimentation on the off-the-shelf big. LITTLE heterogeneous platform and a dual-GPU laptop, with an average error of 1 percent for speedup and of less than 6.5 percent for power dissipation. A quantitative efficiency analysis targeting the system load balancer on the Odroid XU3 platform was used to demonstrate the practical use of the method.

传统的加速模型，如Amdahl定律、Gustafson定律以及Sun和Ni定律，有助于研究界和行业更好地理解系统性能和应用程序并行性。由于它们主要针对同质硬件平台或有限形式的处理器异构性，因此这些模型不涵盖新出现的多核异构架构。本文报告了基于更通用的异构表示（称为范式异构）的新型加速和能耗模型，该模型支持广泛的异构多核心架构。该建模方法旨在预测系统功率效率和性能范围，并促进硬件和系统软件层面的研究和开发。这些模型是通过在现成的大屏幕上进行大量实验来验证的。小型异构平台和双GPU笔记本电脑，加速平均误差为1%，功耗平均误差小于6.5%。针对Odroid XU3平台上的系统负载均衡器进行了定量效率分析，以证明该方法的实际应用。

{"title":"Speedup and Power Scaling Models for Heterogeneous Many-Core Systems","authors":"Ashur Rafiev;Mohammed A. N. Al-Hayanni;Fei Xia;Rishad Shafik;Alexander Romanovsky;Alex Yakovlev","doi":"10.1109/TMSCS.2018.2791531","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2791531","url":null,"abstract":"Traditional speedup models, such as Amdahl's law, Gustafson's, and Sun and Ni's, have helped the research community and industry better understand system performance capabilities and application parallelizability. As they mostly target homogeneous hardware platforms or limited forms of processor heterogeneity, these models do not cover newly emerging multi-core heterogeneous architectures. This paper reports on novel speedup and energy consumption models based on a more general representation of heterogeneity, referred to as the normal form heterogeneity, that supports a wide range of heterogeneous many-core architectures. The modelling method aims to predict system power efficiency and performance ranges, and facilitates research and development at the hardware and system software levels. The models were validated through extensive experimentation on the off-the-shelf big. LITTLE heterogeneous platform and a dual-GPU laptop, with an average error of 1 percent for speedup and of less than 6.5 percent for power dissipation. A quantitative efficiency analysis targeting the system load balancer on the Odroid XU3 platform was used to demonstrate the practical use of the method.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"436-449"},"PeriodicalIF":0.0,"publicationDate":"2018-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2791531","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68026458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

$mathsf{CHOAMP}$ : Cost Based Hardware Optimization for Asymmetric Multicore Processors $mathsf｛CHOAMP｝$：基于成本的非对称多核处理器硬件优化

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-01-11 DOI: 10.1109/TMSCS.2018.2791955

Jyothi Krishna Viswakaran Sreelatha;Shankar Balachandran;Rupesh Nasre

Heterogeneous Multiprocessors (HMPs) are popular due to their energy efficiency over Symmetric Multicore Processors (SMPs). Asymmetric Multicore Processors (AMPs) are a special case of HMPs where different kinds of cores share the same instruction set, but offer different power-performance trade-offs. Due to the computational-power difference between these cores, finding an optimal hardware configuration for executing a given parallel program is quite challenging. An inherent difficulty in this problem stems from the fact that the original program is written for SMPs. This challenge is exacerbated by the interplay of several configuration parameters that are allowed to be changed in AMPs. In this work, we propose a probabilistic method named CHOAMP to choose the bestavailable hardware configuration for a given parallel program. Selection of a configuration is guided by a user-provided run-time property such as energy-delay-product (EDP) and CHOAMP aspires to optimize the property in choosing a configuration. The core part of our probabilistic method relies on identifying the behavior of various program constructs in different classes of CPU cores in the AMP, and how it influences the cost function of choice. We implement the proposed technique in a compiler which automatically transforms a code optimized for SMP to run efficiently over an AMP, eliding requirement of any user annotations. CHOAMP transforms the same source program for different hardware configurations based on different user requirement. We evaluate the efficiency of our method for three different run-time properties: execution time, energy consumption, and EDP, in NAS Parallel Benchmarks for OpenMP. Our experimental evaluation shows that CHOAMP achieves an average of 65, 28, and 57 percent improvement over baseline HMP scheduling while optimizing for energy, execution time, and EDP, respectively.

异构多处理器（HMP）由于其优于对称多核处理器（SMP）的能效而广受欢迎。不对称多核处理器（AMP）是HMP的一种特殊情况，不同类型的核共享相同的指令集，但提供不同的功率性能权衡。由于这些核心之间的计算能力差异，找到用于执行给定并行程序的最佳硬件配置是非常具有挑战性的。这个问题的固有困难源于原始程序是为SMPs编写的。AMP中允许更改的几个配置参数的相互作用加剧了这一挑战。在这项工作中，我们提出了一种名为CHOAMP的概率方法来为给定的并行程序选择最佳可用的硬件配置。配置的选择由用户提供的运行时特性（例如能量延迟乘积（EDP））来指导，并且CHOAMP希望在选择配置时优化该特性。我们概率方法的核心部分依赖于识别AMP中不同类别CPU核心中各种程序结构的行为，以及它如何影响选择的成本函数。我们在编译器中实现了所提出的技术，该编译器自动转换为SMP优化的代码，使其在AMP上高效运行，从而消除了任何用户注释的要求。CHOAMP根据不同的用户需求，为不同的硬件配置转换相同的源程序。在OpenMP的NAS并行基准测试中，我们评估了我们的方法在三种不同运行时属性下的效率：执行时间、能耗和EDP。我们的实验评估表明，CHOAMP在优化能量、执行时间和EDP的同时，比基线HMP调度平均提高了65%、28%和57%。

{"title":"$mathsf{CHOAMP}$ : Cost Based Hardware Optimization for Asymmetric Multicore Processors","authors":"Jyothi Krishna Viswakaran Sreelatha;Shankar Balachandran;Rupesh Nasre","doi":"10.1109/TMSCS.2018.2791955","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2791955","url":null,"abstract":"Heterogeneous Multiprocessors (HMPs) are popular due to their energy efficiency over Symmetric Multicore Processors (SMPs). Asymmetric Multicore Processors (AMPs) are a special case of HMPs where different kinds of cores share the same instruction set, but offer different power-performance trade-offs. Due to the computational-power difference between these cores, finding an optimal hardware configuration for executing a given parallel program is quite challenging. An inherent difficulty in this problem stems from the fact that the original program is written for SMPs. This challenge is exacerbated by the interplay of several configuration parameters that are allowed to be changed in AMPs. In this work, we propose a probabilistic method named CHOAMP to choose the bestavailable hardware configuration for a given parallel program. Selection of a configuration is guided by a user-provided run-time property such as energy-delay-product (EDP) and CHOAMP aspires to optimize the property in choosing a configuration. The core part of our probabilistic method relies on identifying the behavior of various program constructs in different classes of CPU cores in the AMP, and how it influences the cost function of choice. We implement the proposed technique in a compiler which automatically transforms a code optimized for SMP to run efficiently over an AMP, eliding requirement of any user annotations. CHOAMP transforms the same source program for different hardware configurations based on different user requirement. We evaluate the efficiency of our method for three different run-time properties: execution time, energy consumption, and EDP, in NAS Parallel Benchmarks for OpenMP. Our experimental evaluation shows that CHOAMP achieves an average of 65, 28, and 57 percent improvement over baseline HMP scheduling while optimizing for energy, execution time, and EDP, respectively.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 2","pages":"163-176"},"PeriodicalIF":0.0,"publicationDate":"2018-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2791955","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68021417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Execution Trace Graph of Dataflow Process Networks 数据流过程网络的执行跟踪图

IEEE Transactions on Multi-Scale Computing Systems

Pub Date : 2018-01-08 DOI: 10.1109/TMSCS.2018.2790921

Simone Casale-Brunet;Marco Mattavelli

The paper introduces and specifies a formalism that provides complete representations of dataflow process network (DPN) program executions, by means of directed acyclic graphs. Such graphs, also known as execution trace graphs (ETG), are composed of nodes representing each action firing and by directed arcs representing the dataflow program execution constraints between two action firings. Action firings are atomic operations that encompass the algorithmic part of the action executions applied to both, the input data and the actor state variables. The paper describes how an ETG can be effectively derived from a dataflow program, specifies the type of dependencies that need to be included, and the processing that need to be applied so that an ETG become capable of representing all the admissible trajectories that dynamic dataflow programs can execute. The paper also describes how some characteristics of the ETG, related to specific implementations of the dataflow program, can be evaluated by means of high-level and architecture-independent executions of the program. Furthermore, some examples are provided showing how the analysis of the ETGs can support efficient explorations, reductions, and optimizations of the design space, providing results in terms of design alternatives, without requiring any partial implementation or reduction of the expressiveness of the original DPN dataflow program.

本文介绍并指定了一种形式，该形式通过有向无环图提供了数据流过程网络（DPN）程序执行的完整表示。这种图，也称为执行跟踪图（ETG），由表示每个动作触发的节点和表示两个动作触发之间的数据流程序执行约束的有向弧组成。动作触发是原子操作，包含应用于输入数据和参与者状态变量的动作执行的算法部分。本文描述了如何从数据流程序中有效地导出ETG，指定了需要包括的依赖关系类型，以及需要应用的处理，以便ETG能够表示动态数据流程序可以执行的所有可允许轨迹。本文还描述了如何通过高级和独立于体系结构的程序执行来评估与数据流程序的特定实现相关的ETG的一些特性。此外，提供了一些示例，显示了ETG的分析如何支持设计空间的有效探索、减少和优化，提供了设计备选方案方面的结果，而不需要任何部分实现或减少原始DPN数据流程序的表现力。

{"title":"Execution Trace Graph of Dataflow Process Networks","authors":"Simone Casale-Brunet;Marco Mattavelli","doi":"10.1109/TMSCS.2018.2790921","DOIUrl":"https://doi.org/10.1109/TMSCS.2018.2790921","url":null,"abstract":"The paper introduces and specifies a formalism that provides complete representations of dataflow process network (DPN) program executions, by means of directed acyclic graphs. Such graphs, also known as execution trace graphs (ETG), are composed of nodes representing each action firing and by directed arcs representing the dataflow program execution constraints between two action firings. Action firings are atomic operations that encompass the algorithmic part of the action executions applied to both, the input data and the actor state variables. The paper describes how an ETG can be effectively derived from a dataflow program, specifies the type of dependencies that need to be included, and the processing that need to be applied so that an ETG become capable of representing all the admissible trajectories that dynamic dataflow programs can execute. The paper also describes how some characteristics of the ETG, related to specific implementations of the dataflow program, can be evaluated by means of high-level and architecture-independent executions of the program. Furthermore, some examples are provided showing how the analysis of the ETGs can support efficient explorations, reductions, and optimizations of the design space, providing results in terms of design alternatives, without requiring any partial implementation or reduction of the expressiveness of the original DPN dataflow program.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"340-354"},"PeriodicalIF":0.0,"publicationDate":"2018-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2790921","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1