2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献

英文中文

MLNoC: A Machine Learning Based Approach to NoC Design MLNoC:基于机器学习的NoC设计方法

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645914

N. Rao, Akshay Ramachandran, Amish Shah

Modern System on Chips (SoCs) are becoming increasingly complex with a growing number of CPUs, caches, accelerators, memory and I/O subsystems. For such designs, a packet based distributed networks-on-chip (NoCs) interconnect can provide scalability, performance and efficiency. However, the design of such a NoC involves optimizing a large number of variables such as topology, routing choices, arbitration and quality of service (QoS) policies, buffer sizes, and deadlock avoidance policies. Widely varying die sizes, power, floorplan and performance constraints across a variety of different market segments, ranging from high-end servers to low-end IoT devices, impose additional design challenges. In this paper we demonstrate that there is a strong correlation between SoC characteristics and good NoC design practices. However this correlation is highly non-linear and multidimensional, with dimensions indicative of the features of the SoC, design goals and properties of the NoC. This results in a high-dimensional NoC design space and complex search process which is inefficient to solve with classic algorithms. Using a variety of real SoCs and training data sets, we demonstrate that a machine learning (ML) based approach yields near-optimal NoC designs quickly. We determine a number of SoC and NoC features, describe reduction methods, and also show that a multi-model approach yields better designs. We demonstrate that for a wide variety of SoCs, ML based NoC designs are far superior to those designed and optimized manually over years on almost all quality metrics.

随着cpu、缓存、加速器、内存和I/O子系统数量的增加，现代系统芯片(soc)正变得越来越复杂。对于这样的设计，基于分组的分布式片上网络(noc)互连可以提供可扩展性、性能和效率。然而，这种NoC的设计涉及优化大量变量，如拓扑、路由选择、仲裁和服务质量(QoS)策略、缓冲区大小和死锁避免策略。从高端服务器到低端物联网设备，各种不同细分市场的芯片尺寸、功耗、平面布局和性能限制都存在很大差异，这给设计带来了额外的挑战。在本文中，我们证明了SoC特性与良好的NoC设计实践之间存在很强的相关性。然而，这种相关性是高度非线性和多维的，其维度表明SoC的特征、设计目标和NoC的属性。这导致了高维NoC设计空间和复杂的搜索过程，传统算法求解效率低下。通过使用各种真实soc和训练数据集，我们证明了基于机器学习(ML)的方法可以快速生成接近最佳的NoC设计。我们确定了一些SoC和NoC特征，描述了减少方法，并表明多模型方法可以产生更好的设计。我们证明，对于各种各样的soc，基于ML的NoC设计在几乎所有质量指标上都远远优于多年来手工设计和优化的设计。

{"title":"MLNoC: A Machine Learning Based Approach to NoC Design","authors":"N. Rao, Akshay Ramachandran, Amish Shah","doi":"10.1109/CAHPC.2018.8645914","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645914","url":null,"abstract":"Modern System on Chips (SoCs) are becoming increasingly complex with a growing number of CPUs, caches, accelerators, memory and I/O subsystems. For such designs, a packet based distributed networks-on-chip (NoCs) interconnect can provide scalability, performance and efficiency. However, the design of such a NoC involves optimizing a large number of variables such as topology, routing choices, arbitration and quality of service (QoS) policies, buffer sizes, and deadlock avoidance policies. Widely varying die sizes, power, floorplan and performance constraints across a variety of different market segments, ranging from high-end servers to low-end IoT devices, impose additional design challenges. In this paper we demonstrate that there is a strong correlation between SoC characteristics and good NoC design practices. However this correlation is highly non-linear and multidimensional, with dimensions indicative of the features of the SoC, design goals and properties of the NoC. This results in a high-dimensional NoC design space and complex search process which is inefficient to solve with classic algorithms. Using a variety of real SoCs and training data sets, we demonstrate that a machine learning (ML) based approach yields near-optimal NoC designs quickly. We determine a number of SoC and NoC features, describe reduction methods, and also show that a multi-model approach yields better designs. We demonstrate that for a wide variety of SoCs, ML based NoC designs are far superior to those designed and optimized manually over years on almost all quality metrics.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125498329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Exploiting Limited Access Distance for Kernel Fusion Across the Stages of Explicit One-Step Methods on GPUs 利用有限访问距离实现gpu上的显式一步法跨阶段核融合

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645892

Matthias Korch, Tim Werner

The performance of explicit parallel methods solving large systems of ordinary differential equations (ODEs) on GPUs is often memory bound. Therefore, locality optimizations, such as kernel fusion, are desirable. This paper exploits a special property of a large class of right-hand-side (RHS) functions to enable the fusion of computations of blocks of components across multiple stages of the method. This leads to a tiling of the stages within one time step. Our approach is based on a representation of the ODE method by a data flow graph and allows efficient GPU code with fused kernels to be generated automatically for user-defined tilings. In particular, we investigate two generalized tiling strategies, trapezoidal and hexagonal tiling, which are evaluated experimentally for several different high-order Runge-Kutta (RK) methods.

在gpu上求解大型常微分方程组的显式并行方法的性能往往受到内存限制。因此，局部性优化，如核融合，是可取的。本文利用了一类大的右侧(RHS)函数的一个特殊性质，使得该方法的多个阶段中组件块的计算融合。这将导致在一个时间步内平铺各个阶段。我们的方法是基于ODE方法的数据流图表示，并允许为用户定义的平铺自动生成具有融合内核的高效GPU代码。特别地，我们研究了两种广义的平铺策略，梯形和六边形平铺策略，并对几种不同的高阶龙格-库塔(RK)方法进行了实验评估。

引用次数: 3

Optimization of a Sparse Grid-Based Data Mining Kernel for Architectures Using AVX-512 基于AVX-512架构的稀疏网格数据挖掘内核优化

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645913

Paul-Cristian Sarbu, H. Bungartz

Sparse grids have already been successfully used in various high-performance computing (HPC) applications, including data mining. In this article, we take a legacy classification kernel previously optimized for the AVX2 instruction set and investigate the benefits of using the newer AVX-S12-based multi-and many-core architectures. In particular, the Knights Landing (KNL) processor is used to study the possible performance gains of the code. Not all kernels benefit equally from such architectures, therefore choices in optimization steps and KNL cluster and memory modes need to be filtered through the lens of the code implementation at hand. With a less traditional approach of manual vectorization through instruction-level intrinsics, our kernel provides a differently faceted look into the optimization process. Observations stem from results obtained for node-and cluster-level classification simulations with up to 2^28 multidimensional training data points, using the CooLMUC-3cluster of the Leibniz Supercomputing Center (LRZ) in Garching, Germany.

稀疏网格已经成功地应用于各种高性能计算(HPC)应用，包括数据挖掘。在本文中，我们采用以前为AVX2指令集优化的遗留分类内核，并研究使用较新的基于avx - s12的多核和多核体系结构的好处。特别地，骑士登陆(KNL)处理器被用来研究代码可能的性能增益。并不是所有的内核都能从这样的体系结构中获得同样的好处，因此在优化步骤和KNL集群和内存模式方面的选择需要通过手边的代码实现进行筛选。我们的内核采用了一种不太传统的方法，即通过指令级的内在特性进行手动向量化，从而从不同的角度看待优化过程。观测结果来自节点和集群级别的分类模拟，使用德国加兴莱布尼茨超级计算中心(LRZ)的coolmuc -3集群，使用多达2^28个多维训练数据点。

引用次数: 1

Exploring the Potential of Next Generation Software-Defined in Memory Frameworks 探索下一代软件定义内存框架的潜力

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645858

Shouwei Chen, I. Rodero

As in-memory data analytics become increasingly important in a wide range of domains, the ability to develop large-scale and sustainable platforms faces significant challenges related to storage latency and memory size constraints. These challenges can be resolved by adopting new and effective formulations and novel architectures such as software-defined infrastructure. This paper investigates the key issue of data persistency for in-memory processing systems by evaluating persistence methods using different storage and memory devices for Apache Spark and the use of Alluxio. It also proposes and evaluates via simulation a Spark execution model for using disaggregated off-rack memory and non-volatile memory targeting next-generation software-defined infrastructure. Experimental results provide better understanding of behaviors and requirements for improving data persistence in current in-memory systems and provide data points to better understand requirements and design choices for next-generation software-defined infrastructure. The findings indicate that in-memory processing systems can benefit from ongoing software-defined infrastructure implementations; however current frameworks need to be enhanced appropriately to run efficiently at scale.

随着内存数据分析在众多领域变得越来越重要，开发大规模和可持续平台的能力面临着与存储延迟和内存大小限制有关的重大挑战。这些挑战可以通过采用新的有效配方和新型架构（如软件定义的基础设施）来解决。本文通过评估 Apache Spark 使用不同存储和内存设备以及使用 Alluxio 的持久性方法，研究了内存处理系统的数据持久性这一关键问题。它还提出并通过仿真评估了一种 Spark 执行模型，该模型用于使用针对下一代软件定义基础设施的分解机架外内存和非易失性内存。实验结果让人们更好地理解了当前内存系统中改善数据持久性的行为和要求，并提供了数据点，以便更好地理解下一代软件定义基础设施的要求和设计选择。研究结果表明，内存处理系统可以从正在进行的软件定义基础架构实施中获益；不过，当前的框架需要适当改进，以便大规模高效运行。

{"title":"Exploring the Potential of Next Generation Software-Defined in Memory Frameworks","authors":"Shouwei Chen, I. Rodero","doi":"10.1109/CAHPC.2018.8645858","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645858","url":null,"abstract":"As in-memory data analytics become increasingly important in a wide range of domains, the ability to develop large-scale and sustainable platforms faces significant challenges related to storage latency and memory size constraints. These challenges can be resolved by adopting new and effective formulations and novel architectures such as software-defined infrastructure. This paper investigates the key issue of data persistency for in-memory processing systems by evaluating persistence methods using different storage and memory devices for Apache Spark and the use of Alluxio. It also proposes and evaluates via simulation a Spark execution model for using disaggregated off-rack memory and non-volatile memory targeting next-generation software-defined infrastructure. Experimental results provide better understanding of behaviors and requirements for improving data persistence in current in-memory systems and provide data points to better understand requirements and design choices for next-generation software-defined infrastructure. The findings indicate that in-memory processing systems can benefit from ongoing software-defined infrastructure implementations; however current frameworks need to be enhanced appropriately to run efficiently at scale.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129638747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring Self-Adaptivity Towards Performance and Energy for Time-Stepping Methods 探索时间步进方法对性能和能量的自适应性

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645887

Natalia Kalinnik, R. Kiesel, T. Rauber, Marcel Richter, G. Rünger

Time-stepping simulation methods offer potential for self-adaptivity, since the first time steps of the simulation can be used to explore the hardware characteristics and measure which of several available implementation variants leads to a good performance and energy consumption on the given hardware platform. The version with the best performance or the smallest energy consumption can then be used for the remaining time steps. However, the number of variants to test may be quite large and different simulation methods may require different approaches for self-adaptivity. In this article, we explore the potential for self-adaptivity of several methods from scientific computing. In particular, we consider particle simulation methods, solution methods for differential equations, as well as sparse matrix computations and explore the potential for self-adaptivity of these methods, considering both performance and energy consumption as target function.

时间步进仿真方法提供了自适应的潜力，因为仿真的第一个时间步可以用来探索硬件特性，并测量几种可用的实现变体中哪一种在给定的硬件平台上导致良好的性能和能耗。性能最好或能耗最小的版本可以用于剩余的时间步骤。然而，要测试的变量数量可能相当大，不同的模拟方法可能需要不同的自适应方法。在本文中，我们将探讨科学计算中几种方法的自适应潜力。特别地，我们考虑了粒子模拟方法、微分方程的解方法以及稀疏矩阵计算，并探索了这些方法的自适应潜力，将性能和能耗作为目标函数。

引用次数: 1

DOACROSS Parallelization Based on Component Annotation and Loop-Carried Probability 基于组件标注和循环携带概率的DOACROSS并行化

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645904

Luis Mattos, D. C. S. Lucas, Juan Salamanca, J. P. L. Carvalho, M. Pereira, G. Araújo

Although modern compilers implement many loop parallelization techniques, their application is typically restricted to loops that have no loop-carried dependences (DOALL) or that contain well-known structured dependence patterns (e.g. reduction). These restrictions preclude the parallelization of many computational intensive DOACROSS loops. In such loops, either the compiler finds at least one loop-carried dependence or it cannot prove, at compile-time, that the loop is free of such dependences, even though they might never show-up at runtime. In any case, most compilers end-up not parallelizing DOACROSS loops. This paper brings three contributions to address this problem. First, it integrates three algorithms (TLS, DOAX, and BDX) into a simple openMP clause that enables the programmer to select the best algorithm for a given loop. Second, it proposes an annotation approach to separate the sequential components of a loop, thus exposing other components to parallelization. Finally, it shows that loop-carried probability is an effective metric to decide when to use TLS or other non-speculative techniques (e.g. DOAX or BDX) to parallelize DOACROSS loops. Experimental results reveal that, for certain loops, slow-downs can be transformed in 2×speed-ups by quickly selecting the appropriate algorithm.

尽管现代编译器实现了许多循环并行化技术，但它们的应用通常仅限于没有循环携带依赖关系(DOALL)或包含众所周知的结构化依赖模式(例如缩减)的循环。这些限制排除了许多计算密集型DOACROSS循环的并行化。在这样的循环中，编译器要么找到至少一个循环携带的依赖项，要么无法在编译时证明循环没有这样的依赖项，即使它们可能永远不会在运行时出现。在任何情况下，大多数编译器最终都不会并行化DOACROSS循环。本文为解决这一问题做出了三方面的贡献。首先，它将三种算法(TLS、DOAX和BDX)集成到一个简单的openMP子句中，使程序员能够为给定的循环选择最佳算法。其次，它提出了一种注释方法来分离循环的顺序组件，从而将其他组件暴露于并行化。最后，它表明环携带概率是决定何时使用TLS或其他非推测技术(例如DOAX或BDX)并行DOACROSS循环的有效度量。实验结果表明，对于某些循环，通过快速选择适当的算法可以在2×speed-ups中转换慢速。

{"title":"DOACROSS Parallelization Based on Component Annotation and Loop-Carried Probability","authors":"Luis Mattos, D. C. S. Lucas, Juan Salamanca, J. P. L. Carvalho, M. Pereira, G. Araújo","doi":"10.1109/CAHPC.2018.8645904","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645904","url":null,"abstract":"Although modern compilers implement many loop parallelization techniques, their application is typically restricted to loops that have no loop-carried dependences (DOALL) or that contain well-known structured dependence patterns (e.g. reduction). These restrictions preclude the parallelization of many computational intensive DOACROSS loops. In such loops, either the compiler finds at least one loop-carried dependence or it cannot prove, at compile-time, that the loop is free of such dependences, even though they might never show-up at runtime. In any case, most compilers end-up not parallelizing DOACROSS loops. This paper brings three contributions to address this problem. First, it integrates three algorithms (TLS, DOAX, and BDX) into a simple openMP clause that enables the programmer to select the best algorithm for a given loop. Second, it proposes an annotation approach to separate the sequential components of a loop, thus exposing other components to parallelization. Finally, it shows that loop-carried probability is an effective metric to decide when to use TLS or other non-speculative techniques (e.g. DOAX or BDX) to parallelize DOACROSS loops. Experimental results reveal that, for certain loops, slow-downs can be transformed in 2×speed-ups by quickly selecting the appropriate algorithm.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126405858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Assessing Time Predictability Features of ARM Big. LITTLE Multicores 评估ARM Big的时间可预测性特征。小多核

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645925

Gabriel Fernandez, F. Cazorla, J. Abella, Sylvain Girbal

The increasing performance needs in critical realtime embedded systems (CRTES), such as for instance the automotive domain, push for the adoption of high-performance hardware from the consumer electronics domain. However, their time-predictability features are quite unexplored. The ARM big. LITTLE architecture is a good candidate for adoption in the CRTES market (i.e. in the automotive market it has already started being used). In this paper we study ARM big. LITTLE's capabilities to meet CRTES requirements. In particular, we perform a qualitative and quantitative assessment of its timing characteristics, focusing on shared multicore resources, and how this architecture can be reliably used in CRTES.

关键实时嵌入式系统(CRTES)中不断增长的性能需求，例如汽车领域，推动了消费电子领域采用高性能硬件。然而，它们的时间可预测性特征尚未得到充分研究。ARM很大。LITTLE架构是CRTES市场(即在汽车市场中已经开始使用)采用的一个很好的候选者。本文主要研究ARM大系统。LITTLE满足CRTES要求的能力。特别是，我们对其时序特性进行了定性和定量评估，重点关注共享多核资源，以及如何在CRTES中可靠地使用该架构。

引用次数: 3

Performance Comparison of a Parallel Recommender Algorithm Across Three Hadoop-Based Frameworks 一种基于hadoop的并行推荐算法的性能比较

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645926

Christina Diedhiou, Bryan Carpenter, A. Shafi, Soumabha Sarkar, Ramazan Esmeli, Ryan Gadsdon

One of the challenges our society faces is the ever increasing amount of data. Among existing platforms that address the system requirements, Hadoop is a framework widely used to store and analyze “big data”. On the human side, one of the aids to finding the things people really want is recommendation systems. This paper evaluates highly scalable parallel algorithms for recommendation systems with application to very large data sets. A particular goal is to evaluate an open source Java message passing library for parallel computing called MPJ Express, which has been integrated with Hadoop. As a demonstration we use MPJ Express to implement collaborative filtering on various data sets using the algorithm ALSWR (Alternating-Least-Squares with Weighted-λ-Regularization). We benchmark the performance and demonstrate parallel speedup on Movielens and Yahoo Music data sets, comparing our results with two other frameworks: Mahout and Spark. Our results indicate that MPJ Express implementation of ALSWR has very competitive performance and scalability in comparison with the two other frameworks.

我们的社会面临的挑战之一是不断增加的数据量。在解决系统需求的现有平台中，Hadoop是一个广泛用于存储和分析“大数据”的框架。在人类方面，找到人们真正想要的东西的辅助工具之一是推荐系统。本文评估了推荐系统的高度可扩展并行算法，并应用于非常大的数据集。一个特定的目标是评估用于并行计算的开源Java消息传递库MPJ Express，该库已与Hadoop集成。作为演示，我们使用MPJ Express使用ALSWR(加权-λ-正则化交替最小二乘)算法对各种数据集实现协同过滤。我们对性能进行了基准测试，并在Movielens和Yahoo Music数据集上演示了并行加速，并将我们的结果与另外两个框架(Mahout和Spark)进行了比较。我们的研究结果表明，MPJ Express实现的ALSWR与其他两个框架相比具有非常有竞争力的性能和可扩展性。

{"title":"Performance Comparison of a Parallel Recommender Algorithm Across Three Hadoop-Based Frameworks","authors":"Christina Diedhiou, Bryan Carpenter, A. Shafi, Soumabha Sarkar, Ramazan Esmeli, Ryan Gadsdon","doi":"10.1109/CAHPC.2018.8645926","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645926","url":null,"abstract":"One of the challenges our society faces is the ever increasing amount of data. Among existing platforms that address the system requirements, Hadoop is a framework widely used to store and analyze “big data”. On the human side, one of the aids to finding the things people really want is recommendation systems. This paper evaluates highly scalable parallel algorithms for recommendation systems with application to very large data sets. A particular goal is to evaluate an open source Java message passing library for parallel computing called MPJ Express, which has been integrated with Hadoop. As a demonstration we use MPJ Express to implement collaborative filtering on various data sets using the algorithm ALSWR (Alternating-Least-Squares with Weighted-λ-Regularization). We benchmark the performance and demonstrate parallel speedup on Movielens and Yahoo Music data sets, comparing our results with two other frameworks: Mahout and Spark. Our results indicate that MPJ Express implementation of ALSWR has very competitive performance and scalability in comparison with the two other frameworks.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121762575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Adaptive Partitioning for Iterated Sequences of Irregular OpenCL Kernels 不规则OpenCL核迭代序列的自适应划分

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2018-09-01 DOI: 10.1109/SBAC-PAD.2018.00051

Pierre Huchant, Denis Barthou, M. Counilh

OpenCL defines a common parallel programming language for all devices, although writing tasks adapted to the devices, managing communication and load-balancing issues are left to the programmer. We propose in this paper a static/dynamic approach for the execution of an iterated sequence of data-dependent kernels on a multi-device heterogeneous architecture. The method allows to automatically distribute irregular kernels onto multiple devices and tackles, without training, both load balancing and data transfers issues coming from hardware heterogeneity, load imbalance within the application itself and load variations between repeated executions of the sequence.

OpenCL为所有设备定义了一种通用的并行编程语言，尽管编写适合设备的任务、管理通信和负载平衡问题留给了程序员。在本文中，我们提出了一种静态/动态方法，用于在多设备异构架构上执行数据依赖内核的迭代序列。该方法允许将不规则的内核自动分布到多个设备上，并且无需训练就可以解决由硬件异构、应用程序本身的负载不平衡以及重复执行序列之间的负载变化引起的负载平衡和数据传输问题。

引用次数: 1

Network-Aware Energy-Efficient Virtual Machine Management in Distributed Cloud Infrastructures with On-Site Photovoltaic Production 分布式云基础设施中具有网络感知的节能虚拟机管理

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Pub Date : 2018-09-01 DOI: 10.1109/CAHPC.2018.8645901

Benjamin Camus, F. Dufossé, A. Blavette, M. Quinson, Anne-Cécile Orgerie

Distributed Clouds are nowadays an essential component for providing Internet services to always more numerous connected devices. This growth leads the energy consumption of these distributed infrastructures to be a worrying environmental and economic concern. In order to reduce energy costs and carbon footprint, Cloud providers could resort to producing onsite renewable energy, with solar panels for instance. In this paper, we propose NEMESIS: a Network-aware Energy-efficient Management framework for distributEd cloudS Infrastructures with on-Site photovoltaic production. NEMESIS optimizes VM placement and balances VM migration and green energy consumption in Cloud infrastructure embedding geographically distributed data centers with on-site photovoltaic power supply. We use the Simgrid simulation toolbox to evaluate the energy efficiency of NEMESIS against state-of-the-art approaches.

如今，分布式云是为总是更多的连接设备提供Internet服务的必要组件。这种增长导致这些分布式基础设施的能源消耗成为令人担忧的环境和经济问题。为了降低能源成本和碳足迹，云计算提供商可以采用现场生产可再生能源，例如太阳能电池板。在本文中，我们提出了NEMESIS:一个具有现场光伏生产的分布式云基础设施的网络感知节能管理框架。NEMESIS优化虚拟机放置，平衡虚拟机迁移和绿色能源消耗的云基础设施嵌入地理分布的数据中心与现场光伏供电。我们使用Simgrid模拟工具箱来评估NEMESIS与最先进方法的能源效率。

引用次数: 6

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀