2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)最新文献

英文中文

Closed-Form Solutions for Dense Matrix-Matrix Multiplication on Heterogeneous Platforms Using Divisible Load Analysis 基于可分载荷分析的非均质平台上密集矩阵-矩阵乘法的封闭解

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00067

G. Barlas, L. E. Hiny

In this paper we analytically solve the partitioning problem for performing matrix multiplication on a cluster of heterogeneous multicore machines, equipped with an accelerator, typically a GPU. We derive closed-form solutions that not only solve the problem in an exact manner, but they also allow for predictive analysis that can guide system design. Our work allows an optimum partitioning to be calculated in linear time with respect to the number of cores in the system. The static partitioning afforded by our Divisible Load Theory (DLT) based analysis, minimizes communication overhead and improves efficiency. Our work leverages existing optimized Dense Linear Algebra (DLA) libraries, such as cuBLAS and BLAS, which translates to an easy deployment that can readily exploit state-of-the-art tools. A comparison study concludes the paper, highlighting the beneficial effect of our partitioning approach.

在本文中，我们分析解决了在异构多核机器集群上执行矩阵乘法的分区问题，配备了加速器，通常是GPU。我们推导出封闭形式的解决方案，不仅以精确的方式解决问题，而且还允许进行可指导系统设计的预测分析。我们的工作允许在线性时间内根据系统中的核心数量计算最佳分区。我们基于可分负载理论(DLT)的分析提供的静态分区，最大限度地减少了通信开销并提高了效率。我们的工作利用了现有的优化的密集线性代数(DLA)库，如cuBLAS和BLAS，这转化为一个容易的部署，可以很容易地利用最先进的工具。最后通过对比研究，强调了我们的划分方法的有益效果。

引用次数: 0

A Unified Programming Model for Time- and Data-Driven Embedded Applications 时间和数据驱动嵌入式应用的统一编程模型

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00013

G. Breaban, S. Stuijk, K. Goossens

Modern embedded systems encompass a fast increasing range of applications, spanning from automotive to multimedia, and industrial automation. To tackle the increasing design complexity, the model-based design paradigm promotes the use of Models of Computation (MoCs) to capture the essential application properties. Existing MoCs are split between the event/time-triggered paradigm and the data-driven paradigm. However, time and data are two inter-related dimensions that are essential for defining the correct application behavior. In this paper we advocate a unified MoC that integrates the notions of time and data while accounting for imperfect clocks. We present the formal properties of our model and show how the Synchronous Data Flow (SDF) MoC can be used to analyze the time performance guarantees.

现代嵌入式系统涵盖了快速增长的应用范围，从汽车到多媒体和工业自动化。为了解决日益增加的设计复杂性，基于模型的设计范式提倡使用计算模型(moc)来捕获基本的应用程序属性。现有moc分为事件/时间触发范式和数据驱动范式。然而，时间和数据是两个相互关联的维度，它们对于定义正确的应用程序行为至关重要。在本文中，我们提倡一种统一的MoC，它将时间和数据的概念集成在一起，同时考虑到不完美的时钟。我们给出了模型的形式化属性，并展示了同步数据流(SDF) MoC如何用于分析时间性能保证。

引用次数: 1

Context-Aware Optimization for Energy-Efficient and QoS Wireless Body Area Networks with Human Dynamics 基于人动态的高能效和QoS无线体域网络环境感知优化

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00016

Da-Ren Chen, Ping-Feng Wang

In the consideration of the context of human body dynamics, we propose an energy-aware and quality-of-service (QoS) method for wireless body area networks (WBAN). This method exploits the characteristics of physical (PHY) layer based on narrow band signaling with M-PSK modulation and low power micro sensor transceiver. It improves energy efficiency of both receiver and transmitter front-end components while satisfying the QoS metrics in accordance with link quality due to human posture changes or movements. It can meet various QoS requirements of each sensor, improve bandwidth utilization and reduce energy consumption. This method saves an average of 6.2% of energy consumption over the TPC and LA-based methods at a power of -25dBm.

考虑到人体动力学的背景，我们提出了一种能量感知和服务质量(QoS)的无线体域网络(WBAN)方法。该方法利用了基于M-PSK调制的窄带信令和低功耗微传感器收发器的物理层特性。它提高了接收端和发送端前端组件的能量效率，同时满足了由于人体姿势变化或运动导致的链路质量的QoS指标。它可以满足各个传感器的各种QoS要求，提高带宽利用率，降低能耗。在功率为-25dBm时，该方法比基于TPC和la的方法平均节省6.2%的能耗。

引用次数: 2

Increasing Efficiency in Parallel Programming Teaching 提高并行编程教学效率

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00053

M. Danelutto, M. Torquati

The ability to teach parallel programming principles and techniques is becoming fundamental to prepare a new generation of programmers able to master the pervasive parallelism made available by hardware vendors. Classical parallel programming courses leverage either low-level programming frameworks (e.g. those based on Pthreads) or higher level frameworks such as OpenMP or MPI. We discuss our teaching experience within the Master in "Computer Science and networking" where parallel programming is taught leveraging structured parallel programming principles and frameworks. The paper summarizes the results achieved in eight years of experience and shows how the adoption of a structured parallel programming approach improves the efficiency of the teaching process.

教授并行编程原理和技术的能力正在成为培养新一代程序员的基础，使他们能够掌握硬件供应商提供的普遍并行性。经典的并行编程课程要么利用低级编程框架(如基于Pthreads的)，要么利用高级框架(如OpenMP或MPI)。我们在“计算机科学与网络”硕士课程中讨论了我们的教学经验，并行编程是利用结构化并行编程原则和框架来教授的。本文总结了在八年的经验中取得的成果，并展示了采用结构化并行编程方法如何提高教学过程的效率。

引用次数: 1

DuoFS: A Hybrid Storage System Balancing Energy-Efficiency, Reliability, and Performance 双ofs:一种兼顾能效、可靠性和性能的混合存储系统

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00082

Shu Yin, Bing Jiao, Xiaomin Zhu, X. Ruan, Si Chen, Zhuo Tang

As the Energy Wall and the Reliability Wall become unavoidable, it is a demanding and challenging task to reduce energy consumption in large-scale storage systems in modern data centers while retaining acceptable systems reliability. We propose a reliable energy-efficient storage system called DuoFS, which aims at balancing the energy efficiency, the reliability and the performance of parallel storage systems by seamlessly integrating one HDD-based file system and one SSD-based file system. At the heart of the DuoFS is a transformative middleware layer that dispatches files to the one of the two independent parallel file systems based on the files' I/O access popularity. By replicating popular files to the SSD-based file system and pushing the HDD-based file system into the low-power mode under light workload conditions, DuoFS can reduce significant energy consumption, avoid major factors that harm the storage systems reliability, and extract SSDs good I/O performance. Experimental results show that the DuoFS system saves up to 40% of energy, achieves up to 50% better I/O performance while only sacrificing less than 15% of the system's reliability.

随着“能源墙”和“可靠性墙”不可避免，如何在保证可接受的系统可靠性的同时降低现代数据中心大型存储系统的能耗是一项艰巨而富有挑战性的任务。我们提出了一种可靠的节能存储系统，称为DuoFS，旨在通过无缝集成一个基于hdd的文件系统和一个基于ssd的文件系统，平衡并行存储系统的能效、可靠性和性能。DuoFS的核心是一个转换中间件层，它根据文件的I/O访问流行程度将文件分派到两个独立的并行文件系统之一。通过将流行的文件复制到基于ssd的文件系统中，并在轻工作负载条件下将基于hdd的文件系统推入低功耗模式，可以显著降低能耗，避免影响存储系统可靠性的主要因素，并提取ssd良好的I/O性能。实验结果表明，该系统节省了高达40%的能源，实现了高达50%的I/O性能提升，同时只牺牲了不到15%的系统可靠性。

{"title":"DuoFS: A Hybrid Storage System Balancing Energy-Efficiency, Reliability, and Performance","authors":"Shu Yin, Bing Jiao, Xiaomin Zhu, X. Ruan, Si Chen, Zhuo Tang","doi":"10.1109/PDP2018.2018.00082","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00082","url":null,"abstract":"As the Energy Wall and the Reliability Wall become unavoidable, it is a demanding and challenging task to reduce energy consumption in large-scale storage systems in modern data centers while retaining acceptable systems reliability. We propose a reliable energy-efficient storage system called DuoFS, which aims at balancing the energy efficiency, the reliability and the performance of parallel storage systems by seamlessly integrating one HDD-based file system and one SSD-based file system. At the heart of the DuoFS is a transformative middleware layer that dispatches files to the one of the two independent parallel file systems based on the files' I/O access popularity. By replicating popular files to the SSD-based file system and pushing the HDD-based file system into the low-power mode under light workload conditions, DuoFS can reduce significant energy consumption, avoid major factors that harm the storage systems reliability, and extract SSDs good I/O performance. Experimental results show that the DuoFS system saves up to 40% of energy, achieves up to 50% better I/O performance while only sacrificing less than 15% of the system's reliability.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"1057 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127427487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Collective I/O Performance on the Santos Dumont Supercomputer Santos Dumont超级计算机的集体I/O性能

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00015

A. Carneiro, J. L. Bez, F. Boito, Bruno Alves Fagundes, Carla Osthoff, P. Navaux

The historical gap between processing and data access speeds causes many applications to spend a large portion of their execution on I/O operations. From the point of view of a large-scale, expensive, supercomputer, it is important to ensure applications achieve the best I/O performance to promote an efficient usage of the machine. In this paper, we evaluate the I/O infrastructure of the Santos Dumont supercomputer, the largest one from Latin America. More specifically, we investigate the performance of collective I/O operations. By conducting an analysis of a scientific application that uses the machine, we identify large performance differences between the available MPI implementations. We then further study the observed phenomenon using the BT-IO and IOR benchmarks, in addition to a custom microbenchmark. We conclude that the customized MPI implementation by Bull (used by more than 20% of the jobs) presents the worst performance for small collective write operations. Our results are being used to help the Santos Dumont users to achieve the best performance for their applications. Additionally, by investigating the observed phenomenon, we provide information to help improve future MPI-IO collective write implementations.

处理速度和数据访问速度之间的历史差距导致许多应用程序将大部分执行时间花在I/O操作上。从大型、昂贵的超级计算机的角度来看，确保应用程序实现最佳I/O性能以促进机器的有效使用是很重要的。在本文中，我们评估了Santos Dumont超级计算机的I/O基础设施，这是拉丁美洲最大的超级计算机。更具体地说，我们研究了集合I/O操作的性能。通过对使用该机器的科学应用程序进行分析，我们确定了可用的MPI实现之间的巨大性能差异。然后，我们使用BT-IO和IOR基准测试以及自定义微基准测试进一步研究观察到的现象。我们得出结论，Bull的定制MPI实现(超过20%的作业使用)在小型集体写操作中表现出最差的性能。我们的结果被用来帮助Santos Dumont用户实现其应用程序的最佳性能。此外，通过调查观察到的现象，我们提供了有助于改进未来MPI-IO集体写实现的信息。

{"title":"Collective I/O Performance on the Santos Dumont Supercomputer","authors":"A. Carneiro, J. L. Bez, F. Boito, Bruno Alves Fagundes, Carla Osthoff, P. Navaux","doi":"10.1109/PDP2018.2018.00015","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00015","url":null,"abstract":"The historical gap between processing and data access speeds causes many applications to spend a large portion of their execution on I/O operations. From the point of view of a large-scale, expensive, supercomputer, it is important to ensure applications achieve the best I/O performance to promote an efficient usage of the machine. In this paper, we evaluate the I/O infrastructure of the Santos Dumont supercomputer, the largest one from Latin America. More specifically, we investigate the performance of collective I/O operations. By conducting an analysis of a scientific application that uses the machine, we identify large performance differences between the available MPI implementations. We then further study the observed phenomenon using the BT-IO and IOR benchmarks, in addition to a custom microbenchmark. We conclude that the customized MPI implementation by Bull (used by more than 20% of the jobs) presents the worst performance for small collective write operations. Our results are being used to help the Santos Dumont users to achieve the best performance for their applications. Additionally, by investigating the observed phenomenon, we provide information to help improve future MPI-IO collective write implementations.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127678040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Simulation-Based Evaluation Strategy for Task Mapping Approaches in WNoC Platforms WNoC平台任务映射方法的仿真评估策略

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00104

Luis Germán García Morales, J. E. A. Cobo, N. Bagherzadeh

Network-On-Chip (NoC) along with its extension Wireless NoC (WNoC) were proposed to increase system performance in future generations of Multi-Processor System-on-Chip (MPSoC) with hundreds / thousands of cores. For such platforms, designers seek to propose efficient task mapping mechanisms that establish the arrangement of executable tasks taking advantage of available communication links. These proposals are then evaluated and compared against other designs in terms of execution time, latency, energy, communication cost and other metrics, employing simulation tools to that end. However, current WNoC simulators only aim to evaluate the performance of the communication network; they lack the ability to estimate relevant metrics needed for the evaluation of mapping strategies. In this paper, we present an evaluation strategy aimed to assess the performance of task mapping approaches for WNoC and integrate it into a well-known state-of-the-art simulation tool. Several experiments are conducted to demonstrate the benefits of using our proposed strategy.

提出了片上网络(NoC)及其扩展无线NoC (WNoC)，以提高未来几代具有数百/数千核的多处理器片上系统(MPSoC)的系统性能。对于这样的平台，设计者试图提出有效的任务映射机制，利用可用的通信链路建立可执行任务的安排。然后使用仿真工具对这些建议进行评估，并在执行时间、延迟、能量、通信成本和其他指标方面与其他设计进行比较。然而，目前的WNoC仿真器仅旨在评估通信网络的性能;他们缺乏评估映射策略所需的相关度量的能力。在本文中，我们提出了一种评估策略，旨在评估WNoC任务映射方法的性能，并将其集成到一个众所周知的最先进的仿真工具中。进行了几个实验来证明使用我们提出的策略的好处。

{"title":"Simulation-Based Evaluation Strategy for Task Mapping Approaches in WNoC Platforms","authors":"Luis Germán García Morales, J. E. A. Cobo, N. Bagherzadeh","doi":"10.1109/PDP2018.2018.00104","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00104","url":null,"abstract":"Network-On-Chip (NoC) along with its extension Wireless NoC (WNoC) were proposed to increase system performance in future generations of Multi-Processor System-on-Chip (MPSoC) with hundreds / thousands of cores. For such platforms, designers seek to propose efficient task mapping mechanisms that establish the arrangement of executable tasks taking advantage of available communication links. These proposals are then evaluated and compared against other designs in terms of execution time, latency, energy, communication cost and other metrics, employing simulation tools to that end. However, current WNoC simulators only aim to evaluate the performance of the communication network; they lack the ability to estimate relevant metrics needed for the evaluation of mapping strategies. In this paper, we present an evaluation strategy aimed to assess the performance of task mapping approaches for WNoC and integrate it into a well-known state-of-the-art simulation tool. Several experiments are conducted to demonstrate the benefits of using our proposed strategy.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"54 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120883542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Evaluating and Optimizing OpenCL Base64 Data Unpacking Kernel with FPGA 基于FPGA的OpenCL Base64数据解包内核评估与优化

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00046

Zheming Jin, Iris Johnson, H. Finkel

Development of applications using OpenCL targeting FPGAs is an emerging approach on heterogeneous computing systems. This paper uses the data unpacking algorithm in Base64 encoding as a case study to present programming and optimization techniques, and experimental results of the OpenCL-based implementations on an FPGA. We explain the algorithm and evaluate the performance of the kernel implementations with Intel's FPGA OpenCL SDK. The experimental results show kernel vectorization and duplication are two optimization techniques that can improve the kernel performance. The performance of kernel duplication is also closely related to the local work size. Our experiment shows 16-lane vectorization increases the bandwidth by a factor of 2 to 10 for large input data sizes. Moreover, the performance of kernel duplication using 16 compute units is 40% to 1.5% less than that of kernel vectorization depending on the input size. Tuning the local work size can improve the kernel performance by a factor of 3 to 23. For this kernel, using local memory is not an effective technique to improve the kernel performance because input data is not reused. A combination of vectorization and duplication achieves the highest performance of 12.3 GiB/s. Compared to an Intel Xeon E5 CPU and an Nvidia Tesla K80 GPU, the performance of the kernel on the Arria 10 FPGA is 6.7X faster than the CPU and 3X slower than the GPU. The performance per watt on the FPGA is 20.5X higher than the CPU and 1.19X lower than the GPU.

利用OpenCL开发针对fpga的应用程序是异构计算系统的一种新兴方法。本文以Base64编码下的数据解包算法为例，介绍了该算法的编程和优化技术，并给出了基于opencl的FPGA实现实验结果。我们解释了该算法，并利用英特尔的FPGA OpenCL SDK评估了内核实现的性能。实验结果表明，核矢量化和复制是提高核性能的两种优化技术。内核复制的性能也与本地工作大小密切相关。我们的实验表明，对于大的输入数据量，16通道矢量化将带宽提高了2到10倍。此外，根据输入大小的不同，使用16个计算单元的内核复制的性能比内核矢量化的性能低40%到1.5%。调优本地工作大小可以将内核性能提高3到23倍。对于这个内核，使用本地内存并不是提高内核性能的有效技术，因为输入数据不会被重用。矢量化和复制的组合实现了12.3 GiB/s的最高性能。与Intel至强E5 CPU和Nvidia Tesla K80 GPU相比，Arria 10 FPGA上的内核性能比CPU快6.7倍，比GPU慢3倍。FPGA的每瓦性能比CPU高20.5倍，比GPU低1.19倍。

{"title":"Evaluating and Optimizing OpenCL Base64 Data Unpacking Kernel with FPGA","authors":"Zheming Jin, Iris Johnson, H. Finkel","doi":"10.1109/PDP2018.2018.00046","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00046","url":null,"abstract":"Development of applications using OpenCL targeting FPGAs is an emerging approach on heterogeneous computing systems. This paper uses the data unpacking algorithm in Base64 encoding as a case study to present programming and optimization techniques, and experimental results of the OpenCL-based implementations on an FPGA. We explain the algorithm and evaluate the performance of the kernel implementations with Intel's FPGA OpenCL SDK. The experimental results show kernel vectorization and duplication are two optimization techniques that can improve the kernel performance. The performance of kernel duplication is also closely related to the local work size. Our experiment shows 16-lane vectorization increases the bandwidth by a factor of 2 to 10 for large input data sizes. Moreover, the performance of kernel duplication using 16 compute units is 40% to 1.5% less than that of kernel vectorization depending on the input size. Tuning the local work size can improve the kernel performance by a factor of 3 to 23. For this kernel, using local memory is not an effective technique to improve the kernel performance because input data is not reused. A combination of vectorization and duplication achieves the highest performance of 12.3 GiB/s. Compared to an Intel Xeon E5 CPU and an Nvidia Tesla K80 GPU, the performance of the kernel on the Arria 10 FPGA is 6.7X faster than the CPU and 3X slower than the GPU. The performance per watt on the FPGA is 20.5X higher than the CPU and 1.19X lower than the GPU.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127109710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Heterogeneous Computing and Multi-Clustering Support Via Peer-To-Peer HPC 基于点对点高性能计算的异构计算和多集群支持

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00050

Bilal Fakih, D. E. Baz

This paper aims at presenting Peer-To-Peer HPC a decentralized environment that facilitates the use of heterogeneous multi-cluster platform for loosely synchronous applications. The goal is to exploit all the computing resources (all the available cores of computing nodes) as well as all networks, e.g., Ethernet, Infiniband and Myrinet. Peer-To-Peer HPC functionality relies on a reconfigurable multi network protocol RMNP for controlling multiple network adapters and on OpenMP for the exploitation of all the available cores of computing nodes. We report on efficiency obtained with Grid5000 testbed by combining synchronous and asynchronous iterative schemes of computation with Peer-To-Peer HPC. The experimental results show that our environment scales well.

本文旨在为点对点高性能计算提供一个分散的环境，以促进异构多集群平台在松散同步应用中的使用。目标是利用所有的计算资源(所有可用的计算节点核心)以及所有的网络，例如以太网、Infiniband和Myrinet。点对点HPC功能依赖于可重新配置的多网络协议RMNP来控制多个网络适配器，依赖于OpenMP来利用所有可用的计算节点核心。本文报道了Grid5000试验台将同步和异步迭代计算方案与点对点高性能计算相结合所获得的效率。实验结果表明，我们的环境具有良好的可扩展性。

引用次数: 3

Memory-Aware Genetic Algorithms for Task Mapping on Hard Real-Time Networks-on-Chip 硬实时片上网络任务映射的内存感知遗传算法

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00101

Lloyd Robert Still, L. Indrusiak

The problem of mapping hard real-time tasks onto networks-on-chip has previously been successfully addressed by genetic algorithms. However, none of the existing problem formulations consider memory constraints. State-of-the-art genetic mappers are therefore able to find fully-schedulable mappings which are incompatible with the memory limitations of realistic platforms. In this paper, we extend the problem formulation and devise a memory architecture, in the form of private local memories. We then propose three memory models of increasing complexity and realism, and evaluate the impact these additional constraints pose to the genetic search. We conduct extensive experiments using tasks and communications from a realistic benchmark application, and compare the proposed approach against a state-of-the-art baseline mapper.

将硬实时任务映射到片上网络的问题先前已经通过遗传算法成功地解决了。然而，现有的问题表述都没有考虑内存约束。因此，最先进的基因映射器能够找到与现实平台的内存限制不兼容的完全可调度的映射。在本文中，我们扩展了这个问题的表述，并设计了一个私有局部存储器形式的存储器体系结构。然后，我们提出了三种日益复杂和现实的记忆模型，并评估了这些额外的限制对基因搜索的影响。我们使用来自现实基准应用程序的任务和通信进行了广泛的实验，并将所提出的方法与最先进的基线映射器进行了比较。

引用次数: 7

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀