首页 > 最新文献

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)最新文献

英文 中文
Closed-Form Solutions for Dense Matrix-Matrix Multiplication on Heterogeneous Platforms Using Divisible Load Analysis 基于可分载荷分析的非均质平台上密集矩阵-矩阵乘法的封闭解
G. Barlas, L. E. Hiny
In this paper we analytically solve the partitioning problem for performing matrix multiplication on a cluster of heterogeneous multicore machines, equipped with an accelerator, typically a GPU. We derive closed-form solutions that not only solve the problem in an exact manner, but they also allow for predictive analysis that can guide system design. Our work allows an optimum partitioning to be calculated in linear time with respect to the number of cores in the system. The static partitioning afforded by our Divisible Load Theory (DLT) based analysis, minimizes communication overhead and improves efficiency. Our work leverages existing optimized Dense Linear Algebra (DLA) libraries, such as cuBLAS and BLAS, which translates to an easy deployment that can readily exploit state-of-the-art tools. A comparison study concludes the paper, highlighting the beneficial effect of our partitioning approach.
在本文中,我们分析解决了在异构多核机器集群上执行矩阵乘法的分区问题,配备了加速器,通常是GPU。我们推导出封闭形式的解决方案,不仅以精确的方式解决问题,而且还允许进行可指导系统设计的预测分析。我们的工作允许在线性时间内根据系统中的核心数量计算最佳分区。我们基于可分负载理论(DLT)的分析提供的静态分区,最大限度地减少了通信开销并提高了效率。我们的工作利用了现有的优化的密集线性代数(DLA)库,如cuBLAS和BLAS,这转化为一个容易的部署,可以很容易地利用最先进的工具。最后通过对比研究,强调了我们的划分方法的有益效果。
{"title":"Closed-Form Solutions for Dense Matrix-Matrix Multiplication on Heterogeneous Platforms Using Divisible Load Analysis","authors":"G. Barlas, L. E. Hiny","doi":"10.1109/PDP2018.2018.00067","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00067","url":null,"abstract":"In this paper we analytically solve the partitioning problem for performing matrix multiplication on a cluster of heterogeneous multicore machines, equipped with an accelerator, typically a GPU. We derive closed-form solutions that not only solve the problem in an exact manner, but they also allow for predictive analysis that can guide system design. Our work allows an optimum partitioning to be calculated in linear time with respect to the number of cores in the system. The static partitioning afforded by our Divisible Load Theory (DLT) based analysis, minimizes communication overhead and improves efficiency. Our work leverages existing optimized Dense Linear Algebra (DLA) libraries, such as cuBLAS and BLAS, which translates to an easy deployment that can readily exploit state-of-the-art tools. A comparison study concludes the paper, highlighting the beneficial effect of our partitioning approach.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114419165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Unified Programming Model for Time- and Data-Driven Embedded Applications 时间和数据驱动嵌入式应用的统一编程模型
G. Breaban, S. Stuijk, K. Goossens
Modern embedded systems encompass a fast increasing range of applications, spanning from automotive to multimedia, and industrial automation. To tackle the increasing design complexity, the model-based design paradigm promotes the use of Models of Computation (MoCs) to capture the essential application properties. Existing MoCs are split between the event/time-triggered paradigm and the data-driven paradigm. However, time and data are two inter-related dimensions that are essential for defining the correct application behavior. In this paper we advocate a unified MoC that integrates the notions of time and data while accounting for imperfect clocks. We present the formal properties of our model and show how the Synchronous Data Flow (SDF) MoC can be used to analyze the time performance guarantees.
现代嵌入式系统涵盖了快速增长的应用范围,从汽车到多媒体和工业自动化。为了解决日益增加的设计复杂性,基于模型的设计范式提倡使用计算模型(moc)来捕获基本的应用程序属性。现有moc分为事件/时间触发范式和数据驱动范式。然而,时间和数据是两个相互关联的维度,它们对于定义正确的应用程序行为至关重要。在本文中,我们提倡一种统一的MoC,它将时间和数据的概念集成在一起,同时考虑到不完美的时钟。我们给出了模型的形式化属性,并展示了同步数据流(SDF) MoC如何用于分析时间性能保证。
{"title":"A Unified Programming Model for Time- and Data-Driven Embedded Applications","authors":"G. Breaban, S. Stuijk, K. Goossens","doi":"10.1109/PDP2018.2018.00013","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00013","url":null,"abstract":"Modern embedded systems encompass a fast increasing range of applications, spanning from automotive to multimedia, and industrial automation. To tackle the increasing design complexity, the model-based design paradigm promotes the use of Models of Computation (MoCs) to capture the essential application properties. Existing MoCs are split between the event/time-triggered paradigm and the data-driven paradigm. However, time and data are two inter-related dimensions that are essential for defining the correct application behavior. In this paper we advocate a unified MoC that integrates the notions of time and data while accounting for imperfect clocks. We present the formal properties of our model and show how the Synchronous Data Flow (SDF) MoC can be used to analyze the time performance guarantees.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132242105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Context-Aware Optimization for Energy-Efficient and QoS Wireless Body Area Networks with Human Dynamics 基于人动态的高能效和QoS无线体域网络环境感知优化
Da-Ren Chen, Ping-Feng Wang
In the consideration of the context of human body dynamics, we propose an energy-aware and quality-of-service (QoS) method for wireless body area networks (WBAN). This method exploits the characteristics of physical (PHY) layer based on narrow band signaling with M-PSK modulation and low power micro sensor transceiver. It improves energy efficiency of both receiver and transmitter front-end components while satisfying the QoS metrics in accordance with link quality due to human posture changes or movements. It can meet various QoS requirements of each sensor, improve bandwidth utilization and reduce energy consumption. This method saves an average of 6.2% of energy consumption over the TPC and LA-based methods at a power of -25dBm.
考虑到人体动力学的背景,我们提出了一种能量感知和服务质量(QoS)的无线体域网络(WBAN)方法。该方法利用了基于M-PSK调制的窄带信令和低功耗微传感器收发器的物理层特性。它提高了接收端和发送端前端组件的能量效率,同时满足了由于人体姿势变化或运动导致的链路质量的QoS指标。它可以满足各个传感器的各种QoS要求,提高带宽利用率,降低能耗。在功率为-25dBm时,该方法比基于TPC和la的方法平均节省6.2%的能耗。
{"title":"Context-Aware Optimization for Energy-Efficient and QoS Wireless Body Area Networks with Human Dynamics","authors":"Da-Ren Chen, Ping-Feng Wang","doi":"10.1109/PDP2018.2018.00016","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00016","url":null,"abstract":"In the consideration of the context of human body dynamics, we propose an energy-aware and quality-of-service (QoS) method for wireless body area networks (WBAN). This method exploits the characteristics of physical (PHY) layer based on narrow band signaling with M-PSK modulation and low power micro sensor transceiver. It improves energy efficiency of both receiver and transmitter front-end components while satisfying the QoS metrics in accordance with link quality due to human posture changes or movements. It can meet various QoS requirements of each sensor, improve bandwidth utilization and reduce energy consumption. This method saves an average of 6.2% of energy consumption over the TPC and LA-based methods at a power of -25dBm.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121830676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Increasing Efficiency in Parallel Programming Teaching 提高并行编程教学效率
M. Danelutto, M. Torquati
The ability to teach parallel programming principles and techniques is becoming fundamental to prepare a new generation of programmers able to master the pervasive parallelism made available by hardware vendors. Classical parallel programming courses leverage either low-level programming frameworks (e.g. those based on Pthreads) or higher level frameworks such as OpenMP or MPI. We discuss our teaching experience within the Master in "Computer Science and networking" where parallel programming is taught leveraging structured parallel programming principles and frameworks. The paper summarizes the results achieved in eight years of experience and shows how the adoption of a structured parallel programming approach improves the efficiency of the teaching process.
教授并行编程原理和技术的能力正在成为培养新一代程序员的基础,使他们能够掌握硬件供应商提供的普遍并行性。经典的并行编程课程要么利用低级编程框架(如基于Pthreads的),要么利用高级框架(如OpenMP或MPI)。我们在“计算机科学与网络”硕士课程中讨论了我们的教学经验,并行编程是利用结构化并行编程原则和框架来教授的。本文总结了在八年的经验中取得的成果,并展示了采用结构化并行编程方法如何提高教学过程的效率。
{"title":"Increasing Efficiency in Parallel Programming Teaching","authors":"M. Danelutto, M. Torquati","doi":"10.1109/PDP2018.2018.00053","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00053","url":null,"abstract":"The ability to teach parallel programming principles and techniques is becoming fundamental to prepare a new generation of programmers able to master the pervasive parallelism made available by hardware vendors. Classical parallel programming courses leverage either low-level programming frameworks (e.g. those based on Pthreads) or higher level frameworks such as OpenMP or MPI. We discuss our teaching experience within the Master in \"Computer Science and networking\" where parallel programming is taught leveraging structured parallel programming principles and frameworks. The paper summarizes the results achieved in eight years of experience and shows how the adoption of a structured parallel programming approach improves the efficiency of the teaching process.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126071934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
DuoFS: A Hybrid Storage System Balancing Energy-Efficiency, Reliability, and Performance 双ofs:一种兼顾能效、可靠性和性能的混合存储系统
Shu Yin, Bing Jiao, Xiaomin Zhu, X. Ruan, Si Chen, Zhuo Tang
As the Energy Wall and the Reliability Wall become unavoidable, it is a demanding and challenging task to reduce energy consumption in large-scale storage systems in modern data centers while retaining acceptable systems reliability. We propose a reliable energy-efficient storage system called DuoFS, which aims at balancing the energy efficiency, the reliability and the performance of parallel storage systems by seamlessly integrating one HDD-based file system and one SSD-based file system. At the heart of the DuoFS is a transformative middleware layer that dispatches files to the one of the two independent parallel file systems based on the files' I/O access popularity. By replicating popular files to the SSD-based file system and pushing the HDD-based file system into the low-power mode under light workload conditions, DuoFS can reduce significant energy consumption, avoid major factors that harm the storage systems reliability, and extract SSDs good I/O performance. Experimental results show that the DuoFS system saves up to 40% of energy, achieves up to 50% better I/O performance while only sacrificing less than 15% of the system's reliability.
随着“能源墙”和“可靠性墙”不可避免,如何在保证可接受的系统可靠性的同时降低现代数据中心大型存储系统的能耗是一项艰巨而富有挑战性的任务。我们提出了一种可靠的节能存储系统,称为DuoFS,旨在通过无缝集成一个基于hdd的文件系统和一个基于ssd的文件系统,平衡并行存储系统的能效、可靠性和性能。DuoFS的核心是一个转换中间件层,它根据文件的I/O访问流行程度将文件分派到两个独立的并行文件系统之一。通过将流行的文件复制到基于ssd的文件系统中,并在轻工作负载条件下将基于hdd的文件系统推入低功耗模式,可以显著降低能耗,避免影响存储系统可靠性的主要因素,并提取ssd良好的I/O性能。实验结果表明,该系统节省了高达40%的能源,实现了高达50%的I/O性能提升,同时只牺牲了不到15%的系统可靠性。
{"title":"DuoFS: A Hybrid Storage System Balancing Energy-Efficiency, Reliability, and Performance","authors":"Shu Yin, Bing Jiao, Xiaomin Zhu, X. Ruan, Si Chen, Zhuo Tang","doi":"10.1109/PDP2018.2018.00082","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00082","url":null,"abstract":"As the Energy Wall and the Reliability Wall become unavoidable, it is a demanding and challenging task to reduce energy consumption in large-scale storage systems in modern data centers while retaining acceptable systems reliability. We propose a reliable energy-efficient storage system called DuoFS, which aims at balancing the energy efficiency, the reliability and the performance of parallel storage systems by seamlessly integrating one HDD-based file system and one SSD-based file system. At the heart of the DuoFS is a transformative middleware layer that dispatches files to the one of the two independent parallel file systems based on the files' I/O access popularity. By replicating popular files to the SSD-based file system and pushing the HDD-based file system into the low-power mode under light workload conditions, DuoFS can reduce significant energy consumption, avoid major factors that harm the storage systems reliability, and extract SSDs good I/O performance. Experimental results show that the DuoFS system saves up to 40% of energy, achieves up to 50% better I/O performance while only sacrificing less than 15% of the system's reliability.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127427487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Collective I/O Performance on the Santos Dumont Supercomputer Santos Dumont超级计算机的集体I/O性能
A. Carneiro, J. L. Bez, F. Boito, Bruno Alves Fagundes, Carla Osthoff, P. Navaux
The historical gap between processing and data access speeds causes many applications to spend a large portion of their execution on I/O operations. From the point of view of a large-scale, expensive, supercomputer, it is important to ensure applications achieve the best I/O performance to promote an efficient usage of the machine. In this paper, we evaluate the I/O infrastructure of the Santos Dumont supercomputer, the largest one from Latin America. More specifically, we investigate the performance of collective I/O operations. By conducting an analysis of a scientific application that uses the machine, we identify large performance differences between the available MPI implementations. We then further study the observed phenomenon using the BT-IO and IOR benchmarks, in addition to a custom microbenchmark. We conclude that the customized MPI implementation by Bull (used by more than 20% of the jobs) presents the worst performance for small collective write operations. Our results are being used to help the Santos Dumont users to achieve the best performance for their applications. Additionally, by investigating the observed phenomenon, we provide information to help improve future MPI-IO collective write implementations.
处理速度和数据访问速度之间的历史差距导致许多应用程序将大部分执行时间花在I/O操作上。从大型、昂贵的超级计算机的角度来看,确保应用程序实现最佳I/O性能以促进机器的有效使用是很重要的。在本文中,我们评估了Santos Dumont超级计算机的I/O基础设施,这是拉丁美洲最大的超级计算机。更具体地说,我们研究了集合I/O操作的性能。通过对使用该机器的科学应用程序进行分析,我们确定了可用的MPI实现之间的巨大性能差异。然后,我们使用BT-IO和IOR基准测试以及自定义微基准测试进一步研究观察到的现象。我们得出结论,Bull的定制MPI实现(超过20%的作业使用)在小型集体写操作中表现出最差的性能。我们的结果被用来帮助Santos Dumont用户实现其应用程序的最佳性能。此外,通过调查观察到的现象,我们提供了有助于改进未来MPI-IO集体写实现的信息。
{"title":"Collective I/O Performance on the Santos Dumont Supercomputer","authors":"A. Carneiro, J. L. Bez, F. Boito, Bruno Alves Fagundes, Carla Osthoff, P. Navaux","doi":"10.1109/PDP2018.2018.00015","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00015","url":null,"abstract":"The historical gap between processing and data access speeds causes many applications to spend a large portion of their execution on I/O operations. From the point of view of a large-scale, expensive, supercomputer, it is important to ensure applications achieve the best I/O performance to promote an efficient usage of the machine. In this paper, we evaluate the I/O infrastructure of the Santos Dumont supercomputer, the largest one from Latin America. More specifically, we investigate the performance of collective I/O operations. By conducting an analysis of a scientific application that uses the machine, we identify large performance differences between the available MPI implementations. We then further study the observed phenomenon using the BT-IO and IOR benchmarks, in addition to a custom microbenchmark. We conclude that the customized MPI implementation by Bull (used by more than 20% of the jobs) presents the worst performance for small collective write operations. Our results are being used to help the Santos Dumont users to achieve the best performance for their applications. Additionally, by investigating the observed phenomenon, we provide information to help improve future MPI-IO collective write implementations.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127678040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Simulation-Based Evaluation Strategy for Task Mapping Approaches in WNoC Platforms WNoC平台任务映射方法的仿真评估策略
Luis Germán García Morales, J. E. A. Cobo, N. Bagherzadeh
Network-On-Chip (NoC) along with its extension Wireless NoC (WNoC) were proposed to increase system performance in future generations of Multi-Processor System-on-Chip (MPSoC) with hundreds / thousands of cores. For such platforms, designers seek to propose efficient task mapping mechanisms that establish the arrangement of executable tasks taking advantage of available communication links. These proposals are then evaluated and compared against other designs in terms of execution time, latency, energy, communication cost and other metrics, employing simulation tools to that end. However, current WNoC simulators only aim to evaluate the performance of the communication network; they lack the ability to estimate relevant metrics needed for the evaluation of mapping strategies. In this paper, we present an evaluation strategy aimed to assess the performance of task mapping approaches for WNoC and integrate it into a well-known state-of-the-art simulation tool. Several experiments are conducted to demonstrate the benefits of using our proposed strategy.
提出了片上网络(NoC)及其扩展无线NoC (WNoC),以提高未来几代具有数百/数千核的多处理器片上系统(MPSoC)的系统性能。对于这样的平台,设计者试图提出有效的任务映射机制,利用可用的通信链路建立可执行任务的安排。然后使用仿真工具对这些建议进行评估,并在执行时间、延迟、能量、通信成本和其他指标方面与其他设计进行比较。然而,目前的WNoC仿真器仅旨在评估通信网络的性能;他们缺乏评估映射策略所需的相关度量的能力。在本文中,我们提出了一种评估策略,旨在评估WNoC任务映射方法的性能,并将其集成到一个众所周知的最先进的仿真工具中。进行了几个实验来证明使用我们提出的策略的好处。
{"title":"Simulation-Based Evaluation Strategy for Task Mapping Approaches in WNoC Platforms","authors":"Luis Germán García Morales, J. E. A. Cobo, N. Bagherzadeh","doi":"10.1109/PDP2018.2018.00104","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00104","url":null,"abstract":"Network-On-Chip (NoC) along with its extension Wireless NoC (WNoC) were proposed to increase system performance in future generations of Multi-Processor System-on-Chip (MPSoC) with hundreds / thousands of cores. For such platforms, designers seek to propose efficient task mapping mechanisms that establish the arrangement of executable tasks taking advantage of available communication links. These proposals are then evaluated and compared against other designs in terms of execution time, latency, energy, communication cost and other metrics, employing simulation tools to that end. However, current WNoC simulators only aim to evaluate the performance of the communication network; they lack the ability to estimate relevant metrics needed for the evaluation of mapping strategies. In this paper, we present an evaluation strategy aimed to assess the performance of task mapping approaches for WNoC and integrate it into a well-known state-of-the-art simulation tool. Several experiments are conducted to demonstrate the benefits of using our proposed strategy.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120883542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Evaluating and Optimizing OpenCL Base64 Data Unpacking Kernel with FPGA 基于FPGA的OpenCL Base64数据解包内核评估与优化
Zheming Jin, Iris Johnson, H. Finkel
Development of applications using OpenCL targeting FPGAs is an emerging approach on heterogeneous computing systems. This paper uses the data unpacking algorithm in Base64 encoding as a case study to present programming and optimization techniques, and experimental results of the OpenCL-based implementations on an FPGA. We explain the algorithm and evaluate the performance of the kernel implementations with Intel's FPGA OpenCL SDK. The experimental results show kernel vectorization and duplication are two optimization techniques that can improve the kernel performance. The performance of kernel duplication is also closely related to the local work size. Our experiment shows 16-lane vectorization increases the bandwidth by a factor of 2 to 10 for large input data sizes. Moreover, the performance of kernel duplication using 16 compute units is 40% to 1.5% less than that of kernel vectorization depending on the input size. Tuning the local work size can improve the kernel performance by a factor of 3 to 23. For this kernel, using local memory is not an effective technique to improve the kernel performance because input data is not reused. A combination of vectorization and duplication achieves the highest performance of 12.3 GiB/s. Compared to an Intel Xeon E5 CPU and an Nvidia Tesla K80 GPU, the performance of the kernel on the Arria 10 FPGA is 6.7X faster than the CPU and 3X slower than the GPU. The performance per watt on the FPGA is 20.5X higher than the CPU and 1.19X lower than the GPU.
利用OpenCL开发针对fpga的应用程序是异构计算系统的一种新兴方法。本文以Base64编码下的数据解包算法为例,介绍了该算法的编程和优化技术,并给出了基于opencl的FPGA实现实验结果。我们解释了该算法,并利用英特尔的FPGA OpenCL SDK评估了内核实现的性能。实验结果表明,核矢量化和复制是提高核性能的两种优化技术。内核复制的性能也与本地工作大小密切相关。我们的实验表明,对于大的输入数据量,16通道矢量化将带宽提高了2到10倍。此外,根据输入大小的不同,使用16个计算单元的内核复制的性能比内核矢量化的性能低40%到1.5%。调优本地工作大小可以将内核性能提高3到23倍。对于这个内核,使用本地内存并不是提高内核性能的有效技术,因为输入数据不会被重用。矢量化和复制的组合实现了12.3 GiB/s的最高性能。与Intel至强E5 CPU和Nvidia Tesla K80 GPU相比,Arria 10 FPGA上的内核性能比CPU快6.7倍,比GPU慢3倍。FPGA的每瓦性能比CPU高20.5倍,比GPU低1.19倍。
{"title":"Evaluating and Optimizing OpenCL Base64 Data Unpacking Kernel with FPGA","authors":"Zheming Jin, Iris Johnson, H. Finkel","doi":"10.1109/PDP2018.2018.00046","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00046","url":null,"abstract":"Development of applications using OpenCL targeting FPGAs is an emerging approach on heterogeneous computing systems. This paper uses the data unpacking algorithm in Base64 encoding as a case study to present programming and optimization techniques, and experimental results of the OpenCL-based implementations on an FPGA. We explain the algorithm and evaluate the performance of the kernel implementations with Intel's FPGA OpenCL SDK. The experimental results show kernel vectorization and duplication are two optimization techniques that can improve the kernel performance. The performance of kernel duplication is also closely related to the local work size. Our experiment shows 16-lane vectorization increases the bandwidth by a factor of 2 to 10 for large input data sizes. Moreover, the performance of kernel duplication using 16 compute units is 40% to 1.5% less than that of kernel vectorization depending on the input size. Tuning the local work size can improve the kernel performance by a factor of 3 to 23. For this kernel, using local memory is not an effective technique to improve the kernel performance because input data is not reused. A combination of vectorization and duplication achieves the highest performance of 12.3 GiB/s. Compared to an Intel Xeon E5 CPU and an Nvidia Tesla K80 GPU, the performance of the kernel on the Arria 10 FPGA is 6.7X faster than the CPU and 3X slower than the GPU. The performance per watt on the FPGA is 20.5X higher than the CPU and 1.19X lower than the GPU.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127109710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Heterogeneous Computing and Multi-Clustering Support Via Peer-To-Peer HPC 基于点对点高性能计算的异构计算和多集群支持
Bilal Fakih, D. E. Baz
This paper aims at presenting Peer-To-Peer HPC a decentralized environment that facilitates the use of heterogeneous multi-cluster platform for loosely synchronous applications. The goal is to exploit all the computing resources (all the available cores of computing nodes) as well as all networks, e.g., Ethernet, Infiniband and Myrinet. Peer-To-Peer HPC functionality relies on a reconfigurable multi network protocol RMNP for controlling multiple network adapters and on OpenMP for the exploitation of all the available cores of computing nodes. We report on efficiency obtained with Grid5000 testbed by combining synchronous and asynchronous iterative schemes of computation with Peer-To-Peer HPC. The experimental results show that our environment scales well.
本文旨在为点对点高性能计算提供一个分散的环境,以促进异构多集群平台在松散同步应用中的使用。目标是利用所有的计算资源(所有可用的计算节点核心)以及所有的网络,例如以太网、Infiniband和Myrinet。点对点HPC功能依赖于可重新配置的多网络协议RMNP来控制多个网络适配器,依赖于OpenMP来利用所有可用的计算节点核心。本文报道了Grid5000试验台将同步和异步迭代计算方案与点对点高性能计算相结合所获得的效率。实验结果表明,我们的环境具有良好的可扩展性。
{"title":"Heterogeneous Computing and Multi-Clustering Support Via Peer-To-Peer HPC","authors":"Bilal Fakih, D. E. Baz","doi":"10.1109/PDP2018.2018.00050","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00050","url":null,"abstract":"This paper aims at presenting Peer-To-Peer HPC a decentralized environment that facilitates the use of heterogeneous multi-cluster platform for loosely synchronous applications. The goal is to exploit all the computing resources (all the available cores of computing nodes) as well as all networks, e.g., Ethernet, Infiniband and Myrinet. Peer-To-Peer HPC functionality relies on a reconfigurable multi network protocol RMNP for controlling multiple network adapters and on OpenMP for the exploitation of all the available cores of computing nodes. We report on efficiency obtained with Grid5000 testbed by combining synchronous and asynchronous iterative schemes of computation with Peer-To-Peer HPC. The experimental results show that our environment scales well.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128395026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Combining Parallel Genetic Algorithms and Machine Learning to Improve the Research of Optimal Vaccination Protocols 结合并行遗传算法和机器学习改进最优疫苗接种方案的研究
M. Pennisi, G. Russo, F. Pappalardo
The developing of novel prophylactic and therapeutic vaccine candidates in the field of cancer immunology brought to very promising results against tumors, entitling full protection with reduced amount of the typical side effects of the actual conventional treatments. However, such treatments required a constant, life-long, administration procedure to keep protection. As both the period of protection and the relative number of administrations grow, the problem of finding the best administration protocol, in time and dosage, becomes more and more complex. Such a problem cannot be usually solved in in vivo experiments, as the costs in terms of time, money, and people would be prohibitive. We propose a hybrid approach that integrates machine learning and parallel genetic algorithms to enhance the research in silico of optimal administration protocols for a cancer vaccine. A neural network is used to improve both crossover and mutation operators. Preliminary results suggest that the use of such could bring to better administration protocols using a similar computational effort.
在癌症免疫学领域,新型预防和治疗性候选疫苗的开发带来了非常有希望的抗肿瘤效果,可以充分保护肿瘤,同时减少实际常规治疗的典型副作用。然而,这种治疗需要持续的、终生的管理程序来保持保护。随着保护期和相对给药次数的增加,在时间和剂量上寻找最佳给药方案的问题变得越来越复杂。这样的问题通常不能在体内实验中解决,因为在时间、金钱和人员方面的成本将是令人望而却步的。我们提出了一种结合机器学习和并行遗传算法的混合方法,以加强对癌症疫苗最佳给药方案的计算机研究。利用神经网络对交叉算子和变异算子进行改进。初步结果表明,使用这种方法可以使用类似的计算工作带来更好的管理协议。
{"title":"Combining Parallel Genetic Algorithms and Machine Learning to Improve the Research of Optimal Vaccination Protocols","authors":"M. Pennisi, G. Russo, F. Pappalardo","doi":"10.1109/PDP2018.2018.00070","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00070","url":null,"abstract":"The developing of novel prophylactic and therapeutic vaccine candidates in the field of cancer immunology brought to very promising results against tumors, entitling full protection with reduced amount of the typical side effects of the actual conventional treatments. However, such treatments required a constant, life-long, administration procedure to keep protection. As both the period of protection and the relative number of administrations grow, the problem of finding the best administration protocol, in time and dosage, becomes more and more complex. Such a problem cannot be usually solved in in vivo experiments, as the costs in terms of time, money, and people would be prohibitive. We propose a hybrid approach that integrates machine learning and parallel genetic algorithms to enhance the research in silico of optimal administration protocols for a cancer vaccine. A neural network is used to improve both crossover and mutation operators. Preliminary results suggest that the use of such could bring to better administration protocols using a similar computational effort.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132818672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1