首页 > 最新文献

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)最新文献

英文 中文
Performability Analysis of Mesh-Based NoCs Using Markov Reward Model 基于马尔可夫奖励模型的网格noc性能分析
Jie Hou, M. Radetzki
Technology scaling makes it possible to implement systems with hundreds of processing cores, and thousands in the future. The communication in such systems is enabled by Networks-on-Chips (NoCs). A downside of technology scaling is the increased susceptibility to failures in NoC resources. Ensuring reliable operation despite such failures degrades NoC performance and may even invalidate the performance benefits expected from scaling. Thus, it is not enough to analyze performance and reliability in isolation, as usually done. Instead, we suggest treating both aspects together using the concept of performability and its analysis with Markov reward models. Our methodology is exemplified for mesh NoCs and transient faults but can be transferred to other topologies and fault models. We investigate how performability develops with scaling towards larger NoCs and explore the limits of scaling by determining the break-even failure rates under which scaling can achieve net performance increase.
技术扩展使得实现具有数百个处理核心的系统成为可能,未来可能会有数千个处理核心。这种系统中的通信是由片上网络(noc)实现的。技术扩展的一个缺点是NoC资源对故障的敏感性增加。在此类故障的情况下确保可靠的运行会降低NoC的性能,甚至可能使预期的扩展带来的性能优势失效。因此,像通常那样单独分析性能和可靠性是不够的。相反,我们建议使用可执行性的概念及其与马尔可夫奖励模型的分析来同时处理这两个方面。我们的方法适用于网状noc和瞬态故障,但可以转移到其他拓扑和故障模型。我们研究了性能如何随着扩展到更大的noc而发展,并通过确定盈亏平衡故障率来探索扩展的限制,在这种情况下,扩展可以实现净性能提升。
{"title":"Performability Analysis of Mesh-Based NoCs Using Markov Reward Model","authors":"Jie Hou, M. Radetzki","doi":"10.1109/PDP2018.2018.00102","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00102","url":null,"abstract":"Technology scaling makes it possible to implement systems with hundreds of processing cores, and thousands in the future. The communication in such systems is enabled by Networks-on-Chips (NoCs). A downside of technology scaling is the increased susceptibility to failures in NoC resources. Ensuring reliable operation despite such failures degrades NoC performance and may even invalidate the performance benefits expected from scaling. Thus, it is not enough to analyze performance and reliability in isolation, as usually done. Instead, we suggest treating both aspects together using the concept of performability and its analysis with Markov reward models. Our methodology is exemplified for mesh NoCs and transient faults but can be transferred to other topologies and fault models. We investigate how performability develops with scaling towards larger NoCs and explore the limits of scaling by determining the break-even failure rates under which scaling can achieve net performance increase.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117023761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Novel Application of Parallel Computing Techniques in Soft X-Rays Plasma Measurement Systems for the WEST Experimental Thermal Fusion Reactor 并行计算技术在WEST实验热聚变反应堆软x射线等离子体测量系统中的新应用
R. Krawczyk, P. Linczuk, A. Wojeński, K. Poźniak, G. Kasprowicz, Wojciech Zabolotny, M. Gąska, D. Mazon, A. Jardin, T. Czarski, P. Kolasiński, M. Chernyshova, E. Kowalska-Strzeciwilk, K. Malinowski
The article presents results of the novel approach of combining high-performance and parallel computing solutions with front-end electronics in the development of scalable specialized soft X-rays measurement tool for high-scale plasma physics experiments with thermal fusion devices. Regarding the need for an easily-modifiable advanced diagnostics of tokamak hot plasma content, the heterogeneous system consisting of FPGAs and the PC server was introduced. The objective is to provide data quality monitoring and evaluation mechanisms along with an algorithm benchmarking tool for fast, low-latency measurements of soft X-rays emitted by hot tokamak plasma. The article describes a method of the development of the computation pipeline on the server side. The novel parallel algorithms and results are discussed. This brand new approach is targeted to adapt a HPC techniques in new areas of science, where comprehensive low-latency measurements and instrumentation are increasingly desired. The presented solution is deployed in the operational tokamak WEST (Tungsten Environment in Steady-State Tokamak) in collaboration with French Alternative Energies and Atomic Energy Commission (CEA), Cadarache, France.
本文介绍了将高性能和并行计算解决方案与前端电子技术相结合的新方法的结果,该方法用于开发可扩展的专用软x射线测量工具,用于热聚变装置的大规模等离子体物理实验。针对托卡马克热等离子体含量高级诊断的需要,介绍了由fpga和PC服务器组成的异构系统。目的是为热托卡马克等离子体发射的软x射线的快速、低延迟测量提供数据质量监测和评估机制以及算法基准测试工具。本文描述了一种在服务器端开发计算管道的方法。讨论了新的并行算法及其结果。这种全新的方法旨在将高性能计算技术应用于新的科学领域,在这些领域,越来越需要全面的低延迟测量和仪器。所提出的解决方案与法国替代能源和原子能委员会(CEA)合作,部署在位于法国Cadarache的运行托卡马克WEST(稳态托卡马克钨环境)中。
{"title":"Novel Application of Parallel Computing Techniques in Soft X-Rays Plasma Measurement Systems for the WEST Experimental Thermal Fusion Reactor","authors":"R. Krawczyk, P. Linczuk, A. Wojeński, K. Poźniak, G. Kasprowicz, Wojciech Zabolotny, M. Gąska, D. Mazon, A. Jardin, T. Czarski, P. Kolasiński, M. Chernyshova, E. Kowalska-Strzeciwilk, K. Malinowski","doi":"10.1109/PDP2018.2018.00024","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00024","url":null,"abstract":"The article presents results of the novel approach of combining high-performance and parallel computing solutions with front-end electronics in the development of scalable specialized soft X-rays measurement tool for high-scale plasma physics experiments with thermal fusion devices. Regarding the need for an easily-modifiable advanced diagnostics of tokamak hot plasma content, the heterogeneous system consisting of FPGAs and the PC server was introduced. The objective is to provide data quality monitoring and evaluation mechanisms along with an algorithm benchmarking tool for fast, low-latency measurements of soft X-rays emitted by hot tokamak plasma. The article describes a method of the development of the computation pipeline on the server side. The novel parallel algorithms and results are discussed. This brand new approach is targeted to adapt a HPC techniques in new areas of science, where comprehensive low-latency measurements and instrumentation are increasingly desired. The presented solution is deployed in the operational tokamak WEST (Tungsten Environment in Steady-State Tokamak) in collaboration with French Alternative Energies and Atomic Energy Commission (CEA), Cadarache, France.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127039400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Data-Layout Reorganization for an Efficient Intra-Node Assembly of a Spectral Finite-Element Method 面向高效节点内装配的谱有限元数据布局重组
Gauthier Sornet, S. Jubertie, F. Dupros, F. D. Martin, P. Thierry, Sébastien Limet
The Finite-Element Method (FEM) is routinely used to solve Partial Differential Equations (PDE) in various scientific domains. For seismic waves modeling, the Spectral Element Method (SEM), which is a specific formulation of the classical FEM approach, have gained significant attention for the last two decades. This is explained both from the very good numerical accuracy of this method and from the parallel performance of classical MPI-based implementations that scale up to several tens of thousands computing cores. Nevertheless, the trend for current processors with an increasing level of low-level parallelism requires significant efforts at the shared-memory level. One major bottleneck is coming from the standard FEM assembly phase that leads to significant amount of irregular memory accesses. This prevents any efficient automatic optimizations from the compiler for instance. In this paper, we extract a kernel from a spectral-element application dedicated to earthquake simulations in complex geological medium (EFISPEC code developed at BRGM, the French Geological Survey). We study the intra-node behavior and we propose different levels of optimization (data-layout, manual vectorization, multi-threading) to fully benefit from SIMD units and NUMA architectures. Experiments performed on Intel Broadwell architecture show that the proposed optimizations dramatically improve the intra-node performance of the mini-application. Moreover, our results show a good match with rooflines theoretical performance models. We believe that these optimizations are not specific to this mini-application and may be implemented in different SEM and FEM based solvers as well.
在许多科学领域中,有限元法(FEM)通常用于求解偏微分方程(PDE)。对于地震波的建模,谱元法(SEM)是经典有限元方法的一种特殊形式,在过去的二十年中得到了广泛的关注。这可以从该方法非常好的数值精度和基于mpi的经典实现的并行性能(扩展到数万个计算核心)来解释。然而,当前处理器的底层并行性越来越高,这一趋势需要在共享内存级别上做出重大努力。一个主要的瓶颈来自于标准FEM组装阶段,它会导致大量的不规则内存访问。例如,这阻止了编译器进行任何有效的自动优化。在本文中,我们从一个专门用于复杂地质介质中地震模拟的谱元应用程序(EFISPEC代码由法国地质调查局BRGM开发)中提取了一个内核。我们研究了节点内行为,并提出了不同级别的优化(数据布局,手动矢量化,多线程),以充分受益于SIMD单元和NUMA架构。在Intel Broadwell架构上进行的实验表明,所提出的优化方案显著提高了小型应用程序的节点内性能。此外,我们的研究结果与屋顶线理论性能模型吻合良好。我们相信这些优化并不是特定于这个小应用程序的,也可以在不同的基于SEM和FEM的求解器中实现。
{"title":"Data-Layout Reorganization for an Efficient Intra-Node Assembly of a Spectral Finite-Element Method","authors":"Gauthier Sornet, S. Jubertie, F. Dupros, F. D. Martin, P. Thierry, Sébastien Limet","doi":"10.1109/PDP2018.2018.00043","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00043","url":null,"abstract":"The Finite-Element Method (FEM) is routinely used to solve Partial Differential Equations (PDE) in various scientific domains. For seismic waves modeling, the Spectral Element Method (SEM), which is a specific formulation of the classical FEM approach, have gained significant attention for the last two decades. This is explained both from the very good numerical accuracy of this method and from the parallel performance of classical MPI-based implementations that scale up to several tens of thousands computing cores. Nevertheless, the trend for current processors with an increasing level of low-level parallelism requires significant efforts at the shared-memory level. One major bottleneck is coming from the standard FEM assembly phase that leads to significant amount of irregular memory accesses. This prevents any efficient automatic optimizations from the compiler for instance. In this paper, we extract a kernel from a spectral-element application dedicated to earthquake simulations in complex geological medium (EFISPEC code developed at BRGM, the French Geological Survey). We study the intra-node behavior and we propose different levels of optimization (data-layout, manual vectorization, multi-threading) to fully benefit from SIMD units and NUMA architectures. Experiments performed on Intel Broadwell architecture show that the proposed optimizations dramatically improve the intra-node performance of the mini-application. Moreover, our results show a good match with rooflines theoretical performance models. We believe that these optimizations are not specific to this mini-application and may be implemented in different SEM and FEM based solvers as well.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130987048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Getmewhere: A Location-Based Privacy-Preserving Information Service Getmewhere:基于位置的隐私保护信息服务
G. Bella, Francesco Marino, Gianpiero Costantino, F. Martinelli
Mobile users have got used to getting useful information while they are literally on the move. An implication of this habit is that certain live information, such as that for navigation, for dating and for handling emergencies, should be tailored to the user's current location. While this is technically feasible with the current technology, it raises concerns on the user's location privacy. To address the delicate tradeoff between user's location privacy and appropriateness of the information for that location, this paper discusses three information delivery protocols. One is the widely adopted Android's protocol, the other two are the authors' novel ones, termed AL protocol and LBPP protocol respectively. The former conceals the user's location within a geographical area, the latter employs secure two-party computation. Privacy of all protocols is analysed, motivating the choice to implement the LBPP protocol. It is made available as the "Getmewhere" service for the reader to download.
移动用户已经习惯了在移动中获取有用的信息。这种习惯的一个含义是,某些实时信息,如导航、约会和处理紧急情况的信息,应该根据用户当前的位置进行定制。虽然以目前的技术,这在技术上是可行的,但它引起了对用户位置隐私的担忧。为了解决用户位置隐私和位置信息适当性之间的微妙权衡,本文讨论了三种信息传递协议。一个是被广泛采用的Android协议,另外两个是作者的新协议,分别称为AL协议和LBPP协议。前者将用户的位置隐藏在一个地理区域内,后者采用安全的双方计算。对所有协议的隐私性进行了分析,促使选择实现LBPP协议。它作为“Getmewhere”服务提供给读者下载。
{"title":"Getmewhere: A Location-Based Privacy-Preserving Information Service","authors":"G. Bella, Francesco Marino, Gianpiero Costantino, F. Martinelli","doi":"10.1109/PDP2018.2018.00089","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00089","url":null,"abstract":"Mobile users have got used to getting useful information while they are literally on the move. An implication of this habit is that certain live information, such as that for navigation, for dating and for handling emergencies, should be tailored to the user's current location. While this is technically feasible with the current technology, it raises concerns on the user's location privacy. To address the delicate tradeoff between user's location privacy and appropriateness of the information for that location, this paper discusses three information delivery protocols. One is the widely adopted Android's protocol, the other two are the authors' novel ones, termed AL protocol and LBPP protocol respectively. The former conceals the user's location within a geographical area, the latter employs secure two-party computation. Privacy of all protocols is analysed, motivating the choice to implement the LBPP protocol. It is made available as the \"Getmewhere\" service for the reader to download.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130290633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPU Enabled Serverless Computing Framework 支持GPU的无服务器计算框架
T. Jun, Daeyoun Kang, Dohyeun Kim, Daeyoung Kim
A new form of cloud computing, serverless computing, is drawing attention as a new way to design micro-services architectures. In a serverless computing environment, services are developed as service functional units. The function development environment of all serverless computing framework at present is CPU based. In this paper, we propose a GPU-supported serverless computing framework that can deploy services faster than existing serverless computing framework using CPU. Our core approach is to integrate the open source serverless computing framework with NVIDIA-Docker and deploy services based on the GPU support container. We have developed an API that connects the open source framework to the NVIDIA-Docker and commands that enable GPU programming. In our experiments, we measured the performance of the framework in various environments. As a result, developers who want to develop services through the framework can deploy high-performance micro services and developers who want to run deep learning programs without a GPU environment can run code on remote GPUs with little performance degradation.
一种新的云计算形式,无服务器计算,作为一种设计微服务架构的新方式,正引起人们的注意。在无服务器计算环境中,服务是作为服务功能单元开发的。目前所有无服务器计算框架的功能开发环境都是基于CPU的。在本文中,我们提出了一个gpu支持的无服务器计算框架,它可以比现有的使用CPU的无服务器计算框架更快地部署服务。我们的核心方法是将开源无服务器计算框架与NVIDIA-Docker集成,并基于GPU支持容器部署服务。我们已经开发了一个API,将开源框架连接到NVIDIA-Docker和支持GPU编程的命令。在我们的实验中,我们测量了框架在各种环境中的性能。因此,希望通过框架开发服务的开发人员可以部署高性能的微服务,而希望在没有GPU环境的情况下运行深度学习程序的开发人员可以在远程GPU上运行代码,而性能几乎没有下降。
{"title":"GPU Enabled Serverless Computing Framework","authors":"T. Jun, Daeyoun Kang, Dohyeun Kim, Daeyoung Kim","doi":"10.1109/PDP2018.2018.00090","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00090","url":null,"abstract":"A new form of cloud computing, serverless computing, is drawing attention as a new way to design micro-services architectures. In a serverless computing environment, services are developed as service functional units. The function development environment of all serverless computing framework at present is CPU based. In this paper, we propose a GPU-supported serverless computing framework that can deploy services faster than existing serverless computing framework using CPU. Our core approach is to integrate the open source serverless computing framework with NVIDIA-Docker and deploy services based on the GPU support container. We have developed an API that connects the open source framework to the NVIDIA-Docker and commands that enable GPU programming. In our experiments, we measured the performance of the framework in various environments. As a result, developers who want to develop services through the framework can deploy high-performance micro services and developers who want to run deep learning programs without a GPU environment can run code on remote GPUs with little performance degradation.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126505253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Business Model of a Botnet 僵尸网络的商业模式
C. Putman, Abhishta Abhishta, L. Nieuwenhuis
Botnets continue to be an active threat against firms or companies and individuals worldwide. Previous research regarding botnets has unveiled information on how the system and their stakeholders operate, but an insight on the economic structure that supports these stakeholders is lacking. The objective of this research is to analyse the business model and determine the revenue stream of a botnet owner. We also study the botnet life-cycle and determine the costs associated with it on the basis of four case studies. We conclude that building a full scale cyber army from scratch is very expensive where as acquiring a previously developed botnet requires a little cost. We find that initial setup and monthly costs were minimal compared to total revenue.
僵尸网络继续对全球的公司、公司和个人构成威胁。先前关于僵尸网络的研究已经揭示了系统及其利益相关者如何运作的信息,但缺乏对支持这些利益相关者的经济结构的洞察。本研究的目的是分析商业模式,并确定僵尸网络所有者的收入流。我们还研究了僵尸网络的生命周期,并根据四个案例研究确定了与之相关的成本。我们的结论是,从零开始建立一支全面的网络军队是非常昂贵的,而获得一个以前开发的僵尸网络只需要一点成本。我们发现,与总收入相比,初始设置和每月成本是最小的。
{"title":"Business Model of a Botnet","authors":"C. Putman, Abhishta Abhishta, L. Nieuwenhuis","doi":"10.1109/PDP2018.2018.00077","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00077","url":null,"abstract":"Botnets continue to be an active threat against firms or companies and individuals worldwide. Previous research regarding botnets has unveiled information on how the system and their stakeholders operate, but an insight on the economic structure that supports these stakeholders is lacking. The objective of this research is to analyse the business model and determine the revenue stream of a botnet owner. We also study the botnet life-cycle and determine the costs associated with it on the basis of four case studies. We conclude that building a full scale cyber army from scratch is very expensive where as acquiring a previously developed botnet requires a little cost. We find that initial setup and monthly costs were minimal compared to total revenue.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126086373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Extending PluTo for Multiple Devices by Integrating OpenACC 通过集成OpenACC扩展PluTo的多设备
Tim Süß, Tunahan Kaya, Dustin Feld
For many years now, processor vendors increased the performance of their devices by adding more cores and wider vectorization units to their CPUs instead of scaling up the processors' clock frequency. Moreover, GPUs became popular for solving problems with even more parallel compute power. To exploit the full potential of modern compute devices, specific codes are necessary which are often coded in a hardware-specific manner. Usually, the codes for CPUs are not usable for GPUs and vice versa. The programming API OpenACC tries to close this gap by enabling one code-base to be suitable and optimized for many devices. Nevertheless, OpenACC is rarely used by `standard programmers' and while different code transformers (like PluTo) allow for (semi-)automatic code parallelization for multi-core CPUs, they do generally not support OpenACC yet. We present first promising results of our PluTo extension that generates parallelized codes using OpenACC. Using our transformer we create programs which exploit the parallelism of different platforms without any manual modifications and we achieve performance speedups of up to 100 in comparison to the original unoptimized programs and accelations of 2.05 in comparison to equally generated OpenMP codes.
多年来,处理器供应商通过增加更多的内核和更宽的向量化单元来提高其设备的性能,而不是增加处理器的时钟频率。此外,gpu在解决具有更多并行计算能力的问题方面变得流行起来。为了充分利用现代计算设备的潜力,需要特定的代码,这些代码通常以特定于硬件的方式编码。通常,cpu的代码不能用于gpu,反之亦然。编程API OpenACC试图通过使一个代码库适合并优化许多设备来缩小这一差距。尽管如此,“标准程序员”很少使用OpenACC,尽管不同的代码转换器(如PluTo)允许多核cpu的(半)自动代码并行化,但它们通常还不支持OpenACC。我们展示了PluTo扩展的第一个有希望的结果,该扩展使用OpenACC生成并行代码。使用我们的转换器,我们创建的程序可以利用不同平台的并行性,而无需任何手动修改,与原始未优化的程序相比,我们实现了高达100的性能加速,与同等生成的OpenMP代码相比,我们实现了2.05的加速。
{"title":"Extending PluTo for Multiple Devices by Integrating OpenACC","authors":"Tim Süß, Tunahan Kaya, Dustin Feld","doi":"10.1109/PDP2018.2018.00049","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00049","url":null,"abstract":"For many years now, processor vendors increased the performance of their devices by adding more cores and wider vectorization units to their CPUs instead of scaling up the processors' clock frequency. Moreover, GPUs became popular for solving problems with even more parallel compute power. To exploit the full potential of modern compute devices, specific codes are necessary which are often coded in a hardware-specific manner. Usually, the codes for CPUs are not usable for GPUs and vice versa. The programming API OpenACC tries to close this gap by enabling one code-base to be suitable and optimized for many devices. Nevertheless, OpenACC is rarely used by `standard programmers' and while different code transformers (like PluTo) allow for (semi-)automatic code parallelization for multi-core CPUs, they do generally not support OpenACC yet. We present first promising results of our PluTo extension that generates parallelized codes using OpenACC. Using our transformer we create programs which exploit the parallelism of different platforms without any manual modifications and we achieve performance speedups of up to 100 in comparison to the original unoptimized programs and accelations of 2.05 in comparison to equally generated OpenMP codes.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114279039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structured Grid-Based Parallel Simulation of a Simple DEM Model on Heterogeneous Systems 基于结构网格的异构系统简单DEM模型并行仿真
A. Rango, Pietro Napoli, D. D'Ambrosio, W. Spataro, A. D. Renzo, F. Maio
Here we present different preliminary parallel grid-based implementations of a simple particle system with the purpose to evaluate its performances on multi- and many-core computational devices. The system is modeled by means of the Discrete Element Method and the Extended Cellular Automata formalism, while OpenMP and OpenCL are used for parallelization. In particular, both the 3.1 and 4.5 OpenMP specifications have been considered, the latter also able to run on many-core computational devices like GPUs. The results of a first test simulation performed by considering a cubic domain with about 316,000 particles have shown a clear advantage of OpenCL on the considered Tesla K40 Nvidia GPU, while the OpenMP 3.1 implementation has performed better than the corresponding OpenMP 4.5 on the considered Intel Xeon E5-2650 16-thread CPU.
在这里,我们提出了一个简单粒子系统的不同的基于并行网格的初步实现,目的是评估其在多核和多核计算设备上的性能。系统采用离散元法和扩展元胞自动机形式化建模,并行化采用OpenMP和OpenCL。特别是,3.1和4.5 OpenMP规范都被考虑过,后者也能够在多核计算设备(如gpu)上运行。通过考虑大约316,000个粒子的立方域进行的第一次测试模拟结果显示,OpenCL在考虑的Tesla K40 Nvidia GPU上具有明显的优势,而OpenMP 3.1实现在考虑的Intel Xeon E5-2650 16线程CPU上的性能优于相应的OpenMP 4.5。
{"title":"Structured Grid-Based Parallel Simulation of a Simple DEM Model on Heterogeneous Systems","authors":"A. Rango, Pietro Napoli, D. D'Ambrosio, W. Spataro, A. D. Renzo, F. Maio","doi":"10.1109/PDP2018.2018.00099","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00099","url":null,"abstract":"Here we present different preliminary parallel grid-based implementations of a simple particle system with the purpose to evaluate its performances on multi- and many-core computational devices. The system is modeled by means of the Discrete Element Method and the Extended Cellular Automata formalism, while OpenMP and OpenCL are used for parallelization. In particular, both the 3.1 and 4.5 OpenMP specifications have been considered, the latter also able to run on many-core computational devices like GPUs. The results of a first test simulation performed by considering a cubic domain with about 316,000 particles have shown a clear advantage of OpenCL on the considered Tesla K40 Nvidia GPU, while the OpenMP 3.1 implementation has performed better than the corresponding OpenMP 4.5 on the considered Intel Xeon E5-2650 16-thread CPU.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114505055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Accelerating Blockchain Search of Full Nodes Using GPUs 使用gpu加速区块链全节点搜索
Shin Morishima, Hiroki Matsutani
Blockchain is a distributed ledger system based on P2P network and originally used for a crypto currency system. The P2P network of Blockchain is maintained by full nodes which are in charge of verifying all the transactions in the network. However, most Blockchain user nodes do not act as full nodes, because workload of full nodes is quite high for personal mobile devices. Blockchain search queries, such as confirming balance, transaction contents, and transaction histories, from many users go to the full nodes. As a result, search throughput of full nodes would be a new bottleneck of Blockchain system, because the number of full nodes is less than the number of users of Blockchain systems. In this paper, we propose an acceleration method of Blockchain search using GPUs. More specifically, we introduce an array-based Patricia tree structure suitable for GPU processing so that we can make effective use of Blockchain feature that there are no update and delete queries. In the evaluations, the proposed method is compared with an existing GPU-based key-value search and a conventional CPU-based search in terms of the throughput of Blockchain key search. As a result, the throughput of our proposal is 3.4 times higher than that of the existing GPU-based search and 14.1 times higher than that of the CPU search when the number of keys is 80 ×2^20 and the key length is 256-bit in Blockchain search queries.
区块链是一个基于P2P网络的分布式账本系统,最初用于加密货币系统。区块链的P2P网络由全节点维护,全节点负责对网络中的所有交易进行验证。但是,大多数区块链用户节点不充当完整节点,因为对于个人移动设备来说,完整节点的工作负载相当高。区块链搜索查询,如确认余额、事务内容和事务历史记录,将从许多用户转到完整节点。因此,全节点的搜索吞吐量将成为区块链系统的新瓶颈,因为全节点的数量少于区块链系统的用户数量。本文提出了一种利用gpu加速区块链搜索的方法。更具体地说,我们引入了一个适合GPU处理的基于数组的Patricia树结构,这样我们就可以有效地利用区块链特性,没有更新和删除查询。在评估中,就区块链键搜索的吞吐量与现有的基于gpu的键值搜索和传统的基于cpu的搜索进行了比较。因此,在区块链搜索查询中,当密钥数为80 ×2^20,密钥长度为256位时,我们提出的吞吐量是现有基于gpu的搜索的3.4倍,是CPU搜索的14.1倍。
{"title":"Accelerating Blockchain Search of Full Nodes Using GPUs","authors":"Shin Morishima, Hiroki Matsutani","doi":"10.1109/PDP2018.2018.00041","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00041","url":null,"abstract":"Blockchain is a distributed ledger system based on P2P network and originally used for a crypto currency system. The P2P network of Blockchain is maintained by full nodes which are in charge of verifying all the transactions in the network. However, most Blockchain user nodes do not act as full nodes, because workload of full nodes is quite high for personal mobile devices. Blockchain search queries, such as confirming balance, transaction contents, and transaction histories, from many users go to the full nodes. As a result, search throughput of full nodes would be a new bottleneck of Blockchain system, because the number of full nodes is less than the number of users of Blockchain systems. In this paper, we propose an acceleration method of Blockchain search using GPUs. More specifically, we introduce an array-based Patricia tree structure suitable for GPU processing so that we can make effective use of Blockchain feature that there are no update and delete queries. In the evaluations, the proposed method is compared with an existing GPU-based key-value search and a conventional CPU-based search in terms of the throughput of Blockchain key search. As a result, the throughput of our proposal is 3.4 times higher than that of the existing GPU-based search and 14.1 times higher than that of the CPU search when the number of keys is 80 ×2^20 and the key length is 256-bit in Blockchain search queries.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114488167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Analysis of the Impact Factors on Data Error Propagation in HPC Applications HPC应用中影响数据错误传播的因素分析
G. Utrera, Marisa Gil, X. Martorell
Algorithmic codes for scientific computing may exhibit diverse levels of tolerance to memory errors, depending on the program behavior when accessing data. There are factors that can be controlled in an HPC program and may influence the tolerance degree to memory errors. A characterization of the degree of vulnerability an application exhibits can help to improve its security as well as save time and resources. In this work, we study some main factors that may have an impact on the propagation of errors originated from memory accesses.
用于科学计算的算法代码可能对内存错误表现出不同程度的容忍度,这取决于程序在访问数据时的行为。在高性能计算程序中,有一些因素是可以控制的,这些因素可能会影响对内存错误的容忍度。对应用程序显示的漏洞程度进行表征可以帮助提高其安全性,并节省时间和资源。在这项工作中,我们研究了一些可能影响内存访问产生的错误传播的主要因素。
{"title":"Analysis of the Impact Factors on Data Error Propagation in HPC Applications","authors":"G. Utrera, Marisa Gil, X. Martorell","doi":"10.1109/PDP2018.2018.00092","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00092","url":null,"abstract":"Algorithmic codes for scientific computing may exhibit diverse levels of tolerance to memory errors, depending on the program behavior when accessing data. There are factors that can be controlled in an HPC program and may influence the tolerance degree to memory errors. A characterization of the degree of vulnerability an application exhibits can help to improve its security as well as save time and resources. In this work, we study some main factors that may have an impact on the propagation of errors originated from memory accesses.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124806202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1