首页 > 最新文献

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)最新文献

英文 中文
GPU-Powered Multi-Swarm Parameter Estimation of Biological Systems: A Master-Slave Approach 基于gpu的生物系统多群参数估计:一种主从方法
A. Tangherloni, L. Rundo, S. Spolaor, P. Cazzaniga, Marco S. Nobile
In silico investigation of biological systems requires the knowledge of numerical parameters that cannot be easily measured in laboratory experiments, leading to the Parameter Estimation (PE) problem, in which the unknown parameters are automatically inferred by means of optimization algorithms exploiting the available experimental data. Here we present MS 2 PSO, an efficient parallel and distributed implementation of a PE method based on Particle Swarm Optimization (PSO) for the estimation of reaction constants in mathematical models of biological systems, considering as target for the estimation a set of discrete-time measurements of molecular species amounts. In particular, such PE method accounts for the availability of experimental data typically measured under different experimental conditions, by considering a multi-swarm PSO in which the best particles of the swarms can migrate. This strategy allows to infer a common set of reaction constants that simultaneously fits all target data used in the PE. To the aim of efficiently tackling the PE problem, MS 2 PSO embeds the execution of cupSODA, a deterministic simulator that relies on Graphics Processing Units to achieve a massive parallelization of the simulations required in the fitness evaluation of particles. In addition, a further level of parallelism is realized by exploiting the Master-Slave distributed programming paradigm. We apply MS 2 PSO for the PE of synthetic biochemical models with 10, 20 and 30 parameters to be estimated, and compare the performances obtained with different GPUs and different configurations (i.e., numbers of processes) of the Master-Slave.
生物系统的计算机研究需要了解在实验室实验中不易测量的数值参数,这导致了参数估计(PE)问题,其中未知参数是通过利用现有实验数据的优化算法自动推断出来的。本文提出了ms2pso,一种基于粒子群优化(PSO)的PE方法的高效并行和分布式实现,用于估计生物系统数学模型中的反应常数,将一组离散时间测量的分子种类数量作为估计目标。特别是,这种PE方法考虑了在不同实验条件下通常测量的实验数据的可用性,考虑了群中最佳粒子可以迁移的多群粒子群优化算法。这种策略允许推断出一组共同的反应常数,这些常数同时适合PE中使用的所有目标数据。为了有效地解决PE问题,ms2pso嵌入了一个确定性模拟器cupSODA的执行,它依赖于图形处理单元来实现粒子适应度评估所需的模拟的大规模并行化。此外,通过利用主从分布式编程范式实现了更高层次的并行性。我们使用ms2pso对10、20和30个参数的合成生化模型进行PE估计,并比较不同gpu和主从机不同配置(即进程数)下获得的性能。
{"title":"GPU-Powered Multi-Swarm Parameter Estimation of Biological Systems: A Master-Slave Approach","authors":"A. Tangherloni, L. Rundo, S. Spolaor, P. Cazzaniga, Marco S. Nobile","doi":"10.1109/PDP2018.2018.00115","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00115","url":null,"abstract":"In silico investigation of biological systems requires the knowledge of numerical parameters that cannot be easily measured in laboratory experiments, leading to the Parameter Estimation (PE) problem, in which the unknown parameters are automatically inferred by means of optimization algorithms exploiting the available experimental data. Here we present MS 2 PSO, an efficient parallel and distributed implementation of a PE method based on Particle Swarm Optimization (PSO) for the estimation of reaction constants in mathematical models of biological systems, considering as target for the estimation a set of discrete-time measurements of molecular species amounts. In particular, such PE method accounts for the availability of experimental data typically measured under different experimental conditions, by considering a multi-swarm PSO in which the best particles of the swarms can migrate. This strategy allows to infer a common set of reaction constants that simultaneously fits all target data used in the PE. To the aim of efficiently tackling the PE problem, MS 2 PSO embeds the execution of cupSODA, a deterministic simulator that relies on Graphics Processing Units to achieve a massive parallelization of the simulations required in the fitness evaluation of particles. In addition, a further level of parallelism is realized by exploiting the Master-Slave distributed programming paradigm. We apply MS 2 PSO for the PE of synthetic biochemical models with 10, 20 and 30 parameters to be estimated, and compare the performances obtained with different GPUs and different configurations (i.e., numbers of processes) of the Master-Slave.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121813785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Analysis of the Impact Factors on Data Error Propagation in HPC Applications HPC应用中影响数据错误传播的因素分析
G. Utrera, Marisa Gil, X. Martorell
Algorithmic codes for scientific computing may exhibit diverse levels of tolerance to memory errors, depending on the program behavior when accessing data. There are factors that can be controlled in an HPC program and may influence the tolerance degree to memory errors. A characterization of the degree of vulnerability an application exhibits can help to improve its security as well as save time and resources. In this work, we study some main factors that may have an impact on the propagation of errors originated from memory accesses.
用于科学计算的算法代码可能对内存错误表现出不同程度的容忍度,这取决于程序在访问数据时的行为。在高性能计算程序中,有一些因素是可以控制的,这些因素可能会影响对内存错误的容忍度。对应用程序显示的漏洞程度进行表征可以帮助提高其安全性,并节省时间和资源。在这项工作中,我们研究了一些可能影响内存访问产生的错误传播的主要因素。
{"title":"Analysis of the Impact Factors on Data Error Propagation in HPC Applications","authors":"G. Utrera, Marisa Gil, X. Martorell","doi":"10.1109/PDP2018.2018.00092","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00092","url":null,"abstract":"Algorithmic codes for scientific computing may exhibit diverse levels of tolerance to memory errors, depending on the program behavior when accessing data. There are factors that can be controlled in an HPC program and may influence the tolerance degree to memory errors. A characterization of the degree of vulnerability an application exhibits can help to improve its security as well as save time and resources. In this work, we study some main factors that may have an impact on the propagation of errors originated from memory accesses.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124806202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Structured Grid-Based Parallel Simulation of a Simple DEM Model on Heterogeneous Systems 基于结构网格的异构系统简单DEM模型并行仿真
A. Rango, Pietro Napoli, D. D'Ambrosio, W. Spataro, A. D. Renzo, F. Maio
Here we present different preliminary parallel grid-based implementations of a simple particle system with the purpose to evaluate its performances on multi- and many-core computational devices. The system is modeled by means of the Discrete Element Method and the Extended Cellular Automata formalism, while OpenMP and OpenCL are used for parallelization. In particular, both the 3.1 and 4.5 OpenMP specifications have been considered, the latter also able to run on many-core computational devices like GPUs. The results of a first test simulation performed by considering a cubic domain with about 316,000 particles have shown a clear advantage of OpenCL on the considered Tesla K40 Nvidia GPU, while the OpenMP 3.1 implementation has performed better than the corresponding OpenMP 4.5 on the considered Intel Xeon E5-2650 16-thread CPU.
在这里,我们提出了一个简单粒子系统的不同的基于并行网格的初步实现,目的是评估其在多核和多核计算设备上的性能。系统采用离散元法和扩展元胞自动机形式化建模,并行化采用OpenMP和OpenCL。特别是,3.1和4.5 OpenMP规范都被考虑过,后者也能够在多核计算设备(如gpu)上运行。通过考虑大约316,000个粒子的立方域进行的第一次测试模拟结果显示,OpenCL在考虑的Tesla K40 Nvidia GPU上具有明显的优势,而OpenMP 3.1实现在考虑的Intel Xeon E5-2650 16线程CPU上的性能优于相应的OpenMP 4.5。
{"title":"Structured Grid-Based Parallel Simulation of a Simple DEM Model on Heterogeneous Systems","authors":"A. Rango, Pietro Napoli, D. D'Ambrosio, W. Spataro, A. D. Renzo, F. Maio","doi":"10.1109/PDP2018.2018.00099","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00099","url":null,"abstract":"Here we present different preliminary parallel grid-based implementations of a simple particle system with the purpose to evaluate its performances on multi- and many-core computational devices. The system is modeled by means of the Discrete Element Method and the Extended Cellular Automata formalism, while OpenMP and OpenCL are used for parallelization. In particular, both the 3.1 and 4.5 OpenMP specifications have been considered, the latter also able to run on many-core computational devices like GPUs. The results of a first test simulation performed by considering a cubic domain with about 316,000 particles have shown a clear advantage of OpenCL on the considered Tesla K40 Nvidia GPU, while the OpenMP 3.1 implementation has performed better than the corresponding OpenMP 4.5 on the considered Intel Xeon E5-2650 16-thread CPU.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114505055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Geo-Distributed BigData Processing for Maximizing Profit in Federated Clouds Environment 联邦云环境下实现利润最大化的地理分布式大数据处理
Thouraya Gouasmi, Wajdi Louati, A. Kacem
Managing and processing BigData in geo-distributed datacenters gain much attention in recent years. Despite the increasing attention on this topic, most efforts have been focused on user-centric solutions, and unfortunately much less on the difficulties encountered by Cloud providers to improve their profits. Highly efficient framework for geo-distributed BigData processing in cloud federation environment is a crucial solution to maximize profit of the cloud providers. The objective of this paper is to maximize the profit for cloud providers by minimizing costs and penalty. This work proposes to transfer compute (computations) to geo-distributed data and outsourcing only the desired data to idles resources of federated clouds in order to minimize job costs; and proposes a jobs reordering dynamic approach to minimize the penalties costs. The performance evaluation proves that our proposed algorithm can maximize profit, reduce the MapReduce jobs costs and improve utilization of clusters resources.
近年来,地理分布式数据中心的大数据管理和处理备受关注。尽管对这个主题的关注越来越多,但大多数努力都集中在以用户为中心的解决方案上,不幸的是,很少关注云提供商在提高利润方面遇到的困难。高效的云联合环境下地理分布式大数据处理框架是实现云提供商利润最大化的关键解决方案。本文的目标是通过最小化成本和惩罚来最大化云提供商的利润。本文提出将计算转移到地理分布式数据上,只将需要的数据外包给联邦云的空闲资源,以最小化作业成本;并提出了一种动态的工作重新排序方法,以最大限度地降低处罚成本。性能评估表明,该算法能够实现利润最大化,降低MapReduce作业成本,提高集群资源利用率。
{"title":"Geo-Distributed BigData Processing for Maximizing Profit in Federated Clouds Environment","authors":"Thouraya Gouasmi, Wajdi Louati, A. Kacem","doi":"10.1109/PDP2018.2018.00020","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00020","url":null,"abstract":"Managing and processing BigData in geo-distributed datacenters gain much attention in recent years. Despite the increasing attention on this topic, most efforts have been focused on user-centric solutions, and unfortunately much less on the difficulties encountered by Cloud providers to improve their profits. Highly efficient framework for geo-distributed BigData processing in cloud federation environment is a crucial solution to maximize profit of the cloud providers. The objective of this paper is to maximize the profit for cloud providers by minimizing costs and penalty. This work proposes to transfer compute (computations) to geo-distributed data and outsourcing only the desired data to idles resources of federated clouds in order to minimize job costs; and proposes a jobs reordering dynamic approach to minimize the penalties costs. The performance evaluation proves that our proposed algorithm can maximize profit, reduce the MapReduce jobs costs and improve utilization of clusters resources.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115681594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Stingray-HPC: A Scalable Parallel Seismic Raytracing System Stingray-HPC:一种可扩展的并行地震光线追踪系统
Mohammad Alaul Haque Monil, A. Malony, D. Toomey, K. Huck
The Stingray raytracer was developed for marine seismology to compute minimum travel time from all sources in an earth model to determine the 3D geophysical structure below the ocean floor. The original sequential implementation of Stingray used Dijkstra's single-source, shortest-path (SSSP) algorithm. A data parallel version of Stingray was developed based on the Bellman-Ford-Moore iterative SSSP algorithm. Single node experiments demonstrated performance improvements from parallelization with multicore (using OpenMP) and manycore processors (using CUDA). Calculating seismic ray paths for larger earth models requires distributed, multi-node algorithms utilizing domain decomposition methods. Preliminary 2D decomposition strategies show promising scaling results. However, a general 3D decomposition methodology is needed to handle any seismic raytracing problem on any HPC computing platform. In this paper, we present Stingray-HPC, a framework for scalable seismic raytracing which can automatically decompose a 3D earth model across nodes in a distributed environment, allocate ghost cell regions for iterative updates, coordinate ghost cell communications, and test for global convergence. Stingray-HPC is implemented with MPI and either OpenMP or CUDA for node- level calculations. Our results validate Stingray-HPC's ability to handle large models (over a billion points) and to solve these models efficiently at scale up to 512 GPU nodes.
Stingray射线追踪器是为海洋地震学开发的,用于计算地球模型中所有震源的最小传播时间,以确定海底以下的三维地球物理结构。最初的Stingray序列实现使用Dijkstra的单源最短路径(SSSP)算法。基于Bellman-Ford-Moore迭代SSSP算法,开发了Stingray的数据并行版本。单节点实验证明了多核(使用OpenMP)和多核处理器(使用CUDA)并行化的性能改进。计算较大地球模型的地震射线路径需要利用域分解方法的分布式多节点算法。初步的二维分解策略显示出有希望的缩放结果。然而,需要一种通用的三维分解方法来处理任何HPC计算平台上的任何地震光线追踪问题。在本文中,我们提出了一个可扩展的地震射线追踪框架Stingray-HPC,它可以在分布式环境中自动分解三维地球模型,分配鬼细胞区域进行迭代更新,协调鬼细胞通信,并测试全局收敛性。Stingray-HPC使用MPI和OpenMP或CUDA实现节点级计算。我们的结果验证了Stingray-HPC处理大型模型(超过10亿个点)的能力,并在高达512个GPU节点的规模下有效地解决这些模型。
{"title":"Stingray-HPC: A Scalable Parallel Seismic Raytracing System","authors":"Mohammad Alaul Haque Monil, A. Malony, D. Toomey, K. Huck","doi":"10.1109/PDP2018.2018.00035","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00035","url":null,"abstract":"The Stingray raytracer was developed for marine seismology to compute minimum travel time from all sources in an earth model to determine the 3D geophysical structure below the ocean floor. The original sequential implementation of Stingray used Dijkstra's single-source, shortest-path (SSSP) algorithm. A data parallel version of Stingray was developed based on the Bellman-Ford-Moore iterative SSSP algorithm. Single node experiments demonstrated performance improvements from parallelization with multicore (using OpenMP) and manycore processors (using CUDA). Calculating seismic ray paths for larger earth models requires distributed, multi-node algorithms utilizing domain decomposition methods. Preliminary 2D decomposition strategies show promising scaling results. However, a general 3D decomposition methodology is needed to handle any seismic raytracing problem on any HPC computing platform. In this paper, we present Stingray-HPC, a framework for scalable seismic raytracing which can automatically decompose a 3D earth model across nodes in a distributed environment, allocate ghost cell regions for iterative updates, coordinate ghost cell communications, and test for global convergence. Stingray-HPC is implemented with MPI and either OpenMP or CUDA for node- level calculations. Our results validate Stingray-HPC's ability to handle large models (over a billion points) and to solve these models efficiently at scale up to 512 GPU nodes.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116285231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Memory-Aware Tree Partitioning on Homogeneous Platforms 同构平台上的内存感知树分区
Changjiang Gou, A. Benoit, L. Marchal
Scientific applications are commonly modeled as the processing of directed acyclic graphs of tasks, and for some of them, the graph takes the special form of a rooted tree. This tree expresses both the computational dependencies between tasks and their storage requirements. The problem of scheduling/traversing such a tree on a single processor to minimize its memory footprint has already been widely studied. Hence, we move to parallel processing and study how to partition the tree for a homogeneous multiprocessor platform, where each processor is equipped with its own memory. We formally state the problem of partitioning the tree into subtrees such that each subtree can be processed on a single processor and the total resulting processing time is minimized. We prove that the problem is NP-complete, and we design polynomial-time heuristics to address it. An extensive set of simulations demonstrates the usefulness of these heuristics.
科学应用通常被建模为任务的有向无环图的处理,对于其中一些任务,图采用有根树的特殊形式。此树既表示任务之间的计算依赖关系,也表示它们的存储需求。在单个处理器上调度/遍历这样的树以最小化其内存占用的问题已经被广泛研究。因此,我们转向并行处理,并研究如何为同构多处理器平台划分树,其中每个处理器都配备自己的内存。我们正式地陈述了将树划分为子树的问题,这样每个子树都可以在单个处理器上处理,并且最终的总处理时间最小。我们证明了这个问题是np完全的,并设计了多项式时间启发式来解决它。一组广泛的模拟证明了这些启发式的有用性。
{"title":"Memory-Aware Tree Partitioning on Homogeneous Platforms","authors":"Changjiang Gou, A. Benoit, L. Marchal","doi":"10.1109/PDP2018.2018.00056","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00056","url":null,"abstract":"Scientific applications are commonly modeled as the processing of directed acyclic graphs of tasks, and for some of them, the graph takes the special form of a rooted tree. This tree expresses both the computational dependencies between tasks and their storage requirements. The problem of scheduling/traversing such a tree on a single processor to minimize its memory footprint has already been widely studied. Hence, we move to parallel processing and study how to partition the tree for a homogeneous multiprocessor platform, where each processor is equipped with its own memory. We formally state the problem of partitioning the tree into subtrees such that each subtree can be processed on a single processor and the total resulting processing time is minimized. We prove that the problem is NP-complete, and we design polynomial-time heuristics to address it. An extensive set of simulations demonstrates the usefulness of these heuristics.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127485038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Scalable Mapping of Streaming Applications onto MPSoCs Using Optimistic Mixed Integer Linear Programming 使用乐观混合整数线性规划的流应用到mpsoc的可伸缩映射
Neela Gayen, J. Ax, Martin Flasskamp, Christian Klarhorst, T. Jungeblut, Maolin Tang, W. Kelly
Embedded streaming applications are facing increasingly demanding performance requirements in terms of throughput. A common mechanism for providing high compute power with a low energy budget is to use a very large number of low-power cores, often in the form of a Massively Parallel System on Chip (MPSoC). The challenge with programming such massively parallel systems is deciding how to optimally map the computation to individual cores for maximizing throughput. In this work we present an automatic parallelizing compiler for the StreamIt programming language that efficiently and effectively maps computation to individual cores. The compiler must be both effective, meaning that it does a good job of optimizing for throughput; but also efficient, in that the time taken to find such a mapping must scale well as the number of cores and size of the Stream program increases. We improve on previous work that used Integer Linear Programming (ILP) to map StreamIT programs to multicore systems by formulating the mapping problem in a different way using mostly real rather than integer variables. Using so called Mixed Integer Linear Programming (MILP) dramatically reduces the cost compared to standard ILP. This alternative formulation creates what we call an optimistic solution that we then need to adjust slightly to obtain a final feasible solution. We show that this new approach is always close, if not better in terms of effectiveness, while being dramatically better in terms of scalability and efficiency
嵌入式流媒体应用程序在吞吐量方面面临着越来越苛刻的性能要求。以低能量预算提供高计算能力的常见机制是使用大量低功耗核心,通常以大规模并行芯片系统(MPSoC)的形式出现。编程这种大规模并行系统的挑战是决定如何将计算最佳地映射到单个内核以最大化吞吐量。在这项工作中,我们提出了一个用于StreamIt编程语言的自动并行编译器,它可以有效地将计算映射到单个内核。编译器必须是有效的,这意味着它能很好地优化吞吐量;而且效率也很高,因为找到这样一个映射所花费的时间必须随着内核数量和流程序大小的增加而很好地扩展。我们改进了以前使用整数线性规划(ILP)将StreamIT程序映射到多核系统的工作,通过以一种不同的方式制定映射问题,主要使用实变量而不是整数变量。与标准的混合整数线性规划相比,使用所谓的混合整数线性规划(MILP)大大降低了成本。这种替代方案创造了我们所说的乐观解决方案,然后我们需要稍微调整以获得最终可行的解决方案。我们表明,这种新方法即使在有效性方面不是更好,也总是接近的,同时在可伸缩性和效率方面也明显更好
{"title":"Scalable Mapping of Streaming Applications onto MPSoCs Using Optimistic Mixed Integer Linear Programming","authors":"Neela Gayen, J. Ax, Martin Flasskamp, Christian Klarhorst, T. Jungeblut, Maolin Tang, W. Kelly","doi":"10.1109/PDP2018.2018.00062","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00062","url":null,"abstract":"Embedded streaming applications are facing increasingly demanding performance requirements in terms of throughput. A common mechanism for providing high compute power with a low energy budget is to use a very large number of low-power cores, often in the form of a Massively Parallel System on Chip (MPSoC). The challenge with programming such massively parallel systems is deciding how to optimally map the computation to individual cores for maximizing throughput. In this work we present an automatic parallelizing compiler for the StreamIt programming language that efficiently and effectively maps computation to individual cores. The compiler must be both effective, meaning that it does a good job of optimizing for throughput; but also efficient, in that the time taken to find such a mapping must scale well as the number of cores and size of the Stream program increases. We improve on previous work that used Integer Linear Programming (ILP) to map StreamIT programs to multicore systems by formulating the mapping problem in a different way using mostly real rather than integer variables. Using so called Mixed Integer Linear Programming (MILP) dramatically reduces the cost compared to standard ILP. This alternative formulation creates what we call an optimistic solution that we then need to adjust slightly to obtain a final feasible solution. We show that this new approach is always close, if not better in terms of effectiveness, while being dramatically better in terms of scalability and efficiency","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122187014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Improving Communication and Load Balancing with Thread Mapping in Manycore Systems 用线程映射改进多核系统的通信和负载平衡
E. Cruz, M. Diener, M. Serpa, P. Navaux, L. Pilla, I. Koren
Communication and load balancing have a significant impact on the performance of parallel applications and have been the subject of extensive research in multicore architectures. Thread mapping has been one of the solutions adopted in multicore architectures to address both communication and load balancing. However, the impact of such issues on more recently introduced manycore architectures is still unknown. Most related work on manycore architectures focus on execution time and idleness information for scheduling decisions. In this paper, we improve the state of the art by performing a very detailed analysis of the impact of thread mapping on communication and load balancing in two manycore systems from Intel, namely Knights Corner and Knights Landing. We observed that the widely used metric of CPU time provides very inaccurate information for load balancing. We also evaluated the usage of thread mapping based on the communication and load information of the applications to improve the performance of manycore systems.
通信和负载平衡对并行应用程序的性能有重要影响,是多核体系结构中广泛研究的主题。线程映射是多核体系结构中用于解决通信和负载平衡的解决方案之一。然而,这些问题对最近引入的多核架构的影响仍然未知。多核体系结构的大多数相关工作都集中在调度决策的执行时间和空闲信息上。在本文中,我们通过非常详细地分析线程映射对英特尔两个多核系统(即Knights Corner和Knights Landing)的通信和负载平衡的影响,从而提高了技术水平。我们观察到,广泛使用的CPU时间度量为负载平衡提供了非常不准确的信息。我们还评估了基于应用程序的通信和负载信息的线程映射的使用,以提高多核系统的性能。
{"title":"Improving Communication and Load Balancing with Thread Mapping in Manycore Systems","authors":"E. Cruz, M. Diener, M. Serpa, P. Navaux, L. Pilla, I. Koren","doi":"10.1109/PDP2018.2018.00021","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00021","url":null,"abstract":"Communication and load balancing have a significant impact on the performance of parallel applications and have been the subject of extensive research in multicore architectures. Thread mapping has been one of the solutions adopted in multicore architectures to address both communication and load balancing. However, the impact of such issues on more recently introduced manycore architectures is still unknown. Most related work on manycore architectures focus on execution time and idleness information for scheduling decisions. In this paper, we improve the state of the art by performing a very detailed analysis of the impact of thread mapping on communication and load balancing in two manycore systems from Intel, namely Knights Corner and Knights Landing. We observed that the widely used metric of CPU time provides very inaccurate information for load balancing. We also evaluated the usage of thread mapping based on the communication and load information of the applications to improve the performance of manycore systems.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130915196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Low Precision Deep Learning Training on Mobile Heterogeneous Platform 基于移动异构平台的低精度深度学习训练
Olivier Valery, Pangfeng Liu, Jan-Jan Wu
Recent advances in System-on-Chip architectures have made the use of deep learning suitable for a number of applications on mobile devices. Unfortunately, due to the computational cost of neural network training, it is often limited to inference task, e.g., prediction, on mobile devices. In this paper, we propose a deep learning framework that enables both deep learning training and inference tasks on mobile devices. While being able to accommodate with the heterogeneity of computing devices technology on mobile devices, it also uses OpenCL to efficiently leverage modern SoC capabilities, e.g., multi-core CPU, integrated GPU and shared memory architecture, and accelerate deep learning computation. In addition, our system encodes the arithmetic operations of deep networks down to 8-bit fixed-point on mobile devices. As a proof of concept, we trained three well-known neural networks on mobile devices and exhibited a significant performance gain, energy consumption reduction, and memory saving.
片上系统架构的最新进展使得深度学习适用于移动设备上的许多应用程序。不幸的是,由于神经网络训练的计算成本,它通常局限于移动设备上的推理任务,例如预测。在本文中,我们提出了一个深度学习框架,可以在移动设备上实现深度学习训练和推理任务。在能够适应移动设备上计算设备技术的异质性的同时,它还使用OpenCL来有效地利用现代SoC功能,例如多核CPU,集成GPU和共享内存架构,并加速深度学习计算。此外,我们的系统将深度网络的算术运算编码到移动设备上的8位定点。作为概念验证,我们在移动设备上训练了三个知名的神经网络,并展示了显著的性能提升、能耗降低和内存节省。
{"title":"Low Precision Deep Learning Training on Mobile Heterogeneous Platform","authors":"Olivier Valery, Pangfeng Liu, Jan-Jan Wu","doi":"10.1109/PDP2018.2018.00023","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00023","url":null,"abstract":"Recent advances in System-on-Chip architectures have made the use of deep learning suitable for a number of applications on mobile devices. Unfortunately, due to the computational cost of neural network training, it is often limited to inference task, e.g., prediction, on mobile devices. In this paper, we propose a deep learning framework that enables both deep learning training and inference tasks on mobile devices. While being able to accommodate with the heterogeneity of computing devices technology on mobile devices, it also uses OpenCL to efficiently leverage modern SoC capabilities, e.g., multi-core CPU, integrated GPU and shared memory architecture, and accelerate deep learning computation. In addition, our system encodes the arithmetic operations of deep networks down to 8-bit fixed-point on mobile devices. As a proof of concept, we trained three well-known neural networks on mobile devices and exhibited a significant performance gain, energy consumption reduction, and memory saving.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130980537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Scheduler Accelerator for TDMA Data Centers TDMA数据中心的调度程序加速器
I. Patronas, Nikolaos Gkatzios, V. Kitsakis, D. Reisis, K. Christodoulopoulos, Emmanouel Varvarigos
Today's Data Centers networks depend on optical switching to overcome the scalability limitations of traditional architectures. All optical networks most often use slotted Time Division Multiple Access (TDMA) operation; their buffers are located at the optical network edges and their organization relies on effective scheduling of the TDMA frames to achieve efficient sharing of the network resources and a collision-free network operation. Scheduling decisions have to be taken in real time, a process that becomes computationally demanding as the network size increases. Accelerators provide a solution and the present paper proposes a scheduler accelerator to accommodate a data center network divided into points of delivery (pods) of racks and exploiting hybrid electro-optical top-of-rack (ToR) switches that access an all-optical inter-rack network. The scheduler accelerator is a parallel scalable architecture with application specific processing engines. Case studies of 2, 4, 8, 16 processors configuration are presented for the processing of all the transfer TDMA time slot requests for the cases of 512 and 1024 ToR network nodes. The architecture is realized on a Xilinx VC707 board to validate the results.
当今的数据中心网络依靠光交换来克服传统架构的可扩展性限制。所有的光网络通常使用开槽时分多址(TDMA)操作;它们的缓冲区位于光网络边缘,它们的组织依赖于TDMA帧的有效调度,以实现网络资源的有效共享和无冲突的网络运行。调度决策必须实时进行,随着网络规模的增加,这个过程对计算的要求越来越高。加速器提供了一种解决方案,本文提出了一种调度器加速器,以适应数据中心网络,该网络分为机架的交付点(pod),并利用访问全光机架间网络的混合光电机架顶(ToR)交换机。调度器加速器是一个具有应用程序特定处理引擎的并行可伸缩架构。给出了2、4、8、16个处理器配置的案例研究,用于处理512和1024个ToR网络节点情况下的所有传输TDMA时隙请求。在Xilinx VC707板上实现了该架构并验证了结果。
{"title":"Scheduler Accelerator for TDMA Data Centers","authors":"I. Patronas, Nikolaos Gkatzios, V. Kitsakis, D. Reisis, K. Christodoulopoulos, Emmanouel Varvarigos","doi":"10.1109/PDP2018.2018.00030","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00030","url":null,"abstract":"Today's Data Centers networks depend on optical switching to overcome the scalability limitations of traditional architectures. All optical networks most often use slotted Time Division Multiple Access (TDMA) operation; their buffers are located at the optical network edges and their organization relies on effective scheduling of the TDMA frames to achieve efficient sharing of the network resources and a collision-free network operation. Scheduling decisions have to be taken in real time, a process that becomes computationally demanding as the network size increases. Accelerators provide a solution and the present paper proposes a scheduler accelerator to accommodate a data center network divided into points of delivery (pods) of racks and exploiting hybrid electro-optical top-of-rack (ToR) switches that access an all-optical inter-rack network. The scheduler accelerator is a parallel scalable architecture with application specific processing engines. Case studies of 2, 4, 8, 16 processors configuration are presented for the processing of all the transfer TDMA time slot requests for the cases of 512 and 1024 ToR network nodes. The architecture is realized on a Xilinx VC707 board to validate the results.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133101314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1