2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)最新文献

英文中文

GPU-Powered Multi-Swarm Parameter Estimation of Biological Systems: A Master-Slave Approach 基于gpu的生物系统多群参数估计:一种主从方法

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00115

A. Tangherloni, L. Rundo, S. Spolaor, P. Cazzaniga, Marco S. Nobile

In silico investigation of biological systems requires the knowledge of numerical parameters that cannot be easily measured in laboratory experiments, leading to the Parameter Estimation (PE) problem, in which the unknown parameters are automatically inferred by means of optimization algorithms exploiting the available experimental data. Here we present MS 2 PSO, an efficient parallel and distributed implementation of a PE method based on Particle Swarm Optimization (PSO) for the estimation of reaction constants in mathematical models of biological systems, considering as target for the estimation a set of discrete-time measurements of molecular species amounts. In particular, such PE method accounts for the availability of experimental data typically measured under different experimental conditions, by considering a multi-swarm PSO in which the best particles of the swarms can migrate. This strategy allows to infer a common set of reaction constants that simultaneously fits all target data used in the PE. To the aim of efficiently tackling the PE problem, MS 2 PSO embeds the execution of cupSODA, a deterministic simulator that relies on Graphics Processing Units to achieve a massive parallelization of the simulations required in the fitness evaluation of particles. In addition, a further level of parallelism is realized by exploiting the Master-Slave distributed programming paradigm. We apply MS 2 PSO for the PE of synthetic biochemical models with 10, 20 and 30 parameters to be estimated, and compare the performances obtained with different GPUs and different configurations (i.e., numbers of processes) of the Master-Slave.

生物系统的计算机研究需要了解在实验室实验中不易测量的数值参数，这导致了参数估计(PE)问题，其中未知参数是通过利用现有实验数据的优化算法自动推断出来的。本文提出了ms2pso，一种基于粒子群优化(PSO)的PE方法的高效并行和分布式实现，用于估计生物系统数学模型中的反应常数，将一组离散时间测量的分子种类数量作为估计目标。特别是，这种PE方法考虑了在不同实验条件下通常测量的实验数据的可用性，考虑了群中最佳粒子可以迁移的多群粒子群优化算法。这种策略允许推断出一组共同的反应常数，这些常数同时适合PE中使用的所有目标数据。为了有效地解决PE问题，ms2pso嵌入了一个确定性模拟器cupSODA的执行，它依赖于图形处理单元来实现粒子适应度评估所需的模拟的大规模并行化。此外，通过利用主从分布式编程范式实现了更高层次的并行性。我们使用ms2pso对10、20和30个参数的合成生化模型进行PE估计，并比较不同gpu和主从机不同配置(即进程数)下获得的性能。

{"title":"GPU-Powered Multi-Swarm Parameter Estimation of Biological Systems: A Master-Slave Approach","authors":"A. Tangherloni, L. Rundo, S. Spolaor, P. Cazzaniga, Marco S. Nobile","doi":"10.1109/PDP2018.2018.00115","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00115","url":null,"abstract":"In silico investigation of biological systems requires the knowledge of numerical parameters that cannot be easily measured in laboratory experiments, leading to the Parameter Estimation (PE) problem, in which the unknown parameters are automatically inferred by means of optimization algorithms exploiting the available experimental data. Here we present MS 2 PSO, an efficient parallel and distributed implementation of a PE method based on Particle Swarm Optimization (PSO) for the estimation of reaction constants in mathematical models of biological systems, considering as target for the estimation a set of discrete-time measurements of molecular species amounts. In particular, such PE method accounts for the availability of experimental data typically measured under different experimental conditions, by considering a multi-swarm PSO in which the best particles of the swarms can migrate. This strategy allows to infer a common set of reaction constants that simultaneously fits all target data used in the PE. To the aim of efficiently tackling the PE problem, MS 2 PSO embeds the execution of cupSODA, a deterministic simulator that relies on Graphics Processing Units to achieve a massive parallelization of the simulations required in the fitness evaluation of particles. In addition, a further level of parallelism is realized by exploiting the Master-Slave distributed programming paradigm. We apply MS 2 PSO for the PE of synthetic biochemical models with 10, 20 and 30 parameters to be estimated, and compare the performances obtained with different GPUs and different configurations (i.e., numbers of processes) of the Master-Slave.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121813785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Parallelizable Strategy for the Estimation of the 3D Structure of Biological Macromolecules 生物大分子三维结构估计的并行化策略

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00026

C. Caudai, M. Zoppè, E. Salerno, I. Merelli, A. Tonazzini

We present a parallelizzable, multilevel algorithm for the study of three-dimensional structure of biological macromolecules, applied to two fundamental topics: the 3D reconstruction of Chromatin and the elaboration of motion of proteins. For Chromatin, starting from contact data obtained through Chromosome Conformation Capture techniques, our method first subdivides the data matrix in biologically relevant blocks, and then treats them separately, at several levels, depending on the initial data resolution. The result is a family of configurations for the entire fiber, each one compatible with both experimental data and prior knowledge about specific genomes. For Proteins, the method is conceived as a solution for the problem of identifying motion and alternative conformations to the deposited structures. The algorithm, using quaternions, processes the main chain and the aminoacid side chian independently; it then exploits a Monte Carlo method for selection of biologically acceptable conformations, based on energy evaluation, and finally returns a family of conformations and of trajectories at single atom resolution.

我们提出了一种可并行化的多层算法，用于研究生物大分子的三维结构，应用于两个基本主题:染色质的三维重建和蛋白质运动的阐述。对于染色质，我们的方法首先从通过染色体构象捕获技术获得的接触数据开始，将数据矩阵细分为生物学相关块，然后根据初始数据分辨率在多个级别上分别处理它们。结果是整个纤维的一系列配置，每个配置都与实验数据和特定基因组的先验知识兼容。对于蛋白质，该方法被认为是识别运动和沉积结构的替代构象问题的解决方案。该算法使用四元数对主链和氨基酸侧链进行独立处理;然后，它利用蒙特卡罗方法选择生物上可接受的构象，基于能量评估，并最终返回一个家族的构象和单原子分辨率的轨迹。

{"title":"Parallelizable Strategy for the Estimation of the 3D Structure of Biological Macromolecules","authors":"C. Caudai, M. Zoppè, E. Salerno, I. Merelli, A. Tonazzini","doi":"10.1109/PDP2018.2018.00026","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00026","url":null,"abstract":"We present a parallelizzable, multilevel algorithm for the study of three-dimensional structure of biological macromolecules, applied to two fundamental topics: the 3D reconstruction of Chromatin and the elaboration of motion of proteins. For Chromatin, starting from contact data obtained through Chromosome Conformation Capture techniques, our method first subdivides the data matrix in biologically relevant blocks, and then treats them separately, at several levels, depending on the initial data resolution. The result is a family of configurations for the entire fiber, each one compatible with both experimental data and prior knowledge about specific genomes. For Proteins, the method is conceived as a solution for the problem of identifying motion and alternative conformations to the deposited structures. The algorithm, using quaternions, processes the main chain and the aminoacid side chian independently; it then exploits a Monte Carlo method for selection of biologically acceptable conformations, based on energy evaluation, and finally returns a family of conformations and of trajectories at single atom resolution.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134319254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hybrid OpenMP-MPI Parallelism: Porting Experiments from Small to Large Clusters 混合OpenMP-MPI并行:从小型集群到大型集群的移植实验

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00051

M. Ferretti, L. Santangelo

After a brief introduction on Cross Motif Search and its OpenMP and Hybrid OpenMP-MPI implementations, this paper compares the scalability, efficiency and speedup of the hybrid implementation on a small cluster and on a real HPC system, explaining which factors make the application more efficient when it runs on the real HPC architecture. Using profiling and tracing tools highlighted that the hybrid implementation cannot exploit the OpenMP parallelism because of different factors (heap contention among the threads, spin time and overhead time introduced by OpenMP and thread-safe external functions), making the pure MPI implementation better than any other hybrid one. By characterizing of the workload, we also discovered that the application gets improved by changing the order with which tasks are processed. This observation leads to the introduction of a new selection policy, named Longest Job First. The new policy represents a winning solution for tasks submission among all running MPI processes.

本文简要介绍了Cross Motif Search及其OpenMP和混合OpenMP- mpi实现，比较了混合OpenMP- mpi在小型集群和实际HPC系统上的可扩展性、效率和加速，解释了哪些因素使应用程序在实际HPC架构上运行时更高效。使用分析和跟踪工具强调，由于不同的因素(线程之间的堆争用、自旋转时间和OpenMP引入的开销时间以及线程安全的外部函数)，混合实现无法利用OpenMP的并行性，这使得纯MPI实现优于任何其他混合实现。通过描述工作负载的特征，我们还发现，通过改变处理任务的顺序，应用程序得到了改进。这一观察结果导致引入了一种新的选择策略，称为最长作业优先。新策略代表了在所有运行的MPI进程之间提交任务的成功解决方案。

引用次数: 6

Geo-Distributed BigData Processing for Maximizing Profit in Federated Clouds Environment 联邦云环境下实现利润最大化的地理分布式大数据处理

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00020

Thouraya Gouasmi, Wajdi Louati, A. Kacem

Managing and processing BigData in geo-distributed datacenters gain much attention in recent years. Despite the increasing attention on this topic, most efforts have been focused on user-centric solutions, and unfortunately much less on the difficulties encountered by Cloud providers to improve their profits. Highly efficient framework for geo-distributed BigData processing in cloud federation environment is a crucial solution to maximize profit of the cloud providers. The objective of this paper is to maximize the profit for cloud providers by minimizing costs and penalty. This work proposes to transfer compute (computations) to geo-distributed data and outsourcing only the desired data to idles resources of federated clouds in order to minimize job costs; and proposes a jobs reordering dynamic approach to minimize the penalties costs. The performance evaluation proves that our proposed algorithm can maximize profit, reduce the MapReduce jobs costs and improve utilization of clusters resources.

近年来，地理分布式数据中心的大数据管理和处理备受关注。尽管对这个主题的关注越来越多，但大多数努力都集中在以用户为中心的解决方案上，不幸的是，很少关注云提供商在提高利润方面遇到的困难。高效的云联合环境下地理分布式大数据处理框架是实现云提供商利润最大化的关键解决方案。本文的目标是通过最小化成本和惩罚来最大化云提供商的利润。本文提出将计算转移到地理分布式数据上，只将需要的数据外包给联邦云的空闲资源，以最小化作业成本;并提出了一种动态的工作重新排序方法，以最大限度地降低处罚成本。性能评估表明，该算法能够实现利润最大化，降低MapReduce作业成本，提高集群资源利用率。

引用次数: 2

Stingray-HPC: A Scalable Parallel Seismic Raytracing System Stingray-HPC:一种可扩展的并行地震光线追踪系统

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00035

Mohammad Alaul Haque Monil, A. Malony, D. Toomey, K. Huck

The Stingray raytracer was developed for marine seismology to compute minimum travel time from all sources in an earth model to determine the 3D geophysical structure below the ocean floor. The original sequential implementation of Stingray used Dijkstra's single-source, shortest-path (SSSP) algorithm. A data parallel version of Stingray was developed based on the Bellman-Ford-Moore iterative SSSP algorithm. Single node experiments demonstrated performance improvements from parallelization with multicore (using OpenMP) and manycore processors (using CUDA). Calculating seismic ray paths for larger earth models requires distributed, multi-node algorithms utilizing domain decomposition methods. Preliminary 2D decomposition strategies show promising scaling results. However, a general 3D decomposition methodology is needed to handle any seismic raytracing problem on any HPC computing platform. In this paper, we present Stingray-HPC, a framework for scalable seismic raytracing which can automatically decompose a 3D earth model across nodes in a distributed environment, allocate ghost cell regions for iterative updates, coordinate ghost cell communications, and test for global convergence. Stingray-HPC is implemented with MPI and either OpenMP or CUDA for node- level calculations. Our results validate Stingray-HPC's ability to handle large models (over a billion points) and to solve these models efficiently at scale up to 512 GPU nodes.

Stingray射线追踪器是为海洋地震学开发的，用于计算地球模型中所有震源的最小传播时间，以确定海底以下的三维地球物理结构。最初的Stingray序列实现使用Dijkstra的单源最短路径(SSSP)算法。基于Bellman-Ford-Moore迭代SSSP算法，开发了Stingray的数据并行版本。单节点实验证明了多核(使用OpenMP)和多核处理器(使用CUDA)并行化的性能改进。计算较大地球模型的地震射线路径需要利用域分解方法的分布式多节点算法。初步的二维分解策略显示出有希望的缩放结果。然而，需要一种通用的三维分解方法来处理任何HPC计算平台上的任何地震光线追踪问题。在本文中，我们提出了一个可扩展的地震射线追踪框架Stingray-HPC，它可以在分布式环境中自动分解三维地球模型，分配鬼细胞区域进行迭代更新，协调鬼细胞通信，并测试全局收敛性。Stingray-HPC使用MPI和OpenMP或CUDA实现节点级计算。我们的结果验证了Stingray-HPC处理大型模型(超过10亿个点)的能力，并在高达512个GPU节点的规模下有效地解决这些模型。

{"title":"Stingray-HPC: A Scalable Parallel Seismic Raytracing System","authors":"Mohammad Alaul Haque Monil, A. Malony, D. Toomey, K. Huck","doi":"10.1109/PDP2018.2018.00035","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00035","url":null,"abstract":"The Stingray raytracer was developed for marine seismology to compute minimum travel time from all sources in an earth model to determine the 3D geophysical structure below the ocean floor. The original sequential implementation of Stingray used Dijkstra's single-source, shortest-path (SSSP) algorithm. A data parallel version of Stingray was developed based on the Bellman-Ford-Moore iterative SSSP algorithm. Single node experiments demonstrated performance improvements from parallelization with multicore (using OpenMP) and manycore processors (using CUDA). Calculating seismic ray paths for larger earth models requires distributed, multi-node algorithms utilizing domain decomposition methods. Preliminary 2D decomposition strategies show promising scaling results. However, a general 3D decomposition methodology is needed to handle any seismic raytracing problem on any HPC computing platform. In this paper, we present Stingray-HPC, a framework for scalable seismic raytracing which can automatically decompose a 3D earth model across nodes in a distributed environment, allocate ghost cell regions for iterative updates, coordinate ghost cell communications, and test for global convergence. Stingray-HPC is implemented with MPI and either OpenMP or CUDA for node- level calculations. Our results validate Stingray-HPC's ability to handle large models (over a billion points) and to solve these models efficiently at scale up to 512 GPU nodes.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116285231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Memory-Aware Tree Partitioning on Homogeneous Platforms 同构平台上的内存感知树分区

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00056

Changjiang Gou, A. Benoit, L. Marchal

Scientific applications are commonly modeled as the processing of directed acyclic graphs of tasks, and for some of them, the graph takes the special form of a rooted tree. This tree expresses both the computational dependencies between tasks and their storage requirements. The problem of scheduling/traversing such a tree on a single processor to minimize its memory footprint has already been widely studied. Hence, we move to parallel processing and study how to partition the tree for a homogeneous multiprocessor platform, where each processor is equipped with its own memory. We formally state the problem of partitioning the tree into subtrees such that each subtree can be processed on a single processor and the total resulting processing time is minimized. We prove that the problem is NP-complete, and we design polynomial-time heuristics to address it. An extensive set of simulations demonstrates the usefulness of these heuristics.

科学应用通常被建模为任务的有向无环图的处理，对于其中一些任务，图采用有根树的特殊形式。此树既表示任务之间的计算依赖关系，也表示它们的存储需求。在单个处理器上调度/遍历这样的树以最小化其内存占用的问题已经被广泛研究。因此，我们转向并行处理，并研究如何为同构多处理器平台划分树，其中每个处理器都配备自己的内存。我们正式地陈述了将树划分为子树的问题，这样每个子树都可以在单个处理器上处理，并且最终的总处理时间最小。我们证明了这个问题是np完全的，并设计了多项式时间启发式来解决它。一组广泛的模拟证明了这些启发式的有用性。

引用次数: 2

Scalable Mapping of Streaming Applications onto MPSoCs Using Optimistic Mixed Integer Linear Programming 使用乐观混合整数线性规划的流应用到mpsoc的可伸缩映射

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00062

Neela Gayen, J. Ax, Martin Flasskamp, Christian Klarhorst, T. Jungeblut, Maolin Tang, W. Kelly

Embedded streaming applications are facing increasingly demanding performance requirements in terms of throughput. A common mechanism for providing high compute power with a low energy budget is to use a very large number of low-power cores, often in the form of a Massively Parallel System on Chip (MPSoC). The challenge with programming such massively parallel systems is deciding how to optimally map the computation to individual cores for maximizing throughput. In this work we present an automatic parallelizing compiler for the StreamIt programming language that efficiently and effectively maps computation to individual cores. The compiler must be both effective, meaning that it does a good job of optimizing for throughput; but also efficient, in that the time taken to find such a mapping must scale well as the number of cores and size of the Stream program increases. We improve on previous work that used Integer Linear Programming (ILP) to map StreamIT programs to multicore systems by formulating the mapping problem in a different way using mostly real rather than integer variables. Using so called Mixed Integer Linear Programming (MILP) dramatically reduces the cost compared to standard ILP. This alternative formulation creates what we call an optimistic solution that we then need to adjust slightly to obtain a final feasible solution. We show that this new approach is always close, if not better in terms of effectiveness, while being dramatically better in terms of scalability and efficiency

嵌入式流媒体应用程序在吞吐量方面面临着越来越苛刻的性能要求。以低能量预算提供高计算能力的常见机制是使用大量低功耗核心，通常以大规模并行芯片系统(MPSoC)的形式出现。编程这种大规模并行系统的挑战是决定如何将计算最佳地映射到单个内核以最大化吞吐量。在这项工作中，我们提出了一个用于StreamIt编程语言的自动并行编译器，它可以有效地将计算映射到单个内核。编译器必须是有效的，这意味着它能很好地优化吞吐量;而且效率也很高，因为找到这样一个映射所花费的时间必须随着内核数量和流程序大小的增加而很好地扩展。我们改进了以前使用整数线性规划(ILP)将StreamIT程序映射到多核系统的工作，通过以一种不同的方式制定映射问题，主要使用实变量而不是整数变量。与标准的混合整数线性规划相比，使用所谓的混合整数线性规划(MILP)大大降低了成本。这种替代方案创造了我们所说的乐观解决方案，然后我们需要稍微调整以获得最终可行的解决方案。我们表明，这种新方法即使在有效性方面不是更好，也总是接近的，同时在可伸缩性和效率方面也明显更好

{"title":"Scalable Mapping of Streaming Applications onto MPSoCs Using Optimistic Mixed Integer Linear Programming","authors":"Neela Gayen, J. Ax, Martin Flasskamp, Christian Klarhorst, T. Jungeblut, Maolin Tang, W. Kelly","doi":"10.1109/PDP2018.2018.00062","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00062","url":null,"abstract":"Embedded streaming applications are facing increasingly demanding performance requirements in terms of throughput. A common mechanism for providing high compute power with a low energy budget is to use a very large number of low-power cores, often in the form of a Massively Parallel System on Chip (MPSoC). The challenge with programming such massively parallel systems is deciding how to optimally map the computation to individual cores for maximizing throughput. In this work we present an automatic parallelizing compiler for the StreamIt programming language that efficiently and effectively maps computation to individual cores. The compiler must be both effective, meaning that it does a good job of optimizing for throughput; but also efficient, in that the time taken to find such a mapping must scale well as the number of cores and size of the Stream program increases. We improve on previous work that used Integer Linear Programming (ILP) to map StreamIT programs to multicore systems by formulating the mapping problem in a different way using mostly real rather than integer variables. Using so called Mixed Integer Linear Programming (MILP) dramatically reduces the cost compared to standard ILP. This alternative formulation creates what we call an optimistic solution that we then need to adjust slightly to obtain a final feasible solution. We show that this new approach is always close, if not better in terms of effectiveness, while being dramatically better in terms of scalability and efficiency","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122187014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Improving Communication and Load Balancing with Thread Mapping in Manycore Systems 用线程映射改进多核系统的通信和负载平衡

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00021

E. Cruz, M. Diener, M. Serpa, P. Navaux, L. Pilla, I. Koren

Communication and load balancing have a significant impact on the performance of parallel applications and have been the subject of extensive research in multicore architectures. Thread mapping has been one of the solutions adopted in multicore architectures to address both communication and load balancing. However, the impact of such issues on more recently introduced manycore architectures is still unknown. Most related work on manycore architectures focus on execution time and idleness information for scheduling decisions. In this paper, we improve the state of the art by performing a very detailed analysis of the impact of thread mapping on communication and load balancing in two manycore systems from Intel, namely Knights Corner and Knights Landing. We observed that the widely used metric of CPU time provides very inaccurate information for load balancing. We also evaluated the usage of thread mapping based on the communication and load information of the applications to improve the performance of manycore systems.

通信和负载平衡对并行应用程序的性能有重要影响，是多核体系结构中广泛研究的主题。线程映射是多核体系结构中用于解决通信和负载平衡的解决方案之一。然而，这些问题对最近引入的多核架构的影响仍然未知。多核体系结构的大多数相关工作都集中在调度决策的执行时间和空闲信息上。在本文中，我们通过非常详细地分析线程映射对英特尔两个多核系统(即Knights Corner和Knights Landing)的通信和负载平衡的影响，从而提高了技术水平。我们观察到，广泛使用的CPU时间度量为负载平衡提供了非常不准确的信息。我们还评估了基于应用程序的通信和负载信息的线程映射的使用，以提高多核系统的性能。

引用次数: 7

Low Precision Deep Learning Training on Mobile Heterogeneous Platform 基于移动异构平台的低精度深度学习训练

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00023

Olivier Valery, Pangfeng Liu, Jan-Jan Wu

Recent advances in System-on-Chip architectures have made the use of deep learning suitable for a number of applications on mobile devices. Unfortunately, due to the computational cost of neural network training, it is often limited to inference task, e.g., prediction, on mobile devices. In this paper, we propose a deep learning framework that enables both deep learning training and inference tasks on mobile devices. While being able to accommodate with the heterogeneity of computing devices technology on mobile devices, it also uses OpenCL to efficiently leverage modern SoC capabilities, e.g., multi-core CPU, integrated GPU and shared memory architecture, and accelerate deep learning computation. In addition, our system encodes the arithmetic operations of deep networks down to 8-bit fixed-point on mobile devices. As a proof of concept, we trained three well-known neural networks on mobile devices and exhibited a significant performance gain, energy consumption reduction, and memory saving.

片上系统架构的最新进展使得深度学习适用于移动设备上的许多应用程序。不幸的是，由于神经网络训练的计算成本，它通常局限于移动设备上的推理任务，例如预测。在本文中，我们提出了一个深度学习框架，可以在移动设备上实现深度学习训练和推理任务。在能够适应移动设备上计算设备技术的异质性的同时，它还使用OpenCL来有效地利用现代SoC功能，例如多核CPU，集成GPU和共享内存架构，并加速深度学习计算。此外，我们的系统将深度网络的算术运算编码到移动设备上的8位定点。作为概念验证，我们在移动设备上训练了三个知名的神经网络，并展示了显著的性能提升、能耗降低和内存节省。

引用次数: 6

Scheduler Accelerator for TDMA Data Centers TDMA数据中心的调度程序加速器

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

Pub Date : 2018-03-21 DOI: 10.1109/PDP2018.2018.00030

I. Patronas, Nikolaos Gkatzios, V. Kitsakis, D. Reisis, K. Christodoulopoulos, Emmanouel Varvarigos

Today's Data Centers networks depend on optical switching to overcome the scalability limitations of traditional architectures. All optical networks most often use slotted Time Division Multiple Access (TDMA) operation; their buffers are located at the optical network edges and their organization relies on effective scheduling of the TDMA frames to achieve efficient sharing of the network resources and a collision-free network operation. Scheduling decisions have to be taken in real time, a process that becomes computationally demanding as the network size increases. Accelerators provide a solution and the present paper proposes a scheduler accelerator to accommodate a data center network divided into points of delivery (pods) of racks and exploiting hybrid electro-optical top-of-rack (ToR) switches that access an all-optical inter-rack network. The scheduler accelerator is a parallel scalable architecture with application specific processing engines. Case studies of 2, 4, 8, 16 processors configuration are presented for the processing of all the transfer TDMA time slot requests for the cases of 512 and 1024 ToR network nodes. The architecture is realized on a Xilinx VC707 board to validate the results.

当今的数据中心网络依靠光交换来克服传统架构的可扩展性限制。所有的光网络通常使用开槽时分多址(TDMA)操作;它们的缓冲区位于光网络边缘，它们的组织依赖于TDMA帧的有效调度，以实现网络资源的有效共享和无冲突的网络运行。调度决策必须实时进行，随着网络规模的增加，这个过程对计算的要求越来越高。加速器提供了一种解决方案，本文提出了一种调度器加速器，以适应数据中心网络，该网络分为机架的交付点(pod)，并利用访问全光机架间网络的混合光电机架顶(ToR)交换机。调度器加速器是一个具有应用程序特定处理引擎的并行可伸缩架构。给出了2、4、8、16个处理器配置的案例研究，用于处理512和1024个ToR网络节点情况下的所有传输TDMA时隙请求。在Xilinx VC707板上实现了该架构并验证了结果。

{"title":"Scheduler Accelerator for TDMA Data Centers","authors":"I. Patronas, Nikolaos Gkatzios, V. Kitsakis, D. Reisis, K. Christodoulopoulos, Emmanouel Varvarigos","doi":"10.1109/PDP2018.2018.00030","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00030","url":null,"abstract":"Today's Data Centers networks depend on optical switching to overcome the scalability limitations of traditional architectures. All optical networks most often use slotted Time Division Multiple Access (TDMA) operation; their buffers are located at the optical network edges and their organization relies on effective scheduling of the TDMA frames to achieve efficient sharing of the network resources and a collision-free network operation. Scheduling decisions have to be taken in real time, a process that becomes computationally demanding as the network size increases. Accelerators provide a solution and the present paper proposes a scheduler accelerator to accommodate a data center network divided into points of delivery (pods) of racks and exploiting hybrid electro-optical top-of-rack (ToR) switches that access an all-optical inter-rack network. The scheduler accelerator is a parallel scalable architecture with application specific processing engines. Case studies of 2, 4, 8, 16 processors configuration are presented for the processing of all the transfer TDMA time slot requests for the cases of 512 and 1024 ToR network nodes. The architecture is realized on a Xilinx VC707 board to validate the results.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133101314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀