Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00115
A. Tangherloni, L. Rundo, S. Spolaor, P. Cazzaniga, Marco S. Nobile
In silico investigation of biological systems requires the knowledge of numerical parameters that cannot be easily measured in laboratory experiments, leading to the Parameter Estimation (PE) problem, in which the unknown parameters are automatically inferred by means of optimization algorithms exploiting the available experimental data. Here we present MS 2 PSO, an efficient parallel and distributed implementation of a PE method based on Particle Swarm Optimization (PSO) for the estimation of reaction constants in mathematical models of biological systems, considering as target for the estimation a set of discrete-time measurements of molecular species amounts. In particular, such PE method accounts for the availability of experimental data typically measured under different experimental conditions, by considering a multi-swarm PSO in which the best particles of the swarms can migrate. This strategy allows to infer a common set of reaction constants that simultaneously fits all target data used in the PE. To the aim of efficiently tackling the PE problem, MS 2 PSO embeds the execution of cupSODA, a deterministic simulator that relies on Graphics Processing Units to achieve a massive parallelization of the simulations required in the fitness evaluation of particles. In addition, a further level of parallelism is realized by exploiting the Master-Slave distributed programming paradigm. We apply MS 2 PSO for the PE of synthetic biochemical models with 10, 20 and 30 parameters to be estimated, and compare the performances obtained with different GPUs and different configurations (i.e., numbers of processes) of the Master-Slave.
{"title":"GPU-Powered Multi-Swarm Parameter Estimation of Biological Systems: A Master-Slave Approach","authors":"A. Tangherloni, L. Rundo, S. Spolaor, P. Cazzaniga, Marco S. Nobile","doi":"10.1109/PDP2018.2018.00115","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00115","url":null,"abstract":"In silico investigation of biological systems requires the knowledge of numerical parameters that cannot be easily measured in laboratory experiments, leading to the Parameter Estimation (PE) problem, in which the unknown parameters are automatically inferred by means of optimization algorithms exploiting the available experimental data. Here we present MS 2 PSO, an efficient parallel and distributed implementation of a PE method based on Particle Swarm Optimization (PSO) for the estimation of reaction constants in mathematical models of biological systems, considering as target for the estimation a set of discrete-time measurements of molecular species amounts. In particular, such PE method accounts for the availability of experimental data typically measured under different experimental conditions, by considering a multi-swarm PSO in which the best particles of the swarms can migrate. This strategy allows to infer a common set of reaction constants that simultaneously fits all target data used in the PE. To the aim of efficiently tackling the PE problem, MS 2 PSO embeds the execution of cupSODA, a deterministic simulator that relies on Graphics Processing Units to achieve a massive parallelization of the simulations required in the fitness evaluation of particles. In addition, a further level of parallelism is realized by exploiting the Master-Slave distributed programming paradigm. We apply MS 2 PSO for the PE of synthetic biochemical models with 10, 20 and 30 parameters to be estimated, and compare the performances obtained with different GPUs and different configurations (i.e., numbers of processes) of the Master-Slave.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121813785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00026
C. Caudai, M. Zoppè, E. Salerno, I. Merelli, A. Tonazzini
We present a parallelizzable, multilevel algorithm for the study of three-dimensional structure of biological macromolecules, applied to two fundamental topics: the 3D reconstruction of Chromatin and the elaboration of motion of proteins. For Chromatin, starting from contact data obtained through Chromosome Conformation Capture techniques, our method first subdivides the data matrix in biologically relevant blocks, and then treats them separately, at several levels, depending on the initial data resolution. The result is a family of configurations for the entire fiber, each one compatible with both experimental data and prior knowledge about specific genomes. For Proteins, the method is conceived as a solution for the problem of identifying motion and alternative conformations to the deposited structures. The algorithm, using quaternions, processes the main chain and the aminoacid side chian independently; it then exploits a Monte Carlo method for selection of biologically acceptable conformations, based on energy evaluation, and finally returns a family of conformations and of trajectories at single atom resolution.
{"title":"Parallelizable Strategy for the Estimation of the 3D Structure of Biological Macromolecules","authors":"C. Caudai, M. Zoppè, E. Salerno, I. Merelli, A. Tonazzini","doi":"10.1109/PDP2018.2018.00026","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00026","url":null,"abstract":"We present a parallelizzable, multilevel algorithm for the study of three-dimensional structure of biological macromolecules, applied to two fundamental topics: the 3D reconstruction of Chromatin and the elaboration of motion of proteins. For Chromatin, starting from contact data obtained through Chromosome Conformation Capture techniques, our method first subdivides the data matrix in biologically relevant blocks, and then treats them separately, at several levels, depending on the initial data resolution. The result is a family of configurations for the entire fiber, each one compatible with both experimental data and prior knowledge about specific genomes. For Proteins, the method is conceived as a solution for the problem of identifying motion and alternative conformations to the deposited structures. The algorithm, using quaternions, processes the main chain and the aminoacid side chian independently; it then exploits a Monte Carlo method for selection of biologically acceptable conformations, based on energy evaluation, and finally returns a family of conformations and of trajectories at single atom resolution.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134319254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00051
M. Ferretti, L. Santangelo
After a brief introduction on Cross Motif Search and its OpenMP and Hybrid OpenMP-MPI implementations, this paper compares the scalability, efficiency and speedup of the hybrid implementation on a small cluster and on a real HPC system, explaining which factors make the application more efficient when it runs on the real HPC architecture. Using profiling and tracing tools highlighted that the hybrid implementation cannot exploit the OpenMP parallelism because of different factors (heap contention among the threads, spin time and overhead time introduced by OpenMP and thread-safe external functions), making the pure MPI implementation better than any other hybrid one. By characterizing of the workload, we also discovered that the application gets improved by changing the order with which tasks are processed. This observation leads to the introduction of a new selection policy, named Longest Job First. The new policy represents a winning solution for tasks submission among all running MPI processes.
{"title":"Hybrid OpenMP-MPI Parallelism: Porting Experiments from Small to Large Clusters","authors":"M. Ferretti, L. Santangelo","doi":"10.1109/PDP2018.2018.00051","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00051","url":null,"abstract":"After a brief introduction on Cross Motif Search and its OpenMP and Hybrid OpenMP-MPI implementations, this paper compares the scalability, efficiency and speedup of the hybrid implementation on a small cluster and on a real HPC system, explaining which factors make the application more efficient when it runs on the real HPC architecture. Using profiling and tracing tools highlighted that the hybrid implementation cannot exploit the OpenMP parallelism because of different factors (heap contention among the threads, spin time and overhead time introduced by OpenMP and thread-safe external functions), making the pure MPI implementation better than any other hybrid one. By characterizing of the workload, we also discovered that the application gets improved by changing the order with which tasks are processed. This observation leads to the introduction of a new selection policy, named Longest Job First. The new policy represents a winning solution for tasks submission among all running MPI processes.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133973041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00020
Thouraya Gouasmi, Wajdi Louati, A. Kacem
Managing and processing BigData in geo-distributed datacenters gain much attention in recent years. Despite the increasing attention on this topic, most efforts have been focused on user-centric solutions, and unfortunately much less on the difficulties encountered by Cloud providers to improve their profits. Highly efficient framework for geo-distributed BigData processing in cloud federation environment is a crucial solution to maximize profit of the cloud providers. The objective of this paper is to maximize the profit for cloud providers by minimizing costs and penalty. This work proposes to transfer compute (computations) to geo-distributed data and outsourcing only the desired data to idles resources of federated clouds in order to minimize job costs; and proposes a jobs reordering dynamic approach to minimize the penalties costs. The performance evaluation proves that our proposed algorithm can maximize profit, reduce the MapReduce jobs costs and improve utilization of clusters resources.
{"title":"Geo-Distributed BigData Processing for Maximizing Profit in Federated Clouds Environment","authors":"Thouraya Gouasmi, Wajdi Louati, A. Kacem","doi":"10.1109/PDP2018.2018.00020","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00020","url":null,"abstract":"Managing and processing BigData in geo-distributed datacenters gain much attention in recent years. Despite the increasing attention on this topic, most efforts have been focused on user-centric solutions, and unfortunately much less on the difficulties encountered by Cloud providers to improve their profits. Highly efficient framework for geo-distributed BigData processing in cloud federation environment is a crucial solution to maximize profit of the cloud providers. The objective of this paper is to maximize the profit for cloud providers by minimizing costs and penalty. This work proposes to transfer compute (computations) to geo-distributed data and outsourcing only the desired data to idles resources of federated clouds in order to minimize job costs; and proposes a jobs reordering dynamic approach to minimize the penalties costs. The performance evaluation proves that our proposed algorithm can maximize profit, reduce the MapReduce jobs costs and improve utilization of clusters resources.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115681594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00035
Mohammad Alaul Haque Monil, A. Malony, D. Toomey, K. Huck
The Stingray raytracer was developed for marine seismology to compute minimum travel time from all sources in an earth model to determine the 3D geophysical structure below the ocean floor. The original sequential implementation of Stingray used Dijkstra's single-source, shortest-path (SSSP) algorithm. A data parallel version of Stingray was developed based on the Bellman-Ford-Moore iterative SSSP algorithm. Single node experiments demonstrated performance improvements from parallelization with multicore (using OpenMP) and manycore processors (using CUDA). Calculating seismic ray paths for larger earth models requires distributed, multi-node algorithms utilizing domain decomposition methods. Preliminary 2D decomposition strategies show promising scaling results. However, a general 3D decomposition methodology is needed to handle any seismic raytracing problem on any HPC computing platform. In this paper, we present Stingray-HPC, a framework for scalable seismic raytracing which can automatically decompose a 3D earth model across nodes in a distributed environment, allocate ghost cell regions for iterative updates, coordinate ghost cell communications, and test for global convergence. Stingray-HPC is implemented with MPI and either OpenMP or CUDA for node- level calculations. Our results validate Stingray-HPC's ability to handle large models (over a billion points) and to solve these models efficiently at scale up to 512 GPU nodes.
{"title":"Stingray-HPC: A Scalable Parallel Seismic Raytracing System","authors":"Mohammad Alaul Haque Monil, A. Malony, D. Toomey, K. Huck","doi":"10.1109/PDP2018.2018.00035","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00035","url":null,"abstract":"The Stingray raytracer was developed for marine seismology to compute minimum travel time from all sources in an earth model to determine the 3D geophysical structure below the ocean floor. The original sequential implementation of Stingray used Dijkstra's single-source, shortest-path (SSSP) algorithm. A data parallel version of Stingray was developed based on the Bellman-Ford-Moore iterative SSSP algorithm. Single node experiments demonstrated performance improvements from parallelization with multicore (using OpenMP) and manycore processors (using CUDA). Calculating seismic ray paths for larger earth models requires distributed, multi-node algorithms utilizing domain decomposition methods. Preliminary 2D decomposition strategies show promising scaling results. However, a general 3D decomposition methodology is needed to handle any seismic raytracing problem on any HPC computing platform. In this paper, we present Stingray-HPC, a framework for scalable seismic raytracing which can automatically decompose a 3D earth model across nodes in a distributed environment, allocate ghost cell regions for iterative updates, coordinate ghost cell communications, and test for global convergence. Stingray-HPC is implemented with MPI and either OpenMP or CUDA for node- level calculations. Our results validate Stingray-HPC's ability to handle large models (over a billion points) and to solve these models efficiently at scale up to 512 GPU nodes.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116285231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00056
Changjiang Gou, A. Benoit, L. Marchal
Scientific applications are commonly modeled as the processing of directed acyclic graphs of tasks, and for some of them, the graph takes the special form of a rooted tree. This tree expresses both the computational dependencies between tasks and their storage requirements. The problem of scheduling/traversing such a tree on a single processor to minimize its memory footprint has already been widely studied. Hence, we move to parallel processing and study how to partition the tree for a homogeneous multiprocessor platform, where each processor is equipped with its own memory. We formally state the problem of partitioning the tree into subtrees such that each subtree can be processed on a single processor and the total resulting processing time is minimized. We prove that the problem is NP-complete, and we design polynomial-time heuristics to address it. An extensive set of simulations demonstrates the usefulness of these heuristics.
{"title":"Memory-Aware Tree Partitioning on Homogeneous Platforms","authors":"Changjiang Gou, A. Benoit, L. Marchal","doi":"10.1109/PDP2018.2018.00056","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00056","url":null,"abstract":"Scientific applications are commonly modeled as the processing of directed acyclic graphs of tasks, and for some of them, the graph takes the special form of a rooted tree. This tree expresses both the computational dependencies between tasks and their storage requirements. The problem of scheduling/traversing such a tree on a single processor to minimize its memory footprint has already been widely studied. Hence, we move to parallel processing and study how to partition the tree for a homogeneous multiprocessor platform, where each processor is equipped with its own memory. We formally state the problem of partitioning the tree into subtrees such that each subtree can be processed on a single processor and the total resulting processing time is minimized. We prove that the problem is NP-complete, and we design polynomial-time heuristics to address it. An extensive set of simulations demonstrates the usefulness of these heuristics.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127485038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00062
Neela Gayen, J. Ax, Martin Flasskamp, Christian Klarhorst, T. Jungeblut, Maolin Tang, W. Kelly
Embedded streaming applications are facing increasingly demanding performance requirements in terms of throughput. A common mechanism for providing high compute power with a low energy budget is to use a very large number of low-power cores, often in the form of a Massively Parallel System on Chip (MPSoC). The challenge with programming such massively parallel systems is deciding how to optimally map the computation to individual cores for maximizing throughput. In this work we present an automatic parallelizing compiler for the StreamIt programming language that efficiently and effectively maps computation to individual cores. The compiler must be both effective, meaning that it does a good job of optimizing for throughput; but also efficient, in that the time taken to find such a mapping must scale well as the number of cores and size of the Stream program increases. We improve on previous work that used Integer Linear Programming (ILP) to map StreamIT programs to multicore systems by formulating the mapping problem in a different way using mostly real rather than integer variables. Using so called Mixed Integer Linear Programming (MILP) dramatically reduces the cost compared to standard ILP. This alternative formulation creates what we call an optimistic solution that we then need to adjust slightly to obtain a final feasible solution. We show that this new approach is always close, if not better in terms of effectiveness, while being dramatically better in terms of scalability and efficiency
{"title":"Scalable Mapping of Streaming Applications onto MPSoCs Using Optimistic Mixed Integer Linear Programming","authors":"Neela Gayen, J. Ax, Martin Flasskamp, Christian Klarhorst, T. Jungeblut, Maolin Tang, W. Kelly","doi":"10.1109/PDP2018.2018.00062","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00062","url":null,"abstract":"Embedded streaming applications are facing increasingly demanding performance requirements in terms of throughput. A common mechanism for providing high compute power with a low energy budget is to use a very large number of low-power cores, often in the form of a Massively Parallel System on Chip (MPSoC). The challenge with programming such massively parallel systems is deciding how to optimally map the computation to individual cores for maximizing throughput. In this work we present an automatic parallelizing compiler for the StreamIt programming language that efficiently and effectively maps computation to individual cores. The compiler must be both effective, meaning that it does a good job of optimizing for throughput; but also efficient, in that the time taken to find such a mapping must scale well as the number of cores and size of the Stream program increases. We improve on previous work that used Integer Linear Programming (ILP) to map StreamIT programs to multicore systems by formulating the mapping problem in a different way using mostly real rather than integer variables. Using so called Mixed Integer Linear Programming (MILP) dramatically reduces the cost compared to standard ILP. This alternative formulation creates what we call an optimistic solution that we then need to adjust slightly to obtain a final feasible solution. We show that this new approach is always close, if not better in terms of effectiveness, while being dramatically better in terms of scalability and efficiency","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122187014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00021
E. Cruz, M. Diener, M. Serpa, P. Navaux, L. Pilla, I. Koren
Communication and load balancing have a significant impact on the performance of parallel applications and have been the subject of extensive research in multicore architectures. Thread mapping has been one of the solutions adopted in multicore architectures to address both communication and load balancing. However, the impact of such issues on more recently introduced manycore architectures is still unknown. Most related work on manycore architectures focus on execution time and idleness information for scheduling decisions. In this paper, we improve the state of the art by performing a very detailed analysis of the impact of thread mapping on communication and load balancing in two manycore systems from Intel, namely Knights Corner and Knights Landing. We observed that the widely used metric of CPU time provides very inaccurate information for load balancing. We also evaluated the usage of thread mapping based on the communication and load information of the applications to improve the performance of manycore systems.
{"title":"Improving Communication and Load Balancing with Thread Mapping in Manycore Systems","authors":"E. Cruz, M. Diener, M. Serpa, P. Navaux, L. Pilla, I. Koren","doi":"10.1109/PDP2018.2018.00021","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00021","url":null,"abstract":"Communication and load balancing have a significant impact on the performance of parallel applications and have been the subject of extensive research in multicore architectures. Thread mapping has been one of the solutions adopted in multicore architectures to address both communication and load balancing. However, the impact of such issues on more recently introduced manycore architectures is still unknown. Most related work on manycore architectures focus on execution time and idleness information for scheduling decisions. In this paper, we improve the state of the art by performing a very detailed analysis of the impact of thread mapping on communication and load balancing in two manycore systems from Intel, namely Knights Corner and Knights Landing. We observed that the widely used metric of CPU time provides very inaccurate information for load balancing. We also evaluated the usage of thread mapping based on the communication and load information of the applications to improve the performance of manycore systems.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130915196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00023
Olivier Valery, Pangfeng Liu, Jan-Jan Wu
Recent advances in System-on-Chip architectures have made the use of deep learning suitable for a number of applications on mobile devices. Unfortunately, due to the computational cost of neural network training, it is often limited to inference task, e.g., prediction, on mobile devices. In this paper, we propose a deep learning framework that enables both deep learning training and inference tasks on mobile devices. While being able to accommodate with the heterogeneity of computing devices technology on mobile devices, it also uses OpenCL to efficiently leverage modern SoC capabilities, e.g., multi-core CPU, integrated GPU and shared memory architecture, and accelerate deep learning computation. In addition, our system encodes the arithmetic operations of deep networks down to 8-bit fixed-point on mobile devices. As a proof of concept, we trained three well-known neural networks on mobile devices and exhibited a significant performance gain, energy consumption reduction, and memory saving.
{"title":"Low Precision Deep Learning Training on Mobile Heterogeneous Platform","authors":"Olivier Valery, Pangfeng Liu, Jan-Jan Wu","doi":"10.1109/PDP2018.2018.00023","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00023","url":null,"abstract":"Recent advances in System-on-Chip architectures have made the use of deep learning suitable for a number of applications on mobile devices. Unfortunately, due to the computational cost of neural network training, it is often limited to inference task, e.g., prediction, on mobile devices. In this paper, we propose a deep learning framework that enables both deep learning training and inference tasks on mobile devices. While being able to accommodate with the heterogeneity of computing devices technology on mobile devices, it also uses OpenCL to efficiently leverage modern SoC capabilities, e.g., multi-core CPU, integrated GPU and shared memory architecture, and accelerate deep learning computation. In addition, our system encodes the arithmetic operations of deep networks down to 8-bit fixed-point on mobile devices. As a proof of concept, we trained three well-known neural networks on mobile devices and exhibited a significant performance gain, energy consumption reduction, and memory saving.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130980537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-21DOI: 10.1109/PDP2018.2018.00030
I. Patronas, Nikolaos Gkatzios, V. Kitsakis, D. Reisis, K. Christodoulopoulos, Emmanouel Varvarigos
Today's Data Centers networks depend on optical switching to overcome the scalability limitations of traditional architectures. All optical networks most often use slotted Time Division Multiple Access (TDMA) operation; their buffers are located at the optical network edges and their organization relies on effective scheduling of the TDMA frames to achieve efficient sharing of the network resources and a collision-free network operation. Scheduling decisions have to be taken in real time, a process that becomes computationally demanding as the network size increases. Accelerators provide a solution and the present paper proposes a scheduler accelerator to accommodate a data center network divided into points of delivery (pods) of racks and exploiting hybrid electro-optical top-of-rack (ToR) switches that access an all-optical inter-rack network. The scheduler accelerator is a parallel scalable architecture with application specific processing engines. Case studies of 2, 4, 8, 16 processors configuration are presented for the processing of all the transfer TDMA time slot requests for the cases of 512 and 1024 ToR network nodes. The architecture is realized on a Xilinx VC707 board to validate the results.
{"title":"Scheduler Accelerator for TDMA Data Centers","authors":"I. Patronas, Nikolaos Gkatzios, V. Kitsakis, D. Reisis, K. Christodoulopoulos, Emmanouel Varvarigos","doi":"10.1109/PDP2018.2018.00030","DOIUrl":"https://doi.org/10.1109/PDP2018.2018.00030","url":null,"abstract":"Today's Data Centers networks depend on optical switching to overcome the scalability limitations of traditional architectures. All optical networks most often use slotted Time Division Multiple Access (TDMA) operation; their buffers are located at the optical network edges and their organization relies on effective scheduling of the TDMA frames to achieve efficient sharing of the network resources and a collision-free network operation. Scheduling decisions have to be taken in real time, a process that becomes computationally demanding as the network size increases. Accelerators provide a solution and the present paper proposes a scheduler accelerator to accommodate a data center network divided into points of delivery (pods) of racks and exploiting hybrid electro-optical top-of-rack (ToR) switches that access an all-optical inter-rack network. The scheduler accelerator is a parallel scalable architecture with application specific processing engines. Case studies of 2, 4, 8, 16 processors configuration are presented for the processing of all the transfer TDMA time slot requests for the cases of 512 and 1024 ToR network nodes. The architecture is realized on a Xilinx VC707 board to validate the results.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133101314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}