Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00171
Quentin G. Anthony, A. Awan, Arpan Jain, H. Subramoni, D. Panda
Deep Learning (DL) models for semantic image segmentation are an emerging trend in response to the explosion of multi-class, high resolution image and video data. However, segmentation models are highly compute-intensive, and even the fastest Volta GPUs cannot train them in a reasonable time frame. In our experiments, we observed just 6.7 images/second on a single Volta GPU for training DeepLab-v3+ (DLv3+), a state-of-the-art Encoder-Decoder model for semantic image segmentation. For comparison, a Volta GPU can process 300 images/second for training ResNet-50, a state-of-the-art model for image classification. In this context, we see a clear opportunity to utilize supercomputers to speed up training of segmentation models. However, most published studies on the performance of novel DL models such as DLv3+ require the user to significantly change Horovod, MPI, and the DL model to improve performance. Our work proposes an alternative tuning method that achieves near-linear scaling without significant changes to Horovod, MPI, or the DL model. In this paper, we select DLv3+ as the candidate TensorFlow model and implement Horovod-based distributed training for DLv3+. We observed poor default scaling performance of DLv3+ on the Summit system at Oak Ridge National Laboratory. To address this, we conducted an in-depth performance tuning of various Horovod/MPI knobs to achieve better performance over the default parameters. We present a comprehensive scaling comparison for Horovod with MVAPICH2-GDR up to 132 GPUs on Summit. Our optimization approach achieves near-linear (92%) scaling with MVAPICH2-GDR. We achieved a “mIOU” accuracy of 80.8% for distributed training, which is on par with published accuracy for this model. Further, we demonstrate an improvement in scaling efficiency by 23.9% over default Horovod training, which translates to a 1.3× speedup in training performance.
{"title":"Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR","authors":"Quentin G. Anthony, A. Awan, Arpan Jain, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW50202.2020.00171","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00171","url":null,"abstract":"Deep Learning (DL) models for semantic image segmentation are an emerging trend in response to the explosion of multi-class, high resolution image and video data. However, segmentation models are highly compute-intensive, and even the fastest Volta GPUs cannot train them in a reasonable time frame. In our experiments, we observed just 6.7 images/second on a single Volta GPU for training DeepLab-v3+ (DLv3+), a state-of-the-art Encoder-Decoder model for semantic image segmentation. For comparison, a Volta GPU can process 300 images/second for training ResNet-50, a state-of-the-art model for image classification. In this context, we see a clear opportunity to utilize supercomputers to speed up training of segmentation models. However, most published studies on the performance of novel DL models such as DLv3+ require the user to significantly change Horovod, MPI, and the DL model to improve performance. Our work proposes an alternative tuning method that achieves near-linear scaling without significant changes to Horovod, MPI, or the DL model. In this paper, we select DLv3+ as the candidate TensorFlow model and implement Horovod-based distributed training for DLv3+. We observed poor default scaling performance of DLv3+ on the Summit system at Oak Ridge National Laboratory. To address this, we conducted an in-depth performance tuning of various Horovod/MPI knobs to achieve better performance over the default parameters. We present a comprehensive scaling comparison for Horovod with MVAPICH2-GDR up to 132 GPUs on Summit. Our optimization approach achieves near-linear (92%) scaling with MVAPICH2-GDR. We achieved a “mIOU” accuracy of 80.8% for distributed training, which is on par with published accuracy for this model. Further, we demonstrate an improvement in scaling efficiency by 23.9% over default Horovod training, which translates to a 1.3× speedup in training performance.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121643823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00129
Huda Alrammah, Yi Gu, Zhifeng Liu
Cloud computing has become the most popular distributed computing paradigm among others which delivers scalable resources for efficient execution of large-scale scientific workflows. However, the large number of user requests and the limited cloud resources have posed a significant challenge on resource allocation, scheduling/mapping, power consumption, monetary cost, and so on. Therefore, how to schedule and optimize workflow execution in a cloud environment has become the most critical factor in improving the overall performance. Moreover, Multi-objective Optimization Problems (MOPs) along with heterogeneous cloud environments have made resource utilization and workflow scheduling even more challenging. In this work, we propose a novel algorithm, named Multi-objective Optimization for Makespan, Cost and Energy (MOMCE), to efficiently assign tasks to cloud resources in order to reduce total execution time, monetary cost, and energy consumption of scientific workflows. The experimental results have demonstrated the optimization stability and robustness of MOMCE algorithm for achieving a better fitness value in comparison with other existing algorithms.
云计算已经成为最流行的分布式计算范式,它为大规模科学工作流的有效执行提供了可扩展的资源。然而,大量的用户请求和有限的云资源在资源分配、调度/映射、功耗、货币成本等方面提出了重大挑战。因此,如何在云环境中调度和优化工作流的执行成为提高整体性能的最关键因素。此外,多目标优化问题(MOPs)以及异构云环境使资源利用和工作流调度更具挑战性。在这项工作中,我们提出了一种新的算法,称为Makespan, Cost and Energy (MOMCE)的多目标优化,以有效地将任务分配给云资源,以减少科学工作流的总执行时间,货币成本和能源消耗。实验结果表明,与其他现有算法相比,MOMCE算法具有优化稳定性和鲁棒性,可以获得更好的适应度值。
{"title":"Tri-Objective Workflow Scheduling and Optimization in Heterogeneous Cloud Environments","authors":"Huda Alrammah, Yi Gu, Zhifeng Liu","doi":"10.1109/IPDPSW50202.2020.00129","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00129","url":null,"abstract":"Cloud computing has become the most popular distributed computing paradigm among others which delivers scalable resources for efficient execution of large-scale scientific workflows. However, the large number of user requests and the limited cloud resources have posed a significant challenge on resource allocation, scheduling/mapping, power consumption, monetary cost, and so on. Therefore, how to schedule and optimize workflow execution in a cloud environment has become the most critical factor in improving the overall performance. Moreover, Multi-objective Optimization Problems (MOPs) along with heterogeneous cloud environments have made resource utilization and workflow scheduling even more challenging. In this work, we propose a novel algorithm, named Multi-objective Optimization for Makespan, Cost and Energy (MOMCE), to efficiently assign tasks to cloud resources in order to reduce total execution time, monetary cost, and energy consumption of scientific workflows. The experimental results have demonstrated the optimization stability and robustness of MOMCE algorithm for achieving a better fitness value in comparison with other existing algorithms.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132942269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00138
Ayse Bagbaba
Collective Input and output (I/O) is an essential approach in high performance computing (HPC) applications. The achievement of effective collective I/O is a nontrivial job due to the complex interdependencies between the layers of I/O stack. These layers provide the best possible I/O performance through a number of tunable parameters. Sadly, the correct combination of parameters depends on diverse applications and HPC platforms. When a configuration space gets larger, it becomes difficult for humans to monitor the interactions between the configuration options. Engineers has no time or experience for exploring good configuration parameters for each problem because of long benchmarking phase. In most cases, the default settings are implemented, often leading to poor I/O efficiency. I/O profiling tools can not tell the optimal default setups without too much effort to analyzing the tracing results. In this case, an auto-tuning solution for optimizing collective I/O requests and providing system administrators or engineers the statistic information is strongly required. In this paper, a study of the machine learning supported collective I/O auto-tuning including the architecture and software stack is performed. Random forest regression model is used to develop a performance predictor model that can capture parallel I/O behavior as a function of application and file system characteristics. The modeling approach can provide insights into the metrics that impact I/O performance significantly.
{"title":"Improving Collective I/O Performance with Machine Learning Supported Auto-tuning","authors":"Ayse Bagbaba","doi":"10.1109/IPDPSW50202.2020.00138","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00138","url":null,"abstract":"Collective Input and output (I/O) is an essential approach in high performance computing (HPC) applications. The achievement of effective collective I/O is a nontrivial job due to the complex interdependencies between the layers of I/O stack. These layers provide the best possible I/O performance through a number of tunable parameters. Sadly, the correct combination of parameters depends on diverse applications and HPC platforms. When a configuration space gets larger, it becomes difficult for humans to monitor the interactions between the configuration options. Engineers has no time or experience for exploring good configuration parameters for each problem because of long benchmarking phase. In most cases, the default settings are implemented, often leading to poor I/O efficiency. I/O profiling tools can not tell the optimal default setups without too much effort to analyzing the tracing results. In this case, an auto-tuning solution for optimizing collective I/O requests and providing system administrators or engineers the statistic information is strongly required. In this paper, a study of the machine learning supported collective I/O auto-tuning including the architecture and software stack is performed. Random forest regression model is used to develop a performance predictor model that can capture parallel I/O behavior as a function of application and file system characteristics. The modeling approach can provide insights into the metrics that impact I/O performance significantly.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133094308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00154
Marat Dukhan
Neural network frameworks today commonly implement Deconvolution and closely related Convolution operator via a combination of GEMM (dense matrix-matrix multiplication) and a memory transformation. The recently proposed Indirect Convolution algorithm suggests a more efficient implementation of Convolution via the Indirect GEMM primitive - a modification of GEMM where pointers to rows are loaded from a buffer rather than being computed assuming constant stride. However, the algorithm is inefficient for Deconvolution with non-unit stride, which is typical in computer vision models. We describe a novel Indirect Deconvolution algorithm for efficient evaluation of the Deconvolution operator with nonunit stride by splitting Deconvolution with a large kernel into multiple subconvolutions with smaller, variable-size kernels, which can be efficiently implemented on top of the Indirect GEMM primitive.
{"title":"Indirect Deconvolution Algorithm","authors":"Marat Dukhan","doi":"10.1109/IPDPSW50202.2020.00154","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00154","url":null,"abstract":"Neural network frameworks today commonly implement Deconvolution and closely related Convolution operator via a combination of GEMM (dense matrix-matrix multiplication) and a memory transformation. The recently proposed Indirect Convolution algorithm suggests a more efficient implementation of Convolution via the Indirect GEMM primitive - a modification of GEMM where pointers to rows are loaded from a buffer rather than being computed assuming constant stride. However, the algorithm is inefficient for Deconvolution with non-unit stride, which is typical in computer vision models. We describe a novel Indirect Deconvolution algorithm for efficient evaluation of the Deconvolution operator with nonunit stride by splitting Deconvolution with a large kernel into multiple subconvolutions with smaller, variable-size kernels, which can be efficiently implemented on top of the Indirect GEMM primitive.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"35 29","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131806225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/ipdpsw50202.2020.00075
Jia Guo, G. Agrawal
In recent years, there has been considerable interest in developing frameworks for processing streaming data. Like the precursor commercial systems for data-intensive processing, these systems have largely not used methods popular within the HPC community (for example, MPI for communication). In this paper, we demonstrate a system for stream processing that offers a high-level API to the users (similar to MapReduce), is fault-tolerant, and is also more efficient and scalable than current solutions. Particularly, a cost-efficient MPI/OpenMP based fault-tolerant scheme is incorporated so that the system can survive node failures with only a modest degradation of performance. We evaluate both the functionality and efficiency of Smart Streaming using four common applications in machine learning and data analytics. A comparison against state-of-the-art streaming frameworks shows our system boosts the throughput of test cases by up to 10X and achieve desirable parallelism when scaled out. Additionally, the performance loss upon failures is only proportional to the share of failed resources.
{"title":"Smart Streaming: A High-Throughput Fault-tolerant Online Processing System","authors":"Jia Guo, G. Agrawal","doi":"10.1109/ipdpsw50202.2020.00075","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00075","url":null,"abstract":"In recent years, there has been considerable interest in developing frameworks for processing streaming data. Like the precursor commercial systems for data-intensive processing, these systems have largely not used methods popular within the HPC community (for example, MPI for communication). In this paper, we demonstrate a system for stream processing that offers a high-level API to the users (similar to MapReduce), is fault-tolerant, and is also more efficient and scalable than current solutions. Particularly, a cost-efficient MPI/OpenMP based fault-tolerant scheme is incorporated so that the system can survive node failures with only a modest degradation of performance. We evaluate both the functionality and efficiency of Smart Streaming using four common applications in machine learning and data analytics. A comparison against state-of-the-art streaming frameworks shows our system boosts the throughput of test cases by up to 10X and achieve desirable parallelism when scaled out. Additionally, the performance loss upon failures is only proportional to the share of failed resources.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125316456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00044
Saman P. Amarasinghe
In recent years, large graphs with billions of vertices and trillions of edges have emerged in many domains, such as social network analytics, machine learning, physical simulations, and biology. However, optimizing the performance of graph applications is notoriously difficult due to irregular memory access patterns and load imbalance across cores. The performance of graph programs depends highly on the algorithm, the size, and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or single hardware platform works well across all applications.
{"title":"GrAPL 2020 Keynote Speaker The GraphIt Universal Graph Framework: Achieving HighPerformance across Algorithms, Graph Types, and Architectures","authors":"Saman P. Amarasinghe","doi":"10.1109/IPDPSW50202.2020.00044","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00044","url":null,"abstract":"In recent years, large graphs with billions of vertices and trillions of edges have emerged in many domains, such as social network analytics, machine learning, physical simulations, and biology. However, optimizing the performance of graph applications is notoriously difficult due to irregular memory access patterns and load imbalance across cores. The performance of graph programs depends highly on the algorithm, the size, and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or single hardware platform works well across all applications.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133877968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00145
Raphael N. C. B. Rocha, L. L. Filho, M. Aredes, F. França, P. Lima
Capable of supplying local loads, DC microgrids have received much attention in the last decade for alleviating power flow through the main power grid. This has been achieved through the use of edge devices on the control of the converters, but, among other problems, microgrids have stability issues when Constant Power Loads (CPL) are present. This problem was already solved in the literature with the DC STATCOM power converter, in normal operation mode, it can deal with the grid operation. However, in fault cases, the solutions available still fail to ignore faults or even contribute to them. The present work aims to explore the potential of a light machine learning algorithm of the type Weightless Artificial Neural Network (WANN) for predicting the output of the original controller used in the DC STATCOM on an edge device connected to a converter, and investigate its generalization capability under microgrid fault situations. The WANN used is based on the regression variant of the Wilkes, Stonham, and Aleksander Recognition Device (WiSARD), coined as Regression WiSARD (ReW). The evaluation criteria employed measured the capability of the controller to reject the fault condition. Initial results showed surprisingly good results in comparison to the original DC STATCOM controller, indicating that a ReW-based controller plays well the role of the DC STATCOM and was able to cope with fault situations.
{"title":"Regression WiSARD application of controller on DC STATCOM converter under fault conditions","authors":"Raphael N. C. B. Rocha, L. L. Filho, M. Aredes, F. França, P. Lima","doi":"10.1109/IPDPSW50202.2020.00145","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00145","url":null,"abstract":"Capable of supplying local loads, DC microgrids have received much attention in the last decade for alleviating power flow through the main power grid. This has been achieved through the use of edge devices on the control of the converters, but, among other problems, microgrids have stability issues when Constant Power Loads (CPL) are present. This problem was already solved in the literature with the DC STATCOM power converter, in normal operation mode, it can deal with the grid operation. However, in fault cases, the solutions available still fail to ignore faults or even contribute to them. The present work aims to explore the potential of a light machine learning algorithm of the type Weightless Artificial Neural Network (WANN) for predicting the output of the original controller used in the DC STATCOM on an edge device connected to a converter, and investigate its generalization capability under microgrid fault situations. The WANN used is based on the regression variant of the Wilkes, Stonham, and Aleksander Recognition Device (WiSARD), coined as Regression WiSARD (ReW). The evaluation criteria employed measured the capability of the controller to reject the fault condition. Initial results showed surprisingly good results in comparison to the original DC STATCOM controller, indicating that a ReW-based controller plays well the role of the DC STATCOM and was able to cope with fault situations.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116748427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00085
M. Diener, L. Kalé
Data movement between host and accelerators is one of the most challenging aspects of developing applications for heterogeneous systems. Most existing runtime systems for GPGPU programming require developers to perform data movement manually in the source code, while having to support different hardware and software environments. In this paper, we present a novel way to perform data movement for distributed applications based on the Charm ++ programming system. We extend Charm ++’s support for migration across memory address spaces to handle accelerator devices by making use of the description of data contained in Charm ++’s parallel objects. This allows the Charm ++ runtime to handle data movement automatically to a large extent, while supporting different hardware platforms transparently. This increases both developer productivity and the portability of Charm ++ applications. We demonstrate our proposal with a Charm ++ application that runs offloaded CUDA code on three different hardware platforms with a single data movement specification.
{"title":"Unified data movement for offloading Charm++ applications","authors":"M. Diener, L. Kalé","doi":"10.1109/IPDPSW50202.2020.00085","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00085","url":null,"abstract":"Data movement between host and accelerators is one of the most challenging aspects of developing applications for heterogeneous systems. Most existing runtime systems for GPGPU programming require developers to perform data movement manually in the source code, while having to support different hardware and software environments. In this paper, we present a novel way to perform data movement for distributed applications based on the Charm ++ programming system. We extend Charm ++’s support for migration across memory address spaces to handle accelerator devices by making use of the description of data contained in Charm ++’s parallel objects. This allows the Charm ++ runtime to handle data movement automatically to a large extent, while supporting different hardware platforms transparently. This increases both developer productivity and the portability of Charm ++ applications. We demonstrate our proposal with a Charm ++ application that runs offloaded CUDA code on three different hardware platforms with a single data movement specification.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116822294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00041
M. Fujimoto, Cole A. Lyman, M. Clement
K-mers form the backbone of many bioinformatic algorithms. They are, however, difficult to store and use efficiently because the number of k-mers increases exponentially as $k$ increases. Many algorithms exist for compressed storage of kmers but suffer from slow insert times or are probabilistic resulting in false-positive k-mers. Furthermore, k-mer libraries usually specialize in associating specific values with k-mers such as a color in colored de Bruijn Graphs or k-mer count. We present kcollections1, a compressed and parallel data structure designed for k-mers generated from whole, assembled genomes. Kcollections is available for $mathrm {C}++$ and provides set-and maplike structures as well as a k-mer counting data structure all of which utilize parallel operations designed using a MapReduce paradigm. Additionally, we provide basic Python bindings for rapid prototyping. Kcollections makes developing bioinformatic algorithms simpler by abstracting away the tedious task of storing k-mers.1https://www.github.com/masakistan/kcollections
{"title":"Kcollections: A Fast and Efficient Library for K-mers","authors":"M. Fujimoto, Cole A. Lyman, M. Clement","doi":"10.1109/IPDPSW50202.2020.00041","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00041","url":null,"abstract":"K-mers form the backbone of many bioinformatic algorithms. They are, however, difficult to store and use efficiently because the number of k-mers increases exponentially as $k$ increases. Many algorithms exist for compressed storage of kmers but suffer from slow insert times or are probabilistic resulting in false-positive k-mers. Furthermore, k-mer libraries usually specialize in associating specific values with k-mers such as a color in colored de Bruijn Graphs or k-mer count. We present kcollections1, a compressed and parallel data structure designed for k-mers generated from whole, assembled genomes. Kcollections is available for $mathrm {C}++$ and provides set-and maplike structures as well as a k-mer counting data structure all of which utilize parallel operations designed using a MapReduce paradigm. Additionally, we provide basic Python bindings for rapid prototyping. Kcollections makes developing bioinformatic algorithms simpler by abstracting away the tedious task of storing k-mers.1https://www.github.com/masakistan/kcollections","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122391858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/ipdpsw50202.2020.00157
P. Beckman, E. Jeannot, Swann Perarnau
The question of efficient dynamic allocation of compute-node resources, such as cores, by independent libraries or runtime systems can be an nightmare. Scientists writing application components have no way to efficiently specify and compose resource-hungry components. As application software stacks become deeper and the interaction of multiple runtime layers compete for resources from the operating system, it has become clear that intelligent cooperation is needed. Resources such as compute cores, in-package memory, and even electrical power must be orchestrated dynamically across application components, with the ability to query each other and respond appropriately. A more integrated solution would reduce intra-application resource competition and improve performance. Furthermore, application runtime systems could request and allocate specific hardware assets and adjust runtime tuning parameters up and down the software stack.The goal of this workshop is to gather and share the latest scholarly research from the community working on these issues, at all levels of the HPC software stack. This include thread allocation, resource arbitration and management, containers, and so on, from runtime-system designers to compilers. We will also use panel sessions and keynote talks to discuss these issues, share visions, and present solutions.
{"title":"Workshop on Resource Arbitration for Dynamic Runtimes (RADR)","authors":"P. Beckman, E. Jeannot, Swann Perarnau","doi":"10.1109/ipdpsw50202.2020.00157","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00157","url":null,"abstract":"The question of efficient dynamic allocation of compute-node resources, such as cores, by independent libraries or runtime systems can be an nightmare. Scientists writing application components have no way to efficiently specify and compose resource-hungry components. As application software stacks become deeper and the interaction of multiple runtime layers compete for resources from the operating system, it has become clear that intelligent cooperation is needed. Resources such as compute cores, in-package memory, and even electrical power must be orchestrated dynamically across application components, with the ability to query each other and respond appropriately. A more integrated solution would reduce intra-application resource competition and improve performance. Furthermore, application runtime systems could request and allocate specific hardware assets and adjust runtime tuning parameters up and down the software stack.The goal of this workshop is to gather and share the latest scholarly research from the community working on these issues, at all levels of the HPC software stack. This include thread allocation, resource arbitration and management, containers, and so on, from runtime-system designers to compilers. We will also use panel sessions and keynote talks to discuss these issues, share visions, and present solutions.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130307713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}