Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00171
Quentin G. Anthony, A. Awan, Arpan Jain, H. Subramoni, D. Panda
Deep Learning (DL) models for semantic image segmentation are an emerging trend in response to the explosion of multi-class, high resolution image and video data. However, segmentation models are highly compute-intensive, and even the fastest Volta GPUs cannot train them in a reasonable time frame. In our experiments, we observed just 6.7 images/second on a single Volta GPU for training DeepLab-v3+ (DLv3+), a state-of-the-art Encoder-Decoder model for semantic image segmentation. For comparison, a Volta GPU can process 300 images/second for training ResNet-50, a state-of-the-art model for image classification. In this context, we see a clear opportunity to utilize supercomputers to speed up training of segmentation models. However, most published studies on the performance of novel DL models such as DLv3+ require the user to significantly change Horovod, MPI, and the DL model to improve performance. Our work proposes an alternative tuning method that achieves near-linear scaling without significant changes to Horovod, MPI, or the DL model. In this paper, we select DLv3+ as the candidate TensorFlow model and implement Horovod-based distributed training for DLv3+. We observed poor default scaling performance of DLv3+ on the Summit system at Oak Ridge National Laboratory. To address this, we conducted an in-depth performance tuning of various Horovod/MPI knobs to achieve better performance over the default parameters. We present a comprehensive scaling comparison for Horovod with MVAPICH2-GDR up to 132 GPUs on Summit. Our optimization approach achieves near-linear (92%) scaling with MVAPICH2-GDR. We achieved a “mIOU” accuracy of 80.8% for distributed training, which is on par with published accuracy for this model. Further, we demonstrate an improvement in scaling efficiency by 23.9% over default Horovod training, which translates to a 1.3× speedup in training performance.
{"title":"Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR","authors":"Quentin G. Anthony, A. Awan, Arpan Jain, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW50202.2020.00171","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00171","url":null,"abstract":"Deep Learning (DL) models for semantic image segmentation are an emerging trend in response to the explosion of multi-class, high resolution image and video data. However, segmentation models are highly compute-intensive, and even the fastest Volta GPUs cannot train them in a reasonable time frame. In our experiments, we observed just 6.7 images/second on a single Volta GPU for training DeepLab-v3+ (DLv3+), a state-of-the-art Encoder-Decoder model for semantic image segmentation. For comparison, a Volta GPU can process 300 images/second for training ResNet-50, a state-of-the-art model for image classification. In this context, we see a clear opportunity to utilize supercomputers to speed up training of segmentation models. However, most published studies on the performance of novel DL models such as DLv3+ require the user to significantly change Horovod, MPI, and the DL model to improve performance. Our work proposes an alternative tuning method that achieves near-linear scaling without significant changes to Horovod, MPI, or the DL model. In this paper, we select DLv3+ as the candidate TensorFlow model and implement Horovod-based distributed training for DLv3+. We observed poor default scaling performance of DLv3+ on the Summit system at Oak Ridge National Laboratory. To address this, we conducted an in-depth performance tuning of various Horovod/MPI knobs to achieve better performance over the default parameters. We present a comprehensive scaling comparison for Horovod with MVAPICH2-GDR up to 132 GPUs on Summit. Our optimization approach achieves near-linear (92%) scaling with MVAPICH2-GDR. We achieved a “mIOU” accuracy of 80.8% for distributed training, which is on par with published accuracy for this model. Further, we demonstrate an improvement in scaling efficiency by 23.9% over default Horovod training, which translates to a 1.3× speedup in training performance.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121643823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00129
Huda Alrammah, Yi Gu, Zhifeng Liu
Cloud computing has become the most popular distributed computing paradigm among others which delivers scalable resources for efficient execution of large-scale scientific workflows. However, the large number of user requests and the limited cloud resources have posed a significant challenge on resource allocation, scheduling/mapping, power consumption, monetary cost, and so on. Therefore, how to schedule and optimize workflow execution in a cloud environment has become the most critical factor in improving the overall performance. Moreover, Multi-objective Optimization Problems (MOPs) along with heterogeneous cloud environments have made resource utilization and workflow scheduling even more challenging. In this work, we propose a novel algorithm, named Multi-objective Optimization for Makespan, Cost and Energy (MOMCE), to efficiently assign tasks to cloud resources in order to reduce total execution time, monetary cost, and energy consumption of scientific workflows. The experimental results have demonstrated the optimization stability and robustness of MOMCE algorithm for achieving a better fitness value in comparison with other existing algorithms.
云计算已经成为最流行的分布式计算范式,它为大规模科学工作流的有效执行提供了可扩展的资源。然而,大量的用户请求和有限的云资源在资源分配、调度/映射、功耗、货币成本等方面提出了重大挑战。因此,如何在云环境中调度和优化工作流的执行成为提高整体性能的最关键因素。此外,多目标优化问题(MOPs)以及异构云环境使资源利用和工作流调度更具挑战性。在这项工作中,我们提出了一种新的算法,称为Makespan, Cost and Energy (MOMCE)的多目标优化,以有效地将任务分配给云资源,以减少科学工作流的总执行时间,货币成本和能源消耗。实验结果表明,与其他现有算法相比,MOMCE算法具有优化稳定性和鲁棒性,可以获得更好的适应度值。
{"title":"Tri-Objective Workflow Scheduling and Optimization in Heterogeneous Cloud Environments","authors":"Huda Alrammah, Yi Gu, Zhifeng Liu","doi":"10.1109/IPDPSW50202.2020.00129","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00129","url":null,"abstract":"Cloud computing has become the most popular distributed computing paradigm among others which delivers scalable resources for efficient execution of large-scale scientific workflows. However, the large number of user requests and the limited cloud resources have posed a significant challenge on resource allocation, scheduling/mapping, power consumption, monetary cost, and so on. Therefore, how to schedule and optimize workflow execution in a cloud environment has become the most critical factor in improving the overall performance. Moreover, Multi-objective Optimization Problems (MOPs) along with heterogeneous cloud environments have made resource utilization and workflow scheduling even more challenging. In this work, we propose a novel algorithm, named Multi-objective Optimization for Makespan, Cost and Energy (MOMCE), to efficiently assign tasks to cloud resources in order to reduce total execution time, monetary cost, and energy consumption of scientific workflows. The experimental results have demonstrated the optimization stability and robustness of MOMCE algorithm for achieving a better fitness value in comparison with other existing algorithms.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132942269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00138
Ayse Bagbaba
Collective Input and output (I/O) is an essential approach in high performance computing (HPC) applications. The achievement of effective collective I/O is a nontrivial job due to the complex interdependencies between the layers of I/O stack. These layers provide the best possible I/O performance through a number of tunable parameters. Sadly, the correct combination of parameters depends on diverse applications and HPC platforms. When a configuration space gets larger, it becomes difficult for humans to monitor the interactions between the configuration options. Engineers has no time or experience for exploring good configuration parameters for each problem because of long benchmarking phase. In most cases, the default settings are implemented, often leading to poor I/O efficiency. I/O profiling tools can not tell the optimal default setups without too much effort to analyzing the tracing results. In this case, an auto-tuning solution for optimizing collective I/O requests and providing system administrators or engineers the statistic information is strongly required. In this paper, a study of the machine learning supported collective I/O auto-tuning including the architecture and software stack is performed. Random forest regression model is used to develop a performance predictor model that can capture parallel I/O behavior as a function of application and file system characteristics. The modeling approach can provide insights into the metrics that impact I/O performance significantly.
{"title":"Improving Collective I/O Performance with Machine Learning Supported Auto-tuning","authors":"Ayse Bagbaba","doi":"10.1109/IPDPSW50202.2020.00138","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00138","url":null,"abstract":"Collective Input and output (I/O) is an essential approach in high performance computing (HPC) applications. The achievement of effective collective I/O is a nontrivial job due to the complex interdependencies between the layers of I/O stack. These layers provide the best possible I/O performance through a number of tunable parameters. Sadly, the correct combination of parameters depends on diverse applications and HPC platforms. When a configuration space gets larger, it becomes difficult for humans to monitor the interactions between the configuration options. Engineers has no time or experience for exploring good configuration parameters for each problem because of long benchmarking phase. In most cases, the default settings are implemented, often leading to poor I/O efficiency. I/O profiling tools can not tell the optimal default setups without too much effort to analyzing the tracing results. In this case, an auto-tuning solution for optimizing collective I/O requests and providing system administrators or engineers the statistic information is strongly required. In this paper, a study of the machine learning supported collective I/O auto-tuning including the architecture and software stack is performed. Random forest regression model is used to develop a performance predictor model that can capture parallel I/O behavior as a function of application and file system characteristics. The modeling approach can provide insights into the metrics that impact I/O performance significantly.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133094308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00154
Marat Dukhan
Neural network frameworks today commonly implement Deconvolution and closely related Convolution operator via a combination of GEMM (dense matrix-matrix multiplication) and a memory transformation. The recently proposed Indirect Convolution algorithm suggests a more efficient implementation of Convolution via the Indirect GEMM primitive - a modification of GEMM where pointers to rows are loaded from a buffer rather than being computed assuming constant stride. However, the algorithm is inefficient for Deconvolution with non-unit stride, which is typical in computer vision models. We describe a novel Indirect Deconvolution algorithm for efficient evaluation of the Deconvolution operator with nonunit stride by splitting Deconvolution with a large kernel into multiple subconvolutions with smaller, variable-size kernels, which can be efficiently implemented on top of the Indirect GEMM primitive.
{"title":"Indirect Deconvolution Algorithm","authors":"Marat Dukhan","doi":"10.1109/IPDPSW50202.2020.00154","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00154","url":null,"abstract":"Neural network frameworks today commonly implement Deconvolution and closely related Convolution operator via a combination of GEMM (dense matrix-matrix multiplication) and a memory transformation. The recently proposed Indirect Convolution algorithm suggests a more efficient implementation of Convolution via the Indirect GEMM primitive - a modification of GEMM where pointers to rows are loaded from a buffer rather than being computed assuming constant stride. However, the algorithm is inefficient for Deconvolution with non-unit stride, which is typical in computer vision models. We describe a novel Indirect Deconvolution algorithm for efficient evaluation of the Deconvolution operator with nonunit stride by splitting Deconvolution with a large kernel into multiple subconvolutions with smaller, variable-size kernels, which can be efficiently implemented on top of the Indirect GEMM primitive.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"35 29","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131806225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00145
Raphael N. C. B. Rocha, L. L. Filho, M. Aredes, F. França, P. Lima
Capable of supplying local loads, DC microgrids have received much attention in the last decade for alleviating power flow through the main power grid. This has been achieved through the use of edge devices on the control of the converters, but, among other problems, microgrids have stability issues when Constant Power Loads (CPL) are present. This problem was already solved in the literature with the DC STATCOM power converter, in normal operation mode, it can deal with the grid operation. However, in fault cases, the solutions available still fail to ignore faults or even contribute to them. The present work aims to explore the potential of a light machine learning algorithm of the type Weightless Artificial Neural Network (WANN) for predicting the output of the original controller used in the DC STATCOM on an edge device connected to a converter, and investigate its generalization capability under microgrid fault situations. The WANN used is based on the regression variant of the Wilkes, Stonham, and Aleksander Recognition Device (WiSARD), coined as Regression WiSARD (ReW). The evaluation criteria employed measured the capability of the controller to reject the fault condition. Initial results showed surprisingly good results in comparison to the original DC STATCOM controller, indicating that a ReW-based controller plays well the role of the DC STATCOM and was able to cope with fault situations.
{"title":"Regression WiSARD application of controller on DC STATCOM converter under fault conditions","authors":"Raphael N. C. B. Rocha, L. L. Filho, M. Aredes, F. França, P. Lima","doi":"10.1109/IPDPSW50202.2020.00145","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00145","url":null,"abstract":"Capable of supplying local loads, DC microgrids have received much attention in the last decade for alleviating power flow through the main power grid. This has been achieved through the use of edge devices on the control of the converters, but, among other problems, microgrids have stability issues when Constant Power Loads (CPL) are present. This problem was already solved in the literature with the DC STATCOM power converter, in normal operation mode, it can deal with the grid operation. However, in fault cases, the solutions available still fail to ignore faults or even contribute to them. The present work aims to explore the potential of a light machine learning algorithm of the type Weightless Artificial Neural Network (WANN) for predicting the output of the original controller used in the DC STATCOM on an edge device connected to a converter, and investigate its generalization capability under microgrid fault situations. The WANN used is based on the regression variant of the Wilkes, Stonham, and Aleksander Recognition Device (WiSARD), coined as Regression WiSARD (ReW). The evaluation criteria employed measured the capability of the controller to reject the fault condition. Initial results showed surprisingly good results in comparison to the original DC STATCOM controller, indicating that a ReW-based controller plays well the role of the DC STATCOM and was able to cope with fault situations.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116748427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00077
Chen Zeng, Yifan Wang, Fan Liang, Xiaohui Peng
As the advancement of the IoT market and technology, sensor-equipped devices belonging to different interest groups have more opportunities to cooperate, fulfilling an assignment. How to ensure an agreement on correct sensed data in such a low trust environment is a new challenge in nowadays market. To bridge the gap, we propose Pinocchio, a blockchain-based algorithm which can tolerate faults in WSN (Wireless Sensor Network) along with the ability to defend sensed data against malicious attacks of data tampering and masquerade. Compared to other distributed approaches of sensor fault tolerance, Pinocchio greatly reduced the message complexity of the entire network to $O(N)$ and that of a single node to $O(1)$. Considering the possible waste of resources brought about by vicious competition of hash power in the blockchain-based approach, we design the Geppetto algorithm to supervise and control hash power in a distributed manner, and its effectiveness is demonstrated by experiments.
{"title":"Pinocchio: A Blockchain-Based Algorithm for Sensor Fault Tolerance in Low Trust Environment","authors":"Chen Zeng, Yifan Wang, Fan Liang, Xiaohui Peng","doi":"10.1109/IPDPSW50202.2020.00077","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00077","url":null,"abstract":"As the advancement of the IoT market and technology, sensor-equipped devices belonging to different interest groups have more opportunities to cooperate, fulfilling an assignment. How to ensure an agreement on correct sensed data in such a low trust environment is a new challenge in nowadays market. To bridge the gap, we propose Pinocchio, a blockchain-based algorithm which can tolerate faults in WSN (Wireless Sensor Network) along with the ability to defend sensed data against malicious attacks of data tampering and masquerade. Compared to other distributed approaches of sensor fault tolerance, Pinocchio greatly reduced the message complexity of the entire network to $O(N)$ and that of a single node to $O(1)$. Considering the possible waste of resources brought about by vicious competition of hash power in the blockchain-based approach, we design the Geppetto algorithm to supervise and control hash power in a distributed manner, and its effectiveness is demonstrated by experiments.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126405275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00083
N. Fujita, Ryohei Kobayashi, Y. Yamaguchi, Tomohiro Ueno, K. Sano, T. Boku
In recent years, much High Performance Computing (HPC) researchers attract to utilize Field Programmable Gate Arrays (FPGAs) for HPC applications. We can use FPGAs for communication as well as computation thanks to FPGA’s I/O capabilities. HPC scientists cannot utilize FPGAs for their applications because of the difficulty of the FPGA development, however High Level Synthesis (HLS) allows them to use with appropriate costs. In this study, we propose a Communication Integrated Reconfigurable CompUting System (CIRCUS) to enable us to utilize high-speed interconnection of FPGAS from OpenCL. CIRCUS makes a fused single pipeline combining the computation and the communication, which hides the communication latency by completely overlapping them. In this paper, we present the detail of the implementation and the evaluation result using two benchmarks: pingpong benchmark and allreduce benchmark.
{"title":"Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA","authors":"N. Fujita, Ryohei Kobayashi, Y. Yamaguchi, Tomohiro Ueno, K. Sano, T. Boku","doi":"10.1109/IPDPSW50202.2020.00083","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00083","url":null,"abstract":"In recent years, much High Performance Computing (HPC) researchers attract to utilize Field Programmable Gate Arrays (FPGAs) for HPC applications. We can use FPGAs for communication as well as computation thanks to FPGA’s I/O capabilities. HPC scientists cannot utilize FPGAs for their applications because of the difficulty of the FPGA development, however High Level Synthesis (HLS) allows them to use with appropriate costs. In this study, we propose a Communication Integrated Reconfigurable CompUting System (CIRCUS) to enable us to utilize high-speed interconnection of FPGAS from OpenCL. CIRCUS makes a fused single pipeline combining the computation and the communication, which hides the communication latency by completely overlapping them. In this paper, we present the detail of the implementation and the evaluation result using two benchmarks: pingpong benchmark and allreduce benchmark.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125726824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00147
Amit Ruhela, Shulei Xu, K. V. Manian, H. Subramoni, D. Panda
Communication interfaces of High Performance Computing (HPC) systems, Cloud middleware, and Deep Learning (DL) frameworks have been continually evolving to meet the ever-increasing communication demands being placed on them by HPC, Cloud, and DL applications. Modern high performance interconnects like InfiniBand EDR 100 Gbps, InfiniBand HDR 200 Gbps are capable of delivering 100 Gbps and 200 Gbps speeds. However, no previous study has demonstrated how much benefit an end-user in the HPC, Cloud, and DL computing domain can expect by utilizing newer generations of these interconnects over the older ones. In this paper, we evaluate the InfiniBand EDR and HDR high performance interconnects over the PCIe Gen3 interface with HPC, Cloud, and DL workloads. Our comprehensive analysis, done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC, Cloud, and DL applications. The results of our experiments show that the latest InfiniBand HDR interconnect gives the best performance for all three computing domains.
{"title":"Analyzing and Understanding the Impact of Interconnect Performance on HPC, Big Data, and Deep Learning Applications: A Case Study with InfiniBand EDR and HDR","authors":"Amit Ruhela, Shulei Xu, K. V. Manian, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW50202.2020.00147","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00147","url":null,"abstract":"Communication interfaces of High Performance Computing (HPC) systems, Cloud middleware, and Deep Learning (DL) frameworks have been continually evolving to meet the ever-increasing communication demands being placed on them by HPC, Cloud, and DL applications. Modern high performance interconnects like InfiniBand EDR 100 Gbps, InfiniBand HDR 200 Gbps are capable of delivering 100 Gbps and 200 Gbps speeds. However, no previous study has demonstrated how much benefit an end-user in the HPC, Cloud, and DL computing domain can expect by utilizing newer generations of these interconnects over the older ones. In this paper, we evaluate the InfiniBand EDR and HDR high performance interconnects over the PCIe Gen3 interface with HPC, Cloud, and DL workloads. Our comprehensive analysis, done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC, Cloud, and DL applications. The results of our experiments show that the latest InfiniBand HDR interconnect gives the best performance for all three computing domains.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125907614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00041
M. Fujimoto, Cole A. Lyman, M. Clement
K-mers form the backbone of many bioinformatic algorithms. They are, however, difficult to store and use efficiently because the number of k-mers increases exponentially as $k$ increases. Many algorithms exist for compressed storage of kmers but suffer from slow insert times or are probabilistic resulting in false-positive k-mers. Furthermore, k-mer libraries usually specialize in associating specific values with k-mers such as a color in colored de Bruijn Graphs or k-mer count. We present kcollections1, a compressed and parallel data structure designed for k-mers generated from whole, assembled genomes. Kcollections is available for $mathrm {C}++$ and provides set-and maplike structures as well as a k-mer counting data structure all of which utilize parallel operations designed using a MapReduce paradigm. Additionally, we provide basic Python bindings for rapid prototyping. Kcollections makes developing bioinformatic algorithms simpler by abstracting away the tedious task of storing k-mers.1https://www.github.com/masakistan/kcollections
{"title":"Kcollections: A Fast and Efficient Library for K-mers","authors":"M. Fujimoto, Cole A. Lyman, M. Clement","doi":"10.1109/IPDPSW50202.2020.00041","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00041","url":null,"abstract":"K-mers form the backbone of many bioinformatic algorithms. They are, however, difficult to store and use efficiently because the number of k-mers increases exponentially as $k$ increases. Many algorithms exist for compressed storage of kmers but suffer from slow insert times or are probabilistic resulting in false-positive k-mers. Furthermore, k-mer libraries usually specialize in associating specific values with k-mers such as a color in colored de Bruijn Graphs or k-mer count. We present kcollections1, a compressed and parallel data structure designed for k-mers generated from whole, assembled genomes. Kcollections is available for $mathrm {C}++$ and provides set-and maplike structures as well as a k-mer counting data structure all of which utilize parallel operations designed using a MapReduce paradigm. Additionally, we provide basic Python bindings for rapid prototyping. Kcollections makes developing bioinformatic algorithms simpler by abstracting away the tedious task of storing k-mers.1https://www.github.com/masakistan/kcollections","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122391858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/IPDPSW50202.2020.00021
Stephen Tridgell, D. Boland, P. Leong, R. Kastner, Alireza Khodamoradi, Siddhartha
The computational complexity of deep learning has led to research efforts to reduce the computation required. The use of low precision is particularly effective on FPGAs as they are not restricted to byte addressable operations. Very low precision activations and weights can have a significant impact on the accuracy however. This work demonstrates by exploiting throughput matching that higher precision on certain layers can be used to recover this accuracy. This is applied to the domain of automatic modulation classification for radio signals leveraging the RF capabilities offered by the Xilinx ZCU111 RFSoC platform. The implemented networks achieve high-speed real-time performance with a classification latency of $approx8mu$s, and an operational throughput of 488k classifications per second. On the open-source RadioML dataset, we demonstrate how to recover 4.3% in accuracy with the same hardware usage with our technique.
深度学习的计算复杂性导致研究人员努力减少所需的计算量。低精度的使用在fpga上特别有效,因为它们不局限于字节可寻址操作。然而,非常低的精度激活和权重会对精度产生重大影响。这项工作通过利用吞吐量匹配证明,在某些层上可以使用更高的精度来恢复这种精度。这适用于利用Xilinx ZCU111 RFSoC平台提供的RF功能的无线电信号自动调制分类领域。实现的网络实现了高速实时性能,分类延迟为$approx8mu$ s,操作吞吐量为每秒488k个分类。在开源的RadioML数据集上,我们演示了如何恢复4.3% in accuracy with the same hardware usage with our technique.
{"title":"Real-time Automatic Modulation Classification using RFSoC","authors":"Stephen Tridgell, D. Boland, P. Leong, R. Kastner, Alireza Khodamoradi, Siddhartha","doi":"10.1109/IPDPSW50202.2020.00021","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00021","url":null,"abstract":"The computational complexity of deep learning has led to research efforts to reduce the computation required. The use of low precision is particularly effective on FPGAs as they are not restricted to byte addressable operations. Very low precision activations and weights can have a significant impact on the accuracy however. This work demonstrates by exploiting throughput matching that higher precision on certain layers can be used to recover this accuracy. This is applied to the domain of automatic modulation classification for radio signals leveraging the RF capabilities offered by the Xilinx ZCU111 RFSoC platform. The implemented networks achieve high-speed real-time performance with a classification latency of $approx8mu$s, and an operational throughput of 488k classifications per second. On the open-source RadioML dataset, we demonstrate how to recover 4.3% in accuracy with the same hardware usage with our technique.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122262462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}