2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)最新文献

英文中文

Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR 基于Horovod和MVAPICH2-GDR的峰顶语义图像分割高效训练

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00171

Quentin G. Anthony, A. Awan, Arpan Jain, H. Subramoni, D. Panda

Deep Learning (DL) models for semantic image segmentation are an emerging trend in response to the explosion of multi-class, high resolution image and video data. However, segmentation models are highly compute-intensive, and even the fastest Volta GPUs cannot train them in a reasonable time frame. In our experiments, we observed just 6.7 images/second on a single Volta GPU for training DeepLab-v3+ (DLv3+), a state-of-the-art Encoder-Decoder model for semantic image segmentation. For comparison, a Volta GPU can process 300 images/second for training ResNet-50, a state-of-the-art model for image classification. In this context, we see a clear opportunity to utilize supercomputers to speed up training of segmentation models. However, most published studies on the performance of novel DL models such as DLv3+ require the user to significantly change Horovod, MPI, and the DL model to improve performance. Our work proposes an alternative tuning method that achieves near-linear scaling without significant changes to Horovod, MPI, or the DL model. In this paper, we select DLv3+ as the candidate TensorFlow model and implement Horovod-based distributed training for DLv3+. We observed poor default scaling performance of DLv3+ on the Summit system at Oak Ridge National Laboratory. To address this, we conducted an in-depth performance tuning of various Horovod/MPI knobs to achieve better performance over the default parameters. We present a comprehensive scaling comparison for Horovod with MVAPICH2-GDR up to 132 GPUs on Summit. Our optimization approach achieves near-linear (92%) scaling with MVAPICH2-GDR. We achieved a “mIOU” accuracy of 80.8% for distributed training, which is on par with published accuracy for this model. Further, we demonstrate an improvement in scaling efficiency by 23.9% over default Horovod training, which translates to a 1.3× speedup in training performance.

深度学习(DL)语义图像分割模型是应对多类别、高分辨率图像和视频数据爆炸的新兴趋势。然而，分割模型是高度计算密集型的，即使是最快的Volta gpu也无法在合理的时间框架内训练它们。在我们的实验中，我们在单个Volta GPU上仅观察到6.7张图像/秒，用于训练DeepLab-v3+ (DLv3+)，这是一种用于语义图像分割的最先进的编码器-解码器模型。相比之下，Volta GPU可以每秒处理300张图像来训练ResNet-50，这是一种最先进的图像分类模型。在这种情况下，我们看到了利用超级计算机加速分割模型训练的明显机会。然而，大多数已发表的关于新型深度学习模型(如DLv3+)性能的研究都要求用户显著改变Horovod、MPI和深度学习模型以提高性能。我们的工作提出了一种替代调谐方法，可以在不显著改变Horovod、MPI或DL模型的情况下实现近线性缩放。本文选择DLv3+作为候选TensorFlow模型，并对DLv3+实现基于horovod的分布式训练。我们在橡树岭国家实验室的Summit系统上观察到DLv3+的默认扩展性能很差。为了解决这个问题，我们对各种Horovod/MPI旋钮进行了深入的性能调优，以获得比默认参数更好的性能。我们提出了一个全面的缩放比较Horovod与MVAPICH2-GDR高达132 gpu在峰会上。我们的优化方法在MVAPICH2-GDR上实现了近线性缩放(92%)。对于分布式训练，我们实现了80.8%的“mIOU”准确率，这与该模型公布的准确率相当。此外，我们证明了与默认的Horovod训练相比，扩展效率提高了23.9%，这意味着训练性能提高了1.3倍。

{"title":"Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR","authors":"Quentin G. Anthony, A. Awan, Arpan Jain, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW50202.2020.00171","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00171","url":null,"abstract":"Deep Learning (DL) models for semantic image segmentation are an emerging trend in response to the explosion of multi-class, high resolution image and video data. However, segmentation models are highly compute-intensive, and even the fastest Volta GPUs cannot train them in a reasonable time frame. In our experiments, we observed just 6.7 images/second on a single Volta GPU for training DeepLab-v3+ (DLv3+), a state-of-the-art Encoder-Decoder model for semantic image segmentation. For comparison, a Volta GPU can process 300 images/second for training ResNet-50, a state-of-the-art model for image classification. In this context, we see a clear opportunity to utilize supercomputers to speed up training of segmentation models. However, most published studies on the performance of novel DL models such as DLv3+ require the user to significantly change Horovod, MPI, and the DL model to improve performance. Our work proposes an alternative tuning method that achieves near-linear scaling without significant changes to Horovod, MPI, or the DL model. In this paper, we select DLv3+ as the candidate TensorFlow model and implement Horovod-based distributed training for DLv3+. We observed poor default scaling performance of DLv3+ on the Summit system at Oak Ridge National Laboratory. To address this, we conducted an in-depth performance tuning of various Horovod/MPI knobs to achieve better performance over the default parameters. We present a comprehensive scaling comparison for Horovod with MVAPICH2-GDR up to 132 GPUs on Summit. Our optimization approach achieves near-linear (92%) scaling with MVAPICH2-GDR. We achieved a “mIOU” accuracy of 80.8% for distributed training, which is on par with published accuracy for this model. Further, we demonstrate an improvement in scaling efficiency by 23.9% over default Horovod training, which translates to a 1.3× speedup in training performance.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121643823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Tri-Objective Workflow Scheduling and Optimization in Heterogeneous Cloud Environments 异构云环境下的三目标工作流调度与优化

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00129

Huda Alrammah, Yi Gu, Zhifeng Liu

Cloud computing has become the most popular distributed computing paradigm among others which delivers scalable resources for efficient execution of large-scale scientific workflows. However, the large number of user requests and the limited cloud resources have posed a significant challenge on resource allocation, scheduling/mapping, power consumption, monetary cost, and so on. Therefore, how to schedule and optimize workflow execution in a cloud environment has become the most critical factor in improving the overall performance. Moreover, Multi-objective Optimization Problems (MOPs) along with heterogeneous cloud environments have made resource utilization and workflow scheduling even more challenging. In this work, we propose a novel algorithm, named Multi-objective Optimization for Makespan, Cost and Energy (MOMCE), to efficiently assign tasks to cloud resources in order to reduce total execution time, monetary cost, and energy consumption of scientific workflows. The experimental results have demonstrated the optimization stability and robustness of MOMCE algorithm for achieving a better fitness value in comparison with other existing algorithms.

云计算已经成为最流行的分布式计算范式，它为大规模科学工作流的有效执行提供了可扩展的资源。然而，大量的用户请求和有限的云资源在资源分配、调度/映射、功耗、货币成本等方面提出了重大挑战。因此，如何在云环境中调度和优化工作流的执行成为提高整体性能的最关键因素。此外，多目标优化问题(MOPs)以及异构云环境使资源利用和工作流调度更具挑战性。在这项工作中，我们提出了一种新的算法，称为Makespan, Cost and Energy (MOMCE)的多目标优化，以有效地将任务分配给云资源，以减少科学工作流的总执行时间，货币成本和能源消耗。实验结果表明，与其他现有算法相比，MOMCE算法具有优化稳定性和鲁棒性，可以获得更好的适应度值。

{"title":"Tri-Objective Workflow Scheduling and Optimization in Heterogeneous Cloud Environments","authors":"Huda Alrammah, Yi Gu, Zhifeng Liu","doi":"10.1109/IPDPSW50202.2020.00129","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00129","url":null,"abstract":"Cloud computing has become the most popular distributed computing paradigm among others which delivers scalable resources for efficient execution of large-scale scientific workflows. However, the large number of user requests and the limited cloud resources have posed a significant challenge on resource allocation, scheduling/mapping, power consumption, monetary cost, and so on. Therefore, how to schedule and optimize workflow execution in a cloud environment has become the most critical factor in improving the overall performance. Moreover, Multi-objective Optimization Problems (MOPs) along with heterogeneous cloud environments have made resource utilization and workflow scheduling even more challenging. In this work, we propose a novel algorithm, named Multi-objective Optimization for Makespan, Cost and Energy (MOMCE), to efficiently assign tasks to cloud resources in order to reduce total execution time, monetary cost, and energy consumption of scientific workflows. The experimental results have demonstrated the optimization stability and robustness of MOMCE algorithm for achieving a better fitness value in comparison with other existing algorithms.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132942269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Improving Collective I/O Performance with Machine Learning Supported Auto-tuning 通过机器学习支持的自动调优提高集体I/O性能

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00138

Ayse Bagbaba

Collective Input and output (I/O) is an essential approach in high performance computing (HPC) applications. The achievement of effective collective I/O is a nontrivial job due to the complex interdependencies between the layers of I/O stack. These layers provide the best possible I/O performance through a number of tunable parameters. Sadly, the correct combination of parameters depends on diverse applications and HPC platforms. When a configuration space gets larger, it becomes difficult for humans to monitor the interactions between the configuration options. Engineers has no time or experience for exploring good configuration parameters for each problem because of long benchmarking phase. In most cases, the default settings are implemented, often leading to poor I/O efficiency. I/O profiling tools can not tell the optimal default setups without too much effort to analyzing the tracing results. In this case, an auto-tuning solution for optimizing collective I/O requests and providing system administrators or engineers the statistic information is strongly required. In this paper, a study of the machine learning supported collective I/O auto-tuning including the architecture and software stack is performed. Random forest regression model is used to develop a performance predictor model that can capture parallel I/O behavior as a function of application and file system characteristics. The modeling approach can provide insights into the metrics that impact I/O performance significantly.

集合输入和输出(I/O)是高性能计算(HPC)应用程序中必不可少的方法。由于I/O堆栈层之间复杂的相互依赖关系，实现有效的集体I/O是一项艰巨的任务。这些层通过许多可调参数提供最佳的I/O性能。遗憾的是，参数的正确组合取决于不同的应用程序和HPC平台。当配置空间变大时，人们就很难监控配置选项之间的交互。由于长时间的基准测试阶段，工程师没有时间或经验为每个问题探索良好的配置参数。在大多数情况下，会实现默认设置，这通常会导致较差的I/O效率。如果不花太多精力分析跟踪结果，I/O分析工具无法告诉您最佳的默认设置。在这种情况下，迫切需要一种自动调优解决方案，用于优化集体I/O请求并向系统管理员或工程师提供统计信息。本文对机器学习支持的集体I/O自动调优进行了研究，包括体系结构和软件堆栈。随机森林回归模型用于开发性能预测器模型，该模型可以捕获作为应用程序和文件系统特征函数的并行I/O行为。建模方法可以深入了解显著影响I/O性能的指标。

{"title":"Improving Collective I/O Performance with Machine Learning Supported Auto-tuning","authors":"Ayse Bagbaba","doi":"10.1109/IPDPSW50202.2020.00138","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00138","url":null,"abstract":"Collective Input and output (I/O) is an essential approach in high performance computing (HPC) applications. The achievement of effective collective I/O is a nontrivial job due to the complex interdependencies between the layers of I/O stack. These layers provide the best possible I/O performance through a number of tunable parameters. Sadly, the correct combination of parameters depends on diverse applications and HPC platforms. When a configuration space gets larger, it becomes difficult for humans to monitor the interactions between the configuration options. Engineers has no time or experience for exploring good configuration parameters for each problem because of long benchmarking phase. In most cases, the default settings are implemented, often leading to poor I/O efficiency. I/O profiling tools can not tell the optimal default setups without too much effort to analyzing the tracing results. In this case, an auto-tuning solution for optimizing collective I/O requests and providing system administrators or engineers the statistic information is strongly required. In this paper, a study of the machine learning supported collective I/O auto-tuning including the architecture and software stack is performed. Random forest regression model is used to develop a performance predictor model that can capture parallel I/O behavior as a function of application and file system characteristics. The modeling approach can provide insights into the metrics that impact I/O performance significantly.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133094308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Indirect Deconvolution Algorithm 间接反卷积算法

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00154

Marat Dukhan

Neural network frameworks today commonly implement Deconvolution and closely related Convolution operator via a combination of GEMM (dense matrix-matrix multiplication) and a memory transformation. The recently proposed Indirect Convolution algorithm suggests a more efficient implementation of Convolution via the Indirect GEMM primitive - a modification of GEMM where pointers to rows are loaded from a buffer rather than being computed assuming constant stride. However, the algorithm is inefficient for Deconvolution with non-unit stride, which is typical in computer vision models. We describe a novel Indirect Deconvolution algorithm for efficient evaluation of the Deconvolution operator with nonunit stride by splitting Deconvolution with a large kernel into multiple subconvolutions with smaller, variable-size kernels, which can be efficiently implemented on top of the Indirect GEMM primitive.

今天的神经网络框架通常通过GEMM(密集矩阵-矩阵乘法)和记忆变换的组合来实现反卷积和密切相关的卷积算子。最近提出的间接卷积算法建议通过间接GEMM原语实现更有效的卷积——GEMM的修改，其中从缓冲区加载指向行的指针，而不是假设恒定步长进行计算。然而，对于计算机视觉模型中典型的非单位步长反卷积，该算法的效率较低。我们描述了一种新的间接反卷积算法，通过将具有大核的反卷积分割成具有较小的变大小核的多个子卷积来有效地计算具有非单位步长的反卷积算子，该算法可以在间接GEMM原语的基础上有效地实现。

引用次数: 2

Regression WiSARD application of controller on DC STATCOM converter under fault conditions 故障状态下控制器在直流STATCOM变换器上的应用

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00145

Raphael N. C. B. Rocha, L. L. Filho, M. Aredes, F. França, P. Lima

Capable of supplying local loads, DC microgrids have received much attention in the last decade for alleviating power flow through the main power grid. This has been achieved through the use of edge devices on the control of the converters, but, among other problems, microgrids have stability issues when Constant Power Loads (CPL) are present. This problem was already solved in the literature with the DC STATCOM power converter, in normal operation mode, it can deal with the grid operation. However, in fault cases, the solutions available still fail to ignore faults or even contribute to them. The present work aims to explore the potential of a light machine learning algorithm of the type Weightless Artificial Neural Network (WANN) for predicting the output of the original controller used in the DC STATCOM on an edge device connected to a converter, and investigate its generalization capability under microgrid fault situations. The WANN used is based on the regression variant of the Wilkes, Stonham, and Aleksander Recognition Device (WiSARD), coined as Regression WiSARD (ReW). The evaluation criteria employed measured the capability of the controller to reject the fault condition. Initial results showed surprisingly good results in comparison to the original DC STATCOM controller, indicating that a ReW-based controller plays well the role of the DC STATCOM and was able to cope with fault situations.

直流微电网能够提供局部负荷，在过去十年中因缓解通过主电网的电力流动而受到广泛关注。这是通过在转换器的控制上使用边缘设备来实现的，但是，在其他问题中，当恒定功率负载(CPL)存在时，微电网存在稳定性问题。这一问题在文献中已经用直流STATCOM电源变换器解决了，在正常运行模式下，它可以处理电网运行。然而，在故障情况下，可用的解决方案仍然不能忽略故障，甚至导致故障。本研究旨在探索无重力人工神经网络(WANN)类型的轻机器学习算法的潜力，用于预测连接到转换器的边缘设备上的直流STATCOM中使用的原始控制器的输出，并研究其在微电网故障情况下的泛化能力。使用的广域网是基于Wilkes, Stonham和Aleksander识别设备(WiSARD)的回归变体，称为回归WiSARD (ReW)。所采用的评价准则衡量了控制器拒绝故障条件的能力。与原有的直流STATCOM控制器相比，初始结果显示出令人惊讶的好结果，表明基于rew的控制器很好地发挥了直流STATCOM的作用，能够应对故障情况。

{"title":"Regression WiSARD application of controller on DC STATCOM converter under fault conditions","authors":"Raphael N. C. B. Rocha, L. L. Filho, M. Aredes, F. França, P. Lima","doi":"10.1109/IPDPSW50202.2020.00145","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00145","url":null,"abstract":"Capable of supplying local loads, DC microgrids have received much attention in the last decade for alleviating power flow through the main power grid. This has been achieved through the use of edge devices on the control of the converters, but, among other problems, microgrids have stability issues when Constant Power Loads (CPL) are present. This problem was already solved in the literature with the DC STATCOM power converter, in normal operation mode, it can deal with the grid operation. However, in fault cases, the solutions available still fail to ignore faults or even contribute to them. The present work aims to explore the potential of a light machine learning algorithm of the type Weightless Artificial Neural Network (WANN) for predicting the output of the original controller used in the DC STATCOM on an edge device connected to a converter, and investigate its generalization capability under microgrid fault situations. The WANN used is based on the regression variant of the Wilkes, Stonham, and Aleksander Recognition Device (WiSARD), coined as Regression WiSARD (ReW). The evaluation criteria employed measured the capability of the controller to reject the fault condition. Initial results showed surprisingly good results in comparison to the original DC STATCOM controller, indicating that a ReW-based controller plays well the role of the DC STATCOM and was able to cope with fault situations.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116748427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Pinocchio: A Blockchain-Based Algorithm for Sensor Fault Tolerance in Low Trust Environment 匹诺曹:低信任环境下基于区块链的传感器容错算法

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00077

Chen Zeng, Yifan Wang, Fan Liang, Xiaohui Peng

As the advancement of the IoT market and technology, sensor-equipped devices belonging to different interest groups have more opportunities to cooperate, fulfilling an assignment. How to ensure an agreement on correct sensed data in such a low trust environment is a new challenge in nowadays market. To bridge the gap, we propose Pinocchio, a blockchain-based algorithm which can tolerate faults in WSN (Wireless Sensor Network) along with the ability to defend sensed data against malicious attacks of data tampering and masquerade. Compared to other distributed approaches of sensor fault tolerance, Pinocchio greatly reduced the message complexity of the entire network to $O(N)$ and that of a single node to $O(1)$. Considering the possible waste of resources brought about by vicious competition of hash power in the blockchain-based approach, we design the Geppetto algorithm to supervise and control hash power in a distributed manner, and its effectiveness is demonstrated by experiments.

随着物联网市场和技术的进步，属于不同利益群体的传感器设备有更多的合作机会，完成任务。如何在低信任环境下保证正确感知数据的一致性是当前市场面临的新挑战。为了弥补这一差距，我们提出了一种基于区块链的算法Pinocchio，它可以容忍WSN(无线传感器网络)中的故障，并能够保护感知数据免受数据篡改和伪装的恶意攻击。与其他分布式传感器容错方法相比，匹诺曹将整个网络的消息复杂度大大降低到$O(N)$，将单个节点的消息复杂度降低到$O(1)$。考虑到基于区块链的方法中哈希算力的恶性竞争可能带来的资源浪费，我们设计了gepepetto算法，以分布式的方式对哈希算力进行监督和控制，并通过实验证明了其有效性。

引用次数: 5

Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA 基于FPGA的OpenCL编程中管道通信与计算相结合的性能评价

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00083

N. Fujita, Ryohei Kobayashi, Y. Yamaguchi, Tomohiro Ueno, K. Sano, T. Boku

In recent years, much High Performance Computing (HPC) researchers attract to utilize Field Programmable Gate Arrays (FPGAs) for HPC applications. We can use FPGAs for communication as well as computation thanks to FPGA’s I/O capabilities. HPC scientists cannot utilize FPGAs for their applications because of the difficulty of the FPGA development, however High Level Synthesis (HLS) allows them to use with appropriate costs. In this study, we propose a Communication Integrated Reconfigurable CompUting System (CIRCUS) to enable us to utilize high-speed interconnection of FPGAS from OpenCL. CIRCUS makes a fused single pipeline combining the computation and the communication, which hides the communication latency by completely overlapping them. In this paper, we present the detail of the implementation and the evaluation result using two benchmarks: pingpong benchmark and allreduce benchmark.

近年来，越来越多的高性能计算(HPC)研究人员将现场可编程门阵列(fpga)用于高性能计算应用。由于FPGA的I/O功能，我们可以使用FPGA进行通信和计算。由于FPGA开发的困难，高性能计算科学家不能将FPGA用于他们的应用，但是高层次合成(HLS)允许他们以适当的成本使用。在这项研究中，我们提出了一个通信集成可重构计算系统(CIRCUS)，使我们能够利用OpenCL fpga的高速互连。CIRCUS将计算和通信融合成一个单一的管道，通过完全重叠来隐藏通信延迟。在本文中，我们给出了具体的实现和使用乒乓基准和allreduce基准的评估结果。

引用次数: 17

Analyzing and Understanding the Impact of Interconnect Performance on HPC, Big Data, and Deep Learning Applications: A Case Study with InfiniBand EDR and HDR 分析和理解互联性能对高性能计算、大数据和深度学习应用的影响:以InfiniBand EDR和HDR为例

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00147

Amit Ruhela, Shulei Xu, K. V. Manian, H. Subramoni, D. Panda

Communication interfaces of High Performance Computing (HPC) systems, Cloud middleware, and Deep Learning (DL) frameworks have been continually evolving to meet the ever-increasing communication demands being placed on them by HPC, Cloud, and DL applications. Modern high performance interconnects like InfiniBand EDR 100 Gbps, InfiniBand HDR 200 Gbps are capable of delivering 100 Gbps and 200 Gbps speeds. However, no previous study has demonstrated how much benefit an end-user in the HPC, Cloud, and DL computing domain can expect by utilizing newer generations of these interconnects over the older ones. In this paper, we evaluate the InfiniBand EDR and HDR high performance interconnects over the PCIe Gen3 interface with HPC, Cloud, and DL workloads. Our comprehensive analysis, done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC, Cloud, and DL applications. The results of our experiments show that the latest InfiniBand HDR interconnect gives the best performance for all three computing domains.

高性能计算(HPC)系统、云中间件和深度学习(DL)框架的通信接口一直在不断发展，以满足HPC、云和深度学习应用程序对它们日益增长的通信需求。现代高性能互连，如InfiniBand EDR 100 Gbps, InfiniBand HDR 200 Gbps，能够提供100 Gbps和200 Gbps的速度。然而，没有先前的研究表明，在高性能计算、云计算和深度学习计算领域，终端用户可以通过使用新一代的互连而不是旧的互连来获得多少好处。在本文中，我们评估了基于PCIe Gen3接口的InfiniBand EDR和HDR高性能互连与HPC，云和DL工作负载。我们在不同层次上进行了全面的分析，提供了这些现代互连对HPC、云和DL应用程序性能的影响的全球范围。实验结果表明，最新的InfiniBand HDR互连在所有三个计算领域都具有最佳性能。

{"title":"Analyzing and Understanding the Impact of Interconnect Performance on HPC, Big Data, and Deep Learning Applications: A Case Study with InfiniBand EDR and HDR","authors":"Amit Ruhela, Shulei Xu, K. V. Manian, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW50202.2020.00147","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00147","url":null,"abstract":"Communication interfaces of High Performance Computing (HPC) systems, Cloud middleware, and Deep Learning (DL) frameworks have been continually evolving to meet the ever-increasing communication demands being placed on them by HPC, Cloud, and DL applications. Modern high performance interconnects like InfiniBand EDR 100 Gbps, InfiniBand HDR 200 Gbps are capable of delivering 100 Gbps and 200 Gbps speeds. However, no previous study has demonstrated how much benefit an end-user in the HPC, Cloud, and DL computing domain can expect by utilizing newer generations of these interconnects over the older ones. In this paper, we evaluate the InfiniBand EDR and HDR high performance interconnects over the PCIe Gen3 interface with HPC, Cloud, and DL workloads. Our comprehensive analysis, done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC, Cloud, and DL applications. The results of our experiments show that the latest InfiniBand HDR interconnect gives the best performance for all three computing domains.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125907614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Kcollections: A Fast and Efficient Library for K-mers Kcollections:面向K-mers的快速高效库

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00041

M. Fujimoto, Cole A. Lyman, M. Clement

K-mers form the backbone of many bioinformatic algorithms. They are, however, difficult to store and use efficiently because the number of k-mers increases exponentially as $k$ increases. Many algorithms exist for compressed storage of kmers but suffer from slow insert times or are probabilistic resulting in false-positive k-mers. Furthermore, k-mer libraries usually specialize in associating specific values with k-mers such as a color in colored de Bruijn Graphs or k-mer count. We present kcollections1, a compressed and parallel data structure designed for k-mers generated from whole, assembled genomes. Kcollections is available for $mathrm {C}++$ and provides set-and maplike structures as well as a k-mer counting data structure all of which utilize parallel operations designed using a MapReduce paradigm. Additionally, we provide basic Python bindings for rapid prototyping. Kcollections makes developing bioinformatic algorithms simpler by abstracting away the tedious task of storing k-mers.1https://www.github.com/masakistan/kcollections

K-mers构成了许多生物信息学算法的主干。然而，它们很难有效地存储和使用，因为k-mers的数量随着k$的增加呈指数增长。存在许多用于压缩k-mers存储的算法，但它们的插入时间很慢，或者很可能导致k-mers假阳性。此外，k-mer库通常专门用于将特定值与k-mer相关联，例如彩色de Bruijn图中的颜色或k-mer计数。我们提出了kcollection1，这是一种压缩的并行数据结构，专为从整个组装基因组中产生的k-mers而设计。Kcollections可以在$ mathm {C}++$中使用，它提供了类似set和map的结构以及k-mer计数数据结构，所有这些都利用了使用MapReduce范式设计的并行操作。此外，我们还提供了用于快速原型的基本Python绑定。Kcollections抽象了存储k-mers的繁琐任务，使开发生物信息学算法变得更加简单

引用次数: 1

Real-time Automatic Modulation Classification using RFSoC 基于RFSoC的实时自动调制分类

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00021

Stephen Tridgell, D. Boland, P. Leong, R. Kastner, Alireza Khodamoradi, Siddhartha

The computational complexity of deep learning has led to research efforts to reduce the computation required. The use of low precision is particularly effective on FPGAs as they are not restricted to byte addressable operations. Very low precision activations and weights can have a significant impact on the accuracy however. This work demonstrates by exploiting throughput matching that higher precision on certain layers can be used to recover this accuracy. This is applied to the domain of automatic modulation classification for radio signals leveraging the RF capabilities offered by the Xilinx ZCU111 RFSoC platform. The implemented networks achieve high-speed real-time performance with a classification latency of $approx8mu$s, and an operational throughput of 488k classifications per second. On the open-source RadioML dataset, we demonstrate how to recover 4.3% in accuracy with the same hardware usage with our technique.

深度学习的计算复杂性导致研究人员努力减少所需的计算量。低精度的使用在fpga上特别有效，因为它们不局限于字节可寻址操作。然而，非常低的精度激活和权重会对精度产生重大影响。这项工作通过利用吞吐量匹配证明，在某些层上可以使用更高的精度来恢复这种精度。这适用于利用Xilinx ZCU111 RFSoC平台提供的RF功能的无线电信号自动调制分类领域。实现的网络实现了高速实时性能，分类延迟为$approx8mu$ s，操作吞吐量为每秒488k个分类。在开源的RadioML数据集上，我们演示了如何恢复4.3% in accuracy with the same hardware usage with our technique.

引用次数: 10

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀