首页 > 最新文献

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)最新文献

英文 中文
Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR 基于Horovod和MVAPICH2-GDR的峰顶语义图像分割高效训练
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00171
Quentin G. Anthony, A. Awan, Arpan Jain, H. Subramoni, D. Panda
Deep Learning (DL) models for semantic image segmentation are an emerging trend in response to the explosion of multi-class, high resolution image and video data. However, segmentation models are highly compute-intensive, and even the fastest Volta GPUs cannot train them in a reasonable time frame. In our experiments, we observed just 6.7 images/second on a single Volta GPU for training DeepLab-v3+ (DLv3+), a state-of-the-art Encoder-Decoder model for semantic image segmentation. For comparison, a Volta GPU can process 300 images/second for training ResNet-50, a state-of-the-art model for image classification. In this context, we see a clear opportunity to utilize supercomputers to speed up training of segmentation models. However, most published studies on the performance of novel DL models such as DLv3+ require the user to significantly change Horovod, MPI, and the DL model to improve performance. Our work proposes an alternative tuning method that achieves near-linear scaling without significant changes to Horovod, MPI, or the DL model. In this paper, we select DLv3+ as the candidate TensorFlow model and implement Horovod-based distributed training for DLv3+. We observed poor default scaling performance of DLv3+ on the Summit system at Oak Ridge National Laboratory. To address this, we conducted an in-depth performance tuning of various Horovod/MPI knobs to achieve better performance over the default parameters. We present a comprehensive scaling comparison for Horovod with MVAPICH2-GDR up to 132 GPUs on Summit. Our optimization approach achieves near-linear (92%) scaling with MVAPICH2-GDR. We achieved a “mIOU” accuracy of 80.8% for distributed training, which is on par with published accuracy for this model. Further, we demonstrate an improvement in scaling efficiency by 23.9% over default Horovod training, which translates to a 1.3× speedup in training performance.
深度学习(DL)语义图像分割模型是应对多类别、高分辨率图像和视频数据爆炸的新兴趋势。然而,分割模型是高度计算密集型的,即使是最快的Volta gpu也无法在合理的时间框架内训练它们。在我们的实验中,我们在单个Volta GPU上仅观察到6.7张图像/秒,用于训练DeepLab-v3+ (DLv3+),这是一种用于语义图像分割的最先进的编码器-解码器模型。相比之下,Volta GPU可以每秒处理300张图像来训练ResNet-50,这是一种最先进的图像分类模型。在这种情况下,我们看到了利用超级计算机加速分割模型训练的明显机会。然而,大多数已发表的关于新型深度学习模型(如DLv3+)性能的研究都要求用户显著改变Horovod、MPI和深度学习模型以提高性能。我们的工作提出了一种替代调谐方法,可以在不显著改变Horovod、MPI或DL模型的情况下实现近线性缩放。本文选择DLv3+作为候选TensorFlow模型,并对DLv3+实现基于horovod的分布式训练。我们在橡树岭国家实验室的Summit系统上观察到DLv3+的默认扩展性能很差。为了解决这个问题,我们对各种Horovod/MPI旋钮进行了深入的性能调优,以获得比默认参数更好的性能。我们提出了一个全面的缩放比较Horovod与MVAPICH2-GDR高达132 gpu在峰会上。我们的优化方法在MVAPICH2-GDR上实现了近线性缩放(92%)。对于分布式训练,我们实现了80.8%的“mIOU”准确率,这与该模型公布的准确率相当。此外,我们证明了与默认的Horovod训练相比,扩展效率提高了23.9%,这意味着训练性能提高了1.3倍。
{"title":"Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR","authors":"Quentin G. Anthony, A. Awan, Arpan Jain, H. Subramoni, D. Panda","doi":"10.1109/IPDPSW50202.2020.00171","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00171","url":null,"abstract":"Deep Learning (DL) models for semantic image segmentation are an emerging trend in response to the explosion of multi-class, high resolution image and video data. However, segmentation models are highly compute-intensive, and even the fastest Volta GPUs cannot train them in a reasonable time frame. In our experiments, we observed just 6.7 images/second on a single Volta GPU for training DeepLab-v3+ (DLv3+), a state-of-the-art Encoder-Decoder model for semantic image segmentation. For comparison, a Volta GPU can process 300 images/second for training ResNet-50, a state-of-the-art model for image classification. In this context, we see a clear opportunity to utilize supercomputers to speed up training of segmentation models. However, most published studies on the performance of novel DL models such as DLv3+ require the user to significantly change Horovod, MPI, and the DL model to improve performance. Our work proposes an alternative tuning method that achieves near-linear scaling without significant changes to Horovod, MPI, or the DL model. In this paper, we select DLv3+ as the candidate TensorFlow model and implement Horovod-based distributed training for DLv3+. We observed poor default scaling performance of DLv3+ on the Summit system at Oak Ridge National Laboratory. To address this, we conducted an in-depth performance tuning of various Horovod/MPI knobs to achieve better performance over the default parameters. We present a comprehensive scaling comparison for Horovod with MVAPICH2-GDR up to 132 GPUs on Summit. Our optimization approach achieves near-linear (92%) scaling with MVAPICH2-GDR. We achieved a “mIOU” accuracy of 80.8% for distributed training, which is on par with published accuracy for this model. Further, we demonstrate an improvement in scaling efficiency by 23.9% over default Horovod training, which translates to a 1.3× speedup in training performance.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121643823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Tri-Objective Workflow Scheduling and Optimization in Heterogeneous Cloud Environments 异构云环境下的三目标工作流调度与优化
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00129
Huda Alrammah, Yi Gu, Zhifeng Liu
Cloud computing has become the most popular distributed computing paradigm among others which delivers scalable resources for efficient execution of large-scale scientific workflows. However, the large number of user requests and the limited cloud resources have posed a significant challenge on resource allocation, scheduling/mapping, power consumption, monetary cost, and so on. Therefore, how to schedule and optimize workflow execution in a cloud environment has become the most critical factor in improving the overall performance. Moreover, Multi-objective Optimization Problems (MOPs) along with heterogeneous cloud environments have made resource utilization and workflow scheduling even more challenging. In this work, we propose a novel algorithm, named Multi-objective Optimization for Makespan, Cost and Energy (MOMCE), to efficiently assign tasks to cloud resources in order to reduce total execution time, monetary cost, and energy consumption of scientific workflows. The experimental results have demonstrated the optimization stability and robustness of MOMCE algorithm for achieving a better fitness value in comparison with other existing algorithms.
云计算已经成为最流行的分布式计算范式,它为大规模科学工作流的有效执行提供了可扩展的资源。然而,大量的用户请求和有限的云资源在资源分配、调度/映射、功耗、货币成本等方面提出了重大挑战。因此,如何在云环境中调度和优化工作流的执行成为提高整体性能的最关键因素。此外,多目标优化问题(MOPs)以及异构云环境使资源利用和工作流调度更具挑战性。在这项工作中,我们提出了一种新的算法,称为Makespan, Cost and Energy (MOMCE)的多目标优化,以有效地将任务分配给云资源,以减少科学工作流的总执行时间,货币成本和能源消耗。实验结果表明,与其他现有算法相比,MOMCE算法具有优化稳定性和鲁棒性,可以获得更好的适应度值。
{"title":"Tri-Objective Workflow Scheduling and Optimization in Heterogeneous Cloud Environments","authors":"Huda Alrammah, Yi Gu, Zhifeng Liu","doi":"10.1109/IPDPSW50202.2020.00129","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00129","url":null,"abstract":"Cloud computing has become the most popular distributed computing paradigm among others which delivers scalable resources for efficient execution of large-scale scientific workflows. However, the large number of user requests and the limited cloud resources have posed a significant challenge on resource allocation, scheduling/mapping, power consumption, monetary cost, and so on. Therefore, how to schedule and optimize workflow execution in a cloud environment has become the most critical factor in improving the overall performance. Moreover, Multi-objective Optimization Problems (MOPs) along with heterogeneous cloud environments have made resource utilization and workflow scheduling even more challenging. In this work, we propose a novel algorithm, named Multi-objective Optimization for Makespan, Cost and Energy (MOMCE), to efficiently assign tasks to cloud resources in order to reduce total execution time, monetary cost, and energy consumption of scientific workflows. The experimental results have demonstrated the optimization stability and robustness of MOMCE algorithm for achieving a better fitness value in comparison with other existing algorithms.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132942269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving Collective I/O Performance with Machine Learning Supported Auto-tuning 通过机器学习支持的自动调优提高集体I/O性能
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00138
Ayse Bagbaba
Collective Input and output (I/O) is an essential approach in high performance computing (HPC) applications. The achievement of effective collective I/O is a nontrivial job due to the complex interdependencies between the layers of I/O stack. These layers provide the best possible I/O performance through a number of tunable parameters. Sadly, the correct combination of parameters depends on diverse applications and HPC platforms. When a configuration space gets larger, it becomes difficult for humans to monitor the interactions between the configuration options. Engineers has no time or experience for exploring good configuration parameters for each problem because of long benchmarking phase. In most cases, the default settings are implemented, often leading to poor I/O efficiency. I/O profiling tools can not tell the optimal default setups without too much effort to analyzing the tracing results. In this case, an auto-tuning solution for optimizing collective I/O requests and providing system administrators or engineers the statistic information is strongly required. In this paper, a study of the machine learning supported collective I/O auto-tuning including the architecture and software stack is performed. Random forest regression model is used to develop a performance predictor model that can capture parallel I/O behavior as a function of application and file system characteristics. The modeling approach can provide insights into the metrics that impact I/O performance significantly.
集合输入和输出(I/O)是高性能计算(HPC)应用程序中必不可少的方法。由于I/O堆栈层之间复杂的相互依赖关系,实现有效的集体I/O是一项艰巨的任务。这些层通过许多可调参数提供最佳的I/O性能。遗憾的是,参数的正确组合取决于不同的应用程序和HPC平台。当配置空间变大时,人们就很难监控配置选项之间的交互。由于长时间的基准测试阶段,工程师没有时间或经验为每个问题探索良好的配置参数。在大多数情况下,会实现默认设置,这通常会导致较差的I/O效率。如果不花太多精力分析跟踪结果,I/O分析工具无法告诉您最佳的默认设置。在这种情况下,迫切需要一种自动调优解决方案,用于优化集体I/O请求并向系统管理员或工程师提供统计信息。本文对机器学习支持的集体I/O自动调优进行了研究,包括体系结构和软件堆栈。随机森林回归模型用于开发性能预测器模型,该模型可以捕获作为应用程序和文件系统特征函数的并行I/O行为。建模方法可以深入了解显著影响I/O性能的指标。
{"title":"Improving Collective I/O Performance with Machine Learning Supported Auto-tuning","authors":"Ayse Bagbaba","doi":"10.1109/IPDPSW50202.2020.00138","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00138","url":null,"abstract":"Collective Input and output (I/O) is an essential approach in high performance computing (HPC) applications. The achievement of effective collective I/O is a nontrivial job due to the complex interdependencies between the layers of I/O stack. These layers provide the best possible I/O performance through a number of tunable parameters. Sadly, the correct combination of parameters depends on diverse applications and HPC platforms. When a configuration space gets larger, it becomes difficult for humans to monitor the interactions between the configuration options. Engineers has no time or experience for exploring good configuration parameters for each problem because of long benchmarking phase. In most cases, the default settings are implemented, often leading to poor I/O efficiency. I/O profiling tools can not tell the optimal default setups without too much effort to analyzing the tracing results. In this case, an auto-tuning solution for optimizing collective I/O requests and providing system administrators or engineers the statistic information is strongly required. In this paper, a study of the machine learning supported collective I/O auto-tuning including the architecture and software stack is performed. Random forest regression model is used to develop a performance predictor model that can capture parallel I/O behavior as a function of application and file system characteristics. The modeling approach can provide insights into the metrics that impact I/O performance significantly.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133094308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Indirect Deconvolution Algorithm 间接反卷积算法
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00154
Marat Dukhan
Neural network frameworks today commonly implement Deconvolution and closely related Convolution operator via a combination of GEMM (dense matrix-matrix multiplication) and a memory transformation. The recently proposed Indirect Convolution algorithm suggests a more efficient implementation of Convolution via the Indirect GEMM primitive - a modification of GEMM where pointers to rows are loaded from a buffer rather than being computed assuming constant stride. However, the algorithm is inefficient for Deconvolution with non-unit stride, which is typical in computer vision models. We describe a novel Indirect Deconvolution algorithm for efficient evaluation of the Deconvolution operator with nonunit stride by splitting Deconvolution with a large kernel into multiple subconvolutions with smaller, variable-size kernels, which can be efficiently implemented on top of the Indirect GEMM primitive.
今天的神经网络框架通常通过GEMM(密集矩阵-矩阵乘法)和记忆变换的组合来实现反卷积和密切相关的卷积算子。最近提出的间接卷积算法建议通过间接GEMM原语实现更有效的卷积——GEMM的修改,其中从缓冲区加载指向行的指针,而不是假设恒定步长进行计算。然而,对于计算机视觉模型中典型的非单位步长反卷积,该算法的效率较低。我们描述了一种新的间接反卷积算法,通过将具有大核的反卷积分割成具有较小的变大小核的多个子卷积来有效地计算具有非单位步长的反卷积算子,该算法可以在间接GEMM原语的基础上有效地实现。
{"title":"Indirect Deconvolution Algorithm","authors":"Marat Dukhan","doi":"10.1109/IPDPSW50202.2020.00154","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00154","url":null,"abstract":"Neural network frameworks today commonly implement Deconvolution and closely related Convolution operator via a combination of GEMM (dense matrix-matrix multiplication) and a memory transformation. The recently proposed Indirect Convolution algorithm suggests a more efficient implementation of Convolution via the Indirect GEMM primitive - a modification of GEMM where pointers to rows are loaded from a buffer rather than being computed assuming constant stride. However, the algorithm is inefficient for Deconvolution with non-unit stride, which is typical in computer vision models. We describe a novel Indirect Deconvolution algorithm for efficient evaluation of the Deconvolution operator with nonunit stride by splitting Deconvolution with a large kernel into multiple subconvolutions with smaller, variable-size kernels, which can be efficiently implemented on top of the Indirect GEMM primitive.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"35 29","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131806225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Smart Streaming: A High-Throughput Fault-tolerant Online Processing System 智能流:一种高吞吐量容错在线处理系统
Pub Date : 2020-05-01 DOI: 10.1109/ipdpsw50202.2020.00075
Jia Guo, G. Agrawal
In recent years, there has been considerable interest in developing frameworks for processing streaming data. Like the precursor commercial systems for data-intensive processing, these systems have largely not used methods popular within the HPC community (for example, MPI for communication). In this paper, we demonstrate a system for stream processing that offers a high-level API to the users (similar to MapReduce), is fault-tolerant, and is also more efficient and scalable than current solutions. Particularly, a cost-efficient MPI/OpenMP based fault-tolerant scheme is incorporated so that the system can survive node failures with only a modest degradation of performance. We evaluate both the functionality and efficiency of Smart Streaming using four common applications in machine learning and data analytics. A comparison against state-of-the-art streaming frameworks shows our system boosts the throughput of test cases by up to 10X and achieve desirable parallelism when scaled out. Additionally, the performance loss upon failures is only proportional to the share of failed resources.
近年来,人们对开发处理流数据的框架非常感兴趣。与用于数据密集型处理的先驱商业系统一样,这些系统在很大程度上没有使用HPC社区中流行的方法(例如,用于通信的MPI)。在本文中,我们展示了一个流处理系统,它为用户提供了一个高级API(类似于MapReduce),具有容错性,并且比当前的解决方案更高效和可扩展。特别是,采用了一种经济高效的基于MPI/OpenMP的容错方案,使系统能够在节点故障时仅以适度的性能下降存活下来。我们使用机器学习和数据分析中的四种常见应用来评估智能流的功能和效率。与最先进的流框架的比较表明,我们的系统将测试用例的吞吐量提高了10倍,并且在扩展时实现了理想的并行性。此外,故障时的性能损失仅与故障资源的份额成正比。
{"title":"Smart Streaming: A High-Throughput Fault-tolerant Online Processing System","authors":"Jia Guo, G. Agrawal","doi":"10.1109/ipdpsw50202.2020.00075","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00075","url":null,"abstract":"In recent years, there has been considerable interest in developing frameworks for processing streaming data. Like the precursor commercial systems for data-intensive processing, these systems have largely not used methods popular within the HPC community (for example, MPI for communication). In this paper, we demonstrate a system for stream processing that offers a high-level API to the users (similar to MapReduce), is fault-tolerant, and is also more efficient and scalable than current solutions. Particularly, a cost-efficient MPI/OpenMP based fault-tolerant scheme is incorporated so that the system can survive node failures with only a modest degradation of performance. We evaluate both the functionality and efficiency of Smart Streaming using four common applications in machine learning and data analytics. A comparison against state-of-the-art streaming frameworks shows our system boosts the throughput of test cases by up to 10X and achieve desirable parallelism when scaled out. Additionally, the performance loss upon failures is only proportional to the share of failed resources.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125316456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
GrAPL 2020 Keynote Speaker The GraphIt Universal Graph Framework: Achieving HighPerformance across Algorithms, Graph Types, and Architectures GraphIt通用图形框架:实现跨算法、图形类型和架构的高性能
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00044
Saman P. Amarasinghe
In recent years, large graphs with billions of vertices and trillions of edges have emerged in many domains, such as social network analytics, machine learning, physical simulations, and biology. However, optimizing the performance of graph applications is notoriously difficult due to irregular memory access patterns and load imbalance across cores. The performance of graph programs depends highly on the algorithm, the size, and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or single hardware platform works well across all applications.
近年来,拥有数十亿个顶点和数万亿条边的大型图出现在许多领域,如社交网络分析、机器学习、物理模拟和生物学。然而,由于不规则的内存访问模式和内核之间的负载不平衡,优化图形应用程序的性能是出了名的困难。图形程序的性能在很大程度上取决于输入图形的算法、大小和结构,以及底层硬件的特征。没有一组优化或单一硬件平台可以很好地适用于所有应用程序。
{"title":"GrAPL 2020 Keynote Speaker The GraphIt Universal Graph Framework: Achieving HighPerformance across Algorithms, Graph Types, and Architectures","authors":"Saman P. Amarasinghe","doi":"10.1109/IPDPSW50202.2020.00044","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00044","url":null,"abstract":"In recent years, large graphs with billions of vertices and trillions of edges have emerged in many domains, such as social network analytics, machine learning, physical simulations, and biology. However, optimizing the performance of graph applications is notoriously difficult due to irregular memory access patterns and load imbalance across cores. The performance of graph programs depends highly on the algorithm, the size, and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or single hardware platform works well across all applications.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133877968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Regression WiSARD application of controller on DC STATCOM converter under fault conditions 故障状态下控制器在直流STATCOM变换器上的应用
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00145
Raphael N. C. B. Rocha, L. L. Filho, M. Aredes, F. França, P. Lima
Capable of supplying local loads, DC microgrids have received much attention in the last decade for alleviating power flow through the main power grid. This has been achieved through the use of edge devices on the control of the converters, but, among other problems, microgrids have stability issues when Constant Power Loads (CPL) are present. This problem was already solved in the literature with the DC STATCOM power converter, in normal operation mode, it can deal with the grid operation. However, in fault cases, the solutions available still fail to ignore faults or even contribute to them. The present work aims to explore the potential of a light machine learning algorithm of the type Weightless Artificial Neural Network (WANN) for predicting the output of the original controller used in the DC STATCOM on an edge device connected to a converter, and investigate its generalization capability under microgrid fault situations. The WANN used is based on the regression variant of the Wilkes, Stonham, and Aleksander Recognition Device (WiSARD), coined as Regression WiSARD (ReW). The evaluation criteria employed measured the capability of the controller to reject the fault condition. Initial results showed surprisingly good results in comparison to the original DC STATCOM controller, indicating that a ReW-based controller plays well the role of the DC STATCOM and was able to cope with fault situations.
直流微电网能够提供局部负荷,在过去十年中因缓解通过主电网的电力流动而受到广泛关注。这是通过在转换器的控制上使用边缘设备来实现的,但是,在其他问题中,当恒定功率负载(CPL)存在时,微电网存在稳定性问题。这一问题在文献中已经用直流STATCOM电源变换器解决了,在正常运行模式下,它可以处理电网运行。然而,在故障情况下,可用的解决方案仍然不能忽略故障,甚至导致故障。本研究旨在探索无重力人工神经网络(WANN)类型的轻机器学习算法的潜力,用于预测连接到转换器的边缘设备上的直流STATCOM中使用的原始控制器的输出,并研究其在微电网故障情况下的泛化能力。使用的广域网是基于Wilkes, Stonham和Aleksander识别设备(WiSARD)的回归变体,称为回归WiSARD (ReW)。所采用的评价准则衡量了控制器拒绝故障条件的能力。与原有的直流STATCOM控制器相比,初始结果显示出令人惊讶的好结果,表明基于rew的控制器很好地发挥了直流STATCOM的作用,能够应对故障情况。
{"title":"Regression WiSARD application of controller on DC STATCOM converter under fault conditions","authors":"Raphael N. C. B. Rocha, L. L. Filho, M. Aredes, F. França, P. Lima","doi":"10.1109/IPDPSW50202.2020.00145","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00145","url":null,"abstract":"Capable of supplying local loads, DC microgrids have received much attention in the last decade for alleviating power flow through the main power grid. This has been achieved through the use of edge devices on the control of the converters, but, among other problems, microgrids have stability issues when Constant Power Loads (CPL) are present. This problem was already solved in the literature with the DC STATCOM power converter, in normal operation mode, it can deal with the grid operation. However, in fault cases, the solutions available still fail to ignore faults or even contribute to them. The present work aims to explore the potential of a light machine learning algorithm of the type Weightless Artificial Neural Network (WANN) for predicting the output of the original controller used in the DC STATCOM on an edge device connected to a converter, and investigate its generalization capability under microgrid fault situations. The WANN used is based on the regression variant of the Wilkes, Stonham, and Aleksander Recognition Device (WiSARD), coined as Regression WiSARD (ReW). The evaluation criteria employed measured the capability of the controller to reject the fault condition. Initial results showed surprisingly good results in comparison to the original DC STATCOM controller, indicating that a ReW-based controller plays well the role of the DC STATCOM and was able to cope with fault situations.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116748427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Unified data movement for offloading Charm++ applications 用于卸载Charm++应用程序的统一数据移动
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00085
M. Diener, L. Kalé
Data movement between host and accelerators is one of the most challenging aspects of developing applications for heterogeneous systems. Most existing runtime systems for GPGPU programming require developers to perform data movement manually in the source code, while having to support different hardware and software environments. In this paper, we present a novel way to perform data movement for distributed applications based on the Charm ++ programming system. We extend Charm ++’s support for migration across memory address spaces to handle accelerator devices by making use of the description of data contained in Charm ++’s parallel objects. This allows the Charm ++ runtime to handle data movement automatically to a large extent, while supporting different hardware platforms transparently. This increases both developer productivity and the portability of Charm ++ applications. We demonstrate our proposal with a Charm ++ application that runs offloaded CUDA code on three different hardware platforms with a single data movement specification.
主机和加速器之间的数据移动是为异构系统开发应用程序时最具挑战性的方面之一。大多数现有的GPGPU编程运行时系统要求开发人员在源代码中手动执行数据移动,同时必须支持不同的硬件和软件环境。本文提出了一种基于charm++编程系统的分布式应用程序数据移动的新方法。我们扩展了Charm ++对跨内存地址空间迁移的支持,通过使用Charm ++的并行对象中包含的数据描述来处理加速器设备。这允许Charm ++运行时在很大程度上自动处理数据移动,同时透明地支持不同的硬件平台。这既提高了开发人员的工作效率,又提高了Charm ++应用程序的可移植性。我们用一个Charm ++应用程序演示了我们的建议,该应用程序在三个不同的硬件平台上使用单个数据移动规范运行卸载CUDA代码。
{"title":"Unified data movement for offloading Charm++ applications","authors":"M. Diener, L. Kalé","doi":"10.1109/IPDPSW50202.2020.00085","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00085","url":null,"abstract":"Data movement between host and accelerators is one of the most challenging aspects of developing applications for heterogeneous systems. Most existing runtime systems for GPGPU programming require developers to perform data movement manually in the source code, while having to support different hardware and software environments. In this paper, we present a novel way to perform data movement for distributed applications based on the Charm ++ programming system. We extend Charm ++’s support for migration across memory address spaces to handle accelerator devices by making use of the description of data contained in Charm ++’s parallel objects. This allows the Charm ++ runtime to handle data movement automatically to a large extent, while supporting different hardware platforms transparently. This increases both developer productivity and the portability of Charm ++ applications. We demonstrate our proposal with a Charm ++ application that runs offloaded CUDA code on three different hardware platforms with a single data movement specification.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116822294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Kcollections: A Fast and Efficient Library for K-mers Kcollections:面向K-mers的快速高效库
Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00041
M. Fujimoto, Cole A. Lyman, M. Clement
K-mers form the backbone of many bioinformatic algorithms. They are, however, difficult to store and use efficiently because the number of k-mers increases exponentially as $k$ increases. Many algorithms exist for compressed storage of kmers but suffer from slow insert times or are probabilistic resulting in false-positive k-mers. Furthermore, k-mer libraries usually specialize in associating specific values with k-mers such as a color in colored de Bruijn Graphs or k-mer count. We present kcollections1, a compressed and parallel data structure designed for k-mers generated from whole, assembled genomes. Kcollections is available for $mathrm {C}++$ and provides set-and maplike structures as well as a k-mer counting data structure all of which utilize parallel operations designed using a MapReduce paradigm. Additionally, we provide basic Python bindings for rapid prototyping. Kcollections makes developing bioinformatic algorithms simpler by abstracting away the tedious task of storing k-mers.1https://www.github.com/masakistan/kcollections
K-mers构成了许多生物信息学算法的主干。然而,它们很难有效地存储和使用,因为k-mers的数量随着k$的增加呈指数增长。存在许多用于压缩k-mers存储的算法,但它们的插入时间很慢,或者很可能导致k-mers假阳性。此外,k-mer库通常专门用于将特定值与k-mer相关联,例如彩色de Bruijn图中的颜色或k-mer计数。我们提出了kcollection1,这是一种压缩的并行数据结构,专为从整个组装基因组中产生的k-mers而设计。Kcollections可以在$ mathm {C}++$中使用,它提供了类似set和map的结构以及k-mer计数数据结构,所有这些都利用了使用MapReduce范式设计的并行操作。此外,我们还提供了用于快速原型的基本Python绑定。Kcollections抽象了存储k-mers的繁琐任务,使开发生物信息学算法变得更加简单
{"title":"Kcollections: A Fast and Efficient Library for K-mers","authors":"M. Fujimoto, Cole A. Lyman, M. Clement","doi":"10.1109/IPDPSW50202.2020.00041","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00041","url":null,"abstract":"K-mers form the backbone of many bioinformatic algorithms. They are, however, difficult to store and use efficiently because the number of k-mers increases exponentially as $k$ increases. Many algorithms exist for compressed storage of kmers but suffer from slow insert times or are probabilistic resulting in false-positive k-mers. Furthermore, k-mer libraries usually specialize in associating specific values with k-mers such as a color in colored de Bruijn Graphs or k-mer count. We present kcollections1, a compressed and parallel data structure designed for k-mers generated from whole, assembled genomes. Kcollections is available for $mathrm {C}++$ and provides set-and maplike structures as well as a k-mer counting data structure all of which utilize parallel operations designed using a MapReduce paradigm. Additionally, we provide basic Python bindings for rapid prototyping. Kcollections makes developing bioinformatic algorithms simpler by abstracting away the tedious task of storing k-mers.1https://www.github.com/masakistan/kcollections","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122391858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Workshop on Resource Arbitration for Dynamic Runtimes (RADR) 动态运行时(RADR)资源仲裁研讨会
Pub Date : 2020-05-01 DOI: 10.1109/ipdpsw50202.2020.00157
P. Beckman, E. Jeannot, Swann Perarnau
The question of efficient dynamic allocation of compute-node resources, such as cores, by independent libraries or runtime systems can be an nightmare. Scientists writing application components have no way to efficiently specify and compose resource-hungry components. As application software stacks become deeper and the interaction of multiple runtime layers compete for resources from the operating system, it has become clear that intelligent cooperation is needed. Resources such as compute cores, in-package memory, and even electrical power must be orchestrated dynamically across application components, with the ability to query each other and respond appropriately. A more integrated solution would reduce intra-application resource competition and improve performance. Furthermore, application runtime systems could request and allocate specific hardware assets and adjust runtime tuning parameters up and down the software stack.The goal of this workshop is to gather and share the latest scholarly research from the community working on these issues, at all levels of the HPC software stack. This include thread allocation, resource arbitration and management, containers, and so on, from runtime-system designers to compilers. We will also use panel sessions and keynote talks to discuss these issues, share visions, and present solutions.
由独立库或运行时系统有效地动态分配计算节点资源(如内核)的问题可能是一场噩梦。编写应用程序组件的科学家无法有效地指定和组合需要大量资源的组件。随着应用软件栈的深入以及多个运行时层之间的交互对操作系统资源的争夺,智能协作的需求变得越来越明显。计算核心、包内内存甚至电力等资源必须跨应用程序组件进行动态编排,并具有相互查询和适当响应的能力。更集成的解决方案将减少应用程序内部的资源竞争并提高性能。此外,应用程序运行时系统可以请求和分配特定的硬件资产,并在软件堆栈上下调整运行时调优参数。本次研讨会的目的是收集和分享来自社区的最新学术研究,这些研究涉及HPC软件堆栈的各个层面。这包括从运行时系统设计器到编译器的线程分配、资源仲裁和管理、容器等等。我们还将利用小组会议和主题演讲来讨论这些问题,分享愿景,并提出解决方案。
{"title":"Workshop on Resource Arbitration for Dynamic Runtimes (RADR)","authors":"P. Beckman, E. Jeannot, Swann Perarnau","doi":"10.1109/ipdpsw50202.2020.00157","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00157","url":null,"abstract":"The question of efficient dynamic allocation of compute-node resources, such as cores, by independent libraries or runtime systems can be an nightmare. Scientists writing application components have no way to efficiently specify and compose resource-hungry components. As application software stacks become deeper and the interaction of multiple runtime layers compete for resources from the operating system, it has become clear that intelligent cooperation is needed. Resources such as compute cores, in-package memory, and even electrical power must be orchestrated dynamically across application components, with the ability to query each other and respond appropriately. A more integrated solution would reduce intra-application resource competition and improve performance. Furthermore, application runtime systems could request and allocate specific hardware assets and adjust runtime tuning parameters up and down the software stack.The goal of this workshop is to gather and share the latest scholarly research from the community working on these issues, at all levels of the HPC software stack. This include thread allocation, resource arbitration and management, containers, and so on, from runtime-system designers to compilers. We will also use panel sessions and keynote talks to discuss these issues, share visions, and present solutions.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130307713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1