International Journal of High Performance Computing Applications最新文献_第2页

INDIANA—In-Network Distributed Infrastructure for Advanced Network Applications INDANA——用于高级网络应用的网络内分布式基础设施

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2023-06-26 DOI: 10.1177/10943420231179662

Sabra Ossen, Jeremy Musser, Luke Dalessandro, M. Swany

Data volumes are exploding as sensors proliferate and become more capable. Edge computing is envisioned as a path to distribute processing and reduce latency. Many models of Edge computing consider small devices running conventional software. Our model includes a more lightweight execution engine for network microservices and a network scheduling framework to configure network processing elements to process streams and direct the appropriate traffic to them. In this article, we describe INDIANA, a complete framework for in-network microservices. We will describe how the two components-the INDIANA network Processing Element (InPE) and the Flange Network Operating System (NOS)-work together to achieve effective in-network processing to improve performance in edge to cloud environments. Our processing elements provide lightweight compute units optimized for efficient stream processing. These elements are customizable and vary in sophistication and resource consumption. The Flange NOS provides first-class flow based reasoning to drive function placement, network configuration, and load balancing that can respond dynamically to network conditions. We describe design considerations and discuss our approach and implementations. We evaluate the performance of stream processing and examine the performance of several exemplar applications on networks of increasing scale and complexity.

随着传感器的激增和功能的增强，数据量呈爆炸式增长。边缘计算被设想为一种分配处理和减少延迟的途径。许多边缘计算模型考虑的是运行传统软件的小型设备。我们的模型包括一个用于网络微服务的更轻量级的执行引擎和一个网络调度框架，用于配置网络处理元素来处理流并将适当的流量定向到它们。在本文中，我们将描述一个用于网络内微服务的完整框架——INDIANA。我们将描述这两个组件——印第安纳网络处理元素(InPE)和法兰网络操作系统(NOS)——如何协同工作，以实现有效的网络内处理，从而提高边缘到云环境中的性能。我们的处理元素提供了轻量级的计算单元，优化了高效的流处理。这些元素是可定制的，在复杂程度和资源消耗方面各不相同。法兰NOS提供一流的基于流的推理，以驱动功能布局、网络配置和负载平衡，可以动态响应网络条件。我们描述了设计注意事项，并讨论了我们的方法和实现。我们评估了流处理的性能，并检查了几个示例应用程序在规模和复杂性不断增加的网络上的性能。

{"title":"INDIANA—In-Network Distributed Infrastructure for Advanced Network Applications","authors":"Sabra Ossen, Jeremy Musser, Luke Dalessandro, M. Swany","doi":"10.1177/10943420231179662","DOIUrl":"https://doi.org/10.1177/10943420231179662","url":null,"abstract":"Data volumes are exploding as sensors proliferate and become more capable. Edge computing is envisioned as a path to distribute processing and reduce latency. Many models of Edge computing consider small devices running conventional software. Our model includes a more lightweight execution engine for network microservices and a network scheduling framework to configure network processing elements to process streams and direct the appropriate traffic to them. In this article, we describe INDIANA, a complete framework for in-network microservices. We will describe how the two components-the INDIANA network Processing Element (InPE) and the Flange Network Operating System (NOS)-work together to achieve effective in-network processing to improve performance in edge to cloud environments. Our processing elements provide lightweight compute units optimized for efficient stream processing. These elements are customizable and vary in sophistication and resource consumption. The Flange NOS provides first-class flow based reasoning to drive function placement, network configuration, and load balancing that can respond dynamically to network conditions. We describe design considerations and discuss our approach and implementations. We evaluate the performance of stream processing and examine the performance of several exemplar applications on networks of increasing scale and complexity.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"442 - 461"},"PeriodicalIF":3.1,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43385869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Abisko: Deep codesign of an architecture for spiking neural networks using novel neuromorphic materials Abisko：使用新型神经形态材料的尖峰神经网络架构的深度协同设计

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2023-06-22 DOI: 10.1177/10943420231178537

J. Vetter, Prasanna Date, F. Fahim, Shruti R. Kulkarni, P. Maksymovych, A. Talin, Marc Gonzalez Tallada, Pruek Vanna-Iampikul, Aaron R. Young, David Brooks, Yu Cao, Wei Gu-Yeon, S. Lim, Frank Liu, Matthew J. Marinella, B. Sumpter, Narasinga Rao Miniskar

The Abisko project aims to develop an energy-efficient spiking neural network (SNN) computing architecture and software system capable of autonomous learning and operation. The SNN architecture explores novel neuromorphic devices that are based on resistive-switching materials, such as memristors and electrochemical RAM. Equally important, Abisko uses a deep codesign approach to pursue this goal by engaging experts from across the entire range of disciplines: materials, devices and circuits, architectures and integration, software, and algorithms. The key objectives of our Abisko project are threefold. First, we are designing an energy-optimized high-performance neuromorphic accelerator based on SNNs. This architecture is being designed as a chiplet that can be deployed in contemporary computer architectures and we are investigating novel neuromorphic materials to improve its design. Second, we are concurrently developing a productive software stack for the neuromorphic accelerator that will also be portable to other architectures, such as field-programmable gate arrays and GPUs. Third, we are creating a new deep codesign methodology and framework for developing clear interfaces, requirements, and metrics between each level of abstraction to enable the system design to be explored and implemented interchangeably with execution, measurement, a model, or simulation. As a motivating application for this codesign effort, we target the use of SNNs for an analog event detector for a high-energy physics sensor.

Abisko项目旨在开发一种能够自主学习和操作的节能尖峰神经网络（SNN）计算架构和软件系统。SNN架构探索了基于电阻开关材料的新型神经形态器件，如忆阻器和电化学RAM。同样重要的是，Abisko采用深度代码设计方法，通过吸引来自整个学科的专家来实现这一目标：材料、设备和电路、架构和集成、软件和算法。Abisko项目的主要目标有三个。首先，我们正在设计一种基于SNNs的能量优化的高性能神经形态加速器。该架构被设计为一个可以部署在当代计算机架构中的小芯片，我们正在研究新型神经形态材料来改进其设计。其次，我们正在为神经形态加速器开发一个高效的软件堆栈，该堆栈也将可移植到其他架构，如现场可编程门阵列和GPU。第三，我们正在创建一种新的深度代码设计方法和框架，用于在每个抽象级别之间开发清晰的接口、需求和度量，以使系统设计能够与执行、测量、模型或模拟互换地探索和实现。作为这项代码设计工作的一个激励性应用，我们的目标是将SNN用于高能物理传感器的模拟事件检测器。

{"title":"Abisko: Deep codesign of an architecture for spiking neural networks using novel neuromorphic materials","authors":"J. Vetter, Prasanna Date, F. Fahim, Shruti R. Kulkarni, P. Maksymovych, A. Talin, Marc Gonzalez Tallada, Pruek Vanna-Iampikul, Aaron R. Young, David Brooks, Yu Cao, Wei Gu-Yeon, S. Lim, Frank Liu, Matthew J. Marinella, B. Sumpter, Narasinga Rao Miniskar","doi":"10.1177/10943420231178537","DOIUrl":"https://doi.org/10.1177/10943420231178537","url":null,"abstract":"The Abisko project aims to develop an energy-efficient spiking neural network (SNN) computing architecture and software system capable of autonomous learning and operation. The SNN architecture explores novel neuromorphic devices that are based on resistive-switching materials, such as memristors and electrochemical RAM. Equally important, Abisko uses a deep codesign approach to pursue this goal by engaging experts from across the entire range of disciplines: materials, devices and circuits, architectures and integration, software, and algorithms. The key objectives of our Abisko project are threefold. First, we are designing an energy-optimized high-performance neuromorphic accelerator based on SNNs. This architecture is being designed as a chiplet that can be deployed in contemporary computer architectures and we are investigating novel neuromorphic materials to improve its design. Second, we are concurrently developing a productive software stack for the neuromorphic accelerator that will also be portable to other architectures, such as field-programmable gate arrays and GPUs. Third, we are creating a new deep codesign methodology and framework for developing clear interfaces, requirements, and metrics between each level of abstraction to enable the system design to be explored and implemented interchangeably with execution, measurement, a model, or simulation. As a motivating application for this codesign effort, we target the use of SNNs for an analog event detector for a high-energy physics sensor.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"351 - 379"},"PeriodicalIF":3.1,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46219745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Fast truncated SVD of sparse and dense matrices on graphics processors 图形处理器上稀疏和密集矩阵的快速截断奇异值分解

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2023-06-07 DOI: 10.1177/10943420231179699

A. Tomás, E. S. Quintana‐Ortí, H. Anzt

We investigate the solution of low-rank matrix approximation problems using the truncated singular value decomposition (SVD). For this purpose, we develop and optimize graphics processing unit (GPU) implementations for the randomized SVD and a blocked variant of the Lanczos approach. Our work takes advantage of the fact that the two methods are composed of very similar linear algebra building blocks, which can be assembled using numerical kernels from existing high-performance linear algebra libraries. Furthermore, the experiments with several sparse matrices arising in representative real-world applications and synthetic dense test matrices reveal a performance advantage of the block Lanczos algorithm when targeting the same approximation accuracy.

研究了利用截断奇异值分解(SVD)求解低秩矩阵逼近问题的方法。为此，我们开发并优化了随机SVD和Lanczos方法的阻塞变体的图形处理单元(GPU)实现。我们的工作利用了这两种方法由非常相似的线性代数构建块组成的事实，这些构建块可以使用现有高性能线性代数库中的数值核进行组装。此外，在具有代表性的实际应用中出现的几个稀疏矩阵和合成密集测试矩阵的实验表明，在相同的近似精度下，块Lanczos算法具有性能优势。

引用次数: 0

Accelerated dynamic data reduction using spatial and temporal properties 使用空间和时间属性加速动态数据缩减

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2023-06-05 DOI: 10.1177/10943420231180504

Megan Hickman Fulp, Dakota Fulp, Changfeng Zou, Cooper Sanders, Ayan Biswas, Melissa C. Smith, Jon C. Calhoun

Due to improvements in high-performance computing (HPC) capabilities, many of today’s applications produce petabytes worth of data, causing bottlenecks within the system. Importance-based sampling methods, including our spatio-temporal hybrid data sampling method, are capable of resolving these bottlenecks. While our hybrid method has been shown to outperform existing methods, its effectiveness relies heavily on user parameters, such as histogram bins, error threshold, or number of regions. Moreover, the throughput it demonstrates must be higher to avoid becoming a bottleneck itself. In this article, we resolve both of these issues. First, we assess the effects of several user input parameters and detail techniques to help determine optimal parameters. Next, we detail and implement accelerated versions of our method using OpenMP and CUDA. Upon analyzing our implementations, we find 9.8× to 31.5× throughput improvements. Next, we demonstrate how our method can accept different base sampling algorithms and the effects these different algorithms have. Finally, we compare our sampling methods to the lossy compressor cuSZ in terms of data preservation and data movement.

由于高性能计算（HPC）功能的改进，今天的许多应用程序都会产生数PB的数据，从而在系统中造成瓶颈。基于重要性的采样方法，包括我们的时空混合数据采样方法，能够解决这些瓶颈。虽然我们的混合方法已被证明优于现有方法，但其有效性在很大程度上取决于用户参数，如直方图仓、误差阈值或区域数量。此外，它所展示的吞吐量必须更高，以避免本身成为瓶颈。在本文中，我们解决了这两个问题。首先，我们评估了几个用户输入参数和细节技术的影响，以帮助确定最佳参数。接下来，我们将使用OpenMP和CUDA详细介绍并实现我们方法的加速版本。通过分析我们的实现，我们发现吞吐量提高了9.8倍到31.5倍。接下来，我们将演示我们的方法如何接受不同的基本采样算法，以及这些不同算法的效果。最后，在数据保存和数据移动方面，我们将我们的采样方法与有损压缩器cuSZ进行了比较。

{"title":"Accelerated dynamic data reduction using spatial and temporal properties","authors":"Megan Hickman Fulp, Dakota Fulp, Changfeng Zou, Cooper Sanders, Ayan Biswas, Melissa C. Smith, Jon C. Calhoun","doi":"10.1177/10943420231180504","DOIUrl":"https://doi.org/10.1177/10943420231180504","url":null,"abstract":"Due to improvements in high-performance computing (HPC) capabilities, many of today’s applications produce petabytes worth of data, causing bottlenecks within the system. Importance-based sampling methods, including our spatio-temporal hybrid data sampling method, are capable of resolving these bottlenecks. While our hybrid method has been shown to outperform existing methods, its effectiveness relies heavily on user parameters, such as histogram bins, error threshold, or number of regions. Moreover, the throughput it demonstrates must be higher to avoid becoming a bottleneck itself. In this article, we resolve both of these issues. First, we assess the effects of several user input parameters and detail techniques to help determine optimal parameters. Next, we detail and implement accelerated versions of our method using OpenMP and CUDA. Upon analyzing our implementations, we find 9.8× to 31.5× throughput improvements. Next, we demonstrate how our method can accept different base sampling algorithms and the effects these different algorithms have. Finally, we compare our sampling methods to the lossy compressor cuSZ in terms of data preservation and data movement.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"539 - 559"},"PeriodicalIF":3.1,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48055498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

HipBone: A performance-portable graphics processing unit-accelerated C++ version of the NekBone benchmark HipBone:性能可移植的图形处理单元，加速了NekBone基准的c++版本

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2023-05-31 DOI: 10.1177/10943420231178552

N. Chalmers, Abhishek Mishra, Damon McDougall, T. Warburton

We present hipBone, an open-source performance-portable proxy application for the Nek5000 (and NekRS) computational fluid dynamics applications. HipBone is a fully GPU-accelerated C++ implementation of the original NekBone CPU proxy application with several novel algorithmic and implementation improvements which optimize its performance on modern fine-grain parallel GPU accelerators. Our optimizations include a conversion to store the degrees of freedom of the problem in assembled form in order to reduce the amount of data moved during the main iteration and a portable implementation of the main Poisson operator kernel. We demonstrate near-roofline performance of the operator kernel on three different modern GPU accelerators from two different vendors. We present a novel algorithm for splitting the application of the Poisson operator on GPUs which aggressively hides MPI communication required for both halo exchange and assembly. Our implementation of nearest-neighbor MPI communication then leverages several different routing algorithms and GPU-Direct RDMA capabilities, when available, which improves scalability of the benchmark. We demonstrate the performance of hipBone on three different clusters housed at Oak Ridge National Laboratory, namely, the Summit supercomputer and the Frontier early-access clusters, Spock and Crusher. Our tests demonstrate both portability across different clusters and very good scaling efficiency, especially on large problems.

我们提出hipBone，一个开源的性能便携代理应用程序，用于Nek5000(和NekRS)计算流体动力学应用程序。HipBone是原始NekBone CPU代理应用程序的完全GPU加速c++实现，具有几种新颖的算法和实现改进，可优化其在现代细粒度并行GPU加速器上的性能。我们的优化包括以汇编形式存储问题自由度的转换，以减少主迭代期间移动的数据量，以及主泊松算子内核的可移植实现。我们在来自两个不同供应商的三种不同的现代GPU加速器上演示了算子内核的接近屋顶的性能。我们提出了一种分离泊松算子在gpu上的应用的新算法，该算法积极地隐藏了光晕交换和汇编所需的MPI通信。我们的最近邻MPI通信实现利用了几种不同的路由算法和GPU-Direct RDMA功能(如果可用)，从而提高了基准测试的可伸缩性。我们在橡树岭国家实验室的三个不同集群上演示了hipBone的性能，即Summit超级计算机和Frontier早期访问集群Spock和Crusher。我们的测试证明了跨不同集群的可移植性和非常好的扩展效率，特别是在大型问题上。

{"title":"HipBone: A performance-portable graphics processing unit-accelerated C++ version of the NekBone benchmark","authors":"N. Chalmers, Abhishek Mishra, Damon McDougall, T. Warburton","doi":"10.1177/10943420231178552","DOIUrl":"https://doi.org/10.1177/10943420231178552","url":null,"abstract":"We present hipBone, an open-source performance-portable proxy application for the Nek5000 (and NekRS) computational fluid dynamics applications. HipBone is a fully GPU-accelerated C++ implementation of the original NekBone CPU proxy application with several novel algorithmic and implementation improvements which optimize its performance on modern fine-grain parallel GPU accelerators. Our optimizations include a conversion to store the degrees of freedom of the problem in assembled form in order to reduce the amount of data moved during the main iteration and a portable implementation of the main Poisson operator kernel. We demonstrate near-roofline performance of the operator kernel on three different modern GPU accelerators from two different vendors. We present a novel algorithm for splitting the application of the Poisson operator on GPUs which aggressively hides MPI communication required for both halo exchange and assembly. Our implementation of nearest-neighbor MPI communication then leverages several different routing algorithms and GPU-Direct RDMA capabilities, when available, which improves scalability of the benchmark. We demonstrate the performance of hipBone on three different clusters housed at Oak Ridge National Laboratory, namely, the Summit supercomputer and the Frontier early-access clusters, Spock and Crusher. Our tests demonstrate both portability across different clusters and very good scaling efficiency, especially on large problems.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"560 - 577"},"PeriodicalIF":3.1,"publicationDate":"2023-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42825171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Graph neural networks for detecting anomalies in scientific workflows 用于检测科学工作流程中异常的图神经网络

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2023-05-30 DOI: 10.1177/10943420231172140

Hongwei Jin, Krishnan Raghavan, G. Papadimitriou, Cong Wang, A. Mandal, M. Kiran, E. Deelman, Prasanna Balaprakash

Identifying and addressing anomalies in complex, distributed systems can be challenging for reliable execution of scientific workflows. We model these workflows as directed acyclic graphs (DAGs), where the nodes and edges of the DAGs represent jobs and their dependencies, respectively. We develop graph neural networks (GNNs) to learn patterns in the DAGs and to detect anomalies at the node (job) and graph (workflow) levels. We investigate workflow-specific GNN models that are trained on a particular workflow and workflow-agnostic GNN models that are trained across the workflows. Our GNN models, which incorporate both individual job features and topological information from the workflow, show improved accuracy and efficiency compared to conventional learning methods for detecting anomalies. While joint trained with multiple scientific workflows, our GNN models reached an accuracy more than 80% for workflow level and 75% for job level anomalies. In addition, we illustrate the importance of hyperparameter tuning method in our study that can significantly improve the metric(s) measure of evaluating the GNN models. Finally, we integrate explainable GNN methods to provide insights on job features in the workflow that cause an anomaly.

识别和解决复杂分布式系统中的异常现象对于科学工作流程的可靠执行可能具有挑战性。我们将这些工作流建模为有向无环图（DAG），其中DAG的节点和边分别表示作业及其依赖关系。我们开发了图神经网络（GNN）来学习DAG中的模式，并检测节点（作业）和图（工作流）级别的异常。我们研究了在特定工作流上训练的工作流特定GNN模型，以及在工作流中训练的工作流不可知GNN模型。与检测异常的传统学习方法相比，我们的GNN模型结合了工作流程中的个人工作特征和拓扑信息，显示出更高的准确性和效率。在使用多种科学工作流程进行联合训练的同时，我们的GNN模型在工作流程级别和工作级别异常方面的准确率分别达到80%和75%以上。此外，我们还说明了超参数调整方法在我们的研究中的重要性，该方法可以显著改进评估GNN模型的度量。最后，我们集成了可解释的GNN方法，以深入了解导致异常的工作流程中的工作特征。

{"title":"Graph neural networks for detecting anomalies in scientific workflows","authors":"Hongwei Jin, Krishnan Raghavan, G. Papadimitriou, Cong Wang, A. Mandal, M. Kiran, E. Deelman, Prasanna Balaprakash","doi":"10.1177/10943420231172140","DOIUrl":"https://doi.org/10.1177/10943420231172140","url":null,"abstract":"Identifying and addressing anomalies in complex, distributed systems can be challenging for reliable execution of scientific workflows. We model these workflows as directed acyclic graphs (DAGs), where the nodes and edges of the DAGs represent jobs and their dependencies, respectively. We develop graph neural networks (GNNs) to learn patterns in the DAGs and to detect anomalies at the node (job) and graph (workflow) levels. We investigate workflow-specific GNN models that are trained on a particular workflow and workflow-agnostic GNN models that are trained across the workflows. Our GNN models, which incorporate both individual job features and topological information from the workflow, show improved accuracy and efficiency compared to conventional learning methods for detecting anomalies. While joint trained with multiple scientific workflows, our GNN models reached an accuracy more than 80% for workflow level and 75% for job level anomalies. In addition, we illustrate the importance of hyperparameter tuning method in our study that can significantly improve the metric(s) measure of evaluating the GNN models. Finally, we integrate explainable GNN methods to provide insights on job features in the workflow that cause an anomaly.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"394 - 411"},"PeriodicalIF":3.1,"publicationDate":"2023-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46037147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Dynamic spawning of MPI processes applied to malleability 应用于延展性的MPI过程的动态生成

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2023-05-29 DOI: 10.1177/10943420231176527

Iker Martín-Álvarez, J. Aliaga, M. Castillo, Sergio Iserte, R. Mayo

Malleability allows computing facilities to adapt their workloads through resource management systems to maximize the throughput of the facility and the efficiency of the executed jobs. This technique is based on reconfiguring a job to a different resource amount during execution and then continuing with it. One of the stages of malleability is the dynamic spawning of processes in execution time, where different decisions in this stage will affect how the next stage of data redistribution is performed, which is the most time-consuming stage. This paper describes different methods and strategies, defining eight different alternatives to spawn processes dynamically and indicates which one should be used depending on whether a strong or weak scaling application is being used. In addition, it is described for both types of applications which strategies benefit most the application performance or the system productivity. The results show that reducing the number of spawning processes by reusing the older ones can reduce reconfiguration time compared to the classical method by up to 2.6 times for expanding and up to 36 times for shrinking. Furthermore, the asynchronous strategy requires analysing the impact of oversubscription on application performance.

延展性允许计算设施通过资源管理系统调整其工作负载，以最大限度地提高设施的吞吐量和执行作业的效率。该技术基于在执行期间将作业重新配置为不同的资源量，然后继续执行作业。延展性的一个阶段是进程在执行时间上的动态衍生，在这个阶段的不同决策将影响下一阶段数据重新分配的执行，这是最耗时的阶段。本文描述了不同的方法和策略，定义了八种不同的动态生成进程的替代方案，并指出应该根据正在使用的是强缩放还是弱缩放应用程序来使用哪一种。此外，还描述了两种类型的应用程序，哪种策略对应用程序性能或系统生产力最有利。结果表明，与经典方法相比，通过重用旧进程来减少生成进程的数量可以将重构时间减少2.6倍(扩展)和36倍(缩小)。此外，异步策略需要分析过度订阅对应用程序性能的影响。

{"title":"Dynamic spawning of MPI processes applied to malleability","authors":"Iker Martín-Álvarez, J. Aliaga, M. Castillo, Sergio Iserte, R. Mayo","doi":"10.1177/10943420231176527","DOIUrl":"https://doi.org/10.1177/10943420231176527","url":null,"abstract":"Malleability allows computing facilities to adapt their workloads through resource management systems to maximize the throughput of the facility and the efficiency of the executed jobs. This technique is based on reconfiguring a job to a different resource amount during execution and then continuing with it. One of the stages of malleability is the dynamic spawning of processes in execution time, where different decisions in this stage will affect how the next stage of data redistribution is performed, which is the most time-consuming stage. This paper describes different methods and strategies, defining eight different alternatives to spawn processes dynamically and indicates which one should be used depending on whether a strong or weak scaling application is being used. In addition, it is described for both types of applications which strategies benefit most the application performance or the system productivity. The results show that reducing the number of spawning processes by reusing the older ones can reduce reconfiguration time compared to the classical method by up to 2.6 times for expanding and up to 36 times for shrinking. Furthermore, the asynchronous strategy requires analysing the impact of oversubscription on application performance.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":" ","pages":""},"PeriodicalIF":3.1,"publicationDate":"2023-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48600798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Modeling, evaluating, and orchestrating heterogeneous environmental leverages for large-scale data center management 为大规模数据中心管理建模、评估和协调异构环境利用

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2023-05-24 DOI: 10.1177/10943420231172978

Vladimir Ostapenco, L. Lefèvre, Anne-Cécile Orgerie, Benjamin Fichel

Data centers are very energy-intensive facilities that can generate various environmental impacts. Numerous energy, power, and environmental leverages exist and can help cloud providers and data center managers to reduce some of these impacts. But dealing with such heterogeneous leverages can be a challenging task that requires some support from a dedicated framework. This article presents a new approach for modeling, evaluating, and orchestrating a large set of technological and logistical leverages. Our framework is based on leverages modeling and Gantt chart leverages mapping. First experimental results based on selected scenarios show the pertinence of the proposed approach in terms of management facilities and potential impacts reduction.

数据中心是能源密集型设施，会产生各种环境影响。存在大量的能源、电力和环境利用，可以帮助云提供商和数据中心管理者减少其中的一些影响。但是，处理这种异构杠杆可能是一项具有挑战性的任务，需要专门的框架提供一些支持。本文提出了一种新的方法，用于建模、评估和协调大量的技术和后勤杠杆。我们的框架基于杠杆建模和甘特图杠杆映射。基于选定场景的首次实验结果表明，所提出的方法在管理设施和减少潜在影响方面具有针对性。

引用次数: 0

Finding the forest in the trees: Enabling performance optimization on heterogeneous architectures through data science analysis of ensemble performance data 在树中寻找森林：通过集成性能数据的数据科学分析实现异构体系结构的性能优化

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2023-05-23 DOI: 10.1177/10943420231175687

Olga Pearce, S. Brink

In this work, we develop novel data science methodologies for ensemble performance data that have the potential to uncover orders of magnitude of performance that is unknowingly being left on the table. Building on years of successful performance tool design and tool integration into million-line codes at Lawrence Livermore National Laboratory (Caliper (Boehme et al. 2016), Hatchet (Bhatele et al. 2019; Brink et al. 2020))—successes highlighted as key deliverables in meeting LLNL’s L1 and L2 milestones (Rieben and Weiss 2020)—we design a data science methodology for integrating multi-dimensional, multi-scale, multi-architecture, and multi-tool performance data, and provide data analytics and interactive visualization capabilities for further analysis and exploration of the data. Our work provides developers with a comprehensive multi-dimensional performance landscape, enabling enhanced capabilities for pinpointing performance bottlenecks on emerging hardware platforms composed of heterogeneous elements.

在这项工作中，我们为集成性能数据开发了新的数据科学方法，这些方法有可能揭示在不知不觉中遗留在桌面上的性能数量级。在Lawrence Livermore国家实验室多年成功的高性能工具设计和工具集成到百万行代码的基础上（Caliper（Boehme等人，2016），Hatchet（Bhatele等人，2019；Brink等人，2020）——在实现LLNL的L1和L2里程碑方面的成功被强调为关键交付成果（Rieben和Weiss 2020）——我们设计了一种数据科学方法，用于集成多维、多尺度、多架构和多工具性能数据，并提供数据分析和交互式可视化功能，用于进一步分析和探索数据。我们的工作为开发人员提供了一个全面的多维性能环境，增强了在由异构元素组成的新兴硬件平台上定位性能瓶颈的能力。

引用次数: 0

IO-aware Job-Scheduling: Exploiting the Impacts of Workload Characterizations to select the Mapping Strategy IO感知作业调度：利用工作负载特征的影响来选择映射策略

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications

Pub Date : 2023-05-15 DOI: 10.1177/10943420231175854

E. Jeannot, Guillaume Pallez, Nicolas Vidal

In high performance, computing concurrent applications are sharing the same file system. However, the bandwidth which provides access to the storage is limited. Therefore, too many I/O operations performed at the same time lead to conflicts and performance loss due to contention. This scenario will become more common as applications become more data intensive. To avoid congestion, job-schedulers have to play an important role in selecting which application run concurrently. However I/O-aware mapping strategies need to be simple, robust and fast. Hence, in this article, we discuss two plain and practical strategies to mitigate I/O congestion. They are based on the idea of scheduling I/O access so as not to exceed some prescribed I/O bandwidth. More precisely, we compare two approaches: one grouping applications into packs that will be run independently (i.e., pack-scheduling), the other one scheduling greedily applications using a predefined order (i.e. list-scheduling). Results show that performances depend heavily on the I/O load and the homogeneity of the underlying workload. Finally, we introduce the notion of characteristic time that represents information on the average time between consecutive I/O transfers. We show that it could be important to the design of schedulers and that we expect it to be easily obtained by analysis tools.

在高性能中，计算并发应用程序共享同一个文件系统。然而，提供对存储器的访问的带宽是有限的。因此，同时执行过多的I/O操作会导致冲突，并由于争用而导致性能损失。随着应用程序的数据密集度越来越高，这种情况将变得越来越普遍。为了避免拥塞，作业调度器必须在选择哪个应用程序同时运行方面发挥重要作用。然而，I/O感知映射策略需要简单、健壮和快速。因此，在本文中，我们讨论了两种简单实用的策略来缓解I/O拥塞。它们基于调度I/O访问的思想，以便不超过某些规定的I/O带宽。更准确地说，我们比较了两种方法：一种方法将应用程序分组为独立运行的包（即包调度），另一种方法使用预定义的顺序贪婪地调度应用程序（即列表调度）。结果表明，性能在很大程度上取决于I/O负载和底层工作负载的同质性。最后，我们介绍了特征时间的概念，它表示连续I/O传输之间的平均时间信息。我们证明了它对调度器的设计可能很重要，并且我们希望它能很容易地通过分析工具获得。

{"title":"IO-aware Job-Scheduling: Exploiting the Impacts of Workload Characterizations to select the Mapping Strategy","authors":"E. Jeannot, Guillaume Pallez, Nicolas Vidal","doi":"10.1177/10943420231175854","DOIUrl":"https://doi.org/10.1177/10943420231175854","url":null,"abstract":"In high performance, computing concurrent applications are sharing the same file system. However, the bandwidth which provides access to the storage is limited. Therefore, too many I/O operations performed at the same time lead to conflicts and performance loss due to contention. This scenario will become more common as applications become more data intensive. To avoid congestion, job-schedulers have to play an important role in selecting which application run concurrently. However I/O-aware mapping strategies need to be simple, robust and fast. Hence, in this article, we discuss two plain and practical strategies to mitigate I/O congestion. They are based on the idea of scheduling I/O access so as not to exceed some prescribed I/O bandwidth. More precisely, we compare two approaches: one grouping applications into packs that will be run independently (i.e., pack-scheduling), the other one scheduling greedily applications using a predefined order (i.e. list-scheduling). Results show that performances depend heavily on the I/O load and the homogeneity of the underlying workload. Finally, we introduce the notion of characteristic time that represents information on the average time between consecutive I/O transfers. We show that it could be important to the design of schedulers and that we expect it to be easily obtained by analysis tools.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"213 - 228"},"PeriodicalIF":3.1,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46118985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0