The Tapis framework provides APIs for automating job execution on remote resources, including HPC clusters and servers running in the cloud. Tapis can simplify the interaction with remote cyberinfrastructure (CI), but the current services require users to specify the exact configuration of a job to run, including the system, queue, node count, and maximum run time, among other attributes. Moreover, the remote resources must be defined and configured in Tapis before a job can be submitted. In this paper, we present our efforts to develop an intelligent job scheduling capability in Tapis, where various attributes about a job configuration can be automatically determined for the user, and computational resources can be dynamically provisioned by Tapis for specific jobs. We develop an overall architecture for such a feature, which suggests a set of core challenges to be solved. Then, we focus on one such specific challenge: predicting queue times for a job on different HPC systems and queues, and we present two sets of results based on machine learning methods. Our first set of results cast the problem as a regression, which can be used to select the best system from a list of existing options. Our second set of results frames the problem as a classification, allowing us to compare the use of an existing system with a dynamically provisioned resource.
{"title":"Toward Smart Scheduling in Tapis","authors":"Joe Stubbs, Smruti Padhy, Richard Cardone","doi":"arxiv-2408.03349","DOIUrl":"https://doi.org/arxiv-2408.03349","url":null,"abstract":"The Tapis framework provides APIs for automating job execution on remote\u0000resources, including HPC clusters and servers running in the cloud. Tapis can\u0000simplify the interaction with remote cyberinfrastructure (CI), but the current\u0000services require users to specify the exact configuration of a job to run,\u0000including the system, queue, node count, and maximum run time, among other\u0000attributes. Moreover, the remote resources must be defined and configured in\u0000Tapis before a job can be submitted. In this paper, we present our efforts to\u0000develop an intelligent job scheduling capability in Tapis, where various\u0000attributes about a job configuration can be automatically determined for the\u0000user, and computational resources can be dynamically provisioned by Tapis for\u0000specific jobs. We develop an overall architecture for such a feature, which\u0000suggests a set of core challenges to be solved. Then, we focus on one such\u0000specific challenge: predicting queue times for a job on different HPC systems\u0000and queues, and we present two sets of results based on machine learning\u0000methods. Our first set of results cast the problem as a regression, which can\u0000be used to select the best system from a list of existing options. Our second\u0000set of results frames the problem as a classification, allowing us to compare\u0000the use of an existing system with a dynamically provisioned resource.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Solving very large linear systems of equations is a key computational task in science and technology. In many cases, the coefficient matrix of the linear system is rank-deficient, leading to systems that may be underdetermined, inconsistent, or both. In such cases, one generally seeks to compute the least squares solution that minimizes the residual of the problem, which can be further defined as the solution with smallest norm in cases where the coefficient matrix has a nontrivial nullspace. This work presents several new techniques for solving least squares problems involving coefficient matrices that are so large that they do not fit in main memory. The implementations include both CPU and GPU variants. All techniques rely on complete orthogonal decompositions that guarantee that both conditions of a least squares solution are met, regardless of the rank properties of the matrix. Specifically, they rely on the recently proposed "randUTV" algorithm that is particularly effective in strongly communication-constrained environments. A detailed precision and performance study reveals that the new methods, that operate on data stored on disk, are competitive with state-of-the-art methods that store all data in main memory.
求解超大线性方程组是科学和技术领域的一项关键计算任务。在许多情况下,线性方程组的系数矩阵存在秩缺陷,导致方程组可能是未定方程、不一致方程或两者兼而有之。在这种情况下,人们通常寻求计算最小二乘法解,使问题的残差最小,在系数矩阵具有非三维空域的情况下,残差可进一步定义为具有最小规范的解。本研究提出了几种新技术,用于求解涉及系数矩阵大到无法放入主内存的最小二乘法问题。实现方法包括 CPU 和 GPU 变体。所有技术都依赖于完整的正交分解,无论矩阵的秩属性如何,都能保证满足最小二乘法求解的两个条件。具体来说,它们依赖于最近提出的 "randUTV "算法,该算法在通信受限的环境中特别有效。详细的精度和性能研究表明,新方法对存储在磁盘上的数据进行操作,与将所有数据存储在主存储器中的最先进方法相比,具有很强的竞争力。
{"title":"Solving Large Rank-Deficient Linear Least-Squares Problems on Shared-Memory CPU Architectures and GPU Architectures","authors":"Mónica Chillarón, Gregorio Quintana-Ortí, Vicente Vidal, Per-Gunnar Martinsson","doi":"arxiv-2408.05238","DOIUrl":"https://doi.org/arxiv-2408.05238","url":null,"abstract":"Solving very large linear systems of equations is a key computational task in\u0000science and technology. In many cases, the coefficient matrix of the linear\u0000system is rank-deficient, leading to systems that may be underdetermined,\u0000inconsistent, or both. In such cases, one generally seeks to compute the least\u0000squares solution that minimizes the residual of the problem, which can be\u0000further defined as the solution with smallest norm in cases where the\u0000coefficient matrix has a nontrivial nullspace. This work presents several new\u0000techniques for solving least squares problems involving coefficient matrices\u0000that are so large that they do not fit in main memory. The implementations\u0000include both CPU and GPU variants. All techniques rely on complete orthogonal\u0000decompositions that guarantee that both conditions of a least squares solution\u0000are met, regardless of the rank properties of the matrix. Specifically, they\u0000rely on the recently proposed \"randUTV\" algorithm that is particularly\u0000effective in strongly communication-constrained environments. A detailed\u0000precision and performance study reveals that the new methods, that operate on\u0000data stored on disk, are competitive with state-of-the-art methods that store\u0000all data in main memory.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the volume of data being produced is increasing at an exponential rate that needs to be processed quickly, it is reasonable that the data needs to be available very close to the compute devices to reduce transfer latency. Due to this need, local filesystems are getting close attention to understand their inner workings, performance, and more importantly their limitations. This study analyzes few popular Linux filesystems: EXT4, XFS, BtrFS, ZFS, and F2FS by creating, storing, and then reading back one billion files from the local filesystem. The study also captured and analyzed read/write throughput, storage blocks usage, disk space utilization and overheads, and other metrics useful for system designers and integrators. Furthermore, the study explored other side effects such as filesystem performance degradation during and after these large numbers of files and folders are created.
由于需要快速处理的数据量正以指数级速度增长,因此数据需要在非常靠近计算设备的地方提供,以减少传输延迟。基于这种需求,本地文件系统正受到密切关注,以了解其内部工作原理、性能,更重要的是了解其局限性。本研究分析了几种流行的 Linux 文件系统:EXT4、XFS、BtrFS、ZFS 和 F2FS。该研究还采集并分析了读/写吞吐量、存储块使用情况、磁盘空间利用率和开销,以及对系统设计师和集成商有用的其他指标。此外,该研究还探讨了其他方面的影响,如在创建大量文件和文件夹期间和之后文件系统性能的下降。
{"title":"Billion-files File Systems (BfFS): A Comparison","authors":"Sohail Shaikh","doi":"arxiv-2408.01805","DOIUrl":"https://doi.org/arxiv-2408.01805","url":null,"abstract":"As the volume of data being produced is increasing at an exponential rate\u0000that needs to be processed quickly, it is reasonable that the data needs to be\u0000available very close to the compute devices to reduce transfer latency. Due to\u0000this need, local filesystems are getting close attention to understand their\u0000inner workings, performance, and more importantly their limitations. This study\u0000analyzes few popular Linux filesystems: EXT4, XFS, BtrFS, ZFS, and F2FS by\u0000creating, storing, and then reading back one billion files from the local\u0000filesystem. The study also captured and analyzed read/write throughput, storage\u0000blocks usage, disk space utilization and overheads, and other metrics useful\u0000for system designers and integrators. Furthermore, the study explored other\u0000side effects such as filesystem performance degradation during and after these\u0000large numbers of files and folders are created.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"78 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saman Kazemkhani, Aarav Pandya, Daphne Cornelisse, Brennan Shacklett, Eugene Vinitsky
Multi-agent learning algorithms have been successful at generating superhuman planning in a wide variety of games but have had little impact on the design of deployed multi-agent planners. A key bottleneck in applying these techniques to multi-agent planning is that they require billions of steps of experience. To enable the study of multi-agent planning at this scale, we present GPUDrive, a GPU-accelerated, multi-agent simulator built on top of the Madrona Game Engine that can generate over a million steps of experience per second. Observation, reward, and dynamics functions are written directly in C++, allowing users to define complex, heterogeneous agent behaviors that are lowered to high-performance CUDA. We show that using GPUDrive we are able to effectively train reinforcement learning agents over many scenes in the Waymo Motion dataset, yielding highly effective goal-reaching agents in minutes for individual scenes and generally capable agents in a few hours. We ship these trained agents as part of the code base at https://github.com/Emerge-Lab/gpudrive.
多代理学习算法成功地在各种游戏中生成了超人规划,但对部署多代理规划器的设计却影响甚微。将这些技术应用于多代理规划的一个关键瓶颈是,它们需要数十亿步的经验。为了能够研究这种规模的多代理规划,我们提出了 GPUDrive,这是一个基于 Madrona 游戏引擎的 GPU 加速多代理模拟器,每秒可以生成超过一百万步的经验。观察、奖励和动态函数直接用 C++ 编写,允许用户定义复杂的异构代理行为,并将其降低到高性能 CUDA 中。我们的研究表明,使用 GPUDrive,我们能够在 Waymo 运动数据集中的许多场景中有效地训练强化学习代理,在个别场景中几分钟内就能训练出高效的目标达成代理,在几个小时内就能训练出具有一般能力的代理。我们将经过强化训练的代理作为代码库的一部分发布在https://github.com/Emerge-Lab/gpudrive。
{"title":"GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS","authors":"Saman Kazemkhani, Aarav Pandya, Daphne Cornelisse, Brennan Shacklett, Eugene Vinitsky","doi":"arxiv-2408.01584","DOIUrl":"https://doi.org/arxiv-2408.01584","url":null,"abstract":"Multi-agent learning algorithms have been successful at generating superhuman\u0000planning in a wide variety of games but have had little impact on the design of\u0000deployed multi-agent planners. A key bottleneck in applying these techniques to\u0000multi-agent planning is that they require billions of steps of experience. To\u0000enable the study of multi-agent planning at this scale, we present GPUDrive, a\u0000GPU-accelerated, multi-agent simulator built on top of the Madrona Game Engine\u0000that can generate over a million steps of experience per second. Observation,\u0000reward, and dynamics functions are written directly in C++, allowing users to\u0000define complex, heterogeneous agent behaviors that are lowered to\u0000high-performance CUDA. We show that using GPUDrive we are able to effectively\u0000train reinforcement learning agents over many scenes in the Waymo Motion\u0000dataset, yielding highly effective goal-reaching agents in minutes for\u0000individual scenes and generally capable agents in a few hours. We ship these\u0000trained agents as part of the code base at\u0000https://github.com/Emerge-Lab/gpudrive.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"173 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the number of WiFi devices and their traffic demands continue to rise, the need for a scalable and high-performance wireless infrastructure becomes increasingly essential. Central to this infrastructure are WiFi Access Points (APs), which facilitate packet switching between Ethernet and WiFi interfaces. Despite APs' reliance on the Linux kernel's data plane for packet switching, the detailed operations and complexities of switching packets between Ethernet and WiFi interfaces have not been investigated in existing works. This paper makes the following contributions towards filling this research gap. Through macro and micro-analysis of empirical experiments, our study reveals insights in two distinct categories. Firstly, while the kernel's statistics offer valuable insights into system operations, we identify and discuss potential pitfalls that can severely affect system analysis. For instance, we reveal the implications of device drivers on the meaning and accuracy of the statistics related to packet-switching tasks and processor utilization. Secondly, we analyze the impact of the packet switching path and core configuration on performance and power consumption. Specifically, we identify the differences in Ethernet-to-WiFi and WiFi-to-Ethernet data paths regarding processing components, multi-core utilization, and energy efficiency. We show that the WiFi-to-Ethernet data path leverages better multi-core processing and exhibits lower power consumption.
{"title":"Understanding and Enhancing Linux Kernel-based Packet Switching on WiFi Access Points","authors":"Shiqi Zhang, Mridul Gupta, Behnam Dezfouli","doi":"arxiv-2408.01013","DOIUrl":"https://doi.org/arxiv-2408.01013","url":null,"abstract":"As the number of WiFi devices and their traffic demands continue to rise, the\u0000need for a scalable and high-performance wireless infrastructure becomes\u0000increasingly essential. Central to this infrastructure are WiFi Access Points\u0000(APs), which facilitate packet switching between Ethernet and WiFi interfaces.\u0000Despite APs' reliance on the Linux kernel's data plane for packet switching,\u0000the detailed operations and complexities of switching packets between Ethernet\u0000and WiFi interfaces have not been investigated in existing works. This paper\u0000makes the following contributions towards filling this research gap. Through\u0000macro and micro-analysis of empirical experiments, our study reveals insights\u0000in two distinct categories. Firstly, while the kernel's statistics offer\u0000valuable insights into system operations, we identify and discuss potential\u0000pitfalls that can severely affect system analysis. For instance, we reveal the\u0000implications of device drivers on the meaning and accuracy of the statistics\u0000related to packet-switching tasks and processor utilization. Secondly, we\u0000analyze the impact of the packet switching path and core configuration on\u0000performance and power consumption. Specifically, we identify the differences in\u0000Ethernet-to-WiFi and WiFi-to-Ethernet data paths regarding processing\u0000components, multi-core utilization, and energy efficiency. We show that the\u0000WiFi-to-Ethernet data path leverages better multi-core processing and exhibits\u0000lower power consumption.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As development Internet-of-Vehicles (IoV) technology and demand for Intelligent Transportation Systems (ITS) increase, there is a growing need for real-time data and communication by vehicle users. Traditional request-based methods face challenges such as latency and bandwidth limitations. Mode 4 in Connected Vehicle-to-Everything (C-V2X) addresses latency and overhead issues through autonomous resource selection. However, Semi-Persistent Scheduling (SPS) based on distributed sensing may lead to increased collision. Non-Orthogonal Multiple Access (NOMA) can alleviate the problem of reduced packet reception probability due to collisions. Moreover, the concept of Age of Information (AoI) is introduced as a comprehensive metric reflecting reliability and latency performance, analyzing the impact of NOMA on C-V2X communication system. AoI indicates the time a message spends in both local waiting and transmission processes. In C-V2X, waiting process can be extended to queuing process, influenced by packet generation rate and Resource Reservation Interval (RRI). The transmission process is mainly affected by transmission delay and success rate. In C-V2X, a smaller selection window (SW) limits the number of available resources for vehicles, resulting in higher collision rates with increased number of vehicles. SW is generally equal to RRI, which not only affects AoI in queuing process but also AoI in the transmission process. Therefore, this paper proposes an AoI estimation method based on multi-priority data type queues and considers the influence of NOMA on the AoI generated in both processes in C-V2X system under different RRI conditions. This work aims to gain a better performance of C-V2X system comparing with some known algorithms.
{"title":"Age of Information Analysis for Multi-Priority Queue and NOMA Enabled C-V2X in IoV","authors":"Zheng Zhang, Qiong Wu, Pingyi Fan, Ke Xiong","doi":"arxiv-2408.00223","DOIUrl":"https://doi.org/arxiv-2408.00223","url":null,"abstract":"As development Internet-of-Vehicles (IoV) technology and demand for\u0000Intelligent Transportation Systems (ITS) increase, there is a growing need for\u0000real-time data and communication by vehicle users. Traditional request-based\u0000methods face challenges such as latency and bandwidth limitations. Mode 4 in\u0000Connected Vehicle-to-Everything (C-V2X) addresses latency and overhead issues\u0000through autonomous resource selection. However, Semi-Persistent Scheduling\u0000(SPS) based on distributed sensing may lead to increased collision.\u0000Non-Orthogonal Multiple Access (NOMA) can alleviate the problem of reduced\u0000packet reception probability due to collisions. Moreover, the concept of Age of\u0000Information (AoI) is introduced as a comprehensive metric reflecting\u0000reliability and latency performance, analyzing the impact of NOMA on C-V2X\u0000communication system. AoI indicates the time a message spends in both local\u0000waiting and transmission processes. In C-V2X, waiting process can be extended\u0000to queuing process, influenced by packet generation rate and Resource\u0000Reservation Interval (RRI). The transmission process is mainly affected by\u0000transmission delay and success rate. In C-V2X, a smaller selection window (SW)\u0000limits the number of available resources for vehicles, resulting in higher\u0000collision rates with increased number of vehicles. SW is generally equal to\u0000RRI, which not only affects AoI in queuing process but also AoI in the\u0000transmission process. Therefore, this paper proposes an AoI estimation method\u0000based on multi-priority data type queues and considers the influence of NOMA on\u0000the AoI generated in both processes in C-V2X system under different RRI\u0000conditions. This work aims to gain a better performance of C-V2X system\u0000comparing with some known algorithms.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"173 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141880961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Rauter, Lukas Zimmermann, Markus Zeilinger
Direct volume rendering using ray-casting is widely used in practice. By using GPUs and applying acceleration techniques as empty space skipping, high frame rates are possible on modern hardware. This enables performance-critical use-cases such as virtual reality volume rendering. The currently fastest known technique uses volumetric distance maps to skip empty sections of the volume during ray-casting but requires the distance map to be updated per transfer function change. In this paper, we demonstrate a technique for subdividing the volume intensity range into partitions and deriving what we call partitioned distance maps. These can be used to accelerate the distance map computation for a newly changed transfer function by a factor up to 30. This allows the currently fastest known empty space skipping approach to be used while maintaining high frame rates even when the transfer function is changed frequently.
{"title":"Accelerating Transfer Function Update for Distance Map based Volume Rendering","authors":"Michael Rauter, Lukas Zimmermann, Markus Zeilinger","doi":"arxiv-2407.21552","DOIUrl":"https://doi.org/arxiv-2407.21552","url":null,"abstract":"Direct volume rendering using ray-casting is widely used in practice. By\u0000using GPUs and applying acceleration techniques as empty space skipping, high\u0000frame rates are possible on modern hardware. This enables performance-critical\u0000use-cases such as virtual reality volume rendering. The currently fastest known\u0000technique uses volumetric distance maps to skip empty sections of the volume\u0000during ray-casting but requires the distance map to be updated per transfer\u0000function change. In this paper, we demonstrate a technique for subdividing the\u0000volume intensity range into partitions and deriving what we call partitioned\u0000distance maps. These can be used to accelerate the distance map computation for\u0000a newly changed transfer function by a factor up to 30. This allows the\u0000currently fastest known empty space skipping approach to be used while\u0000maintaining high frame rates even when the transfer function is changed\u0000frequently.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Ju, Mingshuai Li, Adalberto Perez, Laura Bellentani, Niclas Jansson, Stefano Markidis, Philipp Schlatter, Erwin Laure
The computational power of High-Performance Computing (HPC) systems is constantly increasing, however, their input/output (IO) performance grows relatively slowly, and their storage capacity is also limited. This unbalance presents significant challenges for applications such as Molecular Dynamics (MD) and Computational Fluid Dynamics (CFD), which generate massive amounts of data for further visualization or analysis. At the same time, checkpointing is crucial for long runs on HPC clusters, due to limited walltimes and/or failures of system components, and typically requires the storage of large amount of data. Thus, restricted IO performance and storage capacity can lead to bottlenecks for the performance of full application workflows (as compared to computational kernels without IO). In-situ techniques, where data is further processed while still in memory rather to write it out over the I/O subsystem, can help to tackle these problems. In contrast to traditional post-processing methods, in-situ techniques can reduce or avoid the need to write or read data via the IO subsystem. They offer a promising approach for applications aiming to leverage the full power of large scale HPC systems. In-situ techniques can also be applied to hybrid computational nodes on HPC systems consisting of graphics processing units (GPUs) and central processing units (CPUs). On one node, the GPUs would have significant performance advantages over the CPUs. Therefore, current approaches for GPU-accelerated applications often focus on maximizing GPU usage, leaving CPUs underutilized. In-situ tasks using CPUs to perform data analysis or preprocess data concurrently to the running simulation, offer a possibility to improve this underutilization.
{"title":"In-Situ Techniques on GPU-Accelerated Data-Intensive Applications","authors":"Yi Ju, Mingshuai Li, Adalberto Perez, Laura Bellentani, Niclas Jansson, Stefano Markidis, Philipp Schlatter, Erwin Laure","doi":"arxiv-2407.20731","DOIUrl":"https://doi.org/arxiv-2407.20731","url":null,"abstract":"The computational power of High-Performance Computing (HPC) systems is\u0000constantly increasing, however, their input/output (IO) performance grows\u0000relatively slowly, and their storage capacity is also limited. This unbalance\u0000presents significant challenges for applications such as Molecular Dynamics\u0000(MD) and Computational Fluid Dynamics (CFD), which generate massive amounts of\u0000data for further visualization or analysis. At the same time, checkpointing is\u0000crucial for long runs on HPC clusters, due to limited walltimes and/or failures\u0000of system components, and typically requires the storage of large amount of\u0000data. Thus, restricted IO performance and storage capacity can lead to\u0000bottlenecks for the performance of full application workflows (as compared to\u0000computational kernels without IO). In-situ techniques, where data is further\u0000processed while still in memory rather to write it out over the I/O subsystem,\u0000can help to tackle these problems. In contrast to traditional post-processing\u0000methods, in-situ techniques can reduce or avoid the need to write or read data\u0000via the IO subsystem. They offer a promising approach for applications aiming\u0000to leverage the full power of large scale HPC systems. In-situ techniques can\u0000also be applied to hybrid computational nodes on HPC systems consisting of\u0000graphics processing units (GPUs) and central processing units (CPUs). On one\u0000node, the GPUs would have significant performance advantages over the CPUs.\u0000Therefore, current approaches for GPU-accelerated applications often focus on\u0000maximizing GPU usage, leaving CPUs underutilized. In-situ tasks using CPUs to\u0000perform data analysis or preprocess data concurrently to the running\u0000simulation, offer a possibility to improve this underutilization.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141873316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Ahmed, R. Shende, I. Perez, D. Crawl, S. Purawat, I. Altintas
Reliable performance metrics are necessary prerequisites to building large-scale end-to-end integrated workflows for collaborative scientific research, particularly within context of use-inspired decision making platforms with many concurrent users and when computing real-time and urgent results using large data. This work is a building block for the National Data Platform, which leverages multiple use-cases including the WIFIRE Data and Model Commons for wildfire behavior modeling and the EarthScope Consortium for collaborative geophysical research. This paper presents an artificial intelligence and machine learning (AI/ML) approach to performance assessment and optimization of scientific workflows. An associated early AI/ML framework spanning performance data collection, prediction and optimization is applied to wildfire science applications within the WIFIRE BurnPro3D (BP3D) platform for proactive fire management and mitigation.
{"title":"Towards an Integrated Performance Framework for Fire Science and Management Workflows","authors":"H. Ahmed, R. Shende, I. Perez, D. Crawl, S. Purawat, I. Altintas","doi":"arxiv-2407.21231","DOIUrl":"https://doi.org/arxiv-2407.21231","url":null,"abstract":"Reliable performance metrics are necessary prerequisites to building\u0000large-scale end-to-end integrated workflows for collaborative scientific\u0000research, particularly within context of use-inspired decision making platforms\u0000with many concurrent users and when computing real-time and urgent results\u0000using large data. This work is a building block for the National Data Platform,\u0000which leverages multiple use-cases including the WIFIRE Data and Model Commons\u0000for wildfire behavior modeling and the EarthScope Consortium for collaborative\u0000geophysical research. This paper presents an artificial intelligence and\u0000machine learning (AI/ML) approach to performance assessment and optimization of\u0000scientific workflows. An associated early AI/ML framework spanning performance\u0000data collection, prediction and optimization is applied to wildfire science\u0000applications within the WIFIRE BurnPro3D (BP3D) platform for proactive fire\u0000management and mitigation.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"221 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Ju, Adalberto Perez, Stefano Markidis, Philipp Schlatter, Erwin Laure
High-Performance Computing (HPC) systems provide input/output (IO) performance growing relatively slowly compared to peak computational performance and have limited storage capacity. Computational Fluid Dynamics (CFD) applications aiming to leverage the full power of Exascale HPC systems, such as the solver Nek5000, will generate massive data for further processing. These data need to be efficiently stored via the IO subsystem. However, limited IO performance and storage capacity may result in performance, and thus scientific discovery, bottlenecks. In comparison to traditional post-processing methods, in-situ techniques can reduce or avoid writing and reading the data through the IO subsystem, promising to be a solution to these problems. In this paper, we study the performance and resource usage of three in-situ use cases: data compression, image generation, and uncertainty quantification. We furthermore analyze three approaches when these in-situ tasks and the simulation are executed synchronously, asynchronously, or in a hybrid manner. In-situ compression can be used to reduce the IO time and storage requirements while maintaining data accuracy. Furthermore, in-situ visualization and analysis can save Terabytes of data from being routed through the IO subsystem to storage. However, the overall efficiency is crucially dependent on the characteristics of both, the in-situ task and the simulation. In some cases, the overhead introduced by the in-situ tasks can be substantial. Therefore, it is essential to choose the proper in-situ approach, synchronous, asynchronous, or hybrid, to minimize overhead and maximize the benefits of concurrent execution.
{"title":"Understanding the Impact of Synchronous, Asynchronous, and Hybrid In-Situ Techniques in Computational Fluid Dynamics Applications","authors":"Yi Ju, Adalberto Perez, Stefano Markidis, Philipp Schlatter, Erwin Laure","doi":"arxiv-2407.20717","DOIUrl":"https://doi.org/arxiv-2407.20717","url":null,"abstract":"High-Performance Computing (HPC) systems provide input/output (IO)\u0000performance growing relatively slowly compared to peak computational\u0000performance and have limited storage capacity. Computational Fluid Dynamics\u0000(CFD) applications aiming to leverage the full power of Exascale HPC systems,\u0000such as the solver Nek5000, will generate massive data for further processing.\u0000These data need to be efficiently stored via the IO subsystem. However, limited\u0000IO performance and storage capacity may result in performance, and thus\u0000scientific discovery, bottlenecks. In comparison to traditional post-processing\u0000methods, in-situ techniques can reduce or avoid writing and reading the data\u0000through the IO subsystem, promising to be a solution to these problems. In this\u0000paper, we study the performance and resource usage of three in-situ use cases:\u0000data compression, image generation, and uncertainty quantification. We\u0000furthermore analyze three approaches when these in-situ tasks and the\u0000simulation are executed synchronously, asynchronously, or in a hybrid manner.\u0000In-situ compression can be used to reduce the IO time and storage requirements\u0000while maintaining data accuracy. Furthermore, in-situ visualization and\u0000analysis can save Terabytes of data from being routed through the IO subsystem\u0000to storage. However, the overall efficiency is crucially dependent on the\u0000characteristics of both, the in-situ task and the simulation. In some cases,\u0000the overhead introduced by the in-situ tasks can be substantial. Therefore, it\u0000is essential to choose the proper in-situ approach, synchronous, asynchronous,\u0000or hybrid, to minimize overhead and maximize the benefits of concurrent\u0000execution.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}