arXiv - CS - Performance最新文献_第5页

Toward Smart Scheduling in Tapis 在 Tapis 中实现智能调度

arXiv - CS - Performance

Pub Date : 2024-08-05 DOI: arxiv-2408.03349

Joe Stubbs, Smruti Padhy, Richard Cardone

The Tapis framework provides APIs for automating job execution on remoteresources, including HPC clusters and servers running in the cloud. Tapis cansimplify the interaction with remote cyberinfrastructure (CI), but the currentservices require users to specify the exact configuration of a job to run,including the system, queue, node count, and maximum run time, among otherattributes. Moreover, the remote resources must be defined and configured inTapis before a job can be submitted. In this paper, we present our efforts todevelop an intelligent job scheduling capability in Tapis, where variousattributes about a job configuration can be automatically determined for theuser, and computational resources can be dynamically provisioned by Tapis forspecific jobs. We develop an overall architecture for such a feature, whichsuggests a set of core challenges to be solved. Then, we focus on one suchspecific challenge: predicting queue times for a job on different HPC systemsand queues, and we present two sets of results based on machine learningmethods. Our first set of results cast the problem as a regression, which canbe used to select the best system from a list of existing options. Our secondset of results frames the problem as a classification, allowing us to comparethe use of an existing system with a dynamically provisioned resource.

Tapis 框架提供了在远程资源（包括高性能计算集群和云中运行的服务器）上自动执行作业的 API。Tapis可以简化与远程网络基础设施（CI）的交互，但目前的服务要求用户指定作业运行的确切配置，包括系统、队列、节点数和最长运行时间等属性。此外，在提交作业之前，还必须在 Tapis 中定义和配置远程资源。在本文中，我们介绍了在 Tapis 中开发智能作业调度功能的努力，在这种功能中，可以为用户自动确定作业配置的各种属性，并由 Tapis 为特定作业动态调配计算资源。我们为这种功能开发了一个整体架构，并提出了一系列需要解决的核心挑战。然后，我们重点讨论了其中一个具体挑战：预测作业在不同高性能计算系统和队列上的排队时间，并介绍了基于机器学习方法的两组结果。我们的第一组结果将问题归结为回归，可用于从现有选项列表中选择最佳系统。我们的第二组结果将问题归结为分类，使我们能够比较现有系统和动态调配资源的使用情况。

{"title":"Toward Smart Scheduling in Tapis","authors":"Joe Stubbs, Smruti Padhy, Richard Cardone","doi":"arxiv-2408.03349","DOIUrl":"https://doi.org/arxiv-2408.03349","url":null,"abstract":"The Tapis framework provides APIs for automating job execution on remote\u0000resources, including HPC clusters and servers running in the cloud. Tapis can\u0000simplify the interaction with remote cyberinfrastructure (CI), but the current\u0000services require users to specify the exact configuration of a job to run,\u0000including the system, queue, node count, and maximum run time, among other\u0000attributes. Moreover, the remote resources must be defined and configured in\u0000Tapis before a job can be submitted. In this paper, we present our efforts to\u0000develop an intelligent job scheduling capability in Tapis, where various\u0000attributes about a job configuration can be automatically determined for the\u0000user, and computational resources can be dynamically provisioned by Tapis for\u0000specific jobs. We develop an overall architecture for such a feature, which\u0000suggests a set of core challenges to be solved. Then, we focus on one such\u0000specific challenge: predicting queue times for a job on different HPC systems\u0000and queues, and we present two sets of results based on machine learning\u0000methods. Our first set of results cast the problem as a regression, which can\u0000be used to select the best system from a list of existing options. Our second\u0000set of results frames the problem as a classification, allowing us to compare\u0000the use of an existing system with a dynamically provisioned resource.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Solving Large Rank-Deficient Linear Least-Squares Problems on Shared-Memory CPU Architectures and GPU Architectures 在共享内存 CPU 架构和 GPU 架构上解决大型缺阶线性最小二乘法问题

arXiv - CS - Performance

Pub Date : 2024-08-05 DOI: arxiv-2408.05238

Mónica Chillarón, Gregorio Quintana-Ortí, Vicente Vidal, Per-Gunnar Martinsson

Solving very large linear systems of equations is a key computational task inscience and technology. In many cases, the coefficient matrix of the linearsystem is rank-deficient, leading to systems that may be underdetermined,inconsistent, or both. In such cases, one generally seeks to compute the leastsquares solution that minimizes the residual of the problem, which can befurther defined as the solution with smallest norm in cases where thecoefficient matrix has a nontrivial nullspace. This work presents several newtechniques for solving least squares problems involving coefficient matricesthat are so large that they do not fit in main memory. The implementationsinclude both CPU and GPU variants. All techniques rely on complete orthogonaldecompositions that guarantee that both conditions of a least squares solutionare met, regardless of the rank properties of the matrix. Specifically, theyrely on the recently proposed "randUTV" algorithm that is particularlyeffective in strongly communication-constrained environments. A detailedprecision and performance study reveals that the new methods, that operate ondata stored on disk, are competitive with state-of-the-art methods that storeall data in main memory.

求解超大线性方程组是科学和技术领域的一项关键计算任务。在许多情况下，线性方程组的系数矩阵存在秩缺陷，导致方程组可能是未定方程、不一致方程或两者兼而有之。在这种情况下，人们通常寻求计算最小二乘法解，使问题的残差最小，在系数矩阵具有非三维空域的情况下，残差可进一步定义为具有最小规范的解。本研究提出了几种新技术，用于求解涉及系数矩阵大到无法放入主内存的最小二乘法问题。实现方法包括 CPU 和 GPU 变体。所有技术都依赖于完整的正交分解，无论矩阵的秩属性如何，都能保证满足最小二乘法求解的两个条件。具体来说，它们依赖于最近提出的 "randUTV "算法，该算法在通信受限的环境中特别有效。详细的精度和性能研究表明，新方法对存储在磁盘上的数据进行操作，与将所有数据存储在主存储器中的最先进方法相比，具有很强的竞争力。

{"title":"Solving Large Rank-Deficient Linear Least-Squares Problems on Shared-Memory CPU Architectures and GPU Architectures","authors":"Mónica Chillarón, Gregorio Quintana-Ortí, Vicente Vidal, Per-Gunnar Martinsson","doi":"arxiv-2408.05238","DOIUrl":"https://doi.org/arxiv-2408.05238","url":null,"abstract":"Solving very large linear systems of equations is a key computational task in\u0000science and technology. In many cases, the coefficient matrix of the linear\u0000system is rank-deficient, leading to systems that may be underdetermined,\u0000inconsistent, or both. In such cases, one generally seeks to compute the least\u0000squares solution that minimizes the residual of the problem, which can be\u0000further defined as the solution with smallest norm in cases where the\u0000coefficient matrix has a nontrivial nullspace. This work presents several new\u0000techniques for solving least squares problems involving coefficient matrices\u0000that are so large that they do not fit in main memory. The implementations\u0000include both CPU and GPU variants. All techniques rely on complete orthogonal\u0000decompositions that guarantee that both conditions of a least squares solution\u0000are met, regardless of the rank properties of the matrix. Specifically, they\u0000rely on the recently proposed \"randUTV\" algorithm that is particularly\u0000effective in strongly communication-constrained environments. A detailed\u0000precision and performance study reveals that the new methods, that operate on\u0000data stored on disk, are competitive with state-of-the-art methods that store\u0000all data in main memory.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Billion-files File Systems (BfFS): A Comparison 十亿文件系统（BfFS）：比较

arXiv - CS - Performance

Pub Date : 2024-08-03 DOI: arxiv-2408.01805

Sohail Shaikh

As the volume of data being produced is increasing at an exponential ratethat needs to be processed quickly, it is reasonable that the data needs to beavailable very close to the compute devices to reduce transfer latency. Due tothis need, local filesystems are getting close attention to understand theirinner workings, performance, and more importantly their limitations. This studyanalyzes few popular Linux filesystems: EXT4, XFS, BtrFS, ZFS, and F2FS bycreating, storing, and then reading back one billion files from the localfilesystem. The study also captured and analyzed read/write throughput, storageblocks usage, disk space utilization and overheads, and other metrics usefulfor system designers and integrators. Furthermore, the study explored otherside effects such as filesystem performance degradation during and after theselarge numbers of files and folders are created.

由于需要快速处理的数据量正以指数级速度增长，因此数据需要在非常靠近计算设备的地方提供，以减少传输延迟。基于这种需求，本地文件系统正受到密切关注，以了解其内部工作原理、性能，更重要的是了解其局限性。本研究分析了几种流行的 Linux 文件系统：EXT4、XFS、BtrFS、ZFS 和 F2FS。该研究还采集并分析了读/写吞吐量、存储块使用情况、磁盘空间利用率和开销，以及对系统设计师和集成商有用的其他指标。此外，该研究还探讨了其他方面的影响，如在创建大量文件和文件夹期间和之后文件系统性能的下降。

引用次数: 0

GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS GPUDrive：每秒 100 万次的数据驱动多代理驾驶模拟

arXiv - CS - Performance

Pub Date : 2024-08-02 DOI: arxiv-2408.01584

Saman Kazemkhani, Aarav Pandya, Daphne Cornelisse, Brennan Shacklett, Eugene Vinitsky

Multi-agent learning algorithms have been successful at generating superhumanplanning in a wide variety of games but have had little impact on the design ofdeployed multi-agent planners. A key bottleneck in applying these techniques tomulti-agent planning is that they require billions of steps of experience. Toenable the study of multi-agent planning at this scale, we present GPUDrive, aGPU-accelerated, multi-agent simulator built on top of the Madrona Game Enginethat can generate over a million steps of experience per second. Observation,reward, and dynamics functions are written directly in C++, allowing users todefine complex, heterogeneous agent behaviors that are lowered tohigh-performance CUDA. We show that using GPUDrive we are able to effectivelytrain reinforcement learning agents over many scenes in the Waymo Motiondataset, yielding highly effective goal-reaching agents in minutes forindividual scenes and generally capable agents in a few hours. We ship thesetrained agents as part of the code base athttps://github.com/Emerge-Lab/gpudrive.

多代理学习算法成功地在各种游戏中生成了超人规划，但对部署多代理规划器的设计却影响甚微。将这些技术应用于多代理规划的一个关键瓶颈是，它们需要数十亿步的经验。为了能够研究这种规模的多代理规划，我们提出了 GPUDrive，这是一个基于 Madrona 游戏引擎的 GPU 加速多代理模拟器，每秒可以生成超过一百万步的经验。观察、奖励和动态函数直接用 C++ 编写，允许用户定义复杂的异构代理行为，并将其降低到高性能 CUDA 中。我们的研究表明，使用 GPUDrive，我们能够在 Waymo 运动数据集中的许多场景中有效地训练强化学习代理，在个别场景中几分钟内就能训练出高效的目标达成代理，在几个小时内就能训练出具有一般能力的代理。我们将经过强化训练的代理作为代码库的一部分发布在https://github.com/Emerge-Lab/gpudrive。

{"title":"GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS","authors":"Saman Kazemkhani, Aarav Pandya, Daphne Cornelisse, Brennan Shacklett, Eugene Vinitsky","doi":"arxiv-2408.01584","DOIUrl":"https://doi.org/arxiv-2408.01584","url":null,"abstract":"Multi-agent learning algorithms have been successful at generating superhuman\u0000planning in a wide variety of games but have had little impact on the design of\u0000deployed multi-agent planners. A key bottleneck in applying these techniques to\u0000multi-agent planning is that they require billions of steps of experience. To\u0000enable the study of multi-agent planning at this scale, we present GPUDrive, a\u0000GPU-accelerated, multi-agent simulator built on top of the Madrona Game Engine\u0000that can generate over a million steps of experience per second. Observation,\u0000reward, and dynamics functions are written directly in C++, allowing users to\u0000define complex, heterogeneous agent behaviors that are lowered to\u0000high-performance CUDA. We show that using GPUDrive we are able to effectively\u0000train reinforcement learning agents over many scenes in the Waymo Motion\u0000dataset, yielding highly effective goal-reaching agents in minutes for\u0000individual scenes and generally capable agents in a few hours. We ship these\u0000trained agents as part of the code base at\u0000https://github.com/Emerge-Lab/gpudrive.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"173 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Understanding and Enhancing Linux Kernel-based Packet Switching on WiFi Access Points 了解并增强 WiFi 接入点上基于 Linux 内核的数据包交换功能

arXiv - CS - Performance

Pub Date : 2024-08-02 DOI: arxiv-2408.01013

Shiqi Zhang, Mridul Gupta, Behnam Dezfouli

As the number of WiFi devices and their traffic demands continue to rise, theneed for a scalable and high-performance wireless infrastructure becomesincreasingly essential. Central to this infrastructure are WiFi Access Points(APs), which facilitate packet switching between Ethernet and WiFi interfaces.Despite APs' reliance on the Linux kernel's data plane for packet switching,the detailed operations and complexities of switching packets between Ethernetand WiFi interfaces have not been investigated in existing works. This papermakes the following contributions towards filling this research gap. Throughmacro and micro-analysis of empirical experiments, our study reveals insightsin two distinct categories. Firstly, while the kernel's statistics offervaluable insights into system operations, we identify and discuss potentialpitfalls that can severely affect system analysis. For instance, we reveal theimplications of device drivers on the meaning and accuracy of the statisticsrelated to packet-switching tasks and processor utilization. Secondly, weanalyze the impact of the packet switching path and core configuration onperformance and power consumption. Specifically, we identify the differences inEthernet-to-WiFi and WiFi-to-Ethernet data paths regarding processingcomponents, multi-core utilization, and energy efficiency. We show that theWiFi-to-Ethernet data path leverages better multi-core processing and exhibitslower power consumption.

随着 WiFi 设备数量及其流量需求的不断增加，对可扩展的高性能无线基础设施的需求变得越来越重要。尽管 WiFi 接入点依赖 Linux 内核的数据平面进行数据包交换，但现有研究还没有对以太网和 WiFi 接口之间数据包交换的详细操作和复杂性进行研究。本文为填补这一研究空白做出了以下贡献。通过对实证实验的宏观和微观分析，我们的研究揭示了两类不同的见解。首先，虽然内核的统计数据为系统操作提供了宝贵的见解，但我们也发现并讨论了可能严重影响系统分析的潜在陷阱。例如，我们揭示了设备驱动程序对数据包交换任务和处理器利用率相关统计数据的意义和准确性的影响。其次，我们分析了数据包交换路径和内核配置对性能和功耗的影响。具体来说，我们确定了以太网到 WiFi 和 WiFi 到以太网数据路径在处理组件、多核利用率和能效方面的差异。我们发现，WiFi 到以太网数据路径能更好地利用多核处理，功耗也更低。

{"title":"Understanding and Enhancing Linux Kernel-based Packet Switching on WiFi Access Points","authors":"Shiqi Zhang, Mridul Gupta, Behnam Dezfouli","doi":"arxiv-2408.01013","DOIUrl":"https://doi.org/arxiv-2408.01013","url":null,"abstract":"As the number of WiFi devices and their traffic demands continue to rise, the\u0000need for a scalable and high-performance wireless infrastructure becomes\u0000increasingly essential. Central to this infrastructure are WiFi Access Points\u0000(APs), which facilitate packet switching between Ethernet and WiFi interfaces.\u0000Despite APs' reliance on the Linux kernel's data plane for packet switching,\u0000the detailed operations and complexities of switching packets between Ethernet\u0000and WiFi interfaces have not been investigated in existing works. This paper\u0000makes the following contributions towards filling this research gap. Through\u0000macro and micro-analysis of empirical experiments, our study reveals insights\u0000in two distinct categories. Firstly, while the kernel's statistics offer\u0000valuable insights into system operations, we identify and discuss potential\u0000pitfalls that can severely affect system analysis. For instance, we reveal the\u0000implications of device drivers on the meaning and accuracy of the statistics\u0000related to packet-switching tasks and processor utilization. Secondly, we\u0000analyze the impact of the packet switching path and core configuration on\u0000performance and power consumption. Specifically, we identify the differences in\u0000Ethernet-to-WiFi and WiFi-to-Ethernet data paths regarding processing\u0000components, multi-core utilization, and energy efficiency. We show that the\u0000WiFi-to-Ethernet data path leverages better multi-core processing and exhibits\u0000lower power consumption.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Age of Information Analysis for Multi-Priority Queue and NOMA Enabled C-V2X in IoV 物联网多优先队列和启用 NOMA 的 C-V2X 的信息时代分析

arXiv - CS - Performance

Pub Date : 2024-08-01 DOI: arxiv-2408.00223

Zheng Zhang, Qiong Wu, Pingyi Fan, Ke Xiong

As development Internet-of-Vehicles (IoV) technology and demand forIntelligent Transportation Systems (ITS) increase, there is a growing need forreal-time data and communication by vehicle users. Traditional request-basedmethods face challenges such as latency and bandwidth limitations. Mode 4 inConnected Vehicle-to-Everything (C-V2X) addresses latency and overhead issuesthrough autonomous resource selection. However, Semi-Persistent Scheduling(SPS) based on distributed sensing may lead to increased collision.Non-Orthogonal Multiple Access (NOMA) can alleviate the problem of reducedpacket reception probability due to collisions. Moreover, the concept of Age ofInformation (AoI) is introduced as a comprehensive metric reflectingreliability and latency performance, analyzing the impact of NOMA on C-V2Xcommunication system. AoI indicates the time a message spends in both localwaiting and transmission processes. In C-V2X, waiting process can be extendedto queuing process, influenced by packet generation rate and ResourceReservation Interval (RRI). The transmission process is mainly affected bytransmission delay and success rate. In C-V2X, a smaller selection window (SW)limits the number of available resources for vehicles, resulting in highercollision rates with increased number of vehicles. SW is generally equal toRRI, which not only affects AoI in queuing process but also AoI in thetransmission process. Therefore, this paper proposes an AoI estimation methodbased on multi-priority data type queues and considers the influence of NOMA onthe AoI generated in both processes in C-V2X system under different RRIconditions. This work aims to gain a better performance of C-V2X systemcomparing with some known algorithms.

随着车联网（IoV）技术的发展和智能交通系统（ITS）需求的增加，车辆用户对实时数据和通信的需求也在不断增长。传统的基于请求的方法面临着延迟和带宽限制等挑战。车物互联（C-V2X）中的模式 4 通过自主资源选择解决了延迟和开销问题。然而，基于分布式传感的半持久调度（SPS）可能会导致碰撞增加，而非正交多址接入（NOMA）可以缓解因碰撞导致的数据包接收概率降低的问题。此外，还引入了信息年龄（AoI）的概念，作为反映可靠性和延迟性能的综合指标，分析 NOMA 对 C-V2X 通信系统的影响。AoI 表示信息在本地等待和传输过程中所花费的时间。在 C-V2X 中，等待过程可扩展为排队过程，受数据包生成率和资源预留间隔（RRI）的影响。传输过程主要受传输延迟和成功率的影响。在 C-V2X 中，较小的选择窗口（SW）限制了车辆可用资源的数量，导致车辆数量增加时碰撞率升高。SW 通常等于 RRI，这不仅会影响排队过程中的 AoI，还会影响传输过程中的 AoI。因此，本文提出了一种基于多优先级数据类型队列的 AoI 估算方法，并考虑了在不同 RRI 条件下，NOMA 对 C-V2X 系统两个过程中产生的 AoI 的影响。这项工作旨在使 C-V2X 系统与一些已知算法相比获得更好的性能。

{"title":"Age of Information Analysis for Multi-Priority Queue and NOMA Enabled C-V2X in IoV","authors":"Zheng Zhang, Qiong Wu, Pingyi Fan, Ke Xiong","doi":"arxiv-2408.00223","DOIUrl":"https://doi.org/arxiv-2408.00223","url":null,"abstract":"As development Internet-of-Vehicles (IoV) technology and demand for\u0000Intelligent Transportation Systems (ITS) increase, there is a growing need for\u0000real-time data and communication by vehicle users. Traditional request-based\u0000methods face challenges such as latency and bandwidth limitations. Mode 4 in\u0000Connected Vehicle-to-Everything (C-V2X) addresses latency and overhead issues\u0000through autonomous resource selection. However, Semi-Persistent Scheduling\u0000(SPS) based on distributed sensing may lead to increased collision.\u0000Non-Orthogonal Multiple Access (NOMA) can alleviate the problem of reduced\u0000packet reception probability due to collisions. Moreover, the concept of Age of\u0000Information (AoI) is introduced as a comprehensive metric reflecting\u0000reliability and latency performance, analyzing the impact of NOMA on C-V2X\u0000communication system. AoI indicates the time a message spends in both local\u0000waiting and transmission processes. In C-V2X, waiting process can be extended\u0000to queuing process, influenced by packet generation rate and Resource\u0000Reservation Interval (RRI). The transmission process is mainly affected by\u0000transmission delay and success rate. In C-V2X, a smaller selection window (SW)\u0000limits the number of available resources for vehicles, resulting in higher\u0000collision rates with increased number of vehicles. SW is generally equal to\u0000RRI, which not only affects AoI in queuing process but also AoI in the\u0000transmission process. Therefore, this paper proposes an AoI estimation method\u0000based on multi-priority data type queues and considers the influence of NOMA on\u0000the AoI generated in both processes in C-V2X system under different RRI\u0000conditions. This work aims to gain a better performance of C-V2X system\u0000comparing with some known algorithms.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"173 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141880961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerating Transfer Function Update for Distance Map based Volume Rendering 加速基于距离图的体量渲染的传递函数更新

arXiv - CS - Performance

Pub Date : 2024-07-31 DOI: arxiv-2407.21552

Michael Rauter, Lukas Zimmermann, Markus Zeilinger

Direct volume rendering using ray-casting is widely used in practice. Byusing GPUs and applying acceleration techniques as empty space skipping, highframe rates are possible on modern hardware. This enables performance-criticaluse-cases such as virtual reality volume rendering. The currently fastest knowntechnique uses volumetric distance maps to skip empty sections of the volumeduring ray-casting but requires the distance map to be updated per transferfunction change. In this paper, we demonstrate a technique for subdividing thevolume intensity range into partitions and deriving what we call partitioneddistance maps. These can be used to accelerate the distance map computation fora newly changed transfer function by a factor up to 30. This allows thecurrently fastest known empty space skipping approach to be used whilemaintaining high frame rates even when the transfer function is changedfrequently.

使用光线铸造的直接体积渲染技术在实践中得到了广泛应用。通过使用 GPU 和空跳加速技术，在现代硬件上可以实现高帧率。这使得虚拟现实体绘制等对性能要求极高的应用成为可能。目前已知最快的技术是在光线投射过程中使用体积距离图跳过体积的空白部分，但每次传输功能变化都需要更新距离图。在本文中，我们展示了一种将体积强度范围细分为若干分区并推导出我们称之为分区距离图的技术。利用这些技术，可以将新改变的传递函数的距离图计算速度提高 30 倍。这样就可以使用目前已知最快的跳空方法，同时在传输函数频繁变化时也能保持较高的帧频。

引用次数: 0

In-Situ Techniques on GPU-Accelerated Data-Intensive Applications GPU 加速数据密集型应用的现场技术

arXiv - CS - Performance

Pub Date : 2024-07-30 DOI: arxiv-2407.20731

Yi Ju, Mingshuai Li, Adalberto Perez, Laura Bellentani, Niclas Jansson, Stefano Markidis, Philipp Schlatter, Erwin Laure

The computational power of High-Performance Computing (HPC) systems isconstantly increasing, however, their input/output (IO) performance growsrelatively slowly, and their storage capacity is also limited. This unbalancepresents significant challenges for applications such as Molecular Dynamics(MD) and Computational Fluid Dynamics (CFD), which generate massive amounts ofdata for further visualization or analysis. At the same time, checkpointing iscrucial for long runs on HPC clusters, due to limited walltimes and/or failuresof system components, and typically requires the storage of large amount ofdata. Thus, restricted IO performance and storage capacity can lead tobottlenecks for the performance of full application workflows (as compared tocomputational kernels without IO). In-situ techniques, where data is furtherprocessed while still in memory rather to write it out over the I/O subsystem,can help to tackle these problems. In contrast to traditional post-processingmethods, in-situ techniques can reduce or avoid the need to write or read datavia the IO subsystem. They offer a promising approach for applications aimingto leverage the full power of large scale HPC systems. In-situ techniques canalso be applied to hybrid computational nodes on HPC systems consisting ofgraphics processing units (GPUs) and central processing units (CPUs). On onenode, the GPUs would have significant performance advantages over the CPUs.Therefore, current approaches for GPU-accelerated applications often focus onmaximizing GPU usage, leaving CPUs underutilized. In-situ tasks using CPUs toperform data analysis or preprocess data concurrently to the runningsimulation, offer a possibility to improve this underutilization.

高性能计算（HPC）系统的计算能力在不断提高，但其输入/输出（IO）性能却增长相对缓慢，而且存储容量也有限。这种不平衡给分子动力学（MD）和计算流体力学（CFD）等应用带来了巨大挑战，因为这些应用会产生海量数据供进一步可视化或分析。同时，由于挂壁时间有限和/或系统组件故障，检查点对于高性能计算集群上的长时间运行至关重要，并且通常需要存储大量数据。因此，受限的 IO 性能和存储容量会导致完整应用工作流的性能出现瓶颈（与无 IO 的计算内核相比）。原位技术，即在内存中对数据进行进一步处理，而不是通过 I/O 子系统将数据写出，有助于解决这些问题。与传统的后处理方法相比，原位技术可以减少或避免通过 IO 子系统写入或读取数据的需要。它们为旨在充分利用大型高性能计算系统的应用提供了一种前景广阔的方法。原位技术还可应用于由图形处理器（GPU）和中央处理器（CPU）组成的高性能计算系统的混合计算节点。因此，目前的 GPU 加速应用方法通常侧重于最大限度地利用 GPU，而不充分利用 CPU。使用 CPU 进行数据分析或在模拟运行的同时进行数据预处理的原位任务，为改善这种利用率不足的情况提供了可能。

{"title":"In-Situ Techniques on GPU-Accelerated Data-Intensive Applications","authors":"Yi Ju, Mingshuai Li, Adalberto Perez, Laura Bellentani, Niclas Jansson, Stefano Markidis, Philipp Schlatter, Erwin Laure","doi":"arxiv-2407.20731","DOIUrl":"https://doi.org/arxiv-2407.20731","url":null,"abstract":"The computational power of High-Performance Computing (HPC) systems is\u0000constantly increasing, however, their input/output (IO) performance grows\u0000relatively slowly, and their storage capacity is also limited. This unbalance\u0000presents significant challenges for applications such as Molecular Dynamics\u0000(MD) and Computational Fluid Dynamics (CFD), which generate massive amounts of\u0000data for further visualization or analysis. At the same time, checkpointing is\u0000crucial for long runs on HPC clusters, due to limited walltimes and/or failures\u0000of system components, and typically requires the storage of large amount of\u0000data. Thus, restricted IO performance and storage capacity can lead to\u0000bottlenecks for the performance of full application workflows (as compared to\u0000computational kernels without IO). In-situ techniques, where data is further\u0000processed while still in memory rather to write it out over the I/O subsystem,\u0000can help to tackle these problems. In contrast to traditional post-processing\u0000methods, in-situ techniques can reduce or avoid the need to write or read data\u0000via the IO subsystem. They offer a promising approach for applications aiming\u0000to leverage the full power of large scale HPC systems. In-situ techniques can\u0000also be applied to hybrid computational nodes on HPC systems consisting of\u0000graphics processing units (GPUs) and central processing units (CPUs). On one\u0000node, the GPUs would have significant performance advantages over the CPUs.\u0000Therefore, current approaches for GPU-accelerated applications often focus on\u0000maximizing GPU usage, leaving CPUs underutilized. In-situ tasks using CPUs to\u0000perform data analysis or preprocess data concurrently to the running\u0000simulation, offer a possibility to improve this underutilization.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141873316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards an Integrated Performance Framework for Fire Science and Management Workflows 建立火灾科学和管理工作流程的综合绩效框架

arXiv - CS - Performance

Pub Date : 2024-07-30 DOI: arxiv-2407.21231

H. Ahmed, R. Shende, I. Perez, D. Crawl, S. Purawat, I. Altintas

Reliable performance metrics are necessary prerequisites to buildinglarge-scale end-to-end integrated workflows for collaborative scientificresearch, particularly within context of use-inspired decision making platformswith many concurrent users and when computing real-time and urgent resultsusing large data. This work is a building block for the National Data Platform,which leverages multiple use-cases including the WIFIRE Data and Model Commonsfor wildfire behavior modeling and the EarthScope Consortium for collaborativegeophysical research. This paper presents an artificial intelligence andmachine learning (AI/ML) approach to performance assessment and optimization ofscientific workflows. An associated early AI/ML framework spanning performancedata collection, prediction and optimization is applied to wildfire scienceapplications within the WIFIRE BurnPro3D (BP3D) platform for proactive firemanagement and mitigation.

可靠的性能指标是为协作式科学研究构建大规模端到端集成工作流的必要前提，尤其是在拥有众多并发用户的使用启发决策平台中，以及在使用海量数据计算实时和紧急结果时。这项工作是国家数据平台的基石，该平台利用了多种用例，包括用于野火行为建模的 WIFIRE 数据和模型公共平台，以及用于地球物理合作研究的 EarthScope 联合会。本文介绍了一种人工智能和机器学习（AI/ML）方法，用于科学工作流的性能评估和优化。相关的早期 AI/ML 框架涵盖了性能数据收集、预测和优化，被应用于 WIFIRE BurnPro3D (BP3D) 平台中的野火科学应用，以实现主动的火灾管理和缓解。

引用次数: 0

Understanding the Impact of Synchronous, Asynchronous, and Hybrid In-Situ Techniques in Computational Fluid Dynamics Applications 了解同步、异步和混合原位技术在计算流体动力学应用中的影响

arXiv - CS - Performance

Pub Date : 2024-07-30 DOI: arxiv-2407.20717

Yi Ju, Adalberto Perez, Stefano Markidis, Philipp Schlatter, Erwin Laure

High-Performance Computing (HPC) systems provide input/output (IO)performance growing relatively slowly compared to peak computationalperformance and have limited storage capacity. Computational Fluid Dynamics(CFD) applications aiming to leverage the full power of Exascale HPC systems,such as the solver Nek5000, will generate massive data for further processing.These data need to be efficiently stored via the IO subsystem. However, limitedIO performance and storage capacity may result in performance, and thusscientific discovery, bottlenecks. In comparison to traditional post-processingmethods, in-situ techniques can reduce or avoid writing and reading the datathrough the IO subsystem, promising to be a solution to these problems. In thispaper, we study the performance and resource usage of three in-situ use cases:data compression, image generation, and uncertainty quantification. Wefurthermore analyze three approaches when these in-situ tasks and thesimulation are executed synchronously, asynchronously, or in a hybrid manner.In-situ compression can be used to reduce the IO time and storage requirementswhile maintaining data accuracy. Furthermore, in-situ visualization andanalysis can save Terabytes of data from being routed through the IO subsystemto storage. However, the overall efficiency is crucially dependent on thecharacteristics of both, the in-situ task and the simulation. In some cases,the overhead introduced by the in-situ tasks can be substantial. Therefore, itis essential to choose the proper in-situ approach, synchronous, asynchronous,or hybrid, to minimize overhead and maximize the benefits of concurrentexecution.

与峰值计算性能相比，高性能计算（HPC）系统的输入/输出（IO）性能增长相对缓慢，而且存储容量有限。计算流体动力学（CFD）应用旨在充分利用超大规模 HPC 系统的全部功能，例如求解器 Nek5000，将产生大量数据供进一步处理。然而，有限的 IO 性能和存储容量可能会导致性能瓶颈，进而影响科学发现。与传统的后处理方法相比，原位技术可以减少或避免通过 IO 子系统写入和读取数据，有望解决这些问题。在本文中，我们研究了数据压缩、图像生成和不确定性量化这三种原位用例的性能和资源使用情况。此外，我们还分析了同步、异步或混合执行这些原位任务和模拟时的三种方法。此外，原位可视化和分析可节省数 TB 的数据，使其无需通过 IO 子系统传输到存储。然而，整体效率关键取决于原位任务和模拟的特性。在某些情况下，原位任务带来的开销可能很大。因此，必须选择适当的原位方法（同步、异步或混合），以尽量减少开销，最大限度地发挥并发执行的优势。

{"title":"Understanding the Impact of Synchronous, Asynchronous, and Hybrid In-Situ Techniques in Computational Fluid Dynamics Applications","authors":"Yi Ju, Adalberto Perez, Stefano Markidis, Philipp Schlatter, Erwin Laure","doi":"arxiv-2407.20717","DOIUrl":"https://doi.org/arxiv-2407.20717","url":null,"abstract":"High-Performance Computing (HPC) systems provide input/output (IO)\u0000performance growing relatively slowly compared to peak computational\u0000performance and have limited storage capacity. Computational Fluid Dynamics\u0000(CFD) applications aiming to leverage the full power of Exascale HPC systems,\u0000such as the solver Nek5000, will generate massive data for further processing.\u0000These data need to be efficiently stored via the IO subsystem. However, limited\u0000IO performance and storage capacity may result in performance, and thus\u0000scientific discovery, bottlenecks. In comparison to traditional post-processing\u0000methods, in-situ techniques can reduce or avoid writing and reading the data\u0000through the IO subsystem, promising to be a solution to these problems. In this\u0000paper, we study the performance and resource usage of three in-situ use cases:\u0000data compression, image generation, and uncertainty quantification. We\u0000furthermore analyze three approaches when these in-situ tasks and the\u0000simulation are executed synchronously, asynchronously, or in a hybrid manner.\u0000In-situ compression can be used to reduce the IO time and storage requirements\u0000while maintaining data accuracy. Furthermore, in-situ visualization and\u0000analysis can save Terabytes of data from being routed through the IO subsystem\u0000to storage. However, the overall efficiency is crucially dependent on the\u0000characteristics of both, the in-situ task and the simulation. In some cases,\u0000the overhead introduced by the in-situ tasks can be substantial. Therefore, it\u0000is essential to choose the proper in-situ approach, synchronous, asynchronous,\u0000or hybrid, to minimize overhead and maximize the benefits of concurrent\u0000execution.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0