首页 > 最新文献

arXiv - CS - Performance最新文献

英文 中文
Delegation with Trust: A Scalable, Type- and Memory-Safe Alternative to Locks 信任委托:可扩展、类型和内存安全的锁替代方案
Pub Date : 2024-08-20 DOI: arxiv-2408.11173
Noaman Ahmad, Ben Baenen, Chen Chen, Jakob Eriksson
We present Trust, a general, type- and memory-safe alternative to lockingin concurrent programs. Instead of synchronizing multi-threaded access to anobject of type T with a lock, the programmer may place the object in aTrust. The object is then no longer directly accessible. Instead adesignated thread, the object's trustee, is responsible for applying anyrequested operations to the object, as requested via the Trust API. Lockingis often said to offer a limited throughput per lock. Trust is based ondelegation, a message-passing technique which does not suffer this per-locklimitation. Instead, per-object throughput is limited by the capacity of theobject's trustee, which is typically considerably higher. Our evaluation showsTrust consistently and considerably outperforming locking where lockcontention exists, with up to 22x higher throughput in microbenchmarks, and5-9x for a home grown key-value store, as well as memcached, in situations withhigh lock contention. Moreover, Trust is competitive with locks even in theabsence of lock contention.
我们提出了 "信任"(Trust)这一通用的、类型和内存安全的并发程序锁定替代方案。程序员可以将 T 类型的对象置于信任中,而不是用锁来同步多线程对该对象的访问。这样,对象就不再被直接访问。取而代之的是,指定的线程,即对象的受托人,负责对对象执行任何通过信任 API 请求的操作。锁定通常被认为每锁提供的吞吐量有限。而信任是基于委托的,这是一种消息传递技术,不会受到每个锁的限制。相反,每个对象的吞吐量受限于对象托管人的容量,而托管人的容量通常要高得多。我们的评估结果表明,在存在锁竞争的情况下,信任技术的性能始终大大优于锁定技术,在微基准测试中,信任技术的吞吐量比锁定技术高出 22 倍;在锁竞争严重的情况下,信任技术的吞吐量比自家开发的键值存储和 memcached 高出 5-9 倍。此外,即使在没有锁竞争的情况下,Trust 也能与锁竞争。
{"title":"Delegation with Trust: A Scalable, Type- and Memory-Safe Alternative to Locks","authors":"Noaman Ahmad, Ben Baenen, Chen Chen, Jakob Eriksson","doi":"arxiv-2408.11173","DOIUrl":"https://doi.org/arxiv-2408.11173","url":null,"abstract":"We present Trust<T>, a general, type- and memory-safe alternative to locking\u0000in concurrent programs. Instead of synchronizing multi-threaded access to an\u0000object of type T with a lock, the programmer may place the object in a\u0000Trust<T>. The object is then no longer directly accessible. Instead a\u0000designated thread, the object's trustee, is responsible for applying any\u0000requested operations to the object, as requested via the Trust<T> API. Locking\u0000is often said to offer a limited throughput per lock. Trust<T> is based on\u0000delegation, a message-passing technique which does not suffer this per-lock\u0000limitation. Instead, per-object throughput is limited by the capacity of the\u0000object's trustee, which is typically considerably higher. Our evaluation shows\u0000Trust<T> consistently and considerably outperforming locking where lock\u0000contention exists, with up to 22x higher throughput in microbenchmarks, and\u00005-9x for a home grown key-value store, as well as memcached, in situations with\u0000high lock contention. Moreover, Trust<T> is competitive with locks even in the\u0000absence of lock contention.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Closer Look at Data Augmentation Strategies for Finetuning-Based Low/Few-Shot Object Detection 基于微调的低/少镜头物体检测的数据增强策略管窥
Pub Date : 2024-08-20 DOI: arxiv-2408.10940
Vladislav Li, Georgios Tsoumplekas, Ilias Siniosoglou, Vasileios Argyriou, Anastasios Lytos, Eleftherios Fountoukidis, Panagiotis Sarigiannidis
Current methods for low- and few-shot object detection have primarily focusedon enhancing model performance for detecting objects. One common approach toachieve this is by combining model finetuning with data augmentationstrategies. However, little attention has been given to the energy efficiencyof these approaches in data-scarce regimes. This paper seeks to conduct acomprehensive empirical study that examines both model performance and energyefficiency of custom data augmentations and automated data augmentationselection strategies when combined with a lightweight object detector. Themethods are evaluated in three different benchmark datasets in terms of theirperformance and energy consumption, and the Efficiency Factor is employed togain insights into their effectiveness considering both performance andefficiency. Consequently, it is shown that in many cases, the performance gainsof data augmentation strategies are overshadowed by their increased energyusage, necessitating the development of more energy efficient data augmentationstrategies to address data scarcity.
目前的低照度和少照度物体检测方法主要侧重于提高模型检测物体的性能。实现这一目标的一种常见方法是将模型微调与数据增强策略相结合。然而,人们很少关注这些方法在数据稀缺情况下的能效。本文旨在开展一项全面的实证研究,考察自定义数据增强和自动数据增强选择策略与轻量级目标检测器相结合时的模型性能和能效。本文在三个不同的基准数据集中对这些方法的性能和能耗进行了评估,并采用了效率因子来深入了解这些方法在性能和效率两方面的有效性。结果表明,在许多情况下,数据增强策略的性能增益被其增加的能耗所掩盖,因此有必要开发能效更高的数据增强策略来解决数据稀缺问题。
{"title":"A Closer Look at Data Augmentation Strategies for Finetuning-Based Low/Few-Shot Object Detection","authors":"Vladislav Li, Georgios Tsoumplekas, Ilias Siniosoglou, Vasileios Argyriou, Anastasios Lytos, Eleftherios Fountoukidis, Panagiotis Sarigiannidis","doi":"arxiv-2408.10940","DOIUrl":"https://doi.org/arxiv-2408.10940","url":null,"abstract":"Current methods for low- and few-shot object detection have primarily focused\u0000on enhancing model performance for detecting objects. One common approach to\u0000achieve this is by combining model finetuning with data augmentation\u0000strategies. However, little attention has been given to the energy efficiency\u0000of these approaches in data-scarce regimes. This paper seeks to conduct a\u0000comprehensive empirical study that examines both model performance and energy\u0000efficiency of custom data augmentations and automated data augmentation\u0000selection strategies when combined with a lightweight object detector. The\u0000methods are evaluated in three different benchmark datasets in terms of their\u0000performance and energy consumption, and the Efficiency Factor is employed to\u0000gain insights into their effectiveness considering both performance and\u0000efficiency. Consequently, it is shown that in many cases, the performance gains\u0000of data augmentation strategies are overshadowed by their increased energy\u0000usage, necessitating the development of more energy efficient data augmentation\u0000strategies to address data scarcity.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data-Driven Analysis to Understand GPU Hardware Resource Usage of Optimizations 通过数据驱动分析了解优化后的 GPU 硬件资源使用情况
Pub Date : 2024-08-19 DOI: arxiv-2408.10143
Tanzima Z. Islam, Aniruddha Marathe, Holland Schutte, Mohammad Zaeed
With heterogeneous systems, the number of GPUs per chip increases to providecomputational capabilities for solving science at a nanoscopic scale. However,low utilization for single GPUs defies the need to invest more money forexpensive ccelerators. While related work develops optimizations for improvingapplication performance, none studies how these optimizations impact hardwareresource usage or the average GPU utilization. This paper takes a data-drivenanalysis approach in addressing this gap by (1) characterizing how hardwareresource usage affects device utilization, execution time, or both, (2)presenting a multi-objective metric to identify important application-deviceinteractions that can be optimized to improve device utilization andapplication performance jointly, (3) studying hardware resource usage behaviorsof several optimizations for a benchmark application, and finally (4)identifying optimization opportunities for several scientific proxyapplications based on their hardware resource usage behaviors. Furthermore, wedemonstrate the applicability of our methodology by applying the identifiedoptimizations to a proxy application, which improves the execution time, deviceutilization and power consumption by up to 29.6%, 5.3% and 26.5% respectively.
有了异构系统,每块芯片上的 GPU 数量就会增加,从而为解决纳米尺度的科学问题提供计算能力。然而,由于单个 GPU 的利用率较低,因此没有必要投入更多资金购买昂贵的加速器。虽然相关工作开发了提高应用程序性能的优化方法,但没有一项工作研究这些优化方法如何影响硬件资源的使用或 GPU 的平均利用率。本文采用数据驱动的分析方法来弥补这一不足:(1)描述硬件资源使用是如何影响设备利用率、执行时间或两者的;(2)提出一种多目标度量方法来识别重要的应用-设备交互,并对其进行优化,以共同提高设备利用率和应用性能;(3)研究针对基准应用的几种优化方法的硬件资源使用行为;最后(4)根据硬件资源使用行为识别几种科学代理应用的优化机会。此外,我们还将确定的优化应用于一个代理应用程序,从而证明了我们方法的适用性,该应用程序的执行时间、设备利用率和功耗分别提高了 29.6%、5.3% 和 26.5%。
{"title":"Data-Driven Analysis to Understand GPU Hardware Resource Usage of Optimizations","authors":"Tanzima Z. Islam, Aniruddha Marathe, Holland Schutte, Mohammad Zaeed","doi":"arxiv-2408.10143","DOIUrl":"https://doi.org/arxiv-2408.10143","url":null,"abstract":"With heterogeneous systems, the number of GPUs per chip increases to provide\u0000computational capabilities for solving science at a nanoscopic scale. However,\u0000low utilization for single GPUs defies the need to invest more money for\u0000expensive ccelerators. While related work develops optimizations for improving\u0000application performance, none studies how these optimizations impact hardware\u0000resource usage or the average GPU utilization. This paper takes a data-driven\u0000analysis approach in addressing this gap by (1) characterizing how hardware\u0000resource usage affects device utilization, execution time, or both, (2)\u0000presenting a multi-objective metric to identify important application-device\u0000interactions that can be optimized to improve device utilization and\u0000application performance jointly, (3) studying hardware resource usage behaviors\u0000of several optimizations for a benchmark application, and finally (4)\u0000identifying optimization opportunities for several scientific proxy\u0000applications based on their hardware resource usage behaviors. Furthermore, we\u0000demonstrate the applicability of our methodology by applying the identified\u0000optimizations to a proxy application, which improves the execution time, device\u0000utilization and power consumption by up to 29.6%, 5.3% and 26.5% respectively.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scalable Systems and Software Architectures for High-Performance Computing on cloud platforms 云平台上高性能计算的可扩展系统和软件架构
Pub Date : 2024-08-18 DOI: arxiv-2408.10281
Risshab Srinivas Ramesh
High-performance computing (HPC) is essential for tackling complexcomputational problems across various domains. As the scale and complexity ofHPC applications continue to grow, the need for scalable systems and softwarearchitectures becomes paramount. This paper provides a comprehensive overviewof architecture for HPC on premise focusing on both hardware and softwareaspects and details the associated challenges in building the HPC cluster onpremise. It explores design principles, challenges, and emerging trends inbuilding scalable HPC systems and software, addressing issues such asparallelism, memory hierarchy, communication overhead, and fault tolerance onvarious cloud platforms. By synthesizing research findings and technologicaladvancements, this paper aims to provide insights into scalable solutions formeeting the evolving demands of HPC applications on cloud.
高性能计算(HPC)对于解决各领域的综合计算问题至关重要。随着高性能计算应用的规模和复杂性不断扩大,对可扩展系统和软件架构的需求变得至关重要。本文从硬件和软件两个方面全面概述了预置式高性能计算的体系结构,并详细介绍了构建预置式高性能计算集群所面临的相关挑战。它探讨了构建可扩展高性能计算系统和软件的设计原则、挑战和新兴趋势,解决了各种云平台上的并行性、内存层次结构、通信开销和容错等问题。通过综合研究成果和技术进步,本文旨在深入探讨可扩展的解决方案,以满足云计算上不断发展的高性能计算应用需求。
{"title":"Scalable Systems and Software Architectures for High-Performance Computing on cloud platforms","authors":"Risshab Srinivas Ramesh","doi":"arxiv-2408.10281","DOIUrl":"https://doi.org/arxiv-2408.10281","url":null,"abstract":"High-performance computing (HPC) is essential for tackling complex\u0000computational problems across various domains. As the scale and complexity of\u0000HPC applications continue to grow, the need for scalable systems and software\u0000architectures becomes paramount. This paper provides a comprehensive overview\u0000of architecture for HPC on premise focusing on both hardware and software\u0000aspects and details the associated challenges in building the HPC cluster on\u0000premise. It explores design principles, challenges, and emerging trends in\u0000building scalable HPC systems and software, addressing issues such as\u0000parallelism, memory hierarchy, communication overhead, and fault tolerance on\u0000various cloud platforms. By synthesizing research findings and technological\u0000advancements, this paper aims to provide insights into scalable solutions for\u0000meeting the evolving demands of HPC applications on cloud.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inspection of I/O Operations from System Call Traces using Directly-Follows-Graph 使用 "直接跟踪图 "从系统调用跟踪中检查 I/O 操作
Pub Date : 2024-08-14 DOI: arxiv-2408.07378
Aravind Sankaran, Ilya Zhukov, Wolfgang Frings, Paolo Bientinesi
We aim to identify the differences in Input/Output(I/O) behavior betweenmultiple user programs in terms of contentions for system resources byinspecting the I/O requests made to the operating system. A typical programissues a large number of I/O requests to the operating system, thereby makingthe process of inspection challenging. In this paper, we address this challengeby presenting a methodology to synthesize I/O system call traces into aspecific type of directed graph, known as the Directly-Follows-Graph (DFG).Based on the DFG, we present a technique to compare the traces from multipleprograms or different configurations of the same program, such that it ispossible to identify the differences in the I/O requests made to the operatingsystem. We apply our methodology to the IOR benchmark, and compare thecontentions for file accesses when the benchmark is run with different optionsfor file output and software interface.
我们的目标是通过检测向操作系统发出的 I/O 请求,找出多个用户程序在争夺系统资源方面的输入/输出(I/O)行为差异。一个典型的程序会向操作系统发出大量的 I/O 请求,因此检测过程充满挑战。在本文中,我们提出了一种将 I/O 系统调用轨迹合成为特定类型的有向图(即直接跟随图,Directly-Follows-Graph,DFG)的方法来应对这一挑战。在 DFG 的基础上,我们提出了一种比较多个程序或同一程序的不同配置的轨迹的技术,从而有可能识别出向操作系统发出的 I/O 请求的差异。我们将这一方法应用于 IOR 基准,并比较了在文件输出和软件接口选项不同的情况下运行该基准时的文件访问内容。
{"title":"Inspection of I/O Operations from System Call Traces using Directly-Follows-Graph","authors":"Aravind Sankaran, Ilya Zhukov, Wolfgang Frings, Paolo Bientinesi","doi":"arxiv-2408.07378","DOIUrl":"https://doi.org/arxiv-2408.07378","url":null,"abstract":"We aim to identify the differences in Input/Output(I/O) behavior between\u0000multiple user programs in terms of contentions for system resources by\u0000inspecting the I/O requests made to the operating system. A typical program\u0000issues a large number of I/O requests to the operating system, thereby making\u0000the process of inspection challenging. In this paper, we address this challenge\u0000by presenting a methodology to synthesize I/O system call traces into a\u0000specific type of directed graph, known as the Directly-Follows-Graph (DFG).\u0000Based on the DFG, we present a technique to compare the traces from multiple\u0000programs or different configurations of the same program, such that it is\u0000possible to identify the differences in the I/O requests made to the operating\u0000system. We apply our methodology to the IOR benchmark, and compare the\u0000contentions for file accesses when the benchmark is run with different options\u0000for file output and software interface.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"320 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Architecture Specific Generation of Large Scale Lattice Boltzmann Methods for Sparse Complex Geometries 针对稀疏复杂几何图形的大规模点阵玻尔兹曼方法的特定架构生成
Pub Date : 2024-08-13 DOI: arxiv-2408.06880
Philipp Suffa, Markus Holzer, Harald Köstler, Ulrich Rüde
We implement and analyse a sparse / indirect-addressing data structure forthe Lattice Boltzmann Method to support efficient compute kernels for fluiddynamics problems with a high number of non-fluid nodes in the domain, such asin porous media flows. The data structure is integrated into a code generationpipeline to enable sparse Lattice Boltzmann Methods with a variety of stencilsand collision operators and to generate efficient code for kernels for CPU aswell as for AMD and NVIDIA accelerator cards. We optimize these sparse kernelswith an in-place streaming pattern to save memory accesses and memoryconsumption and we implement a communication hiding technique to provescalability. We present single GPU performance results with up to 99% ofmaximal bandwidth utilization. We integrate the optimized generated kernels inthe high performance framework WALBERLA and achieve a scaling efficiency of atleast 82% on up to 1024 NVIDIA A100 GPUs and up to 4096 AMD MI250X GPUs onmodern HPC systems. Further, we set up three different applications to test thesparse data structure for realistic demonstrator problems. We show performanceresults for flow through porous media, free flow over a particle bed, and bloodflow in a coronary artery. We achieve a maximal performance speed-up of 2 and asignificantly reduced memory consumption by up to 75% with the sparse /indirect-addressing data structure compared to the direct-addressing datastructure for these applications.
我们实现并分析了一种稀疏/间接寻址的数据结构,以支持在多孔介质流等领域中具有大量非流体节点的流体力学问题的高效计算内核。数据结构被集成到代码生成流水线中,使稀疏的 Lattice Boltzmann 方法能够使用各种模板和碰撞算子,并为 CPU 以及 AMD 和 NVIDIA 加速卡的内核生成高效代码。我们采用就地流式模式优化这些稀疏内核,以节省内存访问和内存消耗,并采用通信隐藏技术来证明可升级性。我们展示了单 GPU 性能结果,最大带宽利用率高达 99%。我们将优化生成的内核集成到高性能框架WALBERLA中,并在多达1024个英伟达A100 GPU和多达4096个AMD MI250X GPU的现代高性能计算系统上实现了至少82%的扩展效率。此外,我们还设置了三个不同的应用程序,以测试针对实际演示问题的稀疏数据结构。我们展示了流经多孔介质、粒子床自由流动和冠状动脉中血流的性能结果。与直接寻址数据结构相比,稀疏/间接寻址数据结构在这些应用中的最大性能提升了 2 倍,内存消耗也显著降低了 75%。
{"title":"Architecture Specific Generation of Large Scale Lattice Boltzmann Methods for Sparse Complex Geometries","authors":"Philipp Suffa, Markus Holzer, Harald Köstler, Ulrich Rüde","doi":"arxiv-2408.06880","DOIUrl":"https://doi.org/arxiv-2408.06880","url":null,"abstract":"We implement and analyse a sparse / indirect-addressing data structure for\u0000the Lattice Boltzmann Method to support efficient compute kernels for fluid\u0000dynamics problems with a high number of non-fluid nodes in the domain, such as\u0000in porous media flows. The data structure is integrated into a code generation\u0000pipeline to enable sparse Lattice Boltzmann Methods with a variety of stencils\u0000and collision operators and to generate efficient code for kernels for CPU as\u0000well as for AMD and NVIDIA accelerator cards. We optimize these sparse kernels\u0000with an in-place streaming pattern to save memory accesses and memory\u0000consumption and we implement a communication hiding technique to prove\u0000scalability. We present single GPU performance results with up to 99% of\u0000maximal bandwidth utilization. We integrate the optimized generated kernels in\u0000the high performance framework WALBERLA and achieve a scaling efficiency of at\u0000least 82% on up to 1024 NVIDIA A100 GPUs and up to 4096 AMD MI250X GPUs on\u0000modern HPC systems. Further, we set up three different applications to test the\u0000sparse data structure for realistic demonstrator problems. We show performance\u0000results for flow through porous media, free flow over a particle bed, and blood\u0000flow in a coronary artery. We achieve a maximal performance speed-up of 2 and a\u0000significantly reduced memory consumption by up to 75% with the sparse /\u0000indirect-addressing data structure compared to the direct-addressing data\u0000structure for these applications.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"176 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding Power Consumption Metric on Heterogeneous Memory Systems 了解异构内存系统的功耗指标
Pub Date : 2024-08-13 DOI: arxiv-2408.06579
Andrès Rubio Proaño, Kento Sato
Contemporary memory systems contain a variety of memory types, eachpossessing distinct characteristics. This trend empowers applications to optfor memory types aligning with developer's desired behavior. As a result,developers gain flexibility to tailor their applications to specific needs,factoring in attributes like latency, bandwidth, and power consumption. Ourresearch centers on the aspect of power consumption within memory systems. Weintroduce an approach that equips developers with comprehensive insights intothe power consumption of individual memory types. Additionally, we propose anordered hierarchy of memory types. Through this methodology, developers canmake informed decisions for efficient memory usage aligned with their uniquerequirements.
当代内存系统包含多种内存类型,每种类型都具有不同的特性。这种趋势使应用程序能够根据开发人员的期望行为选择内存类型。因此,开发人员可以根据特定需求灵活定制应用程序,并将延迟、带宽和功耗等属性考虑在内。我们的研究主要集中在内存系统的功耗方面。我们提出了一种方法,让开发人员能够全面了解单个内存类型的功耗。此外,我们还提出了内存类型的有序层次结构。通过这种方法,开发人员可以做出明智的决定,根据自己的独特需求高效地使用内存。
{"title":"Understanding Power Consumption Metric on Heterogeneous Memory Systems","authors":"Andrès Rubio Proaño, Kento Sato","doi":"arxiv-2408.06579","DOIUrl":"https://doi.org/arxiv-2408.06579","url":null,"abstract":"Contemporary memory systems contain a variety of memory types, each\u0000possessing distinct characteristics. This trend empowers applications to opt\u0000for memory types aligning with developer's desired behavior. As a result,\u0000developers gain flexibility to tailor their applications to specific needs,\u0000factoring in attributes like latency, bandwidth, and power consumption. Our\u0000research centers on the aspect of power consumption within memory systems. We\u0000introduce an approach that equips developers with comprehensive insights into\u0000the power consumption of individual memory types. Additionally, we propose an\u0000ordered hierarchy of memory types. Through this methodology, developers can\u0000make informed decisions for efficient memory usage aligned with their unique\u0000requirements.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automated PMC-based Power Modeling Methodology for Modern Mobile GPUs 基于 PMC 的现代移动 GPU 自动功率建模方法
Pub Date : 2024-08-09 DOI: arxiv-2408.04886
Pranab DashPurdue University, Y. Charlie HuPurdue University, Abhilash JindalIIT Delhi
The rise of machine learning workload on smartphones has propelled GPUs intoone of the most power-hungry components of modern smartphones and elevates theneed for optimizing the GPU power draw by mobile apps. Optimizing the powerconsumption of mobile GPUs in turn requires accurate estimation of their powerdraw during app execution. In this paper, we observe that the prior-art,utilization-frequency based GPU models cannot capture the diversemicro-architectural usage of modern mobile GPUs.We show that these modelssuffer poor modeling accuracy under diverse GPU workload, and study whetherperformance monitoring counter (PMC)-based models recently proposed fordesktop/server GPUs can be applied to accurately model mobile GPU power. Ourstudy shows that the PMCs that come with dominating mobile GPUs used in modernsmartphones are sufficient to model mobile GPU power, but exhibitmulticollinearity if used altogether. We present APGPM, the mobile GPU powermodeling methodology that automatically selects an optimal set of PMCs thatmaximizes the GPU power model accuracy. Evaluation on two representative mobileGPUs shows that APGPM-generated GPU power models reduce the MAPE modeling errorof prior-art by 1.95x to 2.66x (i.e., by 11.3% to 15.4%) while using only 4.66%to 20.41% of the total number of available PMCs.
智能手机上机器学习工作负载的增加使 GPU 成为现代智能手机中最耗电的组件之一,并提升了优化移动应用程序 GPU 功耗的需求。反过来,优化移动 GPU 的功耗也需要准确估计其在应用执行过程中的耗电量。我们的研究表明,这些模型在不同 GPU 工作负载下的建模精度较低,并研究了最近针对台式机/服务器 GPU 提出的基于性能监测计数器(PMC)的模型能否应用于移动 GPU 功耗的精确建模。我们的研究表明,现代智能手机中使用的主流移动 GPU 所配备的 PMC 足以为移动 GPU 功率建模,但如果同时使用,则会表现出多重共线性。我们提出了移动 GPU 功率建模方法 APGPM,它能自动选择一组最佳的 PMC,最大限度地提高 GPU 功率模型的准确性。在两款具有代表性的移动 GPU 上进行的评估表明,APGPM 生成的 GPU 功率模型将先验技术的 MAPE 建模误差降低了 1.95 倍到 2.66 倍(即降低了 11.3% 到 15.4%),同时只使用了可用 PMC 总数的 4.66% 到 20.41%。
{"title":"Automated PMC-based Power Modeling Methodology for Modern Mobile GPUs","authors":"Pranab DashPurdue University, Y. Charlie HuPurdue University, Abhilash JindalIIT Delhi","doi":"arxiv-2408.04886","DOIUrl":"https://doi.org/arxiv-2408.04886","url":null,"abstract":"The rise of machine learning workload on smartphones has propelled GPUs into\u0000one of the most power-hungry components of modern smartphones and elevates the\u0000need for optimizing the GPU power draw by mobile apps. Optimizing the power\u0000consumption of mobile GPUs in turn requires accurate estimation of their power\u0000draw during app execution. In this paper, we observe that the prior-art,\u0000utilization-frequency based GPU models cannot capture the diverse\u0000micro-architectural usage of modern mobile GPUs.We show that these models\u0000suffer poor modeling accuracy under diverse GPU workload, and study whether\u0000performance monitoring counter (PMC)-based models recently proposed for\u0000desktop/server GPUs can be applied to accurately model mobile GPU power. Our\u0000study shows that the PMCs that come with dominating mobile GPUs used in modern\u0000smartphones are sufficient to model mobile GPU power, but exhibit\u0000multicollinearity if used altogether. We present APGPM, the mobile GPU power\u0000modeling methodology that automatically selects an optimal set of PMCs that\u0000maximizes the GPU power model accuracy. Evaluation on two representative mobile\u0000GPUs shows that APGPM-generated GPU power models reduce the MAPE modeling error\u0000of prior-art by 1.95x to 2.66x (i.e., by 11.3% to 15.4%) while using only 4.66%\u0000to 20.41% of the total number of available PMCs.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"98 5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Columbo: Low Level End-to-End System Traces through Modular Full-System Simulation 科伦坡通过模块化全系统仿真进行低水平端到端系统跟踪
Pub Date : 2024-08-08 DOI: arxiv-2408.05251
Jakob Görgen, Vaastav Anand, Hejing Li, Jialin Li, Antoine Kaufmann
Fully understanding performance is a growing challenge when buildingnext-generation cloud systems. Often these systems build on next-generationhardware, and evaluation in realistic physical testbeds is out of reach. Evenwhen physical testbeds are available, visibility into essential system aspectsis a challenge in modern systems where system performance depends on oftensub-$mu s$ interactions between HW and SW components. Existing tools such asperformance counters, logging, and distributed tracing provide aggregate orsampled information, but remain insufficient for understanding individualrequests in-depth. In this paper, we explore a fundamentally different approachto enable in-depth understanding of cloud system behavior at the software andhardware level, with (almost) arbitrarily fine-grained visibility. Our proposalis to run cloud systems in detailed full-system simulations, configure thesimulators to collect detailed events without affecting the system, and finallyassemble these events into end-to-end system traces that can be analyzed byexisting distributed tracing tools.
在构建下一代云系统时,充分了解性能是一个日益严峻的挑战。这些系统通常基于下一代硬件构建,在现实的物理测试平台上进行评估遥不可及。在现代系统中,系统性能往往取决于硬件和软件组件之间的交互,即使有物理测试平台,系统基本方面的可见性也是一个挑战。现有的工具,如性能计数器、日志记录和分布式跟踪可提供汇总或采样信息,但仍不足以深入了解单个请求。在本文中,我们将探索一种根本不同的方法,以便在软件和硬件层面深入了解云系统的行为,并实现(几乎)任意细粒度的可视性。我们的建议是在详细的全系统模拟中运行云系统,配置模拟器以在不影响系统的情况下收集详细事件,最后将这些事件组合成端到端系统跟踪,现有的分布式跟踪工具可对其进行分析。
{"title":"Columbo: Low Level End-to-End System Traces through Modular Full-System Simulation","authors":"Jakob Görgen, Vaastav Anand, Hejing Li, Jialin Li, Antoine Kaufmann","doi":"arxiv-2408.05251","DOIUrl":"https://doi.org/arxiv-2408.05251","url":null,"abstract":"Fully understanding performance is a growing challenge when building\u0000next-generation cloud systems. Often these systems build on next-generation\u0000hardware, and evaluation in realistic physical testbeds is out of reach. Even\u0000when physical testbeds are available, visibility into essential system aspects\u0000is a challenge in modern systems where system performance depends on often\u0000sub-$mu s$ interactions between HW and SW components. Existing tools such as\u0000performance counters, logging, and distributed tracing provide aggregate or\u0000sampled information, but remain insufficient for understanding individual\u0000requests in-depth. In this paper, we explore a fundamentally different approach\u0000to enable in-depth understanding of cloud system behavior at the software and\u0000hardware level, with (almost) arbitrarily fine-grained visibility. Our proposal\u0000is to run cloud systems in detailed full-system simulations, configure the\u0000simulators to collect detailed events without affecting the system, and finally\u0000assemble these events into end-to-end system traces that can be analyzed by\u0000existing distributed tracing tools.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluation of Hash Algorithm Performance for Cryptocurrency Exchanges Based on Blockchain System 基于区块链系统的加密货币交易所哈希算法性能评估
Pub Date : 2024-08-08 DOI: arxiv-2408.11950
Abel C. H. Chen
The blockchain system has emerged as one of the focal points of research inrecent years, particularly in applications and services such ascryptocurrencies and smart contracts. In this context, the hash value serves asa crucial element in linking blocks within the blockchain, ensuring theintegrity of block contents. Therefore, hash algorithms represent a vitalsecurity technology for ensuring the integrity and security of blockchainsystems. This study primarily focuses on analyzing the security and executionefficiency of mainstream hash algorithms in the Proof of Work (PoW)calculations within blockchain systems. It proposes an evaluation factor andconducts comparative experiments to evaluate each hash algorithm. Theexperimental results indicate that there are no significant differences in thesecurity aspects among SHA-2, SHA-3, and BLAKE2. However, SHA-2 and BLAKE2demonstrate shorter computation times, indicating higher efficiency inexecution.
近年来,区块链系统已成为研究的焦点之一,特别是在加密货币和智能合约等应用和服务方面。在这种情况下,哈希值是连接区块链内区块的关键要素,可确保区块内容的完整性。因此,哈希算法是确保区块链系统完整性和安全性的重要安全技术。本研究主要分析主流哈希算法在区块链系统内工作量证明(PoW)计算中的安全性和执行效率。它提出了一个评价因子,并通过对比实验对每种哈希算法进行了评价。实验结果表明,SHA-2、SHA-3 和 BLAKE2 在这些安全性方面没有显著差异。不过,SHA-2 和 BLAKE2 的计算时间更短,表明执行效率更高。
{"title":"Evaluation of Hash Algorithm Performance for Cryptocurrency Exchanges Based on Blockchain System","authors":"Abel C. H. Chen","doi":"arxiv-2408.11950","DOIUrl":"https://doi.org/arxiv-2408.11950","url":null,"abstract":"The blockchain system has emerged as one of the focal points of research in\u0000recent years, particularly in applications and services such as\u0000cryptocurrencies and smart contracts. In this context, the hash value serves as\u0000a crucial element in linking blocks within the blockchain, ensuring the\u0000integrity of block contents. Therefore, hash algorithms represent a vital\u0000security technology for ensuring the integrity and security of blockchain\u0000systems. This study primarily focuses on analyzing the security and execution\u0000efficiency of mainstream hash algorithms in the Proof of Work (PoW)\u0000calculations within blockchain systems. It proposes an evaluation factor and\u0000conducts comparative experiments to evaluate each hash algorithm. The\u0000experimental results indicate that there are no significant differences in the\u0000security aspects among SHA-2, SHA-3, and BLAKE2. However, SHA-2 and BLAKE2\u0000demonstrate shorter computation times, indicating higher efficiency in\u0000execution.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"88 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Performance
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1