首页 > 最新文献

Proc. ACM Meas. Anal. Comput. Syst.最新文献

英文 中文
Scalability Limitations of Processing-in-Memory using Real System Evaluations 使用真实系统评估内存处理的可扩展性限制
Pub Date : 2024-02-16 DOI: 10.1145/3639046
Gilbert Jonatan, Haeyoon Cho, Hyojun Son, Xiangyu Wu, Neal Livesay, Evelio Mora, Kaustubh Shivdikar, José L. Abellán, Ajay Joshi, David Kaeli, John Kim
Processing-in-memory (PIM), where the compute is moved closer to the memory or the data, has been widely explored to accelerate emerging workloads. Recently, different PIM-based systems have been announced by memory vendors to minimize data movement and improve performance as well as energy efficiency. One critical component of PIM is the large amount of compute parallelism provided across many PIM "nodes'' or the compute units near the memory. In this work, we provide an extensive evaluation and analysis of real PIM systems based on UPMEM PIM. We show that while there are benefits of PIM, there are also scalability challenges and limitations as the number of PIM nodes increases. In particular, we show how collective communications that are commonly found in many kernels/workloads can be problematic for PIM systems. To evaluate the impact of collective communication in PIM architectures, we provide an in-depth analysis of two workloads on the UPMEM PIM system that utilize representative common collective communication patterns -- AllReduce and All-to-All communication. Specifically, we evaluate 1) embedding tables that are commonly used in recommendation systems that require AllReduce and 2) the Number Theoretic Transform (NTT) kernel which is a critical component of Fully Homomorphic Encryption (FHE) that requires All-to-All communication. We analyze the performance benefits of these workloads and show how they can be efficiently mapped to the PIM architecture through alternative data partitioning. However, since each PIM compute unit can only access its local memory, when communication is necessary between PIM nodes (or remote data is needed), communication between the compute units must be done through the host CPU, thereby severely hampering application performance. To increase the scalability (or applicability) of PIM to future workloads, we make the case for how future PIM architectures need efficient communication or interconnection networks between the PIM nodes that require both hardware and software support.
内存中处理(PIM)是指将计算移到更靠近内存或数据的地方,它已被广泛用于加速新兴工作负载。最近,内存供应商发布了不同的基于 PIM 的系统,以尽量减少数据移动,提高性能和能效。PIM 的一个关键组成部分是在许多 PIM "节点 "或内存附近的计算单元之间提供大量计算并行性。在这项工作中,我们对基于 UPMEM PIM 的实际 PIM 系统进行了广泛的评估和分析。我们发现,虽然 PIM 有很多优点,但随着 PIM 节点数量的增加,也存在可扩展性方面的挑战和限制。特别是,我们展示了在许多内核/工作负载中常见的集体通信是如何给 PIM 系统带来问题的。为了评估集体通信在 PIM 架构中的影响,我们深入分析了 UPMEM PIM 系统上的两个工作负载,这两个负载使用了具有代表性的常见集体通信模式--AllReduce 和 All-to-All 通信。具体来说,我们评估了:1)推荐系统中常用的嵌入表,它需要 AllReduce;2)数论变换(NTT)内核,它是全同态加密(FHE)的关键组件,需要 All-to-All 通信。我们分析了这些工作负载的性能优势,并展示了如何通过替代数据分区将它们高效地映射到 PIM 架构。然而,由于每个 PIM 计算单元只能访问其本地内存,当 PIM 节点之间需要通信(或需要远程数据)时,计算单元之间的通信必须通过主机 CPU 完成,从而严重影响了应用性能。为了提高 PIM 对未来工作负载的可扩展性(或适用性),我们提出了未来的 PIM 架构如何需要 PIM 节点之间的高效通信或互连网络,这需要硬件和软件的支持。
{"title":"Scalability Limitations of Processing-in-Memory using Real System Evaluations","authors":"Gilbert Jonatan, Haeyoon Cho, Hyojun Son, Xiangyu Wu, Neal Livesay, Evelio Mora, Kaustubh Shivdikar, José L. Abellán, Ajay Joshi, David Kaeli, John Kim","doi":"10.1145/3639046","DOIUrl":"https://doi.org/10.1145/3639046","url":null,"abstract":"Processing-in-memory (PIM), where the compute is moved closer to the memory or the data, has been widely explored to accelerate emerging workloads. Recently, different PIM-based systems have been announced by memory vendors to minimize data movement and improve performance as well as energy efficiency. One critical component of PIM is the large amount of compute parallelism provided across many PIM \"nodes'' or the compute units near the memory. In this work, we provide an extensive evaluation and analysis of real PIM systems based on UPMEM PIM. We show that while there are benefits of PIM, there are also scalability challenges and limitations as the number of PIM nodes increases. In particular, we show how collective communications that are commonly found in many kernels/workloads can be problematic for PIM systems. To evaluate the impact of collective communication in PIM architectures, we provide an in-depth analysis of two workloads on the UPMEM PIM system that utilize representative common collective communication patterns -- AllReduce and All-to-All communication. Specifically, we evaluate 1) embedding tables that are commonly used in recommendation systems that require AllReduce and 2) the Number Theoretic Transform (NTT) kernel which is a critical component of Fully Homomorphic Encryption (FHE) that requires All-to-All communication. We analyze the performance benefits of these workloads and show how they can be efficiently mapped to the PIM architecture through alternative data partitioning. However, since each PIM compute unit can only access its local memory, when communication is necessary between PIM nodes (or remote data is needed), communication between the compute units must be done through the host CPU, thereby severely hampering application performance. To increase the scalability (or applicability) of PIM to future workloads, we make the case for how future PIM architectures need efficient communication or interconnection networks between the PIM nodes that require both hardware and software support.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"292 2","pages":"5:1-5:28"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140453838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
SCADA World: An Exploration of the Diversity in Power Grid Networks SCADA 世界:探索电网网络的多样性
Pub Date : 2024-02-16 DOI: 10.1145/3639036
Neil Ortiz Silva, Alvaro A. Cárdenas, A. Wool
Despite a growing interest in understanding the industrial control networks that monitor and control our critical infrastructures (such as the power grid), to date, SCADA networks have been analyzed in isolation from each other. They have been treated as monolithic networks without taking into consideration their differences. In this paper, we analyze real-world data from different parts of a power grid (generation, transmission, distribution, and end-consumer) and show that these industrial networks exhibit a variety of unique behaviors and configurations that have not been documented before. To the best of our knowledge, our study is the first to tackle the analysis of power grid networks at this level. Our results help us dispel several misconceptions proposed by previous work, and we also provide new insights into the differences and types of SCADA networks.
尽管人们对监测和控制关键基础设施(如电网)的工业控制网络的了解越来越感兴趣,但迄今为止,对 SCADA 网络的分析一直是相互孤立的。它们被视为单体网络,没有考虑到它们之间的差异。在本文中,我们分析了来自电网不同部分(发电、输电、配电和终端用户)的真实数据,并展示了这些工业网络所表现出的各种独特行为和配置,而这些行为和配置以前从未有过记录。据我们所知,我们的研究是首次在这一层面对电网网络进行分析。我们的研究结果帮助我们消除了之前工作中提出的一些误解,同时也为我们了解 SCADA 网络的差异和类型提供了新的视角。
{"title":"SCADA World: An Exploration of the Diversity in Power Grid Networks","authors":"Neil Ortiz Silva, Alvaro A. Cárdenas, A. Wool","doi":"10.1145/3639036","DOIUrl":"https://doi.org/10.1145/3639036","url":null,"abstract":"Despite a growing interest in understanding the industrial control networks that monitor and control our critical infrastructures (such as the power grid), to date, SCADA networks have been analyzed in isolation from each other. They have been treated as monolithic networks without taking into consideration their differences. In this paper, we analyze real-world data from different parts of a power grid (generation, transmission, distribution, and end-consumer) and show that these industrial networks exhibit a variety of unique behaviors and configurations that have not been documented before. To the best of our knowledge, our study is the first to tackle the analysis of power grid networks at this level. Our results help us dispel several misconceptions proposed by previous work, and we also provide new insights into the differences and types of SCADA networks.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"444 1","pages":"10:1-10:32"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140453926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Deep Dive into NTP Pool's Popularity and Mapping 深入了解 NTP 池的普及和映射
Pub Date : 2024-02-16 DOI: 10.1145/3639041
G. Moura, Marco Davids, C. Schutijser, Cristian Hesselman, John Heidemann, Georgios Smaragdakis
Time synchronization is of paramount importance on the Internet, with the Network Time Protocol (NTP) serving as the primary synchronization protocol. The NTP Pool, a volunteer-driven initiative launched two decades ago, facilitates connections between clients and NTP servers. Our analysis of root DNS queries reveals that the NTP Pool has consistently been the most popular time service. We further investigate the DNS component (GeoDNS) of the NTP Pool, which is responsible for mapping clients to servers. Our findings indicate that the current algorithm is heavily skewed, leading to the emergence of time monopolies for entire countries. For instance, clients in the US are served by 551 NTP servers, while clients in Cameroon and Nigeria are served by only one and two servers, respectively, out of the 4k+ servers available in the NTP Pool. We examine the underlying assumption behind GeoDNS for these mappings and discover that time servers located far away can still provide accurate clock time information to clients. We have shared our findings with the NTP Pool operators, who acknowledge them and plan to revise their algorithm to enhance security.
时间同步在互联网上至关重要,网络时间协议(NTP)是主要的同步协议。NTP Pool 是二十年前由志愿者发起的一项倡议,旨在促进客户端与 NTP 服务器之间的连接。我们对根 DNS 查询的分析表明,NTP 池一直是最受欢迎的时间服务。我们进一步调查了 NTP 池的 DNS 组件(GeoDNS),该组件负责将客户端映射到服务器。我们的研究结果表明,当前的算法存在严重偏差,导致整个国家出现时间垄断。例如,美国的客户由 551 台 NTP 服务器提供服务,而喀麦隆和尼日利亚的客户则分别由 NTP 池中 4k 多台服务器中的一台和两台服务器提供服务。我们研究了这些映射的 GeoDNS 背后的基本假设,发现位于遥远地方的时间服务器仍能为客户端提供准确的时钟时间信息。我们已与 NTP Pool 运营商分享了我们的发现,他们对此表示认可,并计划修改算法以提高安全性。
{"title":"Deep Dive into NTP Pool's Popularity and Mapping","authors":"G. Moura, Marco Davids, C. Schutijser, Cristian Hesselman, John Heidemann, Georgios Smaragdakis","doi":"10.1145/3639041","DOIUrl":"https://doi.org/10.1145/3639041","url":null,"abstract":"Time synchronization is of paramount importance on the Internet, with the Network Time Protocol (NTP) serving as the primary synchronization protocol. The NTP Pool, a volunteer-driven initiative launched two decades ago, facilitates connections between clients and NTP servers. Our analysis of root DNS queries reveals that the NTP Pool has consistently been the most popular time service. We further investigate the DNS component (GeoDNS) of the NTP Pool, which is responsible for mapping clients to servers. Our findings indicate that the current algorithm is heavily skewed, leading to the emergence of time monopolies for entire countries. For instance, clients in the US are served by 551 NTP servers, while clients in Cameroon and Nigeria are served by only one and two servers, respectively, out of the 4k+ servers available in the NTP Pool. We examine the underlying assumption behind GeoDNS for these mappings and discover that time servers located far away can still provide accurate clock time information to clients. We have shared our findings with the NTP Pool operators, who acknowledge them and plan to revise their algorithm to enhance security.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"193 1","pages":"15:1-15:30"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140454257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
POMACS V8, N1, March 2024 Editorial POMACS V8,N1,2024 年 3 月社论
Pub Date : 2024-02-16 DOI: 10.1145/3639027
F. Ciucu, Giulia Fanti, Rhonda Righter
The Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) focuses on the measurement and performance evaluation of computer systems and operates in close collaboration with the ACM Special Interest Group SIGMETRICS. All papers in this issue of POMACS will be presented at the ACM SIGMETRICS/Performance 2024 conference on June 10-14, 2024, in Venice, Italy. These papers have been selected during the Fall submission round by the 93 members of the ACM SIGMETRICS/Performance 2024 program committee via a rigorous review process. Each paper was conditionally accepted (and shepherded), allowed a "one-shot" revision (to be resubmitted to one of the subsequent two SIGMETRICS/Performance deadlines), or rejected (with re-submission allowed after a year). For this issue, which represents the Fall deadline, POMACS is publishing 18 papers out of 118 submissions, of which 6 had previously received a one-shot revision decision. All submissions received at least 3 reviews and borderline cases were extensively discussed during the online program committee meeting. Based on the indicated track(s), roughly 33% of the submissions were in the Theory track, 47% were in the Measurement & Applied Modeling track, 39% were in the Systems track, and 19% were in the Learning track (papers could be part of more than one track). Many individuals contributed to the success of this issue of POMACS. First, we would like to thank the authors, who submitted their best work to SIGMETRICS/Performance/POMACS. Second, we would like to thank the program committee members who provided constructive feedback in their reviews to authors and participated in the online discussions and program committee meeting. We also thank the several external reviewers who provided their expert opinions on specific submissions that required additional input. We are also grateful to the SIGMETRICS Board Chair, Mor Harchol-Balter, the IFIP Working Group 7.3 Chair, Mark S. Squillante, the previous SIGMETRICS Board Chair, Giuliano Casale, and the past program committee Chairs, Konstantin Avratchenkov, Phillipa Gill, and Bhuvan Urgaonkar, who provided a wealth of information and guidance. Finally, we are grateful to the Organization Committee and to the SIGMETRICS Board for their ongoing efforts and initiatives for creating an exciting program for ACM SIGMETRICS/Performance 2024.
ACM 计算系统测量与分析论文集》(POMACS)主要关注计算机系统的测量和性能评估,并与 ACM 特别兴趣小组 SIGMETRICS 密切合作。本期 POMACS 中的所有论文将在 2024 年 6 月 10-14 日在意大利威尼斯举行的 ACM SIGMETRICS/Performance 2024 会议上发表。这些论文是由 ACM SIGMETRICS/Performance 2024 项目委员会的 93 名成员在秋季提交的论文中经过严格的审查程序筛选出来的。每篇论文都被有条件地接受(并进行筛选),允许 "一次性 "修改(在随后的两次 SIGMETRICS/Performance 截止日期之一重新提交),或被拒绝(允许一年后重新提交)。本期是秋季截稿期,POMACS 将发表 118 篇投稿中的 18 篇论文,其中 6 篇之前已收到一次性修改决定。所有投稿都至少经过了 3 次审稿,在线程序委员会会议对有争议的情况进行了广泛讨论。根据指定的方向,大约 33% 的投稿属于理论方向,47% 属于测量与应用建模方向,39% 属于系统方向,19% 属于学习方向(论文可能属于多个方向)。许多人都为本期 POMACS 的成功做出了贡献。首先,我们要感谢那些向 SIGMETRICS/Performance/POMACS 投稿的作者们。其次,我们要感谢项目委员会成员,他们在审稿中为作者提供了建设性的反馈意见,并参与了在线讨论和项目委员会会议。我们还要感谢几位外部审稿人,他们就需要额外意见的具体稿件提供了专家意见。我们还要感谢 SIGMETRICS 董事会主席 Mor Harchol-Balter、IFIP 7.3 工作组主席 Mark S. Squillante、SIGMETRICS 前任董事会主席 Giuliano Casale 以及历届项目委员会主席 Konstantin Avratchenkov、Phillipa Gill 和 Bhuvan Urgaonkar,他们为我们提供了大量信息和指导。最后,我们感谢组织委员会和 SIGMETRICS 董事会一直以来为 ACM SIGMETRICS/Performance 2024 项目所做的努力和提出的倡议。
{"title":"POMACS V8, N1, March 2024 Editorial","authors":"F. Ciucu, Giulia Fanti, Rhonda Righter","doi":"10.1145/3639027","DOIUrl":"https://doi.org/10.1145/3639027","url":null,"abstract":"The Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) focuses on the measurement and performance evaluation of computer systems and operates in close collaboration with the ACM Special Interest Group SIGMETRICS. All papers in this issue of POMACS will be presented at the ACM SIGMETRICS/Performance 2024 conference on June 10-14, 2024, in Venice, Italy. These papers have been selected during the Fall submission round by the 93 members of the ACM SIGMETRICS/Performance 2024 program committee via a rigorous review process. Each paper was conditionally accepted (and shepherded), allowed a \"one-shot\" revision (to be resubmitted to one of the subsequent two SIGMETRICS/Performance deadlines), or rejected (with re-submission allowed after a year). For this issue, which represents the Fall deadline, POMACS is publishing 18 papers out of 118 submissions, of which 6 had previously received a one-shot revision decision. All submissions received at least 3 reviews and borderline cases were extensively discussed during the online program committee meeting. Based on the indicated track(s), roughly 33% of the submissions were in the Theory track, 47% were in the Measurement & Applied Modeling track, 39% were in the Systems track, and 19% were in the Learning track (papers could be part of more than one track). Many individuals contributed to the success of this issue of POMACS. First, we would like to thank the authors, who submitted their best work to SIGMETRICS/Performance/POMACS. Second, we would like to thank the program committee members who provided constructive feedback in their reviews to authors and participated in the online discussions and program committee meeting. We also thank the several external reviewers who provided their expert opinions on specific submissions that required additional input. We are also grateful to the SIGMETRICS Board Chair, Mor Harchol-Balter, the IFIP Working Group 7.3 Chair, Mark S. Squillante, the previous SIGMETRICS Board Chair, Giuliano Casale, and the past program committee Chairs, Konstantin Avratchenkov, Phillipa Gill, and Bhuvan Urgaonkar, who provided a wealth of information and guidance. Finally, we are grateful to the Organization Committee and to the SIGMETRICS Board for their ongoing efforts and initiatives for creating an exciting program for ACM SIGMETRICS/Performance 2024.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"93 6","pages":"1:1"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140454897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fair Resource Allocation in Virtualized O-RAN Platforms 虚拟化 O-RAN 平台中的公平资源分配
Pub Date : 2024-02-16 DOI: 10.48550/arXiv.2402.11285
Fatih Aslan, G. Iosifidis, J. Ayala-Romero, Andres Garcia-Saavedra, Xavier Pérez Costa
O-RAN systems and their deployment in virtualized general-purpose computing platforms (O-Cloud) constitute a paradigm shift expected to bring unprecedented performance gains. However, these architectures raise new implementation challenges and threaten to worsen the already-high energy consumption of mobile networks. This paper presents first a series of experiments which assess the O-Cloud's energy costs and their dependency on the servers' hardware, capacity and data traffic properties which, typically, change over time. Next, it proposes a compute policy for assigning the base station data loads to O-Cloud servers in an energy-efficient fashion; and a radio policy that determines at near-real-time the minimum transmission block size for each user so as to avoid unnecessary energy costs. The policies balance energy savings with performance, and ensure that both of them are dispersed fairly across the servers and users, respectively. To cater for the unknown and time-varying parameters affecting the policies, we develop a novel online learning framework with fairness guarantees that apply to the entire operation horizon of the system (long-term fairness). The policies are evaluated using trace-driven simulations and are fully implemented in an O-RAN compatible system where we measure the energy costs and throughput in realistic scenarios.
O-RAN 系统及其在虚拟化通用计算平台(O-Cloud)中的部署构成了一种模式转变,有望带来前所未有的性能提升。然而,这些架构带来了新的实施挑战,并有可能使移动网络本已很高的能耗进一步恶化。本文首先介绍了一系列实验,评估了 O-Cloud 的能源成本及其对服务器硬件、容量和数据流量属性的依赖性,这些属性通常会随着时间的推移而发生变化。接下来,它提出了一种计算策略,用于以节能方式将基站数据负载分配给 O-Cloud 服务器;以及一种无线电策略,用于近乎实时地确定每个用户的最小传输块大小,以避免不必要的能源成本。这些策略兼顾了节能和性能,并确保两者分别公平地分配给服务器和用户。为了应对影响策略的未知参数和时变参数,我们开发了一种新颖的在线学习框架,其公平性保证适用于系统的整个运行周期(长期公平性)。我们通过跟踪仿真对这些策略进行了评估,并在兼容 O-RAN 的系统中全面实施了这些策略,测量了现实场景中的能源成本和吞吐量。
{"title":"Fair Resource Allocation in Virtualized O-RAN Platforms","authors":"Fatih Aslan, G. Iosifidis, J. Ayala-Romero, Andres Garcia-Saavedra, Xavier Pérez Costa","doi":"10.48550/arXiv.2402.11285","DOIUrl":"https://doi.org/10.48550/arXiv.2402.11285","url":null,"abstract":"O-RAN systems and their deployment in virtualized general-purpose computing platforms (O-Cloud) constitute a paradigm shift expected to bring unprecedented performance gains. However, these architectures raise new implementation challenges and threaten to worsen the already-high energy consumption of mobile networks. This paper presents first a series of experiments which assess the O-Cloud's energy costs and their dependency on the servers' hardware, capacity and data traffic properties which, typically, change over time. Next, it proposes a compute policy for assigning the base station data loads to O-Cloud servers in an energy-efficient fashion; and a radio policy that determines at near-real-time the minimum transmission block size for each user so as to avoid unnecessary energy costs. The policies balance energy savings with performance, and ensure that both of them are dispersed fairly across the servers and users, respectively. To cater for the unknown and time-varying parameters affecting the policies, we develop a novel online learning framework with fairness guarantees that apply to the entire operation horizon of the system (long-term fairness). The policies are evaluated using trace-driven simulations and are fully implemented in an O-RAN compatible system where we measure the energy costs and throughput in realistic scenarios.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"70 7","pages":"17:1-17:34"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140455016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Who's Got My Back? Measuring the Adoption of an Internet-wide BGP RTBH Service 谁在支持我?衡量互联网范围内 BGP RTBH 服务的采用情况
Pub Date : 2024-02-16 DOI: 10.1145/3639029
Radu Anghel, Yury Zhauniarovich, C. Gañán
Distributed Denial-of-Service (DDoS) attacks continue to threaten the availability of Internet-based services. While countermeasures exist to decrease the impact of these attacks, not all operators have the resources or knowledge to deploy them. Alternatively, anti-DDoS services such as DDoS clearing houses and blackholing have emerged. Unwanted Traffic Removal Service (UTRS), being one of the oldest community-based anti-DDoS services, has become a global free collaborative service that aims at mitigating major DDoS attacks through the Border Gateway Protocol (BGP). Once the BGP session with UTRS is established, UTRS members can advertise part of the prefixes belonging to their AS to UTRS. UTRS will forward them to all other participants, who, in turn, should start blocking traffic to the advertised IP addresses. In this paper, we develop and evaluate a methodology to automatically detect UTRS participation in the wild. To this end, we deploy a measurement infrastructure and devise a methodology to detect UTRS-based traffic blocking. Using this methodology, we conducted a longitudinal analysis of UTRS participants over ten weeks. Our results show that at any point in time, there were 562 participants, including multihomed, stub, transit, and IXP ASes. Moreover, we surveyed 245 network operators to understand why they would (not) join UTRS. Results show that threat and coping appraisal significantly influence the intention to participate in UTRS.
分布式拒绝服务(DDoS)攻击继续威胁着互联网服务的可用性。虽然存在可降低这些攻击影响的应对措施,但并非所有运营商都拥有部署这些措施的资源或知识。另外,还出现了反 DDoS 服务,如 DDoS 信息交换所和黑洞。无用流量清除服务(UTRS)是历史最悠久的基于社区的反 DDoS 服务之一,现已成为一项全球性的免费协作服务,旨在通过边界网关协议(BGP)缓解重大 DDoS 攻击。与 UTRS 建立 BGP 会话后,UTRS 成员可向 UTRS 公布属于其 AS 的部分前缀。UTRS 将把这些前缀转发给所有其他参与方,而这些参与方则应开始阻断指向广告 IP 地址的流量。在本文中,我们开发并评估了一种自动检测野生 UTRS 参与情况的方法。为此,我们部署了一个测量基础设施,并设计了一种方法来检测基于 UTRS 的流量阻塞。利用这种方法,我们对 UTRS 参与者进行了为期十周的纵向分析。我们的结果显示,在任何时间点都有 562 个参与者,包括多主机、存根、中转和 IXP ASes。此外,我们还对 245 名网络运营商进行了调查,以了解他们加入(不加入)UTRS 的原因。结果显示,威胁和应对评估对参与 UTRS 的意愿有很大影响。
{"title":"Who's Got My Back? Measuring the Adoption of an Internet-wide BGP RTBH Service","authors":"Radu Anghel, Yury Zhauniarovich, C. Gañán","doi":"10.1145/3639029","DOIUrl":"https://doi.org/10.1145/3639029","url":null,"abstract":"Distributed Denial-of-Service (DDoS) attacks continue to threaten the availability of Internet-based services. While countermeasures exist to decrease the impact of these attacks, not all operators have the resources or knowledge to deploy them. Alternatively, anti-DDoS services such as DDoS clearing houses and blackholing have emerged. Unwanted Traffic Removal Service (UTRS), being one of the oldest community-based anti-DDoS services, has become a global free collaborative service that aims at mitigating major DDoS attacks through the Border Gateway Protocol (BGP). Once the BGP session with UTRS is established, UTRS members can advertise part of the prefixes belonging to their AS to UTRS. UTRS will forward them to all other participants, who, in turn, should start blocking traffic to the advertised IP addresses. In this paper, we develop and evaluate a methodology to automatically detect UTRS participation in the wild. To this end, we deploy a measurement infrastructure and devise a methodology to detect UTRS-based traffic blocking. Using this methodology, we conducted a longitudinal analysis of UTRS participants over ten weeks. Our results show that at any point in time, there were 562 participants, including multihomed, stub, transit, and IXP ASes. Moreover, we surveyed 245 network operators to understand why they would (not) join UTRS. Results show that threat and coping appraisal significantly influence the intention to participate in UTRS.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"56 9","pages":"3:1-3:25"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140455110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Shrinking VOD Traffic via Rényi-Entropic Optimal Transport 通过雷尼-各向同性优化传输缩小点播流量
Pub Date : 2024-02-16 DOI: 10.1145/3639033
Chi-Jen (Roger) Lo, Mahesh K. Marina, N. Sastry, Kai Xu, Saeed Fadaei, Yong Li
In response to the exponential surge in Internet Video on Demand (VOD) traffic, numerous research endeavors have concentrated on optimizing and enhancing infrastructure efficiency. In contrast, this paper explores whether users' demand patterns can be shaped to reduce the pressure on infrastructure. Our main idea is to design a mechanism that alters the distribution of user requests to another distribution which is much more cache-efficient, but still remains 'close enough' (in the sense of cost) to fulfil each individual user's preference. To quantify the cache footprint of VOD traffic, we propose a novel application of Rényi entropy as its proxy, capturing the 'richness' (the number of distinct videos or cache size) and the 'evenness' (the relative popularity of video accesses) of the on-demand video distribution. We then demonstrate how to decrease this metric by formulating a problem drawing on the mathematical theory of optimal transport (OT). Additionally, we establish a key equivalence theorem: minimizing Rényi entropy corresponds to maximizing soft cache hit ratio (SCHR) --- a variant of cache hit ratio allowing similarity-based video substitutions. Evaluation on a real-world, city-scale video viewing dataset reveals a remarkable 83% reduction in cache size (associated with VOD caching traffic). Crucially, in alignment with the above-mentioned equivalence theorem, our approach yields a significant uplift to SCHR, achieving close to 100%.
为应对互联网视频点播(VOD)流量的指数级激增,许多研究工作都集中在优化和提高基础设施的效率上。与此相反,本文探讨了是否可以通过调整用户的需求模式来减轻基础设施的压力。我们的主要想法是设计一种机制,将用户请求的分布改变为另一种分布,这种分布的缓存效率要高得多,但仍然 "足够接近"(就成本而言)满足每个用户的偏好。为了量化 VOD 流量的缓存足迹,我们提出了一种新颖的雷尼熵应用作为其代理,捕捉点播视频分布的 "丰富度"(不同视频的数量或缓存大小)和 "均匀度"(视频访问的相对流行度)。然后,我们借鉴最优传输(OT)的数学理论,提出一个问题来演示如何降低这一指标。此外,我们还建立了一个关键的等价定理:雷尼熵最小化与软缓存命中率(SCHR)最大化相对应--软缓存命中率是缓存命中率的一种变体,允许基于相似性的视频替换。在真实世界的城市规模视频观看数据集上进行的评估显示,缓存大小(与 VOD 缓存流量相关)显著减少了 83%。最重要的是,根据上述等价定理,我们的方法显著提高了 SCHR,接近 100%。
{"title":"Shrinking VOD Traffic via Rényi-Entropic Optimal Transport","authors":"Chi-Jen (Roger) Lo, Mahesh K. Marina, N. Sastry, Kai Xu, Saeed Fadaei, Yong Li","doi":"10.1145/3639033","DOIUrl":"https://doi.org/10.1145/3639033","url":null,"abstract":"In response to the exponential surge in Internet Video on Demand (VOD) traffic, numerous research endeavors have concentrated on optimizing and enhancing infrastructure efficiency. In contrast, this paper explores whether users' demand patterns can be shaped to reduce the pressure on infrastructure. Our main idea is to design a mechanism that alters the distribution of user requests to another distribution which is much more cache-efficient, but still remains 'close enough' (in the sense of cost) to fulfil each individual user's preference. To quantify the cache footprint of VOD traffic, we propose a novel application of Rényi entropy as its proxy, capturing the 'richness' (the number of distinct videos or cache size) and the 'evenness' (the relative popularity of video accesses) of the on-demand video distribution. We then demonstrate how to decrease this metric by formulating a problem drawing on the mathematical theory of optimal transport (OT). Additionally, we establish a key equivalence theorem: minimizing Rényi entropy corresponds to maximizing soft cache hit ratio (SCHR) --- a variant of cache hit ratio allowing similarity-based video substitutions. Evaluation on a real-world, city-scale video viewing dataset reveals a remarkable 83% reduction in cache size (associated with VOD caching traffic). Crucially, in alignment with the above-mentioned equivalence theorem, our approach yields a significant uplift to SCHR, achieving close to 100%.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"333 1","pages":"7:1-7:34"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140453783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
H3DM: A High-bandwidth High-capacity Hybrid 3D Memory Design for GPUs H3DM:面向 GPU 的高带宽、大容量混合 3D 内存设计
Pub Date : 2024-02-16 DOI: 10.1145/3639038
N. Akbarzadeh, Sina Darabi, A. Gheibi-Fetrat, Amir Mirzaei, Mohammad Sadrosadati, H. Sarbazi-Azad
Graphics Processing Units (GPUs) are widely used for modern applications with huge data sizes. However, the performance benefit of GPUs is limited by their memory capacity and bandwidth. Although GPU vendors improve memory capacity and bandwidth using 3D memory technology (HBM), many important workloads with terabytes of data still cannot fit in the provided capacity and are bound by the provided bandwidth. With a limited GPU memory capacity, programmers should handle the data movement between GPU and host memories by themselves, causing a significant programming burden. To improve programming ease, GPUs use a unified address space with the host that allows over-subscribing GPU memory, but this approach is not effective in terms of performance once GPUs encounter memory page faults. Many recent works have tried to remedy capacity and bandwidth bottlenecks using dense non-volatile memories (NVMs) and true-3D stacking. However, these works mainly focus on one bottleneck or do not provide a scalable solution that fits future requirements. In this paper, we investigate true-3D stacking of dense, low-power, and refresh-free non-volatile phase change memory (PCM) on top of state-of-the-art GPU configurations to provide higher capacity and bandwidth within the available area and power budget. The higher density and lower power consumption of PCM provide higher capacity through integrating more cells in each 3D layer and enabling stacking more layers. However, we observe that stacking more than six layers of pure-PCM memory violates the thermal constraint and severely harms the performance and power efficiency due to its higher write latency and energy. Further, it degrades the lifetime of GPU to less than one year. Utilizing a hybrid architecture that leverages the benefits of both DRAM and PCM memories has been widely studied by prior proposals; however, true-3D integration of such a hybrid memory architecture especially on top of state-of-the-art powerful GPU architecture has not been investigated yet. We experimentally demonstrate that by covering 80% of write requests in DRAM and eliminating refresh overhead, true-3D stacking of eight 32GB layers of PCM along with two 8GB layers of DRAM is possible resulting in a total of 272GB memory capacity. Based on the explored design requirements, We propose a 3D high-bandwidth high-capacity hybrid memory (H3DM) system utilizing a hybrid-3D (H3D)-aware remapping scheme to reduce expensive PCM writes to under 20% while avoiding DRAM refresh overhead. H3DM improves the performance up to 291% compared to the baseline GPU architecture while remaining within only 3% of an ideal case with DRAM-like access latency, on average. Moreover, by increasing the dataset size above the baseline GPU memory space, H3DM improves performance and power up to 648% and 87% compared to the baseline GPU architecture since it avoids expensive data transfers through off-chip communication links.
图形处理器(GPU)被广泛应用于数据量巨大的现代应用中。然而,GPU 的性能优势受限于其内存容量和带宽。尽管 GPU 厂商利用 3D 内存技术(HBM)提高了内存容量和带宽,但许多数据量高达 TB 的重要工作负载仍然无法容纳所提供的容量,并受到所提供带宽的限制。在 GPU 内存容量有限的情况下,程序员必须自行处理 GPU 和主机内存之间的数据移动,这给编程带来了很大的负担。为了提高编程的简便性,GPU 与主机使用统一的地址空间,允许超量订阅 GPU 内存,但一旦 GPU 遇到内存页面故障,这种方法在性能方面并不奏效。最近的许多研究都试图利用密集的非易失性存储器(NVM)和真正的三维堆叠来解决容量和带宽瓶颈问题。然而,这些工作主要集中在一个瓶颈上,或者没有提供适合未来需求的可扩展解决方案。在本文中,我们研究了在最先进的 GPU 配置上堆叠高密度、低功耗、免刷新的非易失性相变存储器(PCM)的真-3D 堆叠技术,以便在可用面积和功耗预算内提供更高的容量和带宽。PCM 的密度更高、功耗更低,通过在每个三维层中集成更多单元和堆叠更多层,可提供更高的容量。然而,我们注意到,堆叠超过六层的纯 PCM 存储器违反了热约束,并且由于其较高的写入延迟和能耗,严重损害了性能和能效。此外,它还会将 GPU 的使用寿命降低到一年以下。利用混合架构来充分利用 DRAM 和 PCM 存储器的优势已在先前的提案中得到广泛研究;然而,这种混合存储器架构的真正三维集成,尤其是在最先进的强大 GPU 架构之上的集成,尚未得到研究。我们通过实验证明,通过在 DRAM 中覆盖 80% 的写入请求并消除刷新开销,可以将 8 层 32GB 的 PCM 与 2 层 8GB 的 DRAM 进行真正的三维堆叠,从而获得总计 272GB 的内存容量。根据所探讨的设计要求,我们提出了一种三维高带宽大容量混合内存(H3DM)系统,利用混合三维(H3D)感知重映射方案,将昂贵的 PCM 写入量减少到 20% 以下,同时避免 DRAM 刷新开销。与基准 GPU 架构相比,H3DM 的性能提高了 291%,而平均访问延迟仅为理想情况下 DRAM 的 3%。此外,由于 H3DM 避免了通过片外通信链路进行昂贵的数据传输,因此与基准 GPU 架构相比,通过将数据集大小增加到基准 GPU 内存空间以上,H3DM 的性能和功耗分别提高了 648% 和 87%。
{"title":"H3DM: A High-bandwidth High-capacity Hybrid 3D Memory Design for GPUs","authors":"N. Akbarzadeh, Sina Darabi, A. Gheibi-Fetrat, Amir Mirzaei, Mohammad Sadrosadati, H. Sarbazi-Azad","doi":"10.1145/3639038","DOIUrl":"https://doi.org/10.1145/3639038","url":null,"abstract":"Graphics Processing Units (GPUs) are widely used for modern applications with huge data sizes. However, the performance benefit of GPUs is limited by their memory capacity and bandwidth. Although GPU vendors improve memory capacity and bandwidth using 3D memory technology (HBM), many important workloads with terabytes of data still cannot fit in the provided capacity and are bound by the provided bandwidth. With a limited GPU memory capacity, programmers should handle the data movement between GPU and host memories by themselves, causing a significant programming burden. To improve programming ease, GPUs use a unified address space with the host that allows over-subscribing GPU memory, but this approach is not effective in terms of performance once GPUs encounter memory page faults. Many recent works have tried to remedy capacity and bandwidth bottlenecks using dense non-volatile memories (NVMs) and true-3D stacking. However, these works mainly focus on one bottleneck or do not provide a scalable solution that fits future requirements. In this paper, we investigate true-3D stacking of dense, low-power, and refresh-free non-volatile phase change memory (PCM) on top of state-of-the-art GPU configurations to provide higher capacity and bandwidth within the available area and power budget. The higher density and lower power consumption of PCM provide higher capacity through integrating more cells in each 3D layer and enabling stacking more layers. However, we observe that stacking more than six layers of pure-PCM memory violates the thermal constraint and severely harms the performance and power efficiency due to its higher write latency and energy. Further, it degrades the lifetime of GPU to less than one year. Utilizing a hybrid architecture that leverages the benefits of both DRAM and PCM memories has been widely studied by prior proposals; however, true-3D integration of such a hybrid memory architecture especially on top of state-of-the-art powerful GPU architecture has not been investigated yet. We experimentally demonstrate that by covering 80% of write requests in DRAM and eliminating refresh overhead, true-3D stacking of eight 32GB layers of PCM along with two 8GB layers of DRAM is possible resulting in a total of 272GB memory capacity. Based on the explored design requirements, We propose a 3D high-bandwidth high-capacity hybrid memory (H3DM) system utilizing a hybrid-3D (H3D)-aware remapping scheme to reduce expensive PCM writes to under 20% while avoiding DRAM refresh overhead. H3DM improves the performance up to 291% compared to the baseline GPU architecture while remaining within only 3% of an ideal case with DRAM-like access latency, on average. Moreover, by increasing the dataset size above the baseline GPU memory space, H3DM improves performance and power up to 648% and 87% compared to the baseline GPU architecture since it avoids expensive data transfers through off-chip communication links.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"593 ","pages":"12:1-12:28"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140454162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
StarShip: Mitigating I/O Bottlenecks in Serverless Computing for Scientific Workflows 星舰:缓解科学工作流无服务器计算中的 I/O 瓶颈
Pub Date : 2024-02-16 DOI: 10.1145/3639028
Rohan Basu Roy, Devesh Tiwari
This work highlights the significance of I/O bottlenecks that data-intensive HPC workflows face in serverless environments - an issue that has been largely overlooked by prior works. To address this challenge, we propose a novel framework, StarShip, which effectively addresses I/O bottlenecks for HPC workflows executing in serverless environments by leveraging different storage options and multi-tier functions, co-optimizing for service time and service cost. StarShip exploits the Levenberg-Marquardt optimization method to find an effective solution in a large, complex search space. StarShip achieves significantly better performance and cost compared to competing techniques, improving service time by 45% and service cost by 37.6% on average over state-of-the-art solutions.
这项工作强调了数据密集型高性能计算工作流在无服务器环境中面临的I/O瓶颈的重要性--这一问题在很大程度上被之前的工作所忽视。为了应对这一挑战,我们提出了一个新颖的框架--StarShip,它通过利用不同的存储选项和多层功能,共同优化服务时间和服务成本,有效地解决了在无服务器环境中执行的高性能计算工作流的I/O瓶颈问题。StarShip 利用 Levenberg-Marquardt 优化方法,在复杂的大型搜索空间中找到有效的解决方案。与同类技术相比,StarShip 的性能和成本都有明显提高,与最先进的解决方案相比,服务时间平均缩短了 45%,服务成本平均降低了 37.6%。
{"title":"StarShip: Mitigating I/O Bottlenecks in Serverless Computing for Scientific Workflows","authors":"Rohan Basu Roy, Devesh Tiwari","doi":"10.1145/3639028","DOIUrl":"https://doi.org/10.1145/3639028","url":null,"abstract":"This work highlights the significance of I/O bottlenecks that data-intensive HPC workflows face in serverless environments - an issue that has been largely overlooked by prior works. To address this challenge, we propose a novel framework, StarShip, which effectively addresses I/O bottlenecks for HPC workflows executing in serverless environments by leveraging different storage options and multi-tier functions, co-optimizing for service time and service cost. StarShip exploits the Levenberg-Marquardt optimization method to find an effective solution in a large, complex search space. StarShip achieves significantly better performance and cost compared to competing techniques, improving service time by 45% and service cost by 37.6% on average over state-of-the-art solutions.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"492 4","pages":"2:1-2:29"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140454031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Thorough Characterization and Analysis of Large Transformer Model Training At-Scale 大型变压器模型训练的全面特征描述和分析
Pub Date : 2024-02-16 DOI: 10.1145/3639034
Scott Cheng, Jun-Liang Lin, M. Emani, Siddhisanket Raskar, Sam Foreman, Zhen Xie, Venkat Vishwanath, M. Kandemir
Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study. In this paper, we provide a bottom-up breakdown of training throughput into compute and communication time, and quantitatively analyze their respective influences on overall end-to-end training scaling. Our evaluation involves an in-depth exploration of data parallelism, scaling up to 512 GPUs with limited bandwidth, and examines three model sharding strategies among six model sizes. We also evaluate three combinations of model parallelism on both high and low bandwidth supercomputing systems. Overall, our work provides a broader perspective on large-scale transformer model training, and our analysis and evaluation yield practical insights for predicting training scaling, shaping the future development of supercomputing system design.
大型变压器模型最近在各个领域都取得了巨大成功。随着模型参数越来越多,如今的大型变压器模型训练通常涉及模型分片、数据并行和模型并行。因此,大规模模型训练的吞吐量在很大程度上取决于网络带宽,因为模型分片和多种并行策略的组合会产生各种成本。然而,之前在高带宽 DGX 机器上使用 TFLOPS 作为衡量标准的变压器模型特性可能无法反映带宽较低系统的性能。此外,数据和模型并行性在不同系统带宽上显示出明显不同的大规模训练情况,因此需要进行深入研究。在本文中,我们自下而上地将训练吞吐量分解为计算时间和通信时间,并定量分析了它们各自对整个端到端训练规模的影响。我们的评估涉及对数据并行性的深入探讨,在带宽有限的情况下可扩展至 512 个 GPU,并在六种模型大小中考察了三种模型分片策略。我们还在高带宽和低带宽超级计算系统上评估了三种模型并行性组合。总之,我们的工作为大规模变压器模型训练提供了更广阔的视角,我们的分析和评估为预测训练规模提供了实用的见解,对超级计算系统设计的未来发展起着决定性作用。
{"title":"Thorough Characterization and Analysis of Large Transformer Model Training At-Scale","authors":"Scott Cheng, Jun-Liang Lin, M. Emani, Siddhisanket Raskar, Sam Foreman, Zhen Xie, Venkat Vishwanath, M. Kandemir","doi":"10.1145/3639034","DOIUrl":"https://doi.org/10.1145/3639034","url":null,"abstract":"Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study. In this paper, we provide a bottom-up breakdown of training throughput into compute and communication time, and quantitatively analyze their respective influences on overall end-to-end training scaling. Our evaluation involves an in-depth exploration of data parallelism, scaling up to 512 GPUs with limited bandwidth, and examines three model sharding strategies among six model sizes. We also evaluate three combinations of model parallelism on both high and low bandwidth supercomputing systems. Overall, our work provides a broader perspective on large-scale transformer model training, and our analysis and evaluation yield practical insights for predicting training scaling, shaping the future development of supercomputing system design.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"358 1","pages":"8:1-8:25"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140454467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proc. ACM Meas. Anal. Comput. Syst.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1