Proceedings of the ACM on Measurement and Analysis of Computing Systems最新文献_第7页

Expert-Calibrated Learning for Online Optimization with Switching Costs 具有切换成本的在线优化的专家校准学习

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-04-18 DOI: 10.1145/3530894

Peng Li, Jianyi Yang, Shaolei Ren

We study online convex optimization with switching costs, a practically important but also extremely challenging problem due to the lack of complete offline information. By tapping into the power of machine learning (ML) based optimizers, ML-augmented online algorithms (also referred to as expert calibration in this paper) have been emerging as state of the art, with provable worst-case performance guarantees. Nonetheless, by using the standard practice of training an ML model as a standalone optimizer and plugging it into an ML-augmented algorithm, the average cost performance can be highly unsatisfactory. In order to address the "how to learn" challenge, we propose EC-L2O (expert-calibrated learning to optimize), which trains an ML-based optimizer by explicitly taking into account the downstream expert calibrator. To accomplish this, we propose a new differentiable expert calibrator that generalizes regularized online balanced descent and offers a provably better competitive ratio than pure ML predictions when the prediction error is large. For training, our loss function is a weighted sum of two different losses --- one minimizing the average ML prediction error for better robustness, and the other one minimizing the post-calibration average cost. We also provide theoretical analysis for EC-L2O, highlighting that expert calibration can be even beneficial for the average cost performance and that the high-percentile tail ratio of the cost achieved by EC-L2O to that of the offline optimal oracle (i.e., tail cost ratio) can be bounded. Finally, we test EC-L2O by running simulations for sustainable datacenter demand response. Our results demonstrate that EC-L2O can empirically achieve a lower average cost as well as a lower competitive ratio than the existing baseline algorithms.

我们研究了具有切换代价的在线凸优化问题，这是一个非常重要但又极具挑战性的问题，因为缺乏完整的离线信息。通过利用基于机器学习(ML)的优化器的力量，ML增强在线算法(在本文中也称为专家校准)已经成为最先进的技术，具有可证明的最坏情况性能保证。尽管如此，通过使用训练ML模型作为独立优化器并将其插入ML增强算法的标准实践，平均成本性能可能非常不令人满意。为了解决“如何学习”的挑战，我们提出了EC-L2O(专家校准学习优化)，它通过明确考虑下游专家校准器来训练基于ml的优化器。为了实现这一目标，我们提出了一种新的可微分专家校准器，它推广了正则化在线平衡下降，并在预测误差较大时提供了比纯ML预测更好的竞争比。对于训练，我们的损失函数是两个不同损失的加权和——一个最小化平均ML预测误差以获得更好的鲁棒性，另一个最小化后校准平均成本。我们还对EC-L2O进行了理论分析，强调专家校准甚至可能有利于平均成本性能，并且EC-L2O实现的成本与离线最优oracle的高百分位数尾部比(即尾部成本比)是有界的。最后，我们通过运行可持续数据中心需求响应的模拟来测试EC-L2O。我们的研究结果表明，EC-L2O经验上可以实现比现有基线算法更低的平均成本和更低的竞争比。

{"title":"Expert-Calibrated Learning for Online Optimization with Switching Costs","authors":"Peng Li, Jianyi Yang, Shaolei Ren","doi":"10.1145/3530894","DOIUrl":"https://doi.org/10.1145/3530894","url":null,"abstract":"We study online convex optimization with switching costs, a practically important but also extremely challenging problem due to the lack of complete offline information. By tapping into the power of machine learning (ML) based optimizers, ML-augmented online algorithms (also referred to as expert calibration in this paper) have been emerging as state of the art, with provable worst-case performance guarantees. Nonetheless, by using the standard practice of training an ML model as a standalone optimizer and plugging it into an ML-augmented algorithm, the average cost performance can be highly unsatisfactory. In order to address the \"how to learn\" challenge, we propose EC-L2O (expert-calibrated learning to optimize), which trains an ML-based optimizer by explicitly taking into account the downstream expert calibrator. To accomplish this, we propose a new differentiable expert calibrator that generalizes regularized online balanced descent and offers a provably better competitive ratio than pure ML predictions when the prediction error is large. For training, our loss function is a weighted sum of two different losses --- one minimizing the average ML prediction error for better robustness, and the other one minimizing the post-calibration average cost. We also provide theoretical analysis for EC-L2O, highlighting that expert calibration can be even beneficial for the average cost performance and that the high-percentile tail ratio of the cost achieved by EC-L2O to that of the offline optimal oracle (i.e., tail cost ratio) can be bounded. Finally, we test EC-L2O by running simulations for sustainable datacenter demand response. Our results demonstrate that EC-L2O can empirically achieve a lower average cost as well as a lower competitive ratio than the existing baseline algorithms.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131994291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Mars: Near-Optimal Throughput with Shallow Buffers in Reconfigurable Datacenter Networks 火星:可重构数据中心网络中具有浅缓冲区的近最佳吞吐量

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-04-06 DOI: 10.1145/3579312

Vamsi Addanki, C. Avin, S. Schmid

The performance of large-scale computing systems often critically depends on high-performance communication networks. Dynamically reconfigurable topologies, e.g., based on optical circuit switches, are emerging as an innovative new technology to deal with the explosive growth of datacenter traffic. Specifically, periodic reconfigurable datacenter networks (RDCNs) such as RotorNet (SIGCOMM 2017), Opera (NSDI 2020) and Sirius (SIGCOMM 2020) have been shown to provide high throughput, by emulating a complete graph through fast periodic circuit switch scheduling. However, to achieve such a high throughput, existing reconfigurable network designs pay a high price: in terms of potentially high delays, but also, as we show as a first contribution in this paper, in terms of the high buffer requirements. In particular, we show that under buffer constraints, emulating the high-throughput complete graph is infeasible at scale, and we uncover a spectrum of unvisited and attractive alternative RDCNs, which emulate regular graphs, but with lower node degree than the complete graph. We present Mars, a periodic reconfigurable topology which emulates ad-regular graph with near-optimal throughput. In particular, we systematically analyze how the degree d can be optimized for throughput given the available buffer and delay tolerance of the datacenter. We further show empirically that Mars achieves higher throughput compared to existing systems when buffer sizes are bounded.

大型计算系统的性能通常严重依赖于高性能的通信网络。动态可重构拓扑，例如基于光电路交换机的拓扑，正在作为一种创新的新技术出现，以应对数据中心流量的爆炸性增长。具体来说，周期性可重构数据中心网络(rdcn)，如RotorNet (SIGCOMM 2017)、Opera (NSDI 2020)和Sirius (SIGCOMM 2020)，通过快速周期性电路交换调度模拟完整图，已被证明可以提供高吞吐量。然而，为了实现如此高的吞吐量，现有的可重构网络设计付出了高昂的代价:在潜在的高延迟方面，而且，正如我们在本文中的第一个贡献所示，在高缓冲区要求方面。特别是，我们证明了在缓冲区约束下，大规模模拟高吞吐量完全图是不可行的，并且我们发现了一系列未访问的和有吸引力的替代RDCNs，它们模拟正则图，但节点度低于完全图。我们提出了一个周期可重构拓扑Mars，它模拟了具有接近最优吞吐量的非正则图。特别是，我们系统地分析了如何在给定可用缓冲区和数据中心的延迟容忍度的情况下优化d度的吞吐量。我们进一步通过经验证明，当缓冲区大小有限时，与现有系统相比，Mars实现了更高的吞吐量。

{"title":"Mars: Near-Optimal Throughput with Shallow Buffers in Reconfigurable Datacenter Networks","authors":"Vamsi Addanki, C. Avin, S. Schmid","doi":"10.1145/3579312","DOIUrl":"https://doi.org/10.1145/3579312","url":null,"abstract":"The performance of large-scale computing systems often critically depends on high-performance communication networks. Dynamically reconfigurable topologies, e.g., based on optical circuit switches, are emerging as an innovative new technology to deal with the explosive growth of datacenter traffic. Specifically, periodic reconfigurable datacenter networks (RDCNs) such as RotorNet (SIGCOMM 2017), Opera (NSDI 2020) and Sirius (SIGCOMM 2020) have been shown to provide high throughput, by emulating a complete graph through fast periodic circuit switch scheduling. However, to achieve such a high throughput, existing reconfigurable network designs pay a high price: in terms of potentially high delays, but also, as we show as a first contribution in this paper, in terms of the high buffer requirements. In particular, we show that under buffer constraints, emulating the high-throughput complete graph is infeasible at scale, and we uncover a spectrum of unvisited and attractive alternative RDCNs, which emulate regular graphs, but with lower node degree than the complete graph. We present Mars, a periodic reconfigurable topology which emulates ad-regular graph with near-optimal throughput. In particular, we systematically analyze how the degree d can be optimized for throughput given the available buffer and delay tolerance of the datacenter. We further show empirically that Mars achieves higher throughput compared to existing systems when buffer sizes are bounded.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124158803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Switching in the Rain: Predictive Wireless x-haul Network Reconfiguration 雨中的交换:预测性无线x-haul网络重构

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-03-07 DOI: 10.1145/3570616

I. Kadota, Dror Jacoby, H. Messer, G. Zussman, J. Ostrometzky

4G, 5G, and smart city networks often rely on microwave and millimeter-wave x-haul links. A major challenge associated with these high frequency links is their susceptibility to weather conditions. In particular, precipitation may cause severe signal attenuation, which significantly degrades the network performance. In this paper, we develop a Predictive Network Reconfiguration (PNR) framework that uses historical data to predict the future condition of each link and then prepares the network ahead of time for imminent disturbances. The PNR framework has two components: (i) an Attenuation Prediction (AP) mechanism; and (ii) a Multi-Step Network Reconfiguration (MSNR) algorithm. The AP mechanism employs an encoder-decoder Long Short-Term Memory (LSTM) model to predict the sequence of future attenuation levels of each link. The MSNR algorithm leverages these predictions to dynamically optimize routing and admission control decisions aiming to maximize network utilization, while preserving max-min fairness among the nodes using the network (e.g., base-stations) and preventing transient congestion that may be caused by switching routes. We train, validate, and evaluate the PNR framework using a dataset containing over 2 million measurements collected from a real-world city-scale backhaul network. The results show that the framework: (i) predicts attenuation with high accuracy, with an RMSE of less than 0.4 dB for a prediction horizon of 50 seconds; and (ii) can improve the instantaneous network utilization by more than 200% when compared to reactive network reconfiguration algorithms that cannot leverage information about future disturbances.

4G、5G和智慧城市网络通常依赖于微波和毫米波x-haul链路。与这些高频链接相关的一个主要挑战是它们对天气条件的易感性。特别是降水会造成严重的信号衰减，严重影响网络性能。在本文中，我们开发了一个预测网络重构(PNR)框架，该框架使用历史数据来预测每个链路的未来状况，然后提前准备网络以应对即将发生的干扰。PNR框架有两个组成部分:(i)衰减预测(AP)机制;(ii)多步网络重构(MSNR)算法。AP机制采用编码器-解码器长短期记忆(LSTM)模型来预测每条链路未来衰减水平的顺序。MSNR算法利用这些预测动态优化路由和接纳控制决策，旨在最大限度地提高网络利用率，同时保持使用网络的节点(例如基站)之间的最大最小公平性，并防止可能由交换路由引起的短暂拥塞。我们使用包含从现实世界的城市规模回程网络收集的超过200万个测量数据集来训练、验证和评估PNR框架。结果表明:(1)该框架预测衰减精度高，在50秒预测范围内RMSE小于0.4 dB;与无法利用未来干扰信息的被动网络重构算法相比，(ii)可以将瞬时网络利用率提高200%以上。

{"title":"Switching in the Rain: Predictive Wireless x-haul Network Reconfiguration","authors":"I. Kadota, Dror Jacoby, H. Messer, G. Zussman, J. Ostrometzky","doi":"10.1145/3570616","DOIUrl":"https://doi.org/10.1145/3570616","url":null,"abstract":"4G, 5G, and smart city networks often rely on microwave and millimeter-wave x-haul links. A major challenge associated with these high frequency links is their susceptibility to weather conditions. In particular, precipitation may cause severe signal attenuation, which significantly degrades the network performance. In this paper, we develop a Predictive Network Reconfiguration (PNR) framework that uses historical data to predict the future condition of each link and then prepares the network ahead of time for imminent disturbances. The PNR framework has two components: (i) an Attenuation Prediction (AP) mechanism; and (ii) a Multi-Step Network Reconfiguration (MSNR) algorithm. The AP mechanism employs an encoder-decoder Long Short-Term Memory (LSTM) model to predict the sequence of future attenuation levels of each link. The MSNR algorithm leverages these predictions to dynamically optimize routing and admission control decisions aiming to maximize network utilization, while preserving max-min fairness among the nodes using the network (e.g., base-stations) and preventing transient congestion that may be caused by switching routes. We train, validate, and evaluate the PNR framework using a dataset containing over 2 million measurements collected from a real-world city-scale backhaul network. The results show that the framework: (i) predicts attenuation with high accuracy, with an RMSE of less than 0.4 dB for a prediction horizon of 50 seconds; and (ii) can improve the instantaneous network utilization by more than 200% when compared to reactive network reconfiguration algorithms that cannot leverage information about future disturbances.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125450208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Robust Multi-Agent Bandits Over Undirected Graphs 无向图上的鲁棒多智能体强盗

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-02-28 DOI: 10.48550/arXiv.2203.00076

Daniel Vial, S. Shakkottai, R. Srikant

We consider a multi-agent multi-armed bandit setting in which n honest agents collaborate over a network to minimize regret but m malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur O((m + K/n) łog (T) / Δ ) regret in this setting, where K is the number of arms and Δ is the arm gap. For m łl K, this improves over the single-agent baseline regret of O(Kłog(T)/Δ). In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in K and n. In light of this negative result, we propose a new algorithm for which the i-th agent has regret O(( dmal (i) + K/n) łog(T)/Δ) on any connected and undirected graph, where dmal(i) is the number of i's neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where dmal(i) = m), and show the effect of malicious agents is entirely local (in the sense that only the dmal (i) malicious agents directly connected to i affect its long-term regret).

我们考虑一个多智能体多臂强盗设置，其中n个诚实的智能体在网络上合作以最小化遗憾，但m个恶意的智能体可以任意破坏学习。假设网络是完全图，在这种设置下，现有算法会产生O((m + K/n) łog (T) / Δ)的遗憾，其中K为臂的数量，Δ为臂的间隙。对于m łl K，这比单代理基线后悔0 (Kłog(T)/Δ)有所改善。在这项工作中，我们展示了在完全图的情况下，情况更加模糊。特别是，我们证明，如果在无向线图上使用最先进的算法，诚实的代理可能会遭受(几乎)线性遗憾，直到时间在K和n上是双指数。鉴于这个负面结果，我们提出了一种新的算法，其中第i个代理在任何连接和无向图上都有遗憾O((dmal(i) + K/n) łog(T)/Δ)，其中dmal(i)是i的邻居是恶意的数量。因此，我们将现有的遗憾边界推广到完全图(其中dmal(i) = m)之外，并表明恶意代理的影响完全是局部的(从某种意义上说，只有dmal(i)直接连接到i的恶意代理影响其长期遗憾)。

{"title":"Robust Multi-Agent Bandits Over Undirected Graphs","authors":"Daniel Vial, S. Shakkottai, R. Srikant","doi":"10.48550/arXiv.2203.00076","DOIUrl":"https://doi.org/10.48550/arXiv.2203.00076","url":null,"abstract":"We consider a multi-agent multi-armed bandit setting in which n honest agents collaborate over a network to minimize regret but m malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur O((m + K/n) łog (T) / Δ ) regret in this setting, where K is the number of arms and Δ is the arm gap. For m łl K, this improves over the single-agent baseline regret of O(Kłog(T)/Δ). In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in K and n. In light of this negative result, we propose a new algorithm for which the i-th agent has regret O(( dmal (i) + K/n) łog(T)/Δ) on any connected and undirected graph, where dmal(i) is the number of i's neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where dmal(i) = m), and show the effect of malicious agents is entirely local (in the sense that only the dmal (i) malicious agents directly connected to i affect its long-term regret).","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127506180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A Comprehensive Empirical Study of Query Performance Across GPU DBMSes GPU数据库查询性能的综合实证研究

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-02-24 DOI: 10.1145/3508024

Young-Kyoon Suh, Jun Young An, Byungchul Tak, Gap-Joo Na

In recent years, GPU database management systems (DBMSes) have rapidly become popular largely due to their remarkable acceleration capability obtained through extreme parallelism in query evaluations. However, there has been relatively little study on the characteristics of these GPU DBMSes for a better understanding of their query performance in various contexts. Also, little has been known about what the potential factors could be that affect the query processing jobs within the GPU DBMSes. To fill this gap, we have conducted a study to identify such factors and to propose a structural causal model, including key factors and their relationships, to explicate the variances of the query execution times on the GPU DBMSes. We have also established a set of hypotheses drawn from the model that explained the performance characteristics. To test the model, we have designed and run comprehensive experiments and conducted in-depth statistical analyses on the obtained empirical data. As a result, our model achieves about 77% amount of variance explained on the query time and indicates that reducing kernel time and data transfer time are the key factors to improve the query time. Also, our results show that the studied systems should resolve several concerns such as bounded processing within GPU memory, lack of rich query evaluation operators, limited scalability, and GPU under-utilization.

近年来，GPU数据库管理系统(dbms)迅速流行起来，很大程度上是因为它们通过查询计算的极端并行性获得了显著的加速能力。然而，为了更好地理解它们在各种上下文中的查询性能，对这些GPU dbms的特征进行的研究相对较少。此外，对于影响GPU dbms中的查询处理作业的潜在因素知之甚少。为了填补这一空白，我们进行了一项研究，以确定这些因素，并提出一个结构性因果模型，包括关键因素及其关系，以解释GPU dbms上查询执行时间的差异。我们还从模型中建立了一组解释性能特征的假设。为了验证模型，我们设计并进行了全面的实验，并对获得的经验数据进行了深入的统计分析。结果，我们的模型在查询时间上实现了大约77%的方差解释，并表明减少内核时间和数据传输时间是改善查询时间的关键因素。此外，我们的研究结果表明，所研究的系统应该解决几个问题，如GPU内存内的有限处理，缺乏丰富的查询求值运算符，有限的可扩展性和GPU利用率不足。

{"title":"A Comprehensive Empirical Study of Query Performance Across GPU DBMSes","authors":"Young-Kyoon Suh, Jun Young An, Byungchul Tak, Gap-Joo Na","doi":"10.1145/3508024","DOIUrl":"https://doi.org/10.1145/3508024","url":null,"abstract":"In recent years, GPU database management systems (DBMSes) have rapidly become popular largely due to their remarkable acceleration capability obtained through extreme parallelism in query evaluations. However, there has been relatively little study on the characteristics of these GPU DBMSes for a better understanding of their query performance in various contexts. Also, little has been known about what the potential factors could be that affect the query processing jobs within the GPU DBMSes. To fill this gap, we have conducted a study to identify such factors and to propose a structural causal model, including key factors and their relationships, to explicate the variances of the query execution times on the GPU DBMSes. We have also established a set of hypotheses drawn from the model that explained the performance characteristics. To test the model, we have designed and run comprehensive experiments and conducted in-depth statistical analyses on the obtained empirical data. As a result, our model achieves about 77% amount of variance explained on the query time and indicates that reducing kernel time and data transfer time are the key factors to improve the query time. Also, our results show that the studied systems should resolve several concerns such as bounded processing within GPU memory, lack of rich query evaluation operators, limited scalability, and GPU under-utilization.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116905432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Metamorphic Testing of Deep Learning Compilers 深度学习编译器的变形测试

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-02-24 DOI: 10.1145/3508035

Dongwei Xiao, Zhibo Liu, Yuanyuan Yuan, Qi Pang, Shuai Wang

The prosperous trend of deploying deep neural network (DNN) models to diverse hardware platforms has boosted the development of deep learning (DL) compilers. DL compilers take the high-level DNN model specifications as input and generate optimized DNN executables for diverse hardware architectures like CPUs, GPUs, and various hardware accelerators. Compiling DNN models into high-efficiency executables is not easy: the compilation procedure often involves converting high-level model specifications into several different intermediate representations (IR), e.g., graph IR and operator IR, and performing rule-based or learning-based optimizations from both platform-independent and platform-dependent perspectives. Despite the prosperous adoption of DL compilers in real-world scenarios, principled and systematic understanding toward the correctness of DL compilers does not yet exist. To fill this critical gap, this paper introduces MT-DLComp, a metamorphic testing framework specifically designed for DL compilers to effectively uncover erroneous compilations. Our approach leverages deliberately-designed metamorphic relations (MRs) to launch semantics-preserving mutations toward DNN models to generate their variants. This way, DL compilers can be automatically examined for compilation correctness utilizing DNN models and their variants without requiring manual intervention. We also develop a set of practical techniques to realize an effective workflow and localize identified error-revealing inputs. Real-world DL compilers exhibit a high level of engineering quality. Nevertheless, we detected over 435 inputs that can result in erroneous compilations in four popular DL compilers, all of which are industry-strength products maintained by Amazon, Facebook, Microsoft, and Google. While the discovered error-triggering inputs do not cause the DL compilers to crash directly, they can lead to the generation of incorrect DNN executables. With substantial manual effort and help from the DL compiler developers, we uncovered four bugs in these DL compilers by debugging them using the error-triggering inputs. Our proposed testing frameworks and findings can be used to guide developers in their efforts to improve DL compilers.

将深度神经网络(DNN)模型部署到各种硬件平台的繁荣趋势推动了深度学习(DL)编译器的发展。DL编译器将高级DNN模型规范作为输入，并为各种硬件架构(如cpu、gpu和各种硬件加速器)生成优化的DNN可执行文件。将DNN模型编译为高效的可执行文件并不容易:编译过程通常涉及将高级模型规范转换为几种不同的中间表示(IR)，例如，图IR和操作符IR，并从与平台无关和平台相关的角度执行基于规则或基于学习的优化。尽管DL编译器在实际场景中得到了广泛的应用，但对DL编译器正确性的原则性和系统性的理解尚不存在。为了填补这个关键的空白，本文介绍了MT-DLComp，这是一个专门为深度学习编译器设计的变形测试框架，可以有效地发现错误的编译。我们的方法利用故意设计的变质关系(MRs)，向DNN模型发起语义保留突变，以生成它们的变体。通过这种方式，可以利用DNN模型及其变体自动检查深度学习编译器的编译正确性，而无需人工干预。我们还开发了一套实用的技术来实现有效的工作流程和本地化识别错误的输入。现实世界的DL编译器表现出高水平的工程质量。然而，我们在四个流行的深度学习编译器中检测到超过435个可能导致错误编译的输入，所有这些编译器都是由Amazon、Facebook、Microsoft和Google维护的行业级产品。虽然发现的触发错误的输入不会导致DL编译器直接崩溃，但它们可能导致生成不正确的DNN可执行文件。通过大量的手工工作和DL编译器开发人员的帮助，我们发现了这些DL编译器中的四个bug，方法是使用触发错误的输入对它们进行调试。我们提出的测试框架和发现可以用来指导开发人员改进DL编译器。

{"title":"Metamorphic Testing of Deep Learning Compilers","authors":"Dongwei Xiao, Zhibo Liu, Yuanyuan Yuan, Qi Pang, Shuai Wang","doi":"10.1145/3508035","DOIUrl":"https://doi.org/10.1145/3508035","url":null,"abstract":"The prosperous trend of deploying deep neural network (DNN) models to diverse hardware platforms has boosted the development of deep learning (DL) compilers. DL compilers take the high-level DNN model specifications as input and generate optimized DNN executables for diverse hardware architectures like CPUs, GPUs, and various hardware accelerators. Compiling DNN models into high-efficiency executables is not easy: the compilation procedure often involves converting high-level model specifications into several different intermediate representations (IR), e.g., graph IR and operator IR, and performing rule-based or learning-based optimizations from both platform-independent and platform-dependent perspectives. Despite the prosperous adoption of DL compilers in real-world scenarios, principled and systematic understanding toward the correctness of DL compilers does not yet exist. To fill this critical gap, this paper introduces MT-DLComp, a metamorphic testing framework specifically designed for DL compilers to effectively uncover erroneous compilations. Our approach leverages deliberately-designed metamorphic relations (MRs) to launch semantics-preserving mutations toward DNN models to generate their variants. This way, DL compilers can be automatically examined for compilation correctness utilizing DNN models and their variants without requiring manual intervention. We also develop a set of practical techniques to realize an effective workflow and localize identified error-revealing inputs. Real-world DL compilers exhibit a high level of engineering quality. Nevertheless, we detected over 435 inputs that can result in erroneous compilations in four popular DL compilers, all of which are industry-strength products maintained by Amazon, Facebook, Microsoft, and Google. While the discovered error-triggering inputs do not cause the DL compilers to crash directly, they can lead to the generation of incorrect DNN executables. With substantial manual effort and help from the DL compiler developers, we uncovered four bugs in these DL compilers by debugging them using the error-triggering inputs. Our proposed testing frameworks and findings can be used to guide developers in their efforts to improve DL compilers.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127181047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Data Convection 数据对流

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-02-24 DOI: 10.1145/3508027

Soheil Khadirsharbiyani, Jagadish B. Kotra, Karthik Rao, M. Kandemir

Stacked DRAMs have been studied, evaluated in multiple scenarios, and even productized in the last decade. The large available bandwidth they offer make them an attractive choice, particularly, in high-performance computing (HPC) environments. Consequently, many prior research efforts have studied and evaluated 3D stacked DRAM-based designs. Despite offering high bandwidth, stacked DRAMs are severely constrained by the overall memory capacity offered. In this paper, we study and evaluate integrating stacked DRAM on top of a GPU in a 3D manner which in tandem with the 2.5D stacked DRAM increases the capacity and the bandwidth without increasing the package size. This integration of 3D stacked DRAMs aids in satisfying the capacity requirements of emerging workloads like deep learning. Though this vertical 3D integration of stacked DRAMs also increases the total available bandwidth, we observe that the bandwidth offered by these 3D stacked DRAMs is severely limited by the heat generated on the GPU. Based on our experiments on a cycle-level simulator, we make a key observation that the sections of the 3D stacked DRAM that are closer to the GPU have lower retention-times compared to the farther layers of stacked DRAM. This thermal-induced variable retention-times causes certain sections of 3D stacked DRAM to be refreshed more frequently compared to the others, thereby resulting in thermal-induced NUMA paradigms. To alleviate such thermal-induced NUMA behavior, we propose and experimentally evaluate three different incarnations of Data Convection, i.e., Intra-layer, Inter-layer, and Intra + Inter-layer, that aim at placing the most-frequently accessed data in a thermal-induced retention-aware fashion, taking into account both bank-level and channel-level parallelism. Our evaluations on a cycle-level GPU simulator indicate that, in a multi-application scenario, our Intra-layer, Inter-layer and Intra + Inter-layer algorithms improve the overall performance by 1.8%, 11.7%, and 14.4%, respectively, over a baseline that already encompasses 3D+2.5D stacked DRAMs.

在过去的十年中，堆叠dram已经在多种情况下进行了研究，评估，甚至产品化。它们提供的大可用带宽使它们成为一个有吸引力的选择，特别是在高性能计算(HPC)环境中。因此，许多先前的研究工作已经研究和评估了基于3D堆叠dram的设计。尽管提供高带宽，但堆叠dram受到所提供的整体内存容量的严重限制。在本文中，我们研究和评估了以3D方式将堆叠DRAM集成在GPU上，与2.5D堆叠DRAM串联在一起，在不增加封装尺寸的情况下增加容量和带宽。这种3D堆叠dram的集成有助于满足深度学习等新兴工作负载的容量需求。虽然这种堆叠dram的垂直3D集成也增加了总可用带宽，但我们观察到这些3D堆叠dram提供的带宽受到GPU上产生的热量的严重限制。基于我们在周期级模拟器上的实验，我们做出了一个关键的观察，即与更远的堆叠DRAM层相比，靠近GPU的3D堆叠DRAM的部分具有更低的保留时间。这种热诱导的可变保留时间导致3D堆叠DRAM的某些部分比其他部分更频繁地刷新，从而导致热诱导的NUMA范式。为了减轻这种热诱导的NUMA行为，我们提出并实验评估了三种不同的数据对流形式，即层内、层间和层内+层间，旨在将最频繁访问的数据置于热诱导的保留感知方式中，同时考虑到银行级和通道级并行性。我们在周期级GPU模拟器上的评估表明，在多应用场景下，我们的Intra-layer, Inter-layer和Intra + Inter-layer算法在已经包含3D+2.5D堆叠dram的基线上分别提高了1.8%，11.7%和14.4%的整体性能。

{"title":"Data Convection","authors":"Soheil Khadirsharbiyani, Jagadish B. Kotra, Karthik Rao, M. Kandemir","doi":"10.1145/3508027","DOIUrl":"https://doi.org/10.1145/3508027","url":null,"abstract":"Stacked DRAMs have been studied, evaluated in multiple scenarios, and even productized in the last decade. The large available bandwidth they offer make them an attractive choice, particularly, in high-performance computing (HPC) environments. Consequently, many prior research efforts have studied and evaluated 3D stacked DRAM-based designs. Despite offering high bandwidth, stacked DRAMs are severely constrained by the overall memory capacity offered. In this paper, we study and evaluate integrating stacked DRAM on top of a GPU in a 3D manner which in tandem with the 2.5D stacked DRAM increases the capacity and the bandwidth without increasing the package size. This integration of 3D stacked DRAMs aids in satisfying the capacity requirements of emerging workloads like deep learning. Though this vertical 3D integration of stacked DRAMs also increases the total available bandwidth, we observe that the bandwidth offered by these 3D stacked DRAMs is severely limited by the heat generated on the GPU. Based on our experiments on a cycle-level simulator, we make a key observation that the sections of the 3D stacked DRAM that are closer to the GPU have lower retention-times compared to the farther layers of stacked DRAM. This thermal-induced variable retention-times causes certain sections of 3D stacked DRAM to be refreshed more frequently compared to the others, thereby resulting in thermal-induced NUMA paradigms. To alleviate such thermal-induced NUMA behavior, we propose and experimentally evaluate three different incarnations of Data Convection, i.e., Intra-layer, Inter-layer, and Intra + Inter-layer, that aim at placing the most-frequently accessed data in a thermal-induced retention-aware fashion, taking into account both bank-level and channel-level parallelism. Our evaluations on a cycle-level GPU simulator indicate that, in a multi-application scenario, our Intra-layer, Inter-layer and Intra + Inter-layer algorithms improve the overall performance by 1.8%, 11.7%, and 14.4%, respectively, over a baseline that already encompasses 3D+2.5D stacked DRAMs.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125534505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Understanding I/O Direct Cache Access Performance for End Host Networking 了解终端主机组网中I/O直接缓存访问性能

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-02-24 DOI: 10.1145/3508042

Minhu Wang, Mingwei Xu, Jianping Wu

Direct Cache Access (DCA) enables a network interface card (NIC) to load and store data directly on the processor cache, as conventional Direct Memory Access (DMA) is no longer suitable as the bridge between NIC and CPU in the era of 100 Gigabit Ethernet. As numerous I/O devices and cores compete for scarce cache resources, making the most of DCA for networking applications with varied objectives and constraints is a challenge, especially given the increasing complexity of modern cache hardware and I/O stacks. In this paper, we reverse engineer details of one commercial implementation of DCA, Intel's Data Direct I/O (DDIO), to explicate the importance of hardware-level investigation into DCA. Based on the learned knowledge of DCA and network I/O stacks, we (1) develop an analytical framework to predict the effectiveness of DCA (i.e., its hit rate) under certain hardware specifications, system configurations, and application properties; (2) measure penalties of the ineffective use of DCA (i.e., its miss penalty) to characterize its benefits; and (3) show that our reverse engineering, measurement, and model contribute to a deeper understanding of DCA, which in turn helps diagnose, optimize, and design end-host networking.

DCA (Direct Cache Access)是指在千兆以太网时代，传统的DMA (Direct Memory Access)方式已经不能作为连接网卡和CPU的桥梁，而DCA (Direct Cache Access)技术可以直接将数据加载到处理器缓存中并存储。由于大量I/O设备和核心争夺稀缺的缓存资源，因此为具有不同目标和约束的网络应用程序充分利用DCA是一项挑战，特别是考虑到现代缓存硬件和I/O堆栈日益复杂。在本文中，我们对DCA的一个商业实现，英特尔的数据直接I/O (DDIO)的细节进行了逆向工程，以说明硬件级研究DCA的重要性。基于所学到的DCA和网络I/O堆栈知识，我们(1)开发了一个分析框架来预测DCA在某些硬件规格、系统配置和应用程序属性下的有效性(即命中率);(2)衡量无效使用DCA的处罚(即未命中处罚)，以表征其效益;(3)表明我们的逆向工程、测量和模型有助于更深入地理解DCA，这反过来有助于诊断、优化和设计终端主机网络。

{"title":"Understanding I/O Direct Cache Access Performance for End Host Networking","authors":"Minhu Wang, Mingwei Xu, Jianping Wu","doi":"10.1145/3508042","DOIUrl":"https://doi.org/10.1145/3508042","url":null,"abstract":"Direct Cache Access (DCA) enables a network interface card (NIC) to load and store data directly on the processor cache, as conventional Direct Memory Access (DMA) is no longer suitable as the bridge between NIC and CPU in the era of 100 Gigabit Ethernet. As numerous I/O devices and cores compete for scarce cache resources, making the most of DCA for networking applications with varied objectives and constraints is a challenge, especially given the increasing complexity of modern cache hardware and I/O stacks. In this paper, we reverse engineer details of one commercial implementation of DCA, Intel's Data Direct I/O (DDIO), to explicate the importance of hardware-level investigation into DCA. Based on the learned knowledge of DCA and network I/O stacks, we (1) develop an analytical framework to predict the effectiveness of DCA (i.e., its hit rate) under certain hardware specifications, system configurations, and application properties; (2) measure penalties of the ineffective use of DCA (i.e., its miss penalty) to characterize its benefits; and (3) show that our reverse engineering, measurement, and model contribute to a deeper understanding of DCA, which in turn helps diagnose, optimize, and design end-host networking.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115684320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

The First 5G-LTE Comparative Study in Extreme Mobility 首个5G-LTE极端移动性对比研究

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-02-24 DOI: 10.1145/3508040

Yueyang Pan, Ruihan Li, Chenren Xu

5G claims to support mobility up to 500 km/h according to the 3GPP standard. However, its field performance under high-speed scenes remains in mystery. In this paper, we conduct the first large-scale measurement campaign on a high-speed railway route operating at the maximum speed of 350 km/h, with full coverage of LTE and 5G (NSA and SA) along the track. Our study consumed 1788.8 GiB of cellular data in six months, covering the three major carriers in China and the recent standardized QUIC protocol. Based on our dataset, we reveal the key characteristics of 5G and LTE in extreme mobility in terms of throughput, RTT, loss rate, signal quality, and physical resource utilization. We further develop a taxonomy of handovers in both LTE and 5G and carry out the link-layer latency breakdown analysis. Our study pinpoints the deficiencies in the user equipment, radio access network, and core network which hinder seamless connectivity and better utilization of 5G's high bandwidth. Our findings highlight the directions of the next step in the 5G evolution.

5G声称根据3GPP标准支持高达500公里/小时的移动速度。然而，它在高速场景下的现场性能仍然是个谜。在本文中，我们在最高时速为350公里/小时的高速铁路路线上进行了第一次大规模测量活动，沿轨道全覆盖LTE和5G (NSA和SA)。我们的研究在六个月内消耗了1788.8 GiB的蜂窝数据，涵盖了中国的三大运营商和最近标准化的QUIC协议。基于我们的数据集，我们揭示了5G和LTE在吞吐量，RTT，损失率，信号质量和物理资源利用率方面的极端移动性的关键特征。我们进一步开发了LTE和5G中的切换分类，并进行了链路层延迟分解分析。我们的研究明确了用户设备、无线接入网和核心网方面的不足，这些不足阻碍了5G的无缝连接和更好地利用高带宽。我们的研究结果突出了5G演进的下一步方向。

引用次数: 14

NURA NURA

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-02-24 DOI: 10.1145/3508036

Sina Darabi, Negin Mahani, Hazhir Baxishi, Ehsan Yousefzadeh-Asl-Miandoab, Mohammad Sadrosadati, H. Sarbazi-Azad

Multi-application execution in Graphics Processing Units (GPUs), a promising way to utilize GPU resources, is still challenging. Some pieces of prior work (e.g., spatial multitasking) have limited opportunity to improve resource utilization, while other works, e.g., simultaneous multi-kernel, provide fine-grained resource sharing at the price of unfair execution. This paper proposes a new multi-application paradigm for GPUs, called NURA, that provides high potential to improve resource utilization and ensures fairness and Quality-of-Service (QoS). The key idea is that each streaming multiprocessor (SM) executes Cooperative Thread Arrays (CTAs) belong to only one application (similar to the spatial multi-tasking) and shares its unused resources with the SMs running other applications demanding more resources. NURA handles resource sharing process mainly using a software approach to provide simplicity, low hardware cost, and flexibility. We also perform some hardware modifications as an architectural support for our software-based proposal. We conservatively analyze the hardware cost of our proposal, and observe less than 1.07% area overhead with respect to the whole GPU die. Our experimental results over various mixes of GPU workloads show that NURA improves GPU system throughput by 26% compared to state-of-the-art spatial multi-tasking, on average, while meeting the QoS target. In terms of fairness, NURA has almost similar results to spatial multitasking, while it outperforms simultaneous multi-kernel by an average of 76%.

在图形处理单元(GPU)中执行多应用程序是利用GPU资源的一种很有前途的方法，但仍然具有挑战性。一些先前的工作(例如，空间多任务)提高资源利用率的机会有限，而其他工作(例如，同步多内核)以不公平的执行为代价提供细粒度的资源共享。本文提出了一种新的gpu多应用范例，称为NURA，它提供了提高资源利用率和确保公平性和服务质量(QoS)的高潜力。关键思想是，每个流多处理器(SM)执行只属于一个应用程序的协作线程数组(cta)(类似于空间多任务)，并与运行其他需要更多资源的应用程序的SMs共享其未使用的资源。NURA主要使用软件方法处理资源共享过程，以提供简单、低硬件成本和灵活性。我们还执行了一些硬件修改，作为基于软件的建议的架构支持。我们保守地分析了我们的提议的硬件成本，并观察到相对于整个GPU芯片的面积开销不到1.07%。我们对各种GPU工作负载混合的实验结果表明，与最先进的空间多任务相比，NURA平均可将GPU系统吞吐量提高26%，同时满足QoS目标。就公平性而言，NURA的结果与空间多任务处理几乎相似，而它的性能比同步多内核平均高出76%。

{"title":"NURA","authors":"Sina Darabi, Negin Mahani, Hazhir Baxishi, Ehsan Yousefzadeh-Asl-Miandoab, Mohammad Sadrosadati, H. Sarbazi-Azad","doi":"10.1145/3508036","DOIUrl":"https://doi.org/10.1145/3508036","url":null,"abstract":"Multi-application execution in Graphics Processing Units (GPUs), a promising way to utilize GPU resources, is still challenging. Some pieces of prior work (e.g., spatial multitasking) have limited opportunity to improve resource utilization, while other works, e.g., simultaneous multi-kernel, provide fine-grained resource sharing at the price of unfair execution. This paper proposes a new multi-application paradigm for GPUs, called NURA, that provides high potential to improve resource utilization and ensures fairness and Quality-of-Service (QoS). The key idea is that each streaming multiprocessor (SM) executes Cooperative Thread Arrays (CTAs) belong to only one application (similar to the spatial multi-tasking) and shares its unused resources with the SMs running other applications demanding more resources. NURA handles resource sharing process mainly using a software approach to provide simplicity, low hardware cost, and flexibility. We also perform some hardware modifications as an architectural support for our software-based proposal. We conservatively analyze the hardware cost of our proposal, and observe less than 1.07% area overhead with respect to the whole GPU die. Our experimental results over various mixes of GPU workloads show that NURA improves GPU system throughput by 26% compared to state-of-the-art spatial multi-tasking, on average, while meeting the QoS target. In terms of fairness, NURA has almost similar results to spatial multitasking, while it outperforms simultaneous multi-kernel by an average of 76%.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122227532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1