Proceedings of the ACM on Measurement and Analysis of Computing Systems最新文献_第6页

Hierarchical Learning Algorithms for Multi-scale Expert Problems 多尺度专家问题的层次学习算法

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-05-26 DOI: 10.1145/3530900

L. Yang, Y. Chen, M. Hajiesmaili, M. Herbster, D. Towsley

In this paper, we study the multi-scale expert problem, where the rewards of different experts vary in different reward ranges. The performance of existing algorithms for the multi-scale expert problem degrades linearly proportional to the maximum reward range of any expert or the best expert and does not capture the non-uniform heterogeneity in the reward ranges among experts. In this work, we propose learning algorithms that construct a hierarchical tree structure based on the heterogeneity of the reward range of experts and then determine differentiated learning rates based on the reward upper bounds and cumulative empirical feedback over time. We then characterize the regret of the proposed algorithms as a function of non-uniform reward ranges and show that their regrets outperform prior algorithms when the rewards of experts exhibit non-uniform heterogeneity in different ranges. Last, our numerical experiments verify our algorithms' efficiency compared to previous algorithms.

本文研究了不同专家的奖励在不同的奖励范围内变化的多尺度专家问题。现有的多尺度专家问题算法的性能与专家的最大奖励范围或最佳专家的最大奖励范围成线性比例下降，并且没有捕捉到专家之间奖励范围的非均匀异质性。在这项工作中，我们提出了基于专家奖励范围的异质性构建分层树结构的学习算法，然后根据奖励上限和随时间累积的经验反馈确定差异化学习率。然后，我们将提出的算法的遗憾描述为非均匀奖励范围的函数，并表明当专家的奖励在不同范围内表现出非均匀异质性时，他们的遗憾优于先前的算法。最后，通过数值实验验证了算法的有效性。

引用次数: 1

WISEFUSE WISEFUSE

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-05-26 DOI: 10.1145/3530892

Ashraf Y. Mahgoub, E. Yi, Karthick Shankar, Eshaan Minocha, S. Elnikety, S. Bagchi, S. Chaterji

We characterize production workloads of serverless DAGs at a major cloud provider. Our analysis highlights two major factors that limit performance: (a) lack of efficient communication methods between the serverless functions in the DAG, and (b) stragglers when a DAG stage invokes a set of parallel functions that must complete before starting the next DAG stage. To address these limitations, we propose WISEFUSE, an automated approach to generate an optimized execution plan for serverless DAGs for a user-specified latency objective or budget. We introduce three optimizations: (1) Fusion combines in-series functions together in a single VM to reduce the communication overhead between cascaded functions. (2) Bundling executes a group of parallel invocations of a function in one VM to improve resource sharing among the parallel workers to reduce skew. (3) Resource Allocation assigns the right VM size to each function or function bundle in the DAG to reduce the E2E latency and cost. We implement WISEFUSE to evaluate it experimentally using three popular serverless applications with different DAG structures, memory footprints, and intermediate data sizes. Compared to competing approaches and other alternatives, WISEFUSE shows significant improvements in E2E latency and cost. Specifically, for a machine learning pipeline, WISEFUSE achieves P95 latency that is 67% lower than Photons, 39% lower than Faastlane, and 90% lower than SONIC without increasing the cost.

我们描述了一家主要云提供商的无服务器dag的生产工作负载。我们的分析强调了限制性能的两个主要因素:(a) DAG中无服务器函数之间缺乏有效的通信方法，以及(b)当DAG阶段调用一组必须在开始下一个DAG阶段之前完成的并行函数时，会出现离散。为了解决这些限制，我们提出了WISEFUSE，这是一种自动方法，可以根据用户指定的延迟目标或预算为无服务器dag生成优化的执行计划。我们介绍了三个优化:(1)Fusion将串联功能组合在单个VM中，以减少级联功能之间的通信开销。(2)捆绑在一个VM中执行一组函数的并行调用，以提高并行工作者之间的资源共享，减少倾斜。(3)资源分配为DAG中的每个功能或功能包分配合适的虚拟机大小，以减少端到端延迟和成本。我们使用三种流行的无服务器应用程序(具有不同的DAG结构、内存占用和中间数据大小)来实现WISEFUSE并对其进行实验评估。与竞争方法和其他替代方法相比，WISEFUSE在端到端延迟和成本方面有显著改善。具体来说，对于机器学习管道，WISEFUSE实现的P95延迟比photon低67%，比fastlane低39%，比SONIC低90%，而不会增加成本。

{"title":"WISEFUSE","authors":"Ashraf Y. Mahgoub, E. Yi, Karthick Shankar, Eshaan Minocha, S. Elnikety, S. Bagchi, S. Chaterji","doi":"10.1145/3530892","DOIUrl":"https://doi.org/10.1145/3530892","url":null,"abstract":"We characterize production workloads of serverless DAGs at a major cloud provider. Our analysis highlights two major factors that limit performance: (a) lack of efficient communication methods between the serverless functions in the DAG, and (b) stragglers when a DAG stage invokes a set of parallel functions that must complete before starting the next DAG stage. To address these limitations, we propose WISEFUSE, an automated approach to generate an optimized execution plan for serverless DAGs for a user-specified latency objective or budget. We introduce three optimizations: (1) Fusion combines in-series functions together in a single VM to reduce the communication overhead between cascaded functions. (2) Bundling executes a group of parallel invocations of a function in one VM to improve resource sharing among the parallel workers to reduce skew. (3) Resource Allocation assigns the right VM size to each function or function bundle in the DAG to reduce the E2E latency and cost. We implement WISEFUSE to evaluate it experimentally using three popular serverless applications with different DAG structures, memory footprints, and intermediate data sizes. Compared to competing approaches and other alternatives, WISEFUSE shows significant improvements in E2E latency and cost. Specifically, for a machine learning pipeline, WISEFUSE achieves P95 latency that is 67% lower than Photons, 39% lower than Faastlane, and 90% lower than SONIC without increasing the cost.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127221094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Tensor Completion with Nearly Linear Samples Given Weak Side Information 给定弱侧信息的近线性样本张量补全

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-05-26 DOI: 10.1145/3530905

C. Yu, Xumei Xi

Tensor completion exhibits an interesting computational-statistical gap in terms of the number of samples needed to perform tensor estimation. While there are only Θ(tn) degrees of freedom in a t-order tensor with n^t entries, the best known polynomial time algorithm requires O(n^t/2 ) samples in order to guarantee consistent estimation. In this paper, we show that weak side information is sufficient to reduce the sample complexity to O(n). The side information consists of a weight vector for each of the modes which is not orthogonal to any of the latent factors along that mode; this is significantly weaker than assuming noisy knowledge of the subspaces. We provide an algorithm that utilizes this side information to produce a consistent estimator with O(n^1+κ ) samples for any small constant κ > 0. We also provide experiments on both synthetic and real-world datasets that validate our theoretical insights.

张量补全在执行张量估计所需的样本数量方面显示出一个有趣的计算统计差距。虽然在具有n^t项的t阶张量中只有Θ(tn)个自由度，但最著名的多项式时间算法需要O(n^t/2)个样本才能保证一致的估计。在本文中，我们证明了弱侧信息足以将样本复杂度降低到O(n)。侧信息由每个模态的权重向量组成，该权重向量与沿该模态的任何潜在因素都不正交;这比假设子空间有噪声知识要弱得多。我们提供了一种算法，该算法利用该侧信息对任意小常数κ > 0产生具有O(n^1+κ)个样本的一致估计量。我们还提供了合成和现实世界数据集的实验，以验证我们的理论见解。

引用次数: 0

MalRadar MalRadar

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-05-26 DOI: 10.1145/3530906

Liu Wang, Haoyu Wang, Ren He, Ran Tao, Guozhu Meng, Xiapu Luo, Xuanzhe Liu

Mobile malware detection has attracted massive research effort in our community. A reliable and up-to-date malware dataset is critical to evaluate the effectiveness of malware detection approaches. Essentially, the malware ground truth should be manually verified by security experts, and their malicious behaviors should be carefully labelled. Although there are several widely-used malware benchmarks in our community (e.g., MalGenome, Drebin, Piggybacking and AMD, etc.), these benchmarks face several limitations including out-of-date, size, coverage, and reliability issues, etc. In this paper, we first make efforts to create MalRadar, a growing and up-to-date Android malware dataset using the most reliable way, i.e., by collecting malware based on the analysis reports of security experts. We have crawled all the mobile security related reports released by ten leading security companies, and used an automated approach to extract and label the useful ones describing new Android malware and containing Indicators of Compromise (IoC) information. We have successfully compiled MalRadar, a dataset that contains 4,534 unique Android malware samples (including both apks and metadata) released from 2014 to April 2021 by the time of this paper, all of which were manually verified by security experts with detailed behavior analysis. Then we characterize the MalRadar dataset from malware distribution channels, app installation methods, malware activation, malicious behaviors and anti-analysis techniques. We further investigate the malware evolution over the last decade. At last, we measure the effectiveness of commercial anti-virus engines and malware detection techniques on detecting malware in MalRadar. Our dataset can be served as the representative Android malware benchmark in the new era, and our observations can positively contribute to the community and boost a series of research studies on mobile security.

移动恶意软件检测在我们的社区中吸引了大量的研究工作。可靠且最新的恶意软件数据集对于评估恶意软件检测方法的有效性至关重要。从本质上讲，恶意软件的基础真相应该由安全专家手动验证，并且他们的恶意行为应该仔细标记。虽然在我们的社区中有几个广泛使用的恶意软件基准测试(例如，MalGenome, Drebin, Piggybacking和AMD等)，但这些基准测试面临一些限制，包括过时，大小，覆盖范围和可靠性问题等。在本文中，我们首先通过最可靠的方式，即根据安全专家的分析报告收集恶意软件，努力创建MalRadar，这是一个不断增长和最新的Android恶意软件数据集。我们抓取了十家领先安全公司发布的所有移动安全相关报告，并使用自动化方法提取并标记了描述新的Android恶意软件和包含妥协指标(IoC)信息的有用报告。我们成功编译了MalRadar数据集，该数据集包含2014年至2021年4月发布的4,534个独特的Android恶意软件样本(包括apk和元数据)，所有这些样本都经过了安全专家的手工验证，并进行了详细的行为分析。然后，我们从恶意软件分发渠道、应用程序安装方法、恶意软件激活、恶意行为和反分析技术等方面对MalRadar数据集进行了表征。我们进一步调查了过去十年中恶意软件的演变。最后，我们衡量了商用反病毒引擎和恶意软件检测技术在MalRadar中检测恶意软件的有效性。我们的数据集可以作为新时代具有代表性的Android恶意软件基准，我们的观察可以为社区做出积极贡献，推动一系列移动安全研究。

{"title":"MalRadar","authors":"Liu Wang, Haoyu Wang, Ren He, Ran Tao, Guozhu Meng, Xiapu Luo, Xuanzhe Liu","doi":"10.1145/3530906","DOIUrl":"https://doi.org/10.1145/3530906","url":null,"abstract":"Mobile malware detection has attracted massive research effort in our community. A reliable and up-to-date malware dataset is critical to evaluate the effectiveness of malware detection approaches. Essentially, the malware ground truth should be manually verified by security experts, and their malicious behaviors should be carefully labelled. Although there are several widely-used malware benchmarks in our community (e.g., MalGenome, Drebin, Piggybacking and AMD, etc.), these benchmarks face several limitations including out-of-date, size, coverage, and reliability issues, etc. In this paper, we first make efforts to create MalRadar, a growing and up-to-date Android malware dataset using the most reliable way, i.e., by collecting malware based on the analysis reports of security experts. We have crawled all the mobile security related reports released by ten leading security companies, and used an automated approach to extract and label the useful ones describing new Android malware and containing Indicators of Compromise (IoC) information. We have successfully compiled MalRadar, a dataset that contains 4,534 unique Android malware samples (including both apks and metadata) released from 2014 to April 2021 by the time of this paper, all of which were manually verified by security experts with detailed behavior analysis. Then we characterize the MalRadar dataset from malware distribution channels, app installation methods, malware activation, malicious behaviors and anti-analysis techniques. We further investigate the malware evolution over the last decade. At last, we measure the effectiveness of commercial anti-virus engines and malware detection techniques on detecting malware in MalRadar. Our dataset can be served as the representative Android malware benchmark in the new era, and our observations can positively contribute to the community and boost a series of research studies on mobile security.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123782397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Learning To Maximize Welfare with a Reusable Resource 学会利用可重复使用的资源最大化福利

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-05-26 DOI: 10.1145/3530893

Matthew Faw, O. Papadigenopoulos, C. Caramanis, S. Shakkottai

Considerable work has focused on optimal stopping problems where random IID offers arrive sequentially for a single available resource which is controlled by the decision-maker. After viewing the realization of the offer, the decision-maker irrevocably rejects it, or accepts it, collecting the reward and ending the game. We consider an important extension of this model to a dynamic setting where the resource is "renewable'' (a rental, a work assignment, or a temporary position) and can be allocated again after a delay period d. In the case where the reward distribution is known a priori, we design an (asymptotically optimal) 1/2-competitive Prophet Inequality, namely, a policy that collects in expectation at least half of the expected reward collected by a prophet who a priori knows all the realizations. This policy has a particularly simple characterization as a thresholding rule which depends on the reward distribution and the blocking period d, and arises naturally from an LP-relaxation of the prophet's optimal solution. Moreover, it gives the key for extending to the case of unknown distributions; here, we construct a dynamic threshold rule using the reward samples collected when the resource is not blocked. We provide a regret guarantee for our algorithm against the best policy in hindsight, and prove a complementing minimax lower bound on the best achievable regret, establishing that our policy achieves, up to poly-logarithmic factors, the best possible regret in this setting.

大量的工作集中在最优停止问题上，其中随机IID提供顺序到达由决策者控制的单个可用资源。在看到提议的实现后，决策者不可撤销地拒绝它，或者接受它，收集奖励并结束游戏。我们考虑的一个重要扩展这个模型动态设置资源的“可再生”(租赁、工作分配,或临时位置),可以分配后再延迟期d。如果奖励是已知的先验分布,我们设计一个(渐近最优)1/2-competitive先知不平等,即期望至少一半的政策,收集收集的期望的奖励一个先知先验知道所有的实现。这个策略有一个特别简单的特征，即阈值规则，它取决于奖励分配和阻塞周期d，并且自然地从先知最优解的lp松弛中产生。并给出了推广到未知分布情况的关键;在这里，我们使用在资源未被阻塞时收集的奖励样本构建一个动态阈值规则。我们在事后为我们的算法提供了一个针对最佳策略的后悔保证，并证明了最佳可实现的后悔的一个互补的极大极小下界，建立了我们的策略在这种设置下达到了多对数因子的最佳可能后悔。

{"title":"Learning To Maximize Welfare with a Reusable Resource","authors":"Matthew Faw, O. Papadigenopoulos, C. Caramanis, S. Shakkottai","doi":"10.1145/3530893","DOIUrl":"https://doi.org/10.1145/3530893","url":null,"abstract":"Considerable work has focused on optimal stopping problems where random IID offers arrive sequentially for a single available resource which is controlled by the decision-maker. After viewing the realization of the offer, the decision-maker irrevocably rejects it, or accepts it, collecting the reward and ending the game. We consider an important extension of this model to a dynamic setting where the resource is \"renewable'' (a rental, a work assignment, or a temporary position) and can be allocated again after a delay period d. In the case where the reward distribution is known a priori, we design an (asymptotically optimal) 1/2-competitive Prophet Inequality, namely, a policy that collects in expectation at least half of the expected reward collected by a prophet who a priori knows all the realizations. This policy has a particularly simple characterization as a thresholding rule which depends on the reward distribution and the blocking period d, and arises naturally from an LP-relaxation of the prophet's optimal solution. Moreover, it gives the key for extending to the case of unknown distributions; here, we construct a dynamic threshold rule using the reward samples collected when the resource is not blocked. We provide a regret guarantee for our algorithm against the best policy in hindsight, and prove a complementing minimax lower bound on the best achievable regret, establishing that our policy achieves, up to poly-logarithmic factors, the best possible regret in this setting.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"427 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132630987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

An Enterprise-Grade Open-Source Data Reduction Architecture for All-Flash Storage Systems 面向全闪存存储系统的企业级开源数据缩减架构

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-05-26 DOI: 10.1145/3530896

M. Ajdari, Patrick Raaf, Mostafa Kishani, Reza Salkhordeh, H. Asadi, A. Brinkmann

All-flash storage (AFS) systems have become an essential infrastructure component to support enterprise applications, where sub-millisecond latency and very high throughput are required. Nevertheless, the price per capacity ofsolid-state drives (SSDs) is relatively high, which has encouraged system architects to adoptdata reduction techniques, mainlydeduplication andcompression, in enterprise storage solutions. To provide higher reliability and performance, SSDs are typically grouped usingredundant array of independent disk (RAID) configurations. Data reduction on top of RAID arrays, however, adds I/O overheads and also complicates the I/O patterns redirected to the underlying backend SSDs, which invalidates the best-practice configurations used in AFS. Unfortunately, existing works on the performance of data reduction do not consider its interaction and I/O overheads with other enterprise storage components including SSD arrays and RAID controllers. In this paper, using a real setup with enterprise-grade components and based on the open-source data reduction module RedHat VDO, we reveal novel observations on the performance gap between the state-of-the-art and the optimal all-flash storage stack with integrated data reduction. We therefore explore the I/O patterns at the storage entry point and compare them with those at the disk subsystem. Our analysis shows a significant amount of I/O overheads for guaranteeing consistency and avoiding data loss through data journaling, frequent small-sized metadata updates, and duplicate content verification. We accompany these observations with cross-layer optimizations to enhance the performance of AFS, which range from deriving new optimal hardware RAID configurations up to introducing changes to the enterprise storage stack. By analyzing the characteristics of I/O types and their overheads, we propose three techniques: (a) application-aware lazy persistence, (b) a fast, read-only I/O cache for duplicate verification, and (c) disaggregation of block maps and data by offloading block maps to a very fast persistent memory device. By consolidating all proposed optimizations and implementing them in an enterprise AFS, we show 1.3× to 12.5× speedup over the baseline AFS with 90% data reduction, and from 7.8× up to 57× performance/cost improvement over an optimized AFS (with no data reduction) running applications ranging from 100% read-only to 100% write-only accesses.

全闪存存储(AFS)系统已经成为支持企业应用程序的基本基础设施组件，这些应用程序需要亚毫秒级的延迟和非常高的吞吐量。然而，固态硬盘(ssd)的单位容量价格相对较高，这促使系统架构师在企业存储解决方案中采用数据减少技术，主要是重复数据删除和压缩。为了提供更高的可靠性和性能，ssd通常使用独立磁盘冗余阵列(RAID)配置进行分组。但是，在RAID阵列之上减少数据会增加I/O开销，并且还会使重定向到底层后端ssd的I/O模式变得复杂，这会使AFS中使用的最佳实践配置失效。不幸的是，现有的关于数据缩减性能的工作没有考虑到它与其他企业存储组件(包括SSD阵列和RAID控制器)的交互和I/O开销。在本文中，使用企业级组件的真实设置并基于开源数据缩减模块RedHat VDO，我们揭示了最先进的和最优的全闪存存储堆栈之间的性能差距，并集成了数据缩减。因此，我们将研究存储入口点上的I/O模式，并将它们与磁盘子系统上的模式进行比较。我们的分析表明，通过数据日志记录、频繁的小规模元数据更新和重复内容验证来保证一致性和避免数据丢失需要大量的I/O开销。我们将这些观察结果与跨层优化相结合，以增强AFS的性能，其范围从派生新的最佳硬件RAID配置到引入对企业存储堆栈的更改。通过分析I/O类型的特征及其开销，我们提出了三种技术:(a)应用程序感知的延迟持久性，(b)用于重复验证的快速只读I/O缓存，以及(c)通过将块映射卸载到非常快速的持久内存设备来分解块映射和数据。通过整合所有建议的优化并在企业AFS中实现它们，我们显示，与基线AFS相比，速度提高了1.3到12.5倍，数据减少了90%，并且与运行从100%只读到100%只写访问的应用程序的优化AFS(没有数据减少)相比，性能/成本提高了7.8到57倍。

{"title":"An Enterprise-Grade Open-Source Data Reduction Architecture for All-Flash Storage Systems","authors":"M. Ajdari, Patrick Raaf, Mostafa Kishani, Reza Salkhordeh, H. Asadi, A. Brinkmann","doi":"10.1145/3530896","DOIUrl":"https://doi.org/10.1145/3530896","url":null,"abstract":"All-flash storage (AFS) systems have become an essential infrastructure component to support enterprise applications, where sub-millisecond latency and very high throughput are required. Nevertheless, the price per capacity ofsolid-state drives (SSDs) is relatively high, which has encouraged system architects to adoptdata reduction techniques, mainlydeduplication andcompression, in enterprise storage solutions. To provide higher reliability and performance, SSDs are typically grouped usingredundant array of independent disk (RAID) configurations. Data reduction on top of RAID arrays, however, adds I/O overheads and also complicates the I/O patterns redirected to the underlying backend SSDs, which invalidates the best-practice configurations used in AFS. Unfortunately, existing works on the performance of data reduction do not consider its interaction and I/O overheads with other enterprise storage components including SSD arrays and RAID controllers. In this paper, using a real setup with enterprise-grade components and based on the open-source data reduction module RedHat VDO, we reveal novel observations on the performance gap between the state-of-the-art and the optimal all-flash storage stack with integrated data reduction. We therefore explore the I/O patterns at the storage entry point and compare them with those at the disk subsystem. Our analysis shows a significant amount of I/O overheads for guaranteeing consistency and avoiding data loss through data journaling, frequent small-sized metadata updates, and duplicate content verification. We accompany these observations with cross-layer optimizations to enhance the performance of AFS, which range from deriving new optimal hardware RAID configurations up to introducing changes to the enterprise storage stack. By analyzing the characteristics of I/O types and their overheads, we propose three techniques: (a) application-aware lazy persistence, (b) a fast, read-only I/O cache for duplicate verification, and (c) disaggregation of block maps and data by offloading block maps to a very fast persistent memory device. By consolidating all proposed optimizations and implementing them in an enterprise AFS, we show 1.3× to 12.5× speedup over the baseline AFS with 90% data reduction, and from 7.8× up to 57× performance/cost improvement over an optimized AFS (with no data reduction) running applications ranging from 100% read-only to 100% write-only accesses.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120953241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Detailed Look at MIMO Performance in 60 GHz WLANs 60 GHz无线局域网中MIMO性能的详细介绍

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-05-26 DOI: 10.1145/3530904

Shivang Aggarwal, Srisai Karthik Neelamraju, Ajit Bhat, Dimitrios Koutsonikolas

One of the key enhancements in the upcoming 802.11ay standard for 60 GHz WLANs is the support for simultaneous transmissions of up to 8 data streams via SU- and MU-MIMO, which has the potential to enable data rates up to 100 Gbps. However, in spite of the key role MIMO is expected to play in 802.11ay, experimental evaluation of MIMO performance in 60 GHz WLANs has been limited to date, primarily due to lack of hardware supporting MIMO transmissions at millimeter wave frequencies. In this work, we fill this gap by conducting the first large-scale experimental evaluation of SU- and MU-MIMO performance in 60 GHz WLANs. Unlike previous studies, our study involves multiple environments with very different multipath characteristics. We analyze the performance in each environment, identify the factors that affect it, and compare it against the performance of SISO. Further, we seek to identify factors that can guide beam and user selection to limit the (often prohibitive in practice) overhead of exhaustive search. Finally, we propose two heuristics that perform both user and beam selection with low overhead, and show that they perform close to an Oracle solution and outperform previously proposed approaches in both static and mobile scenarios, regardless of the environment and number of users.

即将发布的60ghz wlan 802.11ay标准的关键增强之一是支持通过SU-和MU-MIMO同时传输多达8个数据流，这有可能使数据速率达到100gbps。然而，尽管MIMO有望在802.11ay中发挥关键作用，但迄今为止，60ghz wlan中MIMO性能的实验评估受到限制，主要原因是缺乏支持毫米波频率MIMO传输的硬件。在这项工作中，我们通过在60 GHz wlan中进行SU-和MU-MIMO性能的首次大规模实验评估来填补这一空白。与以往的研究不同，我们的研究涉及具有非常不同的多路径特征的多个环境。我们分析每个环境中的性能，确定影响它的因素，并将其与SISO的性能进行比较。此外，我们试图确定可以指导波束和用户选择的因素，以限制穷举搜索的开销(在实践中通常是禁止的)。最后，我们提出了两种启发式方法，以低开销执行用户和波束选择，并表明它们在静态和移动场景中执行接近Oracle解决方案，并且优于先前提出的方法，无论环境和用户数量如何。

{"title":"A Detailed Look at MIMO Performance in 60 GHz WLANs","authors":"Shivang Aggarwal, Srisai Karthik Neelamraju, Ajit Bhat, Dimitrios Koutsonikolas","doi":"10.1145/3530904","DOIUrl":"https://doi.org/10.1145/3530904","url":null,"abstract":"One of the key enhancements in the upcoming 802.11ay standard for 60 GHz WLANs is the support for simultaneous transmissions of up to 8 data streams via SU- and MU-MIMO, which has the potential to enable data rates up to 100 Gbps. However, in spite of the key role MIMO is expected to play in 802.11ay, experimental evaluation of MIMO performance in 60 GHz WLANs has been limited to date, primarily due to lack of hardware supporting MIMO transmissions at millimeter wave frequencies. In this work, we fill this gap by conducting the first large-scale experimental evaluation of SU- and MU-MIMO performance in 60 GHz WLANs. Unlike previous studies, our study involves multiple environments with very different multipath characteristics. We analyze the performance in each environment, identify the factors that affect it, and compare it against the performance of SISO. Further, we seek to identify factors that can guide beam and user selection to limit the (often prohibitive in practice) overhead of exhaustive search. Finally, we propose two heuristics that perform both user and beam selection with low overhead, and show that they perform close to an Oracle solution and outperform previously proposed approaches in both static and mobile scenarios, regardless of the environment and number of users.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114361484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Comparative Analysis of Ookla Speedtest and Measurement Labs Network Diagnostic Test (NDT7) Ookla速度测试与Measurement Labs网络诊断测试(NDT7)的比较分析

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-05-24 DOI: 10.1145/3579448

Kyle MacMillan, Tarun Mangla, J. Saxon, Nicole P. Marwell, N. Feamster

Consumers, regulators, and ISPs all use client-based "speed tests" to measure network performance, both in single-user settings and in aggregate. Two prevalent speed tests, Ookla's Speedtest and Measurement Lab's Network Diagnostic Test (NDT), are often used for similar purposes, despite having significant differences in both the test design and implementation, and in the infrastructure used to perform measurements. In this paper, we present the first-ever comparative evaluation of Ookla and NDT7 (the latest version of NDT), both in controlled and wide-area settings. Our goal is to characterize when and to what extent these two speed tests yield different results, as well as the factors that contribute to the differences. To study the effects of the test design, we conduct a series of controlled, in-lab experiments under a comprehensive set of network conditions and usage modes (e.g., TCP congestion control, native vs. browser client). Our results show that Ookla and NDT7 report similar speeds under most in-lab conditions, with the exception of networks that experience high latency, where Ookla consistently reports higher throughput. To characterize the behavior of these tools in wide-area deployment, we collect more than 80,000 pairs of Ookla and NDT7 measurements across nine months and 126 households, with a range of ISPs and speed tiers. This first-of-its-kind paired-test analysis reveals many previously unknown systemic issues, including high variability in NDT7 test results and systematically under-performing servers in the Ookla network.

消费者、监管机构和互联网服务提供商都使用基于客户端的“速度测试”来衡量网络性能，无论是在单用户设置中还是在总体设置中。两种流行的速度测试，Ookla的速度测试和Measurement Lab的网络诊断测试(NDT)，通常用于类似的目的，尽管在测试设计和实现以及用于执行测量的基础设施方面存在显着差异。在本文中，我们首次在受控和广域环境下对Ookla和NDT7(最新版本的无损检测)进行了比较评估。我们的目标是描述这两种速度测试何时以及在多大程度上产生不同的结果，以及导致差异的因素。为了研究测试设计的效果，我们在一组全面的网络条件和使用模式(例如，TCP拥塞控制，本机与浏览器客户端)下进行了一系列受控的实验室实验。我们的结果表明，在大多数实验室条件下，Ookla和NDT7报告的速度相似，除了经历高延迟的网络，Ookla始终报告更高的吞吐量。为了描述这些工具在广域部署中的行为，我们收集了超过8万对Ookla和NDT7测量数据，历时9个月，涉及126个家庭，包括一系列isp和速度等级。这种首次的配对测试分析揭示了许多以前未知的系统问题，包括NDT7测试结果的高度可变性和Ookla网络中系统表现不佳的服务器。

{"title":"A Comparative Analysis of Ookla Speedtest and Measurement Labs Network Diagnostic Test (NDT7)","authors":"Kyle MacMillan, Tarun Mangla, J. Saxon, Nicole P. Marwell, N. Feamster","doi":"10.1145/3579448","DOIUrl":"https://doi.org/10.1145/3579448","url":null,"abstract":"Consumers, regulators, and ISPs all use client-based \"speed tests\" to measure network performance, both in single-user settings and in aggregate. Two prevalent speed tests, Ookla's Speedtest and Measurement Lab's Network Diagnostic Test (NDT), are often used for similar purposes, despite having significant differences in both the test design and implementation, and in the infrastructure used to perform measurements. In this paper, we present the first-ever comparative evaluation of Ookla and NDT7 (the latest version of NDT), both in controlled and wide-area settings. Our goal is to characterize when and to what extent these two speed tests yield different results, as well as the factors that contribute to the differences. To study the effects of the test design, we conduct a series of controlled, in-lab experiments under a comprehensive set of network conditions and usage modes (e.g., TCP congestion control, native vs. browser client). Our results show that Ookla and NDT7 report similar speeds under most in-lab conditions, with the exception of networks that experience high latency, where Ookla consistently reports higher throughput. To characterize the behavior of these tools in wide-area deployment, we collect more than 80,000 pairs of Ookla and NDT7 measurements across nine months and 126 households, with a range of ISPs and speed tiers. This first-of-its-kind paired-test analysis reveals many previously unknown systemic issues, including high variability in NDT7 test results and systematically under-performing servers in the Ookla network.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130160122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Strategic Latency Reduction in Blockchain Peer-to-Peer Networks 区块链点对点网络中的战略性延迟减少

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-05-13 DOI: 10.1145/3589976

Weizhao Tang, Lucianna Kiffer, G. Fanti, A. Juels

Most permissionless blockchain networks run on peer-to-peer (P2P) networks, which offer flexibility and decentralization at the expense of performance (e.g., network latency). Historically, this tradeoff has not been a bottleneck for most blockchains. However, an emerging host of blockchain-based applications (e.g., decentralized finance) are increasingly sensitive to latency; users who can reduce their network latency relative to other users can accrue (sometimes significant) financial gains. In this work, we initiate the study of strategic latency reduction in blockchain P2P networks. We first define two classes of latency that are of interest in blockchain applications. We then show empirically that a strategic agent who controls only their local peering decisions can manipulate both types of latency, achieving 60% of the global latency gains provided by the centralized, paid service bloXroute, or, in targeted scenarios, comparable gains. Finally, we show that our results are not due to the poor design of existing P2P networks. Under a simple network model, we theoretically prove that an adversary can always manipulate the P2P network's latency to their advantage, provided the network experiences sufficient peer churn and transaction activity.

大多数无需许可的区块链网络运行在点对点(P2P)网络上，以牺牲性能(例如，网络延迟)为代价提供灵活性和去中心化。从历史上看，这种权衡并不是大多数区块链的瓶颈。然而，新兴的基于区块链的应用程序(例如，去中心化金融)对延迟越来越敏感;能够减少相对于其他用户的网络延迟的用户可以获得(有时是显著的)经济收益。在这项工作中，我们启动了区块链P2P网络中战略性延迟减少的研究。我们首先定义了区块链应用程序中感兴趣的两类延迟。然后，我们通过经验证明，仅控制其本地对等决策的战略代理可以操纵两种类型的延迟，实现由集中式付费服务bloXroute提供的60%的全局延迟增益，或者在目标场景中获得相当的增益。最后，我们表明，我们的结果不是由于现有的P2P网络设计不良。在一个简单的网络模型下，我们从理论上证明，只要网络经历足够的对等流失和交易活动，攻击者总是可以操纵P2P网络的延迟以达到他们的优势。

{"title":"Strategic Latency Reduction in Blockchain Peer-to-Peer Networks","authors":"Weizhao Tang, Lucianna Kiffer, G. Fanti, A. Juels","doi":"10.1145/3589976","DOIUrl":"https://doi.org/10.1145/3589976","url":null,"abstract":"Most permissionless blockchain networks run on peer-to-peer (P2P) networks, which offer flexibility and decentralization at the expense of performance (e.g., network latency). Historically, this tradeoff has not been a bottleneck for most blockchains. However, an emerging host of blockchain-based applications (e.g., decentralized finance) are increasingly sensitive to latency; users who can reduce their network latency relative to other users can accrue (sometimes significant) financial gains. In this work, we initiate the study of strategic latency reduction in blockchain P2P networks. We first define two classes of latency that are of interest in blockchain applications. We then show empirically that a strategic agent who controls only their local peering decisions can manipulate both types of latency, achieving 60% of the global latency gains provided by the centralized, paid service bloXroute, or, in targeted scenarios, comparable gains. Finally, we show that our results are not due to the poor design of existing P2P networks. Under a simple network model, we theoretically prove that an adversary can always manipulate the P2P network's latency to their advantage, provided the network experiences sufficient peer churn and transaction activity.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"460 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123022920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

FuncPipe: A Pipelined Serverless Framework for Fast and Cost-Efficient Training of Deep Learning Models FuncPipe:一个流水线式无服务器框架，用于快速和经济高效的深度学习模型训练

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Pub Date : 2022-04-28 DOI: 10.1145/3570607

Yunzhuo Liu, Bo Jiang, Tian Guo, Zimeng Huang, Wen-ping Ma, Xinbing Wang, Chenghu Zhou

Training deep learning (DL) models in the cloud has become a norm. With the emergence of serverless computing and its benefits of true pay-as-you-go pricing and scalability, systems researchers have recently started to provide support for serverless-based training. However, the ability to train DL models on serverless platforms is hindered by the resource limitations of today's serverless infrastructure and DL models' explosive requirement for memory and bandwidth. This paper describes FuncPipe, a novel pipelined training framework specifically designed for serverless platforms that enable fast and low-cost training of DL models. FuncPipe is designed with the key insight that model partitioning can be leveraged to bridge both memory and bandwidth gaps between the capacity of serverless functions and the requirement of DL training. Conceptually simple, we have to answer several design questions, including how to partition the model, configure each serverless function, and exploit each function's uplink/downlink bandwidth. In particular, we tailor a micro-batch scheduling policy for the serverless environment, which serves as the basis for the subsequent optimization. Our Mixed-Integer Quadratic Programming formulation automatically and simultaneously configures serverless resources and partitions models to fit within the resource constraints. Lastly, we improve the bandwidth efficiency of storage-based synchronization with a novel pipelined scatter-reduce algorithm. We implement FuncPipe on two popular cloud serverless platforms and show that it achieves 7%-77% cost savings and 1.3X-2.2X speedup compared to state-of-the-art serverless-based frameworks.

在云端训练深度学习(DL)模型已经成为一种常态。随着无服务器计算的出现及其真正的按需付费定价和可扩展性的好处，系统研究人员最近开始为基于无服务器的培训提供支持。然而，在无服务器平台上训练深度学习模型的能力受到当今无服务器基础设施的资源限制和深度学习模型对内存和带宽的爆炸性需求的阻碍。本文描述了FuncPipe，这是一种专门为无服务器平台设计的新型流水线训练框架，可以实现快速、低成本的深度学习模型训练。FuncPipe的设计关键在于，可以利用模型划分来弥合无服务器功能容量与深度学习训练需求之间的内存和带宽差距。概念上很简单，我们必须回答几个设计问题，包括如何划分模型，配置每个无服务器功能，以及利用每个功能的上行/下行链路带宽。特别地，我们为无服务器环境定制了微批调度策略，作为后续优化的基础。我们的混合整数二次规划公式自动并同时配置无服务器资源和分区模型，以适应资源约束。最后，我们提出了一种新的流水线散点减少算法，提高了基于存储的同步的带宽效率。我们在两个流行的无服务器云平台上实现了FuncPipe，结果表明，与最先进的基于无服务器的框架相比，它节省了7%-77%的成本，加速了1.3 -2.2倍。

{"title":"FuncPipe: A Pipelined Serverless Framework for Fast and Cost-Efficient Training of Deep Learning Models","authors":"Yunzhuo Liu, Bo Jiang, Tian Guo, Zimeng Huang, Wen-ping Ma, Xinbing Wang, Chenghu Zhou","doi":"10.1145/3570607","DOIUrl":"https://doi.org/10.1145/3570607","url":null,"abstract":"Training deep learning (DL) models in the cloud has become a norm. With the emergence of serverless computing and its benefits of true pay-as-you-go pricing and scalability, systems researchers have recently started to provide support for serverless-based training. However, the ability to train DL models on serverless platforms is hindered by the resource limitations of today's serverless infrastructure and DL models' explosive requirement for memory and bandwidth. This paper describes FuncPipe, a novel pipelined training framework specifically designed for serverless platforms that enable fast and low-cost training of DL models. FuncPipe is designed with the key insight that model partitioning can be leveraged to bridge both memory and bandwidth gaps between the capacity of serverless functions and the requirement of DL training. Conceptually simple, we have to answer several design questions, including how to partition the model, configure each serverless function, and exploit each function's uplink/downlink bandwidth. In particular, we tailor a micro-batch scheduling policy for the serverless environment, which serves as the basis for the subsequent optimization. Our Mixed-Integer Quadratic Programming formulation automatically and simultaneously configures serverless resources and partitions models to fit within the resource constraints. Lastly, we improve the bandwidth efficiency of storage-based synchronization with a novel pipelined scatter-reduce algorithm. We implement FuncPipe on two popular cloud serverless platforms and show that it achieves 7%-77% cost savings and 1.3X-2.2X speedup compared to state-of-the-art serverless-based frameworks.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130279989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2