ACM SIGOPS Oper. Syst. Rev.最新文献_第10页

Dependability issues in cloud computing: extended papers from the 1st international workshop on dependability issues in cloud computing -- DISCCO 云计算中的可靠性问题:来自第一届云计算可靠性问题国际研讨会——DISCCO的扩展论文

ACM SIGOPS Oper. Syst. Rev.

Pub Date : 2013-07-23 DOI: 10.1145/2506164.2506169

M. Correia, N. Mittal

Cloud computing has recently emerged as a popular paradigm for deploying, managing and delivering a variety of services using a shared infrastructure [1]. The services offered through clouds range from simple data storage to endto-end management of business processes. Many companies and even governments are adopting the cloud as a solution to reduce costs and improve the quality of service. The present issue of the Operating Systems Review is dedicated to the dependability – reliability, fault tolerance, availability, security – of cloud computing. The issue contains extended papers from the First International Workshop on Dependability Issues in Cloud Computing (DISCCO), held at Irvine, California, USA, on October 2012, in conjunction with the 31st IEEE International Symposium on Reliable Distributed Systems. Three papers were selected among the seven presented based on the timeliness of their subject and the comments and scores of the reviewers.

云计算最近成为一种流行的范例，用于使用共享基础设施部署、管理和交付各种服务[1]。通过云提供的服务范围从简单的数据存储到业务流程的端到端管理。许多公司甚至政府都在采用云计算作为降低成本和提高服务质量的解决方案。本期《操作系统评论》专门讨论云计算的可靠性——可靠性、容错性、可用性和安全性。这期杂志包含了2012年10月在美国加州尔湾举行的第一届云计算可靠性问题国际研讨会(DISCCO)的扩展论文，该研讨会与第31届IEEE可靠分布式系统国际研讨会同时举行。根据主题的及时性和审稿人的评论和分数，从七篇论文中选出三篇。

引用次数: 0

Multi-core systems modeling for formal verification of parallel algorithms 并行算法形式化验证的多核系统建模

ACM SIGOPS Oper. Syst. Rev.

Pub Date : 2013-07-23 DOI: 10.1145/2506164.2506174

M. Desnoyers, P. McKenney, M. Dagenais

Modeling parallel algorithms at the architecture level enables exploring side-effects of the weakly ordered nature of modern processors. Formal verification of such models with model-checking can ensure that algorithm guarantees will hold even in the presence of the most aggressive compiler and processor optimizations. This paper proposes a virtual architecture to model the effects of such optimizations. It first presents the OoOmem framework to model out-of-order memory accesses. It then presents the OoOisched framework to model the effects of out-of-order instruction scheduling. These two frameworks are explained and tested using weaklyordered memory interaction scenarios known to be affected by weak ordering. Then, modeling of user-level RCU (Read- Copy Update) synchronization algorithms is presented. It uses the virtual architecture proposed to verify that the RCU guarantees are indeed respected.

在体系结构级别对并行算法进行建模，可以探索现代处理器弱有序特性的副作用。使用模型检查对这些模型进行正式验证，可以确保即使在最激进的编译器和处理器优化存在的情况下，算法保证仍然有效。本文提出了一个虚拟架构来模拟这种优化的效果。它首先提出了OoOmem框架来对无序内存访问进行建模。然后提出了OoOisched框架来模拟乱序指令调度的影响。使用已知受弱排序影响的弱排序内存交互场景来解释和测试这两个框架。然后，对用户级RCU (Read- Copy Update)同步算法进行了建模。它使用提出的虚拟架构来验证RCU保证确实得到了尊重。

引用次数: 14

Our troubles with Linux Kernel upgrades and why you should care 我们在Linux内核升级中遇到的麻烦以及为什么你应该关心

ACM SIGOPS Oper. Syst. Rev.

Pub Date : 2013-07-23 DOI: 10.1145/2506164.2506175

Ashif S. Harji, P. Buhr, Tim Brecht

Linux and other open-source Unix variants (and their distributors) provide researchers with full-fledged operating systems that are widely used. However, due to their complexity and rapid development, care should be exercised when using these operating systems for performance experiments, especially in systems research. In particular, the size and continual evolution of the Linux code-base makes it difficult to understand, and as a result, decipher and explain the reasons for performance improvements. In addition, the rapid kernel development cycle means that experimental results can be viewed as out of date, or meaningless, very quickly. We demonstrate that this viewpoint is incorrect because kernel changes can and have introduced both bugs and performance degradations. This paper describes some of our experiences using Linux and FreeBSD as platforms for conducting performance evaluations and some performance regressions we have found. Our results show, these performance regressions can be serious (e.g., repeating identical experiments results in large variability in results) and long lived despite having a large negative effect on performance (one problem was present for more than 3 years). Based on these experiences, we argue: it is sometimes reasonable to use an older kernel version, experimental results need careful analysis to explain why a performance effect occurs, and publishing papers validating prior research is essential.

Linux和其他开源Unix变体(及其分发器)为研究人员提供了广泛使用的成熟操作系统。然而，由于它们的复杂性和快速发展，在使用这些操作系统进行性能实验时，特别是在系统研究中，应该小心谨慎。特别是，Linux代码库的大小和不断发展使其难以理解，因此，很难破译和解释性能改进的原因。此外，快速的内核开发周期意味着实验结果很快就会被视为过时或无意义。我们将证明这种观点是不正确的，因为内核更改可能并且已经引入了bug和性能降低。本文描述了我们使用Linux和FreeBSD作为平台进行性能评估和我们发现的一些性能回归的一些经验。我们的研究结果表明，这些性能回归可能是严重的(例如，重复相同的实验导致结果有很大的可变性)，尽管对性能有很大的负面影响(一个问题存在超过3年)，但这些回归可能是长期存在的。基于这些经验，我们认为:有时使用较旧的内核版本是合理的，实验结果需要仔细分析以解释为什么会出现性能影响，并且发表论文验证先前的研究是必不可少的。

{"title":"Our troubles with Linux Kernel upgrades and why you should care","authors":"Ashif S. Harji, P. Buhr, Tim Brecht","doi":"10.1145/2506164.2506175","DOIUrl":"https://doi.org/10.1145/2506164.2506175","url":null,"abstract":"Linux and other open-source Unix variants (and their distributors) provide researchers with full-fledged operating systems that are widely used. However, due to their complexity and rapid development, care should be exercised when using these operating systems for performance experiments, especially in systems research. In particular, the size and continual evolution of the Linux code-base makes it difficult to understand, and as a result, decipher and explain the reasons for performance improvements. In addition, the rapid kernel development cycle means that experimental results can be viewed as out of date, or meaningless, very quickly. We demonstrate that this viewpoint is incorrect because kernel changes can and have introduced both bugs and performance degradations.\u0000 This paper describes some of our experiences using Linux and FreeBSD as platforms for conducting performance evaluations and some performance regressions we have found. Our results show, these performance regressions can be serious (e.g., repeating identical experiments results in large variability in results) and long lived despite having a large negative effect on performance (one problem was present for more than 3 years). Based on these experiences, we argue: it is sometimes reasonable to use an older kernel version, experimental results need careful analysis to explain why a performance effect occurs, and publishing papers validating prior research is essential.","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"12 11","pages":"66-72"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72569576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Dagstuhl seminar report: security and dependability for federated cloud platforms, 2012 Dagstuhl研讨会报告:联合云平台的安全性和可靠性，2012

ACM SIGOPS Oper. Syst. Rev.

Pub Date : 2013-07-23 DOI: 10.1145/2506164.2506166

A. Shraer, R. Kapitza

The Security and Dependability for Federated Cloud Platforms seminar [3] was held in Schloss Dagstuhl1, July 8-13, 2012. Schloss Dagstuhl, also known as the Leibniz-Zentrum fur Informatik, is a renovated castle located in the scenic countryside of Saarland, Germany. Dagstuhl offers a unique concept: 30-45 participants, all of whom receive invitations from Dagstuhl on behalf of the organizers, stay in the castle during the seminar (typically 3-5 days) enjoying all that the castle has to offer. Amongst other things this includes an impressive library, a music room full of musical instruments, an excellent restaurant, as well as a wine cellar where a variety of cheese, wine and local beer is available daily for the lateevening social meetings. The organizers of our seminar Matthias Schunter, Marc Shapiro, Paulo Verissimo and Michael Waidner targeted a four day event and gathered a mixed group of senior, established and promising young researches from all over the world. The program of the seminar was not set in advance, but most participants provided an abstract [3] and gave short talks on recent or ongoing work. The main purpose of these talks was generating discussion and collaboration among the participants. During some

联邦云平台的安全性和可靠性研讨会[3]于2012年7月8日至13日在达格斯图尔城堡举行。达格施图尔城堡，也被称为莱布尼茨信息中心，是一座经过翻新的城堡，位于德国萨尔州风景秀丽的乡村。达格施图尔提供了一个独特的概念:30-45名参与者，他们都代表组织者收到来自达格施图尔的邀请，在研讨会期间(通常为3-5天)呆在城堡里，享受城堡所提供的一切。除此之外，它还包括一个令人印象深刻的图书馆，一个充满乐器的音乐室，一个很棒的餐厅，以及一个酒窖，那里每天都有各种奶酪，葡萄酒和当地啤酒供深夜社交会议使用。我们研讨会的组织者Matthias Schunter, Marc Shapiro, Paulo Verissimo和Michael Waidner以为期四天的活动为目标，聚集了来自世界各地的资深、知名和有前途的年轻研究人员。研讨会的日程没有事先确定，但大多数与会者都提供了一个抽象的[3]，并就最近或正在进行的工作做了简短的演讲。这些会谈的主要目的是促进与会者之间的讨论和合作。在一些

{"title":"Dagstuhl seminar report: security and dependability for federated cloud platforms, 2012","authors":"A. Shraer, R. Kapitza","doi":"10.1145/2506164.2506166","DOIUrl":"https://doi.org/10.1145/2506164.2506166","url":null,"abstract":"The Security and Dependability for Federated Cloud Platforms seminar [3] was held in Schloss Dagstuhl1, July 8-13, 2012. Schloss Dagstuhl, also known as the Leibniz-Zentrum fur Informatik, is a renovated castle located in the scenic countryside of Saarland, Germany. Dagstuhl offers a unique concept: 30-45 participants, all of whom receive invitations from Dagstuhl on behalf of the organizers, stay in the castle during the seminar (typically 3-5 days) enjoying all that the castle has to offer. Amongst other things this includes an impressive library, a music room full of musical instruments, an excellent restaurant, as well as a wine cellar where a variety of cheese, wine and local beer is available daily for the lateevening social meetings. The organizers of our seminar Matthias Schunter, Marc Shapiro, Paulo Verissimo and Michael Waidner targeted a four day event and gathered a mixed group of senior, established and promising young researches from all over the world. The program of the seminar was not set in advance, but most participants provided an abstract [3] and gave short talks on recent or ongoing work. The main purpose of these talks was generating discussion and collaboration among the participants. During some","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"215 1","pages":"4-5"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75591651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Boosting energy efficiency with mirrored data block replication policy and energy scheduler 使用镜像数据块复制策略和能源调度器提高能源效率

ACM SIGOPS Oper. Syst. Rev.

Pub Date : 2013-07-23 DOI: 10.1145/2506164.2506171

Sara Arbab Yazd, S. Venkatesan, N. Mittal

Energy efficiency is one of the major challenges in big datacenters. To facilitate processing of large data sets in a distributed fashion, the MapReduce programming model is employed in these datacenters. Hadoop is an open-source implementation of MapReduce which contains a distributed file system. Hadoop Distributed File System provides a data block replication scheme to preserve reliability and data availability. The distribution of the data block replicas over the nodes is performed randomly by meeting some constraints (e.g., preventing storage of two replicas of a data block on a single node). This study makes use of flexibility in the data block placement policy to increase energy efficiency in datacenters. Furthermore, inspired by Zaharia et al.'s delay scheduling algorithm, a scheduling algorithm is introduced, which takes into account energy efficiency in addition to fairness and data locality properties. Computer simulations of the proposed method suggest its superiority over Hadoop's standard settings.

能源效率是大数据中心面临的主要挑战之一。为了便于以分布式方式处理大型数据集，这些数据中心采用了MapReduce编程模型。Hadoop是MapReduce的开源实现，它包含一个分布式文件系统。Hadoop分布式文件系统提供数据块复制方案，保证数据的可靠性和可用性。通过满足一些约束(例如，防止在单个节点上存储数据块的两个副本)，数据块副本在节点上的分布是随机执行的。本研究利用数据块放置策略的灵活性来提高数据中心的能源效率。在此基础上，受Zaharia等人延迟调度算法的启发，提出了一种既考虑公平性和数据局部性又考虑能效的调度算法。计算机模拟表明，该方法优于Hadoop的标准设置。

引用次数: 16

Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds 基于解耦的银行:基于拍卖的云上HPC应用的预算驱动可持续性

ACM SIGOPS Oper. Syst. Rev.

Pub Date : 2013-07-23 DOI: 10.1145/2506164.2506172

Moussa Taifi

Cloud providers are auctioning their excess capacity using dynamically priced virtual instances. These spot instances provide significant savings compared to on-demand or fixed price instances. The users willing to use these resources are asked to provide a maximum bid price per hour, and the cloud provider runs the instances as long as the market price is below the user's bid price. By using such resources, the users are exposed explicitly to failures, and need to adapt their applications to provide some level of fault tolerance. In this paper, we expose the effect of bidding in the case of virtual HPC clusters composed of spot instances. We describe the interesting effect of uniform versus non-uniform bidding in terms of both the failure rate and the failure model. We propose an initial attempt to deal with the problem of predicting the runtime of a parallel application under various bidding strategies and various system parameters. We describe the relationship between bidding strategies and programming models, and we build a preliminary optimization model that uses real price traces from Amazon Web Services as inputs, as well as instrumented values related to the processing and network capacities of cluster instances on the EC2 services. Our results show preliminary insights into the relationship between non-uniform bidding and application scaling strategies.

云提供商正在使用动态定价的虚拟实例拍卖他们的过剩容量。与按需或固定价格实例相比，这些现货实例提供了显著的节省。愿意使用这些资源的用户被要求提供每小时的最高出价，只要市场价格低于用户的出价，云提供商就会运行这些实例。通过使用这些资源，用户将显式地暴露于故障，并且需要调整他们的应用程序以提供某种程度的容错。在本文中，我们揭示了竞价在由现货实例组成的虚拟高性能计算集群中的效果。我们从失败率和失效模型两个方面描述了均匀投标和非均匀投标的有趣效果。我们提出了一个初步的尝试来处理在各种投标策略和各种系统参数下预测并行应用程序运行时的问题。我们描述了投标策略和编程模型之间的关系，并建立了一个初步的优化模型，该模型使用来自Amazon Web Services的真实价格轨迹作为输入，以及与EC2服务上集群实例的处理和网络容量相关的仪器值。我们的研究结果初步揭示了非统一竞价与应用程序扩展策略之间的关系。

{"title":"Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds","authors":"Moussa Taifi","doi":"10.1145/2506164.2506172","DOIUrl":"https://doi.org/10.1145/2506164.2506172","url":null,"abstract":"Cloud providers are auctioning their excess capacity using dynamically priced virtual instances. These spot instances provide significant savings compared to on-demand or fixed price instances. The users willing to use these resources are asked to provide a maximum bid price per hour, and the cloud provider runs the instances as long as the market price is below the user's bid price. By using such resources, the users are exposed explicitly to failures, and need to adapt their applications to provide some level of fault tolerance. In this paper, we expose the effect of bidding in the case of virtual HPC clusters composed of spot instances. We describe the interesting effect of uniform versus non-uniform bidding in terms of both the failure rate and the failure model. We propose an initial attempt to deal with the problem of predicting the runtime of a parallel application under various bidding strategies and various system parameters. We describe the relationship between bidding strategies and programming models, and we build a preliminary optimization model that uses real price traces from Amazon Web Services as inputs, as well as instrumented values related to the processing and network capacities of cluster instances on the EC2 services. Our results show preliminary insights into the relationship between non-uniform bidding and application scaling strategies.","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"13 1","pages":"41-50"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90113245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Bridging the gap between applications and networks in data centers 弥合数据中心中应用程序和网络之间的差距

ACM SIGOPS Oper. Syst. Rev.

Pub Date : 2013-01-29 DOI: 10.1145/2433140.2433143

Paolo Costa

Modern data centers host tens (if not hundreds) of thousands of servers and are used by companies such as Amazon, Google, and Microsoft to provide online services to millions of individuals distributed across the Internet. They use commodity hardware and their network infrastructure adopts principles evolved from enterprise and Internet networking. Applications use UDP datagrams or TCP sockets as the primary interface to other applications running inside the data center. This effectively isolates the network from the end-systems, which then have little control over how the network handles packets. Likewise, the network has limited visibility on the application logic. An application injects a packet with a destination address and the network just delivers the packet. Network and applications effectively treat each other as black-boxes. This strict separation between applications and networks (also referred to as dumb network) is a direct outcome of the so-called end-to-end argument [49] and has arguably been one of the main reasons why the Internet has been capable of evolving from a small research project to planetary scale, supporting a multitude of different hardware and network technologies as well as a slew of very diverse applications, and using networks owned by competing ISPs. Despite being so instrumental in the success of the Internet, this black-box design is also one of the root causes of inefficiencies in large-scale data centers. Given the little control and visibility over network resources, applications need to use low-level hacks, e.g., to extract network properties (e.g., using traceroute and IP addresses to infer the network topology) and to prioritize traffic (e.g., increasing the number of TCP flows used by an application to increase its bandwidth share). Further, a simple functionality like multicast or anycast routing is not available and developers must resort to application-level overlays. This, however, leads to inefficiencies as typically multiple logical links are mapped to the same physical link, significantly reducing application throughput. Even with perfect knowledge of the underlying topology, there is still the constraint that servers

现代数据中心托管着数万(如果不是数十万)台服务器，Amazon、Google和Microsoft等公司使用这些数据中心向分布在Internet上的数百万个人提供在线服务。他们使用商品硬件，他们的网络基础设施采用从企业和Internet网络发展而来的原则。应用程序使用UDP数据报或TCP套接字作为与数据中心内运行的其他应用程序的主要接口。这有效地将网络与终端系统隔离开来，终端系统几乎无法控制网络如何处理数据包。同样，网络对应用程序逻辑的可见性也有限。一个应用程序注入一个带有目的地址的数据包，网络只是发送数据包。网络和应用程序有效地将彼此视为黑盒。这种应用程序和网络之间的严格分离(也被称为哑网络)是所谓的端到端争论的直接结果[49]，可以说是互联网能够从一个小型研究项目发展到全球规模的主要原因之一，支持多种不同的硬件和网络技术以及大量非常多样化的应用程序，并使用竞争的isp拥有的网络。尽管在互联网的成功中发挥了重要作用，但这种黑箱设计也是导致大型数据中心效率低下的根本原因之一。由于对网络资源的控制和可见性很小，应用程序需要使用低级黑客，例如，提取网络属性(例如，使用traceroute和IP地址来推断网络拓扑)和对流量进行优先级排序(例如，增加应用程序使用的TCP流的数量以增加其带宽共享)。此外，像多播或任意播路由这样的简单功能是不可用的，开发人员必须求助于应用程序级别的覆盖。但是，这会导致效率低下，因为通常会将多个逻辑链路映射到相同的物理链路，从而大大降低了应用程序吞吐量。即使完全了解底层拓扑，仍然存在服务器的约束

{"title":"Bridging the gap between applications and networks in data centers","authors":"Paolo Costa","doi":"10.1145/2433140.2433143","DOIUrl":"https://doi.org/10.1145/2433140.2433143","url":null,"abstract":"Modern data centers host tens (if not hundreds) of thousands of servers and are used by companies such as Amazon, Google, and Microsoft to provide online services to millions of individuals distributed across the Internet. They use commodity hardware and their network infrastructure adopts principles evolved from enterprise and Internet networking. Applications use UDP datagrams or TCP sockets as the primary interface to other applications running inside the data center. This effectively isolates the network from the end-systems, which then have little control over how the network handles packets. Likewise, the network has limited visibility on the application logic. An application injects a packet with a destination address and the network just delivers the packet. Network and applications effectively treat each other as black-boxes. This strict separation between applications and networks (also referred to as dumb network) is a direct outcome of the so-called end-to-end argument [49] and has arguably been one of the main reasons why the Internet has been capable of evolving from a small research project to planetary scale, supporting a multitude of different hardware and network technologies as well as a slew of very diverse applications, and using networks owned by competing ISPs. Despite being so instrumental in the success of the Internet, this black-box design is also one of the root causes of inefficiencies in large-scale data centers. Given the little control and visibility over network resources, applications need to use low-level hacks, e.g., to extract network properties (e.g., using traceroute and IP addresses to infer the network topology) and to prioritize traffic (e.g., increasing the number of TCP flows used by an application to increase its bandwidth share). Further, a simple functionality like multicast or anycast routing is not available and developers must resort to application-level overlays. This, however, leads to inefficiencies as typically multiple logical links are mapped to the same physical link, significantly reducing application throughput. Even with perfect knowledge of the underlying topology, there is still the constraint that servers","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"28 1","pages":"3-8"},"PeriodicalIF":0.0,"publicationDate":"2013-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77996551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Toward accurate and practical network tomography 走向准确实用的网络断层扫描

ACM SIGOPS Oper. Syst. Rev.

Pub Date : 2013-01-29 DOI: 10.1145/2433140.2433146

Denisa Ghita, K. Argyraki, Patrick Thiran

Troubleshooting large networks is hard; when an end-user complains that she has “network problems,” there is typically a large number of possible causes. For example, the end-user’s own machine may be damaged, misconfigured, or compromised, a network element that handles her traffic may be congested or malfunctioning, or the destination she is trying to reach may be filtering her traffic. To diagnose such problems, a network operator normally has to probe the network’s elements to collect relevant statistics, like packet loss or bandwidth utilization. The challenge, though, is that the network operator often does not have direct access to all the suspected network elements, hence cannot probe them— e.g., the operator of an edge network does not have access to the equipment of her Internet service provider (ISP). Network tomography is an elegant approach to network troubleshooting: just as medical tomography observes an organ from different vantage points and combines the observations to get knowledge of the organ’s internals (without dissecting it), so does network tomography observe the characteristics of different end-to-end network paths and combines the observations to infer the characteristics of individual network links (without probing them). This approach is applicable in scenarios where one needs to monitor the behavior and performance of a network without having direct access to its elements. For instance, the operators of edge networks could use network tomography to monitor the behavior and performance of their ISPs; an ISP operator could use it to monitor the behavior and performance of its peers. However, there are reasons to be skeptical about the usefulness of network tomography in practice. Even though it was invented more than 10 years ago and is still a topic of active research, it has not seen any real deployment. We believe the reason is that existing tomography algorithmsmake certain simplifying assumptions that do not always hold in a real network, which means that the algorithms’ results may be inaccurate. Most importantly, there is no way to determine the extent of this inaccuracy. In other words, today there is no way for a network operator who employs tomography for network troubleshooting to compute the certainty of its diagnosis.

对大型网络进行故障排除很困难;当终端用户抱怨她有“网络问题”时，通常有很多可能的原因。例如，最终用户自己的机器可能损坏、配置错误或受到损害，处理其通信的网络元素可能拥塞或发生故障，或者她试图到达的目的地可能正在过滤其通信。为了诊断这些问题，网络运营商通常必须探测网络的元素来收集相关的统计数据，比如丢包或带宽利用率。然而，挑战在于网络运营商通常无法直接访问所有可疑的网络元素，因此无法探测它们——例如，边缘网络的运营商无法访问其互联网服务提供商(ISP)的设备。网络断层扫描是网络故障排除的一种优雅方法:正如医学断层扫描从不同的有利位置观察一个器官，并结合观察结果来获得器官内部的知识(不解剖它)一样，网络断层扫描观察不同的端到端网络路径的特征，并结合观察结果来推断单个网络链路的特征(不探测它们)。这种方法适用于需要监视网络的行为和性能，而不能直接访问其元素的场景。例如，边缘网络的运营商可以使用网络断层扫描来监控其isp的行为和性能;ISP运营商可以使用它来监控其对等网络的行为和性能。然而，有理由怀疑网络断层扫描在实践中的实用性。尽管它是在10多年前发明的，并且仍然是一个活跃的研究课题，但它还没有看到任何真正的部署。我们认为原因是现有的断层扫描算法做出了某些简化的假设，这些假设并不总是适用于真实的网络，这意味着算法的结果可能是不准确的。最重要的是，没有办法确定这种不准确的程度。换句话说，今天对于使用断层扫描进行网络故障排除的网络操作员来说，没有办法计算其诊断的确定性。

{"title":"Toward accurate and practical network tomography","authors":"Denisa Ghita, K. Argyraki, Patrick Thiran","doi":"10.1145/2433140.2433146","DOIUrl":"https://doi.org/10.1145/2433140.2433146","url":null,"abstract":"Troubleshooting large networks is hard; when an end-user complains that she has “network problems,” there is typically a large number of possible causes. For example, the end-user’s own machine may be damaged, misconfigured, or compromised, a network element that handles her traffic may be congested or malfunctioning, or the destination she is trying to reach may be filtering her traffic. To diagnose such problems, a network operator normally has to probe the network’s elements to collect relevant statistics, like packet loss or bandwidth utilization. The challenge, though, is that the network operator often does not have direct access to all the suspected network elements, hence cannot probe them— e.g., the operator of an edge network does not have access to the equipment of her Internet service provider (ISP). Network tomography is an elegant approach to network troubleshooting: just as medical tomography observes an organ from different vantage points and combines the observations to get knowledge of the organ’s internals (without dissecting it), so does network tomography observe the characteristics of different end-to-end network paths and combines the observations to infer the characteristics of individual network links (without probing them). This approach is applicable in scenarios where one needs to monitor the behavior and performance of a network without having direct access to its elements. For instance, the operators of edge networks could use network tomography to monitor the behavior and performance of their ISPs; an ISP operator could use it to monitor the behavior and performance of its peers. However, there are reasons to be skeptical about the usefulness of network tomography in practice. Even though it was invented more than 10 years ago and is still a topic of active research, it has not seen any real deployment. We believe the reason is that existing tomography algorithmsmake certain simplifying assumptions that do not always hold in a real network, which means that the algorithms’ results may be inaccurate. Most importantly, there is no way to determine the extent of this inaccuracy. In other words, today there is no way for a network operator who employs tomography for network troubleshooting to compute the certainty of its diagnosis.","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"143 1","pages":"22-26"},"PeriodicalIF":0.0,"publicationDate":"2013-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85436865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

A framework to compute statistics of system parameters from very large trace files 从非常大的跟踪文件中计算系统参数统计信息的框架

ACM SIGOPS Oper. Syst. Rev.

Pub Date : 2013-01-29 DOI: 10.1145/2433140.2433151

Naser Ezzati-Jivan, M. Dagenais

In this paper, we present a framework to compute, store and retrieve statistics of various system metrics from large traces in an efficient way. The proposed framework allows for rapid interactive queries about system metrics values for any given time interval. In the proposed framework, efficient data structures and algorithms are designed to achieve a reasonable query time while utilizing less disk space. A parameter termed granularity degree (GD) is defined to determine the threshold of how often it is required to store the precomputed statistics on disk. The solution supports the hierarchy of system resources and also different granularities of time ranges. We explain the architecture of the framework and show how it can be used to efficiently compute and extract the CPU usage and other system metrics. The importance of the framework and its different applications are shown and evaluated in this paper.

在本文中，我们提出了一个框架，以一种有效的方式计算、存储和检索各种系统指标的统计数据。提出的框架允许对任何给定时间间隔的系统度量值进行快速交互式查询。在该框架中，设计了高效的数据结构和算法，以在使用较少的磁盘空间的同时实现合理的查询时间。定义了一个称为粒度度(GD)的参数，用于确定需要在磁盘上存储预先计算的统计信息的频率阈值。该解决方案支持系统资源的层次结构和不同粒度的时间范围。我们解释了框架的体系结构，并展示了如何使用它来有效地计算和提取CPU使用率和其他系统指标。本文展示并评价了该框架的重要性及其不同的应用。

引用次数: 17

Workshop report on LADIS 2012 2012年LADIS研讨会报告

ACM SIGOPS Oper. Syst. Rev.

Pub Date : 2013-01-29 DOI: 10.1145/2433140.2433142

D. Malkhi, R. V. Renesse

The 6th Workshop on Large-Scale Distributed Systems and Middleware was held July 18 and 19 on the island of Madeira, Portugal, co-located with the ACM Symposium on Principles Of Distributed Computing (PODC). LADIS brings together researchers and professionals to discuss new trends and techniques in distributed systems and middlewares which surface in large scale data centers, cloud computing, web services, and other important systems. This year, all LADIS contributions were by invitation only and underwent one round of reviews for quality assurance and providing constructive feedback to the authors. Each paper received five reviews. As is tradition for LADIS, we also invited keynote speakers from academia and industry. The keynote speakers were invited to provide abstracts. As in previous years, we invited the authors of four of the abstracts to provide full papers for a special ACM SIGOPS Operating Systems Review issue. These abstracts were selected based on rankings provided by the reviewers. The selected papers received three more detailed reviews and you see before you the revisions that resulted. Below, we provide a short report of the workshop itself. Scott Shenker (UC Berkeley and ICSI) started the workshop with a keynote presentation on Software Defined Networking (SDN), which was held before a joint audience of LADIS and PODC participants. Scott described the current lack of natural abstractions in the network control plane and how SDN tries to address this shortcoming. The concept is to provide modularity and standardization to network control to simplify management and encourage experimentation. OpenFlow is a well-known instantiation of SDN. The keynote was followed by two SDN-related presentations on cloud networking. Paulo Costa of Imperial College London argued that the traditional separation between applications and networks has to be revisited for modern datacenters. He described his CamCube project that has developed a programmable torus-shaped network for a datacenter, and is now proposing a research agenda called NetworkAs-A-Service. The full paper is included in this issue. Theo

第六届大规模分布式系统和中间件研讨会于7月18日和19日在葡萄牙马德拉岛举行，与ACM分布式计算原理研讨会(PODC)同期举行。LADIS将研究人员和专业人士聚集在一起，讨论分布式系统和中间件的新趋势和技术，这些趋势和技术出现在大规模数据中心、云计算、web服务和其他重要系统中。今年，LADIS的所有投稿都只接受邀请，并经过一轮审查，以保证质量，并向作者提供建设性的反馈。每篇论文有五篇审稿。按照LADIS的传统，我们还邀请了来自学术界和工业界的主讲人。主讲人被邀请提供摘要。与前几年一样，我们邀请了其中四篇摘要的作者为一期ACM SIGOPS操作系统评论特刊提供全文。这些摘要是根据审稿人提供的排名选择的。被选中的论文接受了三次更详细的审查，你会看到结果的修改。下面，我们提供一份关于研讨会本身的简短报告。Scott Shenker (UC Berkeley和ICSI)以一个关于软件定义网络(SDN)的主题演讲开始了研讨会，该研讨会在LADIS和PODC参与者的共同听众面前举行。Scott描述了当前网络控制平面缺乏自然抽象，以及SDN如何尝试解决这个缺点。其概念是为网络控制提供模块化和标准化，以简化管理并鼓励实验。OpenFlow是一个众所周知的SDN实例。主题演讲之后是两个与sdn相关的云网络演讲。伦敦帝国理工学院的保罗·科斯塔认为，应用程序和网络之间的传统分离必须重新考虑现代数据中心。他介绍了他的CamCube项目，该项目为数据中心开发了一个可编程的环形网络，现在正在提出一个名为NetworkAs-A-Service的研究议程。全文包含在本期中。西奥

{"title":"Workshop report on LADIS 2012","authors":"D. Malkhi, R. V. Renesse","doi":"10.1145/2433140.2433142","DOIUrl":"https://doi.org/10.1145/2433140.2433142","url":null,"abstract":"The 6th Workshop on Large-Scale Distributed Systems and Middleware was held July 18 and 19 on the island of Madeira, Portugal, co-located with the ACM Symposium on Principles Of Distributed Computing (PODC). LADIS brings together researchers and professionals to discuss new trends and techniques in distributed systems and middlewares which surface in large scale data centers, cloud computing, web services, and other important systems. This year, all LADIS contributions were by invitation only and underwent one round of reviews for quality assurance and providing constructive feedback to the authors. Each paper received five reviews. As is tradition for LADIS, we also invited keynote speakers from academia and industry. The keynote speakers were invited to provide abstracts. As in previous years, we invited the authors of four of the abstracts to provide full papers for a special ACM SIGOPS Operating Systems Review issue. These abstracts were selected based on rankings provided by the reviewers. The selected papers received three more detailed reviews and you see before you the revisions that resulted. Below, we provide a short report of the workshop itself. Scott Shenker (UC Berkeley and ICSI) started the workshop with a keynote presentation on Software Defined Networking (SDN), which was held before a joint audience of LADIS and PODC participants. Scott described the current lack of natural abstractions in the network control plane and how SDN tries to address this shortcoming. The concept is to provide modularity and standardization to network control to simplify management and encourage experimentation. OpenFlow is a well-known instantiation of SDN. The keynote was followed by two SDN-related presentations on cloud networking. Paulo Costa of Imperial College London argued that the traditional separation between applications and networks has to be revisited for modern datacenters. He described his CamCube project that has developed a programmable torus-shaped network for a datacenter, and is now proposing a research agenda called NetworkAs-A-Service. The full paper is included in this issue. Theo","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"57 1","pages":"1-2"},"PeriodicalIF":0.0,"publicationDate":"2013-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81497934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1