Cloud computing has recently emerged as a popular paradigm for deploying, managing and delivering a variety of services using a shared infrastructure [1]. The services offered through clouds range from simple data storage to endto-end management of business processes. Many companies and even governments are adopting the cloud as a solution to reduce costs and improve the quality of service. The present issue of the Operating Systems Review is dedicated to the dependability – reliability, fault tolerance, availability, security – of cloud computing. The issue contains extended papers from the First International Workshop on Dependability Issues in Cloud Computing (DISCCO), held at Irvine, California, USA, on October 2012, in conjunction with the 31st IEEE International Symposium on Reliable Distributed Systems. Three papers were selected among the seven presented based on the timeliness of their subject and the comments and scores of the reviewers.
{"title":"Dependability issues in cloud computing: extended papers from the 1st international workshop on dependability issues in cloud computing -- DISCCO","authors":"M. Correia, N. Mittal","doi":"10.1145/2506164.2506169","DOIUrl":"https://doi.org/10.1145/2506164.2506169","url":null,"abstract":"Cloud computing has recently emerged as a popular paradigm for deploying, managing and delivering a variety of services using a shared infrastructure [1]. The services offered through clouds range from simple data storage to endto-end management of business processes. Many companies and even governments are adopting the cloud as a solution to reduce costs and improve the quality of service. The present issue of the Operating Systems Review is dedicated to the dependability – reliability, fault tolerance, availability, security – of cloud computing. The issue contains extended papers from the First International Workshop on Dependability Issues in Cloud Computing (DISCCO), held at Irvine, California, USA, on October 2012, in conjunction with the 31st IEEE International Symposium on Reliable Distributed Systems. Three papers were selected among the seven presented based on the timeliness of their subject and the comments and scores of the reviewers.","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"24 1","pages":"20-22"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90325596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modeling parallel algorithms at the architecture level enables exploring side-effects of the weakly ordered nature of modern processors. Formal verification of such models with model-checking can ensure that algorithm guarantees will hold even in the presence of the most aggressive compiler and processor optimizations. This paper proposes a virtual architecture to model the effects of such optimizations. It first presents the OoOmem framework to model out-of-order memory accesses. It then presents the OoOisched framework to model the effects of out-of-order instruction scheduling. These two frameworks are explained and tested using weaklyordered memory interaction scenarios known to be affected by weak ordering. Then, modeling of user-level RCU (Read- Copy Update) synchronization algorithms is presented. It uses the virtual architecture proposed to verify that the RCU guarantees are indeed respected.
{"title":"Multi-core systems modeling for formal verification of parallel algorithms","authors":"M. Desnoyers, P. McKenney, M. Dagenais","doi":"10.1145/2506164.2506174","DOIUrl":"https://doi.org/10.1145/2506164.2506174","url":null,"abstract":"Modeling parallel algorithms at the architecture level enables exploring side-effects of the weakly ordered nature of modern processors. Formal verification of such models with model-checking can ensure that algorithm guarantees will hold even in the presence of the most aggressive compiler and processor optimizations.\u0000 This paper proposes a virtual architecture to model the effects of such optimizations. It first presents the OoOmem framework to model out-of-order memory accesses. It then presents the OoOisched framework to model the effects of out-of-order instruction scheduling.\u0000 These two frameworks are explained and tested using weaklyordered memory interaction scenarios known to be affected by weak ordering. Then, modeling of user-level RCU (Read- Copy Update) synchronization algorithms is presented. It uses the virtual architecture proposed to verify that the RCU guarantees are indeed respected.","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"18 1","pages":"51-65"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83630616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linux and other open-source Unix variants (and their distributors) provide researchers with full-fledged operating systems that are widely used. However, due to their complexity and rapid development, care should be exercised when using these operating systems for performance experiments, especially in systems research. In particular, the size and continual evolution of the Linux code-base makes it difficult to understand, and as a result, decipher and explain the reasons for performance improvements. In addition, the rapid kernel development cycle means that experimental results can be viewed as out of date, or meaningless, very quickly. We demonstrate that this viewpoint is incorrect because kernel changes can and have introduced both bugs and performance degradations. This paper describes some of our experiences using Linux and FreeBSD as platforms for conducting performance evaluations and some performance regressions we have found. Our results show, these performance regressions can be serious (e.g., repeating identical experiments results in large variability in results) and long lived despite having a large negative effect on performance (one problem was present for more than 3 years). Based on these experiences, we argue: it is sometimes reasonable to use an older kernel version, experimental results need careful analysis to explain why a performance effect occurs, and publishing papers validating prior research is essential.
{"title":"Our troubles with Linux Kernel upgrades and why you should care","authors":"Ashif S. Harji, P. Buhr, Tim Brecht","doi":"10.1145/2506164.2506175","DOIUrl":"https://doi.org/10.1145/2506164.2506175","url":null,"abstract":"Linux and other open-source Unix variants (and their distributors) provide researchers with full-fledged operating systems that are widely used. However, due to their complexity and rapid development, care should be exercised when using these operating systems for performance experiments, especially in systems research. In particular, the size and continual evolution of the Linux code-base makes it difficult to understand, and as a result, decipher and explain the reasons for performance improvements. In addition, the rapid kernel development cycle means that experimental results can be viewed as out of date, or meaningless, very quickly. We demonstrate that this viewpoint is incorrect because kernel changes can and have introduced both bugs and performance degradations.\u0000 This paper describes some of our experiences using Linux and FreeBSD as platforms for conducting performance evaluations and some performance regressions we have found. Our results show, these performance regressions can be serious (e.g., repeating identical experiments results in large variability in results) and long lived despite having a large negative effect on performance (one problem was present for more than 3 years). Based on these experiences, we argue: it is sometimes reasonable to use an older kernel version, experimental results need careful analysis to explain why a performance effect occurs, and publishing papers validating prior research is essential.","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"12 11","pages":"66-72"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72569576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Security and Dependability for Federated Cloud Platforms seminar [3] was held in Schloss Dagstuhl1, July 8-13, 2012. Schloss Dagstuhl, also known as the Leibniz-Zentrum fur Informatik, is a renovated castle located in the scenic countryside of Saarland, Germany. Dagstuhl offers a unique concept: 30-45 participants, all of whom receive invitations from Dagstuhl on behalf of the organizers, stay in the castle during the seminar (typically 3-5 days) enjoying all that the castle has to offer. Amongst other things this includes an impressive library, a music room full of musical instruments, an excellent restaurant, as well as a wine cellar where a variety of cheese, wine and local beer is available daily for the lateevening social meetings. The organizers of our seminar Matthias Schunter, Marc Shapiro, Paulo Verissimo and Michael Waidner targeted a four day event and gathered a mixed group of senior, established and promising young researches from all over the world. The program of the seminar was not set in advance, but most participants provided an abstract [3] and gave short talks on recent or ongoing work. The main purpose of these talks was generating discussion and collaboration among the participants. During some
联邦云平台的安全性和可靠性研讨会[3]于2012年7月8日至13日在达格斯图尔城堡举行。达格施图尔城堡,也被称为莱布尼茨信息中心,是一座经过翻新的城堡,位于德国萨尔州风景秀丽的乡村。达格施图尔提供了一个独特的概念:30-45名参与者,他们都代表组织者收到来自达格施图尔的邀请,在研讨会期间(通常为3-5天)呆在城堡里,享受城堡所提供的一切。除此之外,它还包括一个令人印象深刻的图书馆,一个充满乐器的音乐室,一个很棒的餐厅,以及一个酒窖,那里每天都有各种奶酪,葡萄酒和当地啤酒供深夜社交会议使用。我们研讨会的组织者Matthias Schunter, Marc Shapiro, Paulo Verissimo和Michael Waidner以为期四天的活动为目标,聚集了来自世界各地的资深、知名和有前途的年轻研究人员。研讨会的日程没有事先确定,但大多数与会者都提供了一个抽象的[3],并就最近或正在进行的工作做了简短的演讲。这些会谈的主要目的是促进与会者之间的讨论和合作。在一些
{"title":"Dagstuhl seminar report: security and dependability for federated cloud platforms, 2012","authors":"A. Shraer, R. Kapitza","doi":"10.1145/2506164.2506166","DOIUrl":"https://doi.org/10.1145/2506164.2506166","url":null,"abstract":"The Security and Dependability for Federated Cloud Platforms seminar [3] was held in Schloss Dagstuhl1, July 8-13, 2012. Schloss Dagstuhl, also known as the Leibniz-Zentrum fur Informatik, is a renovated castle located in the scenic countryside of Saarland, Germany. Dagstuhl offers a unique concept: 30-45 participants, all of whom receive invitations from Dagstuhl on behalf of the organizers, stay in the castle during the seminar (typically 3-5 days) enjoying all that the castle has to offer. Amongst other things this includes an impressive library, a music room full of musical instruments, an excellent restaurant, as well as a wine cellar where a variety of cheese, wine and local beer is available daily for the lateevening social meetings. The organizers of our seminar Matthias Schunter, Marc Shapiro, Paulo Verissimo and Michael Waidner targeted a four day event and gathered a mixed group of senior, established and promising young researches from all over the world. The program of the seminar was not set in advance, but most participants provided an abstract [3] and gave short talks on recent or ongoing work. The main purpose of these talks was generating discussion and collaboration among the participants. During some","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"215 1","pages":"4-5"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75591651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Energy efficiency is one of the major challenges in big datacenters. To facilitate processing of large data sets in a distributed fashion, the MapReduce programming model is employed in these datacenters. Hadoop is an open-source implementation of MapReduce which contains a distributed file system. Hadoop Distributed File System provides a data block replication scheme to preserve reliability and data availability. The distribution of the data block replicas over the nodes is performed randomly by meeting some constraints (e.g., preventing storage of two replicas of a data block on a single node). This study makes use of flexibility in the data block placement policy to increase energy efficiency in datacenters. Furthermore, inspired by Zaharia et al.'s delay scheduling algorithm, a scheduling algorithm is introduced, which takes into account energy efficiency in addition to fairness and data locality properties. Computer simulations of the proposed method suggest its superiority over Hadoop's standard settings.
{"title":"Boosting energy efficiency with mirrored data block replication policy and energy scheduler","authors":"Sara Arbab Yazd, S. Venkatesan, N. Mittal","doi":"10.1145/2506164.2506171","DOIUrl":"https://doi.org/10.1145/2506164.2506171","url":null,"abstract":"Energy efficiency is one of the major challenges in big datacenters. To facilitate processing of large data sets in a distributed fashion, the MapReduce programming model is employed in these datacenters. Hadoop is an open-source implementation of MapReduce which contains a distributed file system. Hadoop Distributed File System provides a data block replication scheme to preserve reliability and data availability. The distribution of the data block replicas over the nodes is performed randomly by meeting some constraints (e.g., preventing storage of two replicas of a data block on a single node). This study makes use of flexibility in the data block placement policy to increase energy efficiency in datacenters. Furthermore, inspired by Zaharia et al.'s delay scheduling algorithm, a scheduling algorithm is introduced, which takes into account energy efficiency in addition to fairness and data locality properties. Computer simulations of the proposed method suggest its superiority over Hadoop's standard settings.","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"541 1","pages":"33-40"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75352112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cloud providers are auctioning their excess capacity using dynamically priced virtual instances. These spot instances provide significant savings compared to on-demand or fixed price instances. The users willing to use these resources are asked to provide a maximum bid price per hour, and the cloud provider runs the instances as long as the market price is below the user's bid price. By using such resources, the users are exposed explicitly to failures, and need to adapt their applications to provide some level of fault tolerance. In this paper, we expose the effect of bidding in the case of virtual HPC clusters composed of spot instances. We describe the interesting effect of uniform versus non-uniform bidding in terms of both the failure rate and the failure model. We propose an initial attempt to deal with the problem of predicting the runtime of a parallel application under various bidding strategies and various system parameters. We describe the relationship between bidding strategies and programming models, and we build a preliminary optimization model that uses real price traces from Amazon Web Services as inputs, as well as instrumented values related to the processing and network capacities of cluster instances on the EC2 services. Our results show preliminary insights into the relationship between non-uniform bidding and application scaling strategies.
云提供商正在使用动态定价的虚拟实例拍卖他们的过剩容量。与按需或固定价格实例相比,这些现货实例提供了显著的节省。愿意使用这些资源的用户被要求提供每小时的最高出价,只要市场价格低于用户的出价,云提供商就会运行这些实例。通过使用这些资源,用户将显式地暴露于故障,并且需要调整他们的应用程序以提供某种程度的容错。在本文中,我们揭示了竞价在由现货实例组成的虚拟高性能计算集群中的效果。我们从失败率和失效模型两个方面描述了均匀投标和非均匀投标的有趣效果。我们提出了一个初步的尝试来处理在各种投标策略和各种系统参数下预测并行应用程序运行时的问题。我们描述了投标策略和编程模型之间的关系,并建立了一个初步的优化模型,该模型使用来自Amazon Web Services的真实价格轨迹作为输入,以及与EC2服务上集群实例的处理和网络容量相关的仪器值。我们的研究结果初步揭示了非统一竞价与应用程序扩展策略之间的关系。
{"title":"Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds","authors":"Moussa Taifi","doi":"10.1145/2506164.2506172","DOIUrl":"https://doi.org/10.1145/2506164.2506172","url":null,"abstract":"Cloud providers are auctioning their excess capacity using dynamically priced virtual instances. These spot instances provide significant savings compared to on-demand or fixed price instances. The users willing to use these resources are asked to provide a maximum bid price per hour, and the cloud provider runs the instances as long as the market price is below the user's bid price. By using such resources, the users are exposed explicitly to failures, and need to adapt their applications to provide some level of fault tolerance. In this paper, we expose the effect of bidding in the case of virtual HPC clusters composed of spot instances. We describe the interesting effect of uniform versus non-uniform bidding in terms of both the failure rate and the failure model. We propose an initial attempt to deal with the problem of predicting the runtime of a parallel application under various bidding strategies and various system parameters. We describe the relationship between bidding strategies and programming models, and we build a preliminary optimization model that uses real price traces from Amazon Web Services as inputs, as well as instrumented values related to the processing and network capacities of cluster instances on the EC2 services. Our results show preliminary insights into the relationship between non-uniform bidding and application scaling strategies.","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"13 1","pages":"41-50"},"PeriodicalIF":0.0,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90113245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern data centers host tens (if not hundreds) of thousands of servers and are used by companies such as Amazon, Google, and Microsoft to provide online services to millions of individuals distributed across the Internet. They use commodity hardware and their network infrastructure adopts principles evolved from enterprise and Internet networking. Applications use UDP datagrams or TCP sockets as the primary interface to other applications running inside the data center. This effectively isolates the network from the end-systems, which then have little control over how the network handles packets. Likewise, the network has limited visibility on the application logic. An application injects a packet with a destination address and the network just delivers the packet. Network and applications effectively treat each other as black-boxes. This strict separation between applications and networks (also referred to as dumb network) is a direct outcome of the so-called end-to-end argument [49] and has arguably been one of the main reasons why the Internet has been capable of evolving from a small research project to planetary scale, supporting a multitude of different hardware and network technologies as well as a slew of very diverse applications, and using networks owned by competing ISPs. Despite being so instrumental in the success of the Internet, this black-box design is also one of the root causes of inefficiencies in large-scale data centers. Given the little control and visibility over network resources, applications need to use low-level hacks, e.g., to extract network properties (e.g., using traceroute and IP addresses to infer the network topology) and to prioritize traffic (e.g., increasing the number of TCP flows used by an application to increase its bandwidth share). Further, a simple functionality like multicast or anycast routing is not available and developers must resort to application-level overlays. This, however, leads to inefficiencies as typically multiple logical links are mapped to the same physical link, significantly reducing application throughput. Even with perfect knowledge of the underlying topology, there is still the constraint that servers
{"title":"Bridging the gap between applications and networks in data centers","authors":"Paolo Costa","doi":"10.1145/2433140.2433143","DOIUrl":"https://doi.org/10.1145/2433140.2433143","url":null,"abstract":"Modern data centers host tens (if not hundreds) of thousands of servers and are used by companies such as Amazon, Google, and Microsoft to provide online services to millions of individuals distributed across the Internet. They use commodity hardware and their network infrastructure adopts principles evolved from enterprise and Internet networking. Applications use UDP datagrams or TCP sockets as the primary interface to other applications running inside the data center. This effectively isolates the network from the end-systems, which then have little control over how the network handles packets. Likewise, the network has limited visibility on the application logic. An application injects a packet with a destination address and the network just delivers the packet. Network and applications effectively treat each other as black-boxes. This strict separation between applications and networks (also referred to as dumb network) is a direct outcome of the so-called end-to-end argument [49] and has arguably been one of the main reasons why the Internet has been capable of evolving from a small research project to planetary scale, supporting a multitude of different hardware and network technologies as well as a slew of very diverse applications, and using networks owned by competing ISPs. Despite being so instrumental in the success of the Internet, this black-box design is also one of the root causes of inefficiencies in large-scale data centers. Given the little control and visibility over network resources, applications need to use low-level hacks, e.g., to extract network properties (e.g., using traceroute and IP addresses to infer the network topology) and to prioritize traffic (e.g., increasing the number of TCP flows used by an application to increase its bandwidth share). Further, a simple functionality like multicast or anycast routing is not available and developers must resort to application-level overlays. This, however, leads to inefficiencies as typically multiple logical links are mapped to the same physical link, significantly reducing application throughput. Even with perfect knowledge of the underlying topology, there is still the constraint that servers","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"28 1","pages":"3-8"},"PeriodicalIF":0.0,"publicationDate":"2013-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77996551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Troubleshooting large networks is hard; when an end-user complains that she has “network problems,” there is typically a large number of possible causes. For example, the end-user’s own machine may be damaged, misconfigured, or compromised, a network element that handles her traffic may be congested or malfunctioning, or the destination she is trying to reach may be filtering her traffic. To diagnose such problems, a network operator normally has to probe the network’s elements to collect relevant statistics, like packet loss or bandwidth utilization. The challenge, though, is that the network operator often does not have direct access to all the suspected network elements, hence cannot probe them— e.g., the operator of an edge network does not have access to the equipment of her Internet service provider (ISP). Network tomography is an elegant approach to network troubleshooting: just as medical tomography observes an organ from different vantage points and combines the observations to get knowledge of the organ’s internals (without dissecting it), so does network tomography observe the characteristics of different end-to-end network paths and combines the observations to infer the characteristics of individual network links (without probing them). This approach is applicable in scenarios where one needs to monitor the behavior and performance of a network without having direct access to its elements. For instance, the operators of edge networks could use network tomography to monitor the behavior and performance of their ISPs; an ISP operator could use it to monitor the behavior and performance of its peers. However, there are reasons to be skeptical about the usefulness of network tomography in practice. Even though it was invented more than 10 years ago and is still a topic of active research, it has not seen any real deployment. We believe the reason is that existing tomography algorithmsmake certain simplifying assumptions that do not always hold in a real network, which means that the algorithms’ results may be inaccurate. Most importantly, there is no way to determine the extent of this inaccuracy. In other words, today there is no way for a network operator who employs tomography for network troubleshooting to compute the certainty of its diagnosis.
{"title":"Toward accurate and practical network tomography","authors":"Denisa Ghita, K. Argyraki, Patrick Thiran","doi":"10.1145/2433140.2433146","DOIUrl":"https://doi.org/10.1145/2433140.2433146","url":null,"abstract":"Troubleshooting large networks is hard; when an end-user complains that she has “network problems,” there is typically a large number of possible causes. For example, the end-user’s own machine may be damaged, misconfigured, or compromised, a network element that handles her traffic may be congested or malfunctioning, or the destination she is trying to reach may be filtering her traffic. To diagnose such problems, a network operator normally has to probe the network’s elements to collect relevant statistics, like packet loss or bandwidth utilization. The challenge, though, is that the network operator often does not have direct access to all the suspected network elements, hence cannot probe them— e.g., the operator of an edge network does not have access to the equipment of her Internet service provider (ISP). Network tomography is an elegant approach to network troubleshooting: just as medical tomography observes an organ from different vantage points and combines the observations to get knowledge of the organ’s internals (without dissecting it), so does network tomography observe the characteristics of different end-to-end network paths and combines the observations to infer the characteristics of individual network links (without probing them). This approach is applicable in scenarios where one needs to monitor the behavior and performance of a network without having direct access to its elements. For instance, the operators of edge networks could use network tomography to monitor the behavior and performance of their ISPs; an ISP operator could use it to monitor the behavior and performance of its peers. However, there are reasons to be skeptical about the usefulness of network tomography in practice. Even though it was invented more than 10 years ago and is still a topic of active research, it has not seen any real deployment. We believe the reason is that existing tomography algorithmsmake certain simplifying assumptions that do not always hold in a real network, which means that the algorithms’ results may be inaccurate. Most importantly, there is no way to determine the extent of this inaccuracy. In other words, today there is no way for a network operator who employs tomography for network troubleshooting to compute the certainty of its diagnosis.","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"143 1","pages":"22-26"},"PeriodicalIF":0.0,"publicationDate":"2013-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85436865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present a framework to compute, store and retrieve statistics of various system metrics from large traces in an efficient way. The proposed framework allows for rapid interactive queries about system metrics values for any given time interval. In the proposed framework, efficient data structures and algorithms are designed to achieve a reasonable query time while utilizing less disk space. A parameter termed granularity degree (GD) is defined to determine the threshold of how often it is required to store the precomputed statistics on disk. The solution supports the hierarchy of system resources and also different granularities of time ranges. We explain the architecture of the framework and show how it can be used to efficiently compute and extract the CPU usage and other system metrics. The importance of the framework and its different applications are shown and evaluated in this paper.
{"title":"A framework to compute statistics of system parameters from very large trace files","authors":"Naser Ezzati-Jivan, M. Dagenais","doi":"10.1145/2433140.2433151","DOIUrl":"https://doi.org/10.1145/2433140.2433151","url":null,"abstract":"In this paper, we present a framework to compute, store and retrieve statistics of various system metrics from large traces in an efficient way. The proposed framework allows for rapid interactive queries about system metrics values for any given time interval. In the proposed framework, efficient data structures and algorithms are designed to achieve a reasonable query time while utilizing less disk space. A parameter termed granularity degree (GD) is defined to determine the threshold of how often it is required to store the precomputed statistics on disk. The solution supports the hierarchy of system resources and also different granularities of time ranges. We explain the architecture of the framework and show how it can be used to efficiently compute and extract the CPU usage and other system metrics. The importance of the framework and its different applications are shown and evaluated in this paper.","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"768 1","pages":"43-54"},"PeriodicalIF":0.0,"publicationDate":"2013-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78860303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The 6th Workshop on Large-Scale Distributed Systems and Middleware was held July 18 and 19 on the island of Madeira, Portugal, co-located with the ACM Symposium on Principles Of Distributed Computing (PODC). LADIS brings together researchers and professionals to discuss new trends and techniques in distributed systems and middlewares which surface in large scale data centers, cloud computing, web services, and other important systems. This year, all LADIS contributions were by invitation only and underwent one round of reviews for quality assurance and providing constructive feedback to the authors. Each paper received five reviews. As is tradition for LADIS, we also invited keynote speakers from academia and industry. The keynote speakers were invited to provide abstracts. As in previous years, we invited the authors of four of the abstracts to provide full papers for a special ACM SIGOPS Operating Systems Review issue. These abstracts were selected based on rankings provided by the reviewers. The selected papers received three more detailed reviews and you see before you the revisions that resulted. Below, we provide a short report of the workshop itself. Scott Shenker (UC Berkeley and ICSI) started the workshop with a keynote presentation on Software Defined Networking (SDN), which was held before a joint audience of LADIS and PODC participants. Scott described the current lack of natural abstractions in the network control plane and how SDN tries to address this shortcoming. The concept is to provide modularity and standardization to network control to simplify management and encourage experimentation. OpenFlow is a well-known instantiation of SDN. The keynote was followed by two SDN-related presentations on cloud networking. Paulo Costa of Imperial College London argued that the traditional separation between applications and networks has to be revisited for modern datacenters. He described his CamCube project that has developed a programmable torus-shaped network for a datacenter, and is now proposing a research agenda called NetworkAs-A-Service. The full paper is included in this issue. Theo
{"title":"Workshop report on LADIS 2012","authors":"D. Malkhi, R. V. Renesse","doi":"10.1145/2433140.2433142","DOIUrl":"https://doi.org/10.1145/2433140.2433142","url":null,"abstract":"The 6th Workshop on Large-Scale Distributed Systems and Middleware was held July 18 and 19 on the island of Madeira, Portugal, co-located with the ACM Symposium on Principles Of Distributed Computing (PODC). LADIS brings together researchers and professionals to discuss new trends and techniques in distributed systems and middlewares which surface in large scale data centers, cloud computing, web services, and other important systems. This year, all LADIS contributions were by invitation only and underwent one round of reviews for quality assurance and providing constructive feedback to the authors. Each paper received five reviews. As is tradition for LADIS, we also invited keynote speakers from academia and industry. The keynote speakers were invited to provide abstracts. As in previous years, we invited the authors of four of the abstracts to provide full papers for a special ACM SIGOPS Operating Systems Review issue. These abstracts were selected based on rankings provided by the reviewers. The selected papers received three more detailed reviews and you see before you the revisions that resulted. Below, we provide a short report of the workshop itself. Scott Shenker (UC Berkeley and ICSI) started the workshop with a keynote presentation on Software Defined Networking (SDN), which was held before a joint audience of LADIS and PODC participants. Scott described the current lack of natural abstractions in the network control plane and how SDN tries to address this shortcoming. The concept is to provide modularity and standardization to network control to simplify management and encourage experimentation. OpenFlow is a well-known instantiation of SDN. The keynote was followed by two SDN-related presentations on cloud networking. Paulo Costa of Imperial College London argued that the traditional separation between applications and networks has to be revisited for modern datacenters. He described his CamCube project that has developed a programmable torus-shaped network for a datacenter, and is now proposing a research agenda called NetworkAs-A-Service. The full paper is included in this issue. Theo","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"57 1","pages":"1-2"},"PeriodicalIF":0.0,"publicationDate":"2013-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81497934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}