Sébastien Monnet, Ramsés Morales, Gabriel Antoniu, Indranil Gupta
Peer-to-peer overlays allow distributed applications to work in a wide-area, scalable, and fault-tolerant manner. However, most structured and unstructured overlays present in literature today are inflexible from the application viewpoint. In other words, the application has no control over the structure of the overlay itself. This paper proposes the concept of an application-malleable overlay, and the design of the first malleable overlay which we call MOve. In MOve, the communication characteristics of the distributed application using the overlay can influence the overlay's structure itself, with the twin goals of (1) optimizing the application performance by adapting the overlay, while also (2) retaining the large scale and fault tolerance of the overlay approach. The influence could either be explicitly specified by the application or implicitly gleaned by our algorithms. Besides neighbor list membership management, MOve also contains algorithms for resource discovery, update propagation, and churn-resistance. The emergent behavior of the implicit mechanisms used in MOve manifests in the following way: when application communication is low, most overlay links keep their default configuration; however, as application communication characteristics become more evident, the overlay gracefully adapts itself to the application
{"title":"MOve: Design of An Application-Malleable Overlay","authors":"Sébastien Monnet, Ramsés Morales, Gabriel Antoniu, Indranil Gupta","doi":"10.1109/SRDS.2006.33","DOIUrl":"https://doi.org/10.1109/SRDS.2006.33","url":null,"abstract":"Peer-to-peer overlays allow distributed applications to work in a wide-area, scalable, and fault-tolerant manner. However, most structured and unstructured overlays present in literature today are inflexible from the application viewpoint. In other words, the application has no control over the structure of the overlay itself. This paper proposes the concept of an application-malleable overlay, and the design of the first malleable overlay which we call MOve. In MOve, the communication characteristics of the distributed application using the overlay can influence the overlay's structure itself, with the twin goals of (1) optimizing the application performance by adapting the overlay, while also (2) retaining the large scale and fault tolerance of the overlay approach. The influence could either be explicitly specified by the application or implicitly gleaned by our algorithms. Besides neighbor list membership management, MOve also contains algorithms for resource discovery, update propagation, and churn-resistance. The emergent behavior of the implicit mechanisms used in MOve manifests in the following way: when application communication is low, most overlay links keep their default configuration; however, as application communication characteristics become more evident, the overlay gracefully adapts itself to the application","PeriodicalId":164765,"journal":{"name":"2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117232493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the context of clients accessing a read/write shared object, persistency of a written value is a property stating that a value written into the object is always available unless overwritten by a successive write operation. This property can be easily guaranteed in a static distributed system provided that either a subset of processes implementing the object does not crash or processes can crash and then recover being able to retrieve their last state. Unfortunately the enforcing of this property in a potentially large scale and dynamic distributed system (e.g. a P2P system) is far from being trivial when considering the case in which processes implementing the object may fail or leave at any time without notifying any other process (i.e., the last state might not be retrievable). The paper introduces the notion of weak persistency that guarantees persistency of values when a system becomes quiescent (arrivals and departures subside). An implementation of a weakly-persistent object ensuring causal consistency is provided along with its correctness proof. The interest of causal consistency lies in the fact that, contrarily to atomic consistency, it can be maintained even during non-quiescent periods of the distributed system (i.e., when persistency is not guaranteed)
{"title":"Weakly-Persistent Causal Objects in Dynamic Distributed Systems","authors":"R. Baldoni, M. Malek, A. Milani, S. Piergiovanni","doi":"10.1109/SRDS.2006.47","DOIUrl":"https://doi.org/10.1109/SRDS.2006.47","url":null,"abstract":"In the context of clients accessing a read/write shared object, persistency of a written value is a property stating that a value written into the object is always available unless overwritten by a successive write operation. This property can be easily guaranteed in a static distributed system provided that either a subset of processes implementing the object does not crash or processes can crash and then recover being able to retrieve their last state. Unfortunately the enforcing of this property in a potentially large scale and dynamic distributed system (e.g. a P2P system) is far from being trivial when considering the case in which processes implementing the object may fail or leave at any time without notifying any other process (i.e., the last state might not be retrievable). The paper introduces the notion of weak persistency that guarantees persistency of values when a system becomes quiescent (arrivals and departures subside). An implementation of a weakly-persistent object ensuring causal consistency is provided along with its correctness proof. The interest of causal consistency lies in the fact that, contrarily to atomic consistency, it can be maintained even during non-quiescent periods of the distributed system (i.e., when persistency is not guaranteed)","PeriodicalId":164765,"journal":{"name":"2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129709141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PLATO is a predictive total ordering protocol designed for low-latency multicast in datacenters. It predicts out-of-order arrival of multicast packets by observing their inter-arrival times, and delays packets before passing them up to the application only if it believes the packets to have arrived in the wrong order. We show through experimentation on real datacenter-style networks that the inter-arrival time of consecutive packet pairs is an excellent predictor of out-of-order delivery. We evaluate an implementation of PLATO on the Emulab testbed, and show that it drives down delivery latencies by more than a factor of 2 compared to the fixed-sequencer protocol
{"title":"PLATO: Predictive Latency-Aware Total Ordering","authors":"M. Balakrishnan, K. Birman, Amar Phanishayee","doi":"10.1109/SRDS.2006.36","DOIUrl":"https://doi.org/10.1109/SRDS.2006.36","url":null,"abstract":"PLATO is a predictive total ordering protocol designed for low-latency multicast in datacenters. It predicts out-of-order arrival of multicast packets by observing their inter-arrival times, and delays packets before passing them up to the application only if it believes the packets to have arrived in the wrong order. We show through experimentation on real datacenter-style networks that the inter-arrival time of consecutive packet pairs is an excellent predictor of out-of-order delivery. We evaluate an implementation of PLATO on the Emulab testbed, and show that it drives down delivery latencies by more than a factor of 2 compared to the fixed-sequencer protocol","PeriodicalId":164765,"journal":{"name":"2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124134405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefan Pleisch, Thomas Clouser, Mikhail Nesterenko, A. Schiper
We present DRIFT - a total order multicast algorithm for ad hoc networks with mobile or static nodes. Due to the ad hoc nature of the network, DRIFT uses flooding for message propagation. The key idea of DRIFT is virtual flooding - a way of using unrelated message streams to propagate message causality information in order to accelerate message delivery. We describe DRIFT in detail. We evaluate its performance in a simulator and in a wireless sensor network. In both cases our results demonstrate that the performance of DRIFT exceeds that of the simple total order multicast algorithm designed for wired networks, on which it is based. In simulation at scale, for certain experiment settings, DRIFT achieved speedup of several orders of magnitude
{"title":"DRIFT: Efficient Message Ordering in Ad Hoc Networks Using Virtual Flooding","authors":"Stefan Pleisch, Thomas Clouser, Mikhail Nesterenko, A. Schiper","doi":"10.1109/SRDS.2006.18","DOIUrl":"https://doi.org/10.1109/SRDS.2006.18","url":null,"abstract":"We present DRIFT - a total order multicast algorithm for ad hoc networks with mobile or static nodes. Due to the ad hoc nature of the network, DRIFT uses flooding for message propagation. The key idea of DRIFT is virtual flooding - a way of using unrelated message streams to propagate message causality information in order to accelerate message delivery. We describe DRIFT in detail. We evaluate its performance in a simulator and in a wireless sensor network. In both cases our results demonstrate that the performance of DRIFT exceeds that of the simple total order multicast algorithm designed for wired networks, on which it is based. In simulation at scale, for certain experiment settings, DRIFT achieved speedup of several orders of magnitude","PeriodicalId":164765,"journal":{"name":"2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130734054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the past decade, the application field of networked embedded computing systems (NECSs) has been showing gradual acceleration in its growth. A great majority of NECSs are subject to some fault tolerance requirements. In other words, in their application environments, the fault rates of some components, e.g., communication links, batteries, some chips, etc., are not negligible. So, research interests in fault-tolerant (FT) NECSs have been showing growing trends in the past decade in good contrast to the relatively stagnant research interests in more traditional branches of FTC such as FT data servers, etc. In particular, object-/component-oriented (OO/CO) structuring techniques for RT distributed computing systems have been established in highly promising forms with convincing demonstrations although they are still very slow in spreading through industry. Such OO/CO structuring techniques not only lead to efficient design of RTC application systems with considerably reduced labor and improved maintainability but also enable design of complex systems yielding analyzable and yet tight response time bounds. On the basis of such foundation for designing RTC systems, research in FT NECS can now proceed with much improved effectiveness and confidence
{"title":"Systematic composition and analyzability of dependable networked embedded computing systems","authors":"K. Kim","doi":"10.1109/SRDS.2006.45","DOIUrl":"https://doi.org/10.1109/SRDS.2006.45","url":null,"abstract":"In the past decade, the application field of networked embedded computing systems (NECSs) has been showing gradual acceleration in its growth. A great majority of NECSs are subject to some fault tolerance requirements. In other words, in their application environments, the fault rates of some components, e.g., communication links, batteries, some chips, etc., are not negligible. So, research interests in fault-tolerant (FT) NECSs have been showing growing trends in the past decade in good contrast to the relatively stagnant research interests in more traditional branches of FTC such as FT data servers, etc. In particular, object-/component-oriented (OO/CO) structuring techniques for RT distributed computing systems have been established in highly promising forms with convincing demonstrations although they are still very slow in spreading through industry. Such OO/CO structuring techniques not only lead to efficient design of RTC application systems with considerably reduced labor and improved maintainability but also enable design of complex systems yielding analyzable and yet tight response time bounds. On the basis of such foundation for designing RTC systems, research in FT NECS can now proceed with much improved effectiveness and confidence","PeriodicalId":164765,"journal":{"name":"2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06)","volume":"377 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131724803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Grolimund, Luzius Meisser, S. Schmid, Roger Wattenhofer
We present Cryptree, a cryptographic tree structure which facilitates access control in file systems operating on untrusted storage. Cryptree leverages the file system's folder hierarchy to achieve efficient and intuitive, yet simple, access control. The highlights are its ability to recursively grant access to a folder and all its subfolders in constant time, the dynamic inheritance of access rights which inherently prevents scattering of access rights, and the possibility to grant someone access to a file or folder without revealing the identities of other accessors. To reason about and to visualize Cryptree, we introduce the notion of cryptographic links. We describe the Cryptrees we have used to enforce read and write access in our own file system. Finally, we measure the performance of the Cryptree and compare it to other approaches
{"title":"Cryptree: A Folder Tree Structure for Cryptographic File Systems","authors":"D. Grolimund, Luzius Meisser, S. Schmid, Roger Wattenhofer","doi":"10.1109/SRDS.2006.15","DOIUrl":"https://doi.org/10.1109/SRDS.2006.15","url":null,"abstract":"We present Cryptree, a cryptographic tree structure which facilitates access control in file systems operating on untrusted storage. Cryptree leverages the file system's folder hierarchy to achieve efficient and intuitive, yet simple, access control. The highlights are its ability to recursively grant access to a folder and all its subfolders in constant time, the dynamic inheritance of access rights which inherently prevents scattering of access rights, and the possibility to grant someone access to a file or folder without revealing the identities of other accessors. To reason about and to visualize Cryptree, we introduce the notion of cryptographic links. We describe the Cryptrees we have used to enforce read and write access in our own file system. Finally, we measure the performance of the Cryptree and compare it to other approaches","PeriodicalId":164765,"journal":{"name":"2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131455172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonio Fernández, Luis López, Agustín Santos, Chryssis Georgiou
In this work we consider a distributed system formed by a master processor and a collection of n processors (workers) that can execute tasks; worker processors are untrusted and might act maliciously. The master assigns tasks to workers to be executed. Each task returns a binary value, and we want the master to accept only correct values with high probability. Furthermore, we assume that the service provided by the workers is not free; for each task that a worker is assigned, the master is charged with a work-unit. Therefore, considering a single task assigned to several workers, our goal is to have the master computer to accept the correct value of the task with high probability, with the smallest possible amount of work (number of workers the master assigns the task). We explore two ways of bounding the number of faulty processors: (a) we consider a fixed bound f < n/2 on the maximum number of workers that may fail, and (b) a probability p < 1/2 of any processor to be faulty (all processors are faulty with probability p, independently of the rest of processors). Our work demonstrates that it is possible to obtain high probability of correct acceptance with low work. In particular, by considering both mechanisms of bounding the number of malicious workers, we first show lower bounds on the minimum amount of (expected) work required, so that any algorithm accepts the correct value with probability of success 1 - epsiv, where epsiv Lt 1 (e.g., 1/n). Then we develop and analyze two algorithms, each using a different decision strategy, and show that both algorithms obtain the same probability of success 1 - epsiv, and in doing so, they require similar upper bounds on the (expected) work. Furthermore, under certain conditions, these upper bounds are asymptotically optimal with respect to our lower bounds
{"title":"Reliably Executing Tasks in the Presence of Untrusted Entities","authors":"Antonio Fernández, Luis López, Agustín Santos, Chryssis Georgiou","doi":"10.1109/SRDS.2006.40","DOIUrl":"https://doi.org/10.1109/SRDS.2006.40","url":null,"abstract":"In this work we consider a distributed system formed by a master processor and a collection of n processors (workers) that can execute tasks; worker processors are untrusted and might act maliciously. The master assigns tasks to workers to be executed. Each task returns a binary value, and we want the master to accept only correct values with high probability. Furthermore, we assume that the service provided by the workers is not free; for each task that a worker is assigned, the master is charged with a work-unit. Therefore, considering a single task assigned to several workers, our goal is to have the master computer to accept the correct value of the task with high probability, with the smallest possible amount of work (number of workers the master assigns the task). We explore two ways of bounding the number of faulty processors: (a) we consider a fixed bound f < n/2 on the maximum number of workers that may fail, and (b) a probability p < 1/2 of any processor to be faulty (all processors are faulty with probability p, independently of the rest of processors). Our work demonstrates that it is possible to obtain high probability of correct acceptance with low work. In particular, by considering both mechanisms of bounding the number of malicious workers, we first show lower bounds on the minimum amount of (expected) work required, so that any algorithm accepts the correct value with probability of success 1 - epsiv, where epsiv Lt 1 (e.g., 1/n). Then we develop and analyze two algorithms, each using a different decision strategy, and show that both algorithms obtain the same probability of success 1 - epsiv, and in doing so, they require similar upper bounds on the (expected) work. Furthermore, under certain conditions, these upper bounds are asymptotically optimal with respect to our lower bounds","PeriodicalId":164765,"journal":{"name":"2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123572791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Availability prediction in a telecommunication system plays a crucial role in its management, either by alerting the operator to potential failures or by proactively initiating preventive measures. In this paper, we apply linear (ARMA, multivariate, random walk) and nonlinear (Radial and Universal Basis Functions) regression techniques to recognize system failures and to predict the system's call availability up to 15 minutes in advance. Secondly we introduce a novel nonlinear modeling technique for call availability prediction. We benchmark all five techniques against each other. The applied modeling methods are data driven rather than analytical and can handle large amounts of data. We apply the modeling techniques to real data of a commercial telecommunication platform. The data used for modeling includes: a) time stamped event-based log files; and b) continuously measured system states. Results are given in terms of a) receiver operator characteristics (AUC) for classification into classes of failure and non-failure states and b) as a cost-benefit analysis. Our findings suggest: a) high degree of nonlinearity in the data; b) statistically significant improved forecasting performance and cost-benefit ratio of nonlinear modeling techniques; and finally finding that c) log file data does not contribute to improve model performance with any modeling technique
{"title":"Call Availability Prediction in a Telecommunication System: A Data Driven Empirical Approach","authors":"G. A. Hoffmann, M. Malek","doi":"10.1109/SRDS.2006.12","DOIUrl":"https://doi.org/10.1109/SRDS.2006.12","url":null,"abstract":"Availability prediction in a telecommunication system plays a crucial role in its management, either by alerting the operator to potential failures or by proactively initiating preventive measures. In this paper, we apply linear (ARMA, multivariate, random walk) and nonlinear (Radial and Universal Basis Functions) regression techniques to recognize system failures and to predict the system's call availability up to 15 minutes in advance. Secondly we introduce a novel nonlinear modeling technique for call availability prediction. We benchmark all five techniques against each other. The applied modeling methods are data driven rather than analytical and can handle large amounts of data. We apply the modeling techniques to real data of a commercial telecommunication platform. The data used for modeling includes: a) time stamped event-based log files; and b) continuously measured system states. Results are given in terms of a) receiver operator characteristics (AUC) for classification into classes of failure and non-failure states and b) as a cost-benefit analysis. Our findings suggest: a) high degree of nonlinearity in the data; b) statistically significant improved forecasting performance and cost-benefit ratio of nonlinear modeling techniques; and finally finding that c) log file data does not contribute to improve model performance with any modeling technique","PeriodicalId":164765,"journal":{"name":"2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124018309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In a recent paper, we presented proactive resilience as a new approach to proactive recovery, based on architectural hybridization. We showed that, with appropriate assumptions about fault rate, proactive resilience makes it possible to build distributed intrusion-tolerant systems guaranteed not to suffer more than the assumed number of faults during their lifetime. In this paper, we explore the impact of these assumptions in asynchronous systems, and derive conditions that should be met by practical systems in order to guarantee long-lived, i.e., available, intrusion-tolerant operation. Our conclusions are based on analytical and simulation results as implemented in Mobius, and we use the same modeling environment to show that our approach offers higher resilience in comparison with other proactive intrusion-tolerant system models
{"title":"Proactive Resilience Revisited: The Delicate Balance Between Resisting Intrusions and Remaining Available","authors":"Paulo Sousa, N. Neves, P. Veríssimo, W. Sanders","doi":"10.1109/SRDS.2006.37","DOIUrl":"https://doi.org/10.1109/SRDS.2006.37","url":null,"abstract":"In a recent paper, we presented proactive resilience as a new approach to proactive recovery, based on architectural hybridization. We showed that, with appropriate assumptions about fault rate, proactive resilience makes it possible to build distributed intrusion-tolerant systems guaranteed not to suffer more than the assumed number of faults during their lifetime. In this paper, we explore the impact of these assumptions in asynchronous systems, and derive conditions that should be met by practical systems in order to guarantee long-lived, i.e., available, intrusion-tolerant operation. Our conclusions are based on analytical and simulation results as implemented in Mobius, and we use the same modeling environment to show that our approach offers higher resilience in comparison with other proactive intrusion-tolerant system models","PeriodicalId":164765,"journal":{"name":"2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131133410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher Peery, Thu D. Nguyen, Francisco Matias Cuenca-Acuna
We consider the problem of ensuring high data availability in federated content sharing systems. Ideally, such a system would provide high data availability in a device transparent manner so that users are not faced with the time-consuming and error-prone task of managing data replicas across the constituent devices of the system. We propose a novel unified availability model and a decentralized replication algorithm to approximate this ideal. Our availability model addresses three different concerns: availability during connected operation (online), availability during disconnected operation (offline), and availability after permanent disconnection from the federated system (ownership). Our replication algorithm centers around the intuition that devices should selfishly use their local storage to ensure offline and ownership availability for their individual owners. Excess storage, however, is used communally to ensure high online availability for all shared content. Evaluation of an implementation shows that our algorithm rapidly reaches stable and communally desirable configurations when there is sufficient space. Consistent with the fact that devices in a federated system are owned by different users, however, as space becomes highly constrained, the system approaches a non-cooperative configuration where devices only hoard content to serve their individual owners' needs
{"title":"Reducing the Availability Management Overheads of Federated Content Sharing Systems","authors":"Christopher Peery, Thu D. Nguyen, Francisco Matias Cuenca-Acuna","doi":"10.1109/SRDS.2006.39","DOIUrl":"https://doi.org/10.1109/SRDS.2006.39","url":null,"abstract":"We consider the problem of ensuring high data availability in federated content sharing systems. Ideally, such a system would provide high data availability in a device transparent manner so that users are not faced with the time-consuming and error-prone task of managing data replicas across the constituent devices of the system. We propose a novel unified availability model and a decentralized replication algorithm to approximate this ideal. Our availability model addresses three different concerns: availability during connected operation (online), availability during disconnected operation (offline), and availability after permanent disconnection from the federated system (ownership). Our replication algorithm centers around the intuition that devices should selfishly use their local storage to ensure offline and ownership availability for their individual owners. Excess storage, however, is used communally to ensure high online availability for all shared content. Evaluation of an implementation shows that our algorithm rapidly reaches stable and communally desirable configurations when there is sufficient space. Consistent with the fact that devices in a federated system are owned by different users, however, as space becomes highly constrained, the system approaches a non-cooperative configuration where devices only hoard content to serve their individual owners' needs","PeriodicalId":164765,"journal":{"name":"2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133335336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}