{"title":"覆盖网络故障检测算法研究","authors":"S. Zhuang, Dennis Geels, I. Stoica, R. Katz","doi":"10.1109/INFCOM.2005.1498487","DOIUrl":null,"url":null,"abstract":"One of the key reasons overlay networks are seen as an excellent platform for large scale distributed systems is their resilience in the presence of node failures. This resilience rely on accurate and timely detection of node failures. Despite the prevalent use of keep-alive algorithms in overlay networks to detect node failures, their tradeoffs and the circumstances in which they might best he suited is not well understood. In this paper, we study how the design of various keep-alive approaches affect their performance in node failure detection time, probability of false positive, control overhead, and packet loss rate via analysis, simulation, and implementation. We find that among the class of keep-alive algorithms that share information, the maintenance of backpointer state substantially improves detection time and packet loss rate. The improvement in detection time between baseline and sharing algorithms becomes more pronounced as the size of neighbor set increases. Finally, sharing of information allows a network to tolerate a higher churn rate than baseline.","PeriodicalId":20482,"journal":{"name":"Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies.","volume":"43 1","pages":"2112-2123 vol. 3"},"PeriodicalIF":0.0000,"publicationDate":"2005-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"108","resultStr":"{\"title\":\"On failure detection algorithms in overlay networks\",\"authors\":\"S. Zhuang, Dennis Geels, I. Stoica, R. Katz\",\"doi\":\"10.1109/INFCOM.2005.1498487\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the key reasons overlay networks are seen as an excellent platform for large scale distributed systems is their resilience in the presence of node failures. This resilience rely on accurate and timely detection of node failures. Despite the prevalent use of keep-alive algorithms in overlay networks to detect node failures, their tradeoffs and the circumstances in which they might best he suited is not well understood. In this paper, we study how the design of various keep-alive approaches affect their performance in node failure detection time, probability of false positive, control overhead, and packet loss rate via analysis, simulation, and implementation. We find that among the class of keep-alive algorithms that share information, the maintenance of backpointer state substantially improves detection time and packet loss rate. The improvement in detection time between baseline and sharing algorithms becomes more pronounced as the size of neighbor set increases. Finally, sharing of information allows a network to tolerate a higher churn rate than baseline.\",\"PeriodicalId\":20482,\"journal\":{\"name\":\"Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies.\",\"volume\":\"43 1\",\"pages\":\"2112-2123 vol. 3\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-03-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"108\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INFCOM.2005.1498487\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INFCOM.2005.1498487","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
On failure detection algorithms in overlay networks
One of the key reasons overlay networks are seen as an excellent platform for large scale distributed systems is their resilience in the presence of node failures. This resilience rely on accurate and timely detection of node failures. Despite the prevalent use of keep-alive algorithms in overlay networks to detect node failures, their tradeoffs and the circumstances in which they might best he suited is not well understood. In this paper, we study how the design of various keep-alive approaches affect their performance in node failure detection time, probability of false positive, control overhead, and packet loss rate via analysis, simulation, and implementation. We find that among the class of keep-alive algorithms that share information, the maintenance of backpointer state substantially improves detection time and packet loss rate. The improvement in detection time between baseline and sharing algorithms becomes more pronounced as the size of neighbor set increases. Finally, sharing of information allows a network to tolerate a higher churn rate than baseline.