{"title":"容忍系统区域网络中的网络故障","authors":"Jeffrey Tang, A. Bilas","doi":"10.1109/ICPP.2002.1040866","DOIUrl":null,"url":null,"abstract":"In this paper, we investigate how system area networks can deal with transient and permanent network failures. We design and implement a firmware-level retransmission scheme to tolerate transient failures and an on-demand network mapping scheme to deal with permanent failures. Both schemes are transparent to applications and are conceptually simple and suitable for low-level implementations, e.g. in firmware. We then examine how the retransmission scheme affects system performance and how various protocol parameters impact system behavior. We analyze and evaluate system performance by using a real implementation on a state-of-the art cluster and both micro-benchmarks and real applications from the SPLASH-2 suite.","PeriodicalId":393916,"journal":{"name":"Proceedings International Conference on Parallel Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2002-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Tolerating network failures in system area networks\",\"authors\":\"Jeffrey Tang, A. Bilas\",\"doi\":\"10.1109/ICPP.2002.1040866\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we investigate how system area networks can deal with transient and permanent network failures. We design and implement a firmware-level retransmission scheme to tolerate transient failures and an on-demand network mapping scheme to deal with permanent failures. Both schemes are transparent to applications and are conceptually simple and suitable for low-level implementations, e.g. in firmware. We then examine how the retransmission scheme affects system performance and how various protocol parameters impact system behavior. We analyze and evaluate system performance by using a real implementation on a state-of-the art cluster and both micro-benchmarks and real applications from the SPLASH-2 suite.\",\"PeriodicalId\":393916,\"journal\":{\"name\":\"Proceedings International Conference on Parallel Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2002-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPP.2002.1040866\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2002.1040866","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Tolerating network failures in system area networks
In this paper, we investigate how system area networks can deal with transient and permanent network failures. We design and implement a firmware-level retransmission scheme to tolerate transient failures and an on-demand network mapping scheme to deal with permanent failures. Both schemes are transparent to applications and are conceptually simple and suitable for low-level implementations, e.g. in firmware. We then examine how the retransmission scheme affects system performance and how various protocol parameters impact system behavior. We analyze and evaluate system performance by using a real implementation on a state-of-the art cluster and both micro-benchmarks and real applications from the SPLASH-2 suite.