Pub Date : 2013-06-24DOI: 10.1109/DSN.2013.6575365
A. Berger, Stefan Rührup, W. Gansterer, O. Jung
The representation of Internet traffic as connection graphs augments anomaly detection systems by providing insight on the structural connection properties, i.e., who-talks-to-whom. However, these graphs are extremely large and one has to decide in advance on which aspect to focus. In the context of malware detection, this is difficult as malware often mimics legitimate traffic. In this paper, we present a statistical approach for extracting the typical traffic destinations for a set of monitored hosts, and derive a reduced graph that contains only connections that are anomalous for that host. This graph can then be analyzed efficiently. Our system is designed to scale to thousands of monitored hosts. We evaluate our approach using a data set from a real network, and show that we can reliably detect injected malware activity.
{"title":"Locality matters: Reducing Internet traffic graphs using location analysis","authors":"A. Berger, Stefan Rührup, W. Gansterer, O. Jung","doi":"10.1109/DSN.2013.6575365","DOIUrl":"https://doi.org/10.1109/DSN.2013.6575365","url":null,"abstract":"The representation of Internet traffic as connection graphs augments anomaly detection systems by providing insight on the structural connection properties, i.e., who-talks-to-whom. However, these graphs are extremely large and one has to decide in advance on which aspect to focus. In the context of malware detection, this is difficult as malware often mimics legitimate traffic. In this paper, we present a statistical approach for extracting the typical traffic destinations for a set of monitored hosts, and derive a reduced graph that contains only connections that are anomalous for that host. This graph can then be analyzed efficiently. Our system is designed to scale to thousands of monitored hosts. We evaluate our approach using a data set from a real network, and show that we can reliably detect injected malware activity.","PeriodicalId":163407,"journal":{"name":"2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127665922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-24DOI: 10.1109/DSN.2013.6575329
M. Beham, Marius Vlad, Hans P. Reiser
Several research projects in the past have built intrusion detection systems and honeypot architectures based on virtual machine introspection (VMI). These systems directly benefit from the use of virtualization technology. The VMI approach, however, requires direct interaction with the virtual machine monitor, and typically is not available to clients of current public clouds. Recently, nested virtualization has gained popularity in research as an approach that could enable cloud customers to use virtualization-based solutions within a cloud by nesting two virtual machine monitors, with the inner one under control of the client. In this paper, we compare the performance of existing nested-virtualization solutions and analyze the impact of the performance overhead on VMI-based intrusion detection and honeypot systems.
{"title":"Intrusion detection and honeypots in nested virtualization environments","authors":"M. Beham, Marius Vlad, Hans P. Reiser","doi":"10.1109/DSN.2013.6575329","DOIUrl":"https://doi.org/10.1109/DSN.2013.6575329","url":null,"abstract":"Several research projects in the past have built intrusion detection systems and honeypot architectures based on virtual machine introspection (VMI). These systems directly benefit from the use of virtualization technology. The VMI approach, however, requires direct interaction with the virtual machine monitor, and typically is not available to clients of current public clouds. Recently, nested virtualization has gained popularity in research as an approach that could enable cloud customers to use virtualization-based solutions within a cloud by nesting two virtual machine monitors, with the inner one under control of the client. In this paper, we compare the performance of existing nested-virtualization solutions and analyze the impact of the performance overhead on VMI-based intrusion detection and honeypot systems.","PeriodicalId":163407,"journal":{"name":"2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128340874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-24DOI: 10.1109/DSN.2013.6575316
G. Wang, J. W. Stokes, Cormac Herley, D. Felstead
Drive-by download attacks attempt to compromise a victim's computer through browser vulnerabilities. Often they are launched from Malware Distribution Networks (MDNs) consisting of landing pages to attract traffic, intermediate redirection servers, and exploit servers which attempt the compromise. In this paper, we present a novel approach to discovering the landing pages that lead to drive-by downloads. Starting from partial knowledge of a given collection of MDNs we identify the malicious content on their landing pages using multiclass feature selection. We then query the webpage cache of a commercial search engine to identify landing pages containing the same or similar content. In this way we are able to identify previously unknown landing pages belonging to already identified MDNs, which allows us to expand our understanding of the MDN. We explore using both a rule-based and classifier approach to identifying potentially malicious landing pages. We build both systems and independently verify using a high-interaction honeypot that the newly identified landing pages indeed attempt drive-by downloads. For the rule-based system 57% of the landing pages predicted as malicious are confirmed, and this success rate remains constant in two large trials spaced five months apart. This extends the known footprint of the MDNs studied by 17%. The classifier-based system is less successful, and we explore possible reasons.
{"title":"Detecting malicious landing pages in Malware Distribution Networks","authors":"G. Wang, J. W. Stokes, Cormac Herley, D. Felstead","doi":"10.1109/DSN.2013.6575316","DOIUrl":"https://doi.org/10.1109/DSN.2013.6575316","url":null,"abstract":"Drive-by download attacks attempt to compromise a victim's computer through browser vulnerabilities. Often they are launched from Malware Distribution Networks (MDNs) consisting of landing pages to attract traffic, intermediate redirection servers, and exploit servers which attempt the compromise. In this paper, we present a novel approach to discovering the landing pages that lead to drive-by downloads. Starting from partial knowledge of a given collection of MDNs we identify the malicious content on their landing pages using multiclass feature selection. We then query the webpage cache of a commercial search engine to identify landing pages containing the same or similar content. In this way we are able to identify previously unknown landing pages belonging to already identified MDNs, which allows us to expand our understanding of the MDN. We explore using both a rule-based and classifier approach to identifying potentially malicious landing pages. We build both systems and independently verify using a high-interaction honeypot that the newly identified landing pages indeed attempt drive-by downloads. For the rule-based system 57% of the landing pages predicted as malicious are confirmed, and this success rate remains constant in two large trials spaced five months apart. This extends the known footprint of the MDNs studied by 17%. The classifier-based system is less successful, and we explore possible reasons.","PeriodicalId":163407,"journal":{"name":"2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115399858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-24DOI: 10.1109/DSN.2013.6575362
Michael D. Ford, K. Keefe, E. LeMay, W. Sanders, Carol Muehrcke
The ADversary VIew Security Evaluation (ADVISE) model formalism provides a system security model from the perspective of an adversary. An ADVISE atomic model consists of an attack execution graph (AEG) composed of attack steps, system state variables, and attack goals, as well as an adversary profile that defines the abilities and interests of a particular adversary. The ADVISE formalism has been implemented as a Möbius atomic model formalism in order to leverage the existing set of mature modeling formalisms and solution techniques offered by Möbius. This tool paper explains the ADVISE implementation in Möbius and provides technical details for Möbius users who want to use ADVISE either alone or in combination with other modeling formalisms provided by Möbius.
{"title":"Implementing the ADVISE security modeling formalism in Möbius","authors":"Michael D. Ford, K. Keefe, E. LeMay, W. Sanders, Carol Muehrcke","doi":"10.1109/DSN.2013.6575362","DOIUrl":"https://doi.org/10.1109/DSN.2013.6575362","url":null,"abstract":"The ADversary VIew Security Evaluation (ADVISE) model formalism provides a system security model from the perspective of an adversary. An ADVISE atomic model consists of an attack execution graph (AEG) composed of attack steps, system state variables, and attack goals, as well as an adversary profile that defines the abilities and interests of a particular adversary. The ADVISE formalism has been implemented as a Möbius atomic model formalism in order to leverage the existing set of mature modeling formalisms and solution techniques offered by Möbius. This tool paper explains the ADVISE implementation in Möbius and provides technical details for Möbius users who want to use ADVISE either alone or in combination with other modeling formalisms provided by Möbius.","PeriodicalId":163407,"journal":{"name":"2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120962511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-24DOI: 10.1109/DSN.2013.6575298
Bikash Sharma, P. Jayachandran, Akshat Verma, C. Das
In this work, we address problem determination in virtualized clouds. We show that high dynamism, resource sharing, frequent reconfiguration, high propensity to faults and automated management introduce significant new challenges towards fault diagnosis in clouds. Towards this, we propose CloudPD, a fault management framework for clouds. CloudPD leverages (i) a canonical representation of the operating environment to quantify the impact of sharing; (ii) an online learning process to tackle dynamism; (iii) a correlation-based performance models for higher detection accuracy; and (iv) an integrated end-to-end feedback loop to synergize with a cloud management ecosystem. Using a prototype implementation with cloud representative batch and transactional workloads like Hadoop, Olio and RUBiS, it is shown that CloudPD detects and diagnoses faults with low false positives (<; 16%) and high accuracy of 88%, 83% and 83%, respectively. In an enterprise trace-based case study, CloudPD diagnosed anomalies within 30 seconds and with an accuracy of 77%, demonstrating its effectiveness in real-life operations.
{"title":"CloudPD: Problem determination and diagnosis in shared dynamic clouds","authors":"Bikash Sharma, P. Jayachandran, Akshat Verma, C. Das","doi":"10.1109/DSN.2013.6575298","DOIUrl":"https://doi.org/10.1109/DSN.2013.6575298","url":null,"abstract":"In this work, we address problem determination in virtualized clouds. We show that high dynamism, resource sharing, frequent reconfiguration, high propensity to faults and automated management introduce significant new challenges towards fault diagnosis in clouds. Towards this, we propose CloudPD, a fault management framework for clouds. CloudPD leverages (i) a canonical representation of the operating environment to quantify the impact of sharing; (ii) an online learning process to tackle dynamism; (iii) a correlation-based performance models for higher detection accuracy; and (iv) an integrated end-to-end feedback loop to synergize with a cloud management ecosystem. Using a prototype implementation with cloud representative batch and transactional workloads like Hadoop, Olio and RUBiS, it is shown that CloudPD detects and diagnoses faults with low false positives (<; 16%) and high accuracy of 88%, 83% and 83%, respectively. In an enterprise trace-based case study, CloudPD diagnosed anomalies within 30 seconds and with an accuracy of 77%, demonstrating its effectiveness in real-life operations.","PeriodicalId":163407,"journal":{"name":"2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133131827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-24DOI: 10.1109/DSN.2013.6575364
Srinivas Krishnan, Teryl Taylor, F. Monrose, J. McHugh
The domain name system plays a vital role in the dependability and security of modern network. Unfortunately, it has also been widely misused for nefarious activities. Recently, attackers have turned their attention to the use of algorithmically generated domain names (AGDs) in an effort to circumvent network defenses. However, because such domain names are increasingly being used in benign applications, this transition has significant implications for techniques that classify AGDs based solely on the format of a domain name. To highlight the challenges they face, we examine contemporary approaches and demonstrate their limitations. We address these shortcomings by proposing an online form of sequential hypothesis testing that classifies clients based solely on the non-existent (NX) responses they elicit. Our evaluations on real-world data show that we outperform existing approaches, and for the vast majority of cases, we detect malware before they are able to successfully rendezvous with their command and control centers.
{"title":"Crossing the threshold: Detecting network malfeasance via sequential hypothesis testing","authors":"Srinivas Krishnan, Teryl Taylor, F. Monrose, J. McHugh","doi":"10.1109/DSN.2013.6575364","DOIUrl":"https://doi.org/10.1109/DSN.2013.6575364","url":null,"abstract":"The domain name system plays a vital role in the dependability and security of modern network. Unfortunately, it has also been widely misused for nefarious activities. Recently, attackers have turned their attention to the use of algorithmically generated domain names (AGDs) in an effort to circumvent network defenses. However, because such domain names are increasingly being used in benign applications, this transition has significant implications for techniques that classify AGDs based solely on the format of a domain name. To highlight the challenges they face, we examine contemporary approaches and demonstrate their limitations. We address these shortcomings by proposing an online form of sequential hypothesis testing that classifies clients based solely on the non-existent (NX) responses they elicit. Our evaluations on real-world data show that we outperform existing approaches, and for the vast majority of cases, we detect malware before they are able to successfully rendezvous with their command and control centers.","PeriodicalId":163407,"journal":{"name":"2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132568412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-24DOI: 10.1109/DSN.2013.6575358
Anne Remke, Xian-You Wu
In process industries wired supervisory and control networks are more and more replaced by wireless systems. Wireless communication inevitably introduces time delays and message losses, which may degrade the system reliability and performance. WirelessHART, as the first international standard for wireless process supervision and control has received notable academic attention. This paper models WirelessHART networks with link failures using Discrete-time Markov chains and evaluates the network performance in a typical WirelessHART environment with respect to delay and reachability. The evaluation shows that although the performance of WirelessHART is influenced by several factors, it is capable to deliver reliable service in typical industrial environments. The proposed model can also be used to predict path performance and to provide routing suggestions.
{"title":"WirelessHART modeling and performance evaluation","authors":"Anne Remke, Xian-You Wu","doi":"10.1109/DSN.2013.6575358","DOIUrl":"https://doi.org/10.1109/DSN.2013.6575358","url":null,"abstract":"In process industries wired supervisory and control networks are more and more replaced by wireless systems. Wireless communication inevitably introduces time delays and message losses, which may degrade the system reliability and performance. WirelessHART, as the first international standard for wireless process supervision and control has received notable academic attention. This paper models WirelessHART networks with link failures using Discrete-time Markov chains and evaluates the network performance in a typical WirelessHART environment with respect to delay and reachability. The evaluation shows that although the performance of WirelessHART is influenced by several factors, it is capable to deliver reliable service in typical industrial environments. The proposed model can also be used to predict path performance and to provide routing suggestions.","PeriodicalId":163407,"journal":{"name":"2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133792265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-24DOI: 10.1109/DSN.2013.6575323
B. Silva, P. Maciel, E. Tavares, A. Zimmermann
Hundreds of natural disasters occur in many parts of the world every year, causing billions of dollars in damages. This fact contrasts with the high availability requirement of cloud computing systems, and, to protect such systems from unforeseen catastrophe, a recovery plan requires the utilization of different data centers located far enough apart. However, the time to migrate a VM from a data center to another increases due to distance. This work presents dependability models for evaluating distributed cloud computing systems deployed into multiple data centers considering disaster occurrence. Additionally, we present a case study which evaluates several scenarios with different VM migration times and distances between data centers.
{"title":"Dependability models for designing disaster tolerant cloud computing systems","authors":"B. Silva, P. Maciel, E. Tavares, A. Zimmermann","doi":"10.1109/DSN.2013.6575323","DOIUrl":"https://doi.org/10.1109/DSN.2013.6575323","url":null,"abstract":"Hundreds of natural disasters occur in many parts of the world every year, causing billions of dollars in damages. This fact contrasts with the high availability requirement of cloud computing systems, and, to protect such systems from unforeseen catastrophe, a recovery plan requires the utilization of different data centers located far enough apart. However, the time to migrate a VM from a data center to another increases due to distance. This work presents dependability models for evaluating distributed cloud computing systems deployed into multiple data centers considering disaster occurrence. Additionally, we present a case study which evaluates several scenarios with different VM migration times and distances between data centers.","PeriodicalId":163407,"journal":{"name":"2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129754694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-24DOI: 10.1109/DSN.2013.6575348
Yifeng Sun, T. Chiueh
Buggy device drivers are a major threat to the reliability of their host operating system. There have been myriad attempts to protect the kernel, but most of them either required driver modifications or incur substantial performance overhead. This paper describes an isolated device driver execution system called SIDE (Streamlined Isolated Driver Execution), which focuses specifically on unmodified device drivers and strives to avoid changing the existing kernel code as much as possible. SIDE exploits virtual memory hardware to set up a device driver execution environment that is compatible with existing device drivers and yet is fully isolated from the kernel. SIDE is able to run an unmodified device driver for a Gigabit Ethernet NIC and the latency and throughput penalty is kept under 1% when augmented with a set of performance optimizations designed to reduce the number of protection domain crossings between an isolated device driver and the kernel.
{"title":"SIDE: Isolated and efficient execution of unmodified device drivers","authors":"Yifeng Sun, T. Chiueh","doi":"10.1109/DSN.2013.6575348","DOIUrl":"https://doi.org/10.1109/DSN.2013.6575348","url":null,"abstract":"Buggy device drivers are a major threat to the reliability of their host operating system. There have been myriad attempts to protect the kernel, but most of them either required driver modifications or incur substantial performance overhead. This paper describes an isolated device driver execution system called SIDE (Streamlined Isolated Driver Execution), which focuses specifically on unmodified device drivers and strives to avoid changing the existing kernel code as much as possible. SIDE exploits virtual memory hardware to set up a device driver execution environment that is compatible with existing device drivers and yet is fully isolated from the kernel. SIDE is able to run an unmodified device driver for a Gigabit Ethernet NIC and the latency and throughput penalty is kept under 1% when augmented with a set of performance optimizations designed to reduce the number of protection domain crossings between an isolated device driver and the kernel.","PeriodicalId":163407,"journal":{"name":"2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124396337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-24DOI: 10.1109/DSN.2013.6575309
Joseph Sloan, Rakesh Kumar, G. Bronevetsky
The increasing size and complexity of massively parallel systems (e.g. HPC systems) is making it increasingly likely that individual circuits will produce erroneous results. For this reason, novel fault tolerance approaches are increasingly needed. Prior fault tolerance approaches often rely on checkpoint-rollback based schemes. Unfortunately, such schemes are primarily limited to rare error event scenarios as the overheads of such schemes become prohibitive if faults are common. In this paper, we propose a novel approach for algorithmic correction of faulty application outputs. The key insight for this approach is that even under high error scenarios, even if the result of an algorithm is erroneous, most of it is correct. Instead of simply rolling back to the most recent checkpoint and repeating the entire segment of computation, our novel resilience approach uses algorithmic error localization and partial recomputation to efficiently correct the corrupted results. We evaluate our approach in the specific algorithmic scenario of linear algebra operations, focusing on matrix-vector multiplication (MVM) and iterative linear solvers. We develop a novel technique for localizing errors in MVM and show how to achieve partial recomputation within this algorithm, and demonstrate that this approach both improves the performance of the Conjugate Gradient solver in high error scenarios by 3x-4x and increases the probability that it completes successfully by up to 60% with parallel experiments up to 100 nodes.
{"title":"An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance","authors":"Joseph Sloan, Rakesh Kumar, G. Bronevetsky","doi":"10.1109/DSN.2013.6575309","DOIUrl":"https://doi.org/10.1109/DSN.2013.6575309","url":null,"abstract":"The increasing size and complexity of massively parallel systems (e.g. HPC systems) is making it increasingly likely that individual circuits will produce erroneous results. For this reason, novel fault tolerance approaches are increasingly needed. Prior fault tolerance approaches often rely on checkpoint-rollback based schemes. Unfortunately, such schemes are primarily limited to rare error event scenarios as the overheads of such schemes become prohibitive if faults are common. In this paper, we propose a novel approach for algorithmic correction of faulty application outputs. The key insight for this approach is that even under high error scenarios, even if the result of an algorithm is erroneous, most of it is correct. Instead of simply rolling back to the most recent checkpoint and repeating the entire segment of computation, our novel resilience approach uses algorithmic error localization and partial recomputation to efficiently correct the corrupted results. We evaluate our approach in the specific algorithmic scenario of linear algebra operations, focusing on matrix-vector multiplication (MVM) and iterative linear solvers. We develop a novel technique for localizing errors in MVM and show how to achieve partial recomputation within this algorithm, and demonstrate that this approach both improves the performance of the Conjugate Gradient solver in high error scenarios by 3x-4x and increases the probability that it completes successfully by up to 60% with parallel experiments up to 100 nodes.","PeriodicalId":163407,"journal":{"name":"2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114824444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}