State Machine Replication (SMR) is a well-known technique to implement fault-tolerant systems. In SMR, servers are replicated and client requests are deterministically executed in the same order by all replicas. To improve performance in multi-processor systems, some approaches have proposed to parallelize the execution of non-conflicting requests. Such approaches perform remarkably well in workloads dominated by non-conflicting requests. Conflicting requests introduce expensive synchronization and result in considerable performance loss. Current approaches to parallel SMR define the degree of parallelism statically. However, it is often difficult to predict the best degree of parallelism for a workload and workloads experience variations that change their best degree of parallelism. This paper proposes a protocol to reconfigure the degree of parallelism in parallel SMR on-the-fly. Experiments show the gains due to reconfiguration and shed some light on the behavior of parallel and reconfigurable SMR.
{"title":"Reconfiguring Parallel State Machine Replication","authors":"E. Alchieri, F. Dotti, O. Mendizabal, F. Pedone","doi":"10.1109/SRDS.2017.23","DOIUrl":"https://doi.org/10.1109/SRDS.2017.23","url":null,"abstract":"State Machine Replication (SMR) is a well-known technique to implement fault-tolerant systems. In SMR, servers are replicated and client requests are deterministically executed in the same order by all replicas. To improve performance in multi-processor systems, some approaches have proposed to parallelize the execution of non-conflicting requests. Such approaches perform remarkably well in workloads dominated by non-conflicting requests. Conflicting requests introduce expensive synchronization and result in considerable performance loss. Current approaches to parallel SMR define the degree of parallelism statically. However, it is often difficult to predict the best degree of parallelism for a workload and workloads experience variations that change their best degree of parallelism. This paper proposes a protocol to reconfigure the degree of parallelism in parallel SMR on-the-fly. Experiments show the gains due to reconfiguration and shed some light on the behavior of parallel and reconfigurable SMR.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"11 1","pages":"104-113"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78934380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haihe Ba, Huaizhe Zhou, Jiangchun Ren, Zhiying Wang
While Java Virtual Machine can provide applications with safety property to avoid memory corruption bugs, it continues to encounter some security flaws. Real world exploits show that the current sandbox model can be bypassed. In this paper, we focus our work on bytecode integrity measurement in clouds to identify malicious execution and propose J-IMA architecture to provide runtime measurement and remote attestation for bytecode integrity. To the best of our knowledge, our work is the first measurement approach for dynamically-generated bytecode integrity. Moreover, J-IMA has no need for any modification to host systems and any access to source code.
{"title":"Runtime Measurement Architecture for Bytecode Integrity in JVM-Based Cloud","authors":"Haihe Ba, Huaizhe Zhou, Jiangchun Ren, Zhiying Wang","doi":"10.1109/SRDS.2017.39","DOIUrl":"https://doi.org/10.1109/SRDS.2017.39","url":null,"abstract":"While Java Virtual Machine can provide applications with safety property to avoid memory corruption bugs, it continues to encounter some security flaws. Real world exploits show that the current sandbox model can be bypassed. In this paper, we focus our work on bytecode integrity measurement in clouds to identify malicious execution and propose J-IMA architecture to provide runtime measurement and remote attestation for bytecode integrity. To the best of our knowledge, our work is the first measurement approach for dynamically-generated bytecode integrity. Moreover, J-IMA has no need for any modification to host systems and any access to source code.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"32 1","pages":"262-263"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74047779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Z. Tootaghaj, N. Bartolini, Hana Khamfroush, T. L. Porta
Vulnerability due to inter-connectivity of multiple networks has been observed in many complex networks. Previous works mainly focused on robust network design and on recovery strategies after sporadic or massive failures in the case of complete knowledge of failure location. We focus on cascading failures involving the power grid and its communication network with consequent imprecision in damage assessment. We tackle the problem of mitigating the ongoing cascading failure and providing a recovery strategy. We propose a failure mitigation strategy in two steps: 1) Once a cascading failure is detected, we limit further propagation by re-distributing the generator and load's power. 2) We formulate a recovery plan to maximize the total amount of power delivered to the demand loads during the recovery intervention. Our approach to cope with insufficient knowledge of damage locations is based on the use of a new algorithm to determine consistent failure sets (CFS). We show that, given knowledge of the system state before the disruption, the CFS algorithm can find all consistent sets of unknown failures in polynomial time provided that, each connected component of the disrupted graph has at least one line whose failure status is known to the controller.
{"title":"Controlling Cascading Failures in Interdependent Networks under Incomplete Knowledge","authors":"D. Z. Tootaghaj, N. Bartolini, Hana Khamfroush, T. L. Porta","doi":"10.1109/SRDS.2017.14","DOIUrl":"https://doi.org/10.1109/SRDS.2017.14","url":null,"abstract":"Vulnerability due to inter-connectivity of multiple networks has been observed in many complex networks. Previous works mainly focused on robust network design and on recovery strategies after sporadic or massive failures in the case of complete knowledge of failure location. We focus on cascading failures involving the power grid and its communication network with consequent imprecision in damage assessment. We tackle the problem of mitigating the ongoing cascading failure and providing a recovery strategy. We propose a failure mitigation strategy in two steps: 1) Once a cascading failure is detected, we limit further propagation by re-distributing the generator and load's power. 2) We formulate a recovery plan to maximize the total amount of power delivered to the demand loads during the recovery intervention. Our approach to cope with insufficient knowledge of damage locations is based on the use of a new algorithm to determine consistent failure sets (CFS). We show that, given knowledge of the system state before the disruption, the CFS algorithm can find all consistent sets of unknown failures in polynomial time provided that, each connected component of the disrupted graph has at least one line whose failure status is known to the controller.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"22 1","pages":"54-63"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74633078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ricardo Gonçalves, Paulo Sérgio Almeida, Carlos Baquero, V. Fonte
To achieve high availability in the face of network partitions, many distributed databases adopt eventual consistency, allow temporary conflicts due to concurrent writes, and use some form of per-key logical clock to detect and resolve such conflicts. Furthermore, nodes synchronize periodically to ensure replica convergence in a process called anti-entropy, normally using Merkle Trees. We present the design of DottedDB, a Dynamo-like key-value store, which uses a novel node-wide logical clock framework, overcoming three fundamental limitations of the state of the art: (1) minimize the metadata per key necessary to track causality, avoiding its growth even in the face of node churn; (2) correctly and durably delete keys, with no need for tombstones; (3) offer a lightweight anti-entropy mechanism to converge replicated data, avoiding the need for Merkle Trees. We evaluate DottedDB against MerkleDB, an otherwise identical database, but using per-key logical clocks and Merkle Trees for anti-entropy, to precisely measure the impact of the novel approach. Results show that: causality metadata per object always converges rapidly to only one id-counter pair; distributed deletes are correctly achieved without global coordination and with constant metadata; divergent nodes are synchronized faster, with less memory-footprint and with less communication overhead than using Merkle Trees.
{"title":"DottedDB: Anti-Entropy without Merkle Trees, Deletes without Tombstones","authors":"Ricardo Gonçalves, Paulo Sérgio Almeida, Carlos Baquero, V. Fonte","doi":"10.1109/SRDS.2017.28","DOIUrl":"https://doi.org/10.1109/SRDS.2017.28","url":null,"abstract":"To achieve high availability in the face of network partitions, many distributed databases adopt eventual consistency, allow temporary conflicts due to concurrent writes, and use some form of per-key logical clock to detect and resolve such conflicts. Furthermore, nodes synchronize periodically to ensure replica convergence in a process called anti-entropy, normally using Merkle Trees. We present the design of DottedDB, a Dynamo-like key-value store, which uses a novel node-wide logical clock framework, overcoming three fundamental limitations of the state of the art: (1) minimize the metadata per key necessary to track causality, avoiding its growth even in the face of node churn; (2) correctly and durably delete keys, with no need for tombstones; (3) offer a lightweight anti-entropy mechanism to converge replicated data, avoiding the need for Merkle Trees. We evaluate DottedDB against MerkleDB, an otherwise identical database, but using per-key logical clocks and Merkle Trees for anti-entropy, to precisely measure the impact of the novel approach. Results show that: causality metadata per object always converges rapidly to only one id-counter pair; distributed deletes are correctly achieved without global coordination and with constant metadata; divergent nodes are synchronized faster, with less memory-footprint and with less communication overhead than using Merkle Trees.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"35 1","pages":"194-203"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83623659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erasure codes are widely used in practical storage systems to prevent disk failure and data loss. However, these codes require excessive disk I/Os and network traffic for recovering unavailable data. As a result, the recovery performance of erasure codes is suboptimal. Among all erasure codes, Minimum Storage Regenerating (MSR) codes can achieve optimal repair bandwidth under the minimum storage during recovery, but some open issues remain to be addressed before applying them in real systems. In this paper, we present Hybrid Regenerating Codes (Hybrid-RC), a new set of erasure codes with optimized recovery performance and low storage overhead. The codes utilize the superiority of MSR codes to compute a subset of data blocks while some other parity blocks are used for reliability maintenance. As a result, our design is near-optimal with respect to storage and network traffic. We show that Hybrid-RC reduces the reconstruction cost by up to 21% compared to the Local Reconstruction Codes (LRC) with the same storage overhead. Most importantly, in Hybrid-RC, each block contributes only half the amount of data when processing a single block failure. Therefore, the number of I/Os consumed per block is reduced by 50%, which is of great help to balance the network load and reduce the latency.
{"title":"Hybrid-RC: Flexible Erasure Codes with Optimized Recovery Performance and Low Storage Overhead","authors":"Liuqing Ye, D. Feng, Yuchong Hu, Qing Liu","doi":"10.1109/SRDS.2017.17","DOIUrl":"https://doi.org/10.1109/SRDS.2017.17","url":null,"abstract":"Erasure codes are widely used in practical storage systems to prevent disk failure and data loss. However, these codes require excessive disk I/Os and network traffic for recovering unavailable data. As a result, the recovery performance of erasure codes is suboptimal. Among all erasure codes, Minimum Storage Regenerating (MSR) codes can achieve optimal repair bandwidth under the minimum storage during recovery, but some open issues remain to be addressed before applying them in real systems. In this paper, we present Hybrid Regenerating Codes (Hybrid-RC), a new set of erasure codes with optimized recovery performance and low storage overhead. The codes utilize the superiority of MSR codes to compute a subset of data blocks while some other parity blocks are used for reliability maintenance. As a result, our design is near-optimal with respect to storage and network traffic. We show that Hybrid-RC reduces the reconstruction cost by up to 21% compared to the Local Reconstruction Codes (LRC) with the same storage overhead. Most importantly, in Hybrid-RC, each block contributes only half the amount of data when processing a single block failure. Therefore, the number of I/Os consumed per block is reduced by 50%, which is of great help to balance the network load and reduce the latency.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"41 1","pages":"124-133"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75437838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With billions of connected users and objects, location-based services face a massive scalability challenge. We propose a horizontally-scalable and reliable location-based publish/subscribe architecture that can be deployed on a cluster made of commodity hardware. As many modern location-based publish/subscribe systems, our architecture supports moving publishers, as well as moving subscribers. When a publication moves in the range of a subscription, the owner of this subscription is instantly notified via a server-initiated event, usually in the form of a push notification. To achieve this, most existing solutions rely on classic indexing data structures, such as R-trees, and they struggle at scaling beyond the scope of a single computing unit. Our architecture introduces a multi-step routing mechanism that, to achieve horizontal scalability, efficiently combines range partitioning, consistent hashing and a min-wise hashing agreement. In case of node failure, an active replication strategy ensures a reliable delivery of publication throughout the multistep routing mechanism. From an algorithmic perspective, we show that the number of messages required to compute a match is optimal in the execution model we consider and that the number of routing steps is constant. Using experimental results, we show that our method achieves high throughput, low latency and scales horizontally. For example, with a cluster made of 200~nodes, our architecture can process up to 190'000 location updates per second for a fleet of nearly 1'900'000 moving entities, producing more than 130'000 matches per second.
{"title":"A Horizontally Scalable and Reliable Architecture for Location-Based Publish-Subscribe","authors":"B. Chapuis, B. Garbinato, Lucas Mourot","doi":"10.1109/SRDS.2017.16","DOIUrl":"https://doi.org/10.1109/SRDS.2017.16","url":null,"abstract":"With billions of connected users and objects, location-based services face a massive scalability challenge. We propose a horizontally-scalable and reliable location-based publish/subscribe architecture that can be deployed on a cluster made of commodity hardware. As many modern location-based publish/subscribe systems, our architecture supports moving publishers, as well as moving subscribers. When a publication moves in the range of a subscription, the owner of this subscription is instantly notified via a server-initiated event, usually in the form of a push notification. To achieve this, most existing solutions rely on classic indexing data structures, such as R-trees, and they struggle at scaling beyond the scope of a single computing unit. Our architecture introduces a multi-step routing mechanism that, to achieve horizontal scalability, efficiently combines range partitioning, consistent hashing and a min-wise hashing agreement. In case of node failure, an active replication strategy ensures a reliable delivery of publication throughout the multistep routing mechanism. From an algorithmic perspective, we show that the number of messages required to compute a match is optimal in the execution model we consider and that the number of routing steps is constant. Using experimental results, we show that our method achieves high throughput, low latency and scales horizontally. For example, with a cluster made of 200~nodes, our architecture can process up to 190'000 location updates per second for a fleet of nearly 1'900'000 moving entities, producing more than 130'000 matches per second.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"36 1","pages":"74-83"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77724410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Logs have been widely employed to ensure the reliability of distributed systems, because logs are often the only data available that records system runtime information. Compared with logs generated by traditional standalone systems, distributed system logs are often large-scale and of great complexity, invalidating many existing log management methods. To address this problem, the paper describes and envisions an end-to-end log management framework for distributed systems. Specifically, this framework includes strategic logging placement, log collection, log parsing, interleaved logs mining, anomaly detection, and problem identification.
{"title":"An End-To-End Log Management Framework for Distributed Systems","authors":"Pinjia He","doi":"10.1109/SRDS.2017.41","DOIUrl":"https://doi.org/10.1109/SRDS.2017.41","url":null,"abstract":"Logs have been widely employed to ensure the reliability of distributed systems, because logs are often the only data available that records system runtime information. Compared with logs generated by traditional standalone systems, distributed system logs are often large-scale and of great complexity, invalidating many existing log management methods. To address this problem, the paper describes and envisions an end-to-end log management framework for distributed systems. Specifically, this framework includes strategic logging placement, log collection, log parsing, interleaved logs mining, anomaly detection, and problem identification.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"82 1","pages":"266-267"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83966522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cloud scheduler manages multi-resources (e.g., CPU, GPU, memory, storage etc.) in cloud platform to improve resource utilization and achieve cost-efficiency for cloud providers. The optimal allocation for multi-resources has become a key technique in cloud computing and attracted more and more researchers' attentions. The existing multi-resource allocation methods are developed based on a condition that the job has constant demands for multi-resources. However, these methods may not apply in a real cloud scheduler due to the dynamic resource demands in jobs' execution. In this paper, we study a robust multi-resource allocation problem with uncertainties brought by varying resource demands. To this end, the cost function is chosen as either of two multi-resource efficiency-fairness metrics called Fairness on Dominant Shares and Generalized Fairness on Jobs, and we model the resource demand uncertainties through three typical models, i.e., scenario demand uncertainty, box demand uncertainty and ellipsoidal demand uncertainty. By solving an optimization problem we get the solution for robust multi-resource allocation with uncertainties for cloud scheduler. The extensive simulations show that the proposed approach can handle the resource demand uncertainties and the cloud scheduler runs in an optimized and robust manner.
{"title":"Robust Multi-Resource Allocation with Demand Uncertainties in Cloud Scheduler","authors":"Jianguo Yao, Q. Lu, H. Jacobsen, Haibing Guan","doi":"10.1109/SRDS.2017.12","DOIUrl":"https://doi.org/10.1109/SRDS.2017.12","url":null,"abstract":"Cloud scheduler manages multi-resources (e.g., CPU, GPU, memory, storage etc.) in cloud platform to improve resource utilization and achieve cost-efficiency for cloud providers. The optimal allocation for multi-resources has become a key technique in cloud computing and attracted more and more researchers' attentions. The existing multi-resource allocation methods are developed based on a condition that the job has constant demands for multi-resources. However, these methods may not apply in a real cloud scheduler due to the dynamic resource demands in jobs' execution. In this paper, we study a robust multi-resource allocation problem with uncertainties brought by varying resource demands. To this end, the cost function is chosen as either of two multi-resource efficiency-fairness metrics called Fairness on Dominant Shares and Generalized Fairness on Jobs, and we model the resource demand uncertainties through three typical models, i.e., scenario demand uncertainty, box demand uncertainty and ellipsoidal demand uncertainty. By solving an optimization problem we get the solution for robust multi-resource allocation with uncertainties for cloud scheduler. The extensive simulations show that the proposed approach can handle the resource demand uncertainties and the cloud scheduler runs in an optimized and robust manner.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"39 1","pages":"34-43"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88166818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Service elasticity, the ability to rapidly expand or shrink service processing capacity on demand, has become a first-class property in the domain of infrastructure services. Scalable NoSQL data stores are the de-facto choice of applications aiming for scalable, highly available data persistence. The elasticity of such data stores is still challenging, due to the complexity and performance impact of moving large amounts of data over the network to take advantage of new resources (servers). In this paper we propose incremental elasticity, a new mechanism that progressively increases processing capacity in a fine-grain manner during an elasticity action by making sub-sections of the transferred data available for access on the new server, prior to completing the full transfer. In addition, by scheduling data transfers during an elasticity action in sequence (rather than as simultaneous transfers) between each pre-existing server involved and the new server, incremental elasticity leads to smoother elasticity actions, reducing their overall impact on performance.
{"title":"Incremental Elasticity for NoSQL Data Stores","authors":"Antonis Papaioannou, K. Magoutis","doi":"10.1109/SRDS.2017.26","DOIUrl":"https://doi.org/10.1109/SRDS.2017.26","url":null,"abstract":"Service elasticity, the ability to rapidly expand or shrink service processing capacity on demand, has become a first-class property in the domain of infrastructure services. Scalable NoSQL data stores are the de-facto choice of applications aiming for scalable, highly available data persistence. The elasticity of such data stores is still challenging, due to the complexity and performance impact of moving large amounts of data over the network to take advantage of new resources (servers). In this paper we propose incremental elasticity, a new mechanism that progressively increases processing capacity in a fine-grain manner during an elasticity action by making sub-sections of the transferred data available for access on the new server, prior to completing the full transfer. In addition, by scheduling data transfers during an elasticity action in sequence (rather than as simultaneous transfers) between each pre-existing server involved and the new server, incremental elasticity leads to smoother elasticity actions, reducing their overall impact on performance.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"71 1","pages":"174-183"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85771073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
El Mahdi El Mhamdi, R. Guerraoui, Sébastien Rouault
With the development of neural networks based machine learning and their usage in mission critical applications, voices are rising against the black box aspect of neural networks as it becomes crucial to understand their limits and capabilities. With the rise of neuromorphic hardware, it is even more critical to understand how a neural network, as a distributed system, tolerates the failures of its computing nodes, neurons, and its communication channels, synapses. Experimentally assessing the robustness of neural networks involves the quixotic venture of testing all the possible failures, on all the possible inputs, which ultimately hits a combinatorial explosion for the first, and the impossibility to gather all the possible inputs for the second.In this paper, we prove an upper bound on the expected error of the output when a subset of neurons crashes. This bound involves dependencies on the network parameters that can be seen as being too pessimistic in the average case. It involves a polynomial dependency on the Lipschitz coefficient of the neurons' activation function, and an exponential dependency on the depth of the layer where a failure occurs. We back up our theoretical results with experiments illustrating the extent to which our prediction matches the dependencies between the network parameters and robustness. Our results show that the robustness of neural networks to the average crash can be estimated without the need to neither test the network on all failure configurations, nor access the training set used to train the network, both of which are practically impossible requirements.
{"title":"On the Robustness of a Neural Network","authors":"El Mahdi El Mhamdi, R. Guerraoui, Sébastien Rouault","doi":"10.1109/SRDS.2017.21","DOIUrl":"https://doi.org/10.1109/SRDS.2017.21","url":null,"abstract":"With the development of neural networks based machine learning and their usage in mission critical applications, voices are rising against the black box aspect of neural networks as it becomes crucial to understand their limits and capabilities. With the rise of neuromorphic hardware, it is even more critical to understand how a neural network, as a distributed system, tolerates the failures of its computing nodes, neurons, and its communication channels, synapses. Experimentally assessing the robustness of neural networks involves the quixotic venture of testing all the possible failures, on all the possible inputs, which ultimately hits a combinatorial explosion for the first, and the impossibility to gather all the possible inputs for the second.In this paper, we prove an upper bound on the expected error of the output when a subset of neurons crashes. This bound involves dependencies on the network parameters that can be seen as being too pessimistic in the average case. It involves a polynomial dependency on the Lipschitz coefficient of the neurons' activation function, and an exponential dependency on the depth of the layer where a failure occurs. We back up our theoretical results with experiments illustrating the extent to which our prediction matches the dependencies between the network parameters and robustness. Our results show that the robustness of neural networks to the average crash can be estimated without the need to neither test the network on all failure configurations, nor access the training set used to train the network, both of which are practically impossible requirements.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"13 1","pages":"84-93"},"PeriodicalIF":0.0,"publicationDate":"2017-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73988540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}