2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)最新文献

英文中文

Reconfiguring Parallel State Machine Replication 重新配置并行状态机复制

2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2017-09-01 DOI: 10.1109/SRDS.2017.23

E. Alchieri, F. Dotti, O. Mendizabal, F. Pedone

State Machine Replication (SMR) is a well-known technique to implement fault-tolerant systems. In SMR, servers are replicated and client requests are deterministically executed in the same order by all replicas. To improve performance in multi-processor systems, some approaches have proposed to parallelize the execution of non-conflicting requests. Such approaches perform remarkably well in workloads dominated by non-conflicting requests. Conflicting requests introduce expensive synchronization and result in considerable performance loss. Current approaches to parallel SMR define the degree of parallelism statically. However, it is often difficult to predict the best degree of parallelism for a workload and workloads experience variations that change their best degree of parallelism. This paper proposes a protocol to reconfigure the degree of parallelism in parallel SMR on-the-fly. Experiments show the gains due to reconfiguration and shed some light on the behavior of parallel and reconfigurable SMR.

状态机复制(SMR)是一种众所周知的实现容错系统的技术。在SMR中，服务器被复制，所有副本以相同的顺序确定地执行客户端请求。为了提高多处理器系统的性能，人们提出了一些方法来并行化非冲突请求的执行。这种方法在由无冲突请求主导的工作负载中表现得非常好。冲突的请求引入了昂贵的同步，并导致相当大的性能损失。当前的并行SMR方法静态地定义并行度。然而，通常很难预测工作负载的最佳并行度，并且工作负载经历的变化会改变其最佳并行度。本文提出了一种动态重构并行SMR并行度的协议。实验显示了重构所带来的增益，并揭示了并行和可重构SMR的行为。

引用次数: 22

Runtime Measurement Architecture for Bytecode Integrity in JVM-Based Cloud 基于jvm的云中字节码完整性的运行时度量体系结构

2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2017-09-01 DOI: 10.1109/SRDS.2017.39

Haihe Ba, Huaizhe Zhou, Jiangchun Ren, Zhiying Wang

While Java Virtual Machine can provide applications with safety property to avoid memory corruption bugs, it continues to encounter some security flaws. Real world exploits show that the current sandbox model can be bypassed. In this paper, we focus our work on bytecode integrity measurement in clouds to identify malicious execution and propose J-IMA architecture to provide runtime measurement and remote attestation for bytecode integrity. To the best of our knowledge, our work is the first measurement approach for dynamically-generated bytecode integrity. Moreover, J-IMA has no need for any modification to host systems and any access to source code.

虽然Java虚拟机可以为应用程序提供安全属性以避免内存损坏错误，但它仍然会遇到一些安全缺陷。现实世界的漏洞表明，当前的沙盒模型是可以绕过的。在本文中，我们将工作重点放在云中的字节码完整性测量上，以识别恶意执行，并提出J-IMA架构来提供字节码完整性的运行时测量和远程认证。据我们所知，我们的工作是第一个测量动态生成字节码完整性的方法。此外，J-IMA不需要对主机系统进行任何修改，也不需要对源代码进行任何访问。

引用次数: 2

Controlling Cascading Failures in Interdependent Networks under Incomplete Knowledge 不完全知识下相互依赖网络的级联故障控制

2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2017-09-01 DOI: 10.1109/SRDS.2017.14

D. Z. Tootaghaj, N. Bartolini, Hana Khamfroush, T. L. Porta

Vulnerability due to inter-connectivity of multiple networks has been observed in many complex networks. Previous works mainly focused on robust network design and on recovery strategies after sporadic or massive failures in the case of complete knowledge of failure location. We focus on cascading failures involving the power grid and its communication network with consequent imprecision in damage assessment. We tackle the problem of mitigating the ongoing cascading failure and providing a recovery strategy. We propose a failure mitigation strategy in two steps: 1) Once a cascading failure is detected, we limit further propagation by re-distributing the generator and load's power. 2) We formulate a recovery plan to maximize the total amount of power delivered to the demand loads during the recovery intervention. Our approach to cope with insufficient knowledge of damage locations is based on the use of a new algorithm to determine consistent failure sets (CFS). We show that, given knowledge of the system state before the disruption, the CFS algorithm can find all consistent sets of unknown failures in polynomial time provided that, each connected component of the disrupted graph has at least one line whose failure status is known to the controller.

在许多复杂的网络中，由于多个网络的互联性而导致的漏洞已经被观察到。以往的工作主要集中在鲁棒网络设计以及在完全了解故障位置的情况下，零星或大规模故障后的恢复策略。我们的重点是涉及电网及其通信网络的级联故障，由此导致的损害评估不精确。我们解决了减轻正在发生的级联故障并提供恢复策略的问题。我们提出了一个分两步的故障缓解策略:1)一旦检测到级联故障，我们通过重新分配发电机和负载的功率来限制进一步的传播。2)制定恢复方案，使恢复干预期间向需求负荷输送的总功率最大化。我们的方法是基于使用一种新的算法来确定一致故障集(CFS)来处理对损伤位置的不充分了解。我们证明，在已知中断前的系统状态的情况下，CFS算法可以在多项式时间内找到所有未知故障的一致集，前提是中断图的每个连接分量至少有一条线，其故障状态为控制器所知。

{"title":"Controlling Cascading Failures in Interdependent Networks under Incomplete Knowledge","authors":"D. Z. Tootaghaj, N. Bartolini, Hana Khamfroush, T. L. Porta","doi":"10.1109/SRDS.2017.14","DOIUrl":"https://doi.org/10.1109/SRDS.2017.14","url":null,"abstract":"Vulnerability due to inter-connectivity of multiple networks has been observed in many complex networks. Previous works mainly focused on robust network design and on recovery strategies after sporadic or massive failures in the case of complete knowledge of failure location. We focus on cascading failures involving the power grid and its communication network with consequent imprecision in damage assessment. We tackle the problem of mitigating the ongoing cascading failure and providing a recovery strategy. We propose a failure mitigation strategy in two steps: 1) Once a cascading failure is detected, we limit further propagation by re-distributing the generator and load's power. 2) We formulate a recovery plan to maximize the total amount of power delivered to the demand loads during the recovery intervention. Our approach to cope with insufficient knowledge of damage locations is based on the use of a new algorithm to determine consistent failure sets (CFS). We show that, given knowledge of the system state before the disruption, the CFS algorithm can find all consistent sets of unknown failures in polynomial time provided that, each connected component of the disrupted graph has at least one line whose failure status is known to the controller.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"22 1","pages":"54-63"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74633078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

DottedDB: Anti-Entropy without Merkle Trees, Deletes without Tombstones DottedDB:反熵没有默克尔树，删除没有墓碑

2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2017-09-01 DOI: 10.1109/SRDS.2017.28

Ricardo Gonçalves, Paulo Sérgio Almeida, Carlos Baquero, V. Fonte

To achieve high availability in the face of network partitions, many distributed databases adopt eventual consistency, allow temporary conflicts due to concurrent writes, and use some form of per-key logical clock to detect and resolve such conflicts. Furthermore, nodes synchronize periodically to ensure replica convergence in a process called anti-entropy, normally using Merkle Trees. We present the design of DottedDB, a Dynamo-like key-value store, which uses a novel node-wide logical clock framework, overcoming three fundamental limitations of the state of the art: (1) minimize the metadata per key necessary to track causality, avoiding its growth even in the face of node churn; (2) correctly and durably delete keys, with no need for tombstones; (3) offer a lightweight anti-entropy mechanism to converge replicated data, avoiding the need for Merkle Trees. We evaluate DottedDB against MerkleDB, an otherwise identical database, but using per-key logical clocks and Merkle Trees for anti-entropy, to precisely measure the impact of the novel approach. Results show that: causality metadata per object always converges rapidly to only one id-counter pair; distributed deletes are correctly achieved without global coordination and with constant metadata; divergent nodes are synchronized faster, with less memory-footprint and with less communication overhead than using Merkle Trees.

为了在面对网络分区时实现高可用性，许多分布式数据库采用最终一致性，允许由于并发写而产生的临时冲突，并使用某种形式的每键逻辑时钟来检测和解决此类冲突。此外，节点定期同步以确保副本在称为反熵的过程中收敛，通常使用默克尔树。我们提出了一种类似dynamo的键值存储DottedDB的设计，它使用了一种新颖的节点范围内的逻辑时钟框架，克服了现有技术的三个基本限制:(1)最小化跟踪因果关系所需的每个键的元数据，即使面对节点的混乱也避免了元数据的增长;(2)正确持久地删除密钥，不需要墓碑;(3)提供轻量级的反熵机制来收敛复制数据，避免了对Merkle树的需要。我们根据MerkleDB(另一个相同的数据库)对dottedb进行评估，但使用每个键逻辑时钟和Merkle树进行反熵，以精确测量新方法的影响。结果表明:每个对象的因果关系元数据总是快速收敛到只有一个id-counter对;分布式删除可以在没有全局协调和恒定元数据的情况下正确实现;与使用Merkle树相比，不同节点的同步速度更快，内存占用更少，通信开销更少。

{"title":"DottedDB: Anti-Entropy without Merkle Trees, Deletes without Tombstones","authors":"Ricardo Gonçalves, Paulo Sérgio Almeida, Carlos Baquero, V. Fonte","doi":"10.1109/SRDS.2017.28","DOIUrl":"https://doi.org/10.1109/SRDS.2017.28","url":null,"abstract":"To achieve high availability in the face of network partitions, many distributed databases adopt eventual consistency, allow temporary conflicts due to concurrent writes, and use some form of per-key logical clock to detect and resolve such conflicts. Furthermore, nodes synchronize periodically to ensure replica convergence in a process called anti-entropy, normally using Merkle Trees. We present the design of DottedDB, a Dynamo-like key-value store, which uses a novel node-wide logical clock framework, overcoming three fundamental limitations of the state of the art: (1) minimize the metadata per key necessary to track causality, avoiding its growth even in the face of node churn; (2) correctly and durably delete keys, with no need for tombstones; (3) offer a lightweight anti-entropy mechanism to converge replicated data, avoiding the need for Merkle Trees. We evaluate DottedDB against MerkleDB, an otherwise identical database, but using per-key logical clocks and Merkle Trees for anti-entropy, to precisely measure the impact of the novel approach. Results show that: causality metadata per object always converges rapidly to only one id-counter pair; distributed deletes are correctly achieved without global coordination and with constant metadata; divergent nodes are synchronized faster, with less memory-footprint and with less communication overhead than using Merkle Trees.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"35 1","pages":"194-203"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83623659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Hybrid-RC: Flexible Erasure Codes with Optimized Recovery Performance and Low Storage Overhead 混合- rc:灵活的Erasure代码与优化的恢复性能和低存储开销

2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2017-09-01 DOI: 10.1109/SRDS.2017.17

Liuqing Ye, D. Feng, Yuchong Hu, Qing Liu

Erasure codes are widely used in practical storage systems to prevent disk failure and data loss. However, these codes require excessive disk I/Os and network traffic for recovering unavailable data. As a result, the recovery performance of erasure codes is suboptimal. Among all erasure codes, Minimum Storage Regenerating (MSR) codes can achieve optimal repair bandwidth under the minimum storage during recovery, but some open issues remain to be addressed before applying them in real systems. In this paper, we present Hybrid Regenerating Codes (Hybrid-RC), a new set of erasure codes with optimized recovery performance and low storage overhead. The codes utilize the superiority of MSR codes to compute a subset of data blocks while some other parity blocks are used for reliability maintenance. As a result, our design is near-optimal with respect to storage and network traffic. We show that Hybrid-RC reduces the reconstruction cost by up to 21% compared to the Local Reconstruction Codes (LRC) with the same storage overhead. Most importantly, in Hybrid-RC, each block contributes only half the amount of data when processing a single block failure. Therefore, the number of I/Os consumed per block is reduced by 50%, which is of great help to balance the network load and reduce the latency.

Erasure码广泛应用于实际存储系统中，用于防止硬盘故障和数据丢失。但是，这些代码需要大量的磁盘I/ o和网络流量来恢复不可用的数据。因此，擦除码的恢复性能不是最优的。在所有的纠删码中，最小存储再生码(MSR)在恢复过程中可以在最小存储条件下实现最优的修复带宽，但在应用于实际系统之前还存在一些有待解决的开放性问题。本文提出了混合再生码(Hybrid- rc)，这是一种具有优化恢复性能和低存储开销的新型擦除码。该码利用MSR码的优势计算数据块子集，同时使用其他奇偶校验块进行可靠性维护。因此，我们的设计在存储和网络流量方面接近最佳。我们表明，与具有相同存储开销的本地重构代码(LRC)相比，Hybrid-RC可将重构成本降低高达21%。最重要的是，在Hybrid-RC中，当处理单个块故障时，每个块仅贡献一半的数据量。因此，每个块消耗的I/ o数减少了50%，这对平衡网络负载和减少延迟有很大的帮助。

{"title":"Hybrid-RC: Flexible Erasure Codes with Optimized Recovery Performance and Low Storage Overhead","authors":"Liuqing Ye, D. Feng, Yuchong Hu, Qing Liu","doi":"10.1109/SRDS.2017.17","DOIUrl":"https://doi.org/10.1109/SRDS.2017.17","url":null,"abstract":"Erasure codes are widely used in practical storage systems to prevent disk failure and data loss. However, these codes require excessive disk I/Os and network traffic for recovering unavailable data. As a result, the recovery performance of erasure codes is suboptimal. Among all erasure codes, Minimum Storage Regenerating (MSR) codes can achieve optimal repair bandwidth under the minimum storage during recovery, but some open issues remain to be addressed before applying them in real systems. In this paper, we present Hybrid Regenerating Codes (Hybrid-RC), a new set of erasure codes with optimized recovery performance and low storage overhead. The codes utilize the superiority of MSR codes to compute a subset of data blocks while some other parity blocks are used for reliability maintenance. As a result, our design is near-optimal with respect to storage and network traffic. We show that Hybrid-RC reduces the reconstruction cost by up to 21% compared to the Local Reconstruction Codes (LRC) with the same storage overhead. Most importantly, in Hybrid-RC, each block contributes only half the amount of data when processing a single block failure. Therefore, the number of I/Os consumed per block is reduced by 50%, which is of great help to balance the network load and reduce the latency.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"41 1","pages":"124-133"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75437838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A Horizontally Scalable and Reliable Architecture for Location-Based Publish-Subscribe 基于位置的发布-订阅水平可扩展和可靠的体系结构

2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2017-09-01 DOI: 10.1109/SRDS.2017.16

B. Chapuis, B. Garbinato, Lucas Mourot

With billions of connected users and objects, location-based services face a massive scalability challenge. We propose a horizontally-scalable and reliable location-based publish/subscribe architecture that can be deployed on a cluster made of commodity hardware. As many modern location-based publish/subscribe systems, our architecture supports moving publishers, as well as moving subscribers. When a publication moves in the range of a subscription, the owner of this subscription is instantly notified via a server-initiated event, usually in the form of a push notification. To achieve this, most existing solutions rely on classic indexing data structures, such as R-trees, and they struggle at scaling beyond the scope of a single computing unit. Our architecture introduces a multi-step routing mechanism that, to achieve horizontal scalability, efficiently combines range partitioning, consistent hashing and a min-wise hashing agreement. In case of node failure, an active replication strategy ensures a reliable delivery of publication throughout the multistep routing mechanism. From an algorithmic perspective, we show that the number of messages required to compute a match is optimal in the execution model we consider and that the number of routing steps is constant. Using experimental results, we show that our method achieves high throughput, low latency and scales horizontally. For example, with a cluster made of 200~nodes, our architecture can process up to 190'000 location updates per second for a fleet of nearly 1'900'000 moving entities, producing more than 130'000 matches per second.

随着数十亿用户和对象的连接，基于位置的服务面临着巨大的可扩展性挑战。我们提出了一种水平可伸缩的、可靠的、基于位置的发布/订阅体系结构，它可以部署在由商品硬件组成的集群上。与许多现代基于位置的发布/订阅系统一样，我们的体系结构支持移动发布者和移动订阅者。当发布移动到订阅范围中时，将通过服务器发起的事件立即通知此订阅的所有者，通常采用推送通知的形式。为了实现这一点，大多数现有的解决方案依赖于经典的索引数据结构，例如r树，并且它们难以扩展到单个计算单元的范围之外。我们的架构引入了一种多步路由机制，为了实现水平可扩展性，它有效地结合了范围分区、一致哈希和最小哈希协议。在发生节点故障的情况下，主动复制策略可确保在整个多步骤路由机制中可靠地交付发布。从算法的角度来看，我们展示了计算匹配所需的消息数量在我们考虑的执行模型中是最优的，并且路由步骤的数量是恒定的。实验结果表明，该方法实现了高吞吐量、低延迟和水平扩展。例如，在一个由200个节点组成的集群中，我们的架构每秒可以为近190万个移动实体处理多达19万个位置更新，每秒产生超过13万个匹配。

{"title":"A Horizontally Scalable and Reliable Architecture for Location-Based Publish-Subscribe","authors":"B. Chapuis, B. Garbinato, Lucas Mourot","doi":"10.1109/SRDS.2017.16","DOIUrl":"https://doi.org/10.1109/SRDS.2017.16","url":null,"abstract":"With billions of connected users and objects, location-based services face a massive scalability challenge. We propose a horizontally-scalable and reliable location-based publish/subscribe architecture that can be deployed on a cluster made of commodity hardware. As many modern location-based publish/subscribe systems, our architecture supports moving publishers, as well as moving subscribers. When a publication moves in the range of a subscription, the owner of this subscription is instantly notified via a server-initiated event, usually in the form of a push notification. To achieve this, most existing solutions rely on classic indexing data structures, such as R-trees, and they struggle at scaling beyond the scope of a single computing unit. Our architecture introduces a multi-step routing mechanism that, to achieve horizontal scalability, efficiently combines range partitioning, consistent hashing and a min-wise hashing agreement. In case of node failure, an active replication strategy ensures a reliable delivery of publication throughout the multistep routing mechanism. From an algorithmic perspective, we show that the number of messages required to compute a match is optimal in the execution model we consider and that the number of routing steps is constant. Using experimental results, we show that our method achieves high throughput, low latency and scales horizontally. For example, with a cluster made of 200~nodes, our architecture can process up to 190'000 location updates per second for a fleet of nearly 1'900'000 moving entities, producing more than 130'000 matches per second.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"36 1","pages":"74-83"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77724410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

An End-To-End Log Management Framework for Distributed Systems 分布式系统的端到端日志管理框架

2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2017-09-01 DOI: 10.1109/SRDS.2017.41

Pinjia He

Logs have been widely employed to ensure the reliability of distributed systems, because logs are often the only data available that records system runtime information. Compared with logs generated by traditional standalone systems, distributed system logs are often large-scale and of great complexity, invalidating many existing log management methods. To address this problem, the paper describes and envisions an end-to-end log management framework for distributed systems. Specifically, this framework includes strategic logging placement, log collection, log parsing, interleaved logs mining, anomaly detection, and problem identification.

日志被广泛用于确保分布式系统的可靠性，因为日志通常是记录系统运行时信息的唯一可用数据。与传统的单机系统产生的日志相比，分布式系统的日志通常是大规模的，而且非常复杂，使许多现有的日志管理方法失效。为了解决这个问题，本文描述并设想了一个分布式系统的端到端日志管理框架。具体来说，该框架包括战略性日志记录放置、日志收集、日志解析、交错日志挖掘、异常检测和问题识别。

引用次数: 5

Robust Multi-Resource Allocation with Demand Uncertainties in Cloud Scheduler 考虑需求不确定性的云调度鲁棒多资源分配

2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2017-09-01 DOI: 10.1109/SRDS.2017.12

Jianguo Yao, Q. Lu, H. Jacobsen, Haibing Guan

Cloud scheduler manages multi-resources (e.g., CPU, GPU, memory, storage etc.) in cloud platform to improve resource utilization and achieve cost-efficiency for cloud providers. The optimal allocation for multi-resources has become a key technique in cloud computing and attracted more and more researchers' attentions. The existing multi-resource allocation methods are developed based on a condition that the job has constant demands for multi-resources. However, these methods may not apply in a real cloud scheduler due to the dynamic resource demands in jobs' execution. In this paper, we study a robust multi-resource allocation problem with uncertainties brought by varying resource demands. To this end, the cost function is chosen as either of two multi-resource efficiency-fairness metrics called Fairness on Dominant Shares and Generalized Fairness on Jobs, and we model the resource demand uncertainties through three typical models, i.e., scenario demand uncertainty, box demand uncertainty and ellipsoidal demand uncertainty. By solving an optimization problem we get the solution for robust multi-resource allocation with uncertainties for cloud scheduler. The extensive simulations show that the proposed approach can handle the resource demand uncertainties and the cloud scheduler runs in an optimized and robust manner.

云调度器管理云平台中的多资源(如CPU、GPU、内存、存储等)，提高资源利用率，为云提供商实现成本效益。多资源的优化配置已经成为云计算中的一项关键技术，受到越来越多研究者的关注。现有的多资源分配方法是基于作业对多资源的需求是恒定的这一条件发展起来的。但是，由于作业执行中的动态资源需求，这些方法可能不适用于真正的云调度器。本文研究了一类具有资源需求变化带来的不确定性的鲁棒多资源分配问题。为此，选择成本函数作为两种多资源效率公平指标(占主导地位的份额上的公平和工作上的广义公平)中的一种，并通过情景需求不确定性、盒型需求不确定性和椭球型需求不确定性三个典型模型对资源需求不确定性进行建模。通过求解一个优化问题，得到了具有不确定性的云调度鲁棒多资源分配问题的解。大量的仿真结果表明，该方法可以处理资源需求的不确定性，并且云调度程序以优化和鲁棒的方式运行。

{"title":"Robust Multi-Resource Allocation with Demand Uncertainties in Cloud Scheduler","authors":"Jianguo Yao, Q. Lu, H. Jacobsen, Haibing Guan","doi":"10.1109/SRDS.2017.12","DOIUrl":"https://doi.org/10.1109/SRDS.2017.12","url":null,"abstract":"Cloud scheduler manages multi-resources (e.g., CPU, GPU, memory, storage etc.) in cloud platform to improve resource utilization and achieve cost-efficiency for cloud providers. The optimal allocation for multi-resources has become a key technique in cloud computing and attracted more and more researchers' attentions. The existing multi-resource allocation methods are developed based on a condition that the job has constant demands for multi-resources. However, these methods may not apply in a real cloud scheduler due to the dynamic resource demands in jobs' execution. In this paper, we study a robust multi-resource allocation problem with uncertainties brought by varying resource demands. To this end, the cost function is chosen as either of two multi-resource efficiency-fairness metrics called Fairness on Dominant Shares and Generalized Fairness on Jobs, and we model the resource demand uncertainties through three typical models, i.e., scenario demand uncertainty, box demand uncertainty and ellipsoidal demand uncertainty. By solving an optimization problem we get the solution for robust multi-resource allocation with uncertainties for cloud scheduler. The extensive simulations show that the proposed approach can handle the resource demand uncertainties and the cloud scheduler runs in an optimized and robust manner.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"39 1","pages":"34-43"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88166818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Incremental Elasticity for NoSQL Data Stores NoSQL数据存储的增量弹性

2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2017-09-01 DOI: 10.1109/SRDS.2017.26

Antonis Papaioannou, K. Magoutis

Service elasticity, the ability to rapidly expand or shrink service processing capacity on demand, has become a first-class property in the domain of infrastructure services. Scalable NoSQL data stores are the de-facto choice of applications aiming for scalable, highly available data persistence. The elasticity of such data stores is still challenging, due to the complexity and performance impact of moving large amounts of data over the network to take advantage of new resources (servers). In this paper we propose incremental elasticity, a new mechanism that progressively increases processing capacity in a fine-grain manner during an elasticity action by making sub-sections of the transferred data available for access on the new server, prior to completing the full transfer. In addition, by scheduling data transfers during an elasticity action in sequence (rather than as simultaneous transfers) between each pre-existing server involved and the new server, incremental elasticity leads to smoother elasticity actions, reducing their overall impact on performance.

服务弹性，即根据需求迅速扩大或缩小服务处理能力的能力，已成为基础设施服务领域的一流属性。可伸缩的NoSQL数据存储是旨在实现可伸缩、高可用性数据持久性的应用程序的实际选择。由于在网络上移动大量数据以利用新资源(服务器)的复杂性和性能影响，此类数据存储的弹性仍然具有挑战性。在本文中，我们提出了增量弹性，这是一种新机制，通过在完成完整传输之前，将传输数据的子部分提供给新服务器访问，从而在弹性操作期间以细粒度方式逐步增加处理能力。此外，通过在每个预先存在的服务器和新服务器之间按顺序(而不是同时)调度弹性操作期间的数据传输，增量弹性可以使弹性操作更平滑，从而降低它们对性能的总体影响。

引用次数: 1

On the Robustness of a Neural Network 神经网络的鲁棒性

2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)

Pub Date : 2017-07-25 DOI: 10.1109/SRDS.2017.21

El Mahdi El Mhamdi, R. Guerraoui, Sébastien Rouault

With the development of neural networks based machine learning and their usage in mission critical applications, voices are rising against the black box aspect of neural networks as it becomes crucial to understand their limits and capabilities. With the rise of neuromorphic hardware, it is even more critical to understand how a neural network, as a distributed system, tolerates the failures of its computing nodes, neurons, and its communication channels, synapses. Experimentally assessing the robustness of neural networks involves the quixotic venture of testing all the possible failures, on all the possible inputs, which ultimately hits a combinatorial explosion for the first, and the impossibility to gather all the possible inputs for the second.In this paper, we prove an upper bound on the expected error of the output when a subset of neurons crashes. This bound involves dependencies on the network parameters that can be seen as being too pessimistic in the average case. It involves a polynomial dependency on the Lipschitz coefficient of the neurons' activation function, and an exponential dependency on the depth of the layer where a failure occurs. We back up our theoretical results with experiments illustrating the extent to which our prediction matches the dependencies between the network parameters and robustness. Our results show that the robustness of neural networks to the average crash can be estimated without the need to neither test the network on all failure configurations, nor access the training set used to train the network, both of which are practically impossible requirements.

随着基于神经网络的机器学习的发展及其在关键任务应用中的使用，反对神经网络黑盒子方面的声音越来越高，因为理解它们的局限性和能力变得至关重要。随着神经形态硬件的兴起，理解神经网络作为一个分布式系统如何容忍其计算节点、神经元及其通信通道、突触的故障变得更加关键。通过实验来评估神经网络的稳健性，涉及到一种不切实际的冒险:在所有可能的输入上测试所有可能的失败，最终导致第一种情况的组合爆炸，而第二种情况则不可能收集到所有可能的输入。在本文中，我们证明了当一组神经元崩溃时，输出的期望误差的上界。这个边界涉及到对网络参数的依赖，在一般情况下，这可能被视为过于悲观。它涉及到对神经元激活函数的Lipschitz系数的多项式依赖，以及对发生故障的层的深度的指数依赖。我们用实验来支持我们的理论结果，说明我们的预测在多大程度上符合网络参数和鲁棒性之间的依赖关系。我们的结果表明，神经网络对平均崩溃的鲁棒性可以估计，而不需要在所有故障配置上测试网络，也不需要访问用于训练网络的训练集，这两者实际上都是不可能的要求。

{"title":"On the Robustness of a Neural Network","authors":"El Mahdi El Mhamdi, R. Guerraoui, Sébastien Rouault","doi":"10.1109/SRDS.2017.21","DOIUrl":"https://doi.org/10.1109/SRDS.2017.21","url":null,"abstract":"With the development of neural networks based machine learning and their usage in mission critical applications, voices are rising against the black box aspect of neural networks as it becomes crucial to understand their limits and capabilities. With the rise of neuromorphic hardware, it is even more critical to understand how a neural network, as a distributed system, tolerates the failures of its computing nodes, neurons, and its communication channels, synapses. Experimentally assessing the robustness of neural networks involves the quixotic venture of testing all the possible failures, on all the possible inputs, which ultimately hits a combinatorial explosion for the first, and the impossibility to gather all the possible inputs for the second.In this paper, we prove an upper bound on the expected error of the output when a subset of neurons crashes. This bound involves dependencies on the network parameters that can be seen as being too pessimistic in the average case. It involves a polynomial dependency on the Lipschitz coefficient of the neurons' activation function, and an exponential dependency on the depth of the layer where a failure occurs. We back up our theoretical results with experiments illustrating the extent to which our prediction matches the dependencies between the network parameters and robustness. Our results show that the robustness of neural networks to the average crash can be estimated without the need to neither test the network on all failure configurations, nor access the training set used to train the network, both of which are practically impossible requirements.","PeriodicalId":6475,"journal":{"name":"2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)","volume":"13 1","pages":"84-93"},"PeriodicalIF":0.0,"publicationDate":"2017-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73988540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀