Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters最新文献

英文中文

A Study of FPGA Virtualization and Accelerator Scheduling FPGA虚拟化与加速器调度研究

Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters

Pub Date : 2017-04-08 DOI: 10.1145/3129457.3129503

Qian Zhao, M. Iida, T. Sueyoshi

Deploying field-programmable gate arrays (FPGAs) on the cloud to accelerate the processing of the explosively growing server workloads is becoming a clear trend today. However, the costs reduction of accelerator design and deployment is still difficult with conventional development methods and tools. In the previous work, we proposed the hCODE platform to simplify the design, share and deployment of FPGA accelerators, which adopted a shell-and-IP design pattern and developed supporting tools to improve the reusability and the portability of accelerator designs. In this paper, based on our previous work, we propose new design methods and tools for FPGA virtualization and scheduling that allowing IPs to be implemented at cluster scale in low cost. With the proposed platform, users can easily deploy multiple accelerators on one FPGA to improve on-chip resources and communication bandwidth utilization.

在云上部署现场可编程门阵列(fpga)以加速处理爆炸性增长的服务器工作负载已成为当今的一个明显趋势。然而，使用传统的开发方法和工具来降低加速器的设计和部署成本仍然很困难。在之前的工作中，我们提出了hCODE平台来简化FPGA加速器的设计、共享和部署，该平台采用shell- ip设计模式，并开发了支持工具来提高加速器设计的可重用性和可移植性。在本文中，基于我们之前的工作，我们提出了FPGA虚拟化和调度的新设计方法和工具，允许ip以低成本在集群规模上实现。利用所提出的平台，用户可以轻松地在一个FPGA上部署多个加速器，以提高片上资源和通信带宽利用率。

引用次数: 1

Customized Architecture Technology for High Performance Computing 用于高性能计算的定制架构技术

Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters

Pub Date : 2017-04-08 DOI: 10.1145/3129457.3129500

Jingfei Jiang

Customized Architecture is one of the technical road for exascale high performance computing. We will give an overview about FPGA customized architecture. Research experiences on deep learning algorithms accelerators for data analyzing, footprint and cipher algorithms accelerators for information processing, and matrix processing algorithms accelerators for scientific computing will be discussed.

定制架构是实现百亿亿级高性能计算的技术途径之一。我们将对FPGA定制架构进行概述。讨论了面向数据分析的深度学习算法加速器、面向信息处理的足迹和密码算法加速器、面向科学计算的矩阵处理算法加速器等方面的研究经验。

引用次数: 0

Programming FPGAs Using OpenCL from Performance Model to Application Study 用OpenCL编程fpga从性能模型到应用研究

Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters

Pub Date : 2017-04-08 DOI: 10.1145/3129457.3129502

Yun Liang

Recent adoption of OpenCL programming model by FPGA vendors has realized the function portability of OpenCL workloads on FPGA. However, the poor performance portability prevents its wide adoption. To harness the power of FPGAs using OpenCL programming model, it is advantageous to design an analytical performance model to estimate the performance of OpenCL workloads on FPGAs and provide insights into the performance bottlenecks of OpenCL model on FPGA architecture. In the first part of the talk, we present FlexCL, an analytical performance model for OpenCL workloads on flexible FPGAs. FlexCL estimates the overall performance by tightly coupling the on chip global memory and on-chip computation models based on the communication mode. Then, we present an application study of mapping stencil applications onto FPGAs using OpenCL programming model.

近年来FPGA厂商采用OpenCL编程模型，实现了OpenCL工作负载在FPGA上的功能可移植性。然而，其较差的性能可移植性阻碍了其广泛采用。为了充分利用OpenCL编程模型在FPGA上的功能，设计一个分析性能模型来估计OpenCL工作负载在FPGA上的性能，并深入了解OpenCL模型在FPGA架构上的性能瓶颈是有利的。在演讲的第一部分，我们介绍了FlexCL，这是一个针对灵活fpga上的OpenCL工作负载的分析性能模型。FlexCL通过基于通信模式的片上全局内存和片上计算模型的紧密耦合来估计整体性能。然后，我们利用OpenCL编程模型对模板应用程序映射到fpga上进行了应用研究。

引用次数: 0

Distributed SAR Image Change Detection with OpenCL-Enabled Spark 分布式SAR图像变化检测与OpenCL-Enabled Spark

Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters

Pub Date : 2017-04-08 DOI: 10.1145/3129457.3129495

Huming Zhu, J. Kou, Linyan Qiu, Yuqi Guo, Mingwei Niu, Maoguo Gong, L. Jiao

Distributed processing framework has been widely used in remote-sensing field. Spark, as a popular distributed computing framework, has been utilized to deal with big remote sensing data. However, it is inefficient due to that the application is not only data intensive but also computation intensive. For example, in Synthetic Aperture Radar (SAR) image change detection, clustering analysis can consume a lot of computing time and memory resources dealing with big remote sensing data. Coprocessors (GPU, MIC, etc.) have a high-compute power, which is able to handle computation intensive tasks. In this paper, we proposed an OpenCL-enabled Spark framework to accelerate Kernel Fuzzy C-Mean (KFCM) algorithm for SAR image change detection. And the computation intensive operations of KFCM are transferred to coprocessors of the cluster through the proposed OpenCL-enabled Spark framework. The experimental results on real SAR image indicate that the implementation on OpenCL-enabled Spark is efficient and scalable.

分布式处理框架在遥感领域得到了广泛的应用。Spark作为一种流行的分布式计算框架，已被用于处理遥感大数据。然而，由于应用程序不仅是数据密集型的，而且是计算密集型的，因此效率低下。例如，在合成孔径雷达(SAR)图像变化检测中，聚类分析在处理海量遥感数据时会消耗大量的计算时间和内存资源。协处理器(GPU, MIC等)具有很高的计算能力，能够处理计算密集型任务。在本文中，我们提出了一个支持opencl的Spark框架来加速内核模糊c均值(KFCM)算法用于SAR图像变化检测。通过提出的基于opencl的Spark框架，将KFCM的计算密集型操作转移到集群的协处理器上。在真实SAR图像上的实验结果表明，该算法在基于opencl的Spark上实现是高效且可扩展的。

引用次数: 1

TCS: FaaS (FPGA as a service) TCS: FaaS (FPGA即服务)

Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters

Pub Date : 2017-04-08 DOI: 10.1145/3129457.3129499

Jianlin Gao

This presentation firstly points out the dilemma of traditional FPGA industry, then points out that the flexible and easy-to-use cloud services is a feasible way to solve the difficulties of FPGA. Tencent's architecture try to solve the puzzle of FPGA cloud service auto generation using the idea of API as a service. To achieve the goal, Tencent released HDK, SDK, and Tencent Computing Service (TCS) platform to help developers to automatically convert their APIs to cloud service.

本文首先指出了传统FPGA行业的困境，然后指出灵活易用的云服务是解决FPGA困境的可行途径。腾讯的架构尝试用API即服务的思路来解决FPGA云服务自动生成的难题。为了实现这一目标，腾讯发布了HDK、SDK和腾讯计算服务(Tencent Computing Service, TCS)平台，帮助开发者自动将api转换为云服务。

引用次数: 0

DoCE: Direct Extension of On-Chip Interconnects over Converged Ethernet for Rack-Scale Memory Sharing 基于融合以太网的片上互连的直接扩展，用于机架级内存共享

Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters

Pub Date : 2017-04-08 DOI: 10.1145/3129457.3129504

Yisong Chang, Ran Zhao, Lei Yu, Ke Zhang

Novel rack-level interconnects are urgently required to support frequent inter-server communications in emerging large-scale distributed in-memory applications. In this paper, we introduce DoCE, a memory semantic fabric via Direct extension of on-chip interconnect (DEOI) over Converged Ethernet. Based on the architectural support for fine-grained remote memory sharing, DoCE provides a 9.6x speedup for distributed implementation of PageRank algorithm on our dual-node ARM SoC-FPGA prototype versus a conventional TCP/IP based solution. To the best of our knowledge, DoCE is the first implementation and prototype for memory semantic fabric via existing Ethernet infrastructure in ARM ecosystem.

在新兴的大规模分布式内存应用中，迫切需要新的机架级互连来支持频繁的服务器间通信。本文介绍了一种基于片上互连直接扩展(DEOI)的存储语义结构DoCE。基于对细粒度远程内存共享的架构支持，与传统的基于TCP/IP的解决方案相比，DoCE在我们的双节点ARM SoC-FPGA原型上为PageRank算法的分布式实现提供了9.6倍的加速。据我们所知，DoCE是ARM生态系统中第一个通过现有以太网基础设施实现内存语义结构的原型。

引用次数: 1

Slow or Down?: Seem to Be the Same for Cloud Users 慢点还是慢点?对云用户来说似乎是一样的

Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters

Pub Date : 2017-04-08 DOI: 10.1145/3129457.3129496

Laiping Zhao, Xiaobo Zhou

Recent years have seen the rapidly growing cloud computing market. A massive enterprise applications, like social networking, e-commerce, video streaming, email, web search, mapreduce, spark, are moving to cloud systems. These applications often require tens or hundreds of tasks or micro-services to complete, and need to deal with billions of visits per day while handling unprecedented volumes of data. At the same time, these applications need to deliver quick and predictable response times to their users. However, performance predictability has always been one of the biggest challenges in cloud computing. Despite many optimizations and improvements on both hardware and software, the distribution of latencies for Google's back end services show that while majority of requests take around 50-60 ms, significant fraction of requests takes longer than 100 ms, with the largest difference being almost 600 times [10]. The great variance impacts the quality of experience (QoE) for users and directly leads to revenue losses as well as increases in operational costs. Google's study shows that if the response time increase from 0.4 second to 0.9 second, then traffic and ad revenues down 20% [1]. Amazon also reports that every 100 ms increase on the response time leads to sales down 1% [4]. According to Nielsen [14], (i) 0.1 second is about the limit for having the user feel that the system is reacting instantaneously. (ii) 1.0 second is about the limit for the user's flow of thought to stay uninterrupted, even though the user will notice the delay. (iii) 10 seconds is about the limit for keeping the user's attention focused on the dialogue. For longer delays, users will want to perform other tasks while waiting for the computer to finish. In this sense, "slow response" and "service unavailable" seem to be the same for cloud users. Currently, major cloud providers like Amazon, Microsoft, and Google merely state the uptime availability guarantee in their Service Level Agreements (SLA), but never provide guarantee on QoE (e.g., response time). Since the traditional availability is defined based on the failure/repair behaviors of cloud services, this clearly cannot satisfy user's requirements on quick response time. The reason for this is that the complex and diverse uncertainty behaviors in cloud systems make performance predictability very difficult. In general, these uncertainties have two main characteristics: • Diversity: Uncertainties in cloud systems come from many diverse sources, including hardware layer (e.g., failures, system resource competition, network resource competition) and software layer (e.g., scheduling algorithm, software bugs, unexpected workload, loss of data) [9]. • Transmissibility: The uncertainties may not only affect a single service, but also degrade the performance of a chain of services or other co-loated applications. For example, the loss of a piece of intermediate data would require the re-generation of data from its parent ta

近年来，云计算市场迅速增长。大量的企业应用程序，如社交网络、电子商务、视频流、电子邮件、网络搜索、mapreduce、spark，正在向云系统转移。这些应用程序通常需要数十或数百个任务或微服务才能完成，并且每天需要处理数十亿次访问，同时处理前所未有的数据量。同时，这些应用程序需要为用户提供快速且可预测的响应时间。然而，性能可预测性一直是云计算中最大的挑战之一。尽管在硬件和软件上进行了许多优化和改进，但Google后端服务的延迟分布表明，虽然大多数请求大约需要50-60 ms，但相当一部分请求需要超过100 ms，最大的差异几乎是600倍[10]。巨大的差异会影响用户的体验质量(QoE)，直接导致收入损失和运营成本增加。谷歌的研究表明，如果响应时间从0.4秒增加到0.9秒，那么流量和广告收入将下降20%[1]。亚马逊还报告说，响应时间每增加100毫秒，销售额就会下降1%[4]。根据Nielsen[14]的说法，(i) 0.1秒是让用户感觉系统反应迅速的极限。(ii) 1.0秒是用户的思想流保持不间断的极限，即使用户会注意到延迟。(iii) 10秒是将用户的注意力集中在对话上的极限。对于较长的延迟，用户将希望在等待计算机完成时执行其他任务。从这个意义上说，“响应慢”和“服务不可用”对云用户来说似乎是一样的。目前，像亚马逊、微软和谷歌这样的主要云提供商只是在他们的服务水平协议(SLA)中声明了正常运行时间的可用性保证，但从未提供QoE(例如，响应时间)的保证。由于传统的可用性是基于云服务的故障/修复行为来定义的，这显然不能满足用户对快速响应时间的需求。其原因是云系统中复杂多样的不确定性行为使得性能的可预测性非常困难。总体而言，这些不确定性具有两个主要特征:•多样性:云系统中的不确定性来自许多不同的来源，包括硬件层(如故障、系统资源竞争、网络资源竞争)和软件层(如调度算法、软件bug、意外工作负载、数据丢失)[9]。•可传递性:不确定性可能不仅影响单个服务，还会降低服务链或其他协同应用程序的性能。例如，中间数据的丢失需要从父任务中重新生成数据，并推迟子任务的进度;经历突发工作负载的服务可能会抢占更多资源，或者机器故障将减少池中的可用资源，从而导致其他协同部署服务的性能下降。

{"title":"Slow or Down?: Seem to Be the Same for Cloud Users","authors":"Laiping Zhao, Xiaobo Zhou","doi":"10.1145/3129457.3129496","DOIUrl":"https://doi.org/10.1145/3129457.3129496","url":null,"abstract":"Recent years have seen the rapidly growing cloud computing market. A massive enterprise applications, like social networking, e-commerce, video streaming, email, web search, mapreduce, spark, are moving to cloud systems. These applications often require tens or hundreds of tasks or micro-services to complete, and need to deal with billions of visits per day while handling unprecedented volumes of data. At the same time, these applications need to deliver quick and predictable response times to their users. However, performance predictability has always been one of the biggest challenges in cloud computing. Despite many optimizations and improvements on both hardware and software, the distribution of latencies for Google's back end services show that while majority of requests take around 50-60 ms, significant fraction of requests takes longer than 100 ms, with the largest difference being almost 600 times [10]. The great variance impacts the quality of experience (QoE) for users and directly leads to revenue losses as well as increases in operational costs. Google's study shows that if the response time increase from 0.4 second to 0.9 second, then traffic and ad revenues down 20% [1]. Amazon also reports that every 100 ms increase on the response time leads to sales down 1% [4]. According to Nielsen [14], (i) 0.1 second is about the limit for having the user feel that the system is reacting instantaneously. (ii) 1.0 second is about the limit for the user's flow of thought to stay uninterrupted, even though the user will notice the delay. (iii) 10 seconds is about the limit for keeping the user's attention focused on the dialogue. For longer delays, users will want to perform other tasks while waiting for the computer to finish. In this sense, \"slow response\" and \"service unavailable\" seem to be the same for cloud users. Currently, major cloud providers like Amazon, Microsoft, and Google merely state the uptime availability guarantee in their Service Level Agreements (SLA), but never provide guarantee on QoE (e.g., response time). Since the traditional availability is defined based on the failure/repair behaviors of cloud services, this clearly cannot satisfy user's requirements on quick response time. The reason for this is that the complex and diverse uncertainty behaviors in cloud systems make performance predictability very difficult. In general, these uncertainties have two main characteristics: • Diversity: Uncertainties in cloud systems come from many diverse sources, including hardware layer (e.g., failures, system resource competition, network resource competition) and software layer (e.g., scheduling algorithm, software bugs, unexpected workload, loss of data) [9]. • Transmissibility: The uncertainties may not only affect a single service, but also degrade the performance of a chain of services or other co-loated applications. For example, the loss of a piece of intermediate data would require the re-generation of data from its parent ta","PeriodicalId":345943,"journal":{"name":"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129859808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Building the Reconfigurable Cloud Ecosystem 构建可重构云生态系统

Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters

Pub Date : 2017-04-08 DOI: 10.1145/3129457.3129501

P. Chow

Microsoft has clearly made the case for using FPGAs at scale in the cloud and Intel is committed to leveraging the benefits of hardware acceleration with their acquisition of Altera. However, we still cannot use FPGAs with the same ease we have with software-based systems, let alone do it easily at scale in the cloud. High-level synthesis is necessary for making FPGAs accessible, but it is not sufficient. Making FPGAs easy to use for computation requires more than developing accessible tools for creating hardware targeted for FPGAs. The software computing world has a lot of taken-for-granted, sometimes invisible and good open source infrastructure that is missing for using FPGAs as computing devices. The problem is compounded when we want to use FPGAs at the scale of the cloud. I will present the need for some common infrastructure and abstraction layers to support the use of FPGAs for computing at scale, and describe relevant work at the University of Toronto that can contribute towards the development of an open source framework for the use and deployment of FPGAs at scale.

微软已经明确提出了在云中大规模使用fpga的案例，英特尔也致力于通过收购Altera来利用硬件加速的优势。然而，我们仍然不能像使用基于软件的系统那样轻松地使用fpga，更不用说在云中轻松地大规模使用它了。高级合成是使fpga可访问的必要条件，但这是不够的。使fpga易于用于计算需要的不仅仅是开发可访问的工具来创建针对fpga的硬件。软件计算世界有很多被认为是理所当然的、有时是不可见的、好的开源基础设施，而这些基础设施在使用fpga作为计算设备时却缺失了。当我们想要在云的规模上使用fpga时，问题就变得复杂了。我将提出对一些公共基础设施和抽象层的需求，以支持大规模使用fpga进行计算，并描述多伦多大学的相关工作，这些工作可以为大规模使用和部署fpga的开源框架的开发做出贡献。

引用次数: 0

Rethinking the SDN Abstraction 重新思考SDN抽象

Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters

Pub Date : 2017-04-08 DOI: 10.1145/3129457.3129498

Chengchen Hu

Software Defined Networking (SDN) greatly simplifies network management and introduces unprecedented flexibility by decoupling control functions from the network data plane. However, such a decoupling also opens a box of various open questions, which are not well addressed, e.g., scalability issues and security concerns. This talk firstly describes the background of SDN and the abstraction that SDN is possessing now, and secondly presents scalability/security problems and our on-going research progress. In addition, the promising directions will also be discussed in the talk.

软件定义网络(SDN)通过将控制功能与网络数据平面分离，极大地简化了网络管理，并带来了前所未有的灵活性。然而，这样的解耦也打开了各种悬而未决的问题的盒子，这些问题没有得到很好的解决，例如，可伸缩性问题和安全问题。本次演讲首先介绍了SDN的背景和SDN目前所具有的抽象概念，其次介绍了可扩展性/安全性问题以及我们正在进行的研究进展。此外，本讲座还将讨论未来的发展方向。

引用次数: 0

Anomaly Detection in Clouds: Challenges and Practice 云中的异常检测:挑战与实践

Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters

Pub Date : 2017-04-08 DOI: 10.1145/3129457.3129497

Kejiang Ye

Cloud computing is an important infrastructure for many enterprises. After 10 years of development, cloud computing has achieved a great success, and has greatly changed the economy, society, science and industries. In particular, with the rapid development of mobile Internet and big data technology, almost all of the online services and data services are built on the top of cloud computing, such as the online banking services provided by banks, the electronic services provided by the news media, the government cloud information systems provided by the government departments, the mobile services provided by the communications companies. Besides, tens of thousands of Start-ups rely on the provision of cloud computing services. Therefore, ensuring cloud reliability is very important and essential. However, the reality is that the current cloud systems are not reliable enough. On February 28th 2017, Amazon Web Services, the popular storage and hosting platform used by a huge range of companies, experienced S3 service interruption for 4 hours in the Northern Virginia (US-EAST-1) Region, and then quickly spread other online service providers who rely on the S3 service [2]. This failure caused a huge economic loss. It is because cloud computing service providers typically set a Service Level Agreement (SLA) with customers. For example, when customers require 99.99% availability, it means that 99.99% of the time must meet the requirement for 365 days per year. If the service breaks more than 0.01%, compensation is required. In fact, with the continuous development and maturity of cloud computing, a large number of traditional business systems have been deployed on the cloud platform. Cloud computing integrates existing hardware resources through virtualization technology to create a shared resource pool that enables applications to obtain computing, storage, and network resources on demand, effectively enhancing the scalability and resource utilization of traditional IT infrastructures and significantly reducing the operation cost of the traditional business systems. However, with the growing number of applications running on the cloud, the scale of cloud data center has been expanding, the current cloud computing system has become very complex, mainly reflected in: 1) Large scale. A typical data center involves more than 100,000 servers and 10,000 switches, more nodes usually mean higher probability of failure; 2) Complex application structure. Web search, e-commerce and other typical cloud program has a complex interactive behavior. For example, an Amazon page request involves interaction with hundreds of components [7], error in any one component will lead to the whole application anomalies; 3) Shared resource pattern. One of the basic features of cloud computing is resource sharing, a typical server in Google Cloud data center hosts 5 to 18 applications simultaneously, each server runs about 10.69 applications [5]. Resource competition will interfer

云计算是许多企业的重要基础设施。经过10年的发展，云计算取得了巨大的成功，极大地改变了经济、社会、科学和工业。特别是随着移动互联网和大数据技术的快速发展，几乎所有的在线服务和数据服务都是建立在云计算之上的，如银行提供的网上银行服务、新闻媒体提供的电子服务、政府部门提供的政府云信息系统、通信公司提供的移动服务等。此外，数以万计的初创企业依赖云计算服务。因此，确保云的可靠性是非常重要和必要的。然而，现实情况是，目前的云系统不够可靠。2017年2月28日，众多公司使用的热门存储和托管平台Amazon Web Services在北弗吉尼亚(US-EAST-1)地区发生了S3服务中断4小时的事件，随后迅速蔓延到其他依赖S3服务的在线服务提供商[2]。这次失败造成了巨大的经济损失。这是因为云计算服务提供商通常与客户设置服务水平协议(SLA)。例如，当客户要求99.99%的可用性时，这意味着99.99%的时间必须满足每年365天的要求。如果服务中断超过0.01%，则需要赔偿。事实上，随着云计算的不断发展和成熟，大量的传统业务系统已经部署在云平台上。云计算通过虚拟化技术整合现有硬件资源，形成共享的资源池，应用可以按需获取计算、存储和网络资源，有效提升传统IT基础设施的可扩展性和资源利用率，显著降低传统业务系统的运营成本。然而，随着运行在云上的应用越来越多，云数据中心的规模也在不断扩大，当前的云计算系统已经变得非常复杂，主要体现在:1)规模庞大。一个典型的数据中心涉及超过10万台服务器和1万台交换机，节点越多通常意味着故障的可能性越大;2)应用结构复杂。网络搜索、电子商务等典型的云程序具有复杂的交互行为。例如，一个Amazon页面请求涉及到与数百个组件的交互[7]，任何一个组件的错误都会导致整个应用程序异常;3)资源共享模式。云计算的基本特征之一是资源共享，在Google cloud数据中心，一台典型的服务器同时托管5到18个应用程序，每台服务器运行约10.69个应用程序[5]。资源竞争会相互干扰，影响应用程序的性能。这些云计算系统的复杂性、应用交互结构的复杂性以及云平台固有的共享模式使得云系统比传统平台更容易出现性能异常。可以说，在云计算中，异常是一种常态[3]。进一步分析，资源竞争、资源瓶颈、配置错误、软件缺陷、硬件故障、外部攻击等都可能导致云系统异常或故障。性能异常是指性能突然下降，偏离系统的正常行为。与导致系统立即停止运行的中断不同，性能异常通常会导致系统效率下降。配置错误、软件缺陷、硬件故障等原因往往会导致性能异常。对于云计算系统，仅检测中断或其他功能异常是不够的，因为这些异常通常会导致服务中断，并且可以通过简单地重新启动或更换硬件来解决。而由资源共享和干扰引起的性能异常更值得关注[4]，因为性能异常可以在业务中断前消除，保证业务的持续进行。如果对云计算系统的性能异常不及时处理，可能会造成非常严重的后果，不仅影响业务系统的正常运行，也会阻碍企业在云系统上部署业务。特别是对于那些对延迟敏感的云应用程序，及时消除性能异常非常重要。例如，亚马逊发现，每100毫秒的延迟，销售额下降1%，谷歌发现，每0毫秒的延迟，流量下降20%。

{"title":"Anomaly Detection in Clouds: Challenges and Practice","authors":"Kejiang Ye","doi":"10.1145/3129457.3129497","DOIUrl":"https://doi.org/10.1145/3129457.3129497","url":null,"abstract":"Cloud computing is an important infrastructure for many enterprises. After 10 years of development, cloud computing has achieved a great success, and has greatly changed the economy, society, science and industries. In particular, with the rapid development of mobile Internet and big data technology, almost all of the online services and data services are built on the top of cloud computing, such as the online banking services provided by banks, the electronic services provided by the news media, the government cloud information systems provided by the government departments, the mobile services provided by the communications companies. Besides, tens of thousands of Start-ups rely on the provision of cloud computing services. Therefore, ensuring cloud reliability is very important and essential. However, the reality is that the current cloud systems are not reliable enough. On February 28th 2017, Amazon Web Services, the popular storage and hosting platform used by a huge range of companies, experienced S3 service interruption for 4 hours in the Northern Virginia (US-EAST-1) Region, and then quickly spread other online service providers who rely on the S3 service [2]. This failure caused a huge economic loss. It is because cloud computing service providers typically set a Service Level Agreement (SLA) with customers. For example, when customers require 99.99% availability, it means that 99.99% of the time must meet the requirement for 365 days per year. If the service breaks more than 0.01%, compensation is required. In fact, with the continuous development and maturity of cloud computing, a large number of traditional business systems have been deployed on the cloud platform. Cloud computing integrates existing hardware resources through virtualization technology to create a shared resource pool that enables applications to obtain computing, storage, and network resources on demand, effectively enhancing the scalability and resource utilization of traditional IT infrastructures and significantly reducing the operation cost of the traditional business systems. However, with the growing number of applications running on the cloud, the scale of cloud data center has been expanding, the current cloud computing system has become very complex, mainly reflected in: 1) Large scale. A typical data center involves more than 100,000 servers and 10,000 switches, more nodes usually mean higher probability of failure; 2) Complex application structure. Web search, e-commerce and other typical cloud program has a complex interactive behavior. For example, an Amazon page request involves interaction with hundreds of components [7], error in any one component will lead to the whole application anomalies; 3) Shared resource pattern. One of the basic features of cloud computing is resource sharing, a typical server in Google Cloud data center hosts 5 to 18 applications simultaneously, each server runs about 10.69 applications [5]. Resource competition will interfer","PeriodicalId":345943,"journal":{"name":"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124535373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀