Deploying field-programmable gate arrays (FPGAs) on the cloud to accelerate the processing of the explosively growing server workloads is becoming a clear trend today. However, the costs reduction of accelerator design and deployment is still difficult with conventional development methods and tools. In the previous work, we proposed the hCODE platform to simplify the design, share and deployment of FPGA accelerators, which adopted a shell-and-IP design pattern and developed supporting tools to improve the reusability and the portability of accelerator designs. In this paper, based on our previous work, we propose new design methods and tools for FPGA virtualization and scheduling that allowing IPs to be implemented at cluster scale in low cost. With the proposed platform, users can easily deploy multiple accelerators on one FPGA to improve on-chip resources and communication bandwidth utilization.
{"title":"A Study of FPGA Virtualization and Accelerator Scheduling","authors":"Qian Zhao, M. Iida, T. Sueyoshi","doi":"10.1145/3129457.3129503","DOIUrl":"https://doi.org/10.1145/3129457.3129503","url":null,"abstract":"Deploying field-programmable gate arrays (FPGAs) on the cloud to accelerate the processing of the explosively growing server workloads is becoming a clear trend today. However, the costs reduction of accelerator design and deployment is still difficult with conventional development methods and tools. In the previous work, we proposed the hCODE platform to simplify the design, share and deployment of FPGA accelerators, which adopted a shell-and-IP design pattern and developed supporting tools to improve the reusability and the portability of accelerator designs. In this paper, based on our previous work, we propose new design methods and tools for FPGA virtualization and scheduling that allowing IPs to be implemented at cluster scale in low cost. With the proposed platform, users can easily deploy multiple accelerators on one FPGA to improve on-chip resources and communication bandwidth utilization.","PeriodicalId":345943,"journal":{"name":"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127025199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Customized Architecture is one of the technical road for exascale high performance computing. We will give an overview about FPGA customized architecture. Research experiences on deep learning algorithms accelerators for data analyzing, footprint and cipher algorithms accelerators for information processing, and matrix processing algorithms accelerators for scientific computing will be discussed.
{"title":"Customized Architecture Technology for High Performance Computing","authors":"Jingfei Jiang","doi":"10.1145/3129457.3129500","DOIUrl":"https://doi.org/10.1145/3129457.3129500","url":null,"abstract":"Customized Architecture is one of the technical road for exascale high performance computing. We will give an overview about FPGA customized architecture. Research experiences on deep learning algorithms accelerators for data analyzing, footprint and cipher algorithms accelerators for information processing, and matrix processing algorithms accelerators for scientific computing will be discussed.","PeriodicalId":345943,"journal":{"name":"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132878478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent adoption of OpenCL programming model by FPGA vendors has realized the function portability of OpenCL workloads on FPGA. However, the poor performance portability prevents its wide adoption. To harness the power of FPGAs using OpenCL programming model, it is advantageous to design an analytical performance model to estimate the performance of OpenCL workloads on FPGAs and provide insights into the performance bottlenecks of OpenCL model on FPGA architecture. In the first part of the talk, we present FlexCL, an analytical performance model for OpenCL workloads on flexible FPGAs. FlexCL estimates the overall performance by tightly coupling the on chip global memory and on-chip computation models based on the communication mode. Then, we present an application study of mapping stencil applications onto FPGAs using OpenCL programming model.
{"title":"Programming FPGAs Using OpenCL from Performance Model to Application Study","authors":"Yun Liang","doi":"10.1145/3129457.3129502","DOIUrl":"https://doi.org/10.1145/3129457.3129502","url":null,"abstract":"Recent adoption of OpenCL programming model by FPGA vendors has realized the function portability of OpenCL workloads on FPGA. However, the poor performance portability prevents its wide adoption. To harness the power of FPGAs using OpenCL programming model, it is advantageous to design an analytical performance model to estimate the performance of OpenCL workloads on FPGAs and provide insights into the performance bottlenecks of OpenCL model on FPGA architecture. In the first part of the talk, we present FlexCL, an analytical performance model for OpenCL workloads on flexible FPGAs. FlexCL estimates the overall performance by tightly coupling the on chip global memory and on-chip computation models based on the communication mode. Then, we present an application study of mapping stencil applications onto FPGAs using OpenCL programming model.","PeriodicalId":345943,"journal":{"name":"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129824689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huming Zhu, J. Kou, Linyan Qiu, Yuqi Guo, Mingwei Niu, Maoguo Gong, L. Jiao
Distributed processing framework has been widely used in remote-sensing field. Spark, as a popular distributed computing framework, has been utilized to deal with big remote sensing data. However, it is inefficient due to that the application is not only data intensive but also computation intensive. For example, in Synthetic Aperture Radar (SAR) image change detection, clustering analysis can consume a lot of computing time and memory resources dealing with big remote sensing data. Coprocessors (GPU, MIC, etc.) have a high-compute power, which is able to handle computation intensive tasks. In this paper, we proposed an OpenCL-enabled Spark framework to accelerate Kernel Fuzzy C-Mean (KFCM) algorithm for SAR image change detection. And the computation intensive operations of KFCM are transferred to coprocessors of the cluster through the proposed OpenCL-enabled Spark framework. The experimental results on real SAR image indicate that the implementation on OpenCL-enabled Spark is efficient and scalable.
{"title":"Distributed SAR Image Change Detection with OpenCL-Enabled Spark","authors":"Huming Zhu, J. Kou, Linyan Qiu, Yuqi Guo, Mingwei Niu, Maoguo Gong, L. Jiao","doi":"10.1145/3129457.3129495","DOIUrl":"https://doi.org/10.1145/3129457.3129495","url":null,"abstract":"Distributed processing framework has been widely used in remote-sensing field. Spark, as a popular distributed computing framework, has been utilized to deal with big remote sensing data. However, it is inefficient due to that the application is not only data intensive but also computation intensive. For example, in Synthetic Aperture Radar (SAR) image change detection, clustering analysis can consume a lot of computing time and memory resources dealing with big remote sensing data. Coprocessors (GPU, MIC, etc.) have a high-compute power, which is able to handle computation intensive tasks. In this paper, we proposed an OpenCL-enabled Spark framework to accelerate Kernel Fuzzy C-Mean (KFCM) algorithm for SAR image change detection. And the computation intensive operations of KFCM are transferred to coprocessors of the cluster through the proposed OpenCL-enabled Spark framework. The experimental results on real SAR image indicate that the implementation on OpenCL-enabled Spark is efficient and scalable.","PeriodicalId":345943,"journal":{"name":"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121325273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This presentation firstly points out the dilemma of traditional FPGA industry, then points out that the flexible and easy-to-use cloud services is a feasible way to solve the difficulties of FPGA. Tencent's architecture try to solve the puzzle of FPGA cloud service auto generation using the idea of API as a service. To achieve the goal, Tencent released HDK, SDK, and Tencent Computing Service (TCS) platform to help developers to automatically convert their APIs to cloud service.
{"title":"TCS: FaaS (FPGA as a service)","authors":"Jianlin Gao","doi":"10.1145/3129457.3129499","DOIUrl":"https://doi.org/10.1145/3129457.3129499","url":null,"abstract":"This presentation firstly points out the dilemma of traditional FPGA industry, then points out that the flexible and easy-to-use cloud services is a feasible way to solve the difficulties of FPGA. Tencent's architecture try to solve the puzzle of FPGA cloud service auto generation using the idea of API as a service. To achieve the goal, Tencent released HDK, SDK, and Tencent Computing Service (TCS) platform to help developers to automatically convert their APIs to cloud service.","PeriodicalId":345943,"journal":{"name":"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114270906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Novel rack-level interconnects are urgently required to support frequent inter-server communications in emerging large-scale distributed in-memory applications. In this paper, we introduce DoCE, a memory semantic fabric via Direct extension of on-chip interconnect (DEOI) over Converged Ethernet. Based on the architectural support for fine-grained remote memory sharing, DoCE provides a 9.6x speedup for distributed implementation of PageRank algorithm on our dual-node ARM SoC-FPGA prototype versus a conventional TCP/IP based solution. To the best of our knowledge, DoCE is the first implementation and prototype for memory semantic fabric via existing Ethernet infrastructure in ARM ecosystem.
{"title":"DoCE: Direct Extension of On-Chip Interconnects over Converged Ethernet for Rack-Scale Memory Sharing","authors":"Yisong Chang, Ran Zhao, Lei Yu, Ke Zhang","doi":"10.1145/3129457.3129504","DOIUrl":"https://doi.org/10.1145/3129457.3129504","url":null,"abstract":"Novel rack-level interconnects are urgently required to support frequent inter-server communications in emerging large-scale distributed in-memory applications. In this paper, we introduce DoCE, a memory semantic fabric via Direct extension of on-chip interconnect (DEOI) over Converged Ethernet. Based on the architectural support for fine-grained remote memory sharing, DoCE provides a 9.6x speedup for distributed implementation of PageRank algorithm on our dual-node ARM SoC-FPGA prototype versus a conventional TCP/IP based solution. To the best of our knowledge, DoCE is the first implementation and prototype for memory semantic fabric via existing Ethernet infrastructure in ARM ecosystem.","PeriodicalId":345943,"journal":{"name":"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126977394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent years have seen the rapidly growing cloud computing market. A massive enterprise applications, like social networking, e-commerce, video streaming, email, web search, mapreduce, spark, are moving to cloud systems. These applications often require tens or hundreds of tasks or micro-services to complete, and need to deal with billions of visits per day while handling unprecedented volumes of data. At the same time, these applications need to deliver quick and predictable response times to their users. However, performance predictability has always been one of the biggest challenges in cloud computing. Despite many optimizations and improvements on both hardware and software, the distribution of latencies for Google's back end services show that while majority of requests take around 50-60 ms, significant fraction of requests takes longer than 100 ms, with the largest difference being almost 600 times [10]. The great variance impacts the quality of experience (QoE) for users and directly leads to revenue losses as well as increases in operational costs. Google's study shows that if the response time increase from 0.4 second to 0.9 second, then traffic and ad revenues down 20% [1]. Amazon also reports that every 100 ms increase on the response time leads to sales down 1% [4]. According to Nielsen [14], (i) 0.1 second is about the limit for having the user feel that the system is reacting instantaneously. (ii) 1.0 second is about the limit for the user's flow of thought to stay uninterrupted, even though the user will notice the delay. (iii) 10 seconds is about the limit for keeping the user's attention focused on the dialogue. For longer delays, users will want to perform other tasks while waiting for the computer to finish. In this sense, "slow response" and "service unavailable" seem to be the same for cloud users. Currently, major cloud providers like Amazon, Microsoft, and Google merely state the uptime availability guarantee in their Service Level Agreements (SLA), but never provide guarantee on QoE (e.g., response time). Since the traditional availability is defined based on the failure/repair behaviors of cloud services, this clearly cannot satisfy user's requirements on quick response time. The reason for this is that the complex and diverse uncertainty behaviors in cloud systems make performance predictability very difficult. In general, these uncertainties have two main characteristics: • Diversity: Uncertainties in cloud systems come from many diverse sources, including hardware layer (e.g., failures, system resource competition, network resource competition) and software layer (e.g., scheduling algorithm, software bugs, unexpected workload, loss of data) [9]. • Transmissibility: The uncertainties may not only affect a single service, but also degrade the performance of a chain of services or other co-loated applications. For example, the loss of a piece of intermediate data would require the re-generation of data from its parent ta
{"title":"Slow or Down?: Seem to Be the Same for Cloud Users","authors":"Laiping Zhao, Xiaobo Zhou","doi":"10.1145/3129457.3129496","DOIUrl":"https://doi.org/10.1145/3129457.3129496","url":null,"abstract":"Recent years have seen the rapidly growing cloud computing market. A massive enterprise applications, like social networking, e-commerce, video streaming, email, web search, mapreduce, spark, are moving to cloud systems. These applications often require tens or hundreds of tasks or micro-services to complete, and need to deal with billions of visits per day while handling unprecedented volumes of data. At the same time, these applications need to deliver quick and predictable response times to their users. However, performance predictability has always been one of the biggest challenges in cloud computing. Despite many optimizations and improvements on both hardware and software, the distribution of latencies for Google's back end services show that while majority of requests take around 50-60 ms, significant fraction of requests takes longer than 100 ms, with the largest difference being almost 600 times [10]. The great variance impacts the quality of experience (QoE) for users and directly leads to revenue losses as well as increases in operational costs. Google's study shows that if the response time increase from 0.4 second to 0.9 second, then traffic and ad revenues down 20% [1]. Amazon also reports that every 100 ms increase on the response time leads to sales down 1% [4]. According to Nielsen [14], (i) 0.1 second is about the limit for having the user feel that the system is reacting instantaneously. (ii) 1.0 second is about the limit for the user's flow of thought to stay uninterrupted, even though the user will notice the delay. (iii) 10 seconds is about the limit for keeping the user's attention focused on the dialogue. For longer delays, users will want to perform other tasks while waiting for the computer to finish. In this sense, \"slow response\" and \"service unavailable\" seem to be the same for cloud users. Currently, major cloud providers like Amazon, Microsoft, and Google merely state the uptime availability guarantee in their Service Level Agreements (SLA), but never provide guarantee on QoE (e.g., response time). Since the traditional availability is defined based on the failure/repair behaviors of cloud services, this clearly cannot satisfy user's requirements on quick response time. The reason for this is that the complex and diverse uncertainty behaviors in cloud systems make performance predictability very difficult. In general, these uncertainties have two main characteristics: • Diversity: Uncertainties in cloud systems come from many diverse sources, including hardware layer (e.g., failures, system resource competition, network resource competition) and software layer (e.g., scheduling algorithm, software bugs, unexpected workload, loss of data) [9]. • Transmissibility: The uncertainties may not only affect a single service, but also degrade the performance of a chain of services or other co-loated applications. For example, the loss of a piece of intermediate data would require the re-generation of data from its parent ta","PeriodicalId":345943,"journal":{"name":"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129859808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Microsoft has clearly made the case for using FPGAs at scale in the cloud and Intel is committed to leveraging the benefits of hardware acceleration with their acquisition of Altera. However, we still cannot use FPGAs with the same ease we have with software-based systems, let alone do it easily at scale in the cloud. High-level synthesis is necessary for making FPGAs accessible, but it is not sufficient. Making FPGAs easy to use for computation requires more than developing accessible tools for creating hardware targeted for FPGAs. The software computing world has a lot of taken-for-granted, sometimes invisible and good open source infrastructure that is missing for using FPGAs as computing devices. The problem is compounded when we want to use FPGAs at the scale of the cloud. I will present the need for some common infrastructure and abstraction layers to support the use of FPGAs for computing at scale, and describe relevant work at the University of Toronto that can contribute towards the development of an open source framework for the use and deployment of FPGAs at scale.
{"title":"Building the Reconfigurable Cloud Ecosystem","authors":"P. Chow","doi":"10.1145/3129457.3129501","DOIUrl":"https://doi.org/10.1145/3129457.3129501","url":null,"abstract":"Microsoft has clearly made the case for using FPGAs at scale in the cloud and Intel is committed to leveraging the benefits of hardware acceleration with their acquisition of Altera. However, we still cannot use FPGAs with the same ease we have with software-based systems, let alone do it easily at scale in the cloud. High-level synthesis is necessary for making FPGAs accessible, but it is not sufficient. Making FPGAs easy to use for computation requires more than developing accessible tools for creating hardware targeted for FPGAs. The software computing world has a lot of taken-for-granted, sometimes invisible and good open source infrastructure that is missing for using FPGAs as computing devices. The problem is compounded when we want to use FPGAs at the scale of the cloud. I will present the need for some common infrastructure and abstraction layers to support the use of FPGAs for computing at scale, and describe relevant work at the University of Toronto that can contribute towards the development of an open source framework for the use and deployment of FPGAs at scale.","PeriodicalId":345943,"journal":{"name":"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130948150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Software Defined Networking (SDN) greatly simplifies network management and introduces unprecedented flexibility by decoupling control functions from the network data plane. However, such a decoupling also opens a box of various open questions, which are not well addressed, e.g., scalability issues and security concerns. This talk firstly describes the background of SDN and the abstraction that SDN is possessing now, and secondly presents scalability/security problems and our on-going research progress. In addition, the promising directions will also be discussed in the talk.
{"title":"Rethinking the SDN Abstraction","authors":"Chengchen Hu","doi":"10.1145/3129457.3129498","DOIUrl":"https://doi.org/10.1145/3129457.3129498","url":null,"abstract":"Software Defined Networking (SDN) greatly simplifies network management and introduces unprecedented flexibility by decoupling control functions from the network data plane. However, such a decoupling also opens a box of various open questions, which are not well addressed, e.g., scalability issues and security concerns. This talk firstly describes the background of SDN and the abstraction that SDN is possessing now, and secondly presents scalability/security problems and our on-going research progress. In addition, the promising directions will also be discussed in the talk.","PeriodicalId":345943,"journal":{"name":"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114684097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cloud computing is an important infrastructure for many enterprises. After 10 years of development, cloud computing has achieved a great success, and has greatly changed the economy, society, science and industries. In particular, with the rapid development of mobile Internet and big data technology, almost all of the online services and data services are built on the top of cloud computing, such as the online banking services provided by banks, the electronic services provided by the news media, the government cloud information systems provided by the government departments, the mobile services provided by the communications companies. Besides, tens of thousands of Start-ups rely on the provision of cloud computing services. Therefore, ensuring cloud reliability is very important and essential. However, the reality is that the current cloud systems are not reliable enough. On February 28th 2017, Amazon Web Services, the popular storage and hosting platform used by a huge range of companies, experienced S3 service interruption for 4 hours in the Northern Virginia (US-EAST-1) Region, and then quickly spread other online service providers who rely on the S3 service [2]. This failure caused a huge economic loss. It is because cloud computing service providers typically set a Service Level Agreement (SLA) with customers. For example, when customers require 99.99% availability, it means that 99.99% of the time must meet the requirement for 365 days per year. If the service breaks more than 0.01%, compensation is required. In fact, with the continuous development and maturity of cloud computing, a large number of traditional business systems have been deployed on the cloud platform. Cloud computing integrates existing hardware resources through virtualization technology to create a shared resource pool that enables applications to obtain computing, storage, and network resources on demand, effectively enhancing the scalability and resource utilization of traditional IT infrastructures and significantly reducing the operation cost of the traditional business systems. However, with the growing number of applications running on the cloud, the scale of cloud data center has been expanding, the current cloud computing system has become very complex, mainly reflected in: 1) Large scale. A typical data center involves more than 100,000 servers and 10,000 switches, more nodes usually mean higher probability of failure; 2) Complex application structure. Web search, e-commerce and other typical cloud program has a complex interactive behavior. For example, an Amazon page request involves interaction with hundreds of components [7], error in any one component will lead to the whole application anomalies; 3) Shared resource pattern. One of the basic features of cloud computing is resource sharing, a typical server in Google Cloud data center hosts 5 to 18 applications simultaneously, each server runs about 10.69 applications [5]. Resource competition will interfer
云计算是许多企业的重要基础设施。经过10年的发展,云计算取得了巨大的成功,极大地改变了经济、社会、科学和工业。特别是随着移动互联网和大数据技术的快速发展,几乎所有的在线服务和数据服务都是建立在云计算之上的,如银行提供的网上银行服务、新闻媒体提供的电子服务、政府部门提供的政府云信息系统、通信公司提供的移动服务等。此外,数以万计的初创企业依赖云计算服务。因此,确保云的可靠性是非常重要和必要的。然而,现实情况是,目前的云系统不够可靠。2017年2月28日,众多公司使用的热门存储和托管平台Amazon Web Services在北弗吉尼亚(US-EAST-1)地区发生了S3服务中断4小时的事件,随后迅速蔓延到其他依赖S3服务的在线服务提供商[2]。这次失败造成了巨大的经济损失。这是因为云计算服务提供商通常与客户设置服务水平协议(SLA)。例如,当客户要求99.99%的可用性时,这意味着99.99%的时间必须满足每年365天的要求。如果服务中断超过0.01%,则需要赔偿。事实上,随着云计算的不断发展和成熟,大量的传统业务系统已经部署在云平台上。云计算通过虚拟化技术整合现有硬件资源,形成共享的资源池,应用可以按需获取计算、存储和网络资源,有效提升传统IT基础设施的可扩展性和资源利用率,显著降低传统业务系统的运营成本。然而,随着运行在云上的应用越来越多,云数据中心的规模也在不断扩大,当前的云计算系统已经变得非常复杂,主要体现在:1)规模庞大。一个典型的数据中心涉及超过10万台服务器和1万台交换机,节点越多通常意味着故障的可能性越大;2)应用结构复杂。网络搜索、电子商务等典型的云程序具有复杂的交互行为。例如,一个Amazon页面请求涉及到与数百个组件的交互[7],任何一个组件的错误都会导致整个应用程序异常;3)资源共享模式。云计算的基本特征之一是资源共享,在Google cloud数据中心,一台典型的服务器同时托管5到18个应用程序,每台服务器运行约10.69个应用程序[5]。资源竞争会相互干扰,影响应用程序的性能。这些云计算系统的复杂性、应用交互结构的复杂性以及云平台固有的共享模式使得云系统比传统平台更容易出现性能异常。可以说,在云计算中,异常是一种常态[3]。进一步分析,资源竞争、资源瓶颈、配置错误、软件缺陷、硬件故障、外部攻击等都可能导致云系统异常或故障。性能异常是指性能突然下降,偏离系统的正常行为。与导致系统立即停止运行的中断不同,性能异常通常会导致系统效率下降。配置错误、软件缺陷、硬件故障等原因往往会导致性能异常。对于云计算系统,仅检测中断或其他功能异常是不够的,因为这些异常通常会导致服务中断,并且可以通过简单地重新启动或更换硬件来解决。而由资源共享和干扰引起的性能异常更值得关注[4],因为性能异常可以在业务中断前消除,保证业务的持续进行。如果对云计算系统的性能异常不及时处理,可能会造成非常严重的后果,不仅影响业务系统的正常运行,也会阻碍企业在云系统上部署业务。特别是对于那些对延迟敏感的云应用程序,及时消除性能异常非常重要。例如,亚马逊发现,每100毫秒的延迟,销售额下降1%,谷歌发现,每0毫秒的延迟,流量下降20%。
{"title":"Anomaly Detection in Clouds: Challenges and Practice","authors":"Kejiang Ye","doi":"10.1145/3129457.3129497","DOIUrl":"https://doi.org/10.1145/3129457.3129497","url":null,"abstract":"Cloud computing is an important infrastructure for many enterprises. After 10 years of development, cloud computing has achieved a great success, and has greatly changed the economy, society, science and industries. In particular, with the rapid development of mobile Internet and big data technology, almost all of the online services and data services are built on the top of cloud computing, such as the online banking services provided by banks, the electronic services provided by the news media, the government cloud information systems provided by the government departments, the mobile services provided by the communications companies. Besides, tens of thousands of Start-ups rely on the provision of cloud computing services. Therefore, ensuring cloud reliability is very important and essential. However, the reality is that the current cloud systems are not reliable enough. On February 28th 2017, Amazon Web Services, the popular storage and hosting platform used by a huge range of companies, experienced S3 service interruption for 4 hours in the Northern Virginia (US-EAST-1) Region, and then quickly spread other online service providers who rely on the S3 service [2]. This failure caused a huge economic loss. It is because cloud computing service providers typically set a Service Level Agreement (SLA) with customers. For example, when customers require 99.99% availability, it means that 99.99% of the time must meet the requirement for 365 days per year. If the service breaks more than 0.01%, compensation is required. In fact, with the continuous development and maturity of cloud computing, a large number of traditional business systems have been deployed on the cloud platform. Cloud computing integrates existing hardware resources through virtualization technology to create a shared resource pool that enables applications to obtain computing, storage, and network resources on demand, effectively enhancing the scalability and resource utilization of traditional IT infrastructures and significantly reducing the operation cost of the traditional business systems. However, with the growing number of applications running on the cloud, the scale of cloud data center has been expanding, the current cloud computing system has become very complex, mainly reflected in: 1) Large scale. A typical data center involves more than 100,000 servers and 10,000 switches, more nodes usually mean higher probability of failure; 2) Complex application structure. Web search, e-commerce and other typical cloud program has a complex interactive behavior. For example, an Amazon page request involves interaction with hundreds of components [7], error in any one component will lead to the whole application anomalies; 3) Shared resource pattern. One of the basic features of cloud computing is resource sharing, a typical server in Google Cloud data center hosts 5 to 18 applications simultaneously, each server runs about 10.69 applications [5]. Resource competition will interfer","PeriodicalId":345943,"journal":{"name":"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124535373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}