TcpRT: Instrument and Diagnostic Analysis System for Service Quality of Cloud Databases at Massive Scale in Real-time

Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI:10.1145/3183713.3190659

Wei Cao, Yusong Gao, Bingchen Lin, Xiaojie Feng, Yu Xie, Xiao Lou, Peng Wang

{"title":"TcpRT: Instrument and Diagnostic Analysis System for Service Quality of Cloud Databases at Massive Scale in Real-time","authors":"Wei Cao, Yusong Gao, Bingchen Lin, Xiaojie Feng, Yu Xie, Xiao Lou, Peng Wang","doi":"10.1145/3183713.3190659","DOIUrl":null,"url":null,"abstract":"Smooth end-to-end performance of mission-critical database system is essential to the stability of applications deployed on the cloud. It's a challenge for cloud database vendors to detect any performance degradation in real-time and locate the root cause quickly in sophisticated network environment. Cloud databases vendors tend to favor a multi-tier distributed architecture to achieve multi-tenant management, scalability and high-availability, which may further complicate the problem. This paper presents TcpRT, the instrument and diagnosis infrastructure in Alibaba Cloud RDS that achieves real-time anomaly detection. We wrote a Linux kernel module to collect trace data of each SQL query, designed to be efficient with minimal overhead, it adds tracepoints in callbacks of TCP congestion control kernel module, that is totally transparent to database processes. In order to reduce the amount of data significantly before sending it to backend, raw trace data is aggregated. Aggregated trace data is then processed, grouped and analyzed in a distributed streaming computing platform. By utilizing a self-adjustable Cauchy distribution statistical model from historical performance data for each DB instance, anomalous events can be automatically detected in databases, which eliminates manually configuring thresholds by experience. A fault or hiccup occurred in any network component that is shared among multiple DB instances (e.g. hosted on the same physical machine or uplinked to the same pair of TOR switches) may cause large-scale service quality degradations. The ratio of anomalous DB instances vs networks components is being calculated, which helps pinpoint the faulty component. TcpRT has been deployed in production at Alibaba Cloud for the past 3 years, collects over 20 million raw traces per second, and processes over 10 billion locally aggregated results in the backend per day, and managed to have within 1% performance impact on DB system. We present case studies of typical scenarios where TcpRT helps to solve various problems occurred in production system.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"98 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3183713.3190659","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Smooth end-to-end performance of mission-critical database system is essential to the stability of applications deployed on the cloud. It's a challenge for cloud database vendors to detect any performance degradation in real-time and locate the root cause quickly in sophisticated network environment. Cloud databases vendors tend to favor a multi-tier distributed architecture to achieve multi-tenant management, scalability and high-availability, which may further complicate the problem. This paper presents TcpRT, the instrument and diagnosis infrastructure in Alibaba Cloud RDS that achieves real-time anomaly detection. We wrote a Linux kernel module to collect trace data of each SQL query, designed to be efficient with minimal overhead, it adds tracepoints in callbacks of TCP congestion control kernel module, that is totally transparent to database processes. In order to reduce the amount of data significantly before sending it to backend, raw trace data is aggregated. Aggregated trace data is then processed, grouped and analyzed in a distributed streaming computing platform. By utilizing a self-adjustable Cauchy distribution statistical model from historical performance data for each DB instance, anomalous events can be automatically detected in databases, which eliminates manually configuring thresholds by experience. A fault or hiccup occurred in any network component that is shared among multiple DB instances (e.g. hosted on the same physical machine or uplinked to the same pair of TOR switches) may cause large-scale service quality degradations. The ratio of anomalous DB instances vs networks components is being calculated, which helps pinpoint the faulty component. TcpRT has been deployed in production at Alibaba Cloud for the past 3 years, collects over 20 million raw traces per second, and processes over 10 billion locally aggregated results in the backend per day, and managed to have within 1% performance impact on DB system. We present case studies of typical scenarios where TcpRT helps to solve various problems occurred in production system.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TcpRT:大规模实时云数据库服务质量检测与诊断分析系统

关键任务数据库系统流畅的端到端性能对于部署在云上的应用程序的稳定性至关重要。对于云数据库供应商来说，在复杂的网络环境中实时检测任何性能下降并快速定位根本原因是一个挑战。云数据库供应商倾向于采用多层分布式架构来实现多租户管理、可伸缩性和高可用性，这可能会使问题进一步复杂化。本文介绍了TcpRT——阿里云RDS中实现实时异常检测的仪器和诊断基础设施。我们编写了一个Linux内核模块来收集每个SQL查询的跟踪数据，设计的目的是以最小的开销高效的，它在TCP拥塞控制内核模块的回调中增加了跟踪点，这对数据库进程完全透明。为了在将数据发送到后端之前显著减少数据量，对原始跟踪数据进行聚合。然后，在分布式流计算平台中对聚合的跟踪数据进行处理、分组和分析。通过利用每个DB实例的历史性能数据的自调节Cauchy分布统计模型，可以自动检测数据库中的异常事件，从而消除了根据经验手动配置阈值的问题。在多个DB实例之间共享的任何网络组件(例如托管在同一物理机器上或上行链接到同一对TOR交换机)中发生的故障或打嗝可能导致大规模的服务质量下降。正在计算异常DB实例与网络组件的比例，这有助于查明故障组件。TcpRT已经在阿里云的生产环境中部署了3年，每秒收集超过2000万条原始轨迹，每天在后端处理超过100亿的本地聚合结果，并且对DB系统的性能影响在1%以内。我们提供了典型场景的案例研究，在这些场景中，TcpRT可以帮助解决生产系统中出现的各种问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2018 International Conference on Management of Data

自引率

0.00%

发文量

期刊最新文献

Meta-Dataflows: Efficient Exploratory Dataflow Jobs Columnstore and B+ tree - Are Hybrid Physical Designs Important? Demonstration of VerdictDB, the Platform-Independent AQP System Efficient Selection of Geospatial Data on Maps for Interactive and Visualized Exploration Session details: Keynote1