Real-Time Cleaning of Time-Series Data for a Floating System Digital Twin

Day 1 Mon, May 06, 2019 Pub Date : 2019-04-26 DOI:10.4043/29642-MS

P. Agarwal, S. McNeill

{"title":"Real-Time Cleaning of Time-Series Data for a Floating System Digital Twin","authors":"P. Agarwal, S. McNeill","doi":"10.4043/29642-MS","DOIUrl":null,"url":null,"abstract":"\n Using accurate and high quality data is critical for any application relying heavily on the data, be it machine learning, artificial intelligence, or digital twins. Bad quality and erroneous data can result in inaccurate predictions even if the model is otherwise robust. Ensuring data quality is more critical in real-time applications where there is no human in the loop to perform sense checks on data or results. A real-time digital twin implementation for a floating system uses time-series data from numerous measurements such as wind, waves, GPS, vessel motions, mooring tensions, draft, etc. Statistics computed from the data are used in the digital twin. An extensive data checking and cleaning routine was written that performs data quality checks and corrections on the time series data before statistics are computed.\n Various types of errors that typically occur in a time series include noise, flat-lined data, clipped data, outliers, and discontinuities. Statistical procedures were developed to check the raw time-series for all these errors. The procedures are generic and robust so they can be used for different types of data. Some data types are slow varying (e.g., GPS) while the others are fast varying random processes. A measurement classified as an error in one type of data is not necessarily an error in the other data type. For example, GPS data can be discontinuous by nature but a discontinuity in the wave data indicates an error. Likewise, checking for white noise in mooring tension data is not that meaningful. We developed parametric data procedures so that the same routine can handle different types of data and their errors. Outlier removal routines use the standard deviation of the time-series which itself could be biased from errors. Therefore, a method to compute unbiased statistics from the raw data is developed and implemented for robust outlier removal.\n Extensive testing on years of measured data and on hundreds of data channels was performed to ensure that data cleaning procedures function as intended. Statistics (mean, standard deviations, maximum, and minimum) were computed from both the raw and cleaned data. Comparison showed significant differences in raw and cleaned statistics, with the latter obviously being more accurate.\n Data cleaning, while not sounding as high tech as other analytics algorithms, is a critical foundation of any data science application. Using cleaned time-series data and corresponding statistics ensure that a data analytics model provides actionable results. Clean data and statistics help achieve the intended purpose of the digital twin, which is to inform operators of the health/condition of the asset and flag any anomalous events.","PeriodicalId":11149,"journal":{"name":"Day 1 Mon, May 06, 2019","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Day 1 Mon, May 06, 2019","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4043/29642-MS","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Using accurate and high quality data is critical for any application relying heavily on the data, be it machine learning, artificial intelligence, or digital twins. Bad quality and erroneous data can result in inaccurate predictions even if the model is otherwise robust. Ensuring data quality is more critical in real-time applications where there is no human in the loop to perform sense checks on data or results. A real-time digital twin implementation for a floating system uses time-series data from numerous measurements such as wind, waves, GPS, vessel motions, mooring tensions, draft, etc. Statistics computed from the data are used in the digital twin. An extensive data checking and cleaning routine was written that performs data quality checks and corrections on the time series data before statistics are computed. Various types of errors that typically occur in a time series include noise, flat-lined data, clipped data, outliers, and discontinuities. Statistical procedures were developed to check the raw time-series for all these errors. The procedures are generic and robust so they can be used for different types of data. Some data types are slow varying (e.g., GPS) while the others are fast varying random processes. A measurement classified as an error in one type of data is not necessarily an error in the other data type. For example, GPS data can be discontinuous by nature but a discontinuity in the wave data indicates an error. Likewise, checking for white noise in mooring tension data is not that meaningful. We developed parametric data procedures so that the same routine can handle different types of data and their errors. Outlier removal routines use the standard deviation of the time-series which itself could be biased from errors. Therefore, a method to compute unbiased statistics from the raw data is developed and implemented for robust outlier removal. Extensive testing on years of measured data and on hundreds of data channels was performed to ensure that data cleaning procedures function as intended. Statistics (mean, standard deviations, maximum, and minimum) were computed from both the raw and cleaned data. Comparison showed significant differences in raw and cleaned statistics, with the latter obviously being more accurate. Data cleaning, while not sounding as high tech as other analytics algorithms, is a critical foundation of any data science application. Using cleaned time-series data and corresponding statistics ensure that a data analytics model provides actionable results. Clean data and statistics help achieve the intended purpose of the digital twin, which is to inform operators of the health/condition of the asset and flag any anomalous events.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

浮动系统数字孪生时间序列数据的实时清洗

使用准确和高质量的数据对于任何严重依赖数据的应用程序都是至关重要的，无论是机器学习、人工智能还是数字双胞胎。质量差和错误的数据可能导致不准确的预测，即使模型在其他方面是健壮的。确保数据质量在实时应用程序中更为关键，因为没有人在循环中对数据或结果执行感知检查。浮式系统的实时数字孪生实现使用来自众多测量的时间序列数据，如风、波浪、GPS、船舶运动、系泊张力、吃水等。从数据中计算出的统计信息用于数字孪生。编写了一个广泛的数据检查和清理例程，在计算统计数据之前对时间序列数据执行数据质量检查和更正。通常在时间序列中出现的各种类型的错误包括噪声、平线数据、剪切数据、异常值和不连续。开发了统计程序来检查原始时间序列是否存在所有这些错误。这些过程是通用的和健壮的，因此它们可以用于不同类型的数据。有些数据类型是缓慢变化的(例如，GPS)，而其他数据类型是快速变化的随机过程。在一种数据类型中被归类为误差的测量在另一种数据类型中不一定是误差。例如，GPS数据本质上可能是不连续的，但波数据中的不连续表明存在误差。同样，检查系泊张力数据中的白噪声也没有多大意义。我们开发了参数化数据程序，以便同一例程可以处理不同类型的数据及其错误。异常值去除程序使用时间序列的标准偏差，其本身可能因误差而有偏差。因此，一种从原始数据中计算无偏统计量的方法被开发和实现，用于鲁棒的异常值去除。对多年的测量数据和数百个数据通道进行了广泛的测试，以确保数据清理程序按预期运行。统计数据(平均值、标准差、最大值和最小值)从原始数据和清理后的数据中计算。对比显示，原始统计数据和清理统计数据存在显著差异，后者显然更准确。数据清理虽然听起来不像其他分析算法那样高科技，但却是任何数据科学应用程序的关键基础。使用经过清理的时间序列数据和相应的统计信息可确保数据分析模型提供可操作的结果。干净的数据和统计数据有助于实现数字孪生的预期目的，即通知操作人员资产的健康/状况，并标记任何异常事件。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助