TemporalDedup: Domain-Independent Deduplication of Redundant and Errant Temporal Data

Int. J. Semantic Comput. Pub Date : 2023-02-17 DOI:10.1142/s1793351x23500010

Jon Rogers, R. S. Aygün, L. Etzkorn

{"title":"TemporalDedup: Domain-Independent Deduplication of Redundant and Errant Temporal Data","authors":"Jon Rogers, R. S. Aygün, L. Etzkorn","doi":"10.1142/s1793351x23500010","DOIUrl":null,"url":null,"abstract":"Deduplication is a key component of the data preparation process, a bottleneck in the machine learning (ML) and data mining pipeline that is very time-consuming and often relies on domain expertise and manual involvement. Further, temporal data is increasingly prevalent and is not well suited to traditional similarity and distance-based deduplication techniques. We establish a fully automated, domain-independent deduplication model for temporal data domains, known as TemporalDedup, that infers the key attribute(s), applies a base set of deduplication techniques focused on value matches for key, non-key, and elapsed time, and further detects duplicates through inference of temporal ordering requirements using Longest Common Subsequence (LCS) for records of a shared type. Using LCS, we split each record's temporal sequence into constrained and unconstrained sequences. We flag suspicious (errant) records that are non-adherent to the inferred constrained order and we flag a record as a duplicate if its unconstrained order, of sufficient length, matches that of another record. TemporalDedup was compared against a similarity-based Adaptive Sorted Neighborhood Method (ASNM) in evaluating duplicates for two disparate datasets: (1) 22,794 records from Sony's PlayStation Network (PSN) trophy data, where duplication may be indicative of cheating, and (2) emergency declarations and government responses related to COVID-19 for all U.S. states and territories. TemporalDedup (F1-scores of 0.971 and 0.954) exhibited combined sensitivities above 0.9 for all duplicate classes whereas ASNM (0.705 and 0.732) exhibited combined sensitivities below 0.2 for all time and order duplicate classes. © 2023 World Scientific Publishing Company.","PeriodicalId":217956,"journal":{"name":"Int. J. Semantic Comput.","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Semantic Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s1793351x23500010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Deduplication is a key component of the data preparation process, a bottleneck in the machine learning (ML) and data mining pipeline that is very time-consuming and often relies on domain expertise and manual involvement. Further, temporal data is increasingly prevalent and is not well suited to traditional similarity and distance-based deduplication techniques. We establish a fully automated, domain-independent deduplication model for temporal data domains, known as TemporalDedup, that infers the key attribute(s), applies a base set of deduplication techniques focused on value matches for key, non-key, and elapsed time, and further detects duplicates through inference of temporal ordering requirements using Longest Common Subsequence (LCS) for records of a shared type. Using LCS, we split each record's temporal sequence into constrained and unconstrained sequences. We flag suspicious (errant) records that are non-adherent to the inferred constrained order and we flag a record as a duplicate if its unconstrained order, of sufficient length, matches that of another record. TemporalDedup was compared against a similarity-based Adaptive Sorted Neighborhood Method (ASNM) in evaluating duplicates for two disparate datasets: (1) 22,794 records from Sony's PlayStation Network (PSN) trophy data, where duplication may be indicative of cheating, and (2) emergency declarations and government responses related to COVID-19 for all U.S. states and territories. TemporalDedup (F1-scores of 0.971 and 0.954) exhibited combined sensitivities above 0.9 for all duplicate classes whereas ASNM (0.705 and 0.732) exhibited combined sensitivities below 0.2 for all time and order duplicate classes. © 2023 World Scientific Publishing Company.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TemporalDedup:对冗余和错误的时态数据进行域独立的重复数据删除

重复数据删除是数据准备过程的关键组成部分，是机器学习(ML)和数据挖掘管道的瓶颈，非常耗时，通常依赖于领域专业知识和人工参与。此外，时间数据越来越普遍，不适合传统的基于相似性和距离的重复数据删除技术。我们为时间数据域建立了一个完全自动化的、独立于域的重复数据删除模型(称为TemporalDedup)，该模型推断出键属性，应用一组基本的重复数据删除技术，重点关注键、非键和经过时间的值匹配，并通过使用最长公共子序列(LCS)对共享类型的记录推断时间排序需求，进一步检测重复项。使用LCS，我们将每个记录的时间序列拆分为受约束和不受约束的序列。我们标记不遵循推断约束顺序的可疑(错误)记录，如果其无约束顺序(长度足够)与另一条记录匹配，我们将其标记为重复记录。在评估两个不同数据集的重复时，将TemporalDedup与基于相似性的自适应邻域排序方法(ASNM)进行了比较:(1)来自索尼PlayStation Network (PSN)奖杯数据的22,794条记录，其中重复可能表明作弊;(2)美国所有州和地区与COVID-19相关的紧急声明和政府响应。TemporalDedup (f1得分分别为0.971和0.954)对所有重复类别的综合灵敏度均在0.9以上，而ASNM (f1得分分别为0.705和0.732)对所有时间和顺序重复类别的综合灵敏度均低于0.2。©2023世界科学出版公司。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Int. J. Semantic Comput.

自引率

0.00%

发文量