TemporalDedup: Domain-Independent Deduplication of Redundant and Errant Temporal Data

Jon Rogers, R. S. Aygün, L. Etzkorn
{"title":"TemporalDedup: Domain-Independent Deduplication of Redundant and Errant Temporal Data","authors":"Jon Rogers, R. S. Aygün, L. Etzkorn","doi":"10.1142/s1793351x23500010","DOIUrl":null,"url":null,"abstract":"Deduplication is a key component of the data preparation process, a bottleneck in the machine learning (ML) and data mining pipeline that is very time-consuming and often relies on domain expertise and manual involvement. Further, temporal data is increasingly prevalent and is not well suited to traditional similarity and distance-based deduplication techniques. We establish a fully automated, domain-independent deduplication model for temporal data domains, known as TemporalDedup, that infers the key attribute(s), applies a base set of deduplication techniques focused on value matches for key, non-key, and elapsed time, and further detects duplicates through inference of temporal ordering requirements using Longest Common Subsequence (LCS) for records of a shared type. Using LCS, we split each record's temporal sequence into constrained and unconstrained sequences. We flag suspicious (errant) records that are non-adherent to the inferred constrained order and we flag a record as a duplicate if its unconstrained order, of sufficient length, matches that of another record. TemporalDedup was compared against a similarity-based Adaptive Sorted Neighborhood Method (ASNM) in evaluating duplicates for two disparate datasets: (1) 22,794 records from Sony's PlayStation Network (PSN) trophy data, where duplication may be indicative of cheating, and (2) emergency declarations and government responses related to COVID-19 for all U.S. states and territories. TemporalDedup (F1-scores of 0.971 and 0.954) exhibited combined sensitivities above 0.9 for all duplicate classes whereas ASNM (0.705 and 0.732) exhibited combined sensitivities below 0.2 for all time and order duplicate classes. © 2023 World Scientific Publishing Company.","PeriodicalId":217956,"journal":{"name":"Int. J. Semantic Comput.","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Semantic Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s1793351x23500010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Deduplication is a key component of the data preparation process, a bottleneck in the machine learning (ML) and data mining pipeline that is very time-consuming and often relies on domain expertise and manual involvement. Further, temporal data is increasingly prevalent and is not well suited to traditional similarity and distance-based deduplication techniques. We establish a fully automated, domain-independent deduplication model for temporal data domains, known as TemporalDedup, that infers the key attribute(s), applies a base set of deduplication techniques focused on value matches for key, non-key, and elapsed time, and further detects duplicates through inference of temporal ordering requirements using Longest Common Subsequence (LCS) for records of a shared type. Using LCS, we split each record's temporal sequence into constrained and unconstrained sequences. We flag suspicious (errant) records that are non-adherent to the inferred constrained order and we flag a record as a duplicate if its unconstrained order, of sufficient length, matches that of another record. TemporalDedup was compared against a similarity-based Adaptive Sorted Neighborhood Method (ASNM) in evaluating duplicates for two disparate datasets: (1) 22,794 records from Sony's PlayStation Network (PSN) trophy data, where duplication may be indicative of cheating, and (2) emergency declarations and government responses related to COVID-19 for all U.S. states and territories. TemporalDedup (F1-scores of 0.971 and 0.954) exhibited combined sensitivities above 0.9 for all duplicate classes whereas ASNM (0.705 and 0.732) exhibited combined sensitivities below 0.2 for all time and order duplicate classes. © 2023 World Scientific Publishing Company.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
TemporalDedup:对冗余和错误的时态数据进行域独立的重复数据删除
重复数据删除是数据准备过程的关键组成部分,是机器学习(ML)和数据挖掘管道的瓶颈,非常耗时,通常依赖于领域专业知识和人工参与。此外,时间数据越来越普遍,不适合传统的基于相似性和距离的重复数据删除技术。我们为时间数据域建立了一个完全自动化的、独立于域的重复数据删除模型(称为TemporalDedup),该模型推断出键属性,应用一组基本的重复数据删除技术,重点关注键、非键和经过时间的值匹配,并通过使用最长公共子序列(LCS)对共享类型的记录推断时间排序需求,进一步检测重复项。使用LCS,我们将每个记录的时间序列拆分为受约束和不受约束的序列。我们标记不遵循推断约束顺序的可疑(错误)记录,如果其无约束顺序(长度足够)与另一条记录匹配,我们将其标记为重复记录。在评估两个不同数据集的重复时,将TemporalDedup与基于相似性的自适应邻域排序方法(ASNM)进行了比较:(1)来自索尼PlayStation Network (PSN)奖杯数据的22,794条记录,其中重复可能表明作弊;(2)美国所有州和地区与COVID-19相关的紧急声明和政府响应。TemporalDedup (f1得分分别为0.971和0.954)对所有重复类别的综合灵敏度均在0.9以上,而ASNM (f1得分分别为0.705和0.732)对所有时间和顺序重复类别的综合灵敏度均低于0.2。©2023世界科学出版公司。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Guest Editorial - Special Issue on IEEE AIKE 2022 TemporalDedup: Domain-Independent Deduplication of Redundant and Errant Temporal Data Knowledge Graph-Based Explainable Artificial Intelligence for Business Process Analysis Knowledge Graph-Based Integration of Autonomous Driving Datasets Confidence-Based Cheat Detection Through Constrained Order Inference of Temporal Sequences
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1