Understanding Cloud Data Using Approximate String Matching and Edit Distance

2012 SC Companion: High Performance Computing, Networking Storage and Analysis Pub Date : 2012-11-10 DOI:10.1109/SC.Companion.2012.149

Joseph Jupin, Justin Y. Shi, Z. Obradovic

{"title":"Understanding Cloud Data Using Approximate String Matching and Edit Distance","authors":"Joseph Jupin, Justin Y. Shi, Z. Obradovic","doi":"10.1109/SC.Companion.2012.149","DOIUrl":null,"url":null,"abstract":"For health and human services, fraud detection and other security services, identity resolution is a core requirement for understanding big data in the cloud. Due to the lack of a globally unique identifier and captured typographic differences for the same identity, identity resolution has high spatial and temporal complexities. We propose a filter and verify method to substantially increase the speed of approximate string matching using edit distance. This method has been found to be almost 80 times faster (130 times when combined with other optimizations) than Damerau-Levenshtein edit distance and preserves all approximate matches. Our method creates compressed signatures for data fields and uses Boolean operations and an enhanced bit counter to quickly compare the distance between the fields. This method is intended to be applied to data records whose fields contain relatively short-length strings, such as those found in most demographic data. Without loss of accuracy, the proposed Fast Bitwise Filter will provide substantial performance gain to approximate string comparison in database, record linkage and deduplication data processing systems.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"59 1","pages":"1234-1243"},"PeriodicalIF":0.0000,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.Companion.2012.149","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

For health and human services, fraud detection and other security services, identity resolution is a core requirement for understanding big data in the cloud. Due to the lack of a globally unique identifier and captured typographic differences for the same identity, identity resolution has high spatial and temporal complexities. We propose a filter and verify method to substantially increase the speed of approximate string matching using edit distance. This method has been found to be almost 80 times faster (130 times when combined with other optimizations) than Damerau-Levenshtein edit distance and preserves all approximate matches. Our method creates compressed signatures for data fields and uses Boolean operations and an enhanced bit counter to quickly compare the distance between the fields. This method is intended to be applied to data records whose fields contain relatively short-length strings, such as those found in most demographic data. Without loss of accuracy, the proposed Fast Bitwise Filter will provide substantial performance gain to approximate string comparison in database, record linkage and deduplication data processing systems.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用近似字符串匹配和编辑距离理解云数据

对于健康和人类服务、欺诈检测和其他安全服务而言，身份解析是理解云中的大数据的核心要求。由于缺乏全局唯一标识符和捕获相同标识的排版差异，标识解析具有很高的空间和时间复杂性。我们提出了一种过滤和验证方法，可以大大提高使用编辑距离进行近似字符串匹配的速度。这种方法被发现比Damerau-Levenshtein编辑距离快近80倍(与其他优化相结合时快130倍)，并保留所有近似匹配。我们的方法为数据字段创建压缩签名，并使用布尔运算和增强的位计数器来快速比较字段之间的距离。此方法旨在应用于字段包含相对较短字符串的数据记录，例如大多数人口统计数据中的字符串。在不损失准确性的情况下，所提出的Fast Bitwise Filter将为数据库、记录链接和重复数据处理系统中的近似字符串比较提供实质性的性能增益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

自引率

0.00%

发文量