SIMILARITY ALGORITHMS FOR FUZZY JOIN COMPUTATION IN BIG DATA PROCESSING ENVIRONMENT

Anh-Cang Phan, Thuong-Cang Phan
{"title":"SIMILARITY ALGORITHMS FOR FUZZY JOIN COMPUTATION IN BIG DATA PROCESSING ENVIRONMENT","authors":"Anh-Cang Phan, Thuong-Cang Phan","doi":"10.15625/1813-9663/17589","DOIUrl":null,"url":null,"abstract":"Big data processing is attracting the interest of many researchers to process large-scale datasets and extract useful information for supporting and providing decisions. One of the biggest challenges is the problem of querying large datasets. It becomes even more complicated with similarity queries instead of exact match queries. A fuzzy join operation is a typical operation frequently used in similarity queries and big data analysis. Currently, there is very little research on this issue, thus it poses significant barriers to the efforts of improving query operations on big data efficiently. As a result, this study overviews the similarity algorithms for fuzzy joins, in which the data at the join key attributes may have slight differences within a fuzzy threshold. We analyze six similarity algorithms including Hamming, Levenshtein, LCS, Jaccard, Jaro, and Jaro - Winkler, to show the difference between these algorithms through the three criteria: output enrichment, false positives/negatives, and the processing time of the algorithms. Experiments of fuzzy joins algorithms are implemented in the Spark environment, a popular big data processing platform. The algorithms are divided into two groups for evaluation: group 1 (Hamming, Levenshtein, and LCS) and group 2 (Jaccard, Jaro, and Jaro - Winkler). For the former, Levenshtein has an advantage over the other two algorithms in terms of output enrichment, high accuracy in the result set (false positives/negatives), and acceptable processing time. In the letter, Jaccard is considered the worst algorithm considering all three criteria mean while Jaro - Winkler algorithm has more output richness and higher accuracy in the result set. The overview of the similarity algorithms in this study will help users to choose the most suitable algorithm for their problems.","PeriodicalId":15444,"journal":{"name":"Journal of Computer Science and Cybernetics","volume":"187 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Science and Cybernetics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15625/1813-9663/17589","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Big data processing is attracting the interest of many researchers to process large-scale datasets and extract useful information for supporting and providing decisions. One of the biggest challenges is the problem of querying large datasets. It becomes even more complicated with similarity queries instead of exact match queries. A fuzzy join operation is a typical operation frequently used in similarity queries and big data analysis. Currently, there is very little research on this issue, thus it poses significant barriers to the efforts of improving query operations on big data efficiently. As a result, this study overviews the similarity algorithms for fuzzy joins, in which the data at the join key attributes may have slight differences within a fuzzy threshold. We analyze six similarity algorithms including Hamming, Levenshtein, LCS, Jaccard, Jaro, and Jaro - Winkler, to show the difference between these algorithms through the three criteria: output enrichment, false positives/negatives, and the processing time of the algorithms. Experiments of fuzzy joins algorithms are implemented in the Spark environment, a popular big data processing platform. The algorithms are divided into two groups for evaluation: group 1 (Hamming, Levenshtein, and LCS) and group 2 (Jaccard, Jaro, and Jaro - Winkler). For the former, Levenshtein has an advantage over the other two algorithms in terms of output enrichment, high accuracy in the result set (false positives/negatives), and acceptable processing time. In the letter, Jaccard is considered the worst algorithm considering all three criteria mean while Jaro - Winkler algorithm has more output richness and higher accuracy in the result set. The overview of the similarity algorithms in this study will help users to choose the most suitable algorithm for their problems.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大数据处理环境下模糊连接计算的相似度算法
大数据处理吸引了许多研究人员的兴趣,以处理大规模数据集并提取有用的信息来支持和提供决策。最大的挑战之一是查询大型数据集的问题。如果使用相似查询而不是精确匹配查询,情况会变得更加复杂。模糊连接操作是相似度查询和大数据分析中常用的一种典型操作。目前,关于该问题的研究很少,这对有效提高大数据查询操作的努力造成了很大的障碍。因此,本研究概述了模糊连接的相似度算法,其中连接键属性处的数据在模糊阈值内可能存在细微差异。我们分析了Hamming、Levenshtein、LCS、Jaccard、Jaro和Jaro - Winkler等6种相似度算法,通过输出富集、假阳性/阴性和算法的处理时间三个标准来展示这些算法之间的差异。在流行的大数据处理平台Spark环境中,实现了模糊连接算法的实验。算法分为两组进行评估:第1组(Hamming, Levenshtein和LCS)和第2组(Jaccard, Jaro和Jaro - Winkler)。对于前者,Levenshtein在输出丰富性、结果集的高准确性(假阳性/假阴性)和可接受的处理时间方面比其他两种算法有优势。在这封信中,Jaccard算法被认为是同时考虑三个标准的最差算法,而Jaro - Winkler算法在结果集中具有更丰富的输出和更高的精度。本研究对相似度算法的概述将有助于用户选择最适合自己问题的算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
PROVING THE SECURITY OF AES BLOCK CIPHER BASED ON MODIFIED MIXCOLUMN AN IMPROVED INDEXING METHOD FOR QUERYING BIG XML FILES OHYEAH AT VLSP2022-EVJVQA CHALLENGE: A JOINTLY LANGUAGE-IMAGE MODEL FOR MULTILINGUAL VISUAL QUESTION ANSWERING THE VNPT-IT EMOTION TRANSPLANTATION APPROACH FOR VLSP 2022 TAEKWONDO POSE ESTIMATION WITH DEEP LEARNING ARCHITECTURES ON ONE-DIMENSIONAL AND TWO-DIMENSIONAL DATA
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1