Wenhong Liu, Zhiyuan Peng, Shuang Zhao, Jiawei Liu
{"title":"Similarity Analysis in Data Element Matching based on Word2vec","authors":"Wenhong Liu, Zhiyuan Peng, Shuang Zhao, Jiawei Liu","doi":"10.1109/QRS-C57518.2022.00054","DOIUrl":null,"url":null,"abstract":"With the increasing demand for computer-aided big data processing, deep learning has gradually become an effective means to help big data processing. There are often many redundant database fields between different departments. These fields are often completely equivalent, but there are certain differences in field names, which brings trouble to data element matching. To this end, we propose a more targeted approach - ‘MetaMatch’ to handle database fields, combining $W$ ord2vec with a high-performance database. To measure the effectiveness of the proposed method, we propose a $W$ ord2vec-based data element matching method. The method performs semantic segmentation on key fields of the database and trains word vectors. Then, we perform tokenization processing on each training case. According to the result of word segmentation, the corresponding word vector is constructed. We use this method to implement data element matching for big data systems in our experiments and design a validation experiment to evaluate the matching accuracy. The matching accuracy rate reached 79.3%.","PeriodicalId":183728,"journal":{"name":"2022 IEEE 22nd International Conference on Software Quality, Reliability, and Security Companion (QRS-C)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 22nd International Conference on Software Quality, Reliability, and Security Companion (QRS-C)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/QRS-C57518.2022.00054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
With the increasing demand for computer-aided big data processing, deep learning has gradually become an effective means to help big data processing. There are often many redundant database fields between different departments. These fields are often completely equivalent, but there are certain differences in field names, which brings trouble to data element matching. To this end, we propose a more targeted approach - ‘MetaMatch’ to handle database fields, combining $W$ ord2vec with a high-performance database. To measure the effectiveness of the proposed method, we propose a $W$ ord2vec-based data element matching method. The method performs semantic segmentation on key fields of the database and trains word vectors. Then, we perform tokenization processing on each training case. According to the result of word segmentation, the corresponding word vector is constructed. We use this method to implement data element matching for big data systems in our experiments and design a validation experiment to evaluate the matching accuracy. The matching accuracy rate reached 79.3%.