{"title":"Near-duplicate detection using GPU-based simhash scheme","authors":"Xiaowen Feng, Hai Jin, Ran Zheng, Lei Zhu","doi":"10.1109/SMARTCOMP.2014.7043862","DOIUrl":null,"url":null,"abstract":"With the rapid growth of data, near-duplicate documents bearing high similarity are abundant. Elimination of near-duplicates can reduce storage cost and improve the quality of search indexes in data mining. A challenging problem is to find near-duplicate records in large-scale collections efficiently. There have already been several efforts on implementing near-duplicate detection on different architectures. In this paper, a new implementation, using a special hash function namely simhash, is proposed to identify near-duplicate documents on CUDA enabled devices. Two mechanisms are designed to achieve higher performance, including swapping and dynamic allocating. Experimental results show that our parallel implementation outperforms the serial CPU version, achieving up to 18 times.","PeriodicalId":169858,"journal":{"name":"2014 International Conference on Smart Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Smart Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SMARTCOMP.2014.7043862","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
With the rapid growth of data, near-duplicate documents bearing high similarity are abundant. Elimination of near-duplicates can reduce storage cost and improve the quality of search indexes in data mining. A challenging problem is to find near-duplicate records in large-scale collections efficiently. There have already been several efforts on implementing near-duplicate detection on different architectures. In this paper, a new implementation, using a special hash function namely simhash, is proposed to identify near-duplicate documents on CUDA enabled devices. Two mechanisms are designed to achieve higher performance, including swapping and dynamic allocating. Experimental results show that our parallel implementation outperforms the serial CPU version, achieving up to 18 times.