Near-duplicate detection using GPU-based simhash scheme

2014 International Conference on Smart Computing Pub Date : 2014-11-01 DOI:10.1109/SMARTCOMP.2014.7043862

Xiaowen Feng, Hai Jin, Ran Zheng, Lei Zhu

引用次数: 4

Abstract

With the rapid growth of data, near-duplicate documents bearing high similarity are abundant. Elimination of near-duplicates can reduce storage cost and improve the quality of search indexes in data mining. A challenging problem is to find near-duplicate records in large-scale collections efficiently. There have already been several efforts on implementing near-duplicate detection on different architectures. In this paper, a new implementation, using a special hash function namely simhash, is proposed to identify near-duplicate documents on CUDA enabled devices. Two mechanisms are designed to achieve higher performance, including swapping and dynamic allocating. Experimental results show that our parallel implementation outperforms the serial CPU version, achieving up to 18 times.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用基于gpu的simhash方案进行近重复检测

随着数据量的快速增长，具有高相似度的近重复文档大量出现。在数据挖掘中，消除近重复可以降低存储成本，提高搜索索引的质量。一个具有挑战性的问题是如何有效地在大规模集合中找到接近重复的记录。在不同的体系结构上实现近重复检测已经有了一些努力。在本文中，提出了一种新的实现，使用特殊的散列函数即simhash来识别支持CUDA的设备上的近重复文档。设计了两种机制来实现更高的性能，包括交换和动态分配。实验结果表明，我们的并行实现比串行CPU版本性能更好，达到了串行CPU版本的18倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2014 International Conference on Smart Computing

自引率

0.00%

发文量

期刊最新文献

Classifying Smart Objects using capabilities Gas mixture control system for oxygen therapy in pre-term infants Harmful algal blooms prediction with machine learning models in Tolo Harbour Facial expression recognition and generation using sparse autoencoder A MAP estimation based segmentation model for speckled images