{"title":"具有最小丢失率的文本重复数据删除","authors":"Youming Ge, Jiefeng Wu, Genan Dai, Yubao Liu","doi":"10.1145/3318299.3318369","DOIUrl":null,"url":null,"abstract":"Text deduplication is an important operation for text document analysis applications. Given a set of text documents, we often need to remove the text documents whose similarity values are not less than the specified threshold. However, if the set of similar text documents to be removed is too large, the remaining set of text documents may be not enough for text analysis. In this paper, we consider the problem on how to balance the removed set and the remaining set of text documents. We try to reduce the duplication information as much as possible with the minimum number of text documents to be removed. We propose a greedy algorithm for our problem based on the concept of similarity graph which can represent the similar relationship for a set of text documents. We also consider the incremental algorithm for the dynamic settings. The experimental results based on the real news document datasets show the efficiency of the proposed algorithms.","PeriodicalId":164987,"journal":{"name":"International Conference on Machine Learning and Computing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Text Deduplication with Minimum Loss Ratio\",\"authors\":\"Youming Ge, Jiefeng Wu, Genan Dai, Yubao Liu\",\"doi\":\"10.1145/3318299.3318369\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text deduplication is an important operation for text document analysis applications. Given a set of text documents, we often need to remove the text documents whose similarity values are not less than the specified threshold. However, if the set of similar text documents to be removed is too large, the remaining set of text documents may be not enough for text analysis. In this paper, we consider the problem on how to balance the removed set and the remaining set of text documents. We try to reduce the duplication information as much as possible with the minimum number of text documents to be removed. We propose a greedy algorithm for our problem based on the concept of similarity graph which can represent the similar relationship for a set of text documents. We also consider the incremental algorithm for the dynamic settings. The experimental results based on the real news document datasets show the efficiency of the proposed algorithms.\",\"PeriodicalId\":164987,\"journal\":{\"name\":\"International Conference on Machine Learning and Computing\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Machine Learning and Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3318299.3318369\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Machine Learning and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3318299.3318369","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Text deduplication is an important operation for text document analysis applications. Given a set of text documents, we often need to remove the text documents whose similarity values are not less than the specified threshold. However, if the set of similar text documents to be removed is too large, the remaining set of text documents may be not enough for text analysis. In this paper, we consider the problem on how to balance the removed set and the remaining set of text documents. We try to reduce the duplication information as much as possible with the minimum number of text documents to be removed. We propose a greedy algorithm for our problem based on the concept of similarity graph which can represent the similar relationship for a set of text documents. We also consider the incremental algorithm for the dynamic settings. The experimental results based on the real news document datasets show the efficiency of the proposed algorithms.