{"title":"使用近重复检测方法从Web中查找事件相关内容","authors":"Hung-Chi Chang, Jenq-Haur Wang, Chih-Yi Chiu","doi":"10.1109/WI.2007.58","DOIUrl":null,"url":null,"abstract":"In online resources, such as news and weblogs, authors often extract articles, embed content, and comment on existing articles related to a popular event. Therefore, it is useful if authors can check whether two or more articles share common parts for further analysis, such as cocitation analysis and search result improvement. If articles do have parts in common, we say the content of such articles is event-relevant. Conventional text classification methods classify a complete document into categories, but they cannot represent the semantics precisely or extract meaningful event-relevant content. To resolve these problems, we propose a near-duplicate detection approach for finding event-relevant content in Web documents. The efficiency of the approach and the proposed duplicate set generation algorithms make it suitable for identifying event-relevant content. The experiment results demonstrate the potential of the proposed approach for use in weblogs.","PeriodicalId":192501,"journal":{"name":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Finding Event-Relevant Content from the Web Using a Near-Duplicate Detection Approach\",\"authors\":\"Hung-Chi Chang, Jenq-Haur Wang, Chih-Yi Chiu\",\"doi\":\"10.1109/WI.2007.58\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In online resources, such as news and weblogs, authors often extract articles, embed content, and comment on existing articles related to a popular event. Therefore, it is useful if authors can check whether two or more articles share common parts for further analysis, such as cocitation analysis and search result improvement. If articles do have parts in common, we say the content of such articles is event-relevant. Conventional text classification methods classify a complete document into categories, but they cannot represent the semantics precisely or extract meaningful event-relevant content. To resolve these problems, we propose a near-duplicate detection approach for finding event-relevant content in Web documents. The efficiency of the approach and the proposed duplicate set generation algorithms make it suitable for identifying event-relevant content. The experiment results demonstrate the potential of the proposed approach for use in weblogs.\",\"PeriodicalId\":192501,\"journal\":{\"name\":\"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)\",\"volume\":\"77 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-11-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WI.2007.58\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI.2007.58","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Finding Event-Relevant Content from the Web Using a Near-Duplicate Detection Approach
In online resources, such as news and weblogs, authors often extract articles, embed content, and comment on existing articles related to a popular event. Therefore, it is useful if authors can check whether two or more articles share common parts for further analysis, such as cocitation analysis and search result improvement. If articles do have parts in common, we say the content of such articles is event-relevant. Conventional text classification methods classify a complete document into categories, but they cannot represent the semantics precisely or extract meaningful event-relevant content. To resolve these problems, we propose a near-duplicate detection approach for finding event-relevant content in Web documents. The efficiency of the approach and the proposed duplicate set generation algorithms make it suitable for identifying event-relevant content. The experiment results demonstrate the potential of the proposed approach for use in weblogs.