M. Erdélyi, A. Benczúr, B. Daróczy, A. Garzó, Tamás Kiss, Dávid Siklósi
{"title":"Web特性的分类能力","authors":"M. Erdélyi, A. Benczúr, B. Daróczy, A. Garzó, Tamás Kiss, Dávid Siklósi","doi":"10.1080/15427951.2013.850456","DOIUrl":null,"url":null,"abstract":"Abstract In this article we give a comprehensive overview of features devised for web spam detection and investigate how much various classes, some requiring very high computational effort, add to the classification accuracy. We collect and handle a large number of features based on recent advances in web spam filtering, including temporal ones; in particular, we analyze the strength and sensitivity of linkage change. We propose new, temporal link-similarity-based features and show how to compute them efficiently on large graphs. We show that machine learning techniques, including ensemble selection, LogitBoost, and random forest significantly improve accuracy. We conclude that, with appropriate learning techniques, a simple and computationally inexpensive feature subset outperforms all previous results published so far on our dataset and can be further improved only slightly by computationally expensive features. We test our method on three major publicly available datasets: the Web Spam Challenge 2008 dataset WEBSPAM-UK2007, the ECML/PKDD Discovery Challenge dataset DC2010, and the Waterloo Spam Rankings for ClueWeb09. Our classifier ensemble sets the strongest classification benchmark compared to participants of the Web Spam and ECML/PKDD Discovery Challenges as well as the TREC Web track. To foster research in the area, we make several feature sets and source codes public,1 https://datamining.sztaki.hu/en/download/web-spam-resources including the temporal features of eight .uk crawl snapshots that include WEBSPAM-UK2007 as well as the Web Spam Challenge features for the labeled part of ClueWeb09.","PeriodicalId":38105,"journal":{"name":"Internet Mathematics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2014-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15427951.2013.850456","citationCount":"5","resultStr":"{\"title\":\"The Classification Power of Web Features\",\"authors\":\"M. Erdélyi, A. Benczúr, B. Daróczy, A. Garzó, Tamás Kiss, Dávid Siklósi\",\"doi\":\"10.1080/15427951.2013.850456\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract In this article we give a comprehensive overview of features devised for web spam detection and investigate how much various classes, some requiring very high computational effort, add to the classification accuracy. We collect and handle a large number of features based on recent advances in web spam filtering, including temporal ones; in particular, we analyze the strength and sensitivity of linkage change. We propose new, temporal link-similarity-based features and show how to compute them efficiently on large graphs. We show that machine learning techniques, including ensemble selection, LogitBoost, and random forest significantly improve accuracy. We conclude that, with appropriate learning techniques, a simple and computationally inexpensive feature subset outperforms all previous results published so far on our dataset and can be further improved only slightly by computationally expensive features. We test our method on three major publicly available datasets: the Web Spam Challenge 2008 dataset WEBSPAM-UK2007, the ECML/PKDD Discovery Challenge dataset DC2010, and the Waterloo Spam Rankings for ClueWeb09. Our classifier ensemble sets the strongest classification benchmark compared to participants of the Web Spam and ECML/PKDD Discovery Challenges as well as the TREC Web track. To foster research in the area, we make several feature sets and source codes public,1 https://datamining.sztaki.hu/en/download/web-spam-resources including the temporal features of eight .uk crawl snapshots that include WEBSPAM-UK2007 as well as the Web Spam Challenge features for the labeled part of ClueWeb09.\",\"PeriodicalId\":38105,\"journal\":{\"name\":\"Internet Mathematics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-04-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1080/15427951.2013.850456\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Internet Mathematics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/15427951.2013.850456\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Internet Mathematics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/15427951.2013.850456","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}
Abstract In this article we give a comprehensive overview of features devised for web spam detection and investigate how much various classes, some requiring very high computational effort, add to the classification accuracy. We collect and handle a large number of features based on recent advances in web spam filtering, including temporal ones; in particular, we analyze the strength and sensitivity of linkage change. We propose new, temporal link-similarity-based features and show how to compute them efficiently on large graphs. We show that machine learning techniques, including ensemble selection, LogitBoost, and random forest significantly improve accuracy. We conclude that, with appropriate learning techniques, a simple and computationally inexpensive feature subset outperforms all previous results published so far on our dataset and can be further improved only slightly by computationally expensive features. We test our method on three major publicly available datasets: the Web Spam Challenge 2008 dataset WEBSPAM-UK2007, the ECML/PKDD Discovery Challenge dataset DC2010, and the Waterloo Spam Rankings for ClueWeb09. Our classifier ensemble sets the strongest classification benchmark compared to participants of the Web Spam and ECML/PKDD Discovery Challenges as well as the TREC Web track. To foster research in the area, we make several feature sets and source codes public,1 https://datamining.sztaki.hu/en/download/web-spam-resources including the temporal features of eight .uk crawl snapshots that include WEBSPAM-UK2007 as well as the Web Spam Challenge features for the labeled part of ClueWeb09.