{"title":"Web Robot Detection: A Semantic Approach","authors":"Athanasios Lagopoulos, Grigorios Tsoumakas, Georgios Papadopoulos","doi":"10.1109/ICTAI.2018.00150","DOIUrl":null,"url":null,"abstract":"Web robots constitute nowadays more than half of the total web traffic. Malicious robots threaten the security, privacy and performance of the web, while non-malicious ones are involved in analytics skewing. The latter constitutes an important problem for large websites with unique content, as it can lead to false impressions about the popularity and impact of a piece of information. To deal with this problem, we present a novel web robot detection approach for content-rich websites, based on the assumption that human web users are interested in specific topics, while web robots crawl the web randomly. Our approach extends the typical representation of user sessions with a novel set of features that capture the semantics of the content of the requested resources. Empirical results on real-world data from the web portal of an academic publisher, show that the proposed semantic features lead to improved web robot detection accuracy.","PeriodicalId":254686,"journal":{"name":"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2018.00150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
Web robots constitute nowadays more than half of the total web traffic. Malicious robots threaten the security, privacy and performance of the web, while non-malicious ones are involved in analytics skewing. The latter constitutes an important problem for large websites with unique content, as it can lead to false impressions about the popularity and impact of a piece of information. To deal with this problem, we present a novel web robot detection approach for content-rich websites, based on the assumption that human web users are interested in specific topics, while web robots crawl the web randomly. Our approach extends the typical representation of user sessions with a novel set of features that capture the semantics of the content of the requested resources. Empirical results on real-world data from the web portal of an academic publisher, show that the proposed semantic features lead to improved web robot detection accuracy.