Daniel Osuna-Ontiveros, I. Lopez-Arevalo, V. Sosa-Sosa
{"title":"A semantic information retrieval model for focused crawling","authors":"Daniel Osuna-Ontiveros, I. Lopez-Arevalo, V. Sosa-Sosa","doi":"10.1109/NWESP.2011.6088192","DOIUrl":null,"url":null,"abstract":"Nowadays, users of computers store a lot of information on the Web. For this reason, the Internet is a good place to search information on any subject. Due to the large amount of information, some users would search information on specific websites that they consider interesting (e.g. www.wikipedia.com, news sites, etc.). Traditional models represent webpages by using the frequency of terms or the structure of links in order to assign weight to terms of webpages. This paper presents a semantic information retrieval to represent specific websites. This proposal integrates text mining algorithms based on natural language processing and traditional representation models with the aim to improve the quality of webpages recovered by searching. Each webpage of the website is represented as a vector of topics, instead of a vector of terms. In a similar way, the query is represented as a vector of topics. Thus, a similarity measure can be applied over this vector and vectors of documents to retrieve the most relevant documents.","PeriodicalId":271670,"journal":{"name":"2011 7th International Conference on Next Generation Web Services Practices","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 7th International Conference on Next Generation Web Services Practices","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NWESP.2011.6088192","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Nowadays, users of computers store a lot of information on the Web. For this reason, the Internet is a good place to search information on any subject. Due to the large amount of information, some users would search information on specific websites that they consider interesting (e.g. www.wikipedia.com, news sites, etc.). Traditional models represent webpages by using the frequency of terms or the structure of links in order to assign weight to terms of webpages. This paper presents a semantic information retrieval to represent specific websites. This proposal integrates text mining algorithms based on natural language processing and traditional representation models with the aim to improve the quality of webpages recovered by searching. Each webpage of the website is represented as a vector of topics, instead of a vector of terms. In a similar way, the query is represented as a vector of topics. Thus, a similarity measure can be applied over this vector and vectors of documents to retrieve the most relevant documents.