首页 > 最新文献

WebQuality '12最新文献

英文 中文
Content-based trust and bias classification via biclustering 基于内容的双聚类信任和偏见分类
Pub Date : 2012-04-16 DOI: 10.1145/2184305.2184314
Dávid Siklósi, B. Daróczy, A. Benczúr
In this paper we improve trust, bias and factuality classification over Web data on the domain level. Unlike the majority of literature in this area that aims at extracting opinion and handling short text on the micro level, we aim to aid a researcher or an archivist in obtaining a large collection that, on the high level, originates from unbiased and trustworthy sources. Our method generates features as Jensen-Shannon distances from centers in a host-term biclustering. On top of the distance features, we apply kernel methods and also combine with baseline text classifiers. We test our method on the ECML/PKDD Discovery Challenge data set DC2010. Our method improves over the best achieved text classification NDCG results by over 3--10% for neutrality, bias and trustworthiness. The fact that the ECML/PKDD Discovery Challenge 2010 participants reached an AUC only slightly above 0.5 indicates the hardness of the task.
本文在领域层面上改进了Web数据的信任、偏差和事实分类。与该领域的大多数文献不同,这些文献旨在提取意见并在微观层面上处理短文本,我们的目标是帮助研究人员或档案保管员获得大量的收藏,这些收藏在高层次上源于公正和值得信赖的来源。我们的方法在主项双聚类中生成与中心的Jensen-Shannon距离的特征。在距离特征的基础上,我们应用核方法,并结合基线文本分类器。我们在ECML/PKDD Discovery Challenge数据集DC2010上测试了我们的方法。我们的方法在中立性、偏倚性和可信度方面比最佳的文本分类NDCG结果提高了3- 10%以上。ECML/PKDD发现挑战2010参与者的AUC仅略高于0.5,这一事实表明了任务的难度。
{"title":"Content-based trust and bias classification via biclustering","authors":"Dávid Siklósi, B. Daróczy, A. Benczúr","doi":"10.1145/2184305.2184314","DOIUrl":"https://doi.org/10.1145/2184305.2184314","url":null,"abstract":"In this paper we improve trust, bias and factuality classification over Web data on the domain level. Unlike the majority of literature in this area that aims at extracting opinion and handling short text on the micro level, we aim to aid a researcher or an archivist in obtaining a large collection that, on the high level, originates from unbiased and trustworthy sources. Our method generates features as Jensen-Shannon distances from centers in a host-term biclustering. On top of the distance features, we apply kernel methods and also combine with baseline text classifiers. We test our method on the ECML/PKDD Discovery Challenge data set DC2010. Our method improves over the best achieved text classification NDCG results by over 3--10% for neutrality, bias and trustworthiness. The fact that the ECML/PKDD Discovery Challenge 2010 participants reached an AUC only slightly above 0.5 indicates the hardness of the task.","PeriodicalId":230983,"journal":{"name":"WebQuality '12","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116065133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
A breakdown of quality flaws in Wikipedia 维基百科质量缺陷的分类
Pub Date : 2012-04-16 DOI: 10.1145/2184305.2184309
Maik Anderka, Benno Stein
The online encyclopedia Wikipedia is a successful example of the increasing popularity of user generated content on the Web. Despite its success, Wikipedia is often criticized for containing low-quality information, which is mainly attributed to its core policy of being open for editing by everyone. The identification of low-quality information is an important task since Wikipedia has become the primary source of knowledge for a huge number of people around the world. Previous research on quality assessment in Wikipedia either investigates only small samples of articles, or else focuses on single quality aspects, like accuracy or formality. This paper targets the investigation of quality flaws, and presents the first complete breakdown of Wikipedia's quality flaw structure. We conduct an extensive exploratory analysis, which reveals (1) the quality flaws that actually exist, (2) the distribution of flaws in Wikipedia, and (3) the extent of flawed content. An important finding is that more than one in four English Wikipedia articles contains at least one quality flaw, 70% of which concern article verifiability.
在线百科全书维基百科是一个成功的例子,说明用户在网络上生成的内容越来越受欢迎。尽管取得了成功,但维基百科经常因包含低质量的信息而受到批评,这主要归因于其开放供所有人编辑的核心政策。由于维基百科已成为世界上许多人的主要知识来源,因此识别低质量信息是一项重要任务。之前关于维基百科质量评估的研究要么只调查小样本的文章,要么只关注单一的质量方面,比如准确性或正式性。本文以质量缺陷为研究对象,首次提出了维基百科质量缺陷结构的完整分解。我们进行了广泛的探索性分析,揭示了(1)实际存在的质量缺陷,(2)维基百科中缺陷的分布,以及(3)有缺陷内容的程度。一个重要的发现是,超过四分之一的英文维基百科文章至少有一个质量缺陷,其中70%与文章的可验证性有关。
{"title":"A breakdown of quality flaws in Wikipedia","authors":"Maik Anderka, Benno Stein","doi":"10.1145/2184305.2184309","DOIUrl":"https://doi.org/10.1145/2184305.2184309","url":null,"abstract":"The online encyclopedia Wikipedia is a successful example of the increasing popularity of user generated content on the Web. Despite its success, Wikipedia is often criticized for containing low-quality information, which is mainly attributed to its core policy of being open for editing by everyone. The identification of low-quality information is an important task since Wikipedia has become the primary source of knowledge for a huge number of people around the world. Previous research on quality assessment in Wikipedia either investigates only small samples of articles, or else focuses on single quality aspects, like accuracy or formality. This paper targets the investigation of quality flaws, and presents the first complete breakdown of Wikipedia's quality flaw structure. We conduct an extensive exploratory analysis, which reveals (1) the quality flaws that actually exist, (2) the distribution of flaws in Wikipedia, and (3) the extent of flawed content. An important finding is that more than one in four English Wikipedia articles contains at least one quality flaw, 70% of which concern article verifiability.","PeriodicalId":230983,"journal":{"name":"WebQuality '12","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114682698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
On measuring the lexical quality of the web 论网络词汇质量的测量
Pub Date : 2012-04-16 DOI: 10.1145/2184305.2184307
R. Baeza-Yates, Luz Rello
In this paper we propose a measure for estimating the lexical quality of the Web, that is, the representational aspect of the textual web content. Our lexical quality measure is based in a small corpus of spelling errors and we apply it to English and Spanish. We first compute the correlation of our measure with web popularity measures to show that gives independent information and then we apply it to different web segments, including social media. Our results shed a light on the lexical quality of the Web and show that authoritative websites have several orders of magnitude less misspellings than the overall Web. We also present an analysis of the geographical distribution of lexical quality throughout English and Spanish speaking countries as well as how this measure changes in about one year.
在本文中,我们提出了一种衡量Web词汇质量的方法,即文本Web内容的表征方面。我们的词汇质量测量是基于一个小的拼写错误语料库,我们将其应用于英语和西班牙语。我们首先计算我们的测量与网络流行度测量的相关性,以显示它提供了独立的信息,然后我们将其应用于不同的网络细分,包括社交媒体。我们的研究结果揭示了网络的词汇质量,并表明权威网站的拼写错误比整个网络少几个数量级。我们还分析了英语和西班牙语国家词汇质量的地理分布,以及这种测量在大约一年内是如何变化的。
{"title":"On measuring the lexical quality of the web","authors":"R. Baeza-Yates, Luz Rello","doi":"10.1145/2184305.2184307","DOIUrl":"https://doi.org/10.1145/2184305.2184307","url":null,"abstract":"In this paper we propose a measure for estimating the lexical quality of the Web, that is, the representational aspect of the textual web content. Our lexical quality measure is based in a small corpus of spelling errors and we apply it to English and Spanish. We first compute the correlation of our measure with web popularity measures to show that gives independent information and then we apply it to different web segments, including social media. Our results shed a light on the lexical quality of the Web and show that authoritative websites have several orders of magnitude less misspellings than the overall Web. We also present an analysis of the geographical distribution of lexical quality throughout English and Spanish speaking countries as well as how this measure changes in about one year.","PeriodicalId":230983,"journal":{"name":"WebQuality '12","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116736561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
kaPoW plugins: protecting web applications using reputation-based proof-of-work kaPoW插件:使用基于声誉的工作量证明来保护web应用程序
Pub Date : 2012-04-16 DOI: 10.1145/2184305.2184318
Tien Le, A. Dua, Wu-chang Feng
Comment spam is a fact of life if you have a blog or forum. Tools like Akismet and CAPTCHA help prevent spam in applications like WordPress or phpBB. However, they are not devoid of shortcomings. CAPTCHAs are getting easier to solve by automated adversaries like bots and pose usability issues. Akismet strives to detect spam, but can't do much to reduce it. This paper presents the kaPoW plugin and reputation service that can complement existing antispam tools. kaPoW creates disincentives for sending spam by slowing down spammers. It uses a web-based proof-of-work approach wherein a client is given a computational puzzle to solve before accessing a service (e.g. comment posting). The idea is to set puzzle difficulties based on a client's reputation, thereby, issuing "harder" puzzles to spammers. The more time spammers solve puzzles, the less time they have to send spam. Unlike CAPTCHAs, kaPoW requires no additional user interaction since all the puzzles are issued and solved in software. kaPoW can be used by any web application that supports an extension framework (e.g. plugins) and is concerned about spam.
如果你有一个博客或论坛,垃圾评论是一个不可避免的事实。像Akismet和CAPTCHA这样的工具有助于防止WordPress或phpBB等应用程序中的垃圾邮件。然而,它们并非没有缺点。验证码越来越容易被机器人等自动化对手破解,并带来可用性问题。Akismet努力检测垃圾邮件,但在减少垃圾邮件方面无能为力。本文介绍了kaPoW插件和信誉服务,可以补充现有的反垃圾邮件工具。kaPoW通过减缓垃圾邮件发送者的速度来抑制垃圾邮件的发送。它使用基于web的工作量证明方法,其中客户端在访问服务(例如评论发布)之前需要解决一个计算难题。其理念是根据客户的声誉设置谜题难度,从而向垃圾邮件发送者发布“更难”的谜题。垃圾邮件发送者解决谜题的时间越多,他们发送垃圾邮件的时间就越少。与captcha不同,kaPoW不需要额外的用户交互,因为所有的谜题都是在软件中发布和解决的。kaPoW可以被任何支持扩展框架(例如插件)并关注垃圾邮件的web应用程序使用。
{"title":"kaPoW plugins: protecting web applications using reputation-based proof-of-work","authors":"Tien Le, A. Dua, Wu-chang Feng","doi":"10.1145/2184305.2184318","DOIUrl":"https://doi.org/10.1145/2184305.2184318","url":null,"abstract":"Comment spam is a fact of life if you have a blog or forum. Tools like Akismet and CAPTCHA help prevent spam in applications like WordPress or phpBB. However, they are not devoid of shortcomings. CAPTCHAs are getting easier to solve by automated adversaries like bots and pose usability issues. Akismet strives to detect spam, but can't do much to reduce it. This paper presents the kaPoW plugin and reputation service that can complement existing antispam tools. kaPoW creates disincentives for sending spam by slowing down spammers. It uses a web-based proof-of-work approach wherein a client is given a computational puzzle to solve before accessing a service (e.g. comment posting). The idea is to set puzzle difficulties based on a client's reputation, thereby, issuing \"harder\" puzzles to spammers. The more time spammers solve puzzles, the less time they have to send spam. Unlike CAPTCHAs, kaPoW requires no additional user interaction since all the puzzles are issued and solved in software. kaPoW can be used by any web application that supports an extension framework (e.g. plugins) and is concerned about spam.","PeriodicalId":230983,"journal":{"name":"WebQuality '12","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132381234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A deformation analysis method for artificial maps based on geographical accuracy and its applications 一种基于地理精度的人工地图变形分析方法及其应用
Pub Date : 2012-04-16 DOI: 10.1145/2184305.2184310
D. Kitayama, K. Sumiya
Artificial maps are widely used for a variety of purposes, including as tourist guides to help people find geographical objects using simple figures. We aim to develop an editing system and a navigation system for artificial maps. Artificial maps made for tourists show suitable objects for traveling users. Therefore, if the artificial map has a navigation system, users can get geographical information such as object positions and routes without performing any operations. However, artificial maps might contain incorrect or superfluous information, such as some objects on the map being intentionally enlarged or omitted. For developing the system, there are two problems: 1. extraction of geographical information from the raster graphics of the artificial map and 2. revision of inaccurate geographical information on the artificial map. We propose a deformation-analyzing method based on geographical accuracy using optical character recognition techniques and comparing gazetteer information. That is, our proposed method detects the tolerance level for deformation according to the purpose of the artificial map. Then, we detect a certain position on the artificial map using deformation analysis. In this paper, we develop a prototype system and we evaluate the accuracy of extracting information from the artificial map and detecting positions.
人工地图被广泛用于各种目的,包括作为旅游指南,帮助人们用简单的数字找到地理目标。我们的目标是开发一个人工地图的编辑系统和导航系统。为游客制作的人工地图显示出适合旅行用户的对象。因此,如果人工地图具有导航系统,用户无需进行任何操作即可获得物体位置和路线等地理信息。然而,人造地图可能包含不正确或多余的信息,例如地图上的某些物体被故意放大或省略。对于系统的开发,主要存在两个问题:1。从栅格图形中提取地理信息的人工地图和2。修正人造地图上不准确的地理信息。本文提出了一种基于地理精度的形变分析方法,该方法采用光学字符识别技术和比较地名信息。也就是说,我们提出的方法是根据人工地图的目的来检测变形的容差水平。然后,利用变形分析在人工地图上检测出某一位置。在本文中,我们开发了一个原型系统,并评估了从人工地图中提取信息和检测位置的准确性。
{"title":"A deformation analysis method for artificial maps based on geographical accuracy and its applications","authors":"D. Kitayama, K. Sumiya","doi":"10.1145/2184305.2184310","DOIUrl":"https://doi.org/10.1145/2184305.2184310","url":null,"abstract":"Artificial maps are widely used for a variety of purposes, including as tourist guides to help people find geographical objects using simple figures. We aim to develop an editing system and a navigation system for artificial maps. Artificial maps made for tourists show suitable objects for traveling users. Therefore, if the artificial map has a navigation system, users can get geographical information such as object positions and routes without performing any operations. However, artificial maps might contain incorrect or superfluous information, such as some objects on the map being intentionally enlarged or omitted. For developing the system, there are two problems: 1. extraction of geographical information from the raster graphics of the artificial map and 2. revision of inaccurate geographical information on the artificial map. We propose a deformation-analyzing method based on geographical accuracy using optical character recognition techniques and comparing gazetteer information. That is, our proposed method detects the tolerance level for deformation according to the purpose of the artificial map. Then, we detect a certain position on the artificial map using deformation analysis. In this paper, we develop a prototype system and we evaluate the accuracy of extracting information from the artificial map and detecting positions.","PeriodicalId":230983,"journal":{"name":"WebQuality '12","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124089540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Game-theoretic models of web credibility 网络可信度的博弈论模型
Pub Date : 2012-04-16 DOI: 10.1145/2184305.2184312
Thanasis G. Papaioannou, K. Aberer, Katarzyna Abramczuk, P. Adamska, A. Wierzbicki
Research on Web credibility assessment can significantly benefit from new models that are better suited for evaluation and study of adversary strategies. Currently employed models lack several important aspects, such as the explicit modeling of Web content properties (e.g. presentation quality), the user economic incentives and assessment capabilities. In this paper, we introduce a new, game-theoretic model of credibility, referred to as the Credibility Game. We perform equilibrium and stability analysis of a simple variant of the game and then study it as a signaling game against naïve and expert information consumers. By a generic economic model of the player payoffs, we study, via simulation experiments, more complex variants of the Credibility Game and demonstrate the effect of consumer expertise and of the signal for credibility evaluation on the evolutionary stable strategies of the information producers and consumers.
网络可信度评估研究可以从更适合评估和研究对手策略的新模型中获益。目前使用的模型缺少几个重要方面,例如Web内容属性(例如表示质量)的显式建模、用户经济激励和评估能力。在本文中,我们引入了一种新的可信度博弈论模型,称为可信度博弈。我们对游戏的一个简单变体进行均衡和稳定性分析,然后将其作为针对naïve和专家信息消费者的信号游戏进行研究。本文利用参与人收益的一般经济模型,通过模拟实验研究了可信度博弈的更复杂变量,并论证了消费者专业知识和可信度评价信号对信息生产者和消费者演化稳定策略的影响。
{"title":"Game-theoretic models of web credibility","authors":"Thanasis G. Papaioannou, K. Aberer, Katarzyna Abramczuk, P. Adamska, A. Wierzbicki","doi":"10.1145/2184305.2184312","DOIUrl":"https://doi.org/10.1145/2184305.2184312","url":null,"abstract":"Research on Web credibility assessment can significantly benefit from new models that are better suited for evaluation and study of adversary strategies. Currently employed models lack several important aspects, such as the explicit modeling of Web content properties (e.g. presentation quality), the user economic incentives and assessment capabilities. In this paper, we introduce a new, game-theoretic model of credibility, referred to as the Credibility Game. We perform equilibrium and stability analysis of a simple variant of the game and then study it as a signaling game against naïve and expert information consumers. By a generic economic model of the player payoffs, we study, via simulation experiments, more complex variants of the Credibility Game and demonstrate the effect of consumer expertise and of the signal for credibility evaluation on the evolutionary stable strategies of the information producers and consumers.","PeriodicalId":230983,"journal":{"name":"WebQuality '12","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134313741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Measuring the quality of web content using factual information 使用事实信息来衡量网络内容的质量
Pub Date : 2012-04-16 DOI: 10.1145/2184305.2184308
E. Lex, Michael Völske, M. Errecalde, Edgardo Ferretti, L. Cagnina, Christopher Horn, Benno Stein, M. Granitzer
Nowadays, many decisions are based on information found in the Web. For the most part, the disseminating sources are not certified, and hence an assessment of the quality and credibility of Web content became more important than ever. With factual density we present a simple statistical quality measure that is based on facts extracted from Web content using Open Information Extraction. In a first case study, we use this measure to identify featured/good articles in Wikipedia. We compare the factual density measure with word count, a measure that has successfully been applied to this task in the past. Our evaluation corroborates the good performance of word count in Wikipedia since featured/good articles are often longer than non-featured. However, for articles of similar lengths the word count measure fails while factual density can separate between them with an F-measure of 90.4%. We also investigate the use of relational features for categorizing Wikipedia articles into featured/good versus non-featured ones. If articles have similar lengths, we achieve an F-measure of 86.7% and 84% otherwise.
如今,许多决策都是基于在Web上找到的信息。在大多数情况下,传播来源没有经过认证,因此对网络内容的质量和可信度的评估变得比以往任何时候都重要。对于事实密度,我们提出了一个简单的统计质量度量,该度量基于使用开放信息提取从Web内容中提取的事实。在第一个案例研究中,我们使用这种方法来识别维基百科中的特色/优秀文章。我们将事实密度测量与单词计数进行比较,单词计数在过去已经成功地应用于这项任务。我们的评估证实了维基百科中单词计数的良好表现,因为特色/好的文章通常比非特色文章更长。然而,对于长度相似的文章,字数测量失败,而事实密度可以区分它们,f测量值为90.4%。我们还研究了使用关系特征将维基百科文章分类为特色/好与非特色文章。如果文章长度相似,f值为86.7%,否则为84%。
{"title":"Measuring the quality of web content using factual information","authors":"E. Lex, Michael Völske, M. Errecalde, Edgardo Ferretti, L. Cagnina, Christopher Horn, Benno Stein, M. Granitzer","doi":"10.1145/2184305.2184308","DOIUrl":"https://doi.org/10.1145/2184305.2184308","url":null,"abstract":"Nowadays, many decisions are based on information found in the Web. For the most part, the disseminating sources are not certified, and hence an assessment of the quality and credibility of Web content became more important than ever. With factual density we present a simple statistical quality measure that is based on facts extracted from Web content using Open Information Extraction. In a first case study, we use this measure to identify featured/good articles in Wikipedia. We compare the factual density measure with word count, a measure that has successfully been applied to this task in the past. Our evaluation corroborates the good performance of word count in Wikipedia since featured/good articles are often longer than non-featured. However, for articles of similar lengths the word count measure fails while factual density can separate between them with an F-measure of 90.4%. We also investigate the use of relational features for categorizing Wikipedia articles into featured/good versus non-featured ones. If articles have similar lengths, we achieve an F-measure of 86.7% and 84% otherwise.","PeriodicalId":230983,"journal":{"name":"WebQuality '12","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121542481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
An information theoretic approach to sentiment polarity classification 情感极性分类的信息理论方法
Pub Date : 2012-04-16 DOI: 10.1145/2184305.2184313
Yuming Lin, Jingwei Zhang, Xiaoling Wang, Aoying Zhou
Sentiment classification is a task of classifying documents according to their overall sentiment inclination. It is very important and popular in many web applications, such as credibility analysis of news sites on the Web, recommendation system and mining online discussion. Vector space model is widely applied on modeling documents in supervised sentiment classification, in which the feature presentation (including features type and weight function) is crucial for classification accuracy. The traditional feature presentation methods of text categorization do not perform well in sentiment classification, because the expressing manners of sentiment are more subtle. We analyze the relationships of terms with sentiment labels based on information theory, and propose a method by applying information theoretic approach on sentiment classification of documents. In this paper, we adopt mutual information on quantifying the sentiment polarities of terms in a document firstly. Then the terms are weighted in vector space based on both sentiment scores and contribution to the document. We perform extensive experiments with SVM on the sets of multiple product reviews, and the experimental results show our approach is more effective than the traditional ones.
情感分类是一种根据整体情感倾向对文档进行分类的任务。它在新闻网站的可信度分析、推荐系统和在线讨论挖掘等许多web应用中都非常重要和流行。向量空间模型在监督情感分类中被广泛应用于文档建模,其中特征表示(包括特征类型和权重函数)对分类精度至关重要。传统的文本分类特征表示方法在情感分类中表现不佳,因为情感的表达方式比较微妙。基于信息论分析了词条与情感标签之间的关系,提出了一种将信息论方法应用于文档情感分类的方法。在本文中,我们首先采用互信息对文档中术语的情感极性进行量化。然后根据情感得分和对文档的贡献在向量空间中对这些术语进行加权。我们在多个产品评论集上进行了大量的实验,实验结果表明我们的方法比传统的方法更有效。
{"title":"An information theoretic approach to sentiment polarity classification","authors":"Yuming Lin, Jingwei Zhang, Xiaoling Wang, Aoying Zhou","doi":"10.1145/2184305.2184313","DOIUrl":"https://doi.org/10.1145/2184305.2184313","url":null,"abstract":"Sentiment classification is a task of classifying documents according to their overall sentiment inclination. It is very important and popular in many web applications, such as credibility analysis of news sites on the Web, recommendation system and mining online discussion. Vector space model is widely applied on modeling documents in supervised sentiment classification, in which the feature presentation (including features type and weight function) is crucial for classification accuracy. The traditional feature presentation methods of text categorization do not perform well in sentiment classification, because the expressing manners of sentiment are more subtle. We analyze the relationships of terms with sentiment labels based on information theory, and propose a method by applying information theoretic approach on sentiment classification of documents. In this paper, we adopt mutual information on quantifying the sentiment polarities of terms in a document firstly. Then the terms are weighted in vector space based on both sentiment scores and contribution to the document. We perform extensive experiments with SVM on the sets of multiple product reviews, and the experimental results show our approach is more effective than the traditional ones.","PeriodicalId":230983,"journal":{"name":"WebQuality '12","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122283004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
Detecting collective attention spam 检测集体关注垃圾邮件
Pub Date : 2012-04-16 DOI: 10.1145/2184305.2184316
Kyumin Lee, James Caverlee, K. Kamath, Zhiyuan Cheng
We examine the problem of collective attention spam, in which spammers target social media where user attention quickly coalesces and then collectively focuses around a phenomenon. Compared to many existing spam types, collective attention spam relies on the users themselves to seek out the content -- like breaking news, viral videos, and popular memes -- where the spam will be encountered, potentially increasing its effectiveness and reach. We study the presence of collective attention spam in one popular service, Twitter, and we develop spam classifiers to detect spam messages generated by collective attention spammers. Since many instances of collective attention are bursty and unexpected, it is difficult to build spam detectors to pre-screen them before they arise; hence, we examine the effectiveness of quickly learning a classifier based on the first moments of a bursting phenomenon. Through initial experiments over a small set of trending topics on Twitter, we find encouraging results, suggesting that collective attention spam may be identified early in its life cycle and shielded from the view of unsuspecting social media users.
我们研究了集体关注垃圾邮件的问题,其中垃圾邮件发送者瞄准社交媒体,用户的注意力迅速聚集,然后集体关注一个现象。与许多现有的垃圾邮件类型相比,集体关注垃圾邮件依赖于用户自己寻找垃圾邮件会遇到的内容(如突发新闻、病毒式视频和流行的表情包),从而潜在地提高其有效性和覆盖范围。我们研究了一个流行的服务Twitter中存在的集体关注垃圾邮件,并开发了垃圾邮件分类器来检测由集体关注垃圾邮件发送者生成的垃圾邮件。由于许多集体关注的实例是突发的和意外的,因此很难构建垃圾邮件检测器来在它们出现之前对它们进行预筛选;因此,我们检验了基于爆炸现象的第一时刻快速学习分类器的有效性。通过对Twitter上一小部分热门话题的初步实验,我们发现了令人鼓舞的结果,表明集体关注垃圾邮件可能在其生命周期的早期就被识别出来,并屏蔽掉毫无戒心的社交媒体用户的视线。
{"title":"Detecting collective attention spam","authors":"Kyumin Lee, James Caverlee, K. Kamath, Zhiyuan Cheng","doi":"10.1145/2184305.2184316","DOIUrl":"https://doi.org/10.1145/2184305.2184316","url":null,"abstract":"We examine the problem of collective attention spam, in which spammers target social media where user attention quickly coalesces and then collectively focuses around a phenomenon. Compared to many existing spam types, collective attention spam relies on the users themselves to seek out the content -- like breaking news, viral videos, and popular memes -- where the spam will be encountered, potentially increasing its effectiveness and reach. We study the presence of collective attention spam in one popular service, Twitter, and we develop spam classifiers to detect spam messages generated by collective attention spammers. Since many instances of collective attention are bursty and unexpected, it is difficult to build spam detectors to pre-screen them before they arise; hence, we examine the effectiveness of quickly learning a classifier based on the first moments of a bursting phenomenon. Through initial experiments over a small set of trending topics on Twitter, we find encouraging results, suggesting that collective attention spam may be identified early in its life cycle and shielded from the view of unsuspecting social media users.","PeriodicalId":230983,"journal":{"name":"WebQuality '12","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122097520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Identifying spam in the iOS app store 识别iOS应用商店中的垃圾邮件
Pub Date : 2012-04-16 DOI: 10.1145/2184305.2184317
Rishi Chandy, Haijie Gu
Popular apps on the Apple iOS App Store can generate millions of dollars in profit and collect valuable personal user information. Fraudulent reviews could deceive users into downloading potentially harmful spam apps or unfairly ignoring apps that are victims of review spam. Thus, automatically identifying spam in the App Store is an important problem. This paper aims to introduce and characterize novel datasets acquired through crawling the iOS App Store, compare a baseline Decision Tree model with a novel Latent Class graphical model for classification of app spam, and analyze preliminary results for clustering reviews.
苹果iOS应用程序商店的热门应用程序可以产生数百万美元的利润,并收集宝贵的个人用户信息。欺诈性评论可能会欺骗用户下载潜在有害的垃圾应用,或者不公平地忽略那些受到垃圾评论影响的应用。因此,自动识别App Store中的垃圾邮件是一个重要问题。本文旨在介绍和描述通过抓取iOS App Store获得的新数据集,比较基线决策树模型和用于分类垃圾应用的新型Latent Class图形模型,并分析聚类评论的初步结果。
{"title":"Identifying spam in the iOS app store","authors":"Rishi Chandy, Haijie Gu","doi":"10.1145/2184305.2184317","DOIUrl":"https://doi.org/10.1145/2184305.2184317","url":null,"abstract":"Popular apps on the Apple iOS App Store can generate millions of dollars in profit and collect valuable personal user information. Fraudulent reviews could deceive users into downloading potentially harmful spam apps or unfairly ignoring apps that are victims of review spam. Thus, automatically identifying spam in the App Store is an important problem. This paper aims to introduce and characterize novel datasets acquired through crawling the iOS App Store, compare a baseline Decision Tree model with a novel Latent Class graphical model for classification of app spam, and analyze preliminary results for clustering reviews.","PeriodicalId":230983,"journal":{"name":"WebQuality '12","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115994130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 71
期刊
WebQuality '12
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1