首页 > 最新文献

Proceedings of the 21st ACM international conference on Information and knowledge management最新文献

英文 中文
Towards an effective and unbiased ranking of scientific literature through mutual reinforcement 通过相互加强,实现科学文献的有效和公正的排名
Xiaorui Jiang, Xiaoping Sun, H. Zhuge
It is important to help researchers find valuable scientific papers from a large literature collection containing information of authors, papers and venues. Graph-based algorithms have been proposed to rank papers based on networks formed by citation and co-author relationships. This paper proposes a new graph-based ranking framework MutualRank that integrates mutual reinforcement relationships among networks of papers, researchers and venues to achieve a more synthetic, accurate and fair ranking result than previous graph-based methods. MutualRank leverages the network structure information among papers, authors, and their venues available from a literature collection dataset and sets up a unified mutual reinforcement model that involves both intra- and inter-network information for ranking papers, authors and venues simultaneously. To evaluate, we collect a set of recommended papers from websites of graduate-level computational linguistics courses of 15 top universities as the benchmark and apply different methods to estimate paper importance. The results show that MutualRank greatly outperforms the competitors including Pag-eRank, HITS and CoRank in ranking papers as well as researchers. The experimental results also demonstrate that venues ranked by MutualRank are reasonable.
帮助研究人员从包含作者、论文和地点信息的大型文献集中找到有价值的科学论文是很重要的。基于图的算法已经被提出,根据由引用和合著者关系形成的网络对论文进行排名。本文提出了一种新的基于图的排名框架MutualRank,该框架整合了论文、研究人员和场地网络之间的相互强化关系,从而获得比以往基于图的排名方法更综合、更准确、更公平的排名结果。MutualRank利用文献收集数据集中可获得的论文、作者和地点之间的网络结构信息,建立了一个包括网络内和网络间信息的统一的相互强化模型,同时对论文、作者和地点进行排名。为了评估,我们从15所顶尖大学的研究生水平计算语言学课程网站上收集了一组推荐论文作为基准,并采用不同的方法来估计论文的重要性。结果表明,MutualRank在论文排名和研究人员排名方面都大大优于竞争对手,包括page - erank、HITS和CoRank。实验结果也证明了MutualRank对场馆的排序是合理的。
{"title":"Towards an effective and unbiased ranking of scientific literature through mutual reinforcement","authors":"Xiaorui Jiang, Xiaoping Sun, H. Zhuge","doi":"10.1145/2396761.2396853","DOIUrl":"https://doi.org/10.1145/2396761.2396853","url":null,"abstract":"It is important to help researchers find valuable scientific papers from a large literature collection containing information of authors, papers and venues. Graph-based algorithms have been proposed to rank papers based on networks formed by citation and co-author relationships. This paper proposes a new graph-based ranking framework MutualRank that integrates mutual reinforcement relationships among networks of papers, researchers and venues to achieve a more synthetic, accurate and fair ranking result than previous graph-based methods. MutualRank leverages the network structure information among papers, authors, and their venues available from a literature collection dataset and sets up a unified mutual reinforcement model that involves both intra- and inter-network information for ranking papers, authors and venues simultaneously. To evaluate, we collect a set of recommended papers from websites of graduate-level computational linguistics courses of 15 top universities as the benchmark and apply different methods to estimate paper importance. The results show that MutualRank greatly outperforms the competitors including Pag-eRank, HITS and CoRank in ranking papers as well as researchers. The experimental results also demonstrate that venues ranked by MutualRank are reasonable.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128898838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
gSCorr: modeling geo-social correlations for new check-ins on location-based social networks gSCorr:为基于位置的社交网络上的新签到建立地理社会关联模型
Huiji Gao, Jiliang Tang, Huan Liu
Location-based social networks (LBSNs) have attracted an increasing number of users in recent years. The availability of geographical and social information of online LBSNs provides an unprecedented opportunity to study the human movement from their socio-spatial behavior, enabling a variety of location-based services. Previous work on LBSNs reported limited improvements from using the social network information for location prediction; as users can check-in at new places, traditional work on location prediction that relies on mining a user's historical trajectories is not designed for this "cold start" problem of predicting new check-ins. In this paper, we propose to utilize the social network information for solving the "cold start" location prediction problem, with a geo-social correlation model to capture social correlations on LBSNs considering social networks and geographical distance. The experimental results on a real-world LBSN demonstrate that our approach properly models the social correlations of a user's new check-ins by considering various correlation strengths and correlation measures.
近年来,基于位置的社交网络(LBSNs)吸引了越来越多的用户。在线LBSNs的地理和社会信息的可用性为从社会空间行为研究人类运动提供了前所未有的机会,使各种基于位置的服务成为可能。先前关于LBSNs的研究报告了使用社交网络信息进行位置预测的有限改进;由于用户可以在新的地点登记,传统的位置预测工作依赖于挖掘用户的历史轨迹,而不是为预测新登记的“冷启动”问题而设计的。在本文中,我们提出利用社会网络信息来解决“冷启动”位置预测问题,利用地理社会关联模型来捕获考虑社会网络和地理距离的LBSNs上的社会相关性。在现实世界的LBSN上的实验结果表明,我们的方法通过考虑各种相关强度和相关度量,正确地建模了用户新签到的社会相关性。
{"title":"gSCorr: modeling geo-social correlations for new check-ins on location-based social networks","authors":"Huiji Gao, Jiliang Tang, Huan Liu","doi":"10.1145/2396761.2398477","DOIUrl":"https://doi.org/10.1145/2396761.2398477","url":null,"abstract":"Location-based social networks (LBSNs) have attracted an increasing number of users in recent years. The availability of geographical and social information of online LBSNs provides an unprecedented opportunity to study the human movement from their socio-spatial behavior, enabling a variety of location-based services. Previous work on LBSNs reported limited improvements from using the social network information for location prediction; as users can check-in at new places, traditional work on location prediction that relies on mining a user's historical trajectories is not designed for this \"cold start\" problem of predicting new check-ins. In this paper, we propose to utilize the social network information for solving the \"cold start\" location prediction problem, with a geo-social correlation model to capture social correlations on LBSNs considering social networks and geographical distance. The experimental results on a real-world LBSN demonstrate that our approach properly models the social correlations of a user's new check-ins by considering various correlation strengths and correlation measures.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134475644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 231
Reconciling ontologies and the web of data 协调本体和数据网络
Ziawasch Abedjan, Johannes Lorey, Felix Naumann
To integrate Linked Open Data, which originates from various and heterogeneous sources, the use of well-defined ontologies is essential. However, oftentimes the utilization of these ontologies by data publishers differs from the intended application envisioned by ontology engineers. This may lead to unspecified properties being used ad-hoc as predicates in RDF triples or it may result in infrequent usage of specified properties. These mismatches impede the goals and propagation of the Web of Data as data consumers face difficulties when trying to discover and integrate domain-specific information. In this work, we identify and classify common misusage patterns by employing frequency analysis and rule mining. Based on this analysis, we introduce an algorithm to propose suggestions for a data-driven ontology re-engineering workflow, which we evaluate on two large-scale RDF datasets.
要集成来自各种异构来源的关联开放数据,必须使用定义良好的本体。然而,数据发布者对这些本体的使用通常与本体工程师所设想的预期应用程序不同。这可能导致在RDF三元组中临时使用未指定的属性作为谓词,或者可能导致不经常使用指定的属性。这些不匹配阻碍了数据Web的目标和传播,因为数据消费者在试图发现和集成特定于领域的信息时面临困难。在这项工作中,我们通过使用频率分析和规则挖掘来识别和分类常见的误用模式。在此基础上,我们引入了一种算法,为数据驱动的本体重构工作流提出建议,并在两个大规模RDF数据集上进行了评估。
{"title":"Reconciling ontologies and the web of data","authors":"Ziawasch Abedjan, Johannes Lorey, Felix Naumann","doi":"10.1145/2396761.2398467","DOIUrl":"https://doi.org/10.1145/2396761.2398467","url":null,"abstract":"To integrate Linked Open Data, which originates from various and heterogeneous sources, the use of well-defined ontologies is essential. However, oftentimes the utilization of these ontologies by data publishers differs from the intended application envisioned by ontology engineers. This may lead to unspecified properties being used ad-hoc as predicates in RDF triples or it may result in infrequent usage of specified properties. These mismatches impede the goals and propagation of the Web of Data as data consumers face difficulties when trying to discover and integrate domain-specific information. In this work, we identify and classify common misusage patterns by employing frequency analysis and rule mining. Based on this analysis, we introduce an algorithm to propose suggestions for a data-driven ontology re-engineering workflow, which we evaluate on two large-scale RDF datasets.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"294 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115327174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Monochromatic and bichromatic reverse nearest neighbor queries on land surfaces 单色和双色逆最近邻查询在陆地表面
D. Yan, Zhou Zhao, Wilfred Ng
Finding reverse nearest neighbors (RNNs) is an important operation in spatial databases. The problem of evaluating RNN queries has already received considerable attention due to its importance in many real-world applications, such as resource allocation and disaster response. While RNN query processing has been extensively studied in Euclidean space, no work ever studies this problem on land surfaces. However, practical applications of RNN queries involve terrain surfaces that constrain object movements, which rendering the existing algorithms inapplicable. In this paper, we investigate the evaluation of two types of RNN queries on land surfaces: monochromatic RNN (MRNN) queries and bichromatic RNN (BRNN) queries. On a land surface, the distance between two points is calculated as the length of the shortest path along the surface. However, the computational cost of the state-of-the-art shortest path algorithm on a land surface is quadratic to the size of the surface model, which is usually quite huge. As a result, surface RNN query processing is a challenging problem. Leveraging some newly-discovered properties of Voronoi cell approximation structures, we make use of standard index structures such as an R-tree to design efficient algorithms that accelerate the evaluation of MRNN and BRNN queries on land surfaces. Our proposed algorithms are able to localize query evaluation by accessing just a small fraction of the surface data near the query point, which helps avoid shortest path evaluation on a large surface. Extensive experiments are conducted on large real-world datasets to demonstrate the efficiency of our algorithms.
寻找反向最近邻是空间数据库中的一项重要操作。评估RNN查询的问题已经受到了相当大的关注,因为它在许多现实世界的应用中很重要,比如资源分配和灾难响应。虽然RNN查询处理已经在欧几里得空间中得到了广泛的研究,但还没有研究在陆地表面上的问题。然而,RNN查询的实际应用涉及约束物体运动的地形表面,这使得现有算法不适用。在本文中,我们研究了陆地表面上两种类型的RNN查询的评估:单色RNN (MRNN)查询和双色RNN (BRNN)查询。在陆地表面上,两点之间的距离是用沿表面最短路径的长度来计算的。然而,目前最先进的最短路径算法在地表上的计算成本是地表模型大小的二次元,这通常是非常巨大的。因此,表面RNN查询处理是一个具有挑战性的问题。利用Voronoi细胞近似结构的一些新发现的特性,我们使用标准索引结构(如r树)来设计有效的算法,以加速对陆地表面上的MRNN和BRNN查询的评估。我们提出的算法能够通过访问查询点附近的一小部分表面数据来定位查询评估,这有助于避免在大表面上进行最短路径评估。大量的实验在大型真实世界的数据集上进行,以证明我们的算法的效率。
{"title":"Monochromatic and bichromatic reverse nearest neighbor queries on land surfaces","authors":"D. Yan, Zhou Zhao, Wilfred Ng","doi":"10.1145/2396761.2396880","DOIUrl":"https://doi.org/10.1145/2396761.2396880","url":null,"abstract":"Finding reverse nearest neighbors (RNNs) is an important operation in spatial databases. The problem of evaluating RNN queries has already received considerable attention due to its importance in many real-world applications, such as resource allocation and disaster response. While RNN query processing has been extensively studied in Euclidean space, no work ever studies this problem on land surfaces. However, practical applications of RNN queries involve terrain surfaces that constrain object movements, which rendering the existing algorithms inapplicable. In this paper, we investigate the evaluation of two types of RNN queries on land surfaces: monochromatic RNN (MRNN) queries and bichromatic RNN (BRNN) queries. On a land surface, the distance between two points is calculated as the length of the shortest path along the surface. However, the computational cost of the state-of-the-art shortest path algorithm on a land surface is quadratic to the size of the surface model, which is usually quite huge. As a result, surface RNN query processing is a challenging problem. Leveraging some newly-discovered properties of Voronoi cell approximation structures, we make use of standard index structures such as an R-tree to design efficient algorithms that accelerate the evaluation of MRNN and BRNN queries on land surfaces. Our proposed algorithms are able to localize query evaluation by accessing just a small fraction of the surface data near the query point, which helps avoid shortest path evaluation on a large surface. Extensive experiments are conducted on large real-world datasets to demonstrate the efficiency of our algorithms.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115422426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Efficient influence-based processing of market research queries 有效的基于影响的市场调查查询处理
Anastasios Arvanitis, Antonios Deligiannakis, Y. Vassiliou
The rapid growth of social web has contributed vast amounts of user preference data. Analyzing this data and its relationships with products could have several practical applications, such as personalized advertising, market segmentation, product feature promotion etc. In this work we develop novel algorithms for efficiently processing two important classes of queries involving user preferences, i.e. potential customers identification and product positioning. With regards to the first problem, we formulate product attractiveness based on the notion of reverse skyline queries. We then present a new algorithm, termed as RSA, that significantly reduces the I/O cost, as well as the computation cost, when compared to the state-of-the-art reverse skyline algorithm, while at the same time being able to quickly report the first results. Several real-world applications require processing of a large number of queries, in order to identify the product characteristics that maximize the number of potential customers. Motivated by this problem, we also develop a batched extension of our RSA algorithm that significantly improves upon processing multiple queries individually, by grouping contiguous candidates, exploiting I/O commonalities and enabling shared processing. Our experimental study using both real and synthetic data sets demonstrates the superiority of our proposed algorithms for the studied classes of queries.
社交网络的快速发展提供了大量的用户偏好数据。分析这些数据及其与产品的关系可以有几个实际应用,如个性化广告,市场细分,产品功能推广等。在这项工作中,我们开发了新的算法来有效地处理涉及用户偏好的两类重要查询,即潜在客户识别和产品定位。关于第一个问题,我们基于反向天际线查询的概念来制定产品吸引力。然后,我们提出了一种称为RSA的新算法,与最先进的反向天际线算法相比,它显著降低了I/O成本和计算成本,同时能够快速报告第一个结果。一些现实世界的应用程序需要处理大量的查询,以便确定能够最大限度地增加潜在客户数量的产品特征。受到这个问题的启发,我们还开发了RSA算法的批处理扩展,通过分组连续的候选查询、利用I/O共性和启用共享处理,显著提高了单独处理多个查询的能力。我们使用真实和合成数据集进行的实验研究表明,我们提出的算法对于所研究的查询类别具有优越性。
{"title":"Efficient influence-based processing of market research queries","authors":"Anastasios Arvanitis, Antonios Deligiannakis, Y. Vassiliou","doi":"10.1145/2396761.2398420","DOIUrl":"https://doi.org/10.1145/2396761.2398420","url":null,"abstract":"The rapid growth of social web has contributed vast amounts of user preference data. Analyzing this data and its relationships with products could have several practical applications, such as personalized advertising, market segmentation, product feature promotion etc. In this work we develop novel algorithms for efficiently processing two important classes of queries involving user preferences, i.e. potential customers identification and product positioning. With regards to the first problem, we formulate product attractiveness based on the notion of reverse skyline queries. We then present a new algorithm, termed as RSA, that significantly reduces the I/O cost, as well as the computation cost, when compared to the state-of-the-art reverse skyline algorithm, while at the same time being able to quickly report the first results. Several real-world applications require processing of a large number of queries, in order to identify the product characteristics that maximize the number of potential customers. Motivated by this problem, we also develop a batched extension of our RSA algorithm that significantly improves upon processing multiple queries individually, by grouping contiguous candidates, exploiting I/O commonalities and enabling shared processing. Our experimental study using both real and synthetic data sets demonstrates the superiority of our proposed algorithms for the studied classes of queries.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115550944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce PARMA: MapReduce中近似关联规则挖掘的并行随机化算法
Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, E. Upfal
Frequent Itemsets and Association Rules Mining (FIM) is a key task in knowledge discovery from data. As the dataset grows, the cost of solving this task is dominated by the component that depends on the number of transactions in the dataset. We address this issue by proposing PARMA, a parallel algorithm for the MapReduce framework, which scales well with the size of the dataset (as number of transactions) while minimizing data replication and communication cost. PARMA cuts down the dataset-size-dependent part of the cost by using a random sampling approach to FIM. Each machine mines a small random sample of the dataset, of size independent from the dataset size. The results from each machine are then filtered and aggregated to produce a single output collection. The output will be a very close approximation of the collection of Frequent Itemsets (FI's) or Association Rules (AR's) with their frequencies and confidence levels. The quality of the output is probabilistically guaranteed by our analysis to be within the user-specified accuracy and error probability parameters. The sizes of the random samples are independent from the size of the dataset, as is the number of samples. They depend on the user-chosen accuracy and error probability parameters and on the parallel computational model. We implemented PARMA in Hadoop MapReduce and show experimentally that it runs faster than previously introduced FIM algorithms for the same platform, while 1) scaling almost linearly, and 2) offering even higher accuracy and confidence than what is guaranteed by the analysis.
频繁项集和关联规则挖掘(FIM)是从数据中发现知识的关键任务。随着数据集的增长,解决此任务的成本由依赖于数据集中事务数量的组件主导。我们通过提出PARMA来解决这个问题,PARMA是MapReduce框架的一种并行算法,它可以很好地随数据集的大小(作为事务的数量)进行扩展,同时最小化数据复制和通信成本。PARMA通过对FIM使用随机抽样方法减少了与数据集大小相关的部分成本。每台机器挖掘数据集的一个小的随机样本,其大小与数据集大小无关。然后对来自每台机器的结果进行过滤和聚合,以生成单个输出集合。输出将非常接近频率项集(FI)或关联规则(AR)及其频率和置信度的集合。我们的分析从概率上保证了输出的质量在用户指定的精度和错误概率参数范围内。随机样本的大小与数据集的大小无关,样本的数量也是如此。它们取决于用户选择的精度和误差概率参数以及并行计算模型。我们在Hadoop MapReduce中实现了PARMA,并通过实验证明,在相同的平台上,它比以前引入的FIM算法运行得更快,同时1)几乎是线性扩展,2)提供比分析所保证的更高的准确性和可信度。
{"title":"PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce","authors":"Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, E. Upfal","doi":"10.1145/2396761.2396776","DOIUrl":"https://doi.org/10.1145/2396761.2396776","url":null,"abstract":"Frequent Itemsets and Association Rules Mining (FIM) is a key task in knowledge discovery from data. As the dataset grows, the cost of solving this task is dominated by the component that depends on the number of transactions in the dataset. We address this issue by proposing PARMA, a parallel algorithm for the MapReduce framework, which scales well with the size of the dataset (as number of transactions) while minimizing data replication and communication cost. PARMA cuts down the dataset-size-dependent part of the cost by using a random sampling approach to FIM. Each machine mines a small random sample of the dataset, of size independent from the dataset size. The results from each machine are then filtered and aggregated to produce a single output collection. The output will be a very close approximation of the collection of Frequent Itemsets (FI's) or Association Rules (AR's) with their frequencies and confidence levels. The quality of the output is probabilistically guaranteed by our analysis to be within the user-specified accuracy and error probability parameters. The sizes of the random samples are independent from the size of the dataset, as is the number of samples. They depend on the user-chosen accuracy and error probability parameters and on the parallel computational model. We implemented PARMA in Hadoop MapReduce and show experimentally that it runs faster than previously introduced FIM algorithms for the same platform, while 1) scaling almost linearly, and 2) offering even higher accuracy and confidence than what is guaranteed by the analysis.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115625673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 144
Supporting temporal analytics for health-related events in microblogs 支持对微博中与健康相关的事件进行时间分析
Nattiya Kanhabua, Sara Romano, Avare Stewart, W. Nejdl
Microblogging services, such as Twitter, are gaining interests as a means of sharing information in social networks. Numerous works have shown the potential of using Twitter posts (or tweets) in order to infer the existence and magnitude of real-world events. In the medical domain, there has been a surge in detecting public health related tweets for early warning so that a rapid response from health authorities can take place. In this paper, we present a temporal analytics tool for supporting a comparative, temporal analysis of disease outbreaks between Twitter and official sources, such as, World Health Organization (WHO) and ProMED-mail. We automatically extract and aggregate outbreak events from official outbreak reports, producing time series data. Our tool can support a correlation analysis and an understanding of the temporal developments of outbreak mentions in Twitter, based on comparisons with official sources.
微博服务,如Twitter,作为在社交网络中分享信息的一种方式,正获得越来越多的兴趣。许多研究表明,利用Twitter帖子(或tweets)来推断现实世界事件的存在和规模是有潜力的。在医疗领域,检测与公共卫生有关的推文以进行早期预警的数量激增,以便卫生当局能够迅速作出反应。在本文中,我们提出了一个时间分析工具,用于支持Twitter和官方来源(如世界卫生组织(WHO)和ProMED-mail)之间疾病爆发的比较时间分析。我们自动从官方爆发报告中提取和汇总爆发事件,生成时间序列数据。我们的工具可以支持相关性分析,并基于与官方来源的比较,了解Twitter中提到的爆发的时间发展。
{"title":"Supporting temporal analytics for health-related events in microblogs","authors":"Nattiya Kanhabua, Sara Romano, Avare Stewart, W. Nejdl","doi":"10.1145/2396761.2398726","DOIUrl":"https://doi.org/10.1145/2396761.2398726","url":null,"abstract":"Microblogging services, such as Twitter, are gaining interests as a means of sharing information in social networks. Numerous works have shown the potential of using Twitter posts (or tweets) in order to infer the existence and magnitude of real-world events. In the medical domain, there has been a surge in detecting public health related tweets for early warning so that a rapid response from health authorities can take place. In this paper, we present a temporal analytics tool for supporting a comparative, temporal analysis of disease outbreaks between Twitter and official sources, such as, World Health Organization (WHO) and ProMED-mail. We automatically extract and aggregate outbreak events from official outbreak reports, producing time series data. Our tool can support a correlation analysis and an understanding of the temporal developments of outbreak mentions in Twitter, based on comparisons with official sources.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124273851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Detecting offensive tweets via topical feature discovery over a large scale twitter corpus 通过大规模推特语料库上的主题特征发现来检测攻击性推文
Guang Xiang, Bin Fan, Ling Wang, Jason I. Hong, C. Rosé
In this paper, we propose a novel semi-supervised approach for detecting profanity-related offensive content in Twitter. Our approach exploits linguistic regularities in profane language via statistical topic modeling on a huge Twitter corpus, and detects offensive tweets using automatically these generated features. Our approach performs competitively with a variety of machine learning (ML) algorithms. For instance, our approach achieves a true positive rate (TP) of 75.1% over 4029 testing tweets using Logistic Regression, significantly outperforming the popular keyword matching baseline, which has a TP of 69.7%, while keeping the false positive rate (FP) at the same level as the baseline at about 3.77%. Our approach provides an alternative to large scale hand annotation efforts required by fully supervised learning approaches.
在本文中,我们提出了一种新的半监督方法来检测Twitter中与亵渎相关的攻击性内容。我们的方法通过在一个巨大的Twitter语料库上进行统计主题建模,利用亵渎语言的语言规律,并使用这些生成的特征自动检测冒犯性推文。我们的方法与各种机器学习(ML)算法相比具有竞争力。例如,我们的方法在使用Logistic回归的4029条测试推文中实现了75.1%的真阳性率(TP),显著优于流行关键字匹配基线(TP为69.7%),同时将假阳性率(FP)保持在与基线相同的水平,约为3.77%。我们的方法为完全监督学习方法所需的大规模手工注释工作提供了一种替代方法。
{"title":"Detecting offensive tweets via topical feature discovery over a large scale twitter corpus","authors":"Guang Xiang, Bin Fan, Ling Wang, Jason I. Hong, C. Rosé","doi":"10.1145/2396761.2398556","DOIUrl":"https://doi.org/10.1145/2396761.2398556","url":null,"abstract":"In this paper, we propose a novel semi-supervised approach for detecting profanity-related offensive content in Twitter. Our approach exploits linguistic regularities in profane language via statistical topic modeling on a huge Twitter corpus, and detects offensive tweets using automatically these generated features. Our approach performs competitively with a variety of machine learning (ML) algorithms. For instance, our approach achieves a true positive rate (TP) of 75.1% over 4029 testing tweets using Logistic Regression, significantly outperforming the popular keyword matching baseline, which has a TP of 69.7%, while keeping the false positive rate (FP) at the same level as the baseline at about 3.77%. Our approach provides an alternative to large scale hand annotation efforts required by fully supervised learning approaches.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116966269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 254
Trust prediction via aggregating heterogeneous social networks 基于聚合异质社会网络的信任预测
Jin Huang, F. Nie, Heng Huang, Yi-Cheng Tu
Along with the increasing popularity of social web sites, users rely more on the trustworthiness information for many online activities among users. However, such social network data often suffers from severe data sparsity and are not able to provide users with enough information. Therefore, trust prediction has emerged as an important topic in social network research. Traditional approaches explore the topology of trust graph. Previous research in sociology and our life experience suggest that people who are in the same social circle often exhibit similar behavior and tastes. Such ancillary information, is often accessible and therefore could potentially help the trust prediction. In this paper, we address the link prediction problem by aggregating heterogeneous social networks and propose a novel joint manifold factorization (JMF) method. Our new joint learning model explores the user group level similarity between correlated graphs and simultaneously learns the individual graph structure, therefore the shared structures and patterns from multiple social networks can be utilized to enhance the prediction tasks. As a result, we not only improve the trust prediction in the target graph, but also facilitate other information retrieval tasks in the auxiliary graphs. To optimize the objective function, we break down the proposed objective function into several manageable sub-problems, then further establish the theoretical convergence with the aid of auxiliary function. Extensive experiments were conducted on real world data sets and all empirical results demonstrated the effectiveness of our method.
随着社交网站的日益普及,用户越来越依赖于用户之间的许多在线活动的可信度信息。然而,此类社交网络数据往往存在严重的数据稀疏性,无法为用户提供足够的信息。因此,信任预测成为社会网络研究的一个重要课题。传统的方法是探索信任图的拓扑结构。先前的社会学研究和我们的生活经验表明,处于同一个社交圈的人往往表现出相似的行为和品味。这些辅助信息通常是可访问的,因此可能有助于信任预测。在本文中,我们通过聚合异构社会网络来解决链接预测问题,并提出了一种新的联合流形分解(JMF)方法。我们的联合学习模型探索了相关图之间的用户组级相似性,同时学习了单个图的结构,因此可以利用来自多个社交网络的共享结构和模式来增强预测任务。因此,我们不仅提高了目标图中的信任预测,而且方便了辅助图中的其他信息检索任务。为了优化目标函数,我们将所提出的目标函数分解为几个可管理的子问题,然后借助辅助函数进一步建立理论收敛性。在真实世界的数据集上进行了大量的实验,所有的经验结果都证明了我们的方法的有效性。
{"title":"Trust prediction via aggregating heterogeneous social networks","authors":"Jin Huang, F. Nie, Heng Huang, Yi-Cheng Tu","doi":"10.1145/2396761.2398515","DOIUrl":"https://doi.org/10.1145/2396761.2398515","url":null,"abstract":"Along with the increasing popularity of social web sites, users rely more on the trustworthiness information for many online activities among users. However, such social network data often suffers from severe data sparsity and are not able to provide users with enough information. Therefore, trust prediction has emerged as an important topic in social network research. Traditional approaches explore the topology of trust graph. Previous research in sociology and our life experience suggest that people who are in the same social circle often exhibit similar behavior and tastes. Such ancillary information, is often accessible and therefore could potentially help the trust prediction. In this paper, we address the link prediction problem by aggregating heterogeneous social networks and propose a novel joint manifold factorization (JMF) method. Our new joint learning model explores the user group level similarity between correlated graphs and simultaneously learns the individual graph structure, therefore the shared structures and patterns from multiple social networks can be utilized to enhance the prediction tasks. As a result, we not only improve the trust prediction in the target graph, but also facilitate other information retrieval tasks in the auxiliary graphs. To optimize the objective function, we break down the proposed objective function into several manageable sub-problems, then further establish the theoretical convergence with the aid of auxiliary function. Extensive experiments were conducted on real world data sets and all empirical results demonstrated the effectiveness of our method.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116978839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
An evaluation and enhancement of densitometric fragmentation for content slicing reuse 面向内容切片重用的密度碎片评价与改进
Killian Levacher, S. Lawless, V. Wade
Content slicing addresses the need of adaptive systems to reuse open corpus material by converting it into re-composable information objects. However this conversion is highly dependent upon the ability to correctly fragment pages into structurally sound atomic pieces. A recently suggested approach to fragmentation, which relies on densitometric page representation, claims to achieve high accuracy and time performance. Although it has been well received within the research community, a full evaluation of this approach and identification of strengths and weaknesses across a range of characteristics hasn't been performed. This paper proposes an independent evaluation of the approach with respect to granularity control, accuracy, time performance, content diversity and linguistic dependency. Moreover, this paper also provides a significant contribution to address important weaknesses discovered during the analysis, in order to improve the suitability and impact of the original algorithm within the context of content slicing.
内容切片解决了自适应系统通过将开放语料库材料转换为可重新组合的信息对象来重用开放语料库材料的需求。然而,这种转换高度依赖于将页面正确地分割成结构合理的原子块的能力。最近提出的一种碎片化方法,它依赖于密度计页面表示,声称可以实现高精度和时间性能。尽管它在研究界得到了广泛的认可,但尚未对该方法进行全面评估,并在一系列特征中确定其优缺点。本文从粒度控制、准确性、时间性能、内容多样性和语言依赖性等方面对该方法进行了独立评估。此外,本文还为解决分析过程中发现的重要弱点做出了重要贡献,以提高原始算法在内容切片环境中的适用性和影响。
{"title":"An evaluation and enhancement of densitometric fragmentation for content slicing reuse","authors":"Killian Levacher, S. Lawless, V. Wade","doi":"10.1145/2396761.2398652","DOIUrl":"https://doi.org/10.1145/2396761.2398652","url":null,"abstract":"Content slicing addresses the need of adaptive systems to reuse open corpus material by converting it into re-composable information objects. However this conversion is highly dependent upon the ability to correctly fragment pages into structurally sound atomic pieces. A recently suggested approach to fragmentation, which relies on densitometric page representation, claims to achieve high accuracy and time performance. Although it has been well received within the research community, a full evaluation of this approach and identification of strengths and weaknesses across a range of characteristics hasn't been performed. This paper proposes an independent evaluation of the approach with respect to granularity control, accuracy, time performance, content diversity and linguistic dependency. Moreover, this paper also provides a significant contribution to address important weaknesses discovered during the analysis, in order to improve the suitability and impact of the original algorithm within the context of content slicing.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116989937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Proceedings of the 21st ACM international conference on Information and knowledge management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1