首页 > 最新文献

Proceedings of the 22nd ACM international conference on Information & Knowledge Management最新文献

英文 中文
Data management & analytics for healthcare (DARE 2013) 面向医疗保健的数据管理和分析(DARE 2013)
Ullas Nambiar, T. Niranjan
Reducing healthcare costs and improving quality of outcomes is a challenge even in developed economies. Much information remains in paper form, lack common standards, sharing is uncommon and frequently hampered by the lack of foolproof de-identification for patient privacy. All of these issues impede opportunities for data mining and analysis that would enable better predictive and preventive medicine. These issues are compounded in emerging economies due to geopolitical constraints, transportation and geographic barriers, a much more limited clinical workforce, and infrastructural challenges to delivery. Thus, simple, high-impact deliverable interventions such as universal childhood immunization and maternal childcare are hampered by poor monitoring and reporting systems. This workshop is focused on identifying challenges to be overcome for effectively delivering efficient healthcare and to the masses. Specifically, we will provide a forum to discuss research directions, share experience and insights from both academia and industry. The anticipated outcome of the workshop is an assessment of the state of the art in the area, as well as identification of critical next steps to pursue in this topic.
即使在发达经济体,降低医疗成本和提高医疗质量也是一项挑战。许多信息仍以纸质形式存在,缺乏共同标准,共享并不常见,而且往往因缺乏万无一失的患者隐私去识别而受到阻碍。所有这些问题都阻碍了数据挖掘和分析的机会,而数据挖掘和分析将使更好的预测和预防医学成为可能。这些问题在新兴经济体中由于地缘政治限制、交通和地理障碍、更为有限的临床劳动力以及提供基础设施方面的挑战而变得更加复杂。因此,监测和报告系统不健全,阻碍了普及儿童免疫和孕产妇托儿等简单、影响大、可交付的干预措施。本次研讨会的重点是确定需要克服的挑战,以便有效地向大众提供高效的医疗保健服务。具体而言,我们将提供一个论坛,讨论研究方向,分享来自学术界和工业界的经验和见解。讲习班的预期成果是评估该领域的最新技术状况,并确定在该专题中下一步要采取的关键步骤。
{"title":"Data management & analytics for healthcare (DARE 2013)","authors":"Ullas Nambiar, T. Niranjan","doi":"10.1145/2505515.2505820","DOIUrl":"https://doi.org/10.1145/2505515.2505820","url":null,"abstract":"Reducing healthcare costs and improving quality of outcomes is a challenge even in developed economies. Much information remains in paper form, lack common standards, sharing is uncommon and frequently hampered by the lack of foolproof de-identification for patient privacy. All of these issues impede opportunities for data mining and analysis that would enable better predictive and preventive medicine. These issues are compounded in emerging economies due to geopolitical constraints, transportation and geographic barriers, a much more limited clinical workforce, and infrastructural challenges to delivery. Thus, simple, high-impact deliverable interventions such as universal childhood immunization and maternal childcare are hampered by poor monitoring and reporting systems. This workshop is focused on identifying challenges to be overcome for effectively delivering efficient healthcare and to the masses. Specifically, we will provide a forum to discuss research directions, share experience and insights from both academia and industry. The anticipated outcome of the workshop is an assessment of the state of the art in the area, as well as identification of critical next steps to pursue in this topic.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79167363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A generic front-stage for semi-stream processing 用于半流处理的通用前台
M. Naeem, Gerald Weber, G. Dobbie, C. Lutteroth
Recently, a number of semi-stream join algorithms have been published. The typical system setup for these consists of one fast stream input that has to be joined with a disk-based relation R. These semi-stream join approaches typically perform the join with a limited main memory partition assigned to them, which is generally not large enough to hold the whole relation R. We propose a caching approach that can be used as a front-stage for different semi-stream join algorithms, resulting in significant performance gains for common applications. We analyze our approach in the context of a seminal semi-stream join, MESHJOIN (Mesh Join), and provide a cost model for the resulting semi-stream join algorithm, which we call CMESHJOIN (Cached Mesh Join). The algorithm takes advantage of skewed distributions; this article presents results for Zipfian distributions of the type that appears in many applications.
最近,已经发表了许多半流连接算法。这些方法的典型系统设置包括一个必须与基于磁盘的关系r连接的快速流输入。这些半流连接方法通常使用分配给它们的有限主内存分区来执行连接,该分区通常不足以容纳整个关系r。我们提出了一种缓存方法,可以用作不同半流连接算法的前台,从而为常见应用程序带来显着的性能提升。我们在一个重要的半流连接MESHJOIN (Mesh join)的背景下分析了我们的方法,并为所得到的半流连接算法提供了一个成本模型,我们称之为CMESHJOIN (Cached Mesh join)。该算法利用了偏态分布;本文给出了在许多应用程序中出现的Zipfian分布的结果。
{"title":"A generic front-stage for semi-stream processing","authors":"M. Naeem, Gerald Weber, G. Dobbie, C. Lutteroth","doi":"10.1145/2505515.2505734","DOIUrl":"https://doi.org/10.1145/2505515.2505734","url":null,"abstract":"Recently, a number of semi-stream join algorithms have been published. The typical system setup for these consists of one fast stream input that has to be joined with a disk-based relation R. These semi-stream join approaches typically perform the join with a limited main memory partition assigned to them, which is generally not large enough to hold the whole relation R. We propose a caching approach that can be used as a front-stage for different semi-stream join algorithms, resulting in significant performance gains for common applications. We analyze our approach in the context of a seminal semi-stream join, MESHJOIN (Mesh Join), and provide a cost model for the resulting semi-stream join algorithm, which we call CMESHJOIN (Cached Mesh Join). The algorithm takes advantage of skewed distributions; this article presents results for Zipfian distributions of the type that appears in many applications.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81589192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
An unsupervised transfer learning approach to discover topics for online reputation management 一种无监督迁移学习方法来发现在线声誉管理的主题
Tamara Martín-Wanton, Julio Gonzalo, Enrique Amigó
Microblogs play an important role for Online Reputation Management. Companies and organizations in general have an increasing interest in obtaining the last minute information about which are the emerging topics that concern their reputation. In this paper, we present a new technique to cluster a collection of tweets emitted within a short time span about a specific entity. Our approach relies on transfer learning by contextualizing a target collection of tweets with a large set of unlabeled "background" tweets that help improving the clustering of the target collection. We include background tweets together with target tweets in a TwitterLDA process, and we set the total number of clusters. In practice, this means that the system can adapt to find the right number of clusters for the target data, overcoming one of the limitations of using LDA-based approaches (the need of establishing a priori the number of clusters). Our experiments using RepLab 2012 data show that using the background collection gives a 20% improvement over a direct application of TwitterLDA using only the target collection. Our data also confirms that the approach can effectively predict the right number of target clusters in a way that is robust with respect to the total number of clusters established a priori.
微博在网络声誉管理中发挥着重要作用。一般来说,公司和组织越来越有兴趣在最后一刻获得有关其声誉的新兴话题的信息。在本文中,我们提出了一种新技术来聚类在短时间内发出的关于特定实体的推文集合。我们的方法依赖于迁移学习,通过将推文的目标集合与大量未标记的“背景”推文进行上下文化,这有助于提高目标集合的聚类。我们在TwitterLDA进程中包括背景推文和目标推文,并设置集群的总数。在实践中,这意味着系统可以适应为目标数据找到正确数量的集群,克服了使用基于lda的方法的限制之一(需要先验地建立集群的数量)。我们使用RepLab 2012数据进行的实验表明,与只使用目标集合的TwitterLDA直接应用程序相比,使用后台集合的性能提高了20%。我们的数据还证实,该方法可以有效地预测目标簇的正确数量,并且相对于先验建立的簇总数具有鲁棒性。
{"title":"An unsupervised transfer learning approach to discover topics for online reputation management","authors":"Tamara Martín-Wanton, Julio Gonzalo, Enrique Amigó","doi":"10.1145/2505515.2507845","DOIUrl":"https://doi.org/10.1145/2505515.2507845","url":null,"abstract":"Microblogs play an important role for Online Reputation Management. Companies and organizations in general have an increasing interest in obtaining the last minute information about which are the emerging topics that concern their reputation. In this paper, we present a new technique to cluster a collection of tweets emitted within a short time span about a specific entity. Our approach relies on transfer learning by contextualizing a target collection of tweets with a large set of unlabeled \"background\" tweets that help improving the clustering of the target collection. We include background tweets together with target tweets in a TwitterLDA process, and we set the total number of clusters. In practice, this means that the system can adapt to find the right number of clusters for the target data, overcoming one of the limitations of using LDA-based approaches (the need of establishing a priori the number of clusters). Our experiments using RepLab 2012 data show that using the background collection gives a 20% improvement over a direct application of TwitterLDA using only the target collection. Our data also confirms that the approach can effectively predict the right number of target clusters in a way that is robust with respect to the total number of clusters established a priori.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84409689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
On sparsity and drift for effective real-time filtering in microblogs 微博实时过滤的稀疏性和漂移性研究
M. Albakour, C. Macdonald, I. Ounis
In this paper, we approach the problem of real-time filtering in the Twitter Microblogging platform. We adapt an effective traditional news filtering technique, which uses a text classifier inspired by Rocchio's relevance feedback algorithm, to build and dynamically update a profile of the user's interests in real-time. In our adaptation, we tackle two challenges that are particularly prevalent in Twitter: sparsity and drift. In particular, sparsity stems from the brevity of tweets, while drift occurs as events related to the topic develop or the interests of the user change. First, to tackle the acute sparsity problem, we apply query expansion to derive terms or related tweets for a richer initialisation of the user interests within the profile. Second, to deal with drift, we modify the user profile to balance between the importance of the short-term interests, i.e. emerging subtopics, and the long-term interests in the overall topic. Moreover, we investigate an event detection method from Twitter and newswire streams to predict times at which drift may happen. Through experiments using the TREC Microblog track 2012, we show that our approach is effective for a number of common filtering metrics such as the user's utility, and that it compares favourably with state-of-the-art news filtering baselines. Our results also uncover the impact of different factors on handling topic drifting.
本文研究了Twitter微博平台中的实时过滤问题。我们采用了一种有效的传统新闻过滤技术,该技术使用受Rocchio相关反馈算法启发的文本分类器来实时构建和动态更新用户的兴趣概况。在我们的调整中,我们解决了两个在Twitter上特别普遍的挑战:稀疏性和漂移性。特别是,稀疏性源于tweet的简短性,而漂移则随着与主题相关的事件的发展或用户兴趣的变化而发生。首先,为了解决严重的稀疏性问题,我们应用查询扩展来派生术语或相关tweet,以便在配置文件中更丰富地初始化用户兴趣。其次,为了处理漂移,我们修改用户配置文件,以平衡短期兴趣(即新出现的子主题)和整体主题中长期兴趣的重要性。此外,我们研究了一种来自Twitter和新闻专线流的事件检测方法来预测漂移可能发生的时间。通过使用TREC微博轨道2012的实验,我们表明我们的方法对许多常见的过滤指标(如用户效用)是有效的,并且它与最先进的新闻过滤基线相比是有利的。我们的研究结果还揭示了不同因素对处理话题漂移的影响。
{"title":"On sparsity and drift for effective real-time filtering in microblogs","authors":"M. Albakour, C. Macdonald, I. Ounis","doi":"10.1145/2505515.2505709","DOIUrl":"https://doi.org/10.1145/2505515.2505709","url":null,"abstract":"In this paper, we approach the problem of real-time filtering in the Twitter Microblogging platform. We adapt an effective traditional news filtering technique, which uses a text classifier inspired by Rocchio's relevance feedback algorithm, to build and dynamically update a profile of the user's interests in real-time. In our adaptation, we tackle two challenges that are particularly prevalent in Twitter: sparsity and drift. In particular, sparsity stems from the brevity of tweets, while drift occurs as events related to the topic develop or the interests of the user change. First, to tackle the acute sparsity problem, we apply query expansion to derive terms or related tweets for a richer initialisation of the user interests within the profile. Second, to deal with drift, we modify the user profile to balance between the importance of the short-term interests, i.e. emerging subtopics, and the long-term interests in the overall topic. Moreover, we investigate an event detection method from Twitter and newswire streams to predict times at which drift may happen. Through experiments using the TREC Microblog track 2012, we show that our approach is effective for a number of common filtering metrics such as the user's utility, and that it compares favourably with state-of-the-art news filtering baselines. Our results also uncover the impact of different factors on handling topic drifting.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84438900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Leveraging data to change industry paradigms 利用数据改变行业模式
C. Farmer
Much of the conversation on "big data" is centered on data technologies and analytics platforms and how established companies apply them. While those technologies and platforms are certainly very important for industry incumbents, data analytics is also often a key building block for new start-up entrants looking to disrupt industry verticals. In many cases, the best examples of novel applications of data to create new services and competitive advantage require a complete rethinking of organizational design in order to create feedback loops and rethink cost structures. The company I founded, SignalFire is applying data for competitive advantage in my own industry, venture capital, but there are myriad examples of this trend across industries such as transportation, financial services, retail, media and many other markets. In this talk, I will discuss how we analyze these trends as venture capitalists and will look at a few case studies of specific companies leveraging data to innovate in their industries.
关于“大数据”的讨论大多集中在数据技术和分析平台,以及老牌企业如何应用它们。虽然这些技术和平台对行业现有企业来说当然非常重要,但对于那些希望颠覆垂直行业的新初创企业来说,数据分析通常也是一个关键的组成部分。在许多情况下,利用数据创造新服务和竞争优势的最佳范例需要对组织设计进行彻底的重新思考,以便创建反馈循环并重新思考成本结构。我创立的SignalFire公司在我自己的行业——风险投资——中运用数据来获得竞争优势,但在交通、金融服务、零售、媒体和许多其他市场等行业中,这种趋势的例子数不胜数。在这次演讲中,我将讨论我们作为风险资本家如何分析这些趋势,并将研究一些特定公司利用数据在其行业中进行创新的案例研究。
{"title":"Leveraging data to change industry paradigms","authors":"C. Farmer","doi":"10.1145/2505515.2514694","DOIUrl":"https://doi.org/10.1145/2505515.2514694","url":null,"abstract":"Much of the conversation on \"big data\" is centered on data technologies and analytics platforms and how established companies apply them. While those technologies and platforms are certainly very important for industry incumbents, data analytics is also often a key building block for new start-up entrants looking to disrupt industry verticals. In many cases, the best examples of novel applications of data to create new services and competitive advantage require a complete rethinking of organizational design in order to create feedback loops and rethink cost structures. The company I founded, SignalFire is applying data for competitive advantage in my own industry, venture capital, but there are myriad examples of this trend across industries such as transportation, financial services, retail, media and many other markets. In this talk, I will discuss how we analyze these trends as venture capitalists and will look at a few case studies of specific companies leveraging data to innovate in their industries.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84558269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Personalized point-of-interest recommendation by mining users' preference transition 通过挖掘用户的偏好转换,提供个性化的兴趣点推荐
Xin Liu, Yong Liu, K. Aberer, C. Miao
Location-based social networks (LBSNs) offer researchers rich data to study people's online activities and mobility patterns. One important application of such studies is to provide personalized point-of-interest (POI) recommendations to enhance user experience in LBSNs. Previous solutions directly predict users' preference on locations but fail to provide insights about users' preference transitions among locations. In this work, we propose a novel category-aware POI recommendation model, which exploits the transition patterns of users' preference over location categories to improve location recommendation accuracy. Our approach consists of two stages: (1) preference transition (over location categories) prediction, and (2) category-aware POI recommendation. Matrix factorization is employed to predict a user's preference transitions over categories and then her preference on locations in the corresponding categories. Real data based experiments demonstrate that our approach outperforms the state-of-the-art POI recommendation models by at least 39.75% in terms of recall.
基于位置的社交网络(LBSNs)为研究人们的在线活动和移动模式提供了丰富的数据。此类研究的一个重要应用是提供个性化的兴趣点(POI)建议,以增强LBSNs的用户体验。以前的解决方案直接预测用户对位置的偏好,但无法提供用户在位置之间偏好转换的见解。在这项工作中,我们提出了一种新的类别感知POI推荐模型,该模型利用用户对位置类别的偏好转换模式来提高位置推荐的准确性。我们的方法包括两个阶段:(1)偏好转换(超过位置类别)预测,以及(2)类别感知POI推荐。使用矩阵分解来预测用户对类别的偏好转换,然后预测其对相应类别中位置的偏好。基于真实数据的实验表明,我们的方法在召回率方面优于最先进的POI推荐模型至少39.75%。
{"title":"Personalized point-of-interest recommendation by mining users' preference transition","authors":"Xin Liu, Yong Liu, K. Aberer, C. Miao","doi":"10.1145/2505515.2505639","DOIUrl":"https://doi.org/10.1145/2505515.2505639","url":null,"abstract":"Location-based social networks (LBSNs) offer researchers rich data to study people's online activities and mobility patterns. One important application of such studies is to provide personalized point-of-interest (POI) recommendations to enhance user experience in LBSNs. Previous solutions directly predict users' preference on locations but fail to provide insights about users' preference transitions among locations. In this work, we propose a novel category-aware POI recommendation model, which exploits the transition patterns of users' preference over location categories to improve location recommendation accuracy. Our approach consists of two stages: (1) preference transition (over location categories) prediction, and (2) category-aware POI recommendation. Matrix factorization is employed to predict a user's preference transitions over categories and then her preference on locations in the corresponding categories. Real data based experiments demonstrate that our approach outperforms the state-of-the-art POI recommendation models by at least 39.75% in terms of recall.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84957605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 260
Flexible and adaptive subspace search for outlier analysis 灵活和自适应子空间搜索异常值分析
F. Keller, Emmanuel Müller, Andreas Wixler, Klemens Böhm
There exists a variety of traditional outlier models, which measure the deviation of outliers with respect to the full attribute space. However, these techniques fail to detect outliers that deviate only w.r.t. an attribute subset. To address this problem, recent techniques focus on a selection of subspaces that allow: (1) A clear distinction between clustered objects and outliers; (2) a description of outlier reasons by the selected subspaces. However, depending on the outlier model used, different objects in different subspaces have the highest deviation. It is an open research issue to make subspace selection adaptive to the outlier score of each object and flexible w.r.t. the use of different outlier models. In this work we propose such a flexible and adaptive subspace selection scheme. Our generic processing allows instantiations with different outlier models. We utilize the differences of outlier scores in random subspaces to perform a combinatorial refinement of relevant subspaces. Our refinement allows an individual selection of subspaces for each outlier, which is tailored to the underlying outlier model. In the experiments we show the flexibility of our subspace search w.r.t. various outlier models such as distance-based, angle-based, and local-density-based outlier detection.
传统的离群值模型有多种,它们衡量的是离群值相对于整个属性空间的偏差。然而,这些技术无法检测到只偏离属性子集的异常值。为了解决这个问题,最近的技术集中在子空间的选择上,这些子空间允许:(1)明确区分聚类对象和异常值;(2)选取子空间描述离群原因。然而,根据所使用的离群值模型,不同子空间中的不同对象具有最高的偏差。如何使子空间选择适应每个目标的离群值得分,并灵活地使用不同的离群值模型,是一个有待研究的问题。本文提出了一种灵活的自适应子空间选择方案。我们的通用处理允许使用不同的离群模型实例化。我们利用随机子空间中离群值的差异对相关子空间进行组合细化。我们的细化允许为每个离群值单独选择子空间,这是针对底层离群值模型量身定制的。在实验中,我们展示了子空间搜索与各种离群点模型(如基于距离、基于角度和基于局部密度的离群点检测)的灵活性。
{"title":"Flexible and adaptive subspace search for outlier analysis","authors":"F. Keller, Emmanuel Müller, Andreas Wixler, Klemens Böhm","doi":"10.1145/2505515.2505560","DOIUrl":"https://doi.org/10.1145/2505515.2505560","url":null,"abstract":"There exists a variety of traditional outlier models, which measure the deviation of outliers with respect to the full attribute space. However, these techniques fail to detect outliers that deviate only w.r.t. an attribute subset. To address this problem, recent techniques focus on a selection of subspaces that allow: (1) A clear distinction between clustered objects and outliers; (2) a description of outlier reasons by the selected subspaces. However, depending on the outlier model used, different objects in different subspaces have the highest deviation. It is an open research issue to make subspace selection adaptive to the outlier score of each object and flexible w.r.t. the use of different outlier models. In this work we propose such a flexible and adaptive subspace selection scheme. Our generic processing allows instantiations with different outlier models. We utilize the differences of outlier scores in random subspaces to perform a combinatorial refinement of relevant subspaces. Our refinement allows an individual selection of subspaces for each outlier, which is tailored to the underlying outlier model. In the experiments we show the flexibility of our subspace search w.r.t. various outlier models such as distance-based, angle-based, and local-density-based outlier detection.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85941255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Instant foodie: predicting expert ratings from grassroots 即时吃货:预测来自基层的专家评级
Chenhao Tan, Ed H. Chi, David A. Huffaker, Gueorgi Kossinets, Alex Smola
Consumer review sites and recommender systems typically rely on a large volume of user-contributed ratings, which makes rating acquisition an essential component in the design of such systems. User ratings are then summarized to provide an aggregate score representing a popular evaluation of an item. An inherent problem in such summarization is potential bias due to raters self-selection and heterogeneity in terms of experience, tastes and rating scale interpretation. There are two major approaches to collecting ratings, which have different advantages and disadvantages. One is to allow a large number of volunteers to choose and rate items directly (a method employed by e.g. Yelp and Google Places). Alternatively, a panel of raters may be maintained and invited to rate a predefined set of items at regular intervals (such as in Zagat Survey). The latter approach arguably results in more consistent reviews and reduced selection bias, however, at the expense of much smaller coverage (fewer rated items). In this paper, we examine the two different approaches to collecting user ratings of restaurants and explore the question of whether it is possible to reconcile them. Specifically, we study the problem of inferring the more calibrated Zagat Survey ratings (which we dub 'expert ratings') from the user-generated ratings ('grassroots') in Google Places. To that effect, we employ latent factor models and provide a probabilistic treatment of the ordinal rankings. We can predict Zagat Survey ratings accurately from ad hoc user-generated ratings by joint optimization on two datasets. We analyze the resulting model, and find that users become more discerning as they submit more ratings. We also describe an approach towards cross-city recommendations, answering questions such as 'What is the equivalent of the Per Se restaurant in Chicago'?
消费者评论网站和推荐系统通常依赖于大量用户贡献的评级,这使得评级获取成为此类系统设计中的重要组成部分。然后汇总用户评分,以提供一个总分数,代表对某项商品的流行评价。这种总结的一个固有问题是由于评分者自我选择和经验、品味和评分量表解释方面的异质性而产生的潜在偏差。有两种主要的收集评级的方法,它们有不同的优点和缺点。一种是允许大量志愿者直接选择和评价物品(Yelp和Google Places采用的方法)。或者,可以维持一个评估师小组,并邀请他们定期对一组预定义的项目进行评级(例如在Zagat Survey中)。后一种方法可能会导致更一致的评论和减少选择偏差,然而,代价是更小的覆盖范围(更少的评级项目)。在本文中,我们研究了收集餐馆用户评级的两种不同方法,并探讨了是否有可能调和它们的问题。具体来说,我们研究了从Google Places中的用户生成评级(“草根”)推断出更精确的Zagat调查评级(我们称之为“专家评级”)的问题。为此,我们采用潜在因素模型,并对顺序排名进行概率处理。通过对两个数据集的联合优化,我们可以从用户生成的评分中准确预测Zagat Survey评分。我们分析了结果模型,发现用户提交的评分越多,他们的鉴别力就越强。我们还描述了一种跨城市推荐的方法,回答了诸如“芝加哥的Per Se餐厅相当于什么?”
{"title":"Instant foodie: predicting expert ratings from grassroots","authors":"Chenhao Tan, Ed H. Chi, David A. Huffaker, Gueorgi Kossinets, Alex Smola","doi":"10.1145/2505515.2505712","DOIUrl":"https://doi.org/10.1145/2505515.2505712","url":null,"abstract":"Consumer review sites and recommender systems typically rely on a large volume of user-contributed ratings, which makes rating acquisition an essential component in the design of such systems. User ratings are then summarized to provide an aggregate score representing a popular evaluation of an item. An inherent problem in such summarization is potential bias due to raters self-selection and heterogeneity in terms of experience, tastes and rating scale interpretation. There are two major approaches to collecting ratings, which have different advantages and disadvantages. One is to allow a large number of volunteers to choose and rate items directly (a method employed by e.g. Yelp and Google Places). Alternatively, a panel of raters may be maintained and invited to rate a predefined set of items at regular intervals (such as in Zagat Survey). The latter approach arguably results in more consistent reviews and reduced selection bias, however, at the expense of much smaller coverage (fewer rated items). In this paper, we examine the two different approaches to collecting user ratings of restaurants and explore the question of whether it is possible to reconcile them. Specifically, we study the problem of inferring the more calibrated Zagat Survey ratings (which we dub 'expert ratings') from the user-generated ratings ('grassroots') in Google Places. To that effect, we employ latent factor models and provide a probabilistic treatment of the ordinal rankings. We can predict Zagat Survey ratings accurately from ad hoc user-generated ratings by joint optimization on two datasets. We analyze the resulting model, and find that users become more discerning as they submit more ratings. We also describe an approach towards cross-city recommendations, answering questions such as 'What is the equivalent of the Per Se restaurant in Chicago'?","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82986033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Timeline adaptation for text classification 时间轴适应文本分类
Fumiyo Fukumoto, Yoshimi Suzuki, A. Takasu
In this paper, we address the text classification problem that a period of time created test data is different from the training data, and present a method for text classification based on temporal adaptation. We first applied lexical chains for the training data to collect terms with semantic relatedness, and created sets (we call these Sem sets). Semantically related terms in the documents are replaced to their representative term. For the results, we identified short terms that are salient for a specific period of time. Finally, we trained SVM classifiers by applying a temporal weighting function to each selected short terms within the training data, and classified test data. Temporal weighting function is weighted each short term in the training data according to the temporal distance between training and test data. The results using MedLine data showed that the method was comparable to the current state-of-the-art biased-SVM method, especially the method is effective when testing on data far from the training data.
本文针对一段时间内生成的测试数据与训练数据不一致的文本分类问题,提出了一种基于时间适应的文本分类方法。我们首先对训练数据应用词汇链来收集具有语义相关性的术语,并创建集合(我们称之为Sem集合)。文档中语义相关的术语被替换为它们的代表术语。对于结果,我们确定了在特定时期内显着的短期。最后,我们通过对训练数据中每个选择的短期项应用时间加权函数来训练SVM分类器,并对测试数据进行分类。时间加权函数根据训练数据与测试数据之间的时间距离对训练数据中的每个短期项进行加权。使用MedLine数据的结果表明,该方法与目前最先进的偏置支持向量机方法相当,特别是在远离训练数据的数据上进行测试时,该方法是有效的。
{"title":"Timeline adaptation for text classification","authors":"Fumiyo Fukumoto, Yoshimi Suzuki, A. Takasu","doi":"10.1145/2505515.2507833","DOIUrl":"https://doi.org/10.1145/2505515.2507833","url":null,"abstract":"In this paper, we address the text classification problem that a period of time created test data is different from the training data, and present a method for text classification based on temporal adaptation. We first applied lexical chains for the training data to collect terms with semantic relatedness, and created sets (we call these Sem sets). Semantically related terms in the documents are replaced to their representative term. For the results, we identified short terms that are salient for a specific period of time. Finally, we trained SVM classifiers by applying a temporal weighting function to each selected short terms within the training data, and classified test data. Temporal weighting function is weighted each short term in the training data according to the temporal distance between training and test data. The results using MedLine data showed that the method was comparable to the current state-of-the-art biased-SVM method, especially the method is effective when testing on data far from the training data.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"63 12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85397256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Question routing to user communities 问题路由到用户社区
Aditya Pal, Fei Wang, Michelle X. Zhou, Jeffrey Nichols, Barton A. Smith
An online community consists of a group of users who share a common interest, background, or experience and their collective goal is to contribute towards the welfare of the community members. Question answering is an important feature that enables community members to exchange knowledge within the community boundary. The overwhelming number of communities necessitates the need for a good question routing strategy so that new questions gets routed to the appropriately focused community and thus get resolved. In this paper, we consider the novel problem of routing questions to the right community and propose a framework to select the right set of communities for a question. We begin by using several prior proposed features for users and add some additional features, namely language attributes and inclination to respond, for community modeling. Then we introduce two k nearest neighbor based aggregation algorithms for computing community scores. We show how these scores can be combined to recommend communities and test the effectiveness of the recommendations over a large real world dataset.
一个在线社区由一群拥有共同兴趣、背景或经验的用户组成,他们的共同目标是为社区成员的福利做出贡献。问答是使社区成员能够在社区边界内交换知识的重要功能。数量庞大的社区需要一个好的问题路由策略,以便将新问题路由到适当关注的社区,从而得到解决。在本文中,我们考虑了将问题路由到正确社区的新问题,并提出了一个为问题选择正确社区集的框架。我们首先使用先前为用户提出的几个特性,并添加一些额外的特性,即语言属性和响应倾向,用于社区建模。然后介绍了两种基于k近邻的社区评分聚合算法。我们展示了如何将这些分数结合起来推荐社区,并在一个大型的真实世界数据集上测试推荐的有效性。
{"title":"Question routing to user communities","authors":"Aditya Pal, Fei Wang, Michelle X. Zhou, Jeffrey Nichols, Barton A. Smith","doi":"10.1145/2505515.2505669","DOIUrl":"https://doi.org/10.1145/2505515.2505669","url":null,"abstract":"An online community consists of a group of users who share a common interest, background, or experience and their collective goal is to contribute towards the welfare of the community members. Question answering is an important feature that enables community members to exchange knowledge within the community boundary. The overwhelming number of communities necessitates the need for a good question routing strategy so that new questions gets routed to the appropriately focused community and thus get resolved. In this paper, we consider the novel problem of routing questions to the right community and propose a framework to select the right set of communities for a question. We begin by using several prior proposed features for users and add some additional features, namely language attributes and inclination to respond, for community modeling. Then we introduce two k nearest neighbor based aggregation algorithms for computing community scores. We show how these scores can be combined to recommend communities and test the effectiveness of the recommendations over a large real world dataset.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85791917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
期刊
Proceedings of the 22nd ACM international conference on Information & Knowledge Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1