Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献_第7页

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623660

F. Bourse, M. Lelarge, M. Vojnović

Balanced edge partition has emerged as a new approach to partition an input graph data for the purpose of scaling out parallel computations, which is of interest for several modern data analytics computation platforms, including platforms for iterative computations, machine learning problems, and graph databases. This new approach stands in a stark contrast to the traditional approach of balanced vertex partition, where for given number of partitions, the problem is to minimize the number of edges cut subject to balancing the vertex cardinality of partitions. In this paper, we first characterize the expected costs of vertex and edge partitions with and without aggregation of messages, for the commonly deployed policy of placing a vertex or an edge uniformly at random to one of the partitions. We then obtain the first approximation algorithms for the balanced edge-partition problem which for the case of no aggregation matches the best known approximation ratio for the balanced vertex-partition problem, and show that this remains to hold for the case with aggregation up to factor that is equal to the maximum in-degree of a vertex. We report results of an extensive empirical evaluation on a set of real-world graphs, which quantifies the benefits of edge- vs. vertex-partition, and demonstrates efficiency of natural greedy online assignments for the balanced edge-partition problem with and with no aggregation.

平衡边缘划分已经成为一种新的划分输入图数据的方法，用于扩展并行计算，这是一些现代数据分析计算平台感兴趣的，包括迭代计算平台，机器学习问题和图数据库。这种新方法与传统的平衡顶点划分方法形成鲜明对比，在传统的平衡顶点划分方法中，对于给定数量的分区，问题是在平衡分区的顶点基数的情况下最小化被切割的边的数量。在本文中，我们首先描述了通常部署的将顶点或边缘均匀随机放置到其中一个分区的策略下，在有和没有消息聚合的情况下，顶点和边缘分区的预期成本。然后，我们获得了平衡边划分问题的第一个近似算法，该算法在没有聚合的情况下与平衡顶点划分问题的最佳已知近似比相匹配，并表明这仍然适用于聚合因子等于顶点的最大in-degree的情况。我们报告了一组真实世界图的广泛经验评估结果，量化了边缘与顶点划分的好处，并证明了自然贪婪在线分配在有和没有聚合的平衡边缘划分问题上的效率。

{"title":"Balanced graph edge partition","authors":"F. Bourse, M. Lelarge, M. Vojnović","doi":"10.1145/2623330.2623660","DOIUrl":"https://doi.org/10.1145/2623330.2623660","url":null,"abstract":"Balanced edge partition has emerged as a new approach to partition an input graph data for the purpose of scaling out parallel computations, which is of interest for several modern data analytics computation platforms, including platforms for iterative computations, machine learning problems, and graph databases. This new approach stands in a stark contrast to the traditional approach of balanced vertex partition, where for given number of partitions, the problem is to minimize the number of edges cut subject to balancing the vertex cardinality of partitions. In this paper, we first characterize the expected costs of vertex and edge partitions with and without aggregation of messages, for the commonly deployed policy of placing a vertex or an edge uniformly at random to one of the partitions. We then obtain the first approximation algorithms for the balanced edge-partition problem which for the case of no aggregation matches the best known approximation ratio for the balanced vertex-partition problem, and show that this remains to hold for the case with aggregation up to factor that is equal to the maximum in-degree of a vertex. We report results of an extensive empirical evaluation on a set of real-world graphs, which quantifies the benefits of edge- vs. vertex-partition, and demonstrates efficiency of natural greedy online assignments for the balanced edge-partition problem with and with no aggregation.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83164796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 137

Leveraging user libraries to bootstrap collaborative filtering 利用用户库引导协同过滤

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623663

Laurent Charlin, R. Zemel, H. Larochelle

We introduce a novel graphical model, the collaborative score topic model (CSTM), for personal recommendations of textual documents. CSTM's chief novelty lies in its learned model of individual libraries, or sets of documents, associated with each user. Overall, CSTM is a joint directed probabilistic model of user-item scores (ratings), and the textual side information in the user libraries and the items. Creating a generative description of scores and the text allows CSTM to perform well in a wide variety of data regimes, smoothly combining the side information with observed ratings as the number of ratings available for a given user ranges from none to many. Experiments on real-world datasets demonstrate CSTM's performance. We further demonstrate its utility in an application for personal recommendations of posters which we deployed at the NIPS 2013 conference.

我们引入了一种新的图形化模型——协作评分主题模型(CSTM)，用于文本文档的个人推荐。CSTM的主要新颖之处在于其与每个用户相关联的单个库或文档集的学习模型。总的来说，CSTM是用户-项目分数(评级)以及用户库和项目中的文本侧信息的联合有向概率模型。创建分数和文本的生成描述允许CSTM在各种数据体系中表现良好，平滑地将侧面信息与观察到的评分相结合，因为给定用户可用的评分数量从零到多不等。在实际数据集上的实验证明了CSTM的性能。我们在NIPS 2013会议上部署了一个用于个人推荐海报的应用程序，进一步展示了它的实用性。

引用次数: 9

Improved testing of low rank matrices 改进了低秩矩阵的测试

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623736

Yi Li, Zhengyu Wang, David P. Woodruff

We study the problem of determining if an input matrix A εRm x n can be well-approximated by a low rank matrix. Specifically, we study the problem of quickly estimating the rank or stable rank of A, the latter often providing a more robust measure of the rank. Since we seek significantly sublinear time algorithms, we cast these problems in the property testing framework. In this framework, A either has low rank or stable rank, or is far from having this property. The algorithm should read only a small number of entries or rows of A and decide which case A is in with high probability. If neither case occurs, the output is allowed to be arbitrary. We consider two notions of being far: (1) A requires changing at least an ε-fraction of its entries, or (2) A requires changing at least an ε-fraction of its rows. We call the former the "entry model" and the latter the "row model". We show: For testing if a matrix has rank at most d in the entry model, we improve the previous number of entries of A that need to be read from O(d2/ε2) (Krauthgamer and Sasson, SODA 2003) to O(d2/ε). Our algorithm is the first to adaptively query the entries of A, which for constant d we show is necessary to achieve O(1/ε) queries. For the important case of d = 1 we also give a new non-adaptive algorithm, improving the previous O(1/ε2) queries to O(log2(1/ε) / ε). For testing if a matrix has rank at most d in the row model, we prove an Ω(d/ε) lower bound on the number of rows that need to be read, even for adaptive algorithms. Our lower bound matches a non-adaptive upper bound of Krauthgamer and Sasson. For testing if a matrix has stable rank at most d in the row model or requires changing an ε/d-fraction of its rows in order to have stable rank at most d, we prove that reading θ(d/ε2) rows is necessary and sufficient. We also give an empirical evaluation of our rank and stable rank algorithms on real and synthetic datasets.

研究了输入矩阵A εRm x n能否被低秩矩阵很好地逼近的问题。具体来说，我们研究了快速估计A的秩或稳定秩的问题，后者通常提供一个更稳健的秩度量。由于我们寻求重要的次线性时间算法，我们将这些问题置于性质测试框架中。在这个框架中，A要么具有低秩，要么具有稳定秩，要么远不具有这种性质。算法应该只读取a的少量条目或行，并确定a处于高概率的哪种情况。如果这两种情况都不发生，则允许输出是任意的。我们考虑两个远的概念:(1)A需要改变它的至少一个ε-分数的条目，或者(2)A需要改变它的至少一个ε-分数的行。我们称前者为“入口模型”，后者为“行模型”。为了测试一个矩阵在条目模型中是否排名最多为d，我们将a的先前需要读取的条目数从O(d2/ε2) (Krauthgamer and Sasson, SODA 2003)提高到O(d2/ε)。我们的算法是第一个自适应查询A的条目的算法，对于常数d，我们证明了实现O(1/ε)查询是必要的。对于d = 1的重要情况，我们还给出了一种新的非自适应算法，将之前的O(1/ε2)查询改进为O(log2(1/ε) /ε)。为了测试一个矩阵在行模型中是否秩最多为d，我们证明了需要读取的行数的Ω(d/ε)下界，即使对于自适应算法也是如此。我们的下界与Krauthgamer和Sasson的非自适应上界相匹配。为了检验一个矩阵在行模型中是否具有最多d的稳定秩，或者是否需要改变其行的ε/d分数才能具有最多d的稳定秩，我们证明了读取θ(d/ε2)行是必要和充分的。我们还在真实数据集和合成数据集上对我们的秩和稳定秩算法进行了实证评估。

{"title":"Improved testing of low rank matrices","authors":"Yi Li, Zhengyu Wang, David P. Woodruff","doi":"10.1145/2623330.2623736","DOIUrl":"https://doi.org/10.1145/2623330.2623736","url":null,"abstract":"We study the problem of determining if an input matrix A εRm x n can be well-approximated by a low rank matrix. Specifically, we study the problem of quickly estimating the rank or stable rank of A, the latter often providing a more robust measure of the rank. Since we seek significantly sublinear time algorithms, we cast these problems in the property testing framework. In this framework, A either has low rank or stable rank, or is far from having this property. The algorithm should read only a small number of entries or rows of A and decide which case A is in with high probability. If neither case occurs, the output is allowed to be arbitrary. We consider two notions of being far: (1) A requires changing at least an ε-fraction of its entries, or (2) A requires changing at least an ε-fraction of its rows. We call the former the \"entry model\" and the latter the \"row model\". We show: For testing if a matrix has rank at most d in the entry model, we improve the previous number of entries of A that need to be read from O(d2/ε2) (Krauthgamer and Sasson, SODA 2003) to O(d2/ε). Our algorithm is the first to adaptively query the entries of A, which for constant d we show is necessary to achieve O(1/ε) queries. For the important case of d = 1 we also give a new non-adaptive algorithm, improving the previous O(1/ε2) queries to O(log2(1/ε) / ε). For testing if a matrix has rank at most d in the row model, we prove an Ω(d/ε) lower bound on the number of rows that need to be read, even for adaptive algorithms. Our lower bound matches a non-adaptive upper bound of Krauthgamer and Sasson. For testing if a matrix has stable rank at most d in the row model or requires changing an ε/d-fraction of its rows in order to have stable rank at most d, we prove that reading θ(d/ε2) rows is necessary and sufficient. We also give an empirical evaluation of our rank and stable rank algorithms on real and synthetic datasets.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"78 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83926533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

A hazard based approach to user return time prediction 基于危险的用户返回时间预测方法

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623348

Komal Kapoor, Mingxuan Sun, J. Srivastava, Tao Ye

In the competitive environment of the internet, retaining and growing one's user base is of major concern to most web services. Furthermore, the economic model of many web services is allowing free access to most content, and generating revenue through advertising. This unique model requires securing user time on a site rather than the purchase of good which makes it crucially important to create new kinds of metrics and solutions for growth and retention efforts for web services. In this work, we address this problem by proposing a new retention metric for web services by concentrating on the rate of user return. We further apply predictive analysis to the proposed retention metric on a service, as a means for characterizing lost customers. Finally, we set up a simple yet effective framework to evaluate a multitude of factors that contribute to user return. Specifically, we define the problem of return time prediction for free web services. Our solution is based on the Cox's proportional hazard model from survival analysis. The hazard based approach offers several benefits including the ability to work with censored data, to model the dynamics in user return rates, and to easily incorporate different types of covariates in the model. We compare the performance of our hazard based model in predicting the user return time and in categorizing users into buckets based on their predicted return time, against several baseline regression and classification methods and find the hazard based approach to be superior.

在互联网的竞争环境中，保持和发展用户基础是大多数网络服务的主要关注点。此外，许多网络服务的经济模式是允许免费访问大多数内容，并通过广告产生收入。这种独特的模式需要确保用户在网站上停留的时间，而不是购买商品，这使得为网络服务的增长和留存努力创造新的指标和解决方案变得至关重要。在这项工作中，我们通过提出一种新的web服务留存指标来解决这个问题，该指标主要关注用户回报率。我们进一步将预测分析应用于建议的服务保留度量，作为表征流失客户的一种手段。最后，我们建立了一个简单而有效的框架来评估影响用户回报的众多因素。具体来说，我们定义了免费web服务的返回时间预测问题。我们的解决方案是基于生存分析中的Cox比例风险模型。基于风险的方法提供了几个好处，包括处理审查数据的能力，对用户回访率的动态建模，以及在模型中轻松合并不同类型的协变量。我们比较了基于风险的模型在预测用户返回时间和根据预测返回时间将用户分类到桶中的性能，并与几种基线回归和分类方法进行了比较，发现基于风险的方法更优越。

{"title":"A hazard based approach to user return time prediction","authors":"Komal Kapoor, Mingxuan Sun, J. Srivastava, Tao Ye","doi":"10.1145/2623330.2623348","DOIUrl":"https://doi.org/10.1145/2623330.2623348","url":null,"abstract":"In the competitive environment of the internet, retaining and growing one's user base is of major concern to most web services. Furthermore, the economic model of many web services is allowing free access to most content, and generating revenue through advertising. This unique model requires securing user time on a site rather than the purchase of good which makes it crucially important to create new kinds of metrics and solutions for growth and retention efforts for web services. In this work, we address this problem by proposing a new retention metric for web services by concentrating on the rate of user return. We further apply predictive analysis to the proposed retention metric on a service, as a means for characterizing lost customers. Finally, we set up a simple yet effective framework to evaluate a multitude of factors that contribute to user return. Specifically, we define the problem of return time prediction for free web services. Our solution is based on the Cox's proportional hazard model from survival analysis. The hazard based approach offers several benefits including the ability to work with censored data, to model the dynamics in user return rates, and to easily incorporate different types of covariates in the model. We compare the performance of our hazard based model in predicting the user return time and in categorizing users into buckets based on their predicted return time, against several baseline regression and classification methods and find the hazard based approach to be superior.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82703924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS) 联合建模面向电影推荐的方面、评级和情感(JMARS)

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623758

Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alex Smola, Jing Jiang, Chong Wang

Recommendation and review sites offer a wealth of information beyond ratings. For instance, on IMDb users leave reviews, commenting on different aspects of a movie (e.g. actors, plot, visual effects), and expressing their sentiments (positive or negative) on these aspects in their reviews. This suggests that uncovering aspects and sentiments will allow us to gain a better understanding of users, movies, and the process involved in generating ratings. The ability to answer questions such as "Does this user care more about the plot or about the special effects?" or "What is the quality of the movie in terms of acting?" helps us to understand why certain ratings are generated. This can be used to provide more meaningful recommendations. In this work we propose a probabilistic model based on collaborative filtering and topic modeling. It allows us to capture the interest distribution of users and the content distribution for movies; it provides a link between interest and relevance on a per-aspect basis and it allows us to differentiate between positive and negative sentiments on a per-aspect basis. Unlike prior work our approach is entirely unsupervised and does not require knowledge of the aspect specific ratings or genres for inference. We evaluate our model on a live copy crawled from IMDb. Our model offers superior performance by joint modeling. Moreover, we are able to address the cold start problem -- by utilizing the information inherent in reviews our model demonstrates improvement for new users and movies.

推荐和评论网站提供了丰富的信息，除了评级。例如，用户在IMDb上留下评论，评论电影的不同方面(如演员、情节、视觉效果)，并在评论中表达他们对这些方面的看法(积极或消极)。这表明，揭示方面和情感将使我们能够更好地理解用户、电影以及生成评级的过程。能够回答诸如“这个用户更关心情节还是特效?”或“从表演角度来看，这部电影的质量如何?”这样的问题，有助于我们理解为什么会产生某些评级。这可以用来提供更有意义的建议。在这项工作中，我们提出了一个基于协同过滤和主题建模的概率模型。它允许我们捕捉用户的兴趣分布和电影的内容分布;它在每个方面的基础上提供了兴趣和相关性之间的联系，它允许我们区分每个方面的积极和消极情绪。与之前的工作不同，我们的方法是完全无监督的，并且不需要了解特定方面的评级或类型来进行推理。我们在从IMDb抓取的实时副本上评估我们的模型。我们的模型通过关节建模提供了优越的性能。此外，我们能够解决冷启动问题——通过利用评论中固有的信息，我们的模型为新用户和电影展示了改进。

{"title":"Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS)","authors":"Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alex Smola, Jing Jiang, Chong Wang","doi":"10.1145/2623330.2623758","DOIUrl":"https://doi.org/10.1145/2623330.2623758","url":null,"abstract":"Recommendation and review sites offer a wealth of information beyond ratings. For instance, on IMDb users leave reviews, commenting on different aspects of a movie (e.g. actors, plot, visual effects), and expressing their sentiments (positive or negative) on these aspects in their reviews. This suggests that uncovering aspects and sentiments will allow us to gain a better understanding of users, movies, and the process involved in generating ratings. The ability to answer questions such as \"Does this user care more about the plot or about the special effects?\" or \"What is the quality of the movie in terms of acting?\" helps us to understand why certain ratings are generated. This can be used to provide more meaningful recommendations. In this work we propose a probabilistic model based on collaborative filtering and topic modeling. It allows us to capture the interest distribution of users and the content distribution for movies; it provides a link between interest and relevance on a per-aspect basis and it allows us to differentiate between positive and negative sentiments on a per-aspect basis. Unlike prior work our approach is entirely unsupervised and does not require knowledge of the aspect specific ratings or genres for inference. We evaluate our model on a live copy crawled from IMDb. Our model offers superior performance by joint modeling. Moreover, we are able to address the cold start problem -- by utilizing the information inherent in reviews our model demonstrates improvement for new users and movies.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"6 16","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91418298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 425

Correlating events with time series for incident diagnosis 将事件与时间序列相关联以进行事件诊断

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623374

Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, D. Zhang, Zhe Wang

As online services have more and more popular, incident diagnosis has emerged as a critical task in minimizing the service downtime and ensuring high quality of the services provided. For most online services, incident diagnosis is mainly conducted by analyzing a large amount of telemetry data collected from the services at runtime. Time series data and event sequence data are two major types of telemetry data. Techniques of correlation analysis are important tools that are widely used by engineers for data-driven incident diagnosis. Despite their importance, there has been little previous work addressing the correlation between two types of heterogeneous data for incident diagnosis: continuous time series data and temporal event data. In this paper, we propose an approach to evaluate the correlation between time series data and event data. Our approach is capable of discovering three important aspects of event-timeseries correlation in the context of incident diagnosis: existence of correlation, temporal order, and monotonic effect. Our experimental results on simulation data sets and two real data sets demonstrate the effectiveness of the algorithm.

随着在线服务的日益普及，事件诊断已成为最小化服务停机时间和确保提供高质量服务的关键任务。对于大多数在线服务，事件诊断主要是通过分析在运行时从服务收集的大量遥测数据来进行的。时间序列数据和事件序列数据是遥测数据的两种主要类型。相关分析技术是工程师广泛应用于数据驱动事件诊断的重要工具。尽管它们很重要，但以前很少有工作解决两种类型的异构数据之间的相关性，用于事件诊断:连续时间序列数据和时间事件数据。本文提出了一种评估时间序列数据与事件数据相关性的方法。我们的方法能够在事件诊断的背景下发现事件-时间序列相关性的三个重要方面:相关性的存在性、时间顺序和单调效应。在仿真数据集和两个真实数据集上的实验结果证明了该算法的有效性。

{"title":"Correlating events with time series for incident diagnosis","authors":"Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, D. Zhang, Zhe Wang","doi":"10.1145/2623330.2623374","DOIUrl":"https://doi.org/10.1145/2623330.2623374","url":null,"abstract":"As online services have more and more popular, incident diagnosis has emerged as a critical task in minimizing the service downtime and ensuring high quality of the services provided. For most online services, incident diagnosis is mainly conducted by analyzing a large amount of telemetry data collected from the services at runtime. Time series data and event sequence data are two major types of telemetry data. Techniques of correlation analysis are important tools that are widely used by engineers for data-driven incident diagnosis. Despite their importance, there has been little previous work addressing the correlation between two types of heterogeneous data for incident diagnosis: continuous time series data and temporal event data. In this paper, we propose an approach to evaluate the correlation between time series data and event data. Our approach is capable of discovering three important aspects of event-timeseries correlation in the context of incident diagnosis: existence of correlation, temporal order, and monotonic effect. Our experimental results on simulation data sets and two real data sets demonstrate the effectiveness of the algorithm.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82301218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 91

Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization 通过稀疏非负张量分解从电子健康记录中获得高通量表型

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623658

Joyce Ho, Joydeep Ghosh, Jimeng Sun

The rapidly increasing availability of electronic health records (EHRs) from multiple heterogeneous sources has spearheaded the adoption of data-driven approaches for improved clinical research, decision making, prognosis, and patient management. Unfortunately, EHR data do not always directly and reliably map to phenotypes, or medical concepts, that clinical researchers need or use. Existing phenotyping approaches typically require labor intensive supervision from medical experts. We propose Marble, a novel sparse non-negative tensor factorization method to derive phenotype candidates with virtually no human supervision. Marble decomposes the observed tensor into two terms, a bias tensor and an interaction tensor. The bias tensor represents the baseline characteristics common amongst the overall population and the interaction tensor defines the phenotypes. We demonstrate the capability of our proposed model on both simulated and patient data from a publicly available clinical database. Our results show that Marble derived phenotypes provide at least a 42.8% reduction in the number of non-zero element and also retains predictive power for classification purposes. Furthermore, the resulting phenotypes and baseline characteristics from real EHR data are consistent with known characteristics of the patient population. Thus it can potentially be used to rapidly characterize, predict, and manage a large number of diseases, thereby promising a novel, data-driven solution that can benefit very large segments of the population.

来自多个异构来源的电子健康记录(EHRs)的可用性迅速增加，促使采用数据驱动的方法来改进临床研究、决策、预后和患者管理。不幸的是，电子病历数据并不总是直接和可靠地映射到临床研究人员需要或使用的表型或医学概念。现有的表型分析方法通常需要医学专家的劳动密集型监督。我们提出了一种新的稀疏非负张量分解方法Marble，它可以在几乎没有人类监督的情况下推导候选表型。Marble将观测到的张量分解为两项，一个偏置张量和一个相互作用张量。偏倚张量代表总体中共同的基线特征，而相互作用张量定义表型。我们展示了我们提出的模型在来自公开可用的临床数据库的模拟和患者数据上的能力。我们的研究结果表明，大理石衍生的表型至少减少了42.8%的非零元素数量，并且还保留了用于分类目的的预测能力。此外，从真实EHR数据得出的表型和基线特征与患者群体的已知特征一致。因此，它有可能被用于快速描述、预测和管理大量疾病，从而有望成为一种新的、数据驱动的解决方案，使很大一部分人口受益。

{"title":"Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization","authors":"Joyce Ho, Joydeep Ghosh, Jimeng Sun","doi":"10.1145/2623330.2623658","DOIUrl":"https://doi.org/10.1145/2623330.2623658","url":null,"abstract":"The rapidly increasing availability of electronic health records (EHRs) from multiple heterogeneous sources has spearheaded the adoption of data-driven approaches for improved clinical research, decision making, prognosis, and patient management. Unfortunately, EHR data do not always directly and reliably map to phenotypes, or medical concepts, that clinical researchers need or use. Existing phenotyping approaches typically require labor intensive supervision from medical experts. We propose Marble, a novel sparse non-negative tensor factorization method to derive phenotype candidates with virtually no human supervision. Marble decomposes the observed tensor into two terms, a bias tensor and an interaction tensor. The bias tensor represents the baseline characteristics common amongst the overall population and the interaction tensor defines the phenotypes. We demonstrate the capability of our proposed model on both simulated and patient data from a publicly available clinical database. Our results show that Marble derived phenotypes provide at least a 42.8% reduction in the number of non-zero element and also retains predictive power for classification purposes. Furthermore, the resulting phenotypes and baseline characteristics from real EHR data are consistent with known characteristics of the patient population. Thus it can potentially be used to rapidly characterize, predict, and manage a large number of diseases, thereby promising a novel, data-driven solution that can benefit very large segments of the population.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76934206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 220

Open question answering over curated and extracted knowledge bases 开放的问题回答在策划和提取的知识库

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623677

Anthony Fader, Luke Zettlemoyer, Oren Etzioni

We consider the problem of open-domain question answering (Open QA) over massive knowledge bases (KBs). Existing approaches use either manually curated KBs like Freebase or KBs automatically extracted from unstructured text. In this paper, we present OQA, the first approach to leverage both curated and extracted KBs. A key technical challenge is designing systems that are robust to the high variability in both natural language questions and massive KBs. OQA achieves robustness by decomposing the full Open QA problem into smaller sub-problems including question paraphrasing and query reformulation. OQA solves these sub-problems by mining millions of rules from an unlabeled question corpus and across multiple KBs. OQA then learns to integrate these rules by performing discriminative training on question-answer pairs using a latent-variable structured perceptron algorithm. We evaluate OQA on three benchmark question sets and demonstrate that it achieves up to twice the precision and recall of a state-of-the-art Open QA system.

我们考虑了大规模知识库(KBs)上的开放域问答(Open QA)问题。现有的方法要么使用像Freebase这样手动管理的KBs，要么使用从非结构化文本中自动提取的KBs。在本文中，我们提出了OQA，这是第一种利用策划和提取KBs的方法。一个关键的技术挑战是设计对自然语言问题和大量知识库的高度可变性都具有鲁棒性的系统。OQA通过将完全开放的QA问题分解为更小的子问题(包括问题释义和查询重新表述)来实现鲁棒性。OQA通过从未标记的问题语料库和多个知识库中挖掘数百万条规则来解决这些子问题。然后，OQA通过使用潜在变量结构化感知器算法对问答对进行判别训练来学习整合这些规则。我们在三个基准问题集上评估了OQA，并证明它的精确度和召回率是最先进的Open QA系统的两倍。

引用次数: 397

Towards scalable critical alert mining 迈向可扩展的关键警报挖掘

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623729

Bo Zong, Yinghui Wu, Jie Song, Ambuj K. Singh, H. Çam, Jiawei Han, Xifeng Yan

Performance monitor software for data centers typically generates a great number of alert sequences. These alert sequences indicate abnormal network events. Given a set of observed alert sequences, it is important to identify the most critical alerts that are potentially the causes of others. While the need for mining critical alerts over large scale alert sequences is evident, most alert analysis techniques stop at modeling and mining the causal relations among the alerts. This paper studies the critical alert mining problem: Given a set of alert sequences, we aim to find a set of k critical alerts such that the number of alerts potentially triggered by them is maximized. We show that the problem is intractable; therefore, we resort to approximation and heuristic algorithms. First, we develop an approximation algorithm that obtains a near-optimal alert set in quadratic time, and propose pruning techniques to improve its runtime performance. Moreover, we show a faster approximation exists, when the alerts follow certain causal structure. Second, we propose two fast heuristic algorithms based on tree sampling techniques. On real-life data, these algorithms identify a critical alert from up to 270,000 mined causal relations in 5 seconds; meanwhile, they preserve more than 80% of solution quality, and are up to 5,000 times faster than their approximation counterparts.

用于数据中心的性能监控软件通常会生成大量警报序列。这些警报序列表示异常的网络事件。给定一组观察到的警报序列，确定可能导致其他警报的最关键警报是很重要的。虽然在大规模警报序列上挖掘关键警报的需求是显而易见的，但大多数警报分析技术都停留在建模和挖掘警报之间的因果关系上。本文研究了关键警报挖掘问题:给定一组警报序列，我们的目标是找到一组k个关键警报，使它们可能触发的警报数量最大化。我们表明这个问题是难以解决的;因此，我们采用近似和启发式算法。首先，我们开发了一种近似算法，在二次时间内获得接近最优的警报集，并提出了修剪技术来提高其运行时性能。此外，我们表明，当警报遵循一定的因果结构时，存在更快的近似。其次，我们提出了两种基于树采样技术的快速启发式算法。在现实数据上，这些算法在5秒钟内从多达27万个挖掘的因果关系中识别出一个关键警报;同时，它们保持了80%以上的溶液质量，并且比近似的同类产品快5000倍。

{"title":"Towards scalable critical alert mining","authors":"Bo Zong, Yinghui Wu, Jie Song, Ambuj K. Singh, H. Çam, Jiawei Han, Xifeng Yan","doi":"10.1145/2623330.2623729","DOIUrl":"https://doi.org/10.1145/2623330.2623729","url":null,"abstract":"Performance monitor software for data centers typically generates a great number of alert sequences. These alert sequences indicate abnormal network events. Given a set of observed alert sequences, it is important to identify the most critical alerts that are potentially the causes of others. While the need for mining critical alerts over large scale alert sequences is evident, most alert analysis techniques stop at modeling and mining the causal relations among the alerts. This paper studies the critical alert mining problem: Given a set of alert sequences, we aim to find a set of k critical alerts such that the number of alerts potentially triggered by them is maximized. We show that the problem is intractable; therefore, we resort to approximation and heuristic algorithms. First, we develop an approximation algorithm that obtains a near-optimal alert set in quadratic time, and propose pruning techniques to improve its runtime performance. Moreover, we show a faster approximation exists, when the alerts follow certain causal structure. Second, we propose two fast heuristic algorithms based on tree sampling techniques. On real-life data, these algorithms identify a critical alert from up to 270,000 mined causal relations in 5 seconds; meanwhile, they preserve more than 80% of solution quality, and are up to 5,000 times faster than their approximation counterparts.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79931152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

LaSEWeb

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623761

Oleksandr Polozov, Sumit Gulwani

We show how to programmatically model processes that humans use when extracting answers to queries (e.g., "Who invented typewriter?", "List of Washington national parks") from semi-structured Web pages returned by a search engine. This modeling enables various applications including automating repetitive search tasks, and helping search engine developers design micro-segments of factoid questions. We describe the design and implementation of a domain-specific language that enables extracting data from a webpage based on its structure, visual layout, and linguistic patterns. We also describe an algorithm to rank multiple answers extracted from multiple webpages. On 100,000+ queries (across 7 micro-segments) obtained from Bing logs, our system LaSEWeb answered queries with an average recall of 71%. Also, the desired answer(s) were present in top-3 suggestions for 95%+ cases.

引用次数: 8