Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献

The online revolution: education for everyone 网络革命:人人享有教育

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2492148

A. Ng, D. Koller

In 2011, Stanford University offered three online courses, which anyone in the world could enroll in and take for free. Together, these three courses had enrollments of around 350,000 students, making this one of the largest experiments in online education ever performed. Since the beginning of 2012, we have transitioned this effort into a new venture, Coursera, a social entrepreneurship company whose mission is to make high-quality education accessible to everyone by allowing the best universities to offer courses to everyone around the world, for free. Coursera classes provide a real course experience to students, including video content, interactive exercises with meaningful feedback, using both auto-grading and peer-grading, and a rich peer-to-peer interaction around the course materials. Currently, Coursera has 62 university partners, and over 3 million students enrolled in its over 300 courses. These courses span a range of topics including computer science, business, medicine, science, humanities, social sciences, and more. In this talk, I'll report on this far-reaching experiment in education, and why we believe this model can provide both an improved classroom experience for our on-campus students, via a flipped classroom model, as well as a meaningful learning experience for the millions of students around the world who would otherwise never have access to education of this quality.

2011年，斯坦福大学提供了三门在线课程，世界上任何人都可以注册并免费学习。这三门课程总共吸引了大约35万名学生，成为有史以来规模最大的在线教育实验之一。自2012年初以来，我们将这一努力转变为一个新的企业，Coursera，这是一家社会创业公司，其使命是通过允许最好的大学向世界各地的每个人免费提供课程，让每个人都能获得高质量的教育。Coursera课程为学生提供真实的课程体验，包括视频内容，带有有意义反馈的互动练习，使用自动评分和同行评分，以及围绕课程材料进行丰富的点对点互动。目前，Coursera拥有62所大学的合作伙伴，超过300万名学生注册了它的300多门课程。这些课程涵盖了一系列的主题，包括计算机科学、商业、医学、科学、人文、社会科学等等。在这次演讲中，我将报告这个影响深远的教育实验，以及为什么我们相信这个模式既可以通过翻转课堂模式为在校学生提供更好的课堂体验，也可以为世界上数以百万计的学生提供有意义的学习体验，否则他们将永远无法获得这种质量的教育。

{"title":"The online revolution: education for everyone","authors":"A. Ng, D. Koller","doi":"10.1145/2487575.2492148","DOIUrl":"https://doi.org/10.1145/2487575.2492148","url":null,"abstract":"In 2011, Stanford University offered three online courses, which anyone in the world could enroll in and take for free. Together, these three courses had enrollments of around 350,000 students, making this one of the largest experiments in online education ever performed. Since the beginning of 2012, we have transitioned this effort into a new venture, Coursera, a social entrepreneurship company whose mission is to make high-quality education accessible to everyone by allowing the best universities to offer courses to everyone around the world, for free. Coursera classes provide a real course experience to students, including video content, interactive exercises with meaningful feedback, using both auto-grading and peer-grading, and a rich peer-to-peer interaction around the course materials. Currently, Coursera has 62 university partners, and over 3 million students enrolled in its over 300 courses. These courses span a range of topics including computer science, business, medicine, science, humanities, social sciences, and more. In this talk, I'll report on this far-reaching experiment in education, and why we believe this model can provide both an improved classroom experience for our on-campus students, via a flipped classroom model, as well as a meaningful learning experience for the millions of students around the world who would otherwise never have access to education of this quality.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74663596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

The business impact of deep learning 深度学习对商业的影响

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2491127

Jeremy Howard

In the last year deep learning has gone from being a special purpose machine learning technique used mainly for image and speech recognition, to becoming a general purpose machine learning tool. This has broad implications for all organizations that rely on data analysis. It represents the latest development in a general trend towards more automated algorithms, and away from domain specific knowledge. For organizations that rely on domain expertise for their competitive advantage, this trend could be extremely disruptive. For start-ups interested in entering established markets, this trend could be a major opportunity. This talk will be a non-technical introduction to general-purpose deep learning, and its potential business impact.

在过去的一年里，深度学习已经从一种主要用于图像和语音识别的特殊目的机器学习技术，变成了一种通用的机器学习工具。这对所有依赖数据分析的组织都有广泛的影响。它代表了朝着更自动化算法的总体趋势的最新发展，并且远离特定领域的知识。对于依靠领域专业知识来获得竞争优势的组织来说，这种趋势可能是极具破坏性的。对于有意进入成熟市场的初创企业来说，这一趋势可能是一个重大机遇。本次演讲将以非技术的方式介绍通用深度学习及其潜在的商业影响。

引用次数: 16

Constrained stochastic gradient descent for large-scale least squares problem 大规模最小二乘问题的约束随机梯度下降

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487635

Yang Mu, W. Ding, Tianyi Zhou, D. Tao

The least squares problem is one of the most important regression problems in statistics, machine learning and data mining. In this paper, we present the Constrained Stochastic Gradient Descent (CSGD) algorithm to solve the large-scale least squares problem. CSGD improves the Stochastic Gradient Descent (SGD) by imposing a provable constraint that the linear regression line passes through the mean point of all the data points. It results in the best regret bound $O(log{T})$, and fastest convergence speed among all first order approaches. Empirical studies justify the effectiveness of CSGD by comparing it with SGD and other state-of-the-art approaches. An example is also given to show how to use CSGD to optimize SGD based least squares problems to achieve a better performance.

最小二乘问题是统计学、机器学习和数据挖掘中最重要的回归问题之一。本文提出了求解大规模最小二乘问题的约束随机梯度下降(CSGD)算法。CSGD通过施加线性回归线经过所有数据点的平均值的可证明约束来改进随机梯度下降(SGD)。该方法得到了最佳后悔界$O(log{T})$，并且收敛速度是所有一阶方法中最快的。实证研究通过将CSGD与SGD和其他最先进的方法进行比较，证明了CSGD的有效性。通过实例说明了如何使用CSGD优化基于最小二乘问题的SGD，以获得更好的性能。

引用次数: 6

Towards long-lead forecasting of extreme flood events: a data mining framework for precipitation cluster precursors identification 极端洪水事件的长期预测:降水集群前兆识别的数据挖掘框架

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2488220

Dawei Wang, W. Ding, Kui Yu, Xindong Wu, Ping Chen, D. Small, S. Islam

The development of disastrous flood forecasting techniques able to provide warnings at a long lead-time (5-15 days) is of great importance to society. Extreme Flood is usually a consequence of a sequence of precipitation events occurring over from several days to several weeks. Though precise short-term forecasting the magnitude and extent of individual precipitation event is still beyond our reach, long-term forecasting of precipitation clusters can be attempted by identifying persistent atmospheric regimes that are conducive for the precipitation clusters. However, such forecasting will suffer from overwhelming number of relevant features and high imbalance of sample sets. In this paper, we propose an integrated data mining framework for identifying the precursors to precipitation event clusters and use this information to predict extended periods of extreme precipitation and subsequent floods. We synthesize a representative feature set that describes the atmosphere motion, and apply a streaming feature selection algorithm to online identify the precipitation precursors from the enormous feature space. A hierarchical re-sampling approach is embedded in the framework to deal with the imbalance problem. An extensive empirical study is conducted on historical precipitation and associated flood data collected in the State of Iowa. Utilizing our framework a few physically meaningful precipitation cluster precursor sets are identified from millions of features. More than 90% of extreme precipitation events are captured by the proposed prediction model using precipitation cluster precursors with a lead time of more than 5 days.

发展能够在较长时间(5-15天)内提供预警的灾害性洪水预报技术对社会具有重要意义。极端洪水通常是连续几天到几周发生的一系列降水事件的结果。虽然对单个降水事件的强度和范围的精确短期预报仍然超出了我们的能力范围，但可以通过确定有利于降水群的持续大气状态来尝试对降水群进行长期预报。然而，这种预测会受到相关特征过多和样本集高度不平衡的影响。在本文中，我们提出了一个集成的数据挖掘框架来识别降水事件集群的前兆，并使用这些信息来预测延长的极端降水和随后的洪水。我们合成了一个描述大气运动的代表性特征集，并应用流特征选择算法从海量特征空间中在线识别降水前兆。在该框架中嵌入了一种分层重采样方法来处理不平衡问题。本文对爱荷华州的历史降水和相关洪水数据进行了广泛的实证研究。利用我们的框架，从数百万个特征中识别出一些物理上有意义的降水簇前兆集。所提出的预报模式利用提前期大于5 d的降水簇前兆捕获了90%以上的极端降水事件。

{"title":"Towards long-lead forecasting of extreme flood events: a data mining framework for precipitation cluster precursors identification","authors":"Dawei Wang, W. Ding, Kui Yu, Xindong Wu, Ping Chen, D. Small, S. Islam","doi":"10.1145/2487575.2488220","DOIUrl":"https://doi.org/10.1145/2487575.2488220","url":null,"abstract":"The development of disastrous flood forecasting techniques able to provide warnings at a long lead-time (5-15 days) is of great importance to society. Extreme Flood is usually a consequence of a sequence of precipitation events occurring over from several days to several weeks. Though precise short-term forecasting the magnitude and extent of individual precipitation event is still beyond our reach, long-term forecasting of precipitation clusters can be attempted by identifying persistent atmospheric regimes that are conducive for the precipitation clusters. However, such forecasting will suffer from overwhelming number of relevant features and high imbalance of sample sets. In this paper, we propose an integrated data mining framework for identifying the precursors to precipitation event clusters and use this information to predict extended periods of extreme precipitation and subsequent floods. We synthesize a representative feature set that describes the atmosphere motion, and apply a streaming feature selection algorithm to online identify the precipitation precursors from the enormous feature space. A hierarchical re-sampling approach is embedded in the framework to deal with the imbalance problem. An extensive empirical study is conducted on historical precipitation and associated flood data collected in the State of Iowa. Utilizing our framework a few physically meaningful precipitation cluster precursor sets are identified from millions of features. More than 90% of extreme precipitation events are captured by the proposed prediction model using precipitation cluster precursors with a lead time of more than 5 days.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74345577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Flexible and robust co-regularized multi-domain graph clustering 灵活鲁棒的协同正则化多域图聚类

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487582

Wei Cheng, Xiang Zhang, Zhishan Guo, Yubao Wu, P. Sullivan, Wei Wang

Multi-view graph clustering aims to enhance clustering performance by integrating heterogeneous information collected in different domains. Each domain provides a different view of the data instances. Leveraging cross-domain information has been demonstrated an effective way to achieve better clustering results. Despite the previous success, existing multi-view graph clustering methods usually assume that different views are available for the same set of instances. Thus instances in different domains can be treated as having strict one-to-one relationship. In many real-life applications, however, data instances in one domain may correspond to multiple instances in another domain. Moreover, relationships between instances in different domains may be associated with weights based on prior (partial) knowledge. In this paper, we propose a flexible and robust framework, CGC (Co-regularized Graph Clustering), based on non-negative matrix factorization (NMF), to tackle these challenges. CGC has several advantages over the existing methods. First, it supports many-to-many cross-domain instance relationship. Second, it incorporates weight on cross-domain relationship. Third, it allows partial cross-domain mapping so that graphs in different domains may have different sizes. Finally, it provides users with the extent to which the cross-domain instance relationship violates the in-domain clustering structure, and thus enables users to re-evaluate the consistency of the relationship. Extensive experimental results on UCI benchmark data sets, newsgroup data sets and biological interaction networks demonstrate the effectiveness of our approach.

多视图图聚类旨在通过集成不同领域的异构信息来提高聚类性能。每个域提供数据实例的不同视图。利用跨域信息已被证明是一种获得更好聚类结果的有效方法。尽管之前取得了成功，但现有的多视图图聚类方法通常假设同一组实例有不同的视图可用。因此，不同域中的实例可以被视为具有严格的一对一关系。然而，在许多实际应用程序中，一个域中的数据实例可能对应于另一个域中的多个实例。此外，不同领域的实例之间的关系可能与基于先验(部分)知识的权重相关联。在本文中，我们提出了一个灵活而稳健的框架，CGC(协正则化图聚类)，基于非负矩阵分解(NMF)，以解决这些挑战。与现有方法相比，CGC有几个优点。首先，它支持多对多跨域实例关系。其次，它结合了跨域关系的权重。第三，它允许部分跨域映射，这样不同域的图可能有不同的大小。最后，它为用户提供了跨域实例关系违反域内聚类结构的程度，从而使用户能够重新评估关系的一致性。在UCI基准数据集、新闻组数据集和生物相互作用网络上的大量实验结果证明了我们方法的有效性。

{"title":"Flexible and robust co-regularized multi-domain graph clustering","authors":"Wei Cheng, Xiang Zhang, Zhishan Guo, Yubao Wu, P. Sullivan, Wei Wang","doi":"10.1145/2487575.2487582","DOIUrl":"https://doi.org/10.1145/2487575.2487582","url":null,"abstract":"Multi-view graph clustering aims to enhance clustering performance by integrating heterogeneous information collected in different domains. Each domain provides a different view of the data instances. Leveraging cross-domain information has been demonstrated an effective way to achieve better clustering results. Despite the previous success, existing multi-view graph clustering methods usually assume that different views are available for the same set of instances. Thus instances in different domains can be treated as having strict one-to-one relationship. In many real-life applications, however, data instances in one domain may correspond to multiple instances in another domain. Moreover, relationships between instances in different domains may be associated with weights based on prior (partial) knowledge. In this paper, we propose a flexible and robust framework, CGC (Co-regularized Graph Clustering), based on non-negative matrix factorization (NMF), to tackle these challenges. CGC has several advantages over the existing methods. First, it supports many-to-many cross-domain instance relationship. Second, it incorporates weight on cross-domain relationship. Third, it allows partial cross-domain mapping so that graphs in different domains may have different sizes. Finally, it provides users with the extent to which the cross-domain instance relationship violates the in-domain clustering structure, and thus enables users to re-evaluate the consistency of the relationship. Extensive experimental results on UCI benchmark data sets, newsgroup data sets and biological interaction networks demonstrate the effectiveness of our approach.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"121 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72694926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 76

Confluence: conformity influence in large social networks 合流:大型社会网络中的从众影响

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487691

Jie Tang, Sen Wu, Jimeng Sun

Conformity is a type of social influence involving a change in opinion or behavior in order to fit in with a group. Employing several social networks as the source for our experimental data, we study how the effect of conformity plays a role in changing users' online behavior. We formally define several major types of conformity in individual, peer, and group levels. We propose Confluence model to formalize the effects of social conformity into a probabilistic model. Confluence can distinguish and quantify the effects of the different types of conformities. To scale up to large social networks, we propose a distributed learning method that can construct the Confluence model efficiently with near-linear speedup. Our experimental results on four different types of large social networks, i.e., Flickr, Gowalla, Weibo and Co-Author, verify the existence of the conformity phenomena. Leveraging the conformity information, Confluence can accurately predict actions of users. Our experiments show that Confluence significantly improves the prediction accuracy by up to 5-10% compared with several alternative methods.

从众是一种社会影响，包括为了适应群体而改变意见或行为。我们采用几个社交网络作为实验数据的来源，研究从众效应如何在改变用户的在线行为中发挥作用。我们在个人、同伴和群体层面正式定义了几种主要的从众类型。我们提出合流模型，将社会从众的影响形式化为一个概率模型。合流可以区分和量化不同类型整合的影响。为了扩展到大型社交网络，我们提出了一种分布式学习方法，该方法可以有效地构建Confluence模型，并且具有近线性加速。我们在四种不同类型的大型社交网络Flickr、Gowalla、Weibo和合作者上的实验结果验证了从众现象的存在。通过整合信息，Confluence可以准确预测用户的行为。我们的实验表明，与几种替代方法相比，Confluence显著提高了5-10%的预测精度。

引用次数: 133

Nonparametric hierarchal bayesian modeling in non-contractual heterogeneous survival data 非契约异构生存数据的非参数层次贝叶斯建模

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487590

Shouichi Nagano, Yusuke Ichikawa, Noriko Takaya, Tadasu Uchiyama, M. Abe

An important problem in the non-contractual marketing domain is discovering the customer lifetime and assessing the impact of customer's characteristic variables on the lifetime. Unfortunately, the conventional hierarchical Bayes model cannot discern the impact of customer's characteristic variables for each customer. To overcome this problem, we present a new survival model using a non-parametric Bayes paradigm with MCMC. The assumption of a conventional model, logarithm of purchase rate and dropout rate with linear regression, is extended to include our assumption of the Dirichlet Process Mixture of regression. The extension assumes that each customer belongs probabilistically to different mixtures of regression, thereby permitting us to estimate a different impact of customer characteristic variables for each customer. Our model creates several customer groups to mirror the structure of the target data set. The effectiveness of our proposal is confirmed by a comparison involving a real e-commerce transaction dataset and an artificial dataset; it generally achieves higher predictive performance. In addition, we show that preselecting the actual number of customer groups does not always lead to higher predictive performance.

发现顾客生命周期并评估顾客特征变量对顾客生命周期的影响是非契约式营销领域的一个重要问题。遗憾的是，传统的层次贝叶斯模型不能识别顾客特征变量对每个顾客的影响。为了克服这个问题，我们提出了一个新的生存模型，使用非参数贝叶斯范式与MCMC。将传统模型的假设，即购买率和辍学率的对数线性回归，扩展为包含我们的Dirichlet过程混合回归的假设。扩展假设每个客户在概率上属于不同的回归混合，从而允许我们估计每个客户的客户特征变量的不同影响。我们的模型创建了几个客户组来反映目标数据集的结构。通过一个真实的电子商务交易数据集和一个人工数据集的比较，验证了本文建议的有效性;它通常可以实现更高的预测性能。此外，我们表明，预先选择客户群体的实际数量并不总是导致更高的预测性能。

{"title":"Nonparametric hierarchal bayesian modeling in non-contractual heterogeneous survival data","authors":"Shouichi Nagano, Yusuke Ichikawa, Noriko Takaya, Tadasu Uchiyama, M. Abe","doi":"10.1145/2487575.2487590","DOIUrl":"https://doi.org/10.1145/2487575.2487590","url":null,"abstract":"An important problem in the non-contractual marketing domain is discovering the customer lifetime and assessing the impact of customer's characteristic variables on the lifetime. Unfortunately, the conventional hierarchical Bayes model cannot discern the impact of customer's characteristic variables for each customer. To overcome this problem, we present a new survival model using a non-parametric Bayes paradigm with MCMC. The assumption of a conventional model, logarithm of purchase rate and dropout rate with linear regression, is extended to include our assumption of the Dirichlet Process Mixture of regression. The extension assumes that each customer belongs probabilistically to different mixtures of regression, thereby permitting us to estimate a different impact of customer characteristic variables for each customer. Our model creates several customer groups to mirror the structure of the target data set. The effectiveness of our proposal is confirmed by a comparison involving a real e-commerce transaction dataset and an artificial dataset; it generally achieves higher predictive performance. In addition, we show that preselecting the actual number of customer groups does not always lead to higher predictive performance.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80883375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Collaborative matrix factorization with multiple similarities for predicting drug-target interactions 基于多相似性的协同矩阵分解预测药物-靶标相互作用

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487670

Xiaodong Zheng, Hao Ding, Hiroshi Mamitsuka, Shanfeng Zhu

We address the problem of predicting new drug-target interactions from three inputs: known interactions, similarities over drugs and those over targets. This setting has been considered by many methods, which however have a common problem of allowing to have only one similarity matrix over drugs and that over targets. The key idea of our approach is to use more than one similarity matrices over drugs as well as those over targets, where weights over the multiple similarity matrices are estimated from data to automatically select similarities, which are effective for improving the performance of predicting drug-target interactions. We propose a factor model, named Multiple Similarities Collaborative Matrix Factorization(MSCMF), which projects drugs and targets into a common low-rank feature space, which is further consistent with weighted similarity matrices over drugs and those over targets. These two low-rank matrices and weights over similarity matrices are estimated by an alternating least squares algorithm. Our approach allows to predict drug-target interactions by the two low-rank matrices collaboratively and to detect similarities which are important for predicting drug-target interactions. This approach is general and applicable to any binary relations with similarities over elements, being found in many applications, such as recommender systems. In fact, MSCMF is an extension of weighted low-rank approximation for one-class collaborative filtering. We extensively evaluated the performance of MSCMF by using both synthetic and real datasets. Experimental results showed nice properties of MSCMF on selecting similarities useful in improving the predictive performance and the performance advantage of MSCMF over six state-of-the-art methods for predicting drug-target interactions.

我们解决了从三个输入预测新的药物-靶标相互作用的问题:已知的相互作用，药物的相似性和靶标的相似性。许多方法都考虑过这种设置，然而，这些方法都存在一个共同的问题，即只允许药物和靶标具有一个相似矩阵。该方法的关键思想是在药物和靶标上使用多个相似矩阵，其中从数据中估计多个相似矩阵的权重以自动选择相似度，这对于提高预测药物-靶标相互作用的性能是有效的。我们提出了一个因子模型，称为多重相似度协同矩阵分解(Multiple similarity Collaborative Matrix Factorization, MSCMF)，该模型将药物和靶标投影到一个共同的低秩特征空间中，进一步与药物和靶标的加权相似度矩阵相一致。这两个低秩矩阵和相似矩阵的权重由交替最小二乘算法估计。我们的方法允许通过两个低秩矩阵协同预测药物-靶标相互作用，并检测相似性，这对预测药物-靶标相互作用很重要。这种方法是通用的，适用于任何具有元素相似性的二元关系，在许多应用程序中都可以找到，比如推荐系统。实际上，MSCMF是一类协同过滤的加权低秩近似的扩展。我们通过使用合成和真实数据集广泛评估了MSCMF的性能。实验结果表明，MSCMF在选择相似性方面具有良好的性能，有助于提高预测性能，并且在预测药物-靶标相互作用方面具有优于六种最新方法的性能优势。

{"title":"Collaborative matrix factorization with multiple similarities for predicting drug-target interactions","authors":"Xiaodong Zheng, Hao Ding, Hiroshi Mamitsuka, Shanfeng Zhu","doi":"10.1145/2487575.2487670","DOIUrl":"https://doi.org/10.1145/2487575.2487670","url":null,"abstract":"We address the problem of predicting new drug-target interactions from three inputs: known interactions, similarities over drugs and those over targets. This setting has been considered by many methods, which however have a common problem of allowing to have only one similarity matrix over drugs and that over targets. The key idea of our approach is to use more than one similarity matrices over drugs as well as those over targets, where weights over the multiple similarity matrices are estimated from data to automatically select similarities, which are effective for improving the performance of predicting drug-target interactions. We propose a factor model, named Multiple Similarities Collaborative Matrix Factorization(MSCMF), which projects drugs and targets into a common low-rank feature space, which is further consistent with weighted similarity matrices over drugs and those over targets. These two low-rank matrices and weights over similarity matrices are estimated by an alternating least squares algorithm. Our approach allows to predict drug-target interactions by the two low-rank matrices collaboratively and to detect similarities which are important for predicting drug-target interactions. This approach is general and applicable to any binary relations with similarities over elements, being found in many applications, such as recommender systems. In fact, MSCMF is an extension of weighted low-rank approximation for one-class collaborative filtering. We extensively evaluated the performance of MSCMF by using both synthetic and real datasets. Experimental results showed nice properties of MSCMF on selecting similarities useful in improving the predictive performance and the performance advantage of MSCMF over six state-of-the-art methods for predicting drug-target interactions.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85820154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 270

One theme in all views: modeling consensus topics in multiple contexts 所有视图中的一个主题:在多个上下文中建模共识主题

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487682

Jian Tang, Ming Zhang, Q. Mei

New challenges have been presented to classical topic models when applied to social media, as user-generated content suffers from significant problems of data sparseness. A variety of heuristic adjustments to these models have been proposed, many of which are based on the use of context information to improve the performance of topic modeling. Existing contextualized topic models rely on arbitrary manipulation of the model structure, by incorporating various context variables into the generative process of classical topic models in an ad hoc manner. Such manipulations usually result in much more complicated model structures, sophisticated inference procedures, and low generalizability to accommodate arbitrary types or combinations of contexts. In this paper we explore a different direction. We propose a general solution that is able to exploit multiple types of contexts without arbitrary manipulation of the structure of classical topic models. We formulate different types of contexts as multiple views of the partition of the corpus. A co-regularization framework is proposed to let these views collaborate with each other, vote for the consensus topics, and distinguish them from view-specific topics. Experiments with real-world datasets prove that the proposed method is both effective and flexible to handle arbitrary types of contexts.

当应用于社交媒体时，经典主题模型面临着新的挑战，因为用户生成的内容遭受着数据稀疏性的严重问题。对这些模型提出了各种启发式调整，其中许多是基于使用上下文信息来提高主题建模的性能。现有的上下文化主题模型依赖于对模型结构的任意操纵，通过将各种上下文变量以一种特殊的方式纳入经典主题模型的生成过程。这样的操作通常会导致更复杂的模型结构、复杂的推理过程和较低的泛化性，以适应任意类型或上下文组合。在本文中，我们探索了一个不同的方向。我们提出了一个通用的解决方案，它能够利用多种类型的上下文，而不需要任意操纵经典主题模型的结构。我们将不同类型的上下文表述为语料库分割的多个视图。提出了一种协同正则化框架，使这些视图相互协作，对共识主题进行投票，并将其与特定于视图的主题区分开来。实际数据集的实验证明了该方法在处理任意类型上下文时的有效性和灵活性。

{"title":"One theme in all views: modeling consensus topics in multiple contexts","authors":"Jian Tang, Ming Zhang, Q. Mei","doi":"10.1145/2487575.2487682","DOIUrl":"https://doi.org/10.1145/2487575.2487682","url":null,"abstract":"New challenges have been presented to classical topic models when applied to social media, as user-generated content suffers from significant problems of data sparseness. A variety of heuristic adjustments to these models have been proposed, many of which are based on the use of context information to improve the performance of topic modeling. Existing contextualized topic models rely on arbitrary manipulation of the model structure, by incorporating various context variables into the generative process of classical topic models in an ad hoc manner. Such manipulations usually result in much more complicated model structures, sophisticated inference procedures, and low generalizability to accommodate arbitrary types or combinations of contexts. In this paper we explore a different direction. We propose a general solution that is able to exploit multiple types of contexts without arbitrary manipulation of the structure of classical topic models. We formulate different types of contexts as multiple views of the partition of the corpus. A co-regularization framework is proposed to let these views collaborate with each other, vote for the consensus topics, and distinguish them from view-specific topics. Experiments with real-world datasets prove that the proposed method is both effective and flexible to handle arbitrary types of contexts.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78309903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

Inferring distant-time location in low-sampling-rate trajectories 在低采样率轨迹中推断遥远时间的位置

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487707

Meng-Fen Chiang, Yung-Hsiang Lin, Wen-Chih Peng, Philip S. Yu

With the growth of location-based services and social services, low- sampling-rate trajectories from check-in data or photos with geo- tag information becomes ubiquitous. In general, most detailed mov- ing information in low-sampling-rate trajectories are lost. Prior works have elaborated on distant-time location prediction in high- sampling-rate trajectories. However, existing prediction models are pattern-based and thus not applicable due to the sparsity of data points in low-sampling-rate trajectories. To address the sparsity in low-sampling-rate trajectories, we develop a Reachability-based prediction model on Time-constrained Mobility Graph (RTMG) to predict locations for distant-time queries. Specifically, we de- sign an adaptive temporal exploration approach to extract effective supporting trajectories that are temporally close to the query time. Based on the supporting trajectories, a Time-constrained mobility Graph (TG) is constructed to capture mobility information at the given query time. In light of TG, we further derive the reacha- bility probabilities among locations in TG. Thus, a location with maximum reachability from the current location among all possi- ble locations in supporting trajectories is considered as the predic- tion result. To efficiently process queries, we proposed the index structure Sorted Interval-Tree (SOIT) to organize location records. Extensive experiments with real data demonstrated the effective- ness and efficiency of RTMG. First, RTMG with adaptive tempo- ral exploration significantly outperforms the existing pattern-based prediction model HPM [2] over varying data sparsity in terms of higher accuracy and higher coverage. Also, the proposed index structure SOIT can efficiently speedup RTMG in large-scale trajec- tory dataset. In the future, we could extend RTMG by considering more factors (e.g., staying durations in locations, application us- ages in smart phones) to further improve the prediction accuracy.

随着基于位置的服务和社交服务的发展，从签到数据或带有地理标签信息的照片中获取低采样率的轨迹变得无处不在。一般来说，在低采样率的轨迹中，大多数详细的运动信息都会丢失。先前的工作已经详细阐述了在高采样率轨迹下的远时间定位预测。然而，由于低采样率轨迹中数据点的稀疏性，现有的预测模型是基于模式的，因此不适用。为了解决低采样率轨迹的稀疏性问题，我们在时间约束迁移图(RTMG)上开发了一个基于可达性的预测模型来预测远程查询的位置。具体来说，我们设计了一种自适应时间探索方法来提取在时间上接近查询时间的有效支持轨迹。在支持轨迹的基础上，构造了一个时间约束迁移图(TG)来捕获给定查询时间的迁移信息。在热重的基础上，进一步推导了热重中各位置之间的可达性概率。因此，在支持轨迹的所有可能位置中，与当前位置可达性最大的位置被认为是预测结果。为了有效地处理查询，我们提出了排序区间树(SOIT)索引结构来组织位置记录。大量的实际数据实验证明了RTMG的有效性和高效性。首先，在不同的数据稀疏度下，具有自适应节奏探索的RTMG在更高的精度和更高的覆盖率方面显著优于现有的基于模式的预测模型HPM[2]。此外，所提出的索引结构SOIT可以有效地加速大规模轨迹数据集的RTMG。在未来，我们可以通过考虑更多的因素(例如，在某个地点停留的时间、智能手机的应用时间)来扩展RTMG，以进一步提高预测精度。

{"title":"Inferring distant-time location in low-sampling-rate trajectories","authors":"Meng-Fen Chiang, Yung-Hsiang Lin, Wen-Chih Peng, Philip S. Yu","doi":"10.1145/2487575.2487707","DOIUrl":"https://doi.org/10.1145/2487575.2487707","url":null,"abstract":"With the growth of location-based services and social services, low- sampling-rate trajectories from check-in data or photos with geo- tag information becomes ubiquitous. In general, most detailed mov- ing information in low-sampling-rate trajectories are lost. Prior works have elaborated on distant-time location prediction in high- sampling-rate trajectories. However, existing prediction models are pattern-based and thus not applicable due to the sparsity of data points in low-sampling-rate trajectories. To address the sparsity in low-sampling-rate trajectories, we develop a Reachability-based prediction model on Time-constrained Mobility Graph (RTMG) to predict locations for distant-time queries. Specifically, we de- sign an adaptive temporal exploration approach to extract effective supporting trajectories that are temporally close to the query time. Based on the supporting trajectories, a Time-constrained mobility Graph (TG) is constructed to capture mobility information at the given query time. In light of TG, we further derive the reacha- bility probabilities among locations in TG. Thus, a location with maximum reachability from the current location among all possi- ble locations in supporting trajectories is considered as the predic- tion result. To efficiently process queries, we proposed the index structure Sorted Interval-Tree (SOIT) to organize location records. Extensive experiments with real data demonstrated the effective- ness and efficiency of RTMG. First, RTMG with adaptive tempo- ral exploration significantly outperforms the existing pattern-based prediction model HPM [2] over varying data sparsity in terms of higher accuracy and higher coverage. Also, the proposed index structure SOIT can efficiently speedup RTMG in large-scale trajec- tory dataset. In the future, we could extend RTMG by considering more factors (e.g., staying durations in locations, application us- ages in smart phones) to further improve the prediction accuracy.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73147028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18