首页 > 最新文献

2011 IEEE 11th International Conference on Data Mining最新文献

英文 中文
Characterizing Inverse Time Dependency in Multi-class Learning 多类学习中逆时间依赖性的表征
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.32
Danqi Chen, Weizhu Chen, Qiang Yang
The training time of most learning algorithms increases as the size of training data increases. Yet, recent advances in linear binary SVM and LR challenge this commonsense by proposing an inverse dependency property, where the training time decreases as the size of training data increases. In this paper, we study the inverse dependency property of multi-class classification problem. We describe a general framework for multi-class classification problem with a single objective to achieve inverse dependency and extend it to three popular multi-class algorithms. We present theoretical results demonstrating its convergence and inverse dependency guarantee. We conduct experiments to empirically verify the inverse dependency of all the three algorithms on large-scale datasets as well as to ensure the accuracy.
大多数学习算法的训练时间随着训练数据量的增加而增加。然而,线性二元支持向量机和LR的最新进展通过提出逆依赖性质挑战了这一常识,其中训练时间随着训练数据大小的增加而减少。本文研究了多类分类问题的逆依赖性质。我们描述了一个以实现逆依赖为单一目标的多类分类问题的一般框架,并将其扩展到三种流行的多类算法。给出了证明其收敛性和逆相关保证的理论结果。我们通过实验验证了三种算法在大规模数据集上的反向依赖关系,并保证了算法的准确性。
{"title":"Characterizing Inverse Time Dependency in Multi-class Learning","authors":"Danqi Chen, Weizhu Chen, Qiang Yang","doi":"10.1109/ICDM.2011.32","DOIUrl":"https://doi.org/10.1109/ICDM.2011.32","url":null,"abstract":"The training time of most learning algorithms increases as the size of training data increases. Yet, recent advances in linear binary SVM and LR challenge this commonsense by proposing an inverse dependency property, where the training time decreases as the size of training data increases. In this paper, we study the inverse dependency property of multi-class classification problem. We describe a general framework for multi-class classification problem with a single objective to achieve inverse dependency and extend it to three popular multi-class algorithms. We present theoretical results demonstrating its convergence and inverse dependency guarantee. We conduct experiments to empirically verify the inverse dependency of all the three algorithms on large-scale datasets as well as to ensure the accuracy.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131112283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Nonnegative Matrix Tri-factorization Based High-Order Co-clustering and Its Fast Implementation 基于非负矩阵三因子分解的高阶共聚类及其快速实现
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.109
Hua Wang, F. Nie, Heng Huang, C. Ding
The fast growth of Internet and modern technologies has brought data involving objects of multiple types that are related to each other, called as Multi-Type Relational data. Traditional clustering methods for single-type data rarely work well on them, which calls for new clustering techniques, called as high-order co-clustering (HOCC), to deal with the multiple types of data at the same time. A major challenge in developing HOCC methods is how to effectively make use of all available information contained in a multi-type relational data set, including both inter-type and intra-type relationships. Meanwhile, because many real world data sets are often of large sizes, clustering methods with computationally efficient solution algorithms are of great practical interest. In this paper, we first present a general HOCC framework, named as Orthogonal Nonnegative Matrix Tri-factorization (O-NMTF), for simultaneous clustering of multi-type relational data. The proposed O-NMTF approach employs Nonnegative Matrix Tri-Factorization (NMTF) to simultaneously cluster different types of data using the inter-type relationships, and incorporate intra-type information through manifold regularization, where, different from existing works, we emphasize the importance of the orthogonal ties of the factor matrices of NMTF. Based on O-NMTF, we further develop a novel Fast Nonnegative Matrix Tri-Factorization (F-NMTF) approach to deal with large-scale data. Instead of constraining the factor matrices of NMTF to be nonnegative as in existing methods, F-NMTF constrains them to be cluster indicator matrices, a special type of nonnegative matrices. As a result, the optimization problem of the proposed method can be decoupled, which results in sub problems of much smaller sizes requiring much less matrix multiplications, such that our new algorithm scales well to real world data of large sizes. Extensive experimental evaluations have demonstrated the effectiveness of our new approaches.
Internet和现代技术的快速发展,带来了涉及多种类型对象且相互关联的数据,称为多类型关系数据。传统的单类型数据聚类方法很难很好地处理这些数据,这就需要一种新的聚类技术,即高阶共聚类(HOCC)来同时处理多种类型的数据。开发HOCC方法的一个主要挑战是如何有效地利用包含在多类型关系数据集中的所有可用信息,包括类型间和类型内关系。同时,由于许多现实世界的数据集往往规模很大,具有计算效率的解算法的聚类方法具有很大的实用价值。本文首先提出了一种用于多类型关系数据同时聚类的通用HOCC框架——正交非负矩阵三因子分解(O-NMTF)。提出的O-NMTF方法采用非负矩阵三因子分解(NMTF),利用类型间关系同时聚类不同类型的数据,并通过流形正则化纳入类型内信息,其中与现有研究不同的是,我们强调了NMTF中因子矩阵的正交关系的重要性。在O-NMTF的基础上,我们进一步发展了一种新的快速非负矩阵三因子分解(F-NMTF)方法来处理大规模数据。与现有方法将NMTF的因子矩阵约束为非负矩阵不同,F-NMTF将因子矩阵约束为一类特殊的非负矩阵——聚类指标矩阵。因此,所提出的方法的优化问题可以解耦,从而产生更小规模的子问题,需要更少的矩阵乘法,这样我们的新算法可以很好地扩展到大规模的现实世界数据。广泛的实验评估证明了我们的新方法的有效性。
{"title":"Nonnegative Matrix Tri-factorization Based High-Order Co-clustering and Its Fast Implementation","authors":"Hua Wang, F. Nie, Heng Huang, C. Ding","doi":"10.1109/ICDM.2011.109","DOIUrl":"https://doi.org/10.1109/ICDM.2011.109","url":null,"abstract":"The fast growth of Internet and modern technologies has brought data involving objects of multiple types that are related to each other, called as Multi-Type Relational data. Traditional clustering methods for single-type data rarely work well on them, which calls for new clustering techniques, called as high-order co-clustering (HOCC), to deal with the multiple types of data at the same time. A major challenge in developing HOCC methods is how to effectively make use of all available information contained in a multi-type relational data set, including both inter-type and intra-type relationships. Meanwhile, because many real world data sets are often of large sizes, clustering methods with computationally efficient solution algorithms are of great practical interest. In this paper, we first present a general HOCC framework, named as Orthogonal Nonnegative Matrix Tri-factorization (O-NMTF), for simultaneous clustering of multi-type relational data. The proposed O-NMTF approach employs Nonnegative Matrix Tri-Factorization (NMTF) to simultaneously cluster different types of data using the inter-type relationships, and incorporate intra-type information through manifold regularization, where, different from existing works, we emphasize the importance of the orthogonal ties of the factor matrices of NMTF. Based on O-NMTF, we further develop a novel Fast Nonnegative Matrix Tri-Factorization (F-NMTF) approach to deal with large-scale data. Instead of constraining the factor matrices of NMTF to be nonnegative as in existing methods, F-NMTF constrains them to be cluster indicator matrices, a special type of nonnegative matrices. As a result, the optimization problem of the proposed method can be decoupled, which results in sub problems of much smaller sizes requiring much less matrix multiplications, such that our new algorithm scales well to real world data of large sizes. Extensive experimental evaluations have demonstrated the effectiveness of our new approaches.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114061627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
Helix: Unsupervised Grammar Induction for Structured Activity Recognition 结构化活动识别的无监督语法归纳
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.74
Huan-Kai Peng, Pang Wu, Jiang Zhu, J. Zhang
The omnipresence of mobile sensors has brought tremendous opportunities to ubiquitous computing systems. In many natural settings, however, their broader applications are hindered by three main challenges: rarity of labels, uncertainty of activity granularities, and the difficulty of multi-dimensional sensor fusion. In this paper, we propose building a grammar to address all these challenges using a language-based approach. The proposed algorithm, called Helix, first generates an initial vocabulary using unlabeled sensor readings, followed by iteratively combining statistically collocated sub-activities across sensor dimensions and grouping similar activities together to discover higher level activities. The experiments using a 20-minute ping-pong game demonstrate favorable results compared to a Hierarchical Hidden Markov Model (HHMM) baseline. Closer investigations to the learned grammar also shows that the learned grammar captures the natural structure of the underlying activities.
无处不在的移动传感器给普适计算系统带来了巨大的机遇。然而,在许多自然环境中,它们的广泛应用受到三个主要挑战的阻碍:标签的稀缺性、活动粒度的不确定性以及多维传感器融合的难度。在本文中,我们建议使用基于语言的方法构建一个语法来解决所有这些挑战。该算法被称为Helix,首先使用未标记的传感器读数生成初始词汇表,然后迭代地将统计上排列在传感器维度上的子活动组合在一起,并将相似的活动分组在一起,以发现更高级别的活动。使用20分钟的乒乓球比赛进行的实验表明,与层次隐马尔可夫模型(HHMM)基线相比,实验结果较好。对学习到的语法进行更深入的研究也表明,学习到的语法捕捉了潜在活动的自然结构。
{"title":"Helix: Unsupervised Grammar Induction for Structured Activity Recognition","authors":"Huan-Kai Peng, Pang Wu, Jiang Zhu, J. Zhang","doi":"10.1109/ICDM.2011.74","DOIUrl":"https://doi.org/10.1109/ICDM.2011.74","url":null,"abstract":"The omnipresence of mobile sensors has brought tremendous opportunities to ubiquitous computing systems. In many natural settings, however, their broader applications are hindered by three main challenges: rarity of labels, uncertainty of activity granularities, and the difficulty of multi-dimensional sensor fusion. In this paper, we propose building a grammar to address all these challenges using a language-based approach. The proposed algorithm, called Helix, first generates an initial vocabulary using unlabeled sensor readings, followed by iteratively combining statistically collocated sub-activities across sensor dimensions and grouping similar activities together to discover higher level activities. The experiments using a 20-minute ping-pong game demonstrate favorable results compared to a Hierarchical Hidden Markov Model (HHMM) baseline. Closer investigations to the learned grammar also shows that the learned grammar captures the natural structure of the underlying activities.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114090014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Mining Heavy Subgraphs in Time-Evolving Networks 时间演化网络中的重子图挖掘
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.101
Petko Bogdanov, M. Mongiovì, Ambuj K. Singh
Networks from different genres are not static entities, but exhibit dynamic behavior. The congestion level of links in transportation networks varies in time depending on the traffic. Similarly, social and communication links are employed at varying rates as information cascades unfold. In recent years there has been an increase of interest in modeling and mining dynamic networks. However, limited attention has been placed in high-scoring sub graph discovery in time-evolving networks. We define the problem of finding the highest-scoring temporal sub graph in a dynamic network, termed Heaviest Dynamic Sub graph (HDS). We show that HDS is NP-hard even with edge weights in {-1,1} and devise an efficient approach for large graph instances that evolve over long time periods. While a naive approach would enumerate all O(t^2) sub-intervals, our solution performs an effective pruning of the sub-interval space by considering O(t*log(t)) groups of sub-intervals and computing an aggregate of each group in logarithmic time. We also define a fast heuristic and a tight upper bound for approximating the static version of HDS, and use them for further pruning the sub-interval space and quickly verifying candidate sub-intervals. We perform an extensive experimental evaluation of our algorithm on transportation, communication and social media networks for discovering sub graphs that correspond to traffic congestions, communication overflow and localized social discussions. Our method is two orders of magnitude faster than a naive approach and scales well with network size and time length.
不同类型的网络不是静态的实体,而是表现出动态的行为。交通网络中链路的拥塞程度随交通流量的变化而随时间变化。同样,随着信息级联的展开,社会和通信联系也以不同的速度被利用。近年来,人们对动态网络的建模和挖掘越来越感兴趣。然而,对于时间演化网络中高分子图的发现,关注有限。我们定义了在动态网络中寻找得分最高的时间子图的问题,称为最重动态子图(HDS)。我们证明了HDS即使边缘权重为{-1,1}也是NP-hard的,并为长时间演化的大型图实例设计了一种有效的方法。虽然一种朴素的方法会枚举所有O(t^2)个子区间,但我们的解决方案通过考虑O(t*log(t))个子区间组并在对数时间内计算每个组的总和来执行子区间空间的有效修剪。我们还定义了近似HDS静态版本的快速启发式和紧上界,并使用它们进一步修剪子区间空间和快速验证候选子区间。我们对我们的算法在交通、通信和社交媒体网络上进行了广泛的实验评估,以发现与交通拥堵、通信溢出和局部社会讨论相对应的子图。我们的方法比简单的方法快两个数量级,并且可以很好地随网络大小和时间长度进行扩展。
{"title":"Mining Heavy Subgraphs in Time-Evolving Networks","authors":"Petko Bogdanov, M. Mongiovì, Ambuj K. Singh","doi":"10.1109/ICDM.2011.101","DOIUrl":"https://doi.org/10.1109/ICDM.2011.101","url":null,"abstract":"Networks from different genres are not static entities, but exhibit dynamic behavior. The congestion level of links in transportation networks varies in time depending on the traffic. Similarly, social and communication links are employed at varying rates as information cascades unfold. In recent years there has been an increase of interest in modeling and mining dynamic networks. However, limited attention has been placed in high-scoring sub graph discovery in time-evolving networks. We define the problem of finding the highest-scoring temporal sub graph in a dynamic network, termed Heaviest Dynamic Sub graph (HDS). We show that HDS is NP-hard even with edge weights in {-1,1} and devise an efficient approach for large graph instances that evolve over long time periods. While a naive approach would enumerate all O(t^2) sub-intervals, our solution performs an effective pruning of the sub-interval space by considering O(t*log(t)) groups of sub-intervals and computing an aggregate of each group in logarithmic time. We also define a fast heuristic and a tight upper bound for approximating the static version of HDS, and use them for further pruning the sub-interval space and quickly verifying candidate sub-intervals. We perform an extensive experimental evaluation of our algorithm on transportation, communication and social media networks for discovering sub graphs that correspond to traffic congestions, communication overflow and localized social discussions. Our method is two orders of magnitude faster than a naive approach and scales well with network size and time length.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"258 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116203964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 118
Tensor Fold-in Algorithms for Social Tagging Prediction 社会标签预测的张量折叠算法
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.142
Miao Zhang, C. Ding, Zhifang Liao
Social tagging predictions involve the co occurrence of users, items and tags. The tremendous growth of users require the recommender system to produce tag recommendations for millions of users and items at any minute. The triplets of users, items and tags are most naturally described by a 3D tensor, and tensor decomposition-based algorithms can produce high quality recommendations. However, each day, thousands of new users are added to the system and the decompositions must be updated daily in a online fashion. In this paper, we provide analysis of the new user problem, and present fold-in algorithms for Tucker, Para Fac, and Low-order tensor decompositions. We show that these algorithm can very efficiently compute the needed decompositions. We evaluate the fold-in algorithms experimentally on several datasets and the results demonstrate the effectiveness of these algorithms.
社会标签预测涉及用户、项目和标签的共同出现。用户的巨大增长要求推荐系统随时为数百万用户和项目提供标签推荐。用户、项目和标签的三元组最自然地由3D张量描述,基于张量分解的算法可以产生高质量的推荐。然而,每天都有成千上万的新用户被添加到系统中,分解必须以在线方式每天更新。在本文中,我们提供了新用户问题的分析,并提出了Tucker, Para facc和低阶张量分解的折叠算法。结果表明,该算法可以非常有效地计算所需的分解。我们在多个数据集上对这些折叠算法进行了实验评估,结果证明了这些算法的有效性。
{"title":"Tensor Fold-in Algorithms for Social Tagging Prediction","authors":"Miao Zhang, C. Ding, Zhifang Liao","doi":"10.1109/ICDM.2011.142","DOIUrl":"https://doi.org/10.1109/ICDM.2011.142","url":null,"abstract":"Social tagging predictions involve the co occurrence of users, items and tags. The tremendous growth of users require the recommender system to produce tag recommendations for millions of users and items at any minute. The triplets of users, items and tags are most naturally described by a 3D tensor, and tensor decomposition-based algorithms can produce high quality recommendations. However, each day, thousands of new users are added to the system and the decompositions must be updated daily in a online fashion. In this paper, we provide analysis of the new user problem, and present fold-in algorithms for Tucker, Para Fac, and Low-order tensor decompositions. We show that these algorithm can very efficiently compute the needed decompositions. We evaluate the fold-in algorithms experimentally on several datasets and the results demonstrate the effectiveness of these algorithms.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128688503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
LPTA: A Probabilistic Model for Latent Periodic Topic Analysis LPTA:潜在周期主题分析的概率模型
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.96
Zhijun Yin, Liangliang Cao, Jiawei Han, ChengXiang Zhai, Thomas S. Huang
This paper studies the problem of latent periodic topic analysis from time stamped documents. The examples of time stamped documents include news articles, sales records, financial reports, TV programs, and more recently, posts from social media websites such as Flickr, Twitter, and Face book. Different from detecting periodic patterns in traditional time series database, we discover the topics of coherent semantics and periodic characteristics where a topic is represented by a distribution of words. We propose a model called LPTA (Latent Periodic Topic Analysis) that exploits the periodicity of the terms as well as term co-occurrences. To show the effectiveness of our model, we collect several representative datasets including Seminar, DBLP and Flickr. The results show that our model can discover the latent periodic topics effectively and leverage the information from both text and time well.
本文研究了时间戳文档的潜在周期主题分析问题。时间戳文档的示例包括新闻文章、销售记录、财务报告、电视节目,以及最近来自社会媒体网站(如Flickr、Twitter和facebook)的帖子。与传统时间序列数据库中周期性模式的检测不同,我们发现了语义一致且具有周期性特征的主题,其中主题由单词分布表示。我们提出了一个称为LPTA(潜在周期主题分析)的模型,该模型利用了术语的周期性以及术语共现性。为了证明模型的有效性,我们收集了几个具有代表性的数据集,包括Seminar、DBLP和Flickr。结果表明,该模型可以有效地发现潜在的周期性主题,并能很好地利用文本和时间信息。
{"title":"LPTA: A Probabilistic Model for Latent Periodic Topic Analysis","authors":"Zhijun Yin, Liangliang Cao, Jiawei Han, ChengXiang Zhai, Thomas S. Huang","doi":"10.1109/ICDM.2011.96","DOIUrl":"https://doi.org/10.1109/ICDM.2011.96","url":null,"abstract":"This paper studies the problem of latent periodic topic analysis from time stamped documents. The examples of time stamped documents include news articles, sales records, financial reports, TV programs, and more recently, posts from social media websites such as Flickr, Twitter, and Face book. Different from detecting periodic patterns in traditional time series database, we discover the topics of coherent semantics and periodic characteristics where a topic is represented by a distribution of words. We propose a model called LPTA (Latent Periodic Topic Analysis) that exploits the periodicity of the terms as well as term co-occurrences. To show the effectiveness of our model, we collect several representative datasets including Seminar, DBLP and Flickr. The results show that our model can discover the latent periodic topics effectively and leverage the information from both text and time well.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127422783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Discovering the Intrinsic Cardinality and Dimensionality of Time Series Using MDL 用MDL发现时间序列的固有基数和维数
Pub Date : 2011-12-11 DOI: 10.1007/978-3-642-44958-1_14
Bing Hu, T. Rakthanmanon, Yuan Hao, Scott Evans, S. Lonardi, Eamonn J. Keogh
{"title":"Discovering the Intrinsic Cardinality and Dimensionality of Time Series Using MDL","authors":"Bing Hu, T. Rakthanmanon, Yuan Hao, Scott Evans, S. Lonardi, Eamonn J. Keogh","doi":"10.1007/978-3-642-44958-1_14","DOIUrl":"https://doi.org/10.1007/978-3-642-44958-1_14","url":null,"abstract":"","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114429667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Multi-task Learning for Bayesian Matrix Factorization 贝叶斯矩阵分解的多任务学习
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.107
Chao Yuan
Data sparsity is a big challenge for collaborative filtering. This problem becomes more serious if the dataset is newly created and has even fewer ratings. By sharing knowledge among different datasets, multi-task learning is a promising technique to address this issue. Most prior work methods directly share objects (users or items) across different datasets. However, object identities and correspondences may not be known in many cases. We extend the previous work of Bayesian matrix factorization with Dirichlet process mixture into a multi-task learning approach by sharing latent parameters among different tasks. Our method does not require object identities and thus is more widely applicable. The proposed model is fully non-parametric in that the dimension of latent feature vectors is automatically determined. Inference is performed using the variational Bayesian algorithm, which is much faster than Gibbs sampling used by most other related Bayesian methods.
数据稀疏性是协同过滤的一大挑战。如果数据集是新创建的,并且评级更少,那么这个问题会变得更加严重。通过在不同的数据集之间共享知识,多任务学习是解决这一问题的一种很有前途的技术。大多数先前的工作方法直接跨不同的数据集共享对象(用户或项)。然而,在许多情况下,对象标识和对应关系可能是未知的。我们通过在不同任务之间共享潜在参数,将之前的Dirichlet过程混合贝叶斯矩阵分解方法扩展为一种多任务学习方法。我们的方法不需要对象标识,因此适用范围更广。该模型是完全非参数的,潜在特征向量的维数是自动确定的。使用变分贝叶斯算法进行推理,该算法比大多数其他相关贝叶斯方法使用的吉布斯抽样快得多。
{"title":"Multi-task Learning for Bayesian Matrix Factorization","authors":"Chao Yuan","doi":"10.1109/ICDM.2011.107","DOIUrl":"https://doi.org/10.1109/ICDM.2011.107","url":null,"abstract":"Data sparsity is a big challenge for collaborative filtering. This problem becomes more serious if the dataset is newly created and has even fewer ratings. By sharing knowledge among different datasets, multi-task learning is a promising technique to address this issue. Most prior work methods directly share objects (users or items) across different datasets. However, object identities and correspondences may not be known in many cases. We extend the previous work of Bayesian matrix factorization with Dirichlet process mixture into a multi-task learning approach by sharing latent parameters among different tasks. Our method does not require object identities and thus is more widely applicable. The proposed model is fully non-parametric in that the dimension of latent feature vectors is automatically determined. Inference is performed using the variational Bayesian algorithm, which is much faster than Gibbs sampling used by most other related Bayesian methods.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114233728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Clustering with Attribute-Level Constraints 具有属性级约束的聚类
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.36
J. Schmidt, Elisabeth Maria Brändle, Stefan Kramer
In many clustering applications the incorporation of background knowledge in the form of constraints is desirable. In this paper, we introduce a new constraint type and the corresponding clustering problem: attribute constrained clustering. The goal is to induce clusters of binary instances that satisfy constraints on the attribute level. These constraints specify whether instances may or may not be grouped to a cluster, depending on specific attribute values. We show how the well-established instance-level constraints, must-link and cannot-link, can be adapted to the attribute level. A variant of the k-Medoids algorithm taking into account attribute level constraints is evaluated on synthetic and real-world data. Experimental results show that such constraints may provide better clustering results at lower specification costs if constraints can be expressed on the attribute level.
在许多聚类应用中,以约束形式结合背景知识是可取的。本文引入了一种新的约束类型和相应的聚类问题:属性约束聚类。目标是生成满足属性级别约束的二进制实例集群。这些约束根据特定的属性值指定实例是否可以分组到集群中。我们将展示如何将已建立的实例级约束(必须链接和不能链接)调整到属性级别。考虑属性级约束的k-Medoids算法的一种变体在合成数据和实际数据上进行了评估。实验结果表明,如果约束可以在属性级别上表示,则可以在较低的规范成本下提供较好的聚类结果。
{"title":"Clustering with Attribute-Level Constraints","authors":"J. Schmidt, Elisabeth Maria Brändle, Stefan Kramer","doi":"10.1109/ICDM.2011.36","DOIUrl":"https://doi.org/10.1109/ICDM.2011.36","url":null,"abstract":"In many clustering applications the incorporation of background knowledge in the form of constraints is desirable. In this paper, we introduce a new constraint type and the corresponding clustering problem: attribute constrained clustering. The goal is to induce clusters of binary instances that satisfy constraints on the attribute level. These constraints specify whether instances may or may not be grouped to a cluster, depending on specific attribute values. We show how the well-established instance-level constraints, must-link and cannot-link, can be adapted to the attribute level. A variant of the k-Medoids algorithm taking into account attribute level constraints is evaluated on synthetic and real-world data. Experimental results show that such constraints may provide better clustering results at lower specification costs if constraints can be expressed on the attribute level.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128984989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Finding Robust Itemsets under Subsampling 寻找子抽样下的鲁棒项集
Pub Date : 2011-12-11 DOI: 10.1145/2656261
Nikolaj Tatti, Fabian Moerchen
Mining frequent patterns is plagued by the problem of pattern explosion making pattern reduction techniques a key challenge in pattern mining. In this paper we propose a novel theoretical framework for pattern reduction. We do this by measuring the robustness of a property of an item set such as closed ness or non-derivability. The robustness of a property is the probability that this property holds on random subsets of the original data. We study four properties: closed, free, non-derivable and totally shattered item sets, demonstrating how we can compute the robustness analytically without actually sampling the data. Our concept of robustness has many advantages: Unlike statistical approaches for reducing patterns, we do not assume a null hypothesis or any noise model and the patterns reported are simply a subset of all patterns with this property as opposed to approximate patterns for which the property does not really hold. If the underlying property is monotonic, then the measure is also monotonic, allowing us to efficiently mine robust item sets. We further derive a parameter-free technique for ranking item sets that can be used for top-k approaches. Our experiments demonstrate that we can successfully use the robustness measure to reduce the number of patterns and that ranking yields interesting itemsets.
频繁模式的挖掘受到模式爆炸问题的困扰,这使得模式约简技术成为模式挖掘的一个关键挑战。本文提出了一种新的模式约简理论框架。我们通过测量项目集的属性(如闭性或不可导性)的鲁棒性来做到这一点。属性的鲁棒性是指该属性在原始数据的随机子集上保持不变的概率。我们研究了四种性质:封闭、自由、不可导和完全破碎的项目集,证明了我们如何在不实际采样数据的情况下分析计算鲁棒性。我们的鲁棒性概念有许多优点:与减少模式的统计方法不同,我们不假设零假设或任何噪声模型,报告的模式只是具有此属性的所有模式的子集,而不是具有此属性的近似模式。如果底层属性是单调的,那么度量也是单调的,允许我们有效地挖掘鲁棒项集。我们进一步推导了一种无参数技术,用于对项目集进行排序,可用于top-k方法。我们的实验表明,我们可以成功地使用鲁棒性度量来减少模式的数量,并且排序产生有趣的项集。
{"title":"Finding Robust Itemsets under Subsampling","authors":"Nikolaj Tatti, Fabian Moerchen","doi":"10.1145/2656261","DOIUrl":"https://doi.org/10.1145/2656261","url":null,"abstract":"Mining frequent patterns is plagued by the problem of pattern explosion making pattern reduction techniques a key challenge in pattern mining. In this paper we propose a novel theoretical framework for pattern reduction. We do this by measuring the robustness of a property of an item set such as closed ness or non-derivability. The robustness of a property is the probability that this property holds on random subsets of the original data. We study four properties: closed, free, non-derivable and totally shattered item sets, demonstrating how we can compute the robustness analytically without actually sampling the data. Our concept of robustness has many advantages: Unlike statistical approaches for reducing patterns, we do not assume a null hypothesis or any noise model and the patterns reported are simply a subset of all patterns with this property as opposed to approximate patterns for which the property does not really hold. If the underlying property is monotonic, then the measure is also monotonic, allowing us to efficiently mine robust item sets. We further derive a parameter-free technique for ranking item sets that can be used for top-k approaches. Our experiments demonstrate that we can successfully use the robustness measure to reduce the number of patterns and that ranking yields interesting itemsets.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124922907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
期刊
2011 IEEE 11th International Conference on Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1