首页 > 最新文献

2008 Eighth IEEE International Conference on Data Mining最新文献

英文 中文
Discovering Flow Anomalies: A SWEET Approach 发现流动异常:一种SWEET方法
Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.117
James M. Kang, S. Shekhar, Christine Wennen, P. Novak
Given a percentage-threshold and readings from a pair of consecutive upstream and downstream sensors, flow anomaly discovery identifies dominant time intervals where the fraction of time instants of significantly mis-matched sensor readings exceed the given percentage-threshold. Discovering flow anomalies (FA) is an important problem in environmental flow monitoring networks and early warning detection systems for water quality problems. However, mining FAs is computationally expensive because of the large (potentially infinite) number of time instants of measurement and potentially long delays due to stagnant (e.g. lakes) or slow moving (e.g. wetland) water bodies between consecutive sensors. Traditional outlier detection methods (e.g. t-test) are suited for detecting transient FAs (i.e., time instants of significant mis-matches across consecutive sensors) and cannot detect persistent FAs (i.e., long variable time-windows with a high fraction of time instant transient FAs) due to a lack of a pre-defined window size. In contrast, we propose a Smart Window Enumeration and Evaluation of persistence-Thresholds (SWEET) method to efficiently explore the search space of all possible window lengths. Computation overhead is brought down significantly by restricting the start and end points of a window to coincide with transient FAs, using a smart counter and efficient pruning techniques. Experimental evaluation using a real dataset shows our proposed approach outperforms Nainodotve alternatives.
给定一个百分比阈值和一对连续的上游和下游传感器的读数,流量异常发现识别出明显不匹配传感器读数的时间瞬间分数超过给定百分比阈值的主要时间间隔。发现流量异常是环境流量监测网络和水质问题预警检测系统中的一个重要问题。然而,由于连续传感器之间的停滞(例如湖泊)或缓慢移动(例如湿地)水体可能造成长时间延迟,因此挖掘FAs的计算成本很高。传统的离群值检测方法(如t检验)适用于检测瞬态FAs(即,跨连续传感器显著不匹配的时间瞬间),但由于缺乏预定义的窗口大小,无法检测持续FAs(即,具有高比例的时间瞬间瞬态FAs的长可变时间窗口)。相比之下,我们提出了一种智能窗口枚举和持续阈值评估(SWEET)方法来有效地探索所有可能窗口长度的搜索空间。通过使用智能计数器和有效的剪枝技术,限制窗口的起始点和结束点与瞬态fa重合,大大降低了计算开销。使用真实数据集的实验评估表明,我们提出的方法优于奈诺多夫替代方法。
{"title":"Discovering Flow Anomalies: A SWEET Approach","authors":"James M. Kang, S. Shekhar, Christine Wennen, P. Novak","doi":"10.1109/ICDM.2008.117","DOIUrl":"https://doi.org/10.1109/ICDM.2008.117","url":null,"abstract":"Given a percentage-threshold and readings from a pair of consecutive upstream and downstream sensors, flow anomaly discovery identifies dominant time intervals where the fraction of time instants of significantly mis-matched sensor readings exceed the given percentage-threshold. Discovering flow anomalies (FA) is an important problem in environmental flow monitoring networks and early warning detection systems for water quality problems. However, mining FAs is computationally expensive because of the large (potentially infinite) number of time instants of measurement and potentially long delays due to stagnant (e.g. lakes) or slow moving (e.g. wetland) water bodies between consecutive sensors. Traditional outlier detection methods (e.g. t-test) are suited for detecting transient FAs (i.e., time instants of significant mis-matches across consecutive sensors) and cannot detect persistent FAs (i.e., long variable time-windows with a high fraction of time instant transient FAs) due to a lack of a pre-defined window size. In contrast, we propose a Smart Window Enumeration and Evaluation of persistence-Thresholds (SWEET) method to efficiently explore the search space of all possible window lengths. Computation overhead is brought down significantly by restricting the start and end points of a window to coincide with transient FAs, using a smart counter and efficient pruning techniques. Experimental evaluation using a real dataset shows our proposed approach outperforms Nainodotve alternatives.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131292643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Multiplicative Mixture Models for Overlapping Clustering 重叠聚类的乘法混合模型
Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.103
Qiang Fu, A. Banerjee
The problem of overlapping clustering, where a point is allowed to belong to multiple clusters, is becoming increasingly important in a variety of applications. In this paper, we present an overlapping clustering algorithm based on multiplicative mixture models. We analyze a general setting where each component of the multiplicative mixture is from an exponential family, and present an efficient alternating maximization algorithm to learn the model and infer overlapping clusters. We also show that when each component is assumed to be a Gaussian, we can apply the kernel trick leading to non-linear cluster separators and obtain better clustering quality. The efficacy of the proposed algorithms is demonstrated using experiments on both UCI benchmark datasets and a microarray gene expression dataset.
重叠聚类问题,即允许一个点属于多个聚类的问题,在各种应用中变得越来越重要。本文提出了一种基于乘法混合模型的重叠聚类算法。我们分析了一种一般情况,其中乘性混合的每个成分都来自指数族,并提出了一种有效的交替最大化算法来学习模型并推断重叠簇。我们还表明,当假设每个分量都是高斯分布时,我们可以应用导致非线性聚类分离器的核技巧并获得更好的聚类质量。通过在UCI基准数据集和微阵列基因表达数据集上的实验证明了所提出算法的有效性。
{"title":"Multiplicative Mixture Models for Overlapping Clustering","authors":"Qiang Fu, A. Banerjee","doi":"10.1109/ICDM.2008.103","DOIUrl":"https://doi.org/10.1109/ICDM.2008.103","url":null,"abstract":"The problem of overlapping clustering, where a point is allowed to belong to multiple clusters, is becoming increasingly important in a variety of applications. In this paper, we present an overlapping clustering algorithm based on multiplicative mixture models. We analyze a general setting where each component of the multiplicative mixture is from an exponential family, and present an efficient alternating maximization algorithm to learn the model and infer overlapping clusters. We also show that when each component is assumed to be a Gaussian, we can apply the kernel trick leading to non-linear cluster separators and obtain better clustering quality. The efficacy of the proposed algorithms is demonstrated using experiments on both UCI benchmark datasets and a microarray gene expression dataset.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115240671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
A Recommendation System for Preconditioned Iterative Solvers 预条件迭代求解器的推荐系统
Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.105
Thomas George, Anshul Gupta, V. Sarin
Preconditioned iterative methods are often used to solve very large sparse systems of linear systems that arise in many scientific and engineering applications. The performance and robustness of these solvers is extremely sensitive to the choice of multiple preconditioner and solver parameters. Users of iterative methods often encounter an overwhelming number of combinations of choices for solvers, matrix preprocessing steps, preconditioners, and their parameters. The lack of a unified theoretical analysis of preconditioners coupled with limited knowledge of their interaction with linear systems makes it highly challenging for practitioners to choose good solver configurations. In this paper, we propose a novel, multi-stage learning based methodology for determining the best solver configurations to optimize the desired performance behavior for any given linear system. Empirical results over real performance data for the hyper iterative solver package demonstrate the efficacy and flexibility of the proposed approach.
在许多科学和工程应用中,经常使用预条件迭代方法来求解线性系统的非常大的稀疏系统。这些解算器的性能和鲁棒性对多个预条件和解算器参数的选择极为敏感。迭代方法的用户经常遇到求解器、矩阵预处理步骤、前置条件及其参数的选择组合的压倒性数量。对预调节器缺乏统一的理论分析,加上对其与线性系统相互作用的有限知识,使得从业者选择良好的求解器配置极具挑战性。在本文中,我们提出了一种新颖的,基于多阶段学习的方法来确定最佳求解器配置,以优化任何给定线性系统的期望性能行为。超迭代求解器包的实际性能数据的经验结果证明了所提出方法的有效性和灵活性。
{"title":"A Recommendation System for Preconditioned Iterative Solvers","authors":"Thomas George, Anshul Gupta, V. Sarin","doi":"10.1109/ICDM.2008.105","DOIUrl":"https://doi.org/10.1109/ICDM.2008.105","url":null,"abstract":"Preconditioned iterative methods are often used to solve very large sparse systems of linear systems that arise in many scientific and engineering applications. The performance and robustness of these solvers is extremely sensitive to the choice of multiple preconditioner and solver parameters. Users of iterative methods often encounter an overwhelming number of combinations of choices for solvers, matrix preprocessing steps, preconditioners, and their parameters. The lack of a unified theoretical analysis of preconditioners coupled with limited knowledge of their interaction with linear systems makes it highly challenging for practitioners to choose good solver configurations. In this paper, we propose a novel, multi-stage learning based methodology for determining the best solver configurations to optimize the desired performance behavior for any given linear system. Empirical results over real performance data for the hyper iterative solver package demonstrate the efficacy and flexibility of the proposed approach.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114488041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Evolutionary Clustering by Hierarchical Dirichlet Process with Hidden Markov State 隐马尔可夫状态下层次Dirichlet过程的进化聚类
Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.24
Tianbing Xu, Zhongfei Zhang, Philip S. Yu, Bo Long
This paper studies evolutionary clustering, which is a recently hot topic with many important applications, noticeably in social network analysis. In this paper, based on the recent literature on Hierarchical Dirichlet Process (HDP) and Hidden Markov Model (HMM), we have developed a statistical model HDP-HTM that combines HDP with a Hierarchical Transition Matrix (HTM) based on the proposed Infinite Hierarchical Hidden Markov State model (iH2MS) as an effective solution to this problem. The HDP-HTM model substantially advances the literature on evolutionary clustering in the sense that not only it performs better than the existing literature, but more importantly it is capable of automatically learning the cluster numbers and structures and at the same time explicitly addresses the correspondence issue during the evolution. Extensive evaluations have demonstrated the effectiveness and promise of this solution against the state-of-the-art literature.
进化聚类是近年来的一个研究热点,在社会网络分析中有许多重要的应用。本文在分析了近年来关于层次狄利克雷过程(HDP)和隐马尔可夫模型(HMM)的相关文献的基础上,基于所提出的无限层次隐马尔可夫状态模型(iH2MS),建立了一个将HDP与层次转移矩阵(HTM)相结合的统计模型HDP-HTM,作为该问题的有效解决方案。HDP-HTM模型在进化聚类的研究上有了很大的进步,不仅性能优于现有的文献,更重要的是它能够自动学习聚类的数量和结构,同时明确地解决了进化过程中的对应问题。广泛的评估已经证明了这种解决方案的有效性和前景,以对抗最先进的文献。
{"title":"Evolutionary Clustering by Hierarchical Dirichlet Process with Hidden Markov State","authors":"Tianbing Xu, Zhongfei Zhang, Philip S. Yu, Bo Long","doi":"10.1109/ICDM.2008.24","DOIUrl":"https://doi.org/10.1109/ICDM.2008.24","url":null,"abstract":"This paper studies evolutionary clustering, which is a recently hot topic with many important applications, noticeably in social network analysis. In this paper, based on the recent literature on Hierarchical Dirichlet Process (HDP) and Hidden Markov Model (HMM), we have developed a statistical model HDP-HTM that combines HDP with a Hierarchical Transition Matrix (HTM) based on the proposed Infinite Hierarchical Hidden Markov State model (iH2MS) as an effective solution to this problem. The HDP-HTM model substantially advances the literature on evolutionary clustering in the sense that not only it performs better than the existing literature, but more importantly it is capable of automatically learning the cluster numbers and structures and at the same time explicitly addresses the correspondence issue during the evolution. Extensive evaluations have demonstrated the effectiveness and promise of this solution against the state-of-the-art literature.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129600164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
Temporal-Relational Classifiers for Prediction in Evolving Domains 演化领域预测的时间关系分类器
Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.125
U. Sharan, Jennifer Neville
Many relational domains contain temporal information and dynamics that are important to model (e.g., social networks, protein networks). However, past work in relational learning has focused primarily on modeling static "snapshots" of the data and has largely ignored the temporal dimension of these data. In this work, we extend relational techniques to temporally-evolving domains and outline a representational framework that is capable of modeling both temporal and relational dependencies in the data. We develop efficient learning and inference techniques within the framework by considering a restricted set of temporal-relational dependencies and using parameter-tying methods to generalize across relationships and entities. More specifically, we model dynamic relational data with a two-phase process, first summarizing the temporal-relational information with kernel smoothing, and then moderating attribute dependencies with the summarized relational information. We develop a number of novel temporal-relational models using the framework and then show that the current approaches to modeling static relational data are special cases within the framework. We compare the new models to the competing static relational methods on three real-world datasets and show that the temporal-relational models consistently outperform the relational models that ignore temporal information - achieving significant reductions in error ranging from 15% to 70%.
许多关系领域包含对建模很重要的时间信息和动态(例如,社会网络,蛋白质网络)。然而,过去在关系学习方面的工作主要集中在数据的静态“快照”建模上,并且在很大程度上忽略了这些数据的时间维度。在这项工作中,我们将关系技术扩展到时间演变的领域,并概述了一个能够对数据中的时间和关系依赖关系进行建模的表示框架。我们通过考虑一组有限的时间关系依赖并使用参数绑定方法在关系和实体之间进行泛化,在框架内开发了有效的学习和推理技术。更具体地说,我们用一个两阶段的过程来建模动态关系数据,首先用核平滑来总结时间关系信息,然后用总结的关系信息来调节属性依赖。我们使用该框架开发了许多新的时间关系模型,然后表明当前对静态关系数据建模的方法是框架中的特殊情况。我们将新模型与三个真实世界数据集上的静态关系方法进行了比较,结果表明,时间关系模型始终优于忽略时间信息的关系模型,误差显著降低了15%至70%。
{"title":"Temporal-Relational Classifiers for Prediction in Evolving Domains","authors":"U. Sharan, Jennifer Neville","doi":"10.1109/ICDM.2008.125","DOIUrl":"https://doi.org/10.1109/ICDM.2008.125","url":null,"abstract":"Many relational domains contain temporal information and dynamics that are important to model (e.g., social networks, protein networks). However, past work in relational learning has focused primarily on modeling static \"snapshots\" of the data and has largely ignored the temporal dimension of these data. In this work, we extend relational techniques to temporally-evolving domains and outline a representational framework that is capable of modeling both temporal and relational dependencies in the data. We develop efficient learning and inference techniques within the framework by considering a restricted set of temporal-relational dependencies and using parameter-tying methods to generalize across relationships and entities. More specifically, we model dynamic relational data with a two-phase process, first summarizing the temporal-relational information with kernel smoothing, and then moderating attribute dependencies with the summarized relational information. We develop a number of novel temporal-relational models using the framework and then show that the current approaches to modeling static relational data are special cases within the framework. We compare the new models to the competing static relational methods on three real-world datasets and show that the temporal-relational models consistently outperform the relational models that ignore temporal information - achieving significant reductions in error ranging from 15% to 70%.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130521689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 107
A Robust Discriminative Term Weighting Based Linear Discriminant Method for Text Classification 基于鲁棒判别项加权的文本分类线性判别方法
Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.26
K. N. Junejo, Asim Karim
Text classification is widely used in applications ranging from e-mail filtering to review classification. Many of these applications demand that the classification method be efficient and robust, yet produce accurate categorizations by using the terms in the documents only. We present a supervised text classification method based on discriminative term weighting, discrimination information pooling, and linear discrimination. Terms in the documents are assigned weights according to the discrimination information they provide for one category over the others. These weights also serve to partition the terms into two sets. A linear opinion pool is adopted for combining the discrimination information provided by each set of terms yielding a two-dimensional feature space. Subsequently, a linear discriminant function is learned to categorize the documents in the feature space. We provide intuitive and empirical evidence of the robustness of our method with three term weighting strategies. Experimental results are presented for data sets from three different application areas. The results show that our method's accuracy is higher than other popular methods, especially when there is a distribution shift from training to testing sets. Moreover, our method is simple yet robust to different application domains and small training set sizes.
文本分类广泛应用于从电子邮件过滤到评论分类等各个领域。这些应用程序中的许多都要求分类方法既高效又健壮,但仅通过使用文档中的术语来生成准确的分类。提出了一种基于判别词权、判别信息池和线性判别的监督文本分类方法。文档中的术语根据它们提供的一个类别相对于其他类别的区别信息分配权重。这些权重还用于将项划分为两个集合。采用线性意见池将每组术语提供的判别信息组合在一起,生成二维特征空间。然后,学习线性判别函数对特征空间中的文档进行分类。我们提供了直观的和经验的证据,我们的方法鲁棒性与三个术语加权策略。给出了来自三个不同应用领域的数据集的实验结果。结果表明,该方法的准确率高于其他常用方法,特别是当训练集到测试集的分布发生变化时。此外,该方法对不同的应用领域和较小的训练集具有简单的鲁棒性。
{"title":"A Robust Discriminative Term Weighting Based Linear Discriminant Method for Text Classification","authors":"K. N. Junejo, Asim Karim","doi":"10.1109/ICDM.2008.26","DOIUrl":"https://doi.org/10.1109/ICDM.2008.26","url":null,"abstract":"Text classification is widely used in applications ranging from e-mail filtering to review classification. Many of these applications demand that the classification method be efficient and robust, yet produce accurate categorizations by using the terms in the documents only. We present a supervised text classification method based on discriminative term weighting, discrimination information pooling, and linear discrimination. Terms in the documents are assigned weights according to the discrimination information they provide for one category over the others. These weights also serve to partition the terms into two sets. A linear opinion pool is adopted for combining the discrimination information provided by each set of terms yielding a two-dimensional feature space. Subsequently, a linear discriminant function is learned to categorize the documents in the feature space. We provide intuitive and empirical evidence of the robustness of our method with three term weighting strategies. Experimental results are presented for data sets from three different application areas. The results show that our method's accuracy is higher than other popular methods, especially when there is a distribution shift from training to testing sets. Moreover, our method is simple yet robust to different application domains and small training set sizes.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127940619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Semi-supervised Learning from General Unlabeled Data 基于一般未标记数据的半监督学习
Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.61
Kaizhu Huang, Zenglin Xu, Irwin King, Michael R. Lyu
We consider the problem of semi-supervised learning (SSL) from general unlabeled data, which may contain irrelevant samples. Within the binary setting, our model manages to better utilize the information from unlabeled data by formulating them as a three-class (-1,+1, 0) mixture, where class 0 represents the irrelevant data. This distinguishes our work from the traditional SSL problem where unlabeled data are assumed to contain relevant samples only, either +1 or -1, which are forced to be the same as the given labeled samples. This work is also different from another family of popular models, universum learning (universum means "irrelevant" data), in that the universum need not to be specified beforehand. One significant contribution of our proposed framework is that such irrelevant samples can be automatically detected from the available unlabeled data, even though they are mixed with relevant data. This hence presents a general SSL framework that does not force "clean" unlabeled data.More importantly, we formulate this general learning framework as a Semi-definite Programming problem, making it solvable in polynomial time. A series of experiments demonstrate that the proposed framework can outperform the traditional SSL on both synthetic and real data.
我们考虑了一般未标记数据的半监督学习(SSL)问题,这些数据可能包含不相关的样本。在二元设置中,我们的模型通过将未标记数据表述为三类(-1,+1,0)混合物,设法更好地利用来自未标记数据的信息,其中类别0表示不相关数据。这将我们的工作与传统的SSL问题区分开来,传统的SSL问题假设未标记的数据仅包含相关的样本,+1或-1,它们被迫与给定的标记样本相同。这项工作也不同于另一种流行的模型,即宇宙sum学习(universum的意思是“不相关的”数据),因为宇宙sum不需要事先指定。我们提出的框架的一个重要贡献是,这些不相关的样本可以从可用的未标记数据中自动检测出来,即使它们与相关数据混合在一起。因此,这提供了一个通用的SSL框架,它不会强制“清除”未标记的数据。更重要的是,我们将这个一般的学习框架表述为一个半确定的规划问题,使其在多项式时间内可解。一系列实验表明,该框架在综合数据和真实数据上都优于传统SSL。
{"title":"Semi-supervised Learning from General Unlabeled Data","authors":"Kaizhu Huang, Zenglin Xu, Irwin King, Michael R. Lyu","doi":"10.1109/ICDM.2008.61","DOIUrl":"https://doi.org/10.1109/ICDM.2008.61","url":null,"abstract":"We consider the problem of semi-supervised learning (SSL) from general unlabeled data, which may contain irrelevant samples. Within the binary setting, our model manages to better utilize the information from unlabeled data by formulating them as a three-class (-1,+1, 0) mixture, where class 0 represents the irrelevant data. This distinguishes our work from the traditional SSL problem where unlabeled data are assumed to contain relevant samples only, either +1 or -1, which are forced to be the same as the given labeled samples. This work is also different from another family of popular models, universum learning (universum means \"irrelevant\" data), in that the universum need not to be specified beforehand. One significant contribution of our proposed framework is that such irrelevant samples can be automatically detected from the available unlabeled data, even though they are mixed with relevant data. This hence presents a general SSL framework that does not force \"clean\" unlabeled data.More importantly, we formulate this general learning framework as a Semi-definite Programming problem, making it solvable in polynomial time. A series of experiments demonstrate that the proposed framework can outperform the traditional SSL on both synthetic and real data.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125416407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Latent Dirichlet Allocation and Singular Value Decomposition Based Multi-document Summarization 基于潜在狄利克雷分配和奇异值分解的多文档摘要
Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.55
Rachit Arora, Balaraman Ravindran
Multi-Document Summarization deals with computing a summary for a set of related articles such that they give the user a general view about the events. One of the objectives is that the sentences should cover the different events in the documents with the information covered in as few sentences as possible. Latent Dirichlet Allocation can breakdown these documents into different topics or events. However to reduce the common information content the sentences of the summary need to be orthogonal to each other since orthogonal vectors have the lowest possible similarity and correlation between them. Singular Value Decompositions used to get the orthogonal representations of vectors and representing sentences as vectors, we can get the sentences that are orthogonal to each other in the LDA mixture model weighted term domain. Thus using LDA we find the different topics in the documents and using SVD we find the sentences that best represent these topics. Finally we present the evaluation of the algorithms on the DUC2002 Corpus multi-document summarization tasks using the ROUGE evaluator to evaluate the summaries. Compared to DUC 2002 winners, our algorithms gave significantly better ROUGE-1 recall measures.
多文档摘要处理的是计算一组相关文章的摘要,以便它们向用户提供关于事件的一般视图。其中一个目标是句子应该涵盖文档中的不同事件,并且尽可能少地用句子涵盖信息。潜在狄利克雷分配可以将这些文档分解为不同的主题或事件。然而,为了减少共同的信息含量,摘要的句子需要彼此正交,因为正交向量之间的相似性和相关性尽可能低。利用奇异值分解得到向量的正交表示,并将句子表示为向量,在LDA混合模型加权项域中得到彼此正交的句子。因此,使用LDA我们找到文档中的不同主题,使用SVD我们找到最能代表这些主题的句子。最后,利用ROUGE评估器对DUC2002语料库的多文档摘要任务进行了算法评估。与2002年DUC获奖者相比,我们的算法给出了更好的ROUGE-1召回措施。
{"title":"Latent Dirichlet Allocation and Singular Value Decomposition Based Multi-document Summarization","authors":"Rachit Arora, Balaraman Ravindran","doi":"10.1109/ICDM.2008.55","DOIUrl":"https://doi.org/10.1109/ICDM.2008.55","url":null,"abstract":"Multi-Document Summarization deals with computing a summary for a set of related articles such that they give the user a general view about the events. One of the objectives is that the sentences should cover the different events in the documents with the information covered in as few sentences as possible. Latent Dirichlet Allocation can breakdown these documents into different topics or events. However to reduce the common information content the sentences of the summary need to be orthogonal to each other since orthogonal vectors have the lowest possible similarity and correlation between them. Singular Value Decompositions used to get the orthogonal representations of vectors and representing sentences as vectors, we can get the sentences that are orthogonal to each other in the LDA mixture model weighted term domain. Thus using LDA we find the different topics in the documents and using SVD we find the sentences that best represent these topics. Finally we present the evaluation of the algorithms on the DUC2002 Corpus multi-document summarization tasks using the ROUGE evaluator to evaluate the summaries. Compared to DUC 2002 winners, our algorithms gave significantly better ROUGE-1 recall measures.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123252365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
Collaborative Filtering for Implicit Feedback Datasets 隐式反馈数据集的协同过滤
Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.22
Yifan Hu, Y. Koren, C. Volinsky
A common task of recommender systems is to improve customer experience through personalized recommendations based on prior implicit feedback. These systems passively track different sorts of user behavior, such as purchase history, watching habits and browsing activity, in order to model user preferences. Unlike the much more extensively researched explicit feedback, we do not have any direct input from the users regarding their preferences. In particular, we lack substantial evidence on which products consumer dislike. In this work we identify unique properties of implicit feedback datasets. We propose treating the data as indication of positive and negative preference associated with vastly varying confidence levels. This leads to a factor model which is especially tailored for implicit feedback recommenders. We also suggest a scalable optimization procedure, which scales linearly with the data size. The algorithm is used successfully within a recommender system for television shows. It compares favorably with well tuned implementations of other known methods. In addition, we offer a novel way to give explanations to recommendations given by this factor model.
推荐系统的一个共同任务是通过基于先验隐式反馈的个性化推荐来改善客户体验。这些系统被动地跟踪不同类型的用户行为,如购买历史、观看习惯和浏览活动,以模拟用户偏好。与更广泛研究的明确反馈不同,我们没有从用户那里得到任何关于他们偏好的直接输入。特别是,我们缺乏消费者不喜欢哪些产品的实质性证据。在这项工作中,我们确定了隐式反馈数据集的独特属性。我们建议将数据视为与巨大差异的置信度相关的积极和消极偏好的指示。这导致了一个特别为隐式反馈推荐量身定制的因素模型。我们还建议一个可扩展的优化过程,它随数据大小线性扩展。该算法已成功应用于电视节目推荐系统中。与其他已知方法的优化实现相比,它具有优势。此外,我们还提出了一种新的方法来解释该因子模型给出的推荐值。
{"title":"Collaborative Filtering for Implicit Feedback Datasets","authors":"Yifan Hu, Y. Koren, C. Volinsky","doi":"10.1109/ICDM.2008.22","DOIUrl":"https://doi.org/10.1109/ICDM.2008.22","url":null,"abstract":"A common task of recommender systems is to improve customer experience through personalized recommendations based on prior implicit feedback. These systems passively track different sorts of user behavior, such as purchase history, watching habits and browsing activity, in order to model user preferences. Unlike the much more extensively researched explicit feedback, we do not have any direct input from the users regarding their preferences. In particular, we lack substantial evidence on which products consumer dislike. In this work we identify unique properties of implicit feedback datasets. We propose treating the data as indication of positive and negative preference associated with vastly varying confidence levels. This leads to a factor model which is especially tailored for implicit feedback recommenders. We also suggest a scalable optimization procedure, which scales linearly with the data size. The algorithm is used successfully within a recommender system for television shows. It compares favorably with well tuned implementations of other known methods. In addition, we offer a novel way to give explanations to recommendations given by this factor model.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121469043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3068
Unsupervised Face Annotation by Mining the Web 基于网络挖掘的无监督人脸标注
Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.47
Duy-Dinh Le, S. Satoh
Searching for images of people is an essential task for image and video search engines. However, current search engines have limited capabilities for this task since they rely on text associated with images and video, and such text is likely to return many irrelevant results. We propose a method for retrieving relevant faces of one person by learning the visual consistency among results retrieved from text correlation-based search engines. The method consists of two steps. In the first step, each candidate face obtained from a text-based search engine is ranked with a score that measures the distribution of visual similarities among the faces. Faces that are possibly very relevant or irrelevant are ranked at the top or bottom of the list, respectively. The second step improves this ranking by treating this problem as a classification problem in which input faces are classified as psilaperson-Xpsila or psilanon-person-Xpsila; and the faces are re-ranked according to their relevant score inferred from the classifierpsilas probability output. To train this classifier, we use a bagging-based framework to combine results from multiple weak classifiers trained using different subsets. These training subsets are extracted and labeled automatically from the rank list produced from the classifier trained from the previous step. In this way, the accuracy of the ranked list increases after a number of iterations. Experimental results on various face sets retrieved from captions of news photos show that the retrieval performance improved after each iteration, with the final performance being higher than those of the existing algorithms.
人物图像的搜索是图像和视频搜索引擎的重要任务。然而,目前的搜索引擎在这项任务上的能力有限,因为它们依赖于与图像和视频相关的文本,而这样的文本很可能返回许多不相关的结果。我们提出了一种通过学习基于文本关联搜索引擎检索结果之间的视觉一致性来检索某个人相关面孔的方法。该方法包括两个步骤。在第一步中,从基于文本的搜索引擎中获得的每一张候选面孔都用一个分数来衡量这些面孔之间的视觉相似性分布。可能非常相关或不相关的面孔分别排在列表的顶部或底部。第二步通过将这个问题作为一个分类问题来改进这个排序,在这个分类问题中,输入的人脸被分类为psilanon- xpsila或psilanon-person-Xpsila;根据从分类器的概率输出中推断出的相关分数,对人脸进行重新排序。为了训练这个分类器,我们使用基于bagging的框架来组合使用不同子集训练的多个弱分类器的结果。这些训练子集从上一步训练的分类器生成的秩表中自动提取和标记。这样,经过多次迭代后,排名列表的准确性就会提高。从新闻图片的说明文字中检索到的各种人脸集的实验结果表明,每次迭代后检索性能都有所提高,最终的检索性能都高于现有算法。
{"title":"Unsupervised Face Annotation by Mining the Web","authors":"Duy-Dinh Le, S. Satoh","doi":"10.1109/ICDM.2008.47","DOIUrl":"https://doi.org/10.1109/ICDM.2008.47","url":null,"abstract":"Searching for images of people is an essential task for image and video search engines. However, current search engines have limited capabilities for this task since they rely on text associated with images and video, and such text is likely to return many irrelevant results. We propose a method for retrieving relevant faces of one person by learning the visual consistency among results retrieved from text correlation-based search engines. The method consists of two steps. In the first step, each candidate face obtained from a text-based search engine is ranked with a score that measures the distribution of visual similarities among the faces. Faces that are possibly very relevant or irrelevant are ranked at the top or bottom of the list, respectively. The second step improves this ranking by treating this problem as a classification problem in which input faces are classified as psilaperson-Xpsila or psilanon-person-Xpsila; and the faces are re-ranked according to their relevant score inferred from the classifierpsilas probability output. To train this classifier, we use a bagging-based framework to combine results from multiple weak classifiers trained using different subsets. These training subsets are extracted and labeled automatically from the rank list produced from the classifier trained from the previous step. In this way, the accuracy of the ranked list increases after a number of iterations. Experimental results on various face sets retrieved from captions of news photos show that the retrieval performance improved after each iteration, with the final performance being higher than those of the existing algorithms.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120911326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
期刊
2008 Eighth IEEE International Conference on Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1