首页 > 最新文献

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining最新文献

英文 中文
Binary Classifier Calibration Using an Ensemble of Linear Trend Estimation. 基于线性趋势估计集合的二值分类器标定。
Mahdi Pakdaman Naeini, Gregory F Cooper

Learning accurate probabilistic models from data is crucial in many practical tasks in data mining. In this paper we present a new non-parametric calibration method called ensemble of linear trend estimation (ELiTE). ELiTE utilizes the recently proposed 1 trend ltering signal approximation method [22] to find the mapping from uncalibrated classification scores to the calibrated probability estimates. ELiTE is designed to address the key limitations of the histogram binning-based calibration methods which are (1) the use of a piecewise constant form of the calibration mapping using bins, and (2) the assumption of independence of predicted probabilities for the instances that are located in different bins. The method post-processes the output of a binary classifier to obtain calibrated probabilities. Thus, it can be applied with many existing classification models. We demonstrate the performance of ELiTE on real datasets for commonly used binary classification models. Experimental results show that the method outperforms several common binary-classifier calibration methods. In particular, ELiTE commonly performs statistically significantly better than the other methods, and never worse. Moreover, it is able to improve the calibration power of classifiers, while retaining their discrimination power. The method is also computationally tractable for large scale datasets, as it is practically O(N log N) time, where N is the number of samples.

从数据中学习准确的概率模型在数据挖掘的许多实际任务中是至关重要的。本文提出了一种新的非参数校正方法——线性趋势估计集合法(ELiTE)。ELiTE利用最近提出的1趋势滤波信号逼近方法[22]来寻找从未校准分类分数到校准概率估计的映射。ELiTE旨在解决基于直方图分类的校准方法的关键局限性,这些方法是(1)使用使用bin的分段常数形式的校准映射,以及(2)假设位于不同bin中的实例的预测概率的独立性。该方法对二值分类器的输出进行后处理以获得校准概率。因此,它可以应用于许多现有的分类模型。对于常用的二分类模型,我们在实际数据集上展示了ELiTE的性能。实验结果表明,该方法优于几种常用的二分类器标定方法。特别是,ELiTE通常在统计上比其他方法表现得更好,而不会更差。在保持分类器识别能力的同时,提高了分类器的校准能力。对于大规模数据集,该方法在计算上也易于处理,因为它实际上是O(N log N)时间,其中N是样本的数量。
{"title":"Binary Classifier Calibration Using an Ensemble of Linear Trend Estimation.","authors":"Mahdi Pakdaman Naeini,&nbsp;Gregory F Cooper","doi":"10.1137/1.9781611974348.30","DOIUrl":"https://doi.org/10.1137/1.9781611974348.30","url":null,"abstract":"<p><p>Learning accurate probabilistic models from data is crucial in many practical tasks in data mining. In this paper we present a new non-parametric calibration method called <i>ensemble of linear trend estimation</i> (ELiTE). ELiTE utilizes the recently proposed <i>ℓ</i><sub>1</sub> trend ltering signal approximation method [22] to find the mapping from uncalibrated classification scores to the calibrated probability estimates. ELiTE is designed to address the key limitations of the histogram binning-based calibration methods which are (1) the use of a piecewise constant form of the calibration mapping using bins, and (2) the assumption of independence of predicted probabilities for the instances that are located in different bins. The method post-processes the output of a binary classifier to obtain calibrated probabilities. Thus, it can be applied with many existing classification models. We demonstrate the performance of ELiTE on real datasets for commonly used binary classification models. Experimental results show that the method outperforms several common binary-classifier calibration methods. In particular, ELiTE commonly performs statistically significantly better than the other methods, and never worse. Moreover, it is able to improve the calibration power of classifiers, while retaining their discrimination power. The method is also computationally tractable for large scale datasets, as it is practically <i>O</i>(<i>N</i> log <i>N</i>) time, where <i>N</i> is the number of samples.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2016 ","pages":"261-269"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611974348.30","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34868574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
DPClass: An Effective but Concise Discriminative Patterns-Based Classification Framework DPClass:一个有效而简洁的基于判别模式的分类框架
Jingbo Shang, Wenzhu Tong, Jian Peng, Jiawei Han
Pattern-based classification was originally proposed to improve the accuracy using selected frequent patterns, where many efforts were paid to prune a huge number of non-discriminative frequent patterns. On the other hand, tree-based models have shown strong abilities on many classification tasks since they can easily build high-order interactions between different features and also handle both numerical and categorical features as well as high dimensional features. By taking the advantage of both modeling methodologies, we propose a natural and effective way to resolve pattern-based classification by adopting discriminative patterns which are the prefix paths from root to nodes in tree-based models (e.g., random forest). Moreover, we further compress the number of discriminative patterns by selecting the most effective pattern combinations that fit into a generalized linear model. As a result, our discriminative pattern-based classification framework (DPClass) could perform as good as previous state-of-the-art algorithms, provide great interpretability by utilizing only very limited number of discriminative patterns, and predict new data extremely fast. More specifically, in our experiments, DPClass could gain even better accuracy by only using top-20 discriminative patterns. The framework so generated is very concise and highly explanatory to human experts.
基于模式的分类最初是为了使用选择的频繁模式来提高准确率而提出的,其中付出了许多努力来修剪大量的非判别性频繁模式。另一方面,基于树的模型可以很容易地建立不同特征之间的高阶交互,并且可以处理数值和分类特征以及高维特征,因此在许多分类任务中显示出强大的能力。通过利用这两种建模方法的优势,我们提出了一种自然有效的方法来解决基于模式的分类问题,即采用判别模式,即基于树的模型(例如随机森林)中从根到节点的前缀路径。此外,我们通过选择适合广义线性模型的最有效模式组合来进一步压缩判别模式的数量。因此,我们的基于判别模式的分类框架(DPClass)可以像以前最先进的算法一样执行得很好,仅使用非常有限的判别模式就提供了很好的可解释性,并且非常快地预测新数据。更具体地说,在我们的实验中,DPClass仅使用前20个判别模式就可以获得更好的准确性。这样生成的框架非常简洁,对人类专家来说具有很强的解释性。
{"title":"DPClass: An Effective but Concise Discriminative Patterns-Based Classification Framework","authors":"Jingbo Shang, Wenzhu Tong, Jian Peng, Jiawei Han","doi":"10.1137/1.9781611974348.64","DOIUrl":"https://doi.org/10.1137/1.9781611974348.64","url":null,"abstract":"Pattern-based classification was originally proposed to improve the accuracy using selected frequent patterns, where many efforts were paid to prune a huge number of non-discriminative frequent patterns. On the other hand, tree-based models have shown strong abilities on many classification tasks since they can easily build high-order interactions between different features and also handle both numerical and categorical features as well as high dimensional features. By taking the advantage of both modeling methodologies, we propose a natural and effective way to resolve pattern-based classification by adopting discriminative patterns which are the prefix paths from root to nodes in tree-based models (e.g., random forest). Moreover, we further compress the number of discriminative patterns by selecting the most effective pattern combinations that fit into a generalized linear model. As a result, our discriminative pattern-based classification framework (DPClass) could perform as good as previous state-of-the-art algorithms, provide great interpretability by utilizing only very limited number of discriminative patterns, and predict new data extremely fast. More specifically, in our experiments, DPClass could gain even better accuracy by only using top-20 discriminative patterns. The framework so generated is very concise and highly explanatory to human experts.","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"22 1","pages":"567-575"},"PeriodicalIF":0.0,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82143524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Tensor Spectral Clustering for Partitioning Higher-order Network Structures 高阶网络结构的张量谱聚类
Austin R. Benson, D. Gleich, J. Leskovec
Spectral graph theory-based methods represent an important class of tools for studying the structure of networks. Spectral methods are based on a first-order Markov chain derived from a random walk on the graph and thus they cannot take advantage of important higher-order network substructures such as triangles, cycles, and feed-forward loops. Here we propose a Tensor Spectral Clustering (TSC) algorithm that allows for modeling higher-order network structures in a graph partitioning framework. Our TSC algorithm allows the user to specify which higher-order network structures (cycles, feed-forward loops, etc.) should be preserved by the network clustering. Higher-order network structures of interest are represented using a tensor, which we then partition by developing a multilinear spectral method. Our framework can be applied to discovering layered flows in networks as well as graph anomaly detection, which we illustrate on synthetic networks. In directed networks, a higher-order structure of particular interest is the directed 3-cycle, which captures feedback loops in networks. We demonstrate that our TSC algorithm produces large partitions that cut fewer directed 3-cycles than standard spectral clustering algorithms.
基于谱图理论的方法是研究网络结构的一类重要工具。谱方法基于一阶马尔可夫链,该链来源于图上的随机游走,因此它们不能利用重要的高阶网络子结构,如三角形、循环和前馈循环。在这里,我们提出了一个张量谱聚类(TSC)算法,该算法允许在图划分框架中建模高阶网络结构。我们的TSC算法允许用户指定网络聚类应该保留哪些高阶网络结构(循环,前馈循环等)。感兴趣的高阶网络结构使用张量表示,然后我们通过开发多线性谱方法对其进行划分。我们的框架可以应用于发现网络中的分层流以及图异常检测,我们在合成网络上说明了这一点。在有向网络中,一个特别有趣的高阶结构是有向3环,它捕获了网络中的反馈回路。我们证明了我们的TSC算法产生的大分区比标准谱聚类算法切割的有向3周期更少。
{"title":"Tensor Spectral Clustering for Partitioning Higher-order Network Structures","authors":"Austin R. Benson, D. Gleich, J. Leskovec","doi":"10.1137/1.9781611974010.14","DOIUrl":"https://doi.org/10.1137/1.9781611974010.14","url":null,"abstract":"Spectral graph theory-based methods represent an important class of tools for studying the structure of networks. Spectral methods are based on a first-order Markov chain derived from a random walk on the graph and thus they cannot take advantage of important higher-order network substructures such as triangles, cycles, and feed-forward loops. Here we propose a Tensor Spectral Clustering (TSC) algorithm that allows for modeling higher-order network structures in a graph partitioning framework. Our TSC algorithm allows the user to specify which higher-order network structures (cycles, feed-forward loops, etc.) should be preserved by the network clustering. Higher-order network structures of interest are represented using a tensor, which we then partition by developing a multilinear spectral method. Our framework can be applied to discovering layered flows in networks as well as graph anomaly detection, which we illustrate on synthetic networks. In directed networks, a higher-order structure of particular interest is the directed 3-cycle, which captures feedback loops in networks. We demonstrate that our TSC algorithm produces large partitions that cut fewer directed 3-cycles than standard spectral clustering algorithms.","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"25 1","pages":"118-126"},"PeriodicalIF":0.0,"publicationDate":"2015-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88094205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 108
Binary Classifier Calibration Using a Bayesian Non-Parametric Approach 基于贝叶斯非参数方法的二值分类器标定
Mahdi Pakdaman Naeini, G. Cooper, M. Hauskrecht
Learning probabilistic predictive models that are well calibrated is critical for many prediction and decision-making tasks in Data mining. This paper presents two new non-parametric methods for calibrating outputs of binary classification models: a method based on the Bayes optimal selection and a method based on the Bayesian model averaging. The advantage of these methods is that they are independent of the algorithm used to learn a predictive model, and they can be applied in a post-processing step, after the model is learned. This makes them applicable to a wide variety of machine learning models and methods. These calibration methods, as well as other methods, are tested on a variety of datasets in terms of both discrimination and calibration performance. The results show the methods either outperform or are comparable in performance to the state-of-the-art calibration methods.
学习经过良好校准的概率预测模型对于数据挖掘中的许多预测和决策任务至关重要。本文提出了两种新的非参数方法来标定二元分类模型的输出:基于贝叶斯最优选择的方法和基于贝叶斯模型平均的方法。这些方法的优点是它们独立于用于学习预测模型的算法,并且可以在模型学习后的后处理步骤中应用。这使得它们适用于各种各样的机器学习模型和方法。这些校准方法以及其他方法在各种数据集上进行了区分和校准性能的测试。结果表明,这些方法在性能上优于或与最先进的校准方法相当。
{"title":"Binary Classifier Calibration Using a Bayesian Non-Parametric Approach","authors":"Mahdi Pakdaman Naeini, G. Cooper, M. Hauskrecht","doi":"10.1137/1.9781611974010.24","DOIUrl":"https://doi.org/10.1137/1.9781611974010.24","url":null,"abstract":"Learning probabilistic predictive models that are well calibrated is critical for many prediction and decision-making tasks in Data mining. This paper presents two new non-parametric methods for calibrating outputs of binary classification models: a method based on the Bayes optimal selection and a method based on the Bayesian model averaging. The advantage of these methods is that they are independent of the algorithm used to learn a predictive model, and they can be applied in a post-processing step, after the model is learned. This makes them applicable to a wide variety of machine learning models and methods. These calibration methods, as well as other methods, are tested on a variety of datasets in terms of both discrimination and calibration performance. The results show the methods either outperform or are comparable in performance to the state-of-the-art calibration methods.","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"12 1","pages":"208-216"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88387226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Graph Regularized Meta-path Based Transductive Regression in Heterogeneous Information Network 基于图正则元路径的异构信息网络转换回归
Mengting Wan, Yunbo Ouyang, Lance M. Kaplan, Jiawei Han
A number of real-world networks are heterogeneous information networks, which are composed of different types of nodes and links. Numerical prediction in heterogeneous information networks is a challenging but significant area because network based information for unlabeled objects is usually limited to make precise estimations. In this paper, we consider a graph regularized meta-path based transductive regression model (Grempt), which combines the principal philosophies of typical graph-based transductive classification methods and transductive regression models designed for homogeneous networks. The computation of our method is time and space efficient and the precision of our model can be verified by numerical experiments.
现实世界中的许多网络都是异构信息网络,由不同类型的节点和链路组成。异构信息网络中的数值预测是一个具有挑战性但又重要的领域,因为基于网络的未标记对象的信息通常局限于做出精确的估计。在本文中,我们考虑了一个基于图正则化元路径的转导回归模型(Grempt),它结合了典型的基于图的转导分类方法和为同构网络设计的转导回归模型的主要原理。该方法的计算节省了时间和空间,并通过数值实验验证了模型的精度。
{"title":"Graph Regularized Meta-path Based Transductive Regression in Heterogeneous Information Network","authors":"Mengting Wan, Yunbo Ouyang, Lance M. Kaplan, Jiawei Han","doi":"10.1137/1.9781611974010.103","DOIUrl":"https://doi.org/10.1137/1.9781611974010.103","url":null,"abstract":"A number of real-world networks are heterogeneous information networks, which are composed of different types of nodes and links. Numerical prediction in heterogeneous information networks is a challenging but significant area because network based information for unlabeled objects is usually limited to make precise estimations. In this paper, we consider a graph regularized meta-path based transductive regression model (Grempt), which combines the principal philosophies of typical graph-based transductive classification methods and transductive regression models designed for homogeneous networks. The computation of our method is time and space efficient and the precision of our model can be verified by numerical experiments.","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2 1","pages":"918-926"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75447730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Classifying Imbalanced Data Streams via Dynamic Feature Group Weighting with Importance Sampling. 基于重要性抽样的动态特征组加权分类不平衡数据流。
Ke Wu, Andrea Edwards, Wei Fan, Jing Gao, Kun Zhang

Data stream classification and imbalanced data learning are two important areas of data mining research. Each has been well studied to date with many interesting algorithms developed. However, only a few approaches reported in literature address the intersection of these two fields due to their complex interplay. In this work, we proposed an importance sampling driven, dynamic feature group weighting framework (DFGW-IS) for classifying data streams of imbalanced distribution. Two components are tightly incorporated into the proposed approach to address the intrinsic characteristics of concept-drifting, imbalanced streaming data. Specifically, the ever-evolving concepts are tackled by a weighted ensemble trained on a set of feature groups with each sub-classifier (i.e. a single classifier or an ensemble) weighed by its discriminative power and stable level. The un-even class distribution, on the other hand, is typically battled by the sub-classifier built in a specific feature group with the underlying distribution rebalanced by the importance sampling technique. We derived the theoretical upper bound for the generalization error of the proposed algorithm. We also studied the empirical performance of our method on a set of benchmark synthetic and real world data, and significant improvement has been achieved over the competing algorithms in terms of standard evaluation metrics and parallel running time. Algorithm implementations and datasets are available upon request.

数据流分类和不平衡数据学习是数据挖掘研究的两个重要领域。迄今为止,每一种算法都得到了很好的研究,并开发了许多有趣的算法。然而,由于这两个领域复杂的相互作用,只有少数文献报道的方法解决了这两个领域的交集。在这项工作中,我们提出了一个重要采样驱动的动态特征组加权框架(DFGW-IS),用于对不平衡分布的数据流进行分类。两个组成部分紧密结合到所提出的方法中,以解决概念漂移,不平衡流数据的内在特征。具体来说,不断发展的概念是由一组特征组训练的加权集成来处理的,每个子分类器(即单个分类器或集成)根据其判别能力和稳定水平进行加权。另一方面,不均匀的类分布通常由在特定特征组中构建的子分类器来解决,并通过重要性采样技术重新平衡底层分布。给出了该算法泛化误差的理论上界。我们还研究了我们的方法在一组基准合成数据和真实世界数据上的经验性能,并且在标准评估指标和并行运行时间方面比竞争算法取得了显着改进。算法实现和数据集可根据要求提供。
{"title":"Classifying Imbalanced Data Streams via Dynamic Feature Group Weighting with Importance Sampling.","authors":"Ke Wu,&nbsp;Andrea Edwards,&nbsp;Wei Fan,&nbsp;Jing Gao,&nbsp;Kun Zhang","doi":"10.1137/1.9781611973440.83","DOIUrl":"https://doi.org/10.1137/1.9781611973440.83","url":null,"abstract":"<p><p>Data stream classification and imbalanced data learning are two important areas of data mining research. Each has been well studied to date with many interesting algorithms developed. However, only a few approaches reported in literature address the intersection of these two fields due to their complex interplay. In this work, we proposed an importance sampling driven, dynamic feature group weighting framework (DFGW-IS) for classifying data streams of imbalanced distribution. Two components are tightly incorporated into the proposed approach to address the intrinsic characteristics of concept-drifting, imbalanced streaming data. Specifically, the ever-evolving concepts are tackled by a weighted ensemble trained on a set of feature groups with each sub-classifier (i.e. a single classifier or an ensemble) weighed by its discriminative power and stable level. The un-even class distribution, on the other hand, is typically battled by the sub-classifier built in a specific feature group with the underlying distribution rebalanced by the importance sampling technique. We derived the theoretical upper bound for the generalization error of the proposed algorithm. We also studied the empirical performance of our method on a set of benchmark synthetic and real world data, and significant improvement has been achieved over the competing algorithms in terms of standard evaluation metrics and parallel running time. Algorithm implementations and datasets are available upon request.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2014 ","pages":"722-730"},"PeriodicalIF":0.0,"publicationDate":"2014-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611973440.83","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32958341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
DuSK: A Dual Structure-preserving Kernel for Supervised Tensor Learning with Applications to Neuroimages. 一个用于监督张量学习的对偶结构保持核及其在神经图像中的应用。
Lifang He, Xiangnan Kong, Philip S Yu, Ann B Ragin, Zhifeng Hao, Xiaowei Yang

With advances in data collection technologies, tensor data is assuming increasing prominence in many applications and the problem of supervised tensor learning has emerged as a topic of critical significance in the data mining and machine learning community. Conventional methods for supervised tensor learning mainly focus on learning kernels by flattening the tensor into vectors or matrices, however structural information within the tensors will be lost. In this paper, we introduce a new scheme to design structure-preserving kernels for supervised tensor learning. Specifically, we demonstrate how to leverage the naturally available structure within the tensorial representation to encode prior knowledge in the kernel. We proposed a tensor kernel that can preserve tensor structures based upon dual-tensorial mapping. The dual-tensorial mapping function can map each tensor instance in the input space to another tensor in the feature space while preserving the tensorial structure. Theoretically, our approach is an extension of the conventional kernels in the vector space to tensor space. We applied our novel kernel in conjunction with SVM to real-world tensor classification problems including brain fMRI classification for three different diseases (i.e., Alzheimer's disease, ADHD and brain damage by HIV). Extensive empirical studies demonstrate that our proposed approach can effectively boost tensor classification performances, particularly with small sample sizes.

随着数据收集技术的进步,张量数据在许多应用中越来越突出,监督张量学习问题已经成为数据挖掘和机器学习社区中一个至关重要的话题。传统的监督张量学习方法主要是通过将张量扁平化为向量或矩阵来学习核,但是会丢失张量内部的结构信息。在本文中,我们引入了一种新的方案来设计用于监督张量学习的保结构核。具体来说,我们演示了如何利用张量表示中自然可用的结构来编码内核中的先验知识。我们提出了一种基于双张量映射的张量核,可以保留张量结构。双张量映射函数可以将输入空间中的每个张量实例映射到特征空间中的另一个张量,同时保持张量结构。理论上,我们的方法是将向量空间中的常规核扩展到张量空间。我们将我们的新核与支持向量机一起应用于现实世界的张量分类问题,包括对三种不同疾病(即阿尔茨海默病、多动症和艾滋病毒引起的脑损伤)的脑功能磁共振分类。大量的实证研究表明,我们提出的方法可以有效地提高张量分类性能,特别是在小样本量的情况下。
{"title":"DuSK: A Dual Structure-preserving Kernel for Supervised Tensor Learning with Applications to Neuroimages.","authors":"Lifang He,&nbsp;Xiangnan Kong,&nbsp;Philip S Yu,&nbsp;Ann B Ragin,&nbsp;Zhifeng Hao,&nbsp;Xiaowei Yang","doi":"10.1137/1.9781611973440.15","DOIUrl":"https://doi.org/10.1137/1.9781611973440.15","url":null,"abstract":"<p><p>With advances in data collection technologies, tensor data is assuming increasing prominence in many applications and the problem of supervised tensor learning has emerged as a topic of critical significance in the data mining and machine learning community. Conventional methods for supervised tensor learning mainly focus on learning kernels by flattening the tensor into vectors or matrices, however structural information within the tensors will be lost. In this paper, we introduce a new scheme to design structure-preserving kernels for supervised tensor learning. Specifically, we demonstrate how to leverage the naturally available structure within the tensorial representation to encode prior knowledge in the kernel. We proposed a tensor kernel that can preserve tensor structures based upon dual-tensorial mapping. The dual-tensorial mapping function can map each tensor instance in the input space to another tensor in the feature space while preserving the tensorial structure. Theoretically, our approach is an extension of the conventional kernels in the vector space to tensor space. We applied our novel kernel in conjunction with SVM to real-world tensor classification problems including brain fMRI classification for three different diseases (<i>i.e</i>., Alzheimer's disease, ADHD and brain damage by HIV). Extensive empirical studies demonstrate that our proposed approach can effectively boost tensor classification performances, particularly with small sample sizes.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2014 ","pages":"127-135"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611973440.15","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33263013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 74
Turbo-SMT: Accelerating Coupled Sparse Matrix-Tensor Factorizations by 200×. Turbo-SMT: 200倍加速耦合稀疏矩阵张量分解。
Evangelos E Papalexakis, Christos Faloutsos, Tom M Mitchell, Partha Pratim Talukdar, Nicholas D Sidiropoulos, Brian Murphy

How can we correlate the neural activity in the human brain as it responds to typed words, with properties of these terms (like 'edible', 'fits in hand')? In short, we want to find latent variables, that jointly explain both the brain activity, as well as the behavioral responses. This is one of many settings of the Coupled Matrix-Tensor Factorization (CMTF) problem. Can we accelerate any CMTF solver, so that it runs within a few minutes instead of tens of hours to a day, while maintaining good accuracy? We introduce TURBO-SMT, a meta-method capable of doing exactly that: it boosts the performance of any CMTF algorithm, by up to 200×, along with an up to 65 fold increase in sparsity, with comparable accuracy to the baseline. We apply TURBO-SMT to BRAINQ, a dataset consisting of a (nouns, brain voxels, human subjects) tensor and a (nouns, properties) matrix, with coupling along the nouns dimension. TURBO-SMT is able to find meaningful latent variables, as well as to predict brain activity with competitive accuracy.

当人脑对输入的单词做出反应时,我们如何将其神经活动与这些词的属性(如“可食用”、“适合拿在手里”)联系起来?简而言之,我们想找到潜在的变量,共同解释大脑活动,以及行为反应。这是耦合矩阵-张量分解(CMTF)问题的众多设置之一。我们能否加速任何CMTF求解器,使其在保持良好精度的情况下,在几分钟内而不是每天几十小时内运行?我们介绍TURBO-SMT,一种能够做到这一点的元方法:它将任何CMTF算法的性能提高了200倍,同时将稀疏性提高了65倍,并且具有与基线相当的精度。我们将TURBO-SMT应用于BRAINQ,这是一个由一个(名词,脑体素,人类受试者)张量和一个(名词,属性)矩阵组成的数据集,沿着名词维度进行耦合。TURBO-SMT能够找到有意义的潜在变量,并以具有竞争力的准确性预测大脑活动。
{"title":"Turbo-SMT: Accelerating Coupled Sparse Matrix-Tensor Factorizations by 200×.","authors":"Evangelos E Papalexakis,&nbsp;Christos Faloutsos,&nbsp;Tom M Mitchell,&nbsp;Partha Pratim Talukdar,&nbsp;Nicholas D Sidiropoulos,&nbsp;Brian Murphy","doi":"10.1137/1.9781611973440.14","DOIUrl":"https://doi.org/10.1137/1.9781611973440.14","url":null,"abstract":"<p><p>How can we correlate the neural activity in the human brain as it responds to typed words, with properties of these terms (like 'edible', 'fits in hand')? In short, we want to find latent variables, that jointly explain both the brain activity, as well as the behavioral responses. This is one of many settings of the <i>Coupled Matrix-Tensor Factorization</i> (CMTF) problem. Can we accelerate <i>any</i> CMTF solver, so that it runs within a few minutes instead of tens of hours to a day, while maintaining good accuracy? We introduce TURBO-SMT, a meta-method capable of doing exactly that: it boosts the performance of <i>any</i> CMTF algorithm, by up to <i>200</i>×, along with an up to <i>65 fold</i> increase in sparsity, with comparable accuracy to the baseline. We apply TURBO-SMT to BRAINQ, a dataset consisting of a (nouns, brain voxels, human subjects) tensor and a (nouns, properties) matrix, with coupling along the nouns dimension. TURBO-SMT is able to find meaningful latent variables, as well as to predict brain activity with competitive accuracy.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2014 ","pages":"118-126"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611973440.14","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34263891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
An Optimization-based Framework to Learn Conditional Random Fields for Multi-label Classification. 基于优化的多标签分类条件随机场学习框架。
Mahdi Pakdaman Naeini, Iyad Batal, Zitao Liu, CharmGil Hong, Milos Hauskrecht

This paper studies multi-label classification problem in which data instances are associated with multiple, possibly high-dimensional, label vectors. This problem is especially challenging when labels are dependent and one cannot decompose the problem into a set of independent classification problems. To address the problem and properly represent label dependencies we propose and study a pairwise conditional random Field (CRF) model. We develop a new approach for learning the structure and parameters of the CRF from data. The approach maximizes the pseudo likelihood of observed labels and relies on the fast proximal gradient descend for learning the structure and limited memory BFGS for learning the parameters of the model. Empirical results on several datasets show that our approach outperforms several multi-label classification baselines, including recently published state-of-the-art methods.

本文研究了多标签分类问题,其中数据实例与多个可能是高维的标签向量相关联。当标签是相互依赖的,并且不能将问题分解为一组独立的分类问题时,这个问题尤其具有挑战性。为了解决这个问题并正确地表示标签依赖关系,我们提出并研究了一个成对条件随机场(CRF)模型。我们提出了一种从数据中学习CRF结构和参数的新方法。该方法将观察到的标签的伪似然最大化,并依靠快速的近端梯度下降来学习结构,依靠有限的记忆BFGS来学习模型的参数。在几个数据集上的实证结果表明,我们的方法优于几种多标签分类基线,包括最近发表的最先进的方法。
{"title":"An Optimization-based Framework to Learn Conditional Random Fields for Multi-label Classification.","authors":"Mahdi Pakdaman Naeini,&nbsp;Iyad Batal,&nbsp;Zitao Liu,&nbsp;CharmGil Hong,&nbsp;Milos Hauskrecht","doi":"10.1137/1.9781611973440.113","DOIUrl":"https://doi.org/10.1137/1.9781611973440.113","url":null,"abstract":"<p><p>This paper studies multi-label classification problem in which data instances are associated with multiple, possibly high-dimensional, label vectors. This problem is especially challenging when labels are dependent and one cannot decompose the problem into a set of independent classification problems. To address the problem and properly represent label dependencies we propose and study a pairwise conditional random Field (CRF) model. We develop a new approach for learning the structure and parameters of the CRF from data. The approach maximizes the pseudo likelihood of observed labels and relies on the fast proximal gradient descend for learning the structure and limited memory BFGS for learning the parameters of the model. Empirical results on several datasets show that our approach outperforms several multi-label classification baselines, including recently published state-of-the-art methods.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2014 ","pages":"992-1000"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611973440.113","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33263014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Discriminative Feature Selection for Uncertain Graph Classification. 不确定图分类的判别特征选择。
Xiangnan Kong, Philip S Yu, Xue Wang, Ann B Ragin
Mining discriminative features for graph data has attracted much attention in recent years due to its important role in constructing graph classifiers, generating graph indices, etc. Most measurement of interestingness of discriminative subgraph features are defined on certain graphs, where the structure of graph objects are certain, and the binary edges within each graph represent the "presence" of linkages among the nodes. In many real-world applications, however, the linkage structure of the graphs is inherently uncertain. Therefore, existing measurements of interestingness based upon certain graphs are unable to capture the structural uncertainty in these applications effectively. In this paper, we study the problem of discriminative subgraph feature selection from uncertain graphs. This problem is challenging and different from conventional subgraph mining problems because both the structure of the graph objects and the discrimination score of each subgraph feature are uncertain. To address these challenges, we propose a novel discriminative subgraph feature selection method, Dug, which can find discriminative subgraph features in uncertain graphs based upon different statistical measures including expectation, median, mode and φ-probability. We first compute the probability distribution of the discrimination scores for each subgraph feature based on dynamic programming. Then a branch-and-bound algorithm is proposed to search for discriminative subgraphs efficiently. Extensive experiments on various neuroimaging applications (i.e., Alzheimers Disease, ADHD and HIV) have been performed to analyze the gain in performance by taking into account structural uncertainties in identifying discriminative subgraph features for graph classification.
图数据的判别特征挖掘由于在图分类器的构建、图索引的生成等方面的重要作用,近年来受到了广泛的关注。大多数判别子图特征的兴趣度度量都是在特定的图上定义的,其中图对象的结构是确定的,并且每个图中的二值边表示节点之间存在联系。然而,在许多实际应用中,图的链接结构本质上是不确定的。因此,现有的基于某些图的兴趣度测量无法有效地捕获这些应用中的结构不确定性。本文研究了不确定图的判别子图特征选择问题。该问题与传统的子图挖掘问题不同,具有挑战性,因为图对象的结构和每个子图特征的识别分数都是不确定的。为了解决这些问题,我们提出了一种新的判别子图特征选择方法Dug,该方法可以根据不同的统计度量(包括期望、中位数、众数和φ-概率)在不确定图中找到判别子图特征。我们首先基于动态规划计算每个子图特征的判别分数的概率分布。然后提出了一种分支定界算法来有效地搜索判别子图。在各种神经影像学应用(即阿尔茨海默病,多动症和艾滋病毒)上进行了广泛的实验,通过考虑图分类中识别判别子图特征的结构不确定性来分析性能的增益。
{"title":"Discriminative Feature Selection for Uncertain Graph Classification.","authors":"Xiangnan Kong,&nbsp;Philip S Yu,&nbsp;Xue Wang,&nbsp;Ann B Ragin","doi":"10.1137/1.9781611972832.10","DOIUrl":"https://doi.org/10.1137/1.9781611972832.10","url":null,"abstract":"Mining discriminative features for graph data has attracted much attention in recent years due to its important role in constructing graph classifiers, generating graph indices, etc. Most measurement of interestingness of discriminative subgraph features are defined on certain graphs, where the structure of graph objects are certain, and the binary edges within each graph represent the \"presence\" of linkages among the nodes. In many real-world applications, however, the linkage structure of the graphs is inherently uncertain. Therefore, existing measurements of interestingness based upon certain graphs are unable to capture the structural uncertainty in these applications effectively. In this paper, we study the problem of discriminative subgraph feature selection from uncertain graphs. This problem is challenging and different from conventional subgraph mining problems because both the structure of the graph objects and the discrimination score of each subgraph feature are uncertain. To address these challenges, we propose a novel discriminative subgraph feature selection method, Dug, which can find discriminative subgraph features in uncertain graphs based upon different statistical measures including expectation, median, mode and φ-probability. We first compute the probability distribution of the discrimination scores for each subgraph feature based on dynamic programming. Then a branch-and-bound algorithm is proposed to search for discriminative subgraphs efficiently. Extensive experiments on various neuroimaging applications (i.e., Alzheimers Disease, ADHD and HIV) have been performed to analyze the gain in performance by taking into account structural uncertainties in identifying discriminative subgraph features for graph classification.","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2013 ","pages":"82-93"},"PeriodicalIF":0.0,"publicationDate":"2013-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611972832.10","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33282393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
期刊
Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1