KDD : proceedings. International Conference on Knowledge Discovery & Data Mining最新文献_第4页

Batch Discovery of Recurring Rare Classes toward Identifying Anomalous Samples. 批量发现重复出现的罕见类别，从而识别异常样本。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2014-08-01 Epub Date: 2014-08-24 DOI: 10.1145/2623330.2623695

Murat Dundar, Halid Ziya Yerebakan, Bartek Rajwa

We present a clustering algorithm for discovering rare yet significant recurring classes across a batch of samples in the presence of random effects. We model each sample data by an infinite mixture of Dirichlet-process Gaussian-mixture models (DPMs) with each DPM representing the noisy realization of its corresponding class distribution in a given sample. We introduce dependencies across multiple samples by placing a global Dirichlet process prior over individual DPMs. This hierarchical prior introduces a sharing mechanism across samples and allows for identifying local realizations of classes across samples. We use collapsed Gibbs sampler for inference to recover local DPMs and identify their class associations. We demonstrate the utility of the proposed algorithm, processing a flow cytometry data set containing two extremely rare cell populations, and report results that significantly outperform competing techniques. The source code of the proposed algorithm is available on the web via the link: http://cs.iupui.edu/~dundar/aspire.htm.

我们提出了一种聚类算法，用于在存在随机效应的情况下发现一批样本中罕见但重要的重复类。我们用 Dirichlet 过程高斯混合物模型（DPM）的无限混合物对每个样本数据进行建模，每个 DPM 代表其相应类别分布在给定样本中的噪声实现。我们通过在单个 DPM 上放置全局 Dirichlet 过程先验来引入多个样本之间的依赖关系。这种分层先验引入了跨样本共享机制，并允许识别跨样本的局部类实现。我们使用折叠吉布斯采样器进行推理，以恢复局部 DPM 并识别其类别关联。我们通过处理包含两种极其罕见细胞群的流式细胞仪数据集，展示了所提算法的实用性，并报告了明显优于其他竞争技术的结果。建议算法的源代码可通过以下链接获取：http://cs.iupui.edu/~dundar/aspire.htm。

{"title":"Batch Discovery of Recurring Rare Classes toward Identifying Anomalous Samples.","authors":"Murat Dundar, Halid Ziya Yerebakan, Bartek Rajwa","doi":"10.1145/2623330.2623695","DOIUrl":"10.1145/2623330.2623695","url":null,"abstract":"We present a clustering algorithm for discovering rare yet significant recurring classes across a batch of samples in the presence of random effects. We model each sample data by an infinite mixture of Dirichlet-process Gaussian-mixture models (DPMs) with each DPM representing the noisy realization of its corresponding class distribution in a given sample. We introduce dependencies across multiple samples by placing a global Dirichlet process prior over individual DPMs. This hierarchical prior introduces a sharing mechanism across samples and allows for identifying local realizations of classes across samples. We use collapsed Gibbs sampler for inference to recover local DPMs and identify their class associations. We demonstrate the utility of the proposed algorithm, processing a flow cytometry data set containing two extremely rare cell populations, and report results that significantly outperform competing techniques. The source code of the proposed algorithm is available on the web via the link: http://cs.iupui.edu/~dundar/aspire.htm.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9963292/pdf/nihms-1875696.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10793165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust Multi-Task Feature Learning. 稳健的多任务特征学习

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2012-08-12 DOI: 10.1145/2339530.2339672

Pinghua Gong, Jieping Ye, Changshui Zhang

Multi-task learning (MTL) aims to improve the performance of multiple related tasks by exploiting the intrinsic relationships among them. Recently, multi-task feature learning algorithms have received increasing attention and they have been successfully applied to many applications involving high-dimensional data. However, they assume that all tasks share a common set of features, which is too restrictive and may not hold in real-world applications, since outlier tasks often exist. In this paper, we propose a Robust MultiTask Feature Learning algorithm (rMTFL) which simultaneously captures a common set of features among relevant tasks and identifies outlier tasks. Specifically, we decompose the weight (model) matrix for all tasks into two components. We impose the well-known group Lasso penalty on row groups of the first component for capturing the shared features among relevant tasks. To simultaneously identify the outlier tasks, we impose the same group Lasso penalty but on column groups of the second component. We propose to employ the accelerated gradient descent to efficiently solve the optimization problem in rMTFL, and show that the proposed algorithm is scalable to large-size problems. In addition, we provide a detailed theoretical analysis on the proposed rMTFL formulation. Specifically, we present a theoretical bound to measure how well our proposed rMTFL approximates the true evaluation, and provide bounds to measure the error between the estimated weights of rMTFL and the underlying true weights. Moreover, by assuming that the underlying true weights are above the noise level, we present a sound theoretical result to show how to obtain the underlying true shared features and outlier tasks (sparsity patterns). Empirical studies on both synthetic and real-world data demonstrate that our proposed rMTFL is capable of simultaneously capturing shared features among tasks and identifying outlier tasks.

多任务学习（MTL）旨在通过利用多个相关任务之间的内在关系来提高这些任务的性能。最近，多任务特征学习算法受到越来越多的关注，并成功应用于许多涉及高维数据的应用中。然而，这些算法假设所有任务都有一组共同的特征，这限制性太大，在实际应用中可能不成立，因为离群任务经常存在。在本文中，我们提出了一种鲁棒多任务特征学习算法（rMTFL），它能同时捕捉相关任务的共同特征集，并识别离群任务。具体来说，我们将所有任务的权重（模型）矩阵分解为两个部分。我们对第一部分的行组施加众所周知的组 Lasso 惩罚，以捕捉相关任务之间的共同特征。为了同时识别离群任务，我们对第二个分量的列组施加同样的组 Lasso 惩罚。我们建议采用加速梯度下降法来高效解决 rMTFL 中的优化问题，并证明所建议的算法可扩展至大型问题。此外，我们还对提出的 rMTFL 公式进行了详细的理论分析。具体来说，我们提出了一个理论边界来衡量我们提出的 rMTFL 在多大程度上逼近了真实评估，并提供了衡量 rMTFL 估计权重与底层真实权重之间误差的边界。此外，通过假设底层真实权重高于噪声水平，我们提出了一个合理的理论结果，说明如何获得底层真实的共享特征和离群任务（稀疏模式）。对合成数据和真实世界数据的实证研究表明，我们提出的 rMTFL 能够同时捕捉任务间的共享特征并识别离群任务。

{"title":"Robust Multi-Task Feature Learning.","authors":"Pinghua Gong, Jieping Ye, Changshui Zhang","doi":"10.1145/2339530.2339672","DOIUrl":"10.1145/2339530.2339672","url":null,"abstract":"Multi-task learning (MTL) aims to improve the performance of multiple related tasks by exploiting the intrinsic relationships among them. Recently, multi-task feature learning algorithms have received increasing attention and they have been successfully applied to many applications involving high-dimensional data. However, they assume that all tasks share a common set of features, which is too restrictive and may not hold in real-world applications, since outlier tasks often exist. In this paper, we propose a Robust MultiTask Feature Learning algorithm (rMTFL) which simultaneously captures a common set of features among relevant tasks and identifies outlier tasks. Specifically, we decompose the weight (model) matrix for all tasks into two components. We impose the well-known group Lasso penalty on row groups of the first component for capturing the shared features among relevant tasks. To simultaneously identify the outlier tasks, we impose the same group Lasso penalty but on column groups of the second component. We propose to employ the accelerated gradient descent to efficiently solve the optimization problem in rMTFL, and show that the proposed algorithm is scalable to large-size problems. In addition, we provide a detailed theoretical analysis on the proposed rMTFL formulation. Specifically, we present a theoretical bound to measure how well our proposed rMTFL approximates the true evaluation, and provide bounds to measure the error between the estimated weights of rMTFL and the underlying true weights. Moreover, by assuming that the underlying true weights are above the noise level, we present a sound theoretical result to show how to obtain the underlying true shared features and outlier tasks (sparsity patterns). Empirical studies on both synthetic and real-world data demonstrate that our proposed rMTFL is capable of simultaneously capturing shared features among tasks and identifying outlier tasks.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3783219/pdf/nihms497474.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31768883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping. 在动态时间扭曲下搜索和挖掘万亿个时间序列子序列。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2012-08-01 DOI: 10.1145/2339530.2339576

Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh

Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms. The difficulty of scaling search to large datasets largely explains why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine truly massive time series for the first time. We demonstrate the following extremely unintuitive fact; in large datasets we can exactly search under DTW much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We show that our ideas allow us to solve higher-level time series data mining problem such as motif discovery and clustering at scales that would otherwise be untenable. In addition to mining massive datasets, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.

大多数时间序列数据挖掘算法都将相似性搜索作为核心子程序，因此相似性搜索所需的时间几乎是所有时间序列数据开采算法的瓶颈。将搜索扩展到大型数据集的困难在很大程度上解释了为什么大多数关于时间序列数据挖掘的学术工作都停留在考虑数百万个时间序列对象上，而工业和科学的大部分都停留在数十亿个等待探索的时间序列对象上。在这项工作中，我们展示了通过使用四个新颖想法的组合，我们可以首次搜索和挖掘真正庞大的时间序列。我们证明了以下极不直观的事实；在大型数据集中，我们可以比当前最先进的欧几里得距离搜索算法更快地在DTW下进行精确搜索。我们展示了我们在有史以来最大的一组时间序列实验中的工作。特别是，我们考虑的最大数据集大于有史以来发表的所有数据挖掘论文中考虑的所有时间序列数据集的总和。我们表明，我们的想法使我们能够解决更高级别的时间序列数据挖掘问题，如主题发现和聚类，否则这些问题将无法解决。除了挖掘海量数据集，我们还将展示我们的想法对数据流的实时监控也有影响，使我们能够处理比目前更快的到达率和/或使用更便宜、更低功耗的设备。

{"title":"Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping.","authors":"Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh","doi":"10.1145/2339530.2339576","DOIUrl":"10.1145/2339530.2339576","url":null,"abstract":"Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms. The difficulty of scaling search to large datasets largely explains why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine truly massive time series for the first time. We demonstrate the following extremely unintuitive fact; in large datasets we can exactly search under DTW much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We show that our ideas allow us to solve higher-level time series data mining problem such as motif discovery and clustering at scales that would otherwise be untenable. In addition to mining massive datasets, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6816304/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41222405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Source Learning for Joint Analysis of Incomplete Multi-Modality Neuroimaging Data. 不完整多模态神经影像数据联合分析的多源学习。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2012-01-01 DOI: 10.1145/2339530.2339710

Lei Yuan, Yalin Wang, Paul M Thompson, Vaibhav A Narayan, Jieping Ye

Incomplete data present serious problems when integrating largescale brain imaging data sets from different imaging modalities. In the Alzheimer's Disease Neuroimaging Initiative (ADNI), for example, over half of the subjects lack cerebrospinal fluid (CSF) measurements; an independent half of the subjects do not have fluorodeoxyglucose positron emission tomography (FDG-PET) scans; many lack proteomics measurements. Traditionally, subjects with missing measures are discarded, resulting in a severe loss of available information. We address this problem by proposing two novel learning methods where all the samples (with at least one available data source) can be used. In the first method, we divide our samples according to the availability of data sources, and we learn shared sets of features with state-of-the-art sparse learning methods. Our second method learns a base classifier for each data source independently, based on which we represent each source using a single column of prediction scores; we then estimate the missing prediction scores, which, combined with the existing prediction scores, are used to build a multi-source fusion model. To illustrate the proposed approaches, we classify patients from the ADNI study into groups with Alzheimer's disease (AD), mild cognitive impairment (MCI) and normal controls, based on the multi-modality data. At baseline, ADNI's 780 participants (172 AD, 397 MCI, 211 Normal), have at least one of four data types: magnetic resonance imaging (MRI), FDG-PET, CSF and proteomics. These data are used to test our algorithms. Comprehensive experiments show that our proposed methods yield stable and promising results.

当整合来自不同成像方式的大规模脑成像数据集时，数据不完整会带来严重的问题。例如，在阿尔茨海默病神经影像学倡议(ADNI)中，超过一半的受试者缺乏脑脊液(CSF)测量;独立的一半受试者没有氟脱氧葡萄糖正电子发射断层扫描(FDG-PET);许多缺乏蛋白质组学测量。传统上，缺少测量的受试者被丢弃，导致可用信息的严重丢失。我们通过提出两种新颖的学习方法来解决这个问题，其中所有的样本(至少有一个可用的数据源)都可以使用。在第一种方法中，我们根据数据源的可用性划分样本，并使用最先进的稀疏学习方法学习共享的特征集。我们的第二种方法是为每个数据源独立学习一个基本分类器，在此基础上，我们使用单个预测分数列表示每个数据源;然后对缺失的预测分数进行估计，并结合已有的预测分数构建多源融合模型。为了说明所提出的方法，我们根据多模态数据将ADNI研究中的患者分为阿尔茨海默病(AD)、轻度认知障碍(MCI)和正常对照组。在基线时，ADNI的780名参与者(172名AD, 397名MCI, 211名正常)至少有四种数据类型中的一种:磁共振成像(MRI)， FDG-PET, CSF和蛋白质组学。这些数据用来测试我们的算法。综合实验表明，我们提出的方法产生了稳定和有希望的结果。

{"title":"Multi-Source Learning for Joint Analysis of Incomplete Multi-Modality Neuroimaging Data.","authors":"Lei Yuan, Yalin Wang, Paul M Thompson, Vaibhav A Narayan, Jieping Ye","doi":"10.1145/2339530.2339710","DOIUrl":"https://doi.org/10.1145/2339530.2339710","url":null,"abstract":"Incomplete data present serious problems when integrating largescale brain imaging data sets from different imaging modalities. In the Alzheimer's Disease Neuroimaging Initiative (ADNI), for example, over half of the subjects lack cerebrospinal fluid (CSF) measurements; an independent half of the subjects do not have fluorodeoxyglucose positron emission tomography (FDG-PET) scans; many lack proteomics measurements. Traditionally, subjects with missing measures are discarded, resulting in a severe loss of available information. We address this problem by proposing two novel learning methods where all the samples (with at least one available data source) can be used. In the first method, we divide our samples according to the availability of data sources, and we learn shared sets of features with state-of-the-art sparse learning methods. Our second method learns a base classifier for each data source independently, based on which we represent each source using a single column of prediction scores; we then estimate the missing prediction scores, which, combined with the existing prediction scores, are used to build a multi-source fusion model. To illustrate the proposed approaches, we classify patients from the ADNI study into groups with Alzheimer's disease (AD), mild cognitive impairment (MCI) and normal controls, based on the multi-modality data. At baseline, ADNI's 780 participants (172 AD, 397 MCI, 211 Normal), have at least one of four data types: magnetic resonance imaging (MRI), FDG-PET, CSF and proteomics. These data are used to test our algorithms. Comprehensive experiments show that our proposed methods yield stable and promising results.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2339530.2339710","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31715258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

Modeling Disease Progression via Fused Sparse Group Lasso. 通过 Fused Sparse Group Lasso 建立疾病进展模型

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2012-01-01 DOI: 10.1145/2339530.2339702

Jiayu Zhou, Jun Liu, Vaibhav A Narayan, Jieping Ye

Alzheimer's Disease (AD) is the most common neurodegenerative disorder associated with aging. Understanding how the disease progresses and identifying related pathological biomarkers for the progression is of primary importance in the clinical diagnosis and prognosis of Alzheimer's disease. In this paper, we develop novel multi-task learning techniques to predict the disease progression measured by cognitive scores and select biomarkers predictive of the progression. In multi-task learning, the prediction of cognitive scores at each time point is considered as a task, and multiple prediction tasks at different time points are performed simultaneously to capture the temporal smoothness of the prediction models across different time points. Specifically, we propose a novel convex fused sparse group Lasso (cFSGL) formulation that allows the simultaneous selection of a common set of biomarkers for multiple time points and specific sets of biomarkers for different time points using the sparse group Lasso penalty and in the meantime incorporates the temporal smoothness using the fused Lasso penalty. The proposed formulation is challenging to solve due to the use of several non-smooth penalties. One of the main technical contributions of this paper is to show that the proximal operator associated with the proposed formulation exhibits a certain decomposition property and can be computed efficiently; thus cFSGL can be solved efficiently using the accelerated gradient method. To further improve the model, we propose two non-convex formulations to reduce the shrinkage bias inherent in the convex formulation. We employ the difference of convex (DC) programming technique to solve the non-convex formulations. We have performed extensive experiments using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Results demonstrate the effectiveness of the proposed progression models in comparison with existing methods for disease progression. We also perform longitudinal stability selection to identify and analyze the temporal patterns of biomarkers in disease progression.

阿尔茨海默病（AD）是与衰老相关的最常见的神经退行性疾病。了解阿尔茨海默病的进展过程并确定与之相关的病理生物标志物，对于阿尔茨海默病的临床诊断和预后至关重要。在本文中，我们开发了新型多任务学习技术来预测通过认知评分测量的疾病进展，并选择预测疾病进展的生物标志物。在多任务学习中，每个时间点的认知分数预测被视为一个任务，不同时间点的多个预测任务同时进行，以捕捉预测模型在不同时间点的时间平稳性。具体来说，我们提出了一种新颖的凸融合稀疏组拉索（cFSGL）公式，允许使用稀疏组拉索惩罚同时选择多个时间点的共同生物标志物集和不同时间点的特定生物标志物集，同时使用融合拉索惩罚将时间平滑性纳入其中。由于使用了几种非平滑惩罚，所提出的公式在求解上具有挑战性。本文的主要技术贡献之一是证明了与所提公式相关的近算子具有一定的分解特性，并且可以高效计算，因此可以使用加速梯度法高效求解 cFSGL。为了进一步改进模型，我们提出了两种非凸公式，以减少凸公式中固有的收缩偏差。我们采用凸差分（DC）编程技术来求解非凸公式。我们利用阿尔茨海默氏症神经成像计划（ADNI）的数据进行了大量实验。结果表明，与现有的疾病进展方法相比，所提出的进展模型非常有效。我们还进行了纵向稳定性选择，以识别和分析疾病进展中生物标志物的时间模式。

{"title":"Modeling Disease Progression via Fused Sparse Group Lasso.","authors":"Jiayu Zhou, Jun Liu, Vaibhav A Narayan, Jieping Ye","doi":"10.1145/2339530.2339702","DOIUrl":"10.1145/2339530.2339702","url":null,"abstract":"Alzheimer's Disease (AD) is the most common neurodegenerative disorder associated with aging. Understanding how the disease progresses and identifying related pathological biomarkers for the progression is of primary importance in the clinical diagnosis and prognosis of Alzheimer's disease. In this paper, we develop novel multi-task learning techniques to predict the disease progression measured by cognitive scores and select biomarkers predictive of the progression. In multi-task learning, the prediction of cognitive scores at each time point is considered as a task, and multiple prediction tasks at different time points are performed simultaneously to capture the temporal smoothness of the prediction models across different time points. Specifically, we propose a novel convex fused sparse group Lasso (cFSGL) formulation that allows the simultaneous selection of a common set of biomarkers for multiple time points and specific sets of biomarkers for different time points using the sparse group Lasso penalty and in the meantime incorporates the temporal smoothness using the fused Lasso penalty. The proposed formulation is challenging to solve due to the use of several non-smooth penalties. One of the main technical contributions of this paper is to show that the proximal operator associated with the proposed formulation exhibits a certain decomposition property and can be computed efficiently; thus cFSGL can be solved efficiently using the accelerated gradient method. To further improve the model, we propose two non-convex formulations to reduce the shrinkage bias inherent in the convex formulation. We employ the difference of convex (DC) programming technique to solve the non-convex formulations. We have performed extensive experiments using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Results demonstrate the effectiveness of the proposed progression models in comparison with existing methods for disease progression. We also perform longitudinal stability selection to identify and analyze the temporal patterns of biomarkers in disease progression.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4191837/pdf/nihms497478.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32742685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimal Exact Least Squares Rank Minimization. 最优精确最小二乘法秩最小化。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2012-01-01 DOI: 10.1145/2339530.2339609

Shuo Xiang, Yunzhang Zhu, Xiaotong Shen, Jieping Ye

In multivariate analysis, rank minimization emerges when a low-rank structure of matrices is desired as well as a small estimation error. Rank minimization is nonconvex and generally NP-hard, imposing one major challenge. In this paper, we consider a nonconvex least squares formulation, which seeks to minimize the least squares loss function with the rank constraint. Computationally, we develop efficient algorithms to compute a global solution as well as an entire regularization solution path. Theoretically, we show that our method reconstructs the oracle estimator exactly from noisy data. As a result, it recovers the true rank optimally against any method and leads to sharper parameter estimation over its counterpart. Finally, the utility of the proposed method is demonstrated by simulations and image reconstruction from noisy background.

在多元分析中，当需要矩阵的低秩结构和较小的估计误差时，就会出现秩最小化。秩最小化是非凸的，一般来说是 NP 难的，这给我们带来了一个重大挑战。在本文中，我们考虑了一种非凸最小二乘法，即在秩约束下最小化最小二乘损失函数。在计算上，我们开发了计算全局解以及整个正则化解路径的高效算法。从理论上讲，我们证明了我们的方法能准确地从噪声数据中重建oracle估计器。因此，与任何方法相比，我们的方法都能以最佳方式恢复真实秩，并带来比其对应方法更敏锐的参数估计。最后，我们通过模拟和从嘈杂背景中重建图像的方法证明了所提方法的实用性。

引用次数: 0

Feature Grouping and Selection Over an Undirected Graph. 无向图上的特征分组和选择

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2012-01-01 DOI: 10.1145/2339530.2339675

Sen Yang, Lei Yuan, Ying-Cheng Lai, Xiaotong Shen, Peter Wonka, Jieping Ye

High-dimensional regression/classification continues to be an important and challenging problem, especially when features are highly correlated. Feature selection, combined with additional structure information on the features has been considered to be promising in promoting regression/classification performance. Graph-guided fused lasso (GFlasso) has recently been proposed to facilitate feature selection and graph structure exploitation, when features exhibit certain graph structures. However, the formulation in GFlasso relies on pairwise sample correlations to perform feature grouping, which could introduce additional estimation bias. In this paper, we propose three new feature grouping and selection methods to resolve this issue. The first method employs a convex function to penalize the pairwise l_∞ norm of connected regression/classification coefficients, achieving simultaneous feature grouping and selection. The second method improves the first one by utilizing a non-convex function to reduce the estimation bias. The third one is the extension of the second method using a truncated l₁ regularization to further reduce the estimation bias. The proposed methods combine feature grouping and feature selection to enhance estimation accuracy. We employ the alternating direction method of multipliers (ADMM) and difference of convex functions (DC) programming to solve the proposed formulations. Our experimental results on synthetic data and two real datasets demonstrate the effectiveness of the proposed methods.

高维回归/分类仍然是一个重要而具有挑战性的问题，尤其是在特征高度相关的情况下。特征选择与特征的附加结构信息相结合，被认为有望提高回归/分类性能。最近有人提出了图引导融合拉索（GFlasso），当特征表现出特定的图结构时，可以促进特征选择和图结构利用。然而，GFlasso 的表述依赖于成对样本相关性来进行特征分组，这可能会带来额外的估计偏差。本文提出了三种新的特征分组和选择方法来解决这一问题。第一种方法利用凸函数对相连回归/分类系数的成对 l∞ 准则进行惩罚，从而同时实现特征分组和选择。第二种方法改进了第一种方法，利用非凸函数来减少估计偏差。第三种方法是第二种方法的扩展，利用截断 l1 正则化进一步减少估计偏差。所提出的方法结合了特征分组和特征选择来提高估计精度。我们采用交替方向乘法（ADMM）和凸函数差分（DC）编程来求解所提出的公式。我们在合成数据和两个真实数据集上的实验结果证明了所提方法的有效性。

{"title":"Feature Grouping and Selection Over an Undirected Graph.","authors":"Sen Yang, Lei Yuan, Ying-Cheng Lai, Xiaotong Shen, Peter Wonka, Jieping Ye","doi":"10.1145/2339530.2339675","DOIUrl":"10.1145/2339530.2339675","url":null,"abstract":"High-dimensional regression/classification continues to be an important and challenging problem, especially when features are highly correlated. Feature selection, combined with additional structure information on the features has been considered to be promising in promoting regression/classification performance. Graph-guided fused lasso (GFlasso) has recently been proposed to facilitate feature selection and graph structure exploitation, when features exhibit certain graph structures. However, the formulation in GFlasso relies on pairwise sample correlations to perform feature grouping, which could introduce additional estimation bias. In this paper, we propose three new feature grouping and selection methods to resolve this issue. The first method employs a convex function to penalize the pairwise l∞ norm of connected regression/classification coefficients, achieving simultaneous feature grouping and selection. The second method improves the first one by utilizing a non-convex function to reduce the estimation bias. The third one is the extension of the second method using a truncated l1 regularization to further reduce the estimation bias. The proposed methods combine feature grouping and feature selection to enhance estimation accuracy. We employ the alternating direction method of multipliers (ADMM) and difference of convex functions (DC) programming to solve the proposed formulations. Our experimental results on synthetic data and two real datasets demonstrate the effectiveness of the proposed methods.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3763852/pdf/nihms502053.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31715216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Batch Mode Active Sampling based on Marginal Probability Distribution Matching. 基于边际概率分布匹配的批量模式主动采样。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2012-01-01 DOI: 10.1145/2339530.2339647

Rita Chattopadhyay, Zheng Wang, Wei Fan, Ian Davidson, Sethuraman Panchanathan, Jieping Ye

Active Learning is a machine learning and data mining technique that selects the most informative samples for labeling and uses them as training data; it is especially useful when there are large amount of unlabeled data and labeling them is expensive. Recently, batch-mode active learning, where a set of samples are selected concurrently for labeling, based on their collective merit, has attracted a lot of attention. The objective of batch-mode active learning is to select a set of informative samples so that a classifier learned on these samples has good generalization performance on the unlabeled data. Most of the existing batch-mode active learning methodologies try to achieve this by selecting samples based on varied criteria. In this paper we propose a novel criterion which achieves good generalization performance of a classifier by specifically selecting a set of query samples that minimizes the difference in distribution between the labeled and the unlabeled data, after annotation. We explicitly measure this difference based on all candidate subsets of the unlabeled data and select the best subset. The proposed objective is an NP-hard integer programming optimization problem. We provide two optimization techniques to solve this problem. In the first one, the problem is transformed into a convex quadratic programming problem and in the second method the problem is transformed into a linear programming problem. Our empirical studies using publicly available UCI datasets and a biomedical image dataset demonstrate the effectiveness of the proposed approach in comparison with the state-of-the-art batch-mode active learning methods. We also present two extensions of the proposed approach, which incorporate uncertainty of the predicted labels of the unlabeled data and transfer learning in the proposed formulation. Our empirical studies on UCI datasets show that incorporation of uncertainty information improves performance at later iterations while our studies on 20 Newsgroups dataset show that transfer learning improves the performance of the classifier during initial iterations.

主动学习（Active Learning）是一种机器学习和数据挖掘技术，它选择信息量最大的样本进行标注，并将其用作训练数据。最近，批量模式主动学习吸引了很多人的关注，这种模式是根据一组样本的集体优点，同时选择一组样本进行标注。批量模式主动学习的目的是选择一组有信息量的样本，从而使在这些样本上学习到的分类器在未标记数据上具有良好的泛化性能。大多数现有的批量模式主动学习方法都试图通过基于不同标准选择样本来实现这一目标。在本文中，我们提出了一种新标准，通过专门选择一组查询样本来实现分类器的良好泛化性能，使标注后的已标注数据和未标注数据之间的分布差异最小化。我们根据未标注数据的所有候选子集明确测量这种差异，并选择最佳子集。提出的目标是一个 NP 难整数编程优化问题。我们提供了两种优化技术来解决这个问题。第一种方法是将问题转化为凸二次编程问题，第二种方法是将问题转化为线性编程问题。我们使用公开的 UCI 数据集和生物医学图像数据集进行了实证研究，结果表明，与最先进的批处理模式主动学习方法相比，我们提出的方法非常有效。我们还介绍了所提方法的两个扩展部分，它们将未标记数据的预测标签的不确定性和迁移学习纳入了所提公式中。我们在 UCI 数据集上的实证研究表明，纳入不确定性信息可提高后期迭代的性能，而我们在 20 Newsgroups 数据集上的研究表明，迁移学习可提高分类器在初始迭代期间的性能。

{"title":"Batch Mode Active Sampling based on Marginal Probability Distribution Matching.","authors":"Rita Chattopadhyay, Zheng Wang, Wei Fan, Ian Davidson, Sethuraman Panchanathan, Jieping Ye","doi":"10.1145/2339530.2339647","DOIUrl":"10.1145/2339530.2339647","url":null,"abstract":"Active Learning is a machine learning and data mining technique that selects the most informative samples for labeling and uses them as training data; it is especially useful when there are large amount of unlabeled data and labeling them is expensive. Recently, batch-mode active learning, where a set of samples are selected concurrently for labeling, based on their collective merit, has attracted a lot of attention. The objective of batch-mode active learning is to select a set of informative samples so that a classifier learned on these samples has good generalization performance on the unlabeled data. Most of the existing batch-mode active learning methodologies try to achieve this by selecting samples based on varied criteria. In this paper we propose a novel criterion which achieves good generalization performance of a classifier by specifically selecting a set of query samples that minimizes the difference in distribution between the labeled and the unlabeled data, after annotation. We explicitly measure this difference based on all candidate subsets of the unlabeled data and select the best subset. The proposed objective is an NP-hard integer programming optimization problem. We provide two optimization techniques to solve this problem. In the first one, the problem is transformed into a convex quadratic programming problem and in the second method the problem is transformed into a linear programming problem. Our empirical studies using publicly available UCI datasets and a biomedical image dataset demonstrate the effectiveness of the proposed approach in comparison with the state-of-the-art batch-mode active learning methods. We also present two extensions of the proposed approach, which incorporate uncertainty of the predicted labels of the unlabeled data and transfer learning in the proposed formulation. Our empirical studies on UCI datasets show that incorporation of uncertainty information improves performance at later iterations while our studies on 20 Newsgroups dataset show that transfer learning improves the performance of the classifier during initial iterations.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4191836/pdf/nihms497479.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32742684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mining Recent Temporal Patterns for Event Detection in Multivariate Time Series Data. 挖掘多变量时间序列数据中用于事件检测的近期时间模式

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2012-01-01 DOI: 10.1145/2339530.2339578

Iyad Batal, Dmitriy Fradkin, James Harrison, Fabian Moerchen, Milos Hauskrecht

Improving the performance of classifiers using pattern mining techniques has been an active topic of data mining research. In this work we introduce the recent temporal pattern mining framework for finding predictive patterns for monitoring and event detection problems in complex multivariate time series data. This framework first converts time series into time-interval sequences of temporal abstractions. It then constructs more complex temporal patterns backwards in time using temporal operators. We apply our framework to health care data of 13,558 diabetic patients and show its benefits by efficiently finding useful patterns for detecting and diagnosing adverse medical conditions that are associated with diabetes.

利用模式挖掘技术提高分类器的性能一直是数据挖掘研究的一个活跃话题。在这项工作中，我们介绍了最新的时态模式挖掘框架，用于在复杂的多变量时间序列数据中寻找监测和事件检测问题的预测模式。该框架首先将时间序列转换为时间抽象的时间间隔序列。然后，它使用时间运算符在时间上向后构建更复杂的时间模式。我们将这一框架应用于 13558 名糖尿病患者的医疗保健数据，并通过高效地找到有用的模式来检测和诊断与糖尿病相关的不良医疗状况，从而展示了这一框架的优势。

引用次数: 0

Brain Effective Connectivity Modeling for Alzheimer's Disease by Sparse Gaussian Bayesian Network. 利用稀疏高斯贝叶斯网络为阿尔茨海默病建立大脑有效连接性模型

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2011-01-01 DOI: 10.1145/2020408.2020562

Shuai Huang, Jing Li, Jieping Ye, Adam Fleisher, Kewei Chen, Teresa Wu, Eric Reiman

Recent studies have shown that Alzheimer's disease (AD) is related to alteration in brain connectivity networks. One type of connectivity, called effective connectivity, defined as the directional relationship between brain regions, is essential to brain function. However, there have been few studies on modeling the effective connectivity of AD and characterizing its difference from normal controls (NC). In this paper, we investigate the sparse Bayesian Network (BN) for effective connectivity modeling. Specifically, we propose a novel formulation for the structure learning of BNs, which involves one L1-norm penalty term to impose sparsity and another penalty to ensure the learned BN to be a directed acyclic graph - a required property of BNs. We show, through both theoretical analysis and extensive experiments on eleven moderate and large benchmark networks with various sample sizes, that the proposed method has much improved learning accuracy and scalability compared with ten competing algorithms. We apply the proposed method to FDG-PET images of 42 AD and 67 NC subjects, and identify the effective connectivity models for AD and NC, respectively. Our study reveals that the effective connectivity of AD is different from that of NC in many ways, including the global-scale effective connectivity, intra-lobe, interlobe, and inter-hemispheric effective connectivity distributions, as well as the effective connectivity associated with specific brain regions. These findings are consistent with known pathology and clinical progression of AD, and will contribute to AD knowledge discovery.

最近的研究表明，阿尔茨海默病（AD）与大脑连接网络的改变有关。其中一种连通性被称为有效连通性，定义为大脑区域之间的定向关系，对大脑功能至关重要。然而，关于建立 AD 的有效连通性模型并描述其与正常对照组（NC）的差异的研究还很少。在本文中，我们研究了用于有效连接建模的稀疏贝叶斯网络（BN）。具体来说，我们提出了一种新颖的贝叶斯网络结构学习方法，其中包括一个 L1 准则惩罚项来施加稀疏性，以及另一个惩罚项来确保学习到的贝叶斯网络是一个有向无环图--这是贝叶斯网络的一个必要属性。我们通过理论分析和在 11 个具有不同样本量的中型和大型基准网络上进行的大量实验表明，与 10 种竞争算法相比，所提出的方法在学习准确性和可扩展性方面都有很大改进。我们将提出的方法应用于 42 名 AD 和 67 名 NC 受试者的 FDG-PET 图像，并分别确定了 AD 和 NC 的有效连接模型。我们的研究发现，AD 和 NC 的有效连通性在很多方面都不同，包括全局范围的有效连通性、叶内、叶间和半球间的有效连通性分布，以及与特定脑区相关的有效连通性。这些发现与AD的已知病理和临床进展一致，将有助于AD知识的发现。

{"title":"Brain Effective Connectivity Modeling for Alzheimer's Disease by Sparse Gaussian Bayesian Network.","authors":"Shuai Huang, Jing Li, Jieping Ye, Adam Fleisher, Kewei Chen, Teresa Wu, Eric Reiman","doi":"10.1145/2020408.2020562","DOIUrl":"10.1145/2020408.2020562","url":null,"abstract":"Recent studies have shown that Alzheimer's disease (AD) is related to alteration in brain connectivity networks. One type of connectivity, called effective connectivity, defined as the directional relationship between brain regions, is essential to brain function. However, there have been few studies on modeling the effective connectivity of AD and characterizing its difference from normal controls (NC). In this paper, we investigate the sparse Bayesian Network (BN) for effective connectivity modeling. Specifically, we propose a novel formulation for the structure learning of BNs, which involves one L1-norm penalty term to impose sparsity and another penalty to ensure the learned BN to be a directed acyclic graph - a required property of BNs. We show, through both theoretical analysis and extensive experiments on eleven moderate and large benchmark networks with various sample sizes, that the proposed method has much improved learning accuracy and scalability compared with ten competing algorithms. We apply the proposed method to FDG-PET images of 42 AD and 67 NC subjects, and identify the effective connectivity models for AD and NC, respectively. Our study reveals that the effective connectivity of AD is different from that of NC in many ways, including the global-scale effective connectivity, intra-lobe, interlobe, and inter-hemispheric effective connectivity distributions, as well as the effective connectivity associated with specific brain regions. These findings are consistent with known pathology and clinical progression of AD, and will contribute to AD knowledge discovery.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4779440/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64129450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0