Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining最新文献

MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation. MedDiffusion：通过基于扩散的数据扩增提升健康风险预测。

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining

Pub Date : 2024-01-01 DOI: 10.1137/1.9781611978032.58

Yuan Zhong, Suhan Cui, Jiaqi Wang, Xiaochen Wang, Ziyi Yin, Yaqing Wang, Houping Xiao, Mengdi Huai, Ting Wang, Fenglong Ma

Health risk prediction aims to forecast the potential health risks that patients may face using their historical Electronic Health Records (EHR). Although several effective models have developed, data insufficiency is a key issue undermining their effectiveness. Various data generation and augmentation methods have been introduced to mitigate this issue by expanding the size of the training data set through learning underlying data distributions. However, the performance of these methods is often limited due to their task-unrelated design. To address these shortcomings, this paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion. It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space. Furthermore, MedDiffusion discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data. Experimental evaluation on four real-world medical datasets demonstrates that MedDiffusion outperforms 14 cutting-edge baselines in terms of PR-AUC, F1, and Cohen's Kappa. We also conduct ablation studies and benchmark our model against GAN-based alternatives to further validate the rationality and adaptability of our model design. Additionally, we analyze generated data to offer fresh insights into the model's interpretability. The source code is available via https://shorturl.at/aerT0.

健康风险预测旨在利用患者的历史电子健康记录（EHR）预测患者可能面临的潜在健康风险。虽然已经开发出了一些有效的模型，但数据不足是影响其有效性的关键问题。为了缓解这一问题，人们引入了各种数据生成和增强方法，通过学习基础数据分布来扩大训练数据集的规模。然而，由于这些方法的设计与任务无关，其性能往往受到限制。为了解决这些缺陷，本文介绍了一种新颖的、基于端到端扩散的风险预测模型，命名为 MedDiffusion。它通过在训练过程中创建合成患者数据来扩大样本空间，从而提高风险预测性能。此外，MedDiffusion 还利用逐步关注机制来识别患者就诊之间的隐藏关系，使模型能够自动保留最重要的信息，从而生成高质量的数据。在四个真实世界医疗数据集上进行的实验评估表明，MedDiffusion 在 PR-AUC、F1 和 Cohen's Kappa 方面优于 14 个前沿基线。我们还进行了消融研究，并将我们的模型与基于 GAN 的替代模型进行了比较，从而进一步验证了我们模型设计的合理性和适应性。此外，我们还分析了生成的数据，为模型的可解释性提供了新的见解。源代码可通过 https://shorturl.at/aerT0 获取。

{"title":"MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation.","authors":"Yuan Zhong, Suhan Cui, Jiaqi Wang, Xiaochen Wang, Ziyi Yin, Yaqing Wang, Houping Xiao, Mengdi Huai, Ting Wang, Fenglong Ma","doi":"10.1137/1.9781611978032.58","DOIUrl":"https://doi.org/10.1137/1.9781611978032.58","url":null,"abstract":"Health risk prediction aims to forecast the potential health risks that patients may face using their historical Electronic Health Records (EHR). Although several effective models have developed, data insufficiency is a key issue undermining their effectiveness. Various data generation and augmentation methods have been introduced to mitigate this issue by expanding the size of the training data set through learning underlying data distributions. However, the performance of these methods is often limited due to their task-unrelated design. To address these shortcomings, this paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion. It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space. Furthermore, MedDiffusion discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data. Experimental evaluation on four real-world medical datasets demonstrates that MedDiffusion outperforms 14 cutting-edge baselines in terms of PR-AUC, F1, and Cohen's Kappa. We also conduct ablation studies and benchmark our model against GAN-based alternatives to further validate the rationality and adaptability of our model design. Additionally, we analyze generated data to offer fresh insights into the model's interpretability. The source code is available via https://shorturl.at/aerT0.","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2024 ","pages":"499-507"},"PeriodicalIF":0.0,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11469648/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automated Fusion of Multimodal Electronic Health Records for Better Medical Predictions. 自动融合多模态电子健康记录，实现更好的医疗预测。

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining

Pub Date : 2024-01-01 DOI: 10.1137/1.9781611978032.41

Suhan Cui, Jiaqi Wang, Yuan Zhong, Han Liu, Ting Wang, Fenglong Ma

The widespread adoption of Electronic Health Record (EHR) systems in healthcare institutes has generated vast amounts of medical data, offering significant opportunities for improving healthcare services through deep learning techniques. However, the complex and diverse modalities and feature structures in real-world EHR data pose great challenges for deep learning model design. To address the multi-modality challenge in EHR data, current approaches primarily rely on hand-crafted model architectures based on intuition and empirical experiences, leading to sub-optimal model architectures and limited performance. Therefore, to automate the process of model design for mining EHR data, we propose a novel neural architecture search (NAS) framework named AutoFM, which can automatically search for the optimal model architectures for encoding diverse input modalities and fusion strategies. We conduct thorough experiments on real-world multi-modal EHR data and prediction tasks, and the results demonstrate that our framework not only achieves significant performance improvement over existing state-of-the-art methods but also discovers meaningful network architectures effectively.

医疗机构广泛采用电子病历（EHR）系统产生了大量医疗数据，为通过深度学习技术改善医疗服务提供了重要机遇。然而，现实世界中的电子病历数据具有复杂多样的模式和特征结构，这给深度学习模型的设计带来了巨大挑战。为了应对电子病历数据中的多模态挑战，目前的方法主要依赖于基于直觉和经验的手工创建模型架构，这导致了次优模型架构和有限的性能。因此，为了使挖掘电子病历数据的模型设计过程自动化，我们提出了一种名为 AutoFM 的新型神经架构搜索（NAS）框架，它可以自动搜索最佳模型架构，以编码不同的输入模式和融合策略。我们在真实世界的多模态电子病历数据和预测任务中进行了深入实验，结果表明我们的框架不仅比现有的最先进方法实现了显著的性能提升，而且还能有效地发现有意义的网络架构。

{"title":"Automated Fusion of Multimodal Electronic Health Records for Better Medical Predictions.","authors":"Suhan Cui, Jiaqi Wang, Yuan Zhong, Han Liu, Ting Wang, Fenglong Ma","doi":"10.1137/1.9781611978032.41","DOIUrl":"https://doi.org/10.1137/1.9781611978032.41","url":null,"abstract":"The widespread adoption of Electronic Health Record (EHR) systems in healthcare institutes has generated vast amounts of medical data, offering significant opportunities for improving healthcare services through deep learning techniques. However, the complex and diverse modalities and feature structures in real-world EHR data pose great challenges for deep learning model design. To address the multi-modality challenge in EHR data, current approaches primarily rely on hand-crafted model architectures based on intuition and empirical experiences, leading to sub-optimal model architectures and limited performance. Therefore, to automate the process of model design for mining EHR data, we propose a novel neural architecture search (NAS) framework named AutoFM, which can automatically search for the optimal model architectures for encoding diverse input modalities and fusion strategies. We conduct thorough experiments on real-world multi-modal EHR data and prediction tasks, and the results demonstrate that our framework not only achieves significant performance improvement over existing state-of-the-art methods but also discovers meaningful network architectures effectively.","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2024 ","pages":"361-369"},"PeriodicalIF":0.0,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11469647/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FAME: Fragment-based Conditional Molecular Generation for Phenotypic Drug Discovery. FAME:基于片段的条件分子生成用于表型药物发现。

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining

Pub Date : 2022-01-01 DOI: 10.1137/1.9781611977172.81

Thai-Hoang Pham, Lei Xie, Ping Zhang

De novo molecular design is a key challenge in drug discovery due to the complexity of chemical space. With the availability of molecular datasets and advances in machine learning, many deep generative models are proposed for generating novel molecules with desired properties. However, most of the existing models focus only on molecular distribution learning and target-based molecular design, thereby hindering their potentials in real-world applications. In drug discovery, phenotypic molecular design has advantages over target-based molecular design, especially in first-in-class drug discovery. In this work, we propose the first deep graph generative model (FAME) targeting phenotypic molecular design, in particular gene expression-based molecular design. FAME leverages a conditional variational autoencoder framework to learn the conditional distribution generating molecules from gene expression profiles. However, this distribution is difficult to learn due to the complexity of the molecular space and the noisy phenomenon in gene expression data. To tackle these issues, a gene expression denoising (GED) model that employs contrastive objective function is first proposed to reduce noise from gene expression data. FAME is then designed to treat molecules as the sequences of fragments and learn to generate these fragments in autoregressive manner. By leveraging this fragment-based generation strategy and the denoised gene expression profiles, FAME can generate novel molecules with a high validity rate and desired biological activity. The experimental results show that FAME outperforms existing methods including both SMILES-based and graph-based deep generative models for phenotypic molecular design. Furthermore, the effective mechanism for reducing noise in gene expression data proposed in our study can be applied to omics data modeling in general for facilitating phenotypic drug discovery.

由于化学空间的复杂性，从头开始的分子设计是药物发现的一个关键挑战。随着分子数据集的可用性和机器学习的进步，许多深度生成模型被提出用于生成具有所需性质的新分子。然而，现有的大多数模型只关注分子分布学习和基于靶标的分子设计，从而阻碍了它们在实际应用中的潜力。在药物发现中，表型分子设计比基于靶标的分子设计具有优势，特别是在一类新药发现中。在这项工作中，我们提出了第一个针对表型分子设计的深度图生成模型(FAME)，特别是基于基因表达的分子设计。FAME利用条件变分自编码器框架来学习从基因表达谱中生成分子的条件分布。然而，由于分子空间的复杂性和基因表达数据中的噪声现象，这种分布很难学习。为了解决这些问题，首先提出了一种采用对比目标函数的基因表达去噪(GED)模型来降低基因表达数据中的噪声。然后设计FAME将分子视为片段序列，并学习以自回归的方式生成这些片段。通过利用这种基于片段的生成策略和去噪的基因表达谱，FAME可以生成具有高效率和所需生物活性的新分子。实验结果表明，FAME优于现有的基于smiles和基于图的深度生成模型的表型分子设计方法。此外，我们研究中提出的降低基因表达数据噪声的有效机制可以应用于组学数据建模，以促进表型药物的发现。

{"title":"FAME: Fragment-based Conditional Molecular Generation for Phenotypic Drug Discovery.","authors":"Thai-Hoang Pham, Lei Xie, Ping Zhang","doi":"10.1137/1.9781611977172.81","DOIUrl":"https://doi.org/10.1137/1.9781611977172.81","url":null,"abstract":"De novo molecular design is a key challenge in drug discovery due to the complexity of chemical space. With the availability of molecular datasets and advances in machine learning, many deep generative models are proposed for generating novel molecules with desired properties. However, most of the existing models focus only on molecular distribution learning and target-based molecular design, thereby hindering their potentials in real-world applications. In drug discovery, phenotypic molecular design has advantages over target-based molecular design, especially in first-in-class drug discovery. In this work, we propose the first deep graph generative model (FAME) targeting phenotypic molecular design, in particular gene expression-based molecular design. FAME leverages a conditional variational autoencoder framework to learn the conditional distribution generating molecules from gene expression profiles. However, this distribution is difficult to learn due to the complexity of the molecular space and the noisy phenomenon in gene expression data. To tackle these issues, a gene expression denoising (GED) model that employs contrastive objective function is first proposed to reduce noise from gene expression data. FAME is then designed to treat molecules as the sequences of fragments and learn to generate these fragments in autoregressive manner. By leveraging this fragment-based generation strategy and the denoised gene expression profiles, FAME can generate novel molecules with a high validity rate and desired biological activity. The experimental results show that FAME outperforms existing methods including both SMILES-based and graph-based deep generative models for phenotypic molecular design. Furthermore, the effective mechanism for reducing noise in gene expression data proposed in our study can be applied to omics data modeling in general for facilitating phenotypic drug discovery.","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2022 ","pages":"720-728"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9061137/pdf/nihms-1801466.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9664973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Harmonic Alignment. 谐波对齐。

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining

Pub Date : 2020-01-01 DOI: 10.1137/1.9781611976236.36

Jay S Stanley, Scott Gigante, Guy Wolf, Smita Krishnaswamy

We propose a novel framework for combining datasets via alignment of their intrinsic geometry. This alignment can be used to fuse data originating from disparate modalities, or to correct batch effects while preserving intrinsic data structure. Importantly, we do not assume any pointwise correspondence between datasets, but instead rely on correspondence between a (possibly unknown) subset of data features. We leverage this assumption to construct an isometric alignment between the data. This alignment is obtained by relating the expansion of data features in harmonics derived from diffusion operators defined over each dataset. These expansions encode each feature as a function of the data geometry. We use this to relate the diffusion coordinates of each dataset through our assumption of partial feature correspondence. Then, a unified diffusion geometry is constructed over the aligned data, which can also be used to correct the original data measurements. We demonstrate our method on several datasets, showing in particular its effectiveness in biological applications including fusion of single-cell RNA sequencing (scRNA-seq) and single-cell ATAC sequencing (scATAC-seq) data measured on the same population of cells, and removal of batch effect between biological samples.

我们提出了一种新的框架，通过对数据集的内在几何形状进行对齐来组合数据集。这种对齐可以用于融合来自不同模式的数据，或者在保留固有数据结构的同时纠正批处理效果。重要的是，我们不假设数据集之间有任何点向对应，而是依赖于(可能未知的)数据特征子集之间的对应。我们利用这个假设来构建数据之间的等距对齐。这种对齐是通过在每个数据集上定义的扩散算子派生的谐波中关联数据特征的扩展而获得的。这些扩展将每个特征编码为数据几何的函数。我们通过部分特征对应的假设来关联每个数据集的扩散坐标。然后，在对齐的数据上构造统一的扩散几何，该几何也可以用于校正原始数据测量。我们在几个数据集上展示了我们的方法，特别显示了它在生物学应用中的有效性，包括在同一细胞群上测量的单细胞RNA测序(scRNA-seq)和单细胞ATAC测序(scATAC-seq)数据的融合，以及去除生物样品之间的批效应。

{"title":"Harmonic Alignment.","authors":"Jay S Stanley, Scott Gigante, Guy Wolf, Smita Krishnaswamy","doi":"10.1137/1.9781611976236.36","DOIUrl":"https://doi.org/10.1137/1.9781611976236.36","url":null,"abstract":"We propose a novel framework for combining datasets via alignment of their intrinsic geometry. This alignment can be used to fuse data originating from disparate modalities, or to correct batch effects while preserving intrinsic data structure. Importantly, we do not assume any pointwise correspondence between datasets, but instead rely on correspondence between a (possibly unknown) subset of data features. We leverage this assumption to construct an isometric alignment between the data. This alignment is obtained by relating the expansion of data features in harmonics derived from diffusion operators defined over each dataset. These expansions encode each feature as a function of the data geometry. We use this to relate the diffusion coordinates of each dataset through our assumption of partial feature correspondence. Then, a unified diffusion geometry is constructed over the aligned data, which can also be used to correct the original data measurements. We demonstrate our method on several datasets, showing in particular its effectiveness in biological applications including fusion of single-cell RNA sequencing (scRNA-seq) and single-cell ATAC sequencing (scATAC-seq) data measured on the same population of cells, and removal of batch effect between biological samples.","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2020 ","pages":"316-324"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611976236.36","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25481751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

GRIA: Graphical Regularization for Integrative Analysis. 综合分析的图形正则化。

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining

Pub Date : 2020-01-01 DOI: 10.1137/1.9781611976236.68

Changgee Chang, Jihwan Oh, Qi Long

Integrative analysis jointly analyzes multiple data sets to overcome curse of dimensionality. It can detect important but weak signals by jointly selecting features for all data sets, but unfortunately the sets of important features are not always the same for all data sets. Variations which allows heterogeneous sparsity structure-a subset of data sets can have a zero coefficient for a selected feature-have been proposed, but it compromises the effect of integrative analysis recalling the problem of losing weak important signals. We propose a new integrative analysis approach which not only aggregates weak important signals well in homogeneity setting but also substantially alleviates the problem of losing weak important signals in heterogeneity setting. Our approach exploits a priori known graphical structure of features by forcing joint selection of adjacent features, and integrating such information over multiple data sets can increase the power while taking into account the heterogeneity across data sets. We confirm the problem of existing approaches and demonstrate the superiority of our method through a simulation study and an application to gene expression data from ADNI.

综合分析通过对多个数据集进行联合分析，克服了维度的困扰。它可以通过联合选择所有数据集的特征来检测重要但较弱的信号，但不幸的是，重要特征集对于所有数据集来说并不总是相同的。已经提出了允许异构稀疏结构的变化-数据集的子集对于选定的特征可以具有零系数-但是它损害了综合分析的效果，使人想起丢失弱重要信号的问题。本文提出了一种新的综合分析方法，该方法不仅能很好地聚合同质性条件下的弱重要信号，而且能有效地缓解异质性条件下的弱重要信号丢失问题。我们的方法通过强迫相邻特征的联合选择来利用先验已知的特征图形结构，并且在多个数据集上集成这些信息可以增加功率，同时考虑到数据集之间的异质性。我们通过模拟研究和ADNI基因表达数据的应用，证实了现有方法存在的问题，并证明了我们方法的优越性。

{"title":"GRIA: Graphical Regularization for Integrative Analysis.","authors":"Changgee Chang, Jihwan Oh, Qi Long","doi":"10.1137/1.9781611976236.68","DOIUrl":"https://doi.org/10.1137/1.9781611976236.68","url":null,"abstract":"Integrative analysis jointly analyzes multiple data sets to overcome curse of dimensionality. It can detect important but weak signals by jointly selecting features for all data sets, but unfortunately the sets of important features are not always the same for all data sets. Variations which allows heterogeneous sparsity structure-a subset of data sets can have a zero coefficient for a selected feature-have been proposed, but it compromises the effect of integrative analysis recalling the problem of losing weak important signals. We propose a new integrative analysis approach which not only aggregates weak important signals well in homogeneity setting but also substantially alleviates the problem of losing weak important signals in heterogeneity setting. Our approach exploits a priori known graphical structure of features by forcing joint selection of adjacent features, and integrating such information over multiple data sets can increase the power while taking into account the heterogeneity across data sets. We confirm the problem of existing approaches and demonstrate the superiority of our method through a simulation study and an application to gene expression data from ADNI.","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2020 ","pages":"604-612"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611976236.68","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37962526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Region-Based Active Learning with Hierarchical and Adaptive Region Construction. 基于区域主动学习的分层自适应区域构建。

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining

Pub Date : 2019-05-01 DOI: 10.1137/1.9781611975673.50

Zhipeng Luo, Milos Hauskrecht

Learning of classification models in practice often relies on human annotation effort in which humans assign class labels to data instances. As this process can be very time-consuming and costly, finding effective ways to reduce the annotation cost becomes critical for building such models. To solve this problem, instead of soliciting instance-based annotation we explore region-based annotation as the human feedback. A region is defined as a hyper-cubic subspace of the input space X and it covers a subpopulation of data instances that fall into this region. Each region is labeled with a number in [0,1] (in binary classification setting), representing a human estimate of the positive (or negative) class proportion in the subpopulation. To quickly discover pure regions (in terms of class proportion) in the data, we have developed a novel active learning framework that constructs regions in a hierarchical and adaptive way. Hierarchical means that regions are incrementally built into a hierarchical tree, which is done by repeatedly splitting the input space. Adaptive means that our framework can adaptively choose the best heuristic for each of the region splits. Through experiments on numerous datasets we demonstrate that our framework can identify pure regions in very few region queries. Thus our approach is shown to be effective in learning classification models from very limited human feedback.

在实践中，分类模型的学习通常依赖于人类的注释工作，其中人类将类标签分配给数据实例。由于这个过程非常耗时和昂贵，因此找到降低注释成本的有效方法对于构建这样的模型至关重要。为了解决这个问题，我们探索了基于区域的标注作为人类反馈，而不是请求基于实例的标注。区域被定义为输入空间X的超立方子空间，它覆盖了属于该区域的数据实例的子种群。每个区域用[0,1]中的数字标记(在二元分类设置中)，代表人类对子种群中正(或负)类比例的估计。为了快速发现数据中的纯区域(就类比例而言)，我们开发了一种新的主动学习框架，该框架以分层和自适应的方式构建区域。分层意味着将区域增量地构建到分层树中，这是通过重复分割输入空间来完成的。自适应意味着我们的框架可以自适应地为每个区域分割选择最佳启发式。通过对大量数据集的实验，我们证明了我们的框架可以在很少的区域查询中识别纯区域。因此，我们的方法在从非常有限的人类反馈中学习分类模型方面是有效的。

{"title":"Region-Based Active Learning with Hierarchical and Adaptive Region Construction.","authors":"Zhipeng Luo, Milos Hauskrecht","doi":"10.1137/1.9781611975673.50","DOIUrl":"https://doi.org/10.1137/1.9781611975673.50","url":null,"abstract":"Learning of classification models in practice often relies on human annotation effort in which humans assign class labels to data instances. As this process can be very time-consuming and costly, finding effective ways to reduce the annotation cost becomes critical for building such models. To solve this problem, instead of soliciting instance-based annotation we explore region-based annotation as the human feedback. A region is defined as a hyper-cubic subspace of the input space X and it covers a subpopulation of data instances that fall into this region. Each region is labeled with a number in [0,1] (in binary classification setting), representing a human estimate of the positive (or negative) class proportion in the subpopulation. To quickly discover pure regions (in terms of class proportion) in the data, we have developed a novel active learning framework that constructs regions in a hierarchical and adaptive way. Hierarchical means that regions are incrementally built into a hierarchical tree, which is done by repeatedly splitting the input space. Adaptive means that our framework can adaptively choose the best heuristic for each of the region splits. Through experiments on numerous datasets we demonstrate that our framework can identify pure regions in very few region queries. Thus our approach is shown to be effective in learning classification models from very limited human feedback.","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2019 ","pages":"441-449"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611975673.50","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37534776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

CP Tensor Decomposition with Cannot-Link Intermode Constraints. 具有不可链接模式间约束的CP张量分解。

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining

Pub Date : 2019-05-01 DOI: 10.1137/1.9781611975673.80

Jette Henderson, Bradley A Malin, Joshua C Denny, Abel N Kho, Jimeng Sun, Joydeep Ghosh, Joyce C Ho

Tensor factorization is a methodology that is applied in a variety of fields, ranging from climate modeling to medical informatics. A tensor is an n-way array that captures the relationship between n objects. These multiway arrays can be factored to study the underlying bases present in the data. Two challenges arising in tensor factorization are 1) the resulting factors can be noisy and highly overlapping with one another and 2) they may not map to insights within a domain. However, incorporating supervision to increase the number of insightful factors can be costly in terms of the time and domain expertise necessary for gathering labels or domain-specific constraints. To meet these challenges, we introduce CANDECOMP/PARAFAC (CP) tensor factorization with Cannot-Link Intermode Constraints (CP-CLIC), a framework that achieves succinct, diverse, interpretable factors. This is accomplished by gradually learning constraints that are verified with auxiliary information during the decomposition process. We demonstrate CP-CLIC's potential to extract sparse, diverse, and interpretable factors through experiments on simulated data and a real-world application in medical informatics.

张量因子分解是一种应用于从气候建模到医学信息学等多个领域的方法。张量是一个n向数组，用于捕捉n个对象之间的关系。这些多路阵列可以被分解以研究数据中存在的底层基底。张量因子分解中出现的两个挑战是：1）结果因子可能是有噪声的，并且彼此高度重叠；2）它们可能无法映射到域内的见解。然而，就收集标签或特定领域限制所需的时间和领域专业知识而言，纳入监督以增加有洞察力的因素的数量可能代价高昂。为了应对这些挑战，我们引入了具有不可链接模式间约束的CANDECOMP/PARAFAC（CP）张量分解（CP-CLIC），这是一个实现简洁、多样、可解释因素的框架。这是通过逐步学习在分解过程中用辅助信息验证的约束来实现的。我们通过对模拟数据的实验和在医学信息学中的实际应用，展示了CP-CLIC提取稀疏、多样和可解释因素的潜力。

{"title":"CP Tensor Decomposition with Cannot-Link Intermode Constraints.","authors":"Jette Henderson, Bradley A Malin, Joshua C Denny, Abel N Kho, Jimeng Sun, Joydeep Ghosh, Joyce C Ho","doi":"10.1137/1.9781611975673.80","DOIUrl":"10.1137/1.9781611975673.80","url":null,"abstract":"Tensor factorization is a methodology that is applied in a variety of fields, ranging from climate modeling to medical informatics. A tensor is an n-way array that captures the relationship between n objects. These multiway arrays can be factored to study the underlying bases present in the data. Two challenges arising in tensor factorization are 1) the resulting factors can be noisy and highly overlapping with one another and 2) they may not map to insights within a domain. However, incorporating supervision to increase the number of insightful factors can be costly in terms of the time and domain expertise necessary for gathering labels or domain-specific constraints. To meet these challenges, we introduce CANDECOMP/PARAFAC (CP) tensor factorization with Cannot-Link Intermode Constraints (CP-CLIC), a framework that achieves succinct, diverse, interpretable factors. This is accomplished by gradually learning constraints that are verified with auxiliary information during the decomposition process. We demonstrate CP-CLIC's potential to extract sparse, diverse, and interpretable factors through experiments on simulated data and a real-world application in medical informatics.","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2019 ","pages":"711-719"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611975673.80","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37328173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks. AspEm：在异构信息网络中通过方面嵌入学习。

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining

Pub Date : 2018-01-01 DOI: 10.1137/1.9781611975321.16

Yu Shi, Huan Gui, Qi Zhu, Lance Kaplan, Jiawei Han

Heterogeneous information networks (HINs) are ubiquitous in real-world applications. Due to the heterogeneity in HINs, the typed edges may not fully align with each other. In order to capture the semantic subtlety, we propose the concept of aspects with each aspect being a unit representing one underlying semantic facet. Meanwhile, network embedding has emerged as a powerful method for learning network representation, where the learned embedding can be used as features in various downstream applications. Therefore, we are motivated to propose a novel embedding learning framework-ASPEM-to preserve the semantic information in HINs based on multiple aspects. Instead of preserving information of the network in one semantic space, ASPEM encapsulates information regarding each aspect individually. In order to select aspects for embedding purpose, we further devise a solution for ASPEM based on dataset-wide statistics. To corroborate the efficacy of ASPEM, we conducted experiments on two real-words datasets with two types of applications-classification and link prediction. Experiment results demonstrate that ASPEM can outperform baseline network embedding learning methods by considering multiple aspects, where the aspects can be selected from the given HIN in an unsupervised manner.

异构信息网络（HIN）在现实世界的应用中无处不在。由于HIN中的异质性，类型化的边可能无法完全对齐。为了捕捉语义的微妙之处，我们提出了方面的概念，每个方面都是表示一个底层语义方面的单元。同时，网络嵌入已经成为学习网络表示的一种强大方法，其中学习的嵌入可以用作各种下游应用中的特征。因此，我们有动机提出一种新的嵌入学习框架ASPEM，以从多个方面保护HIN中的语义信息。ASPEM不是将网络的信息保存在一个语义空间中，而是单独封装关于每个方面的信息。为了选择嵌入目的的方面，我们进一步设计了一种基于数据集范围统计的ASPEM解决方案。为了证实ASPEM的有效性，我们在两个真实单词数据集上进行了实验，其中包括两种类型的应用分类和链接预测。实验结果表明，通过考虑多个方面，ASPEM可以优于基线网络嵌入学习方法，其中这些方面可以以无监督的方式从给定的HIN中选择。

{"title":"AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks.","authors":"Yu Shi, Huan Gui, Qi Zhu, Lance Kaplan, Jiawei Han","doi":"10.1137/1.9781611975321.16","DOIUrl":"10.1137/1.9781611975321.16","url":null,"abstract":"Heterogeneous information networks (HINs) are ubiquitous in real-world applications. Due to the heterogeneity in HINs, the typed edges may not fully align with each other. In order to capture the semantic subtlety, we propose the concept of aspects with each aspect being a unit representing one underlying semantic facet. Meanwhile, network embedding has emerged as a powerful method for learning network representation, where the learned embedding can be used as features in various downstream applications. Therefore, we are motivated to propose a novel embedding learning framework-ASPEM-to preserve the semantic information in HINs based on multiple aspects. Instead of preserving information of the network in one semantic space, ASPEM encapsulates information regarding each aspect individually. In order to select aspects for embedding purpose, we further devise a solution for ASPEM based on dataset-wide statistics. To corroborate the efficacy of ASPEM, we conducted experiments on two real-words datasets with two types of applications-classification and link prediction. Experiment results demonstrate that ASPEM can outperform baseline network embedding learning methods by considering multiple aspects, where the aspects can be selected from the given HIN in an unsupervised manner.","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2018 ","pages":"144-152"},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611975321.16","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36496991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 81

Active Learning of Classification Models with Likert-Scale Feedback. 利用李克特量表反馈主动学习分类模型

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining

Pub Date : 2017-01-01 DOI: 10.1137/1.9781611974973.4

Yanbing Xue, Milos Hauskrecht

Annotation of classification data by humans can be a time-consuming and tedious process. Finding ways of reducing the annotation effort is critical for building the classification models in practice and for applying them to a variety of classification tasks. In this paper, we develop a new active learning framework that combines two strategies to reduce the annotation effort. First, it relies on label uncertainty information obtained from the human in terms of the Likert-scale feedback. Second, it uses active learning to annotate examples with the greatest expected change. We propose a Bayesian approach to calculate the expectation and an incremental SVM solver to reduce the time complexity of the solvers. We show the combination of our active learning strategy and the Likert-scale feedback can learn classification models more rapidly and with a smaller number of labeled instances than methods that rely on either Likert-scale labels or active learning alone.

人工标注分类数据是一个耗时而繁琐的过程。要在实践中建立分类模型并将其应用于各种分类任务，找到减少标注工作量的方法至关重要。在本文中，我们开发了一种新的主动学习框架，它结合了两种策略来减少标注工作量。首先，它依赖于从人类的李克特量表反馈中获得的标签不确定性信息。其次，它利用主动学习来注释预期变化最大的示例。我们提出了一种计算期望值的贝叶斯方法和一种增量 SVM 求解器，以降低求解器的时间复杂度。我们的研究表明，与单独依赖李克特标度标签或主动学习的方法相比，我们的主动学习策略与李克特标度反馈相结合，可以更快地学习分类模型，而且标注实例的数量更少。

引用次数: 0

Learning Linear Dynamical Systems from Multivariate Time Series: A Matrix Factorization Based Framework 从多元时间序列学习线性动力系统:一个基于矩阵分解的框架

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining

Pub Date : 2016-05-01 DOI: 10.1137/1.9781611974348.91

Zitao Liu, M. Hauskrecht

The linear dynamical system (LDS) model is arguably the most commonly used time series model for real-world engineering and financial applications due to its relative simplicity, mathematically predictable behavior, and the fact that exact inference and predictions for the model can be done efficiently. In this work, we propose a new generalized LDS framework, gLDS, for learning LDS models from a collection of multivariate time series (MTS) data based on matrix factorization, which is different from traditional EM learning and spectral learning algorithms. In gLDS, each MTS sequence is factorized as a product of a shared emission matrix and a sequence-specific (hidden) state dynamics, where an individual hidden state sequence is represented with the help of a shared transition matrix. One advantage of our generalized formulation is that various types of constraints can be easily incorporated into the learning process. Furthermore, we propose a novel temporal smoothing regularization approach for learning the LDS model, which stabilizes the model, its learning algorithm and predictions it makes. Experiments on several real-world MTS data show that (1) regular LDS models learned from gLDS are able to achieve better time series predictive performance than other LDS learning algorithms; (2) constraints can be directly integrated into the learning process to achieve special properties such as stability, low-rankness; and (3) the proposed temporal smoothing regularization encourages more stable and accurate predictions.

线性动力系统(LDS)模型可以说是现实世界工程和金融应用中最常用的时间序列模型，因为它相对简单，数学上可预测的行为，以及对模型的精确推断和预测可以有效地完成。在这项工作中，我们提出了一个新的广义LDS框架，gLDS，用于从多元时间序列(MTS)数据集合中学习LDS模型，这是基于矩阵分解的，这与传统的EM学习和谱学习算法不同。在gLDS中，每个MTS序列被分解为共享发射矩阵和序列特定(隐藏)状态动态的乘积，其中单个隐藏状态序列在共享转移矩阵的帮助下表示。我们的广义公式的一个优点是，各种类型的约束可以很容易地合并到学习过程中。此外，我们提出了一种新的时间平滑正则化方法来学习LDS模型，该方法稳定了模型、学习算法和预测。在多个真实MTS数据上的实验表明:(1)从gLDS中学习的正则LDS模型比其他LDS学习算法具有更好的时间序列预测性能;(2)约束可以直接融入到学习过程中，实现稳定、低秩等特殊性质;(3)提出的时间平滑正则化促进更稳定和准确的预测。

{"title":"Learning Linear Dynamical Systems from Multivariate Time Series: A Matrix Factorization Based Framework","authors":"Zitao Liu, M. Hauskrecht","doi":"10.1137/1.9781611974348.91","DOIUrl":"https://doi.org/10.1137/1.9781611974348.91","url":null,"abstract":"The linear dynamical system (LDS) model is arguably the most commonly used time series model for real-world engineering and financial applications due to its relative simplicity, mathematically predictable behavior, and the fact that exact inference and predictions for the model can be done efficiently. In this work, we propose a new generalized LDS framework, gLDS, for learning LDS models from a collection of multivariate time series (MTS) data based on matrix factorization, which is different from traditional EM learning and spectral learning algorithms. In gLDS, each MTS sequence is factorized as a product of a shared emission matrix and a sequence-specific (hidden) state dynamics, where an individual hidden state sequence is represented with the help of a shared transition matrix. One advantage of our generalized formulation is that various types of constraints can be easily incorporated into the learning process. Furthermore, we propose a novel temporal smoothing regularization approach for learning the LDS model, which stabilizes the model, its learning algorithm and predictions it makes. Experiments on several real-world MTS data show that (1) regular LDS models learned from gLDS are able to achieve better time series predictive performance than other LDS learning algorithms; (2) constraints can be directly integrated into the learning process to achieve special properties such as stability, low-rankness; and (3) the proposed temporal smoothing regularization encourages more stable and accurate predictions.","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"22 1","pages":"810-818"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75317811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23