首页 > 最新文献

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining最新文献

英文 中文
FAME: Fragment-based Conditional Molecular Generation for Phenotypic Drug Discovery. FAME:基于片段的条件分子生成用于表型药物发现。
Thai-Hoang Pham, Lei Xie, Ping Zhang

De novo molecular design is a key challenge in drug discovery due to the complexity of chemical space. With the availability of molecular datasets and advances in machine learning, many deep generative models are proposed for generating novel molecules with desired properties. However, most of the existing models focus only on molecular distribution learning and target-based molecular design, thereby hindering their potentials in real-world applications. In drug discovery, phenotypic molecular design has advantages over target-based molecular design, especially in first-in-class drug discovery. In this work, we propose the first deep graph generative model (FAME) targeting phenotypic molecular design, in particular gene expression-based molecular design. FAME leverages a conditional variational autoencoder framework to learn the conditional distribution generating molecules from gene expression profiles. However, this distribution is difficult to learn due to the complexity of the molecular space and the noisy phenomenon in gene expression data. To tackle these issues, a gene expression denoising (GED) model that employs contrastive objective function is first proposed to reduce noise from gene expression data. FAME is then designed to treat molecules as the sequences of fragments and learn to generate these fragments in autoregressive manner. By leveraging this fragment-based generation strategy and the denoised gene expression profiles, FAME can generate novel molecules with a high validity rate and desired biological activity. The experimental results show that FAME outperforms existing methods including both SMILES-based and graph-based deep generative models for phenotypic molecular design. Furthermore, the effective mechanism for reducing noise in gene expression data proposed in our study can be applied to omics data modeling in general for facilitating phenotypic drug discovery.

由于化学空间的复杂性,从头开始的分子设计是药物发现的一个关键挑战。随着分子数据集的可用性和机器学习的进步,许多深度生成模型被提出用于生成具有所需性质的新分子。然而,现有的大多数模型只关注分子分布学习和基于靶标的分子设计,从而阻碍了它们在实际应用中的潜力。在药物发现中,表型分子设计比基于靶标的分子设计具有优势,特别是在一类新药发现中。在这项工作中,我们提出了第一个针对表型分子设计的深度图生成模型(FAME),特别是基于基因表达的分子设计。FAME利用条件变分自编码器框架来学习从基因表达谱中生成分子的条件分布。然而,由于分子空间的复杂性和基因表达数据中的噪声现象,这种分布很难学习。为了解决这些问题,首先提出了一种采用对比目标函数的基因表达去噪(GED)模型来降低基因表达数据中的噪声。然后设计FAME将分子视为片段序列,并学习以自回归的方式生成这些片段。通过利用这种基于片段的生成策略和去噪的基因表达谱,FAME可以生成具有高效率和所需生物活性的新分子。实验结果表明,FAME优于现有的基于smiles和基于图的深度生成模型的表型分子设计方法。此外,我们研究中提出的降低基因表达数据噪声的有效机制可以应用于组学数据建模,以促进表型药物的发现。
{"title":"FAME: Fragment-based Conditional Molecular Generation for Phenotypic Drug Discovery.","authors":"Thai-Hoang Pham,&nbsp;Lei Xie,&nbsp;Ping Zhang","doi":"10.1137/1.9781611977172.81","DOIUrl":"https://doi.org/10.1137/1.9781611977172.81","url":null,"abstract":"<p><p><i>De novo</i> molecular design is a key challenge in drug discovery due to the complexity of chemical space. With the availability of molecular datasets and advances in machine learning, many deep generative models are proposed for generating novel molecules with desired properties. However, most of the existing models focus only on molecular distribution learning and target-based molecular design, thereby hindering their potentials in real-world applications. In drug discovery, phenotypic molecular design has advantages over target-based molecular design, especially in first-in-class drug discovery. In this work, we propose the first deep graph generative model (FAME) targeting phenotypic molecular design, in particular gene expression-based molecular design. FAME leverages a conditional variational autoencoder framework to learn the conditional distribution generating molecules from gene expression profiles. However, this distribution is difficult to learn due to the complexity of the molecular space and the noisy phenomenon in gene expression data. To tackle these issues, a gene expression denoising (GED) model that employs contrastive objective function is first proposed to reduce noise from gene expression data. FAME is then designed to treat molecules as the sequences of fragments and learn to generate these fragments in autoregressive manner. By leveraging this fragment-based generation strategy and the denoised gene expression profiles, FAME can generate novel molecules with a high validity rate and desired biological activity. The experimental results show that FAME outperforms existing methods including both SMILES-based and graph-based deep generative models for phenotypic molecular design. Furthermore, the effective mechanism for reducing noise in gene expression data proposed in our study can be applied to omics data modeling in general for facilitating phenotypic drug discovery.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9061137/pdf/nihms-1801466.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9664973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Harmonic Alignment. 谐波对齐。
Jay S Stanley, Scott Gigante, Guy Wolf, Smita Krishnaswamy

We propose a novel framework for combining datasets via alignment of their intrinsic geometry. This alignment can be used to fuse data originating from disparate modalities, or to correct batch effects while preserving intrinsic data structure. Importantly, we do not assume any pointwise correspondence between datasets, but instead rely on correspondence between a (possibly unknown) subset of data features. We leverage this assumption to construct an isometric alignment between the data. This alignment is obtained by relating the expansion of data features in harmonics derived from diffusion operators defined over each dataset. These expansions encode each feature as a function of the data geometry. We use this to relate the diffusion coordinates of each dataset through our assumption of partial feature correspondence. Then, a unified diffusion geometry is constructed over the aligned data, which can also be used to correct the original data measurements. We demonstrate our method on several datasets, showing in particular its effectiveness in biological applications including fusion of single-cell RNA sequencing (scRNA-seq) and single-cell ATAC sequencing (scATAC-seq) data measured on the same population of cells, and removal of batch effect between biological samples.

我们提出了一种新的框架,通过对数据集的内在几何形状进行对齐来组合数据集。这种对齐可以用于融合来自不同模式的数据,或者在保留固有数据结构的同时纠正批处理效果。重要的是,我们不假设数据集之间有任何点向对应,而是依赖于(可能未知的)数据特征子集之间的对应。我们利用这个假设来构建数据之间的等距对齐。这种对齐是通过在每个数据集上定义的扩散算子派生的谐波中关联数据特征的扩展而获得的。这些扩展将每个特征编码为数据几何的函数。我们通过部分特征对应的假设来关联每个数据集的扩散坐标。然后,在对齐的数据上构造统一的扩散几何,该几何也可以用于校正原始数据测量。我们在几个数据集上展示了我们的方法,特别显示了它在生物学应用中的有效性,包括在同一细胞群上测量的单细胞RNA测序(scRNA-seq)和单细胞ATAC测序(scATAC-seq)数据的融合,以及去除生物样品之间的批效应。
{"title":"Harmonic Alignment.","authors":"Jay S Stanley,&nbsp;Scott Gigante,&nbsp;Guy Wolf,&nbsp;Smita Krishnaswamy","doi":"10.1137/1.9781611976236.36","DOIUrl":"https://doi.org/10.1137/1.9781611976236.36","url":null,"abstract":"<p><p>We propose a novel framework for combining datasets via alignment of their intrinsic geometry. This alignment can be used to fuse data originating from disparate modalities, or to correct batch effects while preserving intrinsic data structure. Importantly, we do not assume any pointwise correspondence between datasets, but instead rely on correspondence between a (possibly unknown) subset of data features. We leverage this assumption to construct an isometric alignment between the data. This alignment is obtained by relating the expansion of data features in harmonics derived from diffusion operators defined over each dataset. These expansions encode each feature as a function of the data geometry. We use this to relate the diffusion coordinates of each dataset through our assumption of partial feature correspondence. Then, a unified diffusion geometry is constructed over the aligned data, which can also be used to correct the original data measurements. We demonstrate our method on several datasets, showing in particular its effectiveness in biological applications including fusion of single-cell RNA sequencing (scRNA-seq) and single-cell ATAC sequencing (scATAC-seq) data measured on the same population of cells, and removal of batch effect between biological samples.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611976236.36","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25481751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
GRIA: Graphical Regularization for Integrative Analysis. 综合分析的图形正则化。
Changgee Chang, Jihwan Oh, Qi Long

Integrative analysis jointly analyzes multiple data sets to overcome curse of dimensionality. It can detect important but weak signals by jointly selecting features for all data sets, but unfortunately the sets of important features are not always the same for all data sets. Variations which allows heterogeneous sparsity structure-a subset of data sets can have a zero coefficient for a selected feature-have been proposed, but it compromises the effect of integrative analysis recalling the problem of losing weak important signals. We propose a new integrative analysis approach which not only aggregates weak important signals well in homogeneity setting but also substantially alleviates the problem of losing weak important signals in heterogeneity setting. Our approach exploits a priori known graphical structure of features by forcing joint selection of adjacent features, and integrating such information over multiple data sets can increase the power while taking into account the heterogeneity across data sets. We confirm the problem of existing approaches and demonstrate the superiority of our method through a simulation study and an application to gene expression data from ADNI.

综合分析通过对多个数据集进行联合分析,克服了维度的困扰。它可以通过联合选择所有数据集的特征来检测重要但较弱的信号,但不幸的是,重要特征集对于所有数据集来说并不总是相同的。已经提出了允许异构稀疏结构的变化-数据集的子集对于选定的特征可以具有零系数-但是它损害了综合分析的效果,使人想起丢失弱重要信号的问题。本文提出了一种新的综合分析方法,该方法不仅能很好地聚合同质性条件下的弱重要信号,而且能有效地缓解异质性条件下的弱重要信号丢失问题。我们的方法通过强迫相邻特征的联合选择来利用先验已知的特征图形结构,并且在多个数据集上集成这些信息可以增加功率,同时考虑到数据集之间的异质性。我们通过模拟研究和ADNI基因表达数据的应用,证实了现有方法存在的问题,并证明了我们方法的优越性。
{"title":"GRIA: Graphical Regularization for Integrative Analysis.","authors":"Changgee Chang,&nbsp;Jihwan Oh,&nbsp;Qi Long","doi":"10.1137/1.9781611976236.68","DOIUrl":"https://doi.org/10.1137/1.9781611976236.68","url":null,"abstract":"<p><p>Integrative analysis jointly analyzes multiple data sets to overcome curse of dimensionality. It can detect important but weak signals by jointly selecting features for all data sets, but unfortunately the sets of important features are not always the same for all data sets. Variations which allows heterogeneous sparsity structure-a subset of data sets can have a zero coefficient for a selected feature-have been proposed, but it compromises the effect of integrative analysis recalling the problem of losing weak important signals. We propose a new integrative analysis approach which not only aggregates weak important signals well in homogeneity setting but also substantially alleviates the problem of losing weak important signals in heterogeneity setting. Our approach exploits a priori known graphical structure of features by forcing joint selection of adjacent features, and integrating such information over multiple data sets can increase the power while taking into account the heterogeneity across data sets. We confirm the problem of existing approaches and demonstrate the superiority of our method through a simulation study and an application to gene expression data from ADNI.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611976236.68","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37962526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Region-Based Active Learning with Hierarchical and Adaptive Region Construction. 基于区域主动学习的分层自适应区域构建。
Zhipeng Luo, Milos Hauskrecht

Learning of classification models in practice often relies on human annotation effort in which humans assign class labels to data instances. As this process can be very time-consuming and costly, finding effective ways to reduce the annotation cost becomes critical for building such models. To solve this problem, instead of soliciting instance-based annotation we explore region-based annotation as the human feedback. A region is defined as a hyper-cubic subspace of the input space X and it covers a subpopulation of data instances that fall into this region. Each region is labeled with a number in [0,1] (in binary classification setting), representing a human estimate of the positive (or negative) class proportion in the subpopulation. To quickly discover pure regions (in terms of class proportion) in the data, we have developed a novel active learning framework that constructs regions in a hierarchical and adaptive way. Hierarchical means that regions are incrementally built into a hierarchical tree, which is done by repeatedly splitting the input space. Adaptive means that our framework can adaptively choose the best heuristic for each of the region splits. Through experiments on numerous datasets we demonstrate that our framework can identify pure regions in very few region queries. Thus our approach is shown to be effective in learning classification models from very limited human feedback.

在实践中,分类模型的学习通常依赖于人类的注释工作,其中人类将类标签分配给数据实例。由于这个过程非常耗时和昂贵,因此找到降低注释成本的有效方法对于构建这样的模型至关重要。为了解决这个问题,我们探索了基于区域的标注作为人类反馈,而不是请求基于实例的标注。区域被定义为输入空间X的超立方子空间,它覆盖了属于该区域的数据实例的子种群。每个区域用[0,1]中的数字标记(在二元分类设置中),代表人类对子种群中正(或负)类比例的估计。为了快速发现数据中的纯区域(就类比例而言),我们开发了一种新的主动学习框架,该框架以分层和自适应的方式构建区域。分层意味着将区域增量地构建到分层树中,这是通过重复分割输入空间来完成的。自适应意味着我们的框架可以自适应地为每个区域分割选择最佳启发式。通过对大量数据集的实验,我们证明了我们的框架可以在很少的区域查询中识别纯区域。因此,我们的方法在从非常有限的人类反馈中学习分类模型方面是有效的。
{"title":"Region-Based Active Learning with Hierarchical and Adaptive Region Construction.","authors":"Zhipeng Luo,&nbsp;Milos Hauskrecht","doi":"10.1137/1.9781611975673.50","DOIUrl":"https://doi.org/10.1137/1.9781611975673.50","url":null,"abstract":"<p><p>Learning of classification models in practice often relies on human annotation effort in which humans assign class labels to data instances. As this process can be very time-consuming and costly, finding effective ways to reduce the annotation cost becomes critical for building such models. To solve this problem, instead of soliciting instance-based annotation we explore <i>region</i>-based annotation as the human feedback. A region is defined as a hyper-cubic subspace of the input space <i>X</i> and it covers a subpopulation of data instances that fall into this region. Each region is labeled with a number in [0,1] (in binary classification setting), representing a human estimate of the positive (or negative) class proportion in the subpopulation. To quickly discover pure regions (in terms of class proportion) in the data, we have developed a novel active learning framework that constructs regions in a <i>hierarchical</i> and <i>adaptive</i> way. <i>Hierarchical</i> means that regions are incrementally built into a hierarchical tree, which is done by repeatedly splitting the input space. <i>Adaptive</i> means that our framework can adaptively choose the best heuristic for each of the region splits. Through experiments on numerous datasets we demonstrate that our framework can identify pure regions in very few region queries. Thus our approach is shown to be effective in learning classification models from very limited human feedback.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611975673.50","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37534776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
CP Tensor Decomposition with Cannot-Link Intermode Constraints. 具有不可链接模式间约束的CP张量分解。
Jette Henderson, Bradley A Malin, Joshua C Denny, Abel N Kho, Jimeng Sun, Joydeep Ghosh, Joyce C Ho

Tensor factorization is a methodology that is applied in a variety of fields, ranging from climate modeling to medical informatics. A tensor is an n-way array that captures the relationship between n objects. These multiway arrays can be factored to study the underlying bases present in the data. Two challenges arising in tensor factorization are 1) the resulting factors can be noisy and highly overlapping with one another and 2) they may not map to insights within a domain. However, incorporating supervision to increase the number of insightful factors can be costly in terms of the time and domain expertise necessary for gathering labels or domain-specific constraints. To meet these challenges, we introduce CANDECOMP/PARAFAC (CP) tensor factorization with Cannot-Link Intermode Constraints (CP-CLIC), a framework that achieves succinct, diverse, interpretable factors. This is accomplished by gradually learning constraints that are verified with auxiliary information during the decomposition process. We demonstrate CP-CLIC's potential to extract sparse, diverse, and interpretable factors through experiments on simulated data and a real-world application in medical informatics.

张量因子分解是一种应用于从气候建模到医学信息学等多个领域的方法。张量是一个n向数组,用于捕捉n个对象之间的关系。这些多路阵列可以被分解以研究数据中存在的底层基底。张量因子分解中出现的两个挑战是:1)结果因子可能是有噪声的,并且彼此高度重叠;2)它们可能无法映射到域内的见解。然而,就收集标签或特定领域限制所需的时间和领域专业知识而言,纳入监督以增加有洞察力的因素的数量可能代价高昂。为了应对这些挑战,我们引入了具有不可链接模式间约束的CANDECOMP/PARAFAC(CP)张量分解(CP-CLIC),这是一个实现简洁、多样、可解释因素的框架。这是通过逐步学习在分解过程中用辅助信息验证的约束来实现的。我们通过对模拟数据的实验和在医学信息学中的实际应用,展示了CP-CLIC提取稀疏、多样和可解释因素的潜力。
{"title":"CP Tensor Decomposition with Cannot-Link Intermode Constraints.","authors":"Jette Henderson,&nbsp;Bradley A Malin,&nbsp;Joshua C Denny,&nbsp;Abel N Kho,&nbsp;Jimeng Sun,&nbsp;Joydeep Ghosh,&nbsp;Joyce C Ho","doi":"10.1137/1.9781611975673.80","DOIUrl":"10.1137/1.9781611975673.80","url":null,"abstract":"<p><p>Tensor factorization is a methodology that is applied in a variety of fields, ranging from climate modeling to medical informatics. A tensor is an <i>n</i>-way array that captures the relationship between <i>n</i> objects. These multiway arrays can be factored to study the underlying bases present in the data. Two challenges arising in tensor factorization are 1) the resulting factors can be noisy and highly overlapping with one another and 2) they may not map to insights within a domain. However, incorporating supervision to increase the number of insightful factors can be costly in terms of the time and domain expertise necessary for gathering labels or domain-specific constraints. To meet these challenges, we introduce CANDECOMP/PARAFAC (CP) tensor factorization with Cannot-Link Intermode Constraints (CP-CLIC), a framework that achieves succinct, diverse, interpretable factors. This is accomplished by gradually learning constraints that are verified with auxiliary information during the decomposition process. We demonstrate CP-CLIC's potential to extract sparse, diverse, and interpretable factors through experiments on simulated data and a real-world application in medical informatics.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611975673.80","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37328173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks. AspEm:在异构信息网络中通过方面嵌入学习。
Yu Shi, Huan Gui, Qi Zhu, Lance Kaplan, Jiawei Han

Heterogeneous information networks (HINs) are ubiquitous in real-world applications. Due to the heterogeneity in HINs, the typed edges may not fully align with each other. In order to capture the semantic subtlety, we propose the concept of aspects with each aspect being a unit representing one underlying semantic facet. Meanwhile, network embedding has emerged as a powerful method for learning network representation, where the learned embedding can be used as features in various downstream applications. Therefore, we are motivated to propose a novel embedding learning framework-ASPEM-to preserve the semantic information in HINs based on multiple aspects. Instead of preserving information of the network in one semantic space, ASPEM encapsulates information regarding each aspect individually. In order to select aspects for embedding purpose, we further devise a solution for ASPEM based on dataset-wide statistics. To corroborate the efficacy of ASPEM, we conducted experiments on two real-words datasets with two types of applications-classification and link prediction. Experiment results demonstrate that ASPEM can outperform baseline network embedding learning methods by considering multiple aspects, where the aspects can be selected from the given HIN in an unsupervised manner.

异构信息网络(HIN)在现实世界的应用中无处不在。由于HIN中的异质性,类型化的边可能无法完全对齐。为了捕捉语义的微妙之处,我们提出了方面的概念,每个方面都是表示一个底层语义方面的单元。同时,网络嵌入已经成为学习网络表示的一种强大方法,其中学习的嵌入可以用作各种下游应用中的特征。因此,我们有动机提出一种新的嵌入学习框架ASPEM,以从多个方面保护HIN中的语义信息。ASPEM不是将网络的信息保存在一个语义空间中,而是单独封装关于每个方面的信息。为了选择嵌入目的的方面,我们进一步设计了一种基于数据集范围统计的ASPEM解决方案。为了证实ASPEM的有效性,我们在两个真实单词数据集上进行了实验,其中包括两种类型的应用分类和链接预测。实验结果表明,通过考虑多个方面,ASPEM可以优于基线网络嵌入学习方法,其中这些方面可以以无监督的方式从给定的HIN中选择。
{"title":"AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks.","authors":"Yu Shi,&nbsp;Huan Gui,&nbsp;Qi Zhu,&nbsp;Lance Kaplan,&nbsp;Jiawei Han","doi":"10.1137/1.9781611975321.16","DOIUrl":"10.1137/1.9781611975321.16","url":null,"abstract":"<p><p>Heterogeneous information networks (HINs) are ubiquitous in real-world applications. Due to the heterogeneity in HINs, the typed edges may not fully align with each other. In order to capture the semantic subtlety, we propose the concept of aspects with each aspect being a unit representing one underlying semantic facet. Meanwhile, network embedding has emerged as a powerful method for learning network representation, where the learned embedding can be used as features in various downstream applications. Therefore, we are motivated to propose a novel embedding learning framework-ASPEM-to preserve the semantic information in HINs based on multiple aspects. Instead of preserving information of the network in one semantic space, ASPEM encapsulates information regarding each aspect individually. In order to select aspects for embedding purpose, we further devise a solution for ASPEM based on dataset-wide statistics. To corroborate the efficacy of ASPEM, we conducted experiments on two real-words datasets with two types of applications-classification and link prediction. Experiment results demonstrate that ASPEM can outperform baseline network embedding learning methods by considering multiple aspects, where the aspects can be selected from the given HIN in an unsupervised manner.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611975321.16","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36496991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 81
Active Learning of Classification Models with Likert-Scale Feedback. 利用李克特量表反馈主动学习分类模型
Yanbing Xue, Milos Hauskrecht

Annotation of classification data by humans can be a time-consuming and tedious process. Finding ways of reducing the annotation effort is critical for building the classification models in practice and for applying them to a variety of classification tasks. In this paper, we develop a new active learning framework that combines two strategies to reduce the annotation effort. First, it relies on label uncertainty information obtained from the human in terms of the Likert-scale feedback. Second, it uses active learning to annotate examples with the greatest expected change. We propose a Bayesian approach to calculate the expectation and an incremental SVM solver to reduce the time complexity of the solvers. We show the combination of our active learning strategy and the Likert-scale feedback can learn classification models more rapidly and with a smaller number of labeled instances than methods that rely on either Likert-scale labels or active learning alone.

人工标注分类数据是一个耗时而繁琐的过程。要在实践中建立分类模型并将其应用于各种分类任务,找到减少标注工作量的方法至关重要。在本文中,我们开发了一种新的主动学习框架,它结合了两种策略来减少标注工作量。首先,它依赖于从人类的李克特量表反馈中获得的标签不确定性信息。其次,它利用主动学习来注释预期变化最大的示例。我们提出了一种计算期望值的贝叶斯方法和一种增量 SVM 求解器,以降低求解器的时间复杂度。我们的研究表明,与单独依赖李克特标度标签或主动学习的方法相比,我们的主动学习策略与李克特标度反馈相结合,可以更快地学习分类模型,而且标注实例的数量更少。
{"title":"Active Learning of Classification Models with Likert-Scale Feedback.","authors":"Yanbing Xue, Milos Hauskrecht","doi":"10.1137/1.9781611974973.4","DOIUrl":"10.1137/1.9781611974973.4","url":null,"abstract":"<p><p>Annotation of classification data by humans can be a time-consuming and tedious process. Finding ways of reducing the annotation effort is critical for building the classification models in practice and for applying them to a variety of classification tasks. In this paper, we develop a new active learning framework that combines two strategies to reduce the annotation effort. First, it relies on label uncertainty information obtained from the human in terms of the Likert-scale feedback. Second, it uses active learning to annotate examples with the greatest expected change. We propose a Bayesian approach to calculate the expectation and an incremental SVM solver to reduce the time complexity of the solvers. We show the combination of our active learning strategy and the Likert-scale feedback can learn classification models more rapidly and with a smaller number of labeled instances than methods that rely on either Likert-scale labels or active learning alone.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5624557/pdf/nihms857286.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35417827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Linear Dynamical Systems from Multivariate Time Series: A Matrix Factorization Based Framework 从多元时间序列学习线性动力系统:一个基于矩阵分解的框架
Zitao Liu, M. Hauskrecht
The linear dynamical system (LDS) model is arguably the most commonly used time series model for real-world engineering and financial applications due to its relative simplicity, mathematically predictable behavior, and the fact that exact inference and predictions for the model can be done efficiently. In this work, we propose a new generalized LDS framework, gLDS, for learning LDS models from a collection of multivariate time series (MTS) data based on matrix factorization, which is different from traditional EM learning and spectral learning algorithms. In gLDS, each MTS sequence is factorized as a product of a shared emission matrix and a sequence-specific (hidden) state dynamics, where an individual hidden state sequence is represented with the help of a shared transition matrix. One advantage of our generalized formulation is that various types of constraints can be easily incorporated into the learning process. Furthermore, we propose a novel temporal smoothing regularization approach for learning the LDS model, which stabilizes the model, its learning algorithm and predictions it makes. Experiments on several real-world MTS data show that (1) regular LDS models learned from gLDS are able to achieve better time series predictive performance than other LDS learning algorithms; (2) constraints can be directly integrated into the learning process to achieve special properties such as stability, low-rankness; and (3) the proposed temporal smoothing regularization encourages more stable and accurate predictions.
线性动力系统(LDS)模型可以说是现实世界工程和金融应用中最常用的时间序列模型,因为它相对简单,数学上可预测的行为,以及对模型的精确推断和预测可以有效地完成。在这项工作中,我们提出了一个新的广义LDS框架,gLDS,用于从多元时间序列(MTS)数据集合中学习LDS模型,这是基于矩阵分解的,这与传统的EM学习和谱学习算法不同。在gLDS中,每个MTS序列被分解为共享发射矩阵和序列特定(隐藏)状态动态的乘积,其中单个隐藏状态序列在共享转移矩阵的帮助下表示。我们的广义公式的一个优点是,各种类型的约束可以很容易地合并到学习过程中。此外,我们提出了一种新的时间平滑正则化方法来学习LDS模型,该方法稳定了模型、学习算法和预测。在多个真实MTS数据上的实验表明:(1)从gLDS中学习的正则LDS模型比其他LDS学习算法具有更好的时间序列预测性能;(2)约束可以直接融入到学习过程中,实现稳定、低秩等特殊性质;(3)提出的时间平滑正则化促进更稳定和准确的预测。
{"title":"Learning Linear Dynamical Systems from Multivariate Time Series: A Matrix Factorization Based Framework","authors":"Zitao Liu, M. Hauskrecht","doi":"10.1137/1.9781611974348.91","DOIUrl":"https://doi.org/10.1137/1.9781611974348.91","url":null,"abstract":"The linear dynamical system (LDS) model is arguably the most commonly used time series model for real-world engineering and financial applications due to its relative simplicity, mathematically predictable behavior, and the fact that exact inference and predictions for the model can be done efficiently. In this work, we propose a new generalized LDS framework, gLDS, for learning LDS models from a collection of multivariate time series (MTS) data based on matrix factorization, which is different from traditional EM learning and spectral learning algorithms. In gLDS, each MTS sequence is factorized as a product of a shared emission matrix and a sequence-specific (hidden) state dynamics, where an individual hidden state sequence is represented with the help of a shared transition matrix. One advantage of our generalized formulation is that various types of constraints can be easily incorporated into the learning process. Furthermore, we propose a novel temporal smoothing regularization approach for learning the LDS model, which stabilizes the model, its learning algorithm and predictions it makes. Experiments on several real-world MTS data show that (1) regular LDS models learned from gLDS are able to achieve better time series predictive performance than other LDS learning algorithms; (2) constraints can be directly integrated into the learning process to achieve special properties such as stability, low-rankness; and (3) the proposed temporal smoothing regularization encourages more stable and accurate predictions.","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75317811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance. MACFP:编辑距离下的最大近似连续频繁模式挖掘。
Jingbo Shang, Jian Peng, Jiawei Han

Consecutive pattern mining aiming at finding sequential patterns substrings, is a special case of frequent pattern mining and has been played a crucial role in many real world applications, especially in biological sequence analysis, time series analysis, and network log mining. Approximations, including insertions, deletions, and substitutions, between strings are widely used in biological sequence comparisons. However, most existing string pattern mining methods only consider hamming distance without insertions/deletions (indels). Little attention has been paid to the general approximate consecutive frequent pattern mining under edit distance, potentially due to the high computational complexity, particularly on DNA sequences with billions of base pairs. In this paper, we introduce an efficient solution to this problem. We first formulate the Maximal Approximate Consecutive Frequent Pattern Mining (MACFP) problem that identifies substring patterns under edit distance in a long query sequence. Then, we propose a novel algorithm with linear time complexity to check whether the support of a substring pattern is above a predefined threshold in the query sequence, thus greatly reducing the computational complexity of MACFP. With this fast decision algorithm, we can efficiently solve the original pattern discovery problem with several indexing and searching techniques. Comprehensive experiments on sequence pattern analysis and a study on cancer genomics application demonstrate the effectiveness and efficiency of our algorithm, compared to several existing methods.

连续模式挖掘旨在发现连续模式子串,是频繁模式挖掘的一个特例,在现实世界的许多应用中,特别是在生物序列分析、时间序列分析和网络日志挖掘中发挥了至关重要的作用。在生物序列比较中,字符串之间的近似(包括插入、删除和替换)被广泛使用。然而,现有的字符串模式挖掘方法大多只考虑不含插入/删除(indels)的汉明距离。人们很少关注编辑距离下的一般近似连续频繁模式挖掘,这可能是由于计算复杂度较高,尤其是在有数十亿碱基对的 DNA 序列上。在本文中,我们介绍了这一问题的高效解决方案。我们首先提出了最大近似连续频繁模式挖掘(MACFP)问题,该问题可识别长查询序列中编辑距离下的子串模式。然后,我们提出了一种具有线性时间复杂度的新算法,用于检查查询序列中子串模式的支持度是否高于预定义的阈值,从而大大降低了 MACFP 的计算复杂度。有了这种快速决策算法,我们就能利用多种索引和搜索技术高效地解决原始模式发现问题。序列模式分析的综合实验和癌症基因组学的应用研究表明,与现有的几种方法相比,我们的算法是有效和高效的。
{"title":"MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance.","authors":"Jingbo Shang, Jian Peng, Jiawei Han","doi":"10.1137/1.9781611974348.63","DOIUrl":"10.1137/1.9781611974348.63","url":null,"abstract":"<p><p>Consecutive pattern mining aiming at finding sequential patterns substrings, is a special case of frequent pattern mining and has been played a crucial role in many real world applications, especially in biological sequence analysis, time series analysis, and network log mining. Approximations, including insertions, deletions, and substitutions, between strings are widely used in biological sequence comparisons. However, most existing string pattern mining methods only consider hamming distance without insertions/deletions (indels). Little attention has been paid to the general approximate consecutive frequent pattern mining under edit distance, potentially due to the high computational complexity, particularly on DNA sequences with billions of base pairs. In this paper, we introduce an efficient solution to this problem. We first formulate the Maximal Approximate Consecutive Frequent Pattern Mining (MACFP) problem that identifies substring patterns under edit distance in a long query sequence. Then, we propose a novel algorithm with linear time complexity to check whether the support of a substring pattern is above a predefined threshold in the query sequence, thus greatly reducing the computational complexity of MACFP. With this fast decision algorithm, we can efficiently solve the original pattern discovery problem with several indexing and searching techniques. Comprehensive experiments on sequence pattern analysis and a study on cancer genomics application demonstrate the effectiveness and efficiency of our algorithm, compared to several existing methods.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5292242/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84912855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Binary Classifier Calibration Using an Ensemble of Linear Trend Estimation. 基于线性趋势估计集合的二值分类器标定。
Mahdi Pakdaman Naeini, Gregory F Cooper

Learning accurate probabilistic models from data is crucial in many practical tasks in data mining. In this paper we present a new non-parametric calibration method called ensemble of linear trend estimation (ELiTE). ELiTE utilizes the recently proposed 1 trend ltering signal approximation method [22] to find the mapping from uncalibrated classification scores to the calibrated probability estimates. ELiTE is designed to address the key limitations of the histogram binning-based calibration methods which are (1) the use of a piecewise constant form of the calibration mapping using bins, and (2) the assumption of independence of predicted probabilities for the instances that are located in different bins. The method post-processes the output of a binary classifier to obtain calibrated probabilities. Thus, it can be applied with many existing classification models. We demonstrate the performance of ELiTE on real datasets for commonly used binary classification models. Experimental results show that the method outperforms several common binary-classifier calibration methods. In particular, ELiTE commonly performs statistically significantly better than the other methods, and never worse. Moreover, it is able to improve the calibration power of classifiers, while retaining their discrimination power. The method is also computationally tractable for large scale datasets, as it is practically O(N log N) time, where N is the number of samples.

从数据中学习准确的概率模型在数据挖掘的许多实际任务中是至关重要的。本文提出了一种新的非参数校正方法——线性趋势估计集合法(ELiTE)。ELiTE利用最近提出的1趋势滤波信号逼近方法[22]来寻找从未校准分类分数到校准概率估计的映射。ELiTE旨在解决基于直方图分类的校准方法的关键局限性,这些方法是(1)使用使用bin的分段常数形式的校准映射,以及(2)假设位于不同bin中的实例的预测概率的独立性。该方法对二值分类器的输出进行后处理以获得校准概率。因此,它可以应用于许多现有的分类模型。对于常用的二分类模型,我们在实际数据集上展示了ELiTE的性能。实验结果表明,该方法优于几种常用的二分类器标定方法。特别是,ELiTE通常在统计上比其他方法表现得更好,而不会更差。在保持分类器识别能力的同时,提高了分类器的校准能力。对于大规模数据集,该方法在计算上也易于处理,因为它实际上是O(N log N)时间,其中N是样本的数量。
{"title":"Binary Classifier Calibration Using an Ensemble of Linear Trend Estimation.","authors":"Mahdi Pakdaman Naeini,&nbsp;Gregory F Cooper","doi":"10.1137/1.9781611974348.30","DOIUrl":"https://doi.org/10.1137/1.9781611974348.30","url":null,"abstract":"<p><p>Learning accurate probabilistic models from data is crucial in many practical tasks in data mining. In this paper we present a new non-parametric calibration method called <i>ensemble of linear trend estimation</i> (ELiTE). ELiTE utilizes the recently proposed <i>ℓ</i><sub>1</sub> trend ltering signal approximation method [22] to find the mapping from uncalibrated classification scores to the calibrated probability estimates. ELiTE is designed to address the key limitations of the histogram binning-based calibration methods which are (1) the use of a piecewise constant form of the calibration mapping using bins, and (2) the assumption of independence of predicted probabilities for the instances that are located in different bins. The method post-processes the output of a binary classifier to obtain calibrated probabilities. Thus, it can be applied with many existing classification models. We demonstrate the performance of ELiTE on real datasets for commonly used binary classification models. Experimental results show that the method outperforms several common binary-classifier calibration methods. In particular, ELiTE commonly performs statistically significantly better than the other methods, and never worse. Moreover, it is able to improve the calibration power of classifiers, while retaining their discrimination power. The method is also computationally tractable for large scale datasets, as it is practically <i>O</i>(<i>N</i> log <i>N</i>) time, where <i>N</i> is the number of samples.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611974348.30","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34868574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1