首页 > 最新文献

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining最新文献

英文 中文
Interpretable Representation Learning for Healthcare via Capturing Disease Progression through Time. 通过捕捉疾病随时间的进展为医疗保健提供可解释的表征学习。
Tian Bai, Brian L Egleston, Shanshan Zhang, Slobodan Vucetic

Various deep learning models have recently been applied to predictive modeling of Electronic Health Records (EHR). In medical claims data, which is a particular type of EHR data, each patient is represented as a sequence of temporally ordered irregularly sampled visits to health providers, where each visit is recorded as an unordered set of medical codes specifying patient's diagnosis and treatment provided during the visit. Based on the observation that different patient conditions have different temporal progression patterns, in this paper we propose a novel interpretable deep learning model, called Timeline. The main novelty of Timeline is that it has a mechanism that learns time decay factors for every medical code. This allows the Timeline to learn that chronic conditions have a longer lasting impact on future visits than acute conditions. Timeline also has an attention mechanism that improves vector embeddings of visits. By analyzing the attention weights and disease progression functions of Timeline, it is possible to interpret the predictions and understand how risks of future visits change over time. We evaluated Timeline on two large-scale real world data sets. The specific task was to predict what is the primary diagnosis category for the next hospital visit given previous visits. Our results show that Timeline has higher accuracy than the state of the art deep learning models based on RNN. In addition, we demonstrate that time decay factors and attentions learned by Timeline are in accord with the medical knowledge and that Timeline can provide a useful insight into its predictions.

各种深度学习模型最近被应用于电子健康记录(EHR)的预测建模。在医疗索赔数据中,这是一种特定类型的EHR数据,每个患者都被表示为对医疗服务提供者的一系列时间有序的不规则抽样就诊,其中每个就诊都被记录为一组无序的医疗代码,指定患者在就诊期间提供的诊断和治疗。基于对不同患者状况具有不同时间进展模式的观察,本文提出了一种新的可解释深度学习模型,称为Timeline。Timeline的主要新颖之处在于它有一种机制,可以学习每个医疗代码的时间衰减因子。这使Timeline了解到,慢性疾病对未来就诊的影响比急性疾病更持久。Timeline还有一个注意力机制,可以改进访问的向量嵌入。通过分析Timeline的注意力权重和疾病进展函数,可以解释预测,并了解未来就诊的风险如何随时间变化。我们在两个大规模的真实世界数据集上评估了Timeline。具体任务是预测在之前就诊的情况下,下一次医院就诊的主要诊断类别。我们的结果表明,Timeline比现有的基于RNN的深度学习模型具有更高的准确性。此外,我们还证明了Timeline学习到的时间衰减因子和注意事项与医学知识是一致的,Timeline可以为其预测提供有用的见解。
{"title":"Interpretable Representation Learning for Healthcare via Capturing Disease Progression through Time.","authors":"Tian Bai, Brian L Egleston, Shanshan Zhang, Slobodan Vucetic","doi":"10.1145/3219819.3219904","DOIUrl":"10.1145/3219819.3219904","url":null,"abstract":"<p><p>Various deep learning models have recently been applied to predictive modeling of Electronic Health Records (EHR). In medical claims data, which is a particular type of EHR data, each patient is represented as a sequence of temporally ordered irregularly sampled visits to health providers, where each visit is recorded as an unordered set of medical codes specifying patient's diagnosis and treatment provided during the visit. Based on the observation that different patient conditions have different temporal progression patterns, in this paper we propose a novel interpretable deep learning model, called Timeline. The main novelty of Timeline is that it has a mechanism that learns time decay factors for every medical code. This allows the Timeline to learn that chronic conditions have a longer lasting impact on future visits than acute conditions. Timeline also has an attention mechanism that improves vector embeddings of visits. By analyzing the attention weights and disease progression functions of Timeline, it is possible to interpret the predictions and understand how risks of future visits change over time. We evaluated Timeline on two large-scale real world data sets. The specific task was to predict what is the primary diagnosis category for the next hospital visit given previous visits. Our results show that Timeline has higher accuracy than the state of the art deep learning models based on RNN. In addition, we demonstrate that time decay factors and attentions learned by Timeline are in accord with the medical knowledge and that Timeline can provide a useful insight into its predictions.</p>","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6484836/pdf/nihms-1019542.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37198313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalized Score Functions for Causal Discovery. 因果发现的广义得分函数。
Biwei Huang, Kun Zhang, Yizhu Lin, Bernhard Schölkopf, Clark Glymour

Discovery of causal relationships from observational data is a fundamental problem. Roughly speaking, there are two types of methods for causal discovery, constraint-based ones and score-based ones. Score-based methods avoid the multiple testing problem and enjoy certain advantages compared to constraint-based ones. However, most of them need strong assumptions on the functional forms of causal mechanisms, as well as on data distributions, which limit their applicability. In practice the precise information of the underlying model class is usually unknown. If the above assumptions are violated, both spurious and missing edges may result. In this paper, we introduce generalized score functions for causal discovery based on the characterization of general (conditional) independence relationships between random variables, without assuming particular model classes. In particular, we exploit regression in RKHS to capture the dependence in a non-parametric way. The resulting causal discovery approach produces asymptotically correct results in rather general cases, which may have nonlinear causal mechanisms, a wide class of data distributions, mixed continuous and discrete data, and multidimensional variables. Experimental results on both synthetic and real-world data demonstrate the efficacy of our proposed approach.

从观测数据中发现因果关系是一个根本问题。大致来说,因果发现有两种方法,基于约束的方法和基于分数的方法。基于分数的方法避免了多重测试问题,与基于约束的方法相比具有一定的优势。然而,它们中的大多数需要对因果机制的功能形式以及数据分布进行强有力的假设,这限制了它们的适用性。在实践中,底层模型类的精确信息通常是未知的。如果违反上述假设,则可能会导致伪边和缺边。在本文中,我们在不假设特定模型类的情况下,基于随机变量之间一般(条件)独立关系的特征,引入了因果发现的广义得分函数。特别是,我们利用RKHS中的回归以非参数方式捕捉相关性。由此产生的因果发现方法在相当普遍的情况下产生渐近正确的结果,这些情况可能具有非线性因果机制、广泛的数据分布、混合的连续和离散数据以及多维变量。在合成数据和真实世界数据上的实验结果证明了我们提出的方法的有效性。
{"title":"Generalized Score Functions for Causal Discovery.","authors":"Biwei Huang,&nbsp;Kun Zhang,&nbsp;Yizhu Lin,&nbsp;Bernhard Schölkopf,&nbsp;Clark Glymour","doi":"10.1145/3219819.3220104","DOIUrl":"10.1145/3219819.3220104","url":null,"abstract":"<p><p>Discovery of causal relationships from observational data is a fundamental problem. Roughly speaking, there are two types of methods for causal discovery, constraint-based ones and score-based ones. Score-based methods avoid the multiple testing problem and enjoy certain advantages compared to constraint-based ones. However, most of them need strong assumptions on the functional forms of causal mechanisms, as well as on data distributions, which limit their applicability. In practice the precise information of the underlying model class is usually unknown. If the above assumptions are violated, both spurious and missing edges may result. In this paper, we introduce generalized score functions for causal discovery based on the characterization of general (conditional) independence relationships between random variables, without assuming particular model classes. In particular, we exploit regression in RKHS to capture the dependence in a non-parametric way. The resulting causal discovery approach produces asymptotically correct results in rather general cases, which may have nonlinear causal mechanisms, a wide class of data distributions, mixed continuous and discrete data, and multidimensional variables. Experimental results on both synthetic and real-world data demonstrate the efficacy of our proposed approach.</p>","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3219819.3220104","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36470229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 99
SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping. SUSTain:张量的可伸缩无监督评分及其在表型中的应用。
Ioakeim Perros, Evangelos E Papalexakis, Haesun Park, Richard Vuduc, Xiaowei Yan, Christopher Defilippi, Walter F Stewart, Jimeng Sun

This paper presents a new method, which we call SUSTain, that extends real-valued matrix and tensor factorizations to data where values are integers. Such data are common when the values correspond to event counts or ordinal measures. The conventional approach is to treat integer data as real, and then apply real-valued factorizations. However, doing so fails to preserve important characteristics of the original data, thereby making it hard to interpret the results. Instead, our approach extracts factor values from integer datasets as scores that are constrained to take values from a small integer set. These scores are easy to interpret: a score of zero indicates no feature contribution and higher scores indicate distinct levels of feature importance. At its core, SUSTain relies on: a) a problem partitioning into integer-constrained subproblems, so that they can be optimally solved in an efficient manner; and b) organizing the order of the subproblems' solution, to promote reuse of shared intermediate results. We propose two variants, SUSTain M and SUSTain T , to handle both matrix and tensor inputs, respectively. We evaluate SUSTain against several state-of-the-art baselines on both synthetic and real Electronic Health Record (EHR) datasets. Comparing to those baselines, SUSTain shows either significantly better fit or orders of magnitude speedups that achieve a comparable fit (up to 425× faster). We apply SUSTain to EHR datasets to extract patient phenotypes (i.e., clinically meaningful patient clusters). Furthermore, 87% of them were validated as clinically meaningful phenotypes related to heart failure by a cardiologist.

本文提出了一种新的方法,我们称之为SUSTain,它将实值矩阵分解和张量分解扩展到值为整数的数据。当值对应于事件计数或序数度量时,此类数据很常见。传统的方法是将整数数据视为实数,然后应用实值分解。然而,这样做不能保留原始数据的重要特征,从而使结果难以解释。相反,我们的方法从整数数据集中提取因子值作为分数,这些分数被限制从小整数集中获取值。这些分数很容易解释:0分表示没有特征贡献,更高的分数表示不同级别的特征重要性。SUSTain的核心依赖于:a)将一个问题划分为整数约束的子问题,这样它们就可以以一种有效的方式得到最优解;b)组织子问题求解的顺序,促进共享中间结果的重用。我们提出了两个变体,SUSTain M和SUSTain T,分别处理矩阵和张量输入。我们在合成和真实电子健康记录(EHR)数据集上对几个最先进的基线进行了评估。与这些基线相比,SUSTain要么显示出明显更好的拟合,要么显示出达到相当拟合的数量级加速(快425倍)。我们将SUSTain应用于EHR数据集以提取患者表型(即临床有意义的患者集群)。此外,其中87%被心脏病专家证实为与心力衰竭相关的临床有意义的表型。
{"title":"SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping.","authors":"Ioakeim Perros,&nbsp;Evangelos E Papalexakis,&nbsp;Haesun Park,&nbsp;Richard Vuduc,&nbsp;Xiaowei Yan,&nbsp;Christopher Defilippi,&nbsp;Walter F Stewart,&nbsp;Jimeng Sun","doi":"10.1145/3219819.3219999","DOIUrl":"https://doi.org/10.1145/3219819.3219999","url":null,"abstract":"<p><p>This paper presents a new method, which we call SUSTain, that extends real-valued matrix and tensor factorizations to data where values are integers. Such data are common when the values correspond to event counts or ordinal measures. The conventional approach is to treat integer data as real, and then apply real-valued factorizations. However, doing so fails to preserve important characteristics of the original data, thereby making it hard to interpret the results. Instead, our approach extracts factor values from integer datasets as <i>scores</i> that are constrained to take values from a small integer set. These scores are easy to interpret: a score of zero indicates no feature contribution and higher scores indicate <i>distinct levels</i> of feature importance. At its core, SUSTain relies on: a) a problem partitioning into integer-constrained subproblems, so that they can be optimally solved in an efficient manner; and b) organizing the order of the subproblems' solution, to promote reuse of shared intermediate results. We propose two variants, SUSTain <sub><i>M</i></sub> and SUSTain <sub><i>T</i></sub> , to handle both matrix and tensor inputs, respectively. We evaluate SUSTain against several state-of-the-art baselines on both synthetic and real Electronic Health Record (EHR) datasets. Comparing to those baselines, SUSTain shows either significantly better fit or orders of magnitude speedups that achieve a comparable fit (up to 425× faster). We apply SUSTain to EHR datasets to extract patient phenotypes (i.e., clinically meaningful patient clusters). Furthermore, 87% of them were validated as clinically meaningful phenotypes related to heart failure by a cardiologist.</p>","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3219819.3219999","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25445214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 613
Network Inference via the Time-Varying Graphical Lasso. 基于时变图形套索的网络推理。
David Hallac, Youngsuk Park, Stephen Boyd, Jure Leskovec

Many important problems can be modeled as a system of interconnected entities, where each entity is recording time-dependent observations or measurements. In order to spot trends, detect anomalies, and interpret the temporal dynamics of such data, it is essential to understand the relationships between the different entities and how these relationships evolve over time. In this paper, we introduce the time-varying graphical lasso (TVGL), a method of inferring time-varying networks from raw time series data. We cast the problem in terms of estimating a sparse time-varying inverse covariance matrix, which reveals a dynamic network of interdependencies between the entities. Since dynamic network inference is a computationally expensive task, we derive a scalable message-passing algorithm based on the Alternating Direction Method of Multipliers (ADMM) to solve this problem in an efficient way. We also discuss several extensions, including a streaming algorithm to update the model and incorporate new observations in real time. Finally, we evaluate our TVGL algorithm on both real and synthetic datasets, obtaining interpretable results and outperforming state-of-the-art baselines in terms of both accuracy and scalability.

许多重要的问题可以建模为一个相互关联的实体系统,其中每个实体都记录与时间相关的观察或测量。为了发现趋势,检测异常,并解释这些数据的时间动态,了解不同实体之间的关系以及这些关系如何随时间演变是至关重要的。本文介绍了一种从原始时间序列数据推断时变网络的方法——时变图形套索法(TVGL)。我们以估计一个稀疏时变逆协方差矩阵的方式来处理这个问题,它揭示了实体之间相互依赖的动态网络。由于动态网络推理是一项计算成本很高的任务,我们推导了一种基于交替方向乘法器(ADMM)的可扩展消息传递算法,以有效地解决这一问题。我们还讨论了几个扩展,包括实时更新模型和合并新观测的流算法。最后,我们在真实数据集和合成数据集上评估了我们的TVGL算法,获得了可解释的结果,并在准确性和可扩展性方面优于最先进的基线。
{"title":"Network Inference via the Time-Varying Graphical Lasso.","authors":"David Hallac, Youngsuk Park, Stephen Boyd, Jure Leskovec","doi":"10.1145/3097983.3098037","DOIUrl":"10.1145/3097983.3098037","url":null,"abstract":"<p><p>Many important problems can be modeled as a system of interconnected entities, where each entity is recording time-dependent observations or measurements. In order to spot trends, detect anomalies, and interpret the temporal dynamics of such data, it is essential to understand the relationships between the different entities and how these relationships evolve over time. In this paper, we introduce the <i>time-varying graphical lasso (TVGL)</i>, a method of inferring time-varying networks from raw time series data. We cast the problem in terms of estimating a sparse time-varying inverse covariance matrix, which reveals a dynamic network of interdependencies between the entities. Since dynamic network inference is a computationally expensive task, we derive a scalable message-passing algorithm based on the Alternating Direction Method of Multipliers (ADMM) to solve this problem in an efficient way. We also discuss several extensions, including a streaming algorithm to update the model and incorporate new observations in real time. Finally, we evaluate our TVGL algorithm on both real and synthetic datasets, obtaining interpretable results and outperforming state-of-the-art baselines in terms of both accuracy and scalability.</p>","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3097983.3098037","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36106209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 164
Learning Tree-Structured Detection Cascades for Heterogeneous Networks of Embedded Devices. 为嵌入式设备的异构网络学习树状结构检测级联。
Hamid Dadkhahi, Benjamin M Marlin

In this paper, we present a new approach to learning cascaded classifiers for use in computing environments that involve networks of heterogeneous and resource-constrained, low-power embedded compute and sensing nodes. We present a generalization of the classical linear detection cascade to the case of tree-structured cascades where different branches of the tree execute on different physical compute nodes in the network. Different nodes have access to different features, as well as access to potentially different computation and energy resources. We concentrate on the problem of jointly learning the parameters for all of the classifiers in the cascade given a fixed cascade architecture and a known set of costs required to carry out the computation at each node. To accomplish the objective of joint learning of all detectors, we propose a novel approach to combining classifier outputs during training that better matches the hard cascade setting in which the learned system will be deployed. This work is motivated by research in the area of mobile health where energy efficient real time detectors integrating information from multiple wireless on-body sensors and a smart phone are needed for real-time monitoring and the delivery of just-in-time adaptive interventions. We evaluate our framework on mobile sensor-based human activity recognition and mobile health detector learning problems.

在本文中,我们介绍了一种学习级联分类器的新方法,该方法适用于由异构、资源受限、低功耗嵌入式计算和传感节点组成的计算环境。我们将经典的线性检测级联推广到树状结构级联的情况,其中树状结构级联的不同分支在网络中的不同物理计算节点上执行。不同的节点可以访问不同的特征,也可以访问可能不同的计算和能源资源。我们将重点放在级联中所有分类器参数的联合学习问题上,给定一个固定的级联架构和在每个节点上执行计算所需的已知成本集。为了实现联合学习所有检测器的目标,我们提出了一种在训练过程中组合分类器输出的新方法,这种方法能更好地匹配所学系统将要部署的硬级联设置。这项工作的灵感来自移动医疗领域的研究,在该领域,需要将多个无线体感传感器和智能手机的信息整合在一起的高能效实时检测器,以进行实时监测和提供及时的自适应干预。我们就基于移动传感器的人体活动识别和移动健康检测器学习问题对我们的框架进行了评估。
{"title":"Learning Tree-Structured Detection Cascades for Heterogeneous Networks of Embedded Devices.","authors":"Hamid Dadkhahi, Benjamin M Marlin","doi":"10.1145/3097983.3098169","DOIUrl":"10.1145/3097983.3098169","url":null,"abstract":"<p><p>In this paper, we present a new approach to learning cascaded classifiers for use in computing environments that involve networks of heterogeneous and resource-constrained, low-power embedded compute and sensing nodes. We present a generalization of the classical linear detection cascade to the case of tree-structured cascades where different branches of the tree execute on different physical compute nodes in the network. Different nodes have access to different features, as well as access to potentially different computation and energy resources. We concentrate on the problem of jointly learning the parameters for all of the classifiers in the cascade given a fixed cascade architecture and a known set of costs required to carry out the computation at each node. To accomplish the objective of joint learning of all detectors, we propose a novel approach to combining classifier outputs during training that better matches the hard cascade setting in which the learned system will be deployed. This work is motivated by research in the area of mobile health where energy efficient real time detectors integrating information from multiple wireless on-body sensors and a smart phone are needed for real-time monitoring and the delivery of just-in-time adaptive interventions. We evaluate our framework on mobile sensor-based human activity recognition and mobile health detector learning problems.</p>","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5765542/pdf/nihms928860.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35736377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Federated Tensor Factorization for Computational Phenotyping. 用于计算表型的联合张量因式分解。
Yejin Kim, Jimeng Sun, Hwanjo Yu, Xiaoqian Jiang

Tensor factorization models offer an effective approach to convert massive electronic health records into meaningful clinical concepts (phenotypes) for data analysis. These models need a large amount of diverse samples to avoid population bias. An open challenge is how to derive phenotypes jointly across multiple hospitals, in which direct patient-level data sharing is not possible (e.g., due to institutional policies). In this paper, we developed a novel solution to enable federated tensor factorization for computational phenotyping without sharing patient-level data. We developed secure data harmonization and federated computation procedures based on alternating direction method of multipliers (ADMM). Using this method, the multiple hospitals iteratively update tensors and transfer secure summarized information to a central server, and the server aggregates the information to generate phenotypes. We demonstrated with real medical datasets that our method resembles the centralized training model (based on combined datasets) in terms of accuracy and phenotypes discovery while respecting privacy.

张量因子化模型是将海量电子健康记录转换为有意义的临床概念(表型)进行数据分析的有效方法。这些模型需要大量不同的样本,以避免群体偏差。如何跨多家医院联合推导表型是一个公开的挑战,在这种情况下,直接的患者级数据共享是不可能的(例如,由于机构政策)。在本文中,我们开发了一种新颖的解决方案,在不共享患者级数据的情况下,为计算表型实现联合张量因子化。我们开发了基于交替方向乘法(ADMM)的安全数据协调和联合计算程序。利用这种方法,多家医院迭代更新张量并将安全汇总的信息传输到中央服务器,服务器汇总信息以生成表型。我们用真实的医疗数据集证明,我们的方法在准确性和表型发现方面与集中训练模型(基于合并数据集)相似,同时尊重隐私。
{"title":"Federated Tensor Factorization for Computational Phenotyping.","authors":"Yejin Kim, Jimeng Sun, Hwanjo Yu, Xiaoqian Jiang","doi":"10.1145/3097983.3098118","DOIUrl":"10.1145/3097983.3098118","url":null,"abstract":"<p><p>Tensor factorization models offer an effective approach to convert massive electronic health records into meaningful clinical concepts (phenotypes) for data analysis. These models need a large amount of diverse samples to avoid population bias. An open challenge is how to derive phenotypes jointly across multiple hospitals, in which direct patient-level data sharing is not possible (e.g., due to institutional policies). In this paper, we developed a novel solution to enable federated tensor factorization for computational phenotyping without sharing patient-level data. We developed secure data harmonization and federated computation procedures based on alternating direction method of multipliers (ADMM). Using this method, the multiple hospitals iteratively update tensors and transfer secure summarized information to a central server, and the server aggregates the information to generate phenotypes. We demonstrated with real medical datasets that our method resembles the centralized training model (based on combined datasets) in terms of accuracy and phenotypes discovery while respecting privacy.</p>","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5652331/pdf/nihms880922.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35543676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GRAM: Graph-based Attention Model for Healthcare Representation Learning. 基于图的医疗表征学习注意模型。
Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F Stewart, Jimeng Sun

Deep learning methods exhibit promising performance for predictive modeling in healthcare, but two important challenges remain: Data insufficiency: Often in healthcare predictive modeling, the sample size is insufficient for deep learning methods to achieve satisfactory results.Interpretation: The representations learned by deep learning methods should align with medical knowledge. To address these challenges, we propose GRaph-based Attention Model (GRAM) that supplements electronic health records (EHR) with hierarchical information inherent to medical ontologies. Based on the data volume and the ontology structure, GRAM represents a medical concept as a combination of its ancestors in the ontology via an attention mechanism. We compared predictive performance (i.e. accuracy, data needs, interpretability) of GRAM to various methods including the recurrent neural network (RNN) in two sequential diagnoses prediction tasks and one heart failure prediction task. Compared to the basic RNN, GRAM achieved 10% higher accuracy for predicting diseases rarely observed in the training data and 3% improved area under the ROC curve for predicting heart failure using an order of magnitude less training data. Additionally, unlike other methods, the medical concept representations learned by GRAM are well aligned with the medical ontology. Finally, GRAM exhibits intuitive attention behaviors by adaptively generalizing to higher level concepts when facing data insufficiency at the lower level concepts.

深度学习方法在医疗保健预测建模方面表现出良好的性能,但仍然存在两个重要的挑战:数据不足:通常在医疗保健预测建模中,样本量不足以使深度学习方法获得令人满意的结果。解释:通过深度学习方法学习的表征应该与医学知识保持一致。为了解决这些挑战,我们提出了基于图的注意力模型(GRAM),该模型用医学本体固有的分层信息补充电子健康记录(EHR)。GRAM基于数据量和本体结构,通过注意机制将医学概念表示为其祖先在本体中的组合。我们在两个顺序诊断预测任务和一个心力衰竭预测任务中比较了GRAM与包括循环神经网络(RNN)在内的各种方法的预测性能(即准确性,数据需求,可解释性)。与基本RNN相比,GRAM在预测训练数据中很少观察到的疾病方面的准确率提高了10%,在使用更少的训练数据预测心力衰竭时,ROC曲线下的面积提高了3%。此外,与其他方法不同,GRAM学习的医学概念表示与医学本体很好地对齐。最后,GRAM在面对较低层次概念的数据不足时,通过自适应泛化到更高层次的概念,表现出直观的注意行为。
{"title":"GRAM: Graph-based Attention Model for Healthcare Representation Learning.","authors":"Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F Stewart, Jimeng Sun","doi":"10.1145/3097983.3098126","DOIUrl":"10.1145/3097983.3098126","url":null,"abstract":"<p><p>Deep learning methods exhibit promising performance for predictive modeling in healthcare, but two important challenges remain: <i>Data insufficiency:</i> Often in healthcare predictive modeling, the sample size is insufficient for deep learning methods to achieve satisfactory results.<i>Interpretation:</i> The representations learned by deep learning methods should align with medical knowledge. To address these challenges, we propose GRaph-based Attention Model (GRAM) that supplements electronic health records (EHR) with hierarchical information inherent to medical ontologies. Based on the data volume and the ontology structure, GRAM represents a medical concept as a combination of its ancestors in the ontology via an attention mechanism. We compared predictive performance (<i>i.e.</i> accuracy, data needs, interpretability) of GRAM to various methods including the recurrent neural network (RNN) in two sequential diagnoses prediction tasks and one heart failure prediction task. Compared to the basic RNN, GRAM achieved 10% higher accuracy for predicting diseases rarely observed in the training data and 3% improved area under the ROC curve for predicting heart failure using an order of magnitude less training data. Additionally, unlike other methods, the medical concept representations learned by GRAM are well aligned with the medical ontology. Finally, GRAM exhibits intuitive attention behaviors by adaptively generalizing to higher level concepts when facing data insufficiency at the lower level concepts.</p>","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7954122/pdf/nihms-1675242.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25486805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MOLIERE: Automatic Biomedical Hypothesis Generation System. 自动生物医学假设生成系统。
Justin Sybrandt, Michael Shtutman, Ilya Safro

Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network, and demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.

假设生成正在成为一种重要的节省时间的技术,它使生物医学研究人员能够快速发现重要概念之间的隐含联系。通常,这些系统对公共医疗数据的特定领域部分进行操作。相比之下,莫里哀利用了超过2450万份文件的信息。我们方法的核心是从国家生物技术信息中心(NCBI)的几个异构数据集中提取的生物医学对象的多模态和多关系网络。这些对象包括但不限于科学论文、关键词、基因、蛋白质、疾病和诊断。我们使用潜在狄利克雷分配方法对在该网络中发现的最短路径附近找到的摘要进行建模,并通过对历史数据进行假设生成来证明MOLIERE的有效性。我们的网络、实施和结果数据都是公开的,可供广泛的科学界使用。
{"title":"MOLIERE: Automatic Biomedical Hypothesis Generation System.","authors":"Justin Sybrandt, Michael Shtutman, Ilya Safro","doi":"10.1145/3097983.3098057","DOIUrl":"10.1145/3097983.3098057","url":null,"abstract":"<p><p>Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network, and demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.</p>","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3097983.3098057","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35819012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
A Data-driven Process Recommender Framework. 数据驱动的流程推荐框架。
Sen Yang, Xin Dong, Leilei Sun, Yichen Zhou, Richard A Farneth, Hui Xiong, Randall S Burd, Ivan Marsic

We present an approach for improving the performance of complex knowledge-based processes by providing data-driven step-by-step recommendations. Our framework uses the associations between similar historic process performances and contextual information to determine the prototypical way of enacting the process. We introduce a novel similarity metric for grouping traces into clusters that incorporates temporal information about activity performance and handles concurrent activities. Our data-driven recommender system selects the appropriate prototype performance of the process based on user-provided context attributes. Our approach for determining the prototypes discovers the commonly performed activities and their temporal relationships. We tested our system on data from three real-world medical processes and achieved recommendation accuracy up to an F1 score of 0.77 (compared to an F1 score of 0.37 using ZeroR) with 63.2% of recommended enactments being within the first five neighbors of the actual historic enactments in a set of 87 cases. Our framework works as an interactive visual analytic tool for process mining. This work shows the feasibility of data-driven decision support system for complex knowledge-based processes.

我们提出了一种方法,通过提供数据驱动的逐步建议来改善复杂的基于知识的过程的性能。我们的框架使用相似的历史过程性能和上下文信息之间的关联来确定实现过程的原型方式。我们引入了一种新的相似性度量,用于将跟踪分组到集群中,该集群包含有关活动性能的时间信息并处理并发活动。我们的数据驱动推荐系统根据用户提供的上下文属性选择合适的流程原型性能。我们确定原型的方法发现了通常执行的活动及其时间关系。我们在三个真实医疗过程的数据上测试了我们的系统,并获得了高达F1分数0.77的推荐准确性(相比之下,使用zero的F1分数为0.37),在一组87个案例中,63.2%的推荐颁布在实际历史颁布的前五个相邻范围内。我们的框架作为过程挖掘的交互式可视化分析工具。这项工作显示了数据驱动的决策支持系统在复杂的基于知识的过程中的可行性。
{"title":"A Data-driven Process Recommender Framework.","authors":"Sen Yang,&nbsp;Xin Dong,&nbsp;Leilei Sun,&nbsp;Yichen Zhou,&nbsp;Richard A Farneth,&nbsp;Hui Xiong,&nbsp;Randall S Burd,&nbsp;Ivan Marsic","doi":"10.1145/3097983.3098174","DOIUrl":"https://doi.org/10.1145/3097983.3098174","url":null,"abstract":"<p><p>We present an approach for improving the performance of complex knowledge-based processes by providing data-driven step-by-step recommendations. Our framework uses the associations between similar historic process performances and contextual information to determine the prototypical way of enacting the process. We introduce a novel similarity metric for grouping traces into clusters that incorporates temporal information about activity performance and handles concurrent activities. Our data-driven recommender system selects the appropriate prototype performance of the process based on user-provided context attributes. Our approach for determining the prototypes discovers the commonly performed activities and their temporal relationships. We tested our system on data from three real-world medical processes and achieved recommendation accuracy up to an F1 score of 0.77 (compared to an F1 score of 0.37 using ZeroR) with 63.2% of recommended enactments being within the first five neighbors of the actual historic enactments in a set of 87 cases. Our framework works as an interactive visual analytic tool for process mining. This work shows the feasibility of data-driven decision support system for complex knowledge-based processes.</p>","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3097983.3098174","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36666427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables. 选择性标签问题:在不可观测的存在下评估算法预测。
Himabindu Lakkaraju, Jon Kleinberg, Jure Leskovec, Jens Ludwig, Sendhil Mullainathan

Evaluating whether machines improve on human performance is one of the central questions of machine learning. However, there are many domains where the data is selectively labeled in the sense that the observed outcomes are themselves a consequence of the existing choices of the human decision-makers. For instance, in the context of judicial bail decisions, we observe the outcome of whether a defendant fails to return for their court appearance only if the human judge decides to release the defendant on bail. This selective labeling makes it harder to evaluate predictive models as the instances for which outcomes are observed do not represent a random sample of the population. Here we propose a novel framework for evaluating the performance of predictive models on selectively labeled data. We develop an approach called contraction which allows us to compare the performance of predictive models and human decision-makers without resorting to counterfactual inference. Our methodology harnesses the heterogeneity of human decision-makers and facilitates effective evaluation of predictive models even in the presence of unmeasured confounders (unobservables) which influence both human decisions and the resulting outcomes. Experimental results on real world datasets spanning diverse domains such as health care, insurance, and criminal justice demonstrate the utility of our evaluation metric in comparing human decisions and machine predictions.

评估机器是否能提高人类的表现是机器学习的核心问题之一。然而,在许多领域中,数据被选择性地标记,因为观察到的结果本身就是人类决策者现有选择的结果。例如,在司法保释决定的背景下,只有当人类法官决定保释被告时,我们才会观察被告是否没有出庭的结果。这种选择性标记使得评估预测模型变得更加困难,因为观察到结果的实例并不代表总体的随机样本。在这里,我们提出了一个新的框架来评估选择性标记数据上预测模型的性能。我们开发了一种称为收缩的方法,它允许我们在不诉诸反事实推理的情况下比较预测模型和人类决策者的表现。我们的方法利用了人类决策者的异质性,即使在影响人类决策和结果的不可测量混杂因素(不可观察因素)存在的情况下,也能促进预测模型的有效评估。在跨越不同领域(如医疗保健、保险和刑事司法)的真实世界数据集上的实验结果证明了我们的评估指标在比较人类决策和机器预测方面的实用性。
{"title":"The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables.","authors":"Himabindu Lakkaraju,&nbsp;Jon Kleinberg,&nbsp;Jure Leskovec,&nbsp;Jens Ludwig,&nbsp;Sendhil Mullainathan","doi":"10.1145/3097983.3098066","DOIUrl":"https://doi.org/10.1145/3097983.3098066","url":null,"abstract":"<p><p>Evaluating whether machines improve on human performance is one of the central questions of machine learning. However, there are many domains where the data is <i>selectively labeled</i> in the sense that the observed outcomes are themselves a consequence of the existing choices of the human decision-makers. For instance, in the context of judicial bail decisions, we observe the outcome of whether a defendant fails to return for their court appearance only if the human judge decides to release the defendant on bail. This selective labeling makes it harder to evaluate predictive models as the instances for which outcomes are observed do not represent a random sample of the population. Here we propose a novel framework for evaluating the performance of predictive models on selectively labeled data. We develop an approach called <i>contraction</i> which allows us to compare the performance of predictive models and human decision-makers without resorting to counterfactual inference. Our methodology harnesses the heterogeneity of human decision-makers and facilitates effective evaluation of predictive models even in the presence of unmeasured confounders (unobservables) which influence both human decisions and the resulting outcomes. Experimental results on real world datasets spanning diverse domains such as health care, insurance, and criminal justice demonstrate the utility of our evaluation metric in comparing human decisions and machine predictions.</p>","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3097983.3098066","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36115088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 125
期刊
KDD : proceedings. International Conference on Knowledge Discovery & Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1