SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping.

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining Pub Date : 2018-07-01 DOI:10.1145/3219819.3219999

Ioakeim Perros, Evangelos E Papalexakis, Haesun Park, Richard Vuduc, Xiaowei Yan, Christopher Defilippi, Walter F Stewart, Jimeng Sun

{"title":"SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping.","authors":"Ioakeim Perros, Evangelos E Papalexakis, Haesun Park, Richard Vuduc, Xiaowei Yan, Christopher Defilippi, Walter F Stewart, Jimeng Sun","doi":"10.1145/3219819.3219999","DOIUrl":null,"url":null,"abstract":"This paper presents a new method, which we call SUSTain, that extends real-valued matrix and tensor factorizations to data where values are integers. Such data are common when the values correspond to event counts or ordinal measures. The conventional approach is to treat integer data as real, and then apply real-valued factorizations. However, doing so fails to preserve important characteristics of the original data, thereby making it hard to interpret the results. Instead, our approach extracts factor values from integer datasets as scores that are constrained to take values from a small integer set. These scores are easy to interpret: a score of zero indicates no feature contribution and higher scores indicate distinct levels of feature importance. At its core, SUSTain relies on: a) a problem partitioning into integer-constrained subproblems, so that they can be optimally solved in an efficient manner; and b) organizing the order of the subproblems' solution, to promote reuse of shared intermediate results. We propose two variants, SUSTain M and SUSTain T , to handle both matrix and tensor inputs, respectively. We evaluate SUSTain against several state-of-the-art baselines on both synthetic and real Electronic Health Record (EHR) datasets. Comparing to those baselines, SUSTain shows either significantly better fit or orders of magnitude speedups that achieve a comparable fit (up to 425× faster). We apply SUSTain to EHR datasets to extract patient phenotypes (i.e., clinically meaningful patient clusters). Furthermore, 87% of them were validated as clinically meaningful phenotypes related to heart failure by a cardiologist.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3219819.3219999","citationCount":"613","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3219819.3219999","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 613

Abstract

This paper presents a new method, which we call SUSTain, that extends real-valued matrix and tensor factorizations to data where values are integers. Such data are common when the values correspond to event counts or ordinal measures. The conventional approach is to treat integer data as real, and then apply real-valued factorizations. However, doing so fails to preserve important characteristics of the original data, thereby making it hard to interpret the results. Instead, our approach extracts factor values from integer datasets as scores that are constrained to take values from a small integer set. These scores are easy to interpret: a score of zero indicates no feature contribution and higher scores indicate distinct levels of feature importance. At its core, SUSTain relies on: a) a problem partitioning into integer-constrained subproblems, so that they can be optimally solved in an efficient manner; and b) organizing the order of the subproblems' solution, to promote reuse of shared intermediate results. We propose two variants, SUSTain _M and SUSTain _T , to handle both matrix and tensor inputs, respectively. We evaluate SUSTain against several state-of-the-art baselines on both synthetic and real Electronic Health Record (EHR) datasets. Comparing to those baselines, SUSTain shows either significantly better fit or orders of magnitude speedups that achieve a comparable fit (up to 425× faster). We apply SUSTain to EHR datasets to extract patient phenotypes (i.e., clinically meaningful patient clusters). Furthermore, 87% of them were validated as clinically meaningful phenotypes related to heart failure by a cardiologist.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SUSTain:张量的可伸缩无监督评分及其在表型中的应用。

本文提出了一种新的方法，我们称之为SUSTain，它将实值矩阵分解和张量分解扩展到值为整数的数据。当值对应于事件计数或序数度量时，此类数据很常见。传统的方法是将整数数据视为实数，然后应用实值分解。然而，这样做不能保留原始数据的重要特征，从而使结果难以解释。相反，我们的方法从整数数据集中提取因子值作为分数，这些分数被限制从小整数集中获取值。这些分数很容易解释:0分表示没有特征贡献，更高的分数表示不同级别的特征重要性。SUSTain的核心依赖于:a)将一个问题划分为整数约束的子问题，这样它们就可以以一种有效的方式得到最优解;b)组织子问题求解的顺序，促进共享中间结果的重用。我们提出了两个变体，SUSTain M和SUSTain T，分别处理矩阵和张量输入。我们在合成和真实电子健康记录(EHR)数据集上对几个最先进的基线进行了评估。与这些基线相比，SUSTain要么显示出明显更好的拟合，要么显示出达到相当拟合的数量级加速(快425倍)。我们将SUSTain应用于EHR数据集以提取患者表型(即临床有意义的患者集群)。此外，其中87%被心脏病专家证实为与心力衰竭相关的临床有意义的表型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

自引率

0.00%

发文量