KDD : proceedings. International Conference on Knowledge Discovery & Data Mining最新文献_第3页

Pharmacovigilance via Baseline Regularization with Large-Scale Longitudinal Observational Data. 基于大规模纵向观察数据的基线正则化药物警戒。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2017-08-01 DOI: 10.1145/3097983.3097998

Zhaobin Kuang, Peggy Peissig, Vítor Santos Costa, Richard Maclin, David Page

Several prominent public health hazards [29] that occurred at the beginning of this century due to adverse drug events (ADEs) have raised international awareness of governments and industries about pharmacovigilance (PhV) [6,7], the science and activities to monitor and prevent adverse events caused by pharmaceutical products after they are introduced to the market. A major data source for PhV is large-scale longitudinal observational databases (LODs) [6] such as electronic health records (EHRs) and medical insurance claim databases. Inspired by the Self-Controlled Case Series (SCCS) model [27], arguably the leading method for ADE discovery from LODs, we propose baseline regularization, a regularized generalized linear model that leverages the diverse health profiles available in LODs across different individuals at different times. We apply the proposed method as well as SCCS to the Marshfield Clinic EHR. Experimental results suggest that the proposed method outperforms SCCS under various settings in identifying benchmark ADEs from the Observational Medical Outcomes Partnership ground truth [26].

本世纪初由于药物不良事件(ADEs)而发生的几起突出的公共卫生危害b[29]提高了国际上政府和行业对药物警戒(PhV)的认识[6,7]，即监测和预防药品进入市场后引起的不良事件的科学和活动。PhV的主要数据来源是大型纵向观测数据库(lod)[6]，如电子健康记录(EHRs)和医疗保险索赔数据库。受自我控制病例系列(SCCS)模型[27]的启发，我们提出了基线正则化，这是一种正则化的广义线性模型，利用不同个体在不同时间的lod中可用的不同健康概况。我们将提出的方法以及SCCS应用于马什菲尔德诊所的电子病历。实验结果表明，在各种设置下，该方法在从观察性医疗结果伙伴关系基础真值bb0中识别基准ade方面优于SCCS。

{"title":"Pharmacovigilance via Baseline Regularization with Large-Scale Longitudinal Observational Data.","authors":"Zhaobin Kuang, Peggy Peissig, Vítor Santos Costa, Richard Maclin, David Page","doi":"10.1145/3097983.3097998","DOIUrl":"10.1145/3097983.3097998","url":null,"abstract":"Several prominent public health hazards [29] that occurred at the beginning of this century due to adverse drug events (ADEs) have raised international awareness of governments and industries about pharmacovigilance (PhV) [6,7], the science and activities to monitor and prevent adverse events caused by pharmaceutical products after they are introduced to the market. A major data source for PhV is large-scale longitudinal observational databases (LODs) [6] such as electronic health records (EHRs) and medical insurance claim databases. Inspired by the Self-Controlled Case Series (SCCS) model [27], arguably the leading method for ADE discovery from LODs, we propose baseline regularization, a regularized generalized linear model that leverages the diverse health profiles available in LODs across different individuals at different times. We apply the proposed method as well as SCCS to the Marshfield Clinic EHR. Experimental results suggest that the proposed method outperforms SCCS under various settings in identifying benchmark ADEs from the Observational Medical Outcomes Partnership ground truth [26].","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3097983.3097998","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36094259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data. 基于托普利兹逆协方差的多变量时间序列数据聚类。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2017-08-01 DOI: 10.1145/3097983.3098060

David Hallac, Sagar Vare, Stephen Boyd, Jure Leskovec

Subsequence clustering of multivariate time series is a useful tool for discovering repeated patterns in temporal data. Once these patterns have been discovered, seemingly complicated datasets can be interpreted as a temporal sequence of only a small number of states, or clusters. For example, raw sensor data from a fitness-tracking application can be expressed as a timeline of a select few actions (i.e., walking, sitting, running). However, discovering these patterns is challenging because it requires simultaneous segmentation and clustering of the time series. Furthermore, interpreting the resulting clusters is difficult, especially when the data is high-dimensional. Here we propose a new method of model-based clustering, which we call Toeplitz Inverse Covariance-based Clustering (TICC). Each cluster in the TICC method is defined by a correlation network, or Markov random field (MRF), characterizing the interdependencies between different observations in a typical subsequence of that cluster. Based on this graphical representation, TICC simultaneously segments and clusters the time series data. We solve the TICC problem through alternating minimization, using a variation of the expectation maximization (EM) algorithm. We derive closed-form solutions to efficiently solve the two resulting subproblems in a scalable way, through dynamic programming and the alternating direction method of multipliers (ADMM), respectively. We validate our approach by comparing TICC to several state-of-the-art baselines in a series of synthetic experiments, and we then demonstrate on an automobile sensor dataset how TICC can be used to learn interpretable clusters in real-world scenarios.

多变量时间序列的后继聚类是发现时间数据中重复模式的有用工具。一旦发现了这些模式，看似复杂的数据集就可以被解释为仅由少量状态或聚类组成的时间序列。例如，健身跟踪应用程序的原始传感器数据可以表示为选定的几个动作（即行走、坐姿、跑步）的时间轴。然而，发现这些模式具有挑战性，因为这需要同时对时间序列进行分割和聚类。此外，解释由此产生的聚类也很困难，尤其是当数据是高维数据时。在此，我们提出了一种新的基于模型的聚类方法，我们称之为基于 Toeplitz 逆协方差的聚类（TICC）。TICC 方法中的每个聚类都是由相关网络或马尔可夫随机场（MRF）定义的，它描述了该聚类典型子序列中不同观测值之间的相互依赖关系。基于这种图形表示法，TICC 可同时对时间序列数据进行分割和聚类。我们使用期望最大化（EM）算法的一种变体，通过交替最小化来解决 TICC 问题。我们分别通过动态编程和交替乘数法（ADMM）得出了闭式解，从而以可扩展的方式高效地解决了由此产生的两个子问题。我们在一系列合成实验中将 TICC 与几种最先进的基线进行了比较，从而验证了我们的方法，然后我们在一个汽车传感器数据集上演示了如何利用 TICC 学习现实世界场景中可解释的聚类。

{"title":"Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data.","authors":"David Hallac, Sagar Vare, Stephen Boyd, Jure Leskovec","doi":"10.1145/3097983.3098060","DOIUrl":"10.1145/3097983.3098060","url":null,"abstract":"Subsequence clustering of multivariate time series is a useful tool for discovering repeated patterns in temporal data. Once these patterns have been discovered, seemingly complicated datasets can be interpreted as a temporal sequence of only a small number of states, or clusters. For example, raw sensor data from a fitness-tracking application can be expressed as a timeline of a select few actions (i.e., walking, sitting, running). However, discovering these patterns is challenging because it requires simultaneous segmentation and clustering of the time series. Furthermore, interpreting the resulting clusters is difficult, especially when the data is high-dimensional. Here we propose a new method of model-based clustering, which we call Toeplitz Inverse Covariance-based Clustering (TICC). Each cluster in the TICC method is defined by a correlation network, or Markov random field (MRF), characterizing the interdependencies between different observations in a typical subsequence of that cluster. Based on this graphical representation, TICC simultaneously segments and clusters the time series data. We solve the TICC problem through alternating minimization, using a variation of the expectation maximization (EM) algorithm. We derive closed-form solutions to efficiently solve the two resulting subproblems in a scalable way, through dynamic programming and the alternating direction method of multipliers (ADMM), respectively. We validate our approach by comparing TICC to several state-of-the-art baselines in a series of synthetic experiments, and we then demonstrate on an automobile sensor dataset how TICC can be used to learn interpretable clusters in real-world scenarios.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5951184/pdf/nihms933926.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36106210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PReP: Path-Based Relevance from a Probabilistic Perspective in Heterogeneous Information Networks. 异构信息网络中基于路径的概率关联。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2017-08-01 DOI: 10.1145/3097983.3097990

Yu Shi, Po-Wei Chan, Honglei Zhuang, Huan Gui, Jiawei Han

As a powerful representation paradigm for networked and multi-typed data, the heterogeneous information network (HIN) is ubiquitous. Meanwhile, defining proper relevance measures has always been a fundamental problem and of great pragmatic importance for network mining tasks. Inspired by our probabilistic interpretation of existing path-based relevance measures, we propose to study HIN relevance from a probabilistic perspective. We also identify, from real-world data, and propose to model cross-meta-path synergy, which is a characteristic important for defining path-based HIN relevance and has not been modeled by existing methods. A generative model is established to derive a novel path-based relevance measure, which is data-driven and tailored for each HIN. We develop an inference algorithm to find the maximum a posteriori (MAP) estimate of the model parameters, which entails non-trivial tricks. Experiments on two real-world datasets demonstrate the effectiveness of the proposed model and relevance measure.

异构信息网络(HIN)作为一种强大的网络和多类型数据的表示范式，已得到广泛应用。同时，定义合适的关联度量一直是网络挖掘任务的基本问题，具有重要的现实意义。受我们对现有基于路径的相关性度量的概率解释的启发，我们提出从概率角度研究HIN相关性。我们还从真实世界的数据中识别并建议建立跨元路径协同的模型，这是定义基于路径的HIN相关性的重要特征，并且尚未被现有方法建模。建立了一个生成模型，推导出一种新的基于路径的相关性度量，该度量是数据驱动的，并为每个HIN量身定制。我们开发了一种推理算法来找到模型参数的最大后验(MAP)估计，这需要非平凡的技巧。在两个真实数据集上的实验证明了所提出的模型和相关度量的有效性。

{"title":"PReP: Path-Based Relevance from a Probabilistic Perspective in Heterogeneous Information Networks.","authors":"Yu Shi, Po-Wei Chan, Honglei Zhuang, Huan Gui, Jiawei Han","doi":"10.1145/3097983.3097990","DOIUrl":"https://doi.org/10.1145/3097983.3097990","url":null,"abstract":"As a powerful representation paradigm for networked and multi-typed data, the heterogeneous information network (HIN) is ubiquitous. Meanwhile, defining proper relevance measures has always been a fundamental problem and of great pragmatic importance for network mining tasks. Inspired by our probabilistic interpretation of existing path-based relevance measures, we propose to study HIN relevance from a probabilistic perspective. We also identify, from real-world data, and propose to model cross-meta-path synergy, which is a characteristic important for defining path-based HIN relevance and has not been modeled by existing methods. A generative model is established to derive a novel path-based relevance measure, which is data-driven and tailored for each HIN. We develop an inference algorithm to find the maximum a posteriori (MAP) estimate of the model parameters, which entails non-trivial tricks. Experiments on two real-world datasets demonstrate the effectiveness of the proposed model and relevance measure.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3097983.3097990","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36496372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Local Higher-Order Graph Clustering. 局部高阶图聚类。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2017-08-01 DOI: 10.1145/3097983.3098069

Hao Yin, Austin R Benson, Jure Leskovec, David F Gleich

Local graph clustering methods aim to find a cluster of nodes by exploring a small region of the graph. These methods are attractive because they enable targeted clustering around a given seed node and are faster than traditional global graph clustering methods because their runtime does not depend on the size of the input graph. However, current local graph partitioning methods are not designed to account for the higher-order structures crucial to the network, nor can they effectively handle directed networks. Here we introduce a new class of local graph clustering methods that address these issues by incorporating higher-order network information captured by small subgraphs, also called network motifs. We develop the Motif-based Approximate Personalized PageRank (MAPPR) algorithm that finds clusters containing a seed node with minimal motif conductance, a generalization of the conductance metric for network motifs. We generalize existing theory to prove the fast running time (independent of the size of the graph) and obtain theoretical guarantees on the cluster quality (in terms of motif conductance). We also develop a theory of node neighborhoods for finding sets that have small motif conductance, and apply these results to the case of finding good seed nodes to use as input to the MAPPR algorithm. Experimental validation on community detection tasks in both synthetic and real-world networks, shows that our new framework MAPPR outperforms the current edge-based personalized PageRank methodology.

局部图聚类方法旨在通过探索图中的一小块区域来找到节点聚类。这些方法很有吸引力，因为它们能围绕给定的种子节点进行有针对性的聚类，而且比传统的全局图聚类方法更快，因为它们的运行时间不取决于输入图的大小。然而，目前的局部图划分方法在设计上并不考虑对网络至关重要的高阶结构，也不能有效处理有向网络。在这里，我们引入了一类新的局部图聚类方法，通过纳入由小型子图（也称为网络图案）捕获的高阶网络信息来解决这些问题。我们开发了基于图案的近似个性化 PageRank (MAPPR) 算法，该算法可找到包含具有最小图案传导性（网络图案传导性度量的广义化）的种子节点的聚类。我们对现有理论进行了概括，证明了该算法的快速运行时间（与图的大小无关），并从理论上保证了簇的质量（以图案传导率为单位）。我们还开发了一种节点邻域理论，用于寻找具有较小图案传导性的集合，并将这些结果应用于寻找好的种子节点作为 MAPPR 算法输入的情况。在合成网络和真实世界网络中进行的社区检测任务实验验证表明，我们的新框架 MAPPR 优于当前基于边缘的个性化 PageRank 方法。

{"title":"Local Higher-Order Graph Clustering.","authors":"Hao Yin, Austin R Benson, Jure Leskovec, David F Gleich","doi":"10.1145/3097983.3098069","DOIUrl":"10.1145/3097983.3098069","url":null,"abstract":"Local graph clustering methods aim to find a cluster of nodes by exploring a small region of the graph. These methods are attractive because they enable targeted clustering around a given seed node and are faster than traditional global graph clustering methods because their runtime does not depend on the size of the input graph. However, current local graph partitioning methods are not designed to account for the higher-order structures crucial to the network, nor can they effectively handle directed networks. Here we introduce a new class of local graph clustering methods that address these issues by incorporating higher-order network information captured by small subgraphs, also called network motifs. We develop the Motif-based Approximate Personalized PageRank (MAPPR) algorithm that finds clusters containing a seed node with minimal motif conductance, a generalization of the conductance metric for network motifs. We generalize existing theory to prove the fast running time (independent of the size of the graph) and obtain theoretical guarantees on the cluster quality (in terms of motif conductance). We also develop a theory of node neighborhoods for finding sets that have small motif conductance, and apply these results to the case of finding good seed nodes to use as input to the MAPPR algorithm. Experimental validation on community detection tasks in both synthetic and real-world networks, shows that our new framework MAPPR outperforms the current edge-based personalized PageRank methodology.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5951164/pdf/nihms933928.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36106211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Computational Drug Repositioning Using Continuous Self-Controlled Case Series. 使用连续自我控制病例序列的计算药物重新定位。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2016-08-01 DOI: 10.1145/2939672.2939715

Zhaobin Kuang, James Thomson, Michael Caldwell, Peggy Peissig, Ron Stewart, David Page

Computational Drug Repositioning (CDR) is the task of discovering potential new indications for existing drugs by mining large-scale heterogeneous drug-related data sources. Leveraging the patient-level temporal ordering information between numeric physiological measurements and various drug prescriptions provided in Electronic Health Records (EHRs), we propose a Continuous Self-controlled Case Series (CSCCS) model for CDR. As an initial evaluation, we look for drugs that can control Fasting Blood Glucose (FBG) level in our experiments. Applying CSCCS to the Marshfield Clinic EHR, well-known drugs that are indicated for controlling blood glucose level are rediscovered. Furthermore, some drugs with recent literature support for the potential effect of blood glucose level control are also identified.

计算药物重新定位(CDR)是通过挖掘大规模异构药物相关数据源来发现现有药物潜在的新适应症的任务。利用电子健康记录(EHRs)中提供的数值生理测量和各种药物处方之间的患者级时序信息，我们提出了CDR的连续自控病例系列(CSCCS)模型。作为初步评估，我们在实验中寻找可以控制空腹血糖(FBG)水平的药物。将CSCCS应用于Marshfield Clinic EHR，重新发现了用于控制血糖水平的知名药物。此外，一些最近文献支持的药物也被确定为血糖水平控制的潜在效果。

引用次数: 22

Generalized Hierarchical Sparse Model for Arbitrary-Order Interactive Antigenic Sites Identification in Flu Virus Data. 流感病毒数据中任意顺序相互作用抗原位点识别的广义层次稀疏模型。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2016-08-01 DOI: 10.1145/2939672.2939786

Lei Han, Yu Zhang, Xiu-Feng Wan, Tong Zhang

Recent statistical evidence has shown that a regression model by incorporating the interactions among the original covariates/features can significantly improve the interpretability for biological data. One major challenge is the exponentially expanded feature space when adding high-order feature interactions to the model. To tackle the huge dimensionality, hierarchical sparse models (HSM) are developed by enforcing sparsity under heredity structures in the interactions among the covariates. However, existing methods only consider pairwise interactions, making the discovery of important high-order interactions a non-trivial open problem. In this paper, we propose a generalized hierarchical sparse model (GHSM) as a generalization of the HSM models to tackle arbitrary-order interactions. The GHSM applies the ℓ₁ penalty to all the model coefficients under a constraint that given any covariate, if none of its associated kth-order interactions contribute to the regression model, then neither do its associated higher-order interactions. The resulting objective function is non-convex with a challenge lying in the coupled variables appearing in the arbitrary-order hierarchical constraints and we devise an efficient optimization algorithm to directly solve it. Specifically, we decouple the variables in the constraints via both the general iterative shrinkage and thresholding (GIST) and the alternating direction method of multipliers (ADMM) methods into three subproblems, each of which is proved to admit an efficiently analytical solution. We evaluate the GHSM method in both synthetic problem and the antigenic sites identification problem for the influenza virus data, where we expand the feature space up to the 5th-order interactions. Empirical results demonstrate the effectiveness and efficiency of the proposed methods and the learned high-order interactions have meaningful synergistic covariate patterns in the influenza virus antigenicity.

最近的统计证据表明，纳入原始协变量/特征之间相互作用的回归模型可以显着提高生物数据的可解释性。当向模型中添加高阶特征交互时，一个主要的挑战是指数扩展的特征空间。为了解决大维度问题，在协变量相互作用的遗传结构下，建立了层次稀疏模型(HSM)。然而，现有的方法只考虑两两相互作用，使得发现重要的高阶相互作用成为一个非平凡的开放问题。在本文中，我们提出了一种广义层次稀疏模型(GHSM)作为HSM模型的推广来处理任意阶的相互作用。GHSM在给定任何协变量的约束下对所有模型系数应用l1惩罚，如果其相关的第k阶相互作用对回归模型没有贡献，则其相关的高阶相互作用也没有贡献。所得到的目标函数是非凸的，挑战在于任意阶层次约束中出现的耦合变量，我们设计了一种有效的优化算法来直接求解它。具体而言，我们通过一般迭代收缩阈值法(GIST)和乘法器的交替方向法(ADMM)方法将约束中的变量解耦为三个子问题，每个子问题都证明了一个有效的解析解。我们在流感病毒数据的合成问题和抗原位点识别问题中评估了GHSM方法，其中我们将特征空间扩展到5阶相互作用。实证结果证明了所提出方法的有效性和效率，并且所学习的高阶相互作用在流感病毒抗原性中具有有意义的协同协变量模式。

{"title":"Generalized Hierarchical Sparse Model for Arbitrary-Order Interactive Antigenic Sites Identification in Flu Virus Data.","authors":"Lei Han, Yu Zhang, Xiu-Feng Wan, Tong Zhang","doi":"10.1145/2939672.2939786","DOIUrl":"https://doi.org/10.1145/2939672.2939786","url":null,"abstract":"Recent statistical evidence has shown that a regression model by incorporating the interactions among the original covariates/features can significantly improve the interpretability for biological data. One major challenge is the exponentially expanded feature space when adding high-order feature interactions to the model. To tackle the huge dimensionality, hierarchical sparse models (HSM) are developed by enforcing sparsity under heredity structures in the interactions among the covariates. However, existing methods only consider pairwise interactions, making the discovery of important high-order interactions a non-trivial open problem. In this paper, we propose a generalized hierarchical sparse model (GHSM) as a generalization of the HSM models to tackle arbitrary-order interactions. The GHSM applies the ℓ1 penalty to all the model coefficients under a constraint that given any covariate, if none of its associated kth-order interactions contribute to the regression model, then neither do its associated higher-order interactions. The resulting objective function is non-convex with a challenge lying in the coupled variables appearing in the arbitrary-order hierarchical constraints and we devise an efficient optimization algorithm to directly solve it. Specifically, we decouple the variables in the constraints via both the general iterative shrinkage and thresholding (GIST) and the alternating direction method of multipliers (ADMM) methods into three subproblems, each of which is proved to admit an efficiently analytical solution. We evaluate the GHSM method in both synthetic problem and the antigenic sites identification problem for the influenza virus data, where we expand the feature space up to the 5th-order interactions. Empirical results demonstrate the effectiveness and efficiency of the proposed methods and the learned high-order interactions have meaningful synergistic covariate patterns in the influenza virus antigenicity.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2939672.2939786","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34898415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Ranking Causal Anomalies via Temporal and Dynamical Analysis on Vanishing Correlations. 通过消失相关性的时间和动态分析对因果异常进行排序。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2016-08-01 DOI: 10.1145/2939672.2939765

Wei Cheng, Kai Zhang, Haifeng Chen, Guofei Jiang, Zhengzhang Chen, Wei Wang

Modern world has witnessed a dramatic increase in our ability to collect, transmit and distribute real-time monitoring and surveillance data from large-scale information systems and cyber-physical systems. Detecting system anomalies thus attracts significant amount of interest in many fields such as security, fault management, and industrial optimization. Recently, invariant network has shown to be a powerful way in characterizing complex system behaviours. In the invariant network, a node represents a system component and an edge indicates a stable, significant interaction between two components. Structures and evolutions of the invariance network, in particular the vanishing correlations, can shed important light on locating causal anomalies and performing diagnosis. However, existing approaches to detect causal anomalies with the invariant network often use the percentage of vanishing correlations to rank possible casual components, which have several limitations: 1) fault propagation in the network is ignored; 2) the root casual anomalies may not always be the nodes with a high-percentage of vanishing correlations; 3) temporal patterns of vanishing correlations are not exploited for robust detection. To address these limitations, in this paper we propose a network diffusion based framework to identify significant causal anomalies and rank them. Our approach can effectively model fault propagation over the entire invariant network, and can perform joint inference on both the structural, and the time-evolving broken invariance patterns. As a result, it can locate high-confidence anomalies that are truly responsible for the vanishing correlations, and can compensate for unstructured measurement noise in the system. Extensive experiments on synthetic datasets, bank information system datasets, and coal plant cyber-physical system datasets demonstrate the effectiveness of our approach.

现代世界见证了我们从大型信息系统和网络物理系统中收集、传输和分发实时监控和监视数据的能力的急剧提高。因此，检测系统异常在安全、故障管理和工业优化等许多领域引起了人们的极大兴趣。近年来，不变网络已被证明是表征复杂系统行为的一种有效方法。在不变网络中，节点代表一个系统组件，而边缘表示两个组件之间稳定、重要的交互作用。不变性网络的结构和进化，特别是消失的相关性，可以为定位因果异常和执行诊断提供重要的启示。然而，现有的用不变网络检测因果异常的方法通常使用消失相关性的百分比来对可能的偶然成分进行排序，这种方法有几个局限性:1)忽略了网络中的故障传播;2)根偶然异常可能并不总是相关性消失百分比高的节点;3)相关性消失的时间模式没有被用于鲁棒检测。为了解决这些限制，在本文中，我们提出了一个基于网络扩散的框架来识别重要的因果异常并对它们进行排序。该方法可以有效地模拟故障在整个不变网络上的传播，并可以对结构模式和时间演化的不变性模式进行联合推理。因此，它可以定位真正导致相关性消失的高置信度异常，并可以补偿系统中的非结构化测量噪声。在合成数据集、银行信息系统数据集和燃煤电厂网络物理系统数据集上进行的大量实验证明了我们方法的有效性。

{"title":"Ranking Causal Anomalies via Temporal and Dynamical Analysis on Vanishing Correlations.","authors":"Wei Cheng, Kai Zhang, Haifeng Chen, Guofei Jiang, Zhengzhang Chen, Wei Wang","doi":"10.1145/2939672.2939765","DOIUrl":"https://doi.org/10.1145/2939672.2939765","url":null,"abstract":"Modern world has witnessed a dramatic increase in our ability to collect, transmit and distribute real-time monitoring and surveillance data from large-scale information systems and cyber-physical systems. Detecting system anomalies thus attracts significant amount of interest in many fields such as security, fault management, and industrial optimization. Recently, invariant network has shown to be a powerful way in characterizing complex system behaviours. In the invariant network, a node represents a system component and an edge indicates a stable, significant interaction between two components. Structures and evolutions of the invariance network, in particular the vanishing correlations, can shed important light on locating causal anomalies and performing diagnosis. However, existing approaches to detect causal anomalies with the invariant network often use the percentage of vanishing correlations to rank possible casual components, which have several limitations: 1) fault propagation in the network is ignored; 2) the root casual anomalies may not always be the nodes with a high-percentage of vanishing correlations; 3) temporal patterns of vanishing correlations are not exploited for robust detection. To address these limitations, in this paper we propose a network diffusion based framework to identify significant causal anomalies and rank them. Our approach can effectively model fault propagation over the entire invariant network, and can perform joint inference on both the structural, and the time-evolving broken invariance patterns. As a result, it can locate high-confidence anomalies that are truly responsible for the vanishing correlations, and can compensate for unstructured measurement noise in the system. Extensive experiments on synthetic datasets, bank information system datasets, and coal plant cyber-physical system datasets demonstrate the effectiveness of our approach.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2939672.2939765","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35174806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Fast Component Pursuit for Large-Scale Inverse Covariance Estimation. 大规模逆协方差估计的快速分量追踪。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2016-08-01 DOI: 10.1145/2939672.2939851

Lei Han, Yu Zhang, Tong Zhang

The maximum likelihood estimation (MLE) for the Gaussian graphical model, which is also known as the inverse covariance estimation problem, has gained increasing interest recently. Most existing works assume that inverse covariance estimators contain sparse structure and then construct models with the ℓ₁ regularization. In this paper, different from existing works, we study the inverse covariance estimation problem from another perspective by efficiently modeling the low-rank structure in the inverse covariance, which is assumed to be a combination of a low-rank part and a diagonal matrix. One motivation for this assumption is that the low-rank structure is common in many applications including the climate and financial analysis, and another one is that such assumption can reduce the computational complexity when computing its inverse. Specifically, we propose an efficient COmponent Pursuit (COP) method to obtain the low-rank part, where each component can be sparse. For optimization, the COP method greedily learns a rank-one component in each iteration by maximizing the log-likelihood. Moreover, the COP algorithm enjoys several appealing properties including the existence of an efficient solution in each iteration and the theoretical guarantee on the convergence of this greedy approach. Experiments on large-scale synthetic and real-world datasets including thousands of millions variables show that the COP method is faster than the state-of-the-art techniques for the inverse covariance estimation problem when achieving comparable log-likelihood on test data.

高斯图模型的最大似然估计(MLE)，也称为逆协方差估计问题，近年来受到越来越多的关注。现有的研究大多假设逆协方差估计量包含稀疏结构，然后构造具有l1正则化的模型。与已有研究不同，本文从另一个角度研究了协方差反估计问题，通过对协方差反中的低秩结构进行高效建模，将其假设为低秩部分与对角矩阵的组合。这种假设的一个动机是低秩结构在包括气候和金融分析在内的许多应用中很常见，另一个动机是这种假设可以减少计算其逆时的计算复杂性。具体而言，我们提出了一种高效的组件追踪(COP)方法来获取低秩部分，其中每个组件都可以稀疏。为了优化，COP方法在每次迭代中通过最大化对数似然来贪婪地学习排名第一的组件。此外，COP算法还具有在每次迭代中都存在有效解以及该贪心方法收敛性的理论保证等优点。在包括数十万变量的大规模合成数据集和现实世界数据集上的实验表明，在获得测试数据的可比对数似然时，COP方法比最先进的反协方差估计问题更快。

{"title":"Fast Component Pursuit for Large-Scale Inverse Covariance Estimation.","authors":"Lei Han, Yu Zhang, Tong Zhang","doi":"10.1145/2939672.2939851","DOIUrl":"https://doi.org/10.1145/2939672.2939851","url":null,"abstract":"The maximum likelihood estimation (MLE) for the Gaussian graphical model, which is also known as the inverse covariance estimation problem, has gained increasing interest recently. Most existing works assume that inverse covariance estimators contain sparse structure and then construct models with the ℓ1 regularization. In this paper, different from existing works, we study the inverse covariance estimation problem from another perspective by efficiently modeling the low-rank structure in the inverse covariance, which is assumed to be a combination of a low-rank part and a diagonal matrix. One motivation for this assumption is that the low-rank structure is common in many applications including the climate and financial analysis, and another one is that such assumption can reduce the computational complexity when computing its inverse. Specifically, we propose an efficient COmponent Pursuit (COP) method to obtain the low-rank part, where each component can be sparse. For optimization, the COP method greedily learns a rank-one component in each iteration by maximizing the log-likelihood. Moreover, the COP algorithm enjoys several appealing properties including the existence of an efficient solution in each iteration and the theoretical guarantee on the convergence of this greedy approach. Experiments on large-scale synthetic and real-world datasets including thousands of millions variables show that the COP method is faster than the state-of-the-art techniques for the inverse covariance estimation problem when achieving comparable log-likelihood on test data.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2939672.2939851","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35060775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Network Lasso: Clustering and Optimization in Large Graphs. 网络套索:大型图的聚类和优化。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2015-08-01 DOI: 10.1145/2783258.2783313

David Hallac, Jure Leskovec, Stephen Boyd

Convex optimization is an essential tool for modern data analysis, as it provides a framework to formulate and solve many problems in machine learning and data mining. However, general convex optimization solvers do not scale well, and scalable solvers are often specialized to only work on a narrow class of problems. Therefore, there is a need for simple, scalable algorithms that can solve many common optimization problems. In this paper, we introduce the network lasso, a generalization of the group lasso to a network setting that allows for simultaneous clustering and optimization on graphs. We develop an algorithm based on the Alternating Direction Method of Multipliers (ADMM) to solve this problem in a distributed and scalable manner, which allows for guaranteed global convergence even on large graphs. We also examine a non-convex extension of this approach. We then demonstrate that many types of problems can be expressed in our framework. We focus on three in particular - binary classification, predicting housing prices, and event detection in time series data - comparing the network lasso to baseline approaches and showing that it is both a fast and accurate method of solving large optimization problems.

凸优化是现代数据分析的重要工具，因为它提供了一个框架来制定和解决机器学习和数据挖掘中的许多问题。然而，一般的凸优化求解器不能很好地扩展，并且可扩展求解器通常只专门用于处理一类狭窄的问题。因此，需要一种简单的、可扩展的算法来解决许多常见的优化问题。在本文中，我们介绍了网络套索，将群套索推广到一种允许同时在图上聚类和优化的网络设置。我们开发了一种基于乘法器交替方向法(ADMM)的算法，以分布式和可扩展的方式解决这个问题，即使在大图上也可以保证全局收敛。我们还研究了这种方法的非凸扩展。然后，我们演示了许多类型的问题可以在我们的框架中表示。我们特别关注三个方面——二元分类、预测房价和时间序列数据中的事件检测——将网络套索与基线方法进行比较，并表明它是解决大型优化问题的快速而准确的方法。

{"title":"Network Lasso: Clustering and Optimization in Large Graphs.","authors":"David Hallac, Jure Leskovec, Stephen Boyd","doi":"10.1145/2783258.2783313","DOIUrl":"10.1145/2783258.2783313","url":null,"abstract":"Convex optimization is an essential tool for modern data analysis, as it provides a framework to formulate and solve many problems in machine learning and data mining. However, general convex optimization solvers do not scale well, and scalable solvers are often specialized to only work on a narrow class of problems. Therefore, there is a need for simple, scalable algorithms that can solve many common optimization problems. In this paper, we introduce the network lasso, a generalization of the group lasso to a network setting that allows for simultaneous clustering and optimization on graphs. We develop an algorithm based on the Alternating Direction Method of Multipliers (ADMM) to solve this problem in a distributed and scalable manner, which allows for guaranteed global convergence even on large graphs. We also examine a non-convex extension of this approach. We then demonstrate that many types of problems can be expressed in our framework. We focus on three in particular - binary classification, predicting housing prices, and event detection in time series data - comparing the network lasso to baseline approaches and showing that it is both a fast and accurate method of solving large optimization problems.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2015-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2783258.2783313","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34655442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 255

Unfolding Physiological State: Mortality Modelling in Intensive Care Units. 展开生理状态:重症监护病房的死亡率模型。

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623742

Marzyeh Ghassemi, Tristan Naumann, Finale Doshi-Velez, Nicole Brimmer, Rohit Joshi, Anna Rumshisky, Peter Szolovits

Accurate knowledge of a patient's disease state and trajectory is critical in a clinical setting. Modern electronic healthcare records contain an increasingly large amount of data, and the ability to automatically identify the factors that influence patient outcomes stand to greatly improve the efficiency and quality of care. We examined the use of latent variable models (viz. Latent Dirichlet Allocation) to decompose free-text hospital notes into meaningful features, and the predictive power of these features for patient mortality. We considered three prediction regimes: (1) baseline prediction, (2) dynamic (time-varying) outcome prediction, and (3) retrospective outcome prediction. In each, our prediction task differs from the familiar time-varying situation whereby data accumulates; since fewer patients have long ICU stays, as we move forward in time fewer patients are available and the prediction task becomes increasingly difficult. We found that latent topic-derived features were effective in determining patient mortality under three timelines: inhospital, 30 day post-discharge, and 1 year post-discharge mortality. Our results demonstrated that the latent topic features important in predicting hospital mortality are very different from those that are important in post-discharge mortality. In general, latent topic features were more predictive than structured features, and a combination of the two performed best. The time-varying models that combined latent topic features and baseline features had AUCs that reached 0.85, 0.80, and 0.77 for in-hospital, 30 day post-discharge and 1 year post-discharge mortality respectively. Our results agreed with other work suggesting that the first 24 hours of patient information are often the most predictive of hospital mortality. Retrospective models that used a combination of latent topic features and structured features achieved AUCs of 0.96, 0.82, and 0.81 for in-hospital, 30 day, and 1-year mortality prediction. Our work focuses on the dynamic (time-varying) setting because models from this regime could facilitate an on-going severity stratification system that helps direct care-staff resources and inform treatment strategies.

在临床环境中，准确了解患者的疾病状态和发展轨迹至关重要。现代电子医疗记录包含越来越多的数据，并且自动识别影响患者结果的因素的能力将大大提高护理的效率和质量。我们检验了使用潜变量模型(即潜狄利克雷分配)将自由文本医院笔记分解为有意义的特征，以及这些特征对患者死亡率的预测能力。我们考虑了三种预测机制:(1)基线预测，(2)动态(时变)结果预测，(3)回顾性结果预测。在每一种情况下，我们的预测任务不同于我们所熟悉的时变情况，即数据积累;由于在ICU长期住院的患者越来越少，随着时间的推移，可用的患者越来越少，预测任务变得越来越困难。我们发现潜在的主题衍生特征在确定三个时间线下的患者死亡率方面是有效的:住院、出院后30天和出院后1年的死亡率。我们的研究结果表明，在预测医院死亡率中重要的潜在主题特征与那些在出院后死亡率中重要的潜在主题特征非常不同。总的来说，潜在主题特征比结构化特征更具预测性，两者的结合效果最好。结合潜在主题特征和基线特征的时变模型的住院死亡率、出院后30天死亡率和出院后1年死亡率的auc分别达到0.85、0.80和0.77。我们的结果与其他研究结果一致，表明患者信息的前24小时通常是最能预测医院死亡率的。使用潜在主题特征和结构化特征组合的回顾性模型对住院、30天和1年死亡率预测的auc分别为0.96、0.82和0.81。我们的工作重点是动态(时变)环境，因为来自该制度的模型可以促进持续的严重程度分层系统，有助于指导护理人员资源并告知治疗策略。

{"title":"Unfolding Physiological State: Mortality Modelling in Intensive Care Units.","authors":"Marzyeh Ghassemi, Tristan Naumann, Finale Doshi-Velez, Nicole Brimmer, Rohit Joshi, Anna Rumshisky, Peter Szolovits","doi":"10.1145/2623330.2623742","DOIUrl":"https://doi.org/10.1145/2623330.2623742","url":null,"abstract":"Accurate knowledge of a patient's disease state and trajectory is critical in a clinical setting. Modern electronic healthcare records contain an increasingly large amount of data, and the ability to automatically identify the factors that influence patient outcomes stand to greatly improve the efficiency and quality of care. We examined the use of latent variable models (viz. Latent Dirichlet Allocation) to decompose free-text hospital notes into meaningful features, and the predictive power of these features for patient mortality. We considered three prediction regimes: (1) baseline prediction, (2) dynamic (time-varying) outcome prediction, and (3) retrospective outcome prediction. In each, our prediction task differs from the familiar time-varying situation whereby data accumulates; since fewer patients have long ICU stays, as we move forward in time fewer patients are available and the prediction task becomes increasingly difficult. We found that latent topic-derived features were effective in determining patient mortality under three timelines: inhospital, 30 day post-discharge, and 1 year post-discharge mortality. Our results demonstrated that the latent topic features important in predicting hospital mortality are very different from those that are important in post-discharge mortality. In general, latent topic features were more predictive than structured features, and a combination of the two performed best. The time-varying models that combined latent topic features and baseline features had AUCs that reached 0.85, 0.80, and 0.77 for in-hospital, 30 day post-discharge and 1 year post-discharge mortality respectively. Our results agreed with other work suggesting that the first 24 hours of patient information are often the most predictive of hospital mortality. Retrospective models that used a combination of latent topic features and structured features achieved AUCs of 0.96, 0.82, and 0.81 for in-hospital, 30 day, and 1-year mortality prediction. Our work focuses on the dynamic (time-varying) setting because models from this regime could facilitate an on-going severity stratification system that helps direct care-staff resources and inform treatment strategies.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2623330.2623742","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32724871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 206