Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining最新文献_第10页

Effective and Real-time In-App Activity Analysis in Encrypted Internet Traffic Streams 加密互联网流量流中有效和实时的应用内活动分析

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-04 DOI: 10.1145/3097983.3098049

Junming Liu, Yanjie Fu, Jingci Ming, Y. Ren, Leilei Sun, Hui Xiong

The mobile in-App service analysis, aiming at classifying mobile internet traffic into different types of service usages, has become a challenging and emergent task for mobile service providers due to the increasing adoption of secure protocols for in-App services. While some efforts have been made for the classification of mobile internet traffic, existing methods rely on complex feature construction and large storage cache, which lead to low processing speed, and thus not practical for online real-time scenarios. To this end, we develop an iterative analyzer for classifying encrypted mobile traffic in a real-time way. Specifically, we first select an optimal set of most discriminative features from raw features extracted from traffic packet sequences by a novel Maximizing Inner activity similarity and Minimizing Different activity similarity (MIMD) measurement. To develop the online analyzer, we first represent a traffic flow with a series of time windows, which are described by the optimal feature vector and are updated iteratively at the packet level. Instead of extracting feature elements from a series of raw traffic packets, our feature elements are updated when a new traffic packet is observed and the storage of raw traffic packets is not required. The time windows generated from the same service usage activity are grouped by our proposed method, namely, recursive time continuity constrained KMeans clustering (rCKC). The feature vectors of cluster centers are then fed into a random forest classifier to identify corresponding service usages. Finally, we provide extensive experiments on real-world Internet traffic data from Wechat, Whatsapp, and Facebook to demonstrate the effectiveness and efficiency of our approach. The results show that the proposed analyzer provides high accuracy in real-world scenarios, and has low storage cache requirement as well as fast processing speed.

移动应用内服务分析旨在将移动互联网流量分类为不同类型的服务使用，由于越来越多地采用应用内服务的安全协议，移动服务提供商已成为一项具有挑战性和紧急的任务。虽然在移动互联网流量分类方面已经做出了一些努力，但现有的方法依赖于复杂的特征构建和大的存储缓存，导致处理速度较低，因此不适合在线实时场景。为此，我们开发了一个迭代分析器，用于实时对加密的移动流量进行分类。具体来说，我们首先通过一种新颖的最大化内部活动相似性和最小化不同活动相似性(MIMD)度量，从流量数据包序列中提取的原始特征中选择最优的最具区别性的特征集。为了开发在线分析器，我们首先表示具有一系列时间窗口的流量，这些时间窗口由最优特征向量描述，并在数据包级别迭代更新。我们的特征元素不是从一系列原始流量数据包中提取特征元素，而是在观察到新的流量数据包时更新特征元素，并且不需要存储原始流量数据包。由相同服务使用活动生成的时间窗口通过我们提出的方法进行分组，即递归时间连续性约束KMeans聚类(rCKC)。然后将聚类中心的特征向量输入随机森林分类器以识别相应的服务用法。最后，我们对来自微信、Whatsapp和Facebook的真实互联网流量数据进行了广泛的实验，以证明我们方法的有效性和效率。结果表明，该分析仪在实际应用中具有较高的精度，并且具有较低的存储缓存要求和较快的处理速度。

{"title":"Effective and Real-time In-App Activity Analysis in Encrypted Internet Traffic Streams","authors":"Junming Liu, Yanjie Fu, Jingci Ming, Y. Ren, Leilei Sun, Hui Xiong","doi":"10.1145/3097983.3098049","DOIUrl":"https://doi.org/10.1145/3097983.3098049","url":null,"abstract":"The mobile in-App service analysis, aiming at classifying mobile internet traffic into different types of service usages, has become a challenging and emergent task for mobile service providers due to the increasing adoption of secure protocols for in-App services. While some efforts have been made for the classification of mobile internet traffic, existing methods rely on complex feature construction and large storage cache, which lead to low processing speed, and thus not practical for online real-time scenarios. To this end, we develop an iterative analyzer for classifying encrypted mobile traffic in a real-time way. Specifically, we first select an optimal set of most discriminative features from raw features extracted from traffic packet sequences by a novel Maximizing Inner activity similarity and Minimizing Different activity similarity (MIMD) measurement. To develop the online analyzer, we first represent a traffic flow with a series of time windows, which are described by the optimal feature vector and are updated iteratively at the packet level. Instead of extracting feature elements from a series of raw traffic packets, our feature elements are updated when a new traffic packet is observed and the storage of raw traffic packets is not required. The time windows generated from the same service usage activity are grouped by our proposed method, namely, recursive time continuity constrained KMeans clustering (rCKC). The feature vectors of cluster centers are then fed into a random forest classifier to identify corresponding service usages. Finally, we provide extensive experiments on real-world Internet traffic data from Wechat, Whatsapp, and Facebook to demonstrate the effectiveness and efficiency of our approach. The results show that the proposed analyzer provides high accuracy in real-world scenarios, and has low storage cache requirement as well as fast processing speed.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116402638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

metapath2vec: Scalable Representation Learning for Heterogeneous Networks metapath2vec:异构网络的可扩展表示学习

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-04 DOI: 10.1145/3097983.3098036

Yuxiao Dong, N. Chawla, A. Swami

We study the problem of representation learning in heterogeneous networks. Its unique challenges come from the existence of multiple types of nodes and links, which limit the feasibility of the conventional network embedding techniques. We develop two scalable representation learning models, namely metapath2vec and metapath2vec++. The metapath2vec model formalizes meta-path-based random walks to construct the heterogeneous neighborhood of a node and then leverages a heterogeneous skip-gram model to perform node embeddings. The metapath2vec++ model further enables the simultaneous modeling of structural and semantic correlations in heterogeneous networks. Extensive experiments show that metapath2vec and metapath2vec++ are able to not only outperform state-of-the-art embedding models in various heterogeneous network mining tasks, such as node classification, clustering, and similarity search, but also discern the structural and semantic correlations between diverse network objects.

我们研究了异构网络中的表示学习问题。其独特的挑战在于存在多种类型的节点和链路，这限制了传统网络嵌入技术的可行性。我们开发了两个可扩展的表示学习模型，即metapath2vec和metapath2vec++。metapath2vec模型将基于元路径的随机漫步形式化，以构建节点的异构邻域，然后利用异构跳格模型执行节点嵌入。metapath2vec++模型进一步支持异构网络中结构和语义相关性的同时建模。大量实验表明，metapath2vec和metapath2vec++不仅能够在各种异构网络挖掘任务(如节点分类、聚类和相似性搜索)中优于最先进的嵌入模型，而且能够识别不同网络对象之间的结构和语义相关性。

引用次数: 1732

On Finding Socially Tenuous Groups for Online Social Networks 寻找在线社交网络的社会脆弱群体

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-04 DOI: 10.1145/3097983.3097995

Chih-Ya Shen, Liang-Hao Huang, De-Nian Yang, Hong-Han Shuai, Wang-Chien Lee, Ming-Syan Chen

Existing research on finding social groups mostly focuses on dense subgraphs in social networks. However, finding socially tenuous groups also has many important applications. In this paper, we introduce the notion of k-triangles to measure the tenuity of a group. We then formulate a new research problem, Minimum k-Triangle Disconnected Group (MkTG), to find a socially tenuous group from online social networks. We prove that MkTG is NP-Hard and inapproximable within any ratio in arbitrary graphs but polynomial-time tractable in threshold graphs. Two algorithms, namely TERA and TERA-ADV, are designed to exploit graph-theoretical approaches for solving MkTG on general graphs effectively and efficiently. Experimental results on seven real datasets manifest that the proposed algorithms outperform existing approaches in both efficiency and solution quality.

现有的社会群体寻找研究主要集中在社交网络中的密集子图上。然而，寻找社会脆弱群体也有许多重要的应用。在本文中，我们引入了k-三角形的概念来度量群的张力。然后，我们提出了一个新的研究问题，最小k-三角形断开群体(MkTG)，从在线社交网络中寻找一个社会脆弱群体。证明了MkTG在任意图中是NP-Hard且在任意比例内不可逼近，但在阈值图中是多项式时间可处理的。TERA和TERA- adv两种算法旨在利用图论方法有效地求解一般图上的MkTG。在7个真实数据集上的实验结果表明，本文提出的算法在效率和解的质量上都优于现有的方法。

引用次数: 23

It Takes More than Math and Engineering to Hit the Bullseye with Data 用数据击中靶心需要的不仅仅是数学和工程

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-04 DOI: 10.1145/3097983.3105816

P. Desai

Adopting algorithmic decision-making in a large and complex enterprise such as a Fortune 50 retailer like Target takes much more than clean, reliable data and great data mining capabilities. Yet data practitioners too often start with advanced math and fancy algorithms, rather than working hand-in-hand with business partners to identify and understand the biggest business problems. (Then teams should move onto how algorithms can be applied to those problems.) Another key step for data scientists at large organizations: ensuring that their business partners -- the merchants, marketers and supply chain experts -- have a base-line understanding of advanced models as well as the proper analytical support tools. Obtaining widespread buy-in and enthusiasm also requires providing a user-friendly interface for business partners with optionality and flexibility that allows the intelligence to be applied to the many varied issues facing a modern retailer, from personalization to supply chain transformation to decisions on assortment and pricing. This talk will explore effective practices and processes -- the do's and don'ts -- for data scientists to succeed in large, complex organizations like a retailer with 1,800+ stores, major marketing campaigns across multiple channels and a fast growing online business.

在像塔吉特(Target)这样的《财富》50强零售商这样的大型复杂企业中，采用算法决策需要的不仅仅是干净、可靠的数据和强大的数据挖掘能力。然而，数据从业者往往从高级数学和花哨的算法开始，而不是与业务伙伴携手合作，识别和理解最大的业务问题。(然后，团队应该研究如何将算法应用于这些问题。)大型组织中数据科学家的另一个关键步骤是:确保他们的业务合作伙伴——商人、营销人员和供应链专家——对高级模型和适当的分析支持工具有基本的了解。获得广泛的支持和热情还需要为业务合作伙伴提供具有可选性和灵活性的用户友好界面，使智能能够应用于现代零售商面临的许多不同问题，从个性化到供应链转换，再到分类和定价决策。本次演讲将探讨数据科学家在大型复杂组织(如拥有1800多家门店的零售商、跨多个渠道的大型营销活动和快速增长的在线业务)中取得成功的有效实践和流程——该做什么和不该做什么。

{"title":"It Takes More than Math and Engineering to Hit the Bullseye with Data","authors":"P. Desai","doi":"10.1145/3097983.3105816","DOIUrl":"https://doi.org/10.1145/3097983.3105816","url":null,"abstract":"Adopting algorithmic decision-making in a large and complex enterprise such as a Fortune 50 retailer like Target takes much more than clean, reliable data and great data mining capabilities. Yet data practitioners too often start with advanced math and fancy algorithms, rather than working hand-in-hand with business partners to identify and understand the biggest business problems. (Then teams should move onto how algorithms can be applied to those problems.) Another key step for data scientists at large organizations: ensuring that their business partners -- the merchants, marketers and supply chain experts -- have a base-line understanding of advanced models as well as the proper analytical support tools. Obtaining widespread buy-in and enthusiasm also requires providing a user-friendly interface for business partners with optionality and flexibility that allows the intelligence to be applied to the many varied issues facing a modern retailer, from personalization to supply chain transformation to decisions on assortment and pricing. This talk will explore effective practices and processes -- the do's and don'ts -- for data scientists to succeed in large, complex organizations like a retailer with 1,800+ stores, major marketing campaigns across multiple channels and a fast growing online business.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123425061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Functional Annotation of Human Protein Coding Isoforms via Non-convex Multi-Instance Learning 基于非凸多实例学习的人类蛋白质编码异构体功能标注

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-04 DOI: 10.1145/3097983.3097984

Tingjin Luo, Weizhong Zhang, Shang Qiu, Yang Yang, Dong-yun Yi, Guangtao Wang, Jieping Ye, Jie Wang

Functional annotation of human genes is fundamentally important for understanding the molecular basis of various genetic diseases. A major challenge in determining the functions of human genes lies in the functional diversity of proteins, that is, a gene can perform different functions as it may consist of multiple protein coding isoforms (PCIs). Therefore, differentiating functions of PCIs can significantly deepen our understanding of the functions of genes. However, due to the lack of isoform-level gold-standards (ground-truth annotation), many existing functional annotation approaches are developed at gene-level. In this paper, we propose a novel approach to differentiate the functions of PCIs by integrating sparse simplex projection---that is, a nonconvex sparsity-inducing regularizer---with the framework of multi-instance learning (MIL). Specifically, we label the genes that are annotated to the function under consideration as positive bags and the genes without the function as negative bags. Then, by sparse projections onto simplex, we learn a mapping that embeds the original bag space to a discriminative feature space. Our framework is flexible to incorporate various smooth and non-smooth loss functions such as logistic loss and hinge loss. To solve the resulting highly nontrivial non-convex and non-smooth optimization problem, we further develop an efficient block coordinate descent algorithm. Extensive experiments on human genome data demonstrate that the proposed approaches significantly outperform the state-of-the-art methods in terms of functional annotation accuracy of human PCIs and efficiency.

人类基因的功能注释对于理解各种遗传疾病的分子基础至关重要。确定人类基因功能的一个主要挑战在于蛋白质的功能多样性，即一个基因可以执行不同的功能，因为它可能由多个蛋白质编码异构体(PCIs)组成。因此，区分pci的功能可以显著加深我们对基因功能的认识。然而，由于缺乏同种异构体水平的金标准(基真注释)，许多现有的功能注释方法都是在基因水平上开发的。在本文中，我们提出了一种新的方法，通过将稀疏单纯形投影(即非凸稀疏性诱导正则化器)与多实例学习(MIL)框架集成来区分pi的函数。具体来说，我们将被注释到考虑功能的基因标记为阳性袋，而没有功能的基因标记为阴性袋。然后，通过稀疏投影到单纯形上，我们学习了将原始袋空间嵌入到判别特征空间的映射。我们的框架是灵活的，包括各种光滑和非光滑的损失函数，如逻辑损失和铰链损失。为了解决由此产生的高度非平凡的非凸非光滑优化问题，我们进一步开发了一种高效的块坐标下降算法。对人类基因组数据的大量实验表明，所提出的方法在人类pci的功能注释准确性和效率方面显着优于最先进的方法。

{"title":"Functional Annotation of Human Protein Coding Isoforms via Non-convex Multi-Instance Learning","authors":"Tingjin Luo, Weizhong Zhang, Shang Qiu, Yang Yang, Dong-yun Yi, Guangtao Wang, Jieping Ye, Jie Wang","doi":"10.1145/3097983.3097984","DOIUrl":"https://doi.org/10.1145/3097983.3097984","url":null,"abstract":"Functional annotation of human genes is fundamentally important for understanding the molecular basis of various genetic diseases. A major challenge in determining the functions of human genes lies in the functional diversity of proteins, that is, a gene can perform different functions as it may consist of multiple protein coding isoforms (PCIs). Therefore, differentiating functions of PCIs can significantly deepen our understanding of the functions of genes. However, due to the lack of isoform-level gold-standards (ground-truth annotation), many existing functional annotation approaches are developed at gene-level. In this paper, we propose a novel approach to differentiate the functions of PCIs by integrating sparse simplex projection---that is, a nonconvex sparsity-inducing regularizer---with the framework of multi-instance learning (MIL). Specifically, we label the genes that are annotated to the function under consideration as positive bags and the genes without the function as negative bags. Then, by sparse projections onto simplex, we learn a mapping that embeds the original bag space to a discriminative feature space. Our framework is flexible to incorporate various smooth and non-smooth loss functions such as logistic loss and hinge loss. To solve the resulting highly nontrivial non-convex and non-smooth optimization problem, we further develop an efficient block coordinate descent algorithm. Extensive experiments on human genome data demonstrate that the proposed approaches significantly outperform the state-of-the-art methods in terms of functional annotation accuracy of human PCIs and efficiency.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125563651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Learning from Multiple Teacher Networks 从多个教师网络中学习

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-04 DOI: 10.1145/3097983.3098135

Shan You, Chang Xu, Chao Xu, D. Tao

Training thin deep networks following the student-teacher learning paradigm has received intensive attention because of its excellent performance. However, to the best of our knowledge, most existing work mainly considers one single teacher network. In practice, a student may access multiple teachers, and multiple teacher networks together provide comprehensive guidance that is beneficial for training the student network. In this paper, we present a method to train a thin deep network by incorporating multiple teacher networks not only in output layer by averaging the softened outputs (dark knowledge) from different networks, but also in the intermediate layers by imposing a constraint about the dissimilarity among examples. We suggest that the relative dissimilarity between intermediate representations of different examples serves as a more flexible and appropriate guidance from teacher networks. Then triplets are utilized to encourage the consistence of these relative dissimilarity relationships between the student network and teacher networks. Moreover, we leverage a voting strategy to unify multiple relative dissimilarity information provided by multiple teacher networks, which realizes their incorporation in the intermediate layers. Extensive experimental results demonstrated that our method is capable of generating a well-performed student network, with the classification accuracy comparable or even superior to all teacher networks, yet having much fewer parameters and being much faster in running.

基于师生学习模式的深度网络训练因其优异的性能而受到广泛关注。然而，据我们所知，大多数现有的工作主要是考虑一个单一的教师网络。在实践中，一个学生可以访问多个教师，多个教师网络共同提供全面的指导，有利于培养学生网络。在本文中，我们提出了一种方法，通过结合多个教师网络来训练一个薄深度网络，不仅在输出层通过平均来自不同网络的软化输出(暗知识)，而且在中间层通过施加关于示例之间不相似性的约束。我们认为，不同示例的中间表示之间的相对差异可以作为教师网络更灵活和适当的指导。然后使用三元组来鼓励学生网络和教师网络之间这些相对不相似关系的一致性。此外，我们利用投票策略将多个教师网络提供的多个相对不相似信息统一起来，实现了它们在中间层的整合。大量的实验结果表明，我们的方法能够生成一个性能良好的学生网络，其分类精度与所有教师网络相当甚至更好，但参数少得多，运行速度快得多。

{"title":"Learning from Multiple Teacher Networks","authors":"Shan You, Chang Xu, Chao Xu, D. Tao","doi":"10.1145/3097983.3098135","DOIUrl":"https://doi.org/10.1145/3097983.3098135","url":null,"abstract":"Training thin deep networks following the student-teacher learning paradigm has received intensive attention because of its excellent performance. However, to the best of our knowledge, most existing work mainly considers one single teacher network. In practice, a student may access multiple teachers, and multiple teacher networks together provide comprehensive guidance that is beneficial for training the student network. In this paper, we present a method to train a thin deep network by incorporating multiple teacher networks not only in output layer by averaging the softened outputs (dark knowledge) from different networks, but also in the intermediate layers by imposing a constraint about the dissimilarity among examples. We suggest that the relative dissimilarity between intermediate representations of different examples serves as a more flexible and appropriate guidance from teacher networks. Then triplets are utilized to encourage the consistence of these relative dissimilarity relationships between the student network and teacher networks. Moreover, we leverage a voting strategy to unify multiple relative dissimilarity information provided by multiple teacher networks, which realizes their incorporation in the intermediate layers. Extensive experimental results demonstrated that our method is capable of generating a well-performed student network, with the classification accuracy comparable or even superior to all teacher networks, yet having much fewer parameters and being much faster in running.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116722074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 266

Patient Subtyping via Time-Aware LSTM Networks 通过时间感知LSTM网络进行患者亚型分型

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-04 DOI: 10.1145/3097983.3097997

Inci M. Baytas, Cao Xiao, Xi Sheryl Zhang, Fei Wang, Anil K. Jain, Jiayu Zhou

In the study of various diseases, heterogeneity among patients usually leads to different progression patterns and may require different types of therapeutic intervention. Therefore, it is important to study patient subtyping, which is grouping of patients into disease characterizing subtypes. Subtyping from complex patient data is challenging because of the information heterogeneity and temporal dynamics. Long-Short Term Memory (LSTM) has been successfully used in many domains for processing sequential data, and recently applied for analyzing longitudinal patient records. The LSTM units are designed to handle data with constant elapsed times between consecutive elements of a sequence. Given that time lapse between successive elements in patient records can vary from days to months, the design of traditional LSTM may lead to suboptimal performance. In this paper, we propose a novel LSTM unit called Time-Aware LSTM (T-LSTM) to handle irregular time intervals in longitudinal patient records. We learn a subspace decomposition of the cell memory which enables time decay to discount the memory content according to the elapsed time. We propose a patient subtyping model that leverages the proposed T-LSTM in an auto-encoder to learn a powerful single representation for sequential records of patients, which are then used to cluster patients into clinical subtypes. Experiments on synthetic and real world datasets show that the proposed T-LSTM architecture captures the underlying structures in the sequences with time irregularities.

在各种疾病的研究中，患者之间的异质性通常导致不同的进展模式，可能需要不同类型的治疗干预。因此，研究患者亚型是很重要的，即将患者分组为具有疾病特征的亚型。由于信息异质性和时间动态性，从复杂的患者数据中分型是具有挑战性的。长短期记忆(LSTM)已经成功地应用于许多领域，用于序列数据的处理，最近也应用于纵向病历的分析。LSTM单元的设计目的是处理序列中连续元素之间的间隔时间恒定的数据。考虑到患者记录中连续元素之间的时间间隔可能从几天到几个月不等，传统LSTM的设计可能导致性能不理想。在本文中，我们提出了一种新的LSTM单元，称为时间感知LSTM (T-LSTM)来处理纵向患者记录中的不规则时间间隔。我们学习了细胞记忆的子空间分解，使时间衰减能够根据经过的时间对记忆内容进行折扣。我们提出了一个患者亚型模型，利用在自动编码器中提出的T-LSTM来学习患者序列记录的强大的单一表示，然后用于将患者聚类到临床亚型中。在合成数据集和真实世界数据集上的实验表明，所提出的T-LSTM架构能够捕获具有时间不规则性的序列中的底层结构。

{"title":"Patient Subtyping via Time-Aware LSTM Networks","authors":"Inci M. Baytas, Cao Xiao, Xi Sheryl Zhang, Fei Wang, Anil K. Jain, Jiayu Zhou","doi":"10.1145/3097983.3097997","DOIUrl":"https://doi.org/10.1145/3097983.3097997","url":null,"abstract":"In the study of various diseases, heterogeneity among patients usually leads to different progression patterns and may require different types of therapeutic intervention. Therefore, it is important to study patient subtyping, which is grouping of patients into disease characterizing subtypes. Subtyping from complex patient data is challenging because of the information heterogeneity and temporal dynamics. Long-Short Term Memory (LSTM) has been successfully used in many domains for processing sequential data, and recently applied for analyzing longitudinal patient records. The LSTM units are designed to handle data with constant elapsed times between consecutive elements of a sequence. Given that time lapse between successive elements in patient records can vary from days to months, the design of traditional LSTM may lead to suboptimal performance. In this paper, we propose a novel LSTM unit called Time-Aware LSTM (T-LSTM) to handle irregular time intervals in longitudinal patient records. We learn a subspace decomposition of the cell memory which enables time decay to discount the memory content according to the elapsed time. We propose a patient subtyping model that leverages the proposed T-LSTM in an auto-encoder to learn a powerful single representation for sequential records of patients, which are then used to cluster patients into clinical subtypes. Experiments on synthetic and real world datasets show that the proposed T-LSTM architecture captures the underlying structures in the sequences with time irregularities.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114050270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 475

Meta-Graph Based Recommendation Fusion over Heterogeneous Information Networks 基于元图的异构信息网络推荐融合

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-04 DOI: 10.1145/3097983.3098063

Huan Zhao, Quanming Yao, Jianda Li, Yangqiu Song, Lee

Heterogeneous Information Network (HIN) is a natural and general representation of data in modern large commercial recommender systems which involve heterogeneous types of data. HIN based recommenders face two problems: how to represent the high-level semantics of recommendations and how to fuse the heterogeneous information to make recommendations. In this paper, we solve the two problems by first introducing the concept of meta-graph to HIN-based recommendation, and then solving the information fusion problem with a "matrix factorization (MF) + factorization machine (FM)" approach. For the similarities generated by each meta-graph, we perform standard MF to generate latent features for both users and items. With different meta-graph based features, we propose to use FM with Group lasso (FMG) to automatically learn from the observed ratings to effectively select useful meta-graph based features. Experimental results on two real-world datasets, Amazon and Yelp, show the effectiveness of our approach compared to state-of-the-art FM and other HIN-based recommendation algorithms.

异构信息网络(HIN)是现代大型商业推荐系统中涉及异构类型数据的一种自然和通用的数据表示。基于HIN的推荐系统面临两个问题:如何表示推荐的高级语义和如何融合异构信息进行推荐。在本文中，我们首先将元图的概念引入到基于hin的推荐中，然后用“矩阵分解(MF) +分解机(FM)”的方法解决信息融合问题。对于每个元图生成的相似性，我们执行标准MF来生成用户和项目的潜在特征。针对不同的元图特征，我们提出使用FM和Group lasso (FMG)从观察到的评分中自动学习，以有效地选择有用的元图特征。在亚马逊和Yelp两个真实数据集上的实验结果表明，与最先进的FM和其他基于hin的推荐算法相比，我们的方法是有效的。

{"title":"Meta-Graph Based Recommendation Fusion over Heterogeneous Information Networks","authors":"Huan Zhao, Quanming Yao, Jianda Li, Yangqiu Song, Lee","doi":"10.1145/3097983.3098063","DOIUrl":"https://doi.org/10.1145/3097983.3098063","url":null,"abstract":"Heterogeneous Information Network (HIN) is a natural and general representation of data in modern large commercial recommender systems which involve heterogeneous types of data. HIN based recommenders face two problems: how to represent the high-level semantics of recommendations and how to fuse the heterogeneous information to make recommendations. In this paper, we solve the two problems by first introducing the concept of meta-graph to HIN-based recommendation, and then solving the information fusion problem with a \"matrix factorization (MF) + factorization machine (FM)\" approach. For the similarities generated by each meta-graph, we perform standard MF to generate latent features for both users and items. With different meta-graph based features, we propose to use FM with Group lasso (FMG) to automatically learn from the observed ratings to effectively select useful meta-graph based features. Experimental results on two real-world datasets, Amazon and Yelp, show the effectiveness of our approach compared to state-of-the-art FM and other HIN-based recommendation algorithms.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121843772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 436

Three Principles of Data Science: Predictability, Stability and Computability 数据科学的三个原则:可预测性、稳定性和可计算性

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-04 DOI: 10.1145/3097983.3105808

Bin Yu

In this talk, I'd like to discuss the intertwining importance and connections of three principles of data science in the title in data-driven decisions. Making prediction as its central task and embracing computation as its core, machine learning has enabled wide-ranging data-driven successes. Prediction is a useful way to check with reality. Good prediction implicitly assumes stability between past and future. Stability (relative to data and model perturbations) is also a minimum requirement for interpretability and reproducibility of data driven results (cf. Yu, 2013). It is closely related to uncertainty assessment. Obviously, both prediction and stability principles can not be employed without feasible computational algorithms, hence the importance of computability. The three principles will be demonstrated in the context of two neuroscience projects and through analytical connections. In particular, the first project adds stability to predictive modeling used for reconstruction of movies from fMRI brain signlas for interpretable models. The second project use predictive transfer learning that combines AlexNet, GoogleNet and VGG with single V4 neuron data for state-of-the-art prediction performance. Our results lend support, to a certain extent, to the resemblance of these CNNs to brain and at the same time provide stable pattern interpretations of neurons in the difficult primate visual cortex V4.

在这次演讲中，我想讨论数据科学的三个原则在数据驱动决策中的相互交织的重要性和联系。机器学习以预测为中心任务，以计算为核心，实现了广泛的数据驱动型成功。预测是检验现实的有效方法。好的预测隐含地假定过去和未来之间是稳定的。稳定性(相对于数据和模型扰动)也是数据驱动结果的可解释性和可重复性的最低要求(cf. Yu, 2013)。它与不确定性评估密切相关。显然，如果没有可行的计算算法，预测原理和稳定性原理都不能被采用，因此可计算性的重要性。这三个原则将在两个神经科学项目的背景下通过分析联系来证明。特别是，第一个项目增加了预测模型的稳定性，用于从fMRI脑信号重建可解释模型的电影。第二个项目使用预测迁移学习，将AlexNet、GoogleNet和VGG与单个V4神经元数据相结合，以实现最先进的预测性能。我们的研究结果在一定程度上支持了这些cnn与大脑的相似性，同时为灵长类动物视觉皮层V4中的神经元提供了稳定的模式解释。

{"title":"Three Principles of Data Science: Predictability, Stability and Computability","authors":"Bin Yu","doi":"10.1145/3097983.3105808","DOIUrl":"https://doi.org/10.1145/3097983.3105808","url":null,"abstract":"In this talk, I'd like to discuss the intertwining importance and connections of three principles of data science in the title in data-driven decisions. Making prediction as its central task and embracing computation as its core, machine learning has enabled wide-ranging data-driven successes. Prediction is a useful way to check with reality. Good prediction implicitly assumes stability between past and future. Stability (relative to data and model perturbations) is also a minimum requirement for interpretability and reproducibility of data driven results (cf. Yu, 2013). It is closely related to uncertainty assessment. Obviously, both prediction and stability principles can not be employed without feasible computational algorithms, hence the importance of computability. The three principles will be demonstrated in the context of two neuroscience projects and through analytical connections. In particular, the first project adds stability to predictive modeling used for reconstruction of movies from fMRI brain signlas for interpretable models. The second project use predictive transfer learning that combines AlexNet, GoogleNet and VGG with single V4 neuron data for state-of-the-art prediction performance. Our results lend support, to a certain extent, to the resemblance of these CNNs to brain and at the same time provide stable pattern interpretations of neurons in the difficult primate visual cortex V4.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115239753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Groups-Keeping Solution Path Algorithm for Sparse Regression with Automatic Feature Grouping 具有自动特征分组的稀疏回归保群解路径算法

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-04 DOI: 10.1145/3097983.3098010

Bin Gu, Guodong Liu, Heng Huang

Feature selection is one of the most important data mining research topics with many applications. In practical problems, features often have group structure to effect the outcomes. Thus, it is crucial to automatically identify homogenous groups of features for high-dimensional data analysis. Octagonal shrinkage and clustering algorithm for regression (OSCAR) is an important sparse regression approach with automatic feature grouping and selection by ℓ1 norm and pairwise ℓ∞ norm. However, due to over-complex representation of the penalty (especially the pairwise ℓ∞ norm), so far OSCAR has no solution path algorithm which is mostly useful for tuning the model. To address this challenge, in this paper, we propose a groups-keeping solution path algorithm to solve the OSCAR model (OscarGKPath). Given a set of homogenous groups of features and an accuracy bound ε, OscarGKPath can fit the solutions in an interval of regularization parameters while keeping the feature groups. The entire solution path can be obtained by combining multiple such intervals. We prove that all solutions in the solution path produced by OscarGKPath can strictly satisfy the given accuracy bound ε. The experimental results on benchmark datasets not only confirm the effectiveness of our OscarGKPath algorithm, but also show the superiority of our OscarGKPath in cross validation compared with the existing batch algorithm.

特征选择是数据挖掘最重要的研究课题之一，有着广泛的应用。在实际问题中，特征通常具有组结构来影响结果。因此，自动识别同质特征组对于高维数据分析至关重要。八角形收缩聚类回归算法(OSCAR)是一种重要的稀疏回归算法，它通过1模和成对的1∞模对特征进行自动分组和选择。然而，由于惩罚的过于复杂的表示(特别是成对的r∞范数)，到目前为止OSCAR还没有解路径算法，这对调整模型非常有用。为了解决这一挑战，本文提出了一种组保持解路径算法来求解OSCAR模型(OscarGKPath)。给定一组齐次特征组和一个精度界ε， OscarGKPath可以在保持特征组不变的情况下，在正则化参数区间内拟合解。将多个这样的区间组合即可得到整个解路径。证明了OscarGKPath生成的解路径中的所有解都能严格满足给定的精度界ε。在基准数据集上的实验结果不仅证实了我们的OscarGKPath算法的有效性，而且与现有的批处理算法相比，我们的OscarGKPath算法在交叉验证方面具有优势。

{"title":"Groups-Keeping Solution Path Algorithm for Sparse Regression with Automatic Feature Grouping","authors":"Bin Gu, Guodong Liu, Heng Huang","doi":"10.1145/3097983.3098010","DOIUrl":"https://doi.org/10.1145/3097983.3098010","url":null,"abstract":"Feature selection is one of the most important data mining research topics with many applications. In practical problems, features often have group structure to effect the outcomes. Thus, it is crucial to automatically identify homogenous groups of features for high-dimensional data analysis. Octagonal shrinkage and clustering algorithm for regression (OSCAR) is an important sparse regression approach with automatic feature grouping and selection by ℓ1 norm and pairwise ℓ∞ norm. However, due to over-complex representation of the penalty (especially the pairwise ℓ∞ norm), so far OSCAR has no solution path algorithm which is mostly useful for tuning the model. To address this challenge, in this paper, we propose a groups-keeping solution path algorithm to solve the OSCAR model (OscarGKPath). Given a set of homogenous groups of features and an accuracy bound ε, OscarGKPath can fit the solutions in an interval of regularization parameters while keeping the feature groups. The entire solution path can be obtained by combining multiple such intervals. We prove that all solutions in the solution path produced by OscarGKPath can strictly satisfy the given accuracy bound ε. The experimental results on benchmark datasets not only confirm the effectiveness of our OscarGKPath algorithm, but also show the superiority of our OscarGKPath in cross validation compared with the existing batch algorithm.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115849459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10