Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining最新文献

Deep Choice Model Using Pointer Networks for Airline Itinerary Prediction 基于指针网络的航线行程预测深度选择模型

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098005

Alejandro Mottini, Rodrigo Acuna-Agost

Travel providers such as airlines and on-line travel agents are becoming more and more interested in understanding how passengers choose among alternative itineraries when searching for flights. This knowledge helps them better display and adapt their offer, taking into account market conditions and customer needs. Some common applications are not only filtering and sorting alternatives, but also changing certain attributes in real-time (e.g., changing the price). In this paper, we concentrate with the problem of modeling air passenger choices of flight itineraries. This problem has historically been tackled using classical Discrete Choice Modelling techniques. Traditional statistical approaches, in particular the Multinomial Logit model (MNL), is widely used in industrial applications due to its simplicity and general good performance. However, MNL models present several shortcomings and assumptions that might not hold in real applications. To overcome these difficulties, we present a new choice model based on Pointer Networks. Given an input sequence, this type of deep neural architecture combines Recurrent Neural Networks with the Attention Mechanism to learn the conditional probability of an output whose values correspond to positions in an input sequence. Therefore, given a sequence of different alternatives presented to a customer, the model can learn to point to the one most likely to be chosen by the customer. The proposed method was evaluated on a real dataset that combines on-line user search logs and airline flight bookings. Experimental results show that the proposed model outperforms the traditional MNL model on several metrics.

航空公司和在线旅行社等旅游供应商越来越有兴趣了解乘客在搜索航班时如何在不同的行程中做出选择。这些知识有助于他们更好地展示和调整他们的报价，考虑到市场状况和客户需求。一些常见的应用程序不仅过滤和排序替代品，而且还实时更改某些属性(例如，更改价格)。本文主要研究航空旅客的航线选择建模问题。这个问题历来是用经典的离散选择建模技术来解决的。传统的统计方法，特别是多项Logit模型(Multinomial Logit model, MNL)，由于其简单和一般良好的性能，在工业应用中得到了广泛的应用。然而，MNL模型提出了一些在实际应用中可能不成立的缺点和假设。为了克服这些困难，我们提出了一种新的基于指针网络的选择模型。给定一个输入序列，这种类型的深度神经结构结合了递归神经网络和注意机制来学习输出的条件概率，其值对应于输入序列中的位置。因此，给定向客户提供的一系列不同的备选方案，模型可以学习指向客户最有可能选择的一个。在结合在线用户搜索日志和航空公司航班预订的真实数据集上对所提出的方法进行了评估。实验结果表明，该模型在多个指标上都优于传统的MNL模型。

{"title":"Deep Choice Model Using Pointer Networks for Airline Itinerary Prediction","authors":"Alejandro Mottini, Rodrigo Acuna-Agost","doi":"10.1145/3097983.3098005","DOIUrl":"https://doi.org/10.1145/3097983.3098005","url":null,"abstract":"Travel providers such as airlines and on-line travel agents are becoming more and more interested in understanding how passengers choose among alternative itineraries when searching for flights. This knowledge helps them better display and adapt their offer, taking into account market conditions and customer needs. Some common applications are not only filtering and sorting alternatives, but also changing certain attributes in real-time (e.g., changing the price). In this paper, we concentrate with the problem of modeling air passenger choices of flight itineraries. This problem has historically been tackled using classical Discrete Choice Modelling techniques. Traditional statistical approaches, in particular the Multinomial Logit model (MNL), is widely used in industrial applications due to its simplicity and general good performance. However, MNL models present several shortcomings and assumptions that might not hold in real applications. To overcome these difficulties, we present a new choice model based on Pointer Networks. Given an input sequence, this type of deep neural architecture combines Recurrent Neural Networks with the Attention Mechanism to learn the conditional probability of an output whose values correspond to positions in an input sequence. Therefore, given a sequence of different alternatives presented to a customer, the model can learn to point to the one most likely to be chosen by the customer. The proposed method was evaluated on a real dataset that combines on-line user search logs and airline flight bookings. Experimental results show that the proposed model outperforms the traditional MNL model on several metrics.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114282416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

A Temporally Heterogeneous Survival Framework with Application to Social Behavior Dynamics 一个时间异质性生存框架及其在社会行为动力学中的应用

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098189

Linyun Yu, Peng Cui, Chaoming Song, T. Zhang, Shiqiang Yang

Social behavior dynamics is one of the central building blocks in understanding and modeling complex social dynamic phenomena, such as information spreading, opinion formation, and social mobilization. While a wide range of models for social behavior dynamics have been proposed in recent years, the essential ingredients and the minimum model for social behavior dynamics is still largely unanswered. Here, we find that human interaction behavior dynamics exhibit rich complexities over the response time dimension and natural time dimension by exploring a large scale social communication dataset. To tackle this challenge, we develop a temporal Heterogeneous Survival framework where the regularities in response time dimension and natural time dimension can be organically integrated. We apply our model in two online social communication datasets. Our model can successfully regenerate the interaction patterns in the social communication datasets, and the results demonstrate that the proposed method can significantly outperform other state-of-the-art baselines. Meanwhile, the learnt parameters and discovered statistical regularities can lead to multiple potential applications.

社会行为动力学是理解和模拟复杂社会动态现象(如信息传播、意见形成和社会动员)的核心基石之一。虽然近年来提出了各种各样的社会行为动力学模型，但社会行为动力学的基本成分和最小模型在很大程度上仍未得到解答。本文通过对大型社交数据集的研究，发现人类互动行为动态在响应时间维度和自然时间维度上表现出丰富的复杂性。为了解决这一挑战，我们开发了一个时间异构生存框架，将响应时间维度和自然时间维度的规律有机地集成在一起。我们将我们的模型应用于两个在线社交数据集。我们的模型可以成功地重新生成社交数据集中的交互模式，结果表明，所提出的方法可以显著优于其他最先进的基线。同时，学习到的参数和发现的统计规律可以带来多种潜在的应用。

{"title":"A Temporally Heterogeneous Survival Framework with Application to Social Behavior Dynamics","authors":"Linyun Yu, Peng Cui, Chaoming Song, T. Zhang, Shiqiang Yang","doi":"10.1145/3097983.3098189","DOIUrl":"https://doi.org/10.1145/3097983.3098189","url":null,"abstract":"Social behavior dynamics is one of the central building blocks in understanding and modeling complex social dynamic phenomena, such as information spreading, opinion formation, and social mobilization. While a wide range of models for social behavior dynamics have been proposed in recent years, the essential ingredients and the minimum model for social behavior dynamics is still largely unanswered. Here, we find that human interaction behavior dynamics exhibit rich complexities over the response time dimension and natural time dimension by exploring a large scale social communication dataset. To tackle this challenge, we develop a temporal Heterogeneous Survival framework where the regularities in response time dimension and natural time dimension can be organically integrated. We apply our model in two online social communication datasets. Our model can successfully regenerate the interaction patterns in the social communication datasets, and the results demonstrate that the proposed method can significantly outperform other state-of-the-art baselines. Meanwhile, the learnt parameters and discovered statistical regularities can lead to multiple potential applications.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114515623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments 肮脏的一打:在线控制实验中12个常见的度量解释陷阱

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098024

Pavel A. Dmitriev, Somit Gupta, Dong Woo Kim, G. J. Vaz

Online controlled experiments (e.g., A/B tests) are now regularly used to guide product development and accelerate innovation in software. Product ideas are evaluated as scientific hypotheses, and tested in web sites, mobile applications, desktop applications, services, and operating systems. One of the key challenges for organizations that run controlled experiments is to come up with the right set of metrics [1] [2] [3]. Having good metrics, however, is not enough. In our experience of running thousands of experiments with many teams across Microsoft, we observed again and again how incorrect interpretations of metric movements may lead to wrong conclusions about the experiment's outcome, which if deployed could hurt the business by millions of dollars. Inspired by Steven Goodman's twelve p-value misconceptions [4], in this paper, we share twelve common metric interpretation pitfalls which we observed repeatedly in our experiments. We illustrate each pitfall with a puzzling example from a real experiment, and describe processes, metric design principles, and guidelines that can be used to detect and avoid the pitfall. With this paper, we aim to increase the experimenters' awareness of metric interpretation issues, leading to improved quality and trustworthiness of experiment results and better data-driven decisions.

在线控制实验(例如，A/B测试)现在经常用于指导产品开发和加速软件创新。产品理念作为科学假设进行评估，并在网站、移动应用程序、桌面应用程序、服务和操作系统中进行测试。运行受控实验的组织面临的主要挑战之一是提出正确的度量标准集[1][2][3]。然而，拥有良好的指标是不够的。根据我们在微软的许多团队中进行的数千次实验的经验，我们一次又一次地观察到，对度量运动的错误解释可能会导致对实验结果的错误结论，如果部署这些结论，可能会使业务损失数百万美元。受Steven Goodman的12个p值误解[4]的启发，在本文中，我们分享了我们在实验中反复观察到的12个常见的度量解释陷阱。我们用一个来自真实实验的令人困惑的例子来说明每个陷阱，并描述可用于检测和避免陷阱的过程、度量设计原则和指导方针。通过本文，我们旨在提高实验者对度量解释问题的认识，从而提高实验结果的质量和可信度，以及更好的数据驱动决策。

{"title":"A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments","authors":"Pavel A. Dmitriev, Somit Gupta, Dong Woo Kim, G. J. Vaz","doi":"10.1145/3097983.3098024","DOIUrl":"https://doi.org/10.1145/3097983.3098024","url":null,"abstract":"Online controlled experiments (e.g., A/B tests) are now regularly used to guide product development and accelerate innovation in software. Product ideas are evaluated as scientific hypotheses, and tested in web sites, mobile applications, desktop applications, services, and operating systems. One of the key challenges for organizations that run controlled experiments is to come up with the right set of metrics [1] [2] [3]. Having good metrics, however, is not enough. In our experience of running thousands of experiments with many teams across Microsoft, we observed again and again how incorrect interpretations of metric movements may lead to wrong conclusions about the experiment's outcome, which if deployed could hurt the business by millions of dollars. Inspired by Steven Goodman's twelve p-value misconceptions [4], in this paper, we share twelve common metric interpretation pitfalls which we observed repeatedly in our experiments. We illustrate each pitfall with a puzzling example from a real experiment, and describe processes, metric design principles, and guidelines that can be used to detect and avoid the pitfall. With this paper, we aim to increase the experimenters' awareness of metric interpretation issues, leading to improved quality and trustworthiness of experiment results and better data-driven decisions.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129674092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 91

Supporting Employer Name Normalization at both Entity and Cluster Level 在实体和集群级别支持雇主名称规范化

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098093

Qiaoling Liu, F. Javed, Vachik S. Dave, Ankita Joshi

In the recruitment domain, the employer name normalization task, which links employer names in job postings or resumes to entities in an employer knowledge base (KB), is important to many business applications. In previous work, we proposed the CompanyDepot system, which used machine learning techniques to address the problem. After applying it to several applications at CareerBuilder, we faced several new challenges: 1) how to avoid duplicate normalization results when the KB is noisy and contains many duplicate entities; 2) how to address the vocabulary gap between query names and entity names in the KB; and 3) how to use the context available in jobs and resumes to improve normalization quality. To address these challenges, in this paper we extend the previous CompanyDepot system to normalize employer names not only at entity level, but also at cluster level by mapping a query to a cluster in the KB that best matches the query. We also propose a new metric based on success rate and diversity reduction ratio for evaluating the cluster-level normalization. Moreover, we perform query expansion based on five data sources to address the vocabulary gap challenge and leverage the url context for the employer names in many jobs and resumes to improve normalization quality. We show that the proposed CompanyDepot-V2 system outperforms the previous CompanyDepot system and several other baseline systems over multiple real-world datasets. We also demonstrate the large improvement on normalization quality from entity-level to cluster-level normalization.

在招聘领域，雇主名称规范化任务(将招聘启事或简历中的雇主名称链接到雇主知识库中的实体)对许多业务应用程序都很重要。在之前的工作中，我们提出了CompanyDepot系统，它使用机器学习技术来解决这个问题。在将其应用于CareerBuilder的几个应用程序后，我们面临着几个新的挑战:1)当知识库有噪声并且包含许多重复的实体时，如何避免重复的规范化结果;2)如何解决知识库中查询名称和实体名称之间的词汇缺口;3)如何利用工作和简历中可用的上下文来提高规范化质量。为了应对这些挑战，在本文中，我们扩展了以前的CompanyDepot系统，通过将查询映射到最匹配查询的知识库中的集群，不仅在实体级别规范化雇主名称，而且在集群级别规范化雇主名称。我们还提出了一种基于成功率和多样性减少率的新度量来评估聚类水平归一化。此外，我们基于五个数据源执行查询扩展，以解决词汇缺口挑战，并利用许多工作和简历中雇主名称的url上下文来提高规范化质量。我们表明，在多个真实数据集上，建议的CompanyDepot- v2系统优于以前的CompanyDepot系统和其他几个基线系统。我们还演示了从实体级规范化到集群级规范化在规范化质量上的巨大改进。

{"title":"Supporting Employer Name Normalization at both Entity and Cluster Level","authors":"Qiaoling Liu, F. Javed, Vachik S. Dave, Ankita Joshi","doi":"10.1145/3097983.3098093","DOIUrl":"https://doi.org/10.1145/3097983.3098093","url":null,"abstract":"In the recruitment domain, the employer name normalization task, which links employer names in job postings or resumes to entities in an employer knowledge base (KB), is important to many business applications. In previous work, we proposed the CompanyDepot system, which used machine learning techniques to address the problem. After applying it to several applications at CareerBuilder, we faced several new challenges: 1) how to avoid duplicate normalization results when the KB is noisy and contains many duplicate entities; 2) how to address the vocabulary gap between query names and entity names in the KB; and 3) how to use the context available in jobs and resumes to improve normalization quality. To address these challenges, in this paper we extend the previous CompanyDepot system to normalize employer names not only at entity level, but also at cluster level by mapping a query to a cluster in the KB that best matches the query. We also propose a new metric based on success rate and diversity reduction ratio for evaluating the cluster-level normalization. Moreover, we perform query expansion based on five data sources to address the vocabulary gap challenge and leverage the url context for the employer names in many jobs and resumes to improve normalization quality. We show that the proposed CompanyDepot-V2 system outperforms the previous CompanyDepot system and several other baseline systems over multiple real-world datasets. We also demonstrate the large improvement on normalization quality from entity-level to cluster-level normalization.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123188634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Quick Access: Building a Smart Experience for Google Drive 快速访问:建立一个智能体验的谷歌驱动器

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098048

Sandeep Tata, Alexandrin Popescul, Marc Najork, Mike Colagrosso, Julian Gibbons, Alan Green, Alexandre Mah, Michael Smith, Divanshu Garg, Cayden Meyer, Reuben Kan

Google Drive is a cloud storage and collaboration service used by hundreds of millions of users around the world. Quick Access is a new feature in Google Drive that surfaces the most relevant documents when a user visits the home screen. Our metrics show that users locate their documents in half the time with this feature compared to previous approaches. The development of Quick Access illustrates many general challenges and constraints associated with practical machine learning such as protecting user privacy, working with data services that are not designed with machine learning in mind, and evolving product definitions. We believe that the lessons learned from this experience will be useful to practitioners tackling a wide range of applied machine learning problems.

谷歌Drive是全球数亿用户使用的云存储和协作服务。快速访问是b谷歌Drive的一个新功能，当用户访问主屏幕时，可以显示最相关的文档。我们的指标显示，与以前的方法相比，使用此功能的用户定位文档的时间缩短了一半。快速访问的发展说明了与实际机器学习相关的许多一般挑战和限制，例如保护用户隐私，处理没有考虑到机器学习的数据服务，以及不断发展的产品定义。我们相信，从这次经验中吸取的教训将有助于从业者解决广泛的应用机器学习问题。

引用次数: 24

HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network 基于结构化异构信息网络的Android恶意软件智能检测系统HinDroid

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098026

Shifu Hou, Yanfang Ye, Yangqiu Song, Melih Abdulhayoglu

With explosive growth of Android malware and due to the severity of its damages to smart phone users, the detection of Android malware has become increasingly important in cybersecurity. The increasing sophistication of Android malware calls for new defensive techniques that are capable against novel threats and harder to evade. In this paper, to detect Android malware, instead of using Application Programming Interface (API) calls only, we further analyze the different relationships between them and create higher-level semantics which require more effort for attackers to evade the detection. We represent the Android applications (apps), related APIs, and their rich relationships as a structured heterogeneous information network (HIN). Then we use a meta-path based approach to characterize the semantic relatedness of apps and APIs. We use each meta-path to formulate a similarity measure over Android apps, and aggregate different similarities using multi-kernel learning. Then each meta-path is automatically weighted by the learning algorithm to make predictions. To the best of our knowledge, this is the first work to use structured HIN for Android malware detection. Comprehensive experiments on real sample collections from Comodo Cloud Security Center are conducted to compare various malware detection approaches. Promising experimental results demonstrate that our developed system HinDroid outperforms other alternative Android malware detection techniques.

随着Android恶意软件的爆炸式增长，以及其对智能手机用户造成的严重损害，Android恶意软件的检测在网络安全中变得越来越重要。越来越复杂的Android恶意软件需要新的防御技术，能够抵御新的威胁，更难以逃避。在本文中，为了检测Android恶意软件，我们不再仅仅使用API调用，而是进一步分析它们之间的不同关系，并创建更高层次的语义，这使得攻击者需要付出更多的努力来逃避检测。我们将Android应用程序、相关api以及它们之间丰富的关系表示为一个结构化的异构信息网络(HIN)。然后，我们使用基于元路径的方法来表征应用程序和api的语义相关性。我们使用每个元路径来制定Android应用程序的相似性度量，并使用多内核学习来聚合不同的相似性。然后，每个元路径被学习算法自动加权以进行预测。据我们所知，这是第一个使用结构化HIN进行Android恶意软件检测的工作。在科摩多云安全中心采集的真实样本上进行综合实验，比较各种恶意软件检测方法。有希望的实验结果表明，我们开发的系统HinDroid优于其他替代的Android恶意软件检测技术。

{"title":"HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network","authors":"Shifu Hou, Yanfang Ye, Yangqiu Song, Melih Abdulhayoglu","doi":"10.1145/3097983.3098026","DOIUrl":"https://doi.org/10.1145/3097983.3098026","url":null,"abstract":"With explosive growth of Android malware and due to the severity of its damages to smart phone users, the detection of Android malware has become increasingly important in cybersecurity. The increasing sophistication of Android malware calls for new defensive techniques that are capable against novel threats and harder to evade. In this paper, to detect Android malware, instead of using Application Programming Interface (API) calls only, we further analyze the different relationships between them and create higher-level semantics which require more effort for attackers to evade the detection. We represent the Android applications (apps), related APIs, and their rich relationships as a structured heterogeneous information network (HIN). Then we use a meta-path based approach to characterize the semantic relatedness of apps and APIs. We use each meta-path to formulate a similarity measure over Android apps, and aggregate different similarities using multi-kernel learning. Then each meta-path is automatically weighted by the learning algorithm to make predictions. To the best of our knowledge, this is the first work to use structured HIN for Android malware detection. Comprehensive experiments on real sample collections from Comodo Cloud Security Center are conducted to compare various malware detection approaches. Promising experimental results demonstrate that our developed system HinDroid outperforms other alternative Android malware detection techniques.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"506 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115854841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 223

Predicting Clinical Outcomes Across Changing Electronic Health Record Systems 通过不断变化的电子健康记录系统预测临床结果

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098064

Jen J. Gong, Tristan Naumann, Peter Szolovits, J. Guttag

Existing machine learning methods typically assume consistency in how semantically equivalent information is encoded. However, the way information is recorded in databases differs across institutions and over time, often rendering potentially useful data obsolescent. To address this problem, we map database-specific representations of information to a shared set of semantic concepts, thus allowing models to be built from or transition across different databases. We demonstrate our method on machine learning models developed in a healthcare setting. In particular, we evaluate our method using two different intensive care unit (ICU) databases and on two clinically relevant tasks, in-hospital mortality and prolonged length of stay. For both outcomes, a feature representation mapping EHR-specific events to a shared set of clinical concepts yields better results than using EHR-specific events alone.

现有的机器学习方法通常假设语义等效信息的编码方式是一致的。然而，在不同的机构和不同的时间，数据库中记录信息的方式是不同的，这往往会使潜在有用的数据过时。为了解决这个问题，我们将特定于数据库的信息表示映射到一组共享的语义概念，从而允许从不同的数据库构建模型或在不同的数据库之间进行转换。我们在医疗保健环境中开发的机器学习模型上演示了我们的方法。特别是，我们使用两个不同的重症监护室(ICU)数据库和两个临床相关任务(住院死亡率和住院时间延长)来评估我们的方法。对于这两种结果，将ehr特定事件映射到共享的临床概念集的特征表示比单独使用ehr特定事件产生更好的结果。

引用次数: 35

Dispatch with Confidence: Integration of Machine Learning, Optimization and Simulation for Open Pit Mines 信心调度:露天矿机器学习、优化与仿真的集成

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098178

Kosta Ristovski, Chetan Gupta, Kunihiko Harada, Hsiu-Khuern Tang

Open pit mining operations require utilization of extremely expensive equipment such as large trucks, shovels and loaders. To remain competitive, mining companies are under pressure to increase equipment utilization and reduce operational costs. The key to this in mining operations is to have sophisticated truck assignment strategies which will ensure that equipment is utilized efficiently with minimum operating cost. To address this problem, we have implemented truck assignment approach which integrates machine learning, linear/integer programming and simulation. Our truck assignment approach takes into consideration the number of trucks and their sizes, shovels and dump locations as well as stochastic activity times during the operations. Machine learning is used to predict probability distributions of equipment activity duration. We have validated the approach using data collected from two open pit mines. Our experimental results show that our approach offers increase of 10% in efficiency. Presented results demonstrate that machine learning can bring significant value to mining industry.

露天采矿作业需要使用极其昂贵的设备，如大型卡车、铲子和装载机。为了保持竞争力，矿业公司面临着提高设备利用率和降低运营成本的压力。采矿作业的关键是制订复杂的卡车分配战略，以确保以最低的作业成本有效地利用设备。为了解决这个问题，我们实施了卡车分配方法，该方法集成了机器学习、线性/整数规划和仿真。我们的卡车分配方法考虑了卡车的数量、大小、铲斗和倾卸位置以及作业过程中的随机活动时间。机器学习用于预测设备活动持续时间的概率分布。我们使用从两个露天矿收集的数据验证了该方法。实验结果表明，该方法的效率提高了10%。结果表明，机器学习可以为采矿业带来巨大的价值。

{"title":"Dispatch with Confidence: Integration of Machine Learning, Optimization and Simulation for Open Pit Mines","authors":"Kosta Ristovski, Chetan Gupta, Kunihiko Harada, Hsiu-Khuern Tang","doi":"10.1145/3097983.3098178","DOIUrl":"https://doi.org/10.1145/3097983.3098178","url":null,"abstract":"Open pit mining operations require utilization of extremely expensive equipment such as large trucks, shovels and loaders. To remain competitive, mining companies are under pressure to increase equipment utilization and reduce operational costs. The key to this in mining operations is to have sophisticated truck assignment strategies which will ensure that equipment is utilized efficiently with minimum operating cost. To address this problem, we have implemented truck assignment approach which integrates machine learning, linear/integer programming and simulation. Our truck assignment approach takes into consideration the number of trucks and their sizes, shovels and dump locations as well as stochastic activity times during the operations. Machine learning is used to predict probability distributions of equipment activity duration. We have validated the approach using data collected from two open pit mines. Our experimental results show that our approach offers increase of 10% in efficiency. Presented results demonstrate that machine learning can bring significant value to mining industry.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131513425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Stock Price Prediction via Discovering Multi-Frequency Trading Patterns 通过发现多频率交易模式预测股票价格

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098117

Liheng Zhang, C. Aggarwal, Guo-Jun Qi

Stock prices are formed based on short and/or long-term commercial and trading activities that reflect different frequencies of trading patterns. However, these patterns are often elusive as they are affected by many uncertain political-economic factors in the real world, such as corporate performances, government policies, and even breaking news circulated across markets. Moreover, time series of stock prices are non-stationary and non-linear, making the prediction of future price trends much challenging. To address them, we propose a novel State Frequency Memory (SFM) recurrent network to capture the multi-frequency trading patterns from past market data to make long and short term predictions over time. Inspired by Discrete Fourier Transform (DFT), the SFM decomposes the hidden states of memory cells into multiple frequency components, each of which models a particular frequency of latent trading pattern underlying the fluctuation of stock price. Then the future stock prices are predicted as a nonlinear mapping of the combination of these components in an Inverse Fourier Transform (IFT) fashion. Modeling multi-frequency trading patterns can enable more accurate predictions for various time ranges: while a short-term prediction usually depends on high frequency trading patterns, a long-term prediction should focus more on the low frequency trading patterns targeting at long-term return. Unfortunately, no existing model explicitly distinguishes between various frequencies of trading patterns to make dynamic predictions in literature. The experiments on the real market data also demonstrate more competitive performance by the SFM as compared with the state-of-the-art methods.

股票价格是根据短期和/或长期商业和交易活动形成的，这些活动反映了不同频率的交易模式。然而，这些模式往往难以捉摸，因为它们受到现实世界中许多不确定的政治经济因素的影响，例如公司业绩、政府政策，甚至是在市场上传播的突发新闻。此外，股票价格的时间序列是非平稳和非线性的，这使得对未来价格趋势的预测具有很大的挑战性。为了解决这些问题，我们提出了一种新的状态频率记忆(SFM)循环网络，从过去的市场数据中捕获多频率交易模式，以便随着时间的推移进行长期和短期预测。受离散傅里叶变换(DFT)的启发，SFM将存储单元的隐藏状态分解为多个频率分量，每个频率分量模拟股票价格波动背后的潜在交易模式的特定频率。然后，以傅里叶反变换(IFT)的方式预测未来股票价格作为这些成分组合的非线性映射。对多频率交易模式进行建模可以对各种时间范围进行更准确的预测:短期预测通常依赖于高频交易模式，而长期预测应该更多地关注以长期回报为目标的低频交易模式。不幸的是，在文献中没有现有的模型明确区分交易模式的不同频率来进行动态预测。在实际市场数据上的实验也表明，与目前最先进的方法相比，该方法具有更强的竞争力。

{"title":"Stock Price Prediction via Discovering Multi-Frequency Trading Patterns","authors":"Liheng Zhang, C. Aggarwal, Guo-Jun Qi","doi":"10.1145/3097983.3098117","DOIUrl":"https://doi.org/10.1145/3097983.3098117","url":null,"abstract":"Stock prices are formed based on short and/or long-term commercial and trading activities that reflect different frequencies of trading patterns. However, these patterns are often elusive as they are affected by many uncertain political-economic factors in the real world, such as corporate performances, government policies, and even breaking news circulated across markets. Moreover, time series of stock prices are non-stationary and non-linear, making the prediction of future price trends much challenging. To address them, we propose a novel State Frequency Memory (SFM) recurrent network to capture the multi-frequency trading patterns from past market data to make long and short term predictions over time. Inspired by Discrete Fourier Transform (DFT), the SFM decomposes the hidden states of memory cells into multiple frequency components, each of which models a particular frequency of latent trading pattern underlying the fluctuation of stock price. Then the future stock prices are predicted as a nonlinear mapping of the combination of these components in an Inverse Fourier Transform (IFT) fashion. Modeling multi-frequency trading patterns can enable more accurate predictions for various time ranges: while a short-term prediction usually depends on high frequency trading patterns, a long-term prediction should focus more on the low frequency trading patterns targeting at long-term return. Unfortunately, no existing model explicitly distinguishes between various frequencies of trading patterns to make dynamic predictions in literature. The experiments on the real market data also demonstrate more competitive performance by the SFM as compared with the state-of-the-art methods.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133012598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 265

Internet Device Graphs 互联网设备图

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098114

Matthew Malloy, P. Barford, Enis Ceyhun Alp, Jonathan Koller, Adria Jewell

Internet device graphs identify relationships between user-centric internet connected devices such as desktops, laptops, smartphones, tablets, gaming consoles, TV's, etc. The ability to create such graphs is compelling for online advertising, content customization, recommendation systems, security, and operations. We begin by describing an algorithm for generating a device graph based on IP-colocation, and then apply the algorithm to a corpus of over 2.5 trillion internet events collected over the period of six weeks in the United States. The resulting graph exhibits immense scale with greater than 7.3 billion edges (pair-wise relationships) between more than 1.2 billion nodes (devices), accounting for the vast majority of internet connected devices in the US. Next, we apply community detection algorithms to the graph resulting in a partitioning of internet devices into 100 million small communities representing physical households. We validate this partition with a unique ground truth dataset. We report on the characteristics of the graph and the communities. Lastly, we discuss the important issues of ethics and privacy that must be considered when creating and studying device graphs, and suggest further opportunities for device graph enrichment and application.

互联网设备图确定了以用户为中心的互联网连接设备(如台式机、笔记本电脑、智能手机、平板电脑、游戏机、电视等)之间的关系。对于在线广告、内容定制、推荐系统、安全性和操作来说，创建这种图形的能力是很有吸引力的。我们首先描述一种基于ip托管生成设备图的算法，然后将该算法应用于在美国六周内收集的超过2.5万亿互联网事件的语料库。结果图显示了超过12亿个节点(设备)之间超过73亿个边(成对关系)的巨大规模，占美国互联网连接设备的绝大多数。接下来，我们将社区检测算法应用到图中，从而将互联网设备划分为代表物理家庭的1亿个小社区。我们用一个唯一的真实数据集来验证这个分区。我们报告了图和群的特征。最后，我们讨论了在创建和研究设备图时必须考虑的重要道德和隐私问题，并提出了设备图丰富和应用的进一步机会。

{"title":"Internet Device Graphs","authors":"Matthew Malloy, P. Barford, Enis Ceyhun Alp, Jonathan Koller, Adria Jewell","doi":"10.1145/3097983.3098114","DOIUrl":"https://doi.org/10.1145/3097983.3098114","url":null,"abstract":"Internet device graphs identify relationships between user-centric internet connected devices such as desktops, laptops, smartphones, tablets, gaming consoles, TV's, etc. The ability to create such graphs is compelling for online advertising, content customization, recommendation systems, security, and operations. We begin by describing an algorithm for generating a device graph based on IP-colocation, and then apply the algorithm to a corpus of over 2.5 trillion internet events collected over the period of six weeks in the United States. The resulting graph exhibits immense scale with greater than 7.3 billion edges (pair-wise relationships) between more than 1.2 billion nodes (devices), accounting for the vast majority of internet connected devices in the US. Next, we apply community detection algorithms to the graph resulting in a partitioning of internet devices into 100 million small communities representing physical households. We validate this partition with a unique ground truth dataset. We report on the characteristics of the graph and the communities. Lastly, we discuss the important issues of ethics and privacy that must be considered when creating and studying device graphs, and suggest further opportunities for device graph enrichment and application.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127065771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10