Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining最新文献_第6页

KunPeng: Parameter Server based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial 鲲鹏:基于参数服务器的分布式学习系统及其在阿里巴巴和蚂蚁金服中的应用

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098029

Jun Zhou, Xiaolong Li, P. Zhao, Chaochao Chen, Longfei Li, Xinxing Yang, Qing Cui, Jin Yu, Xu Chen, Yi Ding, Yuan Qi

In recent years, due to the emergence of Big Data (terabytes or petabytes) and Big Model (tens of billions of parameters), there has been an ever-increasing need of parallelizing machine learning (ML) algorithms in both academia and industry. Although there are some existing distributed computing systems, such as Hadoop and Spark, for parallelizing ML algorithms, they only provide synchronous and coarse-grained operators (e.g., Map, Reduce, and Join, etc.), which may hinder developers from implementing more efficient algorithms. This motivated us to design a universal distributed platform termed KunPeng, that combines both distributed systems and parallel optimization algorithms to deal with the complexities that arise from large-scale ML. Specifically, KunPeng not only encapsulates the characteristics of data/model parallelism, load balancing, model sync-up, sparse representation, industrial fault-tolerance, etc., but also provides easy-to-use interface to empower users to focus on the core ML logics. Empirical results on terabytes of real datasets with billions of samples and features demonstrate that, such a design brings compelling performance improvements on ML programs ranging from Follow-the-Regularized-Leader Proximal algorithm to Sparse Logistic Regression and Multiple Additive Regression Trees. Furthermore, KunPeng's encouraging performance is also shown for several real-world applications including the Alibaba's Double 11 Online Shopping Festival and Ant Financial's transaction risk estimation.

近年来，由于大数据(tb或pb)和大模型(数百亿参数)的出现，学术界和工业界对并行机器学习(ML)算法的需求不断增加。虽然有一些现有的分布式计算系统，如Hadoop和Spark，用于并行ML算法，但它们只提供同步和粗粒度的操作符(例如Map、Reduce和Join等)，这可能会阻碍开发人员实现更高效的算法。这促使我们设计了一个通用的分布式平台，称为鲲鹏，它结合了分布式系统和并行优化算法来处理大规模机器学习带来的复杂性。具体来说，鲲鹏不仅封装了数据/模型并行、负载平衡、模型同步、稀疏表示、工业容错等特性，而且还提供了易于使用的界面，使用户能够专注于核心机器学习逻辑。在具有数十亿个样本和特征的tb级真实数据集上的经验结果表明，这样的设计为ML程序带来了令人信服的性能改进，从遵循正则化领导者近端算法到稀疏逻辑回归和多加性回归树。此外，鲲鹏的令人鼓舞的表现也显示了几个现实世界的应用程序，包括阿里巴巴的双11在线购物节和蚂蚁金服的交易风险估计。

{"title":"KunPeng: Parameter Server based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial","authors":"Jun Zhou, Xiaolong Li, P. Zhao, Chaochao Chen, Longfei Li, Xinxing Yang, Qing Cui, Jin Yu, Xu Chen, Yi Ding, Yuan Qi","doi":"10.1145/3097983.3098029","DOIUrl":"https://doi.org/10.1145/3097983.3098029","url":null,"abstract":"In recent years, due to the emergence of Big Data (terabytes or petabytes) and Big Model (tens of billions of parameters), there has been an ever-increasing need of parallelizing machine learning (ML) algorithms in both academia and industry. Although there are some existing distributed computing systems, such as Hadoop and Spark, for parallelizing ML algorithms, they only provide synchronous and coarse-grained operators (e.g., Map, Reduce, and Join, etc.), which may hinder developers from implementing more efficient algorithms. This motivated us to design a universal distributed platform termed KunPeng, that combines both distributed systems and parallel optimization algorithms to deal with the complexities that arise from large-scale ML. Specifically, KunPeng not only encapsulates the characteristics of data/model parallelism, load balancing, model sync-up, sparse representation, industrial fault-tolerance, etc., but also provides easy-to-use interface to empower users to focus on the core ML logics. Empirical results on terabytes of real datasets with billions of samples and features demonstrate that, such a design brings compelling performance improvements on ML programs ranging from Follow-the-Regularized-Leader Proximal algorithm to Sparse Logistic Regression and Multiple Additive Regression Trees. Furthermore, KunPeng's encouraging performance is also shown for several real-world applications including the Alibaba's Double 11 Online Shopping Festival and Ant Financial's transaction risk estimation.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123202916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

An Intelligent Customer Care Assistant System for Large-Scale Cellular Network Diagnosis 面向大规模蜂窝网络诊断的智能客服助理系统

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098120

Lujia Pan, Jianfeng Zhang, P. Lee, Hong Cheng, Cheng He, Caifeng He, Keli Zhang

With the advent of cellular network technologies, mobile Internet access becomes the norm in everyday life. In the meantime, the complaints made by subscribers about unsatisfactory cellular network access also become increasingly frequent. From a network operator's perspective, achieving accurate and timely cellular network diagnosis about the causes of the complaints is critical for both improving subscriber-perceived experience and maintaining network robustness. We present the Intelligent Customer Care Assistant (ICCA), a distributed fault classification system that exploits a data-driven approach to perform large-scale cellular network diagnosis. ICCA takes massive network data as input, and realizes both offline model training and online feature computation to distinguish between user and network faults in real time. ICCA is currently deployed in a metropolitan LTE network in China that is serving around 50 million subscribers. We show via evaluation that ICCA achieves high classification accuracy (85.3%) and fast query response time (less than 2.3 seconds). We also report our experiences learned from the deployment.

随着蜂窝网络技术的出现，移动互联网接入已成为日常生活的常态。与此同时，用户对蜂窝网络接入不满意的投诉也日益频繁。从网络运营商的角度来看，准确及时地诊断蜂窝网络投诉的原因对于改善用户感知体验和保持网络稳健性至关重要。我们提出了智能客户关怀助理(ICCA)，这是一个分布式故障分类系统，利用数据驱动的方法来执行大规模蜂窝网络诊断。ICCA以海量网络数据为输入，同时实现离线模型训练和在线特征计算，实时区分用户故障和网络故障。ICCA目前部署在中国的城域LTE网络中，为大约5000万用户提供服务。我们通过评估表明，ICCA实现了高分类准确率(85.3%)和快速查询响应时间(小于2.3秒)。我们还报告了我们从部署中学到的经验。

{"title":"An Intelligent Customer Care Assistant System for Large-Scale Cellular Network Diagnosis","authors":"Lujia Pan, Jianfeng Zhang, P. Lee, Hong Cheng, Cheng He, Caifeng He, Keli Zhang","doi":"10.1145/3097983.3098120","DOIUrl":"https://doi.org/10.1145/3097983.3098120","url":null,"abstract":"With the advent of cellular network technologies, mobile Internet access becomes the norm in everyday life. In the meantime, the complaints made by subscribers about unsatisfactory cellular network access also become increasingly frequent. From a network operator's perspective, achieving accurate and timely cellular network diagnosis about the causes of the complaints is critical for both improving subscriber-perceived experience and maintaining network robustness. We present the Intelligent Customer Care Assistant (ICCA), a distributed fault classification system that exploits a data-driven approach to perform large-scale cellular network diagnosis. ICCA takes massive network data as input, and realizes both offline model training and online feature computation to distinguish between user and network faults in real time. ICCA is currently deployed in a metropolitan LTE network in China that is serving around 50 million subscribers. We show via evaluation that ICCA achieves high classification accuracy (85.3%) and fast query response time (less than 2.3 seconds). We also report our experiences learned from the deployment.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133087622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Learning Temporal State of Diabetes Patients via Combining Behavioral and Demographic Data 结合行为和人口学数据了解糖尿病患者的时间状态

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098100

Houping Xiao, Jing Gao, Long H. Vu, D. Turaga

Diabetes is a serious disease affecting a large number of people. Although there is no cure for diabetes, it can be managed. Especially, with advances in sensor technology, lots of data may lead to the improvement of diabetes management, if properly mined. However, there usually exists noise or errors in the observed behavioral data which poses challenges in extracting meaningful knowledge. To overcome this challenge, we learn the latent state which represents the patient's condition. Such states should be inferred from the behavioral data but unknown a priori. In this paper, we propose a novel framework to capture the trajectory of latent states for patients from behavioral data while exploiting their demographic differences and similarities to other patients. We conduct a hypothesis test to illustrate the importance of the demographic data in diabetes management, and validate that each behavioral feature follows an exponential or a Gaussian distribution. Integrating these aspects, we use a Demographic feature restricted hidden Markov model (DfrHMM) to estimate the trajectory of latent states by integrating the demographic and behavioral data. In DfrHMM, the latent state is mainly determined by the previous state and the demographic features in a nonlinear way. Markov Chain Monte Carlo techniques are used for model parameter estimation. Experiments on synthetic and real datasets show that DfrHMM is effective in diabetes management.

糖尿病是一种影响大量人群的严重疾病。虽然糖尿病无法治愈，但它是可以控制的。特别是，随着传感器技术的进步，如果挖掘得当，大量的数据可能会导致糖尿病管理的改善。然而，观察到的行为数据中往往存在噪声或误差，这给提取有意义的知识带来了挑战。为了克服这一挑战，我们学习了代表病人病情的潜伏状态。这种状态应该从行为数据中推断出来，但先验未知。在本文中，我们提出了一个新的框架，从行为数据中捕捉患者的潜在状态轨迹，同时利用他们与其他患者的人口统计学差异和相似性。我们进行了假设检验，以说明人口统计数据在糖尿病管理中的重要性，并验证每个行为特征遵循指数或高斯分布。综合这些方面，我们使用人口特征受限隐马尔可夫模型(DfrHMM)通过整合人口统计和行为数据来估计潜在状态的轨迹。在DfrHMM中，潜在状态主要由前一状态和人口特征非线性地决定。马尔可夫链蒙特卡罗技术用于模型参数估计。在合成数据集和真实数据集上的实验表明，DfrHMM在糖尿病管理中是有效的。

{"title":"Learning Temporal State of Diabetes Patients via Combining Behavioral and Demographic Data","authors":"Houping Xiao, Jing Gao, Long H. Vu, D. Turaga","doi":"10.1145/3097983.3098100","DOIUrl":"https://doi.org/10.1145/3097983.3098100","url":null,"abstract":"Diabetes is a serious disease affecting a large number of people. Although there is no cure for diabetes, it can be managed. Especially, with advances in sensor technology, lots of data may lead to the improvement of diabetes management, if properly mined. However, there usually exists noise or errors in the observed behavioral data which poses challenges in extracting meaningful knowledge. To overcome this challenge, we learn the latent state which represents the patient's condition. Such states should be inferred from the behavioral data but unknown a priori. In this paper, we propose a novel framework to capture the trajectory of latent states for patients from behavioral data while exploiting their demographic differences and similarities to other patients. We conduct a hypothesis test to illustrate the importance of the demographic data in diabetes management, and validate that each behavioral feature follows an exponential or a Gaussian distribution. Integrating these aspects, we use a Demographic feature restricted hidden Markov model (DfrHMM) to estimate the trajectory of latent states by integrating the demographic and behavioral data. In DfrHMM, the latent state is mainly determined by the previous state and the demographic features in a nonlinear way. Markov Chain Monte Carlo techniques are used for model parameter estimation. Experiments on synthetic and real datasets show that DfrHMM is effective in diabetes management.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134298025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

AESOP: Automatic Policy Learning for Predicting and Mitigating Network Service Impairments 用于预测和减轻网络服务损害的自动策略学习

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098157

S. Deb, Zihui Ge, S. Isukapalli, S. Puthenpura, Shobha Venkataraman, He Yan, J. Yates

Efficient management and control of modern and next-gen networks is of paramount importance as networks have to maintain highly reliable service quality whilst supporting rapid growth in traffic demand and new application services. Rapid mitigation of network service degradations is a key factor in delivering high service quality. Automation is vital to achieving rapid mitigation of issues, particularly at the network edge where the scale and diversity is the greatest. This automation involves the rapid detection, localization and (where possible) repair of service-impacting faults and performance impairments. However, the most significant challenge here is knowing what events to detect, how to correlate events to localize an issue and what mitigation actions should be performed in response to the identified issues. These are defined as policies to systems such as ECOMP. In this paper, we present AESOP, a data-driven intelligent system to facilitate automatic learning of policies and rules for triggering remedial actions in networks. AESOP combines best operational practices (domain knowledge) with a variety of measurement data to learn and validate operational policies to mitigate service issues in networks. AESOP's design addresses the following key challenges: (i) learning from high-dimensional noisy data, (ii) capturing multiple fault models, (iii) modeling the high service-cost of false positives, and (iv) accounting for the evolving network infrastructure. We present the design of our system and show results from our ongoing experiments to show the effectiveness of our policy leaning framework.

现代和下一代网络的有效管理和控制至关重要，因为网络必须保持高度可靠的服务质量，同时支持快速增长的流量需求和新的应用服务。快速缓解网络服务降级是提供高服务质量的关键因素。自动化对于快速缓解问题至关重要，特别是在规模和多样性最大的网络边缘。这种自动化包括快速检测、定位和(在可能的情况下)修复影响服务的故障和性能损害。但是，这里最重要的挑战是知道要检测哪些事件、如何关联事件以定位问题以及应该执行哪些缓解措施以响应已确定的问题。它们被定义为诸如ECOMP之类的系统的策略。在本文中，我们提出了AESOP，一个数据驱动的智能系统，以促进自动学习策略和规则，以触发网络中的补救行动。AESOP将最佳操作实践(领域知识)与各种测量数据相结合，以学习和验证操作策略，以减轻网络中的服务问题。AESOP的设计解决了以下关键挑战:(i)从高维噪声数据中学习，(ii)捕获多个故障模型，(iii)为误报的高服务成本建模，以及(iv)考虑不断发展的网络基础设施。我们展示了我们系统的设计，并展示了我们正在进行的实验的结果，以显示我们的政策学习框架的有效性。

{"title":"AESOP: Automatic Policy Learning for Predicting and Mitigating Network Service Impairments","authors":"S. Deb, Zihui Ge, S. Isukapalli, S. Puthenpura, Shobha Venkataraman, He Yan, J. Yates","doi":"10.1145/3097983.3098157","DOIUrl":"https://doi.org/10.1145/3097983.3098157","url":null,"abstract":"Efficient management and control of modern and next-gen networks is of paramount importance as networks have to maintain highly reliable service quality whilst supporting rapid growth in traffic demand and new application services. Rapid mitigation of network service degradations is a key factor in delivering high service quality. Automation is vital to achieving rapid mitigation of issues, particularly at the network edge where the scale and diversity is the greatest. This automation involves the rapid detection, localization and (where possible) repair of service-impacting faults and performance impairments. However, the most significant challenge here is knowing what events to detect, how to correlate events to localize an issue and what mitigation actions should be performed in response to the identified issues. These are defined as policies to systems such as ECOMP. In this paper, we present AESOP, a data-driven intelligent system to facilitate automatic learning of policies and rules for triggering remedial actions in networks. AESOP combines best operational practices (domain knowledge) with a variety of measurement data to learn and validate operational policies to mitigate service issues in networks. AESOP's design addresses the following key challenges: (i) learning from high-dimensional noisy data, (ii) capturing multiple fault models, (iii) modeling the high service-cost of false positives, and (iv) accounting for the evolving network infrastructure. We present the design of our system and show results from our ongoing experiments to show the effectiveness of our policy leaning framework.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113952736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Inductive Semi-supervised Multi-Label Learning with Co-Training 具有协同训练的归纳半监督多标签学习

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098141

Wang Zhan, Min-Ling Zhang

In multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training examples, especially for multi-label learning task where a number of class labels need to be annotated for the instance. To circumvent this difficulty, semi-supervised multi-label learning aims to exploit the readily-available unlabeled data to help build multi-label predictive model. Nonetheless, most semi-supervised solutions to multi-label learning work under transductive setting, which only focus on making predictions on existing unlabeled data and cannot generalize to unseen instances. In this paper, a novel approach named COINS is proposed to learning from labeled and unlabeled data by adapting the well-known co-training strategy which naturally works under inductive setting. In each co-training round, a dichotomy over the feature space is learned by maximizing the diversity between the two classifiers induced on either dichotomized feature subset. After that, pairwise ranking predictions on unlabeled data are communicated between either classifier for model refinement. Extensive experiments on a number of benchmark data sets show that COINS performs favorably against state-of-the-art multi-label learning approaches.

在多标签学习中，每个训练样例与多个类标签相关联，任务是学习从特征空间到标签空间幂集的映射。一般来说，获取训练样例的标签是费时费力的，特别是对于需要为实例标注多个类标签的多标签学习任务。为了克服这一困难，半监督多标签学习旨在利用容易获得的未标记数据来帮助构建多标签预测模型。然而，大多数多标签学习的半监督解决方案都是在转导设置下工作的，它只关注对现有的未标记数据进行预测，而不能推广到未见过的实例。本文提出了一种新的方法，即COINS，通过采用众所周知的在归纳设置下自然有效的共同训练策略，从标记和未标记的数据中学习。在每一轮共同训练中，通过最大化两个分类器在任何一个二分类特征子集上产生的多样性来学习特征空间的二分类。之后，对未标记数据的两两排序预测在两个分类器之间进行沟通，以进行模型优化。在许多基准数据集上进行的广泛实验表明，COINS与最先进的多标签学习方法相比表现良好。

{"title":"Inductive Semi-supervised Multi-Label Learning with Co-Training","authors":"Wang Zhan, Min-Ling Zhang","doi":"10.1145/3097983.3098141","DOIUrl":"https://doi.org/10.1145/3097983.3098141","url":null,"abstract":"In multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training examples, especially for multi-label learning task where a number of class labels need to be annotated for the instance. To circumvent this difficulty, semi-supervised multi-label learning aims to exploit the readily-available unlabeled data to help build multi-label predictive model. Nonetheless, most semi-supervised solutions to multi-label learning work under transductive setting, which only focus on making predictions on existing unlabeled data and cannot generalize to unseen instances. In this paper, a novel approach named COINS is proposed to learning from labeled and unlabeled data by adapting the well-known co-training strategy which naturally works under inductive setting. In each co-training round, a dichotomy over the feature space is learned by maximizing the diversity between the two classifiers induced on either dichotomized feature subset. After that, pairwise ranking predictions on unlabeled data are communicated between either classifier for model refinement. Extensive experiments on a number of benchmark data sets show that COINS performs favorably against state-of-the-art multi-label learning approaches.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115491209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

No Longer Sleeping with a Bomb: A Duet System for Protecting Urban Safety from Dangerous Goods 不再与炸弹同床共枕:保护城市安全免受危险物品侵害的二重唱系统

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3097985

Jingyuan Wang, C. Chen, Junjie Wu, Z. Xiong

Recent years have witnessed the continuous growth of megalopolises worldwide, which makes urban safety a top priority in modern city life. Among various threats, dangerous goods such as gas and hazardous chemicals transported through and around cities have increasingly become the deadly "bomb" we sleep with every day. In both academia and government, tremendous efforts have been dedicated to dealing with dangerous goods transportation (DGT) issues, but further study is still in great need to quantify the problem and explore its intrinsic dynamics in a big data perspective. In this paper, we present a novel system called DGeye, which features a "duet" between DGT trajectory data and human mobility data for risky zones identification. Moreover, DGeye innovatively takes risky patterns as the keystones in DGT management, and builds causality networks among them for pain point identification, attribution and prediction. Experiments on both Beijing and Tianjin cities demonstrate the effectiveness of DGeye. In particular, the report generated by DGeye driven the Beijing government to lay down gas pipelines for the famous Guijie food street.

近年来，世界范围内的特大城市不断增长，这使得城市安全成为现代城市生活的重中之重。在各种威胁中，通过城市和城市周围运输的天然气和危险化学品等危险品日益成为我们每天睡觉时的致命“炸弹”。学术界和政府都对危险货物运输(DGT)问题进行了大量的研究，但仍需要进一步的研究，以大数据的视角来量化问题并探索其内在动态。在本文中，我们提出了一个名为DGeye的新系统，该系统将DGT轨迹数据和人类移动数据“二重”结合起来，用于危险区域识别。此外，geye创新地将风险模式作为DGT管理的关键，并在风险模式之间构建因果关系网络，用于痛点识别、归因和预测。在北京和天津进行的实验证明了该方法的有效性。特别值得一提的是，葛野的报告促使北京政府为著名的簋街美食街铺设天然气管道。

{"title":"No Longer Sleeping with a Bomb: A Duet System for Protecting Urban Safety from Dangerous Goods","authors":"Jingyuan Wang, C. Chen, Junjie Wu, Z. Xiong","doi":"10.1145/3097983.3097985","DOIUrl":"https://doi.org/10.1145/3097983.3097985","url":null,"abstract":"Recent years have witnessed the continuous growth of megalopolises worldwide, which makes urban safety a top priority in modern city life. Among various threats, dangerous goods such as gas and hazardous chemicals transported through and around cities have increasingly become the deadly \"bomb\" we sleep with every day. In both academia and government, tremendous efforts have been dedicated to dealing with dangerous goods transportation (DGT) issues, but further study is still in great need to quantify the problem and explore its intrinsic dynamics in a big data perspective. In this paper, we present a novel system called DGeye, which features a \"duet\" between DGT trajectory data and human mobility data for risky zones identification. Moreover, DGeye innovatively takes risky patterns as the keystones in DGT management, and builds causality networks among them for pain point identification, attribution and prediction. Experiments on both Beijing and Tianjin cities demonstrate the effectiveness of DGeye. In particular, the report generated by DGeye driven the Beijing government to lay down gas pipelines for the famous Guijie food street.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127462295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Real-Time Optimization of Web Publisher RTB Revenues 网络发行商RTB收益的实时优化

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098150

Pedro Chahuara, Nicolas Grislain, Grégoire Jauvion, J. Renders

This paper describes an engine to optimize web publisher revenues from second-price auctions. These auctions are widely used to sell online ad spaces in a mechanism called real-time bidding (RTB). Optimization within these auctions is crucial for web publishers, because setting appropriate reserve prices can significantly increase revenue. We consider a practical real-world setting where the only available information before an auction occurs consists of a user identifier and an ad placement identifier. The real-world challenges we had to tackle consist mainly of tracking the dependencies on both the user and placement in an highly non-stationary environment and of dealing with censored bid observations. These challenges led us to make the following design choices: (i) we adopted a relatively simple non-parametric regression model of auction revenue based on an incremental time-weighted matrix factorization which implicitly builds adaptive users' and placements' profiles; (ii) we jointly used a non-parametric model to estimate the first and second bids' distribution when they are censored, based on an on-line extension of the Aalen's Additive model. Our engine is a component of a deployed system handling hundreds of web publishers across the world, serving billions of ads a day to hundreds of millions of visitors. The engine is able to predict, for each auction, an optimal reserve price in approximately one millisecond and yields a significant revenue increase for the web publishers.

本文描述了一个引擎，以优化网络出版商的收入从第二价格拍卖。这些拍卖被广泛用于在线广告空间的实时竞价(RTB)机制销售。优化这些拍卖对网络发行商来说至关重要，因为设定合适的保留价格可以显著增加收益。我们考虑一个实际的现实世界设置，其中拍卖发生前唯一可用的信息由用户标识符和广告位置标识符组成。我们必须解决的现实挑战主要包括在高度非固定的环境中跟踪对用户和位置的依赖关系，以及处理审查后的投标观察。这些挑战促使我们做出以下设计选择:(i)我们采用了一个相对简单的拍卖收入非参数回归模型，该模型基于增量时间加权矩阵分解，该模型隐含地构建了自适应用户和位置的配置文件;(ii)基于Aalen's Additive模型的在线扩展，我们联合使用非参数模型来估计第一个和第二个投标在审查时的分布。我们的引擎是一个部署系统的组成部分，该系统处理全球数百个网络发布商，每天向数亿访问者提供数十亿个广告。该引擎能够在大约一毫秒内预测每次拍卖的最佳保留价格，并为网络出版商带来显著的收入增长。

{"title":"Real-Time Optimization of Web Publisher RTB Revenues","authors":"Pedro Chahuara, Nicolas Grislain, Grégoire Jauvion, J. Renders","doi":"10.1145/3097983.3098150","DOIUrl":"https://doi.org/10.1145/3097983.3098150","url":null,"abstract":"This paper describes an engine to optimize web publisher revenues from second-price auctions. These auctions are widely used to sell online ad spaces in a mechanism called real-time bidding (RTB). Optimization within these auctions is crucial for web publishers, because setting appropriate reserve prices can significantly increase revenue. We consider a practical real-world setting where the only available information before an auction occurs consists of a user identifier and an ad placement identifier. The real-world challenges we had to tackle consist mainly of tracking the dependencies on both the user and placement in an highly non-stationary environment and of dealing with censored bid observations. These challenges led us to make the following design choices: (i) we adopted a relatively simple non-parametric regression model of auction revenue based on an incremental time-weighted matrix factorization which implicitly builds adaptive users' and placements' profiles; (ii) we jointly used a non-parametric model to estimate the first and second bids' distribution when they are censored, based on an on-line extension of the Aalen's Additive model. Our engine is a component of a deployed system handling hundreds of web publishers across the world, serving billions of ads a day to hundreds of millions of visitors. The engine is able to predict, for each auction, an optimal reserve price in approximately one millisecond and yields a significant revenue increase for the web publishers.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126795392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Predicting Optimal Facility Location without Customer Locations 在没有客户位置的情况下预测最佳设施位置

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098198

Emre Yilmaz, Sanem Elbasi, H. Ferhatosmanoğlu

Deriving meaningful insights from location data helps businesses make better decisions. One critical decision made by a business is choosing a location for its new facility. Optimal location queries ask for a location to build a new facility that optimizes an objective function. Most of the existing works on optimal location queries propose solutions to return best location when the set of existing facilities and the set of customers are given. However, most businesses do not know the locations of their customers. In this paper, we introduce a new problem setting for optimal location queries by removing the assumption that the customer locations are known. We propose an optimal location predictor which accepts partial information about customer locations and returns a location for the new facility. The predictor generates synthetic customer locations by using given partial information and it runs optimal location queries with generated location data. Experiments with real data show that the predictor can find the optimal location when sufficient information is provided.

从位置数据中获得有意义的见解有助于企业做出更好的决策。企业所做的一个关键决定是为其新设施选择一个地点。最优位置查询要求在一个位置建立一个优化目标函数的新设施。大多数关于最优位置查询的现有工作都提出了当给定现有设施集和客户集时返回最优位置的解决方案。然而，大多数企业并不知道他们的客户的位置。在本文中，我们通过消除已知客户位置的假设，为最优位置查询引入了一个新的问题设置。我们提出了一个最优位置预测器，它接受有关客户位置的部分信息，并返回新设施的位置。预测器通过使用给定的部分信息生成合成的客户位置，并使用生成的位置数据运行最佳位置查询。实际数据的实验表明，当提供足够的信息时，该预测器可以找到最优位置。

引用次数: 11

MARAS: Signaling Multi-Drug Adverse Reactions MARAS:多药物不良反应信号

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3097986

X. Qin, T. Kakar, Susmitha Wunnava, Elke A. Rundensteiner, Lei Cao

There is a growing need for computing-supported methods that facilitate the automated signaling of Adverse Drug Reactions (ADRs) otherwise left undiscovered from the exploding amount of ADR reports filed by patients, medical professionals and drug manufacturers. In this research, we design a Multi-Drug Adverse Reaction Analytics Strategy, called MARAS, to signal severe unknown ADRs triggered by the usage of a combination of drugs, also known as Multi-Drug Adverse Reactions (MDAR). First, MARAS features an efficient signal generation algorithm based on association rule learning that extracts non-spurious MDAR associations. Second, MARAS incorporates contextual information to detect drug combinations that are strongly associated with a set of ADRs. It groups related associations into Contextual Association Clusters (CACs) that then avail contextual information to evaluate the significance of the discovered MDAR Associations. Lastly, we use this contextual significance to rank discoveries by their notion of interestingness to signal the most compelling MDARs. To demonstrate the utility of MARAS, it is compared with state-of-the-art techniques and evaluated via case studies on datasets collected by U.S. Food and Drug Administration Adverse Event Reporting System (FAERS).

越来越需要计算机支持的方法来促进药物不良反应(ADR)的自动信号传递，否则患者、医疗专业人员和药品制造商提交的数量激增的ADR报告中无法发现。在这项研究中，我们设计了一种多药物不良反应分析策略，称为MARAS，以表明由联合使用药物引发的严重未知不良反应，也称为多药物不良反应(MDAR)。首先，MARAS具有一种基于关联规则学习的高效信号生成算法，该算法可以提取非虚假的MDAR关联。其次，MARAS结合上下文信息来检测与一组adr密切相关的药物组合。它将相关关联分组到上下文关联集群(CACs)中，然后利用上下文信息来评估所发现的MDAR关联的重要性。最后，我们利用这种语境意义对发现的有趣程度进行排名，以表明最引人注目的mdar。为了证明MARAS的实用性，将其与最先进的技术进行比较，并通过对美国食品和药物管理局不良事件报告系统(FAERS)收集的数据集的案例研究进行评估。

{"title":"MARAS: Signaling Multi-Drug Adverse Reactions","authors":"X. Qin, T. Kakar, Susmitha Wunnava, Elke A. Rundensteiner, Lei Cao","doi":"10.1145/3097983.3097986","DOIUrl":"https://doi.org/10.1145/3097983.3097986","url":null,"abstract":"There is a growing need for computing-supported methods that facilitate the automated signaling of Adverse Drug Reactions (ADRs) otherwise left undiscovered from the exploding amount of ADR reports filed by patients, medical professionals and drug manufacturers. In this research, we design a Multi-Drug Adverse Reaction Analytics Strategy, called MARAS, to signal severe unknown ADRs triggered by the usage of a combination of drugs, also known as Multi-Drug Adverse Reactions (MDAR). First, MARAS features an efficient signal generation algorithm based on association rule learning that extracts non-spurious MDAR associations. Second, MARAS incorporates contextual information to detect drug combinations that are strongly associated with a set of ADRs. It groups related associations into Contextual Association Clusters (CACs) that then avail contextual information to evaluate the significance of the discovered MDAR Associations. Lastly, we use this contextual significance to rank discoveries by their notion of interestingness to signal the most compelling MDARs. To demonstrate the utility of MARAS, it is compared with state-of-the-art techniques and evaluated via case studies on datasets collected by U.S. Food and Drug Administration Adverse Event Reporting System (FAERS).","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129454693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Deep Design: Product Aesthetics for Heterogeneous Markets 深度设计:异质市场的产品美学

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098176

Yanxin Pan, Alex Burnap, J. Hartley, Rich Gonzalez, P. Papalambros

Aesthetic appeal is a primary driver of customer consideration for products such as automobiles. Product designers must accordingly convey design attributes (e.g., 'Sportiness'), a challenging proposition given the subjective nature of aesthetics and heterogeneous market segments with potentially different aesthetic preferences. We introduce a scalable deep learning approach that predicts how customers across different market segments perceive aesthetic designs and provides a visualization that can aid in product design. We tested this approach using a large-scale product design and crowdsourced customer data set with a Siamese neural network architecture containing a pair of conditional generative adversarial networks. The results show that the model predicts aesthetic design attributes of customers in heterogeneous market segments and provides a visualization of these aesthetic perceptions. This suggests that the proposed deep learning approach provides a scalable method for understanding customer aesthetic perceptions.

审美吸引力是消费者考虑汽车等产品的主要驱动因素。产品设计师必须相应地传达设计属性(例如，“运动性”)，这是一个具有挑战性的命题，因为美学的主观性和具有潜在不同审美偏好的异质细分市场。我们引入了一种可扩展的深度学习方法，预测不同细分市场的客户如何感知美学设计，并提供有助于产品设计的可视化。我们使用大规模产品设计和众包客户数据集测试了这种方法，该数据集使用了包含一对条件生成对抗网络的Siamese神经网络架构。结果表明，该模型预测了异质细分市场中客户的审美设计属性，并提供了这些审美感知的可视化。这表明所提出的深度学习方法为理解客户审美提供了一种可扩展的方法。

引用次数: 19