首页 > 最新文献

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining最新文献

英文 中文
TFX: A TensorFlow-Based Production-Scale Machine Learning Platform TFX:基于tensorflow的生产规模机器学习平台
Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, M. Ispir, Vihan Jain, L. Koc, C. Koo, Lukasz Lew, Clemens Mewald, A. Modi, N. Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang, M. Wicke, Jarek Wilkiewicz, Xin Zhang, Martin A. Zinkevich
Creating and maintaining a platform for reliably producing and deploying machine learning models requires careful orchestration of many components---a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. This becomes particularly challenging when data changes over time and fresh models need to be produced continuously. Unfortunately, such orchestration is often done ad hoc using glue code and custom scripts developed by individual teams for specific use cases, leading to duplicated effort and fragile systems with high technical debt. We present TensorFlow Extended (TFX), a TensorFlow-based general-purpose machine learning platform implemented at Google. By integrating the aforementioned components into one platform, we were able to standardize the components, simplify the platform configuration, and reduce the time to production from the order of months to weeks, while providing platform stability that minimizes disruptions. We present the case study of one deployment of TFX in the Google Play app store, where the machine learning models are refreshed continuously as new data arrive. Deploying TFX led to reduced custom code, faster experiment cycles, and a 2% increase in app installs resulting from improved data and model analysis.
创建和维护一个用于可靠地生产和部署机器学习模型的平台,需要对许多组件进行仔细的编排——用于基于训练数据生成模型的学习器,用于分析和验证数据和模型的模块,以及用于在生产中服务模型的基础设施。当数据随时间变化并且需要不断生成新的模型时,这变得特别具有挑战性。不幸的是,这样的编排通常是使用由个别团队为特定用例开发的粘合代码和自定义脚本来临时完成的,这会导致重复的工作和具有高技术债务的脆弱系统。我们介绍TensorFlow Extended (TFX),这是一个基于TensorFlow的通用机器学习平台,由Google实现。通过将上述组件集成到一个平台中,我们能够标准化组件,简化平台配置,并将生产时间从几个月减少到几周,同时提供平台稳定性,最大限度地减少中断。我们提出了在Google Play应用商店中部署TFX的案例研究,其中机器学习模型随着新数据的到来而不断刷新。部署TFX减少了自定义代码,加快了实验周期,通过改进数据和模型分析,应用安装量增加了2%。
{"title":"TFX: A TensorFlow-Based Production-Scale Machine Learning Platform","authors":"Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, M. Ispir, Vihan Jain, L. Koc, C. Koo, Lukasz Lew, Clemens Mewald, A. Modi, N. Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang, M. Wicke, Jarek Wilkiewicz, Xin Zhang, Martin A. Zinkevich","doi":"10.1145/3097983.3098021","DOIUrl":"https://doi.org/10.1145/3097983.3098021","url":null,"abstract":"Creating and maintaining a platform for reliably producing and deploying machine learning models requires careful orchestration of many components---a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. This becomes particularly challenging when data changes over time and fresh models need to be produced continuously. Unfortunately, such orchestration is often done ad hoc using glue code and custom scripts developed by individual teams for specific use cases, leading to duplicated effort and fragile systems with high technical debt. We present TensorFlow Extended (TFX), a TensorFlow-based general-purpose machine learning platform implemented at Google. By integrating the aforementioned components into one platform, we were able to standardize the components, simplify the platform configuration, and reduce the time to production from the order of months to weeks, while providing platform stability that minimizes disruptions. We present the case study of one deployment of TFX in the Google Play app store, where the machine learning models are refreshed continuously as new data arrive. Deploying TFX led to reduced custom code, faster experiment cycles, and a 2% increase in app installs resulting from improved data and model analysis.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134459681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 355
A Taxi Order Dispatch Model based On Combinatorial Optimization 基于组合优化的出租车订单调度模型
Lingyu Zhang, Tao Hu, Yue Min, Guobin Wu, Junying Zhang, Pengcheng Feng, Pinghua Gong, Jieping Ye
Taxi-booking apps have been very popular all over the world as they provide convenience such as fast response time to the users. The key component of a taxi-booking app is the dispatch system which aims to provide optimal matches between drivers and riders. Traditional dispatch systems sequentially dispatch taxis to riders and aim to maximize the driver acceptance rate for each individual order. However, the traditional systems may lead to a low global success rate, which degrades the rider experience when using the app. In this paper, we propose a novel system that attempts to optimally dispatch taxis to serve multiple bookings. The proposed system aims to maximize the global success rate, thus it optimizes the overall travel efficiency, leading to enhanced user experience. To further enhance users' experience, we also propose a method to predict destinations of a user once the taxi-booking APP is started. The proposed method employs the Bayesian framework to model the distribution of a user's destination based on his/her travel histories. We use rigorous A/B tests to compare our new taxi dispatch method with state-of-the-art models using data collected in Beijing. Experimental results show that the proposed method is significantly better than other state-of-the art models in terms of global success rate (increased from 80% to 84%). Moreover, we have also achieved significant improvement on other metrics such as user's waiting-time and pick-up distance. For our destination prediction algorithm, we show that our proposed model is superior to the baseline model by improving the top-3 accuracy from 89% to 93%. The proposed taxi dispatch and destination prediction algorithms are both deployed in our online systems and serve tens of millions of users everyday.
出租车预订应用程序在世界各地都很受欢迎,因为它们为用户提供了快速响应时间等便利。出租车预约应用的关键部分是调度系统,该系统旨在为司机和乘客提供最佳匹配。传统的调度系统按顺序将出租车分配给乘客,目的是最大化司机对每个订单的接受率。然而,传统的系统可能会导致较低的全局成功率,这降低了乘客使用应用程序时的体验。在本文中,我们提出了一个新的系统,试图优化调度出租车以服务多个预订。该系统旨在最大化全球成功率,从而优化整体出行效率,从而增强用户体验。为了进一步提升用户体验,我们还提出了一种方法来预测用户在打车APP启动后的目的地。该方法采用贝叶斯框架,根据用户的旅行历史对其目的地分布进行建模。我们使用严格的A/B测试,将我们的新出租车调度方法与北京收集的最先进的模型进行比较。实验结果表明,该方法在全局成功率方面明显优于现有模型(从80%提高到84%)。此外,我们在用户等待时间和取货距离等其他指标上也取得了显著的改进。对于我们的目的地预测算法,我们表明我们提出的模型优于基线模型,将前3名的准确率从89%提高到93%。所提出的出租车调度和目的地预测算法都部署在我们的在线系统中,每天为数千万用户服务。
{"title":"A Taxi Order Dispatch Model based On Combinatorial Optimization","authors":"Lingyu Zhang, Tao Hu, Yue Min, Guobin Wu, Junying Zhang, Pengcheng Feng, Pinghua Gong, Jieping Ye","doi":"10.1145/3097983.3098138","DOIUrl":"https://doi.org/10.1145/3097983.3098138","url":null,"abstract":"Taxi-booking apps have been very popular all over the world as they provide convenience such as fast response time to the users. The key component of a taxi-booking app is the dispatch system which aims to provide optimal matches between drivers and riders. Traditional dispatch systems sequentially dispatch taxis to riders and aim to maximize the driver acceptance rate for each individual order. However, the traditional systems may lead to a low global success rate, which degrades the rider experience when using the app. In this paper, we propose a novel system that attempts to optimally dispatch taxis to serve multiple bookings. The proposed system aims to maximize the global success rate, thus it optimizes the overall travel efficiency, leading to enhanced user experience. To further enhance users' experience, we also propose a method to predict destinations of a user once the taxi-booking APP is started. The proposed method employs the Bayesian framework to model the distribution of a user's destination based on his/her travel histories. We use rigorous A/B tests to compare our new taxi dispatch method with state-of-the-art models using data collected in Beijing. Experimental results show that the proposed method is significantly better than other state-of-the art models in terms of global success rate (increased from 80% to 84%). Moreover, we have also achieved significant improvement on other metrics such as user's waiting-time and pick-up distance. For our destination prediction algorithm, we show that our proposed model is superior to the baseline model by improving the top-3 accuracy from 89% to 93%. The proposed taxi dispatch and destination prediction algorithms are both deployed in our online systems and serve tens of millions of users everyday.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133104970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 180
FLAP: An End-to-End Event Log Analysis Platform for System Management 面向系统管理的端到端事件日志分析平台
Tao Li, Yexi Jiang, Chunqiu Zeng, Bin Xia, Zheng Liu, Wubai Zhou, Xiaolong Zhu, Wentao Wang, L. Zhang, Junying Wu, Li Xue, Dewei Bao
Many systems, such as distributed operating systems, complex networks, and high throughput web-based applications, are continuously generating large volume of event logs. These logs contain useful information to help system administrators to understand the system running status and to pinpoint the system failures. Generally, due to the scale and complexity of modern systems, the generated logs are beyond the analytic power of human beings. Therefore, it is imperative to develop a comprehensive log analysis system to support effective system management. Although a number of log mining techniques have been proposed to address specific log analysis use cases, few research and industrial efforts have been paid on providing integrated systems with an end-to-end solution to facilitate the log analysis routines. In this paper, we design and implement an integrated system, called FIU Log Analysis Platform (a.k.a. FLAP), that aims to facilitate the data analytics for system event logs. FLAP provides an end-to-end solution that utilizes advanced data mining techniques to assist log analysts to conveniently, timely, and accurately conduct event log knowledge discovery, system status investigation, and system failure diagnosis. Specifically, in FLAP, state-of-the-art template learning techniques are used to extract useful information from unstructured raw logs; advanced data transformation techniques are proposed and leveraged for event transformation and storage; effective event pattern mining, event summarization, event querying, and failure prediction techniques are designed and integrated for log analytics; and user-friendly interfaces are utilized to present the informative analysis results intuitively and vividly. Since 2016, FLAP has been used by Huawei Technologies Co. Ltd for internal event log analysis, and has provided effective support in its system operation and workflow optimization.
许多系统,如分布式操作系统、复杂网络和基于web的高吞吐量应用程序,都在不断地生成大量的事件日志。这些日志包含有用的信息,可以帮助系统管理员了解系统运行状态,并查明系统故障。一般来说,由于现代系统的规模和复杂性,产生的日志超出了人类的分析能力。因此,开发一个全面的日志分析系统来支持有效的系统管理势在必行。虽然已经提出了许多日志挖掘技术来解决特定的日志分析用例,但很少有研究和工业努力为集成系统提供端到端解决方案来促进日志分析例程。在本文中,我们设计并实现了一个集成系统,称为FIU日志分析平台(又名FLAP),旨在促进系统事件日志的数据分析。FLAP提供了一个端到端的解决方案,利用先进的数据挖掘技术,帮助日志分析人员方便、及时、准确地进行事件日志知识发现、系统状态调查和系统故障诊断。具体来说,在FLAP中,最先进的模板学习技术用于从非结构化原始日志中提取有用的信息;提出并利用先进的数据转换技术进行事件转换和存储;针对日志分析,设计并集成了有效的事件模式挖掘、事件摘要、事件查询和故障预测技术;并采用人性化的界面,直观、形象地呈现信息丰富的分析结果。自2016年以来,FLAP已被华为技术有限公司用于内部事件日志分析,为华为技术有限公司的系统运行和工作流程优化提供了有效的支持。
{"title":"FLAP: An End-to-End Event Log Analysis Platform for System Management","authors":"Tao Li, Yexi Jiang, Chunqiu Zeng, Bin Xia, Zheng Liu, Wubai Zhou, Xiaolong Zhu, Wentao Wang, L. Zhang, Junying Wu, Li Xue, Dewei Bao","doi":"10.1145/3097983.3098022","DOIUrl":"https://doi.org/10.1145/3097983.3098022","url":null,"abstract":"Many systems, such as distributed operating systems, complex networks, and high throughput web-based applications, are continuously generating large volume of event logs. These logs contain useful information to help system administrators to understand the system running status and to pinpoint the system failures. Generally, due to the scale and complexity of modern systems, the generated logs are beyond the analytic power of human beings. Therefore, it is imperative to develop a comprehensive log analysis system to support effective system management. Although a number of log mining techniques have been proposed to address specific log analysis use cases, few research and industrial efforts have been paid on providing integrated systems with an end-to-end solution to facilitate the log analysis routines. In this paper, we design and implement an integrated system, called FIU Log Analysis Platform (a.k.a. FLAP), that aims to facilitate the data analytics for system event logs. FLAP provides an end-to-end solution that utilizes advanced data mining techniques to assist log analysts to conveniently, timely, and accurately conduct event log knowledge discovery, system status investigation, and system failure diagnosis. Specifically, in FLAP, state-of-the-art template learning techniques are used to extract useful information from unstructured raw logs; advanced data transformation techniques are proposed and leveraged for event transformation and storage; effective event pattern mining, event summarization, event querying, and failure prediction techniques are designed and integrated for log analytics; and user-friendly interfaces are utilized to present the informative analysis results intuitively and vividly. Since 2016, FLAP has been used by Huawei Technologies Co. Ltd for internal event log analysis, and has provided effective support in its system operation and workflow optimization.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115603187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 42
Inferring the Strength of Social Ties: A Community-Driven Approach 推断社会关系的强度:社区驱动的方法
Polina Rozenshtein, Nikolaj Tatti, A. Gionis
Online social networks are growing and becoming denser.The social connections of a given person may have very high variability: from close friends and relatives to acquaintances to people who hardly know. Inferring the strength of social ties is an important ingredient for modeling the interaction of users in a network and understanding their behavior. Furthermore, the problem has applications in computational social science, viral marketing, and people recommendation. In this paper we study the problem of inferring the strength of social ties in a given network. Our work is motivated by a recent approach by Sintos et. al [24], which leverages the Strong Triadic Closure} STC principle, a hypothesis rooted in social psychology. To guide our inference process, in addition to the network structure, we also consider as input a collection of tight communities. Those are sets of vertices that we expect to be connected via strong ties. Such communities appear in different situations, e.g., when being part of a community implies a strong connection to one of the existing members. We consider two related problem formalizations that reflect the assumptions of our setting: small number of STC violations and strong-tie connectivity in the input communities. We show that both problem formulations are NP-hard. We also show that one problem formulation is hard to approximate, while for the second we develop an algorithm with approximation guarantee. We validate the proposed method on real-world datasets by comparing with baselines that optimize STC violations and community connectivity separately.
在线社交网络正在增长,并且变得越来越密集。一个人的社会关系可能有很大的可变性:从亲密的朋友和亲戚到熟人,再到几乎不认识的人。推断社会关系的强度是对网络中用户交互建模和理解其行为的重要组成部分。此外,该问题还应用于计算社会科学、病毒式营销和人际推荐。在本文中,我们研究了在给定网络中推断社会联系强度的问题。我们的工作受到了Sintos等人[24]最近的一种方法的启发,该方法利用了基于社会心理学的强三元闭合原理(STC)。为了指导我们的推理过程,除了网络结构外,我们还考虑一个紧密社区的集合作为输入。这些是我们期望通过强联系连接起来的顶点集合。这种社区出现在不同的情况下,例如,当成为社区的一部分意味着与现有成员之一有很强的联系时。我们考虑了两个相关的问题形式化,它们反映了我们设置的假设:输入社区中的少量STC违规和强连接连接。我们证明了这两个问题的表述都是np困难的。我们还证明了一个问题公式难以近似,而第二个问题我们开发了一个具有近似保证的算法。通过与分别优化STC违规和社区连通性的基线进行比较,我们在真实数据集上验证了所提出的方法。
{"title":"Inferring the Strength of Social Ties: A Community-Driven Approach","authors":"Polina Rozenshtein, Nikolaj Tatti, A. Gionis","doi":"10.1145/3097983.3098199","DOIUrl":"https://doi.org/10.1145/3097983.3098199","url":null,"abstract":"Online social networks are growing and becoming denser.The social connections of a given person may have very high variability: from close friends and relatives to acquaintances to people who hardly know. Inferring the strength of social ties is an important ingredient for modeling the interaction of users in a network and understanding their behavior. Furthermore, the problem has applications in computational social science, viral marketing, and people recommendation. In this paper we study the problem of inferring the strength of social ties in a given network. Our work is motivated by a recent approach by Sintos et. al [24], which leverages the Strong Triadic Closure} STC principle, a hypothesis rooted in social psychology. To guide our inference process, in addition to the network structure, we also consider as input a collection of tight communities. Those are sets of vertices that we expect to be connected via strong ties. Such communities appear in different situations, e.g., when being part of a community implies a strong connection to one of the existing members. We consider two related problem formalizations that reflect the assumptions of our setting: small number of STC violations and strong-tie connectivity in the input communities. We show that both problem formulations are NP-hard. We also show that one problem formulation is hard to approximate, while for the second we develop an algorithm with approximation guarantee. We validate the proposed method on real-world datasets by comparing with baselines that optimize STC violations and community connectivity separately.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114374181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Not All Passes Are Created Equal: Objectively Measuring the Risk and Reward of Passes in Soccer from Tracking Data 并非所有的传球都是平等的:从跟踪数据客观地衡量足球中传球的风险和回报
P. Power, Héctor Ruiz, Xinyu Wei, P. Lucey
In soccer, the most frequent event that occurs is a pass. For a trained eye, there are a myriad of adjectives which could describe this event (e.g., "majestic pass", "conservative" to "poor-ball"). However, as these events are needed to be coded live and in real-time (most often by human annotators), the current method of grading passes is restricted to the binary labels 0 (unsuccessful) or 1 (successful). Obviously, this is sub-optimal because the quality of a pass needs to be measured on a continuous spectrum (i.e., 0 to 100%) and not a binary value. Additionally, a pass can be measured across multiple dimensions, namely: i) risk -- the likelihood of executing a pass in a given situation, and ii) reward -- the likelihood of a pass creating a chance. In this paper, we show how we estimate both the risk and reward of a pass across two seasons of tracking data captured from a recent professional soccer league with state-of-the-art performance, then showcase various use cases of our deployed passing system.
在足球比赛中,最常见的是传球。对于训练有素的人来说,有无数的形容词可以描述这个事件(例如,“雄伟的传球”,“保守的”到“可怜的球”)。然而,由于这些事件需要实时编码(通常由人工注释器编写),因此当前的分级方法仅限于二进制标签0(不成功)或1(成功)。显然,这是次优的,因为通过的质量需要在连续谱(即0到100%)上测量,而不是二进制值。此外,通过可以通过多个维度进行衡量,即:i)风险—-在给定情况下执行通过的可能性,以及ii)奖励—-通过创造机会的可能性。在本文中,我们展示了如何通过跟踪从最近的具有最先进性能的职业足球联赛中捕获的两个赛季的数据来评估传球的风险和回报,然后展示了我们部署的传球系统的各种用例。
{"title":"Not All Passes Are Created Equal: Objectively Measuring the Risk and Reward of Passes in Soccer from Tracking Data","authors":"P. Power, Héctor Ruiz, Xinyu Wei, P. Lucey","doi":"10.1145/3097983.3098051","DOIUrl":"https://doi.org/10.1145/3097983.3098051","url":null,"abstract":"In soccer, the most frequent event that occurs is a pass. For a trained eye, there are a myriad of adjectives which could describe this event (e.g., \"majestic pass\", \"conservative\" to \"poor-ball\"). However, as these events are needed to be coded live and in real-time (most often by human annotators), the current method of grading passes is restricted to the binary labels 0 (unsuccessful) or 1 (successful). Obviously, this is sub-optimal because the quality of a pass needs to be measured on a continuous spectrum (i.e., 0 to 100%) and not a binary value. Additionally, a pass can be measured across multiple dimensions, namely: i) risk -- the likelihood of executing a pass in a given situation, and ii) reward -- the likelihood of a pass creating a chance. In this paper, we show how we estimate both the risk and reward of a pass across two seasons of tracking data captured from a recent professional soccer league with state-of-the-art performance, then showcase various use cases of our deployed passing system.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124830588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 82
A Practical Exploration System for Search Advertising 一个实用的搜索广告探索系统
P. Shah, Ming Yang, Sachidanand Alle, A. Ratnaparkhi, B. Shahshahani, Rohit Chandra
In this paper, we describe an exploration system that was implemented by the search-advertising team of a prominent web-portal to address the cold ads problem. The cold ads problem refers to the situation where, when new ads are injected into the system by advertisers, the system is unable to assign an accurate quality to the ad (in our case, the click probability). As a consequence, the advertiser may suffer from low impression volumes for these cold ads, and the overall system may perform sub-optimally if the click probabilities for new ads are not learnt rapidly. We designed a new exploration system that was adapted to search advertising and the serving constraints of the system. In this paper, we define the problem, discuss the design details of the exploration system, new evaluation criteria, and present the performance metrics that were observed by us.
在本文中,我们描述了一个由一家著名门户网站的搜索广告团队实现的搜索系统,以解决冷广告问题。冷广告问题指的是,当广告商向系统注入新的广告时,系统无法为广告分配一个准确的质量(在我们的例子中是点击概率)。因此,广告客户可能会遭受这些冷广告的低印象量,如果不能快速了解新广告的点击概率,整个系统可能会表现不佳。我们设计了一个新的搜索系统,它适应了搜索广告和系统的服务约束。在本文中,我们定义了问题,讨论了勘探系统的设计细节,新的评估标准,并提出了我们观察到的性能指标。
{"title":"A Practical Exploration System for Search Advertising","authors":"P. Shah, Ming Yang, Sachidanand Alle, A. Ratnaparkhi, B. Shahshahani, Rohit Chandra","doi":"10.1145/3097983.3098041","DOIUrl":"https://doi.org/10.1145/3097983.3098041","url":null,"abstract":"In this paper, we describe an exploration system that was implemented by the search-advertising team of a prominent web-portal to address the cold ads problem. The cold ads problem refers to the situation where, when new ads are injected into the system by advertisers, the system is unable to assign an accurate quality to the ad (in our case, the click probability). As a consequence, the advertiser may suffer from low impression volumes for these cold ads, and the overall system may perform sub-optimally if the click probabilities for new ads are not learnt rapidly. We designed a new exploration system that was adapted to search advertising and the serving constraints of the system. In this paper, we define the problem, discuss the design details of the exploration system, new evaluation criteria, and present the performance metrics that were observed by us.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124846929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Multi-view Learning over Retinal Thickness and Visual Sensitivity on Glaucomatous Eyes 青光眼视网膜厚度与视敏的多视点学习
Toshimitsu Uesaka, K. Morino, Hiroki Sugiura, Taichi Kiwaki, Hiroshi Murata, R. Asaoka, K. Yamanishi
Dense measurements of visual-field, which is necessary to detect glaucoma, is known as very costly and labor intensive. Recently, measurement of retinal-thickness can be less costly than measurement of visual-field. Thus, it is sincerely desired that the retinal-thickness could be transformed into visual-sensitivity data somehow. In this paper, we propose two novel methods to estimate the sensitivity of the visual-field with SITA-Standard mode 10-2 resolution using retinal-thickness data measured with optical coherence tomography (OCT). The first method called Affine-Structured Non-negative Matrix Factorization (ASNMF) which is able to cope with both the estimation of visual-field and the discovery of deep glaucoma knowledge. While, the second is based on Convolutional Neural Networks (CNNs) which demonstrates very high estimation performance. These methods are kinds of multi-view learning methods because they utilize visual-field and retinal thickness data simultaneously. We experimentally tested the performance of our methods from several perspectives. We found that ASNMF worked better for relatively small data size while CNNs did for relatively large data size. In addition, some clinical knowledge are discovered via ASNMF. To the best of our knowledge, this is the first paper to address the dense estimation of the visual-field based on the retinal-thickness data.
密集的视野测量是检测青光眼所必需的,是非常昂贵和劳动密集型的。最近,测量视网膜厚度的成本比测量视野的成本要低。因此,迫切希望视网膜厚度能够以某种方式转化为视觉灵敏度数据。本文提出了两种新的方法,利用光学相干断层扫描(OCT)测量的视网膜厚度数据来估计sita -标准模式10-2分辨率下的视野灵敏度。第一种方法称为仿射结构非负矩阵分解(ASNMF),它能够同时处理视野的估计和深度青光眼知识的发现。第二种方法是基于卷积神经网络(cnn),具有很高的估计性能。这些方法是一种多视图学习方法,因为它们同时利用了视野和视网膜厚度数据。我们从几个角度对我们的方法的性能进行了实验测试。我们发现ASNMF在相对较小的数据量下工作得更好,而cnn在相对较大的数据量下工作得更好。此外,一些临床知识是通过ASNMF发现的。据我们所知,这是第一篇解决基于视网膜厚度数据的视野密集估计的论文。
{"title":"Multi-view Learning over Retinal Thickness and Visual Sensitivity on Glaucomatous Eyes","authors":"Toshimitsu Uesaka, K. Morino, Hiroki Sugiura, Taichi Kiwaki, Hiroshi Murata, R. Asaoka, K. Yamanishi","doi":"10.1145/3097983.3098194","DOIUrl":"https://doi.org/10.1145/3097983.3098194","url":null,"abstract":"Dense measurements of visual-field, which is necessary to detect glaucoma, is known as very costly and labor intensive. Recently, measurement of retinal-thickness can be less costly than measurement of visual-field. Thus, it is sincerely desired that the retinal-thickness could be transformed into visual-sensitivity data somehow. In this paper, we propose two novel methods to estimate the sensitivity of the visual-field with SITA-Standard mode 10-2 resolution using retinal-thickness data measured with optical coherence tomography (OCT). The first method called Affine-Structured Non-negative Matrix Factorization (ASNMF) which is able to cope with both the estimation of visual-field and the discovery of deep glaucoma knowledge. While, the second is based on Convolutional Neural Networks (CNNs) which demonstrates very high estimation performance. These methods are kinds of multi-view learning methods because they utilize visual-field and retinal thickness data simultaneously. We experimentally tested the performance of our methods from several perspectives. We found that ASNMF worked better for relatively small data size while CNNs did for relatively large data size. In addition, some clinical knowledge are discovered via ASNMF. To the best of our knowledge, this is the first paper to address the dense estimation of the visual-field based on the retinal-thickness data.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122659621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Resolving the Bias in Electronic Medical Records 解决电子病历中的偏见
Kaiping Zheng, Jinyang Gao, K. Ngiam, B. Ooi, J. Yip
Electronic Medical Records (EMR) are the most fundamental resources used in healthcare data analytics. Since people visit hospital more frequently when they feel sick and doctors prescribe lab examinations when they feel necessary, we argue that there could be a strong bias in EMR observations compared with the hidden conditions of patients. Directly using such EMR for analytical tasks without considering the bias may lead to misinterpretation. To this end, we propose a general method to resolve the bias by transforming EMR to regular patient hidden condition series using a Hidden Markov Model (HMM) variant. Compared with the biased EMR series with irregular time stamps, the unbiased regular time series is much easier to be processed by most analytical models and yields better results. Extensive experimental results demonstrate that our bias resolving method imputes missing data more accurately than baselines and improves the performance of the state-of-the-art methods on typical medical data analytics.
电子医疗记录(EMR)是医疗数据分析中使用的最基本的资源。由于人们在生病时更频繁地去医院,医生在必要时开实验室检查的处方,我们认为,与患者的隐藏情况相比,EMR观察结果可能存在强烈的偏差。直接使用这种EMR进行分析任务而不考虑偏差可能会导致误解。为此,我们提出了一种通用的方法,通过使用隐马尔可夫模型(HMM)变体将EMR转换为常规患者隐藏病情序列来解决偏差。与带有不规则时间戳的有偏EMR序列相比,无偏规则时间序列更容易被大多数分析模型处理,结果也更好。广泛的实验结果表明,我们的偏差解决方法比基线更准确地估算缺失数据,并提高了典型医疗数据分析中最先进方法的性能。
{"title":"Resolving the Bias in Electronic Medical Records","authors":"Kaiping Zheng, Jinyang Gao, K. Ngiam, B. Ooi, J. Yip","doi":"10.1145/3097983.3098149","DOIUrl":"https://doi.org/10.1145/3097983.3098149","url":null,"abstract":"Electronic Medical Records (EMR) are the most fundamental resources used in healthcare data analytics. Since people visit hospital more frequently when they feel sick and doctors prescribe lab examinations when they feel necessary, we argue that there could be a strong bias in EMR observations compared with the hidden conditions of patients. Directly using such EMR for analytical tasks without considering the bias may lead to misinterpretation. To this end, we propose a general method to resolve the bias by transforming EMR to regular patient hidden condition series using a Hidden Markov Model (HMM) variant. Compared with the biased EMR series with irregular time stamps, the unbiased regular time series is much easier to be processed by most analytical models and yields better results. Extensive experimental results demonstrate that our bias resolving method imputes missing data more accurately than baselines and improves the performance of the state-of-the-art methods on typical medical data analytics.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129675971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
A Quasi-experimental Estimate of the Impact of P2P Transportation Platforms on Urban Consumer Patterns P2P交通平台对城市消费模式影响的准实验研究
Zhe Zhang, Beibei Li
With the pervasiveness of mobile technology and location-based computing, new forms of smart urban transportation, such as Uber & Lyft, have become increasingly popular. These new forms of urban infrastructure can influence individuals' movement frictions and patterns, in turn influencing local consumption patterns and the economic performance of local businesses. To gain insights about future impact of urban transportation changes, in this paper, we utilize a novel dataset and econometric analysis methods to present a quasi-experimental examination of how the emerging growth of peer-to-peer car sharing services may have affected local consumer mobility and consumption patterns.
随着移动技术和基于位置的计算的普及,新的智能城市交通形式,如Uber和Lyft,越来越受欢迎。这些新形式的城市基础设施可以影响个人的移动摩擦和模式,进而影响当地的消费模式和当地企业的经济业绩。为了深入了解城市交通变化对未来的影响,本文利用新颖的数据集和计量经济学分析方法,对新兴的点对点汽车共享服务的增长如何影响当地消费者的流动性和消费模式进行了准实验研究。
{"title":"A Quasi-experimental Estimate of the Impact of P2P Transportation Platforms on Urban Consumer Patterns","authors":"Zhe Zhang, Beibei Li","doi":"10.1145/3097983.3098058","DOIUrl":"https://doi.org/10.1145/3097983.3098058","url":null,"abstract":"With the pervasiveness of mobile technology and location-based computing, new forms of smart urban transportation, such as Uber & Lyft, have become increasingly popular. These new forms of urban infrastructure can influence individuals' movement frictions and patterns, in turn influencing local consumption patterns and the economic performance of local businesses. To gain insights about future impact of urban transportation changes, in this paper, we utilize a novel dataset and econometric analysis methods to present a quasi-experimental examination of how the emerging growth of peer-to-peer car sharing services may have affected local consumer mobility and consumption patterns.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129490546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Estimation of Recent Ancestral Origins of Individuals on a Large Scale 大规模个体近世祖先起源的估计
Ross E. Curtis, A. Girshick
The last ten years have seen an exponential growth of direct-to-consumer genomics. One popular feature of these tests is the report of a distant ancestral inference profile-a breakdown of the regions of the world where the test-taker's ancestors may have lived. While current methods and products generally focus on the more distant past (e.g., thousands of years ago), we have recently demonstrated that by leveraging network analysis tools such as community detection, more recent ancestry can be identified. However, using a network analysis tool like community detection on a large network with potentially millions of nodes is not feasible in a live production environment where hundreds or thousands of new genotypes are processed every day. In this study, we describe a classification method that leverages network features to assign individuals to communities in a large network corresponding to recent ancestry. We recently launched a beta version of this research as a new product feature at AncestryDNA.
过去十年,直接面向消费者的基因组学呈指数级增长。这些测试的一个流行特点是报告遥远的祖先推断概况——对世界上考生祖先可能生活过的地区进行分类。虽然当前的方法和产品通常关注更遥远的过去(例如,数千年前),但我们最近证明,通过利用网络分析工具,如社区检测,可以识别更近的祖先。然而,在可能有数百万个节点的大型网络上使用社区检测之类的网络分析工具,在每天处理数百或数千个新基因型的实时生产环境中是不可用的。在这项研究中,我们描述了一种分类方法,该方法利用网络特征将个体分配到与最近祖先相对应的大型网络中的社区。我们最近在AncestryDNA推出了这项研究的测试版,作为一项新产品功能。
{"title":"Estimation of Recent Ancestral Origins of Individuals on a Large Scale","authors":"Ross E. Curtis, A. Girshick","doi":"10.1145/3097983.3098042","DOIUrl":"https://doi.org/10.1145/3097983.3098042","url":null,"abstract":"The last ten years have seen an exponential growth of direct-to-consumer genomics. One popular feature of these tests is the report of a distant ancestral inference profile-a breakdown of the regions of the world where the test-taker's ancestors may have lived. While current methods and products generally focus on the more distant past (e.g., thousands of years ago), we have recently demonstrated that by leveraging network analysis tools such as community detection, more recent ancestry can be identified. However, using a network analysis tool like community detection on a large network with potentially millions of nodes is not feasible in a live production environment where hundreds or thousands of new genotypes are processed every day. In this study, we describe a classification method that leverages network features to assign individuals to communities in a large network corresponding to recent ancestry. We recently launched a beta version of this research as a new product feature at AncestryDNA.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130898787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1