Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining最新文献

英文中文

Multi-Modality Disease Modeling via Collective Deep Matrix Factorization 基于集体深度矩阵分解的多模态疾病建模

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098164

Qi Wang, Mengying Sun, L. Zhan, P. Thompson, Shuiwang Ji, Jiayu Zhou

Alzheimer's disease (AD), one of the most common causes of dementia, is a severe irreversible neurodegenerative disease that results in loss of mental functions. The transitional stage between the expected cognitive decline of normal aging and AD, mild cognitive impairment (MCI), has been widely regarded as a suitable time for possible therapeutic intervention. The challenging task of MCI detection is therefore of great clinical importance, where the key is to effectively fuse predictive information from multiple heterogeneous data sources collected from the patients. In this paper, we propose a framework to fuse multiple data modalities for predictive modeling using deep matrix factorization, which explores the non-linear interactions among the modalities and exploits such interactions to transfer knowledge and enable high performance prediction. Specifically, the proposed collective deep matrix factorization decomposes all modalities simultaneously to capture non-linear structures of the modalities in a supervised manner, and learns a modality specific component for each modality and a modality invariant component across all modalities. The modality invariant component serves as a compact feature representation of patients that has high predictive power. The modality specific components provide an effective means to explore imaging genetics, yielding insights into how imaging and genotype interact with each other non-linearly in the AD pathology. Extensive empirical studies using various data modalities provided by Alzheimer's Disease Neuroimaging Initiative (ADNI) demonstrate the effectiveness of the proposed method for fusing heterogeneous modalities.

阿尔茨海默病(AD)是一种严重的不可逆转的神经退行性疾病，导致精神功能丧失，是痴呆症最常见的原因之一。轻度认知障碍(mild cognitive impairment, MCI)是正常衰老和AD预期认知能力下降之间的过渡阶段，被广泛认为是进行治疗干预的合适时机。因此，MCI检测的挑战性任务具有重要的临床意义，其中关键是有效融合从患者收集的多个异构数据源的预测信息。在本文中，我们提出了一个使用深度矩阵分解融合多种数据模式进行预测建模的框架，该框架探索了模式之间的非线性相互作用，并利用这种相互作用来传递知识并实现高性能预测。具体来说，提出的集体深度矩阵分解同时分解所有模态，以监督的方式捕获模态的非线性结构，并学习每个模态的模态特定成分和所有模态的模态不变成分。模态不变成分作为患者的紧凑特征表示，具有很高的预测能力。模态特异性成分为探索成像遗传学提供了有效手段，从而深入了解成像和基因型如何在AD病理中非线性地相互作用。使用阿尔茨海默病神经影像学倡议(ADNI)提供的各种数据模式进行的广泛实证研究表明，所提出的方法融合异构模式的有效性。

{"title":"Multi-Modality Disease Modeling via Collective Deep Matrix Factorization","authors":"Qi Wang, Mengying Sun, L. Zhan, P. Thompson, Shuiwang Ji, Jiayu Zhou","doi":"10.1145/3097983.3098164","DOIUrl":"https://doi.org/10.1145/3097983.3098164","url":null,"abstract":"Alzheimer's disease (AD), one of the most common causes of dementia, is a severe irreversible neurodegenerative disease that results in loss of mental functions. The transitional stage between the expected cognitive decline of normal aging and AD, mild cognitive impairment (MCI), has been widely regarded as a suitable time for possible therapeutic intervention. The challenging task of MCI detection is therefore of great clinical importance, where the key is to effectively fuse predictive information from multiple heterogeneous data sources collected from the patients. In this paper, we propose a framework to fuse multiple data modalities for predictive modeling using deep matrix factorization, which explores the non-linear interactions among the modalities and exploits such interactions to transfer knowledge and enable high performance prediction. Specifically, the proposed collective deep matrix factorization decomposes all modalities simultaneously to capture non-linear structures of the modalities in a supervised manner, and learns a modality specific component for each modality and a modality invariant component across all modalities. The modality invariant component serves as a compact feature representation of patients that has high predictive power. The modality specific components provide an effective means to explore imaging genetics, yielding insights into how imaging and genotype interact with each other non-linearly in the AD pathology. Extensive empirical studies using various data modalities provided by Alzheimer's Disease Neuroimaging Initiative (ADNI) demonstrate the effectiveness of the proposed method for fusing heterogeneous modalities.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115978355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Embedding-based News Recommendation for Millions of Users 面向数百万用户的嵌入式新闻推荐

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098108

Shumpei Okura, Yukihiro Tagami, Shingo Ono, Akira Tajima

It is necessary to understand the content of articles and user preferences to make effective news recommendations. While ID-based methods, such as collaborative filtering and low-rank factorization, are well known for making recommendations, they are not suitable for news recommendations because candidate articles expire quickly and are replaced with new ones within short spans of time. Word-based methods, which are often used in information retrieval settings, are good candidates in terms of system performance but have issues such as their ability to cope with synonyms and orthographical variants and define "queries" from users' historical activities. This paper proposes an embedding-based method to use distributed representations in a three step end-to-end manner: (i) start with distributed representations of articles based on a variant of a denoising autoencoder, (ii) generate user representations by using a recurrent neural network (RNN) with browsing histories as input sequences, and (iii) match and list articles for users based on inner-product operations by taking system performance into consideration. The proposed method performed well in an experimental offline evaluation using past access data on Yahoo! JAPAN's homepage. We implemented it on our actual news distribution system based on these experimental results and compared its online performance with a method that was conventionally incorporated into the system. As a result, the click-through rate (CTR) improved by 23% and the total duration improved by 10%, compared with the conventionally incorporated method. Services that incorporated the method we propose are already open to all users and provide recommendations to over ten million individual users per day who make billions of accesses per month.

要想做出有效的新闻推荐，就必须了解文章的内容和用户的偏好。虽然基于id的方法(如协同过滤和低秩分解)以推荐而闻名，但它们不适合新闻推荐，因为候选文章很快就会过期，并在短时间内被新文章所取代。通常用于信息检索设置的基于单词的方法在系统性能方面是很好的选择，但是存在一些问题，例如它们处理同义词和拼写变体的能力，以及根据用户的历史活动定义“查询”的能力。本文提出了一种基于嵌入的方法，以端到端的三步方式使用分布式表示:(i)基于去噪自动编码器的一种变体从文章的分布式表示开始，(ii)通过使用浏览历史作为输入序列的循环神经网络(RNN)生成用户表示，以及(iii)通过考虑系统性能，基于内积操作为用户匹配和列出文章。本文提出的方法在使用Yahoo!日本的主页。基于这些实验结果，我们在实际的新闻分发系统中实现了该方法，并将其在线性能与传统方法合并到系统中进行了比较。结果，与传统整合方法相比，点击率(CTR)提高了23%，总持续时间提高了10%。采用我们提出的方法的服务已经向所有用户开放，每天向超过1000万的个人用户提供推荐，这些用户每月访问数十亿次。

{"title":"Embedding-based News Recommendation for Millions of Users","authors":"Shumpei Okura, Yukihiro Tagami, Shingo Ono, Akira Tajima","doi":"10.1145/3097983.3098108","DOIUrl":"https://doi.org/10.1145/3097983.3098108","url":null,"abstract":"It is necessary to understand the content of articles and user preferences to make effective news recommendations. While ID-based methods, such as collaborative filtering and low-rank factorization, are well known for making recommendations, they are not suitable for news recommendations because candidate articles expire quickly and are replaced with new ones within short spans of time. Word-based methods, which are often used in information retrieval settings, are good candidates in terms of system performance but have issues such as their ability to cope with synonyms and orthographical variants and define \"queries\" from users' historical activities. This paper proposes an embedding-based method to use distributed representations in a three step end-to-end manner: (i) start with distributed representations of articles based on a variant of a denoising autoencoder, (ii) generate user representations by using a recurrent neural network (RNN) with browsing histories as input sequences, and (iii) match and list articles for users based on inner-product operations by taking system performance into consideration. The proposed method performed well in an experimental offline evaluation using past access data on Yahoo! JAPAN's homepage. We implemented it on our actual news distribution system based on these experimental results and compared its online performance with a method that was conventionally incorporated into the system. As a result, the click-through rate (CTR) improved by 23% and the total duration improved by 10%, compared with the conventionally incorporated method. Services that incorporated the method we propose are already open to all users and provide recommendations to over ten million individual users per day who make billions of accesses per month.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127529618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 370

Scalable Top-n Local Outlier Detection 可扩展的Top-n局部离群点检测

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098191

Yizhou Yan, Lei Cao, Elke A. Rundensteiner

Local Outlier Factor (LOF) method that labels all points with their respective LOF scores to indicate their status is known to be very effective for identifying outliers in datasets with a skewed distribution. Since outliers by definition are the absolute minority in a dataset, the concept of Top-N local outlier was proposed to discover the n points with the largest LOF scores. The detection of the Top-N local outliers is prohibitively expensive, since it requires huge number of high complexity k-nearest neighbor (kNN) searches. In this work, we present the first scalable Top-N local outlier detection approach called TOLF. The key innovation of TOLF is a multi-granularity pruning strategy that quickly prunes most points from the set of potential outlier candidates without computing their exact LOF scores or even without conducting any kNN search for them. Our customized density-aware indexing structure not only effectively supports the pruning strategy, but also accelerates the $k$NN search. Our extensive experimental evaluation on OpenStreetMap, SDSS, and TIGER datasets demonstrates the effectiveness of TOLF - up to 35 times faster than the state-of-the-art methods.

局部离群因子(LOF)方法用它们各自的LOF分数标记所有点以表明它们的状态，对于识别歪斜分布数据集中的离群值是非常有效的。由于异常值在数据集中是绝对少数，因此提出了Top-N局部异常值的概念，以发现LOF分数最大的n个点。检测Top-N个局部异常值的成本非常高，因为它需要大量高复杂度的k-最近邻(kNN)搜索。在这项工作中，我们提出了第一个可扩展的Top-N局部异常点检测方法，称为TOLF。TOLF的关键创新是一种多粒度修剪策略，它可以快速地从潜在的异常候选集中修剪大多数点，而不需要计算它们的精确LOF分数，甚至不需要对它们进行任何kNN搜索。我们定制的密度感知索引结构不仅有效地支持修剪策略，而且加速了$k$NN的搜索。我们在OpenStreetMap、SDSS和TIGER数据集上进行了广泛的实验评估，证明了TOLF的有效性——比最先进的方法快35倍。

{"title":"Scalable Top-n Local Outlier Detection","authors":"Yizhou Yan, Lei Cao, Elke A. Rundensteiner","doi":"10.1145/3097983.3098191","DOIUrl":"https://doi.org/10.1145/3097983.3098191","url":null,"abstract":"Local Outlier Factor (LOF) method that labels all points with their respective LOF scores to indicate their status is known to be very effective for identifying outliers in datasets with a skewed distribution. Since outliers by definition are the absolute minority in a dataset, the concept of Top-N local outlier was proposed to discover the n points with the largest LOF scores. The detection of the Top-N local outliers is prohibitively expensive, since it requires huge number of high complexity k-nearest neighbor (kNN) searches. In this work, we present the first scalable Top-N local outlier detection approach called TOLF. The key innovation of TOLF is a multi-granularity pruning strategy that quickly prunes most points from the set of potential outlier candidates without computing their exact LOF scores or even without conducting any kNN search for them. Our customized density-aware indexing structure not only effectively supports the pruning strategy, but also accelerates the $k$NN search. Our extensive experimental evaluation on OpenStreetMap, SDSS, and TIGER datasets demonstrates the effectiveness of TOLF - up to 35 times faster than the state-of-the-art methods.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127556078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Functional Zone Based Hierarchical Demand Prediction For Bike System Expansion 基于功能区的自行车系统扩展分层需求预测

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098180

Junming Liu, Leilei Sun, Qiao Li, Jingci Ming, Yanchi Liu, Hui Xiong

Bike sharing systems, aiming at providing the missing links in public transportation systems, are becoming popular in urban cities. Many providers of bike sharing systems are ready to expand their bike stations from the existing service area to surrounding regions. A key to success for a bike sharing systems expansion is the bike demand prediction for expansion areas. There are two major challenges in this demand prediction problem: First. the bike transition records are not available for the expansion area and second. station level bike demand have big variances across the urban city. Previous research efforts mainly focus on discovering global features, assuming the station bike demands react equally to the global features, which brings large prediction error when the urban area is large and highly diversified. To address these challenges, in this paper, we develop a hierarchical station bike demand predictor which analyzes bike demands from functional zone level to station level. Specifically, we first divide the studied bike stations into functional zones by a novel Bi-clustering algorithm which is designed to cluster bike stations with similar POI characteristics and close geographical distances together. Then, the hourly bike check-ins and check-outs of functional zones are predicted by integrating three influential factors: distance preference, zone-to-zone preference, and zone characteristics. The station demand is estimated by studying the demand distributions among the stations within the same functional zone. Finally, the extensive experimental results on the NYC Citi Bike system with two expansion stages show the advantages of our approach on station demand and balance prediction for bike sharing system expansions.

自行车共享系统旨在弥补公共交通系统中缺失的环节，在城市中越来越受欢迎。许多共享单车供应商准备将他们的自行车站从现有的服务区扩展到周边地区。共享单车系统扩张成功的关键是对扩张区域的单车需求预测。在这个需求预测问题中有两个主要挑战:第一。扩展区域和第二区域没有自行车转换记录。站级自行车需求在城市中有很大的差异。以往的研究主要集中在发现全局特征上，假设车站自行车需求对全局特征的反应是相等的，当城市面积大且高度多样化时，预测误差很大。为了解决这些挑战，本文开发了一个分层的车站自行车需求预测器，从功能区层面到车站层面分析自行车需求。具体而言，我们首先采用一种新颖的双聚类算法将研究的自行车站点划分为功能区，该算法旨在将POI特征相似且地理距离近的自行车站点聚在一起。然后，综合距离偏好、片区间偏好、片区特征三个影响因素，对功能区每小时自行车签到和签到情况进行预测。通过研究同一功能区内各车站之间的需求分布来估算车站需求。最后，在纽约市花旗自行车系统的两个扩展阶段的大量实验结果表明，我们的方法在共享自行车系统扩展的站点需求和平衡预测方面具有优势。

{"title":"Functional Zone Based Hierarchical Demand Prediction For Bike System Expansion","authors":"Junming Liu, Leilei Sun, Qiao Li, Jingci Ming, Yanchi Liu, Hui Xiong","doi":"10.1145/3097983.3098180","DOIUrl":"https://doi.org/10.1145/3097983.3098180","url":null,"abstract":"Bike sharing systems, aiming at providing the missing links in public transportation systems, are becoming popular in urban cities. Many providers of bike sharing systems are ready to expand their bike stations from the existing service area to surrounding regions. A key to success for a bike sharing systems expansion is the bike demand prediction for expansion areas. There are two major challenges in this demand prediction problem: First. the bike transition records are not available for the expansion area and second. station level bike demand have big variances across the urban city. Previous research efforts mainly focus on discovering global features, assuming the station bike demands react equally to the global features, which brings large prediction error when the urban area is large and highly diversified. To address these challenges, in this paper, we develop a hierarchical station bike demand predictor which analyzes bike demands from functional zone level to station level. Specifically, we first divide the studied bike stations into functional zones by a novel Bi-clustering algorithm which is designed to cluster bike stations with similar POI characteristics and close geographical distances together. Then, the hourly bike check-ins and check-outs of functional zones are predicted by integrating three influential factors: distance preference, zone-to-zone preference, and zone characteristics. The station demand is estimated by studying the demand distributions among the stations within the same functional zone. Finally, the extensive experimental results on the NYC Citi Bike system with two expansion stages show the advantages of our approach on station demand and balance prediction for bike sharing system expansions.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126826088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55

Formative Essay Feedback Using Predictive Scoring Models 使用预测评分模型的形成性论文反馈

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098160

Bronwyn Woods, David Adamson, Shayne Miel, Elijah Mayfield

A major component of secondary education is learning to write effectively, a skill which is bolstered by repeated practice with formative guidance. However, providing focused feedback to every student on multiple drafts of each essay throughout the school year is a challenge for even the most dedicated of teachers. This paper first establishes a new ordinal essay scoring model and its state of the art performance compared to recent results in the Automated Essay Scoring field. Extending this model, we describe a method for using prediction on realistic essay variants to give rubric-specific formative feedback to writers. This method is used in Revision Assistant, a deployed data-driven educational product that provides immediate, rubric-specific, sentence-level feedback to students to supplement teacher guidance. We present initial evaluations of this feedback generation, both offline and in deployment.

中学教育的一个主要组成部分是学习有效地写作，这是一项通过反复练习和形成性指导来加强的技能。然而，在整个学年中，为每个学生提供每篇文章的多个草稿的集中反馈对最敬业的老师来说也是一项挑战。本文首先建立了一个新的有序论文评分模型，并将其与自动论文评分领域的最新结果进行了比较。扩展这个模型，我们描述了一种方法，使用预测现实的文章变体，给作者特定的形成反馈。这种方法在Revision Assistant中使用，Revision Assistant是一个部署的数据驱动的教育产品，它向学生提供即时的、特定于规则的、句子级的反馈，以补充教师的指导。我们提出了这种反馈生成的初步评估，包括离线和部署。

引用次数: 50

Google Vizier: A Service for Black-Box Optimization Google Vizier:一项黑盒优化服务

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098043

D. Golovin, Benjamin Solnik, Subhodeep Moitra, G. Kochanski, J. Karro, D. Sculley

Any sufficiently complex system acts as a black box when it becomes easier to experiment with than to understand. Hence, black-box optimization has become increasingly important as systems have become more complex. In this paper we describe Google Vizier, a Google-internal service for performing black-box optimization that has become the de facto parameter tuning engine at Google. Google Vizier is used to optimize many of our machine learning models and other systems, and also provides core capabilities to Google's Cloud Machine Learning HyperTune subsystem. We discuss our requirements, infrastructure design, underlying algorithms, and advanced features such as transfer learning and automated early stopping that the service provides.

任何足够复杂的系统，当它变得更容易实验而不是理解时，就会成为一个黑盒。因此，随着系统变得越来越复杂，黑盒优化变得越来越重要。在本文中，我们描述了Google Vizier，一个用于执行黑盒优化的Google内部服务，它已成为Google事实上的参数调整引擎。Google Vizier用于优化我们的许多机器学习模型和其他系统，并为Google的云机器学习HyperTune子系统提供核心功能。我们讨论了我们的需求、基础架构设计、底层算法以及服务提供的迁移学习和自动提前停止等高级功能。

引用次数: 663

Aspect Based Recommendations: Recommending Items with the Most Valuable Aspects Based on User Reviews 基于方面的推荐:基于用户评论推荐具有最有价值方面的项目

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098170

Konstantin Bauman, B. Liu, A. Tuzhilin

In this paper, we propose a recommendation technique that not only can recommend items of interest to the user as traditional recommendation systems do but also specific aspects of consumption of the items to further enhance the user experience with those items. For example, it can recommend the user to go to a specific restaurant (item) and also order some specific foods there, e.g., seafood (an aspect of consumption). Our method is called Sentiment Utility Logistic Model (SULM). As its name suggests, SULM uses sentiment analysis of user reviews. It first predicts the sentiment that the user may have about the item based on what he/she might express about the aspects of the item and then identifies the most valuable aspects of the user's potential experience with that item. Furthermore, the method can recommend items together with those most important aspects over which the user has control and can potentially select them, such as the time to go to a restaurant, e.g. lunch vs. dinner, and what to order there, e.g., seafood. We tested the proposed method on three applications (restaurant, hotel, and beauty & spa) and experimentally showed that those users who followed our recommendations of the most valuable aspects while consuming the items, had better experiences, as defined by the overall rating.

在本文中，我们提出了一种推荐技术，该技术不仅可以像传统推荐系统那样向用户推荐感兴趣的商品，还可以推荐商品消费的特定方面，以进一步增强用户对这些商品的体验。例如，它可以推荐用户去一个特定的餐厅(项目)，并在那里订购一些特定的食物，例如海鲜(消费的一个方面)。我们的方法被称为情感效用逻辑模型(SULM)。顾名思义，SULM使用用户评论的情感分析。它首先根据用户可能对该物品的各个方面的表达来预测用户可能对该物品的看法，然后确定用户对该物品的潜在体验中最有价值的方面。此外，该方法可以推荐商品以及用户可以控制和选择的最重要方面，例如去餐馆的时间，例如午餐和晚餐，以及在那里点什么，例如海鲜。我们在三个应用程序(餐厅、酒店和美容&水疗)上测试了提出的方法，实验表明，那些在消费商品时遵循我们最有价值方面建议的用户，有更好的体验，这是由总体评分定义的。

{"title":"Aspect Based Recommendations: Recommending Items with the Most Valuable Aspects Based on User Reviews","authors":"Konstantin Bauman, B. Liu, A. Tuzhilin","doi":"10.1145/3097983.3098170","DOIUrl":"https://doi.org/10.1145/3097983.3098170","url":null,"abstract":"In this paper, we propose a recommendation technique that not only can recommend items of interest to the user as traditional recommendation systems do but also specific aspects of consumption of the items to further enhance the user experience with those items. For example, it can recommend the user to go to a specific restaurant (item) and also order some specific foods there, e.g., seafood (an aspect of consumption). Our method is called Sentiment Utility Logistic Model (SULM). As its name suggests, SULM uses sentiment analysis of user reviews. It first predicts the sentiment that the user may have about the item based on what he/she might express about the aspects of the item and then identifies the most valuable aspects of the user's potential experience with that item. Furthermore, the method can recommend items together with those most important aspects over which the user has control and can potentially select them, such as the time to go to a restaurant, e.g. lunch vs. dinner, and what to order there, e.g., seafood. We tested the proposed method on three applications (restaurant, hotel, and beauty & spa) and experimentally showed that those users who followed our recommendations of the most valuable aspects while consuming the items, had better experiences, as defined by the overall rating.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134583411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 160

Small Batch or Large Batch?: Gaussian Walk with Rebound Can Teach 小批量还是大批量?:有反弹的高斯步可以教

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098147

Peifeng Yin, Ping Luo, Taiga Nakamura

Efficiency of large-scale learning is a hot topic in both academic and industry. The stochastic gradient descent (SGD) algorithm, and its extension mini-batch SGD, allow the model to be updated without scanning the whole data set. However, the use of approximate gradient leads to the uncertainty issue, slowing down the decreasing of objective function. Furthermore, such uncertainty may result in a high frequency of meaningless update on the model, causing a communication issue in parallel learning environment. In this work, we develop a batch-adaptive stochastic gradient descent (BA-SGD) algorithm, which can dynamically choose a proper batch size as learning proceeds. Particularly on the basis of Taylor extension and central limit theorem, it models the decrease of objective value as a Gaussian random walk game with rebound. In this game, a heuristic strategy of determining batch size is adopted to maximize the utility of each incremental sampling. By evaluation on multiple real data sets, we demonstrate that by smartly choosing the batch size, the BA-SGD not only conserves the fast convergence of SGD algorithm but also avoids too frequent model updates.

大规模学习的效率问题一直是学术界和业界关注的热点问题。随机梯度下降(SGD)算法及其扩展的小批量SGD算法允许在不扫描整个数据集的情况下更新模型。然而，近似梯度的使用导致了不确定性问题，减缓了目标函数的递减速度。此外，这种不确定性可能导致对模型进行高频率的无意义更新，从而导致并行学习环境中的通信问题。在这项工作中，我们开发了一种批量自适应随机梯度下降(BA-SGD)算法，该算法可以随着学习的进行动态选择合适的批量大小。特别在Taylor推广和中心极限定理的基础上，将目标值的减少建模为一个带反弹的高斯随机游走博弈。在这个博弈中，采用了一种确定批大小的启发式策略来最大化每次增量抽样的效用。通过对多个真实数据集的评估，我们证明了通过明智地选择批大小，BA-SGD算法既保持了SGD算法的快速收敛性，又避免了过于频繁的模型更新。

{"title":"Small Batch or Large Batch?: Gaussian Walk with Rebound Can Teach","authors":"Peifeng Yin, Ping Luo, Taiga Nakamura","doi":"10.1145/3097983.3098147","DOIUrl":"https://doi.org/10.1145/3097983.3098147","url":null,"abstract":"Efficiency of large-scale learning is a hot topic in both academic and industry. The stochastic gradient descent (SGD) algorithm, and its extension mini-batch SGD, allow the model to be updated without scanning the whole data set. However, the use of approximate gradient leads to the uncertainty issue, slowing down the decreasing of objective function. Furthermore, such uncertainty may result in a high frequency of meaningless update on the model, causing a communication issue in parallel learning environment. In this work, we develop a batch-adaptive stochastic gradient descent (BA-SGD) algorithm, which can dynamically choose a proper batch size as learning proceeds. Particularly on the basis of Taylor extension and central limit theorem, it models the decrease of objective value as a Gaussian random walk game with rebound. In this game, a heuristic strategy of determining batch size is adopted to maximize the utility of each incremental sampling. By evaluation on multiple real data sets, we demonstrate that by smartly choosing the batch size, the BA-SGD not only conserves the fast convergence of SGD algorithm but also avoids too frequent model updates.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130772789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Robust Spectral Clustering for Noisy Data: Modeling Sparse Corruptions Improves Latent Embeddings 噪声数据的鲁棒谱聚类:稀疏损坏建模改进潜在嵌入

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098156

Aleksandar Bojchevski, Yves Matkovic, Stephan Günnemann

Spectral clustering is one of the most prominent clustering approaches. However, it is highly sensitive to noisy input data. In this work, we propose a robust spectral clustering technique able to handle such scenarios. To achieve this goal, we propose a sparse and latent decomposition of the similarity graph used in spectral clustering. In our model, we jointly learn the spectral embedding as well as the corrupted data - thus, enhancing the clustering performance overall. We propose algorithmic solutions to all three established variants of spectral clustering, each showing linear complexity in the number of edges. Our experimental analysis confirms the significant potential of our approach for robust spectral clustering. Supplementary material is available at www.kdd.in.tum.de/RSC.

光谱聚类是最突出的聚类方法之一。然而，它对噪声输入数据非常敏感。在这项工作中，我们提出了一种鲁棒的光谱聚类技术，能够处理这种情况。为了实现这一目标，我们提出了一种用于谱聚类的相似图的稀疏和潜在分解方法。在我们的模型中，我们共同学习了谱嵌入和损坏数据，从而提高了整体的聚类性能。我们提出了所有三种已建立的谱聚类变体的算法解决方案，每个变体在边缘数量上都显示出线性复杂性。我们的实验分析证实了我们的方法在鲁棒光谱聚类方面的巨大潜力。补充材料可在www.kdd.in.tum.de/RSC上获得。

引用次数: 60

Evaluating U.S. Electoral Representation with a Joint Statistical Model of Congressional Roll-Calls, Legislative Text, and Voter Registration Data 用国会唱名、立法文本和选民登记数据的联合统计模型评估美国选举代表性

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pub Date : 2017-08-13 DOI: 10.1145/3097983.3098151

Zhengming Xing, Sunshine Hillygus, L. Carin

Extensive information on 3 million randomly sampled United States citizens is used to construct a statistical model of constituent preferences for each U.S. congressional district. This model is linked to the legislative voting record of the legislator from each district, yielding an integrated model for constituency data, legislative roll-call votes, and the text of the legislation. The model is used to examine the extent to which legislators' voting records are aligned with constituent preferences, and the implications of that alignment (or lack thereof) on subsequent election outcomes. The analysis is based on a Bayesian formalism, with fast inference via a stochastic variational Bayesian analysis.

该研究利用300万随机抽样的美国公民的大量信息，构建了一个统计模型，反映了美国每个国会选区的选民偏好。该模型与每个选区议员的立法投票记录相关联，从而产生一个包含选区数据、立法唱名投票和立法文本的综合模型。该模型用于检查立法者的投票记录与选民偏好一致的程度，以及这种一致(或缺乏)对随后选举结果的影响。该分析基于贝叶斯形式，通过随机变分贝叶斯分析进行快速推理。

引用次数: 2

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀