首页 > 最新文献

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining最新文献

英文 中文
Discovering Non-Redundant K-means Clusterings in Optimal Subspaces 在最优子空间中发现非冗余K-means聚类
Dominik Mautz, Wei Ye, C. Plant, C. Böhm
A huge object collection in high-dimensional space can often be clustered in more than one way, for instance, objects could be clustered by their shape or alternatively by their color. Each grouping represents a different view of the data set. The new research field of non-redundant clustering addresses this class of problems. In this paper, we follow the approach that different, non-redundant k-means-like clusterings may exist in different, arbitrarily oriented subspaces of the high-dimensional space. We assume that these subspaces (and optionally a further noise space without any cluster structure) are orthogonal to each other. This assumption enables a particularly rigorous mathematical treatment of the non-redundant clustering problem and thus a particularly efficient algorithm, which we call Nr-Kmeans (for non-redundant k-means). The superiority of our algorithm is demonstrated both theoretically, as well as in extensive experiments.
在高维空间中,一个巨大的对象集合通常可以以不止一种方式聚类,例如,对象可以根据其形状或颜色进行聚类。每个分组代表数据集的不同视图。非冗余聚类的新研究领域解决了这类问题。在本文中,我们遵循的方法,不同的,非冗余的k-均值类聚类可能存在于不同的,任意定向的高维空间的子空间。我们假设这些子空间(以及可选的没有任何聚类结构的进一步噪声空间)彼此正交。这个假设使得对非冗余聚类问题进行特别严格的数学处理,从而产生特别有效的算法,我们称之为Nr-Kmeans(非冗余k-means)。我们的算法在理论上和大量的实验中都证明了它的优越性。
{"title":"Discovering Non-Redundant K-means Clusterings in Optimal Subspaces","authors":"Dominik Mautz, Wei Ye, C. Plant, C. Böhm","doi":"10.1145/3219819.3219945","DOIUrl":"https://doi.org/10.1145/3219819.3219945","url":null,"abstract":"A huge object collection in high-dimensional space can often be clustered in more than one way, for instance, objects could be clustered by their shape or alternatively by their color. Each grouping represents a different view of the data set. The new research field of non-redundant clustering addresses this class of problems. In this paper, we follow the approach that different, non-redundant k-means-like clusterings may exist in different, arbitrarily oriented subspaces of the high-dimensional space. We assume that these subspaces (and optionally a further noise space without any cluster structure) are orthogonal to each other. This assumption enables a particularly rigorous mathematical treatment of the non-redundant clustering problem and thus a particularly efficient algorithm, which we call Nr-Kmeans (for non-redundant k-means). The superiority of our algorithm is demonstrated both theoretically, as well as in extensive experiments.","PeriodicalId":322066,"journal":{"name":"Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115599891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Adaptive Paywall Mechanism for Digital News Media 数字新闻媒体的自适应付费墙机制
Heydar Davoudi, Aijun An, Morteza Zihayat, Gordon Edall
Many online news agencies utilize the paywall mechanism to increase reader subscriptions. This method offers a non-subscribed reader a fixed number of free articles in a period of time (e.g., a month), and then directs the user to the subscription page for further reading. We argue that there is no direct relationship between the number of paywalls presented to readers and the number of subscriptions, and that this artificial barrier, if not used well, may disengage potential subscribers and thus may not well serve its purpose of increasing revenue. Moreover, the current paywall mechanism neither considers the user browsing history nor the potential articles which the user may visit in the future. Thus, it treats all readers equally and does not consider the potential of a reader in becoming a subscriber. In this paper, we propose an adaptive paywall mechanism to balance the benefit of showing an article against that of displaying the paywall (i.e., terminating the session). We first define the notion of cost and utility that are used to define an objective function for optimal paywall decision making. Then, we model the problem as a stochastic sequential decision process. Finally, we propose an efficient policy function for paywall decision making. The experimental results on a real dataset from a major newspaper in Canada show that the proposed model outperforms the traditional paywall mechanism as well as the other baselines.
许多在线新闻机构利用付费墙机制来增加读者订阅。该方法在一段时间内(如一个月)为未订阅的读者提供固定数量的免费文章,然后将用户引导到订阅页面进行进一步阅读。我们认为,提供给读者的付费墙数量与订阅数量之间没有直接关系,而且这种人为的障碍,如果使用不当,可能会脱离潜在的订阅用户,因此可能无法很好地实现其增加收入的目的。此外,目前的付费墙机制既没有考虑用户的浏览历史,也没有考虑用户未来可能访问的潜在文章。因此,它平等对待所有读者,而不考虑读者成为订阅者的潜力。在本文中,我们提出了一种自适应付费墙机制,以平衡显示文章与显示付费墙的好处(即终止会话)。我们首先定义成本和效用的概念,用于定义最优付费墙决策的目标函数。然后,我们将该问题建模为一个随机顺序决策过程。最后,我们提出了一个有效的付费墙决策策略函数。在加拿大一家主要报纸的真实数据集上的实验结果表明,所提出的模型优于传统的付费墙机制以及其他基准。
{"title":"Adaptive Paywall Mechanism for Digital News Media","authors":"Heydar Davoudi, Aijun An, Morteza Zihayat, Gordon Edall","doi":"10.1145/3219819.3219892","DOIUrl":"https://doi.org/10.1145/3219819.3219892","url":null,"abstract":"Many online news agencies utilize the paywall mechanism to increase reader subscriptions. This method offers a non-subscribed reader a fixed number of free articles in a period of time (e.g., a month), and then directs the user to the subscription page for further reading. We argue that there is no direct relationship between the number of paywalls presented to readers and the number of subscriptions, and that this artificial barrier, if not used well, may disengage potential subscribers and thus may not well serve its purpose of increasing revenue. Moreover, the current paywall mechanism neither considers the user browsing history nor the potential articles which the user may visit in the future. Thus, it treats all readers equally and does not consider the potential of a reader in becoming a subscriber. In this paper, we propose an adaptive paywall mechanism to balance the benefit of showing an article against that of displaying the paywall (i.e., terminating the session). We first define the notion of cost and utility that are used to define an objective function for optimal paywall decision making. Then, we model the problem as a stochastic sequential decision process. Finally, we propose an efficient policy function for paywall decision making. The experimental results on a real dataset from a major newspaper in Canada show that the proposed model outperforms the traditional paywall mechanism as well as the other baselines.","PeriodicalId":322066,"journal":{"name":"Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121316159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
On Interpretation of Network Embedding via Taxonomy Induction 基于分类归纳的网络嵌入解释
Ninghao Liu, Xiao Huang, Jundong Li, Xia Hu
Network embedding has been increasingly used in many network analytics applications to generate low-dimensional vector representations, so that many off-the-shelf models can be applied to solve a wide variety of data mining tasks. However, similar to many other machine learning methods, network embedding results remain hard to be understood by users. Each dimension in the embedding space usually does not have any specific meaning, thus it is difficult to comprehend how the embedding instances are distributed in the reconstructed space. In addition, heterogeneous content information may be incorporated into network embedding, so it is challenging to specify which source of information is effective in generating the embedding results. In this paper, we investigate the interpretation of network embedding, aiming to understand how instances are distributed in embedding space, as well as explore the factors that lead to the embedding results. We resort to the post-hoc interpretation scheme, so that our approach can be applied to different types of embedding methods. Specifically, the interpretation of network embedding is presented in the form of a taxonomy. Effective objectives and corresponding algorithms are developed towards building the taxonomy. We also design several metrics to evaluate interpretation results. Experiments on real-world datasets from different domains demonstrate that, by comparing with the state-of-the-art alternatives, our approach produces effective and meaningful interpretation to embedding results.
网络嵌入在许多网络分析应用中越来越多地用于生成低维向量表示,因此许多现成的模型可以应用于解决各种数据挖掘任务。然而,与许多其他机器学习方法一样,网络嵌入的结果仍然很难被用户理解。嵌入空间中的每个维度通常没有特定的含义,因此很难理解嵌入实例在重构空间中的分布情况。此外,网络嵌入可能包含异构内容信息,因此很难确定哪种信息源有效地生成嵌入结果。在本文中,我们研究了网络嵌入的解释,旨在了解实例在嵌入空间中的分布情况,并探讨导致嵌入结果的因素。我们采用事后解释方案,使我们的方法可以应用于不同类型的嵌入方法。具体来说,网络嵌入的解释以分类法的形式呈现。为建立分类,提出了有效的目标和相应的算法。我们还设计了几个指标来评估解释结果。对来自不同领域的真实数据集的实验表明,与最先进的替代方法相比,我们的方法对嵌入结果产生了有效和有意义的解释。
{"title":"On Interpretation of Network Embedding via Taxonomy Induction","authors":"Ninghao Liu, Xiao Huang, Jundong Li, Xia Hu","doi":"10.1145/3219819.3220001","DOIUrl":"https://doi.org/10.1145/3219819.3220001","url":null,"abstract":"Network embedding has been increasingly used in many network analytics applications to generate low-dimensional vector representations, so that many off-the-shelf models can be applied to solve a wide variety of data mining tasks. However, similar to many other machine learning methods, network embedding results remain hard to be understood by users. Each dimension in the embedding space usually does not have any specific meaning, thus it is difficult to comprehend how the embedding instances are distributed in the reconstructed space. In addition, heterogeneous content information may be incorporated into network embedding, so it is challenging to specify which source of information is effective in generating the embedding results. In this paper, we investigate the interpretation of network embedding, aiming to understand how instances are distributed in embedding space, as well as explore the factors that lead to the embedding results. We resort to the post-hoc interpretation scheme, so that our approach can be applied to different types of embedding methods. Specifically, the interpretation of network embedding is presented in the form of a taxonomy. Effective objectives and corresponding algorithms are developed towards building the taxonomy. We also design several metrics to evaluate interpretation results. Experiments on real-world datasets from different domains demonstrate that, by comparing with the state-of-the-art alternatives, our approach produces effective and meaningful interpretation to embedding results.","PeriodicalId":322066,"journal":{"name":"Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125070001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
Learning Adversarial Networks for Semi-Supervised Text Classification via Policy Gradient 基于策略梯度学习半监督文本分类的对抗网络
Yan Li, Jieping Ye
Semi-supervised learning is a branch of machine learning techniques that aims to make fully use of both labeled and unlabeled instances to improve the prediction performance. The size of modern real world datasets is ever-growing so that acquiring label information for them is extraordinarily difficult and costly. Therefore, deep semi-supervised learning is becoming more and more popular. Most of the existing deep semi-supervised learning methods are built under the generative model based scheme, where the data distribution is approximated via input data reconstruction. However, this scheme does not naturally work on discrete data, e.g., text; in addition, learning a good data representation is sometimes directly opposed to the goal of learning a high performance prediction model. To address the issues of this type of methods, we reformulate the semi-supervised learning as a model-based reinforcement learning problem and propose an adversarial networks based framework. The proposed framework contains two networks: a predictor network for target estimation and a judge network for evaluation. The judge network iteratively generates proper reward to guide the training of predictor network, and the predictor network is trained via policy gradient. Based on the aforementioned framework, we propose a recurrent neural network based model for semi-supervised text classification. We conduct comprehensive experimental analysis on several real world benchmark text datasets, and the results from our evaluations show that our method outperforms other competing state-of-the-art methods.
半监督学习是机器学习技术的一个分支,旨在充分利用标记和未标记的实例来提高预测性能。现代现实世界数据集的规模不断增长,因此获取标签信息非常困难和昂贵。因此,深度半监督学习越来越受欢迎。现有的深度半监督学习方法大多建立在基于生成模型的方案下,通过输入数据重构来逼近数据分布。然而,该方案并不适用于离散数据,例如文本;此外,学习良好的数据表示有时与学习高性能预测模型的目标直接相反。为了解决这类方法的问题,我们将半监督学习重新表述为基于模型的强化学习问题,并提出了一个基于对抗网络的框架。该框架包含两个网络:用于目标估计的预测网络和用于评估的判断网络。判断网络迭代生成适当的奖励来指导预测网络的训练,预测网络通过策略梯度进行训练。基于上述框架,我们提出了一种基于递归神经网络的半监督文本分类模型。我们对几个真实世界的基准文本数据集进行了全面的实验分析,评估结果表明,我们的方法优于其他竞争的最先进的方法。
{"title":"Learning Adversarial Networks for Semi-Supervised Text Classification via Policy Gradient","authors":"Yan Li, Jieping Ye","doi":"10.1145/3219819.3219956","DOIUrl":"https://doi.org/10.1145/3219819.3219956","url":null,"abstract":"Semi-supervised learning is a branch of machine learning techniques that aims to make fully use of both labeled and unlabeled instances to improve the prediction performance. The size of modern real world datasets is ever-growing so that acquiring label information for them is extraordinarily difficult and costly. Therefore, deep semi-supervised learning is becoming more and more popular. Most of the existing deep semi-supervised learning methods are built under the generative model based scheme, where the data distribution is approximated via input data reconstruction. However, this scheme does not naturally work on discrete data, e.g., text; in addition, learning a good data representation is sometimes directly opposed to the goal of learning a high performance prediction model. To address the issues of this type of methods, we reformulate the semi-supervised learning as a model-based reinforcement learning problem and propose an adversarial networks based framework. The proposed framework contains two networks: a predictor network for target estimation and a judge network for evaluation. The judge network iteratively generates proper reward to guide the training of predictor network, and the predictor network is trained via policy gradient. Based on the aforementioned framework, we propose a recurrent neural network based model for semi-supervised text classification. We conduct comprehensive experimental analysis on several real world benchmark text datasets, and the results from our evaluations show that our method outperforms other competing state-of-the-art methods.","PeriodicalId":322066,"journal":{"name":"Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123564743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Count-Min: Optimal Estimation and Tight Error Bounds using Empirical Error Distributions Count-Min:使用经验误差分布的最优估计和紧误差范围
Daniel Ting
The Count-Min sketch is an important and well-studied data summarization method. It can estimate the count of any item in a stream using a small, fixed size data sketch. However, the accuracy of the Count-Min sketch depends on characteristics of the underlying data. This has led to a number of count estimation procedures which work well in one scenario but perform poorly in others. A practitioner is faced with two basic, unanswered questions. Given an estimate, what is its error? Which estimation procedure should be chosen when the data is unknown? We provide answers to these questions. We derive new count estimators, including a provably optimal estimator, which best or match previous estimators in all scenarios. We also provide practical, tight error bounds at query time for all estimators and methods to tune sketch parameters using these bounds. The key observation is that the full distribution of errors in each counter can be empirically estimated from the sketch itself. By first estimating this distribution, count estimation becomes a statistical estimation and inference problem with a known error distribution. This provides both a principled way to derive new and optimal estimators as well as a way to study the error and properties of existing estimators.
最小计数草图是一种重要的数据汇总方法。它可以使用一个小的、固定大小的数据草图来估计流中任何项目的数量。然而,最小计数草图的准确性取决于底层数据的特征。这导致许多计数估计过程在一个场景中工作良好,但在其他场景中表现不佳。从业者面临着两个基本的、没有答案的问题。给定一个估计,它的误差是多少?当数据未知时,应该选择哪种估计程序?我们为这些问题提供答案。我们推导了新的计数估计器,包括一个可证明的最优估计器,它在所有情况下都优于或匹配先前的估计器。我们还在查询时为所有估计器和方法提供了实用的、严格的误差界限,以使用这些界限来调优草图参数。关键的观察结果是,每个计数器中误差的完整分布可以从草图本身经验地估计出来。通过先估计这个分布,计数估计就变成了一个误差分布已知的统计估计和推理问题。这既提供了一种有原则的方法来推导新的和最优的估计量,也提供了一种研究现有估计量的误差和性质的方法。
{"title":"Count-Min: Optimal Estimation and Tight Error Bounds using Empirical Error Distributions","authors":"Daniel Ting","doi":"10.1145/3219819.3219975","DOIUrl":"https://doi.org/10.1145/3219819.3219975","url":null,"abstract":"The Count-Min sketch is an important and well-studied data summarization method. It can estimate the count of any item in a stream using a small, fixed size data sketch. However, the accuracy of the Count-Min sketch depends on characteristics of the underlying data. This has led to a number of count estimation procedures which work well in one scenario but perform poorly in others. A practitioner is faced with two basic, unanswered questions. Given an estimate, what is its error? Which estimation procedure should be chosen when the data is unknown? We provide answers to these questions. We derive new count estimators, including a provably optimal estimator, which best or match previous estimators in all scenarios. We also provide practical, tight error bounds at query time for all estimators and methods to tune sketch parameters using these bounds. The key observation is that the full distribution of errors in each counter can be empirically estimated from the sketch itself. By first estimating this distribution, count estimation becomes a statistical estimation and inference problem with a known error distribution. This provides both a principled way to derive new and optimal estimators as well as a way to study the error and properties of existing estimators.","PeriodicalId":322066,"journal":{"name":"Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125518730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Fatigue Prediction in Outdoor Runners Via Machine Learning and Sensor Fusion 基于机器学习和传感器融合的户外跑步者疲劳预测
Tim Op De Beéck, Wannes Meert, K. Schütte, B. Vanwanseele, Jesse Davis
Running is extremely popular and around 10.6 million people run regularly in the United States alone. Unfortunately, estimates indicated that between 29% to 79% of runners sustain an overuse injury every year. One contributing factor to such injuries is excessive fatigue, which can result in alterations in how someone runs that increase the risk for an overuse injury. Thus being able to detect during a running session when excessive fatigue sets in, and hence when these alterations are prone to arise, could be of great practical importance. In this paper, we explore whether we can use machine learning to predict the rating of perceived exertion (RPE), a validated subjective measure of fatigue, from inertial sensor data of individuals running outdoors. We describe how both the subjective target label and the realistic outdoor running environment introduce several interesting data science challenges. We collected a longitudinal dataset of runners, and demonstrate that machine learning can be used to learn accurate models for predicting RPE.
跑步非常受欢迎,仅在美国就有大约1060万人定期跑步。不幸的是,据估计,每年有29%到79%的跑步者遭受过度使用损伤。造成这种伤害的一个因素是过度疲劳,这会导致跑步方式的改变,从而增加过度使用伤害的风险。因此,能够在跑步过程中发现过度疲劳,从而发现这些变化何时容易出现,可能具有很大的实际重要性。在本文中,我们探讨了是否可以使用机器学习来预测感知消耗等级(RPE),这是一种经过验证的主观疲劳测量,来自户外跑步个人的惯性传感器数据。我们描述了主观目标标签和现实的户外跑步环境如何引入几个有趣的数据科学挑战。我们收集了跑步者的纵向数据集,并证明机器学习可以用来学习预测RPE的准确模型。
{"title":"Fatigue Prediction in Outdoor Runners Via Machine Learning and Sensor Fusion","authors":"Tim Op De Beéck, Wannes Meert, K. Schütte, B. Vanwanseele, Jesse Davis","doi":"10.1145/3219819.3219864","DOIUrl":"https://doi.org/10.1145/3219819.3219864","url":null,"abstract":"Running is extremely popular and around 10.6 million people run regularly in the United States alone. Unfortunately, estimates indicated that between 29% to 79% of runners sustain an overuse injury every year. One contributing factor to such injuries is excessive fatigue, which can result in alterations in how someone runs that increase the risk for an overuse injury. Thus being able to detect during a running session when excessive fatigue sets in, and hence when these alterations are prone to arise, could be of great practical importance. In this paper, we explore whether we can use machine learning to predict the rating of perceived exertion (RPE), a validated subjective measure of fatigue, from inertial sensor data of individuals running outdoors. We describe how both the subjective target label and the realistic outdoor running environment introduce several interesting data science challenges. We collected a longitudinal dataset of runners, and demonstrate that machine learning can be used to learn accurate models for predicting RPE.","PeriodicalId":322066,"journal":{"name":"Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115078085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Finding Similar Exercises in Online Education Systems 在在线教育系统中发现类似的练习
Qi Liu, Zai Huang, Zhenya Huang, Chuanren Liu, Enhong Chen, Yu-Ho Su, Guoping Hu
In online education systems, finding similar exercises is a fundamental task of many applications, such as exercise retrieval and student modeling. Several approaches have been proposed for this task by simply using the specific textual content (e.g. the same knowledge concepts or the similar words) in exercises. However, the problem of how to systematically exploit the rich semantic information embedded in multiple heterogenous data (e.g. texts and images) to precisely retrieve similar exercises remains pretty much open. To this end, in this paper, we develop a novel Multimodal Attention-based Neural Network (MANN) framework for finding similar exercises in large-scale online education systems by learning a unified semantic representation from the heterogenous data. In MANN, given exercises with texts, images and knowledge concepts, we first apply a convolutional neural network to extract image representations and use an embedding layer for representing concepts. Then, we design an attention-based long short-term memory network to learn a unified semantic representation of each exercise in a multimodal way. Here, two attention strategies are proposed to capture the associations of texts and images, texts and knowledge concepts, respectively. Moreover, with a Similarity Attention, the similar parts in each exercise pair are also measured. Finally, we develop a pairwise training strategy for returning similar exercises. Extensive experimental results on real-world data clearly validate the effectiveness and the interpretation power of MANN.
在在线教育系统中,寻找相似的练习是许多应用程序的基本任务,例如练习检索和学生建模。已经提出了几种方法,通过简单地在练习中使用特定的文本内容(例如相同的知识概念或相似的单词)来完成这项任务。然而,如何系统地利用嵌入在多个异构数据(如文本和图像)中的丰富语义信息来精确检索类似练习的问题仍然是一个悬而未决的问题。为此,在本文中,我们开发了一种新的基于多模态注意力的神经网络(MANN)框架,通过从异构数据中学习统一的语义表示,在大规模在线教育系统中发现类似的练习。在MANN中,给定文本、图像和知识概念的练习,我们首先应用卷积神经网络来提取图像表示,并使用嵌入层来表示概念。然后,我们设计了一个基于注意的长短期记忆网络,以多模态的方式学习每个练习的统一语义表示。本文提出了两种注意策略,分别捕捉文本与图像、文本与知识概念之间的联系。此外,采用相似度注意法,对每个练习对中的相似部分进行测量。最后,我们制定了一个成对的训练策略,以返回类似的练习。大量的实际数据实验结果清楚地验证了MANN的有效性和解释能力。
{"title":"Finding Similar Exercises in Online Education Systems","authors":"Qi Liu, Zai Huang, Zhenya Huang, Chuanren Liu, Enhong Chen, Yu-Ho Su, Guoping Hu","doi":"10.1145/3219819.3219960","DOIUrl":"https://doi.org/10.1145/3219819.3219960","url":null,"abstract":"In online education systems, finding similar exercises is a fundamental task of many applications, such as exercise retrieval and student modeling. Several approaches have been proposed for this task by simply using the specific textual content (e.g. the same knowledge concepts or the similar words) in exercises. However, the problem of how to systematically exploit the rich semantic information embedded in multiple heterogenous data (e.g. texts and images) to precisely retrieve similar exercises remains pretty much open. To this end, in this paper, we develop a novel Multimodal Attention-based Neural Network (MANN) framework for finding similar exercises in large-scale online education systems by learning a unified semantic representation from the heterogenous data. In MANN, given exercises with texts, images and knowledge concepts, we first apply a convolutional neural network to extract image representations and use an embedding layer for representing concepts. Then, we design an attention-based long short-term memory network to learn a unified semantic representation of each exercise in a multimodal way. Here, two attention strategies are proposed to capture the associations of texts and images, texts and knowledge concepts, respectively. Moreover, with a Similarity Attention, the similar parts in each exercise pair are also measured. Finally, we develop a pairwise training strategy for returning similar exercises. Extensive experimental results on real-world data clearly validate the effectiveness and the interpretation power of MANN.","PeriodicalId":322066,"journal":{"name":"Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122917542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
Large-Scale Order Dispatch in On-Demand Ride-Hailing Platforms: A Learning and Planning Approach 网约车平台的大规模订单调度:一种学习与规划方法
Zhe Xu, Zhixin Li, Qingwen Guan, Dingshui Zhang, Qiang Li, Junxiao Nan, Chunyang Liu, Wei Bian, Jieping Ye
We present a novel order dispatch algorithm in large-scale on-demand ride-hailing platforms. While traditional order dispatch approaches usually focus on immediate customer satisfaction, the proposed algorithm is designed to provide a more efficient way to optimize resource utilization and user experience in a global and more farsighted view. In particular, we model order dispatch as a large-scale sequential decision-making problem, where the decision of assigning an order to a driver is determined by a centralized algorithm in a coordinated way. The problem is solved in a learning and planning manner: 1) based on historical data, we first summarize demand and supply patterns into a spatiotemporal quantization, each of which indicates the expected value of a driver being in a particular state; 2) a planning step is conducted in real-time, where each driver-order-pair is valued in consideration of both immediate rewards and future gains, and then dispatch is solved using a combinatorial optimizing algorithm. Through extensive offline experiments and online AB tests, the proposed approach delivers remarkable improvement on the platform's efficiency and has been successfully deployed in the production system of Didi Chuxing.
提出了一种新的大规模按需网约车平台的订单调度算法。传统的订单调度方法通常关注客户的即时满意度,而本文提出的算法旨在提供一种更有效的方法,以全局和更有远见的视角优化资源利用率和用户体验。特别地,我们将订单分配建模为一个大规模的顺序决策问题,其中将订单分配给驾驶员的决策由集中算法以协调的方式确定。采用学习和规划的方式解决问题:1)基于历史数据,首先将需求和供给模式总结为时空量化,每个模式都表示驾驶员处于特定状态的期望值;2)实时规划步骤,考虑当前收益和未来收益,对每个司机-订单对进行估值,然后使用组合优化算法求解调度问题。通过大量的线下实验和线上AB测试,该方法显著提高了平台的效率,并已成功部署在滴滴出行的生产系统中。
{"title":"Large-Scale Order Dispatch in On-Demand Ride-Hailing Platforms: A Learning and Planning Approach","authors":"Zhe Xu, Zhixin Li, Qingwen Guan, Dingshui Zhang, Qiang Li, Junxiao Nan, Chunyang Liu, Wei Bian, Jieping Ye","doi":"10.1145/3219819.3219824","DOIUrl":"https://doi.org/10.1145/3219819.3219824","url":null,"abstract":"We present a novel order dispatch algorithm in large-scale on-demand ride-hailing platforms. While traditional order dispatch approaches usually focus on immediate customer satisfaction, the proposed algorithm is designed to provide a more efficient way to optimize resource utilization and user experience in a global and more farsighted view. In particular, we model order dispatch as a large-scale sequential decision-making problem, where the decision of assigning an order to a driver is determined by a centralized algorithm in a coordinated way. The problem is solved in a learning and planning manner: 1) based on historical data, we first summarize demand and supply patterns into a spatiotemporal quantization, each of which indicates the expected value of a driver being in a particular state; 2) a planning step is conducted in real-time, where each driver-order-pair is valued in consideration of both immediate rewards and future gains, and then dispatch is solved using a combinatorial optimizing algorithm. Through extensive offline experiments and online AB tests, the proposed approach delivers remarkable improvement on the platform's efficiency and has been successfully deployed in the production system of Didi Chuxing.","PeriodicalId":322066,"journal":{"name":"Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123020702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 300
Efficient Attribute Recommendation with Probabilistic Guarantee 具有概率保证的高效属性推荐
Chi Wang, K. Chakrabarti
We study how to efficiently solve a primitive data exploration problem: Given two ad-hoc predicates which define two subsets of a relational table, find the top-K attributes whose distributions in the two subsets deviate most from each other. The deviation is measured by $ell1$ or $ell2$ distance. The exact approach is to query the full table to calculate the deviation for each attribute and then sort them. It is too expensive for large tables. Researchers have proposed heuristic sampling solutions to avoid accessing the entire table for all attributes. However, these solutions have no theoretical guarantee of correctness and their speedup over the exact approach is limited. In this paper, we develop an adaptive querying solution with probabilistic guarantee of correctness and near-optimal sample complexity. We perform experiments in both synthetic and real-world datasets. Compared to the exact approach implemented with a commercial DBMS, previous sampling solutions achieve up to 2× speedup with erroneous answers. Our solution can produce 25× speedup with near-zero error in the answer.
我们研究了如何有效地解决一个原始数据探索问题:给定定义关系表的两个子集的两个特别谓词,找出两个子集中分布偏差最大的top-K属性。偏差通过$ell1$或$ell2$距离来测量。确切的方法是查询整个表来计算每个属性的偏差,然后对它们进行排序。对于大桌子来说太贵了。研究人员提出了启发式抽样解决方案,以避免访问整个表的所有属性。然而,这些解决方案没有理论上的正确性保证,并且它们相对于精确方法的加速是有限的。在本文中,我们开发了一种具有概率保证正确性和近最优样本复杂度的自适应查询解。我们在合成和真实世界的数据集上进行实验。与商业DBMS实现的精确方法相比,以前的采样解决方案可以在错误答案下实现高达2倍的加速。我们的解决方案可以产生25倍的加速,而答案中的错误几乎为零。
{"title":"Efficient Attribute Recommendation with Probabilistic Guarantee","authors":"Chi Wang, K. Chakrabarti","doi":"10.1145/3219819.3219984","DOIUrl":"https://doi.org/10.1145/3219819.3219984","url":null,"abstract":"We study how to efficiently solve a primitive data exploration problem: Given two ad-hoc predicates which define two subsets of a relational table, find the top-K attributes whose distributions in the two subsets deviate most from each other. The deviation is measured by $ell1$ or $ell2$ distance. The exact approach is to query the full table to calculate the deviation for each attribute and then sort them. It is too expensive for large tables. Researchers have proposed heuristic sampling solutions to avoid accessing the entire table for all attributes. However, these solutions have no theoretical guarantee of correctness and their speedup over the exact approach is limited. In this paper, we develop an adaptive querying solution with probabilistic guarantee of correctness and near-optimal sample complexity. We perform experiments in both synthetic and real-world datasets. Compared to the exact approach implemented with a commercial DBMS, previous sampling solutions achieve up to 2× speedup with erroneous answers. Our solution can produce 25× speedup with near-zero error in the answer.","PeriodicalId":322066,"journal":{"name":"Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"19 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114041342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Computational Advertising at Scale 大规模计算广告
Suju Rajan
Machine learning literature on Computational Advertising typically tends to focus on the simplistic CTR prediction problem which while being relevant is the tip of the iceberg in terms of the challenges in the field. There is also very little appreciation for the scale at which the real-time-bidding systems operate (200B bid requests/day) or the increasingly adversarial ecosystem all of which add a ton of constraints in terms of feasible solutions. In this talk, I'll highlight some recent efforts in developing models that try to better encapsulate the journey of an ad from the first display to a user to the effect on an actual purchase.
关于计算广告的机器学习文献通常倾向于关注简单的点击率预测问题,而这只是该领域挑战的冰山一角。人们对实时投标系统的运作规模(每天200B个投标请求)或日益敌对的生态系统的认识也很少,所有这些都在可行的解决方案方面增加了大量的限制。在这次演讲中,我将重点介绍最近在开发模型方面所做的一些努力,这些模型试图更好地概括广告从第一次展示到用户的整个过程,以及对实际购买的影响。
{"title":"Computational Advertising at Scale","authors":"Suju Rajan","doi":"10.1145/3219819.3219932","DOIUrl":"https://doi.org/10.1145/3219819.3219932","url":null,"abstract":"Machine learning literature on Computational Advertising typically tends to focus on the simplistic CTR prediction problem which while being relevant is the tip of the iceberg in terms of the challenges in the field. There is also very little appreciation for the scale at which the real-time-bidding systems operate (200B bid requests/day) or the increasingly adversarial ecosystem all of which add a ton of constraints in terms of feasible solutions. In this talk, I'll highlight some recent efforts in developing models that try to better encapsulate the journey of an ad from the first display to a user to the effect on an actual purchase.","PeriodicalId":322066,"journal":{"name":"Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117089117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1