Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献_第10页

Scalable diffusion-aware optimization of network topology 网络拓扑的可扩展扩散感知优化

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623704

Elias Boutros Khalil, B. Dilkina, Le Song

How can we optimize the topology of a networked system to bring a flu under control, propel a video to popularity, or stifle a network malware in its infancy? Previous work on information diffusion has focused on modeling the diffusion dynamics and selecting nodes to maximize/minimize influence. Only a paucity of recent studies have attempted to address the network modification problems, where the goal is to either facilitate desirable spreads or curtail undesirable ones by adding or deleting a small subset of network nodes or edges. In this paper, we focus on the widely studied linear threshold diffusion model, and prove, for the first time, that the network modification problems under this model have supermodular objective functions. This surprising property allows us to design efficient data structures and scalable algorithms with provable approximation guarantees, despite the hardness of the problems in question. Both the time and space complexities of our algorithms are linear in the size of the network, which allows us to experiment with millions of nodes and edges. We show that our algorithms outperform an array of heuristics in terms of their effectiveness in controlling diffusion processes, often beating the next best by a significant margin.

我们如何优化网络系统的拓扑结构来控制流感，推动视频流行，或者扼杀网络恶意软件的萌芽期?以往关于信息扩散的工作主要集中在建模扩散动力学和选择节点以最大化/最小化影响。最近只有少数研究试图解决网络修改问题，其目标是通过添加或删除一小部分网络节点或边来促进理想的传播或减少不希望的传播。本文研究了目前广泛研究的线性阈值扩散模型，并首次证明了该模型下的网络修正问题具有超模目标函数。这个令人惊讶的性质允许我们设计有效的数据结构和可扩展的算法，并具有可证明的近似保证，尽管所讨论的问题很困难。我们的算法的时间和空间复杂性在网络的大小上都是线性的，这使得我们可以用数百万个节点和边缘进行实验。我们表明，我们的算法在控制扩散过程的有效性方面优于一系列启发式算法，通常以显着的幅度击败下一个最佳算法。

{"title":"Scalable diffusion-aware optimization of network topology","authors":"Elias Boutros Khalil, B. Dilkina, Le Song","doi":"10.1145/2623330.2623704","DOIUrl":"https://doi.org/10.1145/2623330.2623704","url":null,"abstract":"How can we optimize the topology of a networked system to bring a flu under control, propel a video to popularity, or stifle a network malware in its infancy? Previous work on information diffusion has focused on modeling the diffusion dynamics and selecting nodes to maximize/minimize influence. Only a paucity of recent studies have attempted to address the network modification problems, where the goal is to either facilitate desirable spreads or curtail undesirable ones by adding or deleting a small subset of network nodes or edges. In this paper, we focus on the widely studied linear threshold diffusion model, and prove, for the first time, that the network modification problems under this model have supermodular objective functions. This surprising property allows us to design efficient data structures and scalable algorithms with provable approximation guarantees, despite the hardness of the problems in question. Both the time and space complexities of our algorithms are linear in the size of the network, which allows us to experiment with millions of nodes and edges. We show that our algorithms outperform an array of heuristics in terms of their effectiveness in controlling diffusion processes, often beating the next best by a significant margin.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86763196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 133

Deep learning 深度学习

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2630809

R. Salakhutdinov

Building intelligent systems that are capable of extracting high-level representations from high-dimensional data lies at the core of solving many AI related tasks, including visual object or pattern recognition, speech perception, and language understanding. Theoretical and biological arguments strongly suggest that building such systems requires deep architectures that involve many layers of nonlinear processing. Many existing learning algorithms use shallow architectures, including neural networks with only one hidden layer, support vector machines, kernel logistic regression, and many others. The internal representations learned by such systems are necessarily simple and are incapable of extracting some types of complex structure from high-dimensional input. In the past few years, researchers across many different communities, from applied statistics to engineering, computer science, and neuroscience, have proposed several deep (hierarchical) models that are capable of extracting meaningful, high-level representations. An important property of these models is that they can extract complex statistical dependencies from data and efficiently learn high-level representations by re-using and combining intermediate concepts, allowing these models to generalize well across a wide variety of tasks. The learned high-level representations have been shown to give state-of-the-art results in many challenging learning problems and have been successfully applied in a wide variety of application domains, including visual object recognition, information retrieval, natural language processing, and speech perception. A few notable examples of such models include Deep Belief Networks, Deep Boltzmann Machines, Deep Autoencoders, and sparse coding-based methods. The goal of the tutorial is to introduce the recent developments of various deep learning methods to the KDD community. The core focus will be placed on algorithms that can learn multi-layer hierarchies of representations, emphasizing their applications in information retrieval, object recognition, and speech perception.

构建能够从高维数据中提取高级表示的智能系统是解决许多人工智能相关任务的核心，包括视觉对象或模式识别、语音感知和语言理解。理论和生物学的论证强烈地表明，构建这样的系统需要涉及多层非线性处理的深层架构。许多现有的学习算法使用浅层架构，包括只有一个隐藏层的神经网络、支持向量机、核逻辑回归等。这种系统所学习的内部表征必然是简单的，无法从高维输入中提取某些类型的复杂结构。在过去的几年中，从应用统计学到工程学、计算机科学和神经科学等许多不同领域的研究人员提出了几种能够提取有意义的高级表示的深度(层次)模型。这些模型的一个重要特性是，它们可以从数据中提取复杂的统计依赖关系，并通过重用和组合中间概念有效地学习高级表示，从而使这些模型能够很好地泛化各种任务。学习到的高级表示已被证明在许多具有挑战性的学习问题中给出了最先进的结果，并已成功地应用于各种应用领域，包括视觉对象识别、信息检索、自然语言处理和语音感知。这些模型的一些值得注意的例子包括深度信念网络、深度玻尔兹曼机、深度自动编码器和基于稀疏编码的方法。本教程的目标是向KDD社区介绍各种深度学习方法的最新发展。核心焦点将放在能够学习多层表示层次的算法上，强调它们在信息检索、对象识别和语音感知方面的应用。

{"title":"Deep learning","authors":"R. Salakhutdinov","doi":"10.1145/2623330.2630809","DOIUrl":"https://doi.org/10.1145/2623330.2630809","url":null,"abstract":"Building intelligent systems that are capable of extracting high-level representations from high-dimensional data lies at the core of solving many AI related tasks, including visual object or pattern recognition, speech perception, and language understanding. Theoretical and biological arguments strongly suggest that building such systems requires deep architectures that involve many layers of nonlinear processing. Many existing learning algorithms use shallow architectures, including neural networks with only one hidden layer, support vector machines, kernel logistic regression, and many others. The internal representations learned by such systems are necessarily simple and are incapable of extracting some types of complex structure from high-dimensional input. In the past few years, researchers across many different communities, from applied statistics to engineering, computer science, and neuroscience, have proposed several deep (hierarchical) models that are capable of extracting meaningful, high-level representations. An important property of these models is that they can extract complex statistical dependencies from data and efficiently learn high-level representations by re-using and combining intermediate concepts, allowing these models to generalize well across a wide variety of tasks. The learned high-level representations have been shown to give state-of-the-art results in many challenging learning problems and have been successfully applied in a wide variety of application domains, including visual object recognition, information retrieval, natural language processing, and speech perception. A few notable examples of such models include Deep Belief Networks, Deep Boltzmann Machines, Deep Autoencoders, and sparse coding-based methods. The goal of the tutorial is to introduce the recent developments of various deep learning methods to the KDD community. The core focus will be placed on algorithms that can learn multi-layer hierarchies of representations, emphasizing their applications in information retrieval, object recognition, and speech perception.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86792510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Product selection problem: improve market share by learning consumer behavior 产品选择问题:通过学习消费者行为来提高市场占有率

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623692

Silei Xu, John C.S. Lui

It is often crucial for manufacturers to decide what products to produce so that they can increase their market share in an increasingly fierce market. To decide which products to produce, manufacturers need to analyze the consumers' requirements and how consumers make their purchase decisions so that the new products will be competitive in the market. In this paper, we first present a general distance-based product adoption model to capture consumers' purchase behavior. Using this model, various distance metrics can be used to describe different real life purchase behavior. We then provide a learning algorithm to decide which set of distance metrics one should use when we are given some historical purchase data. Based on the product adoption model, we formalize the k most marketable products (or k-MMP) selection problem and formally prove that the problem is NP-hard. To tackle this problem, we propose an efficient greedy-based approximation algorithm with a provable solution guarantee. Using submodularity analysis, we prove that our approximation algorithm can achieve at least 63% of the optimal solution. We apply our algorithm on both synthetic datasets and real-world datasets (TripAdvisor.com), and show that our algorithm can easily achieve five or more orders of speedup over the exhaustive search and achieve about 96% of the optimal solution on average. Our experiments also show the significant impact of different distance metrics on the results, and how proper distance metrics can improve the accuracy of product selection.

为了在日益激烈的市场中增加市场份额，制造商决定生产什么产品往往是至关重要的。为了决定生产哪些产品，制造商需要分析消费者的需求以及消费者如何做出购买决定，以便新产品在市场上具有竞争力。在本文中，我们首先提出了一个通用的基于距离的产品采用模型来捕捉消费者的购买行为。使用该模型，可以使用各种距离度量来描述现实生活中不同的购买行为。然后，我们提供了一个学习算法来决定当我们得到一些历史购买数据时应该使用哪一组距离度量。基于产品采用模型，我们形式化了k个最畅销产品(或k- mmp)选择问题，并形式化地证明了该问题是np困难的。为了解决这个问题，我们提出了一种有效的基于贪婪的近似算法，该算法具有可证明的解保证。利用子模块分析，我们证明了我们的近似算法至少可以达到63%的最优解。我们将我们的算法应用于合成数据集和真实数据集(TripAdvisor.com)，并表明我们的算法可以很容易地在穷举搜索中实现5个或更多的加速，平均达到96%的最优解。我们的实验还显示了不同距离度量对结果的显著影响，以及适当的距离度量如何提高产品选择的准确性。

{"title":"Product selection problem: improve market share by learning consumer behavior","authors":"Silei Xu, John C.S. Lui","doi":"10.1145/2623330.2623692","DOIUrl":"https://doi.org/10.1145/2623330.2623692","url":null,"abstract":"It is often crucial for manufacturers to decide what products to produce so that they can increase their market share in an increasingly fierce market. To decide which products to produce, manufacturers need to analyze the consumers' requirements and how consumers make their purchase decisions so that the new products will be competitive in the market. In this paper, we first present a general distance-based product adoption model to capture consumers' purchase behavior. Using this model, various distance metrics can be used to describe different real life purchase behavior. We then provide a learning algorithm to decide which set of distance metrics one should use when we are given some historical purchase data. Based on the product adoption model, we formalize the k most marketable products (or k-MMP) selection problem and formally prove that the problem is NP-hard. To tackle this problem, we propose an efficient greedy-based approximation algorithm with a provable solution guarantee. Using submodularity analysis, we prove that our approximation algorithm can achieve at least 63% of the optimal solution. We apply our algorithm on both synthetic datasets and real-world datasets (TripAdvisor.com), and show that our algorithm can easily achieve five or more orders of speedup over the exhaustive search and achieve about 96% of the optimal solution on average. Our experiments also show the significant impact of different distance metrics on the results, and how proper distance metrics can improve the accuracy of product selection.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"128 11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85079616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

LASTA: large scale topic assignment on multiple social networks LASTA:在多个社交网络上进行大规模的主题分配

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623350

Nemanja Spasojevic, Jinyun Yan, Adithya Rao, Prantik Bhattacharyya

Millions of people use social networks everyday to talk about a variety of subjects, publish opinions and share information. Understanding this data to infer user's topical interests is a challenging problem with applications in various data-powered products. In this paper, we present 'LASTA' (Large Scale Topic Assignment), a full production system used at Klout, Inc., which mines topical interests from five social networks and assigns over 10,000 topics to hundreds of millions of users on a daily basis. The system continuously collects streams of user data and is reactive to fresh information, updating topics for users as interests shift. LASTA generates over 50 distinct features derived from signals such as user generated posts and profiles, user reactions such as comments and retweets, user attributions such as lists, tags and endorsements, as well as signals based on social graph connections. We show that using this diverse set of features leads to a better representation of a user's topical interests as compared to using only generated text or only graph based features. We also show that using cross-network information for a user leads to a more complete and accurate understanding of the user's topics, as compared to using any single network. We evaluate LASTA's topic assignment system on an internal labeled corpus of 32,264 user-topic labels generated from real users.

数以百万计的人每天使用社交网络谈论各种主题，发表意见和分享信息。在各种数据驱动的产品中，理解这些数据以推断用户的主题兴趣是一个具有挑战性的问题。在本文中，我们介绍了“LASTA”(大规模主题分配)，这是Klout公司使用的一个完整的生产系统，它从五个社交网络中挖掘主题兴趣，并每天向数亿用户分配超过10,000个主题。该系统不断收集用户数据流，并对新信息做出反应，随着用户兴趣的变化，为用户更新主题。LASTA生成了50多个不同的功能，这些功能来自于用户生成的帖子和个人资料、用户反应(如评论和转发)、用户归属(如列表、标签和认可)以及基于社交图连接的信号。我们表明，与仅使用生成的文本或仅基于图形的特征相比，使用这种多样化的特征集可以更好地表示用户的主题兴趣。我们还表明，与使用任何单一网络相比，使用用户的跨网络信息可以更完整、更准确地理解用户的主题。我们在一个由真实用户生成的32,264个用户主题标签的内部标记语料库上评估LASTA的主题分配系统。

{"title":"LASTA: large scale topic assignment on multiple social networks","authors":"Nemanja Spasojevic, Jinyun Yan, Adithya Rao, Prantik Bhattacharyya","doi":"10.1145/2623330.2623350","DOIUrl":"https://doi.org/10.1145/2623330.2623350","url":null,"abstract":"Millions of people use social networks everyday to talk about a variety of subjects, publish opinions and share information. Understanding this data to infer user's topical interests is a challenging problem with applications in various data-powered products. In this paper, we present 'LASTA' (Large Scale Topic Assignment), a full production system used at Klout, Inc., which mines topical interests from five social networks and assigns over 10,000 topics to hundreds of millions of users on a daily basis. The system continuously collects streams of user data and is reactive to fresh information, updating topics for users as interests shift. LASTA generates over 50 distinct features derived from signals such as user generated posts and profiles, user reactions such as comments and retweets, user attributions such as lists, tags and endorsements, as well as signals based on social graph connections. We show that using this diverse set of features leads to a better representation of a user's topical interests as compared to using only generated text or only graph based features. We also show that using cross-network information for a user leads to a more complete and accurate understanding of the user's topics, as compared to using any single network. We evaluate LASTA's topic assignment system on an internal labeled corpus of 32,264 user-topic labels generated from real users.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90470930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

FoodSIS: a text mining system to improve the state of food safety in singapore FoodSIS:一个文本挖掘系统，以改善新加坡的食品安全状况

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623369

K. Kate, S. Chaudhari, A. Prapanca, J. Kalagnanam

Food safety is an important health issue in Singapore as the number of food poisoning cases have increased significantly over the past few decades. The National Environment Agency of Singapore (NEA) is the primary government agency responsible for monitoring and mitigating the food safety risks. In an effort to pro-actively monitor emerging food safety issues and to stay abreast with developments related to food safety in the world, NEA tracks the World Wide Web as a source of news feeds to identify food safety related articles. However, such information gathering is a difficult and time consuming process due to information overload. In this paper, we present FoodSIS, a system for end-to-end web information gathering for food safety. FoodSIS improves efficiency of such focused information gathering process with the use of machine learning techniques to identify and rank relevant content. We discuss the challenges in building such a system and describe how thoughtful system design and recent advances in machine learning provide a framework that synthesizes interactive learning with classification to provide a system that is used in daily operations. We conduct experiments and demonstrate that our classification approach results in improving the efficiency by average 35% compared to a conventional approach and the ranking approach leads to average 16% improvement in elevating the ranks of relevant articles.

在新加坡，食品安全是一个重要的健康问题，因为在过去的几十年里，食物中毒病例的数量显著增加。新加坡国家环境局(NEA)是负责监督和减轻食品安全风险的主要政府机构。为了积极监测食品安全问题，及时了解世界食品安全的发展动态，NEA将万维网作为新闻来源，以识别与食品安全相关的文章。然而，由于信息过载，这种信息收集是一个困难且耗时的过程。在本文中，我们介绍了FoodSIS，一个端到端网络信息收集食品安全的系统。FoodSIS通过使用机器学习技术来识别和排序相关内容，提高了这种集中信息收集过程的效率。我们讨论了构建这样一个系统的挑战，并描述了深思熟虑的系统设计和机器学习的最新进展如何提供一个框架，该框架综合了交互式学习和分类，从而提供了一个用于日常操作的系统。我们进行了实验并证明，与传统方法相比，我们的分类方法的效率平均提高了35%，排名方法在提升相关文章的排名方面平均提高了16%。

{"title":"FoodSIS: a text mining system to improve the state of food safety in singapore","authors":"K. Kate, S. Chaudhari, A. Prapanca, J. Kalagnanam","doi":"10.1145/2623330.2623369","DOIUrl":"https://doi.org/10.1145/2623330.2623369","url":null,"abstract":"Food safety is an important health issue in Singapore as the number of food poisoning cases have increased significantly over the past few decades. The National Environment Agency of Singapore (NEA) is the primary government agency responsible for monitoring and mitigating the food safety risks. In an effort to pro-actively monitor emerging food safety issues and to stay abreast with developments related to food safety in the world, NEA tracks the World Wide Web as a source of news feeds to identify food safety related articles. However, such information gathering is a difficult and time consuming process due to information overload. In this paper, we present FoodSIS, a system for end-to-end web information gathering for food safety. FoodSIS improves efficiency of such focused information gathering process with the use of machine learning techniques to identify and rank relevant content. We discuss the challenges in building such a system and describe how thoughtful system design and recent advances in machine learning provide a framework that synthesizes interactive learning with classification to provide a system that is used in daily operations. We conduct experiments and demonstrate that our classification approach results in improving the efficiency by average 35% compared to a conventional approach and the ranking approach leads to average 16% improvement in elevating the ranks of relevant articles.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90548942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Activity-edge centric multi-label classification for mining heterogeneous information networks 以活动边缘为中心的异构信息网络多标签分类

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623737

Yang Zhou, Ling Liu

Multi-label classification of heterogeneous information networks has received renewed attention in social network analysis. In this paper, we present an activity-edge centric multi-label classification framework for analyzing heterogeneous information networks with three unique features. First, we model a heterogeneous information network in terms of a collaboration graph and multiple associated activity graphs. We introduce a novel concept of vertex-edge homophily in terms of both vertex labels and edge labels and transform a general collaboration graph into an activity-based collaboration multigraph by augmenting its edges with class labels from each activity graph through activity-based edge classification. Second, we utilize the label vicinity to capture the pairwise vertex closeness based on the labeling on the activity-based collaboration multigraph. We incorporate both the structure affinity and the label vicinity into a unified classifier to speed up the classification convergence. Third, we design an iterative learning algorithm, AEClass, to dynamically refine the classification result by continuously adjusting the weights on different activity-based edge classification schemes from multiple activity graphs, while constantly learning the contribution of the structure affinity and the label vicinity in the unified classifier. Extensive evaluation on real datasets demonstrates that AEClass outperforms existing representative methods in terms of both effectiveness and efficiency.

异构信息网络的多标签分类在社会网络分析中得到了新的关注。本文提出了一种以活动边缘为中心的多标签分类框架，用于分析具有三个独特特征的异构信息网络。首先，我们根据协作图和多个相关活动图对异构信息网络进行建模。我们在顶点标签和边缘标签方面引入了一种新的顶点边缘同态概念，并通过基于活动的边缘分类，通过每个活动图的类标签来增加其边缘，将一般协作图转换为基于活动的协作多图。其次，基于基于活动的协作多图上的标记，利用标签邻近度来捕获成对顶点的接近度。我们将结构亲和度和标签邻近度结合到一个统一的分类器中，加快了分类收敛速度。第三，设计迭代学习算法AEClass，从多个活动图中不断调整不同基于活动的边缘分类方案的权值，动态细化分类结果，同时不断学习结构亲和度和标签邻近度在统一分类器中的贡献。对真实数据集的广泛评估表明，AEClass在有效性和效率方面都优于现有的代表性方法。

{"title":"Activity-edge centric multi-label classification for mining heterogeneous information networks","authors":"Yang Zhou, Ling Liu","doi":"10.1145/2623330.2623737","DOIUrl":"https://doi.org/10.1145/2623330.2623737","url":null,"abstract":"Multi-label classification of heterogeneous information networks has received renewed attention in social network analysis. In this paper, we present an activity-edge centric multi-label classification framework for analyzing heterogeneous information networks with three unique features. First, we model a heterogeneous information network in terms of a collaboration graph and multiple associated activity graphs. We introduce a novel concept of vertex-edge homophily in terms of both vertex labels and edge labels and transform a general collaboration graph into an activity-based collaboration multigraph by augmenting its edges with class labels from each activity graph through activity-based edge classification. Second, we utilize the label vicinity to capture the pairwise vertex closeness based on the labeling on the activity-based collaboration multigraph. We incorporate both the structure affinity and the label vicinity into a unified classifier to speed up the classification convergence. Third, we design an iterative learning algorithm, AEClass, to dynamically refine the classification result by continuously adjusting the weights on different activity-based edge classification schemes from multiple activity graphs, while constantly learning the contribution of the structure affinity and the label vicinity in the unified classifier. Extensive evaluation on real datasets demonstrates that AEClass outperforms existing representative methods in terms of both effectiveness and efficiency.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81310475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning FastXML:一个快速，准确和稳定的树分类器，用于极端的多标签学习

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623651

Yashoteja Prabhu, M. Varma

The objective in extreme multi-label classification is to learn a classifier that can automatically tag a data point with the most relevant subset of labels from a large label set. Extreme multi-label classification is an important research problem since not only does it enable the tackling of applications with many labels but it also allows the reformulation of ranking problems with certain advantages over existing formulations. Our objective, in this paper, is to develop an extreme multi-label classifier that is faster to train and more accurate at prediction than the state-of-the-art Multi-label Random Forest (MLRF) algorithm [2] and the Label Partitioning for Sub-linear Ranking (LPSR) algorithm [35]. MLRF and LPSR learn a hierarchy to deal with the large number of labels but optimize task independent measures, such as the Gini index or clustering error, in order to learn the hierarchy. Our proposed FastXML algorithm achieves significantly higher accuracies by directly optimizing an nDCG based ranking loss function. We also develop an alternating minimization algorithm for efficiently optimizing the proposed formulation. Experiments reveal that FastXML can be trained on problems with more than a million labels on a standard desktop in eight hours using a single core and in an hour using multiple cores.

极端多标签分类的目标是学习一种分类器，它可以从一个大的标签集中自动地用最相关的标签子集标记数据点。极端多标签分类是一个重要的研究问题，因为它不仅能够处理具有许多标签的应用程序，而且还允许重新表述排名问题，具有比现有表述更大的优势。在本文中，我们的目标是开发一种极端多标签分类器，它比最先进的多标签随机森林(MLRF)算法[2]和亚线性排序(LPSR)算法[35]更快地训练和更准确地预测。MLRF和LPSR学习层次结构来处理大量的标签，但优化任务独立的度量，如基尼指数或聚类误差，以学习层次结构。我们提出的FastXML算法通过直接优化基于nDCG的排序损失函数来实现更高的精度。我们还开发了一种交替最小化算法，以有效地优化所提出的公式。实验表明，FastXML可以在标准桌面上使用单核和多核分别在8小时和1小时内训练处理超过100万个标签的问题。

{"title":"FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning","authors":"Yashoteja Prabhu, M. Varma","doi":"10.1145/2623330.2623651","DOIUrl":"https://doi.org/10.1145/2623330.2623651","url":null,"abstract":"The objective in extreme multi-label classification is to learn a classifier that can automatically tag a data point with the most relevant subset of labels from a large label set. Extreme multi-label classification is an important research problem since not only does it enable the tackling of applications with many labels but it also allows the reformulation of ranking problems with certain advantages over existing formulations. Our objective, in this paper, is to develop an extreme multi-label classifier that is faster to train and more accurate at prediction than the state-of-the-art Multi-label Random Forest (MLRF) algorithm [2] and the Label Partitioning for Sub-linear Ranking (LPSR) algorithm [35]. MLRF and LPSR learn a hierarchy to deal with the large number of labels but optimize task independent measures, such as the Gini index or clustering error, in order to learn the hierarchy. Our proposed FastXML algorithm achieves significantly higher accuracies by directly optimizing an nDCG based ranking loss function. We also develop an alternating minimization algorithm for efficiently optimizing the proposed formulation. Experiments reveal that FastXML can be trained on problems with more than a million labels on a standard desktop in eight hours using a single core and in an hour using multiple cores.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"30 5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82999840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 374

Streaming submodular maximization: massive data summarization on the fly 流式子模块最大化:动态的海量数据汇总

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623637

Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, Andreas Krause

How can one summarize a massive data set "on the fly", i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of "representativeness" satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2-ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training large-scale kernel methods and exemplar-based clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.

一个人怎么能“在飞行中”总结一个庞大的数据集，也就是说，甚至没有看到它的整体?在本文中，我们解决了从大量数据流中提取代表性元素的问题。也就是说，我们希望根据某个目标函数从流中选择最具代表性的k个数据点的子集。“代表性”的许多自然概念满足子模块性，这是一种收益递减的直观概念。因此，这样的问题可以简化为在基数约束下最大化一个次模集合函数。实现子模块最大化的经典方法需要对数据集进行完全访问。我们开发了第一个高效的流算法，该算法具有常数因子1/2-ε近似保证最优解，只需要一次遍历数据，并且内存与数据大小无关。在我们的实验中，我们广泛地评估了我们的方法在几个应用程序上的有效性，包括在数百万个数据点上训练大规模核方法和基于示例的聚类。我们观察到，我们的流方法在实现几乎相同的实用价值的同时，运行速度比以前的工作快100倍。

{"title":"Streaming submodular maximization: massive data summarization on the fly","authors":"Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, Andreas Krause","doi":"10.1145/2623330.2623637","DOIUrl":"https://doi.org/10.1145/2623330.2623637","url":null,"abstract":"How can one summarize a massive data set \"on the fly\", i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of \"representativeness\" satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2-ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training large-scale kernel methods and exemplar-based clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"269 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83483818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 327

Scalable heterogeneous translated hashing 可伸缩的异构转换散列

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623688

Ying Wei, Yangqiu Song, Yi Zhen, Bo Liu, Qiang Yang

Hashing has enjoyed a great success in large-scale similarity search. Recently, researchers have studied the multi-modal hashing to meet the need of similarity search across different types of media. However, most of the existing methods are applied to search across multi-views among which explicit bridge information is provided. Given a heterogeneous media search task, we observe that abundant multi-view data can be found on the Web which can serve as an auxiliary bridge. In this paper, we propose a Heterogeneous Translated Hashing (HTH) method with such auxiliary bridge incorporated not only to improve current multi-view search but also to enable similarity search across heterogeneous media which have no direct correspondence. HTH simultaneously learns hash functions embedding heterogeneous media into different Hamming spaces, and translators aligning these spaces. Unlike almost all existing methods that map heterogeneous data in a common Hamming space, mapping to different spaces provides more flexible and discriminative ability. We empirically verify the effectiveness and efficiency of our algorithm on two real world large datasets, one publicly available dataset of Flickr and the other MIRFLICKR-Yahoo Answers dataset.

哈希在大规模相似度搜索中取得了巨大的成功。近年来，研究人员对多模态哈希进行了研究，以满足跨不同类型媒体的相似性搜索需求。然而，现有的方法大多是跨多视图搜索，其中提供了明确的桥信息。在异构媒体搜索任务中，我们发现在网络上可以找到丰富的多视图数据，这些数据可以作为辅助桥梁。在本文中，我们提出了一种包含这种辅助桥的异构翻译哈希(HTH)方法，不仅可以改进当前的多视图搜索，而且可以实现跨没有直接对应的异构媒体的相似性搜索。HTH同时学习哈希函数，将异构媒体嵌入到不同的汉明空间中，并让翻译器对齐这些空间。与几乎所有在公共Hamming空间中映射异构数据的现有方法不同，映射到不同空间提供了更灵活和判别能力。我们在两个真实世界的大型数据集上验证了算法的有效性和效率，一个是公开可用的Flickr数据集，另一个是Flickr - yahoo Answers数据集。

{"title":"Scalable heterogeneous translated hashing","authors":"Ying Wei, Yangqiu Song, Yi Zhen, Bo Liu, Qiang Yang","doi":"10.1145/2623330.2623688","DOIUrl":"https://doi.org/10.1145/2623330.2623688","url":null,"abstract":"Hashing has enjoyed a great success in large-scale similarity search. Recently, researchers have studied the multi-modal hashing to meet the need of similarity search across different types of media. However, most of the existing methods are applied to search across multi-views among which explicit bridge information is provided. Given a heterogeneous media search task, we observe that abundant multi-view data can be found on the Web which can serve as an auxiliary bridge. In this paper, we propose a Heterogeneous Translated Hashing (HTH) method with such auxiliary bridge incorporated not only to improve current multi-view search but also to enable similarity search across heterogeneous media which have no direct correspondence. HTH simultaneously learns hash functions embedding heterogeneous media into different Hamming spaces, and translators aligning these spaces. Unlike almost all existing methods that map heterogeneous data in a common Hamming space, mapping to different spaces provides more flexible and discriminative ability. We empirically verify the effectiveness and efficiency of our algorithm on two real world large datasets, one publicly available dataset of Flickr and the other MIRFLICKR-Yahoo Answers dataset.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89389650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Inside the atoms: ranking on a network of networks 原子内部:在网络的网络中排名

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623643

Jingchao Ni, Hanghang Tong, Wei Fan, Xiang Zhang

Networks are prevalent and have posed many fascinating research questions. How can we spot similar users, e.g., virtual identical twins, in Cleveland for a New Yorker? Given a query disease, how can we prioritize its candidate genes by incorporating the tissue-specific protein interaction networks of those similar diseases? In most, if not all, of the existing network ranking methods, the nodes are the ranking objects with the finest granularity. In this paper, we propose a new network data model, a Network of Networks (NoN), where each node of the main network itself can be further represented as another (domain-specific) network. This new data model enables to compare the nodes in a broader context and rank them at a finer granularity. Moreover, such an NoN model enables much more efficient search when the ranking targets reside in a certain domain-specific network. We formulate ranking on NoN as a regularized optimization problem; propose efficient algorithms and provide theoretical analysis, such as optimality, convergence, complexity and equivalence. Extensive experimental evaluations demonstrate the effectiveness and the efficiency of our methods.

网络是普遍存在的，并提出了许多有趣的研究问题。我们如何为一个纽约人在克利夫兰发现相似的用户，例如虚拟同卵双胞胎?给定一种疾病，我们如何通过结合这些类似疾病的组织特异性蛋白质相互作用网络来确定其候选基因的优先级?在大多数(如果不是全部的话)现有的网络排序方法中，节点是具有最细粒度的排序对象。在本文中，我们提出了一种新的网络数据模型，即网络的网络(NoN)，其中主网络的每个节点本身可以进一步表示为另一个(特定于领域的)网络。这个新的数据模型支持在更广泛的上下文中比较节点，并以更细的粒度对它们进行排序。此外，当排序目标位于某个特定领域的网络中时，这种NoN模型可以实现更高效的搜索。我们将NoN上的排序问题表述为正则化优化问题;提出有效的算法并提供理论分析，如最优性、收敛性、复杂性和等价性。大量的实验评估证明了我们的方法的有效性和效率。

{"title":"Inside the atoms: ranking on a network of networks","authors":"Jingchao Ni, Hanghang Tong, Wei Fan, Xiang Zhang","doi":"10.1145/2623330.2623643","DOIUrl":"https://doi.org/10.1145/2623330.2623643","url":null,"abstract":"Networks are prevalent and have posed many fascinating research questions. How can we spot similar users, e.g., virtual identical twins, in Cleveland for a New Yorker? Given a query disease, how can we prioritize its candidate genes by incorporating the tissue-specific protein interaction networks of those similar diseases? In most, if not all, of the existing network ranking methods, the nodes are the ranking objects with the finest granularity. In this paper, we propose a new network data model, a Network of Networks (NoN), where each node of the main network itself can be further represented as another (domain-specific) network. This new data model enables to compare the nodes in a broader context and rank them at a finer granularity. Moreover, such an NoN model enables much more efficient search when the ranking targets reside in a certain domain-specific network. We formulate ranking on NoN as a regularized optimization problem; propose efficient algorithms and provide theoretical analysis, such as optimality, convergence, complexity and equivalence. Extensive experimental evaluations demonstrate the effectiveness and the efficiency of our methods.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89882891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58