How can we optimize the topology of a networked system to bring a flu under control, propel a video to popularity, or stifle a network malware in its infancy? Previous work on information diffusion has focused on modeling the diffusion dynamics and selecting nodes to maximize/minimize influence. Only a paucity of recent studies have attempted to address the network modification problems, where the goal is to either facilitate desirable spreads or curtail undesirable ones by adding or deleting a small subset of network nodes or edges. In this paper, we focus on the widely studied linear threshold diffusion model, and prove, for the first time, that the network modification problems under this model have supermodular objective functions. This surprising property allows us to design efficient data structures and scalable algorithms with provable approximation guarantees, despite the hardness of the problems in question. Both the time and space complexities of our algorithms are linear in the size of the network, which allows us to experiment with millions of nodes and edges. We show that our algorithms outperform an array of heuristics in terms of their effectiveness in controlling diffusion processes, often beating the next best by a significant margin.
{"title":"Scalable diffusion-aware optimization of network topology","authors":"Elias Boutros Khalil, B. Dilkina, Le Song","doi":"10.1145/2623330.2623704","DOIUrl":"https://doi.org/10.1145/2623330.2623704","url":null,"abstract":"How can we optimize the topology of a networked system to bring a flu under control, propel a video to popularity, or stifle a network malware in its infancy? Previous work on information diffusion has focused on modeling the diffusion dynamics and selecting nodes to maximize/minimize influence. Only a paucity of recent studies have attempted to address the network modification problems, where the goal is to either facilitate desirable spreads or curtail undesirable ones by adding or deleting a small subset of network nodes or edges. In this paper, we focus on the widely studied linear threshold diffusion model, and prove, for the first time, that the network modification problems under this model have supermodular objective functions. This surprising property allows us to design efficient data structures and scalable algorithms with provable approximation guarantees, despite the hardness of the problems in question. Both the time and space complexities of our algorithms are linear in the size of the network, which allows us to experiment with millions of nodes and edges. We show that our algorithms outperform an array of heuristics in terms of their effectiveness in controlling diffusion processes, often beating the next best by a significant margin.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86763196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Building intelligent systems that are capable of extracting high-level representations from high-dimensional data lies at the core of solving many AI related tasks, including visual object or pattern recognition, speech perception, and language understanding. Theoretical and biological arguments strongly suggest that building such systems requires deep architectures that involve many layers of nonlinear processing. Many existing learning algorithms use shallow architectures, including neural networks with only one hidden layer, support vector machines, kernel logistic regression, and many others. The internal representations learned by such systems are necessarily simple and are incapable of extracting some types of complex structure from high-dimensional input. In the past few years, researchers across many different communities, from applied statistics to engineering, computer science, and neuroscience, have proposed several deep (hierarchical) models that are capable of extracting meaningful, high-level representations. An important property of these models is that they can extract complex statistical dependencies from data and efficiently learn high-level representations by re-using and combining intermediate concepts, allowing these models to generalize well across a wide variety of tasks. The learned high-level representations have been shown to give state-of-the-art results in many challenging learning problems and have been successfully applied in a wide variety of application domains, including visual object recognition, information retrieval, natural language processing, and speech perception. A few notable examples of such models include Deep Belief Networks, Deep Boltzmann Machines, Deep Autoencoders, and sparse coding-based methods. The goal of the tutorial is to introduce the recent developments of various deep learning methods to the KDD community. The core focus will be placed on algorithms that can learn multi-layer hierarchies of representations, emphasizing their applications in information retrieval, object recognition, and speech perception.
{"title":"Deep learning","authors":"R. Salakhutdinov","doi":"10.1145/2623330.2630809","DOIUrl":"https://doi.org/10.1145/2623330.2630809","url":null,"abstract":"Building intelligent systems that are capable of extracting high-level representations from high-dimensional data lies at the core of solving many AI related tasks, including visual object or pattern recognition, speech perception, and language understanding. Theoretical and biological arguments strongly suggest that building such systems requires deep architectures that involve many layers of nonlinear processing. Many existing learning algorithms use shallow architectures, including neural networks with only one hidden layer, support vector machines, kernel logistic regression, and many others. The internal representations learned by such systems are necessarily simple and are incapable of extracting some types of complex structure from high-dimensional input. In the past few years, researchers across many different communities, from applied statistics to engineering, computer science, and neuroscience, have proposed several deep (hierarchical) models that are capable of extracting meaningful, high-level representations. An important property of these models is that they can extract complex statistical dependencies from data and efficiently learn high-level representations by re-using and combining intermediate concepts, allowing these models to generalize well across a wide variety of tasks. The learned high-level representations have been shown to give state-of-the-art results in many challenging learning problems and have been successfully applied in a wide variety of application domains, including visual object recognition, information retrieval, natural language processing, and speech perception. A few notable examples of such models include Deep Belief Networks, Deep Boltzmann Machines, Deep Autoencoders, and sparse coding-based methods. The goal of the tutorial is to introduce the recent developments of various deep learning methods to the KDD community. The core focus will be placed on algorithms that can learn multi-layer hierarchies of representations, emphasizing their applications in information retrieval, object recognition, and speech perception.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86792510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is often crucial for manufacturers to decide what products to produce so that they can increase their market share in an increasingly fierce market. To decide which products to produce, manufacturers need to analyze the consumers' requirements and how consumers make their purchase decisions so that the new products will be competitive in the market. In this paper, we first present a general distance-based product adoption model to capture consumers' purchase behavior. Using this model, various distance metrics can be used to describe different real life purchase behavior. We then provide a learning algorithm to decide which set of distance metrics one should use when we are given some historical purchase data. Based on the product adoption model, we formalize the k most marketable products (or k-MMP) selection problem and formally prove that the problem is NP-hard. To tackle this problem, we propose an efficient greedy-based approximation algorithm with a provable solution guarantee. Using submodularity analysis, we prove that our approximation algorithm can achieve at least 63% of the optimal solution. We apply our algorithm on both synthetic datasets and real-world datasets (TripAdvisor.com), and show that our algorithm can easily achieve five or more orders of speedup over the exhaustive search and achieve about 96% of the optimal solution on average. Our experiments also show the significant impact of different distance metrics on the results, and how proper distance metrics can improve the accuracy of product selection.
{"title":"Product selection problem: improve market share by learning consumer behavior","authors":"Silei Xu, John C.S. Lui","doi":"10.1145/2623330.2623692","DOIUrl":"https://doi.org/10.1145/2623330.2623692","url":null,"abstract":"It is often crucial for manufacturers to decide what products to produce so that they can increase their market share in an increasingly fierce market. To decide which products to produce, manufacturers need to analyze the consumers' requirements and how consumers make their purchase decisions so that the new products will be competitive in the market. In this paper, we first present a general distance-based product adoption model to capture consumers' purchase behavior. Using this model, various distance metrics can be used to describe different real life purchase behavior. We then provide a learning algorithm to decide which set of distance metrics one should use when we are given some historical purchase data. Based on the product adoption model, we formalize the k most marketable products (or k-MMP) selection problem and formally prove that the problem is NP-hard. To tackle this problem, we propose an efficient greedy-based approximation algorithm with a provable solution guarantee. Using submodularity analysis, we prove that our approximation algorithm can achieve at least 63% of the optimal solution. We apply our algorithm on both synthetic datasets and real-world datasets (TripAdvisor.com), and show that our algorithm can easily achieve five or more orders of speedup over the exhaustive search and achieve about 96% of the optimal solution on average. Our experiments also show the significant impact of different distance metrics on the results, and how proper distance metrics can improve the accuracy of product selection.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"128 11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85079616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Millions of people use social networks everyday to talk about a variety of subjects, publish opinions and share information. Understanding this data to infer user's topical interests is a challenging problem with applications in various data-powered products. In this paper, we present 'LASTA' (Large Scale Topic Assignment), a full production system used at Klout, Inc., which mines topical interests from five social networks and assigns over 10,000 topics to hundreds of millions of users on a daily basis. The system continuously collects streams of user data and is reactive to fresh information, updating topics for users as interests shift. LASTA generates over 50 distinct features derived from signals such as user generated posts and profiles, user reactions such as comments and retweets, user attributions such as lists, tags and endorsements, as well as signals based on social graph connections. We show that using this diverse set of features leads to a better representation of a user's topical interests as compared to using only generated text or only graph based features. We also show that using cross-network information for a user leads to a more complete and accurate understanding of the user's topics, as compared to using any single network. We evaluate LASTA's topic assignment system on an internal labeled corpus of 32,264 user-topic labels generated from real users.
{"title":"LASTA: large scale topic assignment on multiple social networks","authors":"Nemanja Spasojevic, Jinyun Yan, Adithya Rao, Prantik Bhattacharyya","doi":"10.1145/2623330.2623350","DOIUrl":"https://doi.org/10.1145/2623330.2623350","url":null,"abstract":"Millions of people use social networks everyday to talk about a variety of subjects, publish opinions and share information. Understanding this data to infer user's topical interests is a challenging problem with applications in various data-powered products. In this paper, we present 'LASTA' (Large Scale Topic Assignment), a full production system used at Klout, Inc., which mines topical interests from five social networks and assigns over 10,000 topics to hundreds of millions of users on a daily basis. The system continuously collects streams of user data and is reactive to fresh information, updating topics for users as interests shift. LASTA generates over 50 distinct features derived from signals such as user generated posts and profiles, user reactions such as comments and retweets, user attributions such as lists, tags and endorsements, as well as signals based on social graph connections. We show that using this diverse set of features leads to a better representation of a user's topical interests as compared to using only generated text or only graph based features. We also show that using cross-network information for a user leads to a more complete and accurate understanding of the user's topics, as compared to using any single network. We evaluate LASTA's topic assignment system on an internal labeled corpus of 32,264 user-topic labels generated from real users.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90470930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Food safety is an important health issue in Singapore as the number of food poisoning cases have increased significantly over the past few decades. The National Environment Agency of Singapore (NEA) is the primary government agency responsible for monitoring and mitigating the food safety risks. In an effort to pro-actively monitor emerging food safety issues and to stay abreast with developments related to food safety in the world, NEA tracks the World Wide Web as a source of news feeds to identify food safety related articles. However, such information gathering is a difficult and time consuming process due to information overload. In this paper, we present FoodSIS, a system for end-to-end web information gathering for food safety. FoodSIS improves efficiency of such focused information gathering process with the use of machine learning techniques to identify and rank relevant content. We discuss the challenges in building such a system and describe how thoughtful system design and recent advances in machine learning provide a framework that synthesizes interactive learning with classification to provide a system that is used in daily operations. We conduct experiments and demonstrate that our classification approach results in improving the efficiency by average 35% compared to a conventional approach and the ranking approach leads to average 16% improvement in elevating the ranks of relevant articles.
{"title":"FoodSIS: a text mining system to improve the state of food safety in singapore","authors":"K. Kate, S. Chaudhari, A. Prapanca, J. Kalagnanam","doi":"10.1145/2623330.2623369","DOIUrl":"https://doi.org/10.1145/2623330.2623369","url":null,"abstract":"Food safety is an important health issue in Singapore as the number of food poisoning cases have increased significantly over the past few decades. The National Environment Agency of Singapore (NEA) is the primary government agency responsible for monitoring and mitigating the food safety risks. In an effort to pro-actively monitor emerging food safety issues and to stay abreast with developments related to food safety in the world, NEA tracks the World Wide Web as a source of news feeds to identify food safety related articles. However, such information gathering is a difficult and time consuming process due to information overload. In this paper, we present FoodSIS, a system for end-to-end web information gathering for food safety. FoodSIS improves efficiency of such focused information gathering process with the use of machine learning techniques to identify and rank relevant content. We discuss the challenges in building such a system and describe how thoughtful system design and recent advances in machine learning provide a framework that synthesizes interactive learning with classification to provide a system that is used in daily operations. We conduct experiments and demonstrate that our classification approach results in improving the efficiency by average 35% compared to a conventional approach and the ranking approach leads to average 16% improvement in elevating the ranks of relevant articles.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90548942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-label classification of heterogeneous information networks has received renewed attention in social network analysis. In this paper, we present an activity-edge centric multi-label classification framework for analyzing heterogeneous information networks with three unique features. First, we model a heterogeneous information network in terms of a collaboration graph and multiple associated activity graphs. We introduce a novel concept of vertex-edge homophily in terms of both vertex labels and edge labels and transform a general collaboration graph into an activity-based collaboration multigraph by augmenting its edges with class labels from each activity graph through activity-based edge classification. Second, we utilize the label vicinity to capture the pairwise vertex closeness based on the labeling on the activity-based collaboration multigraph. We incorporate both the structure affinity and the label vicinity into a unified classifier to speed up the classification convergence. Third, we design an iterative learning algorithm, AEClass, to dynamically refine the classification result by continuously adjusting the weights on different activity-based edge classification schemes from multiple activity graphs, while constantly learning the contribution of the structure affinity and the label vicinity in the unified classifier. Extensive evaluation on real datasets demonstrates that AEClass outperforms existing representative methods in terms of both effectiveness and efficiency.
{"title":"Activity-edge centric multi-label classification for mining heterogeneous information networks","authors":"Yang Zhou, Ling Liu","doi":"10.1145/2623330.2623737","DOIUrl":"https://doi.org/10.1145/2623330.2623737","url":null,"abstract":"Multi-label classification of heterogeneous information networks has received renewed attention in social network analysis. In this paper, we present an activity-edge centric multi-label classification framework for analyzing heterogeneous information networks with three unique features. First, we model a heterogeneous information network in terms of a collaboration graph and multiple associated activity graphs. We introduce a novel concept of vertex-edge homophily in terms of both vertex labels and edge labels and transform a general collaboration graph into an activity-based collaboration multigraph by augmenting its edges with class labels from each activity graph through activity-based edge classification. Second, we utilize the label vicinity to capture the pairwise vertex closeness based on the labeling on the activity-based collaboration multigraph. We incorporate both the structure affinity and the label vicinity into a unified classifier to speed up the classification convergence. Third, we design an iterative learning algorithm, AEClass, to dynamically refine the classification result by continuously adjusting the weights on different activity-based edge classification schemes from multiple activity graphs, while constantly learning the contribution of the structure affinity and the label vicinity in the unified classifier. Extensive evaluation on real datasets demonstrates that AEClass outperforms existing representative methods in terms of both effectiveness and efficiency.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81310475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The objective in extreme multi-label classification is to learn a classifier that can automatically tag a data point with the most relevant subset of labels from a large label set. Extreme multi-label classification is an important research problem since not only does it enable the tackling of applications with many labels but it also allows the reformulation of ranking problems with certain advantages over existing formulations. Our objective, in this paper, is to develop an extreme multi-label classifier that is faster to train and more accurate at prediction than the state-of-the-art Multi-label Random Forest (MLRF) algorithm [2] and the Label Partitioning for Sub-linear Ranking (LPSR) algorithm [35]. MLRF and LPSR learn a hierarchy to deal with the large number of labels but optimize task independent measures, such as the Gini index or clustering error, in order to learn the hierarchy. Our proposed FastXML algorithm achieves significantly higher accuracies by directly optimizing an nDCG based ranking loss function. We also develop an alternating minimization algorithm for efficiently optimizing the proposed formulation. Experiments reveal that FastXML can be trained on problems with more than a million labels on a standard desktop in eight hours using a single core and in an hour using multiple cores.
{"title":"FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning","authors":"Yashoteja Prabhu, M. Varma","doi":"10.1145/2623330.2623651","DOIUrl":"https://doi.org/10.1145/2623330.2623651","url":null,"abstract":"The objective in extreme multi-label classification is to learn a classifier that can automatically tag a data point with the most relevant subset of labels from a large label set. Extreme multi-label classification is an important research problem since not only does it enable the tackling of applications with many labels but it also allows the reformulation of ranking problems with certain advantages over existing formulations. Our objective, in this paper, is to develop an extreme multi-label classifier that is faster to train and more accurate at prediction than the state-of-the-art Multi-label Random Forest (MLRF) algorithm [2] and the Label Partitioning for Sub-linear Ranking (LPSR) algorithm [35]. MLRF and LPSR learn a hierarchy to deal with the large number of labels but optimize task independent measures, such as the Gini index or clustering error, in order to learn the hierarchy. Our proposed FastXML algorithm achieves significantly higher accuracies by directly optimizing an nDCG based ranking loss function. We also develop an alternating minimization algorithm for efficiently optimizing the proposed formulation. Experiments reveal that FastXML can be trained on problems with more than a million labels on a standard desktop in eight hours using a single core and in an hour using multiple cores.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"30 5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82999840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, Andreas Krause
How can one summarize a massive data set "on the fly", i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of "representativeness" satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2-ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training large-scale kernel methods and exemplar-based clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.
{"title":"Streaming submodular maximization: massive data summarization on the fly","authors":"Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, Andreas Krause","doi":"10.1145/2623330.2623637","DOIUrl":"https://doi.org/10.1145/2623330.2623637","url":null,"abstract":"How can one summarize a massive data set \"on the fly\", i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of \"representativeness\" satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2-ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training large-scale kernel methods and exemplar-based clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"269 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83483818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying Wei, Yangqiu Song, Yi Zhen, Bo Liu, Qiang Yang
Hashing has enjoyed a great success in large-scale similarity search. Recently, researchers have studied the multi-modal hashing to meet the need of similarity search across different types of media. However, most of the existing methods are applied to search across multi-views among which explicit bridge information is provided. Given a heterogeneous media search task, we observe that abundant multi-view data can be found on the Web which can serve as an auxiliary bridge. In this paper, we propose a Heterogeneous Translated Hashing (HTH) method with such auxiliary bridge incorporated not only to improve current multi-view search but also to enable similarity search across heterogeneous media which have no direct correspondence. HTH simultaneously learns hash functions embedding heterogeneous media into different Hamming spaces, and translators aligning these spaces. Unlike almost all existing methods that map heterogeneous data in a common Hamming space, mapping to different spaces provides more flexible and discriminative ability. We empirically verify the effectiveness and efficiency of our algorithm on two real world large datasets, one publicly available dataset of Flickr and the other MIRFLICKR-Yahoo Answers dataset.
{"title":"Scalable heterogeneous translated hashing","authors":"Ying Wei, Yangqiu Song, Yi Zhen, Bo Liu, Qiang Yang","doi":"10.1145/2623330.2623688","DOIUrl":"https://doi.org/10.1145/2623330.2623688","url":null,"abstract":"Hashing has enjoyed a great success in large-scale similarity search. Recently, researchers have studied the multi-modal hashing to meet the need of similarity search across different types of media. However, most of the existing methods are applied to search across multi-views among which explicit bridge information is provided. Given a heterogeneous media search task, we observe that abundant multi-view data can be found on the Web which can serve as an auxiliary bridge. In this paper, we propose a Heterogeneous Translated Hashing (HTH) method with such auxiliary bridge incorporated not only to improve current multi-view search but also to enable similarity search across heterogeneous media which have no direct correspondence. HTH simultaneously learns hash functions embedding heterogeneous media into different Hamming spaces, and translators aligning these spaces. Unlike almost all existing methods that map heterogeneous data in a common Hamming space, mapping to different spaces provides more flexible and discriminative ability. We empirically verify the effectiveness and efficiency of our algorithm on two real world large datasets, one publicly available dataset of Flickr and the other MIRFLICKR-Yahoo Answers dataset.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89389650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Networks are prevalent and have posed many fascinating research questions. How can we spot similar users, e.g., virtual identical twins, in Cleveland for a New Yorker? Given a query disease, how can we prioritize its candidate genes by incorporating the tissue-specific protein interaction networks of those similar diseases? In most, if not all, of the existing network ranking methods, the nodes are the ranking objects with the finest granularity. In this paper, we propose a new network data model, a Network of Networks (NoN), where each node of the main network itself can be further represented as another (domain-specific) network. This new data model enables to compare the nodes in a broader context and rank them at a finer granularity. Moreover, such an NoN model enables much more efficient search when the ranking targets reside in a certain domain-specific network. We formulate ranking on NoN as a regularized optimization problem; propose efficient algorithms and provide theoretical analysis, such as optimality, convergence, complexity and equivalence. Extensive experimental evaluations demonstrate the effectiveness and the efficiency of our methods.
{"title":"Inside the atoms: ranking on a network of networks","authors":"Jingchao Ni, Hanghang Tong, Wei Fan, Xiang Zhang","doi":"10.1145/2623330.2623643","DOIUrl":"https://doi.org/10.1145/2623330.2623643","url":null,"abstract":"Networks are prevalent and have posed many fascinating research questions. How can we spot similar users, e.g., virtual identical twins, in Cleveland for a New Yorker? Given a query disease, how can we prioritize its candidate genes by incorporating the tissue-specific protein interaction networks of those similar diseases? In most, if not all, of the existing network ranking methods, the nodes are the ranking objects with the finest granularity. In this paper, we propose a new network data model, a Network of Networks (NoN), where each node of the main network itself can be further represented as another (domain-specific) network. This new data model enables to compare the nodes in a broader context and rank them at a finer granularity. Moreover, such an NoN model enables much more efficient search when the ranking targets reside in a certain domain-specific network. We formulate ranking on NoN as a regularized optimization problem; propose efficient algorithms and provide theoretical analysis, such as optimality, convergence, complexity and equivalence. Extensive experimental evaluations demonstrate the effectiveness and the efficiency of our methods.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89882891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}