Clustering offers significant insights in data analysis. Density based algorithms have emerged as flexible and efficient techniques, able to discover high-quality and potentially irregularly shaped- clusters. We present two fast density-based clustering algorithms based on random projections. Both algorithms demonstrate one to two orders of magnitude speedup compared to equivalent state-of-art density based techniques, even for modest-size datasets. We give a comprehensive analysis of both our algorithms and show runtime of O(dNlog2 N), for a d-dimensional dataset. Our first algorithm can be viewed as a fast variant of the OPTICS density-based algorithm, but using a softer definition of density combined with sampling. The second algorithm is parameter-less, and identifies areas separating clusters.
{"title":"Fast parameterless density-based clustering via random projections","authors":"Johannes Schneider, M. Vlachos","doi":"10.1145/2505515.2505590","DOIUrl":"https://doi.org/10.1145/2505515.2505590","url":null,"abstract":"Clustering offers significant insights in data analysis. Density based algorithms have emerged as flexible and efficient techniques, able to discover high-quality and potentially irregularly shaped- clusters. We present two fast density-based clustering algorithms based on random projections. Both algorithms demonstrate one to two orders of magnitude speedup compared to equivalent state-of-art density based techniques, even for modest-size datasets. We give a comprehensive analysis of both our algorithms and show runtime of O(dNlog2 N), for a d-dimensional dataset. Our first algorithm can be viewed as a fast variant of the OPTICS density-based algorithm, but using a softer definition of density combined with sampling. The second algorithm is parameter-less, and identifies areas separating clusters.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75102743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although, many applications use unstructured meshes, there is no specialized mesh database which supports storing and querying mesh data. Existing mesh libraries do not support declarative querying and are expensive to maintain. A mesh database can benefit the domains in several ways such as: declarative query language, ease of maintenance, etc. In this paper, we propose the Incidence multi-Graph Complex (ImG-Complex) data model for storing topological aspects of meshes in a database. ImG-Complex extends incidence graph (IG) model with multi-incidence information to represent a new object class which we call ImG-Complexes. We introduce optional and application-specific constraints to limit the ImG model to smaller object classes and validate mesh structures based on the modeled object class properties. We show how Neo4j graph database can be used to query mesh topology based on the (possibly constrained) ImG model. Finally, we experiment Neo4j and PostgreSQL performance on executing topological mesh queries.
{"title":"ImG-complex: graph data model for topology of unstructured meshes","authors":"Alireza Rezaei Mahdiraji, P. Baumann, G. Berti","doi":"10.1145/2505515.2505733","DOIUrl":"https://doi.org/10.1145/2505515.2505733","url":null,"abstract":"Although, many applications use unstructured meshes, there is no specialized mesh database which supports storing and querying mesh data. Existing mesh libraries do not support declarative querying and are expensive to maintain. A mesh database can benefit the domains in several ways such as: declarative query language, ease of maintenance, etc. In this paper, we propose the Incidence multi-Graph Complex (ImG-Complex) data model for storing topological aspects of meshes in a database. ImG-Complex extends incidence graph (IG) model with multi-incidence information to represent a new object class which we call ImG-Complexes. We introduce optional and application-specific constraints to limit the ImG model to smaller object classes and validate mesh structures based on the modeled object class properties. We show how Neo4j graph database can be used to query mesh topology based on the (possibly constrained) ImG model. Finally, we experiment Neo4j and PostgreSQL performance on executing topological mesh queries.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75561708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The problem of aligning ontologies and database schemas across different knowledge bases and databases is fundamental to knowledge management problems, including the problem of integrating the disparate knowledge sources that form the semantic web's Linked Data [5]. We present a novel approach to this ontology alignment problem that employs a very large natural language text corpus as an interlingua to relate different knowledge bases (KBs). The result is a scalable and robust method (PIDGIN) that aligns relations and categories across different KBs by analyzing both (1) shared relation instances across these KBs, and (2) the verb phrases in the text instantiations of these relation instances. Experiments with PIDGIN demonstrate its superior performance when aligning ontologies across large existing KBs including NELL, Yago and Freebase. Furthermore, we show that in addition to aligning ontologies, PIDGIN can automatically learn from text, the verb phrases to identify relations, and can also type the arguments of relations of different KBs.
{"title":"PIDGIN: ontology alignment using web text as interlingua","authors":"D. Wijaya, P. Talukdar, Tom Michael Mitchell","doi":"10.1145/2505515.2505559","DOIUrl":"https://doi.org/10.1145/2505515.2505559","url":null,"abstract":"The problem of aligning ontologies and database schemas across different knowledge bases and databases is fundamental to knowledge management problems, including the problem of integrating the disparate knowledge sources that form the semantic web's Linked Data [5]. We present a novel approach to this ontology alignment problem that employs a very large natural language text corpus as an interlingua to relate different knowledge bases (KBs). The result is a scalable and robust method (PIDGIN) that aligns relations and categories across different KBs by analyzing both (1) shared relation instances across these KBs, and (2) the verb phrases in the text instantiations of these relation instances. Experiments with PIDGIN demonstrate its superior performance when aligning ontologies across large existing KBs including NELL, Yago and Freebase. Furthermore, we show that in addition to aligning ontologies, PIDGIN can automatically learn from text, the verb phrases to identify relations, and can also type the arguments of relations of different KBs.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73082021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, due to ubiquitous data uncertainty in many real-life applications, it has become increasingly important to study efficient and effective processing of various probabilistic queries over uncertain data, which usually retrieve uncertain objects that satisfy query predicates with high probabilities. However, one annoying, yet challenging, problem is that, some probabilistic queries are very sensitive to low-quality objects in uncertain databases, and the returned query answers might miss some important results (due to low data quality). To identify both accurate query answers and those potentially low-quality objects, in this paper, we investigate the causes of query answers/non-answers from a novel angle of causality and responsibility (CR), and propose a new interpretation of probabilistic queries. Particularly, we focus on the problem of CR-based probabilistic nearest neighbor (CR-PNN) query, and design a general framework for answering CR-based queries (including CR-PNN), which can return both query answers with high confidences and low-quality objects that may potentially affect query results (for data cleaning purposes). To efficiently process CR-PNN queries, we propose effective pruning strategies to quickly filter out false alarms, and design efficient algorithms to obtain CR-PNN answers. Extensive experiments have been conducted to verify the efficiency and effectiveness of our proposed approaches.
{"title":"Causality and responsibility: probabilistic queries revisited in uncertain databases","authors":"Xiang Lian, Lei Chen","doi":"10.1145/2505515.2505754","DOIUrl":"https://doi.org/10.1145/2505515.2505754","url":null,"abstract":"Recently, due to ubiquitous data uncertainty in many real-life applications, it has become increasingly important to study efficient and effective processing of various probabilistic queries over uncertain data, which usually retrieve uncertain objects that satisfy query predicates with high probabilities. However, one annoying, yet challenging, problem is that, some probabilistic queries are very sensitive to low-quality objects in uncertain databases, and the returned query answers might miss some important results (due to low data quality). To identify both accurate query answers and those potentially low-quality objects, in this paper, we investigate the causes of query answers/non-answers from a novel angle of causality and responsibility (CR), and propose a new interpretation of probabilistic queries. Particularly, we focus on the problem of CR-based probabilistic nearest neighbor (CR-PNN) query, and design a general framework for answering CR-based queries (including CR-PNN), which can return both query answers with high confidences and low-quality objects that may potentially affect query results (for data cleaning purposes). To efficiently process CR-PNN queries, we propose effective pruning strategies to quickly filter out false alarms, and design efficient algorithms to obtain CR-PNN answers. Extensive experiments have been conducted to verify the efficiency and effectiveness of our proposed approaches.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74964086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conversational Recommendation mimics the kind of dialog that takes between a customer and a shopkeeper involving multiple interactions where the user can give feedback at every interaction as opposed to Single Shot Retrieval, which corresponds to a scheme where the system retrieves a set of items in response to a user query in a single interaction. Compromise refers to a particular user preference which the recommender system failed to satisfy. But in the context of conversational systems, where the user's preferences keep on evolving as she interacts with the system, what constitutes as a compromise for her also keeps on changing. Typically, in Single Shot retrieval, the notion of compromise is characterized by the assignment of a particular feature to a particular dominance group such as MIB (higher value is better) or LIB (lower value is better) and this assignment remains true for all the users who use the system. In this paper, we propose a way to realize the notion of compromise in a conversational setting. Our approach, Flexi-Comp, introduces the notion of dynamically assigning a feature to two dominance groups simultaneously which is then used to redefine the notion of compromise. We show experimentally that a utility function based on this notion of compromise outperforms the existing conversational recommenders in terms of recommendation efficiency.
{"title":"Flexible and dynamic compromises for effective recommendations","authors":"Saurabh Gupta, Sutanu Chakraborti","doi":"10.1145/2505515.2507893","DOIUrl":"https://doi.org/10.1145/2505515.2507893","url":null,"abstract":"Conversational Recommendation mimics the kind of dialog that takes between a customer and a shopkeeper involving multiple interactions where the user can give feedback at every interaction as opposed to Single Shot Retrieval, which corresponds to a scheme where the system retrieves a set of items in response to a user query in a single interaction. Compromise refers to a particular user preference which the recommender system failed to satisfy. But in the context of conversational systems, where the user's preferences keep on evolving as she interacts with the system, what constitutes as a compromise for her also keeps on changing. Typically, in Single Shot retrieval, the notion of compromise is characterized by the assignment of a particular feature to a particular dominance group such as MIB (higher value is better) or LIB (lower value is better) and this assignment remains true for all the users who use the system. In this paper, we propose a way to realize the notion of compromise in a conversational setting. Our approach, Flexi-Comp, introduces the notion of dynamically assigning a feature to two dominance groups simultaneously which is then used to redefine the notion of compromise. We show experimentally that a utility function based on this notion of compromise outperforms the existing conversational recommenders in terms of recommendation efficiency.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72803530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider in this paper top-k query answering in social applications, with a focus on social tagging. This problem requires a significant departure from socially agnostic techniques. In a network- aware context, one can (and should) exploit the social links, which can indicate how users relate to the seeker and how much weight their tagging actions should have in the result build-up. We propose algorithms that have the potential to scale to current applications. While the problem has already been considered in previous literature, this was done either under strong simplifying assumptions or under choices that cannot scale to even moderate-size real-world applications. We first revisit a key aspect of the problem, which is accessing the closest or most relevant users for a given seeker. We describe how this can be done on the fly (without any pre- computations) for several possible choices -- arguably the most natural ones -- of proximity computation in a user network. Based on this, our top-k algorithm is sound and complete, addressing the applicability issues of the existing ones. Moreover, it performs significantly better in general and is instance optimal in the case when the search relies exclusively on the social weight of tagging actions. To further address the efficiency needs of online applications, for which the exact search, albeit optimal, may still be expensive, we then consider approximate algorithms. Specifically, these rely on concise statistics about the social network or on approximate shortest-paths computations. Extensive experiments on real-world data from Twitter show that our techniques can drastically improve response time, without sacrificing precision.
{"title":"Network-aware search in social tagging applications: instance optimality versus efficiency","authors":"S. Maniu, Bogdan Cautis","doi":"10.1145/2505515.2505760","DOIUrl":"https://doi.org/10.1145/2505515.2505760","url":null,"abstract":"We consider in this paper top-k query answering in social applications, with a focus on social tagging. This problem requires a significant departure from socially agnostic techniques. In a network- aware context, one can (and should) exploit the social links, which can indicate how users relate to the seeker and how much weight their tagging actions should have in the result build-up. We propose algorithms that have the potential to scale to current applications. While the problem has already been considered in previous literature, this was done either under strong simplifying assumptions or under choices that cannot scale to even moderate-size real-world applications. We first revisit a key aspect of the problem, which is accessing the closest or most relevant users for a given seeker. We describe how this can be done on the fly (without any pre- computations) for several possible choices -- arguably the most natural ones -- of proximity computation in a user network. Based on this, our top-k algorithm is sound and complete, addressing the applicability issues of the existing ones. Moreover, it performs significantly better in general and is instance optimal in the case when the search relies exclusively on the social weight of tagging actions. To further address the efficiency needs of online applications, for which the exact search, albeit optimal, may still be expensive, we then consider approximate algorithms. Specifically, these rely on concise statistics about the social network or on approximate shortest-paths computations. Extensive experiments on real-world data from Twitter show that our techniques can drastically improve response time, without sacrificing precision.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73961550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paul Seitlinger, Dominik Kowald, C. Trattner, Tobias Ley
When interacting with social tagging systems, humans exercise complex processes of categorization that have been the topic of much research in cognitive science. In this paper we present a recommender approach for social tags derived from ALCOVE, a model of human category learning. The basic architecture is a simple three-layers connectionist model. The input layer encodes patterns of semantic features of a user-specific resource, such as latent topics elicited through Latent Dirichlet Allocation (LDA) or available external categories. The hidden layer categorizes the resource by matching the encoded pattern against already learned exemplar patterns. The latter are composed of unique feature patterns and associated tag distributions. Finally, the output layer samples tags from the associated tag distributions to verbalize the preceding categorization process. We have evaluated this approach on a real-world folksonomy gathered from Wikipedia bookmarks in Delicious. In the experiment our approach outperformed LDA, a well-established algorithm. We attribute this to the fact that our approach processes semantic information (either latent topics or external categories) across the three different layers. With this paper, we demonstrate that a theoretically guided design of algorithms not only holds potential for improving existing recommendation mechanisms, but it also allows us to derive more generalizable insights about how human information interaction on the Web is determined by both semantic and verbal processes.
{"title":"Recommending tags with a model of human categorization","authors":"Paul Seitlinger, Dominik Kowald, C. Trattner, Tobias Ley","doi":"10.1145/2505515.2505625","DOIUrl":"https://doi.org/10.1145/2505515.2505625","url":null,"abstract":"When interacting with social tagging systems, humans exercise complex processes of categorization that have been the topic of much research in cognitive science. In this paper we present a recommender approach for social tags derived from ALCOVE, a model of human category learning. The basic architecture is a simple three-layers connectionist model. The input layer encodes patterns of semantic features of a user-specific resource, such as latent topics elicited through Latent Dirichlet Allocation (LDA) or available external categories. The hidden layer categorizes the resource by matching the encoded pattern against already learned exemplar patterns. The latter are composed of unique feature patterns and associated tag distributions. Finally, the output layer samples tags from the associated tag distributions to verbalize the preceding categorization process. We have evaluated this approach on a real-world folksonomy gathered from Wikipedia bookmarks in Delicious. In the experiment our approach outperformed LDA, a well-established algorithm. We attribute this to the fact that our approach processes semantic information (either latent topics or external categories) across the three different layers. With this paper, we demonstrate that a theoretically guided design of algorithms not only holds potential for improving existing recommendation mechanisms, but it also allows us to derive more generalizable insights about how human information interaction on the Web is determined by both semantic and verbal processes.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84434952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most modern web search engines yield a list of documents of a fixed length (usually 10) in response to a user query. The next ten search results are usually available in one click. These documents either replace the current result page or are appended to the end. Hence, in order to examine more documents than the first 10 the user needs to explicitly express her intention. Although clickthrough numbers are lower for documents on the second and later result pages, they still represent a noticeable amount of traffic. We propose a modification of the Dynamic Bayesian Network (DBN) click model by explicitly including into the model the probability of transition between result pages. We show that our new click model can significantly better capture user behavior on the second and later result pages while giving the same performance on the first result page.
{"title":"Modeling clicks beyond the first result page","authors":"A. Chuklin, P. Serdyukov, M. de Rijke","doi":"10.1145/2505515.2507859","DOIUrl":"https://doi.org/10.1145/2505515.2507859","url":null,"abstract":"Most modern web search engines yield a list of documents of a fixed length (usually 10) in response to a user query. The next ten search results are usually available in one click. These documents either replace the current result page or are appended to the end. Hence, in order to examine more documents than the first 10 the user needs to explicitly express her intention. Although clickthrough numbers are lower for documents on the second and later result pages, they still represent a noticeable amount of traffic. We propose a modification of the Dynamic Bayesian Network (DBN) click model by explicitly including into the model the probability of transition between result pages. We show that our new click model can significantly better capture user behavior on the second and later result pages while giving the same performance on the first result page.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84776334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanan Xie, Liang Chen, Kunyang Jia, Lichuan Ji, Jian Wu
Online news reading has become the major method to know about the world as web provide more information than other media like TV and radio. However, traditional online news reading interface is inconvenient for many types of people, especially for those who are disabled or taking a bus. This paper presents a mobile application iNewsBox enabling users to listen to news collected from the Internet. In order to simplify necessary interactions of getting valuable news, we also propose a framework for using implicit feedback to recommend news in this paper. Experiment shows our algorithms in iNewsBox are effective.
{"title":"iNewsBox: modeling and exploiting implicit feedback for building personalized news radio","authors":"Yanan Xie, Liang Chen, Kunyang Jia, Lichuan Ji, Jian Wu","doi":"10.1145/2505515.2508199","DOIUrl":"https://doi.org/10.1145/2505515.2508199","url":null,"abstract":"Online news reading has become the major method to know about the world as web provide more information than other media like TV and radio. However, traditional online news reading interface is inconvenient for many types of people, especially for those who are disabled or taking a bus. This paper presents a mobile application iNewsBox enabling users to listen to news collected from the Internet. In order to simplify necessary interactions of getting valuable news, we also propose a framework for using implicit feedback to recommend news in this paper. Experiment shows our algorithms in iNewsBox are effective.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85027029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Today, machine learning (ML) methods play a central role in industry and science. The growth of the Web and improvements in sensor data collection technology have been rapidly increasing the magnitude and complexity of the ML tasks we must solve. This growth is driving the need for scalable, parallel ML algorithms that can handle "Big Data." In this talk, we will focus on: Examining common algorithmic patterns in distributed ML methods. Qualifying the challenges of implementing these algorithms in real distributed systems. Describing computational frameworks for implementing these algorithms at scale. Addressing a significant core challenge to large-scale ML -- enabling the widespread adoption of machine learning beyond experts. In the latter part, we will focus mainly on the GraphLab framework, which naturally expresses asynchronous, dynamic graph computations that are key for state-of-the-art ML algorithms. When these algorithms are expressed in our higher-level abstraction, GraphLab will effectively address many of the underlying parallelism challenges, including data distribution, optimized communication, and guaranteeing sequential consistency, a property that is surprisingly important for many ML algorithms. On a variety of large-scale tasks, GraphLab provides 20-100x performance improvements over Hadoop. In recent months, GraphLab has received many tens of thousands of downloads, and is being actively used by a number of startups, companies, research labs and universities.
{"title":"Usability in machine learning at scale with graphlab","authors":"Carlos Guestrin","doi":"10.1145/2505515.2527108","DOIUrl":"https://doi.org/10.1145/2505515.2527108","url":null,"abstract":"Today, machine learning (ML) methods play a central role in industry and science. The growth of the Web and improvements in sensor data collection technology have been rapidly increasing the magnitude and complexity of the ML tasks we must solve. This growth is driving the need for scalable, parallel ML algorithms that can handle \"Big Data.\" In this talk, we will focus on: Examining common algorithmic patterns in distributed ML methods. Qualifying the challenges of implementing these algorithms in real distributed systems. Describing computational frameworks for implementing these algorithms at scale. Addressing a significant core challenge to large-scale ML -- enabling the widespread adoption of machine learning beyond experts. In the latter part, we will focus mainly on the GraphLab framework, which naturally expresses asynchronous, dynamic graph computations that are key for state-of-the-art ML algorithms. When these algorithms are expressed in our higher-level abstraction, GraphLab will effectively address many of the underlying parallelism challenges, including data distribution, optimized communication, and guaranteeing sequential consistency, a property that is surprisingly important for many ML algorithms. On a variety of large-scale tasks, GraphLab provides 20-100x performance improvements over Hadoop. In recent months, GraphLab has received many tens of thousands of downloads, and is being actively used by a number of startups, companies, research labs and universities.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85047185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}