While supervised learning-to-rank algorithms have largely supplanted unsupervised query-document similarity measures for search, the exploration of query-document measures by many researchers over many years produced insights that might be exploited in other domains. For example, the BM25 measure substantially and consistently outperforms cosine across many tested environments, and potentially provides retrieval effectiveness approaching that of the best learning-to-rank methods over equivalent features sets. Other measures based on language modeling and divergence from randomness can outperform BM25 in some circumstances. Despite this evidence, cosine remains the prevalent method for determining inter-document similarity for clustering and other applications. However, recent research demonstrates that BM25 terms weights can significantly improve clustering. In this work, we extend that result, presenting and evaluating novel inter-document similarity measures based on BM25, language modeling, and divergence from randomness. In our first experiment we analyze the accuracy of nearest neighborhoods when using our measures. In our second experiment, we analyze using clustering algorithms in conjunction with our measures. Our novel symmetric BM25 and language modeling similarity measures outperform alternative measures in both experiments. This outcome strongly recommends the adoption of these measures, replacing cosine similarity in future work.
{"title":"Effective measures for inter-document similarity","authors":"John S. Whissell, C. Clarke","doi":"10.1145/2505515.2505526","DOIUrl":"https://doi.org/10.1145/2505515.2505526","url":null,"abstract":"While supervised learning-to-rank algorithms have largely supplanted unsupervised query-document similarity measures for search, the exploration of query-document measures by many researchers over many years produced insights that might be exploited in other domains. For example, the BM25 measure substantially and consistently outperforms cosine across many tested environments, and potentially provides retrieval effectiveness approaching that of the best learning-to-rank methods over equivalent features sets. Other measures based on language modeling and divergence from randomness can outperform BM25 in some circumstances. Despite this evidence, cosine remains the prevalent method for determining inter-document similarity for clustering and other applications. However, recent research demonstrates that BM25 terms weights can significantly improve clustering. In this work, we extend that result, presenting and evaluating novel inter-document similarity measures based on BM25, language modeling, and divergence from randomness. In our first experiment we analyze the accuracy of nearest neighborhoods when using our measures. In our second experiment, we analyze using clustering algorithms in conjunction with our measures. Our novel symmetric BM25 and language modeling similarity measures outperform alternative measures in both experiments. This outcome strongly recommends the adoption of these measures, replacing cosine similarity in future work.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90038209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Zhao, Chunping Li, Mengya Li, Qiang Ding, Li Li
We study the problem of social recommendation incorporating topic mining and social trust analysis. Different from other works related to social recommendation, we merge topic mining and social trust analysis techniques into recommender systems for finding topics from the tags of the items and estimating the topic-specific social trust. We propose a probabilistic matrix factorization (TTMF) algorithm and try to enhance the recommendation accuracy by utilizing the estimated topic-specific social trust relations. Moreover, TTMF is also convenient to solve the item cold start problem by inferring the feature (topic) of new items from their tags. Experiments are conducted on three different data sets. The results validate the effectiveness of our method for improving recommendation performance and its applicability to solve the cold start problem.
{"title":"Social recommendation incorporating topic mining and social trust analysis","authors":"T. Zhao, Chunping Li, Mengya Li, Qiang Ding, Li Li","doi":"10.1145/2505515.2505592","DOIUrl":"https://doi.org/10.1145/2505515.2505592","url":null,"abstract":"We study the problem of social recommendation incorporating topic mining and social trust analysis. Different from other works related to social recommendation, we merge topic mining and social trust analysis techniques into recommender systems for finding topics from the tags of the items and estimating the topic-specific social trust. We propose a probabilistic matrix factorization (TTMF) algorithm and try to enhance the recommendation accuracy by utilizing the estimated topic-specific social trust relations. Moreover, TTMF is also convenient to solve the item cold start problem by inferring the feature (topic) of new items from their tags. Experiments are conducted on three different data sets. The results validate the effectiveness of our method for improving recommendation performance and its applicability to solve the cold start problem.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"84 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90159942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Guo, Peng Zhang, Chuan Zhou, Yanan Cao, Li Guo
In this paper, we study a new problem on social network influence maximization. The problem is defined as, given a target user $w$, finding the top-k most influential nodes for the user. Different from existing influence maximization works which aim to find a small subset of nodes to maximize the spread of influence over the entire network (i.e., global optima), our problem aims to find a small subset of nodes which can maximize the influence spread to a given target user (i.e., local optima). The solution is critical for personalized services on social networks, where fully understanding of each specific user is essential. Although some global influence maximization models can be narrowed down as the solution, these methods often bias to the target node itself. To this end, in this paper we present a local influence maximization solution. We first provide a random function, with low variance guarantee, to randomly simulate the objective function of local influence maximization. Then, we present efficient algorithms with approximation guarantee. For online social network applications, we also present a scalable approximate algorithm by exploring the local cascade structure of the target user. We test the proposed algorithms on several real-world social networks. Experimental results validate the performance of the proposed algorithms.
{"title":"Personalized influence maximization on social networks","authors":"Jing Guo, Peng Zhang, Chuan Zhou, Yanan Cao, Li Guo","doi":"10.1145/2505515.2505571","DOIUrl":"https://doi.org/10.1145/2505515.2505571","url":null,"abstract":"In this paper, we study a new problem on social network influence maximization. The problem is defined as, given a target user $w$, finding the top-k most influential nodes for the user. Different from existing influence maximization works which aim to find a small subset of nodes to maximize the spread of influence over the entire network (i.e., global optima), our problem aims to find a small subset of nodes which can maximize the influence spread to a given target user (i.e., local optima). The solution is critical for personalized services on social networks, where fully understanding of each specific user is essential. Although some global influence maximization models can be narrowed down as the solution, these methods often bias to the target node itself. To this end, in this paper we present a local influence maximization solution. We first provide a random function, with low variance guarantee, to randomly simulate the objective function of local influence maximization. Then, we present efficient algorithms with approximation guarantee. For online social network applications, we also present a scalable approximate algorithm by exploring the local cascade structure of the target user. We test the proposed algorithms on several real-world social networks. Experimental results validate the performance of the proposed algorithms.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88867843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lars Dannecker, R. Lorenz, Philipp J. Rösch, Wolfgang Lehner, Gregor Hackenbroich
Forecasting is used as the basis for business planning in many application areas such as energy, sales and traffic management. Time series data used in these areas is often hierarchically organized and thus, aggregated along the hierarchy levels based on their dimensional features. Calculating forecasts in these environments is very time consuming, due to ensuring forecasting consistency between hierarchy levels. To increase the forecasting efficiency for hierarchically organized time series, we introduce a novel forecasting approach that takes advantage of the hierarchical organization. There, we reuse the forecast models maintained on the lowest level of the hierarchy to almost instantly create already estimated forecast models on higher hierarchical levels. In addition, we define a hierarchical communication framework, increasing the communication flexibility and efficiency. Our experiments show significant runtime improvements for creating a forecast model at higher hierarchical levels, while still providing a very high accuracy.
{"title":"Efficient forecasting for hierarchical time series","authors":"Lars Dannecker, R. Lorenz, Philipp J. Rösch, Wolfgang Lehner, Gregor Hackenbroich","doi":"10.1145/2505515.2505622","DOIUrl":"https://doi.org/10.1145/2505515.2505622","url":null,"abstract":"Forecasting is used as the basis for business planning in many application areas such as energy, sales and traffic management. Time series data used in these areas is often hierarchically organized and thus, aggregated along the hierarchy levels based on their dimensional features. Calculating forecasts in these environments is very time consuming, due to ensuring forecasting consistency between hierarchy levels. To increase the forecasting efficiency for hierarchically organized time series, we introduce a novel forecasting approach that takes advantage of the hierarchical organization. There, we reuse the forecast models maintained on the lowest level of the hierarchy to almost instantly create already estimated forecast models on higher hierarchical levels. In addition, we define a hierarchical communication framework, increasing the communication flexibility and efficiency. Our experiments show significant runtime improvements for creating a forecast model at higher hierarchical levels, while still providing a very high accuracy.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89141671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Kharitonov, C. Macdonald, P. Serdyukov, I. Ounis
The query suggestion or auto-completion mechanisms help users to type less while interacting with a search engine. A basic approach that ranks suggestions according to their frequency in query logs is suboptimal. Firstly, many candidate queries with the same prefix can be removed as redundant. Secondly, the suggestions can also be personalised based on the user's context. These two directions to improve the mechanisms' quality can be in opposition: while the latter aims to promote suggestions that address search intents that a user is likely to have, the former aims to diversify the suggestions to cover as many intents as possible. We introduce a contextualisation framework that utilises a short-term context using the user's behaviour within the current search session, such as the previous query, the documents examined, and the candidate query suggestions that the user has discarded. This short-term context is used to contextualise and diversify the ranking of query suggestions, by modelling the user's information need as a mixture of intent-specific user models. The evaluation is performed offline on a set of approximately 1.0M test user sessions. Our results suggest that the proposed approach significantly improves query suggestions compared to the baseline approach.
{"title":"Intent models for contextualising and diversifying query suggestions","authors":"E. Kharitonov, C. Macdonald, P. Serdyukov, I. Ounis","doi":"10.1145/2505515.2505661","DOIUrl":"https://doi.org/10.1145/2505515.2505661","url":null,"abstract":"The query suggestion or auto-completion mechanisms help users to type less while interacting with a search engine. A basic approach that ranks suggestions according to their frequency in query logs is suboptimal. Firstly, many candidate queries with the same prefix can be removed as redundant. Secondly, the suggestions can also be personalised based on the user's context. These two directions to improve the mechanisms' quality can be in opposition: while the latter aims to promote suggestions that address search intents that a user is likely to have, the former aims to diversify the suggestions to cover as many intents as possible. We introduce a contextualisation framework that utilises a short-term context using the user's behaviour within the current search session, such as the previous query, the documents examined, and the candidate query suggestions that the user has discarded. This short-term context is used to contextualise and diversify the ranking of query suggestions, by modelling the user's information need as a mixture of intent-specific user models. The evaluation is performed offline on a set of approximately 1.0M test user sessions. Our results suggest that the proposed approach significantly improves query suggestions compared to the baseline approach.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88545145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cornelia Caragea, C. Lee Giles, L. Rokach, Xiaozhong Liu
The field of Scientometrics is concerned with the analysis of science and scientific research. As science advances, scientists around the world continue to produce large numbers of research articles, which provide the technological basis for worldwide collection, sharing, and dissemination of scientific discoveries. Research ideas are generally developed based on high quality citations. Understanding how research ideas emerge, evolve, or disappear as a topic, what is a good measure of quality of published works, what are the most promising areas of research, how authors connect and influence each other, who are the experts in a field, what works are similar, and who funds a particular research topic are some of the major foci of the rapidly emerging field of Scientometrics. Digital libraries and other databases that store research articles have become a medium for answering such questions. Citation analysis is used to mine large publication graphs in order to extract patterns in the data (e.g., citations per article) that can help measure the quality of a journal. Scientometrics, on the other hand, is used to mine graphs that link together multiple types of entities: authors, publications, conference venues, journals, institutions, etc., in order to assess the quality of science and answer complex questions such as those listed above. Tools such as maps of science that are built from digital libraries, allow different categories of users to satisfy various needs, e.g., help researchers to easily access research results, identify relevant funding opportunities, and find collaborators. Moreover, the recent developments in data mining, machine learning, natural language processing, and information retrieval makes it possible to transform the way we analyze research publications, funded proposals, patents, etc., on a web-wide scale.
{"title":"2013 international workshop on computational scientometrics: theory and applications","authors":"Cornelia Caragea, C. Lee Giles, L. Rokach, Xiaozhong Liu","doi":"10.1145/2505515.2505809","DOIUrl":"https://doi.org/10.1145/2505515.2505809","url":null,"abstract":"The field of Scientometrics is concerned with the analysis of science and scientific research. As science advances, scientists around the world continue to produce large numbers of research articles, which provide the technological basis for worldwide collection, sharing, and dissemination of scientific discoveries. Research ideas are generally developed based on high quality citations. Understanding how research ideas emerge, evolve, or disappear as a topic, what is a good measure of quality of published works, what are the most promising areas of research, how authors connect and influence each other, who are the experts in a field, what works are similar, and who funds a particular research topic are some of the major foci of the rapidly emerging field of Scientometrics. Digital libraries and other databases that store research articles have become a medium for answering such questions. Citation analysis is used to mine large publication graphs in order to extract patterns in the data (e.g., citations per article) that can help measure the quality of a journal. Scientometrics, on the other hand, is used to mine graphs that link together multiple types of entities: authors, publications, conference venues, journals, institutions, etc., in order to assess the quality of science and answer complex questions such as those listed above. Tools such as maps of science that are built from digital libraries, allow different categories of users to satisfy various needs, e.g., help researchers to easily access research results, identify relevant funding opportunities, and find collaborators. Moreover, the recent developments in data mining, machine learning, natural language processing, and information retrieval makes it possible to transform the way we analyze research publications, funded proposals, patents, etc., on a web-wide scale.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88660615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
What is the role of the internet in politics general and during campaigns in particular? And what is the role of large amounts of user data in all of this? In the 2008 and 2012 U.S. presidential campaigns the Democrats were far more successful than the Republicans in utilizing online media for mobilization, co-ordination and fundraising. Year over year, social media and the Internet plays a fundamental role in political campaigns. However, technical research in this area is still limited and fragmented. The goal of this workshop is to bring together researchers working at the intersection of social network analysis, computational social science and political science, to share and discuss their ideas in a common forum; and to inspire further developments in this growing, fascinating field.
{"title":"PLEAD 2013: politics, elections and data","authors":"Ingmar Weber, A. Popescu, M. Pennacchiotti","doi":"10.1145/2505515.2505813","DOIUrl":"https://doi.org/10.1145/2505515.2505813","url":null,"abstract":"What is the role of the internet in politics general and during campaigns in particular? And what is the role of large amounts of user data in all of this? In the 2008 and 2012 U.S. presidential campaigns the Democrats were far more successful than the Republicans in utilizing online media for mobilization, co-ordination and fundraising. Year over year, social media and the Internet plays a fundamental role in political campaigns. However, technical research in this area is still limited and fragmented. The goal of this workshop is to bring together researchers working at the intersection of social network analysis, computational social science and political science, to share and discuss their ideas in a common forum; and to inspire further developments in this growing, fascinating field.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87223330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-level productivity languages such as Python, Matlab, and R are popular choices for scientists doing data analysis. However, for today's increasingly large datasets, applications written in these languages may run too slowly, if at all. In such cases, an experienced programmer must typically rewrite the application in a less-productive performant language such as C or C++, but this work is intricate, tedious, and often non-reusable. To bridge this gap between programmer productivity and performance, we extend an existing framework that uses just-in-time code generation and compilation. This framework uses the SEJITS methodology, (Selective Embedded Just-In-Time Specialization [11]), converting programs written in domain specific embedded languages (DSELs) to programs in languages suitable for high performance or parallel computation. We present a Python DSEL for a recently developed, scalable bootstrapping method; the DSEL executes efficiently in a distributed cluster. In previous work [18, Prasad et al. created a DSEL compiler for the same DSEL (with minor differences) to generate OpenMP or Cilk code. In this work, we create a new DSEL compiler which instead emits code to run on Spark [16], a distributed processing framework. Using two example applications of bootstrapping, we show that the resulting distributed code achieves near-perfect strong scaling from 4 to 32 eight-core computers (32 to 256 cores) on datasets up to hundreds of gigabytes in size. With our DSEL, a data scientist can write a single program in serial Python that can run "toy" problems in plain Python, non-toy problems fitting on a single computer in OpenMP or Cilk, and non-toy problems with large datasets on a multi-computer Spark installation.
{"title":"Scalable bootstrapping for python","authors":"P. Birsinger, R. Xia, A. Fox","doi":"10.1145/2505515.2505630","DOIUrl":"https://doi.org/10.1145/2505515.2505630","url":null,"abstract":"High-level productivity languages such as Python, Matlab, and R are popular choices for scientists doing data analysis. However, for today's increasingly large datasets, applications written in these languages may run too slowly, if at all. In such cases, an experienced programmer must typically rewrite the application in a less-productive performant language such as C or C++, but this work is intricate, tedious, and often non-reusable. To bridge this gap between programmer productivity and performance, we extend an existing framework that uses just-in-time code generation and compilation. This framework uses the SEJITS methodology, (Selective Embedded Just-In-Time Specialization [11]), converting programs written in domain specific embedded languages (DSELs) to programs in languages suitable for high performance or parallel computation. We present a Python DSEL for a recently developed, scalable bootstrapping method; the DSEL executes efficiently in a distributed cluster. In previous work [18, Prasad et al. created a DSEL compiler for the same DSEL (with minor differences) to generate OpenMP or Cilk code. In this work, we create a new DSEL compiler which instead emits code to run on Spark [16], a distributed processing framework. Using two example applications of bootstrapping, we show that the resulting distributed code achieves near-perfect strong scaling from 4 to 32 eight-core computers (32 to 256 cores) on datasets up to hundreds of gigabytes in size. With our DSEL, a data scientist can write a single program in serial Python that can run \"toy\" problems in plain Python, non-toy problems fitting on a single computer in OpenMP or Cilk, and non-toy problems with large datasets on a multi-computer Spark installation.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87424129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most previous research on location recommendation services in location-based social networks (LBSNs) makes recommendations without considering where the targeted user is currently located. Such services may recommend a place near her hometown even if the user is traveling out of town. In this paper, we study the issues in making location recommendations for out-of-town users by taking into account user preference, social influence and geographical proximity. Accordingly, we propose a collaborative recommendation framework, called User Preference, Proximity and Social-Based Collaborative Filtering} (UPS-CF), to make location recommendation for mobile users in LBSNs. We validate our ideas by comprehensive experiments using real datasets collected from Foursquare and Gowalla. By comparing baseline algorithms and conventional collaborative filtering approach (and its variants), we show that UPS-CF exhibits the best performance. Additionally, we find that preference derived from similar users is important for in-town users while social influence becomes more important for out-of-town users.
{"title":"Location recommendation for out-of-town users in location-based social networks","authors":"Gregory Ference, Mao Ye, Wang-Chien Lee","doi":"10.1145/2505515.2505637","DOIUrl":"https://doi.org/10.1145/2505515.2505637","url":null,"abstract":"Most previous research on location recommendation services in location-based social networks (LBSNs) makes recommendations without considering where the targeted user is currently located. Such services may recommend a place near her hometown even if the user is traveling out of town. In this paper, we study the issues in making location recommendations for out-of-town users by taking into account user preference, social influence and geographical proximity. Accordingly, we propose a collaborative recommendation framework, called User Preference, Proximity and Social-Based Collaborative Filtering} (UPS-CF), to make location recommendation for mobile users in LBSNs. We validate our ideas by comprehensive experiments using real datasets collected from Foursquare and Gowalla. By comparing baseline algorithms and conventional collaborative filtering approach (and its variants), we show that UPS-CF exhibits the best performance. Additionally, we find that preference derived from similar users is important for in-town users while social influence becomes more important for out-of-town users.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89239848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Systems that capture and store data provenance, the record of how an object has arrived at its current state, accumulate historical metadata over time, forming a large graph. Local clustering in these graphs, in which we start with a seed vertex and grow a cluster around it, is of paramount importance because it supports critical provenance applications such as identifying semantically meaningful tasks in an object's history. However, generic graph clustering algorithms are not effective at these tasks. We identify three key properties of provenance graphs and exploit them to justify two new centrality metrics we developed for use in performing local clustering on provenance graphs.
{"title":"Local clustering in provenance graphs","authors":"P. Macko, Daniel W. Margo, M. Seltzer","doi":"10.1145/2505515.2505624","DOIUrl":"https://doi.org/10.1145/2505515.2505624","url":null,"abstract":"Systems that capture and store data provenance, the record of how an object has arrived at its current state, accumulate historical metadata over time, forming a large graph. Local clustering in these graphs, in which we start with a seed vertex and grow a cluster around it, is of paramount importance because it supports critical provenance applications such as identifying semantically meaningful tasks in an object's history. However, generic graph clustering algorithms are not effective at these tasks. We identify three key properties of provenance graphs and exploit them to justify two new centrality metrics we developed for use in performing local clustering on provenance graphs.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78119061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}