Proceedings of the 22nd ACM international conference on Information & Knowledge Management最新文献

英文中文

Effective measures for inter-document similarity 文件间相似性的有效措施

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505526

John S. Whissell, C. Clarke

While supervised learning-to-rank algorithms have largely supplanted unsupervised query-document similarity measures for search, the exploration of query-document measures by many researchers over many years produced insights that might be exploited in other domains. For example, the BM25 measure substantially and consistently outperforms cosine across many tested environments, and potentially provides retrieval effectiveness approaching that of the best learning-to-rank methods over equivalent features sets. Other measures based on language modeling and divergence from randomness can outperform BM25 in some circumstances. Despite this evidence, cosine remains the prevalent method for determining inter-document similarity for clustering and other applications. However, recent research demonstrates that BM25 terms weights can significantly improve clustering. In this work, we extend that result, presenting and evaluating novel inter-document similarity measures based on BM25, language modeling, and divergence from randomness. In our first experiment we analyze the accuracy of nearest neighborhoods when using our measures. In our second experiment, we analyze using clustering algorithms in conjunction with our measures. Our novel symmetric BM25 and language modeling similarity measures outperform alternative measures in both experiments. This outcome strongly recommends the adoption of these measures, replacing cosine similarity in future work.

虽然有监督的学习排序算法在很大程度上取代了无监督的查询文档相似度度量，但许多研究人员多年来对查询文档度量的探索产生了可能在其他领域被利用的见解。例如，在许多测试环境中，BM25的测量结果在本质上和一致性上都优于余弦，并且潜在地提供了接近等效特征集上最佳学习排序方法的检索效率。在某些情况下，基于语言建模和随机性发散的其他度量可以优于BM25。尽管有这些证据，余弦仍然是确定聚类和其他应用中文档间相似性的流行方法。然而，最近的研究表明，BM25项权重可以显著改善聚类。在这项工作中，我们扩展了这一结果，提出并评估了基于BM25、语言建模和随机性发散的新型文档间相似性度量。在我们的第一个实验中，我们在使用我们的测量方法时分析了最近邻居的准确性。在我们的第二个实验中，我们将聚类算法与我们的测量相结合进行分析。我们新颖的对称BM25和语言建模相似性度量在两个实验中都优于其他度量。该结果强烈建议采用这些措施，在未来的工作中取代余弦相似度。

{"title":"Effective measures for inter-document similarity","authors":"John S. Whissell, C. Clarke","doi":"10.1145/2505515.2505526","DOIUrl":"https://doi.org/10.1145/2505515.2505526","url":null,"abstract":"While supervised learning-to-rank algorithms have largely supplanted unsupervised query-document similarity measures for search, the exploration of query-document measures by many researchers over many years produced insights that might be exploited in other domains. For example, the BM25 measure substantially and consistently outperforms cosine across many tested environments, and potentially provides retrieval effectiveness approaching that of the best learning-to-rank methods over equivalent features sets. Other measures based on language modeling and divergence from randomness can outperform BM25 in some circumstances. Despite this evidence, cosine remains the prevalent method for determining inter-document similarity for clustering and other applications. However, recent research demonstrates that BM25 terms weights can significantly improve clustering. In this work, we extend that result, presenting and evaluating novel inter-document similarity measures based on BM25, language modeling, and divergence from randomness. In our first experiment we analyze the accuracy of nearest neighborhoods when using our measures. In our second experiment, we analyze using clustering algorithms in conjunction with our measures. Our novel symmetric BM25 and language modeling similarity measures outperform alternative measures in both experiments. This outcome strongly recommends the adoption of these measures, replacing cosine similarity in future work.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90038209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Social recommendation incorporating topic mining and social trust analysis 结合主题挖掘和社会信任分析的社会推荐

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505592

T. Zhao, Chunping Li, Mengya Li, Qiang Ding, Li Li

We study the problem of social recommendation incorporating topic mining and social trust analysis. Different from other works related to social recommendation, we merge topic mining and social trust analysis techniques into recommender systems for finding topics from the tags of the items and estimating the topic-specific social trust. We propose a probabilistic matrix factorization (TTMF) algorithm and try to enhance the recommendation accuracy by utilizing the estimated topic-specific social trust relations. Moreover, TTMF is also convenient to solve the item cold start problem by inferring the feature (topic) of new items from their tags. Experiments are conducted on three different data sets. The results validate the effectiveness of our method for improving recommendation performance and its applicability to solve the cold start problem.

我们结合主题挖掘和社会信任分析来研究社会推荐问题。与其他与社会推荐相关的工作不同，我们将主题挖掘和社会信任分析技术融合到推荐系统中，从项目的标签中寻找主题并估计特定主题的社会信任。我们提出了一种概率矩阵分解(TTMF)算法，并试图利用估计的特定主题的社会信任关系来提高推荐的准确性。此外，TTMF还可以从新项目的标签中推断新项目的特征(主题)，从而方便地解决项目冷启动问题。实验在三个不同的数据集上进行。结果验证了该方法在提高推荐性能方面的有效性，以及该方法在解决冷启动问题方面的适用性。

引用次数: 27

Personalized influence maximization on social networks 在社交网络上实现个性化影响力最大化

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505571

Jing Guo, Peng Zhang, Chuan Zhou, Yanan Cao, Li Guo

In this paper, we study a new problem on social network influence maximization. The problem is defined as, given a target user $w$, finding the top-k most influential nodes for the user. Different from existing influence maximization works which aim to find a small subset of nodes to maximize the spread of influence over the entire network (i.e., global optima), our problem aims to find a small subset of nodes which can maximize the influence spread to a given target user (i.e., local optima). The solution is critical for personalized services on social networks, where fully understanding of each specific user is essential. Although some global influence maximization models can be narrowed down as the solution, these methods often bias to the target node itself. To this end, in this paper we present a local influence maximization solution. We first provide a random function, with low variance guarantee, to randomly simulate the objective function of local influence maximization. Then, we present efficient algorithms with approximation guarantee. For online social network applications, we also present a scalable approximate algorithm by exploring the local cascade structure of the target user. We test the proposed algorithms on several real-world social networks. Experimental results validate the performance of the proposed algorithms.

本文研究了一个新的社会网络影响最大化问题。这个问题被定义为，给定一个目标用户$w$，为该用户找到top-k个最具影响力的节点。现有的影响力最大化工作旨在找到一小部分节点来最大化影响力在整个网络上的传播(即全局最优)，而我们的问题旨在找到一小部分节点来最大化对给定目标用户的影响力传播(即局部最优)。该解决方案对于社交网络上的个性化服务至关重要，因为充分了解每个特定用户是必不可少的。尽管一些全局影响最大化模型可以缩小范围作为解决方案，但这些方法往往偏向于目标节点本身。为此，本文提出了一种局部影响最大化的求解方法。我们首先提供一个低方差保证的随机函数来随机模拟局部影响最大化的目标函数。然后，我们提出了具有近似保证的高效算法。对于在线社交网络应用，我们还通过探索目标用户的局部级联结构，提出了一种可扩展的近似算法。我们在几个真实的社交网络上测试了提出的算法。实验结果验证了算法的有效性。

{"title":"Personalized influence maximization on social networks","authors":"Jing Guo, Peng Zhang, Chuan Zhou, Yanan Cao, Li Guo","doi":"10.1145/2505515.2505571","DOIUrl":"https://doi.org/10.1145/2505515.2505571","url":null,"abstract":"In this paper, we study a new problem on social network influence maximization. The problem is defined as, given a target user $w$, finding the top-k most influential nodes for the user. Different from existing influence maximization works which aim to find a small subset of nodes to maximize the spread of influence over the entire network (i.e., global optima), our problem aims to find a small subset of nodes which can maximize the influence spread to a given target user (i.e., local optima). The solution is critical for personalized services on social networks, where fully understanding of each specific user is essential. Although some global influence maximization models can be narrowed down as the solution, these methods often bias to the target node itself. To this end, in this paper we present a local influence maximization solution. We first provide a random function, with low variance guarantee, to randomly simulate the objective function of local influence maximization. Then, we present efficient algorithms with approximation guarantee. For online social network applications, we also present a scalable approximate algorithm by exploring the local cascade structure of the target user. We test the proposed algorithms on several real-world social networks. Experimental results validate the performance of the proposed algorithms.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88867843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 91

Efficient forecasting for hierarchical time series 分层时间序列的有效预测

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505622

Lars Dannecker, R. Lorenz, Philipp J. Rösch, Wolfgang Lehner, Gregor Hackenbroich

Forecasting is used as the basis for business planning in many application areas such as energy, sales and traffic management. Time series data used in these areas is often hierarchically organized and thus, aggregated along the hierarchy levels based on their dimensional features. Calculating forecasts in these environments is very time consuming, due to ensuring forecasting consistency between hierarchy levels. To increase the forecasting efficiency for hierarchically organized time series, we introduce a novel forecasting approach that takes advantage of the hierarchical organization. There, we reuse the forecast models maintained on the lowest level of the hierarchy to almost instantly create already estimated forecast models on higher hierarchical levels. In addition, we define a hierarchical communication framework, increasing the communication flexibility and efficiency. Our experiments show significant runtime improvements for creating a forecast model at higher hierarchical levels, while still providing a very high accuracy.

在能源、销售和交通管理等许多应用领域，预测被用作商业规划的基础。在这些领域中使用的时间序列数据通常是分层组织的，因此，根据它们的维度特征沿着分层级别聚合。在这些环境中计算预测是非常耗时的，因为要确保层次结构级别之间的预测一致性。为了提高分层时间序列的预测效率，提出了一种利用分层组织的预测方法。在那里，我们重用在层次结构的最低级别上维护的预测模型，几乎立即在更高的层次结构级别上创建已经估计的预测模型。此外，我们还定义了一个分层的通信框架，提高了通信的灵活性和效率。我们的实验显示了在更高层次上创建预测模型的显著运行时改进，同时仍然提供非常高的准确性。

引用次数: 4

Intent models for contextualising and diversifying query suggestions 用于上下文化和多样化查询建议的意图模型

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505661

E. Kharitonov, C. Macdonald, P. Serdyukov, I. Ounis

The query suggestion or auto-completion mechanisms help users to type less while interacting with a search engine. A basic approach that ranks suggestions according to their frequency in query logs is suboptimal. Firstly, many candidate queries with the same prefix can be removed as redundant. Secondly, the suggestions can also be personalised based on the user's context. These two directions to improve the mechanisms' quality can be in opposition: while the latter aims to promote suggestions that address search intents that a user is likely to have, the former aims to diversify the suggestions to cover as many intents as possible. We introduce a contextualisation framework that utilises a short-term context using the user's behaviour within the current search session, such as the previous query, the documents examined, and the candidate query suggestions that the user has discarded. This short-term context is used to contextualise and diversify the ranking of query suggestions, by modelling the user's information need as a mixture of intent-specific user models. The evaluation is performed offline on a set of approximately 1.0M test user sessions. Our results suggest that the proposed approach significantly improves query suggestions compared to the baseline approach.

查询建议或自动完成机制帮助用户在与搜索引擎交互时减少输入。根据查询日志中的频率对建议进行排名的基本方法是次优的。首先，许多具有相同前缀的候选查询可以作为冗余删除。其次，建议也可以根据用户的上下文进行个性化。这两个改善机制质量的方向可能是对立的:后者旨在促进解决用户可能拥有的搜索意图的建议，而前者旨在使建议多样化，以涵盖尽可能多的意图。我们引入了一个上下文化框架，该框架利用用户在当前搜索会话中的行为(如前一个查询、检查的文档和用户丢弃的候选查询建议)来利用短期上下文。通过将用户的信息需求建模为特定于意图的用户模型的混合，该短期上下文用于将查询建议的排名上下文化和多样化。评估是在一组大约100万个测试用户会话上离线执行的。我们的结果表明，与基线方法相比，所提出的方法显着改善了查询建议。

{"title":"Intent models for contextualising and diversifying query suggestions","authors":"E. Kharitonov, C. Macdonald, P. Serdyukov, I. Ounis","doi":"10.1145/2505515.2505661","DOIUrl":"https://doi.org/10.1145/2505515.2505661","url":null,"abstract":"The query suggestion or auto-completion mechanisms help users to type less while interacting with a search engine. A basic approach that ranks suggestions according to their frequency in query logs is suboptimal. Firstly, many candidate queries with the same prefix can be removed as redundant. Secondly, the suggestions can also be personalised based on the user's context. These two directions to improve the mechanisms' quality can be in opposition: while the latter aims to promote suggestions that address search intents that a user is likely to have, the former aims to diversify the suggestions to cover as many intents as possible. We introduce a contextualisation framework that utilises a short-term context using the user's behaviour within the current search session, such as the previous query, the documents examined, and the candidate query suggestions that the user has discarded. This short-term context is used to contextualise and diversify the ranking of query suggestions, by modelling the user's information need as a mixture of intent-specific user models. The evaluation is performed offline on a set of approximately 1.0M test user sessions. Our results suggest that the proposed approach significantly improves query suggestions compared to the baseline approach.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88545145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

2013 international workshop on computational scientometrics: theory and applications 2013计算科学计量学国际研讨会:理论与应用

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505809

Cornelia Caragea, C. Lee Giles, L. Rokach, Xiaozhong Liu

The field of Scientometrics is concerned with the analysis of science and scientific research. As science advances, scientists around the world continue to produce large numbers of research articles, which provide the technological basis for worldwide collection, sharing, and dissemination of scientific discoveries. Research ideas are generally developed based on high quality citations. Understanding how research ideas emerge, evolve, or disappear as a topic, what is a good measure of quality of published works, what are the most promising areas of research, how authors connect and influence each other, who are the experts in a field, what works are similar, and who funds a particular research topic are some of the major foci of the rapidly emerging field of Scientometrics. Digital libraries and other databases that store research articles have become a medium for answering such questions. Citation analysis is used to mine large publication graphs in order to extract patterns in the data (e.g., citations per article) that can help measure the quality of a journal. Scientometrics, on the other hand, is used to mine graphs that link together multiple types of entities: authors, publications, conference venues, journals, institutions, etc., in order to assess the quality of science and answer complex questions such as those listed above. Tools such as maps of science that are built from digital libraries, allow different categories of users to satisfy various needs, e.g., help researchers to easily access research results, identify relevant funding opportunities, and find collaborators. Moreover, the recent developments in data mining, machine learning, natural language processing, and information retrieval makes it possible to transform the way we analyze research publications, funded proposals, patents, etc., on a web-wide scale.

科学计量学领域关注的是对科学和科学研究的分析。随着科学的进步，世界各地的科学家们继续发表大量的研究论文，为世界范围内的科学发现的收集、分享和传播提供了技术基础。研究思路通常建立在高质量的引文基础上。理解研究思想如何作为一个主题出现、演变或消失，什么是衡量已发表作品质量的好方法，什么是最有前途的研究领域，作者如何相互联系和影响，谁是一个领域的专家，哪些作品是相似的，以及谁资助了一个特定的研究主题，这些都是科学计量学这个迅速崛起的领域的一些主要焦点。存储研究论文的数字图书馆和其他数据库已经成为回答这些问题的媒介。引文分析用于挖掘大型出版物图表，以便从数据中提取模式(例如，每篇文章的引文)，从而帮助衡量期刊的质量。另一方面，科学计量学用于挖掘将多种实体(作者、出版物、会议场所、期刊、机构等)联系在一起的图表，以评估科学质量并回答上述复杂问题。从数字图书馆建立的科学地图等工具允许不同类别的用户满足各种需求，例如，帮助研究人员轻松访问研究成果、确定相关的资助机会和寻找合作者。此外，数据挖掘、机器学习、自然语言处理和信息检索方面的最新发展，使我们有可能在整个网络范围内改变我们分析研究出版物、资助提案、专利等的方式。

{"title":"2013 international workshop on computational scientometrics: theory and applications","authors":"Cornelia Caragea, C. Lee Giles, L. Rokach, Xiaozhong Liu","doi":"10.1145/2505515.2505809","DOIUrl":"https://doi.org/10.1145/2505515.2505809","url":null,"abstract":"The field of Scientometrics is concerned with the analysis of science and scientific research. As science advances, scientists around the world continue to produce large numbers of research articles, which provide the technological basis for worldwide collection, sharing, and dissemination of scientific discoveries. Research ideas are generally developed based on high quality citations. Understanding how research ideas emerge, evolve, or disappear as a topic, what is a good measure of quality of published works, what are the most promising areas of research, how authors connect and influence each other, who are the experts in a field, what works are similar, and who funds a particular research topic are some of the major foci of the rapidly emerging field of Scientometrics. Digital libraries and other databases that store research articles have become a medium for answering such questions. Citation analysis is used to mine large publication graphs in order to extract patterns in the data (e.g., citations per article) that can help measure the quality of a journal. Scientometrics, on the other hand, is used to mine graphs that link together multiple types of entities: authors, publications, conference venues, journals, institutions, etc., in order to assess the quality of science and answer complex questions such as those listed above. Tools such as maps of science that are built from digital libraries, allow different categories of users to satisfy various needs, e.g., help researchers to easily access research results, identify relevant funding opportunities, and find collaborators. Moreover, the recent developments in data mining, machine learning, natural language processing, and information retrieval makes it possible to transform the way we analyze research publications, funded proposals, patents, etc., on a web-wide scale.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88660615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PLEAD 2013: politics, elections and data 恳求2013:政治、选举和数据

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505813

Ingmar Weber, A. Popescu, M. Pennacchiotti

What is the role of the internet in politics general and during campaigns in particular? And what is the role of large amounts of user data in all of this? In the 2008 and 2012 U.S. presidential campaigns the Democrats were far more successful than the Republicans in utilizing online media for mobilization, co-ordination and fundraising. Year over year, social media and the Internet plays a fundamental role in political campaigns. However, technical research in this area is still limited and fragmented. The goal of this workshop is to bring together researchers working at the intersection of social network analysis, computational social science and political science, to share and discuss their ideas in a common forum; and to inspire further developments in this growing, fascinating field.

互联网在政治中，特别是在竞选期间扮演什么角色?大量的用户数据在这一切中扮演着什么角色?在2008年和2012年的美国总统竞选中，民主党在利用网络媒体进行动员、协调和筹款方面远比共和党成功。年复一年，社交媒体和互联网在政治竞选中发挥着重要作用。然而，在这方面的技术研究仍然是有限的和碎片化的。本次研讨会的目标是将社会网络分析、计算社会科学和政治学交叉领域的研究人员聚集在一起，在一个共同的论坛上分享和讨论他们的想法;并激发这个不断发展的迷人领域的进一步发展。

引用次数: 3

Scalable bootstrapping for python python的可伸缩引导

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505630

P. Birsinger, R. Xia, A. Fox

High-level productivity languages such as Python, Matlab, and R are popular choices for scientists doing data analysis. However, for today's increasingly large datasets, applications written in these languages may run too slowly, if at all. In such cases, an experienced programmer must typically rewrite the application in a less-productive performant language such as C or C++, but this work is intricate, tedious, and often non-reusable. To bridge this gap between programmer productivity and performance, we extend an existing framework that uses just-in-time code generation and compilation. This framework uses the SEJITS methodology, (Selective Embedded Just-In-Time Specialization [11]), converting programs written in domain specific embedded languages (DSELs) to programs in languages suitable for high performance or parallel computation. We present a Python DSEL for a recently developed, scalable bootstrapping method; the DSEL executes efficiently in a distributed cluster. In previous work [18, Prasad et al. created a DSEL compiler for the same DSEL (with minor differences) to generate OpenMP or Cilk code. In this work, we create a new DSEL compiler which instead emits code to run on Spark [16], a distributed processing framework. Using two example applications of bootstrapping, we show that the resulting distributed code achieves near-perfect strong scaling from 4 to 32 eight-core computers (32 to 256 cores) on datasets up to hundreds of gigabytes in size. With our DSEL, a data scientist can write a single program in serial Python that can run "toy" problems in plain Python, non-toy problems fitting on a single computer in OpenMP or Cilk, and non-toy problems with large datasets on a multi-computer Spark installation.

Python、Matlab和R等高级生产力语言是科学家进行数据分析的热门选择。然而，对于今天越来越大的数据集，用这些语言编写的应用程序可能运行得太慢，如果有的话。在这种情况下，经验丰富的程序员通常必须用效率较低的高性能语言(如C或c++)重写应用程序，但这项工作复杂、乏味，而且通常不可重用。为了弥合程序员生产力和性能之间的差距，我们扩展了一个使用即时代码生成和编译的现有框架。该框架使用SEJITS方法(选择性嵌入式即时专门化[11])，将用特定领域嵌入式语言(dsel)编写的程序转换为适合高性能或并行计算的语言程序。我们为最近开发的一种可扩展的引导方法提供了一个Python DSEL;DSEL在分布式集群中高效地执行。在之前的工作[18]中，Prasad等人为相同的DSEL(略有不同)创建了一个DSEL编译器来生成OpenMP或Cilk代码。在这项工作中，我们创建了一个新的DSEL编译器，它发出代码在Spark[16]上运行，Spark是一个分布式处理框架。通过使用引导的两个示例应用程序，我们展示了所得到的分布式代码在高达数百gb大小的数据集上实现了近乎完美的从4到32个八核计算机(32到256核)的强大扩展。使用我们的DSEL，数据科学家可以用串行Python编写一个程序，该程序可以在普通Python中运行“玩具”问题，在OpenMP或Cilk中适合单台计算机的非玩具问题，以及在多计算机Spark安装上运行大型数据集的非玩具问题。

{"title":"Scalable bootstrapping for python","authors":"P. Birsinger, R. Xia, A. Fox","doi":"10.1145/2505515.2505630","DOIUrl":"https://doi.org/10.1145/2505515.2505630","url":null,"abstract":"High-level productivity languages such as Python, Matlab, and R are popular choices for scientists doing data analysis. However, for today's increasingly large datasets, applications written in these languages may run too slowly, if at all. In such cases, an experienced programmer must typically rewrite the application in a less-productive performant language such as C or C++, but this work is intricate, tedious, and often non-reusable. To bridge this gap between programmer productivity and performance, we extend an existing framework that uses just-in-time code generation and compilation. This framework uses the SEJITS methodology, (Selective Embedded Just-In-Time Specialization [11]), converting programs written in domain specific embedded languages (DSELs) to programs in languages suitable for high performance or parallel computation. We present a Python DSEL for a recently developed, scalable bootstrapping method; the DSEL executes efficiently in a distributed cluster. In previous work [18, Prasad et al. created a DSEL compiler for the same DSEL (with minor differences) to generate OpenMP or Cilk code. In this work, we create a new DSEL compiler which instead emits code to run on Spark [16], a distributed processing framework. Using two example applications of bootstrapping, we show that the resulting distributed code achieves near-perfect strong scaling from 4 to 32 eight-core computers (32 to 256 cores) on datasets up to hundreds of gigabytes in size. With our DSEL, a data scientist can write a single program in serial Python that can run \"toy\" problems in plain Python, non-toy problems fitting on a single computer in OpenMP or Cilk, and non-toy problems with large datasets on a multi-computer Spark installation.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87424129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Location recommendation for out-of-town users in location-based social networks 在基于位置的社交网络中为外地用户提供位置推荐

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505637

Gregory Ference, Mao Ye, Wang-Chien Lee

Most previous research on location recommendation services in location-based social networks (LBSNs) makes recommendations without considering where the targeted user is currently located. Such services may recommend a place near her hometown even if the user is traveling out of town. In this paper, we study the issues in making location recommendations for out-of-town users by taking into account user preference, social influence and geographical proximity. Accordingly, we propose a collaborative recommendation framework, called User Preference, Proximity and Social-Based Collaborative Filtering} (UPS-CF), to make location recommendation for mobile users in LBSNs. We validate our ideas by comprehensive experiments using real datasets collected from Foursquare and Gowalla. By comparing baseline algorithms and conventional collaborative filtering approach (and its variants), we show that UPS-CF exhibits the best performance. Additionally, we find that preference derived from similar users is important for in-town users while social influence becomes more important for out-of-town users.

以往大多数关于基于位置的社交网络(LBSNs)位置推荐服务的研究都是在不考虑目标用户当前所在位置的情况下进行推荐的。即使用户出城旅行，这些服务也可能会推荐一个离她家乡近的地方。在本文中，我们通过考虑用户偏好、社会影响力和地理邻近性来研究外地用户的位置推荐问题。因此，我们提出了一个协作推荐框架，称为用户偏好、邻近度和基于社交的协同过滤(UPS-CF)，为LBSNs中的移动用户进行位置推荐。我们通过使用从Foursquare和Gowalla收集的真实数据集进行综合实验来验证我们的想法。通过比较基线算法和传统协同过滤方法(及其变体)，我们表明UPS-CF表现出最佳性能。此外，我们发现来自相似用户的偏好对城镇用户很重要，而社会影响对城镇外用户更重要。

引用次数: 134

Local clustering in provenance graphs 来源图中的局部聚类

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505624

P. Macko, Daniel W. Margo, M. Seltzer

Systems that capture and store data provenance, the record of how an object has arrived at its current state, accumulate historical metadata over time, forming a large graph. Local clustering in these graphs, in which we start with a seed vertex and grow a cluster around it, is of paramount importance because it supports critical provenance applications such as identifying semantically meaningful tasks in an object's history. However, generic graph clustering algorithms are not effective at these tasks. We identify three key properties of provenance graphs and exploit them to justify two new centrality metrics we developed for use in performing local clustering on provenance graphs.

捕获和存储数据来源的系统，记录对象如何到达其当前状态，随着时间的推移积累历史元数据，形成一个大的图。在这些图中的局部聚类，我们从一个种子顶点开始，并围绕它增长一个聚类，是至关重要的，因为它支持关键的来源应用程序，比如识别对象历史中语义上有意义的任务。然而，一般的图聚类算法在这些任务中并不有效。我们确定了来源图的三个关键属性，并利用它们来证明我们开发的用于在来源图上执行局部聚类的两个新的中心性度量。

引用次数: 17

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀