Proceedings of the 22nd ACM international conference on Information & Knowledge Management最新文献

英文中文

Large-scale deep learning at Baidu 百度大规模深度学习

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2514699

Kai Yu

In the past 30 years, tremendous progress has been achieved in building effective shallow classification models. Despite the success, we come to realize that, for many applications, the key bottleneck is not the qualify of classifiers but that of features. Not being able to automatically get useful features has become the main limitation for shallow models. Since 2006, learning high-level features using deep architectures from raw data has become a huge wave of new learning paradigms. In recent two years, deep learning has made many performance breakthroughs, for example, in the areas of image understanding and speech recognition. In this talk, I will walk through some of the latest technology advances of deep learning within Baidu, and discuss the main challenges, e.g., developing effective models for various applications, and scaling up the model training using many GPUs. In the end of the talk I will discuss what might be interesting future directions.

近30年来，在建立有效的浅层分类模型方面取得了巨大进展。尽管取得了成功，但我们意识到，对于许多应用程序来说，关键的瓶颈不是分类器的资格，而是特征的资格。不能自动获得有用的特征已经成为浅模型的主要限制。自2006年以来，使用深度架构从原始数据中学习高级特征已经成为一股新的学习范式浪潮。近两年，深度学习取得了许多性能上的突破，例如在图像理解和语音识别领域。在这次演讲中，我将介绍百度深度学习的一些最新技术进展，并讨论主要挑战，例如，为各种应用开发有效的模型，以及使用许多gpu扩展模型训练。在演讲的最后，我将讨论未来可能有趣的方向。

引用次数: 34

LearNext: learning to predict tourists movements LearNext:学习预测游客的动向

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505656

R. Baraglia, Cristina Ioana Muntean, F. M. Nardini, F. Silvestri

In this paper, we tackle the problem of predicting the "next" geographical position of a tourist given her history (i.e., the prediction is done accordingly to the tourist's current trail) by means of supervised learning techniques, namely Gradient Boosted Regression Trees and Ranking SVM. The learning is done on the basis of an object space represented by a 68 dimension feature vector, specifically designed for tourism related data. Furthermore, we propose a thorough comparison of several methods that are considered state-of-the-art in touristic recommender and trail prediction systems as well as a strong popularity baseline. Experiments show that the methods we propose outperform important competitors and baselines thus providing strong evidence of the performance of our solutions.

在本文中，我们通过监督学习技术，即梯度增强回归树和排序支持向量机，解决了给定游客历史的预测“下一个”地理位置的问题(即，根据游客当前的路径进行预测)。学习是在一个由68维特征向量表示的对象空间的基础上完成的，这个空间是专门为旅游相关数据设计的。此外，我们建议对几种被认为是最先进的旅游推荐和路径预测系统以及强大的人气基线的方法进行彻底的比较。实验表明，我们提出的方法优于重要的竞争对手和基线，从而为我们的解决方案的性能提供了强有力的证据。

引用次数: 34

Content-centric flow mining for influence analysis in social streams 以内容为中心的流挖掘，用于社交流中的影响分析

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505626

Karthik Subbian, C. Aggarwal, J. Srivastava

The problem of discovering information flow trends and influencers in social networks has become increasingly relevant both because of the increasing amount of content available from online networks in the form of social streams, and because of its relevance as a tool for content trends analysis. An important part of this analysis is to determine the key patterns of flow and corresponding influencers in the underlying network. Almost all the work on influence analysis has focused on fixed models of the network structure, and edge-based transmission between nodes. In this paper, we propose a fully content-centered model of flow analysis in social network streams, in which the analysis is based on actual content transmissions in the network, rather than a static model of transmission on the edges. First, we introduce the problem of information flow mining in social streams, and then propose a novel algorithm InFlowMine to discover the information flow patterns in the network. We then leverage this approach to determine the key influencers in the network. Our approach is flexible, since it can also determine topic-specific influencers. We experimentally show the effectiveness and efficiency of our model.

发现社交网络中的信息流趋势和影响者的问题已经变得越来越重要，因为在线网络中以社交流的形式提供的内容越来越多，而且因为它作为内容趋势分析工具的相关性。这种分析的一个重要部分是确定流量的关键模式和潜在网络中相应的影响者。几乎所有的影响分析工作都集中在网络结构的固定模型和节点之间基于边缘的传输上。在本文中，我们提出了一个完全以内容为中心的社交网络流分析模型，其中分析基于网络中实际的内容传播，而不是基于边缘的静态传播模型。首先，我们介绍了社交流中的信息流挖掘问题，然后提出了一种新的算法InFlowMine来发现网络中的信息流模式。然后，我们利用这种方法来确定网络中的关键影响者。我们的方法是灵活的，因为它还可以确定特定主题的影响者。实验证明了该模型的有效性和高效性。

{"title":"Content-centric flow mining for influence analysis in social streams","authors":"Karthik Subbian, C. Aggarwal, J. Srivastava","doi":"10.1145/2505515.2505626","DOIUrl":"https://doi.org/10.1145/2505515.2505626","url":null,"abstract":"The problem of discovering information flow trends and influencers in social networks has become increasingly relevant both because of the increasing amount of content available from online networks in the form of social streams, and because of its relevance as a tool for content trends analysis. An important part of this analysis is to determine the key patterns of flow and corresponding influencers in the underlying network. Almost all the work on influence analysis has focused on fixed models of the network structure, and edge-based transmission between nodes. In this paper, we propose a fully content-centered model of flow analysis in social network streams, in which the analysis is based on actual content transmissions in the network, rather than a static model of transmission on the edges. First, we introduce the problem of information flow mining in social streams, and then propose a novel algorithm InFlowMine to discover the information flow patterns in the network. We then leverage this approach to determine the key influencers in the network. Our approach is flexible, since it can also determine topic-specific influencers. We experimentally show the effectiveness and efficiency of our model.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"86 5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86499554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Functional dirichlet process 泛函狄利克雷过程

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505537

Lijing Qin, Xiaoyan Zhu

Dirichlet process mixture (DPM) model is one of the most important Bayesian nonparametric models owing to its efficiency of inference and flexibility for various applications. A fundamental assumption made by DPM model is that all data items are generated from a single, shared DP. This assumption, however, is restrictive in many practical settings where samples are generated from a collection of dependent DPs, each associated with a point in some covariate space. For example, documents in the proceedings of a conference are organized by year, or photos may be tagged and recorded with GPS locations. We present a general method for constructing dependent Dirichlet processes (DP) on arbitrary covariate space. The approach is based on restricting and projecting a DP defined on a space of continuous functions with different domains, which results in a collection of dependent random measures, each associated with a point in covariate space and is marginally DP distributed. The constructed collection of dependent DPs can be used as a nonparametric prior of infinite dynamic mixture models, which allow each mixture component to appear/disappear and vary in a subspace of covariate space. Furthermore, we discuss choices of base distributions of functions in a variety of settings as a flexible method to control dependencies. In addition, we develop an efficient Gibbs sampler for model inference where all underlying random measures are integrated out. Finally, experiment results on temporal modeling and spatial modeling datasets demonstrate the effectiveness of the method in modeling dynamic mixture models on different types of covariates.

Dirichlet过程混合(DPM)模型是贝叶斯非参数模型中最重要的一种，它具有推理效率高、适用范围广等优点。DPM模型的一个基本假设是，所有数据项都是从一个共享的DP生成的。然而，在许多实际情况下，这种假设是限制性的，因为样本是由一组相关的dp生成的，每个dp都与一些协变量空间中的一个点相关联。例如，会议记录中的文件是按年组织的，或者可以用GPS位置标记和记录照片。给出了在任意协变量空间上构造相依狄利克雷过程(DP)的一般方法。该方法基于限制和投影定义在具有不同域的连续函数空间上的DP，从而得到一组相关随机测度，每个测度与协变量空间中的一个点相关联，并且是边际DP分布。所构建的相关DPs集合可以作为无限动态混合模型的非参数先验，允许每个混合成分在协变量空间的一个子空间中出现/消失和变化。此外，我们讨论了在各种设置中选择函数的基本分布，作为控制依赖关系的灵活方法。此外，我们开发了一种有效的吉布斯采样器，用于模型推理，其中所有潜在的随机度量都被集成出来。最后，在时间建模和空间建模数据集上的实验结果验证了该方法在不同类型协变量的动态混合模型建模中的有效性。

{"title":"Functional dirichlet process","authors":"Lijing Qin, Xiaoyan Zhu","doi":"10.1145/2505515.2505537","DOIUrl":"https://doi.org/10.1145/2505515.2505537","url":null,"abstract":"Dirichlet process mixture (DPM) model is one of the most important Bayesian nonparametric models owing to its efficiency of inference and flexibility for various applications. A fundamental assumption made by DPM model is that all data items are generated from a single, shared DP. This assumption, however, is restrictive in many practical settings where samples are generated from a collection of dependent DPs, each associated with a point in some covariate space. For example, documents in the proceedings of a conference are organized by year, or photos may be tagged and recorded with GPS locations. We present a general method for constructing dependent Dirichlet processes (DP) on arbitrary covariate space. The approach is based on restricting and projecting a DP defined on a space of continuous functions with different domains, which results in a collection of dependent random measures, each associated with a point in covariate space and is marginally DP distributed. The constructed collection of dependent DPs can be used as a nonparametric prior of infinite dynamic mixture models, which allow each mixture component to appear/disappear and vary in a subspace of covariate space. Furthermore, we discuss choices of base distributions of functions in a variety of settings as a flexible method to control dependencies. In addition, we develop an efficient Gibbs sampler for model inference where all underlying random measures are integrated out. Finally, experiment results on temporal modeling and spatial modeling datasets demonstrate the effectiveness of the method in modeling dynamic mixture models on different types of covariates.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85817579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A two-phase algorithm for mining sequential patterns with differential privacy 基于差分隐私的序列模式挖掘的两阶段算法

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505553

Luca Bonomi, Li Xiong

Frequent sequential pattern mining is a central task in many fields such as biology and finance. However, release of these patterns is raising increasing concerns on individual privacy. In this paper, we study the sequential pattern mining problem under the differential privacy framework which provides formal and provable guarantees of privacy. Due to the nature of the differential privacy mechanism which perturbs the frequency results with noise, and the high dimensionality of the pattern space, this mining problem is particularly challenging. In this work, we propose a novel two-phase algorithm for mining both prefixes and substring patterns. In the first phase, our approach takes advantage of the statistical properties of the data to construct a model-based prefix tree which is used to mine prefixes and a candidate set of substring patterns. The frequency of the substring patterns is further refined in the successive phase where we employ a novel transformation of the original data to reduce the perturbation noise. Extensive experiment results using real datasets showed that our approach is effective for mining both substring and prefix patterns in comparison to the state-of-the-art solutions.

频繁的序列模式挖掘是生物和金融等许多领域的核心任务。然而，这些模式的发布引起了人们对个人隐私的日益关注。本文研究了差分隐私框架下的顺序模式挖掘问题，差分隐私框架提供了形式化的、可证明的隐私保证。由于差分隐私机制的特性会使频率结果受到噪声的干扰，以及模式空间的高维性，使得这种挖掘问题特别具有挑战性。在这项工作中，我们提出了一种新的两阶段算法来挖掘前缀和子字符串模式。在第一阶段，我们的方法利用数据的统计属性来构建一个基于模型的前缀树，该树用于挖掘前缀和候选子字符串模式集。子串模式的频率在连续阶段进一步细化，我们采用了原始数据的新变换来降低扰动噪声。使用真实数据集的大量实验结果表明，与最先进的解决方案相比，我们的方法在挖掘子字符串和前缀模式方面都是有效的。

{"title":"A two-phase algorithm for mining sequential patterns with differential privacy","authors":"Luca Bonomi, Li Xiong","doi":"10.1145/2505515.2505553","DOIUrl":"https://doi.org/10.1145/2505515.2505553","url":null,"abstract":"Frequent sequential pattern mining is a central task in many fields such as biology and finance. However, release of these patterns is raising increasing concerns on individual privacy. In this paper, we study the sequential pattern mining problem under the differential privacy framework which provides formal and provable guarantees of privacy. Due to the nature of the differential privacy mechanism which perturbs the frequency results with noise, and the high dimensionality of the pattern space, this mining problem is particularly challenging. In this work, we propose a novel two-phase algorithm for mining both prefixes and substring patterns. In the first phase, our approach takes advantage of the statistical properties of the data to construct a model-based prefix tree which is used to mine prefixes and a candidate set of substring patterns. The frequency of the substring patterns is further refined in the successive phase where we employ a novel transformation of the original data to reduce the perturbation noise. Extensive experiment results using real datasets showed that our approach is effective for mining both substring and prefix patterns in comparison to the state-of-the-art solutions.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78901264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 87

A pattern-based selective recrawling approach for object-level vertical search 对象级垂直搜索的基于模式的选择性抓取方法

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505707

Yaqian Zhou, Qi Zhang, Xuanjing Huang, Lide Wu

Traditional recrawling methods learn navigation patterns in order to crawl related web pages. However, they cannot remove the redundancy found on the web, especially at the object level. To deal with this problem, we propose a new hypertext resource discovery method, called ``selective recrawling'' for object-level vertical search applications. The goal of selective recrawling is to automatically generate URL patterns, then select those pages that have the widest coverage, and least irrelevance and redundancy relative to a pre-defined vertical domain. This method only requires a few seed objects and can select the set of URL patterns that covers the greatest number of objects. The selected set can continue to be used for some time to recrawl web pages and can be renewed periodically. This leads to significant savings in hardware and network resources. In this paper we present a detailed framework of selective recrawling for object-level vertical search. The selective recrawling method automatically extends the set of candidate websites from initial seed objects. Based on the objects extracted from these websites it learns a set of URL patterns which covers the greatest number of target objects with little redundancy. Finally, the navigation patterns generated from the selected URL pattern set are used to guide future crawling. Experiments on local event data show that our method can greatly reduce downloading of web pages while maintaining comparative object coverage.

传统的抓取方法通过学习导航模式来抓取相关的网页。然而，它们不能消除在web上发现的冗余，特别是在对象级别。为了解决这个问题，我们提出了一种新的超文本资源发现方法，称为“选择性抓取”，用于对象级垂直搜索应用。选择性重新抓取的目标是自动生成URL模式，然后选择那些覆盖范围最广、相对于预定义的垂直域不相关和冗余最少的页面。此方法只需要几个种子对象，并且可以选择覆盖最多对象的URL模式集。选择的集合可以继续使用一段时间来重新抓取网页，并且可以定期更新。这将大大节省硬件和网络资源。在本文中，我们提出了一个对象级垂直搜索的选择性抓取的详细框架。选择性抓取方法自动从初始种子对象扩展候选网站集。基于从这些网站中提取的对象，它学习一组URL模式，这些模式覆盖了最多的目标对象，并且冗余少。最后，从所选URL模式集生成的导航模式用于指导将来的爬行。在局部事件数据上的实验表明，该方法可以在保持相对对象覆盖率的同时大大减少网页下载。

{"title":"A pattern-based selective recrawling approach for object-level vertical search","authors":"Yaqian Zhou, Qi Zhang, Xuanjing Huang, Lide Wu","doi":"10.1145/2505515.2505707","DOIUrl":"https://doi.org/10.1145/2505515.2505707","url":null,"abstract":"Traditional recrawling methods learn navigation patterns in order to crawl related web pages. However, they cannot remove the redundancy found on the web, especially at the object level. To deal with this problem, we propose a new hypertext resource discovery method, called ``selective recrawling'' for object-level vertical search applications. The goal of selective recrawling is to automatically generate URL patterns, then select those pages that have the widest coverage, and least irrelevance and redundancy relative to a pre-defined vertical domain. This method only requires a few seed objects and can select the set of URL patterns that covers the greatest number of objects. The selected set can continue to be used for some time to recrawl web pages and can be renewed periodically. This leads to significant savings in hardware and network resources. In this paper we present a detailed framework of selective recrawling for object-level vertical search. The selective recrawling method automatically extends the set of candidate websites from initial seed objects. Based on the objects extracted from these websites it learns a set of URL patterns which covers the greatest number of target objects with little redundancy. Finally, the navigation patterns generated from the selected URL pattern set are used to guide future crawling. Experiments on local event data show that our method can greatly reduce downloading of web pages while maintaining comparative object coverage.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79051073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Building optimal information systems automatically: configuration space exploration for biomedical information systems 自动构建最优信息系统:生物医学信息系统的配置空间探索

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505692

Zi Yang, E. Garduño, Yan Fang, Avner Maiberg, Collin McCormack, Eric Nyberg

Software frameworks which support integration and scaling of text analysis algorithms make it possible to build complex, high performance information systems for information extraction, information retrieval, and question answering; IBM's Watson is a prominent example. As the complexity and scaling of information systems become ever greater, it is much more challenging to effectively and efficiently determine which toolkits, algorithms, knowledge bases or other resources should be integrated into an information system in order to achieve a desired or optimal level of performance on a given task. This paper presents a formal representation of the space of possible system configurations, given a set of information processing components and their parameters (configuration space) and discusses algorithmic approaches to determine the optimal configuration within a given configuration space (configuration space exploration or CSE). We introduce the CSE framework, an extension to the UIMA framework which provides a general distributed solution for building and exploring configuration spaces for information systems. The CSE framework was used to implement biomedical information systems in case studies involving over a trillion different configuration combinations of components and parameter values operating on question answering tasks from the TREC Genomics. The framework automatically and efficiently evaluated different system configurations, and identified configurations that achieved better results than prior published results.

支持文本分析算法集成和扩展的软件框架使构建复杂的高性能信息系统成为可能，用于信息提取、信息检索和问题回答;IBM的沃森就是一个突出的例子。随着信息系统的复杂性和规模变得越来越大，有效和高效地确定哪些工具包、算法、知识库或其他资源应该集成到信息系统中，以便在给定任务上实现期望的或最佳的性能水平，这是一个更大的挑战。在给定一组信息处理组件及其参数(配置空间)的情况下，本文给出了可能的系统配置空间的形式化表示，并讨论了在给定配置空间内确定最佳配置的算法方法(配置空间探索或CSE)。我们介绍CSE框架，它是UIMA框架的扩展，为构建和探索信息系统的配置空间提供了通用的分布式解决方案。CSE框架用于在案例研究中实现生物医学信息系统，涉及超过一万亿种不同的组件配置组合和参数值，这些组件和参数值在TREC Genomics的问答任务上运行。该框架自动有效地评估不同的系统配置，并识别出比先前发布的结果更好的配置。

{"title":"Building optimal information systems automatically: configuration space exploration for biomedical information systems","authors":"Zi Yang, E. Garduño, Yan Fang, Avner Maiberg, Collin McCormack, Eric Nyberg","doi":"10.1145/2505515.2505692","DOIUrl":"https://doi.org/10.1145/2505515.2505692","url":null,"abstract":"Software frameworks which support integration and scaling of text analysis algorithms make it possible to build complex, high performance information systems for information extraction, information retrieval, and question answering; IBM's Watson is a prominent example. As the complexity and scaling of information systems become ever greater, it is much more challenging to effectively and efficiently determine which toolkits, algorithms, knowledge bases or other resources should be integrated into an information system in order to achieve a desired or optimal level of performance on a given task. This paper presents a formal representation of the space of possible system configurations, given a set of information processing components and their parameters (configuration space) and discusses algorithmic approaches to determine the optimal configuration within a given configuration space (configuration space exploration or CSE). We introduce the CSE framework, an extension to the UIMA framework which provides a general distributed solution for building and exploring configuration spaces for information systems. The CSE framework was used to implement biomedical information systems in case studies involving over a trillion different configuration combinations of components and parameter values operating on question answering tasks from the TREC Genomics. The framework automatically and efficiently evaluated different system configurations, and identified configurations that achieved better results than prior published results.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77282475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Channeling the deluge: research challenges for big data and information systems 引导洪水:大数据和信息系统的研究挑战

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2525541

Paul N. Bennett, C. Lee Giles, A. Halevy, Jiawei Han, Marti A. Hearst, J. Leskovec

With massive amounts of data being generated and stored ubiquitously in every discipline and every aspect of our daily life, how to handle such big data poses many challenging issues to researchers in data and information systems. The participants of CIKM 2013 are active researchers on large scale data, information and knowledge management, from multiple disciplines, including database systems, data mining, information retrieval, human-computer interaction, and knowledge or information management. As a group of experienced researchers in academia and industry, we will present at this panel our visions on what should be the challenging research issues in this promising research frontier and hope to attract heated discussions and debates from the audience. We expect panelists with diverse backgrounds raise different challenging research problems and exchange their views with each other and with the audience. A heated discussion may help young researchers understand the need for research in both industry and academia and invest their efforts on more important research issues and make impacts to the development of new principles, methodologies, and technologies.

随着大量数据的产生和存储无处不在，在我们日常生活的各个学科和各个方面，如何处理这些大数据给数据和信息系统的研究人员提出了许多具有挑战性的问题。CIKM 2013的参与者是来自数据库系统、数据挖掘、信息检索、人机交互、知识或信息管理等多个学科的大规模数据、信息和知识管理领域的活跃研究人员。作为一群来自学术界和工业界经验丰富的研究人员，我们将在这个小组讨论中提出我们对这一前景广阔的研究前沿应该是具有挑战性的研究问题的看法，并希望引起观众的热烈讨论和辩论。我们期待具有不同背景的小组成员提出不同的具有挑战性的研究问题，并与彼此和听众交流他们的观点。一场热烈的讨论可以帮助年轻的研究人员了解工业界和学术界的研究需求，并将他们的努力投入到更重要的研究问题上，并对新原则、方法和技术的发展产生影响。

{"title":"Channeling the deluge: research challenges for big data and information systems","authors":"Paul N. Bennett, C. Lee Giles, A. Halevy, Jiawei Han, Marti A. Hearst, J. Leskovec","doi":"10.1145/2505515.2525541","DOIUrl":"https://doi.org/10.1145/2505515.2525541","url":null,"abstract":"With massive amounts of data being generated and stored ubiquitously in every discipline and every aspect of our daily life, how to handle such big data poses many challenging issues to researchers in data and information systems. The participants of CIKM 2013 are active researchers on large scale data, information and knowledge management, from multiple disciplines, including database systems, data mining, information retrieval, human-computer interaction, and knowledge or information management. As a group of experienced researchers in academia and industry, we will present at this panel our visions on what should be the challenging research issues in this promising research frontier and hope to attract heated discussions and debates from the audience. We expect panelists with diverse backgrounds raise different challenging research problems and exchange their views with each other and with the audience. A heated discussion may help young researchers understand the need for research in both industry and academia and invest their efforts on more important research issues and make impacts to the development of new principles, methodologies, and technologies.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"13 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79883982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Using micro-reviews to select an efficient set of reviews 使用微评论选择一组有效的评论

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505568

Thanh-Son Nguyen, Hady W. Lauw, Panayiotis Tsaparas

Online reviews are an invaluable resource for web users trying to make decisions regarding products or services. However, the abundance of review content, as well as the unstructured, lengthy, and verbose nature of reviews make it hard for users to locate the appropriate reviews, and distill the useful information. With the recent growth of social networking and micro-blogging services, we observe the emergence of a new type of online review content, consisting of bite-sized, 140 character-long reviews often posted reactively on the spot via mobile devices. These micro-reviews are short, concise, and focused, nicely complementing the lengthy, elaborate, and verbose nature of full-text reviews. We propose a novel methodology that brings together these two diverse types of review content, to obtain something that is more than the sum of its parts. We use micro-reviews as a crowdsourced way to extract the salient aspects of the reviewed item, and propose a new formulation of the review selection problem that aims to find a small set of reviews that efficiently cover the micro-reviews. Our approach consists of a two-step process: matching review sentences to micro-reviews and then selecting reviews such that we cover as many micro-reviews as possible, with few sentences. We perform a detailed evaluation of all the steps of our methodology using data collected from Foursquare and Yelp.

在线评论是网络用户试图对产品或服务做出决定的宝贵资源。然而，大量的评论内容，以及非结构化、冗长和冗长的评论性质，使得用户很难找到适当的评论，并提取有用的信息。随着近年来社交网络和微博服务的发展，我们观察到一种新型的在线评论内容的出现，这种内容由140个字符的小评论组成，通常通过移动设备即时发布。这些微评论简短、简洁、重点突出，很好地补充了全文评论的冗长、精细和冗长的本质。我们提出了一种新的方法，将这两种不同类型的复习内容结合在一起，以获得比各部分之和更多的东西。我们使用微评论作为众包的方式来提取被评论项目的突出方面，并提出了一种新的评论选择问题的公式，旨在找到一个有效覆盖微评论的小评论集。我们的方法包括两个步骤:将评论句子与微评论相匹配，然后选择评论，这样我们就可以用很少的句子覆盖尽可能多的微评论。我们使用从Foursquare和Yelp收集的数据对我们方法论的所有步骤进行了详细的评估。

{"title":"Using micro-reviews to select an efficient set of reviews","authors":"Thanh-Son Nguyen, Hady W. Lauw, Panayiotis Tsaparas","doi":"10.1145/2505515.2505568","DOIUrl":"https://doi.org/10.1145/2505515.2505568","url":null,"abstract":"Online reviews are an invaluable resource for web users trying to make decisions regarding products or services. However, the abundance of review content, as well as the unstructured, lengthy, and verbose nature of reviews make it hard for users to locate the appropriate reviews, and distill the useful information. With the recent growth of social networking and micro-blogging services, we observe the emergence of a new type of online review content, consisting of bite-sized, 140 character-long reviews often posted reactively on the spot via mobile devices. These micro-reviews are short, concise, and focused, nicely complementing the lengthy, elaborate, and verbose nature of full-text reviews. We propose a novel methodology that brings together these two diverse types of review content, to obtain something that is more than the sum of its parts. We use micro-reviews as a crowdsourced way to extract the salient aspects of the reviewed item, and propose a new formulation of the review selection problem that aims to find a small set of reviews that efficiently cover the micro-reviews. Our approach consists of a two-step process: matching review sentences to micro-reviews and then selecting reviews such that we cover as many micro-reviews as possible, with few sentences. We perform a detailed evaluation of all the steps of our methodology using data collected from Foursquare and Yelp.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86271808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Objectionable content filtering by click-through data 通过点击数据过滤不良内容

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2507849

Lung-Hao Lee, Yen-Cheng Juan, Hsin-Hsi Chen, Yuen-Hsien Tseng

This paper explores users' browsing intents to predict the category of a user's next access during web surfing, and applies the results to objectionable content filtering. A user's access trail represented as a sequence of URLs reveals the contextual information of web browsing behaviors. We extract behavioral features of each clicked URL, i.e., hostname, bag-of-words, gTLD, IP, and port, to develop a linear chain CRF model for context-aware category prediction. Large-scale experiments show that our method achieves a promising accuracy of 0.9396 for objectionable access identification without requesting their corresponding page content. Error analysis indicates that our proposed model results in a low false positive rate of 0.0571. In real-life filtering simulations, our proposed model accomplishes macro-averaging blocking rate 0.9271, while maintaining a favorably low macro-averaging over-blocking rate 0.0575 for collaboratively filtering objectionable content with time change on the dynamic web.

本文通过研究用户的浏览意图来预测用户在网上冲浪时下一次访问的类别，并将结果应用于不良内容过滤。以url序列表示的用户访问轨迹揭示了web浏览行为的上下文信息。我们提取每个被点击URL的行为特征，即主机名、词袋、gTLD、IP和端口，以开发用于上下文感知类别预测的线性链CRF模型。大规模实验表明，我们的方法在不要求相应页面内容的情况下对不良访问进行识别，准确率达到0.9396。误差分析表明，我们提出的模型的假阳性率为0.0571。在实际过滤模拟中，我们提出的模型实现了宏观平均阻塞率0.9271，同时保持了一个有利的低宏观平均过阻塞率0.0575，以协同过滤动态网络上随时间变化的不良内容。

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀