Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management最新文献

英文中文

Partial duplicate detection for large book collections 对大型藏书的部分重复检测

Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management

Pub Date : 2011-10-24 DOI: 10.1145/2063576.2063647

I. Z. Yalniz, E. Can, R. Manmatha

A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as "unique words" and they constitute a small percentage of all the words in a typical book. Along with the order information the set of unique words provides a compact representation which is highly descriptive of the content and the flow of ideas in the book. By aligning the sequence of unique words from two books using the longest common subsequence (LCS) one can discover whether two books are duplicates. Experiments on several datasets show that DUPNIQ is more accurate than traditional methods for duplicate detection such as shingling and is fast. On a collection of 100K scanned English books DUPNIQ detects partial duplicates in 30 min using 350 cores and has precision 0.996 and recall 0.833 compared to shingling with precision 0.992 and recall 0.720. The technique works on other languages as well and is demonstrated for a French dataset.

提出了一种用于发现具有光学字符识别(OCR)错误的大量扫描图书的部分副本的框架。集合中的每本书都用在书中只出现一次的单词序列(按照它们在文本中出现的顺序)来表示。这些词被称为“独特的词”，它们在一本典型的书中所占的比例很小。随着顺序信息，一组独特的单词提供了一个紧凑的表示，这是高度描述性的内容和思想的流动在书中。通过使用最长公共子序列(LCS)对两本书中的唯一单词序列进行对齐，可以发现两本书是否重复。在多个数据集上的实验表明，DUPNIQ比传统的带状重复检测方法(shingling)更准确，速度更快。在一个100K的英文图书扫描集上，DUPNIQ使用350个核在30分钟内检测出部分重复，其精度为0.996，召回率为0.833，而shingling的精度为0.992，召回率为0.720。该技术也适用于其他语言，并针对法语数据集进行了演示。

{"title":"Partial duplicate detection for large book collections","authors":"I. Z. Yalniz, E. Can, R. Manmatha","doi":"10.1145/2063576.2063647","DOIUrl":"https://doi.org/10.1145/2063576.2063647","url":null,"abstract":"A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as \"unique words\" and they constitute a small percentage of all the words in a typical book. Along with the order information the set of unique words provides a compact representation which is highly descriptive of the content and the flow of ideas in the book. By aligning the sequence of unique words from two books using the longest common subsequence (LCS) one can discover whether two books are duplicates. Experiments on several datasets show that DUPNIQ is more accurate than traditional methods for duplicate detection such as shingling and is fast. On a collection of 100K scanned English books DUPNIQ detects partial duplicates in 30 min using 350 cores and has precision 0.996 and recall 0.833 compared to shingling with precision 0.992 and recall 0.720. The technique works on other languages as well and is demonstrated for a French dataset.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"238 1","pages":"469-474"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82244274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Detecting anomalies in graphs with numeric labels 在带有数字标签的图形中检测异常

Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management

Pub Date : 2011-10-24 DOI: 10.1145/2063576.2063749

Michael Davis, Weiru Liu, P. Miller, G. Redpath

This paper presents Yagada, an algorithm to search labelled graphs for anomalies using both structural data and numeric attributes. Yagada is explained using several security-related examples and validated with experiments on a physical Access Control database. Quantitative analysis shows that in the upper range of anomaly thresholds, Yagada detects twice as many anomalies as the best-performing numeric discretization algorithm. Qualitative evaluation shows that the detected anomalies are meaningful, representing a combination of structural irregularities and numerical outliers.

本文提出了Yagada，一种利用结构数据和数字属性搜索标记图异常的算法。使用几个与安全相关的示例解释了Yagada，并在物理访问控制数据库上进行了实验验证。定量分析表明，在异常阈值的上限范围内，Yagada检测到的异常数量是性能最好的数值离散化算法的两倍。定性评价表明，检测到的异常是有意义的，代表了结构不规则和数值异常的结合。

引用次数: 40

Question identification on twitter 推特上的问题识别

Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management

Pub Date : 2011-10-24 DOI: 10.1145/2063576.2063996

Baichuan Li, Xiance Si, Michael R. Lyu, Irwin King, E. Chang

In this paper, we investigate the novel problem of automatic question identification in the microblog environment. It contains two steps: detecting tweets that contain questions (we call them "interrogative tweets") and extracting the tweets which really seek information or ask for help (so called "qweets") from interrogative tweets. To detect interrogative tweets, both traditional rule-based approach and state-of-the-art learning-based method are employed. To extract qweets, context features like short urls and Tweet-specific features like Retweets are elaborately selected for classification. We conduct an empirical study with sampled one hour's English tweets and report our experimental results for question identification on Twitter.

本文研究了微博环境下的问题自动识别问题。它包含两个步骤:检测包含问题的推文(我们称之为“疑问推文”)和从疑问推文中提取真正寻求信息或寻求帮助的推文(所谓的“qweets”)。为了检测疑问性推文，采用了传统的基于规则的方法和最先进的基于学习的方法。为了提取qweets，需要精心选择上下文特征(如短url)和特定于tweet的特征(如Retweets)进行分类。我们以一个小时的英语推文为样本进行了实证研究，并报告了我们在Twitter上进行问题识别的实验结果。

引用次数: 59

More or better: on trade-offs in compacting textual problem solution repositories 更多或更好:关于压缩文本问题解决方案存储库的权衡

Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management

Pub Date : 2011-10-24 DOI: 10.1145/2063576.2063956

P Deepak, Sutanu Chakraborti, D. Khemani

In this paper, we look into the problem of filtering problem solution repositories (from sources such as community-driven question answering systems) to render them more suitable for usage in knowledge reuse systems. We explore harnessing the fuzzy nature of usability of a solution to a problem, for such compaction. Fuzzy usabilities lead to several challenges; notably, the trade-off between choosing generic or better solutions. We develop an approach that can heed to a user specification of the trade-off between these criteria and introduce several quality measures based on fuzzy usability estimates to ascertain the quality of a problem-solution repository for usage in a Case Based Reasoning system. We establish, through a detailed empirical analysis, that our approach outperforms state-of-the-art approaches on virtually all quality measures.

在本文中，我们研究了过滤问题解决方案存储库的问题(来自社区驱动的问答系统等来源)，以使它们更适合在知识重用系统中使用。我们将探索利用问题解决方案可用性的模糊本质，以实现这种压缩。模糊的可用性会带来一些挑战;值得注意的是，选择通用或更好的解决方案之间的权衡。我们开发了一种方法，可以注意到这些标准之间权衡的用户规范，并引入了基于模糊可用性估计的几个质量度量，以确定在基于案例的推理系统中使用的问题解决方案存储库的质量。我们建立，通过详细的实证分析，我们的方法优于最先进的方法在几乎所有的质量措施。

引用次数: 2

Extracting multi-dimensional relations: a generative model of groups of entities in a corpus 多维关系提取:语料库中实体组的生成模型

Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management

Pub Date : 2011-10-24 DOI: 10.1145/2063576.2063750

C. Yeung, Tomoharu Iwata

Extracting relations among different entities from various data sources has been an important topic in data mining. While many methods focus only on a single type of relations, real world entities maintain relations that contain much richer information. We propose a hierarchical Bayesian model for extracting multi-dimensional relations among entities from a text corpus. Using data from Wikipedia, we show that our model can accurately predict the relevance of an entity given the topic of the document as well as the set of entities that are already mentioned in that document.

从各种数据源中提取不同实体之间的关系一直是数据挖掘中的一个重要课题。虽然许多方法只关注单一类型的关系，但现实世界的实体维护的关系包含更丰富的信息。我们提出了一个层次贝叶斯模型，用于从文本语料库中提取实体之间的多维关系。使用来自Wikipedia的数据，我们证明了我们的模型可以准确地预测给定文档主题的实体的相关性，以及该文档中已经提到的实体集。

引用次数: 1

Image clustering fusion technique based on BFS 基于BFS的图像聚类融合技术

Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management

Pub Date : 2011-10-24 DOI: 10.1145/2063576.2063898

Luca Costantini, Raffaele Nicolussi

With the increasing in number and size of databases dedicated to the storage of visual content, the need for effective retrieval systems has become crucial. The proposed method makes a significant contribution to meet this need through a technique in which sets of clusters are fused together to create an unique and more significant set of clusters. The images are represented by some features and then are grouped by these features, that are considered one by one. A probability matrix is then built and explored by the breadth first search algorithm with the aim of select an unique set of clusters. Experimental results, obtained using two different datasets, show the effectiveness of the proposed technique. Furthermore, the proposed approach overcomes the drawback of tuning a set of parameters that fuse the similarity measurement obtained by each feature to get an overall similarity between two images.

随着专门用于存储视觉内容的数据库数量和规模的增加，对有效检索系统的需求变得至关重要。本文提出的方法为满足这一需求做出了重大贡献，该方法通过一种技术将集群集融合在一起，以创建一个独特且更重要的集群集。图像由一些特征表示，然后按这些特征分组，逐个考虑。然后通过广度优先搜索算法建立概率矩阵，以选择一组唯一的聚类。使用两个不同的数据集获得的实验结果表明了该技术的有效性。此外，该方法克服了调整一组参数的缺点，这些参数融合了每个特征获得的相似性度量，以获得两幅图像之间的总体相似性。

引用次数: 3

Semantic convolution kernels over dependency trees: smoothed partial tree kernel 依赖树上的语义卷积核:平滑的部分树核

Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management

Pub Date : 2011-10-24 DOI: 10.1145/2063576.2063878

D. Croce, Alessandro Moschitti, Roberto Basili

In recent years, natural language processing techniques have been used more and more in IR. Among other syntactic and semantic parsing are effective methods for the design of complex applications like for example question answering and sentiment analysis. Unfortunately, extracting feature representations suitable for machine learning algorithms from linguistic structures is typically difficult. In this paper, we describe one of the most advanced piece of technology for automatic engineering of syntactic and semantic patterns. This method merges together convolution dependency tree kernels with lexical similarities. It can efficiently and effectively measure the similarity between dependency structures, whose lexical nodes are in part or completely different. Its use in powerful algorithm such as Support Vector Machines (SVMs) allows for fast design of accurate automatic systems. We report some experiments on question classification, which show an unprecedented result, e.g. 41% of error reduction of the former state-of-the-art, along with the analysis of the nice properties of the approach.

近年来，自然语言处理技术在红外领域得到了越来越多的应用。语法和语义分析是设计复杂应用程序的有效方法，例如问答和情感分析。不幸的是，从语言结构中提取适合机器学习算法的特征表示通常是困难的。在本文中，我们描述了语法和语义模式自动化工程中最先进的技术之一。该方法将具有词法相似性的卷积依赖树核合并在一起。它可以有效地度量词法节点部分或完全不同的依存结构之间的相似性。它在支持向量机(svm)等强大算法中的应用，可以快速设计精确的自动系统。我们报告了一些关于问题分类的实验，这些实验显示了前所未有的结果，例如将先前最先进的错误减少了41%，并分析了该方法的良好特性。

{"title":"Semantic convolution kernels over dependency trees: smoothed partial tree kernel","authors":"D. Croce, Alessandro Moschitti, Roberto Basili","doi":"10.1145/2063576.2063878","DOIUrl":"https://doi.org/10.1145/2063576.2063878","url":null,"abstract":"In recent years, natural language processing techniques have been used more and more in IR. Among other syntactic and semantic parsing are effective methods for the design of complex applications like for example question answering and sentiment analysis. Unfortunately, extracting feature representations suitable for machine learning algorithms from linguistic structures is typically difficult. In this paper, we describe one of the most advanced piece of technology for automatic engineering of syntactic and semantic patterns. This method merges together convolution dependency tree kernels with lexical similarities. It can efficiently and effectively measure the similarity between dependency structures, whose lexical nodes are in part or completely different. Its use in powerful algorithm such as Support Vector Machines (SVMs) allows for fast design of accurate automatic systems.\u0000 We report some experiments on question classification, which show an unprecedented result, e.g. 41% of error reduction of the former state-of-the-art, along with the analysis of the nice properties of the approach.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"65 1","pages":"2013-2016"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80731405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

The list Viterbi training algorithm and its application to keyword search over databases 列表Viterbi训练算法及其在数据库关键字搜索中的应用

Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management

Pub Date : 2011-10-24 DOI: 10.1145/2063576.2063808

Silvia Rota, S. Bergamaschi, F. Guerra

Hidden Markov Models (HMMs) are today employed in a variety of applications, ranging from speech recognition to bioinformatics. In this paper, we present the List Viterbi training algorithm, a version of the Expectation-Maximization (EM) algorithm based on the List Viterbi algorithm instead of the commonly used forward-backward algorithm. We developed the batch and online versions of the algorithm, and we also describe an interesting application in the context of keyword search over databases, where we exploit a HMM for matching keywords into database terms. In our experiments we tested the online version of the training algorithm in a semi-supervised setting that allows us to take into account the feedbacks provided by the users.

隐马尔可夫模型(hmm)今天被用于各种应用，从语音识别到生物信息学。在本文中，我们提出了List Viterbi训练算法，这是一种基于List Viterbi算法的期望最大化(EM)算法，而不是常用的前向向后算法。我们开发了该算法的批处理和在线版本，我们还描述了一个有趣的应用程序，在数据库的关键字搜索上下文中，我们利用HMM将关键字匹配到数据库术语中。在我们的实验中，我们在半监督设置中测试了在线版本的训练算法，该设置允许我们考虑用户提供的反馈。

引用次数: 6

Insights into explicit semantic analysis 对显式语义分析的见解

Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management

Pub Date : 2011-10-24 DOI: 10.1145/2063576.2063865

Thomas Gottron, Maik Anderka, Benno Stein

Since its debut the Explicit Semantic Analysis (ESA) has received much attention in the IR community. ESA has been proven to perform surprisingly well in several tasks and in different contexts. However, given the conceptual motivation for ESA, recent work has observed unexpected behavior. In this paper we look at the foundations of ESA from a theoretical point of view and employ a general probabilistic model for term weights which reveals how ESA actually works. Based on this model we explain some of the phenomena that have been observed in previous work and support our findings with new experiments. Moreover, we provide a theoretical grounding on how the size and the composition of the index collection affect the ESA-based computation of similarity values for texts.

自显式语义分析(ESA)问世以来，它就受到了IR社区的广泛关注。欧空局已经被证明在几个任务和不同的环境中表现得惊人的好。然而，考虑到ESA的概念动机，最近的工作已经观察到意想不到的行为。在本文中，我们从理论的角度来看欧空局的基础，并采用一个一般的概率模型的项权，揭示欧空局是如何实际工作的。基于这个模型，我们解释了一些在以前的工作中观察到的现象，并用新的实验来支持我们的发现。此外，我们为索引集合的大小和组成如何影响基于esa的文本相似值计算提供了理论基础。

引用次数: 62

YANA: an efficient privacy-preserving recommender system for online social communities YANA:一个高效的在线社交社区隐私保护推荐系统

Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management

Pub Date : 2011-10-24 DOI: 10.1145/2063576.2063943

Dongsheng Li, Q. Lv, L. Shang, Ning Gu

In online social communities, many recommender systems use collaborative filtering, a method that makes recommendations based on what are liked by other users with similar interests. Serious privacy issues may arise in this process, as sensitive personal information (e.g., content interests) may be collected and disclosed to other parties, especially the recommender server. In this paper, we propose YANA (short for "you are not alone"), an efficient group-based privacy-preserving collaborative filtering system for content recommendation in online social communities. We have developed a prototype system on desktop and mobile devices, and evaluated it using real world data. The results demonstrate that YANA can effectively protect users' privacy, while achieving high recommendation quality and energy efficiency.

在在线社交社区中，许多推荐系统使用协同过滤，这是一种基于其他有相似兴趣的用户喜欢的内容进行推荐的方法。在此过程中可能会出现严重的隐私问题，因为敏感的个人信息(例如，内容兴趣)可能会被收集并披露给其他方，特别是推荐服务器。在本文中，我们提出了YANA (you are not alone的缩写)，这是一个高效的基于群体的隐私保护协同过滤系统，用于在线社交社区的内容推荐。我们已经在桌面和移动设备上开发了一个原型系统，并使用真实世界的数据对其进行了评估。结果表明，YANA可以有效地保护用户隐私，同时获得较高的推荐质量和能效。

引用次数: 8

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀