Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management最新文献
A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as "unique words" and they constitute a small percentage of all the words in a typical book. Along with the order information the set of unique words provides a compact representation which is highly descriptive of the content and the flow of ideas in the book. By aligning the sequence of unique words from two books using the longest common subsequence (LCS) one can discover whether two books are duplicates. Experiments on several datasets show that DUPNIQ is more accurate than traditional methods for duplicate detection such as shingling and is fast. On a collection of 100K scanned English books DUPNIQ detects partial duplicates in 30 min using 350 cores and has precision 0.996 and recall 0.833 compared to shingling with precision 0.992 and recall 0.720. The technique works on other languages as well and is demonstrated for a French dataset.
{"title":"Partial duplicate detection for large book collections","authors":"I. Z. Yalniz, E. Can, R. Manmatha","doi":"10.1145/2063576.2063647","DOIUrl":"https://doi.org/10.1145/2063576.2063647","url":null,"abstract":"A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as \"unique words\" and they constitute a small percentage of all the words in a typical book. Along with the order information the set of unique words provides a compact representation which is highly descriptive of the content and the flow of ideas in the book. By aligning the sequence of unique words from two books using the longest common subsequence (LCS) one can discover whether two books are duplicates. Experiments on several datasets show that DUPNIQ is more accurate than traditional methods for duplicate detection such as shingling and is fast. On a collection of 100K scanned English books DUPNIQ detects partial duplicates in 30 min using 350 cores and has precision 0.996 and recall 0.833 compared to shingling with precision 0.992 and recall 0.720. The technique works on other languages as well and is demonstrated for a French dataset.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"238 1","pages":"469-474"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82244274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents Yagada, an algorithm to search labelled graphs for anomalies using both structural data and numeric attributes. Yagada is explained using several security-related examples and validated with experiments on a physical Access Control database. Quantitative analysis shows that in the upper range of anomaly thresholds, Yagada detects twice as many anomalies as the best-performing numeric discretization algorithm. Qualitative evaluation shows that the detected anomalies are meaningful, representing a combination of structural irregularities and numerical outliers.
{"title":"Detecting anomalies in graphs with numeric labels","authors":"Michael Davis, Weiru Liu, P. Miller, G. Redpath","doi":"10.1145/2063576.2063749","DOIUrl":"https://doi.org/10.1145/2063576.2063749","url":null,"abstract":"This paper presents Yagada, an algorithm to search labelled graphs for anomalies using both structural data and numeric attributes. Yagada is explained using several security-related examples and validated with experiments on a physical Access Control database. Quantitative analysis shows that in the upper range of anomaly thresholds, Yagada detects twice as many anomalies as the best-performing numeric discretization algorithm. Qualitative evaluation shows that the detected anomalies are meaningful, representing a combination of structural irregularities and numerical outliers.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"209 1","pages":"1197-1202"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88058276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Baichuan Li, Xiance Si, Michael R. Lyu, Irwin King, E. Chang
In this paper, we investigate the novel problem of automatic question identification in the microblog environment. It contains two steps: detecting tweets that contain questions (we call them "interrogative tweets") and extracting the tweets which really seek information or ask for help (so called "qweets") from interrogative tweets. To detect interrogative tweets, both traditional rule-based approach and state-of-the-art learning-based method are employed. To extract qweets, context features like short urls and Tweet-specific features like Retweets are elaborately selected for classification. We conduct an empirical study with sampled one hour's English tweets and report our experimental results for question identification on Twitter.
{"title":"Question identification on twitter","authors":"Baichuan Li, Xiance Si, Michael R. Lyu, Irwin King, E. Chang","doi":"10.1145/2063576.2063996","DOIUrl":"https://doi.org/10.1145/2063576.2063996","url":null,"abstract":"In this paper, we investigate the novel problem of automatic question identification in the microblog environment. It contains two steps: detecting tweets that contain questions (we call them \"interrogative tweets\") and extracting the tweets which really seek information or ask for help (so called \"qweets\") from interrogative tweets. To detect interrogative tweets, both traditional rule-based approach and state-of-the-art learning-based method are employed. To extract qweets, context features like short urls and Tweet-specific features like Retweets are elaborately selected for classification. We conduct an empirical study with sampled one hour's English tweets and report our experimental results for question identification on Twitter.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"45 1","pages":"2477-2480"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83939842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we look into the problem of filtering problem solution repositories (from sources such as community-driven question answering systems) to render them more suitable for usage in knowledge reuse systems. We explore harnessing the fuzzy nature of usability of a solution to a problem, for such compaction. Fuzzy usabilities lead to several challenges; notably, the trade-off between choosing generic or better solutions. We develop an approach that can heed to a user specification of the trade-off between these criteria and introduce several quality measures based on fuzzy usability estimates to ascertain the quality of a problem-solution repository for usage in a Case Based Reasoning system. We establish, through a detailed empirical analysis, that our approach outperforms state-of-the-art approaches on virtually all quality measures.
{"title":"More or better: on trade-offs in compacting textual problem solution repositories","authors":"P Deepak, Sutanu Chakraborti, D. Khemani","doi":"10.1145/2063576.2063956","DOIUrl":"https://doi.org/10.1145/2063576.2063956","url":null,"abstract":"In this paper, we look into the problem of filtering problem solution repositories (from sources such as community-driven question answering systems) to render them more suitable for usage in knowledge reuse systems. We explore harnessing the fuzzy nature of usability of a solution to a problem, for such compaction. Fuzzy usabilities lead to several challenges; notably, the trade-off between choosing generic or better solutions. We develop an approach that can heed to a user specification of the trade-off between these criteria and introduce several quality measures based on fuzzy usability estimates to ascertain the quality of a problem-solution repository for usage in a Case Based Reasoning system. We establish, through a detailed empirical analysis, that our approach outperforms state-of-the-art approaches on virtually all quality measures.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"1 1","pages":"2321-2324"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82934697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Extracting relations among different entities from various data sources has been an important topic in data mining. While many methods focus only on a single type of relations, real world entities maintain relations that contain much richer information. We propose a hierarchical Bayesian model for extracting multi-dimensional relations among entities from a text corpus. Using data from Wikipedia, we show that our model can accurately predict the relevance of an entity given the topic of the document as well as the set of entities that are already mentioned in that document.
{"title":"Extracting multi-dimensional relations: a generative model of groups of entities in a corpus","authors":"C. Yeung, Tomoharu Iwata","doi":"10.1145/2063576.2063750","DOIUrl":"https://doi.org/10.1145/2063576.2063750","url":null,"abstract":"Extracting relations among different entities from various data sources has been an important topic in data mining. While many methods focus only on a single type of relations, real world entities maintain relations that contain much richer information. We propose a hierarchical Bayesian model for extracting multi-dimensional relations among entities from a text corpus. Using data from Wikipedia, we show that our model can accurately predict the relevance of an entity given the topic of the document as well as the set of entities that are already mentioned in that document.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"36 1","pages":"1203-1208"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89215350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the increasing in number and size of databases dedicated to the storage of visual content, the need for effective retrieval systems has become crucial. The proposed method makes a significant contribution to meet this need through a technique in which sets of clusters are fused together to create an unique and more significant set of clusters. The images are represented by some features and then are grouped by these features, that are considered one by one. A probability matrix is then built and explored by the breadth first search algorithm with the aim of select an unique set of clusters. Experimental results, obtained using two different datasets, show the effectiveness of the proposed technique. Furthermore, the proposed approach overcomes the drawback of tuning a set of parameters that fuse the similarity measurement obtained by each feature to get an overall similarity between two images.
{"title":"Image clustering fusion technique based on BFS","authors":"Luca Costantini, Raffaele Nicolussi","doi":"10.1145/2063576.2063898","DOIUrl":"https://doi.org/10.1145/2063576.2063898","url":null,"abstract":"With the increasing in number and size of databases dedicated to the storage of visual content, the need for effective retrieval systems has become crucial. The proposed method makes a significant contribution to meet this need through a technique in which sets of clusters are fused together to create an unique and more significant set of clusters. The images are represented by some features and then are grouped by these features, that are considered one by one. A probability matrix is then built and explored by the breadth first search algorithm with the aim of select an unique set of clusters. Experimental results, obtained using two different datasets, show the effectiveness of the proposed technique. Furthermore, the proposed approach overcomes the drawback of tuning a set of parameters that fuse the similarity measurement obtained by each feature to get an overall similarity between two images.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"9 1","pages":"2093-2096"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90436068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, natural language processing techniques have been used more and more in IR. Among other syntactic and semantic parsing are effective methods for the design of complex applications like for example question answering and sentiment analysis. Unfortunately, extracting feature representations suitable for machine learning algorithms from linguistic structures is typically difficult. In this paper, we describe one of the most advanced piece of technology for automatic engineering of syntactic and semantic patterns. This method merges together convolution dependency tree kernels with lexical similarities. It can efficiently and effectively measure the similarity between dependency structures, whose lexical nodes are in part or completely different. Its use in powerful algorithm such as Support Vector Machines (SVMs) allows for fast design of accurate automatic systems. We report some experiments on question classification, which show an unprecedented result, e.g. 41% of error reduction of the former state-of-the-art, along with the analysis of the nice properties of the approach.
{"title":"Semantic convolution kernels over dependency trees: smoothed partial tree kernel","authors":"D. Croce, Alessandro Moschitti, Roberto Basili","doi":"10.1145/2063576.2063878","DOIUrl":"https://doi.org/10.1145/2063576.2063878","url":null,"abstract":"In recent years, natural language processing techniques have been used more and more in IR. Among other syntactic and semantic parsing are effective methods for the design of complex applications like for example question answering and sentiment analysis. Unfortunately, extracting feature representations suitable for machine learning algorithms from linguistic structures is typically difficult. In this paper, we describe one of the most advanced piece of technology for automatic engineering of syntactic and semantic patterns. This method merges together convolution dependency tree kernels with lexical similarities. It can efficiently and effectively measure the similarity between dependency structures, whose lexical nodes are in part or completely different. Its use in powerful algorithm such as Support Vector Machines (SVMs) allows for fast design of accurate automatic systems.\u0000 We report some experiments on question classification, which show an unprecedented result, e.g. 41% of error reduction of the former state-of-the-art, along with the analysis of the nice properties of the approach.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"65 1","pages":"2013-2016"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80731405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hidden Markov Models (HMMs) are today employed in a variety of applications, ranging from speech recognition to bioinformatics. In this paper, we present the List Viterbi training algorithm, a version of the Expectation-Maximization (EM) algorithm based on the List Viterbi algorithm instead of the commonly used forward-backward algorithm. We developed the batch and online versions of the algorithm, and we also describe an interesting application in the context of keyword search over databases, where we exploit a HMM for matching keywords into database terms. In our experiments we tested the online version of the training algorithm in a semi-supervised setting that allows us to take into account the feedbacks provided by the users.
{"title":"The list Viterbi training algorithm and its application to keyword search over databases","authors":"Silvia Rota, S. Bergamaschi, F. Guerra","doi":"10.1145/2063576.2063808","DOIUrl":"https://doi.org/10.1145/2063576.2063808","url":null,"abstract":"Hidden Markov Models (HMMs) are today employed in a variety of applications, ranging from speech recognition to bioinformatics. In this paper, we present the List Viterbi training algorithm, a version of the Expectation-Maximization (EM) algorithm based on the List Viterbi algorithm instead of the commonly used forward-backward algorithm. We developed the batch and online versions of the algorithm, and we also describe an interesting application in the context of keyword search over databases, where we exploit a HMM for matching keywords into database terms. In our experiments we tested the online version of the training algorithm in a semi-supervised setting that allows us to take into account the feedbacks provided by the users.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"12 1","pages":"1601-1606"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79662087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Since its debut the Explicit Semantic Analysis (ESA) has received much attention in the IR community. ESA has been proven to perform surprisingly well in several tasks and in different contexts. However, given the conceptual motivation for ESA, recent work has observed unexpected behavior. In this paper we look at the foundations of ESA from a theoretical point of view and employ a general probabilistic model for term weights which reveals how ESA actually works. Based on this model we explain some of the phenomena that have been observed in previous work and support our findings with new experiments. Moreover, we provide a theoretical grounding on how the size and the composition of the index collection affect the ESA-based computation of similarity values for texts.
{"title":"Insights into explicit semantic analysis","authors":"Thomas Gottron, Maik Anderka, Benno Stein","doi":"10.1145/2063576.2063865","DOIUrl":"https://doi.org/10.1145/2063576.2063865","url":null,"abstract":"Since its debut the Explicit Semantic Analysis (ESA) has received much attention in the IR community. ESA has been proven to perform surprisingly well in several tasks and in different contexts. However, given the conceptual motivation for ESA, recent work has observed unexpected behavior. In this paper we look at the foundations of ESA from a theoretical point of view and employ a general probabilistic model for term weights which reveals how ESA actually works. Based on this model we explain some of the phenomena that have been observed in previous work and support our findings with new experiments. Moreover, we provide a theoretical grounding on how the size and the composition of the index collection affect the ESA-based computation of similarity values for texts.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"46 1","pages":"1961-1964"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79678444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In online social communities, many recommender systems use collaborative filtering, a method that makes recommendations based on what are liked by other users with similar interests. Serious privacy issues may arise in this process, as sensitive personal information (e.g., content interests) may be collected and disclosed to other parties, especially the recommender server. In this paper, we propose YANA (short for "you are not alone"), an efficient group-based privacy-preserving collaborative filtering system for content recommendation in online social communities. We have developed a prototype system on desktop and mobile devices, and evaluated it using real world data. The results demonstrate that YANA can effectively protect users' privacy, while achieving high recommendation quality and energy efficiency.
在在线社交社区中,许多推荐系统使用协同过滤,这是一种基于其他有相似兴趣的用户喜欢的内容进行推荐的方法。在此过程中可能会出现严重的隐私问题,因为敏感的个人信息(例如,内容兴趣)可能会被收集并披露给其他方,特别是推荐服务器。在本文中,我们提出了YANA (you are not alone的缩写),这是一个高效的基于群体的隐私保护协同过滤系统,用于在线社交社区的内容推荐。我们已经在桌面和移动设备上开发了一个原型系统,并使用真实世界的数据对其进行了评估。结果表明,YANA可以有效地保护用户隐私,同时获得较高的推荐质量和能效。
{"title":"YANA: an efficient privacy-preserving recommender system for online social communities","authors":"Dongsheng Li, Q. Lv, L. Shang, Ning Gu","doi":"10.1145/2063576.2063943","DOIUrl":"https://doi.org/10.1145/2063576.2063943","url":null,"abstract":"In online social communities, many recommender systems use collaborative filtering, a method that makes recommendations based on what are liked by other users with similar interests. Serious privacy issues may arise in this process, as sensitive personal information (e.g., content interests) may be collected and disclosed to other parties, especially the recommender server. In this paper, we propose YANA (short for \"you are not alone\"), an efficient group-based privacy-preserving collaborative filtering system for content recommendation in online social communities. We have developed a prototype system on desktop and mobile devices, and evaluated it using real world data. The results demonstrate that YANA can effectively protect users' privacy, while achieving high recommendation quality and energy efficiency.","PeriodicalId":74507,"journal":{"name":"Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management","volume":"23 1","pages":"2269-2272"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83224737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Proceedings of the ... ACM International Conference on Information & Knowledge Management. ACM International Conference on Information and Knowledge Management