首页 > 最新文献

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval最新文献

英文 中文
Lightening the load of document smoothing for better language modeling retrieval 减轻文档平滑的负担,实现更好的语言建模检索
Mark D. Smucker, James Allan
We hypothesized that language modeling retrieval would improve if we reduced the need for document smoothing to provide an inverse document frequency (IDF) like effect. We created inverse collection frequency (ICF) weighted query models as a tool to partially separate the IDF-like role from document smoothing. Compared to maximum likelihood estimated (MLE) queries, the ICF weighted queries achieved a 6.4% improvement in mean average precision on description queries. The ICF weighted queries performed better with less document smoothing than that required by MLE queries. Language modeling retrieval may benefit from a means to separately incorporate an IDF-like behavior outside of document smoothing.
我们假设,如果我们减少对文档平滑的需求,以提供类似于逆文档频率(IDF)的效果,那么语言建模检索将得到改善。我们创建了逆收集频率(ICF)加权查询模型,作为将类似idf的角色与文档平滑部分分离的工具。与最大似然估计(MLE)查询相比,ICF加权查询在描述查询的平均精度方面提高了6.4%。与MLE查询相比,ICF加权查询在较少文档平滑的情况下执行得更好。语言建模检索可能受益于在文档平滑之外单独合并类似idf的行为的方法。
{"title":"Lightening the load of document smoothing for better language modeling retrieval","authors":"Mark D. Smucker, James Allan","doi":"10.1145/1148170.1148324","DOIUrl":"https://doi.org/10.1145/1148170.1148324","url":null,"abstract":"We hypothesized that language modeling retrieval would improve if we reduced the need for document smoothing to provide an inverse document frequency (IDF) like effect. We created inverse collection frequency (ICF) weighted query models as a tool to partially separate the IDF-like role from document smoothing. Compared to maximum likelihood estimated (MLE) queries, the ICF weighted queries achieved a 6.4% improvement in mean average precision on description queries. The ICF weighted queries performed better with less document smoothing than that required by MLE queries. Language modeling retrieval may benefit from a means to separately incorporate an IDF-like behavior outside of document smoothing.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115181877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
NMF and PLSI: equivalence and a hybrid algorithm NMF和PLSI:等价和混合算法
C. Ding, Tao Li, Wei Peng
In this paper, we show that PLSI and NMF optimize the same objective function, although PLSI and NMF are different algorithms as verified by experiments. In addition, we also propose a new hybrid method that runs PLSI and NMF alternatively to achieve better solutions.
在本文中,我们证明了PLSI和NMF优化相同的目标函数,尽管PLSI和NMF是不同的算法,并通过实验验证。此外,我们还提出了一种新的混合方法,即交替运行PLSI和NMF以获得更好的解决方案。
{"title":"NMF and PLSI: equivalence and a hybrid algorithm","authors":"C. Ding, Tao Li, Wei Peng","doi":"10.1145/1148170.1148295","DOIUrl":"https://doi.org/10.1145/1148170.1148295","url":null,"abstract":"In this paper, we show that PLSI and NMF optimize the same objective function, although PLSI and NMF are different algorithms as verified by experiments. In addition, we also propose a new hybrid method that runs PLSI and NMF alternatively to achieve better solutions.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122529376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Contextual search and name disambiguation in email using graphs 使用图形在电子邮件中进行上下文搜索和名称消歧
Einat Minkov, William W. Cohen, A. Ng
Similarity measures for text have historically been an important tool for solving information retrieval problems. In many interesting settings, however, documents are often closely connected to other documents, as well as other non-textual objects: for instance, email messages are connected to other messages via header information. In this paper we consider extended similarity metrics for documents and other objects embedded in graphs, facilitated via a lazy graph walk. We provide a detailed instantiation of this framework for email data, where content, social networks and a timeline are integrated in a structural graph. The suggested framework is evaluated for two email-related problems: disambiguating names in email documents, and threading. We show that reranking schemes based on the graph-walk similarity measures often outperform baseline methods, and that further improvements can be obtained by use of appropriate learning methods.
文本的相似性度量历来是解决信息检索问题的重要工具。然而,在许多有趣的设置中,文档通常与其他文档以及其他非文本对象紧密相连:例如,电子邮件消息通过标题信息连接到其他消息。在本文中,我们考虑了文档和嵌入在图中的其他对象的扩展相似度量,通过延迟图漫步来促进。我们为电子邮件数据提供了这个框架的详细实例,其中内容、社交网络和时间轴集成在一个结构图中。建议的框架针对两个与电子邮件相关的问题进行评估:消除电子邮件文档中的名称歧义和线程。我们表明,基于图走相似度度量的重新排序方案通常优于基线方法,并且可以通过使用适当的学习方法进一步改进。
{"title":"Contextual search and name disambiguation in email using graphs","authors":"Einat Minkov, William W. Cohen, A. Ng","doi":"10.1145/1148170.1148179","DOIUrl":"https://doi.org/10.1145/1148170.1148179","url":null,"abstract":"Similarity measures for text have historically been an important tool for solving information retrieval problems. In many interesting settings, however, documents are often closely connected to other documents, as well as other non-textual objects: for instance, email messages are connected to other messages via header information. In this paper we consider extended similarity metrics for documents and other objects embedded in graphs, facilitated via a lazy graph walk. We provide a detailed instantiation of this framework for email data, where content, social networks and a timeline are integrated in a structural graph. The suggested framework is evaluated for two email-related problems: disambiguating names in email documents, and threading. We show that reranking schemes based on the graph-walk similarity measures often outperform baseline methods, and that further improvements can be obtained by use of appropriate learning methods.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"232 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122691578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 191
Swordfish: an unsupervised Ngram based approach to morphological analysis 剑鱼:一种基于无监督神经网络的形态学分析方法
Christopher T. Jordan, J. Healy, Vlado Keselj
Extracting morphemes from words is a nontrivial task. Rule based stemming approaches such as Porter's algorithm have encountered some success, however they are restricted by their ability to identify a limited number of affixes and are language dependent. When dealing with languages with many affixes, rule based approaches generally require many more rules to deal with all the possible word forms. Deriving these rules requires a larger effort on the part of linguists and in some instances can be simply impractical. We propose an unsupervised ngram based approach, named Swordfish. Using ngram probabilities in the corpus, possible morphemes are identified. We look at two possible methods for identifying candidate morphemes, one using joint probabilities between two ngrams, and the second based on log odds between prefix probabilities. Initial results indicate the joint probability approach to be better for English while the prefix ratio approach is better for Finnish and Turkish.
从单词中提取语素是一项不平凡的任务。基于规则的词干提取方法,如波特的算法,已经取得了一些成功,但是它们受到识别有限数量词缀的能力的限制,并且依赖于语言。在处理具有许多词缀的语言时,基于规则的方法通常需要更多的规则来处理所有可能的单词形式。得出这些规则需要语言学家付出更大的努力,在某些情况下可能根本不切实际。我们提出了一种基于无监督图像图的方法,命名为“剑鱼”。使用语料库中的ngram概率,识别可能的语素。我们研究了识别候选语素的两种可能方法,一种是使用两个图之间的联合概率,另一种是基于前缀概率之间的对数赔率。初步结果表明,联合概率方法对英语更好,而前缀比例方法对芬兰语和土耳其语更好。
{"title":"Swordfish: an unsupervised Ngram based approach to morphological analysis","authors":"Christopher T. Jordan, J. Healy, Vlado Keselj","doi":"10.1145/1148170.1148303","DOIUrl":"https://doi.org/10.1145/1148170.1148303","url":null,"abstract":"Extracting morphemes from words is a nontrivial task. Rule based stemming approaches such as Porter's algorithm have encountered some success, however they are restricted by their ability to identify a limited number of affixes and are language dependent. When dealing with languages with many affixes, rule based approaches generally require many more rules to deal with all the possible word forms. Deriving these rules requires a larger effort on the part of linguists and in some instances can be simply impractical. We propose an unsupervised ngram based approach, named Swordfish. Using ngram probabilities in the corpus, possible morphemes are identified. We look at two possible methods for identifying candidate morphemes, one using joint probabilities between two ngrams, and the second based on log odds between prefix probabilities. Initial results indicate the joint probability approach to be better for English while the prefix ratio approach is better for Finnish and Turkish.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130309027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
You are what you say: privacy risks of public mentions 你就是你所说的:公开提及的隐私风险
Dan Frankowski, D. Cosley, Shilad Sen, L. Terveen, J. Riedl
In today's data-rich networked world, people express many aspects of their lives online. It is common to segregate different aspects in different places: you might write opinionated rants about movies in your blog under a pseudonym while participating in a forum or web site for scholarly discussion of medical ethics under your real name. However, it may be possible to link these separate identities, because the movies, journal articles, or authors you mention are from a sparse relation space whose properties (e.g., many items related to by only a few users) allow re-identification. This re-identification violates people's intentions to separate aspects of their life and can have negative consequences; it also may allow other privacy violations, such as obtaining a stronger identifier like name and address.This paper examines this general problem in a specific setting: re-identification of users from a public web movie forum in a private movie ratings dataset. We present three major results. First, we develop algorithms that can re-identify a large proportion of public users in a sparse relation space. Second, we evaluate whether private dataset owners can protect user privacy by hiding data; we show that this requires extensive and undesirable changes to the dataset, making it impractical. Third, we evaluate two methods for users in a public forum to protect their own privacy, suppression and misdirection. Suppression doesn't work here either. However, we show that a simple misdirection strategy works well: mention a few popular items that you haven't rated.
在当今数据丰富的网络世界中,人们在网上表达他们生活的许多方面。在不同的地方区分不同的方面是很常见的:你可能会用假名在博客上发表关于电影的固执己见的咆哮,而用真名参加医学伦理学术讨论的论坛或网站。然而,有可能将这些独立的身份联系起来,因为您提到的电影、期刊文章或作者来自稀疏关系空间,其属性(例如,只有少数用户与许多项目相关)允许重新识别。这种重新识别违背了人们将生活的各个方面分开的意图,并可能产生负面后果;它还可能允许其他隐私侵犯,例如获得更强的标识符,如姓名和地址。本文在一个特定的环境中研究了这个普遍问题:在一个私人电影评级数据集中重新识别来自公共网络电影论坛的用户。我们提出了三个主要结果。首先,我们开发了可以在稀疏关系空间中重新识别大部分公共用户的算法。其次,我们评估私有数据集所有者是否可以通过隐藏数据来保护用户隐私;我们表明,这需要对数据集进行广泛且不受欢迎的更改,从而使其不切实际。第三,我们评估了公共论坛中用户保护自己隐私的两种方法:压制和误导。压制在这里也不起作用。然而,我们证明了一个简单的误导策略很有效:提到一些你没有评价过的热门项目。
{"title":"You are what you say: privacy risks of public mentions","authors":"Dan Frankowski, D. Cosley, Shilad Sen, L. Terveen, J. Riedl","doi":"10.1145/1148170.1148267","DOIUrl":"https://doi.org/10.1145/1148170.1148267","url":null,"abstract":"In today's data-rich networked world, people express many aspects of their lives online. It is common to segregate different aspects in different places: you might write opinionated rants about movies in your blog under a pseudonym while participating in a forum or web site for scholarly discussion of medical ethics under your real name. However, it may be possible to link these separate identities, because the movies, journal articles, or authors you mention are from a sparse relation space whose properties (e.g., many items related to by only a few users) allow re-identification. This re-identification violates people's intentions to separate aspects of their life and can have negative consequences; it also may allow other privacy violations, such as obtaining a stronger identifier like name and address.This paper examines this general problem in a specific setting: re-identification of users from a public web movie forum in a private movie ratings dataset. We present three major results. First, we develop algorithms that can re-identify a large proportion of public users in a sparse relation space. Second, we evaluate whether private dataset owners can protect user privacy by hiding data; we show that this requires extensive and undesirable changes to the dataset, making it impractical. Third, we evaluate two methods for users in a public forum to protect their own privacy, suppression and misdirection. Suppression doesn't work here either. However, we show that a simple misdirection strategy works well: mention a few popular items that you haven't rated.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127755497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 93
Large scale semi-supervised linear SVMs 大规模半监督线性支持向量机
Vikas Sindhwani, S. Keerthi
Large scale learning is often realistic only in a semi-supervised setting where a small set of labeled examples is available together with a large collection of unlabeled data. In many information retrieval and data mining applications, linear classifiers are strongly preferred because of their ease of implementation, interpretability and empirical performance. In this work, we present a family of semi-supervised linear support vector classifiers that are designed to handle partially-labeled sparse datasets with possibly very large number of examples and features. At their core, our algorithms employ recently developed modified finite Newton techniques. Our contributions in this paper are as follows: (a) We provide an implementation of Transductive SVM (TSVM) that is significantly more efficient and scalable than currently used dual techniques, for linear classification problems involving large, sparse datasets. (b) We propose a variant of TSVM that involves multiple switching of labels. Experimental results show that this variant provides an order of magnitude further improvement in training efficiency. (c) We present a new algorithm for semi-supervised learning based on a Deterministic Annealing (DA) approach. This algorithm alleviates the problem of local minimum in the TSVM optimization procedure while also being computationally attractive. We conduct an empirical study on several document classification tasks which confirms the value of our methods in large scale semi-supervised settings.
大规模学习通常只有在半监督的环境下才能实现,在这种环境下,一小部分有标签的例子与大量未标记的数据一起可用。在许多信息检索和数据挖掘应用中,线性分类器因其易于实现、可解释性和经验性能而备受青睐。在这项工作中,我们提出了一组半监督线性支持向量分类器,旨在处理可能具有大量示例和特征的部分标记稀疏数据集。在其核心,我们的算法采用了最近开发的改进有限牛顿技术。我们在本文中的贡献如下:(a)对于涉及大型稀疏数据集的线性分类问题,我们提供了一种比目前使用的对偶技术更有效和可扩展的转导支持向量机(TSVM)的实现。(b)我们提出了一种涉及标签多次交换的TSVM变体。实验结果表明,该算法将训练效率提高了一个数量级。(c)提出了一种基于确定性退火(DA)方法的半监督学习新算法。该算法在解决TSVM优化过程中的局部最小值问题的同时,在计算上也很有吸引力。我们对几个文档分类任务进行了实证研究,证实了我们的方法在大规模半监督设置中的价值。
{"title":"Large scale semi-supervised linear SVMs","authors":"Vikas Sindhwani, S. Keerthi","doi":"10.1145/1148170.1148253","DOIUrl":"https://doi.org/10.1145/1148170.1148253","url":null,"abstract":"Large scale learning is often realistic only in a semi-supervised setting where a small set of labeled examples is available together with a large collection of unlabeled data. In many information retrieval and data mining applications, linear classifiers are strongly preferred because of their ease of implementation, interpretability and empirical performance. In this work, we present a family of semi-supervised linear support vector classifiers that are designed to handle partially-labeled sparse datasets with possibly very large number of examples and features. At their core, our algorithms employ recently developed modified finite Newton techniques. Our contributions in this paper are as follows: (a) We provide an implementation of Transductive SVM (TSVM) that is significantly more efficient and scalable than currently used dual techniques, for linear classification problems involving large, sparse datasets. (b) We propose a variant of TSVM that involves multiple switching of labels. Experimental results show that this variant provides an order of magnitude further improvement in training efficiency. (c) We present a new algorithm for semi-supervised learning based on a Deterministic Annealing (DA) approach. This algorithm alleviates the problem of local minimum in the TSVM optimization procedure while also being computationally attractive. We conduct an empirical study on several document classification tasks which confirms the value of our methods in large scale semi-supervised settings.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116756209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 260
Simple questions to improve pseudo-relevance feedback results 简单的问题,提高伪相关反馈结果
G. Kumaran, James Allan
We explore interactive methods to further improve the performance of pseudo-relevance feedback. Studies citeria suggest that new methods for tackling difficult queries are required. Our approach is to gather more information about the query from the user by asking her simple questions. The equally simple responses are used to modify the original query. Our experiments using the TREC Robust Track queries show that we can obtain a significant improvement in mean average precision averaging around 5% over pseudo-relevance feedback. This improvement is also spread across more queries compared to ordinary pseudo-relevance feedback, as suggested by geometric mean average precision.
我们探索交互式方法来进一步提高伪相关反馈的性能。研究表明,需要新的方法来处理困难的查询。我们的方法是通过向用户询问简单的问题来收集有关查询的更多信息。同样简单的响应用于修改原始查询。我们使用TREC鲁棒跟踪查询的实验表明,与伪相关反馈相比,我们可以获得平均精度的显著提高,平均精度约为5%。与普通的伪相关反馈相比,这种改进也扩展到更多的查询中,正如几何平均精度所表明的那样。
{"title":"Simple questions to improve pseudo-relevance feedback results","authors":"G. Kumaran, James Allan","doi":"10.1145/1148170.1148305","DOIUrl":"https://doi.org/10.1145/1148170.1148305","url":null,"abstract":"We explore interactive methods to further improve the performance of pseudo-relevance feedback. Studies citeria suggest that new methods for tackling difficult queries are required. Our approach is to gather more information about the query from the user by asking her simple questions. The equally simple responses are used to modify the original query. Our experiments using the TREC Robust Track queries show that we can obtain a significant improvement in mean average precision averaging around 5% over pseudo-relevance feedback. This improvement is also spread across more queries compared to ordinary pseudo-relevance feedback, as suggested by geometric mean average precision.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126457237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Latent semantic analysis for multiple-type interrelated data objects 多类型关联数据对象的潜在语义分析
Xuanhui Wang, Jian-Tao Sun, Zheng Chen, ChengXiang Zhai
Co-occurrence data is quite common in many real applications. Latent Semantic Analysis (LSA) has been successfully used to identify semantic relations in such data. However, LSA can only handle a single co-occurrence relationship between two types of objects. In practical applications, there are many cases where multiple types of objects exist and any pair of these objects could have a pairwise co-occurrence relation. All these co-occurrence relations can be exploited to alleviate data sparseness or to represent objects more meaningfully. In this paper, we propose a novel algorithm, M-LSA, which conducts latent semantic analysis by incorporating all pairwise co-occurrences among multiple types of objects. Based on the mutual reinforcement principle, M-LSA identifies the most salient concepts among the co-occurrence data and represents all the objects in a unified semantic space. M-LSA is general and we show that several variants of LSA are special cases of our algorithm. Experiment results show that M-LSA outperforms LSA on multiple applications, including collaborative filtering, text clustering, and text categorization.
共现数据在许多实际应用程序中非常常见。潜在语义分析(LSA)已经成功地用于识别这些数据中的语义关系。但是,LSA只能处理两类对象之间单一的共现关系。在实际应用中,存在多种类型的对象,并且这些对象中的任何一对都可能具有成对共现关系。所有这些共现关系都可以用来缓解数据稀疏性或更有意义地表示对象。在本文中,我们提出了一种新的算法M-LSA,该算法通过合并多个类型对象之间的所有成对共现来进行潜在语义分析。基于相互强化原则,M-LSA识别出共现数据中最突出的概念,并将所有对象表示在统一的语义空间中。M-LSA是一般的,我们证明了LSA的几种变体是我们算法的特殊情况。实验结果表明,M-LSA在协同过滤、文本聚类和文本分类等多个应用中都优于LSA。
{"title":"Latent semantic analysis for multiple-type interrelated data objects","authors":"Xuanhui Wang, Jian-Tao Sun, Zheng Chen, ChengXiang Zhai","doi":"10.1145/1148170.1148214","DOIUrl":"https://doi.org/10.1145/1148170.1148214","url":null,"abstract":"Co-occurrence data is quite common in many real applications. Latent Semantic Analysis (LSA) has been successfully used to identify semantic relations in such data. However, LSA can only handle a single co-occurrence relationship between two types of objects. In practical applications, there are many cases where multiple types of objects exist and any pair of these objects could have a pairwise co-occurrence relation. All these co-occurrence relations can be exploited to alleviate data sparseness or to represent objects more meaningfully. In this paper, we propose a novel algorithm, M-LSA, which conducts latent semantic analysis by incorporating all pairwise co-occurrences among multiple types of objects. Based on the mutual reinforcement principle, M-LSA identifies the most salient concepts among the co-occurrence data and represents all the objects in a unified semantic space. M-LSA is general and we show that several variants of LSA are special cases of our algorithm. Experiment results show that M-LSA outperforms LSA on multiple applications, including collaborative filtering, text clustering, and text categorization.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126483445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 74
Supporting semantic visual feature browsing in contentbased video retrieval 支持基于内容的视频检索中的语义视觉特征浏览
Xiangming Mu
A new shot level video retrieval system that supports semantic visual features (e.g., car, mountain, and fire) browsing is developed to facilitate content-based retrieval. The video's binary semantic feature vector is utilized to calculate the score of similarity between two shot keyframes. The score is then used to browse the "similar" keyframes in terms of semantic visual features.
为了方便基于内容的检索,开发了一个支持语义视觉特征(如汽车、山和火)浏览的新的镜头级视频检索系统。利用视频的二值语义特征向量计算两个镜头关键帧之间的相似度。然后使用分数来浏览语义视觉特征方面的“相似”关键帧。
{"title":"Supporting semantic visual feature browsing in contentbased video retrieval","authors":"Xiangming Mu","doi":"10.1145/1148170.1148347","DOIUrl":"https://doi.org/10.1145/1148170.1148347","url":null,"abstract":"A new shot level video retrieval system that supports semantic visual features (e.g., car, mountain, and fire) browsing is developed to facilitate content-based retrieval. The video's binary semantic feature vector is utilized to calculate the score of similarity between two shot keyframes. The score is then used to browse the \"similar\" keyframes in terms of semantic visual features.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126667233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Information graphics: an untapped resource for digital libraries 信息图形:数字图书馆尚未开发的资源
S. Carberry, S. Schwartz, Seniz Demir
Information graphics are non-pictorial graphics such as bar charts and line graphs that depict attributes of entities and relations among entities. Most information graphics appearing in popular media have a communicative goal or intended message; consequently, information graphics constitute a form of language. This paper argues that information graphics are a valuable knowledge resource that should be retrievable from a digital library and that such graphics should be taken into account when summarizing a multimodal document for subsequent indexing and retrieval. But to accomplish this, the information graphic must be understood and its message recognized. The paper presents our Bayesian system for recognizing the primary message of one kind of information graphic (simple bar charts) and discusses the potential role of an information graphic's message in indexing graphics and summarizing multimodal documents.
信息图形是描述实体属性和实体之间关系的柱状图和线形图等非图形图形。大众媒体上出现的大多数信息图形都有一个交际目的或想要传达的信息;因此,信息图形构成了一种语言形式。本文认为信息图形是一种有价值的知识资源,应该从数字图书馆中检索,并且在总结多模态文档以进行后续索引和检索时应考虑到这些图形。但要做到这一点,必须理解信息图形并识别其信息。本文提出了一种用于识别一类信息图形(简单条形图)主信息的贝叶斯系统,并讨论了信息图形的信息在索引图形和多模态文档汇总中的潜在作用。
{"title":"Information graphics: an untapped resource for digital libraries","authors":"S. Carberry, S. Schwartz, Seniz Demir","doi":"10.1145/1148170.1148270","DOIUrl":"https://doi.org/10.1145/1148170.1148270","url":null,"abstract":"Information graphics are non-pictorial graphics such as bar charts and line graphs that depict attributes of entities and relations among entities. Most information graphics appearing in popular media have a communicative goal or intended message; consequently, information graphics constitute a form of language. This paper argues that information graphics are a valuable knowledge resource that should be retrievable from a digital library and that such graphics should be taken into account when summarizing a multimodal document for subsequent indexing and retrieval. But to accomplish this, the information graphic must be understood and its message recognized. The paper presents our Bayesian system for recognizing the primary message of one kind of information graphic (simple bar charts) and discusses the potential role of an information graphic's message in indexing graphics and summarizing multimodal documents.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128043373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 90
期刊
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1