首页 > 最新文献

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval最新文献

英文 中文
Text clustering with extended user feedback 扩展用户反馈的文本聚类
Yifen Huang, Tom Michael Mitchell
Text clustering is most commonly treated as a fully automated task without user feedback. However, a variety of researchers have explored mixed-initiative clustering methods which allow a user to interact with and advise the clustering algorithm. This mixed-initiative approach is especially attractive for text clustering tasks where the user is trying to organize a corpus of documents into clusters for some particular purpose (e.g., clustering their email into folders that reflect various activities in which they are involved). This paper introduces a new approach to mixed-initiative clustering that handles several natural types of user feedback. We first introduce a new probabilistic generative model for text clustering (the SpeClustering model) and show that it outperforms the commonly used mixture of multinomials clustering model, even when used in fully autonomous mode with no user input. We then describe how to incorporate four distinct types of user feedback into the clustering algorithm, and provide experimental evidence showing substantial improvements in text clustering when this user feedback is incorporated.
文本聚类通常被视为没有用户反馈的全自动任务。然而,各种研究人员已经探索了混合主动聚类方法,允许用户与聚类算法交互并建议聚类算法。这种混合主动的方法对于文本聚类任务特别有吸引力,当用户试图将文档语料库组织成特定目的的聚类时(例如,将他们的电子邮件聚类到反映他们参与的各种活动的文件夹中)。本文介绍了一种新的混合主动聚类方法,该方法处理几种自然类型的用户反馈。我们首先引入了一种新的文本聚类概率生成模型(specluclustering模型),并表明它优于常用的多项混合聚类模型,即使在没有用户输入的完全自主模式下使用。然后,我们描述了如何将四种不同类型的用户反馈合并到聚类算法中,并提供实验证据表明,当用户反馈被合并时,文本聚类有了实质性的改进。
{"title":"Text clustering with extended user feedback","authors":"Yifen Huang, Tom Michael Mitchell","doi":"10.1145/1148170.1148242","DOIUrl":"https://doi.org/10.1145/1148170.1148242","url":null,"abstract":"Text clustering is most commonly treated as a fully automated task without user feedback. However, a variety of researchers have explored mixed-initiative clustering methods which allow a user to interact with and advise the clustering algorithm. This mixed-initiative approach is especially attractive for text clustering tasks where the user is trying to organize a corpus of documents into clusters for some particular purpose (e.g., clustering their email into folders that reflect various activities in which they are involved). This paper introduces a new approach to mixed-initiative clustering that handles several natural types of user feedback. We first introduce a new probabilistic generative model for text clustering (the SpeClustering model) and show that it outperforms the commonly used mixture of multinomials clustering model, even when used in fully autonomous mode with no user input. We then describe how to incorporate four distinct types of user feedback into the clustering algorithm, and provide experimental evidence showing substantial improvements in text clustering when this user feedback is incorporated.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114239169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 93
Music structure based vector space retrieval 基于向量空间检索的音乐结构
N. Maddage, Haizhou Li, M. Kankanhalli
This paper proposes a novel framework for music content indexing and retrieval. The music structure information, i.e., timing, harmony and music region content, is represented by the layers of the music structure pyramid. We begin by extracting this layered structure information. We analyze the rhythm of the music and then segment the signal proportional to the inter-beat intervals. Thus, the timing information is incorporated in the segmentation process, which we call Beat Space Segmentation. To describe Harmony Events, we propose a two-layer hierarchical approach to model the music chords. We also model the progression of instrumental and vocal content as Acoustic Events. After information extraction, we propose a vector space modeling approach which uses these events as the indexing terms. In query-by-example music retrieval, a query is represented by a vector of the statistics of the n-gram events. We then propose two effective retrieval models, a hard-indexing scheme and a soft-indexing scheme. Experiments show that the vector space modeling is effective in representing the layered music information, achieving 82.5% top-5 retrieval accuracy using 15-sec music clips as the queries. The soft-indexing outperforms hard-indexing in general.
本文提出了一种新的音乐内容索引和检索框架。音乐结构信息,即节拍、和声和音乐区域内容,用音乐结构金字塔的各层来表示。我们从提取分层结构信息开始。我们分析音乐的节奏,然后按节拍间隔的比例分割信号。因此,在分割过程中加入了时间信息,我们称之为节拍空间分割。为了描述和谐事件,我们提出了一种双层分层方法来建模音乐和弦。我们还将器乐和声乐内容的进展建模为声学事件。在信息提取之后,我们提出了一种使用这些事件作为索引项的向量空间建模方法。在按例查询音乐检索中,查询由n-gram事件的统计向量表示。然后,我们提出了两种有效的检索模型,硬索引方案和软索引方案。实验表明,向量空间建模在表示分层音乐信息方面是有效的,使用15秒音乐片段作为查询,前5名的检索准确率达到82.5%。软索引通常优于硬索引。
{"title":"Music structure based vector space retrieval","authors":"N. Maddage, Haizhou Li, M. Kankanhalli","doi":"10.1145/1148170.1148185","DOIUrl":"https://doi.org/10.1145/1148170.1148185","url":null,"abstract":"This paper proposes a novel framework for music content indexing and retrieval. The music structure information, i.e., timing, harmony and music region content, is represented by the layers of the music structure pyramid. We begin by extracting this layered structure information. We analyze the rhythm of the music and then segment the signal proportional to the inter-beat intervals. Thus, the timing information is incorporated in the segmentation process, which we call Beat Space Segmentation. To describe Harmony Events, we propose a two-layer hierarchical approach to model the music chords. We also model the progression of instrumental and vocal content as Acoustic Events. After information extraction, we propose a vector space modeling approach which uses these events as the indexing terms. In query-by-example music retrieval, a query is represented by a vector of the statistics of the n-gram events. We then propose two effective retrieval models, a hard-indexing scheme and a soft-indexing scheme. Experiments show that the vector space modeling is effective in representing the layered music information, achieving 82.5% top-5 retrieval accuracy using 15-sec music clips as the queries. The soft-indexing outperforms hard-indexing in general.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114807456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
PENG: integrated search of distributed news archives 彭:分布式新闻档案的集成搜索
Mark Baillie, F. Crestani, M. Landoni
The PENG system is intended to provide an integrated and personalized environment for news professionals, providing functionalities for filtering, distributed retrieval, and a flexible interface environment for the display and manipulation of news materials. In this paper we review the progress and results of the PENG system to date, and describe in detail the document filtering part of the system, which is designed to gather and filter documents to user profiles. The current architecture will be described, along with some of the main issues which have so far been found in it's development.
PENG系统旨在为新闻专业人员提供一个集成和个性化的环境,提供过滤、分布式检索的功能,以及一个灵活的界面环境,用于显示和操作新闻材料。在本文中,我们回顾了迄今为止PENG系统的进展和结果,并详细描述了系统的文档过滤部分,该部分旨在收集和过滤文档到用户配置文件。本文将描述当前的体系结构,以及迄今为止在其开发过程中发现的一些主要问题。
{"title":"PENG: integrated search of distributed news archives","authors":"Mark Baillie, F. Crestani, M. Landoni","doi":"10.1145/1148170.1148278","DOIUrl":"https://doi.org/10.1145/1148170.1148278","url":null,"abstract":"The PENG system is intended to provide an integrated and personalized environment for news professionals, providing functionalities for filtering, distributed retrieval, and a flexible interface environment for the display and manipulation of news materials. In this paper we review the progress and results of the PENG system to date, and describe in detail the document filtering part of the system, which is designed to gather and filter documents to user profiles. The current architecture will be described, along with some of the main issues which have so far been found in it's development.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131867774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A study of real-time query expansion effectiveness 实时查询扩展的有效性研究
Ryen W. White, G. Marchionini
In this poster, we describe the study of an interface technique that provides a list of suggested additional query terms as a searcher types a search query, in effect offering interactive query expansion (IQE) options while the query is formulated. Analysis of the results shows that offering IQE during query formulation leads to better quality initial queries, and an increased uptake of query expansion. These findings have implications for how IQE should be offered in retrieval interfaces.
在这张海报中,我们描述了一种接口技术的研究,该技术在搜索者键入搜索查询时提供了一个建议的附加查询术语列表,实际上在制定查询时提供了交互式查询扩展(IQE)选项。对结果的分析表明,在查询制定过程中提供IQE可以提高初始查询的质量,并增加查询扩展的吸收。这些发现对如何在检索接口中提供IQE具有启示意义。
{"title":"A study of real-time query expansion effectiveness","authors":"Ryen W. White, G. Marchionini","doi":"10.1145/1148170.1148332","DOIUrl":"https://doi.org/10.1145/1148170.1148332","url":null,"abstract":"In this poster, we describe the study of an interface technique that provides a list of suggested additional query terms as a searcher types a search query, in effect offering interactive query expansion (IQE) options while the query is formulated. Analysis of the results shows that offering IQE during query formulation leads to better quality initial queries, and an increased uptake of query expansion. These findings have implications for how IQE should be offered in retrieval interfaces.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"26 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132124956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
A location annotation system for personal photos 个人照片的位置标注系统
Chufeng Chen, M. Oakes, J. Tait
{"title":"A location annotation system for personal photos","authors":"Chufeng Chen, M. Oakes, J. Tait","doi":"10.1145/1148170.1148339","DOIUrl":"https://doi.org/10.1145/1148170.1148339","url":null,"abstract":"","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132718317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Community-based snippet-indexes for pseudo-anonymous personalization in web search 基于社区的网页搜索伪匿名个性化的片段索引
Oisín Boydell, Barry Smyth
We describe and evaluate an approach to personalizing Web search that involves post-processing the results returned by some underlying search engine so that they re .ect the interests of a community of like-minded searchers.To do this we leverage the search experiences of the community by mining the title and snippet texts of results that have been selected by community members in response to their queries. Our approach seeks to build a community-based snippet index that re .ects the evolving interests of a group of searchers. This index is then sed to re-rank the results returned by the underlying search engine by boosting the ranking of key results that have been freq ently selected for similar q eries by community members in the past.
我们描述并评估了一种个性化Web搜索的方法,该方法涉及对某些底层搜索引擎返回的结果进行后处理,以便它们能够反映志同道合的搜索者社区的兴趣。为了做到这一点,我们利用社区的搜索经验,挖掘社区成员根据他们的查询选择的结果的标题和摘要文本。我们的方法旨在建立一个基于社区的片段索引,反映一组搜索者不断发展的兴趣。然后使用该索引对底层搜索引擎返回的结果进行重新排序,方法是提高社区成员过去经常为类似查询选择的关键结果的排名。
{"title":"Community-based snippet-indexes for pseudo-anonymous personalization in web search","authors":"Oisín Boydell, Barry Smyth","doi":"10.1145/1148170.1148283","DOIUrl":"https://doi.org/10.1145/1148170.1148283","url":null,"abstract":"We describe and evaluate an approach to personalizing Web search that involves post-processing the results returned by some underlying search engine so that they re .ect the interests of a community of like-minded searchers.To do this we leverage the search experiences of the community by mining the title and snippet texts of results that have been selected by community members in response to their queries. Our approach seeks to build a community-based snippet index that re .ects the evolving interests of a group of searchers. This index is then sed to re-rank the results returned by the underlying search engine by boosting the ranking of key results that have been freq ently selected for similar q eries by community members in the past.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116501286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive query-based sampling for distributed IR 基于查询的分布式红外自适应采样
L. Azzopardi, Mark Baillie, F. Crestani
In Distributed Information Retrieval systems (DIR), the widely accepted solution for resource description acquisition is Query-Based Sampling (QBS) [1]. In the standard approach to QBS, once 300-500 unique documents have been retrieved sampling is curtailed. This threshold was obtained by empirically measuring the estimated resource description against the actual resource, and then considering the corresponding retrieval selection accuracy [1]. However, a fixed threshold may not generalise to other collections and environments beyond that which it was estimated on (i.e. a set of resources of uniform size [1]). Cases when the blanket application of such a heuristic would be inappropriate include (1) when the sizes of resource are highly skewed and (2) when the resources are very heterogenous. In the former, if a resource is very large then undersampling will occur because not enough documents were obtained. Conversely, if a collection is very small in size, then oversampling will occur increasing costs beyond necessity. In the later case, if the resource is varied and highly heterogeneous, then to obtain a sufficiently accurate description would require more documents to be sampled than when resources are homogenous. Either way, adopting a flat cut off will not necessarily provide sufficiently good resource descriptions for all resources.
在分布式信息检索系统(DIR)中,广泛接受的资源描述获取解决方案是基于查询的采样(QBS)[1]。在QBS的标准方法中,一旦检索到300-500个唯一文档,就会减少抽样。该阈值是通过对估计的资源描述与实际资源进行经验度量,然后考虑相应的检索选择精度[1]得到的。然而,一个固定的阈值可能不能推广到其他的集合和环境,超出了它的估计(即一组统一大小的资源[1])。这种启发式的全面应用可能不合适的情况包括(1)当资源的大小高度倾斜时和(2)当资源非常异构时。在前一种情况下,如果资源非常大,则会因为没有获得足够的文档而发生欠采样。相反,如果集合的大小非常小,则会发生过采样,从而增加不必要的成本。在后一种情况下,如果资源是多种多样且高度异构的,那么要获得足够准确的描述,将需要比资源是同质的情况下采样更多的文档。无论哪种方式,采用一个平坦的截止都不一定能为所有资源提供足够好的资源描述。
{"title":"Adaptive query-based sampling for distributed IR","authors":"L. Azzopardi, Mark Baillie, F. Crestani","doi":"10.1145/1148170.1148277","DOIUrl":"https://doi.org/10.1145/1148170.1148277","url":null,"abstract":"In Distributed Information Retrieval systems (DIR), the widely accepted solution for resource description acquisition is Query-Based Sampling (QBS) [1]. In the standard approach to QBS, once 300-500 unique documents have been retrieved sampling is curtailed. This threshold was obtained by empirically measuring the estimated resource description against the actual resource, and then considering the corresponding retrieval selection accuracy [1]. However, a fixed threshold may not generalise to other collections and environments beyond that which it was estimated on (i.e. a set of resources of uniform size [1]). Cases when the blanket application of such a heuristic would be inappropriate include (1) when the sizes of resource are highly skewed and (2) when the resources are very heterogenous. In the former, if a resource is very large then undersampling will occur because not enough documents were obtained. Conversely, if a collection is very small in size, then oversampling will occur increasing costs beyond necessity. In the later case, if the resource is varied and highly heterogeneous, then to obtain a sufficiently accurate description would require more documents to be sampled than when resources are homogenous. Either way, adopting a flat cut off will not necessarily provide sufficiently good resource descriptions for all resources.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122095670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Building bridges for web query classification 为web查询分类搭建桥梁
Dou Shen, Jian-Tao Sun, Qiang Yang, Zheng Chen
Web query classification (QC) aims to classify Web users' queries, which are often short and ambiguous, into a set of target categories. QC has many applications including page ranking in Web search, targeted advertisement in response to queries, and personalization. In this paper, we present a novel approach for QC that outperforms the winning solution of the ACM KDDCUP 2005 competition, whose objective is to classify 800,000 real user queries. In our approach, we first build a bridging classifier on an intermediate taxonomy in an offline mode. This classifier is then used in an online mode to map user queries to the target categories via the above intermediate taxonomy. A major innovation is that by leveraging the similarity distribution over the intermediate taxonomy, we do not need to retrain a new classifier for each new set of target categories, and therefore the bridging classifier needs to be trained only once. In addition, we introduce category selection as a new method for narrowing down the scope of the intermediate taxonomy based on which we classify the queries. Category selection can improve both efficiency and effectiveness of the online classification. By combining our algorithm with the winning solution of KDDCUP 2005, we made an improvement by 9.7% and 3.8% in terms of precision and F1 respectively compared with the best results of KDDCUP 2005.
Web查询分类(QC)的目的是将Web用户的查询(通常是简短和模糊的)分类到一组目标类别中。QC有许多应用,包括Web搜索中的页面排名、响应查询的定向广告和个性化。在本文中,我们提出了一种新的QC方法,该方法优于ACM KDDCUP 2005竞赛的获胜方案,该竞赛的目标是对80万个真实用户查询进行分类。在我们的方法中,我们首先以离线模式在中间分类法上构建桥接分类器。然后以在线模式使用此分类器,通过上述中间分类法将用户查询映射到目标类别。一个主要的创新是,通过利用中间分类法上的相似性分布,我们不需要为每一组新的目标类别重新训练一个新的分类器,因此桥接分类器只需要训练一次。此外,我们引入了类别选择作为缩小中间分类法范围的新方法,我们根据中间分类法对查询进行分类。分类选择可以提高在线分类的效率和效果。将该算法与KDDCUP 2005的优胜解相结合,与KDDCUP 2005的最佳结果相比,精度和F1分别提高了9.7%和3.8%。
{"title":"Building bridges for web query classification","authors":"Dou Shen, Jian-Tao Sun, Qiang Yang, Zheng Chen","doi":"10.1145/1148170.1148196","DOIUrl":"https://doi.org/10.1145/1148170.1148196","url":null,"abstract":"Web query classification (QC) aims to classify Web users' queries, which are often short and ambiguous, into a set of target categories. QC has many applications including page ranking in Web search, targeted advertisement in response to queries, and personalization. In this paper, we present a novel approach for QC that outperforms the winning solution of the ACM KDDCUP 2005 competition, whose objective is to classify 800,000 real user queries. In our approach, we first build a bridging classifier on an intermediate taxonomy in an offline mode. This classifier is then used in an online mode to map user queries to the target categories via the above intermediate taxonomy. A major innovation is that by leveraging the similarity distribution over the intermediate taxonomy, we do not need to retrain a new classifier for each new set of target categories, and therefore the bridging classifier needs to be trained only once. In addition, we introduce category selection as a new method for narrowing down the scope of the intermediate taxonomy based on which we classify the queries. Category selection can improve both efficiency and effectiveness of the online classification. By combining our algorithm with the winning solution of KDDCUP 2005, we made an improvement by 9.7% and 3.8% in terms of precision and F1 respectively compared with the best results of KDDCUP 2005.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125730418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 316
Document clustering with prior knowledge 具有先验知识的文档聚类
Xiang-Hua Ji, W. Xu
Document clustering is an important tool for text analysis and is used in many different applications. We propose to incorporate prior knowledge of cluster membership for document cluster analysis and develop a novel semi-supervised document clustering model. The method models a set of documents with weighted graph in which each document is represented as a vertex, and each edge connecting a pair of vertices is weighted with the similarity value of the two corresponding documents. The prior knowledge indicates pairs of documents that known to belong to the same cluster. Then, the prior knowledge is transformed into a set of constraints. The document clustering task is accomplished by finding the best cuts of the graph under the constraints. We apply the model to the Normalized Cut method to demonstrate the idea and concept. Our experimental evaluations show that the proposed document clustering model reveals remarkable performance improvements with very limited training samples, and hence is a very effective semi-supervised classification tool.
文档聚类是文本分析的重要工具,在许多不同的应用程序中都有使用。我们提出将聚类隶属度的先验知识纳入文档聚类分析,并开发了一种新的半监督文档聚类模型。该方法用加权图对一组文档建模,其中每个文档表示为一个顶点,连接一对顶点的每条边用两个对应文档的相似度值进行加权。先验知识表示已知属于同一集群的文档对。然后,将先验知识转化为一组约束。文档聚类任务通过在约束条件下找到图的最佳切点来完成。我们将该模型应用于归一化切割方法来演示思想和概念。我们的实验评估表明,本文提出的文档聚类模型在非常有限的训练样本下显示出显著的性能改进,因此是一种非常有效的半监督分类工具。
{"title":"Document clustering with prior knowledge","authors":"Xiang-Hua Ji, W. Xu","doi":"10.1145/1148170.1148241","DOIUrl":"https://doi.org/10.1145/1148170.1148241","url":null,"abstract":"Document clustering is an important tool for text analysis and is used in many different applications. We propose to incorporate prior knowledge of cluster membership for document cluster analysis and develop a novel semi-supervised document clustering model. The method models a set of documents with weighted graph in which each document is represented as a vertex, and each edge connecting a pair of vertices is weighted with the similarity value of the two corresponding documents. The prior knowledge indicates pairs of documents that known to belong to the same cluster. Then, the prior knowledge is transformed into a set of constraints. The document clustering task is accomplished by finding the best cuts of the graph under the constraints. We apply the model to the Normalized Cut method to demonstrate the idea and concept. Our experimental evaluations show that the proposed document clustering model reveals remarkable performance improvements with very limited training samples, and hence is a very effective semi-supervised classification tool.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125838502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 155
What makes a query difficult? 是什么让查询变得困难?
David Carmel, E. Yom-Tov, Adam Darlow, D. Pelleg
This work tries to answer the question of what makes a query difficult. It addresses a novel model that captures the main components of a topic and the relationship between those components and topic difficulty. The three components of a topic are the textual expression describing the information need (the query or queries), the set of documents relevant to the topic (the Qrels), and the entire collection of documents. We show experimentally that topic difficulty strongly depends on the distances between these components. In the absence of knowledge about one of the model components, the model is still useful by approximating the missing component based on the other components. We demonstrate the applicability of the difficulty model for several uses such as predicting query difficulty, predicting the number of topic aspects expected to be covered by the search results, and analyzing the findability of a specific domain.
这项工作试图回答是什么让查询变得困难。它提出了一种新颖的模型,可以捕获主题的主要组成部分以及这些组成部分与主题难度之间的关系。主题的三个组成部分是描述信息需求的文本表达式(一个或多个查询)、与主题相关的文档集(qql)和整个文档集合。我们通过实验表明,主题难度很大程度上取决于这些成分之间的距离。在缺乏关于某个模型组件的知识的情况下,通过基于其他组件近似缺失组件,模型仍然是有用的。我们演示了难度模型在预测查询难度、预测搜索结果预计涵盖的主题方面的数量以及分析特定领域的可查找性等方面的适用性。
{"title":"What makes a query difficult?","authors":"David Carmel, E. Yom-Tov, Adam Darlow, D. Pelleg","doi":"10.1145/1148170.1148238","DOIUrl":"https://doi.org/10.1145/1148170.1148238","url":null,"abstract":"This work tries to answer the question of what makes a query difficult. It addresses a novel model that captures the main components of a topic and the relationship between those components and topic difficulty. The three components of a topic are the textual expression describing the information need (the query or queries), the set of documents relevant to the topic (the Qrels), and the entire collection of documents. We show experimentally that topic difficulty strongly depends on the distances between these components. In the absence of knowledge about one of the model components, the model is still useful by approximating the missing component based on the other components. We demonstrate the applicability of the difficulty model for several uses such as predicting query difficulty, predicting the number of topic aspects expected to be covered by the search results, and analyzing the findability of a specific domain.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"32 Suppl 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123573442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 226
期刊
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1