Document clustering with evolved search queries

Laurence Hirsch, A. D. Nuovo
{"title":"Document clustering with evolved search queries","authors":"Laurence Hirsch, A. D. Nuovo","doi":"10.1109/CEC.2017.7969447","DOIUrl":null,"url":null,"abstract":"Search queries define a set of documents located in a collection and can be used to rank the documents by assigning each document a score according to their closeness to the query in the multidimensional space of weighted terms. In this paper, we describe a system whereby an island model genetic algorithm (GA) creates individuals which can generate a set of Apache Lucene search queries for the purpose of text document clustering. A cluster is specified by the documents returned by a single query in the set. Each document that is included in only one of the clusters adds to the fitness of the individual and each document that is included in more than one cluster will reduce the fitness. The method can be refined by using the ranking score of each document in the fitness test. The system has a number of advantages; in particular, the final search queries are easily understood and offer a simple explanation of the clusters, meaning that an extra cluster labelling stage is not required. We describe how the GA can be used to build queries and show results for clustering on various data sets and with different query sizes. Results are also compared with clusters built using the widely used k-means algorithm.","PeriodicalId":335123,"journal":{"name":"2017 IEEE Congress on Evolutionary Computation (CEC)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Congress on Evolutionary Computation (CEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CEC.2017.7969447","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Search queries define a set of documents located in a collection and can be used to rank the documents by assigning each document a score according to their closeness to the query in the multidimensional space of weighted terms. In this paper, we describe a system whereby an island model genetic algorithm (GA) creates individuals which can generate a set of Apache Lucene search queries for the purpose of text document clustering. A cluster is specified by the documents returned by a single query in the set. Each document that is included in only one of the clusters adds to the fitness of the individual and each document that is included in more than one cluster will reduce the fitness. The method can be refined by using the ranking score of each document in the fitness test. The system has a number of advantages; in particular, the final search queries are easily understood and offer a simple explanation of the clusters, meaning that an extra cluster labelling stage is not required. We describe how the GA can be used to build queries and show results for clustering on various data sets and with different query sizes. Results are also compared with clusters built using the widely used k-means algorithm.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
具有进化搜索查询的文档聚类
搜索查询定义了位于集合中的一组文档,可以根据每个文档在加权词的多维空间中与查询的接近程度为其分配分数,从而对文档进行排序。在本文中,我们描述了一个系统,其中岛模型遗传算法(GA)创建个体,这些个体可以生成一组Apache Lucene搜索查询,用于文本文档聚类。集群由集合中单个查询返回的文档指定。只包含在一个集群中的每个文档增加了个体的适应度,而包含在多个集群中的每个文档将降低适应度。该方法可以通过使用每个文档在适应度检验中的排名分数来改进。该系统有许多优点;特别是,最后的搜索查询很容易理解,并提供了对聚类的简单解释,这意味着不需要额外的聚类标记阶段。我们描述了如何使用遗传算法构建查询并显示在不同数据集和不同查询大小上聚类的结果。结果还与使用广泛使用的k-means算法构建的聚类进行了比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Knowledge-based particle swarm optimization for PID controller tuning Local Optima Networks of the Permutation Flowshop Scheduling Problem: Makespan vs. total flow time Information core optimization using Evolutionary Algorithm with Elite Population in recommender systems New heuristics for multi-objective worst-case optimization in evidence-based robust design Bus Routing for emergency evacuations: The case of the Great Fire of Valparaiso
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1