Concept Forest: A New Ontology-assisted Text Document Similarity Measurement Method

Nan Du, Bin Wu, Bai Wang
{"title":"Concept Forest: A New Ontology-assisted Text Document Similarity Measurement Method","authors":"Nan Du, Bin Wu, Bai Wang","doi":"10.1109/WI.2007.36","DOIUrl":null,"url":null,"abstract":"Although using ontologies to assist information retrieval and text document processing has recently attracted more and more attention, existing ontologybased approaches have not shown advantages over the traditional keywords-based Latent Semantic Indexing (LSI) method. This paper proposes an algorithm to extract a concept forest (CF) from a document with the assistance of a natural language ontology, the WordNet lexical database. Using concept forests to represent the semantics of text documents, the semantic similarities of these documents are then measured as the commonalities of their concept forests. Performance studies of text document clustering based on different document similarity measurement methods show that the CF-based similarity measurement is an effective alternative to the existing keywords-based methods. In particular, this CFbased approach has obvious advantages over the existing keywords-based methods, including LSI, in processing short text documents or in P2P or live news environments where it is impractical to collect the entire document corpus for analysis.","PeriodicalId":192501,"journal":{"name":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"60","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI.2007.36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 60

Abstract

Although using ontologies to assist information retrieval and text document processing has recently attracted more and more attention, existing ontologybased approaches have not shown advantages over the traditional keywords-based Latent Semantic Indexing (LSI) method. This paper proposes an algorithm to extract a concept forest (CF) from a document with the assistance of a natural language ontology, the WordNet lexical database. Using concept forests to represent the semantics of text documents, the semantic similarities of these documents are then measured as the commonalities of their concept forests. Performance studies of text document clustering based on different document similarity measurement methods show that the CF-based similarity measurement is an effective alternative to the existing keywords-based methods. In particular, this CFbased approach has obvious advantages over the existing keywords-based methods, including LSI, in processing short text documents or in P2P or live news environments where it is impractical to collect the entire document corpus for analysis.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
概念森林:一种新的本体辅助文本文档相似度度量方法
近年来,利用本体来辅助信息检索和文本文档处理越来越受到人们的关注,但现有的基于本体的方法与传统的基于关键词的潜在语义索引(LSI)方法相比,并没有表现出明显的优势。本文提出了一种利用自然语言本体WordNet词汇数据库从文档中提取概念森林(CF)的算法。使用概念森林来表示文本文档的语义,然后将这些文档的语义相似性作为其概念森林的共性进行测量。基于不同文档相似度度量方法的文本文档聚类性能研究表明,基于cf的相似度度量是现有基于关键字的方法的有效替代。特别是,与现有的基于关键字的方法(包括LSI)相比,这种基于cff的方法在处理短文本文档或P2P或实时新闻环境中具有明显的优势,因为在这些环境中收集整个文档语料库进行分析是不切实际的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
On the Conceptual Tagging: An Ontology Pruning Use Case Extending Description Logic for Reasoning about Ontology Evolution You Can't Always Get What You Want: Achieving Differentiated Service Levels with Pricing Agents in a Storage Grid An unsupervised hierarchical approach to document categorization How Up-to-date should it be? the Value of Instant Profiling and Adaptation in Information Filtering
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1