Improving Access to Large-scale Digital Libraries ThroughSemantic-enhanced Search and Disambiguation

A. Hinze, Craig Taube-Schock, D. Bainbridge, Rangi Matamua, J. S. Downie
{"title":"Improving Access to Large-scale Digital Libraries ThroughSemantic-enhanced Search and Disambiguation","authors":"A. Hinze, Craig Taube-Schock, D. Bainbridge, Rangi Matamua, J. S. Downie","doi":"10.1145/2756406.2756920","DOIUrl":null,"url":null,"abstract":"With 13,000,000 volumes comprising 4.5 billion pages of text, it is currently very difficult for scholars to locate relevant sets of documents that are useful in their research from the HathiTrust Digital Libary (HTDL) using traditional lexically-based retrieval techniques. Existing document search tools and document clustering approaches use purely lexical analysis, which cannot address the inherent ambiguity of natural language. A semantic search approach offers the potential to overcome the shortcoming of lexical search, but even if an appropriate network of ontologies could be decided upon it would require a full semantic markup of each document. In this paper, we present a conceptual design and report on the initial implementation of a new framework that affords the benefits of semantic search while minimizing the problems associated with applying existing semantic analysis at scale. Our approach avoids the need for complete semantic document markup using pre-existing ontologies by developing an automatically generated Concept-in-Context (CiC) network seeded by a priori analysis of Wikipedia texts and identification of semantic metadata. Our Capisco system analyzes documents by the semantics and context of their content. The disambiguation of search queries is done interactively, to fully utilize the domain knowledge of the scholar. Our method achieves a form of semantic-enhanced search that simultaneously exploits the proven scale benefits provided by lexical indexing.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"213 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2756406.2756920","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25

Abstract

With 13,000,000 volumes comprising 4.5 billion pages of text, it is currently very difficult for scholars to locate relevant sets of documents that are useful in their research from the HathiTrust Digital Libary (HTDL) using traditional lexically-based retrieval techniques. Existing document search tools and document clustering approaches use purely lexical analysis, which cannot address the inherent ambiguity of natural language. A semantic search approach offers the potential to overcome the shortcoming of lexical search, but even if an appropriate network of ontologies could be decided upon it would require a full semantic markup of each document. In this paper, we present a conceptual design and report on the initial implementation of a new framework that affords the benefits of semantic search while minimizing the problems associated with applying existing semantic analysis at scale. Our approach avoids the need for complete semantic document markup using pre-existing ontologies by developing an automatically generated Concept-in-Context (CiC) network seeded by a priori analysis of Wikipedia texts and identification of semantic metadata. Our Capisco system analyzes documents by the semantics and context of their content. The disambiguation of search queries is done interactively, to fully utilize the domain knowledge of the scholar. Our method achieves a form of semantic-enhanced search that simultaneously exploits the proven scale benefits provided by lexical indexing.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过语义增强搜索和消歧义改善对大型数字图书馆的访问
目前,学者们很难使用传统的基于词汇的检索技术从HathiTrust数字图书馆(html)中找到对他们的研究有用的相关文档集,因为该图书馆有1300万册,包含45亿页的文本。现有的文档搜索工具和文档聚类方法使用纯粹的词法分析,无法解决自然语言固有的歧义。语义搜索方法有可能克服词法搜索的缺点,但是即使确定了适当的本体网络,也需要对每个文档进行完整的语义标记。在本文中,我们提出了一个概念设计,并报告了一个新框架的初步实现,该框架提供了语义搜索的好处,同时最大限度地减少了与大规模应用现有语义分析相关的问题。我们的方法通过开发一个自动生成的上下文概念(CiC)网络,通过对维基百科文本的先验分析和语义元数据的识别,避免了使用预先存在的本体来完成语义文档标记的需要。我们的Capisco系统通过其内容的语义和上下文分析文档。搜索查询的消歧是交互式的,以充分利用学者的领域知识。我们的方法实现了一种语义增强的搜索形式,同时利用了由词法索引提供的已证实的规模优势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Combining Classifiers and User Feedback for Disambiguating Author Names Improving Access to Large-scale Digital Libraries ThroughSemantic-enhanced Search and Disambiguation ConfAssist: A Conflict Resolution Framework for Assisting the Categorization of Computer Science Conferences The HathiTrust Research Center: Providing analytic access to the HathiTrust Digital Library's 4.7 billion pages Scholarly Document Information Extraction using Extensible Features for Efficient Higher Order Semi-CRFs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1