Bio-SODA UX:通过用户消歧义,在知识图上实现自然语言问题回答。

IF 1.5 4区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Distributed and Parallel Databases Pub Date : 2022-01-01 Epub Date: 2022-07-16 DOI:10.1007/s10619-022-07414-w
Ana Claudia Sima, Tarcisio Mendes de Farias, Maria Anisimova, Christophe Dessimoz, Marc Robinson-Rechavi, Erich Zbinden, Kurt Stockinger
{"title":"Bio-SODA UX:通过用户消歧义,在知识图上实现自然语言问题回答。","authors":"Ana Claudia Sima,&nbsp;Tarcisio Mendes de Farias,&nbsp;Maria Anisimova,&nbsp;Christophe Dessimoz,&nbsp;Marc Robinson-Rechavi,&nbsp;Erich Zbinden,&nbsp;Kurt Stockinger","doi":"10.1007/s10619-022-07414-w","DOIUrl":null,"url":null,"abstract":"<p><p>The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at <i>open-domain</i> question answering using DBpedia, or require <i>large training datasets</i> to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex <i>scientific datasets</i> where no prior training data is available. In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official <i>bioinformatics</i> Question Answering over Linked Data (QALD) challenge, as well as the CORDIS dataset of European projects, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets. Finally, we introduce Bio-SODA UX, a graphical user interface designed to assist users in the exploration of large knowledge graphs and in dynamically disambiguating natural language questions that target the data available in these graphs.</p>","PeriodicalId":50568,"journal":{"name":"Distributed and Parallel Databases","volume":"40 2-3","pages":"409-440"},"PeriodicalIF":1.5000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9458692/pdf/","citationCount":"2","resultStr":"{\"title\":\"Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation.\",\"authors\":\"Ana Claudia Sima,&nbsp;Tarcisio Mendes de Farias,&nbsp;Maria Anisimova,&nbsp;Christophe Dessimoz,&nbsp;Marc Robinson-Rechavi,&nbsp;Erich Zbinden,&nbsp;Kurt Stockinger\",\"doi\":\"10.1007/s10619-022-07414-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at <i>open-domain</i> question answering using DBpedia, or require <i>large training datasets</i> to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex <i>scientific datasets</i> where no prior training data is available. In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official <i>bioinformatics</i> Question Answering over Linked Data (QALD) challenge, as well as the CORDIS dataset of European projects, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets. Finally, we introduce Bio-SODA UX, a graphical user interface designed to assist users in the exploration of large knowledge graphs and in dynamically disambiguating natural language questions that target the data available in these graphs.</p>\",\"PeriodicalId\":50568,\"journal\":{\"name\":\"Distributed and Parallel Databases\",\"volume\":\"40 2-3\",\"pages\":\"409-440\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2022-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9458692/pdf/\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Distributed and Parallel Databases\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10619-022-07414-w\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2022/7/16 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Distributed and Parallel Databases","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10619-022-07414-w","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/7/16 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 2

摘要

结构化数据上的自然语言处理问题已经成为一个日益增长的研究领域,在关系数据库和语义Web社区中都是如此,在知识图问答(KGQA)方面投入了大量的努力。然而,这些方法中的许多方法要么专门针对使用DBpedia的开放域问题回答,要么需要大型训练数据集将自然语言问题翻译成SPARQL以查询知识图。因此,这些方法通常不能直接应用于没有事先训练数据可用的复杂科学数据集。在本文中,我们关注的是自然语言处理在科学数据集知识图上的挑战。特别地,我们介绍了Bio-SODA,这是一种自然语言处理引擎,它不需要以问答对的形式训练数据来生成SPARQL查询。Bio-SODA使用一种通用的基于图的方法将用户问题转换为SPARQL候选查询的排序列表。此外,Bio-SODA使用了一种新颖的排序算法,该算法将节点中心性作为选择最佳SPARQL候选查询的相关性度量。我们对几个科学领域的真实数据集进行了实验,包括官方的生物信息学关联数据问答(QALD)挑战,以及欧洲项目的CORDIS数据集,结果表明,Bio-SODA比公开可用的KGQA系统的f1得分至少高出20%,在更复杂的生物信息学数据集上的得分甚至更高。最后,我们介绍了Bio-SODA UX,这是一个图形用户界面,旨在帮助用户探索大型知识图,并动态消除针对这些图中可用数据的自然语言问题的歧义。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation.

The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training datasets to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex scientific datasets where no prior training data is available. In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official bioinformatics Question Answering over Linked Data (QALD) challenge, as well as the CORDIS dataset of European projects, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets. Finally, we introduce Bio-SODA UX, a graphical user interface designed to assist users in the exploration of large knowledge graphs and in dynamically disambiguating natural language questions that target the data available in these graphs.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Distributed and Parallel Databases
Distributed and Parallel Databases 工程技术-计算机:理论方法
CiteScore
3.50
自引率
0.00%
发文量
17
审稿时长
>12 weeks
期刊介绍: Distributed and Parallel Databases publishes papers in all the traditional as well as most emerging areas of database research, including: Availability and reliability; Benchmarking and performance evaluation, and tuning; Big Data Storage and Processing; Cloud Computing and Database-as-a-Service; Crowdsourcing; Data curation, annotation and provenance; Data integration, metadata Management, and interoperability; Data models, semantics, query languages; Data mining and knowledge discovery; Data privacy, security, trust; Data provenance, workflows, Scientific Data Management; Data visualization and interactive data exploration; Data warehousing, OLAP, Analytics; Graph data management, RDF, social networks; Information Extraction and Data Cleaning; Middleware and Workflow Management; Modern Hardware and In-Memory Database Systems; Query Processing and Optimization; Semantic Web and open data; Social Networks; Storage, indexing, and physical database design; Streams, sensor networks, and complex event processing; Strings, Texts, and Keyword Search; Spatial, temporal, and spatio-temporal databases; Transaction processing; Uncertain, probabilistic, and approximate databases.
期刊最新文献
zk-Oracle: trusted off-chain compute and storage for decentralized applications Parallel continuous skyline query over high-dimensional data stream windows A blockchain datastore for scalable IoT workloads using data decaying Flexible fingerprint cuckoo filter for information retrieval optimization in distributed network Federated computation: a survey of concepts and challenges
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1