Application of machine reading comprehension techniques for named entity recognition in materials science

IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Journal of Cheminformatics Pub Date : 2024-07-02 DOI:10.1186/s13321-024-00874-5
Zihui Huang, Liqiang He, Yuhang Yang, Andi Li, Zhiwen Zhang, Siwei Wu, Yang Wang, Yan He, Xujie Liu
{"title":"Application of machine reading comprehension techniques for named entity recognition in materials science","authors":"Zihui Huang,&nbsp;Liqiang He,&nbsp;Yuhang Yang,&nbsp;Andi Li,&nbsp;Zhiwen Zhang,&nbsp;Siwei Wu,&nbsp;Yang Wang,&nbsp;Yan He,&nbsp;Xujie Liu","doi":"10.1186/s13321-024-00874-5","DOIUrl":null,"url":null,"abstract":"<div><p>Materials science is an interdisciplinary field that studies the properties, structures, and behaviors of different materials. A large amount of scientific literature contains rich knowledge in the field of materials science, but manually analyzing these papers to find material-related data is a daunting task. In information processing, named entity recognition (NER) plays a crucial role as it can automatically extract entities in the field of materials science, which have significant value in tasks such as building knowledge graphs. The typically used sequence labeling methods for traditional named entity recognition in material science (MatNER) tasks often fail to fully utilize the semantic information in the dataset and cannot effectively extract nested entities. Herein, we proposed to convert the sequence labeling task into a machine reading comprehension (MRC) task. MRC method effectively can solve the challenge of extracting multiple overlapping entities by transforming it into the form of answering multiple independent questions. Moreover, the MRC framework allows for a more comprehensive understanding of the contextual information and semantic relationships within materials science literature, by integrating prior knowledge from queries. State-of-the-art (SOTA) performance was achieved on the Matscholar, BC4CHEMD, NLMChem, SOFC, and SOFC-Slot datasets, with F1-scores of 89.64%, 94.30%, 85.89%, 85.95%, and 71.73%, respectively in MRC approach. By effectively utilizing semantic information and extracting nested entities, this approach holds great significance for knowledge extraction and data analysis in the field of materials science, and thus accelerating the development of material science.</p><p><b>Scientific contribution</b></p><p>We have developed an innovative NER method that enhances the efficiency and accuracy of automatic entity extraction in the field of materials science by transforming the sequence labeling task into a MRC task, this approach provides robust support for constructing knowledge graphs and other data analysis tasks.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1000,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00874-5","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-024-00874-5","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Materials science is an interdisciplinary field that studies the properties, structures, and behaviors of different materials. A large amount of scientific literature contains rich knowledge in the field of materials science, but manually analyzing these papers to find material-related data is a daunting task. In information processing, named entity recognition (NER) plays a crucial role as it can automatically extract entities in the field of materials science, which have significant value in tasks such as building knowledge graphs. The typically used sequence labeling methods for traditional named entity recognition in material science (MatNER) tasks often fail to fully utilize the semantic information in the dataset and cannot effectively extract nested entities. Herein, we proposed to convert the sequence labeling task into a machine reading comprehension (MRC) task. MRC method effectively can solve the challenge of extracting multiple overlapping entities by transforming it into the form of answering multiple independent questions. Moreover, the MRC framework allows for a more comprehensive understanding of the contextual information and semantic relationships within materials science literature, by integrating prior knowledge from queries. State-of-the-art (SOTA) performance was achieved on the Matscholar, BC4CHEMD, NLMChem, SOFC, and SOFC-Slot datasets, with F1-scores of 89.64%, 94.30%, 85.89%, 85.95%, and 71.73%, respectively in MRC approach. By effectively utilizing semantic information and extracting nested entities, this approach holds great significance for knowledge extraction and data analysis in the field of materials science, and thus accelerating the development of material science.

Scientific contribution

We have developed an innovative NER method that enhances the efficiency and accuracy of automatic entity extraction in the field of materials science by transforming the sequence labeling task into a MRC task, this approach provides robust support for constructing knowledge graphs and other data analysis tasks.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
将机器阅读理解技术应用于材料科学中的命名实体识别。
材料科学是一门研究不同材料的特性、结构和行为的交叉学科。大量科学文献蕴含着丰富的材料科学领域知识,但手动分析这些论文以查找与材料相关的数据是一项艰巨的任务。在信息处理中,命名实体识别(NER)起着至关重要的作用,因为它可以自动提取材料科学领域的实体,而这些实体在构建知识图谱等任务中具有重要价值。在传统的材料科学命名实体识别(MatNER)任务中,通常使用的序列标注方法往往不能充分利用数据集中的语义信息,也不能有效地提取嵌套实体。在此,我们提出将序列标注任务转换为机器阅读理解(MRC)任务。MRC 方法通过将其转换为回答多个独立问题的形式,有效地解决了提取多个重叠实体的难题。此外,MRC 框架通过整合查询中的先验知识,可以更全面地理解材料科学文献中的上下文信息和语义关系。MRC 方法在 Matscholar、BC4CHEMD、NLMChem、SOFC 和 SOFC-Slot 数据集上取得了最先进(SOTA)的性能,F1 分数分别为 89.64%、94.30%、85.89%、85.95% 和 71.73%。通过有效利用语义信息和提取嵌套实体,该方法对材料科学领域的知识提取和数据分析具有重要意义,从而加速了材料科学的发展。 科学贡献我们开发了一种创新的 NER 方法,通过将序列标注任务转化为 MRC 任务,提高了材料科学领域实体自动提取的效率和准确性,该方法为构建知识图谱和其他数据分析任务提供了强大的支持。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Cheminformatics
Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS
CiteScore
14.10
自引率
7.00%
发文量
82
审稿时长
3 months
期刊介绍: Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.
期刊最新文献
cidalsDB: an AI-empowered platform for anti-pathogen therapeutics research Group graph: a molecular graph representation with enhanced performance, efficiency and interpretability GT-NMR: a novel graph transformer-based approach for accurate prediction of NMR chemical shifts Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature Molecular identification via molecular fingerprint extraction from atomic force microscopy images
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1