Improving classification of low-resource COVID-19 literature by using Named Entity Recognition.

Q2 Agricultural and Biological Sciences Genomics and Informatics Pub Date : 2021-09-01 Epub Date: 2021-09-30 DOI:10.5808/gi.21018
Oscar Lithgow-Serrano, Joseph Cornelius, Vani Kanjirangat, Carlos-Francisco Méndez-Cruz, Fabio Rinaldi
{"title":"Improving classification of low-resource COVID-19 literature by using Named Entity Recognition.","authors":"Oscar Lithgow-Serrano,&nbsp;Joseph Cornelius,&nbsp;Vani Kanjirangat,&nbsp;Carlos-Francisco Méndez-Cruz,&nbsp;Fabio Rinaldi","doi":"10.5808/gi.21018","DOIUrl":null,"url":null,"abstract":"<p><p>Automatic document classification for highly interrelated classes is a demanding task that becomes more challenging when there is little labeled data for training. Such is the case of the coronavirus disease 2019 (COVID-19) Clinical repository-a repository of classified and translated academic articles related to COVID-19 and relevant to the clinical practice-where a 3-way classification scheme is being applied to COVID-19 literature. During the 7th Biomedical Linked Annotation Hackathon (BLAH7) hackathon, we performed experiments to explore the use of named-entity-recognition (NER) to improve the classification. We processed the literature with OntoGene's Biomedical Entity Recogniser (OGER) and used the resulting identified Named Entities (NE) and their links to major biological databases as extra input features for the classifier. We compared the results with a baseline model without the OGER extracted features. In these proof-of-concept experiments, we observed a clear gain on COVID-19 literature classification. In particular, NE's origin was useful to classify document types and NE's type for clinical specialties. Due to the limitations of the small dataset, we can only conclude that our results suggests that NER would benefit this classification task. In order to accurately estimate this benefit, further experiments with a larger dataset would be needed.</p>","PeriodicalId":36591,"journal":{"name":"Genomics and Informatics","volume":"19 3","pages":"e22"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8510872/pdf/","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genomics and Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5808/gi.21018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/9/30 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}
引用次数: 1

Abstract

Automatic document classification for highly interrelated classes is a demanding task that becomes more challenging when there is little labeled data for training. Such is the case of the coronavirus disease 2019 (COVID-19) Clinical repository-a repository of classified and translated academic articles related to COVID-19 and relevant to the clinical practice-where a 3-way classification scheme is being applied to COVID-19 literature. During the 7th Biomedical Linked Annotation Hackathon (BLAH7) hackathon, we performed experiments to explore the use of named-entity-recognition (NER) to improve the classification. We processed the literature with OntoGene's Biomedical Entity Recogniser (OGER) and used the resulting identified Named Entities (NE) and their links to major biological databases as extra input features for the classifier. We compared the results with a baseline model without the OGER extracted features. In these proof-of-concept experiments, we observed a clear gain on COVID-19 literature classification. In particular, NE's origin was useful to classify document types and NE's type for clinical specialties. Due to the limitations of the small dataset, we can only conclude that our results suggests that NER would benefit this classification task. In order to accurately estimate this benefit, further experiments with a larger dataset would be needed.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用命名实体识别改进低资源COVID-19文献分类。
对高度相关的类进行自动文档分类是一项要求很高的任务,当用于训练的标记数据很少时,这项任务变得更具挑战性。2019冠状病毒病(COVID-19)临床知识库就是这样一个例子,它是一个与COVID-19相关并与临床实践相关的分类和翻译学术文章的知识库,其中对COVID-19文献采用三向分类方案。在第七届生物医学链接标注黑客马拉松(BLAH7)中,我们进行了实验,探索使用命名实体识别(NER)来改进分类。我们使用OntoGene的生物医学实体识别器(OGER)处理文献,并使用结果识别的命名实体(NE)及其与主要生物数据库的链接作为分类器的额外输入特征。我们将结果与没有OGER提取特征的基线模型进行比较。在这些概念验证实验中,我们观察到COVID-19文献分类的明显增加。特别的是,NE的来源是有用的分类文件类型和NE的类型为临床专科。由于小数据集的限制,我们只能得出结论,我们的结果表明NER将有利于这个分类任务。为了准确地估计这种好处,需要使用更大的数据集进行进一步的实验。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Genomics and Informatics
Genomics and Informatics Agricultural and Biological Sciences-Ecology, Evolution, Behavior and Systematics
CiteScore
1.90
自引率
0.00%
发文量
0
审稿时长
12 weeks
期刊最新文献
Gut metagenomic analysis of gastric cancer patients reveals Akkermansia, Gammaproteobacteria, and Veillonella microbiota as potential non-invasive biomarkers COVID-19 progression towards ARDS: a genome wide study reveals host factors underlying critical COVID-19. Bioinformatic analyses reveal the prognostic significance and potential role of ankyrin 3 (ANK3) in kidney renal clear cell carcinoma. Comparison of digital PCR platforms using the molecular marker. Single-cell RNA sequencing identifies distinct transcriptomic signatures between PMA/ionomycin- and αCD3/αCD28-activated primary human T cells.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1