Improving classification of low-resource COVID-19 literature by using Named Entity Recognition.

Q2 Agricultural and Biological Sciences Genomics and Informatics Pub Date : 2021-09-01 Epub Date: 2021-09-30 DOI:10.5808/gi.21018

Oscar Lithgow-Serrano, Joseph Cornelius, Vani Kanjirangat, Carlos-Francisco Méndez-Cruz, Fabio Rinaldi

{"title":"Improving classification of low-resource COVID-19 literature by using Named Entity Recognition.","authors":"Oscar Lithgow-Serrano, Joseph Cornelius, Vani Kanjirangat, Carlos-Francisco Méndez-Cruz, Fabio Rinaldi","doi":"10.5808/gi.21018","DOIUrl":null,"url":null,"abstract":"<p><p>Automatic document classification for highly interrelated classes is a demanding task that becomes more challenging when there is little labeled data for training. Such is the case of the coronavirus disease 2019 (COVID-19) Clinical repository-a repository of classified and translated academic articles related to COVID-19 and relevant to the clinical practice-where a 3-way classification scheme is being applied to COVID-19 literature. During the 7th Biomedical Linked Annotation Hackathon (BLAH7) hackathon, we performed experiments to explore the use of named-entity-recognition (NER) to improve the classification. We processed the literature with OntoGene's Biomedical Entity Recogniser (OGER) and used the resulting identified Named Entities (NE) and their links to major biological databases as extra input features for the classifier. We compared the results with a baseline model without the OGER extracted features. In these proof-of-concept experiments, we observed a clear gain on COVID-19 literature classification. In particular, NE's origin was useful to classify document types and NE's type for clinical specialties. Due to the limitations of the small dataset, we can only conclude that our results suggests that NER would benefit this classification task. In order to accurately estimate this benefit, further experiments with a larger dataset would be needed.</p>","PeriodicalId":36591,"journal":{"name":"Genomics and Informatics","volume":"19 3","pages":"e22"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8510872/pdf/","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genomics and Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5808/gi.21018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/9/30 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}

引用次数: 1

Abstract

Automatic document classification for highly interrelated classes is a demanding task that becomes more challenging when there is little labeled data for training. Such is the case of the coronavirus disease 2019 (COVID-19) Clinical repository-a repository of classified and translated academic articles related to COVID-19 and relevant to the clinical practice-where a 3-way classification scheme is being applied to COVID-19 literature. During the 7th Biomedical Linked Annotation Hackathon (BLAH7) hackathon, we performed experiments to explore the use of named-entity-recognition (NER) to improve the classification. We processed the literature with OntoGene's Biomedical Entity Recogniser (OGER) and used the resulting identified Named Entities (NE) and their links to major biological databases as extra input features for the classifier. We compared the results with a baseline model without the OGER extracted features. In these proof-of-concept experiments, we observed a clear gain on COVID-19 literature classification. In particular, NE's origin was useful to classify document types and NE's type for clinical specialties. Due to the limitations of the small dataset, we can only conclude that our results suggests that NER would benefit this classification task. In order to accurately estimate this benefit, further experiments with a larger dataset would be needed.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用命名实体识别改进低资源COVID-19文献分类。

对高度相关的类进行自动文档分类是一项要求很高的任务，当用于训练的标记数据很少时，这项任务变得更具挑战性。2019冠状病毒病(COVID-19)临床知识库就是这样一个例子，它是一个与COVID-19相关并与临床实践相关的分类和翻译学术文章的知识库，其中对COVID-19文献采用三向分类方案。在第七届生物医学链接标注黑客马拉松(BLAH7)中，我们进行了实验，探索使用命名实体识别(NER)来改进分类。我们使用OntoGene的生物医学实体识别器(OGER)处理文献，并使用结果识别的命名实体(NE)及其与主要生物数据库的链接作为分类器的额外输入特征。我们将结果与没有OGER提取特征的基线模型进行比较。在这些概念验证实验中，我们观察到COVID-19文献分类的明显增加。特别的是，NE的来源是有用的分类文件类型和NE的类型为临床专科。由于小数据集的限制，我们只能得出结论，我们的结果表明NER将有利于这个分类任务。为了准确地估计这种好处，需要使用更大的数据集进行进一步的实验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊