SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks.

IF 2 3区工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Journal of Biomedical Semantics Pub Date : 2022-05-08 DOI:10.1186/s13326-022-00269-1

Lucas Emanuel Silva E Oliveira, Ana Carolina Peters, Adalniza Moura Pucca da Silva, Caroline Pilatti Gebeluca, Yohan Bonescki Gumiel, Lilian Mie Mukai Cintho, Deborah Ribeiro Carvalho, Sadid Al Hasan, Claudia Maria Cabral Moro

{"title":"SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks.","authors":"Lucas Emanuel Silva E Oliveira, Ana Carolina Peters, Adalniza Moura Pucca da Silva, Caroline Pilatti Gebeluca, Yohan Bonescki Gumiel, Lilian Mie Mukai Cintho, Deborah Ribeiro Carvalho, Sadid Al Hasan, Claudia Maria Cabral Moro","doi":"10.1186/s13326-022-00269-1","DOIUrl":null,"url":null,"abstract":"Background: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field.Methods: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations.Results: This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores.Conclusion: The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"13 1","pages":"13"},"PeriodicalIF":2.0000,"publicationDate":"2022-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9080187/pdf/","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Semantics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1186/s13326-022-00269-1","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 14

Abstract

Background: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field.

Methods: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations.

Results: This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores.

Conclusion: The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SemClinBr -一个多机构和多专业语义注释的语料库，用于葡萄牙临床NLP任务。

背景:从电子健康记录(EHRs)中提取患者信息的大量研究导致对标注语料库的需求增加，这是开发和评估自然语言处理(NLP)算法的宝贵资源。在英语语言范围之外缺乏多用途临床语料库，特别是在巴西葡萄牙语中，这是显而易见的，并严重影响了生物医学NLP领域的科学进展。方法:在本研究中，使用来自多个医学专业、文献类型和机构的临床文本开发了一个语义注释的语料库。此外，我们提出了(1)一项调查，列出了共同的方面、差异和从以往研究中吸取的教训;(2)一个可以复制的细粒度注释模式，以指导其他注释计划;(3)一个基于web的注释工具，专注于注释建议功能;(4)对注释进行内在和外在评估。结果:该研究产生了SemClinBr，这是一个包含1000个临床记录的语料库，标记了65117个实体和11263个关系。此外，还生成了否定线索和医学缩写词典。在接受部分重叠和层次相关的语义类型时，注释器协议的平均得分从0.71(应用严格匹配)到0.92(考虑宽松匹配)不等。当将语料库应用于两个下游NLP任务时，外部评估证明了注释的可靠性和有用性，系统获得的结果与协议分数一致。结论:本工作生成的SemClinBr语料库和其他资源可以支持临床NLP研究，为研究界提供一个共同的开发和评估资源，促进电子病历在临床实践和生物医学研究中的应用。据我们所知，SemClinBr是第一个可用的葡萄牙临床语料库。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Biomedical Semantics MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

4.20

自引率

5.30%

发文量

审稿时长

30 weeks

期刊介绍： Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas: Infrastructure for biomedical semantics: focusing on semantic resources and repositories, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability. Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.