A configurable software platform for creating, reviewing and adjudicating annotation of unstructured text.

IF 1.6 Q3 HEALTH CARE SCIENCES & SERVICES International Journal of Population Data Science Pub Date : 2022-08-25 DOI:10.23889/ijpds.v7i3.1953
R. Beare, Adam Morris, Tanya Ravipati, Elizabeth Le, T. Collyer, Helene Roberts, V. Srikanth, Nadine E. Andrew
{"title":"A configurable software platform for creating, reviewing and adjudicating annotation of unstructured text.","authors":"R. Beare, Adam Morris, Tanya Ravipati, Elizabeth Le, T. Collyer, Helene Roberts, V. Srikanth, Nadine E. Andrew","doi":"10.23889/ijpds.v7i3.1953","DOIUrl":null,"url":null,"abstract":"ObjectivesTo develop a flexible platform for creating, reviewing and adjudicating annotation of unstructured text. Natural Language Processing models and statistical classifiers use the results for analysis of large databases of text, such as electronic health records, that are curated by the National Centre for Healthy Ageing (NCHA) Data Platform. \nApproachAutomated approaches are essential for large scale extraction of structured data from unstructured documents. We applied the CogStack suite to annotate clinical text from hospital inpatient records based on the Unified Medical Language System (UMLS) for classifying dementia status. We trained a logistic regression classifier to determine dementia/non-dementia status within two cohorts based on frequency of occurrence of a set of terms provided by experts - one with confirmed dementia based on clinical assessment and the other confirmed non-dementia based on telephone cognitive interview. We used our annotation platform to review the accuracy of concepts assigned by CogStack. \nResultsThere were 368 people with clinically confirmed dementia and 218 screen-negative for dementia. Of these, 259 with dementia and 195 without dementia had documents in the inpatient electronic health record system, 84045 inpatient documents 16950 for the dementia and non-dementia cohort respectively. A set of key words pertaining to dementia was generated by a specialist neurologist and a health information manager, and matched to UMLS concepts. The NCHA data platform holds a copy of the inpatient text records (>13million documents) that has been annotated using CogStack. Annotated documents corresponding to the study cohort were extracted. \nWe tested true positive rates of annotation against 50 concepts judged by a neurologist and health information manager to be relevant to dementia patients by manually review of 100 documents. \nConclusionAutomated annotations must be validated. The platform we have developed allows efficient review and correction of annotations to allow models to be trained further or provide confidence that accuracy is sufficient for subsequent analysis. Implementation within our linked NCHA data platform will allow incorporation of text based data at scale.","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":" ","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2022-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Population Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23889/ijpds.v7i3.1953","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

ObjectivesTo develop a flexible platform for creating, reviewing and adjudicating annotation of unstructured text. Natural Language Processing models and statistical classifiers use the results for analysis of large databases of text, such as electronic health records, that are curated by the National Centre for Healthy Ageing (NCHA) Data Platform. ApproachAutomated approaches are essential for large scale extraction of structured data from unstructured documents. We applied the CogStack suite to annotate clinical text from hospital inpatient records based on the Unified Medical Language System (UMLS) for classifying dementia status. We trained a logistic regression classifier to determine dementia/non-dementia status within two cohorts based on frequency of occurrence of a set of terms provided by experts - one with confirmed dementia based on clinical assessment and the other confirmed non-dementia based on telephone cognitive interview. We used our annotation platform to review the accuracy of concepts assigned by CogStack. ResultsThere were 368 people with clinically confirmed dementia and 218 screen-negative for dementia. Of these, 259 with dementia and 195 without dementia had documents in the inpatient electronic health record system, 84045 inpatient documents 16950 for the dementia and non-dementia cohort respectively. A set of key words pertaining to dementia was generated by a specialist neurologist and a health information manager, and matched to UMLS concepts. The NCHA data platform holds a copy of the inpatient text records (>13million documents) that has been annotated using CogStack. Annotated documents corresponding to the study cohort were extracted. We tested true positive rates of annotation against 50 concepts judged by a neurologist and health information manager to be relevant to dementia patients by manually review of 100 documents. ConclusionAutomated annotations must be validated. The platform we have developed allows efficient review and correction of annotations to allow models to be trained further or provide confidence that accuracy is sufficient for subsequent analysis. Implementation within our linked NCHA data platform will allow incorporation of text based data at scale.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一个用于创建、审查和裁决非结构化文本注释的可配置软件平台。
目的开发一个灵活的平台,用于创建、审查和裁决非结构化文本的注释。自然语言处理模型和统计分类器将结果用于分析由国家健康老龄化中心(NCHA)数据平台策划的大型文本数据库,如电子健康记录。方法自动化方法对于从非结构化文档中大规模提取结构化数据至关重要。我们应用CogStack套件对基于统一医学语言系统(UMLS)的医院住院记录中的临床文本进行注释,以对痴呆状态进行分类。我们训练了一个逻辑回归分类器,根据专家提供的一组术语的出现频率来确定两个队列中的痴呆症/非痴呆症状态——一个基于临床评估的确诊痴呆症,另一个基于电话认知访谈的确诊非痴呆症。我们使用我们的注释平台来审查CogStack分配的概念的准确性。结果临床确诊痴呆368例,痴呆筛查阴性218例。其中,259名痴呆症患者和195名无痴呆症患者的住院电子健康记录系统中有文件,84045名痴呆症和非痴呆症患者分别有16950份住院文件。一位神经科专家和一位健康信息经理生成了一组与痴呆症相关的关键词,并与UMLS概念相匹配。NCHA数据平台保存一份使用CogStack进行注释的住院患者文本记录(>1300万份文档)。提取了与研究队列相对应的注释文件。我们通过手动审查100份文件,针对神经学家和健康信息管理人员判断与痴呆症患者相关的50个概念,测试了注释的真实阳性率。结论自动化注释必须经过验证。我们开发的平台允许对注释进行有效的审查和更正,以便进一步训练模型,或为后续分析提供足够的准确性。在我们链接的NCHA数据平台内实施将允许大规模合并基于文本的数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
2.50
自引率
0.00%
发文量
386
审稿时长
20 weeks
期刊最新文献
Defining a low-risk birth cohort: a cohort study comparing two perinatal data sets in Ontario, Canada. Data resource profile: nutrition data in the VA million veteran program. Deprivation effects on length of stay and death of hospitalised COVID-19 patients in Greater Manchester. Variation in colorectal cancer treatment and outcomes in Scotland: real world evidence from national linked administrative health data. Examining the quality and population representativeness of linked survey and administrative data: guidance and illustration using linked 1958 National Child Development Study and Hospital Episode Statistics data
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1