一个用于创建、审查和裁决非结构化文本注释的可配置软件平台。

IF 2.2 Q3 HEALTH CARE SCIENCES & SERVICES International Journal of Population Data Science Pub Date : 2022-08-25 DOI:10.23889/ijpds.v7i3.1953

R. Beare, Adam Morris, Tanya Ravipati, Elizabeth Le, T. Collyer, Helene Roberts, V. Srikanth, Nadine E. Andrew

{"title":"一个用于创建、审查和裁决非结构化文本注释的可配置软件平台。","authors":"R. Beare, Adam Morris, Tanya Ravipati, Elizabeth Le, T. Collyer, Helene Roberts, V. Srikanth, Nadine E. Andrew","doi":"10.23889/ijpds.v7i3.1953","DOIUrl":null,"url":null,"abstract":"ObjectivesTo develop a flexible platform for creating, reviewing and adjudicating annotation of unstructured text. Natural Language Processing models and statistical classifiers use the results for analysis of large databases of text, such as electronic health records, that are curated by the National Centre for Healthy Ageing (NCHA) Data Platform. \nApproachAutomated approaches are essential for large scale extraction of structured data from unstructured documents. We applied the CogStack suite to annotate clinical text from hospital inpatient records based on the Unified Medical Language System (UMLS) for classifying dementia status. We trained a logistic regression classifier to determine dementia/non-dementia status within two cohorts based on frequency of occurrence of a set of terms provided by experts - one with confirmed dementia based on clinical assessment and the other confirmed non-dementia based on telephone cognitive interview. We used our annotation platform to review the accuracy of concepts assigned by CogStack. \nResultsThere were 368 people with clinically confirmed dementia and 218 screen-negative for dementia. Of these, 259 with dementia and 195 without dementia had documents in the inpatient electronic health record system, 84045 inpatient documents 16950 for the dementia and non-dementia cohort respectively. A set of key words pertaining to dementia was generated by a specialist neurologist and a health information manager, and matched to UMLS concepts. The NCHA data platform holds a copy of the inpatient text records (>13million documents) that has been annotated using CogStack. Annotated documents corresponding to the study cohort were extracted. \nWe tested true positive rates of annotation against 50 concepts judged by a neurologist and health information manager to be relevant to dementia patients by manually review of 100 documents. \nConclusionAutomated annotations must be validated. The platform we have developed allows efficient review and correction of annotations to allow models to be trained further or provide confidence that accuracy is sufficient for subsequent analysis. Implementation within our linked NCHA data platform will allow incorporation of text based data at scale.","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":" ","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2022-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A configurable software platform for creating, reviewing and adjudicating annotation of unstructured text.\",\"authors\":\"R. Beare, Adam Morris, Tanya Ravipati, Elizabeth Le, T. Collyer, Helene Roberts, V. Srikanth, Nadine E. Andrew\",\"doi\":\"10.23889/ijpds.v7i3.1953\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ObjectivesTo develop a flexible platform for creating, reviewing and adjudicating annotation of unstructured text. Natural Language Processing models and statistical classifiers use the results for analysis of large databases of text, such as electronic health records, that are curated by the National Centre for Healthy Ageing (NCHA) Data Platform. \\nApproachAutomated approaches are essential for large scale extraction of structured data from unstructured documents. We applied the CogStack suite to annotate clinical text from hospital inpatient records based on the Unified Medical Language System (UMLS) for classifying dementia status. We trained a logistic regression classifier to determine dementia/non-dementia status within two cohorts based on frequency of occurrence of a set of terms provided by experts - one with confirmed dementia based on clinical assessment and the other confirmed non-dementia based on telephone cognitive interview. We used our annotation platform to review the accuracy of concepts assigned by CogStack. \\nResultsThere were 368 people with clinically confirmed dementia and 218 screen-negative for dementia. Of these, 259 with dementia and 195 without dementia had documents in the inpatient electronic health record system, 84045 inpatient documents 16950 for the dementia and non-dementia cohort respectively. A set of key words pertaining to dementia was generated by a specialist neurologist and a health information manager, and matched to UMLS concepts. The NCHA data platform holds a copy of the inpatient text records (>13million documents) that has been annotated using CogStack. Annotated documents corresponding to the study cohort were extracted. \\nWe tested true positive rates of annotation against 50 concepts judged by a neurologist and health information manager to be relevant to dementia patients by manually review of 100 documents. \\nConclusionAutomated annotations must be validated. The platform we have developed allows efficient review and correction of annotations to allow models to be trained further or provide confidence that accuracy is sufficient for subsequent analysis. Implementation within our linked NCHA data platform will allow incorporation of text based data at scale.\",\"PeriodicalId\":36483,\"journal\":{\"name\":\"International Journal of Population Data Science\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2022-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Population Data Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23889/ijpds.v7i3.1953\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Population Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23889/ijpds.v7i3.1953","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

目的开发一个灵活的平台，用于创建、审查和裁决非结构化文本的注释。自然语言处理模型和统计分类器将结果用于分析由国家健康老龄化中心（NCHA）数据平台策划的大型文本数据库，如电子健康记录。方法自动化方法对于从非结构化文档中大规模提取结构化数据至关重要。我们应用CogStack套件对基于统一医学语言系统（UMLS）的医院住院记录中的临床文本进行注释，以对痴呆状态进行分类。我们训练了一个逻辑回归分类器，根据专家提供的一组术语的出现频率来确定两个队列中的痴呆症/非痴呆症状态——一个基于临床评估的确诊痴呆症，另一个基于电话认知访谈的确诊非痴呆症。我们使用我们的注释平台来审查CogStack分配的概念的准确性。结果临床确诊痴呆368例，痴呆筛查阴性218例。其中，259名痴呆症患者和195名无痴呆症患者的住院电子健康记录系统中有文件，84045名痴呆症和非痴呆症患者分别有16950份住院文件。一位神经科专家和一位健康信息经理生成了一组与痴呆症相关的关键词，并与UMLS概念相匹配。NCHA数据平台保存一份使用CogStack进行注释的住院患者文本记录（>1300万份文档）。提取了与研究队列相对应的注释文件。我们通过手动审查100份文件，针对神经学家和健康信息管理人员判断与痴呆症患者相关的50个概念，测试了注释的真实阳性率。结论自动化注释必须经过验证。我们开发的平台允许对注释进行有效的审查和更正，以便进一步训练模型，或为后续分析提供足够的准确性。在我们链接的NCHA数据平台内实施将允许大规模合并基于文本的数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A configurable software platform for creating, reviewing and adjudicating annotation of unstructured text.

ObjectivesTo develop a flexible platform for creating, reviewing and adjudicating annotation of unstructured text. Natural Language Processing models and statistical classifiers use the results for analysis of large databases of text, such as electronic health records, that are curated by the National Centre for Healthy Ageing (NCHA) Data Platform. ApproachAutomated approaches are essential for large scale extraction of structured data from unstructured documents. We applied the CogStack suite to annotate clinical text from hospital inpatient records based on the Unified Medical Language System (UMLS) for classifying dementia status. We trained a logistic regression classifier to determine dementia/non-dementia status within two cohorts based on frequency of occurrence of a set of terms provided by experts - one with confirmed dementia based on clinical assessment and the other confirmed non-dementia based on telephone cognitive interview. We used our annotation platform to review the accuracy of concepts assigned by CogStack. ResultsThere were 368 people with clinically confirmed dementia and 218 screen-negative for dementia. Of these, 259 with dementia and 195 without dementia had documents in the inpatient electronic health record system, 84045 inpatient documents 16950 for the dementia and non-dementia cohort respectively. A set of key words pertaining to dementia was generated by a specialist neurologist and a health information manager, and matched to UMLS concepts. The NCHA data platform holds a copy of the inpatient text records (>13million documents) that has been annotated using CogStack. Annotated documents corresponding to the study cohort were extracted. We tested true positive rates of annotation against 50 concepts judged by a neurologist and health information manager to be relevant to dementia patients by manually review of 100 documents. ConclusionAutomated annotations must be validated. The platform we have developed allows efficient review and correction of annotations to allow models to be trained further or provide confidence that accuracy is sufficient for subsequent analysis. Implementation within our linked NCHA data platform will allow incorporation of text based data at scale.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊