Document Sublanguage Clustering to Detect Medical Specialty in Cross-institutional Clinical Texts.

Proceedings of the ACM ... International Workshop on Data and Text Mining in Biomedical Informatics. ACM International Workshop on Data and Text Mining in Biomedical Informatics Pub Date : 2013-10-01 DOI:10.1145/2512089.2512101

Kristina Doing-Harris, Olga Patterson, Sean Igo, John Hurdle

{"title":"Document Sublanguage Clustering to Detect Medical Specialty in Cross-institutional Clinical Texts.","authors":"Kristina Doing-Harris, Olga Patterson, Sean Igo, John Hurdle","doi":"10.1145/2512089.2512101","DOIUrl":null,"url":null,"abstract":"<p><p>This paper reports on a set of studies designed to identify sublanguages in documents for domain-specific processing across institutions. Psychological evidence indicates that humans use context-specific linguistic information when they read. Natural Language Processing (NLP) pipelines are successful within specific domains (i.e., contexts). To limit the number of domain-specific NLP systems, a natural focus would be on sublanguages. Sublanguages are identified by shared lexical and semantic features.[1] Patterson and Hurdle[2] developed a sublanguage identification system that functioned well for 12 clinical specialties at the University of Utah. The current work compares sublanguages across institutions. Using a clinical NLP pipeline augmented by a new document corpus from the University of Pittsburg (UPitt), new documents were assigned to clusters based on the minimum cosine-distance to a Utah cluster centroid. The UPitt documents were divided into a nine-group specialty corpus. Across institutions, five of the specialty groups fell within the expected clusters. We find that clustering encounters difficulty due to documents with mixed sublanguages; naming convention differences across institutions; and document types used across specialties. The findings indicate that clinical specialty sublanguages can be identified across institutions.</p>","PeriodicalId":91598,"journal":{"name":"Proceedings of the ACM ... International Workshop on Data and Text Mining in Biomedical Informatics. ACM International Workshop on Data and Text Mining in Biomedical Informatics","volume":"2013 ","pages":"9-12"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2512089.2512101","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM ... International Workshop on Data and Text Mining in Biomedical Informatics. ACM International Workshop on Data and Text Mining in Biomedical Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2512089.2512101","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

Abstract

This paper reports on a set of studies designed to identify sublanguages in documents for domain-specific processing across institutions. Psychological evidence indicates that humans use context-specific linguistic information when they read. Natural Language Processing (NLP) pipelines are successful within specific domains (i.e., contexts). To limit the number of domain-specific NLP systems, a natural focus would be on sublanguages. Sublanguages are identified by shared lexical and semantic features.[1] Patterson and Hurdle[2] developed a sublanguage identification system that functioned well for 12 clinical specialties at the University of Utah. The current work compares sublanguages across institutions. Using a clinical NLP pipeline augmented by a new document corpus from the University of Pittsburg (UPitt), new documents were assigned to clusters based on the minimum cosine-distance to a Utah cluster centroid. The UPitt documents were divided into a nine-group specialty corpus. Across institutions, five of the specialty groups fell within the expected clusters. We find that clustering encounters difficulty due to documents with mixed sublanguages; naming convention differences across institutions; and document types used across specialties. The findings indicate that clinical specialty sublanguages can be identified across institutions.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

跨机构临床文本中检测医学专业的文献子语言聚类。

本文报告了一组旨在识别跨机构特定领域处理文档中的子语言的研究。心理学证据表明，人类在阅读时使用上下文特定的语言信息。自然语言处理(NLP)管道在特定领域(即上下文)中是成功的。为了限制特定于领域的NLP系统的数量，自然会将重点放在子语言上。子语言是通过共享的词汇和语义特征来识别的。[1]Patterson和obstacle[2]开发了一种亚语言识别系统，该系统在犹他大学的12个临床专业中运行良好。目前的工作是比较各机构的子语言。使用由匹兹堡大学(UPitt)的新文档语料库增强的临床NLP管道，根据到犹他聚类质心的最小余弦距离将新文档分配给聚类。UPitt文档被分为9组专业语料库。在各院校中，有5个专业小组落在了预期的类别中。我们发现，由于混合子语言的文档，聚类遇到困难;不同机构之间命名习惯的差异;以及跨专业使用的文档类型。研究结果表明，临床专科亚语言可以在不同的机构中被识别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ACM ... International Workshop on Data and Text Mining in Biomedical Informatics. ACM International Workshop on Data and Text Mining in Biomedical Informatics

自引率

0.00%

发文量

期刊最新文献

Automatic Text Summarization Topic Detection and Tracking Text Representation Text Data Mining Topic Model