Document Sublanguage Clustering to Detect Medical Specialty in Cross-institutional Clinical Texts.

Kristina Doing-Harris, Olga Patterson, Sean Igo, John Hurdle
{"title":"Document Sublanguage Clustering to Detect Medical Specialty in Cross-institutional Clinical Texts.","authors":"Kristina Doing-Harris,&nbsp;Olga Patterson,&nbsp;Sean Igo,&nbsp;John Hurdle","doi":"10.1145/2512089.2512101","DOIUrl":null,"url":null,"abstract":"<p><p>This paper reports on a set of studies designed to identify sublanguages in documents for domain-specific processing across institutions. Psychological evidence indicates that humans use context-specific linguistic information when they read. Natural Language Processing (NLP) pipelines are successful within specific domains (i.e., contexts). To limit the number of domain-specific NLP systems, a natural focus would be on sublanguages. Sublanguages are identified by shared lexical and semantic features.[1] Patterson and Hurdle[2] developed a sublanguage identification system that functioned well for 12 clinical specialties at the University of Utah. The current work compares sublanguages across institutions. Using a clinical NLP pipeline augmented by a new document corpus from the University of Pittsburg (UPitt), new documents were assigned to clusters based on the minimum cosine-distance to a Utah cluster centroid. The UPitt documents were divided into a nine-group specialty corpus. Across institutions, five of the specialty groups fell within the expected clusters. We find that clustering encounters difficulty due to documents with mixed sublanguages; naming convention differences across institutions; and document types used across specialties. The findings indicate that clinical specialty sublanguages can be identified across institutions.</p>","PeriodicalId":91598,"journal":{"name":"Proceedings of the ACM ... International Workshop on Data and Text Mining in Biomedical Informatics. ACM International Workshop on Data and Text Mining in Biomedical Informatics","volume":"2013 ","pages":"9-12"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2512089.2512101","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM ... International Workshop on Data and Text Mining in Biomedical Informatics. ACM International Workshop on Data and Text Mining in Biomedical Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2512089.2512101","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25

Abstract

This paper reports on a set of studies designed to identify sublanguages in documents for domain-specific processing across institutions. Psychological evidence indicates that humans use context-specific linguistic information when they read. Natural Language Processing (NLP) pipelines are successful within specific domains (i.e., contexts). To limit the number of domain-specific NLP systems, a natural focus would be on sublanguages. Sublanguages are identified by shared lexical and semantic features.[1] Patterson and Hurdle[2] developed a sublanguage identification system that functioned well for 12 clinical specialties at the University of Utah. The current work compares sublanguages across institutions. Using a clinical NLP pipeline augmented by a new document corpus from the University of Pittsburg (UPitt), new documents were assigned to clusters based on the minimum cosine-distance to a Utah cluster centroid. The UPitt documents were divided into a nine-group specialty corpus. Across institutions, five of the specialty groups fell within the expected clusters. We find that clustering encounters difficulty due to documents with mixed sublanguages; naming convention differences across institutions; and document types used across specialties. The findings indicate that clinical specialty sublanguages can be identified across institutions.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
跨机构临床文本中检测医学专业的文献子语言聚类。
本文报告了一组旨在识别跨机构特定领域处理文档中的子语言的研究。心理学证据表明,人类在阅读时使用上下文特定的语言信息。自然语言处理(NLP)管道在特定领域(即上下文)中是成功的。为了限制特定于领域的NLP系统的数量,自然会将重点放在子语言上。子语言是通过共享的词汇和语义特征来识别的。[1]Patterson和obstacle[2]开发了一种亚语言识别系统,该系统在犹他大学的12个临床专业中运行良好。目前的工作是比较各机构的子语言。使用由匹兹堡大学(UPitt)的新文档语料库增强的临床NLP管道,根据到犹他聚类质心的最小余弦距离将新文档分配给聚类。UPitt文档被分为9组专业语料库。在各院校中,有5个专业小组落在了预期的类别中。我们发现,由于混合子语言的文档,聚类遇到困难;不同机构之间命名习惯的差异;以及跨专业使用的文档类型。研究结果表明,临床专科亚语言可以在不同的机构中被识别。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Automatic Text Summarization Topic Detection and Tracking Text Representation Text Data Mining Topic Model
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1