Knowledge-Based Biomedical Word Sense Disambiguation with Neural Concept Embeddings

Akm Sabbir, Antonio Jimeno-Yepes, Ramakanth Kavuluru
{"title":"Knowledge-Based Biomedical Word Sense Disambiguation with Neural Concept Embeddings","authors":"Akm Sabbir, Antonio Jimeno-Yepes, Ramakanth Kavuluru","doi":"10.1109/BIBE.2017.00-61","DOIUrl":null,"url":null,"abstract":"<p><p>Biomedical word sense disambiguation (WSD) is an important intermediate task in many natural language processing applications such as named entity recognition, syntactic parsing, and relation extraction. In this paper, we employ knowledge-based approaches that also exploit recent advances in neural word/concept embeddings to improve over the state-of-the-art in biomedical WSD using the public MSH WSD dataset [1] as the test set. Our methods involve weak supervision - we do not use any hand-labeled examples for WSD to build our prediction models; however, we employ an existing concept mapping program, MetaMap, to obtain our concept vectors. Over the MSH WSD dataset, our linear time (in terms of numbers of senses and words in the test instance) method achieves an accuracy of 92.24% which is a 3% improvement over the best known results [2] obtained via unsupervised means. A more expensive approach that we developed relies on a nearest neighbor framework and achieves accuracy of 94.34%, essentially cutting the error rate in half. Employing dense vector representations learned from unlabeled free text has been shown to benefit many language processing tasks recently and our efforts show that biomedical WSD is no exception to this trend. For a complex and rapidly evolving domain such as biomedicine, building labeled datasets for larger sets of ambiguous terms may be impractical. Here, we show that weak supervision that leverages recent advances in representation learning can rival supervised approaches in biomedical WSD. However, external knowledge bases (here sense inventories) play a key role in the improvements achieved.</p>","PeriodicalId":87347,"journal":{"name":"Proceedings. IEEE International Symposium on Bioinformatics and Bioengineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5792196/pdf/nihms919324.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE International Symposium on Bioinformatics and Bioengineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2017.00-61","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2018/1/11 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Biomedical word sense disambiguation (WSD) is an important intermediate task in many natural language processing applications such as named entity recognition, syntactic parsing, and relation extraction. In this paper, we employ knowledge-based approaches that also exploit recent advances in neural word/concept embeddings to improve over the state-of-the-art in biomedical WSD using the public MSH WSD dataset [1] as the test set. Our methods involve weak supervision - we do not use any hand-labeled examples for WSD to build our prediction models; however, we employ an existing concept mapping program, MetaMap, to obtain our concept vectors. Over the MSH WSD dataset, our linear time (in terms of numbers of senses and words in the test instance) method achieves an accuracy of 92.24% which is a 3% improvement over the best known results [2] obtained via unsupervised means. A more expensive approach that we developed relies on a nearest neighbor framework and achieves accuracy of 94.34%, essentially cutting the error rate in half. Employing dense vector representations learned from unlabeled free text has been shown to benefit many language processing tasks recently and our efforts show that biomedical WSD is no exception to this trend. For a complex and rapidly evolving domain such as biomedicine, building labeled datasets for larger sets of ambiguous terms may be impractical. Here, we show that weak supervision that leverages recent advances in representation learning can rival supervised approaches in biomedical WSD. However, external knowledge bases (here sense inventories) play a key role in the improvements achieved.

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于知识的生物医学词义消歧与神经概念嵌入
生物医学词义消歧(WSD)是命名实体识别、句法分析和关系提取等许多自然语言处理应用中的一项重要中间任务。在本文中,我们采用了基于知识的方法,并利用神经词/概念嵌入的最新进展,以公共 MSH WSD 数据集 [1] 作为测试集,改进了生物医学 WSD 的先进水平。我们的方法涉及弱监督--我们不使用任何手工标记的 WSD 示例来建立预测模型;但是,我们使用现有的概念映射程序 MetaMap 来获取概念向量。在 MSH WSD 数据集上,我们的线性时间(以测试实例中的感官和单词数量计算)方法实现了 92.24% 的准确率,比通过无监督方法获得的最佳已知结果[2]提高了 3%。我们开发的一种更昂贵的方法依赖于近邻框架,准确率达到 94.34%,基本上将错误率降低了一半。从无标注的自由文本中学习到的密集向量表示最近已被证明有利于许多语言处理任务,我们的努力表明生物医学 WSD 也不例外。对于像生物医学这样复杂且快速发展的领域,为较大的模糊术语集建立标记数据集可能并不现实。在这里,我们展示了利用表征学习的最新进展进行的弱监督可以与生物医学 WSD 中的监督方法相媲美。然而,外部知识库(此处为感官清单)在实现改进方面发挥了关键作用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Transformer-based de novo peptide sequencing for data-independent acquisition mass spectrometry. Deep Multiview Learning to Identify Population Structure with Multimodal Imaging. Fusion Learning on Multiple-Tag RFID Measurements for Respiratory Rate Monitoring. Semi-Supervised Classification of Noisy, Gigapixel Histology Images. Hybrid Modeling of Ebola Propagation.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1