The Value of Numbers in Clinical Text Classification

IF 4 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Machine learning and knowledge extraction Pub Date : 2023-07-07 DOI:10.3390/make5030040
Kristian Miok, P. Corcoran, Irena Spasic
{"title":"The Value of Numbers in Clinical Text Classification","authors":"Kristian Miok, P. Corcoran, Irena Spasic","doi":"10.3390/make5030040","DOIUrl":null,"url":null,"abstract":"Clinical text often includes numbers of various types and formats. However, most current text classification approaches do not take advantage of these numbers. This study aims to demonstrate that using numbers as features can significantly improve the performance of text classification models. This study also demonstrates the feasibility of extracting such features from clinical text. Unsupervised learning was used to identify patterns of number usage in clinical text. These patterns were analyzed manually and converted into pattern-matching rules. Information extraction was used to incorporate numbers as features into a document representation model. We evaluated text classification models trained on such representation. Our experiments were performed with two document representation models (vector space model and word embedding model) and two classification models (support vector machines and neural networks). The results showed that even a handful of numerical features can significantly improve text classification performance. We conclude that commonly used document representations do not represent numbers in a way that machine learning algorithms can effectively utilize them as features. Although we demonstrated that traditional information extraction can be effective in converting numbers into features, further community-wide research is required to systematically incorporate number representation into the word embedding process.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"30 1","pages":"746-762"},"PeriodicalIF":4.0000,"publicationDate":"2023-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning and knowledge extraction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/make5030040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Clinical text often includes numbers of various types and formats. However, most current text classification approaches do not take advantage of these numbers. This study aims to demonstrate that using numbers as features can significantly improve the performance of text classification models. This study also demonstrates the feasibility of extracting such features from clinical text. Unsupervised learning was used to identify patterns of number usage in clinical text. These patterns were analyzed manually and converted into pattern-matching rules. Information extraction was used to incorporate numbers as features into a document representation model. We evaluated text classification models trained on such representation. Our experiments were performed with two document representation models (vector space model and word embedding model) and two classification models (support vector machines and neural networks). The results showed that even a handful of numerical features can significantly improve text classification performance. We conclude that commonly used document representations do not represent numbers in a way that machine learning algorithms can effectively utilize them as features. Although we demonstrated that traditional information extraction can be effective in converting numbers into features, further community-wide research is required to systematically incorporate number representation into the word embedding process.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
数字在临床文本分类中的价值
临床文本通常包括各种类型和格式的数字。然而,大多数当前的文本分类方法都没有利用这些数字。本研究旨在证明使用数字作为特征可以显著提高文本分类模型的性能。本研究也证明了从临床文本中提取这些特征的可行性。使用无监督学习来识别临床文本中数字使用的模式。手动分析这些模式并将其转换为模式匹配规则。信息提取用于将数字作为特征纳入文档表示模型。我们评估了在这种表示上训练的文本分类模型。我们的实验使用了两种文档表示模型(向量空间模型和词嵌入模型)和两种分类模型(支持向量机和神经网络)。结果表明,即使少量的数字特征也能显著提高文本分类性能。我们得出的结论是,常用的文档表示方式不能以机器学习算法可以有效地利用它们作为特征的方式表示数字。虽然我们证明了传统的信息提取可以有效地将数字转换为特征,但需要进一步的社区研究来系统地将数字表示纳入词嵌入过程。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
6.30
自引率
0.00%
发文量
0
审稿时长
7 weeks
期刊最新文献
Knowledge Graph Extraction of Business Interactions from News Text for Business Networking Analysis Machine Learning for an Enhanced Credit Risk Analysis: A Comparative Study of Loan Approval Prediction Models Integrating Mental Health Data A Data Mining Approach for Health Transport Demand Predicting Wind Comfort in an Urban Area: A Comparison of a Regression- with a Classification-CNN for General Wind Rose Statistics An Evaluative Baseline for Sentence-Level Semantic Division
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1