The Value of Numbers in Clinical Text Classification

IF 6 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Machine learning and knowledge extraction Pub Date : 2023-07-07 DOI:10.3390/make5030040

Kristian Miok, P. Corcoran, Irena Spasic

{"title":"The Value of Numbers in Clinical Text Classification","authors":"Kristian Miok, P. Corcoran, Irena Spasic","doi":"10.3390/make5030040","DOIUrl":null,"url":null,"abstract":"Clinical text often includes numbers of various types and formats. However, most current text classification approaches do not take advantage of these numbers. This study aims to demonstrate that using numbers as features can significantly improve the performance of text classification models. This study also demonstrates the feasibility of extracting such features from clinical text. Unsupervised learning was used to identify patterns of number usage in clinical text. These patterns were analyzed manually and converted into pattern-matching rules. Information extraction was used to incorporate numbers as features into a document representation model. We evaluated text classification models trained on such representation. Our experiments were performed with two document representation models (vector space model and word embedding model) and two classification models (support vector machines and neural networks). The results showed that even a handful of numerical features can significantly improve text classification performance. We conclude that commonly used document representations do not represent numbers in a way that machine learning algorithms can effectively utilize them as features. Although we demonstrated that traditional information extraction can be effective in converting numbers into features, further community-wide research is required to systematically incorporate number representation into the word embedding process.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"30 1","pages":"746-762"},"PeriodicalIF":6.0000,"publicationDate":"2023-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning and knowledge extraction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/make5030040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Clinical text often includes numbers of various types and formats. However, most current text classification approaches do not take advantage of these numbers. This study aims to demonstrate that using numbers as features can significantly improve the performance of text classification models. This study also demonstrates the feasibility of extracting such features from clinical text. Unsupervised learning was used to identify patterns of number usage in clinical text. These patterns were analyzed manually and converted into pattern-matching rules. Information extraction was used to incorporate numbers as features into a document representation model. We evaluated text classification models trained on such representation. Our experiments were performed with two document representation models (vector space model and word embedding model) and two classification models (support vector machines and neural networks). The results showed that even a handful of numerical features can significantly improve text classification performance. We conclude that commonly used document representations do not represent numbers in a way that machine learning algorithms can effectively utilize them as features. Although we demonstrated that traditional information extraction can be effective in converting numbers into features, further community-wide research is required to systematically incorporate number representation into the word embedding process.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

数字在临床文本分类中的价值

临床文本通常包括各种类型和格式的数字。然而，大多数当前的文本分类方法都没有利用这些数字。本研究旨在证明使用数字作为特征可以显著提高文本分类模型的性能。本研究也证明了从临床文本中提取这些特征的可行性。使用无监督学习来识别临床文本中数字使用的模式。手动分析这些模式并将其转换为模式匹配规则。信息提取用于将数字作为特征纳入文档表示模型。我们评估了在这种表示上训练的文本分类模型。我们的实验使用了两种文档表示模型(向量空间模型和词嵌入模型)和两种分类模型(支持向量机和神经网络)。结果表明，即使少量的数字特征也能显著提高文本分类性能。我们得出的结论是，常用的文档表示方式不能以机器学习算法可以有效地利用它们作为特征的方式表示数字。虽然我们证明了传统的信息提取可以有效地将数字转换为特征，但需要进一步的社区研究来系统地将数字表示纳入词嵌入过程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊