Using machine learning techniques for exploration and classification of laboratory data

IF 1.1 4区医学 Q4 MEDICAL LABORATORY TECHNOLOGY Journal of Laboratory Medicine Pub Date : 2024-08-12 DOI:10.1515/labmed-2024-0100

Inga Trulson, Stefan Holdenrieder, Georg Hoffmann

{"title":"Using machine learning techniques for exploration and classification of laboratory data","authors":"Inga Trulson, Stefan Holdenrieder, Georg Hoffmann","doi":"10.1515/labmed-2024-0100","DOIUrl":null,"url":null,"abstract":"Objectives The study aims to acquaint readers with six widely used machine learning (ML) techniques (Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), k-means, hierarchical clustering and the decision tree models (rpart and random forest)) that might be useful for the analysis of laboratory data. Methods Utilizing a recently validated data set from lung cancer diagnostics, we investigate how ML can support the search for a suitable tumor marker panel for the differentiation of small cell (SCLC) and non-small cell lung cancer (NSCLC). Results The ML techniques used here effectively helped to gain a quick overview of the data structures and provide initial answers to the clinical questions. Dimensionality reduction techniques such as PCA and UMAP offered insightful visualization and impression of the data structure, suggesting the existence of two tumor groups with a large overlap of largely inconspicuous values. This impression was confirmed by a cluster analysis with the k-means algorithm, indicative of unsupervised learning. For supervised learning, decision tree models like rpart or random forest demonstrated their utility in differential diagnosis of the two tumor types. The rpart model, which constructs binary decision trees based on the recursive partitioning algorithm, suggests a tree involving four serum tumor markers (STMs), which were confirmed by the random forest approach. Both highlighted pro-gastrin-releasing peptide (ProGRP), neuron specific enolase (NSE), cytokeratin-19 fragment (CYFRA 21-1) and cancer antigen (CA) 72-4 as key tumor markers, aligning with the outcomes of the initial statistical analysis. Cross-validation of the two proposals showed a higher area under the receiver operating characteristic (AUROC) curve of 0.95 with a 95 % confidence interval (CI) of 0.92–0.97 for the random forest model compared to an AUROC curve of 0.88 (95 % CI: 0.83–0.93). Conclusions ML can provide a useful overview of inherent medical data structures and distinguish significant from less pertinent features. While by no means replacing human medical and statistical expertise, ML can significantly accelerate the evaluation of medical data, supporting a more informed diagnostic dialogue between physicians and statisticians.","PeriodicalId":55986,"journal":{"name":"Journal of Laboratory Medicine","volume":"27 1","pages":""},"PeriodicalIF":1.1000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Laboratory Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1515/labmed-2024-0100","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MEDICAL LABORATORY TECHNOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives The study aims to acquaint readers with six widely used machine learning (ML) techniques (Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), k-means, hierarchical clustering and the decision tree models (rpart and random forest)) that might be useful for the analysis of laboratory data. Methods Utilizing a recently validated data set from lung cancer diagnostics, we investigate how ML can support the search for a suitable tumor marker panel for the differentiation of small cell (SCLC) and non-small cell lung cancer (NSCLC). Results The ML techniques used here effectively helped to gain a quick overview of the data structures and provide initial answers to the clinical questions. Dimensionality reduction techniques such as PCA and UMAP offered insightful visualization and impression of the data structure, suggesting the existence of two tumor groups with a large overlap of largely inconspicuous values. This impression was confirmed by a cluster analysis with the k-means algorithm, indicative of unsupervised learning. For supervised learning, decision tree models like rpart or random forest demonstrated their utility in differential diagnosis of the two tumor types. The rpart model, which constructs binary decision trees based on the recursive partitioning algorithm, suggests a tree involving four serum tumor markers (STMs), which were confirmed by the random forest approach. Both highlighted pro-gastrin-releasing peptide (ProGRP), neuron specific enolase (NSE), cytokeratin-19 fragment (CYFRA 21-1) and cancer antigen (CA) 72-4 as key tumor markers, aligning with the outcomes of the initial statistical analysis. Cross-validation of the two proposals showed a higher area under the receiver operating characteristic (AUROC) curve of 0.95 with a 95 % confidence interval (CI) of 0.92–0.97 for the random forest model compared to an AUROC curve of 0.88 (95 % CI: 0.83–0.93). Conclusions ML can provide a useful overview of inherent medical data structures and distinguish significant from less pertinent features. While by no means replacing human medical and statistical expertise, ML can significantly accelerate the evaluation of medical data, supporting a more informed diagnostic dialogue between physicians and statisticians.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用机器学习技术探索实验室数据并进行分类

目的本研究旨在让读者了解六种广泛使用的机器学习（ML）技术（主成分分析（PCA）、统一表层逼近和投影（UMAP）、k-均值、分层聚类和决策树模型（rpart 和随机森林）），这些技术可能对实验室数据分析有用。方法利用最近验证的肺癌诊断数据集，我们研究了 ML 如何支持寻找合适的肿瘤标记物面板，以区分小细胞肺癌（SCLC）和非小细胞肺癌（NSCLC）。结果这里使用的 ML 技术有效地帮助我们快速了解了数据结构，并为临床问题提供了初步答案。PCA 和 UMAP 等降维技术为数据结构提供了深入的可视化和印象，表明存在两个肿瘤组，其中有大量基本不明显的值重叠。使用 k-means 算法进行的聚类分析证实了这一印象，表明这是无监督学习。在监督学习方面，rpart 或随机森林等决策树模型在两种肿瘤类型的鉴别诊断中发挥了作用。基于递归分割算法构建二元决策树的 rpart 模型提出了一种涉及四种血清肿瘤标志物（STMs）的决策树，随机森林方法证实了这一点。这两种方法都强调促胃泌素释放肽（ProGRP）、神经元特异性烯醇化酶（NSE）、细胞角蛋白-19片段（CYFRA 21-1）和癌抗原（CA）72-4是关键的肿瘤标志物，与初步统计分析的结果一致。两种方案的交叉验证结果显示，随机森林模型的接收者操作特征曲线下面积（AUROC）为 0.95，95 % 置信区间（CI）为 0.92-0.97，而随机森林模型的接收者操作特征曲线下面积（AUROC）为 0.88（95 % 置信区间：0.83-0.93）。结论 ML 可以提供对固有医疗数据结构的有用概述，并区分重要特征和不太相关的特征。虽然 ML 无法取代人类的医学和统计专业知识，但它能大大加快医疗数据的评估速度，支持医生和统计学家之间进行更明智的诊断对话。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Laboratory Medicine Mathematics-Discrete Mathematics and Combinatorics

CiteScore

2.50

自引率

0.00%

发文量

审稿时长

10 weeks

期刊介绍： The Journal of Laboratory Medicine (JLM) is a bi-monthly published journal that reports on the latest developments in laboratory medicine. Particular focus is placed on the diagnostic aspects of the clinical laboratory, although technical, regulatory, and educational topics are equally covered. The Journal specializes in the publication of high-standard, competent and timely review articles on clinical, methodological and pathogenic aspects of modern laboratory diagnostics. These reviews are critically reviewed by expert reviewers and JLM’s Associate Editors who are specialists in the various subdisciplines of laboratory medicine. In addition, JLM publishes original research articles, case reports, point/counterpoint articles and letters to the editor, all of which are peer reviewed by at least two experts in the field.