首页 > 最新文献

Biodata Mining最新文献

英文 中文
Effective hybrid feature selection using different bootstrap enhances cancers classification performance. 采用不同自举法的有效混合特征选择提高了癌症分类性能。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2022-09-30 DOI: 10.1186/s13040-022-00304-y
Noura Mohammed Abdelwahed, Gh S El-Tawel, M A Makhlouf

Background: Machine learning can be used to predict the different onset of human cancers. Highly dimensional data have enormous, complicated problems. One of these is an excessive number of genes plus over-fitting, fitting time, and classification accuracy. Recursive Feature Elimination (RFE) is a wrapper method for selecting the best subset of features that cause the best accuracy. Despite the high performance of RFE, time computation and over-fitting are two disadvantages of this algorithm. Random forest for selection (RFS) proves its effectiveness in selecting the effective features and improving the over-fitting problem.

Method: This paper proposed a method, namely, positions first bootstrap step (PFBS) random forest selection recursive feature elimination (RFS-RFE) and its abbreviation is PFBS- RFS-RFE to enhance cancer classification performance. It used a bootstrap with many positions included in the outer first bootstrap step (OFBS), inner first bootstrap step (IFBS), and outer/ inner first bootstrap step (O/IFBS). In the first position, OFBS is applied as a resampling method (bootstrap) with replacement before selection step. The RFS is applied with bootstrap = false i.e., the whole datasets are used to build each tree. The importance features are hybrid with RFE to select the most relevant subset of features. In the second position, IFBS is applied as a resampling method (bootstrap) with replacement during applied RFS. The importance features are hybrid with RFE. In the third position, O/IFBS is applied as a hybrid of first and second positions. RFE used logistic regression (LR) as an estimator. The proposed methods are incorporated with four classifiers to solve the feature selection problems and modify the performance of RFE, in which five datasets with different size are used to assess the performance of the PFBS-RFS-RFE.

Results: The results showed that the O/IFBS-RFS-RFE achieved the best performance compared with previous work and enhanced the accuracy, variance and ROC area for RNA gene and dermatology erythemato-squamous diseases datasets to become 99.994%, 0.0000004, 1.000 and 100.000%, 0.0 and 1.000, respectively.

Conclusion: High dimensional datasets and RFE algorithm face many troubles in cancers classification performance. PFBS-RFS-RFE is proposed to fix these troubles with different positions. The importance features which extracted from RFS are used with RFE to obtain the effective features.

背景:机器学习可以用来预测人类癌症的不同发病。高维数据具有巨大而复杂的问题。其中之一是过多的基因加上过度拟合、拟合时间和分类准确性。递归特征消除(RFE)是一种包装方法,用于选择导致最佳精度的最佳特征子集。尽管RFE算法具有较高的性能,但其缺点是时间计算和过拟合。随机选择森林(RFS)在选择有效特征和改善过拟合问题方面证明了它的有效性。方法:本文提出了一种提高癌症分类性能的方法,即位置第一bootstrap step (PFBS)随机森林选择递归特征消除法(RFS-RFE),简称PFBS- RFS-RFE。它使用了一个有许多位置的引导,包括外部第一个引导步骤(OFBS)、内部第一个引导步骤(IFBS)和外部/内部第一个引导步骤(O/IFBS)。在第一个位置,采用OFBS作为重采样方法(bootstrap),在选择步骤之前进行替换。RFS应用bootstrap = false,即使用整个数据集来构建每棵树。将重要特性与RFE混合,以选择最相关的特性子集。在第二种情况下,IFBS作为一种重采样方法(bootstrap)应用,在应用RFS期间进行替换。重要的特性是与RFE混合的。在第三个位置,O/IFBS作为第一和第二位置的混合应用。RFE使用逻辑回归(LR)作为估计器。该方法结合4个分类器解决了特征选择问题,并改进了RFE的性能,其中使用5个不同大小的数据集对PFBS-RFS-RFE的性能进行了评估。结果:与前人相比,O/IFBS-RFS-RFE获得了最好的性能,并将RNA基因和皮肤病学红斑-鳞状疾病数据集的准确率、方差和ROC面积分别提高到99.994%、0.0000004、1.000和100.000%、0.0和1.000。结论:高维数据集和RFE算法在癌症分类性能方面存在诸多问题。PFBS-RFS-RFE的提出就是为了解决这些问题。将从RFS中提取的重要特征与RFE相结合,得到有效特征。
{"title":"Effective hybrid feature selection using different bootstrap enhances cancers classification performance.","authors":"Noura Mohammed Abdelwahed,&nbsp;Gh S El-Tawel,&nbsp;M A Makhlouf","doi":"10.1186/s13040-022-00304-y","DOIUrl":"https://doi.org/10.1186/s13040-022-00304-y","url":null,"abstract":"<p><strong>Background: </strong>Machine learning can be used to predict the different onset of human cancers. Highly dimensional data have enormous, complicated problems. One of these is an excessive number of genes plus over-fitting, fitting time, and classification accuracy. Recursive Feature Elimination (RFE) is a wrapper method for selecting the best subset of features that cause the best accuracy. Despite the high performance of RFE, time computation and over-fitting are two disadvantages of this algorithm. Random forest for selection (RFS) proves its effectiveness in selecting the effective features and improving the over-fitting problem.</p><p><strong>Method: </strong>This paper proposed a method, namely, positions first bootstrap step (PFBS) random forest selection recursive feature elimination (RFS-RFE) and its abbreviation is PFBS- RFS-RFE to enhance cancer classification performance. It used a bootstrap with many positions included in the outer first bootstrap step (OFBS), inner first bootstrap step (IFBS), and outer/ inner first bootstrap step (O/IFBS). In the first position, OFBS is applied as a resampling method (bootstrap) with replacement before selection step. The RFS is applied with bootstrap = false i.e., the whole datasets are used to build each tree. The importance features are hybrid with RFE to select the most relevant subset of features. In the second position, IFBS is applied as a resampling method (bootstrap) with replacement during applied RFS. The importance features are hybrid with RFE. In the third position, O/IFBS is applied as a hybrid of first and second positions. RFE used logistic regression (LR) as an estimator. The proposed methods are incorporated with four classifiers to solve the feature selection problems and modify the performance of RFE, in which five datasets with different size are used to assess the performance of the PFBS-RFS-RFE.</p><p><strong>Results: </strong>The results showed that the O/IFBS-RFS-RFE achieved the best performance compared with previous work and enhanced the accuracy, variance and ROC area for RNA gene and dermatology erythemato-squamous diseases datasets to become 99.994%, 0.0000004, 1.000 and 100.000%, 0.0 and 1.000, respectively.</p><p><strong>Conclusion: </strong>High dimensional datasets and RFE algorithm face many troubles in cancers classification performance. PFBS-RFS-RFE is proposed to fix these troubles with different positions. The importance features which extracted from RFS are used with RFE to obtain the effective features.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9523996/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40382338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Polygenic risk modeling of tumor stage and survival in bladder cancer. 膀胱癌肿瘤分期和生存的多基因风险模型。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2022-09-30 DOI: 10.1186/s13040-022-00306-w
Mauro Nascimben, Lia Rimondini, Davide Corà, Manolo Venturin

Introduction: Bladder cancer assessment with non-invasive gene expression signatures facilitates the detection of patients at risk and surveillance of their status, bypassing the discomforts given by cystoscopy. To achieve accurate cancer estimation, analysis pipelines for gene expression data (GED) may integrate a sequence of several machine learning and bio-statistical techniques to model complex characteristics of pathological patterns.

Methods: Numerical experiments tested the combination of GED preprocessing by discretization with tree ensemble embeddings and nonlinear dimensionality reductions to categorize oncological patients comprehensively. Modeling aimed to identify tumor stage and distinguish survival outcomes in two situations: complete and partial data embedding. This latter experimental condition simulates the addition of new patients to an existing model for rapid monitoring of disease progression. Machine learning procedures were employed to identify the most relevant genes involved in patient prognosis and test the performance of preprocessed GED compared to untransformed data in predicting patient conditions.

Results: Data embedding paired with dimensionality reduction produced prognostic maps with well-defined clusters of patients, suitable for medical decision support. A second experiment simulated the addition of new patients to an existing model (partial data embedding): Uniform Manifold Approximation and Projection (UMAP) methodology with uniform data discretization led to better outcomes than other analyzed pipelines. Further exploration of parameter space for UMAP and t-distributed stochastic neighbor embedding (t-SNE) underlined the importance of tuning a higher number of parameters for UMAP rather than t-SNE. Moreover, two different machine learning experiments identified a group of genes valuable for partitioning patients (gene relevance analysis) and showed the higher precision obtained by preprocessed data in predicting tumor outcomes for cancer stage and survival rate (six classes prediction).

Conclusions: The present investigation proposed new analysis pipelines for disease outcome modeling from bladder cancer-related biomarkers. Complete and partial data embedding experiments suggested that pipelines employing UMAP had a more accurate predictive ability, supporting the recent literature trends on this methodology. However, it was also found that several UMAP parameters influence experimental results, therefore deriving a recommendation for researchers to pay attention to this aspect of the UMAP technique. Machine learning procedures further demonstrated the effectiveness of the proposed preprocessing in predicting patients' conditions and determined a sub-group of biomarkers significant for forecasting bladder cancer prognosis.

介绍:膀胱癌的非侵入性基因表达特征评估有助于发现患者的风险和监测他们的状态,绕过膀胱镜检查带来的不适。为了实现准确的癌症估计,基因表达数据(GED)的分析管道可以整合一系列机器学习和生物统计技术来模拟病理模式的复杂特征。方法:数值实验验证了离散化与树集成嵌入相结合的GED预处理与非线性降维相结合对肿瘤患者进行综合分类。建模旨在识别肿瘤分期,区分两种情况下的生存结果:完全和部分数据嵌入。后一种实验条件模拟了将新患者添加到现有模型中以快速监测疾病进展。使用机器学习程序来识别与患者预后相关的最相关基因,并与未转换的数据相比,测试预处理GED在预测患者病情方面的性能。结果:数据嵌入与降维相结合,生成了具有明确定义的患者群的预后图,适用于医疗决策支持。第二个实验模拟了将新患者添加到现有模型中(部分数据嵌入):统一流形近似和投影(UMAP)方法具有统一数据离散化,其结果优于其他分析管道。对UMAP和t分布随机邻居嵌入(t-SNE)参数空间的进一步探索强调了为UMAP而不是t-SNE调整更多参数的重要性。此外,两个不同的机器学习实验确定了一组对划分患者有价值的基因(基因相关性分析),并显示了预处理数据在预测癌症分期和生存率(六类预测)方面获得的更高精度。结论:本研究为膀胱癌相关生物标志物的疾病结局建模提供了新的分析管道。完整和部分数据嵌入实验表明,采用UMAP的管道具有更准确的预测能力,支持了该方法的最新文献趋势。然而,也发现UMAP的几个参数会影响实验结果,因此建议研究人员关注UMAP技术的这方面。机器学习程序进一步证明了所提出的预处理在预测患者病情方面的有效性,并确定了一组对预测膀胱癌预后具有重要意义的生物标志物。
{"title":"Polygenic risk modeling of tumor stage and survival in bladder cancer.","authors":"Mauro Nascimben,&nbsp;Lia Rimondini,&nbsp;Davide Corà,&nbsp;Manolo Venturin","doi":"10.1186/s13040-022-00306-w","DOIUrl":"https://doi.org/10.1186/s13040-022-00306-w","url":null,"abstract":"<p><strong>Introduction: </strong>Bladder cancer assessment with non-invasive gene expression signatures facilitates the detection of patients at risk and surveillance of their status, bypassing the discomforts given by cystoscopy. To achieve accurate cancer estimation, analysis pipelines for gene expression data (GED) may integrate a sequence of several machine learning and bio-statistical techniques to model complex characteristics of pathological patterns.</p><p><strong>Methods: </strong>Numerical experiments tested the combination of GED preprocessing by discretization with tree ensemble embeddings and nonlinear dimensionality reductions to categorize oncological patients comprehensively. Modeling aimed to identify tumor stage and distinguish survival outcomes in two situations: complete and partial data embedding. This latter experimental condition simulates the addition of new patients to an existing model for rapid monitoring of disease progression. Machine learning procedures were employed to identify the most relevant genes involved in patient prognosis and test the performance of preprocessed GED compared to untransformed data in predicting patient conditions.</p><p><strong>Results: </strong>Data embedding paired with dimensionality reduction produced prognostic maps with well-defined clusters of patients, suitable for medical decision support. A second experiment simulated the addition of new patients to an existing model (partial data embedding): Uniform Manifold Approximation and Projection (UMAP) methodology with uniform data discretization led to better outcomes than other analyzed pipelines. Further exploration of parameter space for UMAP and t-distributed stochastic neighbor embedding (t-SNE) underlined the importance of tuning a higher number of parameters for UMAP rather than t-SNE. Moreover, two different machine learning experiments identified a group of genes valuable for partitioning patients (gene relevance analysis) and showed the higher precision obtained by preprocessed data in predicting tumor outcomes for cancer stage and survival rate (six classes prediction).</p><p><strong>Conclusions: </strong>The present investigation proposed new analysis pipelines for disease outcome modeling from bladder cancer-related biomarkers. Complete and partial data embedding experiments suggested that pipelines employing UMAP had a more accurate predictive ability, supporting the recent literature trends on this methodology. However, it was also found that several UMAP parameters influence experimental results, therefore deriving a recommendation for researchers to pay attention to this aspect of the UMAP technique. Machine learning procedures further demonstrated the effectiveness of the proposed preprocessing in predicting patients' conditions and determined a sub-group of biomarkers significant for forecasting bladder cancer prognosis.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9523990/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40384186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Gated Recurrent Unit based architecture for recognizing ontology concepts from biological literature. 基于门控循环单元的生物文献本体概念识别体系结构。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2022-09-28 DOI: 10.1186/s13040-022-00310-0
Pratik Devkota, Somya D Mohanty, Prashanti Manda

Background: Annotating scientific literature with ontology concepts is a critical task in biology and several other domains for knowledge discovery. Ontology based annotations can power large-scale comparative analyses in a wide range of applications ranging from evolutionary phenotypes to rare human diseases to the study of protein functions. Computational methods that can tag scientific text with ontology terms have included lexical/syntactic methods, traditional machine learning, and most recently, deep learning.

Results: Here, we present state of the art deep learning architectures based on Gated Recurrent Units for annotating text with ontology concepts. We use the Colorado Richly Annotated Full Text Corpus (CRAFT) as a gold standard for training and testing. We explore a number of additional information sources including NCBI's BioThesauraus and Unified Medical Language System (UMLS) to augment information from CRAFT for increasing prediction accuracy. Our best model results in a 0.84 F1 and semantic similarity.

Conclusion: The results shown here underscore the impact for using deep learning architectures for automatically recognizing ontology concepts from literature. The augmentation of the models with biological information beyond that present in the gold standard corpus shows a distinct improvement in prediction accuracy.

背景:用本体概念注释科学文献是生物学和其他知识发现领域的一项关键任务。基于本体的注释可以为从进化表型到罕见的人类疾病到蛋白质功能研究的广泛应用提供大规模的比较分析。可以用本体术语标记科学文本的计算方法包括词汇/句法方法、传统机器学习以及最近的深度学习。结果:在这里,我们提出了基于门控制循环单元的最先进的深度学习架构,用于用本体概念注释文本。我们使用科罗拉多丰富注释全文语料库(CRAFT)作为训练和测试的黄金标准。我们探索了一些额外的信息源,包括NCBI的生物词库和统一医学语言系统(UMLS),以增加CRAFT的信息,以提高预测的准确性。我们最好的模型结果是0.84 F1和语义相似度。结论:这里显示的结果强调了使用深度学习架构自动识别文献中的本体概念的影响。在金标准语料库中存在的生物信息的模型的增强显示出预测精度的明显提高。
{"title":"A Gated Recurrent Unit based architecture for recognizing ontology concepts from biological literature.","authors":"Pratik Devkota,&nbsp;Somya D Mohanty,&nbsp;Prashanti Manda","doi":"10.1186/s13040-022-00310-0","DOIUrl":"https://doi.org/10.1186/s13040-022-00310-0","url":null,"abstract":"<p><strong>Background: </strong>Annotating scientific literature with ontology concepts is a critical task in biology and several other domains for knowledge discovery. Ontology based annotations can power large-scale comparative analyses in a wide range of applications ranging from evolutionary phenotypes to rare human diseases to the study of protein functions. Computational methods that can tag scientific text with ontology terms have included lexical/syntactic methods, traditional machine learning, and most recently, deep learning.</p><p><strong>Results: </strong>Here, we present state of the art deep learning architectures based on Gated Recurrent Units for annotating text with ontology concepts. We use the Colorado Richly Annotated Full Text Corpus (CRAFT) as a gold standard for training and testing. We explore a number of additional information sources including NCBI's BioThesauraus and Unified Medical Language System (UMLS) to augment information from CRAFT for increasing prediction accuracy. Our best model results in a 0.84 F1 and semantic similarity.</p><p><strong>Conclusion: </strong>The results shown here underscore the impact for using deep learning architectures for automatically recognizing ontology concepts from literature. The augmentation of the models with biological information beyond that present in the gold standard corpus shows a distinct improvement in prediction accuracy.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9516808/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40380616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpretable recurrent neural network models for dynamic prediction of the extubation failure risk in patients with invasive mechanical ventilation in the intensive care unit. 重症监护病房有创机械通气患者拔管失败风险动态预测的可解释递归神经网络模型。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2022-09-27 DOI: 10.1186/s13040-022-00309-7
Zhixuan Zeng, Xianming Tang, Yang Liu, Zhengkun He, Xun Gong

Background: Clinical decision of extubation is a challenge in the treatment of patient with invasive mechanical ventilation (IMV), since existing extubation protocols are not capable of precisely predicting extubation failure (EF). This study aims to develop and validate interpretable recurrent neural network (RNN) models for dynamically predicting EF risk.

Methods: A retrospective cohort study was conducted on IMV patients from the Medical Information Mart for Intensive Care IV (MIMIC-IV) database. Time series with a 4-h resolution were built for all included patients. Two types of RNN models, the long short-term memory (LSTM) and the gated recurrent unit (GRU), were developed. A stepwise logistic regression model was used to select key features for developing light-version RNN models. The RNN models were compared to other five non-temporal machine learning models. The Shapley additive explanations (SHAP) value was applied to explain the influence of the features on model prediction.

Results: Of 8,599 included patients, 2,609 had EF (30.3%). The area under receiver operating characteristic curve (AUROC) of LSTM and GRU showed no statistical difference on the test set (0.828 vs. 0.829). The light-version RNN models based on the 26 features selected out of a total of 89 features showed comparable performance as their corresponding full-version models. Among the non-temporal models, only the random forest (RF) (AUROC: 0.820) and the extreme gradient boosting (XGB) model (AUROC: 0.823) were comparable to the RNN models, but their calibration was deviated.

Conclusions: The RNN models have excellent predictive performance for predicting EF risk and have potential to become real-time assistant decision-making systems for extubation.

背景:在有创机械通气(IMV)患者的治疗中,拔管的临床决策是一个挑战,因为现有的拔管方案不能准确预测拔管失败(EF)。本研究旨在建立并验证可解释递归神经网络(RNN)模型,以动态预测EF风险。方法:对重症监护医学信息市场IV (MIMIC-IV)数据库中的IMV患者进行回顾性队列研究。为所有纳入的患者建立4小时分辨率的时间序列。提出了长短期记忆(LSTM)和门控循环单元(GRU)两种RNN模型。采用逐步逻辑回归模型选择关键特征,建立轻型RNN模型。将RNN模型与其他五种非时态机器学习模型进行比较。采用Shapley加性解释(SHAP)值来解释特征对模型预测的影响。结果:在8599例纳入的患者中,2609例发生EF(30.3%)。LSTM与GRU的受试者工作特征曲线下面积(AUROC)在测试集上差异无统计学意义(0.828 vs. 0.829)。基于从总共89个特征中选择的26个特征的轻型RNN模型显示出与其对应的完整版本模型相当的性能。在非时相模型中,只有随机森林(RF)模型(AUROC: 0.820)和极端梯度增强(XGB)模型(AUROC: 0.823)与RNN模型具有可比性,但其校准存在偏差。结论:RNN模型在预测EF风险方面具有优异的预测性能,具有成为拔管实时辅助决策系统的潜力。
{"title":"Interpretable recurrent neural network models for dynamic prediction of the extubation failure risk in patients with invasive mechanical ventilation in the intensive care unit.","authors":"Zhixuan Zeng,&nbsp;Xianming Tang,&nbsp;Yang Liu,&nbsp;Zhengkun He,&nbsp;Xun Gong","doi":"10.1186/s13040-022-00309-7","DOIUrl":"https://doi.org/10.1186/s13040-022-00309-7","url":null,"abstract":"<p><strong>Background: </strong>Clinical decision of extubation is a challenge in the treatment of patient with invasive mechanical ventilation (IMV), since existing extubation protocols are not capable of precisely predicting extubation failure (EF). This study aims to develop and validate interpretable recurrent neural network (RNN) models for dynamically predicting EF risk.</p><p><strong>Methods: </strong>A retrospective cohort study was conducted on IMV patients from the Medical Information Mart for Intensive Care IV (MIMIC-IV) database. Time series with a 4-h resolution were built for all included patients. Two types of RNN models, the long short-term memory (LSTM) and the gated recurrent unit (GRU), were developed. A stepwise logistic regression model was used to select key features for developing light-version RNN models. The RNN models were compared to other five non-temporal machine learning models. The Shapley additive explanations (SHAP) value was applied to explain the influence of the features on model prediction.</p><p><strong>Results: </strong>Of 8,599 included patients, 2,609 had EF (30.3%). The area under receiver operating characteristic curve (AUROC) of LSTM and GRU showed no statistical difference on the test set (0.828 vs. 0.829). The light-version RNN models based on the 26 features selected out of a total of 89 features showed comparable performance as their corresponding full-version models. Among the non-temporal models, only the random forest (RF) (AUROC: 0.820) and the extreme gradient boosting (XGB) model (AUROC: 0.823) were comparable to the RNN models, but their calibration was deviated.</p><p><strong>Conclusions: </strong>The RNN models have excellent predictive performance for predicting EF risk and have potential to become real-time assistant decision-making systems for extubation.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9513908/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40375927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Machine Learning Algorithms for understanding the determinants of under-five Mortality. 了解五岁以下儿童死亡率决定因素的机器学习算法。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2022-09-24 DOI: 10.1186/s13040-022-00308-8
Rakesh Kumar Saroj, Pawan Kumar Yadav, Rajneesh Singh, Obvious N Chilyabanyama

Background: Under-five mortality is a matter of serious concern for child health as well as the social development of any country. The paper aimed to find the accuracy of machine learning models in predicting under-five mortality and identify the most significant factors associated with under-five mortality.

Method: The data was taken from the National Family Health Survey (NFHS-IV) of Uttar Pradesh. First, we used multivariate logistic regression due to its capability for predicting the important factors, then we used machine learning techniques such as decision tree, random forest, Naïve Bayes, K- nearest neighbor (KNN), logistic regression, support vector machine (SVM), neural network, and ridge classifier. Each model's accuracy was checked by a confusion matrix, accuracy, precision, recall, F1 score, Cohen's Kappa, and area under the receiver operating characteristics curve (AUROC). Information gain rank was used to find the important factors for under-five mortality. Data analysis was performed using, STATA-16.0, Python 3.3, and IBM SPSS Statistics for Windows, Version 27.0 software.

Result: By applying the machine learning models, results showed that the neural network model was the best predictive model for under-five mortality when compared with other predictive models, with model accuracy of (95.29% to 95.96%), recall (71.51% to 81.03%), precision (36.64% to 51.83%), F1 score (50.46% to 62.68%), Cohen's Kappa value (0.48 to 0.60), AUROC range (93.51% to 96.22%) and precision-recall curve range (99.52% to 99.73%). The neural network was the most efficient model, but logistic regression also shows well for predicting under-five mortality with accuracy (94% to 95%)., AUROC range (93.4% to 94.8%), and precision-recall curve (99.5% to 99.6%). The number of living children, survival time, wealth index, child size at birth, birth in the last five years, the total number of children ever born, mother's education level, and birth order were identified as important factors influencing under-five mortality.

Conclusion: The neural network model was a better predictive model compared to other machine learning models in predicting under-five mortality, but logistic regression analysis also shows good results. These models may be helpful for the analysis of high-dimensional data for health research.

背景:五岁以下儿童死亡率是一个严重影响儿童健康和社会发展的问题。本文旨在研究机器学习模型预测五岁以下儿童死亡率的准确性,并找出与五岁以下儿童死亡率相关的最重要因素:数据来自北方邦全国家庭健康调查(NFHS-IV)。首先,由于多元逻辑回归能够预测重要因素,我们使用了多元逻辑回归;然后,我们使用了机器学习技术,如决策树、随机森林、奈夫贝叶斯、K-近邻(KNN)、逻辑回归、支持向量机(SVM)、神经网络和脊分类器。每个模型的准确性都通过混淆矩阵、准确率、精确率、召回率、F1 分数、Cohen's Kappa 和接收者工作特征曲线下面积(AUROC)来检验。信息增益等级用于找出五岁以下儿童死亡的重要因素。数据分析使用 STATA-16.0、Python 3.3 和 IBM SPSS Statistics for Windows 27.0 版软件进行:结果:通过应用机器学习模型,结果显示,与其他预测模型相比,神经网络模型是预测五岁以下儿童死亡率的最佳模型,模型准确率为(95.29%至95.96%)、召回率(71.51%至81.03%)、精确率(36.64%至51.83%)、F1得分(50.46%至62.68%)、Cohen's Kappa值(0.48至0.60)、AUROC范围(93.51%至96.22%)和精确率-召回率曲线范围(99.52%至99.73%)。神经网络是最有效的模型,但逻辑回归也能很好地预测五岁以下儿童死亡率,准确率(94% 至 95%)、AUROC 范围(93.4% 至 94.8%)和精度-召回曲线(99.5% 至 99.6%)。活产儿数量、存活时间、财富指数、出生时孩子的体型、最近五年的出生情况、出生过的孩子总数、母亲的教育水平和出生顺序被认为是影响五岁以下儿童死亡率的重要因素:与其他机器学习模型相比,神经网络模型在预测五岁以下儿童死亡率方面具有更好的预测效果,但逻辑回归分析也显示出良好的效果。这些模型可能有助于健康研究中的高维数据分析。
{"title":"Machine Learning Algorithms for understanding the determinants of under-five Mortality.","authors":"Rakesh Kumar Saroj, Pawan Kumar Yadav, Rajneesh Singh, Obvious N Chilyabanyama","doi":"10.1186/s13040-022-00308-8","DOIUrl":"10.1186/s13040-022-00308-8","url":null,"abstract":"<p><strong>Background: </strong>Under-five mortality is a matter of serious concern for child health as well as the social development of any country. The paper aimed to find the accuracy of machine learning models in predicting under-five mortality and identify the most significant factors associated with under-five mortality.</p><p><strong>Method: </strong>The data was taken from the National Family Health Survey (NFHS-IV) of Uttar Pradesh. First, we used multivariate logistic regression due to its capability for predicting the important factors, then we used machine learning techniques such as decision tree, random forest, Naïve Bayes, K- nearest neighbor (KNN), logistic regression, support vector machine (SVM), neural network, and ridge classifier. Each model's accuracy was checked by a confusion matrix, accuracy, precision, recall, F1 score, Cohen's Kappa, and area under the receiver operating characteristics curve (AUROC). Information gain rank was used to find the important factors for under-five mortality. Data analysis was performed using, STATA-16.0, Python 3.3, and IBM SPSS Statistics for Windows, Version 27.0 software.</p><p><strong>Result: </strong>By applying the machine learning models, results showed that the neural network model was the best predictive model for under-five mortality when compared with other predictive models, with model accuracy of (95.29% to 95.96%), recall (71.51% to 81.03%), precision (36.64% to 51.83%), F1 score (50.46% to 62.68%), Cohen's Kappa value (0.48 to 0.60), AUROC range (93.51% to 96.22%) and precision-recall curve range (99.52% to 99.73%). The neural network was the most efficient model, but logistic regression also shows well for predicting under-five mortality with accuracy (94% to 95%)., AUROC range (93.4% to 94.8%), and precision-recall curve (99.5% to 99.6%). The number of living children, survival time, wealth index, child size at birth, birth in the last five years, the total number of children ever born, mother's education level, and birth order were identified as important factors influencing under-five mortality.</p><p><strong>Conclusion: </strong>The neural network model was a better predictive model compared to other machine learning models in predicting under-five mortality, but logistic regression analysis also shows good results. These models may be helpful for the analysis of high-dimensional data for health research.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2022-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9509654/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33480090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ParticleChromo3D: a Particle Swarm Optimization algorithm for chromosome 3D structure prediction from Hi-C data. particclecromo3d:一种基于Hi-C数据的染色体三维结构预测粒子群算法。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2022-09-21 DOI: 10.1186/s13040-022-00305-x
David Vadnais, Michael Middleton, Oluwatosin Oluwadare

Background: The three-dimensional (3D) structure of chromatin has a massive effect on its function. Because of this, it is desirable to have an understanding of the 3D structural organization of chromatin. To gain greater insight into the spatial organization of chromosomes and genomes and the functions they perform, chromosome conformation capture (3C) techniques, particularly Hi-C, have been developed. The Hi-C technology is widely used and well-known because of its ability to profile interactions for all read pairs in an entire genome. The advent of Hi-C has greatly expanded our understanding of the 3D genome, genome folding, gene regulation and has enabled the development of many 3D chromosome structure reconstruction methods.

Results: Here, we propose a novel approach for 3D chromosome and genome structure reconstruction from Hi-C data using Particle Swarm Optimization (PSO) approach called ParticleChromo3D. This algorithm begins with a grouping of candidate solution locations for each chromosome bin, according to the particle swarm algorithm, and then iterates its position towards a global best candidate solution. While moving towards the optimal global solution, each candidate solution or particle uses its own local best information and a randomizer to choose its path. Using several metrics to validate our results, we show that ParticleChromo3D produces a robust and rigorous representation of the 3D structure for input Hi-C data. We evaluated our algorithm on simulated and real Hi-C data in this work. Our results show that ParticleChromo3D is more accurate than most of the existing algorithms for 3D structure reconstruction.

Conclusions: Our results also show that constructed ParticleChromo3D structures are very consistent, hence indicating that it will always arrive at the global solution at every iteration. The source code for ParticleChromo3D, the simulated and real Hi-C datasets, and the models generated for these datasets are available here: https://github.com/OluwadareLab/ParticleChromo3D.

背景:染色质的三维结构对其功能有很大的影响。因此,了解染色质的三维结构组织是很有必要的。为了更深入地了解染色体和基因组的空间组织及其功能,已经开发了染色体构象捕获(3C)技术,特别是Hi-C技术。Hi-C技术被广泛使用并广为人知,因为它能够描述整个基因组中所有读对的相互作用。Hi-C的出现极大地扩展了我们对3D基因组、基因组折叠、基因调控的理解,并使许多3D染色体结构重建方法得以发展。结果:本文提出了一种基于粒子群优化(PSO)的Hi-C数据三维染色体和基因组结构重建新方法——particclecromo3d。该算法首先根据粒子群算法对每个染色体bin的候选解位置进行分组,然后迭代其位置以获得全局最佳候选解。在向全局最优解移动时,每个候选解或粒子使用自己的局部最优信息和随机选择器来选择路径。使用几个指标来验证我们的结果,我们表明particlecromo3d为输入的Hi-C数据产生了一个强大而严格的3D结构表示。我们在模拟和真实的Hi-C数据上对我们的算法进行了评估。结果表明,particlecromo3d比大多数现有的三维结构重建算法更精确。结论:我们的结果还表明,构建的particclecromo3d结构非常一致,因此表明它总是在每次迭代中得到全局解。particlecromo3d的源代码,模拟和真实的Hi-C数据集,以及为这些数据集生成的模型可以在这里获得:https://github.com/OluwadareLab/ParticleChromo3D。
{"title":"ParticleChromo3D: a Particle Swarm Optimization algorithm for chromosome 3D structure prediction from Hi-C data.","authors":"David Vadnais,&nbsp;Michael Middleton,&nbsp;Oluwatosin Oluwadare","doi":"10.1186/s13040-022-00305-x","DOIUrl":"https://doi.org/10.1186/s13040-022-00305-x","url":null,"abstract":"<p><strong>Background: </strong>The three-dimensional (3D) structure of chromatin has a massive effect on its function. Because of this, it is desirable to have an understanding of the 3D structural organization of chromatin. To gain greater insight into the spatial organization of chromosomes and genomes and the functions they perform, chromosome conformation capture (3C) techniques, particularly Hi-C, have been developed. The Hi-C technology is widely used and well-known because of its ability to profile interactions for all read pairs in an entire genome. The advent of Hi-C has greatly expanded our understanding of the 3D genome, genome folding, gene regulation and has enabled the development of many 3D chromosome structure reconstruction methods.</p><p><strong>Results: </strong>Here, we propose a novel approach for 3D chromosome and genome structure reconstruction from Hi-C data using Particle Swarm Optimization (PSO) approach called ParticleChromo3D. This algorithm begins with a grouping of candidate solution locations for each chromosome bin, according to the particle swarm algorithm, and then iterates its position towards a global best candidate solution. While moving towards the optimal global solution, each candidate solution or particle uses its own local best information and a randomizer to choose its path. Using several metrics to validate our results, we show that ParticleChromo3D produces a robust and rigorous representation of the 3D structure for input Hi-C data. We evaluated our algorithm on simulated and real Hi-C data in this work. Our results show that ParticleChromo3D is more accurate than most of the existing algorithms for 3D structure reconstruction.</p><p><strong>Conclusions: </strong>Our results also show that constructed ParticleChromo3D structures are very consistent, hence indicating that it will always arrive at the global solution at every iteration. The source code for ParticleChromo3D, the simulated and real Hi-C datasets, and the models generated for these datasets are available here: https://github.com/OluwadareLab/ParticleChromo3D.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9494900/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40374693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Learning and visualizing chronic latent representations using electronic health records. 使用电子健康记录学习和可视化慢性潜伏表征。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2022-09-05 DOI: 10.1186/s13040-022-00303-z
David Chushig-Muzo, Cristina Soguero-Ruiz, Pablo de Miguel Bohoyo, Inmaculada Mora-Jiménez

Background: Nowadays, patients with chronic diseases such as diabetes and hypertension have reached alarming numbers worldwide. These diseases increase the risk of developing acute complications and involve a substantial economic burden and demand for health resources. The widespread adoption of Electronic Health Records (EHRs) is opening great opportunities for supporting decision-making. Nevertheless, data extracted from EHRs are complex (heterogeneous, high-dimensional and usually noisy), hampering the knowledge extraction with conventional approaches.

Methods: We propose the use of the Denoising Autoencoder (DAE), a Machine Learning (ML) technique allowing to transform high-dimensional data into latent representations (LRs), thus addressing the main challenges with clinical data. We explore in this work how the combination of LRs with a visualization method can be used to map the patient data in a two-dimensional space, gaining knowledge about the distribution of patients with different chronic conditions. Furthermore, this representation can be also used to characterize the patient's health status evolution, which is of paramount importance in the clinical setting.

Results: To obtain clinical LRs, we considered real-world data extracted from EHRs linked to the University Hospital of Fuenlabrada in Spain. Experimental results showed the great potential of DAEs to identify patients with clinical patterns linked to hypertension, diabetes and multimorbidity. The procedure allowed us to find patients with the same main chronic disease but different clinical characteristics. Thus, we identified two kinds of diabetic patients with differences in their drug therapy (insulin and non-insulin dependant), and also a group of women affected by hypertension and gestational diabetes. We also present a proof of concept for mapping the health status evolution of synthetic patients when considering the most significant diagnoses and drugs associated with chronic patients.

Conclusion: Our results highlighted the value of ML techniques to extract clinical knowledge, supporting the identification of patients with certain chronic conditions. Furthermore, the patient's health status progression on the two-dimensional space might be used as a tool for clinicians aiming to characterize health conditions and identify their more relevant clinical codes.

背景:目前,世界范围内糖尿病、高血压等慢性疾病患者数量惊人。这些疾病增加了发生急性并发症的风险,造成了巨大的经济负担和对卫生资源的需求。电子健康记录(EHRs)的广泛采用为支持决策提供了巨大的机会。然而,从电子病历中提取的数据复杂(异构、高维且通常有噪声),阻碍了传统方法的知识提取。方法:我们建议使用去噪自动编码器(DAE),这是一种机器学习(ML)技术,允许将高维数据转换为潜在表征(LRs),从而解决临床数据的主要挑战。在这项工作中,我们探索了如何将LRs与可视化方法相结合,在二维空间中绘制患者数据,从而获得关于不同慢性疾病患者分布的知识。此外,这种表现也可以用来表征患者的健康状况演变,这在临床环境中是至关重要的。结果:为了获得临床LRs,我们考虑了从与西班牙富恩拉布拉达大学医院相关的电子病历中提取的真实数据。实验结果表明,DAEs在识别与高血压、糖尿病和多病相关的临床模式患者方面具有巨大潜力。该程序使我们能够找到具有相同主要慢性疾病但临床特征不同的患者。因此,我们确定了两种药物治疗差异的糖尿病患者(胰岛素依赖型和非胰岛素依赖型),以及一组高血压和妊娠糖尿病患者。在考虑与慢性患者相关的最重要的诊断和药物时,我们还提出了映射合成患者健康状况演变的概念证明。结论:我们的研究结果突出了ML技术在提取临床知识方面的价值,支持对某些慢性疾病患者的识别。此外,患者在二维空间上的健康状况进展可能被用作临床医生的工具,旨在表征健康状况并确定其更相关的临床代码。
{"title":"Learning and visualizing chronic latent representations using electronic health records.","authors":"David Chushig-Muzo,&nbsp;Cristina Soguero-Ruiz,&nbsp;Pablo de Miguel Bohoyo,&nbsp;Inmaculada Mora-Jiménez","doi":"10.1186/s13040-022-00303-z","DOIUrl":"https://doi.org/10.1186/s13040-022-00303-z","url":null,"abstract":"<p><strong>Background: </strong>Nowadays, patients with chronic diseases such as diabetes and hypertension have reached alarming numbers worldwide. These diseases increase the risk of developing acute complications and involve a substantial economic burden and demand for health resources. The widespread adoption of Electronic Health Records (EHRs) is opening great opportunities for supporting decision-making. Nevertheless, data extracted from EHRs are complex (heterogeneous, high-dimensional and usually noisy), hampering the knowledge extraction with conventional approaches.</p><p><strong>Methods: </strong>We propose the use of the Denoising Autoencoder (DAE), a Machine Learning (ML) technique allowing to transform high-dimensional data into latent representations (LRs), thus addressing the main challenges with clinical data. We explore in this work how the combination of LRs with a visualization method can be used to map the patient data in a two-dimensional space, gaining knowledge about the distribution of patients with different chronic conditions. Furthermore, this representation can be also used to characterize the patient's health status evolution, which is of paramount importance in the clinical setting.</p><p><strong>Results: </strong>To obtain clinical LRs, we considered real-world data extracted from EHRs linked to the University Hospital of Fuenlabrada in Spain. Experimental results showed the great potential of DAEs to identify patients with clinical patterns linked to hypertension, diabetes and multimorbidity. The procedure allowed us to find patients with the same main chronic disease but different clinical characteristics. Thus, we identified two kinds of diabetic patients with differences in their drug therapy (insulin and non-insulin dependant), and also a group of women affected by hypertension and gestational diabetes. We also present a proof of concept for mapping the health status evolution of synthetic patients when considering the most significant diagnoses and drugs associated with chronic patients.</p><p><strong>Conclusion: </strong>Our results highlighted the value of ML techniques to extract clinical knowledge, supporting the identification of patients with certain chronic conditions. Furthermore, the patient's health status progression on the two-dimensional space might be used as a tool for clinicians aiming to characterize health conditions and identify their more relevant clinical codes.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9446539/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40351439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Analysis of risk factors progression of preterm delivery using electronic health records. 利用电子健康档案分析早产危险因素进展。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2022-08-17 DOI: 10.1186/s13040-022-00298-7
Zeineb Safi, Neethu Venugopal, Haytham Ali, Michel Makhlouf, Faisal Farooq, Sabri Boughorbel

Background: Preterm deliveries have many negative health implications on both mother and child. Identifying the population level factors that increase the risk of preterm deliveries is an important step in the direction of mitigating the impact and reducing the frequency of occurrence of preterm deliveries. The purpose of this work is to identify preterm delivery risk factors and their progression throughout the pregnancy from a large collection of Electronic Health Records (EHR).

Results: The study cohort includes about 60,000 deliveries in the USA with the complete medical history from EHR for diagnoses, medications and procedures. We propose a temporal analysis of risk factors by estimating and comparing risk ratios and variable importance at different time points prior to the delivery event. We selected the following time points before delivery: 0, 12 and 24 week(s) of gestation. We did so by conducting a retrospective cohort study of patient history for a selected set of mothers who delivered preterm and a control group of mothers that delivered full-term. We analyzed the extracted data using logistic regression and random forests models. The results of our analyses showed that the highest risk ratio and variable importance corresponds to history of previous preterm delivery. Other risk factors were identified, some of which are consistent with those that are reported in the literature, others need further investigation.

Conclusions: The comparative analysis of the risk factors at different time points showed that risk factors in the early pregnancy related to patient history and chronic condition, while the risk factors in late pregnancy are specific to the current pregnancy. Our analysis unifies several previously reported studies on preterm risk factors. It also gives important insights on the changes of risk factors in the course of pregnancy. The code used for data analysis will be made available on github.

背景:早产对母亲和孩子都有许多负面的健康影响。确定增加早产风险的人口层面因素是朝着减轻影响和减少早产发生频率的方向迈出的重要一步。这项工作的目的是从大量的电子健康记录(EHR)中确定早产的危险因素及其在整个妊娠期间的进展。结果:研究队列包括美国约60,000例分娩,从电子病历中获得诊断,药物和程序的完整病史。我们提出了一种风险因素的时间分析,通过估计和比较在交付事件之前的不同时间点的风险比率和可变重要性。我们选择分娩前的以下时间点:妊娠0周、12周和24周。为此,我们对选定的一组早产母亲和一组足月分娩母亲的病史进行了回顾性队列研究。我们使用逻辑回归和随机森林模型分析提取的数据。我们的分析结果显示,最高的风险比和可变重要性与以前的早产史相对应。其他的危险因素也被确定,其中一些与文献报道的一致,其他的需要进一步调查。结论:不同时间点的危险因素对比分析显示,妊娠早期的危险因素与患者病史和慢性疾病有关,而妊娠晚期的危险因素则是当前妊娠特有的。我们的分析结合了先前报道的几项关于早产风险因素的研究。它还对怀孕过程中危险因素的变化提供了重要的见解。用于数据分析的代码将在github上提供。
{"title":"Analysis of risk factors progression of preterm delivery using electronic health records.","authors":"Zeineb Safi,&nbsp;Neethu Venugopal,&nbsp;Haytham Ali,&nbsp;Michel Makhlouf,&nbsp;Faisal Farooq,&nbsp;Sabri Boughorbel","doi":"10.1186/s13040-022-00298-7","DOIUrl":"https://doi.org/10.1186/s13040-022-00298-7","url":null,"abstract":"<p><strong>Background: </strong>Preterm deliveries have many negative health implications on both mother and child. Identifying the population level factors that increase the risk of preterm deliveries is an important step in the direction of mitigating the impact and reducing the frequency of occurrence of preterm deliveries. The purpose of this work is to identify preterm delivery risk factors and their progression throughout the pregnancy from a large collection of Electronic Health Records (EHR).</p><p><strong>Results: </strong>The study cohort includes about 60,000 deliveries in the USA with the complete medical history from EHR for diagnoses, medications and procedures. We propose a temporal analysis of risk factors by estimating and comparing risk ratios and variable importance at different time points prior to the delivery event. We selected the following time points before delivery: 0, 12 and 24 week(s) of gestation. We did so by conducting a retrospective cohort study of patient history for a selected set of mothers who delivered preterm and a control group of mothers that delivered full-term. We analyzed the extracted data using logistic regression and random forests models. The results of our analyses showed that the highest risk ratio and variable importance corresponds to history of previous preterm delivery. Other risk factors were identified, some of which are consistent with those that are reported in the literature, others need further investigation.</p><p><strong>Conclusions: </strong>The comparative analysis of the risk factors at different time points showed that risk factors in the early pregnancy related to patient history and chronic condition, while the risk factors in late pregnancy are specific to the current pregnancy. Our analysis unifies several previously reported studies on preterm risk factors. It also gives important insights on the changes of risk factors in the course of pregnancy. The code used for data analysis will be made available on github.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9386949/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40718285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Neural network methods for diagnosing patient conditions from cardiopulmonary exercise testing data. 从心肺运动测试数据诊断患者病情的神经网络方法。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2022-08-13 DOI: 10.1186/s13040-022-00299-6
Donald E Brown, Suchetha Sharma, James A Jablonski, Arthur Weltman

Background: Cardiopulmonary exercise testing (CPET) provides a reliable and reproducible approach to measuring fitness in patients and diagnosing their health problems. However, the data from CPET consist of multiple time series that require training to interpret. Part of this training teaches the use of flow charts or nested decision trees to interpret the CPET results. This paper investigates the use of two machine learning techniques using neural networks to predict patient health conditions with CPET data in contrast to flow charts. The data for this investigation comes from a small sample of patients with known health problems and who had CPET results. The small size of the sample data also allows us to investigate the use and performance of deep learning neural networks on health care problems with limited amounts of labeled training and testing data.

Methods: This paper compares the current standard for interpreting and classifying CPET data, flowcharts, to neural network techniques, autoencoders and convolutional neural networks (CNN). The study also investigated the performance of principal component analysis (PCA) with logistic regression to provide an additional baseline of comparison to the neural network techniques.

Results: The patients in the sample had two primary diagnoses: heart failure and metabolic syndrome. All model-based testing was done with 5-fold cross-validation and metrics of precision, recall, F1 score, and accuracy. As a baseline for comparison to our models, the highest performing flow chart method achieved an accuracy of 77%. Both PCA regression and CNN achieved an average accuracy of 90% and outperformed the flow chart methods on all metrics. The autoencoder with logistic regression performed the best on each of the metrics and had an average accuracy of 94%.

Conclusions: This study suggests that machine learning and neural network techniques, in particular, can provide higher levels of accuracy with CPET data than traditional flowchart methods. Further, the CNN performed well with a small data set showing that these techniques can be designed to perform well on small data problems that are often found in health care and the life sciences. Further testing with larger data sets is needed to continue evaluating the use of machine learning to interpret CPET data.

背景:心肺运动测试(CPET)是测量患者体能和诊断其健康问题的可靠且可重复的方法。然而,CPET 的数据由多个时间序列组成,需要经过培训才能解读。培训的一部分内容是教授如何使用流程图或嵌套决策树来解释 CPET 结果。与流程图相比,本文研究了使用神经网络的两种机器学习技术,通过 CPET 数据预测患者的健康状况。本次调查的数据来自于已知有健康问题且有 CPET 结果的小样本患者。样本数据规模较小,这也使我们能够在标注的训练和测试数据数量有限的情况下,研究深度学习神经网络在医疗保健问题上的应用和性能:本文将当前解释和分类 CPET 数据的标准(流程图)与神经网络技术(自动编码器和卷积神经网络 (CNN))进行了比较。研究还调查了主成分分析(PCA)与逻辑回归的性能,以提供与神经网络技术比较的额外基线:样本中的患者有两个主要诊断:心力衰竭和代谢综合征。所有基于模型的测试都是通过 5 倍交叉验证以及精确度、召回率、F1 分数和准确度等指标完成的。作为与我们的模型进行比较的基线,性能最高的流程图方法达到了 77% 的准确率。PCA 回归和 CNN 的平均准确率都达到了 90%,在所有指标上都优于流程图方法。带有逻辑回归的自动编码器在各项指标上表现最佳,平均准确率达到 94%:这项研究表明,与传统的流程图方法相比,机器学习和神经网络技术尤其能为 CPET 数据提供更高的准确性。此外,CNN 在小型数据集上表现良好,这表明这些技术在设计上可以很好地解决医疗保健和生命科学领域经常出现的小型数据问题。要继续评估使用机器学习解释 CPET 数据的效果,还需要对更大的数据集进行进一步测试。
{"title":"Neural network methods for diagnosing patient conditions from cardiopulmonary exercise testing data.","authors":"Donald E Brown, Suchetha Sharma, James A Jablonski, Arthur Weltman","doi":"10.1186/s13040-022-00299-6","DOIUrl":"10.1186/s13040-022-00299-6","url":null,"abstract":"<p><strong>Background: </strong>Cardiopulmonary exercise testing (CPET) provides a reliable and reproducible approach to measuring fitness in patients and diagnosing their health problems. However, the data from CPET consist of multiple time series that require training to interpret. Part of this training teaches the use of flow charts or nested decision trees to interpret the CPET results. This paper investigates the use of two machine learning techniques using neural networks to predict patient health conditions with CPET data in contrast to flow charts. The data for this investigation comes from a small sample of patients with known health problems and who had CPET results. The small size of the sample data also allows us to investigate the use and performance of deep learning neural networks on health care problems with limited amounts of labeled training and testing data.</p><p><strong>Methods: </strong>This paper compares the current standard for interpreting and classifying CPET data, flowcharts, to neural network techniques, autoencoders and convolutional neural networks (CNN). The study also investigated the performance of principal component analysis (PCA) with logistic regression to provide an additional baseline of comparison to the neural network techniques.</p><p><strong>Results: </strong>The patients in the sample had two primary diagnoses: heart failure and metabolic syndrome. All model-based testing was done with 5-fold cross-validation and metrics of precision, recall, F1 score, and accuracy. As a baseline for comparison to our models, the highest performing flow chart method achieved an accuracy of 77%. Both PCA regression and CNN achieved an average accuracy of 90% and outperformed the flow chart methods on all metrics. The autoencoder with logistic regression performed the best on each of the metrics and had an average accuracy of 94%.</p><p><strong>Conclusions: </strong>This study suggests that machine learning and neural network techniques, in particular, can provide higher levels of accuracy with CPET data than traditional flowchart methods. Further, the CNN performed well with a small data set showing that these techniques can be designed to perform well on small data problems that are often found in health care and the life sciences. Further testing with larger data sets is needed to continue evaluating the use of machine learning to interpret CPET data.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2022-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9375280/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40626572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Benchmarking AutoML frameworks for disease prediction using medical claims. 使用医疗索赔进行疾病预测的AutoML框架的基准测试。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2022-07-26 DOI: 10.1186/s13040-022-00300-2
Roland Albert A Romero, Mariefel Nicole Y Deypalan, Suchit Mehrotra, John Titus Jungao, Natalie E Sheils, Elisabetta Manduchi, Jason H Moore

Objectives: Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets.

Materials and methods: We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics.

Results: The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications.

Discussion: Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance.

Conclusion: Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application.

目的:确定并比较自动化机器学习(AutoML)工具在大型、高度不平衡的医疗数据集上的性能。材料和方法:我们使用历史去识别的行政索赔生成了一个大型数据集,包括2019年之前四个不同时间窗口的人口统计信息和疾病代码标志。然后,我们在该数据集上训练了三个AutoML工具,以预测2019年的六种不同疾病结果,并根据几个指标评估模型的性能。结果:与基线随机森林模型相比,AutoML工具显示出改进,但彼此之间没有显著差异。所有模型的准确率-召回率曲线下的面积都很低,无法预测真阳性,而真阴性率却很高。模型性能与流行率没有直接关系。我们提供了一个特定的用例来说明如何选择一个阈值,使真阳性率和假阳性率之间达到最佳平衡,因为这是医疗应用中的一个重要考虑因素。讨论:医疗保健数据集对AutoML工具提出了几个挑战,包括大样本量、高度不平衡以及可用功能的限制。可扩展性的改进、不平衡学习重采样和集成方法的组合以及有组织的特征选择可能是实现更好性能的下一步。结论:在研究的三个工具中,没有一个AutoML工具在预测性能方面始终优于其他工具。本研究模型的表现表明,在处理医疗理赔数据方面可能存在改进的空间。最后,最优预测阈值的选择应以具体的实际应用为指导。
{"title":"Benchmarking AutoML frameworks for disease prediction using medical claims.","authors":"Roland Albert A Romero,&nbsp;Mariefel Nicole Y Deypalan,&nbsp;Suchit Mehrotra,&nbsp;John Titus Jungao,&nbsp;Natalie E Sheils,&nbsp;Elisabetta Manduchi,&nbsp;Jason H Moore","doi":"10.1186/s13040-022-00300-2","DOIUrl":"https://doi.org/10.1186/s13040-022-00300-2","url":null,"abstract":"<p><strong>Objectives: </strong>Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets.</p><p><strong>Materials and methods: </strong>We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics.</p><p><strong>Results: </strong>The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications.</p><p><strong>Discussion: </strong>Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance.</p><p><strong>Conclusion: </strong>Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9327416/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40541160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
Biodata Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1