首页 > 最新文献

Biodata Mining最新文献

英文 中文
Reference-free phylogeny from sequencing data. 来自测序数据的无参考系统发育。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2023-03-27 DOI: 10.1186/s13040-023-00329-x
Petr Ryšavý, Filip Železný

Motivation: Clustering of genetic sequences is one of the key parts of bioinformatics analyses. Resulting phylogenetic trees are beneficial for solving many research questions, including tracing the history of species, studying migration in the past, or tracing a source of a virus outbreak. At the same time, biologists provide more data in the raw form of reads or only on contig-level assembly. Therefore, tools that are able to process those data without supervision need to be developed.

Results: In this paper, we present a tool for reference-free phylogeny capable of handling data where no mature-level assembly is available. The tool allows distance calculation for raw reads, contigs, and the combination of the latter. The tool provides an estimation of the Levenshtein distance between the sequences, which in turn estimates the number of mutations between the organisms. Compared to the previous research, the novelty of the method lies in a newly proposed combination of the read and contig measures, a new method for read-contig mapping, and an efficient embedding of contigs.

动机:基因序列聚类是生物信息学分析的关键部分之一。由此产生的系统发育树有利于解决许多研究问题,包括追踪物种的历史,研究过去的迁移,或追踪病毒爆发的来源。与此同时,生物学家以原始形式的reads或仅在contig级组装上提供更多数据。因此,需要开发能够在没有监督的情况下处理这些数据的工具。结果:在本文中,我们提出了一个无参考系统发育的工具,能够处理没有成熟级别组装可用的数据。该工具允许对原始读取、配置和后者的组合进行距离计算。该工具提供序列之间Levenshtein距离的估计,从而估计生物体之间的突变数量。与以往的研究相比,该方法的新颖之处在于将读取和配置测度结合起来,采用了一种新的读取-配置映射方法,并实现了配置的高效嵌入。
{"title":"Reference-free phylogeny from sequencing data.","authors":"Petr Ryšavý,&nbsp;Filip Železný","doi":"10.1186/s13040-023-00329-x","DOIUrl":"https://doi.org/10.1186/s13040-023-00329-x","url":null,"abstract":"<p><strong>Motivation: </strong>Clustering of genetic sequences is one of the key parts of bioinformatics analyses. Resulting phylogenetic trees are beneficial for solving many research questions, including tracing the history of species, studying migration in the past, or tracing a source of a virus outbreak. At the same time, biologists provide more data in the raw form of reads or only on contig-level assembly. Therefore, tools that are able to process those data without supervision need to be developed.</p><p><strong>Results: </strong>In this paper, we present a tool for reference-free phylogeny capable of handling data where no mature-level assembly is available. The tool allows distance calculation for raw reads, contigs, and the combination of the latter. The tool provides an estimation of the Levenshtein distance between the sequences, which in turn estimates the number of mutations between the organisms. Compared to the previous research, the novelty of the method lies in a newly proposed combination of the read and contig measures, a new method for read-contig mapping, and an efficient embedding of contigs.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2023-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10045052/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9256818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unsupervised encoding selection through ensemble pruning for biomedical classification. 基于集成剪枝的生物医学分类无监督编码选择。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2023-03-16 DOI: 10.1186/s13040-022-00317-7
Sebastian Spänig, Alexander Michel, Dominik Heider

Background: Owing to the rising levels of multi-resistant pathogens, antimicrobial peptides, an alternative strategy to classic antibiotics, got more attention. A crucial part is thereby the costly identification and validation. With the ever-growing amount of annotated peptides, researchers leverage artificial intelligence to circumvent the cumbersome, wet-lab-based identification and automate the detection of promising candidates. However, the prediction of a peptide's function is not limited to antimicrobial efficiency. To date, multiple studies successfully classified additional properties, e.g., antiviral or cell-penetrating effects. In this light, ensemble classifiers are employed aiming to further improve the prediction. Although we recently presented a workflow to significantly diminish the initial encoding choice, an entire unsupervised encoding selection, considering various machine learning models, is still lacking.

Results: We developed a workflow, automatically selecting encodings and generating classifier ensembles by employing sophisticated pruning methods. We observed that the Pareto frontier pruning is a good method to create encoding ensembles for the datasets at hand. In addition, encodings combined with the Decision Tree classifier as the base model are often superior. However, our results also demonstrate that none of the ensemble building techniques is outstanding for all datasets.

Conclusion: The workflow conducts multiple pruning methods to evaluate ensemble classifiers composed from a wide range of peptide encodings and base models. Consequently, researchers can use the workflow for unsupervised encoding selection and ensemble creation. Ultimately, the extensible workflow can be used as a plugin for the PEPTIDE REACToR, further establishing it as a versatile tool in the domain.

背景:随着多重耐药病原菌的不断增多,抗菌肽作为经典抗生素的替代策略受到越来越多的关注。因此,一个关键部分是昂贵的识别和验证。随着标注肽数量的不断增长,研究人员利用人工智能来规避繁琐的、基于湿实验室的识别,并自动检测有前途的候选肽。然而,对肽功能的预测并不局限于抗菌效率。迄今为止,多项研究成功地分类了其他特性,例如抗病毒或细胞穿透作用。在这种情况下,为了进一步改进预测,我们采用了集成分类器。尽管我们最近提出了一个工作流来显著减少初始编码选择,但考虑到各种机器学习模型,仍然缺乏一个完整的无监督编码选择。结果:我们开发了一个工作流程,通过采用复杂的修剪方法自动选择编码和生成分类器集成。我们观察到,帕累托边界修剪是一种为手头数据集创建编码集成的好方法。此外,结合决策树分类器作为基本模型的编码通常更优越。然而,我们的结果也表明,没有一种集成构建技术对所有数据集都是杰出的。结论:该工作流通过多种修剪方法来评估由广泛的肽编码和基础模型组成的集成分类器。因此,研究人员可以使用工作流进行无监督编码选择和集成创建。最终,可扩展的工作流可以用作PEPTIDE REACToR的插件,进一步将其建立为该领域的通用工具。
{"title":"Unsupervised encoding selection through ensemble pruning for biomedical classification.","authors":"Sebastian Spänig,&nbsp;Alexander Michel,&nbsp;Dominik Heider","doi":"10.1186/s13040-022-00317-7","DOIUrl":"https://doi.org/10.1186/s13040-022-00317-7","url":null,"abstract":"<p><strong>Background: </strong>Owing to the rising levels of multi-resistant pathogens, antimicrobial peptides, an alternative strategy to classic antibiotics, got more attention. A crucial part is thereby the costly identification and validation. With the ever-growing amount of annotated peptides, researchers leverage artificial intelligence to circumvent the cumbersome, wet-lab-based identification and automate the detection of promising candidates. However, the prediction of a peptide's function is not limited to antimicrobial efficiency. To date, multiple studies successfully classified additional properties, e.g., antiviral or cell-penetrating effects. In this light, ensemble classifiers are employed aiming to further improve the prediction. Although we recently presented a workflow to significantly diminish the initial encoding choice, an entire unsupervised encoding selection, considering various machine learning models, is still lacking.</p><p><strong>Results: </strong>We developed a workflow, automatically selecting encodings and generating classifier ensembles by employing sophisticated pruning methods. We observed that the Pareto frontier pruning is a good method to create encoding ensembles for the datasets at hand. In addition, encodings combined with the Decision Tree classifier as the base model are often superior. However, our results also demonstrate that none of the ensemble building techniques is outstanding for all datasets.</p><p><strong>Conclusion: </strong>The workflow conducts multiple pruning methods to evaluate ensemble classifiers composed from a wide range of peptide encodings and base models. Consequently, researchers can use the workflow for unsupervised encoding selection and ensemble creation. Ultimately, the extensible workflow can be used as a plugin for the PEPTIDE REACToR, further establishing it as a versatile tool in the domain.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10018861/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9133013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Algorithm-based detection of acute kidney injury according to full KDIGO criteria including urine output following cardiac surgery: a descriptive analysis. 基于算法的急性肾损伤检测,根据完整的KDIGO标准,包括心脏手术后的尿量:描述性分析
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2023-03-16 DOI: 10.1186/s13040-023-00323-3
Nico Schmid, Mihnea Ghinescu, Moritz Schanz, Micha Christ, Severin Schricker, Markus Ketteler, Mark Dominik Alscher, Ulrich Franke, Nora Goebel

Background: Automated data analysis and processing has the potential to assist, improve and guide decision making in medical practice. However, by now it has not yet been fully integrated in a clinical setting. Herein we present the first results of applying algorithm-based detection to the diagnosis of postoperative acute kidney injury (AKI) comprising patient data from a cardiac surgical intensive care unit (ICU).

Methods: First, we generated a well-defined study population of cardiac surgical ICU patients by implementing an application programming interface (API) to extract, clean and select relevant data from the archived digital patient management system. Health records of N = 21,045 adult patients admitted to the ICU following cardiac surgery between 2012 and 2022 were analyzed. Secondly, we developed a software functionality to detect the incidence of AKI according to Kidney Disease: Improving Global Outcomes (KDIGO) criteria, including urine output. Incidence, severity, and temporal evolution of AKI were assessed.

Results: With the use of our automated data analyzing model the overall incidence of postoperative AKI was 65.4% (N = 13,755). Divided by stages, AKI 2 was the most frequent maximum disease stage with 30.5% of patients (stage 1 in 17.6%, stage 3 in 17.2%). We observed considerable temporal divergence between first detections and maximum AKI stages: 51% of patients developed AKI stage 2 or 3 after a previously identified lower stage. Length of ICU stay was significantly prolonged in AKI patients (8.8 vs. 6.6 days, p <  0.001) and increased for higher AKI stages up to 10.1 days on average. In terms of AKI criteria, urine output proved to be most relevant, contributing to detection in 87.3% (N = 12,004) of cases.

Conclusion: The incidence of postoperative AKI following cardiac surgery is strikingly high with 65.4% when using full KDIGO-criteria including urine output. Automated data analysis demonstrated reliable early detection of AKI with progressive deterioration of renal function in the majority of patients, therefore allowing for potential earlier therapeutic intervention for preventing or lessening disease progression, reducing the length of ICU stay, and ultimately improving overall patient outcomes.

背景:自动化数据分析和处理具有协助、改进和指导医疗实践决策的潜力。然而,到目前为止,它还没有完全整合到临床环境中。在这里,我们提出了应用基于算法的检测来诊断术后急性肾损伤(AKI)的第一个结果,包括来自心脏外科重症监护病房(ICU)的患者数据。方法:首先,我们通过实现应用程序编程接口(API)从存档的数字患者管理系统中提取、清理和选择相关数据,生成了一个定义良好的心脏外科ICU患者研究人群。对2012年至2022年间N = 21045例心脏手术后入住ICU的成年患者的健康记录进行分析。其次,我们开发了一个软件功能,根据肾脏疾病:改善全球结局(KDIGO)标准检测AKI的发生率,包括尿量。评估AKI的发生率、严重程度和时间演变。结果:使用我们的自动数据分析模型,术后AKI总发生率为65.4% (N = 13,755)。按分期划分,AKI 2是最常见的最大疾病分期,占30.5%(1期占17.6%,3期占17.2%)。我们观察到首次发现和最大AKI分期之间存在相当大的时间差异:51%的患者在先前确定的较低阶段后发展为AKI 2期或3期。AKI患者在ICU的住院时间明显延长(8.8天对6.6天,p)。结论:当使用包括尿量在内的完整kdigo标准时,心脏手术后AKI的发生率惊人地高,为65.4%。自动化数据分析显示,在大多数患者中,可靠的早期发现AKI伴肾功能进行性恶化,因此允许潜在的早期治疗干预,以预防或减轻疾病进展,缩短ICU住院时间,并最终改善患者的整体预后。
{"title":"Algorithm-based detection of acute kidney injury according to full KDIGO criteria including urine output following cardiac surgery: a descriptive analysis.","authors":"Nico Schmid,&nbsp;Mihnea Ghinescu,&nbsp;Moritz Schanz,&nbsp;Micha Christ,&nbsp;Severin Schricker,&nbsp;Markus Ketteler,&nbsp;Mark Dominik Alscher,&nbsp;Ulrich Franke,&nbsp;Nora Goebel","doi":"10.1186/s13040-023-00323-3","DOIUrl":"https://doi.org/10.1186/s13040-023-00323-3","url":null,"abstract":"<p><strong>Background: </strong>Automated data analysis and processing has the potential to assist, improve and guide decision making in medical practice. However, by now it has not yet been fully integrated in a clinical setting. Herein we present the first results of applying algorithm-based detection to the diagnosis of postoperative acute kidney injury (AKI) comprising patient data from a cardiac surgical intensive care unit (ICU).</p><p><strong>Methods: </strong>First, we generated a well-defined study population of cardiac surgical ICU patients by implementing an application programming interface (API) to extract, clean and select relevant data from the archived digital patient management system. Health records of N = 21,045 adult patients admitted to the ICU following cardiac surgery between 2012 and 2022 were analyzed. Secondly, we developed a software functionality to detect the incidence of AKI according to Kidney Disease: Improving Global Outcomes (KDIGO) criteria, including urine output. Incidence, severity, and temporal evolution of AKI were assessed.</p><p><strong>Results: </strong>With the use of our automated data analyzing model the overall incidence of postoperative AKI was 65.4% (N = 13,755). Divided by stages, AKI 2 was the most frequent maximum disease stage with 30.5% of patients (stage 1 in 17.6%, stage 3 in 17.2%). We observed considerable temporal divergence between first detections and maximum AKI stages: 51% of patients developed AKI stage 2 or 3 after a previously identified lower stage. Length of ICU stay was significantly prolonged in AKI patients (8.8 vs. 6.6 days, p <  0.001) and increased for higher AKI stages up to 10.1 days on average. In terms of AKI criteria, urine output proved to be most relevant, contributing to detection in 87.3% (N = 12,004) of cases.</p><p><strong>Conclusion: </strong>The incidence of postoperative AKI following cardiac surgery is strikingly high with 65.4% when using full KDIGO-criteria including urine output. Automated data analysis demonstrated reliable early detection of AKI with progressive deterioration of renal function in the majority of patients, therefore allowing for potential earlier therapeutic intervention for preventing or lessening disease progression, reducing the length of ICU stay, and ultimately improving overall patient outcomes.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10022284/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9138603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clinical assistant decision-making model of tuberculosis based on electronic health records. 基于电子病历的肺结核临床辅助决策模型
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2023-03-16 DOI: 10.1186/s13040-023-00328-y
Mengying Wang, Cuixia Lee, Zhenhao Wei, Hong Ji, Yingyun Yang, Cheng Yang

Background: Tuberculosis is a dangerous infectious disease with the largest number of reported cases in China every year. Preventing missed diagnosis has an important impact on the prevention, treatment, and recovery of tuberculosis. The earliest pulmonary tuberculosis prediction models mainly used traditional image data combined with neural network models. However, a single data source tends to miss important information, such as primary symptoms and laboratory test results, that is available in multi-source data like medical records and tests. In this study, we propose a multi-stream integrated pulmonary tuberculosis diagnosis model based on structured and unstructured multi-source data from electronic health records. With the limited number of lung specialists and the high prevalence of tuberculosis, the application of this auxiliary diagnosis model can make substantial contributions to clinical settings.

Methods: The subjects were patients at the respiratory department and infectious cases department of a large comprehensive hospital in China between 2015 to 2020. A total of 95,294 medical records were selected through a quality control process. Each record contains structured and unstructured data. First, numerical expressions of features for structured data were created. Then, feature engineering was performed through decision tree model, random forest, and GBDT. Features were included in the feature exclusion set as per their weights in descending order. When the importance of the set was higher than 0.7, this process was concluded. Finally, the contained features were used for model training. In addition, the unstructured free-text data was segmented at the character level and input into the model after indexing. Tuberculosis prediction was conducted through a multi-stream integration tuberculosis diagnosis model (MSI-PTDM), and the evaluation indices of accuracy, AUC, sensitivity, and specificity were compared against the prediction results of XGBoost, Text-CNN, Random Forest, SVM, and so on.

Results: Through a variety of characteristic engineering methods, 20 characteristic factors, such as main complaint hemoptysis, cough, and test erythrocyte sedimentation rate, were selected, and the influencing factors were analyzed using the Chinese diagnostic standard of pulmonary tuberculosis. The area under the curve values for MSI-PTDM, XGBoost, Text-CNN, RF, and SVM were 0.9858, 0.9571, 0.9486, 0.9428, and 0.9429, respectively. The sensitivity, specificity, and accuracy of MSI-PTDM were 93.18%, 96.96%, and 96.96%, respectively. The MSI-PTDM prediction model was installed at a doctor workstation and operated in a real clinic environment for 4 months. A total of 692,949 patients were monitored, including 484 patients with confirmed pulmonary tuberculosis. The model predicted 440 cases of pulmonary tuberculosis. The positive sample recognition rate was 90.91%, the false-positive ra

背景:结核病是中国每年报告病例最多的危险传染病。预防漏诊对结核病的预防、治疗和康复具有重要影响。早期的肺结核预测模型主要采用传统图像数据与神经网络模型相结合的方法。然而,单一数据源往往会遗漏重要信息,如主要症状和实验室测试结果,而这些信息可在医疗记录和测试等多源数据中获得。在本研究中,我们提出了一个基于结构化和非结构化电子病历多源数据的多流集成肺结核诊断模型。由于肺病专科医生的数量有限,结核病的患病率很高,应用这种辅助诊断模型可以为临床环境做出实质性的贡献。方法:选取2015 - 2020年在国内某大型综合性医院呼吸科和感染科就诊的患者为研究对象。通过质量控制程序共选择了95,294份医疗记录。每个记录包含结构化和非结构化数据。首先,建立了结构化数据特征的数值表达式。然后,通过决策树模型、随机森林和GBDT进行特征工程。特征按照权重由高到低的顺序被包含在特征排除集中。当集合的重要性大于0.7时,该过程结束。最后,将包含的特征用于模型训练。此外,将非结构化的自由文本数据在字符级进行分割,并在索引后输入到模型中。通过多流集成结核病诊断模型(MSI-PTDM)进行结核病预测,并将准确率、AUC、灵敏度、特异性等评价指标与XGBoost、Text-CNN、Random Forest、SVM等预测结果进行比较。结果:通过多种特征工程方法,筛选出主要主诉咯血、咳嗽、红细胞沉降试验等20个特征因素,并采用中国肺结核诊断标准对其影响因素进行分析。MSI-PTDM、XGBoost、Text-CNN、RF和SVM的曲线下面积分别为0.9858、0.9571、0.9486、0.9428和0.9429。MSI-PTDM的敏感性为93.18%,特异性为96.96%,准确性为96.96%。MSI-PTDM预测模型安装在医生工作站,在真实的临床环境中运行4个月。总共监测了692,949例患者,其中包括484例确诊肺结核患者。该模型预测了440例肺结核病例。阳性样本识别率为90.91%,假阳性率为9.09%,阴性样本识别率为96.17%,假阴性率为3.83%。结论:MSI-PTDM可以同时处理稀疏数据、密集数据和非结构化文本数据。该模型增加了嵌入医学稀疏特征的特征域向量,将单值稀疏向量用多维密隐向量表示,既增强了特征表达,又减轻了稀疏性对模型训练的副作用。但是,从文本中提取特征时可能存在信息丢失的问题,加入对原始非结构化文本的处理在一定程度上弥补了上述过程中的错误,使模型能够更全面、更有效地学习数据。此外,MSI-PTDM还允许特征之间的交互,考虑了患者特征之间的组合效应,增加了更复杂的非线性计算考虑,提高了模型的学习能力。它已经通过测试集和实际门诊环境的部署进行了验证。
{"title":"Clinical assistant decision-making model of tuberculosis based on electronic health records.","authors":"Mengying Wang,&nbsp;Cuixia Lee,&nbsp;Zhenhao Wei,&nbsp;Hong Ji,&nbsp;Yingyun Yang,&nbsp;Cheng Yang","doi":"10.1186/s13040-023-00328-y","DOIUrl":"https://doi.org/10.1186/s13040-023-00328-y","url":null,"abstract":"<p><strong>Background: </strong>Tuberculosis is a dangerous infectious disease with the largest number of reported cases in China every year. Preventing missed diagnosis has an important impact on the prevention, treatment, and recovery of tuberculosis. The earliest pulmonary tuberculosis prediction models mainly used traditional image data combined with neural network models. However, a single data source tends to miss important information, such as primary symptoms and laboratory test results, that is available in multi-source data like medical records and tests. In this study, we propose a multi-stream integrated pulmonary tuberculosis diagnosis model based on structured and unstructured multi-source data from electronic health records. With the limited number of lung specialists and the high prevalence of tuberculosis, the application of this auxiliary diagnosis model can make substantial contributions to clinical settings.</p><p><strong>Methods: </strong>The subjects were patients at the respiratory department and infectious cases department of a large comprehensive hospital in China between 2015 to 2020. A total of 95,294 medical records were selected through a quality control process. Each record contains structured and unstructured data. First, numerical expressions of features for structured data were created. Then, feature engineering was performed through decision tree model, random forest, and GBDT. Features were included in the feature exclusion set as per their weights in descending order. When the importance of the set was higher than 0.7, this process was concluded. Finally, the contained features were used for model training. In addition, the unstructured free-text data was segmented at the character level and input into the model after indexing. Tuberculosis prediction was conducted through a multi-stream integration tuberculosis diagnosis model (MSI-PTDM), and the evaluation indices of accuracy, AUC, sensitivity, and specificity were compared against the prediction results of XGBoost, Text-CNN, Random Forest, SVM, and so on.</p><p><strong>Results: </strong>Through a variety of characteristic engineering methods, 20 characteristic factors, such as main complaint hemoptysis, cough, and test erythrocyte sedimentation rate, were selected, and the influencing factors were analyzed using the Chinese diagnostic standard of pulmonary tuberculosis. The area under the curve values for MSI-PTDM, XGBoost, Text-CNN, RF, and SVM were 0.9858, 0.9571, 0.9486, 0.9428, and 0.9429, respectively. The sensitivity, specificity, and accuracy of MSI-PTDM were 93.18%, 96.96%, and 96.96%, respectively. The MSI-PTDM prediction model was installed at a doctor workstation and operated in a real clinic environment for 4 months. A total of 692,949 patients were monitored, including 484 patients with confirmed pulmonary tuberculosis. The model predicted 440 cases of pulmonary tuberculosis. The positive sample recognition rate was 90.91%, the false-positive ra","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10022184/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9145782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Genetics and precision health: the ecological fallacy and artificial intelligence solutions. 遗传学与精准健康:生态谬论与人工智能解决方案。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2023-03-13 DOI: 10.1186/s13040-023-00327-z
Scott M Williams, Jason H Moore
{"title":"Genetics and precision health: the ecological fallacy and artificial intelligence solutions.","authors":"Scott M Williams,&nbsp;Jason H Moore","doi":"10.1186/s13040-023-00327-z","DOIUrl":"https://doi.org/10.1186/s13040-023-00327-z","url":null,"abstract":"","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10018838/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9133009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prediction of the risk of developing end-stage renal diseases in newly diagnosed type 2 diabetes mellitus using artificial intelligence algorithms. 用人工智能算法预测新诊断的2型糖尿病患者发生终末期肾脏疾病的风险
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2023-03-10 DOI: 10.1186/s13040-023-00324-2
Shuo-Ming Ou, Ming-Tsun Tsai, Kuo-Hua Lee, Wei-Cheng Tseng, Chih-Yu Yang, Tz-Heng Chen, Pin-Jie Bin, Tzeng-Ji Chen, Yao-Ping Lin, Wayne Huey-Herng Sheu, Yuan-Chia Chu, Der-Cherng Tarng

Objectives: Type 2 diabetes mellitus (T2DM) imposes a great burden on healthcare systems, and these patients experience higher long-term risks for developing end-stage renal disease (ESRD). Managing diabetic nephropathy becomes more challenging when kidney function starts declining. Therefore, developing predictive models for the risk of developing ESRD in newly diagnosed T2DM patients may be helpful in clinical settings.

Methods: We established machine learning models constructed from a subset of clinical features collected from 53,477 newly diagnosed T2DM patients from January 2008 to December 2018 and then selected the best model. The cohort was divided, with 70% and 30% of patients randomly assigned to the training and testing sets, respectively.

Results: The discriminative ability of our machine learning models, including logistic regression, extra tree classifier, random forest, gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting machine were evaluated across the cohort. XGBoost yielded the highest area under the receiver operating characteristic curve (AUC) of 0.953, followed by extra tree and GBDT, with AUC values of 0.952 and 0.938 on the testing dataset. The SHapley Additive explanation summary plot in the XGBoost model illustrated that the top five important features included baseline serum creatinine, mean serum creatine within 1 year before the diagnosis of T2DM, high-sensitivity C-reactive protein, spot urine protein-to-creatinine ratio and female gender.

Conclusions: Because our machine learning prediction models were based on routinely collected clinical features, they can be used as risk assessment tools for developing ESRD. By identifying high-risk patients, intervention strategies may be provided at an early stage.

目的:2型糖尿病(T2DM)给医疗系统带来了巨大的负担,这些患者发展为终末期肾脏疾病(ESRD)的长期风险更高。当肾功能开始下降时,糖尿病肾病的治疗变得更具挑战性。因此,开发新诊断T2DM患者发生ESRD风险的预测模型可能有助于临床设置。方法:从2008年1月至2018年12月收集的53,477例新诊断T2DM患者的临床特征子集中建立机器学习模型,然后选择最佳模型。队列被划分,70%和30%的患者被随机分配到训练组和测试组。结果:我们的机器学习模型,包括逻辑回归、额外树分类器、随机森林、梯度增强决策树(GBDT)、极端梯度增强(XGBoost)和轻梯度增强机的判别能力在整个队列中进行了评估。XGBoost在接收者工作特征曲线(AUC)下的面积最高,为0.953,其次是extra tree和GBDT,在测试数据集上的AUC值分别为0.952和0.938。XGBoost模型中的SHapley Additive解释总结图显示,最重要的5个特征包括基线血清肌酐、诊断T2DM前1年内平均血清肌酐、高敏c反应蛋白、尿样蛋白/肌酐比和女性性别。结论:由于我们的机器学习预测模型是基于常规收集的临床特征,因此它们可以作为发生ESRD的风险评估工具。通过识别高危患者,可以在早期阶段提供干预策略。
{"title":"Prediction of the risk of developing end-stage renal diseases in newly diagnosed type 2 diabetes mellitus using artificial intelligence algorithms.","authors":"Shuo-Ming Ou,&nbsp;Ming-Tsun Tsai,&nbsp;Kuo-Hua Lee,&nbsp;Wei-Cheng Tseng,&nbsp;Chih-Yu Yang,&nbsp;Tz-Heng Chen,&nbsp;Pin-Jie Bin,&nbsp;Tzeng-Ji Chen,&nbsp;Yao-Ping Lin,&nbsp;Wayne Huey-Herng Sheu,&nbsp;Yuan-Chia Chu,&nbsp;Der-Cherng Tarng","doi":"10.1186/s13040-023-00324-2","DOIUrl":"https://doi.org/10.1186/s13040-023-00324-2","url":null,"abstract":"<p><strong>Objectives: </strong>Type 2 diabetes mellitus (T2DM) imposes a great burden on healthcare systems, and these patients experience higher long-term risks for developing end-stage renal disease (ESRD). Managing diabetic nephropathy becomes more challenging when kidney function starts declining. Therefore, developing predictive models for the risk of developing ESRD in newly diagnosed T2DM patients may be helpful in clinical settings.</p><p><strong>Methods: </strong>We established machine learning models constructed from a subset of clinical features collected from 53,477 newly diagnosed T2DM patients from January 2008 to December 2018 and then selected the best model. The cohort was divided, with 70% and 30% of patients randomly assigned to the training and testing sets, respectively.</p><p><strong>Results: </strong>The discriminative ability of our machine learning models, including logistic regression, extra tree classifier, random forest, gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting machine were evaluated across the cohort. XGBoost yielded the highest area under the receiver operating characteristic curve (AUC) of 0.953, followed by extra tree and GBDT, with AUC values of 0.952 and 0.938 on the testing dataset. The SHapley Additive explanation summary plot in the XGBoost model illustrated that the top five important features included baseline serum creatinine, mean serum creatine within 1 year before the diagnosis of T2DM, high-sensitivity C-reactive protein, spot urine protein-to-creatinine ratio and female gender.</p><p><strong>Conclusions: </strong>Because our machine learning prediction models were based on routinely collected clinical features, they can be used as risk assessment tools for developing ESRD. By identifying high-risk patients, intervention strategies may be provided at an early stage.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2023-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10007785/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9105623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Signature literature review reveals AHCY, DPYSL3, and NME1 as the most recurrent prognostic genes for neuroblastoma. 文献综述显示,AHCY、DPYSL3和NME1是神经母细胞瘤复发率最高的预后基因。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2023-03-04 DOI: 10.1186/s13040-023-00325-1
Davide Chicco, Tiziana Sanavia, Giuseppe Jurman

Neuroblastoma is a childhood neurological tumor which affects hundreds of thousands of children worldwide, and information about its prognosis can be pivotal for patients, their families, and clinicians. One of the main goals in the related bioinformatics analyses is to provide stable genetic signatures able to include genes whose expression levels can be effective to predict the prognosis of the patients. In this study, we collected the prognostic signatures for neuroblastoma published in the biomedical literature, and noticed that the most frequent genes present among them were three: AHCY, DPYLS3, and NME1. We therefore investigated the prognostic power of these three genes by performing a survival analysis and a binary classification on multiple gene expression datasets of different groups of patients diagnosed with neuroblastoma. Finally, we discussed the main studies in the literature associating these three genes with neuroblastoma. Our results, in each of these three steps of validation, confirm the prognostic capability of AHCY, DPYLS3, and NME1, and highlight their key role in neuroblastoma prognosis. Our results can have an impact on neuroblastoma genetics research: biologists and medical researchers can pay more attention to the regulation and expression of these three genes in patients having neuroblastoma, and therefore can develop better cures and treatments which can save patients' lives.

神经母细胞瘤是一种影响全球数十万儿童的儿童神经系统肿瘤,有关其预后的信息对患者、其家人和临床医生至关重要。相关生物信息学分析的主要目标之一是提供稳定的遗传特征,能够包括表达水平可以有效预测患者预后的基因。在本研究中,我们收集了生物医学文献中发表的神经母细胞瘤预后特征,并注意到其中最常见的基因有三个:AHCY, DPYLS3和NME1。因此,我们通过对诊断为神经母细胞瘤的不同组患者的多基因表达数据集进行生存分析和二元分类来研究这三种基因的预后能力。最后,我们讨论了文献中有关这三个基因与神经母细胞瘤的主要研究。在这三个验证步骤中,我们的结果都证实了AHCY、DPYLS3和NME1的预后能力,并强调了它们在神经母细胞瘤预后中的关键作用。我们的研究结果可能对神经母细胞瘤遗传学研究产生影响:生物学家和医学研究人员可以更多地关注这三个基因在神经母细胞瘤患者中的调控和表达,从而开发出更好的治疗方法,挽救患者的生命。
{"title":"Signature literature review reveals AHCY, DPYSL3, and NME1 as the most recurrent prognostic genes for neuroblastoma.","authors":"Davide Chicco,&nbsp;Tiziana Sanavia,&nbsp;Giuseppe Jurman","doi":"10.1186/s13040-023-00325-1","DOIUrl":"https://doi.org/10.1186/s13040-023-00325-1","url":null,"abstract":"<p><p>Neuroblastoma is a childhood neurological tumor which affects hundreds of thousands of children worldwide, and information about its prognosis can be pivotal for patients, their families, and clinicians. One of the main goals in the related bioinformatics analyses is to provide stable genetic signatures able to include genes whose expression levels can be effective to predict the prognosis of the patients. In this study, we collected the prognostic signatures for neuroblastoma published in the biomedical literature, and noticed that the most frequent genes present among them were three: AHCY, DPYLS3, and NME1. We therefore investigated the prognostic power of these three genes by performing a survival analysis and a binary classification on multiple gene expression datasets of different groups of patients diagnosed with neuroblastoma. Finally, we discussed the main studies in the literature associating these three genes with neuroblastoma. Our results, in each of these three steps of validation, confirm the prognostic capability of AHCY, DPYLS3, and NME1, and highlight their key role in neuroblastoma prognosis. Our results can have an impact on neuroblastoma genetics research: biologists and medical researchers can pay more attention to the regulation and expression of these three genes in patients having neuroblastoma, and therefore can develop better cures and treatments which can save patients' lives.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2023-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9985261/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10280657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Ten simple rules for providing bioinformatics support within a hospital. 在医院内提供生物信息学支持的十条简单规则。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-02-23 DOI: 10.1186/s13040-023-00326-0
Davide Chicco, Giuseppe Jurman

Bioinformatics has become a key aspect of the biomedical research programmes of many hospitals' scientific centres, and the establishment of bioinformatics facilities within hospitals has become a common practice worldwide. Bioinformaticians working in these facilities provide computational biology support to medical doctors and principal investigators who are daily dealing with data of patients to analyze. These bioinformatics analysts, although pivotal, usually do not receive formal training for this job. We therefore propose these ten simple rules to guide these bioinformaticians in their work: ten pieces of advice on how to provide bioinformatics support to medical doctors in hospitals. We believe these simple rules can help bioinformatics facility analysts in producing better scientific results and work in a serene and fruitful environment.

生物信息学已成为许多医院科学中心生物医学研究计划的一个重要方面,在医院内建立生物信息学设施已成为全世界的普遍做法。在这些机构工作的生物信息学家为医生和主要研究人员提供计算生物学方面的支持,他们每天都要分析病人的数据。这些生物信息分析师虽然举足轻重,但通常没有接受过正规的培训。因此,我们提出了这十条简单的规则来指导这些生物信息学家的工作:关于如何为医院医生提供生物信息学支持的十条建议。我们相信,这些简单的规则可以帮助生物信息学设施分析人员取得更好的科研成果,并在一个宁静而富有成效的环境中工作。
{"title":"Ten simple rules for providing bioinformatics support within a hospital.","authors":"Davide Chicco, Giuseppe Jurman","doi":"10.1186/s13040-023-00326-0","DOIUrl":"10.1186/s13040-023-00326-0","url":null,"abstract":"<p><p>Bioinformatics has become a key aspect of the biomedical research programmes of many hospitals' scientific centres, and the establishment of bioinformatics facilities within hospitals has become a common practice worldwide. Bioinformaticians working in these facilities provide computational biology support to medical doctors and principal investigators who are daily dealing with data of patients to analyze. These bioinformatics analysts, although pivotal, usually do not receive formal training for this job. We therefore propose these ten simple rules to guide these bioinformaticians in their work: ten pieces of advice on how to provide bioinformatics support to medical doctors in hospitals. We believe these simple rules can help bioinformatics facility analysts in producing better scientific results and work in a serene and fruitful environment.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2023-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9948383/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9335813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
iU-Net: a hybrid structured network with a novel feature fusion approach for medical image segmentation. u - net:一种基于特征融合的混合结构网络,用于医学图像分割。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2023-02-21 DOI: 10.1186/s13040-023-00320-6
Yun Jiang, Jinkun Dong, Tongtong Cheng, Yuan Zhang, Xin Lin, Jing Liang

In recent years, convolutional neural networks (CNNs) have made great achievements in the field of medical image segmentation, especially full convolutional neural networks based on U-shaped structures and skip connections. However, limited by the inherent limitations of convolution, CNNs-based methods usually exhibit limitations in modeling long-range dependencies and are unable to extract large amounts of global contextual information, which deprives neural networks of the ability to adapt to different visual modalities. In this paper, we propose our own model, which is called iU-Net bacause its structure closely resembles the combination of i and U. iU-Net is a multiple encoder-decoder structure combining Swin Transformer and CNN. We use a hierarchical Swin Transformer structure with shifted windows as the primary encoder and convolution as the secondary encoder to complement the context information extracted by the primary encoder. To sufficiently fuse the feature information extracted from multiple encoders, we design a feature fusion module (W-FFM) based on wave function representation. Besides, a three branch up sampling method(Tri-Upsample) has developed to replace the patch expand in the Swin Transformer, which can effectively avoid the Checkerboard Artifacts caused by the patch expand. On the skin lesion region segmentation task, the segmentation performance of iU-Net is optimal, with Dice and Iou reaching 90.12% and 83.06%, respectively. To verify the generalization of iU-Net, we used the model trained on ISIC2018 dataset to test on PH2 dataset, and achieved 93.80% Dice and 88.74% IoU. On the lung feild segmentation task, the iU-Net achieved optimal results on IoU and Precision, reaching 98.54% and 94.35% respectively. Extensive experiments demonstrate the segmentation performance and generalization ability of iU-Net.

近年来,卷积神经网络(cnn)在医学图像分割领域取得了很大的成就,尤其是基于u型结构和跳跃连接的全卷积神经网络。然而,受卷积固有局限性的限制,基于cnn的方法通常在建模远程依赖关系方面表现出局限性,并且无法提取大量的全局上下文信息,这剥夺了神经网络适应不同视觉模式的能力。在本文中,我们提出了我们自己的模型,称为u - net,因为它的结构非常类似于i和u的组合。u - net是一种结合了Swin Transformer和CNN的多重编码器-解码器结构。我们使用分层Swin Transformer结构,以移位窗口作为主编码器,卷积作为副编码器,以补充主编码器提取的上下文信息。为了充分融合从多个编码器中提取的特征信息,我们设计了一个基于波函数表示的特征融合模块(W-FFM)。此外,本文还提出了一种三支向上采样方法(Tri-Upsample)来代替Swin变压器中的补丁扩展,有效地避免了补丁扩展引起的棋盘伪影。在皮肤病变区域分割任务上,iU-Net的分割性能最优,Dice和Iou的分割性能分别达到90.12%和83.06%。为了验证iU-Net的泛化,我们使用在ISIC2018数据集上训练的模型在PH2数据集上进行测试,获得了93.80%的Dice和88.74%的IoU。在肺场分割任务上,iU-Net在IoU和Precision上取得了最优结果,分别达到98.54%和94.35%。大量的实验证明了iU-Net的分割性能和泛化能力。
{"title":"iU-Net: a hybrid structured network with a novel feature fusion approach for medical image segmentation.","authors":"Yun Jiang,&nbsp;Jinkun Dong,&nbsp;Tongtong Cheng,&nbsp;Yuan Zhang,&nbsp;Xin Lin,&nbsp;Jing Liang","doi":"10.1186/s13040-023-00320-6","DOIUrl":"https://doi.org/10.1186/s13040-023-00320-6","url":null,"abstract":"<p><p>In recent years, convolutional neural networks (CNNs) have made great achievements in the field of medical image segmentation, especially full convolutional neural networks based on U-shaped structures and skip connections. However, limited by the inherent limitations of convolution, CNNs-based methods usually exhibit limitations in modeling long-range dependencies and are unable to extract large amounts of global contextual information, which deprives neural networks of the ability to adapt to different visual modalities. In this paper, we propose our own model, which is called iU-Net bacause its structure closely resembles the combination of i and U. iU-Net is a multiple encoder-decoder structure combining Swin Transformer and CNN. We use a hierarchical Swin Transformer structure with shifted windows as the primary encoder and convolution as the secondary encoder to complement the context information extracted by the primary encoder. To sufficiently fuse the feature information extracted from multiple encoders, we design a feature fusion module (W-FFM) based on wave function representation. Besides, a three branch up sampling method(Tri-Upsample) has developed to replace the patch expand in the Swin Transformer, which can effectively avoid the Checkerboard Artifacts caused by the patch expand. On the skin lesion region segmentation task, the segmentation performance of iU-Net is optimal, with Dice and Iou reaching 90.12% and 83.06%, respectively. To verify the generalization of iU-Net, we used the model trained on ISIC2018 dataset to test on PH2 dataset, and achieved 93.80% Dice and 88.74% IoU. On the lung feild segmentation task, the iU-Net achieved optimal results on IoU and Precision, reaching 98.54% and 94.35% respectively. Extensive experiments demonstrate the segmentation performance and generalization ability of iU-Net.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2023-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9942350/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10764875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. 马修斯相关系数(MCC)应取代 ROC AUC,成为评估二元分类的标准指标。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-02-17 DOI: 10.1186/s13040-023-00322-4
Davide Chicco, Giuseppe Jurman

Binary classification is a common task for which machine learning and computational statistics are used, and the area under the receiver operating characteristic curve (ROC AUC) has become the common standard metric to evaluate binary classifications in most scientific fields. The ROC curve has true positive rate (also called sensitivity or recall) on the y axis and false positive rate on the x axis, and the ROC AUC can range from 0 (worst result) to 1 (perfect result). The ROC AUC, however, has several flaws and drawbacks. This score is generated including predictions that obtained insufficient sensitivity and specificity, and moreover it does not say anything about positive predictive value (also known as precision) nor negative predictive value (NPV) obtained by the classifier, therefore potentially generating inflated overoptimistic results. Since it is common to include ROC AUC alone without precision and negative predictive value, a researcher might erroneously conclude that their classification was successful. Furthermore, a given point in the ROC space does not identify a single confusion matrix nor a group of matrices sharing the same MCC value. Indeed, a given (sensitivity, specificity) pair can cover a broad MCC range, which casts doubts on the reliability of ROC AUC as a performance measure. In contrast, the Matthews correlation coefficient (MCC) generates a high score in its [Formula: see text] interval only if the classifier scored a high value for all the four basic rates of the confusion matrix: sensitivity, specificity, precision, and negative predictive value. A high MCC (for example, MCC [Formula: see text] 0.9), moreover, always corresponds to a high ROC AUC, and not vice versa. In this short study, we explain why the Matthews correlation coefficient should replace the ROC AUC as standard statistic in all the scientific studies involving a binary classification, in all scientific fields.

二元分类是机器学习和计算统计常用的任务,接收者工作特征曲线下面积(ROC AUC)已成为大多数科学领域评估二元分类的常用标准指标。ROC 曲线的 Y 轴为真阳性率(也称灵敏度或召回率),X 轴为假阳性率,ROC AUC 的范围从 0(最差结果)到 1(完美结果)不等。然而,ROC AUC 有几个缺陷和不足。这个分数是在预测灵敏度和特异性不足的情况下产生的,而且它对分类器获得的正预测值(也称为精确度)和负预测值(NPV)没有任何说明,因此可能会产生夸大的过于乐观的结果。由于只包含 ROC AUC 而不包含精确度和负预测值的情况很常见,研究人员可能会错误地得出分类成功的结论。此外,ROC 空间中的一个给定点并不能确定一个混淆矩阵或一组具有相同 MCC 值的矩阵。事实上,给定的(灵敏度、特异性)对可以覆盖很宽的 MCC 范围,这让人对 ROC AUC 作为性能测量指标的可靠性产生怀疑。相反,只有当分类器在混淆矩阵的所有四个基本比率(灵敏度、特异性、精确度和负预测值)上都获得高分时,马修斯相关系数(MCC)才会在其[计算公式:见正文]区间内产生高分。此外,高 MCC(例如 MCC [公式:见正文] 0.9)总是与高 ROC AUC 相对应,反之亦然。在这篇简短的研究中,我们将解释为什么马修斯相关系数应该取代 ROC AUC,成为所有科学领域涉及二元分类的所有科学研究的标准统计量。
{"title":"The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification.","authors":"Davide Chicco, Giuseppe Jurman","doi":"10.1186/s13040-023-00322-4","DOIUrl":"10.1186/s13040-023-00322-4","url":null,"abstract":"<p><p>Binary classification is a common task for which machine learning and computational statistics are used, and the area under the receiver operating characteristic curve (ROC AUC) has become the common standard metric to evaluate binary classifications in most scientific fields. The ROC curve has true positive rate (also called sensitivity or recall) on the y axis and false positive rate on the x axis, and the ROC AUC can range from 0 (worst result) to 1 (perfect result). The ROC AUC, however, has several flaws and drawbacks. This score is generated including predictions that obtained insufficient sensitivity and specificity, and moreover it does not say anything about positive predictive value (also known as precision) nor negative predictive value (NPV) obtained by the classifier, therefore potentially generating inflated overoptimistic results. Since it is common to include ROC AUC alone without precision and negative predictive value, a researcher might erroneously conclude that their classification was successful. Furthermore, a given point in the ROC space does not identify a single confusion matrix nor a group of matrices sharing the same MCC value. Indeed, a given (sensitivity, specificity) pair can cover a broad MCC range, which casts doubts on the reliability of ROC AUC as a performance measure. In contrast, the Matthews correlation coefficient (MCC) generates a high score in its [Formula: see text] interval only if the classifier scored a high value for all the four basic rates of the confusion matrix: sensitivity, specificity, precision, and negative predictive value. A high MCC (for example, MCC [Formula: see text] 0.9), moreover, always corresponds to a high ROC AUC, and not vice versa. In this short study, we explain why the Matthews correlation coefficient should replace the ROC AUC as standard statistic in all the scientific studies involving a binary classification, in all scientific fields.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2023-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9938573/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9320067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biodata Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1