Pub Date : 2023-04-10DOI: 10.1186/s13040-023-00331-3
Philip J Freda, Attri Ghosh, Elizabeth Zhang, Tianhao Luo, Apurva S Chitre, Oksana Polesskaya, Celine L St Pierre, Jianjun Gao, Connor D Martin, Hao Chen, Angel G Garcia-Martinez, Tengfei Wang, Wenyan Han, Keita Ishiwari, Paul Meyer, Alexander Lamparelli, Christopher P King, Abraham A Palmer, Ruowang Li, Jason H Moore
Background: Quantitative Trait Locus (QTL) analysis and Genome-Wide Association Studies (GWAS) have the power to identify variants that capture significant levels of phenotypic variance in complex traits. However, effort and time are required to select the best methods and optimize parameters and pre-processing steps. Although machine learning approaches have been shown to greatly assist in optimization and data processing, applying them to QTL analysis and GWAS is challenging due to the complexity of large, heterogenous datasets. Here, we describe proof-of-concept for an automated machine learning approach, AutoQTL, with the ability to automate many complicated decisions related to analysis of complex traits and generate solutions to describe relationships that exist in genetic data.
Results: Using a publicly available dataset of 18 putative QTL from a large-scale GWAS of body mass index in the laboratory rat, Rattus norvegicus, AutoQTL captures the phenotypic variance explained under a standard additive model. AutoQTL also detects evidence of non-additive effects including deviations from additivity and 2-way epistatic interactions in simulated data via multiple optimal solutions. Additionally, feature importance metrics provide different insights into the inheritance models and predictive power of multiple GWAS-derived putative QTL.
Conclusions: This proof-of-concept illustrates that automated machine learning techniques can complement standard approaches and have the potential to detect both additive and non-additive effects via various optimal solutions and feature importance metrics. In the future, we aim to expand AutoQTL to accommodate omics-level datasets with intelligent feature selection and feature engineering strategies.
{"title":"Automated quantitative trait locus analysis (AutoQTL).","authors":"Philip J Freda, Attri Ghosh, Elizabeth Zhang, Tianhao Luo, Apurva S Chitre, Oksana Polesskaya, Celine L St Pierre, Jianjun Gao, Connor D Martin, Hao Chen, Angel G Garcia-Martinez, Tengfei Wang, Wenyan Han, Keita Ishiwari, Paul Meyer, Alexander Lamparelli, Christopher P King, Abraham A Palmer, Ruowang Li, Jason H Moore","doi":"10.1186/s13040-023-00331-3","DOIUrl":"https://doi.org/10.1186/s13040-023-00331-3","url":null,"abstract":"<p><strong>Background: </strong>Quantitative Trait Locus (QTL) analysis and Genome-Wide Association Studies (GWAS) have the power to identify variants that capture significant levels of phenotypic variance in complex traits. However, effort and time are required to select the best methods and optimize parameters and pre-processing steps. Although machine learning approaches have been shown to greatly assist in optimization and data processing, applying them to QTL analysis and GWAS is challenging due to the complexity of large, heterogenous datasets. Here, we describe proof-of-concept for an automated machine learning approach, AutoQTL, with the ability to automate many complicated decisions related to analysis of complex traits and generate solutions to describe relationships that exist in genetic data.</p><p><strong>Results: </strong>Using a publicly available dataset of 18 putative QTL from a large-scale GWAS of body mass index in the laboratory rat, Rattus norvegicus, AutoQTL captures the phenotypic variance explained under a standard additive model. AutoQTL also detects evidence of non-additive effects including deviations from additivity and 2-way epistatic interactions in simulated data via multiple optimal solutions. Additionally, feature importance metrics provide different insights into the inheritance models and predictive power of multiple GWAS-derived putative QTL.</p><p><strong>Conclusions: </strong>This proof-of-concept illustrates that automated machine learning techniques can complement standard approaches and have the potential to detect both additive and non-additive effects via various optimal solutions and feature importance metrics. In the future, we aim to expand AutoQTL to accommodate omics-level datasets with intelligent feature selection and feature engineering strategies.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"14"},"PeriodicalIF":4.5,"publicationDate":"2023-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10088184/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9633507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-27DOI: 10.1186/s13040-023-00329-x
Petr Ryšavý, Filip Železný
Motivation: Clustering of genetic sequences is one of the key parts of bioinformatics analyses. Resulting phylogenetic trees are beneficial for solving many research questions, including tracing the history of species, studying migration in the past, or tracing a source of a virus outbreak. At the same time, biologists provide more data in the raw form of reads or only on contig-level assembly. Therefore, tools that are able to process those data without supervision need to be developed.
Results: In this paper, we present a tool for reference-free phylogeny capable of handling data where no mature-level assembly is available. The tool allows distance calculation for raw reads, contigs, and the combination of the latter. The tool provides an estimation of the Levenshtein distance between the sequences, which in turn estimates the number of mutations between the organisms. Compared to the previous research, the novelty of the method lies in a newly proposed combination of the read and contig measures, a new method for read-contig mapping, and an efficient embedding of contigs.
{"title":"Reference-free phylogeny from sequencing data.","authors":"Petr Ryšavý, Filip Železný","doi":"10.1186/s13040-023-00329-x","DOIUrl":"https://doi.org/10.1186/s13040-023-00329-x","url":null,"abstract":"<p><strong>Motivation: </strong>Clustering of genetic sequences is one of the key parts of bioinformatics analyses. Resulting phylogenetic trees are beneficial for solving many research questions, including tracing the history of species, studying migration in the past, or tracing a source of a virus outbreak. At the same time, biologists provide more data in the raw form of reads or only on contig-level assembly. Therefore, tools that are able to process those data without supervision need to be developed.</p><p><strong>Results: </strong>In this paper, we present a tool for reference-free phylogeny capable of handling data where no mature-level assembly is available. The tool allows distance calculation for raw reads, contigs, and the combination of the latter. The tool provides an estimation of the Levenshtein distance between the sequences, which in turn estimates the number of mutations between the organisms. Compared to the previous research, the novelty of the method lies in a newly proposed combination of the read and contig measures, a new method for read-contig mapping, and an efficient embedding of contigs.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"13"},"PeriodicalIF":4.5,"publicationDate":"2023-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10045052/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9256818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-16DOI: 10.1186/s13040-022-00317-7
Sebastian Spänig, Alexander Michel, Dominik Heider
Background: Owing to the rising levels of multi-resistant pathogens, antimicrobial peptides, an alternative strategy to classic antibiotics, got more attention. A crucial part is thereby the costly identification and validation. With the ever-growing amount of annotated peptides, researchers leverage artificial intelligence to circumvent the cumbersome, wet-lab-based identification and automate the detection of promising candidates. However, the prediction of a peptide's function is not limited to antimicrobial efficiency. To date, multiple studies successfully classified additional properties, e.g., antiviral or cell-penetrating effects. In this light, ensemble classifiers are employed aiming to further improve the prediction. Although we recently presented a workflow to significantly diminish the initial encoding choice, an entire unsupervised encoding selection, considering various machine learning models, is still lacking.
Results: We developed a workflow, automatically selecting encodings and generating classifier ensembles by employing sophisticated pruning methods. We observed that the Pareto frontier pruning is a good method to create encoding ensembles for the datasets at hand. In addition, encodings combined with the Decision Tree classifier as the base model are often superior. However, our results also demonstrate that none of the ensemble building techniques is outstanding for all datasets.
Conclusion: The workflow conducts multiple pruning methods to evaluate ensemble classifiers composed from a wide range of peptide encodings and base models. Consequently, researchers can use the workflow for unsupervised encoding selection and ensemble creation. Ultimately, the extensible workflow can be used as a plugin for the PEPTIDE REACToR, further establishing it as a versatile tool in the domain.
{"title":"Unsupervised encoding selection through ensemble pruning for biomedical classification.","authors":"Sebastian Spänig, Alexander Michel, Dominik Heider","doi":"10.1186/s13040-022-00317-7","DOIUrl":"https://doi.org/10.1186/s13040-022-00317-7","url":null,"abstract":"<p><strong>Background: </strong>Owing to the rising levels of multi-resistant pathogens, antimicrobial peptides, an alternative strategy to classic antibiotics, got more attention. A crucial part is thereby the costly identification and validation. With the ever-growing amount of annotated peptides, researchers leverage artificial intelligence to circumvent the cumbersome, wet-lab-based identification and automate the detection of promising candidates. However, the prediction of a peptide's function is not limited to antimicrobial efficiency. To date, multiple studies successfully classified additional properties, e.g., antiviral or cell-penetrating effects. In this light, ensemble classifiers are employed aiming to further improve the prediction. Although we recently presented a workflow to significantly diminish the initial encoding choice, an entire unsupervised encoding selection, considering various machine learning models, is still lacking.</p><p><strong>Results: </strong>We developed a workflow, automatically selecting encodings and generating classifier ensembles by employing sophisticated pruning methods. We observed that the Pareto frontier pruning is a good method to create encoding ensembles for the datasets at hand. In addition, encodings combined with the Decision Tree classifier as the base model are often superior. However, our results also demonstrate that none of the ensemble building techniques is outstanding for all datasets.</p><p><strong>Conclusion: </strong>The workflow conducts multiple pruning methods to evaluate ensemble classifiers composed from a wide range of peptide encodings and base models. Consequently, researchers can use the workflow for unsupervised encoding selection and ensemble creation. Ultimately, the extensible workflow can be used as a plugin for the PEPTIDE REACToR, further establishing it as a versatile tool in the domain.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"10"},"PeriodicalIF":4.5,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10018861/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9133013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-16DOI: 10.1186/s13040-023-00323-3
Nico Schmid, Mihnea Ghinescu, Moritz Schanz, Micha Christ, Severin Schricker, Markus Ketteler, Mark Dominik Alscher, Ulrich Franke, Nora Goebel
Background: Automated data analysis and processing has the potential to assist, improve and guide decision making in medical practice. However, by now it has not yet been fully integrated in a clinical setting. Herein we present the first results of applying algorithm-based detection to the diagnosis of postoperative acute kidney injury (AKI) comprising patient data from a cardiac surgical intensive care unit (ICU).
Methods: First, we generated a well-defined study population of cardiac surgical ICU patients by implementing an application programming interface (API) to extract, clean and select relevant data from the archived digital patient management system. Health records of N = 21,045 adult patients admitted to the ICU following cardiac surgery between 2012 and 2022 were analyzed. Secondly, we developed a software functionality to detect the incidence of AKI according to Kidney Disease: Improving Global Outcomes (KDIGO) criteria, including urine output. Incidence, severity, and temporal evolution of AKI were assessed.
Results: With the use of our automated data analyzing model the overall incidence of postoperative AKI was 65.4% (N = 13,755). Divided by stages, AKI 2 was the most frequent maximum disease stage with 30.5% of patients (stage 1 in 17.6%, stage 3 in 17.2%). We observed considerable temporal divergence between first detections and maximum AKI stages: 51% of patients developed AKI stage 2 or 3 after a previously identified lower stage. Length of ICU stay was significantly prolonged in AKI patients (8.8 vs. 6.6 days, p < 0.001) and increased for higher AKI stages up to 10.1 days on average. In terms of AKI criteria, urine output proved to be most relevant, contributing to detection in 87.3% (N = 12,004) of cases.
Conclusion: The incidence of postoperative AKI following cardiac surgery is strikingly high with 65.4% when using full KDIGO-criteria including urine output. Automated data analysis demonstrated reliable early detection of AKI with progressive deterioration of renal function in the majority of patients, therefore allowing for potential earlier therapeutic intervention for preventing or lessening disease progression, reducing the length of ICU stay, and ultimately improving overall patient outcomes.
{"title":"Algorithm-based detection of acute kidney injury according to full KDIGO criteria including urine output following cardiac surgery: a descriptive analysis.","authors":"Nico Schmid, Mihnea Ghinescu, Moritz Schanz, Micha Christ, Severin Schricker, Markus Ketteler, Mark Dominik Alscher, Ulrich Franke, Nora Goebel","doi":"10.1186/s13040-023-00323-3","DOIUrl":"https://doi.org/10.1186/s13040-023-00323-3","url":null,"abstract":"<p><strong>Background: </strong>Automated data analysis and processing has the potential to assist, improve and guide decision making in medical practice. However, by now it has not yet been fully integrated in a clinical setting. Herein we present the first results of applying algorithm-based detection to the diagnosis of postoperative acute kidney injury (AKI) comprising patient data from a cardiac surgical intensive care unit (ICU).</p><p><strong>Methods: </strong>First, we generated a well-defined study population of cardiac surgical ICU patients by implementing an application programming interface (API) to extract, clean and select relevant data from the archived digital patient management system. Health records of N = 21,045 adult patients admitted to the ICU following cardiac surgery between 2012 and 2022 were analyzed. Secondly, we developed a software functionality to detect the incidence of AKI according to Kidney Disease: Improving Global Outcomes (KDIGO) criteria, including urine output. Incidence, severity, and temporal evolution of AKI were assessed.</p><p><strong>Results: </strong>With the use of our automated data analyzing model the overall incidence of postoperative AKI was 65.4% (N = 13,755). Divided by stages, AKI 2 was the most frequent maximum disease stage with 30.5% of patients (stage 1 in 17.6%, stage 3 in 17.2%). We observed considerable temporal divergence between first detections and maximum AKI stages: 51% of patients developed AKI stage 2 or 3 after a previously identified lower stage. Length of ICU stay was significantly prolonged in AKI patients (8.8 vs. 6.6 days, p < 0.001) and increased for higher AKI stages up to 10.1 days on average. In terms of AKI criteria, urine output proved to be most relevant, contributing to detection in 87.3% (N = 12,004) of cases.</p><p><strong>Conclusion: </strong>The incidence of postoperative AKI following cardiac surgery is strikingly high with 65.4% when using full KDIGO-criteria including urine output. Automated data analysis demonstrated reliable early detection of AKI with progressive deterioration of renal function in the majority of patients, therefore allowing for potential earlier therapeutic intervention for preventing or lessening disease progression, reducing the length of ICU stay, and ultimately improving overall patient outcomes.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"12"},"PeriodicalIF":4.5,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10022284/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9138603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-16DOI: 10.1186/s13040-023-00328-y
Mengying Wang, Cuixia Lee, Zhenhao Wei, Hong Ji, Yingyun Yang, Cheng Yang
<p><strong>Background: </strong>Tuberculosis is a dangerous infectious disease with the largest number of reported cases in China every year. Preventing missed diagnosis has an important impact on the prevention, treatment, and recovery of tuberculosis. The earliest pulmonary tuberculosis prediction models mainly used traditional image data combined with neural network models. However, a single data source tends to miss important information, such as primary symptoms and laboratory test results, that is available in multi-source data like medical records and tests. In this study, we propose a multi-stream integrated pulmonary tuberculosis diagnosis model based on structured and unstructured multi-source data from electronic health records. With the limited number of lung specialists and the high prevalence of tuberculosis, the application of this auxiliary diagnosis model can make substantial contributions to clinical settings.</p><p><strong>Methods: </strong>The subjects were patients at the respiratory department and infectious cases department of a large comprehensive hospital in China between 2015 to 2020. A total of 95,294 medical records were selected through a quality control process. Each record contains structured and unstructured data. First, numerical expressions of features for structured data were created. Then, feature engineering was performed through decision tree model, random forest, and GBDT. Features were included in the feature exclusion set as per their weights in descending order. When the importance of the set was higher than 0.7, this process was concluded. Finally, the contained features were used for model training. In addition, the unstructured free-text data was segmented at the character level and input into the model after indexing. Tuberculosis prediction was conducted through a multi-stream integration tuberculosis diagnosis model (MSI-PTDM), and the evaluation indices of accuracy, AUC, sensitivity, and specificity were compared against the prediction results of XGBoost, Text-CNN, Random Forest, SVM, and so on.</p><p><strong>Results: </strong>Through a variety of characteristic engineering methods, 20 characteristic factors, such as main complaint hemoptysis, cough, and test erythrocyte sedimentation rate, were selected, and the influencing factors were analyzed using the Chinese diagnostic standard of pulmonary tuberculosis. The area under the curve values for MSI-PTDM, XGBoost, Text-CNN, RF, and SVM were 0.9858, 0.9571, 0.9486, 0.9428, and 0.9429, respectively. The sensitivity, specificity, and accuracy of MSI-PTDM were 93.18%, 96.96%, and 96.96%, respectively. The MSI-PTDM prediction model was installed at a doctor workstation and operated in a real clinic environment for 4 months. A total of 692,949 patients were monitored, including 484 patients with confirmed pulmonary tuberculosis. The model predicted 440 cases of pulmonary tuberculosis. The positive sample recognition rate was 90.91%, the false-positive ra
{"title":"Clinical assistant decision-making model of tuberculosis based on electronic health records.","authors":"Mengying Wang, Cuixia Lee, Zhenhao Wei, Hong Ji, Yingyun Yang, Cheng Yang","doi":"10.1186/s13040-023-00328-y","DOIUrl":"https://doi.org/10.1186/s13040-023-00328-y","url":null,"abstract":"<p><strong>Background: </strong>Tuberculosis is a dangerous infectious disease with the largest number of reported cases in China every year. Preventing missed diagnosis has an important impact on the prevention, treatment, and recovery of tuberculosis. The earliest pulmonary tuberculosis prediction models mainly used traditional image data combined with neural network models. However, a single data source tends to miss important information, such as primary symptoms and laboratory test results, that is available in multi-source data like medical records and tests. In this study, we propose a multi-stream integrated pulmonary tuberculosis diagnosis model based on structured and unstructured multi-source data from electronic health records. With the limited number of lung specialists and the high prevalence of tuberculosis, the application of this auxiliary diagnosis model can make substantial contributions to clinical settings.</p><p><strong>Methods: </strong>The subjects were patients at the respiratory department and infectious cases department of a large comprehensive hospital in China between 2015 to 2020. A total of 95,294 medical records were selected through a quality control process. Each record contains structured and unstructured data. First, numerical expressions of features for structured data were created. Then, feature engineering was performed through decision tree model, random forest, and GBDT. Features were included in the feature exclusion set as per their weights in descending order. When the importance of the set was higher than 0.7, this process was concluded. Finally, the contained features were used for model training. In addition, the unstructured free-text data was segmented at the character level and input into the model after indexing. Tuberculosis prediction was conducted through a multi-stream integration tuberculosis diagnosis model (MSI-PTDM), and the evaluation indices of accuracy, AUC, sensitivity, and specificity were compared against the prediction results of XGBoost, Text-CNN, Random Forest, SVM, and so on.</p><p><strong>Results: </strong>Through a variety of characteristic engineering methods, 20 characteristic factors, such as main complaint hemoptysis, cough, and test erythrocyte sedimentation rate, were selected, and the influencing factors were analyzed using the Chinese diagnostic standard of pulmonary tuberculosis. The area under the curve values for MSI-PTDM, XGBoost, Text-CNN, RF, and SVM were 0.9858, 0.9571, 0.9486, 0.9428, and 0.9429, respectively. The sensitivity, specificity, and accuracy of MSI-PTDM were 93.18%, 96.96%, and 96.96%, respectively. The MSI-PTDM prediction model was installed at a doctor workstation and operated in a real clinic environment for 4 months. A total of 692,949 patients were monitored, including 484 patients with confirmed pulmonary tuberculosis. The model predicted 440 cases of pulmonary tuberculosis. The positive sample recognition rate was 90.91%, the false-positive ra","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"11"},"PeriodicalIF":4.5,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10022184/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9145782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-13DOI: 10.1186/s13040-023-00327-z
Scott M Williams, Jason H Moore
{"title":"Genetics and precision health: the ecological fallacy and artificial intelligence solutions.","authors":"Scott M Williams, Jason H Moore","doi":"10.1186/s13040-023-00327-z","DOIUrl":"https://doi.org/10.1186/s13040-023-00327-z","url":null,"abstract":"","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"9"},"PeriodicalIF":4.5,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10018838/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9133009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Objectives: Type 2 diabetes mellitus (T2DM) imposes a great burden on healthcare systems, and these patients experience higher long-term risks for developing end-stage renal disease (ESRD). Managing diabetic nephropathy becomes more challenging when kidney function starts declining. Therefore, developing predictive models for the risk of developing ESRD in newly diagnosed T2DM patients may be helpful in clinical settings.
Methods: We established machine learning models constructed from a subset of clinical features collected from 53,477 newly diagnosed T2DM patients from January 2008 to December 2018 and then selected the best model. The cohort was divided, with 70% and 30% of patients randomly assigned to the training and testing sets, respectively.
Results: The discriminative ability of our machine learning models, including logistic regression, extra tree classifier, random forest, gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting machine were evaluated across the cohort. XGBoost yielded the highest area under the receiver operating characteristic curve (AUC) of 0.953, followed by extra tree and GBDT, with AUC values of 0.952 and 0.938 on the testing dataset. The SHapley Additive explanation summary plot in the XGBoost model illustrated that the top five important features included baseline serum creatinine, mean serum creatine within 1 year before the diagnosis of T2DM, high-sensitivity C-reactive protein, spot urine protein-to-creatinine ratio and female gender.
Conclusions: Because our machine learning prediction models were based on routinely collected clinical features, they can be used as risk assessment tools for developing ESRD. By identifying high-risk patients, intervention strategies may be provided at an early stage.
{"title":"Prediction of the risk of developing end-stage renal diseases in newly diagnosed type 2 diabetes mellitus using artificial intelligence algorithms.","authors":"Shuo-Ming Ou, Ming-Tsun Tsai, Kuo-Hua Lee, Wei-Cheng Tseng, Chih-Yu Yang, Tz-Heng Chen, Pin-Jie Bin, Tzeng-Ji Chen, Yao-Ping Lin, Wayne Huey-Herng Sheu, Yuan-Chia Chu, Der-Cherng Tarng","doi":"10.1186/s13040-023-00324-2","DOIUrl":"https://doi.org/10.1186/s13040-023-00324-2","url":null,"abstract":"<p><strong>Objectives: </strong>Type 2 diabetes mellitus (T2DM) imposes a great burden on healthcare systems, and these patients experience higher long-term risks for developing end-stage renal disease (ESRD). Managing diabetic nephropathy becomes more challenging when kidney function starts declining. Therefore, developing predictive models for the risk of developing ESRD in newly diagnosed T2DM patients may be helpful in clinical settings.</p><p><strong>Methods: </strong>We established machine learning models constructed from a subset of clinical features collected from 53,477 newly diagnosed T2DM patients from January 2008 to December 2018 and then selected the best model. The cohort was divided, with 70% and 30% of patients randomly assigned to the training and testing sets, respectively.</p><p><strong>Results: </strong>The discriminative ability of our machine learning models, including logistic regression, extra tree classifier, random forest, gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting machine were evaluated across the cohort. XGBoost yielded the highest area under the receiver operating characteristic curve (AUC) of 0.953, followed by extra tree and GBDT, with AUC values of 0.952 and 0.938 on the testing dataset. The SHapley Additive explanation summary plot in the XGBoost model illustrated that the top five important features included baseline serum creatinine, mean serum creatine within 1 year before the diagnosis of T2DM, high-sensitivity C-reactive protein, spot urine protein-to-creatinine ratio and female gender.</p><p><strong>Conclusions: </strong>Because our machine learning prediction models were based on routinely collected clinical features, they can be used as risk assessment tools for developing ESRD. By identifying high-risk patients, intervention strategies may be provided at an early stage.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"8"},"PeriodicalIF":4.5,"publicationDate":"2023-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10007785/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9105623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-04DOI: 10.1186/s13040-023-00325-1
Davide Chicco, Tiziana Sanavia, Giuseppe Jurman
Neuroblastoma is a childhood neurological tumor which affects hundreds of thousands of children worldwide, and information about its prognosis can be pivotal for patients, their families, and clinicians. One of the main goals in the related bioinformatics analyses is to provide stable genetic signatures able to include genes whose expression levels can be effective to predict the prognosis of the patients. In this study, we collected the prognostic signatures for neuroblastoma published in the biomedical literature, and noticed that the most frequent genes present among them were three: AHCY, DPYLS3, and NME1. We therefore investigated the prognostic power of these three genes by performing a survival analysis and a binary classification on multiple gene expression datasets of different groups of patients diagnosed with neuroblastoma. Finally, we discussed the main studies in the literature associating these three genes with neuroblastoma. Our results, in each of these three steps of validation, confirm the prognostic capability of AHCY, DPYLS3, and NME1, and highlight their key role in neuroblastoma prognosis. Our results can have an impact on neuroblastoma genetics research: biologists and medical researchers can pay more attention to the regulation and expression of these three genes in patients having neuroblastoma, and therefore can develop better cures and treatments which can save patients' lives.
{"title":"Signature literature review reveals AHCY, DPYSL3, and NME1 as the most recurrent prognostic genes for neuroblastoma.","authors":"Davide Chicco, Tiziana Sanavia, Giuseppe Jurman","doi":"10.1186/s13040-023-00325-1","DOIUrl":"https://doi.org/10.1186/s13040-023-00325-1","url":null,"abstract":"<p><p>Neuroblastoma is a childhood neurological tumor which affects hundreds of thousands of children worldwide, and information about its prognosis can be pivotal for patients, their families, and clinicians. One of the main goals in the related bioinformatics analyses is to provide stable genetic signatures able to include genes whose expression levels can be effective to predict the prognosis of the patients. In this study, we collected the prognostic signatures for neuroblastoma published in the biomedical literature, and noticed that the most frequent genes present among them were three: AHCY, DPYLS3, and NME1. We therefore investigated the prognostic power of these three genes by performing a survival analysis and a binary classification on multiple gene expression datasets of different groups of patients diagnosed with neuroblastoma. Finally, we discussed the main studies in the literature associating these three genes with neuroblastoma. Our results, in each of these three steps of validation, confirm the prognostic capability of AHCY, DPYLS3, and NME1, and highlight their key role in neuroblastoma prognosis. Our results can have an impact on neuroblastoma genetics research: biologists and medical researchers can pay more attention to the regulation and expression of these three genes in patients having neuroblastoma, and therefore can develop better cures and treatments which can save patients' lives.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"7"},"PeriodicalIF":4.5,"publicationDate":"2023-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9985261/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10280657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-23DOI: 10.1186/s13040-023-00326-0
Davide Chicco, Giuseppe Jurman
Bioinformatics has become a key aspect of the biomedical research programmes of many hospitals' scientific centres, and the establishment of bioinformatics facilities within hospitals has become a common practice worldwide. Bioinformaticians working in these facilities provide computational biology support to medical doctors and principal investigators who are daily dealing with data of patients to analyze. These bioinformatics analysts, although pivotal, usually do not receive formal training for this job. We therefore propose these ten simple rules to guide these bioinformaticians in their work: ten pieces of advice on how to provide bioinformatics support to medical doctors in hospitals. We believe these simple rules can help bioinformatics facility analysts in producing better scientific results and work in a serene and fruitful environment.
{"title":"Ten simple rules for providing bioinformatics support within a hospital.","authors":"Davide Chicco, Giuseppe Jurman","doi":"10.1186/s13040-023-00326-0","DOIUrl":"10.1186/s13040-023-00326-0","url":null,"abstract":"<p><p>Bioinformatics has become a key aspect of the biomedical research programmes of many hospitals' scientific centres, and the establishment of bioinformatics facilities within hospitals has become a common practice worldwide. Bioinformaticians working in these facilities provide computational biology support to medical doctors and principal investigators who are daily dealing with data of patients to analyze. These bioinformatics analysts, although pivotal, usually do not receive formal training for this job. We therefore propose these ten simple rules to guide these bioinformaticians in their work: ten pieces of advice on how to provide bioinformatics support to medical doctors in hospitals. We believe these simple rules can help bioinformatics facility analysts in producing better scientific results and work in a serene and fruitful environment.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"6"},"PeriodicalIF":4.0,"publicationDate":"2023-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9948383/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9335813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, convolutional neural networks (CNNs) have made great achievements in the field of medical image segmentation, especially full convolutional neural networks based on U-shaped structures and skip connections. However, limited by the inherent limitations of convolution, CNNs-based methods usually exhibit limitations in modeling long-range dependencies and are unable to extract large amounts of global contextual information, which deprives neural networks of the ability to adapt to different visual modalities. In this paper, we propose our own model, which is called iU-Net bacause its structure closely resembles the combination of i and U. iU-Net is a multiple encoder-decoder structure combining Swin Transformer and CNN. We use a hierarchical Swin Transformer structure with shifted windows as the primary encoder and convolution as the secondary encoder to complement the context information extracted by the primary encoder. To sufficiently fuse the feature information extracted from multiple encoders, we design a feature fusion module (W-FFM) based on wave function representation. Besides, a three branch up sampling method(Tri-Upsample) has developed to replace the patch expand in the Swin Transformer, which can effectively avoid the Checkerboard Artifacts caused by the patch expand. On the skin lesion region segmentation task, the segmentation performance of iU-Net is optimal, with Dice and Iou reaching 90.12% and 83.06%, respectively. To verify the generalization of iU-Net, we used the model trained on ISIC2018 dataset to test on PH2 dataset, and achieved 93.80% Dice and 88.74% IoU. On the lung feild segmentation task, the iU-Net achieved optimal results on IoU and Precision, reaching 98.54% and 94.35% respectively. Extensive experiments demonstrate the segmentation performance and generalization ability of iU-Net.
{"title":"iU-Net: a hybrid structured network with a novel feature fusion approach for medical image segmentation.","authors":"Yun Jiang, Jinkun Dong, Tongtong Cheng, Yuan Zhang, Xin Lin, Jing Liang","doi":"10.1186/s13040-023-00320-6","DOIUrl":"https://doi.org/10.1186/s13040-023-00320-6","url":null,"abstract":"<p><p>In recent years, convolutional neural networks (CNNs) have made great achievements in the field of medical image segmentation, especially full convolutional neural networks based on U-shaped structures and skip connections. However, limited by the inherent limitations of convolution, CNNs-based methods usually exhibit limitations in modeling long-range dependencies and are unable to extract large amounts of global contextual information, which deprives neural networks of the ability to adapt to different visual modalities. In this paper, we propose our own model, which is called iU-Net bacause its structure closely resembles the combination of i and U. iU-Net is a multiple encoder-decoder structure combining Swin Transformer and CNN. We use a hierarchical Swin Transformer structure with shifted windows as the primary encoder and convolution as the secondary encoder to complement the context information extracted by the primary encoder. To sufficiently fuse the feature information extracted from multiple encoders, we design a feature fusion module (W-FFM) based on wave function representation. Besides, a three branch up sampling method(Tri-Upsample) has developed to replace the patch expand in the Swin Transformer, which can effectively avoid the Checkerboard Artifacts caused by the patch expand. On the skin lesion region segmentation task, the segmentation performance of iU-Net is optimal, with Dice and Iou reaching 90.12% and 83.06%, respectively. To verify the generalization of iU-Net, we used the model trained on ISIC2018 dataset to test on PH2 dataset, and achieved 93.80% Dice and 88.74% IoU. On the lung feild segmentation task, the iU-Net achieved optimal results on IoU and Precision, reaching 98.54% and 94.35% respectively. Extensive experiments demonstrate the segmentation performance and generalization ability of iU-Net.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"5"},"PeriodicalIF":4.5,"publicationDate":"2023-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9942350/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10764875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}