Pub Date : 2024-09-05DOI: 10.1186/s13040-024-00385-x
Panagiota I Kontou, Pantelis G Bagos
Genome-wide association studies (GWAS) have revolutionized our understanding of the genetic architecture of complex traits and diseases. GWAS summary statistics have become essential tools for various genetic analyses, including meta-analysis, fine-mapping, and risk prediction. However, the increasing number of GWAS summary statistics and the diversity of software tools available for their analysis can make it challenging for researchers to select the most appropriate tools for their specific needs. This systematic review aims to provide a comprehensive overview of the currently available software tools and databases for GWAS summary statistics analysis. We conducted a comprehensive literature search to identify relevant software tools and databases. We categorized the tools and databases by their functionality, including data management, quality control, single-trait analysis, and multiple-trait analysis. We also compared the tools and databases based on their features, limitations, and user-friendliness. Our review identified a total of 305 functioning software tools and databases dedicated to GWAS summary statistics, each with unique strengths and limitations. We provide descriptions of the key features of each tool and database, including their input/output formats, data types, and computational requirements. We also discuss the overall usability and applicability of each tool for different research scenarios. This comprehensive review will serve as a valuable resource for researchers who are interested in using GWAS summary statistics to investigate the genetic basis of complex traits and diseases. By providing a detailed overview of the available tools and databases, we aim to facilitate informed tool selection and maximize the effectiveness of GWAS summary statistics analysis.
{"title":"The goldmine of GWAS summary statistics: a systematic review of methods and tools.","authors":"Panagiota I Kontou, Pantelis G Bagos","doi":"10.1186/s13040-024-00385-x","DOIUrl":"10.1186/s13040-024-00385-x","url":null,"abstract":"<p><p>Genome-wide association studies (GWAS) have revolutionized our understanding of the genetic architecture of complex traits and diseases. GWAS summary statistics have become essential tools for various genetic analyses, including meta-analysis, fine-mapping, and risk prediction. However, the increasing number of GWAS summary statistics and the diversity of software tools available for their analysis can make it challenging for researchers to select the most appropriate tools for their specific needs. This systematic review aims to provide a comprehensive overview of the currently available software tools and databases for GWAS summary statistics analysis. We conducted a comprehensive literature search to identify relevant software tools and databases. We categorized the tools and databases by their functionality, including data management, quality control, single-trait analysis, and multiple-trait analysis. We also compared the tools and databases based on their features, limitations, and user-friendliness. Our review identified a total of 305 functioning software tools and databases dedicated to GWAS summary statistics, each with unique strengths and limitations. We provide descriptions of the key features of each tool and database, including their input/output formats, data types, and computational requirements. We also discuss the overall usability and applicability of each tool for different research scenarios. This comprehensive review will serve as a valuable resource for researchers who are interested in using GWAS summary statistics to investigate the genetic basis of complex traits and diseases. By providing a detailed overview of the available tools and databases, we aim to facilitate informed tool selection and maximize the effectiveness of GWAS summary statistics analysis.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"31"},"PeriodicalIF":4.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11375927/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142141566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-04DOI: 10.1186/s13040-024-00384-y
Junliang Zhu, Shaowei Pu, Jiaji He, Dongchao Su, Weijie Cai, Xueying Xu, Hongbo Liu
Objective: Data imbalance is a pervasive issue in medical data mining, often leading to biased and unreliable predictive models. This study aims to address the urgent need for effective strategies to mitigate the impact of data imbalance on classification models. We focus on quantifying the effects of different imbalance degrees and sample sizes on model performance, identifying optimal cut-off values, and evaluating the efficacy of various methods to enhance model accuracy in highly imbalanced and small sample size scenarios.
Methods: We collected medical records of patients receiving assisted reproductive treatment in a reproductive medicine center. Random forest was used to screen the key variables for the prediction target. Various datasets with different imbalance degrees and sample sizes were constructed to compare the classification performance of logistic regression models. Metrics such as AUC, G-mean, F1-Score, Accuracy, Recall, and Precision were used for evaluation. Four imbalance treatment methods (SMOTE, ADASYN, OSS, and CNN) were applied to datasets with low positive rates and small sample sizes to assess their effectiveness.
Results: The logistic model's performance was low when the positive rate was below 10% but stabilized beyond this threshold. Similarly, sample sizes below 1200 yielded poor results, with improvement seen above this threshold. For robustness, the optimal cut-offs for positive rate and sample size were identified as 15% and 1500, respectively. SMOTE and ADASYN oversampling significantly improved classification performance in datasets with low positive rates and small sample sizes.
Conclusions: The study identifies a positive rate of 15% and a sample size of 1500 as optimal cut-offs for stable logistic model performance. For datasets with low positive rates and small sample sizes, SMOTE and ADASYN are recommended to improve balance and model accuracy.
{"title":"Processing imbalanced medical data at the data level with assisted-reproduction data as an example.","authors":"Junliang Zhu, Shaowei Pu, Jiaji He, Dongchao Su, Weijie Cai, Xueying Xu, Hongbo Liu","doi":"10.1186/s13040-024-00384-y","DOIUrl":"10.1186/s13040-024-00384-y","url":null,"abstract":"<p><strong>Objective: </strong>Data imbalance is a pervasive issue in medical data mining, often leading to biased and unreliable predictive models. This study aims to address the urgent need for effective strategies to mitigate the impact of data imbalance on classification models. We focus on quantifying the effects of different imbalance degrees and sample sizes on model performance, identifying optimal cut-off values, and evaluating the efficacy of various methods to enhance model accuracy in highly imbalanced and small sample size scenarios.</p><p><strong>Methods: </strong>We collected medical records of patients receiving assisted reproductive treatment in a reproductive medicine center. Random forest was used to screen the key variables for the prediction target. Various datasets with different imbalance degrees and sample sizes were constructed to compare the classification performance of logistic regression models. Metrics such as AUC, G-mean, F1-Score, Accuracy, Recall, and Precision were used for evaluation. Four imbalance treatment methods (SMOTE, ADASYN, OSS, and CNN) were applied to datasets with low positive rates and small sample sizes to assess their effectiveness.</p><p><strong>Results: </strong>The logistic model's performance was low when the positive rate was below 10% but stabilized beyond this threshold. Similarly, sample sizes below 1200 yielded poor results, with improvement seen above this threshold. For robustness, the optimal cut-offs for positive rate and sample size were identified as 15% and 1500, respectively. SMOTE and ADASYN oversampling significantly improved classification performance in datasets with low positive rates and small sample sizes.</p><p><strong>Conclusions: </strong>The study identifies a positive rate of 15% and a sample size of 1500 as optimal cut-offs for stable logistic model performance. For datasets with low positive rates and small sample sizes, SMOTE and ADASYN are recommended to improve balance and model accuracy.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"29"},"PeriodicalIF":4.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11373105/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142134276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-04DOI: 10.1186/s13040-024-00386-w
Bolin Chen, Jinlei Zhang, Ci Shao, Jun Bian, Ruiming Kang, Xuequn Shang
Background: Identifying critical genes is important for understanding the pathogenesis of complex diseases. Traditional studies typically comparing the change of biomecules between normal and disease samples or detecting important vertices from a single static biomolecular network, which often overlook the dynamic changes that occur between different disease stages. However, investigating temporal changes in biomolecular networks and identifying critical genes is critical for understanding the occurrence and development of diseases.
Methods: A novel method called Quantifying Importance of Genes with Tensor Decomposition (QIGTD) was proposed in this study. It first constructs a time series network by integrating both the intra and inter temporal network information, which preserving connections between networks at adjacent stages according to the local similarities. A tensor is employed to describe the connections of this time series network, and a 3-order tensor decomposition method was proposed to capture both the topological information of each network snapshot and the time series characteristics of the whole network. QIGTD is also a learning-free and efficient method that can be applied to datasets with a small number of samples.
Results: The effectiveness of QIGTD was evaluated using lung adenocarcinoma (LUAD) datasets and three state-of-the-art methods: T-degree, T-closeness, and T-betweenness were employed as benchmark methods. Numerical experimental results demonstrate that QIGTD outperforms these methods in terms of the indices of both precision and mAP. Notably, out of the top 50 genes, 29 have been verified to be highly related to LUAD according to the DisGeNET Database, and 36 are significantly enriched in LUAD related Gene Ontology (GO) terms, including nuclear division, mitotic nuclear division, chromosome segregation, organelle fission, and mitotic sister chromatid segregation.
Conclusion: In conclusion, QIGTD effectively captures the temporal changes in gene networks and identifies critical genes. It provides a valuable tool for studying temporal dynamics in biological networks and can aid in understanding the underlying mechanisms of diseases such as LUAD.
{"title":"QIGTD: identifying critical genes in the evolution of lung adenocarcinoma with tensor decomposition.","authors":"Bolin Chen, Jinlei Zhang, Ci Shao, Jun Bian, Ruiming Kang, Xuequn Shang","doi":"10.1186/s13040-024-00386-w","DOIUrl":"10.1186/s13040-024-00386-w","url":null,"abstract":"<p><strong>Background: </strong>Identifying critical genes is important for understanding the pathogenesis of complex diseases. Traditional studies typically comparing the change of biomecules between normal and disease samples or detecting important vertices from a single static biomolecular network, which often overlook the dynamic changes that occur between different disease stages. However, investigating temporal changes in biomolecular networks and identifying critical genes is critical for understanding the occurrence and development of diseases.</p><p><strong>Methods: </strong>A novel method called Quantifying Importance of Genes with Tensor Decomposition (QIGTD) was proposed in this study. It first constructs a time series network by integrating both the intra and inter temporal network information, which preserving connections between networks at adjacent stages according to the local similarities. A tensor is employed to describe the connections of this time series network, and a 3-order tensor decomposition method was proposed to capture both the topological information of each network snapshot and the time series characteristics of the whole network. QIGTD is also a learning-free and efficient method that can be applied to datasets with a small number of samples.</p><p><strong>Results: </strong>The effectiveness of QIGTD was evaluated using lung adenocarcinoma (LUAD) datasets and three state-of-the-art methods: T-degree, T-closeness, and T-betweenness were employed as benchmark methods. Numerical experimental results demonstrate that QIGTD outperforms these methods in terms of the indices of both precision and mAP. Notably, out of the top 50 genes, 29 have been verified to be highly related to LUAD according to the DisGeNET Database, and 36 are significantly enriched in LUAD related Gene Ontology (GO) terms, including nuclear division, mitotic nuclear division, chromosome segregation, organelle fission, and mitotic sister chromatid segregation.</p><p><strong>Conclusion: </strong>In conclusion, QIGTD effectively captures the temporal changes in gene networks and identifies critical genes. It provides a valuable tool for studying temporal dynamics in biological networks and can aid in understanding the underlying mechanisms of diseases such as LUAD.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"30"},"PeriodicalIF":4.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11376055/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142134277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-03DOI: 10.1186/s13040-024-00380-2
Vincenzo Bonnici, Davide Chicco
Pangenomics is a relatively new scientific field which investigates the union of all the genomes of a clade. The word pan means everything in ancient Greek; the term pangenomics originally regarded genomes of bacteria and was later intended to refer to human genomes as well. Modern bioinformatics offers several tools to analyze pangenomics data, paving the way to an emerging field that we can call computational pangenomics. Current computational power available for the bioinformatics community has made computational pangenomic analyses easy to perform, but this higher accessibility to pangenomics analysis also increases the chances to make mistakes and to produce misleading or inflated results, especially by beginners. To handle this problem, we present here a few quick tips for efficient and correct computational pangenomic analyses with a focus on bacterial pangenomics, by describing common mistakes to avoid and experienced best practices to follow in this field. We believe our recommendations can help the readers perform more robust and sound pangenomic analyses and to generate more reliable results.
{"title":"Seven quick tips for gene-focused computational pangenomic analysis.","authors":"Vincenzo Bonnici, Davide Chicco","doi":"10.1186/s13040-024-00380-2","DOIUrl":"10.1186/s13040-024-00380-2","url":null,"abstract":"<p><p>Pangenomics is a relatively new scientific field which investigates the union of all the genomes of a clade. The word pan means everything in ancient Greek; the term pangenomics originally regarded genomes of bacteria and was later intended to refer to human genomes as well. Modern bioinformatics offers several tools to analyze pangenomics data, paving the way to an emerging field that we can call computational pangenomics. Current computational power available for the bioinformatics community has made computational pangenomic analyses easy to perform, but this higher accessibility to pangenomics analysis also increases the chances to make mistakes and to produce misleading or inflated results, especially by beginners. To handle this problem, we present here a few quick tips for efficient and correct computational pangenomic analyses with a focus on bacterial pangenomics, by describing common mistakes to avoid and experienced best practices to follow in this field. We believe our recommendations can help the readers perform more robust and sound pangenomic analyses and to generate more reliable results.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"28"},"PeriodicalIF":4.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11370085/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142127084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1186/s13040-024-00381-1
Luís B Elvas, Sara Gomes, João C Ferreira, Luís Brás Rosário, Tomás Brandão
Cardiovascular diseases are the main cause of death in the world and cardiovascular imaging techniques are the mainstay of noninvasive diagnosis. Aortic stenosis is a lethal cardiac disease preceded by aortic valve calcification for several years. Data-driven tools developed with Deep Learning (DL) algorithms can process and categorize medical images data, providing fast diagnoses with considered reliability, to improve healthcare effectiveness. A systematic review of DL applications on medical images for pathologic calcium detection concluded that there are established techniques in this field, using primarily CT scans, at the expense of radiation exposure. Echocardiography is an unexplored alternative to detect calcium, but still needs technological developments. In this article, a fully automated method based on Convolutional Neural Networks (CNNs) was developed to detect Aortic Calcification in Echocardiography images, consisting of two essential processes: (1) an object detector to locate aortic valve - achieving 95% of precision and 100% of recall; and (2) a classifier to identify calcium structures in the valve - which achieved 92% of precision and 100% of recall. The outcome of this work is the possibility of automation of the detection with Echocardiography of Aortic Valve Calcification, a lethal and prevalent disease.
{"title":"Deep learning for automatic calcium detection in echocardiography.","authors":"Luís B Elvas, Sara Gomes, João C Ferreira, Luís Brás Rosário, Tomás Brandão","doi":"10.1186/s13040-024-00381-1","DOIUrl":"10.1186/s13040-024-00381-1","url":null,"abstract":"<p><p>Cardiovascular diseases are the main cause of death in the world and cardiovascular imaging techniques are the mainstay of noninvasive diagnosis. Aortic stenosis is a lethal cardiac disease preceded by aortic valve calcification for several years. Data-driven tools developed with Deep Learning (DL) algorithms can process and categorize medical images data, providing fast diagnoses with considered reliability, to improve healthcare effectiveness. A systematic review of DL applications on medical images for pathologic calcium detection concluded that there are established techniques in this field, using primarily CT scans, at the expense of radiation exposure. Echocardiography is an unexplored alternative to detect calcium, but still needs technological developments. In this article, a fully automated method based on Convolutional Neural Networks (CNNs) was developed to detect Aortic Calcification in Echocardiography images, consisting of two essential processes: (1) an object detector to locate aortic valve - achieving 95% of precision and 100% of recall; and (2) a classifier to identify calcium structures in the valve - which achieved 92% of precision and 100% of recall. The outcome of this work is the possibility of automation of the detection with Echocardiography of Aortic Valve Calcification, a lethal and prevalent disease.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"27"},"PeriodicalIF":4.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11351547/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142094005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-27DOI: 10.1186/s13040-024-00382-0
Yang Chen, Qingqing Zheng, Hui Wang, Peiren Tang, Li Deng, Pu Li, Huan Li, Jianhong Hou, Jie Li, Li Wang, Jun Peng
Background: In recent years, significant morbidity and mortality in patients with severe inflammatory bowel disease (IBD) and cytomegalovirus (CMV) have drawn considerable attention to the status of CMV infection in the intestinal mucosa of IBD patients and its role in disease progression. However, there is currently no high-throughput sequencing data for ulcerative colitis patients with CMV infection (CMV + UC), and the immune microenvironment in CMV + UC patients have yet to be explored.
Method: The xCell algorithm was used for evaluate the immune microenvironment of CMV + UC patients. Then, WGCNA analysis was explored to obtain the co-expression modules between abnormal immune cells and gene level or protein level. Next, three machine learning approach include Random Forest, SVM-rfe, and Lasso were used to filter candidate biomarkers. Finally, Best Subset Selection algorithms was performed to construct the diagnostic model.
Results: In this study, we performed transcriptomic and proteomic sequencing on CMV + UC patients to establish a comprehensive immune microenvironment profile and found 11 specific abnormal immune cells in CMV + UC group. After using multi-omics integration algorithms, we identified seven co-expression gene modules and five co-expression protein modules. Subsequently, we utilized various machine learning algorithms to identify key biomarkers with diagnostic efficacy and constructed an early diagnostic model. We identified a total of eight biomarkers (PPP1R12B, CIRBP, CSNK2A2, DNAJB11, PIK3R4, RRBP1, STX5, TMEM214) that play crucial roles in the immune microenvironment of CMV + UC and exhibit superior diagnostic performance for CMV + UC.
Conclusion: This 8 biomarkers model offers a new paradigm for the diagnosis and treatment of IBD patients post-CMV infection. Further research into this model will be significant for understanding the changes in the host immune microenvironment following CMV infection.
{"title":"Integrating transcriptomics and proteomics to analyze the immune microenvironment of cytomegalovirus associated ulcerative colitis and identify relevant biomarkers.","authors":"Yang Chen, Qingqing Zheng, Hui Wang, Peiren Tang, Li Deng, Pu Li, Huan Li, Jianhong Hou, Jie Li, Li Wang, Jun Peng","doi":"10.1186/s13040-024-00382-0","DOIUrl":"10.1186/s13040-024-00382-0","url":null,"abstract":"<p><strong>Background: </strong>In recent years, significant morbidity and mortality in patients with severe inflammatory bowel disease (IBD) and cytomegalovirus (CMV) have drawn considerable attention to the status of CMV infection in the intestinal mucosa of IBD patients and its role in disease progression. However, there is currently no high-throughput sequencing data for ulcerative colitis patients with CMV infection (CMV + UC), and the immune microenvironment in CMV + UC patients have yet to be explored.</p><p><strong>Method: </strong>The xCell algorithm was used for evaluate the immune microenvironment of CMV + UC patients. Then, WGCNA analysis was explored to obtain the co-expression modules between abnormal immune cells and gene level or protein level. Next, three machine learning approach include Random Forest, SVM-rfe, and Lasso were used to filter candidate biomarkers. Finally, Best Subset Selection algorithms was performed to construct the diagnostic model.</p><p><strong>Results: </strong>In this study, we performed transcriptomic and proteomic sequencing on CMV + UC patients to establish a comprehensive immune microenvironment profile and found 11 specific abnormal immune cells in CMV + UC group. After using multi-omics integration algorithms, we identified seven co-expression gene modules and five co-expression protein modules. Subsequently, we utilized various machine learning algorithms to identify key biomarkers with diagnostic efficacy and constructed an early diagnostic model. We identified a total of eight biomarkers (PPP1R12B, CIRBP, CSNK2A2, DNAJB11, PIK3R4, RRBP1, STX5, TMEM214) that play crucial roles in the immune microenvironment of CMV + UC and exhibit superior diagnostic performance for CMV + UC.</p><p><strong>Conclusion: </strong>This 8 biomarkers model offers a new paradigm for the diagnosis and treatment of IBD patients post-CMV infection. Further research into this model will be significant for understanding the changes in the host immune microenvironment following CMV infection.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"26"},"PeriodicalIF":4.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11348729/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1186/s13040-024-00378-w
Caroline König, Alfredo Vellido
The analysis of absorption, distribution, metabolism, and excretion (ADME) molecular properties is of relevance to drug design, as they directly influence the drug’s effectiveness at its target location. This study concerns their prediction, using explainable Machine Learning (ML) models. The aim of the study is to find which molecular features are relevant to the prediction of the different ADME properties and measure their impact on the predictive model. The relative relevance of individual features for ADME activity is gauged by estimating feature importance in ML models’ predictions. Feature importance is calculated using feature permutation and the individual impact of features is measured by SHAP additive explanations. The study reveals the relevance of specific molecular descriptors for each ADME property and quantifies their impact on the ADME property prediction. The reported research illustrates how explainable ML models can provide detailed insights about the individual contributions of molecular features to the final prediction of an ADME property, as an effort to support experts in the process of drug candidate selection through a better understanding of the impact of molecular features.
吸收、分布、代谢和排泄(ADME)分子特性的分析与药物设计息息相关,因为它们直接影响药物在靶点的有效性。本研究利用可解释的机器学习(ML)模型对其进行预测。研究的目的是找出与预测不同 ADME 特性相关的分子特征,并衡量它们对预测模型的影响。通过估算特征在 ML 模型预测中的重要性来衡量各个特征与 ADME 活性的相对相关性。特征重要性通过特征排列来计算,特征的个体影响则通过 SHAP 相加解释来衡量。该研究揭示了特定分子描述符对每种 ADME 特性的相关性,并量化了它们对 ADME 特性预测的影响。所报告的研究说明了可解释的 ML 模型如何能够提供有关分子特征对 ADME 特性最终预测的个别贡献的详细见解,从而通过更好地了解分子特征的影响,在候选药物选择过程中为专家提供支持。
{"title":"Understanding predictions of drug profiles using explainable machine learning models","authors":"Caroline König, Alfredo Vellido","doi":"10.1186/s13040-024-00378-w","DOIUrl":"https://doi.org/10.1186/s13040-024-00378-w","url":null,"abstract":"The analysis of absorption, distribution, metabolism, and excretion (ADME) molecular properties is of relevance to drug design, as they directly influence the drug’s effectiveness at its target location. This study concerns their prediction, using explainable Machine Learning (ML) models. The aim of the study is to find which molecular features are relevant to the prediction of the different ADME properties and measure their impact on the predictive model. The relative relevance of individual features for ADME activity is gauged by estimating feature importance in ML models’ predictions. Feature importance is calculated using feature permutation and the individual impact of features is measured by SHAP additive explanations. The study reveals the relevance of specific molecular descriptors for each ADME property and quantifies their impact on the ADME property prediction. The reported research illustrates how explainable ML models can provide detailed insights about the individual contributions of molecular features to the final prediction of an ADME property, as an effort to support experts in the process of drug candidate selection through a better understanding of the impact of molecular features.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"45 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141862771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-17DOI: 10.1186/s13040-024-00375-z
Krishna Prasad, Allen Griffiths, Kavya Agrawal, Michael McEwan, Flavio Macci, Marco Ghisoni, Matthew Stopher, Matthew Napleton, Joel Strickland, David Keating, Thomas Whitehead, Gareth Conduit, Stacey Murray, Lauren Edward
Pharmacokinetic (PK) studies can provide essential information on abuse liability of nicotine and tobacco products but are intrusive and must be conducted in a clinical environment. The objective of the study was to explore whether changes in plasma nicotine levels following use of an e-cigarette can be predicted from real time monitoring of physiological parameters and mouth level exposure (MLE) to nicotine before, during, and after e-cigarette vaping, using wearable devices. Such an approach would allow an -effective pre-screening process, reducing the number of clinical studies, reducing the number of products to be tested and the number of blood draws required in a clinical PK study Establishing such a prediction model might facilitate the longitudinal collection of data on product use and nicotine expression among consumers using nicotine products in their normal environments, thereby reducing the need for intrusive clinical studies while generating PK data related to product use in the real world.An exploratory machine learning model was developed to predict changes in plasma nicotine levels following the use of an e-cigarette; from real time monitoring of physiological parameters and MLE to nicotine before, during, and after e-cigarette vaping. This preliminary study identified key parameters, such as heart rate (HR), heart rate variability (HRV), and physiological stress (PS) that may act as predictors for an individual's plasma nicotine response (PK curve). Relative to baseline measurements (per participant), HR showed a significant increase for nicotine containing e-liquids and was consistent across sessions (intra-participant). Imputing missing values and training the model on all data resulted in 57% improvement from the original'learning' data and achieved a median validation R2 of 0.70.The study is in its exploratory phase, with limitations including a small and non-diverse sample size and reliance on data from a single e-cigarette product. These findings necessitate further research for validation and to enhance the model's generalisability and applicability in real-world settings. This study serves as a foundational step towards developing non-intrusive PK models for nicotine product use.
{"title":"Modelling the nicotine pharmacokinetic profile for e-cigarettes using real time monitoring of consumers' physiological measurements and mouth level exposure.","authors":"Krishna Prasad, Allen Griffiths, Kavya Agrawal, Michael McEwan, Flavio Macci, Marco Ghisoni, Matthew Stopher, Matthew Napleton, Joel Strickland, David Keating, Thomas Whitehead, Gareth Conduit, Stacey Murray, Lauren Edward","doi":"10.1186/s13040-024-00375-z","DOIUrl":"10.1186/s13040-024-00375-z","url":null,"abstract":"<p><p>Pharmacokinetic (PK) studies can provide essential information on abuse liability of nicotine and tobacco products but are intrusive and must be conducted in a clinical environment. The objective of the study was to explore whether changes in plasma nicotine levels following use of an e-cigarette can be predicted from real time monitoring of physiological parameters and mouth level exposure (MLE) to nicotine before, during, and after e-cigarette vaping, using wearable devices. Such an approach would allow an -effective pre-screening process, reducing the number of clinical studies, reducing the number of products to be tested and the number of blood draws required in a clinical PK study Establishing such a prediction model might facilitate the longitudinal collection of data on product use and nicotine expression among consumers using nicotine products in their normal environments, thereby reducing the need for intrusive clinical studies while generating PK data related to product use in the real world.An exploratory machine learning model was developed to predict changes in plasma nicotine levels following the use of an e-cigarette; from real time monitoring of physiological parameters and MLE to nicotine before, during, and after e-cigarette vaping. This preliminary study identified key parameters, such as heart rate (HR), heart rate variability (HRV), and physiological stress (PS) that may act as predictors for an individual's plasma nicotine response (PK curve). Relative to baseline measurements (per participant), HR showed a significant increase for nicotine containing e-liquids and was consistent across sessions (intra-participant). Imputing missing values and training the model on all data resulted in 57% improvement from the original'learning' data and achieved a median validation R<sup>2</sup> of 0.70.The study is in its exploratory phase, with limitations including a small and non-diverse sample size and reliance on data from a single e-cigarette product. These findings necessitate further research for validation and to enhance the model's generalisability and applicability in real-world settings. This study serves as a foundational step towards developing non-intrusive PK models for nicotine product use.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"24"},"PeriodicalIF":4.0,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11253374/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141635153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Patients with chronic conditions need multiple medications daily to manage their condition. However, most patients have poor compliance, which affects the effectiveness of treatment. To address these challenges, we establish a medication reminder system for the intelligent generation of universal medication schedule (UMS) to remind patients with chronic diseases to take medication accurately and to improve safety of home medication.
Methods: To design medication time constraint with one drug (MTCOD) for each drug and medication time constraint with multi-drug (MTCMD) for each two drugs in order to better regulate the interval and time of patients' medication. Establishment of a medication reminder system consisting of a cloud database of drug information, an operator terminal for medical staff and a patient terminal.
Results: The cloud database has a total of 153,916 pharmaceutical products, 496,708 drug interaction data, and 153,390 pharmaceutical product-ingredient pairs. The MTCOD data was 153,916, and the MTCMD data was 8,552,712. An intelligent UMS medication reminder system was constructed. The system can read the prescription information of patients and provide personalized medication guidance with medication timeline for chronic patients. At the same time, patients can query medication information and get remote pharmacy guidance in real time.
Conclusions: Overall, the medication reminder system provides intelligent medication reminders, automatic drug interaction identification, and monitoring system, which is helpful to monitor the entire process of treatment in patients with chronic diseases.
{"title":"Construction and application of medication reminder system: intelligent generation of universal medication schedule.","authors":"Hangxing Huang, Lu Zhang, Yongyu Yang, Ling Huang, Xikui Lu, Jingyang Li, Huimin Yu, Shuqiao Cheng, Jian Xiao","doi":"10.1186/s13040-024-00376-y","DOIUrl":"10.1186/s13040-024-00376-y","url":null,"abstract":"<p><strong>Background: </strong>Patients with chronic conditions need multiple medications daily to manage their condition. However, most patients have poor compliance, which affects the effectiveness of treatment. To address these challenges, we establish a medication reminder system for the intelligent generation of universal medication schedule (UMS) to remind patients with chronic diseases to take medication accurately and to improve safety of home medication.</p><p><strong>Methods: </strong>To design medication time constraint with one drug (MTCOD) for each drug and medication time constraint with multi-drug (MTCMD) for each two drugs in order to better regulate the interval and time of patients' medication. Establishment of a medication reminder system consisting of a cloud database of drug information, an operator terminal for medical staff and a patient terminal.</p><p><strong>Results: </strong>The cloud database has a total of 153,916 pharmaceutical products, 496,708 drug interaction data, and 153,390 pharmaceutical product-ingredient pairs. The MTCOD data was 153,916, and the MTCMD data was 8,552,712. An intelligent UMS medication reminder system was constructed. The system can read the prescription information of patients and provide personalized medication guidance with medication timeline for chronic patients. At the same time, patients can query medication information and get remote pharmacy guidance in real time.</p><p><strong>Conclusions: </strong>Overall, the medication reminder system provides intelligent medication reminders, automatic drug interaction identification, and monitoring system, which is helpful to monitor the entire process of treatment in patients with chronic diseases.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"23"},"PeriodicalIF":4.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11247871/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141621275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-12DOI: 10.1186/s13040-024-00373-1
Mateja Napravnik, Franko Hržić, Sebastian Tschauner, Ivan Štajduhar
Background: The use of machine learning in medical diagnosis and treatment has grown significantly in recent years with the development of computer-aided diagnosis systems, often based on annotated medical radiology images. However, the lack of large annotated image datasets remains a major obstacle, as the annotation process is time-consuming and costly. This study aims to overcome this challenge by proposing an automated method for annotating a large database of medical radiology images based on their semantic similarity.
Results: An automated, unsupervised approach is used to create a large annotated dataset of medical radiology images originating from the Clinical Hospital Centre Rijeka, Croatia. The pipeline is built by data-mining three different types of medical data: images, DICOM metadata and narrative diagnoses. The optimal feature extractors are then integrated into a multimodal representation, which is then clustered to create an automated pipeline for labelling a precursor dataset of 1,337,926 medical images into 50 clusters of visually similar images. The quality of the clusters is assessed by examining their homogeneity and mutual information, taking into account the anatomical region and modality representation.
Conclusions: The results indicate that fusing the embeddings of all three data sources together provides the best results for the task of unsupervised clustering of large-scale medical data and leads to the most concise clusters. Hence, this work marks the initial step towards building a much larger and more fine-grained annotated dataset of medical radiology images.
{"title":"Building RadiologyNET: an unsupervised approach to annotating a large-scale multimodal medical database.","authors":"Mateja Napravnik, Franko Hržić, Sebastian Tschauner, Ivan Štajduhar","doi":"10.1186/s13040-024-00373-1","DOIUrl":"10.1186/s13040-024-00373-1","url":null,"abstract":"<p><strong>Background: </strong>The use of machine learning in medical diagnosis and treatment has grown significantly in recent years with the development of computer-aided diagnosis systems, often based on annotated medical radiology images. However, the lack of large annotated image datasets remains a major obstacle, as the annotation process is time-consuming and costly. This study aims to overcome this challenge by proposing an automated method for annotating a large database of medical radiology images based on their semantic similarity.</p><p><strong>Results: </strong>An automated, unsupervised approach is used to create a large annotated dataset of medical radiology images originating from the Clinical Hospital Centre Rijeka, Croatia. The pipeline is built by data-mining three different types of medical data: images, DICOM metadata and narrative diagnoses. The optimal feature extractors are then integrated into a multimodal representation, which is then clustered to create an automated pipeline for labelling a precursor dataset of 1,337,926 medical images into 50 clusters of visually similar images. The quality of the clusters is assessed by examining their homogeneity and mutual information, taking into account the anatomical region and modality representation.</p><p><strong>Conclusions: </strong>The results indicate that fusing the embeddings of all three data sources together provides the best results for the task of unsupervised clustering of large-scale medical data and leads to the most concise clusters. Hence, this work marks the initial step towards building a much larger and more fine-grained annotated dataset of medical radiology images.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"22"},"PeriodicalIF":4.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11245804/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}