Pub Date : 2024-09-03DOI: 10.1186/s13040-024-00380-2
Vincenzo Bonnici, Davide Chicco
Pangenomics is a relatively new scientific field which investigates the union of all the genomes of a clade. The word pan means everything in ancient Greek; the term pangenomics originally regarded genomes of bacteria and was later intended to refer to human genomes as well. Modern bioinformatics offers several tools to analyze pangenomics data, paving the way to an emerging field that we can call computational pangenomics. Current computational power available for the bioinformatics community has made computational pangenomic analyses easy to perform, but this higher accessibility to pangenomics analysis also increases the chances to make mistakes and to produce misleading or inflated results, especially by beginners. To handle this problem, we present here a few quick tips for efficient and correct computational pangenomic analyses with a focus on bacterial pangenomics, by describing common mistakes to avoid and experienced best practices to follow in this field. We believe our recommendations can help the readers perform more robust and sound pangenomic analyses and to generate more reliable results.
{"title":"Seven quick tips for gene-focused computational pangenomic analysis.","authors":"Vincenzo Bonnici, Davide Chicco","doi":"10.1186/s13040-024-00380-2","DOIUrl":"10.1186/s13040-024-00380-2","url":null,"abstract":"<p><p>Pangenomics is a relatively new scientific field which investigates the union of all the genomes of a clade. The word pan means everything in ancient Greek; the term pangenomics originally regarded genomes of bacteria and was later intended to refer to human genomes as well. Modern bioinformatics offers several tools to analyze pangenomics data, paving the way to an emerging field that we can call computational pangenomics. Current computational power available for the bioinformatics community has made computational pangenomic analyses easy to perform, but this higher accessibility to pangenomics analysis also increases the chances to make mistakes and to produce misleading or inflated results, especially by beginners. To handle this problem, we present here a few quick tips for efficient and correct computational pangenomic analyses with a focus on bacterial pangenomics, by describing common mistakes to avoid and experienced best practices to follow in this field. We believe our recommendations can help the readers perform more robust and sound pangenomic analyses and to generate more reliable results.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11370085/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142127084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1186/s13040-024-00381-1
Luís B Elvas, Sara Gomes, João C Ferreira, Luís Brás Rosário, Tomás Brandão
Cardiovascular diseases are the main cause of death in the world and cardiovascular imaging techniques are the mainstay of noninvasive diagnosis. Aortic stenosis is a lethal cardiac disease preceded by aortic valve calcification for several years. Data-driven tools developed with Deep Learning (DL) algorithms can process and categorize medical images data, providing fast diagnoses with considered reliability, to improve healthcare effectiveness. A systematic review of DL applications on medical images for pathologic calcium detection concluded that there are established techniques in this field, using primarily CT scans, at the expense of radiation exposure. Echocardiography is an unexplored alternative to detect calcium, but still needs technological developments. In this article, a fully automated method based on Convolutional Neural Networks (CNNs) was developed to detect Aortic Calcification in Echocardiography images, consisting of two essential processes: (1) an object detector to locate aortic valve - achieving 95% of precision and 100% of recall; and (2) a classifier to identify calcium structures in the valve - which achieved 92% of precision and 100% of recall. The outcome of this work is the possibility of automation of the detection with Echocardiography of Aortic Valve Calcification, a lethal and prevalent disease.
{"title":"Deep learning for automatic calcium detection in echocardiography.","authors":"Luís B Elvas, Sara Gomes, João C Ferreira, Luís Brás Rosário, Tomás Brandão","doi":"10.1186/s13040-024-00381-1","DOIUrl":"10.1186/s13040-024-00381-1","url":null,"abstract":"<p><p>Cardiovascular diseases are the main cause of death in the world and cardiovascular imaging techniques are the mainstay of noninvasive diagnosis. Aortic stenosis is a lethal cardiac disease preceded by aortic valve calcification for several years. Data-driven tools developed with Deep Learning (DL) algorithms can process and categorize medical images data, providing fast diagnoses with considered reliability, to improve healthcare effectiveness. A systematic review of DL applications on medical images for pathologic calcium detection concluded that there are established techniques in this field, using primarily CT scans, at the expense of radiation exposure. Echocardiography is an unexplored alternative to detect calcium, but still needs technological developments. In this article, a fully automated method based on Convolutional Neural Networks (CNNs) was developed to detect Aortic Calcification in Echocardiography images, consisting of two essential processes: (1) an object detector to locate aortic valve - achieving 95% of precision and 100% of recall; and (2) a classifier to identify calcium structures in the valve - which achieved 92% of precision and 100% of recall. The outcome of this work is the possibility of automation of the detection with Echocardiography of Aortic Valve Calcification, a lethal and prevalent disease.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11351547/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142094005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-27DOI: 10.1186/s13040-024-00382-0
Yang Chen, Qingqing Zheng, Hui Wang, Peiren Tang, Li Deng, Pu Li, Huan Li, Jianhong Hou, Jie Li, Li Wang, Jun Peng
Background: In recent years, significant morbidity and mortality in patients with severe inflammatory bowel disease (IBD) and cytomegalovirus (CMV) have drawn considerable attention to the status of CMV infection in the intestinal mucosa of IBD patients and its role in disease progression. However, there is currently no high-throughput sequencing data for ulcerative colitis patients with CMV infection (CMV + UC), and the immune microenvironment in CMV + UC patients have yet to be explored.
Method: The xCell algorithm was used for evaluate the immune microenvironment of CMV + UC patients. Then, WGCNA analysis was explored to obtain the co-expression modules between abnormal immune cells and gene level or protein level. Next, three machine learning approach include Random Forest, SVM-rfe, and Lasso were used to filter candidate biomarkers. Finally, Best Subset Selection algorithms was performed to construct the diagnostic model.
Results: In this study, we performed transcriptomic and proteomic sequencing on CMV + UC patients to establish a comprehensive immune microenvironment profile and found 11 specific abnormal immune cells in CMV + UC group. After using multi-omics integration algorithms, we identified seven co-expression gene modules and five co-expression protein modules. Subsequently, we utilized various machine learning algorithms to identify key biomarkers with diagnostic efficacy and constructed an early diagnostic model. We identified a total of eight biomarkers (PPP1R12B, CIRBP, CSNK2A2, DNAJB11, PIK3R4, RRBP1, STX5, TMEM214) that play crucial roles in the immune microenvironment of CMV + UC and exhibit superior diagnostic performance for CMV + UC.
Conclusion: This 8 biomarkers model offers a new paradigm for the diagnosis and treatment of IBD patients post-CMV infection. Further research into this model will be significant for understanding the changes in the host immune microenvironment following CMV infection.
{"title":"Integrating transcriptomics and proteomics to analyze the immune microenvironment of cytomegalovirus associated ulcerative colitis and identify relevant biomarkers.","authors":"Yang Chen, Qingqing Zheng, Hui Wang, Peiren Tang, Li Deng, Pu Li, Huan Li, Jianhong Hou, Jie Li, Li Wang, Jun Peng","doi":"10.1186/s13040-024-00382-0","DOIUrl":"10.1186/s13040-024-00382-0","url":null,"abstract":"<p><strong>Background: </strong>In recent years, significant morbidity and mortality in patients with severe inflammatory bowel disease (IBD) and cytomegalovirus (CMV) have drawn considerable attention to the status of CMV infection in the intestinal mucosa of IBD patients and its role in disease progression. However, there is currently no high-throughput sequencing data for ulcerative colitis patients with CMV infection (CMV + UC), and the immune microenvironment in CMV + UC patients have yet to be explored.</p><p><strong>Method: </strong>The xCell algorithm was used for evaluate the immune microenvironment of CMV + UC patients. Then, WGCNA analysis was explored to obtain the co-expression modules between abnormal immune cells and gene level or protein level. Next, three machine learning approach include Random Forest, SVM-rfe, and Lasso were used to filter candidate biomarkers. Finally, Best Subset Selection algorithms was performed to construct the diagnostic model.</p><p><strong>Results: </strong>In this study, we performed transcriptomic and proteomic sequencing on CMV + UC patients to establish a comprehensive immune microenvironment profile and found 11 specific abnormal immune cells in CMV + UC group. After using multi-omics integration algorithms, we identified seven co-expression gene modules and five co-expression protein modules. Subsequently, we utilized various machine learning algorithms to identify key biomarkers with diagnostic efficacy and constructed an early diagnostic model. We identified a total of eight biomarkers (PPP1R12B, CIRBP, CSNK2A2, DNAJB11, PIK3R4, RRBP1, STX5, TMEM214) that play crucial roles in the immune microenvironment of CMV + UC and exhibit superior diagnostic performance for CMV + UC.</p><p><strong>Conclusion: </strong>This 8 biomarkers model offers a new paradigm for the diagnosis and treatment of IBD patients post-CMV infection. Further research into this model will be significant for understanding the changes in the host immune microenvironment following CMV infection.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11348729/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1186/s13040-024-00378-w
Caroline König, Alfredo Vellido
The analysis of absorption, distribution, metabolism, and excretion (ADME) molecular properties is of relevance to drug design, as they directly influence the drug’s effectiveness at its target location. This study concerns their prediction, using explainable Machine Learning (ML) models. The aim of the study is to find which molecular features are relevant to the prediction of the different ADME properties and measure their impact on the predictive model. The relative relevance of individual features for ADME activity is gauged by estimating feature importance in ML models’ predictions. Feature importance is calculated using feature permutation and the individual impact of features is measured by SHAP additive explanations. The study reveals the relevance of specific molecular descriptors for each ADME property and quantifies their impact on the ADME property prediction. The reported research illustrates how explainable ML models can provide detailed insights about the individual contributions of molecular features to the final prediction of an ADME property, as an effort to support experts in the process of drug candidate selection through a better understanding of the impact of molecular features.
吸收、分布、代谢和排泄(ADME)分子特性的分析与药物设计息息相关,因为它们直接影响药物在靶点的有效性。本研究利用可解释的机器学习(ML)模型对其进行预测。研究的目的是找出与预测不同 ADME 特性相关的分子特征,并衡量它们对预测模型的影响。通过估算特征在 ML 模型预测中的重要性来衡量各个特征与 ADME 活性的相对相关性。特征重要性通过特征排列来计算,特征的个体影响则通过 SHAP 相加解释来衡量。该研究揭示了特定分子描述符对每种 ADME 特性的相关性,并量化了它们对 ADME 特性预测的影响。所报告的研究说明了可解释的 ML 模型如何能够提供有关分子特征对 ADME 特性最终预测的个别贡献的详细见解,从而通过更好地了解分子特征的影响,在候选药物选择过程中为专家提供支持。
{"title":"Understanding predictions of drug profiles using explainable machine learning models","authors":"Caroline König, Alfredo Vellido","doi":"10.1186/s13040-024-00378-w","DOIUrl":"https://doi.org/10.1186/s13040-024-00378-w","url":null,"abstract":"The analysis of absorption, distribution, metabolism, and excretion (ADME) molecular properties is of relevance to drug design, as they directly influence the drug’s effectiveness at its target location. This study concerns their prediction, using explainable Machine Learning (ML) models. The aim of the study is to find which molecular features are relevant to the prediction of the different ADME properties and measure their impact on the predictive model. The relative relevance of individual features for ADME activity is gauged by estimating feature importance in ML models’ predictions. Feature importance is calculated using feature permutation and the individual impact of features is measured by SHAP additive explanations. The study reveals the relevance of specific molecular descriptors for each ADME property and quantifies their impact on the ADME property prediction. The reported research illustrates how explainable ML models can provide detailed insights about the individual contributions of molecular features to the final prediction of an ADME property, as an effort to support experts in the process of drug candidate selection through a better understanding of the impact of molecular features.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141862771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-17DOI: 10.1186/s13040-024-00375-z
Krishna Prasad, Allen Griffiths, Kavya Agrawal, Michael McEwan, Flavio Macci, Marco Ghisoni, Matthew Stopher, Matthew Napleton, Joel Strickland, David Keating, Thomas Whitehead, Gareth Conduit, Stacey Murray, Lauren Edward
Pharmacokinetic (PK) studies can provide essential information on abuse liability of nicotine and tobacco products but are intrusive and must be conducted in a clinical environment. The objective of the study was to explore whether changes in plasma nicotine levels following use of an e-cigarette can be predicted from real time monitoring of physiological parameters and mouth level exposure (MLE) to nicotine before, during, and after e-cigarette vaping, using wearable devices. Such an approach would allow an -effective pre-screening process, reducing the number of clinical studies, reducing the number of products to be tested and the number of blood draws required in a clinical PK study Establishing such a prediction model might facilitate the longitudinal collection of data on product use and nicotine expression among consumers using nicotine products in their normal environments, thereby reducing the need for intrusive clinical studies while generating PK data related to product use in the real world.An exploratory machine learning model was developed to predict changes in plasma nicotine levels following the use of an e-cigarette; from real time monitoring of physiological parameters and MLE to nicotine before, during, and after e-cigarette vaping. This preliminary study identified key parameters, such as heart rate (HR), heart rate variability (HRV), and physiological stress (PS) that may act as predictors for an individual's plasma nicotine response (PK curve). Relative to baseline measurements (per participant), HR showed a significant increase for nicotine containing e-liquids and was consistent across sessions (intra-participant). Imputing missing values and training the model on all data resulted in 57% improvement from the original'learning' data and achieved a median validation R2 of 0.70.The study is in its exploratory phase, with limitations including a small and non-diverse sample size and reliance on data from a single e-cigarette product. These findings necessitate further research for validation and to enhance the model's generalisability and applicability in real-world settings. This study serves as a foundational step towards developing non-intrusive PK models for nicotine product use.
{"title":"Modelling the nicotine pharmacokinetic profile for e-cigarettes using real time monitoring of consumers' physiological measurements and mouth level exposure.","authors":"Krishna Prasad, Allen Griffiths, Kavya Agrawal, Michael McEwan, Flavio Macci, Marco Ghisoni, Matthew Stopher, Matthew Napleton, Joel Strickland, David Keating, Thomas Whitehead, Gareth Conduit, Stacey Murray, Lauren Edward","doi":"10.1186/s13040-024-00375-z","DOIUrl":"10.1186/s13040-024-00375-z","url":null,"abstract":"<p><p>Pharmacokinetic (PK) studies can provide essential information on abuse liability of nicotine and tobacco products but are intrusive and must be conducted in a clinical environment. The objective of the study was to explore whether changes in plasma nicotine levels following use of an e-cigarette can be predicted from real time monitoring of physiological parameters and mouth level exposure (MLE) to nicotine before, during, and after e-cigarette vaping, using wearable devices. Such an approach would allow an -effective pre-screening process, reducing the number of clinical studies, reducing the number of products to be tested and the number of blood draws required in a clinical PK study Establishing such a prediction model might facilitate the longitudinal collection of data on product use and nicotine expression among consumers using nicotine products in their normal environments, thereby reducing the need for intrusive clinical studies while generating PK data related to product use in the real world.An exploratory machine learning model was developed to predict changes in plasma nicotine levels following the use of an e-cigarette; from real time monitoring of physiological parameters and MLE to nicotine before, during, and after e-cigarette vaping. This preliminary study identified key parameters, such as heart rate (HR), heart rate variability (HRV), and physiological stress (PS) that may act as predictors for an individual's plasma nicotine response (PK curve). Relative to baseline measurements (per participant), HR showed a significant increase for nicotine containing e-liquids and was consistent across sessions (intra-participant). Imputing missing values and training the model on all data resulted in 57% improvement from the original'learning' data and achieved a median validation R<sup>2</sup> of 0.70.The study is in its exploratory phase, with limitations including a small and non-diverse sample size and reliance on data from a single e-cigarette product. These findings necessitate further research for validation and to enhance the model's generalisability and applicability in real-world settings. This study serves as a foundational step towards developing non-intrusive PK models for nicotine product use.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11253374/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141635153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Patients with chronic conditions need multiple medications daily to manage their condition. However, most patients have poor compliance, which affects the effectiveness of treatment. To address these challenges, we establish a medication reminder system for the intelligent generation of universal medication schedule (UMS) to remind patients with chronic diseases to take medication accurately and to improve safety of home medication.
Methods: To design medication time constraint with one drug (MTCOD) for each drug and medication time constraint with multi-drug (MTCMD) for each two drugs in order to better regulate the interval and time of patients' medication. Establishment of a medication reminder system consisting of a cloud database of drug information, an operator terminal for medical staff and a patient terminal.
Results: The cloud database has a total of 153,916 pharmaceutical products, 496,708 drug interaction data, and 153,390 pharmaceutical product-ingredient pairs. The MTCOD data was 153,916, and the MTCMD data was 8,552,712. An intelligent UMS medication reminder system was constructed. The system can read the prescription information of patients and provide personalized medication guidance with medication timeline for chronic patients. At the same time, patients can query medication information and get remote pharmacy guidance in real time.
Conclusions: Overall, the medication reminder system provides intelligent medication reminders, automatic drug interaction identification, and monitoring system, which is helpful to monitor the entire process of treatment in patients with chronic diseases.
{"title":"Construction and application of medication reminder system: intelligent generation of universal medication schedule.","authors":"Hangxing Huang, Lu Zhang, Yongyu Yang, Ling Huang, Xikui Lu, Jingyang Li, Huimin Yu, Shuqiao Cheng, Jian Xiao","doi":"10.1186/s13040-024-00376-y","DOIUrl":"10.1186/s13040-024-00376-y","url":null,"abstract":"<p><strong>Background: </strong>Patients with chronic conditions need multiple medications daily to manage their condition. However, most patients have poor compliance, which affects the effectiveness of treatment. To address these challenges, we establish a medication reminder system for the intelligent generation of universal medication schedule (UMS) to remind patients with chronic diseases to take medication accurately and to improve safety of home medication.</p><p><strong>Methods: </strong>To design medication time constraint with one drug (MTCOD) for each drug and medication time constraint with multi-drug (MTCMD) for each two drugs in order to better regulate the interval and time of patients' medication. Establishment of a medication reminder system consisting of a cloud database of drug information, an operator terminal for medical staff and a patient terminal.</p><p><strong>Results: </strong>The cloud database has a total of 153,916 pharmaceutical products, 496,708 drug interaction data, and 153,390 pharmaceutical product-ingredient pairs. The MTCOD data was 153,916, and the MTCMD data was 8,552,712. An intelligent UMS medication reminder system was constructed. The system can read the prescription information of patients and provide personalized medication guidance with medication timeline for chronic patients. At the same time, patients can query medication information and get remote pharmacy guidance in real time.</p><p><strong>Conclusions: </strong>Overall, the medication reminder system provides intelligent medication reminders, automatic drug interaction identification, and monitoring system, which is helpful to monitor the entire process of treatment in patients with chronic diseases.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11247871/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141621275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-12DOI: 10.1186/s13040-024-00373-1
Mateja Napravnik, Franko Hržić, Sebastian Tschauner, Ivan Štajduhar
Background: The use of machine learning in medical diagnosis and treatment has grown significantly in recent years with the development of computer-aided diagnosis systems, often based on annotated medical radiology images. However, the lack of large annotated image datasets remains a major obstacle, as the annotation process is time-consuming and costly. This study aims to overcome this challenge by proposing an automated method for annotating a large database of medical radiology images based on their semantic similarity.
Results: An automated, unsupervised approach is used to create a large annotated dataset of medical radiology images originating from the Clinical Hospital Centre Rijeka, Croatia. The pipeline is built by data-mining three different types of medical data: images, DICOM metadata and narrative diagnoses. The optimal feature extractors are then integrated into a multimodal representation, which is then clustered to create an automated pipeline for labelling a precursor dataset of 1,337,926 medical images into 50 clusters of visually similar images. The quality of the clusters is assessed by examining their homogeneity and mutual information, taking into account the anatomical region and modality representation.
Conclusions: The results indicate that fusing the embeddings of all three data sources together provides the best results for the task of unsupervised clustering of large-scale medical data and leads to the most concise clusters. Hence, this work marks the initial step towards building a much larger and more fine-grained annotated dataset of medical radiology images.
{"title":"Building RadiologyNET: an unsupervised approach to annotating a large-scale multimodal medical database.","authors":"Mateja Napravnik, Franko Hržić, Sebastian Tschauner, Ivan Štajduhar","doi":"10.1186/s13040-024-00373-1","DOIUrl":"10.1186/s13040-024-00373-1","url":null,"abstract":"<p><strong>Background: </strong>The use of machine learning in medical diagnosis and treatment has grown significantly in recent years with the development of computer-aided diagnosis systems, often based on annotated medical radiology images. However, the lack of large annotated image datasets remains a major obstacle, as the annotation process is time-consuming and costly. This study aims to overcome this challenge by proposing an automated method for annotating a large database of medical radiology images based on their semantic similarity.</p><p><strong>Results: </strong>An automated, unsupervised approach is used to create a large annotated dataset of medical radiology images originating from the Clinical Hospital Centre Rijeka, Croatia. The pipeline is built by data-mining three different types of medical data: images, DICOM metadata and narrative diagnoses. The optimal feature extractors are then integrated into a multimodal representation, which is then clustered to create an automated pipeline for labelling a precursor dataset of 1,337,926 medical images into 50 clusters of visually similar images. The quality of the clusters is assessed by examining their homogeneity and mutual information, taking into account the anatomical region and modality representation.</p><p><strong>Conclusions: </strong>The results indicate that fusing the embeddings of all three data sources together provides the best results for the task of unsupervised clustering of large-scale medical data and leads to the most concise clusters. Hence, this work marks the initial step towards building a much larger and more fine-grained annotated dataset of medical radiology images.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11245804/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-11DOI: 10.1186/s13040-024-00374-0
Emily R Hannon, Carmen J Marsit, Arlene E Dent, Paula Embury, Sidney Ogolla, David Midem, Scott M Williams, James W Kazura
Background: Changing cell-type proportions can confound studies of differential gene expression or DNA methylation (DNAm) from peripheral blood mononuclear cells (PBMCs). We examined how cell-type proportions derived from the transcriptome versus the methylome (DNAm) influence estimates of differentially expressed genes (DEGs) and differentially methylated positions (DMPs).
Methods: Transcriptome and DNAm data were obtained from PBMC RNA and DNA of Kenyan children (n = 8) before, during, and 6 weeks following uncomplicated malaria. DEGs and DMPs between time points were detected using cell-type adjusted modeling with Cibersortx or IDOL, respectively.
Results: Most major cell types and principal components had moderate to high correlation between the two deconvolution methods (r = 0.60-0.96). Estimates of cell-type proportions and DEGs or DMPs were largely unaffected by the method, with the greatest discrepancy in the estimation of neutrophils.
Conclusion: Variation in cell-type proportions is captured similarly by both transcriptomic and methylome deconvolution methods for most major cell types.
{"title":"Transcriptome- and DNA methylation-based cell-type deconvolutions produce similar estimates of differential gene expression and differential methylation.","authors":"Emily R Hannon, Carmen J Marsit, Arlene E Dent, Paula Embury, Sidney Ogolla, David Midem, Scott M Williams, James W Kazura","doi":"10.1186/s13040-024-00374-0","DOIUrl":"10.1186/s13040-024-00374-0","url":null,"abstract":"<p><strong>Background: </strong>Changing cell-type proportions can confound studies of differential gene expression or DNA methylation (DNAm) from peripheral blood mononuclear cells (PBMCs). We examined how cell-type proportions derived from the transcriptome versus the methylome (DNAm) influence estimates of differentially expressed genes (DEGs) and differentially methylated positions (DMPs).</p><p><strong>Methods: </strong>Transcriptome and DNAm data were obtained from PBMC RNA and DNA of Kenyan children (n = 8) before, during, and 6 weeks following uncomplicated malaria. DEGs and DMPs between time points were detected using cell-type adjusted modeling with Cibersortx or IDOL, respectively.</p><p><strong>Results: </strong>Most major cell types and principal components had moderate to high correlation between the two deconvolution methods (r = 0.60-0.96). Estimates of cell-type proportions and DEGs or DMPs were largely unaffected by the method, with the greatest discrepancy in the estimation of neutrophils.</p><p><strong>Conclusion: </strong>Variation in cell-type proportions is captured similarly by both transcriptomic and methylome deconvolution methods for most major cell types.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11241886/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141591813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-01DOI: 10.1186/s13040-024-00369-x
Lin Wang, Jiaming Su, Zhongjie Liu, Shaowei Ding, Yaotan Li, Baoluo Hou, Yuxin Hu, Zhaoxi Dong, Jingyi Tang, Hongfang Liu, Weijing Liu
Background: Diabetic nephropathy (DN) is a major microvascular complication of diabetes and has become the leading cause of end-stage renal disease worldwide. A considerable number of DN patients have experienced irreversible end-stage renal disease progression due to the inability to diagnose the disease early. Therefore, reliable biomarkers that are helpful for early diagnosis and treatment are identified. The migration of immune cells to the kidney is considered to be a key step in the progression of DN-related vascular injury. Therefore, finding markers in this process may be more helpful for the early diagnosis and progression prediction of DN.
Methods: The gene chip data were retrieved from the GEO database using the search term ' diabetic nephropathy '. The ' limma ' software package was used to identify differentially expressed genes (DEGs) between DN and control samples. Gene set enrichment analysis (GSEA) was performed on genes obtained from the molecular characteristic database (MSigDB. The R package 'WGCNA' was used to identify gene modules associated with tubulointerstitial injury in DN, and it was crossed with immune-related DEGs to identify target genes. Gene ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were performed on differentially expressed genes using the 'ClusterProfiler' software package in R. Three methods, least absolute shrinkage and selection operator (LASSO), support vector machine recursive feature elimination (SVM-RFE) and random forest (RF), were used to select immune-related biomarkers for diagnosis. We retrieved the tubulointerstitial dataset from the Nephroseq database to construct an external validation dataset. Unsupervised clustering analysis of the expression levels of immune-related biomarkers was performed using the 'ConsensusClusterPlus 'R software package. The urine of patients who visited Dongzhimen Hospital of Beijing University of Chinese Medicine from September 2021 to March 2023 was collected, and Elisa was used to detect the mRNA expression level of immune-related biomarkers in urine. Pearson correlation analysis was used to detect the effect of immune-related biomarker expression on renal function in DN patients.
Results: Four microarray datasets from the GEO database are included in the analysis : GSE30122, GSE47185, GSE99340 and GSE104954. These datasets included 63 DN patients and 55 healthy controls. A total of 9415 genes were detected in the data set. We found 153 differentially expressed immune-related genes, of which 112 genes were up-regulated, 41 genes were down-regulated, and 119 overlapping genes were identified. GO analysis showed that they were involved in various biological processes including leukocyte-mediated immunity. KEGG analysis showed that these target genes were mainly involved in the formation of phagosomes in Staphylococcus aureus infection. Among these
{"title":"Identification of immune-associated biomarkers of diabetes nephropathy tubulointerstitial injury based on machine learning: a bioinformatics multi-chip integrated analysis.","authors":"Lin Wang, Jiaming Su, Zhongjie Liu, Shaowei Ding, Yaotan Li, Baoluo Hou, Yuxin Hu, Zhaoxi Dong, Jingyi Tang, Hongfang Liu, Weijing Liu","doi":"10.1186/s13040-024-00369-x","DOIUrl":"10.1186/s13040-024-00369-x","url":null,"abstract":"<p><strong>Background: </strong>Diabetic nephropathy (DN) is a major microvascular complication of diabetes and has become the leading cause of end-stage renal disease worldwide. A considerable number of DN patients have experienced irreversible end-stage renal disease progression due to the inability to diagnose the disease early. Therefore, reliable biomarkers that are helpful for early diagnosis and treatment are identified. The migration of immune cells to the kidney is considered to be a key step in the progression of DN-related vascular injury. Therefore, finding markers in this process may be more helpful for the early diagnosis and progression prediction of DN.</p><p><strong>Methods: </strong>The gene chip data were retrieved from the GEO database using the search term ' diabetic nephropathy '. The ' limma ' software package was used to identify differentially expressed genes (DEGs) between DN and control samples. Gene set enrichment analysis (GSEA) was performed on genes obtained from the molecular characteristic database (MSigDB. The R package 'WGCNA' was used to identify gene modules associated with tubulointerstitial injury in DN, and it was crossed with immune-related DEGs to identify target genes. Gene ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were performed on differentially expressed genes using the 'ClusterProfiler' software package in R. Three methods, least absolute shrinkage and selection operator (LASSO), support vector machine recursive feature elimination (SVM-RFE) and random forest (RF), were used to select immune-related biomarkers for diagnosis. We retrieved the tubulointerstitial dataset from the Nephroseq database to construct an external validation dataset. Unsupervised clustering analysis of the expression levels of immune-related biomarkers was performed using the 'ConsensusClusterPlus 'R software package. The urine of patients who visited Dongzhimen Hospital of Beijing University of Chinese Medicine from September 2021 to March 2023 was collected, and Elisa was used to detect the mRNA expression level of immune-related biomarkers in urine. Pearson correlation analysis was used to detect the effect of immune-related biomarker expression on renal function in DN patients.</p><p><strong>Results: </strong>Four microarray datasets from the GEO database are included in the analysis : GSE30122, GSE47185, GSE99340 and GSE104954. These datasets included 63 DN patients and 55 healthy controls. A total of 9415 genes were detected in the data set. We found 153 differentially expressed immune-related genes, of which 112 genes were up-regulated, 41 genes were down-regulated, and 119 overlapping genes were identified. GO analysis showed that they were involved in various biological processes including leukocyte-mediated immunity. KEGG analysis showed that these target genes were mainly involved in the formation of phagosomes in Staphylococcus aureus infection. Among these","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11218417/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141477779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-26DOI: 10.1186/s13040-024-00372-2
Yunfei Yin, Zheng Yuan, Islam Md Tanvir, Xianjian Bao
The loss of electronic medical records has seriously affected the practical application of biomedical data. Therefore, it is a meaningful research effort to effectively fill these lost data. Currently, state-of-the-art methods focus on using Generative Adversarial Networks (GANs) to fill the missing values of electronic medical records, achieving breakthrough progress. However, when facing datasets with high missing rates, the imputation accuracy of these methods sharply deceases. This motivates us to explore the uncertainty of GANs and improve the GAN-based imputation methods. In this paper, the GRUD (Gate Recurrent Unit Decay) network and the UGAN (Uncertainty Generative Adversarial Network) are proposed and organically combined, called UGAN-GRUD. In UGAN-GRUD, it highlights using GAN to generate imputation values and then leveraging GRUD to compensate them. We have designed the UGAN and the GRUD network. The former is employed to learn the distribution pattern and uncertainty of data through the Generator and Discriminator, iteratively. The latter is exploited to compensate the former by leveraging the GRUD based on time decay factor, which can learn the specific temporal relations in electronic medical records. Through experimental research on publicly available biomedical datasets, the results show that UGAN-GRUD outperforms the current state-of-the-art methods, with average 13% RMSE (Root Mean Squared Error) and 24.5% MAPE (Mean Absolute Percentage Error) improvements.
电子病历的丢失严重影响了生物医学数据的实际应用。因此,有效填补这些丢失的数据是一项有意义的研究工作。目前,最先进的方法主要是使用生成对抗网络(GAN)来填补电子病历的缺失值,并取得了突破性进展。然而,当面对高缺失率的数据集时,这些方法的估算准确性会急剧下降。这促使我们探索 GAN 的不确定性,并改进基于 GAN 的估算方法。本文提出 GRUD(门递归单元衰减)网络和 UGAN(不确定性生成对抗网络),并将其有机地结合起来,称为 UGAN-GRUD。在 UGAN-GRUD 中,它强调使用 GAN 生成估算值,然后利用 GRUD 对其进行补偿。我们设计了 UGAN 和 GRUD 网络。前者通过生成器和判别器反复学习数据的分布模式和不确定性。后者则利用基于时间衰减因子的 GRUD 来弥补前者的不足,后者可以学习电子病历中的特定时间关系。通过对公开生物医学数据集的实验研究,结果表明 UGAN-GRUD 优于目前最先进的方法,平均 RMSE(均方根误差)提高了 13%,MAPE(平均绝对误差)提高了 24.5%。
{"title":"Electronic medical records imputation by temporal Generative Adversarial Network.","authors":"Yunfei Yin, Zheng Yuan, Islam Md Tanvir, Xianjian Bao","doi":"10.1186/s13040-024-00372-2","DOIUrl":"10.1186/s13040-024-00372-2","url":null,"abstract":"<p><p>The loss of electronic medical records has seriously affected the practical application of biomedical data. Therefore, it is a meaningful research effort to effectively fill these lost data. Currently, state-of-the-art methods focus on using Generative Adversarial Networks (GANs) to fill the missing values of electronic medical records, achieving breakthrough progress. However, when facing datasets with high missing rates, the imputation accuracy of these methods sharply deceases. This motivates us to explore the uncertainty of GANs and improve the GAN-based imputation methods. In this paper, the GRUD (Gate Recurrent Unit Decay) network and the UGAN (Uncertainty Generative Adversarial Network) are proposed and organically combined, called UGAN-GRUD. In UGAN-GRUD, it highlights using GAN to generate imputation values and then leveraging GRUD to compensate them. We have designed the UGAN and the GRUD network. The former is employed to learn the distribution pattern and uncertainty of data through the Generator and Discriminator, iteratively. The latter is exploited to compensate the former by leveraging the GRUD based on time decay factor, which can learn the specific temporal relations in electronic medical records. Through experimental research on publicly available biomedical datasets, the results show that UGAN-GRUD outperforms the current state-of-the-art methods, with average 13% RMSE (Root Mean Squared Error) and 24.5% MAPE (Mean Absolute Percentage Error) improvements.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11202349/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141460183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}