Pub Date : 2024-12-27DOI: 10.1186/s12859-024-06013-z
Qiaosheng Zhang, Yalong Wei, Jie Hou, Hongpeng Li, Zhaoman Zhong
Background: Cancer classification has consistently been a challenging problem, with the main difficulties being high-dimensional data and the collection of patient samples. Concretely, obtaining patient samples is a costly and resource-intensive process, and imbalances often exist between samples. Moreover, expression data is characterized by high dimensionality, small samples and high noise, which could easily lead to struggles such as dimensionality catastrophe and overfitting. Thus, we incorporate prior knowledge from the pathway and combine AutoEncoder and Generative Adversarial Network (GAN) to solve these difficulties.
Results: In this study, we propose an effective and efficient deep learning method, named AEGAN, which combines the capabilities of AutoEncoder and GAN to generate synthetic samples of the minority class in imbalanced gene expression data. The proposed data balancing technique has been demonstrated to be useful for cancer classification and improving the performance of classifier models. Additionally, we integrate prior knowledge from the pathway and employ the pathifier algorithm to calculate pathway scores for each sample. This data augmentation approach, referred to as AEGAN-Pathifier, not only preserves the biological functionality of the data but also possesses dimensional reduction capabilities. Through validation with various classifiers, the experimental results show an improvement in classifier performance.
Conclusion: AEGAN-Pathifier shows improved performance on the imbalanced datasets GSE25066, GSE20194, BRCA and Liver24. Results from various classifiers indicate that AEGAN-Pathifier has good generalization capability.
{"title":"AEGAN-Pathifier: a data augmentation method to improve cancer classification for imbalanced gene expression data.","authors":"Qiaosheng Zhang, Yalong Wei, Jie Hou, Hongpeng Li, Zhaoman Zhong","doi":"10.1186/s12859-024-06013-z","DOIUrl":"10.1186/s12859-024-06013-z","url":null,"abstract":"<p><strong>Background: </strong>Cancer classification has consistently been a challenging problem, with the main difficulties being high-dimensional data and the collection of patient samples. Concretely, obtaining patient samples is a costly and resource-intensive process, and imbalances often exist between samples. Moreover, expression data is characterized by high dimensionality, small samples and high noise, which could easily lead to struggles such as dimensionality catastrophe and overfitting. Thus, we incorporate prior knowledge from the pathway and combine AutoEncoder and Generative Adversarial Network (GAN) to solve these difficulties.</p><p><strong>Results: </strong>In this study, we propose an effective and efficient deep learning method, named AEGAN, which combines the capabilities of AutoEncoder and GAN to generate synthetic samples of the minority class in imbalanced gene expression data. The proposed data balancing technique has been demonstrated to be useful for cancer classification and improving the performance of classifier models. Additionally, we integrate prior knowledge from the pathway and employ the pathifier algorithm to calculate pathway scores for each sample. This data augmentation approach, referred to as AEGAN-Pathifier, not only preserves the biological functionality of the data but also possesses dimensional reduction capabilities. Through validation with various classifiers, the experimental results show an improvement in classifier performance.</p><p><strong>Conclusion: </strong>AEGAN-Pathifier shows improved performance on the imbalanced datasets GSE25066, GSE20194, BRCA and Liver24. Results from various classifiers indicate that AEGAN-Pathifier has good generalization capability.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"392"},"PeriodicalIF":2.9,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11673641/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142891898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1186/s12859-024-06017-9
Hyunseok Shin, Sejong Oh
Background: High-dimensional datasets with low sample sizes (HDLSS) are pivotal in the fields of biology and bioinformatics. One of core objective of HDLSS is to select most informative features and discarding redundant or irrelevant features. This is particularly crucial in bioinformatics, where accurate feature (gene) selection can lead to breakthroughs in drug development and provide insights into disease diagnostics. Despite its importance, identifying optimal features is still a significant challenge in HDLSS.
Results: To address this challenge, we propose an effective feature selection method that combines gradual permutation filtering with a heuristic tribrid search strategy, specifically tailored for HDLSS contexts. The proposed method considers inter-feature interactions and leverages feature rankings during the search process. In addition, a new performance metric for the HDLSS that evaluates both the number and quality of selected features is suggested. Through the comparison of the benchmark dataset with existing methods, the proposed method reduced the average number of selected features from 37.8 to 5.5 and improved the performance of the prediction model, based on the selected features, from 0.855 to 0.927.
Conclusions: The proposed method effectively selects a small number of important features and achieves high prediction performance.
{"title":"An effective heuristic for developing hybrid feature selection in high dimensional and low sample size datasets.","authors":"Hyunseok Shin, Sejong Oh","doi":"10.1186/s12859-024-06017-9","DOIUrl":"10.1186/s12859-024-06017-9","url":null,"abstract":"<p><strong>Background: </strong>High-dimensional datasets with low sample sizes (HDLSS) are pivotal in the fields of biology and bioinformatics. One of core objective of HDLSS is to select most informative features and discarding redundant or irrelevant features. This is particularly crucial in bioinformatics, where accurate feature (gene) selection can lead to breakthroughs in drug development and provide insights into disease diagnostics. Despite its importance, identifying optimal features is still a significant challenge in HDLSS.</p><p><strong>Results: </strong>To address this challenge, we propose an effective feature selection method that combines gradual permutation filtering with a heuristic tribrid search strategy, specifically tailored for HDLSS contexts. The proposed method considers inter-feature interactions and leverages feature rankings during the search process. In addition, a new performance metric for the HDLSS that evaluates both the number and quality of selected features is suggested. Through the comparison of the benchmark dataset with existing methods, the proposed method reduced the average number of selected features from 37.8 to 5.5 and improved the performance of the prediction model, based on the selected features, from 0.855 to 0.927.</p><p><strong>Conclusions: </strong>The proposed method effectively selects a small number of important features and achieves high prediction performance.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"390"},"PeriodicalIF":2.9,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11670382/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142891899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As a heterogeneous disease, prostate cancer (PCa) exhibits diverse clinical and biological features, which pose significant challenges for early diagnosis and treatment. Metabolomics offers promising new approaches for early diagnosis, treatment, and prognosis of PCa. However, metabolomics data are characterized by high dimensionality, noise, variability, and small sample sizes, presenting substantial challenges for classification. Despite the wide range of applications of deep learning methods, the use of deep learning in metabolomics research has not been extensively explored. In this study, we propose a hybrid model, TransConvNet, which combines transformer and convolutional neural networks for the classification of prostate cancer metabolomics data. We introduce a 1D convolution layer for the inputs to the dot-product attention mechanism, enabling the interaction of both local and global information. Additionally, a gating mechanism is incorporated to dynamically adjust the attention weights. The features extracted by multi-head attention are further refined through 1D convolution, and a residual network is introduced to alleviate the gradient vanishing problem in the convolutional layers. We conducted comparative experiments with seven other machine learning algorithms. Through five-fold cross-validation, TransConvNet achieved an accuracy of 81.03% and an AUC of 0.89, significantly outperforming the other algorithms. Additionally, we validated TransConvNet's generalization ability through experiments on the lung cancer dataset, with the results demonstrating its robustness and adaptability to different metabolomics datasets. We also proposed the MI-RF (Mutual Information-based random forest) model, which effectively identified key biomarkers associated with prostate cancer by leveraging comprehensive feature weight coefficients. In contrast, traditional methods identified only a limited number of biomarkers. In summary, these results highlight the potential of TransConvNet and MI-RF in both classification tasks and biomarker discovery, providing valuable insights for the clinical application of prostate cancer diagnosis.
{"title":"Deep learning-based metabolomics data study of prostate cancer.","authors":"Liqiang Sun, Xiaojing Fan, Yunwei Zhao, Qi Zhang, Mingyang Jiang","doi":"10.1186/s12859-024-06016-w","DOIUrl":"10.1186/s12859-024-06016-w","url":null,"abstract":"<p><p>As a heterogeneous disease, prostate cancer (PCa) exhibits diverse clinical and biological features, which pose significant challenges for early diagnosis and treatment. Metabolomics offers promising new approaches for early diagnosis, treatment, and prognosis of PCa. However, metabolomics data are characterized by high dimensionality, noise, variability, and small sample sizes, presenting substantial challenges for classification. Despite the wide range of applications of deep learning methods, the use of deep learning in metabolomics research has not been extensively explored. In this study, we propose a hybrid model, TransConvNet, which combines transformer and convolutional neural networks for the classification of prostate cancer metabolomics data. We introduce a 1D convolution layer for the inputs to the dot-product attention mechanism, enabling the interaction of both local and global information. Additionally, a gating mechanism is incorporated to dynamically adjust the attention weights. The features extracted by multi-head attention are further refined through 1D convolution, and a residual network is introduced to alleviate the gradient vanishing problem in the convolutional layers. We conducted comparative experiments with seven other machine learning algorithms. Through five-fold cross-validation, TransConvNet achieved an accuracy of 81.03% and an AUC of 0.89, significantly outperforming the other algorithms. Additionally, we validated TransConvNet's generalization ability through experiments on the lung cancer dataset, with the results demonstrating its robustness and adaptability to different metabolomics datasets. We also proposed the MI-RF (Mutual Information-based random forest) model, which effectively identified key biomarkers associated with prostate cancer by leveraging comprehensive feature weight coefficients. In contrast, traditional methods identified only a limited number of biomarkers. In summary, these results highlight the potential of TransConvNet and MI-RF in both classification tasks and biomarker discovery, providing valuable insights for the clinical application of prostate cancer diagnosis.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"391"},"PeriodicalIF":2.9,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11674358/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142891901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-24DOI: 10.1186/s12859-024-06006-y
Zoltán Maróti, Peter Juma Ochieng, József Dombi, Miklós Krész, Tibor Kalmár
Background: Accurate prediction of copy number variations (CNVs) from targeted capture next-generation sequencing (NGS) data relies on effective normalization of read coverage profiles. The normalization process is particularly challenging due to hidden systemic biases such as GC bias, which can significantly affect the sensitivity and specificity of CNV detection. In many cases, the kit manifests provide only the genome coordinates of the targeted regions, and the exact bait design of the oligo capture baits is not available. Although the on-target regions significantly overlap with the bait design, a lack of adequate information allows less accurate normalization of the coverage data. In this study, we propose a novel approach that utilizes a 1D convolution neural network (CNN) model to predict the positions of capture baits in complex whole-exome sequencing (WES) kits. By accurately identifying the exact positions of bait coordinates, our model enables precise normalization of GC bias across target regions, thereby allowing better CNV data normalization.
Results: We evaluated the optimal hyperparameters, model architecture, and complexity to predict the likely positions of the oligo capture baits. Our analysis shows that the CNN models outperform the Dense NN for bait predictions. Batch normalization is the most important parameter for the stable training of CNN models. Our results indicate that the spatiality of the data plays an important role in the prediction performance. We have shown that combined input data, including experimental coverage, on-target information, and sequence data, are critical for bait prediction. Furthermore, comparison with the on-target information indicated that the CNN models performed better in predicting bait positions that exhibited a high degree of overlap (>90%) with the true bait positions.
Results: This study highlights the potential of utilizing CNN-based approaches to optimize coverage data analysis and improve copy number data normalization. Subsequent CNV detection based on these predicted coordinates facilitates more accurate measurement of coverage profiles and better normalization for GC bias. As a result, this approach could reduce systemic bias and improve the sensitivity and specificity of CNV detection in genomic studies.
{"title":"Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions.","authors":"Zoltán Maróti, Peter Juma Ochieng, József Dombi, Miklós Krész, Tibor Kalmár","doi":"10.1186/s12859-024-06006-y","DOIUrl":"10.1186/s12859-024-06006-y","url":null,"abstract":"<p><strong>Background: </strong>Accurate prediction of copy number variations (CNVs) from targeted capture next-generation sequencing (NGS) data relies on effective normalization of read coverage profiles. The normalization process is particularly challenging due to hidden systemic biases such as GC bias, which can significantly affect the sensitivity and specificity of CNV detection. In many cases, the kit manifests provide only the genome coordinates of the targeted regions, and the exact bait design of the oligo capture baits is not available. Although the on-target regions significantly overlap with the bait design, a lack of adequate information allows less accurate normalization of the coverage data. In this study, we propose a novel approach that utilizes a 1D convolution neural network (CNN) model to predict the positions of capture baits in complex whole-exome sequencing (WES) kits. By accurately identifying the exact positions of bait coordinates, our model enables precise normalization of GC bias across target regions, thereby allowing better CNV data normalization.</p><p><strong>Results: </strong>We evaluated the optimal hyperparameters, model architecture, and complexity to predict the likely positions of the oligo capture baits. Our analysis shows that the CNN models outperform the Dense NN for bait predictions. Batch normalization is the most important parameter for the stable training of CNN models. Our results indicate that the spatiality of the data plays an important role in the prediction performance. We have shown that combined input data, including experimental coverage, on-target information, and sequence data, are critical for bait prediction. Furthermore, comparison with the on-target information indicated that the CNN models performed better in predicting bait positions that exhibited a high degree of overlap (>90%) with the true bait positions.</p><p><strong>Results: </strong>This study highlights the potential of utilizing CNN-based approaches to optimize coverage data analysis and improve copy number data normalization. Subsequent CNV detection based on these predicted coordinates facilitates more accurate measurement of coverage profiles and better normalization for GC bias. As a result, this approach could reduce systemic bias and improve the sensitivity and specificity of CNV detection in genomic studies.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"389"},"PeriodicalIF":2.9,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11669243/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142885167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Time-series scRNA-seq data have opened a door to elucidate cell differentiation, and in this context, the optimal transport theory has been attracting much attention. However, there remain critical issues in interpretability and computational cost.
Results: We present scEGOT, a comprehensive framework for single-cell trajectory inference, as a generative model with high interpretability and low computational cost. Applied to the human primordial germ cell-like cell (PGCLC) induction system, scEGOT identified the PGCLC progenitor population and bifurcation time of segregation. Our analysis shows TFAP2A is insufficient for identifying PGCLC progenitors, requiring NKX1-2. Additionally, MESP1 and GATA6 are also crucial for PGCLC/somatic cell segregation.
Conclusions: These findings shed light on the mechanism that segregates PGCLC from somatic lineages. Notably, not limited to scRNA-seq, scEGOT's versatility can extend to general single-cell data like scATAC-seq, and hence has the potential to revolutionize our understanding of such datasets and, thereby also, developmental biology.
{"title":"scEGOT: single-cell trajectory inference framework based on entropic Gaussian mixture optimal transport.","authors":"Toshiaki Yachimura, Hanbo Wang, Yusuke Imoto, Momoko Yoshida, Sohei Tasaki, Yoji Kojima, Yukihiro Yabuta, Mitinori Saitou, Yasuaki Hiraoka","doi":"10.1186/s12859-024-05988-z","DOIUrl":"10.1186/s12859-024-05988-z","url":null,"abstract":"<p><strong>Background: </strong>Time-series scRNA-seq data have opened a door to elucidate cell differentiation, and in this context, the optimal transport theory has been attracting much attention. However, there remain critical issues in interpretability and computational cost.</p><p><strong>Results: </strong>We present scEGOT, a comprehensive framework for single-cell trajectory inference, as a generative model with high interpretability and low computational cost. Applied to the human primordial germ cell-like cell (PGCLC) induction system, scEGOT identified the PGCLC progenitor population and bifurcation time of segregation. Our analysis shows TFAP2A is insufficient for identifying PGCLC progenitors, requiring NKX1-2. Additionally, MESP1 and GATA6 are also crucial for PGCLC/somatic cell segregation.</p><p><strong>Conclusions: </strong>These findings shed light on the mechanism that segregates PGCLC from somatic lineages. Notably, not limited to scRNA-seq, scEGOT's versatility can extend to general single-cell data like scATAC-seq, and hence has the potential to revolutionize our understanding of such datasets and, thereby also, developmental biology.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"388"},"PeriodicalIF":2.9,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11665215/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142876061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-20DOI: 10.1186/s12859-024-05998-x
Shengrong Zhu, Ruijia Yang, Zifeng Pan, Xuan Tian, Hong Ji
Backgrounds: Diagnostic prediction is a central application that spans various medical specialties and scenarios, sequential diagnosis prediction is the process of predicting future diagnoses based on patients' historical visits. Prior research has underexplored the impact of irregular intervals between patient visits on predictive models, despite its significance.
Method: We developed the Multi-task Fusion Visit Interval for Sequential Diagnosis Prediction (MISDP) framework to address this research gap. The MISDP framework integrated sequential diagnosis prediction with visit interval prediction within a multi-task learning paradigm. It uses positional encoding and interval encoding to handle irregular patient visit intervals. Furthermore, it incorporates historical attention residue to enhance the multi-head self-attention mechanism, focusing on extracting long-term dependencies from clinical historical visits.
Results: The MISDP model exhibited superior performance across real-world healthcare dataset, irrespective of the training data scarcity or abundance. With only 20% training data, MISDP achieved a 4. 2% improvement over KAME; when training data ranged from 60 to 80%, MISDP surpassed SETOR, the top baseline, by 0. 8% in accuracy, underscoring its robustness and efficacy in sequential diagnosis prediction task.
Conclusions: The MISDP model significantly improves the accuracy of Sequential Diagnosis Prediction. The result highlights the advantage of multi-task learning in synergistically enhancing the performance of individual sub-task. Notably, irregular visit interval factors and historical attention residue has been particularly instrumental in refining the precision of sequential diagnosis prediction, suggesting a promising avenue for advancing clinical decision-making through data-driven modeling approaches.
{"title":"MISDP: multi-task fusion visit interval for sequential diagnosis prediction.","authors":"Shengrong Zhu, Ruijia Yang, Zifeng Pan, Xuan Tian, Hong Ji","doi":"10.1186/s12859-024-05998-x","DOIUrl":"10.1186/s12859-024-05998-x","url":null,"abstract":"<p><strong>Backgrounds: </strong>Diagnostic prediction is a central application that spans various medical specialties and scenarios, sequential diagnosis prediction is the process of predicting future diagnoses based on patients' historical visits. Prior research has underexplored the impact of irregular intervals between patient visits on predictive models, despite its significance.</p><p><strong>Method: </strong>We developed the Multi-task Fusion Visit Interval for Sequential Diagnosis Prediction (MISDP) framework to address this research gap. The MISDP framework integrated sequential diagnosis prediction with visit interval prediction within a multi-task learning paradigm. It uses positional encoding and interval encoding to handle irregular patient visit intervals. Furthermore, it incorporates historical attention residue to enhance the multi-head self-attention mechanism, focusing on extracting long-term dependencies from clinical historical visits.</p><p><strong>Results: </strong>The MISDP model exhibited superior performance across real-world healthcare dataset, irrespective of the training data scarcity or abundance. With only 20% training data, MISDP achieved a 4. 2% improvement over KAME; when training data ranged from 60 to 80%, MISDP surpassed SETOR, the top baseline, by 0. 8% in accuracy, underscoring its robustness and efficacy in sequential diagnosis prediction task.</p><p><strong>Conclusions: </strong>The MISDP model significantly improves the accuracy of Sequential Diagnosis Prediction. The result highlights the advantage of multi-task learning in synergistically enhancing the performance of individual sub-task. Notably, irregular visit interval factors and historical attention residue has been particularly instrumental in refining the precision of sequential diagnosis prediction, suggesting a promising avenue for advancing clinical decision-making through data-driven modeling approaches.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"387"},"PeriodicalIF":2.9,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11662528/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142871119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: As a key non-coding RNA molecule, miRNA profoundly affects gene expression regulation and connects to the pathological processes of several kinds of human diseases. However, conventional experimental methods for validating miRNA-disease associations are laborious. Consequently, the development of efficient and reliable computational prediction models is crucial for the identification and validation of these associations.
Results: In this research, we developed the PCACFMDA method to predict the potential associations between miRNAs and diseases. To construct a multidimensional feature matrix, we consider the fusion similarities of miRNA and disease and miRNA-disease pairs. We then use principal component analysis(PCA) to reduce data complexity and extract low-dimensional features. Subsequently, a tuned cascade forest is used to mine the features and output prediction scores deeply. The results of the 5-fold cross-validation using the HMDD v2.0 database indicate that the PCACFMDA algorithm achieved an AUC of 98.56%. Additionally, we perform case studies on breast, esophageal and lung neoplasms. The findings revealed that the top 50 miRNAs most strongly linked to each disease have been validated.
Conclusions: Based on PCA and optimized cascade forests, we propose the PCACFMDA model for predicting undiscovered miRNA-disease associations. The experimental results demonstrate superior prediction performance and commendable stability. Consequently, the PCACFMDA is a potent instrument for in-depth exploration of miRNA-disease associations.
{"title":"Prediction of miRNA-disease associations based on PCA and cascade forest.","authors":"Chuanlei Zhang, Yubo Li, Yinglun Dong, Wei Chen, Changqing Yu","doi":"10.1186/s12859-024-05999-w","DOIUrl":"10.1186/s12859-024-05999-w","url":null,"abstract":"<p><strong>Background: </strong>As a key non-coding RNA molecule, miRNA profoundly affects gene expression regulation and connects to the pathological processes of several kinds of human diseases. However, conventional experimental methods for validating miRNA-disease associations are laborious. Consequently, the development of efficient and reliable computational prediction models is crucial for the identification and validation of these associations.</p><p><strong>Results: </strong>In this research, we developed the PCACFMDA method to predict the potential associations between miRNAs and diseases. To construct a multidimensional feature matrix, we consider the fusion similarities of miRNA and disease and miRNA-disease pairs. We then use principal component analysis(PCA) to reduce data complexity and extract low-dimensional features. Subsequently, a tuned cascade forest is used to mine the features and output prediction scores deeply. The results of the 5-fold cross-validation using the HMDD v2.0 database indicate that the PCACFMDA algorithm achieved an AUC of 98.56%. Additionally, we perform case studies on breast, esophageal and lung neoplasms. The findings revealed that the top 50 miRNAs most strongly linked to each disease have been validated.</p><p><strong>Conclusions: </strong>Based on PCA and optimized cascade forests, we propose the PCACFMDA model for predicting undiscovered miRNA-disease associations. The experimental results demonstrate superior prediction performance and commendable stability. Consequently, the PCACFMDA is a potent instrument for in-depth exploration of miRNA-disease associations.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"386"},"PeriodicalIF":2.9,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11660965/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142862959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-18DOI: 10.1186/s12859-024-06003-1
Zhongning Jiang, Wei Huang, Raymond H W Lam, Wei Zhang
Recent developments in spatially resolved transcriptomics (SRT) enable the characterization of spatial structures for different tissues. Many decomposition methods have been proposed to depict the cellular distribution within tissues. However, existing computational methods struggle to balance spatial continuity in cell distribution with the preservation of cell-specific characteristics. To address this, we propose Spall, a novel decomposition network that integrates scRNA-seq data with SRT data to accurately infer cell type proportions. Spall introduced the GATv2 module, featuring a flexible dynamic attention mechanism to capture relationships between spots. This improves the identification of cellular distribution patterns in spatial analysis. Additionally, Spall incorporates skip connections to address the loss of cell-specific information, thereby enhancing the prediction capability for rare cell types. Experimental results show that Spall outperforms the state-of-the-art methods in reconstructing cell distribution patterns on multiple datasets. Notably, Spall reveals tumor heterogeneity in human pancreatic ductal adenocarcinoma samples and delineates complex tissue structures, such as the laminar organization of the mouse cerebral cortex and the mouse cerebellum. These findings highlight the ability of Spall to provide reliable low-dimensional embeddings for downstream analyses, offering new opportunities for deciphering tissue structures.
{"title":"Spall: accurate and robust unveiling cellular landscapes from spatially resolved transcriptomics data using a decomposition network.","authors":"Zhongning Jiang, Wei Huang, Raymond H W Lam, Wei Zhang","doi":"10.1186/s12859-024-06003-1","DOIUrl":"10.1186/s12859-024-06003-1","url":null,"abstract":"<p><p>Recent developments in spatially resolved transcriptomics (SRT) enable the characterization of spatial structures for different tissues. Many decomposition methods have been proposed to depict the cellular distribution within tissues. However, existing computational methods struggle to balance spatial continuity in cell distribution with the preservation of cell-specific characteristics. To address this, we propose Spall, a novel decomposition network that integrates scRNA-seq data with SRT data to accurately infer cell type proportions. Spall introduced the GATv2 module, featuring a flexible dynamic attention mechanism to capture relationships between spots. This improves the identification of cellular distribution patterns in spatial analysis. Additionally, Spall incorporates skip connections to address the loss of cell-specific information, thereby enhancing the prediction capability for rare cell types. Experimental results show that Spall outperforms the state-of-the-art methods in reconstructing cell distribution patterns on multiple datasets. Notably, Spall reveals tumor heterogeneity in human pancreatic ductal adenocarcinoma samples and delineates complex tissue structures, such as the laminar organization of the mouse cerebral cortex and the mouse cerebellum. These findings highlight the ability of Spall to provide reliable low-dimensional embeddings for downstream analyses, offering new opportunities for deciphering tissue structures.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"379"},"PeriodicalIF":2.9,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11656923/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142852300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-18DOI: 10.1186/s12859-024-05985-2
Sasan Azizian, Juan Cui
Background: Interactions between microRNAs and RNA-binding proteins are crucial for microRNA-mediated gene regulation and sorting. Despite their significance, the molecular mechanisms governing these interactions remain underexplored, apart from sequence motifs identified on microRNAs. To date, only a limited number of microRNA-binding proteins have been confirmed, typically through labor-intensive experimental procedures. Advanced bioinformatics tools are urgently needed to facilitate this research.
Methods: We present DeepMiRBP, a novel hybrid deep learning model specifically designed to predict microRNA-binding proteins by modeling molecular interactions. This innovation approach is the first to target the direct interactions between small RNAs and proteins. DeepMiRBP consists of two main components. The first component employs bidirectional long short-term memory (Bi-LSTM) neural networks to capture sequential dependencies and context within RNA sequences, attention mechanisms to enhance the model's focus on the most relevant features and transfer learning to apply knowledge gained from a large dataset of RNA-protein binding sites to the specific task of predicting microRNA-protein interactions. Cosine similarity is applied to assess RNA similarities. The second component utilizes Convolutional Neural Networks (CNNs) to process the spatial data inherent in protein structures based on Position-Specific Scoring Matrices (PSSM) and contact maps to generate detailed and accurate representations of potential microRNA-binding sites and assess protein similarities.
Results: DeepMiRBP achieved a prediction accuracy of 87.4% during training and 85.4% using testing, with an F score of 0.860. Additionally, we validated our method using three case studies, focusing on microRNAs such as miR-451, -19b, -23a, -21, -223, and -let-7d. DeepMiRBP successfully predicted known miRNA interactions with recently discovered RNA-binding proteins, including AGO, YBX1, and FXR2, identified in various exosomes.
Conclusions: Our proposed DeepMiRBP strategy represents the first of its kind designed for microRNA-protein interaction prediction. Its promising performance underscores the model's potential to uncover novel interactions critical for small RNA sorting and packaging, as well as to infer new RNA transporter proteins. The methodologies and insights from DeepMiRBP offer a scalable template for future small RNA research, from mechanistic discovery to modeling disease-related cell-to-cell communication, emphasizing its adaptability and potential for developing novel small RNA-centric therapeutic interventions and personalized medicine.
{"title":"DeepMiRBP: a hybrid model for predicting microRNA-protein interactions based on transfer learning and cosine similarity.","authors":"Sasan Azizian, Juan Cui","doi":"10.1186/s12859-024-05985-2","DOIUrl":"10.1186/s12859-024-05985-2","url":null,"abstract":"<p><strong>Background: </strong>Interactions between microRNAs and RNA-binding proteins are crucial for microRNA-mediated gene regulation and sorting. Despite their significance, the molecular mechanisms governing these interactions remain underexplored, apart from sequence motifs identified on microRNAs. To date, only a limited number of microRNA-binding proteins have been confirmed, typically through labor-intensive experimental procedures. Advanced bioinformatics tools are urgently needed to facilitate this research.</p><p><strong>Methods: </strong>We present DeepMiRBP, a novel hybrid deep learning model specifically designed to predict microRNA-binding proteins by modeling molecular interactions. This innovation approach is the first to target the direct interactions between small RNAs and proteins. DeepMiRBP consists of two main components. The first component employs bidirectional long short-term memory (Bi-LSTM) neural networks to capture sequential dependencies and context within RNA sequences, attention mechanisms to enhance the model's focus on the most relevant features and transfer learning to apply knowledge gained from a large dataset of RNA-protein binding sites to the specific task of predicting microRNA-protein interactions. Cosine similarity is applied to assess RNA similarities. The second component utilizes Convolutional Neural Networks (CNNs) to process the spatial data inherent in protein structures based on Position-Specific Scoring Matrices (PSSM) and contact maps to generate detailed and accurate representations of potential microRNA-binding sites and assess protein similarities.</p><p><strong>Results: </strong>DeepMiRBP achieved a prediction accuracy of 87.4% during training and 85.4% using testing, with an F score of 0.860. Additionally, we validated our method using three case studies, focusing on microRNAs such as miR-451, -19b, -23a, -21, -223, and -let-7d. DeepMiRBP successfully predicted known miRNA interactions with recently discovered RNA-binding proteins, including AGO, YBX1, and FXR2, identified in various exosomes.</p><p><strong>Conclusions: </strong>Our proposed DeepMiRBP strategy represents the first of its kind designed for microRNA-protein interaction prediction. Its promising performance underscores the model's potential to uncover novel interactions critical for small RNA sorting and packaging, as well as to infer new RNA transporter proteins. The methodologies and insights from DeepMiRBP offer a scalable template for future small RNA research, from mechanistic discovery to modeling disease-related cell-to-cell communication, emphasizing its adaptability and potential for developing novel small RNA-centric therapeutic interventions and personalized medicine.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"381"},"PeriodicalIF":2.9,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11656930/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142852193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-18DOI: 10.1186/s12859-024-06001-3
Daniel Antonio Negrón, Shipra Trivedi, Nicholas Tolli, David Ashford, Gabrielle Melton, Stephanie Guertin, Katharine Jennings, Bryan D Necciai, Shanmuga Sozhamannan, Bradley W Abramson
Background: The bacterium Vibrio cholerae causes diarrheal illness and can acquire genetic material leading to multiple drug resistance (MDR). Rapid detection of resistance-conferring mobile genetic elements helps avoid the prescription of ineffective antibiotics for specific strains. Colorimetric loop-mediated isothermal amplification (LAMP) assays provide a rapid and cost-effective means for detection at point-of-care since they do not require specialized equipment, require limited expertise to perform, and can take less than 30 min to perform in resource limited regions. LAMP output is a color change that can be viewed by eye, but it can be difficult to design primer sets, determine target specificity, and interpret subjective color changes.
Methods: We developed an algorithm for the in silico design and evaluation of LAMP assays within the open-source PCR Signature Erosion Tool (PSET) and a computer vision application for the quantitative analysis of colorimetric outputs. First, Primer3 calculates LAMP primer sequence candidates with settings based on GC-content optimization. Next, PSET aligns the primer sequences of each assay against large sequence databases to calculate sufficient sequence similarity, coverage, and primer arrangement to the intended taxa, ultimately generating a confusion matrix. Finally, we tested assay candidates in the laboratory against synthetic constructs.
Results: As an example, we generated new LAMP assays targeting drug resistance in V. cholerae and evaluated existing ones from the literature based on in silico target specificity and in vitro testing. Improvements in the design and testing of LAMP assays, with heightened target specificity and a simple analysis platform, increase utility for in-field applications. Overall, 9 of the 16 tested LAMP assays had positive signal through visual and computer vision-based detection methods developed here. Here we show LAMP assays tested on synthetic AMR gene targets for aph(6), varG, floR, qnrVC5, and almG, which allow for resistance to aminoglycosides, penicillins, carbapenems, phenicols, fluoroquinolones, and polymyxins respectively.
{"title":"Loop-mediated isothermal amplification assays for the detection of antimicrobial resistance elements in Vibrio cholera.","authors":"Daniel Antonio Negrón, Shipra Trivedi, Nicholas Tolli, David Ashford, Gabrielle Melton, Stephanie Guertin, Katharine Jennings, Bryan D Necciai, Shanmuga Sozhamannan, Bradley W Abramson","doi":"10.1186/s12859-024-06001-3","DOIUrl":"10.1186/s12859-024-06001-3","url":null,"abstract":"<p><strong>Background: </strong>The bacterium Vibrio cholerae causes diarrheal illness and can acquire genetic material leading to multiple drug resistance (MDR). Rapid detection of resistance-conferring mobile genetic elements helps avoid the prescription of ineffective antibiotics for specific strains. Colorimetric loop-mediated isothermal amplification (LAMP) assays provide a rapid and cost-effective means for detection at point-of-care since they do not require specialized equipment, require limited expertise to perform, and can take less than 30 min to perform in resource limited regions. LAMP output is a color change that can be viewed by eye, but it can be difficult to design primer sets, determine target specificity, and interpret subjective color changes.</p><p><strong>Methods: </strong>We developed an algorithm for the in silico design and evaluation of LAMP assays within the open-source PCR Signature Erosion Tool (PSET) and a computer vision application for the quantitative analysis of colorimetric outputs. First, Primer3 calculates LAMP primer sequence candidates with settings based on GC-content optimization. Next, PSET aligns the primer sequences of each assay against large sequence databases to calculate sufficient sequence similarity, coverage, and primer arrangement to the intended taxa, ultimately generating a confusion matrix. Finally, we tested assay candidates in the laboratory against synthetic constructs.</p><p><strong>Results: </strong>As an example, we generated new LAMP assays targeting drug resistance in V. cholerae and evaluated existing ones from the literature based on in silico target specificity and in vitro testing. Improvements in the design and testing of LAMP assays, with heightened target specificity and a simple analysis platform, increase utility for in-field applications. Overall, 9 of the 16 tested LAMP assays had positive signal through visual and computer vision-based detection methods developed here. Here we show LAMP assays tested on synthetic AMR gene targets for aph(6), varG, floR, qnrVC5, and almG, which allow for resistance to aminoglycosides, penicillins, carbapenems, phenicols, fluoroquinolones, and polymyxins respectively.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"384"},"PeriodicalIF":2.9,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11657800/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142852290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}