Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag033
Junqi Long, Bo Liu, Jianqiang Li, Shuangtao Zhao
Motivation: Interactions among long noncoding RNAs, circular RNAs, microRNAs, and messenger RNAs form complex gene expression regulatory networks, which are of great significance for the diagnosis, prevention, and treatment of complex diseases. Although existing computational methods have been developed to predict interactions among certain molecular types, they are generally limited to single-modality perspectives, overlooking competitive specificity and co-target cooperativity across multi-omics molecules, and thereby limiting their ability to elucidate cross-omics regulatory mechanisms.
Results: We proposed a novel cross-omics adaptive multimodal contrastive learning framework (MCOAN) that learns multimodal regulatory mechanisms and effectively predicts disease-associated molecular regulatory networks. Specifically, we first constructed a five-layer heterogeneous graph architecture to comprehensively integrate the complex regulatory associations among multi-omics nodes. Then, we proposed an unsupervised multimodal contrastive learning strategy that maximizes mutual information across distinct regulatory views, thereby enhancing node representations by efficiently capturing local neighborhood structure and global semantic information. Meanwhile, we also proposed a cross-omics adaptive learning mechanism that captures complex competitive specificity and co-target cooperativity across distinct regulatory networks, thereby further enhancing the structural awareness in node representations. Furthermore, we evaluated multiple downstream classifiers to accurately predict multimodal molecular regulatory networks. Finally, extensive experiments show that MCOAN consistently outperforms existing methods, achieving strong predictive accuracy and generalization (max AUC = 0.9881; max AUPR = 0.9826), and further confirm its real-world predictive performance through case studies.
Availability and implementation: All resources are available at https://github.com/JunqiLab/MCOAN.git.
{"title":"MCOAN: multimodal contrastive representation learning for cross-omics adaptive disease regulatory network prediction.","authors":"Junqi Long, Bo Liu, Jianqiang Li, Shuangtao Zhao","doi":"10.1093/bioinformatics/btag033","DOIUrl":"10.1093/bioinformatics/btag033","url":null,"abstract":"<p><strong>Motivation: </strong>Interactions among long noncoding RNAs, circular RNAs, microRNAs, and messenger RNAs form complex gene expression regulatory networks, which are of great significance for the diagnosis, prevention, and treatment of complex diseases. Although existing computational methods have been developed to predict interactions among certain molecular types, they are generally limited to single-modality perspectives, overlooking competitive specificity and co-target cooperativity across multi-omics molecules, and thereby limiting their ability to elucidate cross-omics regulatory mechanisms.</p><p><strong>Results: </strong>We proposed a novel cross-omics adaptive multimodal contrastive learning framework (MCOAN) that learns multimodal regulatory mechanisms and effectively predicts disease-associated molecular regulatory networks. Specifically, we first constructed a five-layer heterogeneous graph architecture to comprehensively integrate the complex regulatory associations among multi-omics nodes. Then, we proposed an unsupervised multimodal contrastive learning strategy that maximizes mutual information across distinct regulatory views, thereby enhancing node representations by efficiently capturing local neighborhood structure and global semantic information. Meanwhile, we also proposed a cross-omics adaptive learning mechanism that captures complex competitive specificity and co-target cooperativity across distinct regulatory networks, thereby further enhancing the structural awareness in node representations. Furthermore, we evaluated multiple downstream classifiers to accurately predict multimodal molecular regulatory networks. Finally, extensive experiments show that MCOAN consistently outperforms existing methods, achieving strong predictive accuracy and generalization (max AUC = 0.9881; max AUPR = 0.9826), and further confirm its real-world predictive performance through case studies.</p><p><strong>Availability and implementation: </strong>All resources are available at https://github.com/JunqiLab/MCOAN.git.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12881826/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146004644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btaf655
Zhao Li, Zaiyi Zheng, Rongbin Li, Wenbo Chen, Yuntao Yang, Meer A Ali, Jundong Li, W Jim Zheng
Motivation: Single-cell RNA sequencing (scRNA-Seq) technology enables detailed exploration of gene expression at the individual cell level, crucial for annotating cell types and understanding cellular diversity. Traditional methods for cell type annotation often rely on marker genes and manual labeling, posing challenges due to low data quality and incomplete reference datasets.
Results: We developed CeLLTra, a novel contrastive learning framework that leverages a Transformer-based model integrating biological pathway information to group genes into super tokens, effectively capturing comprehensive gene expression from scRNA-Seq data. By combining this pathway-informed Transformer with a pretrained domain-specific language model, CeLLTra accurately aligns cell-type annotations with gene expression profiles. Evaluations on a large-scale human scRNA-Seq dataset showed that CeLLTra significantly outperformed state-of-the-art methods in supervised and zero-shot cell-type prediction. Additionally, CeLLTra generalized well to external datasets, improving clustering performance and enabling better characterization of cancerous cell states in tumor-infiltrating myeloid cells from non-small cell lung cancer patients.
Availability and implementation: CeLLTra is freely available on GitHub (https://github.com/WJZheng-group/CeLLTra) and Zenodo (https://doi.org/10.5281/zenodo.17666735). The datasets underlying this article are the following: GSE201333 and GSE127465. All these datasets are publicly available and can be freely accessed on the Gene Expression Omnibus repository.
{"title":"CeLLTra: aligning cell names with gene expression via a pathway-informed transformer.","authors":"Zhao Li, Zaiyi Zheng, Rongbin Li, Wenbo Chen, Yuntao Yang, Meer A Ali, Jundong Li, W Jim Zheng","doi":"10.1093/bioinformatics/btaf655","DOIUrl":"10.1093/bioinformatics/btaf655","url":null,"abstract":"<p><strong>Motivation: </strong>Single-cell RNA sequencing (scRNA-Seq) technology enables detailed exploration of gene expression at the individual cell level, crucial for annotating cell types and understanding cellular diversity. Traditional methods for cell type annotation often rely on marker genes and manual labeling, posing challenges due to low data quality and incomplete reference datasets.</p><p><strong>Results: </strong>We developed CeLLTra, a novel contrastive learning framework that leverages a Transformer-based model integrating biological pathway information to group genes into super tokens, effectively capturing comprehensive gene expression from scRNA-Seq data. By combining this pathway-informed Transformer with a pretrained domain-specific language model, CeLLTra accurately aligns cell-type annotations with gene expression profiles. Evaluations on a large-scale human scRNA-Seq dataset showed that CeLLTra significantly outperformed state-of-the-art methods in supervised and zero-shot cell-type prediction. Additionally, CeLLTra generalized well to external datasets, improving clustering performance and enabling better characterization of cancerous cell states in tumor-infiltrating myeloid cells from non-small cell lung cancer patients.</p><p><strong>Availability and implementation: </strong>CeLLTra is freely available on GitHub (https://github.com/WJZheng-group/CeLLTra) and Zenodo (https://doi.org/10.5281/zenodo.17666735). The datasets underlying this article are the following: GSE201333 and GSE127465. All these datasets are publicly available and can be freely accessed on the Gene Expression Omnibus repository.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":"42 2","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12881829/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag037
Charlotte Collins, Panagiotis Fytas, İlknur Karadeniz, Huiyuan Zheng, Simon Baker, Ulla Stenius, Anna Korhonen
Motivation: Automatic information extraction from biomedical texts requires machine learning methodology that can recognize biomedical entities, characterize inter-entity relationships, and relate extracted information to specific research topics. Large language models (LLMs) excel in general tasks but perform less reliably in the biomedical domain, where texts are characterized by extensive technical terminology and semantic variations from general literature. There is an unmet need for annotated full-text datasets that can be used to fine-tune language models for significant biomedical applications. Here, we focus on extraction of the complex relationships between genes and diseases.
Results: We present BioTriplex, a corpus of 100 full-length biomedical research articles (comprising 604 subsection texts) manually annotated with disease names, genes, and 21 subtypes of disease-gene relationships. We employ BioTriplex to train the LLaMA 3.1 8B language model in gene-disease relation extraction. Our fine-tuned model outperforms zero-shot and few-shot approaches, both within the LLaMA 3.1 architecture and across the larger state-of-the-art LLMs GPT-4 and Claude Sonnet 3.7, and classifies gene-disease relation types with broader scope and greater granularity than previously described. These results validate BioTriplex as a useful full-text data resource and underscore the value of specialized datasets in fine-tuning language models for important biomedical tasks.
Availability and implementation: https://github.com/PanagiotisFytas/BioTriplex.
{"title":"BioTriplex: a full-text annotated corpus for fine-tuning language models in gene-disease relation extraction tasks.","authors":"Charlotte Collins, Panagiotis Fytas, İlknur Karadeniz, Huiyuan Zheng, Simon Baker, Ulla Stenius, Anna Korhonen","doi":"10.1093/bioinformatics/btag037","DOIUrl":"10.1093/bioinformatics/btag037","url":null,"abstract":"<p><strong>Motivation: </strong>Automatic information extraction from biomedical texts requires machine learning methodology that can recognize biomedical entities, characterize inter-entity relationships, and relate extracted information to specific research topics. Large language models (LLMs) excel in general tasks but perform less reliably in the biomedical domain, where texts are characterized by extensive technical terminology and semantic variations from general literature. There is an unmet need for annotated full-text datasets that can be used to fine-tune language models for significant biomedical applications. Here, we focus on extraction of the complex relationships between genes and diseases.</p><p><strong>Results: </strong>We present BioTriplex, a corpus of 100 full-length biomedical research articles (comprising 604 subsection texts) manually annotated with disease names, genes, and 21 subtypes of disease-gene relationships. We employ BioTriplex to train the LLaMA 3.1 8B language model in gene-disease relation extraction. Our fine-tuned model outperforms zero-shot and few-shot approaches, both within the LLaMA 3.1 architecture and across the larger state-of-the-art LLMs GPT-4 and Claude Sonnet 3.7, and classifies gene-disease relation types with broader scope and greater granularity than previously described. These results validate BioTriplex as a useful full-text data resource and underscore the value of specialized datasets in fine-tuning language models for important biomedical tasks.</p><p><strong>Availability and implementation: </strong>https://github.com/PanagiotisFytas/BioTriplex.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12883087/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146020562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Aberrant DNA methylation is a fundamental epigenetic hallmark of cancer. However, existing resources often lack technological diversity and comprehensive cancer coverage. Furthermore, most platforms fail to achieve deep multi-omics integration and tend to ignore cancer-type-specific methylation features, limiting their utility in precision oncology and drug discovery.
Results: We developed Cancer Methylation Atlas (CMAtlas), a comprehensive platform integrating 13 753 samples across 34 cancer types. By applying technology-tailored pipelines to data from various profiling technologies, we identified 830 725 tumor-specific differentially methylated elements (DMEs) and 1 480 098 differentially methylated regions (DMRs), alongside 1 154 256 cancer-type-specific DMEs and 329 154 DMRs. The platform demonstrates high cross-platform consistency and strong concordance between tumor tissues and cell lines, ensuring the robustness of our findings. All DMEs and DMRs are annotated with multi-omics data (RNA expression, somatic mutations, and chromatin accessibility) and clinical relevance (survival associations and cell-free DNA profiling). We further demonstrate the utility of CMAtlas by identifying prognostic aberrant methylation in colorectal cancer driver genes.
Availability and implementation: CMAtlas is freely accessible at {{https://cmatlas.renlab.cn/}}. The platform offers an intuitive web interface supporting gene-centric and cancer-centric queries, alongside customizable analysis modules designed to facilitate user-specific research needs.
{"title":"CMAtlas: a comprehensive DNA methylation atlas for exploring epigenetic alterations in 34 human cancer types.","authors":"Mengni Liu, Lizhen Jiang, Luowanyue Zhang, Tianjian Chen, Xingzhe Wang, Yuan Liang, Xianping Shi, Jian Ren, Yueyuan Zheng","doi":"10.1093/bioinformatics/btag022","DOIUrl":"10.1093/bioinformatics/btag022","url":null,"abstract":"<p><strong>Motivation: </strong>Aberrant DNA methylation is a fundamental epigenetic hallmark of cancer. However, existing resources often lack technological diversity and comprehensive cancer coverage. Furthermore, most platforms fail to achieve deep multi-omics integration and tend to ignore cancer-type-specific methylation features, limiting their utility in precision oncology and drug discovery.</p><p><strong>Results: </strong>We developed Cancer Methylation Atlas (CMAtlas), a comprehensive platform integrating 13 753 samples across 34 cancer types. By applying technology-tailored pipelines to data from various profiling technologies, we identified 830 725 tumor-specific differentially methylated elements (DMEs) and 1 480 098 differentially methylated regions (DMRs), alongside 1 154 256 cancer-type-specific DMEs and 329 154 DMRs. The platform demonstrates high cross-platform consistency and strong concordance between tumor tissues and cell lines, ensuring the robustness of our findings. All DMEs and DMRs are annotated with multi-omics data (RNA expression, somatic mutations, and chromatin accessibility) and clinical relevance (survival associations and cell-free DNA profiling). We further demonstrate the utility of CMAtlas by identifying prognostic aberrant methylation in colorectal cancer driver genes.</p><p><strong>Availability and implementation: </strong>CMAtlas is freely accessible at {{https://cmatlas.renlab.cn/}}. The platform offers an intuitive web interface supporting gene-centric and cancer-centric queries, alongside customizable analysis modules designed to facilitate user-specific research needs.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12881830/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145986016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag026
Katalin Ferenc, Lorenzo Martini, Ieva Rauluseviciute, Geir Kjetil Ferkingstad Sandve, Anthony Mathelier
Summary: The accurate development, assessment, interpretation, and benchmarking of bioinformatics frameworks for analyzing transcriptional regulatory grammars rely on controlled simulations to validate the underlying methods. However, existing simulators often lack end-to-end flexibility or ease of integration, which limits their practical use. We present inMOTIFin, a lightweight, modular, and user-friendly Python-based software that addresses these gaps by providing versatile and efficient simulation and modification of DNA regulatory sequences. inMOTIFin enables users to simulate or modify regulatory sequences efficiently for the customizable generation of motifs and insertion of motif instances with precise control over their positions, co-occurrences, and spacing, as well as direct modification of real sequences, facilitating a comprehensive evaluation of motif-based methods and interpretation tools. We demonstrate inMOTIFin applications for the assessment of de novo motif discovery, the analysis of transcription factor cooperativity, and the support of explainability analyses for deep learning models. inMOTIFin ensures robust and reproducible analyses for studying transcriptional regulatory grammars.
Availability and implementation: inMOTIFin is available at PyPI https://pypi.org/project/inMOTIFin/ and Docker Hub https://hub.docker.com/r/cbgr/inmotifin. Detailed documentation is available at https://inmotifin.readthedocs.io/en/latest/. The code for use case analyses is available at https://bitbucket.org/CBGR/inmotifin_evaluation/src/main/. The version of the code used for this article has been uploaded to Zenodo with DOI: 10.5281/zenodo.17638579.
{"title":"inMOTIFin: a lightweight end-to-end simulation software for regulatory sequences.","authors":"Katalin Ferenc, Lorenzo Martini, Ieva Rauluseviciute, Geir Kjetil Ferkingstad Sandve, Anthony Mathelier","doi":"10.1093/bioinformatics/btag026","DOIUrl":"10.1093/bioinformatics/btag026","url":null,"abstract":"<p><strong>Summary: </strong>The accurate development, assessment, interpretation, and benchmarking of bioinformatics frameworks for analyzing transcriptional regulatory grammars rely on controlled simulations to validate the underlying methods. However, existing simulators often lack end-to-end flexibility or ease of integration, which limits their practical use. We present inMOTIFin, a lightweight, modular, and user-friendly Python-based software that addresses these gaps by providing versatile and efficient simulation and modification of DNA regulatory sequences. inMOTIFin enables users to simulate or modify regulatory sequences efficiently for the customizable generation of motifs and insertion of motif instances with precise control over their positions, co-occurrences, and spacing, as well as direct modification of real sequences, facilitating a comprehensive evaluation of motif-based methods and interpretation tools. We demonstrate inMOTIFin applications for the assessment of de novo motif discovery, the analysis of transcription factor cooperativity, and the support of explainability analyses for deep learning models. inMOTIFin ensures robust and reproducible analyses for studying transcriptional regulatory grammars.</p><p><strong>Availability and implementation: </strong>inMOTIFin is available at PyPI https://pypi.org/project/inMOTIFin/ and Docker Hub https://hub.docker.com/r/cbgr/inmotifin. Detailed documentation is available at https://inmotifin.readthedocs.io/en/latest/. The code for use case analyses is available at https://bitbucket.org/CBGR/inmotifin_evaluation/src/main/. The version of the code used for this article has been uploaded to Zenodo with DOI: 10.5281/zenodo.17638579.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12881827/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btaf337
Matthias Flotho, Philipp Flotho, Andreas Keller
Summary: Visualization of multidimensional, categorical data is a common challenge across scientific domains and, in particular, the life sciences. The goal is to create a comprehensive overview of the underlying data which enables one to assess multiple variables. One application where such visualizations are particularly useful is gene or pathway analysis, which involves checking for dysregulation in known biological mechanisms and functions across multiple conditions. Here, we propose a new visualization approach that encodes such data in an intuitive representation: DicePlots visualize up to four distinct categorical classes in a single view using elements resembling dice faces, whereas DominoPlots add an additional layer of information for binary comparison.
Availability and implementation: The code is available as the diceplot R package and the pydiceplot on PyPI. All source code is available at https://github.com/maflot.
Contact: The repo is managed actively and we encourage community contributions and requests.
{"title":"DicePlot: a package for high-dimensional categorical data visualization.","authors":"Matthias Flotho, Philipp Flotho, Andreas Keller","doi":"10.1093/bioinformatics/btaf337","DOIUrl":"10.1093/bioinformatics/btaf337","url":null,"abstract":"<p><strong>Summary: </strong>Visualization of multidimensional, categorical data is a common challenge across scientific domains and, in particular, the life sciences. The goal is to create a comprehensive overview of the underlying data which enables one to assess multiple variables. One application where such visualizations are particularly useful is gene or pathway analysis, which involves checking for dysregulation in known biological mechanisms and functions across multiple conditions. Here, we propose a new visualization approach that encodes such data in an intuitive representation: DicePlots visualize up to four distinct categorical classes in a single view using elements resembling dice faces, whereas DominoPlots add an additional layer of information for binary comparison.</p><p><strong>Availability and implementation: </strong>The code is available as the diceplot R package and the pydiceplot on PyPI. All source code is available at https://github.com/maflot.</p><p><strong>Contact: </strong>The repo is managed actively and we encourage community contributions and requests.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866641/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144509939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag007
Alexandra M Wong, Cecile P G Meier-Scherling, Lorin Crawford
Motivation: Predicting synergistic cancer drug combinations through computational methods offers a scalable approach to creating therapies that are more effective and less toxic. However, most algorithms focus solely on synergy without considering toxicity when selecting optimal drug combinations. In the absence of combinatorial toxicity assays, a few models use toxicity penalties to balance high synergy with lower toxicity. Still, these penalties have not been explicitly validated against known drug-drug interactions.
Results: In this study, we examine whether synergy scores and toxicity metrics correlate with known adverse drug interactions. While some metrics show trends with toxicity levels, our results reveal significant limitations in using them as penalties. These findings highlight the challenges of incorporating toxicity into synergy prediction frameworks and suggest that advancing the field requires more comprehensive combination toxicity data.
Availability and implementation: The code written for this project is available at https://github.com/amw14/toxicity-cancer-drug-combination.
{"title":"Characterizing clinical toxicity in cancer combination therapies.","authors":"Alexandra M Wong, Cecile P G Meier-Scherling, Lorin Crawford","doi":"10.1093/bioinformatics/btag007","DOIUrl":"10.1093/bioinformatics/btag007","url":null,"abstract":"<p><strong>Motivation: </strong>Predicting synergistic cancer drug combinations through computational methods offers a scalable approach to creating therapies that are more effective and less toxic. However, most algorithms focus solely on synergy without considering toxicity when selecting optimal drug combinations. In the absence of combinatorial toxicity assays, a few models use toxicity penalties to balance high synergy with lower toxicity. Still, these penalties have not been explicitly validated against known drug-drug interactions.</p><p><strong>Results: </strong>In this study, we examine whether synergy scores and toxicity metrics correlate with known adverse drug interactions. While some metrics show trends with toxicity levels, our results reveal significant limitations in using them as penalties. These findings highlight the challenges of incorporating toxicity into synergy prediction frameworks and suggest that advancing the field requires more comprehensive combination toxicity data.</p><p><strong>Availability and implementation: </strong>The code written for this project is available at https://github.com/amw14/toxicity-cancer-drug-combination.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12865850/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag039
Daria Meyer, Emanuel Barth, Laura Wiehle, Manja Marz
Motivation: DNA methylation serves as a key biomarker in clinical diagnostics, especially in cancer detection. With methylation-specific PCR (MSP), a widely used approach, patient samples can be screened fast and efficiently for differential methylation. During MSP, methylated regions are selectively amplified with specific primers. With nanopore sequencing, knowledge about DNA methylation is generated during direct DNA sequencing without needing pretreatment of the DNA. Multiple methods, mainly developed for whole-genome bisulfite sequencing (WGBS) data, exist to predict differentially methylated regions (DMRs) in the genome. However, the predicted DMRs are often very large and not sufficiently discriminating to generate meaningful results in MSP, creating a gap between theoretical cancer marker research and practical application, as no tool currently provides methylation difference predictions tailored for PCR-based diagnostics.
Results: Here, we present diffMONT, a tool that predicts differentially methylated regions specifically suited for MSP primer design, enabling rapid translation into practical applications. diffMONT takes into account (i) the specific length of primer and amplicon regions, (ii) the fact that one condition should be unmethylated, and (iii) a minimal required amount of differentially methylated cytosines within the primer regions. We compared the results of diffMONT to metilene and DSS based on a publicly available nanopore sequencing dataset and show that the regions predicted by diffMONT are more specific toward hypermethylated regions. diffMONT accelerates the design of methylation-specific diagnostic assays, bridging the gap between theoretical research and clinical application.
Availability and implementation: The source code for diffMONT, an open-source Python-based tool, is available at https://github.com/rnajena/diffMONT/, with an archived release under https://zenodo.org/records/17641031.
{"title":"diffMONT: predicting methylation-specific PCR biomarkers based on nanopore sequencing data for clinical application.","authors":"Daria Meyer, Emanuel Barth, Laura Wiehle, Manja Marz","doi":"10.1093/bioinformatics/btag039","DOIUrl":"10.1093/bioinformatics/btag039","url":null,"abstract":"<p><strong>Motivation: </strong>DNA methylation serves as a key biomarker in clinical diagnostics, especially in cancer detection. With methylation-specific PCR (MSP), a widely used approach, patient samples can be screened fast and efficiently for differential methylation. During MSP, methylated regions are selectively amplified with specific primers. With nanopore sequencing, knowledge about DNA methylation is generated during direct DNA sequencing without needing pretreatment of the DNA. Multiple methods, mainly developed for whole-genome bisulfite sequencing (WGBS) data, exist to predict differentially methylated regions (DMRs) in the genome. However, the predicted DMRs are often very large and not sufficiently discriminating to generate meaningful results in MSP, creating a gap between theoretical cancer marker research and practical application, as no tool currently provides methylation difference predictions tailored for PCR-based diagnostics.</p><p><strong>Results: </strong>Here, we present diffMONT, a tool that predicts differentially methylated regions specifically suited for MSP primer design, enabling rapid translation into practical applications. diffMONT takes into account (i) the specific length of primer and amplicon regions, (ii) the fact that one condition should be unmethylated, and (iii) a minimal required amount of differentially methylated cytosines within the primer regions. We compared the results of diffMONT to metilene and DSS based on a publicly available nanopore sequencing dataset and show that the regions predicted by diffMONT are more specific toward hypermethylated regions. diffMONT accelerates the design of methylation-specific diagnostic assays, bridging the gap between theoretical research and clinical application.</p><p><strong>Availability and implementation: </strong>The source code for diffMONT, an open-source Python-based tool, is available at https://github.com/rnajena/diffMONT/, with an archived release under https://zenodo.org/records/17641031.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12881825/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag004
Jierui Xu, Elena I Zavala, Priya Moorjani
Summary: Sediment DNA-the recovery of genetic material from archaeological sediments-is an exciting new frontier in ancient DNA research, offering the potential to study individuals at a given archaeological site without destructive sampling. In recent years, several studies have demonstrated the promise of this approach by extracting hominin DNA from prehistoric sediments, including those dating back to the Middle or Late Pleistocene. However, a lack of open-source workflows for analysis of hominin sediment DNA samples poses a challenge for data processing and reproducibility of findings across studies. Here, we introduce a snakemake workflow, sedimix, for processing genomic sequences from archaeological sediment DNA samples to identify hominin sequences and generate relevant summary statistics to assess the reliability of the pipeline. By performing simulations and comparing our results to two published studies with human DNA from ∼25,000 years ago (including shotgun data from a sediment sample and capture data from touch DNA recovered from a deer tooth pendant) we demonstrate that sedimix yields accurate and reliable inferences. sedimix offers a reliable and adaptable framework to aid in the analysis of sediment DNA datasets and improve reproducibility across studies.
Availability and implementation: sedimix is available as an open-source software with the associated code, example data, and user manual with installation instructions available at https://github.com/jierui-cell/sedimix. A permanent archived version of this release is available via Zenodo: https://doi.org/10.5281/zenodo.17244854.
{"title":"sedimix: a workflow for the analysis of hominin nuclear DNA sequences from sediments.","authors":"Jierui Xu, Elena I Zavala, Priya Moorjani","doi":"10.1093/bioinformatics/btag004","DOIUrl":"10.1093/bioinformatics/btag004","url":null,"abstract":"<p><strong>Summary: </strong>Sediment DNA-the recovery of genetic material from archaeological sediments-is an exciting new frontier in ancient DNA research, offering the potential to study individuals at a given archaeological site without destructive sampling. In recent years, several studies have demonstrated the promise of this approach by extracting hominin DNA from prehistoric sediments, including those dating back to the Middle or Late Pleistocene. However, a lack of open-source workflows for analysis of hominin sediment DNA samples poses a challenge for data processing and reproducibility of findings across studies. Here, we introduce a snakemake workflow, sedimix, for processing genomic sequences from archaeological sediment DNA samples to identify hominin sequences and generate relevant summary statistics to assess the reliability of the pipeline. By performing simulations and comparing our results to two published studies with human DNA from ∼25,000 years ago (including shotgun data from a sediment sample and capture data from touch DNA recovered from a deer tooth pendant) we demonstrate that sedimix yields accurate and reliable inferences. sedimix offers a reliable and adaptable framework to aid in the analysis of sediment DNA datasets and improve reproducibility across studies.</p><p><strong>Availability and implementation: </strong>sedimix is available as an open-source software with the associated code, example data, and user manual with installation instructions available at https://github.com/jierui-cell/sedimix. A permanent archived version of this release is available via Zenodo: https://doi.org/10.5281/zenodo.17244854.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866666/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145946960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1093/bioinformatics/btag014
Xinyi Tang, Ran Liu
Motivation: Sequence motif identification is crucial for understanding molecular recognition, particularly in immune responses involving peptide binding to major histocompatibility complex (MHC) Class I molecules for antigen presentation to T cells. Traditionally, MHC Class I binding motifs are assumed to be contiguous and span nine amino acids. However, structural evidence suggests that binding may involve nonadjacent residues, challenging the assumptions of existing methods.
Results: In this study, we propose Gap-Aware Motif Mining Algorithm (GAMMA), a probabilistic framework designed to identify noncontiguous motifs under conditions of incomplete labeling. GAMMA employs Bayesian inference with Markov chain Monte Carlo sampling to jointly estimate motif parameters, binding locations, and the relative spacing between binding positions. Through extensive simulations and real-world applications to MHC Class I peptide datasets, GAMMA outperforms existing motif discovery tools such as GLAM2 in accurately localizing binding residues and identifying the underlying motifs. Notably, our results suggest that the true number of binding residues may be eight, fewer than the commonly assumed nine. In addition, for longer peptides, the model captures increased flexibility in the central region, consistent with structural observations that peptides may bulge in the middle.
Availability and implementation: The raw data and the source codes are available on GitHub (https://github.com/RanLIUaca/GAMMAmotif).
{"title":"GAMMA: gap-aware motif mining under incomplete labeling with applications to MHC motifs.","authors":"Xinyi Tang, Ran Liu","doi":"10.1093/bioinformatics/btag014","DOIUrl":"10.1093/bioinformatics/btag014","url":null,"abstract":"<p><strong>Motivation: </strong>Sequence motif identification is crucial for understanding molecular recognition, particularly in immune responses involving peptide binding to major histocompatibility complex (MHC) Class I molecules for antigen presentation to T cells. Traditionally, MHC Class I binding motifs are assumed to be contiguous and span nine amino acids. However, structural evidence suggests that binding may involve nonadjacent residues, challenging the assumptions of existing methods.</p><p><strong>Results: </strong>In this study, we propose Gap-Aware Motif Mining Algorithm (GAMMA), a probabilistic framework designed to identify noncontiguous motifs under conditions of incomplete labeling. GAMMA employs Bayesian inference with Markov chain Monte Carlo sampling to jointly estimate motif parameters, binding locations, and the relative spacing between binding positions. Through extensive simulations and real-world applications to MHC Class I peptide datasets, GAMMA outperforms existing motif discovery tools such as GLAM2 in accurately localizing binding residues and identifying the underlying motifs. Notably, our results suggest that the true number of binding residues may be eight, fewer than the commonly assumed nine. In addition, for longer peptides, the model captures increased flexibility in the central region, consistent with structural observations that peptides may bulge in the middle.</p><p><strong>Availability and implementation: </strong>The raw data and the source codes are available on GitHub (https://github.com/RanLIUaca/GAMMAmotif).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866627/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}