Bioinformatics (Oxford, England)最新文献_第8页

MCOAN: multimodal contrastive representation learning for cross-omics adaptive disease regulatory network prediction. 基于多模态对比表征学习的跨组学适应性疾病调节网络预测。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-01-03 DOI: 10.1093/bioinformatics/btag033

Junqi Long, Bo Liu, Jianqiang Li, Shuangtao Zhao

Motivation: Interactions among long noncoding RNAs, circular RNAs, microRNAs, and messenger RNAs form complex gene expression regulatory networks, which are of great significance for the diagnosis, prevention, and treatment of complex diseases. Although existing computational methods have been developed to predict interactions among certain molecular types, they are generally limited to single-modality perspectives, overlooking competitive specificity and co-target cooperativity across multi-omics molecules, and thereby limiting their ability to elucidate cross-omics regulatory mechanisms.

Results: We proposed a novel cross-omics adaptive multimodal contrastive learning framework (MCOAN) that learns multimodal regulatory mechanisms and effectively predicts disease-associated molecular regulatory networks. Specifically, we first constructed a five-layer heterogeneous graph architecture to comprehensively integrate the complex regulatory associations among multi-omics nodes. Then, we proposed an unsupervised multimodal contrastive learning strategy that maximizes mutual information across distinct regulatory views, thereby enhancing node representations by efficiently capturing local neighborhood structure and global semantic information. Meanwhile, we also proposed a cross-omics adaptive learning mechanism that captures complex competitive specificity and co-target cooperativity across distinct regulatory networks, thereby further enhancing the structural awareness in node representations. Furthermore, we evaluated multiple downstream classifiers to accurately predict multimodal molecular regulatory networks. Finally, extensive experiments show that MCOAN consistently outperforms existing methods, achieving strong predictive accuracy and generalization (max AUC = 0.9881; max AUPR = 0.9826), and further confirm its real-world predictive performance through case studies.

Availability and implementation: All resources are available at https://github.com/JunqiLab/MCOAN.git.

研究动机：长链非编码rna （lncRNAs）、环状rna （circRNAs）、微rna （miRNAs）、信使rna （mrna）相互作用形成复杂的基因表达调控网络，对复杂疾病的诊断、预防和治疗具有重要意义。虽然现有的计算方法已经发展到预测某些分子类型之间的相互作用，但它们通常仅限于单模态视角，忽略了多组学分子之间的竞争特异性和共同靶标协同性，从而限制了它们阐明跨组学调控机制的能力。结果：我们提出了一种新的跨组学自适应多模态对比学习框架（MCOAN），该框架可以学习多模态调节机制并有效预测疾病相关的分子调节网络。具体而言，我们首先构建了一个五层异构图架构，以全面整合多组学节点之间复杂的调控关联。然后，我们提出了一种无监督的多模态对比学习策略，该策略最大化了不同监管视图之间的互信息，从而通过有效捕获局部邻域结构和全局语义信息来增强节点表示。同时，我们还提出了一种跨组学自适应学习机制，该机制可以捕获不同调控网络之间复杂的竞争特异性和共靶标协同性，从而进一步增强节点表示中的结构意识。此外，我们评估了多个下游分类器，以准确预测多模态分子调控网络。最后，大量实验表明，MCOAN始终优于现有方法，具有较强的预测精度和泛化能力（max AUC = 0.9881; max AUPR = 0.9826），并通过案例研究进一步证实了其在现实世界中的预测性能。可用性：所有资源可在https://github.com/JunqiLab/MCOAN.git.Supplementary信息上获得；补充数据可在Bioinformatics在线上获得。

{"title":"MCOAN: multimodal contrastive representation learning for cross-omics adaptive disease regulatory network prediction.","authors":"Junqi Long, Bo Liu, Jianqiang Li, Shuangtao Zhao","doi":"10.1093/bioinformatics/btag033","DOIUrl":"10.1093/bioinformatics/btag033","url":null,"abstract":"Motivation: Interactions among long noncoding RNAs, circular RNAs, microRNAs, and messenger RNAs form complex gene expression regulatory networks, which are of great significance for the diagnosis, prevention, and treatment of complex diseases. Although existing computational methods have been developed to predict interactions among certain molecular types, they are generally limited to single-modality perspectives, overlooking competitive specificity and co-target cooperativity across multi-omics molecules, and thereby limiting their ability to elucidate cross-omics regulatory mechanisms.Results: We proposed a novel cross-omics adaptive multimodal contrastive learning framework (MCOAN) that learns multimodal regulatory mechanisms and effectively predicts disease-associated molecular regulatory networks. Specifically, we first constructed a five-layer heterogeneous graph architecture to comprehensively integrate the complex regulatory associations among multi-omics nodes. Then, we proposed an unsupervised multimodal contrastive learning strategy that maximizes mutual information across distinct regulatory views, thereby enhancing node representations by efficiently capturing local neighborhood structure and global semantic information. Meanwhile, we also proposed a cross-omics adaptive learning mechanism that captures complex competitive specificity and co-target cooperativity across distinct regulatory networks, thereby further enhancing the structural awareness in node representations. Furthermore, we evaluated multiple downstream classifiers to accurately predict multimodal molecular regulatory networks. Finally, extensive experiments show that MCOAN consistently outperforms existing methods, achieving strong predictive accuracy and generalization (max AUC = 0.9881; max AUPR = 0.9826), and further confirm its real-world predictive performance through case studies.Availability and implementation: All resources are available at https://github.com/JunqiLab/MCOAN.git.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12881826/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146004644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CeLLTra: aligning cell names with gene expression via a pathway-informed transformer. CeLLTra：将细胞名称与基因表达通过通路知情转换器对齐。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-01-03 DOI: 10.1093/bioinformatics/btaf655

Zhao Li, Zaiyi Zheng, Rongbin Li, Wenbo Chen, Yuntao Yang, Meer A Ali, Jundong Li, W Jim Zheng

Motivation: Single-cell RNA sequencing (scRNA-Seq) technology enables detailed exploration of gene expression at the individual cell level, crucial for annotating cell types and understanding cellular diversity. Traditional methods for cell type annotation often rely on marker genes and manual labeling, posing challenges due to low data quality and incomplete reference datasets.

Results: We developed CeLLTra, a novel contrastive learning framework that leverages a Transformer-based model integrating biological pathway information to group genes into super tokens, effectively capturing comprehensive gene expression from scRNA-Seq data. By combining this pathway-informed Transformer with a pretrained domain-specific language model, CeLLTra accurately aligns cell-type annotations with gene expression profiles. Evaluations on a large-scale human scRNA-Seq dataset showed that CeLLTra significantly outperformed state-of-the-art methods in supervised and zero-shot cell-type prediction. Additionally, CeLLTra generalized well to external datasets, improving clustering performance and enabling better characterization of cancerous cell states in tumor-infiltrating myeloid cells from non-small cell lung cancer patients.

Availability and implementation: CeLLTra is freely available on GitHub (https://github.com/WJZheng-group/CeLLTra) and Zenodo (https://doi.org/10.5281/zenodo.17666735). The datasets underlying this article are the following: GSE201333 and GSE127465. All these datasets are publicly available and can be freely accessed on the Gene Expression Omnibus repository.

动机：单细胞RNA测序（scRNA-Seq）技术可以在单个细胞水平上详细探索基因表达，这对于注释细胞类型和理解细胞多样性至关重要。传统的细胞类型标注方法通常依赖于标记基因和人工标记，由于数据质量低和参考数据集不完整而面临挑战。结果：我们开发了CeLLTra，这是一个新的对比学习框架，利用基于transformer的模型整合生物途径信息，将基因分组为超级标记，有效地从scRNA-Seq数据中捕获全面的基因表达。通过将这种途径知情的Transformer与预训练的领域特定语言模型相结合，CeLLTra可以准确地将细胞类型注释与基因表达谱相匹配。对大规模人类scRNA-Seq数据集的评估表明，CeLLTra在监督和零射击细胞类型预测方面明显优于最先进的方法。此外，CeLLTra可以很好地推广到外部数据集，提高了聚类性能，并能够更好地表征来自非小细胞肺癌患者的肿瘤浸润骨髓细胞的癌细胞状态。可用性和实现：CeLLTra在GitHub （https://github.com/WJZheng-group/CeLLTra）和Zenodo （https://doi.org/10.5281/zenodo.17666735）上免费提供。本文的数据集如下：GSE201333和GSE127465。所有这些数据集都是公开可用的，可以在基因表达综合库上免费访问。

{"title":"CeLLTra: aligning cell names with gene expression via a pathway-informed transformer.","authors":"Zhao Li, Zaiyi Zheng, Rongbin Li, Wenbo Chen, Yuntao Yang, Meer A Ali, Jundong Li, W Jim Zheng","doi":"10.1093/bioinformatics/btaf655","DOIUrl":"10.1093/bioinformatics/btaf655","url":null,"abstract":"Motivation: Single-cell RNA sequencing (scRNA-Seq) technology enables detailed exploration of gene expression at the individual cell level, crucial for annotating cell types and understanding cellular diversity. Traditional methods for cell type annotation often rely on marker genes and manual labeling, posing challenges due to low data quality and incomplete reference datasets.Results: We developed CeLLTra, a novel contrastive learning framework that leverages a Transformer-based model integrating biological pathway information to group genes into super tokens, effectively capturing comprehensive gene expression from scRNA-Seq data. By combining this pathway-informed Transformer with a pretrained domain-specific language model, CeLLTra accurately aligns cell-type annotations with gene expression profiles. Evaluations on a large-scale human scRNA-Seq dataset showed that CeLLTra significantly outperformed state-of-the-art methods in supervised and zero-shot cell-type prediction. Additionally, CeLLTra generalized well to external datasets, improving clustering performance and enabling better characterization of cancerous cell states in tumor-infiltrating myeloid cells from non-small cell lung cancer patients.Availability and implementation: CeLLTra is freely available on GitHub (https://github.com/WJZheng-group/CeLLTra) and Zenodo (https://doi.org/10.5281/zenodo.17666735). The datasets underlying this article are the following: GSE201333 and GSE127465. All these datasets are publicly available and can be freely accessed on the Gene Expression Omnibus repository.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":"42 2","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12881829/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CMAtlas: a comprehensive DNA methylation atlas for exploring epigenetic alterations in 34 human cancer types. CMAtlas：一个全面的DNA甲基化图谱，用于探索34种人类癌症类型的表观遗传改变。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-01-03 DOI: 10.1093/bioinformatics/btag022

Mengni Liu, Lizhen Jiang, Luowanyue Zhang, Tianjian Chen, Xingzhe Wang, Yuan Liang, Xianping Shi, Jian Ren, Yueyuan Zheng

Motivation: Aberrant DNA methylation is a fundamental epigenetic hallmark of cancer. However, existing resources often lack technological diversity and comprehensive cancer coverage. Furthermore, most platforms fail to achieve deep multi-omics integration and tend to ignore cancer-type-specific methylation features, limiting their utility in precision oncology and drug discovery.

Results: We developed Cancer Methylation Atlas (CMAtlas), a comprehensive platform integrating 13 753 samples across 34 cancer types. By applying technology-tailored pipelines to data from various profiling technologies, we identified 830 725 tumor-specific differentially methylated elements (DMEs) and 1 480 098 differentially methylated regions (DMRs), alongside 1 154 256 cancer-type-specific DMEs and 329 154 DMRs. The platform demonstrates high cross-platform consistency and strong concordance between tumor tissues and cell lines, ensuring the robustness of our findings. All DMEs and DMRs are annotated with multi-omics data (RNA expression, somatic mutations, and chromatin accessibility) and clinical relevance (survival associations and cell-free DNA profiling). We further demonstrate the utility of CMAtlas by identifying prognostic aberrant methylation in colorectal cancer driver genes.

Availability and implementation: CMAtlas is freely accessible at {{https://cmatlas.renlab.cn/}}. The platform offers an intuitive web interface supporting gene-centric and cancer-centric queries, alongside customizable analysis modules designed to facilitate user-specific research needs.

动机：异常DNA甲基化是癌症的基本表观遗传标志。然而，现有资源往往缺乏技术多样性和全面的癌症覆盖。此外，大多数平台未能实现深度多组学整合，并倾向于忽略癌症类型特异性甲基化特征，限制了它们在精确肿瘤学和药物发现中的应用。结果：我们开发了CMAtlas（癌症甲基化图谱），这是一个综合平台，整合了34种癌症类型的13,753个样本。通过将技术定制的管道应用于各种分析技术的数据，我们确定了830,725个肿瘤特异性差异甲基化元件（DMEs）和1,480,098个差异甲基化区域（DMRs），以及1,154,256个癌症类型特异性DMEs和329,154个DMRs。该平台具有高度的跨平台一致性和肿瘤组织和细胞系之间的强一致性，确保了我们研究结果的稳健性。所有DMEs和DMRs都用多组学数据（RNA表达、体细胞突变和染色质可及性）和临床相关性（生存关联和无细胞DNA分析）进行了注释。我们通过鉴定结直肠癌驱动基因的预后异常甲基化进一步证明了CMAtlas的效用。可用性：CMAtlas可在{{https://cmatlas.renlab.cn/}}免费访问。该平台提供了一个直观的网络界面，支持以基因为中心和以癌症为中心的查询，以及可定制的分析模块，旨在促进用户特定的研究需求。补充信息：补充数据可在生物信息学在线获取。

{"title":"CMAtlas: a comprehensive DNA methylation atlas for exploring epigenetic alterations in 34 human cancer types.","authors":"Mengni Liu, Lizhen Jiang, Luowanyue Zhang, Tianjian Chen, Xingzhe Wang, Yuan Liang, Xianping Shi, Jian Ren, Yueyuan Zheng","doi":"10.1093/bioinformatics/btag022","DOIUrl":"10.1093/bioinformatics/btag022","url":null,"abstract":"Motivation: Aberrant DNA methylation is a fundamental epigenetic hallmark of cancer. However, existing resources often lack technological diversity and comprehensive cancer coverage. Furthermore, most platforms fail to achieve deep multi-omics integration and tend to ignore cancer-type-specific methylation features, limiting their utility in precision oncology and drug discovery.Results: We developed Cancer Methylation Atlas (CMAtlas), a comprehensive platform integrating 13 753 samples across 34 cancer types. By applying technology-tailored pipelines to data from various profiling technologies, we identified 830 725 tumor-specific differentially methylated elements (DMEs) and 1 480 098 differentially methylated regions (DMRs), alongside 1 154 256 cancer-type-specific DMEs and 329 154 DMRs. The platform demonstrates high cross-platform consistency and strong concordance between tumor tissues and cell lines, ensuring the robustness of our findings. All DMEs and DMRs are annotated with multi-omics data (RNA expression, somatic mutations, and chromatin accessibility) and clinical relevance (survival associations and cell-free DNA profiling). We further demonstrate the utility of CMAtlas by identifying prognostic aberrant methylation in colorectal cancer driver genes.Availability and implementation: CMAtlas is freely accessible at {{https://cmatlas.renlab.cn/}}. The platform offers an intuitive web interface supporting gene-centric and cancer-centric queries, alongside customizable analysis modules designed to facilitate user-specific research needs.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12881830/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145986016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

inMOTIFin: a lightweight end-to-end simulation software for regulatory sequences. inMOTIFin：一个轻量级的端到端调节序列模拟软件。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-01-03 DOI: 10.1093/bioinformatics/btag026

Katalin Ferenc, Lorenzo Martini, Ieva Rauluseviciute, Geir Kjetil Ferkingstad Sandve, Anthony Mathelier

Summary: The accurate development, assessment, interpretation, and benchmarking of bioinformatics frameworks for analyzing transcriptional regulatory grammars rely on controlled simulations to validate the underlying methods. However, existing simulators often lack end-to-end flexibility or ease of integration, which limits their practical use. We present inMOTIFin, a lightweight, modular, and user-friendly Python-based software that addresses these gaps by providing versatile and efficient simulation and modification of DNA regulatory sequences. inMOTIFin enables users to simulate or modify regulatory sequences efficiently for the customizable generation of motifs and insertion of motif instances with precise control over their positions, co-occurrences, and spacing, as well as direct modification of real sequences, facilitating a comprehensive evaluation of motif-based methods and interpretation tools. We demonstrate inMOTIFin applications for the assessment of de novo motif discovery, the analysis of transcription factor cooperativity, and the support of explainability analyses for deep learning models. inMOTIFin ensures robust and reproducible analyses for studying transcriptional regulatory grammars.

Availability and implementation: inMOTIFin is available at PyPI https://pypi.org/project/inMOTIFin/ and Docker Hub https://hub.docker.com/r/cbgr/inmotifin. Detailed documentation is available at https://inmotifin.readthedocs.io/en/latest/. The code for use case analyses is available at https://bitbucket.org/CBGR/inmotifin_evaluation/src/main/. The version of the code used for this article has been uploaded to Zenodo with DOI: 10.5281/zenodo.17638579.

摘要：准确的开发、评估、解释和基准分析转录调控语法的生物信息学框架依赖于受控模拟来验证底层方法。然而，现有的模拟器往往缺乏端到端的灵活性或易于集成，这限制了它们的实际使用。我们提出了inMOTIFin，一个轻量级的，模块化的，用户友好的基于python的软件，通过提供多功能和高效的模拟和修改DNA调控序列来解决这些空白。inMOTIFin使用户能够有效地模拟或修改调控序列，用于定制基序的生成和基序实例的插入，精确控制它们的位置、共现和间距，以及直接修改真实序列，从而促进对基于基序的方法和解释工具的全面评估。我们展示了inMOTIFin在评估从头基序发现、转录因子协同性分析以及支持深度学习模型的可解释性分析方面的应用。inMOTIFin确保研究转录调控语法的稳健和可重复的分析。可用性和实现：inMOTIFin可在PyPI https://pypi.org/project/inMOTIFin/和Docker Hub https://hub.docker.com/r/cbgr/inmotifin上获得。详细的文档可在https://inmotifin.readthedocs.io/en/latest/上获得。用例分析的代码可在https://bitbucket.org/CBGR/inmotifin_evaluation/src/main/上获得。本文所用的代码版本已上传到Zenodo， DOI: 10.5281/ Zenodo .17638579。补充信息：补充数据可在生物信息学在线获取。

{"title":"inMOTIFin: a lightweight end-to-end simulation software for regulatory sequences.","authors":"Katalin Ferenc, Lorenzo Martini, Ieva Rauluseviciute, Geir Kjetil Ferkingstad Sandve, Anthony Mathelier","doi":"10.1093/bioinformatics/btag026","DOIUrl":"10.1093/bioinformatics/btag026","url":null,"abstract":"Summary: The accurate development, assessment, interpretation, and benchmarking of bioinformatics frameworks for analyzing transcriptional regulatory grammars rely on controlled simulations to validate the underlying methods. However, existing simulators often lack end-to-end flexibility or ease of integration, which limits their practical use. We present inMOTIFin, a lightweight, modular, and user-friendly Python-based software that addresses these gaps by providing versatile and efficient simulation and modification of DNA regulatory sequences. inMOTIFin enables users to simulate or modify regulatory sequences efficiently for the customizable generation of motifs and insertion of motif instances with precise control over their positions, co-occurrences, and spacing, as well as direct modification of real sequences, facilitating a comprehensive evaluation of motif-based methods and interpretation tools. We demonstrate inMOTIFin applications for the assessment of de novo motif discovery, the analysis of transcription factor cooperativity, and the support of explainability analyses for deep learning models. inMOTIFin ensures robust and reproducible analyses for studying transcriptional regulatory grammars.Availability and implementation: inMOTIFin is available at PyPI https://pypi.org/project/inMOTIFin/ and Docker Hub https://hub.docker.com/r/cbgr/inmotifin. Detailed documentation is available at https://inmotifin.readthedocs.io/en/latest/. The code for use case analyses is available at https://bitbucket.org/CBGR/inmotifin_evaluation/src/main/. The version of the code used for this article has been uploaded to Zenodo with DOI: 10.5281/zenodo.17638579.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12881827/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BioTriplex: a full-text annotated corpus for fine-tuning language models in gene-disease relation extraction tasks. BioTriplex：一个用于基因-疾病关系提取任务微调语言模型的全文注释语料库。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-01-03 DOI: 10.1093/bioinformatics/btag037

Charlotte Collins, Panagiotis Fytas, İlknur Karadeniz, Huiyuan Zheng, Simon Baker, Ulla Stenius, Anna Korhonen

Motivation: Automatic information extraction from biomedical texts requires machine learning methodology that can recognize biomedical entities, characterize inter-entity relationships, and relate extracted information to specific research topics. Large language models (LLMs) excel in general tasks but perform less reliably in the biomedical domain, where texts are characterized by extensive technical terminology and semantic variations from general literature. There is an unmet need for annotated full-text datasets that can be used to fine-tune language models for significant biomedical applications. Here, we focus on extraction of the complex relationships between genes and diseases.

Results: We present BioTriplex, a corpus of 100 full-length biomedical research articles (comprising 604 subsection texts) manually annotated with disease names, genes, and 21 subtypes of disease-gene relationships. We employ BioTriplex to train the LLaMA 3.1 8B language model in gene-disease relation extraction. Our fine-tuned model outperforms zero-shot and few-shot approaches, both within the LLaMA 3.1 architecture and across the larger state-of-the-art LLMs GPT-4 and Claude Sonnet 3.7, and classifies gene-disease relation types with broader scope and greater granularity than previously described. These results validate BioTriplex as a useful full-text data resource and underscore the value of specialized datasets in fine-tuning language models for important biomedical tasks.

Availability and implementation: https://github.com/PanagiotisFytas/BioTriplex.

动机：从生物医学文本中自动提取信息需要机器学习方法，该方法可以识别生物医学实体，表征实体间的关系，并将提取的信息与特定的研究主题联系起来。大型语言模型（llm）在一般任务中表现出色，但在生物医学领域表现不太可靠，其中文本的特点是广泛的技术术语和来自一般文献的语义变化。对注释全文数据集的需求尚未得到满足，这些数据集可用于微调用于重要生物医学应用的语言模型。在这里，我们的重点是提取基因与疾病之间的复杂关系。结果：我们展示了BioTriplex，这是一个包含100篇完整的生物医学研究文章（包括604小节文本）的语库，人工注释了疾病名称、基因和21种疾病-基因关系亚型。我们使用BioTriplex对LLaMA 3.1 8B语言模型进行基因-疾病关系提取训练。我们的微调模型优于零射击和少射击方法，无论是在LLaMA 3.1架构内，还是在更大的最先进的LLMs ggt -4和Claude Sonnet 3.7中，并以比以前描述的更广泛的范围和更大的粒度对基因-疾病关系类型进行分类。这些结果验证了BioTriplex是一个有用的全文数据资源，并强调了在重要生物医学任务中微调语言模型的专业数据集的价值。可用性：https://github.com/PanagiotisFytas/BioTriplex.Supplementary信息：补充数据可在Bioinformatics在线获取。

{"title":"BioTriplex: a full-text annotated corpus for fine-tuning language models in gene-disease relation extraction tasks.","authors":"Charlotte Collins, Panagiotis Fytas, İlknur Karadeniz, Huiyuan Zheng, Simon Baker, Ulla Stenius, Anna Korhonen","doi":"10.1093/bioinformatics/btag037","DOIUrl":"10.1093/bioinformatics/btag037","url":null,"abstract":"Motivation: Automatic information extraction from biomedical texts requires machine learning methodology that can recognize biomedical entities, characterize inter-entity relationships, and relate extracted information to specific research topics. Large language models (LLMs) excel in general tasks but perform less reliably in the biomedical domain, where texts are characterized by extensive technical terminology and semantic variations from general literature. There is an unmet need for annotated full-text datasets that can be used to fine-tune language models for significant biomedical applications. Here, we focus on extraction of the complex relationships between genes and diseases.Results: We present BioTriplex, a corpus of 100 full-length biomedical research articles (comprising 604 subsection texts) manually annotated with disease names, genes, and 21 subtypes of disease-gene relationships. We employ BioTriplex to train the LLaMA 3.1 8B language model in gene-disease relation extraction. Our fine-tuned model outperforms zero-shot and few-shot approaches, both within the LLaMA 3.1 architecture and across the larger state-of-the-art LLMs GPT-4 and Claude Sonnet 3.7, and classifies gene-disease relation types with broader scope and greater granularity than previously described. These results validate BioTriplex as a useful full-text data resource and underscore the value of specialized datasets in fine-tuning language models for important biomedical tasks.Availability and implementation: https://github.com/PanagiotisFytas/BioTriplex.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12883087/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146020562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DicePlot: a package for high-dimensional categorical data visualization. DicePlot：一个用于高维分类数据可视化的包。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-01-03 DOI: 10.1093/bioinformatics/btaf337

Matthias Flotho, Philipp Flotho, Andreas Keller

Summary: Visualization of multidimensional, categorical data is a common challenge across scientific domains and, in particular, the life sciences. The goal is to create a comprehensive overview of the underlying data which enables one to assess multiple variables. One application where such visualizations are particularly useful is gene or pathway analysis, which involves checking for dysregulation in known biological mechanisms and functions across multiple conditions. Here, we propose a new visualization approach that encodes such data in an intuitive representation: DicePlots visualize up to four distinct categorical classes in a single view using elements resembling dice faces, whereas DominoPlots add an additional layer of information for binary comparison.

Availability and implementation: The code is available as the diceplot R package and the pydiceplot on PyPI. All source code is available at https://github.com/maflot.

Contact: The repo is managed actively and we encourage community contributions and requests.

摘要：多维、分类数据的可视化是跨科学领域，特别是生命科学领域的共同挑战。目标是创建底层数据的全面概述，使人们能够评估多个变量。这种可视化特别有用的一个应用是基因或途径分析，它涉及在多种条件下检查已知生物机制和功能的失调。在这里，我们提出了一种新的可视化方法，以直观的表示方式对这些数据进行编码：DicePlots使用类似骰子面的元素在单个视图中可视化多达四个不同的分类类，而DominoPlots则为二进制比较添加了额外的信息层。可用性和实现：代码可以作为diceplot R包和PyPI上的pydiceplot获得。所有源代码都可以在https://github.com/maflot.Contact上获得：回购是积极管理的，我们鼓励社区贡献和请求。补充信息：补充数据，图表和可复制的例子可在网上获得。

引用次数: 0

Characterizing clinical toxicity in cancer combination therapies. 肿瘤联合治疗的临床毒性特征。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-01-03 DOI: 10.1093/bioinformatics/btag007

Alexandra M Wong, Cecile P G Meier-Scherling, Lorin Crawford

Motivation: Predicting synergistic cancer drug combinations through computational methods offers a scalable approach to creating therapies that are more effective and less toxic. However, most algorithms focus solely on synergy without considering toxicity when selecting optimal drug combinations. In the absence of combinatorial toxicity assays, a few models use toxicity penalties to balance high synergy with lower toxicity. Still, these penalties have not been explicitly validated against known drug-drug interactions.

Results: In this study, we examine whether synergy scores and toxicity metrics correlate with known adverse drug interactions. While some metrics show trends with toxicity levels, our results reveal significant limitations in using them as penalties. These findings highlight the challenges of incorporating toxicity into synergy prediction frameworks and suggest that advancing the field requires more comprehensive combination toxicity data.

Availability and implementation: The code written for this project is available at https://github.com/amw14/toxicity-cancer-drug-combination.

动机：通过计算方法预测协同抗癌药物组合为创造更有效、毒性更小的治疗方法提供了一种可扩展的方法。然而，在选择最佳药物组合时，大多数算法只关注协同作用而不考虑毒性。在没有组合毒性试验的情况下，一些模型使用毒性惩罚来平衡高协同作用和低毒性。尽管如此，这些惩罚还没有明确地针对已知的药物-药物相互作用进行验证。结果：在本研究中，我们研究了协同作用评分和毒性指标是否与已知的不良药物相互作用相关。虽然一些指标显示了毒性水平的趋势，但我们的研究结果表明，使用它们作为惩罚措施存在重大局限性。这些发现突出了将毒性纳入协同作用预测框架的挑战，并表明推进该领域需要更全面的联合毒性数据。可用性和实现：为这个项目编写的代码可在https://github.com/amw14/toxicity-cancer-drug-combination上获得。

{"title":"Characterizing clinical toxicity in cancer combination therapies.","authors":"Alexandra M Wong, Cecile P G Meier-Scherling, Lorin Crawford","doi":"10.1093/bioinformatics/btag007","DOIUrl":"10.1093/bioinformatics/btag007","url":null,"abstract":"Motivation: Predicting synergistic cancer drug combinations through computational methods offers a scalable approach to creating therapies that are more effective and less toxic. However, most algorithms focus solely on synergy without considering toxicity when selecting optimal drug combinations. In the absence of combinatorial toxicity assays, a few models use toxicity penalties to balance high synergy with lower toxicity. Still, these penalties have not been explicitly validated against known drug-drug interactions.Results: In this study, we examine whether synergy scores and toxicity metrics correlate with known adverse drug interactions. While some metrics show trends with toxicity levels, our results reveal significant limitations in using them as penalties. These findings highlight the challenges of incorporating toxicity into synergy prediction frameworks and suggest that advancing the field requires more comprehensive combination toxicity data.Availability and implementation: The code written for this project is available at https://github.com/amw14/toxicity-cancer-drug-combination.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12865850/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

diffMONT: predicting methylation-specific PCR biomarkers based on nanopore sequencing data for clinical application. diffMONT：基于纳米孔测序数据预测甲基化特异性PCR生物标志物的临床应用。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-01-03 DOI: 10.1093/bioinformatics/btag039

Daria Meyer, Emanuel Barth, Laura Wiehle, Manja Marz

Motivation: DNA methylation serves as a key biomarker in clinical diagnostics, especially in cancer detection. With methylation-specific PCR (MSP), a widely used approach, patient samples can be screened fast and efficiently for differential methylation. During MSP, methylated regions are selectively amplified with specific primers. With nanopore sequencing, knowledge about DNA methylation is generated during direct DNA sequencing without needing pretreatment of the DNA. Multiple methods, mainly developed for whole-genome bisulfite sequencing (WGBS) data, exist to predict differentially methylated regions (DMRs) in the genome. However, the predicted DMRs are often very large and not sufficiently discriminating to generate meaningful results in MSP, creating a gap between theoretical cancer marker research and practical application, as no tool currently provides methylation difference predictions tailored for PCR-based diagnostics.

Results: Here, we present diffMONT, a tool that predicts differentially methylated regions specifically suited for MSP primer design, enabling rapid translation into practical applications. diffMONT takes into account (i) the specific length of primer and amplicon regions, (ii) the fact that one condition should be unmethylated, and (iii) a minimal required amount of differentially methylated cytosines within the primer regions. We compared the results of diffMONT to metilene and DSS based on a publicly available nanopore sequencing dataset and show that the regions predicted by diffMONT are more specific toward hypermethylated regions. diffMONT accelerates the design of methylation-specific diagnostic assays, bridging the gap between theoretical research and clinical application.

Availability and implementation: The source code for diffMONT, an open-source Python-based tool, is available at https://github.com/rnajena/diffMONT/, with an archived release under https://zenodo.org/records/17641031.

动机：DNA甲基化是临床诊断的关键生物标志物，特别是在癌症检测中。甲基化特异性PCR （MSP）是一种广泛使用的方法，可以快速有效地筛选患者样本进行差异甲基化。在MSP过程中，甲基化区域被特异性引物选择性扩增。利用纳米孔测序，DNA甲基化的知识是在直接DNA测序过程中产生的，而不需要对DNA进行预处理。目前，主要针对亚硫酸氢盐全基因组测序（WGBS）数据开发了多种方法来预测基因组中的差异甲基化区域（DMRs）。然而，预测的dmr通常非常大，并且没有足够的辨别性，无法在MSP中产生有意义的结果，这在理论癌症标志物研究和实际应用之间造成了差距，因为目前没有工具为基于pcr的诊断提供量身定制的甲基化差异预测。结果：在这里，我们提出了diffMONT，这是一个预测特别适合MSP引物设计的差异甲基化区域的工具，能够快速翻译到实际应用中。diffMONT考虑了(i)引物和扩增子区域的特定长度，（ii）一个条件应该未甲基化的事实，以及（iii）引物区域内差异甲基化胞嘧啶的最小所需量。基于公开的纳米孔测序数据集，我们将diffMONT的结果与甲基烯和DSS的结果进行了比较，结果表明diffMONT预测的区域对高甲基化区域更具特异性。diffMONT加速了甲基化特异性诊断分析的设计，弥合了理论研究和临床应用之间的差距。可用性和实现：diffMONT是一个基于python的开源工具，其源代码可从https://github.com/rnajena/diffMONT/获得，其存档版本位于https://zenodo.org/records/17641031。

{"title":"diffMONT: predicting methylation-specific PCR biomarkers based on nanopore sequencing data for clinical application.","authors":"Daria Meyer, Emanuel Barth, Laura Wiehle, Manja Marz","doi":"10.1093/bioinformatics/btag039","DOIUrl":"10.1093/bioinformatics/btag039","url":null,"abstract":"Motivation: DNA methylation serves as a key biomarker in clinical diagnostics, especially in cancer detection. With methylation-specific PCR (MSP), a widely used approach, patient samples can be screened fast and efficiently for differential methylation. During MSP, methylated regions are selectively amplified with specific primers. With nanopore sequencing, knowledge about DNA methylation is generated during direct DNA sequencing without needing pretreatment of the DNA. Multiple methods, mainly developed for whole-genome bisulfite sequencing (WGBS) data, exist to predict differentially methylated regions (DMRs) in the genome. However, the predicted DMRs are often very large and not sufficiently discriminating to generate meaningful results in MSP, creating a gap between theoretical cancer marker research and practical application, as no tool currently provides methylation difference predictions tailored for PCR-based diagnostics.Results: Here, we present diffMONT, a tool that predicts differentially methylated regions specifically suited for MSP primer design, enabling rapid translation into practical applications. diffMONT takes into account (i) the specific length of primer and amplicon regions, (ii) the fact that one condition should be unmethylated, and (iii) a minimal required amount of differentially methylated cytosines within the primer regions. We compared the results of diffMONT to metilene and DSS based on a publicly available nanopore sequencing dataset and show that the regions predicted by diffMONT are more specific toward hypermethylated regions. diffMONT accelerates the design of methylation-specific diagnostic assays, bridging the gap between theoretical research and clinical application.Availability and implementation: The source code for diffMONT, an open-source Python-based tool, is available at https://github.com/rnajena/diffMONT/, with an archived release under https://zenodo.org/records/17641031.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12881825/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

sRACIPE 2.0: a systems biology circuit modeling toolkit for random circuit perturbation. 用于随机电路扰动的系统生物学电路建模工具包。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-01-03 DOI: 10.1093/bioinformatics/btag019

Aidan Tillman, Daniel Ramirez, Mingyang Lu

Summary: The Random Circuit Perturbation (RACIPE) algorithm enables the exploration of the dynamical behaviors of gene regulatory circuits (GRCs) by simulating an ensemble of differential equation models via randomization of kinetic parameters. Here, we release sRACIPE 2.0, a major update to the R/Bioconductor package, as a unified platform for modeling GRCs with diverse interaction types using both deterministic and stochastic simulations. The update also introduces new features for modeling perturbation, extrinsic signaling and time-corrected noise, and a new diagnostic tool to ensure proper simulations. We hope that this release will serve as a versatile modeling toolkit for the systems biology community.

Availability and implementation: The package is available on GitHub at https://github.com/lusystemsbio/sRACIPE under the MIT license. It is also available on Bioconductor at https://www.bioconductor.org/packages/release/bioc/html/sRACIPE.html.

摘要：随机电路摄动（RACIPE）算法通过模拟动力学参数随机化的微分方程模型集合，可以探索基因调控电路（GRCs）的动态行为。在这里，我们发布了sRACIPE 2.0，这是R/Bioconductor包的一个重大更新，作为一个统一的平台，可以使用确定性和随机模拟来模拟具有不同交互类型的GRCs。该更新还引入了新的功能建模扰动，外部信号和时间校正噪声，以及一个新的诊断工具，以确保适当的模拟。我们希望此版本将作为系统生物学社区的通用建模工具包。可用性和实现：该软件包在MIT许可下可在GitHub上获得https://github.com/lusystemsbio/sRACIPE。也可以在Bioconductor网站https://www.bioconductor.org/packages/release/bioc/html/sRACIPE.html上找到。包装和测试数据也存档在Zenodo上，网址为https://doi.org/10.5281/zenodo.18202342.Supplemental information：补充信息可在Bioinformatics在线获得，具体的包装结构在包装插图中描述。可以在GitHub repo https://github.com/dan-ramirez-23/sRACIPE-Demos中找到其他插图。

引用次数: 0

sedimix: a workflow for the analysis of hominin nuclear DNA sequences from sediments. sedimix：从沉积物中分析古人类核DNA序列的工作流程。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-01-03 DOI: 10.1093/bioinformatics/btag004

Jierui Xu, Elena I Zavala, Priya Moorjani

Summary: Sediment DNA-the recovery of genetic material from archaeological sediments-is an exciting new frontier in ancient DNA research, offering the potential to study individuals at a given archaeological site without destructive sampling. In recent years, several studies have demonstrated the promise of this approach by extracting hominin DNA from prehistoric sediments, including those dating back to the Middle or Late Pleistocene. However, a lack of open-source workflows for analysis of hominin sediment DNA samples poses a challenge for data processing and reproducibility of findings across studies. Here, we introduce a snakemake workflow, sedimix, for processing genomic sequences from archaeological sediment DNA samples to identify hominin sequences and generate relevant summary statistics to assess the reliability of the pipeline. By performing simulations and comparing our results to two published studies with human DNA from ∼25,000 years ago (including shotgun data from a sediment sample and capture data from touch DNA recovered from a deer tooth pendant) we demonstrate that sedimix yields accurate and reliable inferences. sedimix offers a reliable and adaptable framework to aid in the analysis of sediment DNA datasets and improve reproducibility across studies.

Availability and implementation: sedimix is available as an open-source software with the associated code, example data, and user manual with installation instructions available at https://github.com/jierui-cell/sedimix. A permanent archived version of this release is available via Zenodo: https://doi.org/10.5281/zenodo.17244854.

摘要：沉积物DNA——从考古沉积物中提取DNA的能力——是古代DNA研究中一个令人兴奋的新领域，它提供了在给定考古遗址中研究个体而无需破坏性采样的潜力。近年来，几项研究通过从史前沉积物（包括可追溯到更新世中期或晚期的沉积物）中恢复古人类DNA，证明了这种方法的前景。然而，缺乏用于分析古人类沉积物DNA样本的开源工作流程对数据处理和跨研究结果的可重复性提出了挑战。在这里，我们介绍了一个制作蛇的工作流程，sedimix，用于处理考古沉积物DNA样本中的基因组序列，以识别人类序列，并生成相关的汇总统计数据，以评估管道的可靠性。通过进行模拟，并将我们的结果与两项已发表的研究结果进行比较，这些研究使用了大约25000年前的人类DNA（包括来自沉积物样本的猎枪数据和来自鹿牙坠子的触摸DNA的捕获数据），我们证明了沉积物可以产生准确可靠的推断。sedimix提供了一个可靠和适应性强的框架，以帮助分析沉积物DNA数据集，并提高研究的可重复性。可用性和实现：sedimix是一个开源软件，其相关代码、示例数据和用户手册以及安装说明可在https://github.com/jierui-cell/sedimix.A上获得，此版本的永久存档版本可通过Zenodo获得：https://doi.org/10.5281/zenodo.17244854.Supplementary信息：补充数据可在Bioinformatics在线上获得。

{"title":"sedimix: a workflow for the analysis of hominin nuclear DNA sequences from sediments.","authors":"Jierui Xu, Elena I Zavala, Priya Moorjani","doi":"10.1093/bioinformatics/btag004","DOIUrl":"10.1093/bioinformatics/btag004","url":null,"abstract":"Summary: Sediment DNA-the recovery of genetic material from archaeological sediments-is an exciting new frontier in ancient DNA research, offering the potential to study individuals at a given archaeological site without destructive sampling. In recent years, several studies have demonstrated the promise of this approach by extracting hominin DNA from prehistoric sediments, including those dating back to the Middle or Late Pleistocene. However, a lack of open-source workflows for analysis of hominin sediment DNA samples poses a challenge for data processing and reproducibility of findings across studies. Here, we introduce a snakemake workflow, sedimix, for processing genomic sequences from archaeological sediment DNA samples to identify hominin sequences and generate relevant summary statistics to assess the reliability of the pipeline. By performing simulations and comparing our results to two published studies with human DNA from ∼25,000 years ago (including shotgun data from a sediment sample and capture data from touch DNA recovered from a deer tooth pendant) we demonstrate that sedimix yields accurate and reliable inferences. sedimix offers a reliable and adaptable framework to aid in the analysis of sediment DNA datasets and improve reproducibility across studies.Availability and implementation: sedimix is available as an open-source software with the associated code, example data, and user manual with installation instructions available at https://github.com/jierui-cell/sedimix. A permanent archived version of this release is available via Zenodo: https://doi.org/10.5281/zenodo.17244854.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866666/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145946960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0