Motivation: The increasing prevalence of antibiotic-resistant bacteria has intensified the demand for novel antimicrobial agents. Antimicrobial peptides (AMPs) have emerged as promising alternatives, yet their identification or classification remains challenging due to the lack of multi-perspective information, insufficient feature representation learning, and monocular data modalities.
Results: In this paper, we propose a dual diffusion model-based representation learning framework for classifying AMPs, which effectively integrates both peptide sequence and structure information to address existing issues for the task. Specifically, our approach utilizes a multi-view feature construction module, which encodes peptide sequences and structures from distinctive perspectives, deriving initial feature representations with enriched biological semantics. To enhance representation learning, the proposed framework leverages both diffusion models for sequence and structure information respectively to effectively capture complex semantics from dual modalities. In addition, both single-modal and dual-modal contrastive learning are used to further advance the representation learning. Results of comprehensive experiments demonstrate that our model outperforms existing methods for the task of AMPs classification, providing a feasible solution to accelerating the discovery of novel antimicrobial agents.
Availability of implementation: The data and source codes are available in GitHub at https://github.com/kww567upup/DDM.
{"title":"A dual diffusion model-based representation learning framework for antimicrobial peptides classification.","authors":"Wen Kong, Lingling Fu, Xingpeng Jiang, Weizhong Zhao","doi":"10.1093/bioinformatics/btag077","DOIUrl":"10.1093/bioinformatics/btag077","url":null,"abstract":"<p><strong>Motivation: </strong>The increasing prevalence of antibiotic-resistant bacteria has intensified the demand for novel antimicrobial agents. Antimicrobial peptides (AMPs) have emerged as promising alternatives, yet their identification or classification remains challenging due to the lack of multi-perspective information, insufficient feature representation learning, and monocular data modalities.</p><p><strong>Results: </strong>In this paper, we propose a dual diffusion model-based representation learning framework for classifying AMPs, which effectively integrates both peptide sequence and structure information to address existing issues for the task. Specifically, our approach utilizes a multi-view feature construction module, which encodes peptide sequences and structures from distinctive perspectives, deriving initial feature representations with enriched biological semantics. To enhance representation learning, the proposed framework leverages both diffusion models for sequence and structure information respectively to effectively capture complex semantics from dual modalities. In addition, both single-modal and dual-modal contrastive learning are used to further advance the representation learning. Results of comprehensive experiments demonstrate that our model outperforms existing methods for the task of AMPs classification, providing a feasible solution to accelerating the discovery of novel antimicrobial agents.</p><p><strong>Availability of implementation: </strong>The data and source codes are available in GitHub at https://github.com/kww567upup/DDM.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12960902/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146204197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag112
Heng Li, Brian Li
Motivation: Low-complexity (LC) DNA sequences are compositionally repetitive sequences that are often associated with spurious homologous matches and variant calling artifacts. While algorithms for identifying LC sequences exist, they either lack concise mathematical definition of complexity or are inefficient with long or variable context windows.
Results: Longdust is a new algorithm that efficiently identifies long LC sequences including centromeric satellite and tandem repeats with moderately long motifs. It defines string complexity by statistically modeling the k-mer count distribution with the parameters: the k-mer length, the context window size and a threshold on complexity. Longdust exhibits high performance on real data and high consistency with existing methods.
Availability and implementation: https://github.com/lh3/longdust.
{"title":"Finding low-complexity DNA sequences with longdust.","authors":"Heng Li, Brian Li","doi":"10.1093/bioinformatics/btag112","DOIUrl":"10.1093/bioinformatics/btag112","url":null,"abstract":"<p><strong>Motivation: </strong>Low-complexity (LC) DNA sequences are compositionally repetitive sequences that are often associated with spurious homologous matches and variant calling artifacts. While algorithms for identifying LC sequences exist, they either lack concise mathematical definition of complexity or are inefficient with long or variable context windows.</p><p><strong>Results: </strong>Longdust is a new algorithm that efficiently identifies long LC sequences including centromeric satellite and tandem repeats with moderately long motifs. It defines string complexity by statistically modeling the k-mer count distribution with the parameters: the k-mer length, the context window size and a threshold on complexity. Longdust exhibits high performance on real data and high consistency with existing methods.</p><p><strong>Availability and implementation: </strong>https://github.com/lh3/longdust.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13003316/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147461337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag071
Ke Xu, Xin Maizie Zhou, Lu Zhang
Motivation: Spatially resolved transcriptomics (SRT) and spatially resolved proteomics (SRP) data enable the study of gene expression and protein abundances within their precise spatial and cellular contexts in tissues. Certain SRT and SRP technologies also capture corresponding morphology images, adding another layer of valuable information. However, few existing methods developed for SRT data effectively leverage these supplementary images to enhance clustering performance.
Results: Here, we introduce stDyer-image, an end-to-end deep learning framework designed for clustering for SRT and SRP datasets with images. Unlike existing methods that utilize images to complement gene expression data, stDyer-image directly links image features to cluster labels. This approach draws inspiration from pathologists, who can visually identify specific cell types or tumor regions from morphological images without relying on gene expression or protein abundances. Benchmarks against state-of-the-art tools demonstrate that stDyer-image achieves superior performance in clustering. Moreover, it is capable of handling large-scale datasets across diverse technologies, making it a versatile and powerful tool for spatial omics analysis.
Availability and implementation: The source code of stDyer-image and detailed tutorials are available at https://github.com/ericcombiolab/stDyer-image.
{"title":"stDyer-image improves clustering analysis of spatially resolved transcriptomics and proteomics with morphological images.","authors":"Ke Xu, Xin Maizie Zhou, Lu Zhang","doi":"10.1093/bioinformatics/btag071","DOIUrl":"10.1093/bioinformatics/btag071","url":null,"abstract":"<p><strong>Motivation: </strong>Spatially resolved transcriptomics (SRT) and spatially resolved proteomics (SRP) data enable the study of gene expression and protein abundances within their precise spatial and cellular contexts in tissues. Certain SRT and SRP technologies also capture corresponding morphology images, adding another layer of valuable information. However, few existing methods developed for SRT data effectively leverage these supplementary images to enhance clustering performance.</p><p><strong>Results: </strong>Here, we introduce stDyer-image, an end-to-end deep learning framework designed for clustering for SRT and SRP datasets with images. Unlike existing methods that utilize images to complement gene expression data, stDyer-image directly links image features to cluster labels. This approach draws inspiration from pathologists, who can visually identify specific cell types or tumor regions from morphological images without relying on gene expression or protein abundances. Benchmarks against state-of-the-art tools demonstrate that stDyer-image achieves superior performance in clustering. Moreover, it is capable of handling large-scale datasets across diverse technologies, making it a versatile and powerful tool for spatial omics analysis.</p><p><strong>Availability and implementation: </strong>The source code of stDyer-image and detailed tutorials are available at https://github.com/ericcombiolab/stDyer-image.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12960910/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146204237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag107
Xu Shi, Yuqi Zhai, Xianshi Yu, Xiaoou Li, Brian L Hazlehurst, Denis B Nyongesa, Daniel S Sapp, Brian D Williamson, David S Carrell, Luesa Healy, Kara L Cushing-Haugen, Jenna Wong, Shirley V Wang, James S Floyd, Kathleen Shattuck, Samuel McGown, Sarah Alam, José J Hernández-Muñoz, Jie Li, Yong Ma, Danijela Stojanovic, Sudha R Raman, Sharon E Davis, Tianxi Cai, Jennifer C Nelson, Patrick J Heagerty
Motivation: Although common data models for electronic health record (EHR) data can facilitate multi-site data organization and querying, the same medical event may still be coded differently between healthcare systems. In this paper, we present statistical methods to identify and mitigate coding discrepancies using summary-level data, and demonstrate these methods using data from two FDA Sentinel data partners: Kaiser Permanente Washington and Kaiser Permanente Northwest.
Results: We first characterize differences in coding patterns, then compute a code mapping matrix to harmonize data between systems. Our findings reveal significant heterogeneity in coded EHR data, even after adopting a common data model with the same coding system, highlighting the importance of data harmonization before downstream analyses. Our study also demonstrates the effectiveness of the data harmonization approaches, which provide a foundational data quality step to promote semantic interoperability, enhance data integration, and improve the integrity of study conclusions.
Availability and implementation: Computation prototypes, including R/Python codes and examples, are included in Section 7, available as supplementary data at Bioinformatics online and will be posted on GitHub upon publication.
{"title":"Statistical methods to harmonize electronic health record data across healthcare systems: case study and lessons learned.","authors":"Xu Shi, Yuqi Zhai, Xianshi Yu, Xiaoou Li, Brian L Hazlehurst, Denis B Nyongesa, Daniel S Sapp, Brian D Williamson, David S Carrell, Luesa Healy, Kara L Cushing-Haugen, Jenna Wong, Shirley V Wang, James S Floyd, Kathleen Shattuck, Samuel McGown, Sarah Alam, José J Hernández-Muñoz, Jie Li, Yong Ma, Danijela Stojanovic, Sudha R Raman, Sharon E Davis, Tianxi Cai, Jennifer C Nelson, Patrick J Heagerty","doi":"10.1093/bioinformatics/btag107","DOIUrl":"10.1093/bioinformatics/btag107","url":null,"abstract":"<p><strong>Motivation: </strong>Although common data models for electronic health record (EHR) data can facilitate multi-site data organization and querying, the same medical event may still be coded differently between healthcare systems. In this paper, we present statistical methods to identify and mitigate coding discrepancies using summary-level data, and demonstrate these methods using data from two FDA Sentinel data partners: Kaiser Permanente Washington and Kaiser Permanente Northwest.</p><p><strong>Results: </strong>We first characterize differences in coding patterns, then compute a code mapping matrix to harmonize data between systems. Our findings reveal significant heterogeneity in coded EHR data, even after adopting a common data model with the same coding system, highlighting the importance of data harmonization before downstream analyses. Our study also demonstrates the effectiveness of the data harmonization approaches, which provide a foundational data quality step to promote semantic interoperability, enhance data integration, and improve the integrity of study conclusions.</p><p><strong>Availability and implementation: </strong>Computation prototypes, including R/Python codes and examples, are included in Section 7, available as supplementary data at Bioinformatics online and will be posted on GitHub upon publication.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13005927/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147328601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag092
Rémi Trimbour, Julio Saez-Rodriguez, Laura Cantini
Motivation: Chromatin 3D folding creates numerous DNA interactions, participating in gene expression regulation. Single-cell chromatin-accessibility assays now profile hundreds of thousands of cells, challenging existing methods for mapping cis-regulatory interactions.
Results: We present CIRCE, a fast and scalable Python package to predict cis-regulatory DNA interactions from single-cell chromatin accessibility data. CIRCE re-implements the Cicero workflow to analyse single-cell atlases, cutting runtime and memory use by several orders of magnitude. We also provide new options to compute metacells, grouping similar cells to reduce data sparsity. We benchmarked CIRCE against Cicero on two datasets of different sizes and demonstrated the improvement from CIRCE's metacells' strategy with promoter capture Hi-C data. We also evaluated how DNA interaction predictions are impacted by different pre-processing. We observed a negative impact of Cicero's count normalization, and the best performance was obtained with the single-cell count matrix directly. Finally, we demonstrated the scalability of CIRCE by processing a dataset of more than 700 000 cells and 1 million DNA regions in less than an hour. CIRCE should greatly facilitate the prediction of DNA region interactions for scverse and Python users, while providing new and up-to-date pre-processing insights.
Availability and implementation: CIRCE is released as an open-source software under the AGPL-3.0 licence. The package source code is available on GitHub at https://github.com/cantinilab/CIRCE, and its documentation is accessible at https://circe.readthedocs.io. The code to reproduce the presented results is available as a Snakemake pipeline at https://github.com/cantinilab/circe_reproducibility.s.
{"title":"CIRCE: a scalable Python package to predict cis-regulatory DNA interactions from single-cell chromatin accessibility data.","authors":"Rémi Trimbour, Julio Saez-Rodriguez, Laura Cantini","doi":"10.1093/bioinformatics/btag092","DOIUrl":"10.1093/bioinformatics/btag092","url":null,"abstract":"<p><strong>Motivation: </strong>Chromatin 3D folding creates numerous DNA interactions, participating in gene expression regulation. Single-cell chromatin-accessibility assays now profile hundreds of thousands of cells, challenging existing methods for mapping cis-regulatory interactions.</p><p><strong>Results: </strong>We present CIRCE, a fast and scalable Python package to predict cis-regulatory DNA interactions from single-cell chromatin accessibility data. CIRCE re-implements the Cicero workflow to analyse single-cell atlases, cutting runtime and memory use by several orders of magnitude. We also provide new options to compute metacells, grouping similar cells to reduce data sparsity. We benchmarked CIRCE against Cicero on two datasets of different sizes and demonstrated the improvement from CIRCE's metacells' strategy with promoter capture Hi-C data. We also evaluated how DNA interaction predictions are impacted by different pre-processing. We observed a negative impact of Cicero's count normalization, and the best performance was obtained with the single-cell count matrix directly. Finally, we demonstrated the scalability of CIRCE by processing a dataset of more than 700 000 cells and 1 million DNA regions in less than an hour. CIRCE should greatly facilitate the prediction of DNA region interactions for scverse and Python users, while providing new and up-to-date pre-processing insights.</p><p><strong>Availability and implementation: </strong>CIRCE is released as an open-source software under the AGPL-3.0 licence. The package source code is available on GitHub at https://github.com/cantinilab/CIRCE, and its documentation is accessible at https://circe.readthedocs.io. The code to reproduce the presented results is available as a Snakemake pipeline at https://github.com/cantinilab/circe_reproducibility.s.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12987762/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147286621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag066
Elnaz Abdollahzadeh, Ali Mortazavi
Motivation: Oxford Nanopore (ONT) sequencing allows for the direct detection of RNA and DNA modifications from unamplified nucleic acids, which is a significant advantage over other platforms. However, the rapid updates to ONT basecalling models and the evolving landscape of computational tools for modification detection bring about challenges for reproducible and standardized analyses. To address these challenges, we developed Dogme to automate basecalling, alignment, modification detection, and transcript quantification. Dogme automates the reprocessing of ONT POD5 files by integrating basecalling using Dorado, read mapping using minimap2 and subsequent analysis steps such as running modkit. The pipeline supports three major types of sequencing data-direct RNA (dRNA), complementary DNA (cDNA), and genomic DNA (gDNA). Dogme facilitates detection of diverse RNA modifications supported by Dorado such as N6-methyladenosine (m6A), 5-methylcytosine (m5C), inosine, pseudouridine, 2'-O-methylation (Nm) and DNA methylation, while concurrently quantifying full-length transcript isoforms LR-Kallisto for transcript quantification for dRNA and cDNA.
Results: We applied Dogme to three separate mouse C2C12 myoblast replicates using direct RNA sequencing on MinION flow cells. We detected 96 603 m6A, 43 476 m5C, 8829 inosine, 10 055 pseudouridine, and 30 320 Nm sites in three biological replicates. The pipeline produced reproducible modification profiles and transcript expression levels across replicates, demonstrating its utility for integrative long-read transcriptomic and epigenomic analyses.
Availability and implementation: Dogme is implemented in Nextflow and is freely available under the MIT license at https://github.com/mortazavilab/dogme, with documentation provided for installation and usage.
动机:牛津纳米孔(ONT)测序允许从未扩增的核酸中直接检测RNA和DNA修饰,这是其他平台的显著优势。然而,ONT基调用模型的快速更新和用于修饰检测的计算工具的不断发展,为可重复性和标准化分析带来了挑战。为了应对这些挑战,我们开发了Dogme来自动调用碱基、比对、修改检测和转录本定量。Dogme通过集成使用Dorado的基础调用、使用minimap2的读取映射以及随后的分析步骤(如运行modkit),实现了ONT POD5文件的自动再处理。该管道支持三种主要类型的测序数据-直接RNA (dRNA),互补DNA (cDNA)和基因组DNA (gDNA)。Dogme有助于检测Dorado支持的多种RNA修饰,如n6 -甲基腺苷(m6A)、5-甲基胞嘧啶(m5C)、肌苷、假尿嘧啶、2'- o -甲基化(Nm)和DNA甲基化,同时定量全长转录异构体LR-Kallisto,用于定量dRNA和cDNA的转录物。结果:我们将Dogme应用于3个独立的小鼠C2C12成肌细胞复制,对MinION流细胞进行直接RNA测序。我们在三个生物重复中检测到96,603个m6A位点、43,476个m5C位点、8,829个肌苷位点、10,055个假尿嘧啶位点和30,320 Nm位点。该管道产生了可重复的修饰谱和转录物表达水平,证明了其在综合长读转录组学和表观基因组学分析中的实用性。可用性:Dogme在Nextflow中实现,在MIT许可下可在https://github.com/mortazavilab/dogme免费获得,并提供安装和使用文档。
{"title":"Dogme: a nextflow pipeline for reprocessing nanopore RNA and DNA modifications.","authors":"Elnaz Abdollahzadeh, Ali Mortazavi","doi":"10.1093/bioinformatics/btag066","DOIUrl":"10.1093/bioinformatics/btag066","url":null,"abstract":"<p><strong>Motivation: </strong>Oxford Nanopore (ONT) sequencing allows for the direct detection of RNA and DNA modifications from unamplified nucleic acids, which is a significant advantage over other platforms. However, the rapid updates to ONT basecalling models and the evolving landscape of computational tools for modification detection bring about challenges for reproducible and standardized analyses. To address these challenges, we developed Dogme to automate basecalling, alignment, modification detection, and transcript quantification. Dogme automates the reprocessing of ONT POD5 files by integrating basecalling using Dorado, read mapping using minimap2 and subsequent analysis steps such as running modkit. The pipeline supports three major types of sequencing data-direct RNA (dRNA), complementary DNA (cDNA), and genomic DNA (gDNA). Dogme facilitates detection of diverse RNA modifications supported by Dorado such as N6-methyladenosine (m6A), 5-methylcytosine (m5C), inosine, pseudouridine, 2'-O-methylation (Nm) and DNA methylation, while concurrently quantifying full-length transcript isoforms LR-Kallisto for transcript quantification for dRNA and cDNA.</p><p><strong>Results: </strong>We applied Dogme to three separate mouse C2C12 myoblast replicates using direct RNA sequencing on MinION flow cells. We detected 96 603 m6A, 43 476 m5C, 8829 inosine, 10 055 pseudouridine, and 30 320 Nm sites in three biological replicates. The pipeline produced reproducible modification profiles and transcript expression levels across replicates, demonstrating its utility for integrative long-read transcriptomic and epigenomic analyses.</p><p><strong>Availability and implementation: </strong>Dogme is implemented in Nextflow and is freely available under the MIT license at https://github.com/mortazavilab/dogme, with documentation provided for installation and usage.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12961274/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147328512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag067
Ki-Hwa Kim, Avinash Yaganapu, Sai Kosaraju, Aashish Bhatt, Yun Lyna Luo, Sai Phani Parsa, Juyeon Park, Hyun Lee, Jun Hyuck Lee, Tae-Jin Oh, Mingon Kang
Motivation: Prediction of Compound-Protein Interactions (CPI) in bacteria is crucial to advance various pharmaceutical and chemical engineering fields, including biocatalysis, drug discovery, and industrial processing. However, current CPI models cannot be applied for bacterial CPI prediction due to the lack of curated negative interaction samples.
Results: We propose a novel Positive-Unlabeled (PU) learning framework, named BIN-PU, to address this limitation. BIN-PU generates pseudo positive and negative labels from known positive interaction data, enabling effective training of deep learning models for CPI prediction. We also propose a weighted positive loss function that weights to truly positive samples. We have validated BIN-PU coupled with multiple CPI backbone models, comparing the performance with the existing PU models using bacterial cytochrome P450 (CYP) data. Extensive experiments demonstrate the superiority of BIN-PU over the benchmark models in predicting CPIs with only truly positive samples. Furthermore, we have validated BIN-PU on additional bacterial proteins obtained from literature review, human CYP datasets, and uncurated data for its reproducibility. We have also validated the CPI prediction for the uncurated CYP data with biological and biophysical experiments. BIN-PU represents a significant advancement in CPI prediction for bacterial proteins, opening new possibilities for improving predictive models in related biological interaction tasks.
Availability and implementation: The source code and data are available at https://github.com/datax-lab/CYP.
{"title":"Prediction of bacterial protein-compound interactions with only positive samples.","authors":"Ki-Hwa Kim, Avinash Yaganapu, Sai Kosaraju, Aashish Bhatt, Yun Lyna Luo, Sai Phani Parsa, Juyeon Park, Hyun Lee, Jun Hyuck Lee, Tae-Jin Oh, Mingon Kang","doi":"10.1093/bioinformatics/btag067","DOIUrl":"10.1093/bioinformatics/btag067","url":null,"abstract":"<p><strong>Motivation: </strong>Prediction of Compound-Protein Interactions (CPI) in bacteria is crucial to advance various pharmaceutical and chemical engineering fields, including biocatalysis, drug discovery, and industrial processing. However, current CPI models cannot be applied for bacterial CPI prediction due to the lack of curated negative interaction samples.</p><p><strong>Results: </strong>We propose a novel Positive-Unlabeled (PU) learning framework, named BIN-PU, to address this limitation. BIN-PU generates pseudo positive and negative labels from known positive interaction data, enabling effective training of deep learning models for CPI prediction. We also propose a weighted positive loss function that weights to truly positive samples. We have validated BIN-PU coupled with multiple CPI backbone models, comparing the performance with the existing PU models using bacterial cytochrome P450 (CYP) data. Extensive experiments demonstrate the superiority of BIN-PU over the benchmark models in predicting CPIs with only truly positive samples. Furthermore, we have validated BIN-PU on additional bacterial proteins obtained from literature review, human CYP datasets, and uncurated data for its reproducibility. We have also validated the CPI prediction for the uncurated CYP data with biological and biophysical experiments. BIN-PU represents a significant advancement in CPI prediction for bacterial proteins, opening new possibilities for improving predictive models in related biological interaction tasks.</p><p><strong>Availability and implementation: </strong>The source code and data are available at https://github.com/datax-lab/CYP.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12975285/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146222390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag078
Freya E R Woods, Emilyanne Leonard, Timothy Ebbels, Jonathan Cairns, Rhiannon David
Motivation: Flow cytometry (FC) is a widely used technique for analysing cells or particles based on the fluorescence of specific markers. Thresholds for fluorescence are typically set manually, a laborious, subjective process that scales poorly as FC technology advances. Machine learning (ML) methods can address these issues but often require technical expertise many bench scientists do not possess. Thus, accessible, open-source, and cross-domain ML-based FC tools are needed.
Results: We present AutoFlow, an easy-to-use, adaptable R Shiny application for automated flow cytometry (FC) analysis. AutoFlow supports two workflows: supervised and unsupervised learning. The application automates key preprocessing steps including fluorescence compensation, debris exclusion, single-cell identification, viability marker gating, and downstream classification or clustering. Across three datasets, two publicly available (Mosmann and Nilsson Rare) and a novel bone marrow microphysiological system (BM-MPS) dataset, AutoFlow demonstrated robust performance. In the supervised workflow, multiclass classification on BM-MPS achieved 97.2% accuracy under a single-timepoint training and multi-timepoint testing scheme, with high sensitivity and specificity across major lineages. For rare populations, performance was strong: Mosmann Rare (0.03% prevalence) achieved 87.5% sensitivity, and 100% specificity, while Nilsson Rare (0.08% prevalence) achieved 87.9% sensitivity, and 99.9% specificity. The unsupervised workflow accurately grouped cells into biologically meaningful clusters, recovering known populations and identifying additional candidate populations with marker profiles consistent with true biology. AutoFlow offers a fast, reproducible, and scalable solution for FC analysis, enabling high-throughput studies and improving the discovery of rare or unexpected cell types.
Availability and implementation: The application is available at https://github.com/FERWoods/AutoFlow for download using R. An archived version is available at DOI: 10.5281/zenodo.18235796.
{"title":"AutoFlow: an interactive Shiny app for supervised and unsupervised flow cytometry analysis.","authors":"Freya E R Woods, Emilyanne Leonard, Timothy Ebbels, Jonathan Cairns, Rhiannon David","doi":"10.1093/bioinformatics/btag078","DOIUrl":"10.1093/bioinformatics/btag078","url":null,"abstract":"<p><strong>Motivation: </strong>Flow cytometry (FC) is a widely used technique for analysing cells or particles based on the fluorescence of specific markers. Thresholds for fluorescence are typically set manually, a laborious, subjective process that scales poorly as FC technology advances. Machine learning (ML) methods can address these issues but often require technical expertise many bench scientists do not possess. Thus, accessible, open-source, and cross-domain ML-based FC tools are needed.</p><p><strong>Results: </strong>We present AutoFlow, an easy-to-use, adaptable R Shiny application for automated flow cytometry (FC) analysis. AutoFlow supports two workflows: supervised and unsupervised learning. The application automates key preprocessing steps including fluorescence compensation, debris exclusion, single-cell identification, viability marker gating, and downstream classification or clustering. Across three datasets, two publicly available (Mosmann and Nilsson Rare) and a novel bone marrow microphysiological system (BM-MPS) dataset, AutoFlow demonstrated robust performance. In the supervised workflow, multiclass classification on BM-MPS achieved 97.2% accuracy under a single-timepoint training and multi-timepoint testing scheme, with high sensitivity and specificity across major lineages. For rare populations, performance was strong: Mosmann Rare (0.03% prevalence) achieved 87.5% sensitivity, and 100% specificity, while Nilsson Rare (0.08% prevalence) achieved 87.9% sensitivity, and 99.9% specificity. The unsupervised workflow accurately grouped cells into biologically meaningful clusters, recovering known populations and identifying additional candidate populations with marker profiles consistent with true biology. AutoFlow offers a fast, reproducible, and scalable solution for FC analysis, enabling high-throughput studies and improving the discovery of rare or unexpected cell types.</p><p><strong>Availability and implementation: </strong>The application is available at https://github.com/FERWoods/AutoFlow for download using R. An archived version is available at DOI: 10.5281/zenodo.18235796.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12970595/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146204283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag089
Jacob T Mitchell, Orian Stapleton, Kavita Krishnan, Sushma Nagaraj, Dmitrijs Lvovs, Christopher Cherry, Amanda Poissonnier, Wesley Horton, Andrew Adey, Varun Rao, Amanda Huff, Jacquelyn W Zimmerman, Luciane T Kagohara, Neeha Zaidi, Lisa M Coussens, Elizabeth M Jaffee, Jennifer H Elisseeff, Elana J Fertig
Motivation: Algorithms for ligand-receptor network inference have emerged as commonly used tools to estimate cell-cell communication from reference single-cell data. Many studies employ these algorithms to compare signaling between conditions and lack methods to statistically identify signals that are significantly different. We previously developed the cell communication inference algorithm Domino, which considers ligand and receptor gene expression in association with downstream transcription factor activity scoring. We developed the dominoSignal software to innovate upon Domino and extend its functionality to test statistically differential cellular signaling.
Results: This new functionality includes the compilation of active signals as linkages from multiple subjects in a single-cell data set and testing condition-dependent signaling linkage. The software is applicable for analysis of single-cell data sets with multiple subjects as biological replicates as well as with bootstrapped replicates from data sets with few or pooled subjects. We use simulation studies to benchmark the number of subjects in compared groups and cells within an annotated cell type sufficient to accurately identify differential linkages. We demonstrate the application of the Differential Cell Signaling Test (DCST) in the dominoSignal software to investigate consequences of cancer cell phenotypes and immunotherapy on cell-cell communication in tumor microenvironments. These applications in cancer studies demonstrate the ability of differential cell signaling analysis to infer changes to cell communication networks from therapeutic or experimental perturbations, which is broadly applicable across biological systems.
Availability: dominoSignal is available through Bioconductor at https://www.bioconductor.org/packages/release/bioc/html/dominoSignal.html.
{"title":"Differential cell signaling testing for cell-cell communication inference from single-cell data by dominoSignal.","authors":"Jacob T Mitchell, Orian Stapleton, Kavita Krishnan, Sushma Nagaraj, Dmitrijs Lvovs, Christopher Cherry, Amanda Poissonnier, Wesley Horton, Andrew Adey, Varun Rao, Amanda Huff, Jacquelyn W Zimmerman, Luciane T Kagohara, Neeha Zaidi, Lisa M Coussens, Elizabeth M Jaffee, Jennifer H Elisseeff, Elana J Fertig","doi":"10.1093/bioinformatics/btag089","DOIUrl":"10.1093/bioinformatics/btag089","url":null,"abstract":"<p><strong>Motivation: </strong>Algorithms for ligand-receptor network inference have emerged as commonly used tools to estimate cell-cell communication from reference single-cell data. Many studies employ these algorithms to compare signaling between conditions and lack methods to statistically identify signals that are significantly different. We previously developed the cell communication inference algorithm Domino, which considers ligand and receptor gene expression in association with downstream transcription factor activity scoring. We developed the dominoSignal software to innovate upon Domino and extend its functionality to test statistically differential cellular signaling.</p><p><strong>Results: </strong>This new functionality includes the compilation of active signals as linkages from multiple subjects in a single-cell data set and testing condition-dependent signaling linkage. The software is applicable for analysis of single-cell data sets with multiple subjects as biological replicates as well as with bootstrapped replicates from data sets with few or pooled subjects. We use simulation studies to benchmark the number of subjects in compared groups and cells within an annotated cell type sufficient to accurately identify differential linkages. We demonstrate the application of the Differential Cell Signaling Test (DCST) in the dominoSignal software to investigate consequences of cancer cell phenotypes and immunotherapy on cell-cell communication in tumor microenvironments. These applications in cancer studies demonstrate the ability of differential cell signaling analysis to infer changes to cell communication networks from therapeutic or experimental perturbations, which is broadly applicable across biological systems.</p><p><strong>Availability: </strong>dominoSignal is available through Bioconductor at https://www.bioconductor.org/packages/release/bioc/html/dominoSignal.html.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12998610/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147291656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag103
Chen-Hao Huang, Yen-Jen Oyang, Hsuan-Cheng Huang, Hsueh-Fen Juan
Summary: Identifying drugs that target intercellular communication networks represents a promising therapeutic strategy, yet linking single-cell RNA sequencing (scRNA-seq) analysis to structure-based drug screening remains technically challenging and requires substantial bioinformatics expertise. We present scDock, an integrated and user-friendly pipeline that seamlessly connects scRNA-seq data processing, cell-cell communication inference, and molecular docking-based drug discovery. Through a single configuration file, users can execute the complete workflow, from raw scRNA-seq data to ranked drug candidates, without programming skills. scDock automates the identification of disease-relevant ligand-receptor interactions from scRNA-seq data and performs structure-based virtual screening against these communication targets using Protein Data Bank (PDB) or AlphaFold-predicted protein structures. The pipeline generates comprehensive outputs at each stage, enabling users to explore intercellular signaling alterations and discover therapeutic compounds targeting specific cell-cell communications. scDock addresses a critical gap by providing an accessible end-to-end solution for communication-targeted drug discovery from single-cell data.
Availability and implementation: scDock is freely available at https://doi.org/10.6084/m9.figshare.31370368 and https://github.com/Andrewneteye4343/scDock. It is implemented in R, Python, shell scripts, and supports Linux systems, including Ubuntu and Debian.
{"title":"scDock: streamlining drug discovery targeting cell-cell communication via scRNA-seq analysis and molecular docking.","authors":"Chen-Hao Huang, Yen-Jen Oyang, Hsuan-Cheng Huang, Hsueh-Fen Juan","doi":"10.1093/bioinformatics/btag103","DOIUrl":"10.1093/bioinformatics/btag103","url":null,"abstract":"<p><strong>Summary: </strong>Identifying drugs that target intercellular communication networks represents a promising therapeutic strategy, yet linking single-cell RNA sequencing (scRNA-seq) analysis to structure-based drug screening remains technically challenging and requires substantial bioinformatics expertise. We present scDock, an integrated and user-friendly pipeline that seamlessly connects scRNA-seq data processing, cell-cell communication inference, and molecular docking-based drug discovery. Through a single configuration file, users can execute the complete workflow, from raw scRNA-seq data to ranked drug candidates, without programming skills. scDock automates the identification of disease-relevant ligand-receptor interactions from scRNA-seq data and performs structure-based virtual screening against these communication targets using Protein Data Bank (PDB) or AlphaFold-predicted protein structures. The pipeline generates comprehensive outputs at each stage, enabling users to explore intercellular signaling alterations and discover therapeutic compounds targeting specific cell-cell communications. scDock addresses a critical gap by providing an accessible end-to-end solution for communication-targeted drug discovery from single-cell data.</p><p><strong>Availability and implementation: </strong>scDock is freely available at https://doi.org/10.6084/m9.figshare.31370368 and https://github.com/Andrewneteye4343/scDock. It is implemented in R, Python, shell scripts, and supports Linux systems, including Ubuntu and Debian.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12996892/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147328510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}