Bioinformatics (Oxford, England)最新文献_第10页

GenomeDecoder: inferring segmental duplications in highly repetitive genomic regions.

Bioinformatics (Oxford, England)

Pub Date : 2025-02-04 DOI: 10.1093/bioinformatics/btaf058

Zhenmiao Zhang, Ishaan Gupta, Pavel A Pevzner

Motivation: The emergence of the 'telomere-to-telomere' genomics brought the challenge of identifying segmental duplications (SDs) in complete genomes. It further opened a possibility for identifying the differences in SDs across individual human genomes and studying the SD evolution. These newly emerged challenges require algorithms for reconstructing SDs in the most complex genomic regions that evaded all previous attempts to analyze their architecture, such as rapidly evolving immunoglobulin loci.

Results: We describe the GenomeDecoder algorithm for inferring SDs and apply it to analyzing genomic architectures of various loci in primate genomes. Our analysis revealed that multiple duplications/deletions led to a rapid birth/death of immunoglobulin genes within the human population and large changes in genomic architecture of immunoglobulin loci across primate genomes. Comparison of immunoglobulin loci across primate genomes suggests that they are subjected to diversifying selection.

Availability and implementation: GenomeDecoder is available at https://github.com/ZhangZhenmiao/GenomeDecoder. The software version and test data used in this paper are uploaded to https://doi.org/10.5281/zenodo.14753844.

{"title":"GenomeDecoder: inferring segmental duplications in highly repetitive genomic regions.","authors":"Zhenmiao Zhang, Ishaan Gupta, Pavel A Pevzner","doi":"10.1093/bioinformatics/btaf058","DOIUrl":"10.1093/bioinformatics/btaf058","url":null,"abstract":"Motivation: The emergence of the 'telomere-to-telomere' genomics brought the challenge of identifying segmental duplications (SDs) in complete genomes. It further opened a possibility for identifying the differences in SDs across individual human genomes and studying the SD evolution. These newly emerged challenges require algorithms for reconstructing SDs in the most complex genomic regions that evaded all previous attempts to analyze their architecture, such as rapidly evolving immunoglobulin loci.Results: We describe the GenomeDecoder algorithm for inferring SDs and apply it to analyzing genomic architectures of various loci in primate genomes. Our analysis revealed that multiple duplications/deletions led to a rapid birth/death of immunoglobulin genes within the human population and large changes in genomic architecture of immunoglobulin loci across primate genomes. Comparison of immunoglobulin loci across primate genomes suggests that they are subjected to diversifying selection.Availability and implementation: GenomeDecoder is available at https://github.com/ZhangZhenmiao/GenomeDecoder. The software version and test data used in this paper are uploaded to https://doi.org/10.5281/zenodo.14753844.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11842051/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143257344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VirDetector: a bioinformatic pipeline for virus surveillance using nanopore sequencing. 病毒检测器：利用纳米孔测序进行病毒监测的生物信息学管道。

Bioinformatics (Oxford, England)

Pub Date : 2025-02-04 DOI: 10.1093/bioinformatics/btaf029

Nick Laurenz Kaiser, Martin H Groschup, Balal Sadeghi

Summary: Virus surveillance programmes are designed to counter the growing threat of viral outbreaks to human health. Nanopore sequencing, in particular, has proven to be suitable for this purpose, as it is readily available and provides rapid results. However, as special bioinformatic programs are required to extract the relevant information from the sequencing data, applications are needed that allow users without extensive bioinformatics knowledge to carry out the relevant analysis steps. We present VirDetector, a bioinformatic pipeline for virus surveillance using nanopore sequencing. The pipeline automatically installs all required programs and databases and allows all its steps to be executed with a single console command. After preprocessing the samples, including the possibility for basecalling, the pipeline classifies each sample taxonomically and reconstructs the viral consensus genomes, which are then used in phylogenetic analyses. This streamlined workflow provides a user-friendly and efficient solution for monitoring viral pathogens.

Availability and implementation: VirDetector is freely available at https://github.com/NLKaiser/VirDetector and https://zenodo.org/records/14637302 (10.5281/zenodo.14637302).

摘要：病毒监测规划旨在应对病毒暴发对人类健康日益严重的威胁。特别是纳米孔测序，已被证明适合这一目的，因为它容易获得并提供快速的结果。然而，由于需要特殊的生物信息学程序从测序数据中提取相关信息，因此需要允许没有广泛生物信息学知识的用户执行相关分析步骤的应用程序。我们介绍了VirDetector，这是一种利用纳米孔测序进行病毒监测的生物信息学管道。该管道自动安装所有所需的程序和数据库，并允许使用单个控制台命令执行其所有步骤。在对样本进行预处理后，包括对碱基调用的可能性，该管道对每个样本进行分类并重建病毒共识基因组，然后将其用于系统发育分析。这种简化的工作流程为监测病毒病原体提供了一种用户友好且高效的解决方案。可用性和实现：VirDetector免费提供：https://github.com/NLKaiser/VirDetector和https://zenodo.org/records/14637302 （10.5281/zenodo.14637302）。补充信息：补充数据可在生物信息学在线获取。

{"title":"VirDetector: a bioinformatic pipeline for virus surveillance using nanopore sequencing.","authors":"Nick Laurenz Kaiser, Martin H Groschup, Balal Sadeghi","doi":"10.1093/bioinformatics/btaf029","DOIUrl":"10.1093/bioinformatics/btaf029","url":null,"abstract":"Summary: Virus surveillance programmes are designed to counter the growing threat of viral outbreaks to human health. Nanopore sequencing, in particular, has proven to be suitable for this purpose, as it is readily available and provides rapid results. However, as special bioinformatic programs are required to extract the relevant information from the sequencing data, applications are needed that allow users without extensive bioinformatics knowledge to carry out the relevant analysis steps. We present VirDetector, a bioinformatic pipeline for virus surveillance using nanopore sequencing. The pipeline automatically installs all required programs and databases and allows all its steps to be executed with a single console command. After preprocessing the samples, including the possibility for basecalling, the pipeline classifies each sample taxonomically and reconstructs the viral consensus genomes, which are then used in phylogenetic analyses. This streamlined workflow provides a user-friendly and efficient solution for monitoring viral pathogens.Availability and implementation: VirDetector is freely available at https://github.com/NLKaiser/VirDetector and https://zenodo.org/records/14637302 (10.5281/zenodo.14637302).","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11802467/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143017325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CRAmed: a conditional randomization test for high-dimensional mediation analysis in sparse microbiome data.

Bioinformatics (Oxford, England)

Pub Date : 2025-02-04 DOI: 10.1093/bioinformatics/btaf038

Tiantian Liu, Xiangnan Xu, Tao Wang, Peirong Xu

Motivation: Numerous microbiome studies have revealed significant associations between the microbiome and human health and disease. These findings have motivated researchers to explore the causal role of the microbiome in human complex traits and diseases. However, the complexities of microbiome data pose challenges for statistical analysis and interpretation of causal effects.

Results: We introduced a novel statistical framework, CRAmed, for inferring the mediating role of the microbiome between treatment and outcome. CRAmed improved the interpretability of the mediation analysis by decomposing the natural indirect effect into two parts, corresponding to the presence-absence and abundance of a microbe, respectively. Comprehensive simulations demonstrated the superior performance of CRAmed in Recall, precision, and F1 score, with a notable level of robustness, compared to existing mediation analysis methods. Furthermore, two real data applications illustrated the effectiveness and interpretability of CRAmed. Our research revealed that CRAmed holds promise for uncovering the mediating role of the microbiome and understanding of the factors influencing host health.

Availability and implementation: The R package CRAmed implementing the proposed methods is available online at https://github.com/liudoubletian/CRAmed.

{"title":"CRAmed: a conditional randomization test for high-dimensional mediation analysis in sparse microbiome data.","authors":"Tiantian Liu, Xiangnan Xu, Tao Wang, Peirong Xu","doi":"10.1093/bioinformatics/btaf038","DOIUrl":"10.1093/bioinformatics/btaf038","url":null,"abstract":"Motivation: Numerous microbiome studies have revealed significant associations between the microbiome and human health and disease. These findings have motivated researchers to explore the causal role of the microbiome in human complex traits and diseases. However, the complexities of microbiome data pose challenges for statistical analysis and interpretation of causal effects.Results: We introduced a novel statistical framework, CRAmed, for inferring the mediating role of the microbiome between treatment and outcome. CRAmed improved the interpretability of the mediation analysis by decomposing the natural indirect effect into two parts, corresponding to the presence-absence and abundance of a microbe, respectively. Comprehensive simulations demonstrated the superior performance of CRAmed in Recall, precision, and F1 score, with a notable level of robustness, compared to existing mediation analysis methods. Furthermore, two real data applications illustrated the effectiveness and interpretability of CRAmed. Our research revealed that CRAmed holds promise for uncovering the mediating role of the microbiome and understanding of the factors influencing host health.Availability and implementation: The R package CRAmed implementing the proposed methods is available online at https://github.com/liudoubletian/CRAmed.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11821267/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143070110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PhyloMix: enhancing microbiome-trait association prediction through phylogeny-mixing augmentation. PhyloMix：通过系统发育混合增强增强微生物组-性状关联预测。

Bioinformatics (Oxford, England)

Pub Date : 2025-02-04 DOI: 10.1093/bioinformatics/btaf014

Yifan Jiang, Disen Liao, Qiyun Zhu, Yang Young Lu

Motivation: Understanding the associations between traits and microbial composition is a fundamental objective in microbiome research. Recently, researchers have turned to machine learning (ML) models to achieve this goal with promising results. However, the effectiveness of advanced ML models is often limited by the unique characteristics of microbiome data, which are typically high-dimensional, compositional, and imbalanced. These characteristics can hinder the models' ability to fully explore the relationships among taxa in predictive analyses. To address this challenge, data augmentation has become crucial. It involves generating synthetic samples with artificial labels based on existing data and incorporating these samples into the training set to improve ML model performance.

Results: Here, we propose PhyloMix, a novel data augmentation method specifically designed for microbiome data to enhance predictive analyses. PhyloMix leverages the phylogenetic relationships among microbiome taxa as an informative prior to guide the generation of synthetic microbial samples. Leveraging phylogeny, PhyloMix creates new samples by removing a subtree from one sample and combining it with the corresponding subtree from another sample. Notably, PhyloMix is designed to address the compositional nature of microbiome data, effectively handling both raw counts and relative abundances. This approach introduces sufficient diversity into the augmented samples, leading to improved predictive performance. We empirically evaluated PhyloMix on six real microbiome datasets across five commonly used ML models. PhyloMix significantly outperforms distinct baseline methods including sample-mixing-based data augmentation techniques like vanilla mixup and compositional cutmix, as well as the phylogeny-based method TADA. We also demonstrated the wide applicability of PhyloMix in both supervised learning and contrastive representation learning.

Availability and implementation: The Apache-licensed source code is available at (https://github.com/batmen-lab/phylomix).

动机了解性状与微生物组成之间的关联是微生物组研究的一个基本目标。最近，研究人员转向使用机器学习（ML）模型来实现这一目标，并取得了可喜的成果。然而，高级 ML 模型的有效性往往受到微生物组数据独特特性的限制，这些数据通常具有高维、组成复杂和不平衡的特点。这些特点会阻碍模型在预测分析中充分探索类群之间关系的能力。为了应对这一挑战，数据扩增变得至关重要。它包括在现有数据的基础上生成带有人工标签的合成样本，并将这些样本纳入训练集，以提高 ML 模型的性能：在此，我们提出了 PhyloMix，这是一种专为微生物组数据设计的新型数据增强方法，可增强预测分析。PhyloMix 利用微生物群分类群之间的系统发育关系作为信息先导，指导合成微生物样本的生成。利用系统发育关系，PhyloMix 从一个样本中移除一个子树，然后将其与另一个样本中的相应子树结合，从而生成新样本。值得注意的是，PhyloMix 的设计旨在解决微生物组数据的组成性质问题，有效处理原始计数和相对丰度。这种方法为增强样本引入了足够的多样性，从而提高了预测性能。我们在六个真实的微生物组数据集上对 PhyloMix 进行了实证评估，涉及五个常用的 ML 模型。PhyloMix 明显优于不同的基线方法，包括基于样本混合的数据增强技术，如 vanilla mixup 和 compositional cutmix，以及基于系统发育的方法 TADA。我们还证明了 PhyloMix 在监督学习和对比表示学习中的广泛适用性：Apache 许可的源代码可在 (https://github.com/batmen-lab/phylomix) 上获取。补充信息：补充数据可从 Bioinformatics 网站获取。

{"title":"PhyloMix: enhancing microbiome-trait association prediction through phylogeny-mixing augmentation.","authors":"Yifan Jiang, Disen Liao, Qiyun Zhu, Yang Young Lu","doi":"10.1093/bioinformatics/btaf014","DOIUrl":"10.1093/bioinformatics/btaf014","url":null,"abstract":"Motivation: Understanding the associations between traits and microbial composition is a fundamental objective in microbiome research. Recently, researchers have turned to machine learning (ML) models to achieve this goal with promising results. However, the effectiveness of advanced ML models is often limited by the unique characteristics of microbiome data, which are typically high-dimensional, compositional, and imbalanced. These characteristics can hinder the models' ability to fully explore the relationships among taxa in predictive analyses. To address this challenge, data augmentation has become crucial. It involves generating synthetic samples with artificial labels based on existing data and incorporating these samples into the training set to improve ML model performance.Results: Here, we propose PhyloMix, a novel data augmentation method specifically designed for microbiome data to enhance predictive analyses. PhyloMix leverages the phylogenetic relationships among microbiome taxa as an informative prior to guide the generation of synthetic microbial samples. Leveraging phylogeny, PhyloMix creates new samples by removing a subtree from one sample and combining it with the corresponding subtree from another sample. Notably, PhyloMix is designed to address the compositional nature of microbiome data, effectively handling both raw counts and relative abundances. This approach introduces sufficient diversity into the augmented samples, leading to improved predictive performance. We empirically evaluated PhyloMix on six real microbiome datasets across five commonly used ML models. PhyloMix significantly outperforms distinct baseline methods including sample-mixing-based data augmentation techniques like vanilla mixup and compositional cutmix, as well as the phylogeny-based method TADA. We also demonstrated the wide applicability of PhyloMix in both supervised learning and contrastive representation learning.Availability and implementation: The Apache-licensed source code is available at (https://github.com/batmen-lab/phylomix).","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11849959/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MEGA-GO: functions prediction of diverse protein sequence length using Multi-scalE Graph Adaptive neural network.

Bioinformatics (Oxford, England)

Pub Date : 2025-02-04 DOI: 10.1093/bioinformatics/btaf032

Yujian Lee, Peng Gao, Yongqi Xu, Ziyang Wang, Shuaicheng Li, Jiaxing Chen

Motivation: The increasing accessibility of large-scale protein sequences through advanced sequencing technologies has necessitated the development of efficient and accurate methods for predicting protein function. Computational prediction models have emerged as a promising solution to expedite the annotation process. However, despite making significant progress in protein research, graph neural networks face challenges in capturing long-range structural correlations and identifying critical residues in protein graphs. Furthermore, existing models have limitations in effectively predicting the function of newly sequenced proteins that are not included in protein interaction networks. This highlights the need for novel approaches integrating protein structure and sequence data.

Results: We introduce Multi-scalE Graph Adaptive neural network (MEGA-GO), highlighting the capability of capturing diverse protein sequence length features from multiple scales. The unique graph adaptive neural network architecture of MEGA-GO enables a more nuanced extraction of graph structure features, effectively capturing intricate relationships within biological data. Experimental results demonstrate that MEGA-GO outperforms mainstream protein function prediction models in the accuracy of Gene Ontology term classification, yielding 33.4%, 68.9%, and 44.6% of area under the precision-recall curve on biological process, molecular function, and cellular component domains, respectively. The rest of the experimental results reveal that our model consistently surpasses the state-of-the-art methods.

Availability and implementation: The source code and data of MEGA-GO are available at https://github.com/Cheliosoops/MEGA-GO.

{"title":"MEGA-GO: functions prediction of diverse protein sequence length using Multi-scalE Graph Adaptive neural network.","authors":"Yujian Lee, Peng Gao, Yongqi Xu, Ziyang Wang, Shuaicheng Li, Jiaxing Chen","doi":"10.1093/bioinformatics/btaf032","DOIUrl":"10.1093/bioinformatics/btaf032","url":null,"abstract":"Motivation: The increasing accessibility of large-scale protein sequences through advanced sequencing technologies has necessitated the development of efficient and accurate methods for predicting protein function. Computational prediction models have emerged as a promising solution to expedite the annotation process. However, despite making significant progress in protein research, graph neural networks face challenges in capturing long-range structural correlations and identifying critical residues in protein graphs. Furthermore, existing models have limitations in effectively predicting the function of newly sequenced proteins that are not included in protein interaction networks. This highlights the need for novel approaches integrating protein structure and sequence data.Results: We introduce Multi-scalE Graph Adaptive neural network (MEGA-GO), highlighting the capability of capturing diverse protein sequence length features from multiple scales. The unique graph adaptive neural network architecture of MEGA-GO enables a more nuanced extraction of graph structure features, effectively capturing intricate relationships within biological data. Experimental results demonstrate that MEGA-GO outperforms mainstream protein function prediction models in the accuracy of Gene Ontology term classification, yielding 33.4%, 68.9%, and 44.6% of area under the precision-recall curve on biological process, molecular function, and cellular component domains, respectively. The rest of the experimental results reveal that our model consistently surpasses the state-of-the-art methods.Availability and implementation: The source code and data of MEGA-GO are available at https://github.com/Cheliosoops/MEGA-GO.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11810639/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143030375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multidimensional scaling improves distance-based clustering for microbiome data.

Bioinformatics (Oxford, England)

Pub Date : 2025-02-04 DOI: 10.1093/bioinformatics/btaf042

Guanhua Chen, Xinyue Wang, Qiang Sun, Zheng-Zheng Tang

Motivation: Clustering patients into subgroups based on their microbial compositions can greatly enhance our understanding of the role of microbes in human health and disease etiology. Distance-based clustering methods, such as partitioning around medoids (PAM), are popular due to their computational efficiency and absence of distributional assumptions. However, the performance of these methods can be suboptimal when true cluster memberships are driven by differences in the abundance of only a few microbes, a situation known as the sparse signal scenario.

Results: We demonstrate that classical multidimensional scaling (MDS), a widely used dimensionality reduction technique, effectively denoises microbiome data and enhances the clustering performance of distance-based methods. We propose a two-step procedure that first applies MDS to project high-dimensional microbiome data into a low-dimensional space, followed by distance-based clustering using the low-dimensional data. Our extensive simulations demonstrate that our procedure offers superior performance compared to directly conducting distance-based clustering under the sparse signal scenario. The advantage of our procedure is further showcased in several real data applications.

Availability and implementation: The R package MDSMClust is available at https://github.com/wxy929/MDS-project.

{"title":"Multidimensional scaling improves distance-based clustering for microbiome data.","authors":"Guanhua Chen, Xinyue Wang, Qiang Sun, Zheng-Zheng Tang","doi":"10.1093/bioinformatics/btaf042","DOIUrl":"10.1093/bioinformatics/btaf042","url":null,"abstract":"Motivation: Clustering patients into subgroups based on their microbial compositions can greatly enhance our understanding of the role of microbes in human health and disease etiology. Distance-based clustering methods, such as partitioning around medoids (PAM), are popular due to their computational efficiency and absence of distributional assumptions. However, the performance of these methods can be suboptimal when true cluster memberships are driven by differences in the abundance of only a few microbes, a situation known as the sparse signal scenario.Results: We demonstrate that classical multidimensional scaling (MDS), a widely used dimensionality reduction technique, effectively denoises microbiome data and enhances the clustering performance of distance-based methods. We propose a two-step procedure that first applies MDS to project high-dimensional microbiome data into a low-dimensional space, followed by distance-based clustering using the low-dimensional data. Our extensive simulations demonstrate that our procedure offers superior performance compared to directly conducting distance-based clustering under the sparse signal scenario. The advantage of our procedure is further showcased in several real data applications.Availability and implementation: The R package MDSMClust is available at https://github.com/wxy929/MDS-project.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11814494/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143061508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Trajectory Inference with Cell-Cell Interactions (TICCI): intercellular communication improves the accuracy of trajectory inference methods.

Bioinformatics (Oxford, England)

Pub Date : 2025-02-04 DOI: 10.1093/bioinformatics/btaf027

Yifeng Fu, Hong Qu, Dacheng Qu, Min Zhao

Motivation: Understanding cell differentiation and development dynamics is key for single-cell transcriptome analysis. Current cell differentiation trajectory inference algorithms face challenges such as high dimensionality, noise, and a need for users to possess certain biological information about the datasets to effectively utilize the algorithms. Here, we introduce Trajectory Inference with Cell-Cell Interaction (TICCI), a novel way to address these challenges by integrating intercellular communication information. In recognizing crucial intercellular communication during development, TICCI proposes Cell-Cell Interactions (CCI) at single-cell resolution. We posit that cells exhibiting higher gene expression similarity patterns are more likely to exchange information via biomolecular mediators.

Results: TICCI is initiated by constructing a cell-neighborhood matrix using edge weights composed of intercellular similarity and CCI information. Louvain partitioning identifies trajectory branches, attenuating noise, while single-cell entropy (scEntropy) is used to assess differentiation status. The Chu-Liu algorithm constructs a directed least-square model to identify trajectory branches, and an improved diffusion fitted time algorithm computes cell-fitted time in nonconnected topologies. TICCI validation on single-cell RNA sequencing (scRNA-seq) datasets confirms the accuracy of cell trajectories, aligning with genealogical branching and gene markers. Verification using extrinsic information labels demonstrates CCI information utility in enhancing accurate trajectory inference. A comparative analysis establishes TICCI proficiency in accurate temporal ordering.

Availability and implementation: Source code and binaries freely available for download at https://github.com/mine41/TICCI, implemented in R (version 4.32) and Python (version 3.7.16) and supported on MS Windows. Authors ensure that the software is available for a full two years following publication.

{"title":"Trajectory Inference with Cell-Cell Interactions (TICCI): intercellular communication improves the accuracy of trajectory inference methods.","authors":"Yifeng Fu, Hong Qu, Dacheng Qu, Min Zhao","doi":"10.1093/bioinformatics/btaf027","DOIUrl":"10.1093/bioinformatics/btaf027","url":null,"abstract":"Motivation: Understanding cell differentiation and development dynamics is key for single-cell transcriptome analysis. Current cell differentiation trajectory inference algorithms face challenges such as high dimensionality, noise, and a need for users to possess certain biological information about the datasets to effectively utilize the algorithms. Here, we introduce Trajectory Inference with Cell-Cell Interaction (TICCI), a novel way to address these challenges by integrating intercellular communication information. In recognizing crucial intercellular communication during development, TICCI proposes Cell-Cell Interactions (CCI) at single-cell resolution. We posit that cells exhibiting higher gene expression similarity patterns are more likely to exchange information via biomolecular mediators.Results: TICCI is initiated by constructing a cell-neighborhood matrix using edge weights composed of intercellular similarity and CCI information. Louvain partitioning identifies trajectory branches, attenuating noise, while single-cell entropy (scEntropy) is used to assess differentiation status. The Chu-Liu algorithm constructs a directed least-square model to identify trajectory branches, and an improved diffusion fitted time algorithm computes cell-fitted time in nonconnected topologies. TICCI validation on single-cell RNA sequencing (scRNA-seq) datasets confirms the accuracy of cell trajectories, aligning with genealogical branching and gene markers. Verification using extrinsic information labels demonstrates CCI information utility in enhancing accurate trajectory inference. A comparative analysis establishes TICCI proficiency in accurate temporal ordering.Availability and implementation: Source code and binaries freely available for download at https://github.com/mine41/TICCI, implemented in R (version 4.32) and Python (version 3.7.16) and supported on MS Windows. Authors ensure that the software is available for a full two years following publication.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11829803/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143082557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ESPClust: unsupervised identification of modifiers for the effect size profile in omics association studies.

Bioinformatics (Oxford, England)

Pub Date : 2025-02-04 DOI: 10.1093/bioinformatics/btaf065

Francisco J Pérez-Reche, Nathan J Cheetham, Ruth C E Bowyer, Ellen J Thompson, Francesca Tettamanzi, Cristina Menni, Claire J Steves

Motivation: High-throughput omics technologies have revolutionized the identification of associations between individual traits and underlying biological characteristics, but still use 'one effect-size fits all' approaches. While covariates are often used, their potential as effect modifiers often remains unexplored.

Results: We propose ESPClust, a novel unsupervised method designed to identify covariates that modify the effect size of associations between sets of omics variables and outcomes. By extending the concept of moderators to encompass multiple exposures, ESPClust analyses the effect size profile (ESP) to identify regions in covariate space with different ESP, enabling the discovery of subpopulations with distinct associations. Applying ESPClust to synthetic data, insulin resistance and COVID-19 symptom manifestation, we demonstrate its versatility and ability to uncover nuanced effect size modifications that traditional analyses may overlook. By integrating information from multiple exposures, ESPClust identifies effect size modifiers in datasets that are too small for traditional univariate stratified analyses. This method provides a robust framework for understanding complex omics data and holds promise for personalised medicine.

Availability and implementation: The source code ESPClust is available at https://github.com/fjpreche/ESPClust.git. It can be installed via Python package repositories as 'pip install ESPClust==1.1.0'.

{"title":"ESPClust: unsupervised identification of modifiers for the effect size profile in omics association studies.","authors":"Francisco J Pérez-Reche, Nathan J Cheetham, Ruth C E Bowyer, Ellen J Thompson, Francesca Tettamanzi, Cristina Menni, Claire J Steves","doi":"10.1093/bioinformatics/btaf065","DOIUrl":"10.1093/bioinformatics/btaf065","url":null,"abstract":"Motivation: High-throughput omics technologies have revolutionized the identification of associations between individual traits and underlying biological characteristics, but still use 'one effect-size fits all' approaches. While covariates are often used, their potential as effect modifiers often remains unexplored.Results: We propose ESPClust, a novel unsupervised method designed to identify covariates that modify the effect size of associations between sets of omics variables and outcomes. By extending the concept of moderators to encompass multiple exposures, ESPClust analyses the effect size profile (ESP) to identify regions in covariate space with different ESP, enabling the discovery of subpopulations with distinct associations. Applying ESPClust to synthetic data, insulin resistance and COVID-19 symptom manifestation, we demonstrate its versatility and ability to uncover nuanced effect size modifications that traditional analyses may overlook. By integrating information from multiple exposures, ESPClust identifies effect size modifiers in datasets that are too small for traditional univariate stratified analyses. This method provides a robust framework for understanding complex omics data and holds promise for personalised medicine.Availability and implementation: The source code ESPClust is available at https://github.com/fjpreche/ESPClust.git. It can be installed via Python package repositories as 'pip install ESPClust==1.1.0'.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11879214/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143367080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ipd: an R package for conducting inference on predicted data.

Bioinformatics (Oxford, England)

Pub Date : 2025-02-04 DOI: 10.1093/bioinformatics/btaf055

Stephen Salerno, Jiacheng Miao, Awan Afiaz, Kentaro Hoffman, Anna Neufeld, Qiongshi Lu, Tyler H McCormick, Jeffrey T Leek

Summary: ipd is an open-source R software package for the downstream modeling of an outcome and its associated features where a potentially sizable portion of the outcome data has been imputed by an artificial intelligence or machine learning prediction algorithm. The package implements several recent proposed methods for inference on predicted data with a single, user-friendly wrapper function, ipd. The package also provides custom print, summary, tidy, glance, and augment methods to facilitate easy model inspection. This document introduces the ipd software package and provides a demonstration of its basic usage.

Availability: ipd is freely available on CRAN or as a developer version at our GitHub page: github.com/ipd-tools/ipd. Full documentation, including detailed instructions and a usage 'vignette' are available at github.com/ipd-tools/ipd.

引用次数: 0

Enhanced O-glycosylation site prediction using explainable machine learning technique with spatial local environment.

Bioinformatics (Oxford, England)

Pub Date : 2025-02-04 DOI: 10.1093/bioinformatics/btaf034

Seokyoung Hong, Krishna Gopal Chattaraj, Jing Guo, Bernhardt L Trout, Richard D Braatz

Motivation: The accurate prediction of O-GlcNAcylation sites is crucial for understanding disease mechanisms and developing effective treatments. Previous machine learning (ML) models primarily relied on primary or secondary protein structural and related properties, which have limitations in capturing the spatial interactions of neighboring amino acids. This study introduces local environmental features as a novel approach that incorporates three-dimensional spatial information, significantly improving model performance by considering the spatial context around the target site. Additionally, we utilize sparse recurrent neural networks to effectively capture sequential nature of the proteins and to identify key factors influencing O-GlcNAcylation as an explainable ML model.

Results: Our findings demonstrate the effectiveness of our proposed features with the model achieving an F1 score of 28.3%, as well as feature selection capability with the model using only the top 20% of features achieving the highest F1 score of 32.02%, a 1.4-fold improvement over existing PTM models. Statistical analysis of the top 20 features confirmed their consistency with literature. This method not only boosts prediction accuracy but also paves the way for further research in understanding and targeting O-GlcNAcylation.

Availability and implementation: The entire code, data, features used in this study are available in the GitHub repository: https://github.com/pseokyoung/o-glcnac-prediction.

{"title":"Enhanced O-glycosylation site prediction using explainable machine learning technique with spatial local environment.","authors":"Seokyoung Hong, Krishna Gopal Chattaraj, Jing Guo, Bernhardt L Trout, Richard D Braatz","doi":"10.1093/bioinformatics/btaf034","DOIUrl":"10.1093/bioinformatics/btaf034","url":null,"abstract":"Motivation: The accurate prediction of O-GlcNAcylation sites is crucial for understanding disease mechanisms and developing effective treatments. Previous machine learning (ML) models primarily relied on primary or secondary protein structural and related properties, which have limitations in capturing the spatial interactions of neighboring amino acids. This study introduces local environmental features as a novel approach that incorporates three-dimensional spatial information, significantly improving model performance by considering the spatial context around the target site. Additionally, we utilize sparse recurrent neural networks to effectively capture sequential nature of the proteins and to identify key factors influencing O-GlcNAcylation as an explainable ML model.Results: Our findings demonstrate the effectiveness of our proposed features with the model achieving an F1 score of 28.3%, as well as feature selection capability with the model using only the top 20% of features achieving the highest F1 score of 32.02%, a 1.4-fold improvement over existing PTM models. Statistical analysis of the top 20 features confirmed their consistency with literature. This method not only boosts prediction accuracy but also paves the way for further research in understanding and targeting O-GlcNAcylation.Availability and implementation: The entire code, data, features used in this study are available in the GitHub repository: https://github.com/pseokyoung/o-glcnac-prediction.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11814488/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143061569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0