Pub Date : 2024-10-01DOI: 10.1186/s12859-024-05944-x
Liang Bai, Boya Ji, Shulin Wang
Background: Single-cell RNA sequencing (scRNA-seq) technology has emerged as a crucial tool for studying cellular heterogeneity. However, dropouts are inherent to the sequencing process, known as dropout events, posing challenges in downstream analysis and interpretation. Imputing dropout data becomes a critical concern in scRNA-seq data analysis. Present imputation methods predominantly rely on statistical or machine learning approaches, often overlooking inter-sample correlations.
Results: To address this limitation, We introduced SAE-Impute, a new computational method for imputing single-cell data by combining subspace regression and auto-encoders for enhancing the accuracy and reliability of the imputation process. Specifically, SAE-Impute assesses sample correlations via subspace regression, predicts potential dropout values, and then leverages these predictions within an autoencoder framework for interpolation. To validate the performance of SAE-Impute, we systematically conducted experiments on both simulated and real scRNA-seq datasets. These results highlight that SAE-Impute effectively reduces false negative signals in single-cell data and enhances the retrieval of dropout values, gene-gene and cell-cell correlations. Finally, We also conducted several downstream analyses on the imputed single-cell RNA sequencing (scRNA-seq) data, including the identification of differential gene expression, cell clustering and visualization, and cell trajectory construction.
Conclusions: These results once again demonstrate that SAE-Impute is able to effectively reduce the droupouts in single-cell dataset, thereby improving the functional interpretability of the data.
{"title":"SAE-Impute: imputation for single-cell data via subspace regression and auto-encoders.","authors":"Liang Bai, Boya Ji, Shulin Wang","doi":"10.1186/s12859-024-05944-x","DOIUrl":"10.1186/s12859-024-05944-x","url":null,"abstract":"<p><strong>Background: </strong>Single-cell RNA sequencing (scRNA-seq) technology has emerged as a crucial tool for studying cellular heterogeneity. However, dropouts are inherent to the sequencing process, known as dropout events, posing challenges in downstream analysis and interpretation. Imputing dropout data becomes a critical concern in scRNA-seq data analysis. Present imputation methods predominantly rely on statistical or machine learning approaches, often overlooking inter-sample correlations.</p><p><strong>Results: </strong>To address this limitation, We introduced SAE-Impute, a new computational method for imputing single-cell data by combining subspace regression and auto-encoders for enhancing the accuracy and reliability of the imputation process. Specifically, SAE-Impute assesses sample correlations via subspace regression, predicts potential dropout values, and then leverages these predictions within an autoencoder framework for interpolation. To validate the performance of SAE-Impute, we systematically conducted experiments on both simulated and real scRNA-seq datasets. These results highlight that SAE-Impute effectively reduces false negative signals in single-cell data and enhances the retrieval of dropout values, gene-gene and cell-cell correlations. Finally, We also conducted several downstream analyses on the imputed single-cell RNA sequencing (scRNA-seq) data, including the identification of differential gene expression, cell clustering and visualization, and cell trajectory construction.</p><p><strong>Conclusions: </strong>These results once again demonstrate that SAE-Impute is able to effectively reduce the droupouts in single-cell dataset, thereby improving the functional interpretability of the data.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11443887/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142360991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1186/s12859-024-05935-y
Rhalena A Thomas, Michael R Fiorini, Saeid Amiri, Edward A Fon, Sali M K Farhan
Background: Single-cell RNA sequencing (scRNAseq) offers powerful insights, but the surge in sample sizes demands more computational power than local workstations can provide. Consequently, high-performance computing (HPC) systems have become imperative. Existing web apps designed to analyze scRNAseq data lack scalability and integration capabilities, while analysis packages demand coding expertise, hindering accessibility.
Results: In response, we introduce scRNAbox, an innovative scRNAseq analysis pipeline meticulously crafted for HPC systems. This end-to-end solution, executed via the SLURM workload manager, efficiently processes raw data from standard and Hashtag samples. It incorporates quality control filtering, sample integration, clustering, cluster annotation tools, and facilitates cell type-specific differential gene expression analysis between two groups. We demonstrate the application of scRNAbox by analyzing two publicly available datasets.
Conclusion: ScRNAbox is a comprehensive end-to-end pipeline designed to streamline the processing and analysis of scRNAseq data. By responding to the pressing demand for a user-friendly, HPC solution, scRNAbox bridges the gap between the growing computational demands of scRNAseq analysis and the coding expertise required to meet them.
{"title":"ScRNAbox: empowering single-cell RNA sequencing on high performance computing systems.","authors":"Rhalena A Thomas, Michael R Fiorini, Saeid Amiri, Edward A Fon, Sali M K Farhan","doi":"10.1186/s12859-024-05935-y","DOIUrl":"10.1186/s12859-024-05935-y","url":null,"abstract":"<p><strong>Background: </strong>Single-cell RNA sequencing (scRNAseq) offers powerful insights, but the surge in sample sizes demands more computational power than local workstations can provide. Consequently, high-performance computing (HPC) systems have become imperative. Existing web apps designed to analyze scRNAseq data lack scalability and integration capabilities, while analysis packages demand coding expertise, hindering accessibility.</p><p><strong>Results: </strong>In response, we introduce scRNAbox, an innovative scRNAseq analysis pipeline meticulously crafted for HPC systems. This end-to-end solution, executed via the SLURM workload manager, efficiently processes raw data from standard and Hashtag samples. It incorporates quality control filtering, sample integration, clustering, cluster annotation tools, and facilitates cell type-specific differential gene expression analysis between two groups. We demonstrate the application of scRNAbox by analyzing two publicly available datasets.</p><p><strong>Conclusion: </strong>ScRNAbox is a comprehensive end-to-end pipeline designed to streamline the processing and analysis of scRNAseq data. By responding to the pressing demand for a user-friendly, HPC solution, scRNAbox bridges the gap between the growing computational demands of scRNAseq analysis and the coding expertise required to meet them.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11443813/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142360992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1186/s12859-024-05943-y
Yunfei Gao, Albert No
Background: Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between (GC content constraint ). Sequencing or synthesis errors tend to increase when these constraints are violated.
Results: In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when and , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub.
Conclusion: We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.
背景:基于 DNA 的高效存储系统能以更低的成本提供巨大的容量和更长的寿命,从而应对预期的数据增长。然而,将数据编码到 DNA 序列中受到两个关键约束的限制:1) 最多有 h 个连续的相同碱基(同源多聚约束 h),以及 2) GC 比率在 [ 0.5 - c GC , 0.5 + c GC ] 之间(GC 含量约束 c GC)。当违反这些限制条件时,测序或合成错误往往会增加:在这项研究中,我们解决了 DNA 存储背景下的纯源编码问题,同时考虑了同源多聚物和 GC 含量约束。我们引入了一种新颖的编码技术,它既能遵守这些约束条件,又能在块长度增加时保持线性复杂性,并实现接近最优的速率。我们通过对随机生成的数据和现有文件进行实验,证明了所提方法的有效性。例如,当 h = 4 和 c GC = 0.05 时,速率达到 1.988,接近理论极限 1.990。相关代码可在 GitHub.Conclusion 上获取:我们提出了一种不依赖于连接预定义短序列的变长到变长编码方法,它能达到接近最优的速率。
{"title":"Efficient and low-complexity variable-to-variable length coding for DNA storage.","authors":"Yunfei Gao, Albert No","doi":"10.1186/s12859-024-05943-y","DOIUrl":"10.1186/s12859-024-05943-y","url":null,"abstract":"<p><strong>Background: </strong>Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between <math><mrow><mo>[</mo> <mn>0.5</mn> <mo>-</mo> <msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>,</mo> <mn>0.5</mn> <mo>+</mo> <msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>]</mo></mrow> </math> (GC content constraint <math><msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> </math> ). Sequencing or synthesis errors tend to increase when these constraints are violated.</p><p><strong>Results: </strong>In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when <math><mrow><mi>h</mi> <mo>=</mo> <mn>4</mn></mrow> </math> and <math> <mrow><msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>=</mo> <mn>0.05</mn></mrow> </math> , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub.</p><p><strong>Conclusion: </strong>We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446080/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142360990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1186/s12859-024-05947-8
Sushma Koirala, Harman Sharma, Yee Lian Chew, Anna Konopka
Background: The increased interest in research on DNA damage in neurodegeneration has created a need for the development of tools dedicated to the analysis of DNA damage in neurons. Double-stranded breaks (DSBs) are among the most detrimental types of DNA damage and have become a subject of intensive research. DSBs result in DNA damage foci, which are detectable with the marker γH2AX. Manual counting of DNA damage foci is challenging and biased, and there is a lack of open-source programs optimized specifically in neurons. Thus, we developed a new, fully automated application, SimplySmart_v1, for DNA damage quantification and optimized its performance specifically in primary neurons cultured in vitro.
Results: Compared with control neurons, SimplySmart_v1 accurately identifies the induction of DNA damage with etoposide in primary neurons. It also accurately quantifies DNA damage in the desired fraction of cells and processes a batch of images within a few seconds. SimplySmart_v1 was also capable of quantifying DNA damage effectively regardless of the cell type (neuron or NSC-34). The comparative analysis of SimplySmart_v1 with other open-source tools, such as Fiji, CellProfiler and a focinator, revealed that SimplySmart_v1 is the most 'user-friendly' and the quickest tool among others and provides highly accurate results free of variability between measurements. In the context of neurodegenerative research, SimplySmart_v1 revealed an increase in DNA damage in primary neurons expressing abnormal TAR DNA/RNA binding protein (TDP-43).
Conclusions: These findings showed that SimplySmart_v1 is a new and effective tool for research on DNA damage and can successfully replace other available software.
背景:人们对神经变性中 DNA 损伤的研究兴趣日益浓厚,因此需要开发专用于分析神经元中 DNA 损伤的工具。双链断裂(DSB)是最有害的 DNA 损伤类型之一,已成为深入研究的主题。DSB导致DNA损伤灶,可通过标记物γH2AX检测到。人工计数 DNA 损伤灶既具有挑战性又存在偏差,而且缺乏专门针对神经元进行优化的开源程序。因此,我们开发了一种新的全自动应用程序 SimplySmart_v1,用于 DNA 损伤定量,并专门在体外培养的原代神经元中优化了其性能:结果:与对照神经元相比,SimplySmart_v1 能准确识别依托泊苷在原代神经元中诱导的 DNA 损伤。它还能准确量化所需部分细胞的 DNA 损伤,并在几秒钟内处理一批图像。SimplySmart_v1 还能有效量化 DNA 损伤,与细胞类型(神经元或 NSC-34)无关。SimplySmart_v1 与其他开源工具(如 Fiji、CellProfiler 和 focinator)的比较分析表明,SimplySmart_v1 是其他工具中最 "用户友好"、最快捷的工具,而且能提供高度准确的结果,测量结果之间没有差异。在神经退行性病变研究中,SimplySmart_v1 发现表达异常 TAR DNA/RNA 结合蛋白(TDP-43)的原发性神经元 DNA 损伤增加:这些研究结果表明,SimplySmart_v1 是研究 DNA 损伤的一种新的有效工具,可以成功取代现有的其他软件。
{"title":"SimplySmart_v1, a new tool for the analysis of DNA damage optimized in primary neuronal cultures.","authors":"Sushma Koirala, Harman Sharma, Yee Lian Chew, Anna Konopka","doi":"10.1186/s12859-024-05947-8","DOIUrl":"10.1186/s12859-024-05947-8","url":null,"abstract":"<p><strong>Background: </strong>The increased interest in research on DNA damage in neurodegeneration has created a need for the development of tools dedicated to the analysis of DNA damage in neurons. Double-stranded breaks (DSBs) are among the most detrimental types of DNA damage and have become a subject of intensive research. DSBs result in DNA damage foci, which are detectable with the marker γH2AX. Manual counting of DNA damage foci is challenging and biased, and there is a lack of open-source programs optimized specifically in neurons. Thus, we developed a new, fully automated application, SimplySmart_v1, for DNA damage quantification and optimized its performance specifically in primary neurons cultured in vitro.</p><p><strong>Results: </strong>Compared with control neurons, SimplySmart_v1 accurately identifies the induction of DNA damage with etoposide in primary neurons. It also accurately quantifies DNA damage in the desired fraction of cells and processes a batch of images within a few seconds. SimplySmart_v1 was also capable of quantifying DNA damage effectively regardless of the cell type (neuron or NSC-34). The comparative analysis of SimplySmart_v1 with other open-source tools, such as Fiji, CellProfiler and a focinator, revealed that SimplySmart_v1 is the most 'user-friendly' and the quickest tool among others and provides highly accurate results free of variability between measurements. In the context of neurodegenerative research, SimplySmart_v1 revealed an increase in DNA damage in primary neurons expressing abnormal TAR DNA/RNA binding protein (TDP-43).</p><p><strong>Conclusions: </strong>These findings showed that SimplySmart_v1 is a new and effective tool for research on DNA damage and can successfully replace other available software.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11443846/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142360993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-30DOI: 10.1186/s12859-024-05916-1
You Zhou, Giulia Pedrielli, Fei Zhang, Teresa Wu
Background: The active functionalities of RNA are recognized to be heavily dependent on the structure and sequence. Therefore, a model that can accurately evaluate a design by giving RNA sequence-structure pairs would be a valuable tool for many researchers. Machine learning methods have been explored to develop such tools, showing promising results. However, two key issues remain. Firstly, the performance of machine learning models is affected by the features used to characterize RNA. Currently, there is no consensus on which features are the most effective for characterizing RNA sequence-structure pairs. Secondly, most existing machine learning methods extract features describing entire RNA molecule. We argue that it is essential to define additional features that characterize nucleotides and specific sections of RNA structure to enhance the overall efficacy of the RNA design process.
Results: We develop two deep learning models for evaluating RNA sequence-secondary structure pairs. The first model, NU-ResNet, uses a convolutional neural network architecture that solves the aforementioned problems by explicitly encoding RNA sequence-structure information into a 3D matrix. Building upon NU-ResNet, our second model, NUMO-ResNet, incorporates additional information derived from the characterizations of RNA, specifically the 2D folding motifs. In this work, we introduce an automated method to extract these motifs based on fundamental secondary structure descriptions. We evaluate the performance of both models on an independent testing dataset. Our proposed models outperform the models from literatures in this independent testing dataset. To assess the robustness of our models, we conduct 10-fold cross validation. To evaluate the generalization ability of NU-ResNet and NUMO-ResNet across different RNA families, we train and test our proposed models in different RNA families. Our proposed models show superior performance compared to the models from literatures when being tested across different independent RNA families.
Conclusions: In this study, we propose two deep learning models, NU-ResNet and NUMO-ResNet, to evaluate RNA sequence-secondary structure pairs. These two models expand the field of data-driven approaches for learning RNA. Furthermore, these two models provide the new method to encode RNA sequence-secondary structure pairs.
{"title":"Predicting RNA sequence-structure likelihood via structure-aware deep learning.","authors":"You Zhou, Giulia Pedrielli, Fei Zhang, Teresa Wu","doi":"10.1186/s12859-024-05916-1","DOIUrl":"10.1186/s12859-024-05916-1","url":null,"abstract":"<p><strong>Background: </strong>The active functionalities of RNA are recognized to be heavily dependent on the structure and sequence. Therefore, a model that can accurately evaluate a design by giving RNA sequence-structure pairs would be a valuable tool for many researchers. Machine learning methods have been explored to develop such tools, showing promising results. However, two key issues remain. Firstly, the performance of machine learning models is affected by the features used to characterize RNA. Currently, there is no consensus on which features are the most effective for characterizing RNA sequence-structure pairs. Secondly, most existing machine learning methods extract features describing entire RNA molecule. We argue that it is essential to define additional features that characterize nucleotides and specific sections of RNA structure to enhance the overall efficacy of the RNA design process.</p><p><strong>Results: </strong>We develop two deep learning models for evaluating RNA sequence-secondary structure pairs. The first model, NU-ResNet, uses a convolutional neural network architecture that solves the aforementioned problems by explicitly encoding RNA sequence-structure information into a 3D matrix. Building upon NU-ResNet, our second model, NUMO-ResNet, incorporates additional information derived from the characterizations of RNA, specifically the 2D folding motifs. In this work, we introduce an automated method to extract these motifs based on fundamental secondary structure descriptions. We evaluate the performance of both models on an independent testing dataset. Our proposed models outperform the models from literatures in this independent testing dataset. To assess the robustness of our models, we conduct 10-fold cross validation. To evaluate the generalization ability of NU-ResNet and NUMO-ResNet across different RNA families, we train and test our proposed models in different RNA families. Our proposed models show superior performance compared to the models from literatures when being tested across different independent RNA families.</p><p><strong>Conclusions: </strong>In this study, we propose two deep learning models, NU-ResNet and NUMO-ResNet, to evaluate RNA sequence-secondary structure pairs. These two models expand the field of data-driven approaches for learning RNA. Furthermore, these two models provide the new method to encode RNA sequence-secondary structure pairs.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11443715/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-28DOI: 10.1186/s12859-024-05937-w
Yan Zheng, Xuequn Shang
Background: Structural variations play a significant role in genetic diseases and evolutionary mechanisms. Extensive research has been conducted over the past decade to detect simple structural variations, leading to the development of well-established detection methods. However, recent studies have highlighted the potentially greater impact of complex structural variations on individuals compared to simple structural variations. Despite this, the field still lacks precise detection methods specifically designed for complex structural variations. Therefore, the development of a highly efficient and accurate detection method is of utmost importance.
Result: In response to this need, we propose a novel method called FindCSV, which leverages deep learning techniques and consensus sequences to enhance the detection of SVs using long-read sequencing data. Compared to current methods, FindCSV performs better in detecting complex and simple structural variations.
Conclusions: FindCSV is a new method to detect complex and simple structural variations with reasonable accuracy in real and simulated data. The source code for the program is available at https://github.com/nwpuzhengyan/FindCSV .
{"title":"FindCSV: a long-read based method for detecting complex structural variations.","authors":"Yan Zheng, Xuequn Shang","doi":"10.1186/s12859-024-05937-w","DOIUrl":"https://doi.org/10.1186/s12859-024-05937-w","url":null,"abstract":"<p><strong>Background: </strong>Structural variations play a significant role in genetic diseases and evolutionary mechanisms. Extensive research has been conducted over the past decade to detect simple structural variations, leading to the development of well-established detection methods. However, recent studies have highlighted the potentially greater impact of complex structural variations on individuals compared to simple structural variations. Despite this, the field still lacks precise detection methods specifically designed for complex structural variations. Therefore, the development of a highly efficient and accurate detection method is of utmost importance.</p><p><strong>Result: </strong>In response to this need, we propose a novel method called FindCSV, which leverages deep learning techniques and consensus sequences to enhance the detection of SVs using long-read sequencing data. Compared to current methods, FindCSV performs better in detecting complex and simple structural variations.</p><p><strong>Conclusions: </strong>FindCSV is a new method to detect complex and simple structural variations with reasonable accuracy in real and simulated data. The source code for the program is available at https://github.com/nwpuzhengyan/FindCSV .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11439270/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-27DOI: 10.1186/s12859-024-05928-x
Teng Li, Yiran Zou, Xianghan Li, Thomas K F Wong, Allen G Rodrigo
Background: The application of Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction and visualization has revolutionized the analysis of single-cell RNA expression and population genetics. However, its potential in single-cell DNA sequencing data analysis, particularly for visualizing gene mutation information, has not been fully explored.
Results: We introduce Mugen-UMAP, a novel Python-based program that extends UMAP's utility to single-cell DNA sequencing data. This innovative tool provides a comprehensive pipeline for processing gene annotation files of single-cell somatic single-nucleotide variants and metadata to the visualization of UMAP projections for identifying clusters, along with various statistical analyses. Employing Mugen-UMAP, we analyzed whole-exome sequencing data from 365 single-cell samples across 12 non-small cell lung cancer (NSCLC) patients, revealing distinct clusters associated with histological subtypes of NSCLC. Moreover, to demonstrate the general utility of Mugen-UMAP, we applied the program to 9 additional single-cell WES datasets from various cancer types, uncovering interesting patterns of cell clusters that warrant further investigation. In summary, Mugen-UMAP provides a quick and effective visualization method to uncover cell cluster patterns based on the gene mutation information from single-cell DNA sequencing data.
Conclusions: The application of Mugen-UMAP demonstrates its capacity to provide valuable insights into the visualization and interpretation of single-cell DNA sequencing data. Mugen-UMAP can be found at https://github.com/tengchn/Mugen-UMAP.
背景:应用统一表层逼近和投影(UMAP)技术进行降维和可视化已彻底改变了单细胞 RNA 表达和群体遗传学分析。然而,它在单细胞 DNA 测序数据分析,尤其是基因突变信息可视化方面的潜力尚未得到充分挖掘:我们介绍了 Mugen-UMAP,这是一个基于 Python 的新程序,它将 UMAP 的实用性扩展到了单细胞 DNA 测序数据。这一创新工具提供了一个全面的管道,用于处理单细胞体细胞单核苷酸变异的基因注释文件和元数据,以及用于识别聚类的可视化 UMAP 投影和各种统计分析。我们利用 Mugen-UMAP 分析了 12 名非小细胞肺癌(NSCLC)患者的 365 个单细胞样本的全外显子组测序数据,发现了与 NSCLC 组织学亚型相关的不同群集。此外,为了证明 Mugen-UMAP 的通用性,我们还将该程序应用于另外 9 个来自不同癌症类型的单细胞 WES 数据集,发现了值得进一步研究的有趣的细胞集群模式。总之,Mugen-UMAP 提供了一种快速有效的可视化方法,可根据单细胞 DNA 测序数据中的基因突变信息发现细胞群模式:结论:Mugen-UMAP 的应用表明,它能够为单细胞 DNA 测序数据的可视化和解读提供有价值的见解。Mugen-UMAP可在https://github.com/tengchn/Mugen-UMAP。
{"title":"Mugen-UMAP: UMAP visualization and clustering of mutated genes in single-cell DNA sequencing data.","authors":"Teng Li, Yiran Zou, Xianghan Li, Thomas K F Wong, Allen G Rodrigo","doi":"10.1186/s12859-024-05928-x","DOIUrl":"https://doi.org/10.1186/s12859-024-05928-x","url":null,"abstract":"<p><strong>Background: </strong>The application of Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction and visualization has revolutionized the analysis of single-cell RNA expression and population genetics. However, its potential in single-cell DNA sequencing data analysis, particularly for visualizing gene mutation information, has not been fully explored.</p><p><strong>Results: </strong>We introduce Mugen-UMAP, a novel Python-based program that extends UMAP's utility to single-cell DNA sequencing data. This innovative tool provides a comprehensive pipeline for processing gene annotation files of single-cell somatic single-nucleotide variants and metadata to the visualization of UMAP projections for identifying clusters, along with various statistical analyses. Employing Mugen-UMAP, we analyzed whole-exome sequencing data from 365 single-cell samples across 12 non-small cell lung cancer (NSCLC) patients, revealing distinct clusters associated with histological subtypes of NSCLC. Moreover, to demonstrate the general utility of Mugen-UMAP, we applied the program to 9 additional single-cell WES datasets from various cancer types, uncovering interesting patterns of cell clusters that warrant further investigation. In summary, Mugen-UMAP provides a quick and effective visualization method to uncover cell cluster patterns based on the gene mutation information from single-cell DNA sequencing data.</p><p><strong>Conclusions: </strong>The application of Mugen-UMAP demonstrates its capacity to provide valuable insights into the visualization and interpretation of single-cell DNA sequencing data. Mugen-UMAP can be found at https://github.com/tengchn/Mugen-UMAP.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11437917/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-27DOI: 10.1186/s12859-024-05938-9
Harley Edwards, Joseph Zavorskas, Walker Huso, Alexander G Doan, Caton Silbiger, Steven Harris, Ranjan Srivastava, Mark R Marten
Background: Derivative profiling is a novel approach to identify differential signals from dynamic omics data sets. This approach applies variable step-size differentiation to time dynamic omics data. This work assumes that there is a general omics derivative that is a useful and descriptive feature of dynamic omics experiments. We assert that this omics derivative, or omics flux, is a valuable descriptor that can be used instead of, or with, fold change calculations.
Results: The results of derivative profiling are compared to established methods such as Multivariate Adaptive Regression Splines, significance versus fold change analysis (Volcano), and an adjusted ratio over intensity (M/A) analysis to find that there is a statistically significant similarity between the results. This comparison is repeated for transcriptomic and phosphoproteomic expression profiles previously characterized in Aspergillus nidulans. This method has been packaged in an open-source, GUI-based MATLAB app, the Derivative Profiling omics Package (DPoP). Gene Ontology (GO) term enrichment has been included in the app so that a user can automatically/programmatically describe the over/under-represented GO terms in the derivative profiling results using domain specific knowledge found in their organism's specific GO database file. The advantage of the DPoP analysis is that it is computationally inexpensive, it does not require fold change calculations, it describes both instantaneous as well as overall behavior, and it achieves statistical confidence with signal trajectories of a single bio-replicate over four or more points.
Conclusions: While we apply this method to time dynamic transcriptomic and phosphoproteomic datasets, it is a numerically generalizable technique that can be applied to any organism and any field interested in time series data analysis. The app described in this work enables omics researchers with no computer science background to apply derivative profiling to their data sets, while also allowing multidisciplined users to build on the nascent idea of profiling derivatives in omics.
{"title":"Using flux theory in dynamic omics data sets to identify differentially changing signals using DPoP.","authors":"Harley Edwards, Joseph Zavorskas, Walker Huso, Alexander G Doan, Caton Silbiger, Steven Harris, Ranjan Srivastava, Mark R Marten","doi":"10.1186/s12859-024-05938-9","DOIUrl":"https://doi.org/10.1186/s12859-024-05938-9","url":null,"abstract":"<p><strong>Background: </strong>Derivative profiling is a novel approach to identify differential signals from dynamic omics data sets. This approach applies variable step-size differentiation to time dynamic omics data. This work assumes that there is a general omics derivative that is a useful and descriptive feature of dynamic omics experiments. We assert that this omics derivative, or omics flux, is a valuable descriptor that can be used instead of, or with, fold change calculations.</p><p><strong>Results: </strong>The results of derivative profiling are compared to established methods such as Multivariate Adaptive Regression Splines, significance versus fold change analysis (Volcano), and an adjusted ratio over intensity (M/A) analysis to find that there is a statistically significant similarity between the results. This comparison is repeated for transcriptomic and phosphoproteomic expression profiles previously characterized in Aspergillus nidulans. This method has been packaged in an open-source, GUI-based MATLAB app, the Derivative Profiling omics Package (DPoP). Gene Ontology (GO) term enrichment has been included in the app so that a user can automatically/programmatically describe the over/under-represented GO terms in the derivative profiling results using domain specific knowledge found in their organism's specific GO database file. The advantage of the DPoP analysis is that it is computationally inexpensive, it does not require fold change calculations, it describes both instantaneous as well as overall behavior, and it achieves statistical confidence with signal trajectories of a single bio-replicate over four or more points.</p><p><strong>Conclusions: </strong>While we apply this method to time dynamic transcriptomic and phosphoproteomic datasets, it is a numerically generalizable technique that can be applied to any organism and any field interested in time series data analysis. The app described in this work enables omics researchers with no computer science background to apply derivative profiling to their data sets, while also allowing multidisciplined users to build on the nascent idea of profiling derivatives in omics.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11437665/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Autism spectrum disorder (ASD) is a class of complex neurodevelopment disorders with high genetic heterogeneity. Long non-coding RNAs (lncRNAs) are vital regulators that perform specific functions within diverse cell types and play pivotal roles in neurological diseases including ASD. Therefore, exploring lncRNA regulation would contribute to deciphering ASD molecular mechanisms. Existing computational methods utilize bulk transcriptomics data to identify lncRNA regulation in all of samples, which could reveal the commonalities of lncRNA regulation in ASD, but ignore the specificity of lncRNA regulation across various cell types.
Results: Here, we present Cycle (Cell type-specific lncRNA regulatory network) to construct the landscape of cell type-specific lncRNA regulation in ASD. We have found that each ASD cell type is unique in lncRNA regulation, and more than one-third and all cell type-specific lncRNA regulatory networks are characterized as scale-free and small-world, respectively. Across 17 ASD cell types, we have discovered 19 rewired and 11 stable modules, along with eight rewired and three stable hubs within the constructed cell type-specific lncRNA regulatory networks. Enrichment analysis reveals that the discovered rewired and stable modules and hubs are closely related to ASD. Furthermore, more similar ASD cell types tend to be connected with higher strength in the constructed cell similarity network. Finally, the comparison results demonstrate that Cycle is a potential method for uncovering cell type-specific lncRNA regulation.
Conclusion: Overall, these results illustrate that Cycle is a promising method to model the landscape of cell type-specific lncRNA regulation, and provides insights into understanding the heterogeneity of lncRNA regulation between various ASD cell types.
{"title":"Modelling cell type-specific lncRNA regulatory network in autism with Cycle.","authors":"Chenchen Xiong, Mingfang Zhang, Haolin Yang, Xuemei Wei, Chunwen Zhao, Junpeng Zhang","doi":"10.1186/s12859-024-05933-0","DOIUrl":"https://doi.org/10.1186/s12859-024-05933-0","url":null,"abstract":"<p><strong>Background: </strong>Autism spectrum disorder (ASD) is a class of complex neurodevelopment disorders with high genetic heterogeneity. Long non-coding RNAs (lncRNAs) are vital regulators that perform specific functions within diverse cell types and play pivotal roles in neurological diseases including ASD. Therefore, exploring lncRNA regulation would contribute to deciphering ASD molecular mechanisms. Existing computational methods utilize bulk transcriptomics data to identify lncRNA regulation in all of samples, which could reveal the commonalities of lncRNA regulation in ASD, but ignore the specificity of lncRNA regulation across various cell types.</p><p><strong>Results: </strong>Here, we present Cycle (Cell type-specific lncRNA regulatory network) to construct the landscape of cell type-specific lncRNA regulation in ASD. We have found that each ASD cell type is unique in lncRNA regulation, and more than one-third and all cell type-specific lncRNA regulatory networks are characterized as scale-free and small-world, respectively. Across 17 ASD cell types, we have discovered 19 rewired and 11 stable modules, along with eight rewired and three stable hubs within the constructed cell type-specific lncRNA regulatory networks. Enrichment analysis reveals that the discovered rewired and stable modules and hubs are closely related to ASD. Furthermore, more similar ASD cell types tend to be connected with higher strength in the constructed cell similarity network. Finally, the comparison results demonstrate that Cycle is a potential method for uncovering cell type-specific lncRNA regulation.</p><p><strong>Conclusion: </strong>Overall, these results illustrate that Cycle is a promising method to model the landscape of cell type-specific lncRNA regulation, and provides insights into understanding the heterogeneity of lncRNA regulation between various ASD cell types.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11430139/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-27DOI: 10.1186/s12859-024-05932-1
George Luo, Toby Chen, John J Letterio
Background: The interpretation of large datasets, such as The Cancer Genome Atlas (TCGA), for scientific and research purposes, remains challenging despite their public availability. In this study, we focused on identifying gene expression profiles most relevant to patient prognosis and aimed to develop a method and database to address this issue. To achieve this, we introduced Luo's Optimization Categorization Curve (LOCC), an innovative tool for visualizing and scoring continuous variables against dichotomous outcomes. To demonstrate the efficacy of LOCC using real-world data, we analyzed gene expression profiles and patient data from TCGA hepatocellular carcinoma samples.
Results: To showcase LOCC, we demonstrate an optimal cutoff for E2F1 expression in hepatocellular carcinoma, which was subsequently validated in an independent cohort. Compared to ROC curves and their AUC, LOCC offered a superior description of the predictive value of E2F1 expression across various cancer types. The LOCC score, comprised of factors representing significance, range, and impact of the biomarker, facilitated the ranking of all gene expression profiles in hepatocellular carcinoma, aiding in the evaluation and understanding of previously published prognostic gene signatures. We also demonstrate that LOCC does not have the same assumptions required of Cox proportional hazards modeling for accurate analysis. Repeated sampling demonstrated that LOCC scores outperformed ROC's AUC in discriminating predictors from non-predictors. Additionally, gene set enrichment analysis revealed significant associations between certain genes and prognosis, such as E2F target genes and G2M checkpoint with poor prognosis, and bile acid metabolism and oxidative phosphorylation with good prognosis.
Conclusion: In summary, we present LOCC as a novel visualization tool for the analysis of gene expression in cancer, particularly for understanding and selecting cutoffs. Our findings suggest that LOCC scores, which effectively rank genes based on their prognostic potential, represent a more suitable approach than ROC curves and Cox proportional hazard for prognostic modeling and understanding in cancer gene expression analysis. LOCC holds promise as an invaluable tool for advancing precision medicine and furthering biomarker research. Further research regarding multivariable integration and validation will help LOCC reach its full potential and establish its utility across diverse cancer types and clinical settings.
{"title":"LOCC: a novel visualization and scoring of cutoffs for continuous variables with hepatocellular carcinoma prognosis as an example.","authors":"George Luo, Toby Chen, John J Letterio","doi":"10.1186/s12859-024-05932-1","DOIUrl":"10.1186/s12859-024-05932-1","url":null,"abstract":"<p><strong>Background: </strong>The interpretation of large datasets, such as The Cancer Genome Atlas (TCGA), for scientific and research purposes, remains challenging despite their public availability. In this study, we focused on identifying gene expression profiles most relevant to patient prognosis and aimed to develop a method and database to address this issue. To achieve this, we introduced Luo's Optimization Categorization Curve (LOCC), an innovative tool for visualizing and scoring continuous variables against dichotomous outcomes. To demonstrate the efficacy of LOCC using real-world data, we analyzed gene expression profiles and patient data from TCGA hepatocellular carcinoma samples.</p><p><strong>Results: </strong>To showcase LOCC, we demonstrate an optimal cutoff for E2F1 expression in hepatocellular carcinoma, which was subsequently validated in an independent cohort. Compared to ROC curves and their AUC, LOCC offered a superior description of the predictive value of E2F1 expression across various cancer types. The LOCC score, comprised of factors representing significance, range, and impact of the biomarker, facilitated the ranking of all gene expression profiles in hepatocellular carcinoma, aiding in the evaluation and understanding of previously published prognostic gene signatures. We also demonstrate that LOCC does not have the same assumptions required of Cox proportional hazards modeling for accurate analysis. Repeated sampling demonstrated that LOCC scores outperformed ROC's AUC in discriminating predictors from non-predictors. Additionally, gene set enrichment analysis revealed significant associations between certain genes and prognosis, such as E2F target genes and G2M checkpoint with poor prognosis, and bile acid metabolism and oxidative phosphorylation with good prognosis.</p><p><strong>Conclusion: </strong>In summary, we present LOCC as a novel visualization tool for the analysis of gene expression in cancer, particularly for understanding and selecting cutoffs. Our findings suggest that LOCC scores, which effectively rank genes based on their prognostic potential, represent a more suitable approach than ROC curves and Cox proportional hazard for prognostic modeling and understanding in cancer gene expression analysis. LOCC holds promise as an invaluable tool for advancing precision medicine and furthering biomarker research. Further research regarding multivariable integration and validation will help LOCC reach its full potential and establish its utility across diverse cancer types and clinical settings.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11438210/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}