首页 > 最新文献

BMC Bioinformatics最新文献

英文 中文
SAE-Impute: imputation for single-cell data via subspace regression and auto-encoders. SAE-Impute:通过子空间回归和自动编码器对单细胞数据进行估算。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-01 DOI: 10.1186/s12859-024-05944-x
Liang Bai, Boya Ji, Shulin Wang

Background: Single-cell RNA sequencing (scRNA-seq) technology has emerged as a crucial tool for studying cellular heterogeneity. However, dropouts are inherent to the sequencing process, known as dropout events, posing challenges in downstream analysis and interpretation. Imputing dropout data becomes a critical concern in scRNA-seq data analysis. Present imputation methods predominantly rely on statistical or machine learning approaches, often overlooking inter-sample correlations.

Results: To address this limitation, We introduced SAE-Impute, a new computational method for imputing single-cell data by combining subspace regression and auto-encoders for enhancing the accuracy and reliability of the imputation process. Specifically, SAE-Impute assesses sample correlations via subspace regression, predicts potential dropout values, and then leverages these predictions within an autoencoder framework for interpolation. To validate the performance of SAE-Impute, we systematically conducted experiments on both simulated and real scRNA-seq datasets. These results highlight that SAE-Impute effectively reduces false negative signals in single-cell data and enhances the retrieval of dropout values, gene-gene and cell-cell correlations. Finally, We also conducted several downstream analyses on the imputed single-cell RNA sequencing (scRNA-seq) data, including the identification of differential gene expression, cell clustering and visualization, and cell trajectory construction.

Conclusions: These results once again demonstrate that SAE-Impute is able to effectively reduce the droupouts in single-cell dataset, thereby improving the functional interpretability of the data.

背景:单细胞 RNA 测序(scRNA-seq)技术已成为研究细胞异质性的重要工具。然而,测序过程中固有的数据丢失(称为丢失事件)给下游分析和解释带来了挑战。在 scRNA-seq 数据分析中,丢失数据的估算成为一个关键问题。目前的估算方法主要依赖于统计或机器学习方法,往往忽略了样本间的相关性:为了解决这一局限性,我们引入了 SAE-Impute,这是一种新的单细胞数据归因计算方法,通过结合子空间回归和自动编码器来提高归因过程的准确性和可靠性。具体来说,SAE-Impute 通过子空间回归评估样本相关性,预测潜在的丢失值,然后在自动编码器框架内利用这些预测值进行插值。为了验证 SAE-Impute 的性能,我们在模拟和真实 scRNA-seq 数据集上进行了系统实验。这些结果表明,SAE-Impute 有效地减少了单细胞数据中的假阴性信号,并提高了剔除值、基因-基因和细胞-细胞相关性的检索能力。最后,我们还对估算的单细胞 RNA 测序(scRNA-seq)数据进行了多项下游分析,包括差异基因表达的识别、细胞聚类和可视化以及细胞轨迹构建:这些结果再次证明了 SAE-Impute 能够有效减少单细胞数据集中的群漏,从而提高数据的功能可解释性。
{"title":"SAE-Impute: imputation for single-cell data via subspace regression and auto-encoders.","authors":"Liang Bai, Boya Ji, Shulin Wang","doi":"10.1186/s12859-024-05944-x","DOIUrl":"10.1186/s12859-024-05944-x","url":null,"abstract":"<p><strong>Background: </strong>Single-cell RNA sequencing (scRNA-seq) technology has emerged as a crucial tool for studying cellular heterogeneity. However, dropouts are inherent to the sequencing process, known as dropout events, posing challenges in downstream analysis and interpretation. Imputing dropout data becomes a critical concern in scRNA-seq data analysis. Present imputation methods predominantly rely on statistical or machine learning approaches, often overlooking inter-sample correlations.</p><p><strong>Results: </strong>To address this limitation, We introduced SAE-Impute, a new computational method for imputing single-cell data by combining subspace regression and auto-encoders for enhancing the accuracy and reliability of the imputation process. Specifically, SAE-Impute assesses sample correlations via subspace regression, predicts potential dropout values, and then leverages these predictions within an autoencoder framework for interpolation. To validate the performance of SAE-Impute, we systematically conducted experiments on both simulated and real scRNA-seq datasets. These results highlight that SAE-Impute effectively reduces false negative signals in single-cell data and enhances the retrieval of dropout values, gene-gene and cell-cell correlations. Finally, We also conducted several downstream analyses on the imputed single-cell RNA sequencing (scRNA-seq) data, including the identification of differential gene expression, cell clustering and visualization, and cell trajectory construction.</p><p><strong>Conclusions: </strong>These results once again demonstrate that SAE-Impute is able to effectively reduce the droupouts in single-cell dataset, thereby improving the functional interpretability of the data.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11443887/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142360991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ScRNAbox: empowering single-cell RNA sequencing on high performance computing systems. SCRNAbox:在高性能计算系统上支持单细胞 RNA 测序。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-01 DOI: 10.1186/s12859-024-05935-y
Rhalena A Thomas, Michael R Fiorini, Saeid Amiri, Edward A Fon, Sali M K Farhan

Background: Single-cell RNA sequencing (scRNAseq) offers powerful insights, but the surge in sample sizes demands more computational power than local workstations can provide. Consequently, high-performance computing (HPC) systems have become imperative. Existing web apps designed to analyze scRNAseq data lack scalability and integration capabilities, while analysis packages demand coding expertise, hindering accessibility.

Results: In response, we introduce scRNAbox, an innovative scRNAseq analysis pipeline meticulously crafted for HPC systems. This end-to-end solution, executed via the SLURM workload manager, efficiently processes raw data from standard and Hashtag samples. It incorporates quality control filtering, sample integration, clustering, cluster annotation tools, and facilitates cell type-specific differential gene expression analysis between two groups. We demonstrate the application of scRNAbox by analyzing two publicly available datasets.

Conclusion: ScRNAbox is a comprehensive end-to-end pipeline designed to streamline the processing and analysis of scRNAseq data. By responding to the pressing demand for a user-friendly, HPC solution, scRNAbox bridges the gap between the growing computational demands of scRNAseq analysis and the coding expertise required to meet them.

背景:单细胞 RNA 测序(scRNAseq)提供了强大的洞察力,但样本量的激增要求更强的计算能力,而本地工作站无法提供。因此,高性能计算(HPC)系统势在必行。现有的用于分析 scRNAseq 数据的网络应用程序缺乏可扩展性和集成能力,而分析软件包需要专业的编码知识,这阻碍了其普及:为此,我们推出了 scRNAbox,这是一个专为高性能计算系统精心设计的创新型 scRNAseq 分析管道。这一端到端解决方案通过 SLURM 工作负载管理器执行,可高效处理来自标准样本和 Hashtag 样本的原始数据。它整合了质量控制过滤、样本整合、聚类、聚类注释工具,并促进了两组之间特定细胞类型的差异基因表达分析。我们通过分析两个公开数据集演示了 scRNAbox 的应用:ScRNAbox 是一个全面的端到端管道,旨在简化 scRNAseq 数据的处理和分析。scRNAbox 满足了人们对用户友好型高性能计算解决方案的迫切需求,在日益增长的 scRNAseq 分析计算需求与满足这些需求所需的编码专业知识之间架起了一座桥梁。
{"title":"ScRNAbox: empowering single-cell RNA sequencing on high performance computing systems.","authors":"Rhalena A Thomas, Michael R Fiorini, Saeid Amiri, Edward A Fon, Sali M K Farhan","doi":"10.1186/s12859-024-05935-y","DOIUrl":"10.1186/s12859-024-05935-y","url":null,"abstract":"<p><strong>Background: </strong>Single-cell RNA sequencing (scRNAseq) offers powerful insights, but the surge in sample sizes demands more computational power than local workstations can provide. Consequently, high-performance computing (HPC) systems have become imperative. Existing web apps designed to analyze scRNAseq data lack scalability and integration capabilities, while analysis packages demand coding expertise, hindering accessibility.</p><p><strong>Results: </strong>In response, we introduce scRNAbox, an innovative scRNAseq analysis pipeline meticulously crafted for HPC systems. This end-to-end solution, executed via the SLURM workload manager, efficiently processes raw data from standard and Hashtag samples. It incorporates quality control filtering, sample integration, clustering, cluster annotation tools, and facilitates cell type-specific differential gene expression analysis between two groups. We demonstrate the application of scRNAbox by analyzing two publicly available datasets.</p><p><strong>Conclusion: </strong>ScRNAbox is a comprehensive end-to-end pipeline designed to streamline the processing and analysis of scRNAseq data. By responding to the pressing demand for a user-friendly, HPC solution, scRNAbox bridges the gap between the growing computational demands of scRNAseq analysis and the coding expertise required to meet them.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11443813/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142360992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient and low-complexity variable-to-variable length coding for DNA storage. 用于 DNA 存储的高效、低复杂度变长编码。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-01 DOI: 10.1186/s12859-024-05943-y
Yunfei Gao, Albert No

Background: Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between [ 0.5 - c GC , 0.5 + c GC ] (GC content constraint c GC ). Sequencing or synthesis errors tend to increase when these constraints are violated.

Results: In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when h = 4 and c GC = 0.05 , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub.

Conclusion: We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.

背景:基于 DNA 的高效存储系统能以更低的成本提供巨大的容量和更长的寿命,从而应对预期的数据增长。然而,将数据编码到 DNA 序列中受到两个关键约束的限制:1) 最多有 h 个连续的相同碱基(同源多聚约束 h),以及 2) GC 比率在 [ 0.5 - c GC , 0.5 + c GC ] 之间(GC 含量约束 c GC)。当违反这些限制条件时,测序或合成错误往往会增加:在这项研究中,我们解决了 DNA 存储背景下的纯源编码问题,同时考虑了同源多聚物和 GC 含量约束。我们引入了一种新颖的编码技术,它既能遵守这些约束条件,又能在块长度增加时保持线性复杂性,并实现接近最优的速率。我们通过对随机生成的数据和现有文件进行实验,证明了所提方法的有效性。例如,当 h = 4 和 c GC = 0.05 时,速率达到 1.988,接近理论极限 1.990。相关代码可在 GitHub.Conclusion 上获取:我们提出了一种不依赖于连接预定义短序列的变长到变长编码方法,它能达到接近最优的速率。
{"title":"Efficient and low-complexity variable-to-variable length coding for DNA storage.","authors":"Yunfei Gao, Albert No","doi":"10.1186/s12859-024-05943-y","DOIUrl":"10.1186/s12859-024-05943-y","url":null,"abstract":"<p><strong>Background: </strong>Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between <math><mrow><mo>[</mo> <mn>0.5</mn> <mo>-</mo> <msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>,</mo> <mn>0.5</mn> <mo>+</mo> <msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>]</mo></mrow> </math> (GC content constraint <math><msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> </math> ). Sequencing or synthesis errors tend to increase when these constraints are violated.</p><p><strong>Results: </strong>In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when <math><mrow><mi>h</mi> <mo>=</mo> <mn>4</mn></mrow> </math> and <math> <mrow><msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>=</mo> <mn>0.05</mn></mrow> </math> , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub.</p><p><strong>Conclusion: </strong>We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446080/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142360990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SimplySmart_v1, a new tool for the analysis of DNA damage optimized in primary neuronal cultures. SimplySmart_v1 是一种用于分析原代神经元培养物中 DNA 损伤优化情况的新工具。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-10-01 DOI: 10.1186/s12859-024-05947-8
Sushma Koirala, Harman Sharma, Yee Lian Chew, Anna Konopka

Background: The increased interest in research on DNA damage in neurodegeneration has created a need for the development of tools dedicated to the analysis of DNA damage in neurons. Double-stranded breaks (DSBs) are among the most detrimental types of DNA damage and have become a subject of intensive research. DSBs result in DNA damage foci, which are detectable with the marker γH2AX. Manual counting of DNA damage foci is challenging and biased, and there is a lack of open-source programs optimized specifically in neurons. Thus, we developed a new, fully automated application, SimplySmart_v1, for DNA damage quantification and optimized its performance specifically in primary neurons cultured in vitro.

Results: Compared with control neurons, SimplySmart_v1 accurately identifies the induction of DNA damage with etoposide in primary neurons. It also accurately quantifies DNA damage in the desired fraction of cells and processes a batch of images within a few seconds. SimplySmart_v1 was also capable of quantifying DNA damage effectively regardless of the cell type (neuron or NSC-34). The comparative analysis of SimplySmart_v1 with other open-source tools, such as Fiji, CellProfiler and a focinator, revealed that SimplySmart_v1 is the most 'user-friendly' and the quickest tool among others and provides highly accurate results free of variability between measurements. In the context of neurodegenerative research, SimplySmart_v1 revealed an increase in DNA damage in primary neurons expressing abnormal TAR DNA/RNA binding protein (TDP-43).

Conclusions: These findings showed that SimplySmart_v1 is a new and effective tool for research on DNA damage and can successfully replace other available software.

背景:人们对神经变性中 DNA 损伤的研究兴趣日益浓厚,因此需要开发专用于分析神经元中 DNA 损伤的工具。双链断裂(DSB)是最有害的 DNA 损伤类型之一,已成为深入研究的主题。DSB导致DNA损伤灶,可通过标记物γH2AX检测到。人工计数 DNA 损伤灶既具有挑战性又存在偏差,而且缺乏专门针对神经元进行优化的开源程序。因此,我们开发了一种新的全自动应用程序 SimplySmart_v1,用于 DNA 损伤定量,并专门在体外培养的原代神经元中优化了其性能:结果:与对照神经元相比,SimplySmart_v1 能准确识别依托泊苷在原代神经元中诱导的 DNA 损伤。它还能准确量化所需部分细胞的 DNA 损伤,并在几秒钟内处理一批图像。SimplySmart_v1 还能有效量化 DNA 损伤,与细胞类型(神经元或 NSC-34)无关。SimplySmart_v1 与其他开源工具(如 Fiji、CellProfiler 和 focinator)的比较分析表明,SimplySmart_v1 是其他工具中最 "用户友好"、最快捷的工具,而且能提供高度准确的结果,测量结果之间没有差异。在神经退行性病变研究中,SimplySmart_v1 发现表达异常 TAR DNA/RNA 结合蛋白(TDP-43)的原发性神经元 DNA 损伤增加:这些研究结果表明,SimplySmart_v1 是研究 DNA 损伤的一种新的有效工具,可以成功取代现有的其他软件。
{"title":"SimplySmart_v1, a new tool for the analysis of DNA damage optimized in primary neuronal cultures.","authors":"Sushma Koirala, Harman Sharma, Yee Lian Chew, Anna Konopka","doi":"10.1186/s12859-024-05947-8","DOIUrl":"10.1186/s12859-024-05947-8","url":null,"abstract":"<p><strong>Background: </strong>The increased interest in research on DNA damage in neurodegeneration has created a need for the development of tools dedicated to the analysis of DNA damage in neurons. Double-stranded breaks (DSBs) are among the most detrimental types of DNA damage and have become a subject of intensive research. DSBs result in DNA damage foci, which are detectable with the marker γH2AX. Manual counting of DNA damage foci is challenging and biased, and there is a lack of open-source programs optimized specifically in neurons. Thus, we developed a new, fully automated application, SimplySmart_v1, for DNA damage quantification and optimized its performance specifically in primary neurons cultured in vitro.</p><p><strong>Results: </strong>Compared with control neurons, SimplySmart_v1 accurately identifies the induction of DNA damage with etoposide in primary neurons. It also accurately quantifies DNA damage in the desired fraction of cells and processes a batch of images within a few seconds. SimplySmart_v1 was also capable of quantifying DNA damage effectively regardless of the cell type (neuron or NSC-34). The comparative analysis of SimplySmart_v1 with other open-source tools, such as Fiji, CellProfiler and a focinator, revealed that SimplySmart_v1 is the most 'user-friendly' and the quickest tool among others and provides highly accurate results free of variability between measurements. In the context of neurodegenerative research, SimplySmart_v1 revealed an increase in DNA damage in primary neurons expressing abnormal TAR DNA/RNA binding protein (TDP-43).</p><p><strong>Conclusions: </strong>These findings showed that SimplySmart_v1 is a new and effective tool for research on DNA damage and can successfully replace other available software.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11443846/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142360993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting RNA sequence-structure likelihood via structure-aware deep learning. 通过结构感知深度学习预测 RNA 序列结构的可能性。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-30 DOI: 10.1186/s12859-024-05916-1
You Zhou, Giulia Pedrielli, Fei Zhang, Teresa Wu

Background: The active functionalities of RNA are recognized to be heavily dependent on the structure and sequence. Therefore, a model that can accurately evaluate a design by giving RNA sequence-structure pairs would be a valuable tool for many researchers. Machine learning methods have been explored to develop such tools, showing promising results. However, two key issues remain. Firstly, the performance of machine learning models is affected by the features used to characterize RNA. Currently, there is no consensus on which features are the most effective for characterizing RNA sequence-structure pairs. Secondly, most existing machine learning methods extract features describing entire RNA molecule. We argue that it is essential to define additional features that characterize nucleotides and specific sections of RNA structure to enhance the overall efficacy of the RNA design process.

Results: We develop two deep learning models for evaluating RNA sequence-secondary structure pairs. The first model, NU-ResNet, uses a convolutional neural network architecture that solves the aforementioned problems by explicitly encoding RNA sequence-structure information into a 3D matrix. Building upon NU-ResNet, our second model, NUMO-ResNet, incorporates additional information derived from the characterizations of RNA, specifically the 2D folding motifs. In this work, we introduce an automated method to extract these motifs based on fundamental secondary structure descriptions. We evaluate the performance of both models on an independent testing dataset. Our proposed models outperform the models from literatures in this independent testing dataset. To assess the robustness of our models, we conduct 10-fold cross validation. To evaluate the generalization ability of NU-ResNet and NUMO-ResNet across different RNA families, we train and test our proposed models in different RNA families. Our proposed models show superior performance compared to the models from literatures when being tested across different independent RNA families.

Conclusions: In this study, we propose two deep learning models, NU-ResNet and NUMO-ResNet, to evaluate RNA sequence-secondary structure pairs. These two models expand the field of data-driven approaches for learning RNA. Furthermore, these two models provide the new method to encode RNA sequence-secondary structure pairs.

背景:众所周知,RNA 的活性功能在很大程度上取决于其结构和序列。因此,对于许多研究人员来说,一个能通过提供 RNA 序列-结构对来准确评估设计的模型将是一个非常有价值的工具。人们已经探索了机器学习方法来开发此类工具,并取得了可喜的成果。然而,仍然存在两个关键问题。首先,机器学习模型的性能受到用于描述 RNA 特征的特征的影响。目前,关于哪些特征对表征 RNA 序列结构对最有效还没有达成共识。其次,现有的机器学习方法大多提取描述整个 RNA 分子的特征。我们认为,有必要定义更多描述核苷酸和 RNA 结构特定部分的特征,以提高 RNA 设计过程的整体效率:我们开发了两个深度学习模型,用于评估 RNA 序列-二级结构对。第一个模型 NU-ResNet 采用卷积神经网络架构,通过将 RNA 序列-结构信息明确编码到三维矩阵中来解决上述问题。在 NU-ResNet 的基础上,我们的第二个模型 NUMO-ResNet 加入了从 RNA 特征中获得的额外信息,特别是二维折叠图案。在这项工作中,我们介绍了一种基于基本二级结构描述提取这些图案的自动方法。我们在一个独立的测试数据集上评估了这两种模型的性能。在这个独立测试数据集中,我们提出的模型优于文献中的模型。为了评估模型的鲁棒性,我们进行了 10 倍交叉验证。为了评估 NU-ResNet 和 NUMO-ResNet 在不同 RNA 家族中的泛化能力,我们在不同的 RNA 家族中训练和测试了我们提出的模型。与文献中的模型相比,我们提出的模型在不同的独立 RNA 家族中进行测试时表现出更优越的性能:在本研究中,我们提出了两种深度学习模型--NU-ResNet 和 NUMO-ResNet,用于评估 RNA 序列-二级结构对。这两个模型拓展了数据驱动的 RNA 学习方法领域。此外,这两个模型还提供了对 RNA 序列-二级结构对进行编码的新方法。
{"title":"Predicting RNA sequence-structure likelihood via structure-aware deep learning.","authors":"You Zhou, Giulia Pedrielli, Fei Zhang, Teresa Wu","doi":"10.1186/s12859-024-05916-1","DOIUrl":"10.1186/s12859-024-05916-1","url":null,"abstract":"<p><strong>Background: </strong>The active functionalities of RNA are recognized to be heavily dependent on the structure and sequence. Therefore, a model that can accurately evaluate a design by giving RNA sequence-structure pairs would be a valuable tool for many researchers. Machine learning methods have been explored to develop such tools, showing promising results. However, two key issues remain. Firstly, the performance of machine learning models is affected by the features used to characterize RNA. Currently, there is no consensus on which features are the most effective for characterizing RNA sequence-structure pairs. Secondly, most existing machine learning methods extract features describing entire RNA molecule. We argue that it is essential to define additional features that characterize nucleotides and specific sections of RNA structure to enhance the overall efficacy of the RNA design process.</p><p><strong>Results: </strong>We develop two deep learning models for evaluating RNA sequence-secondary structure pairs. The first model, NU-ResNet, uses a convolutional neural network architecture that solves the aforementioned problems by explicitly encoding RNA sequence-structure information into a 3D matrix. Building upon NU-ResNet, our second model, NUMO-ResNet, incorporates additional information derived from the characterizations of RNA, specifically the 2D folding motifs. In this work, we introduce an automated method to extract these motifs based on fundamental secondary structure descriptions. We evaluate the performance of both models on an independent testing dataset. Our proposed models outperform the models from literatures in this independent testing dataset. To assess the robustness of our models, we conduct 10-fold cross validation. To evaluate the generalization ability of NU-ResNet and NUMO-ResNet across different RNA families, we train and test our proposed models in different RNA families. Our proposed models show superior performance compared to the models from literatures when being tested across different independent RNA families.</p><p><strong>Conclusions: </strong>In this study, we propose two deep learning models, NU-ResNet and NUMO-ResNet, to evaluate RNA sequence-secondary structure pairs. These two models expand the field of data-driven approaches for learning RNA. Furthermore, these two models provide the new method to encode RNA sequence-secondary structure pairs.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11443715/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FindCSV: a long-read based method for detecting complex structural variations. FindCSV:基于长读取的复杂结构变异检测方法。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-28 DOI: 10.1186/s12859-024-05937-w
Yan Zheng, Xuequn Shang

Background: Structural variations play a significant role in genetic diseases and evolutionary mechanisms. Extensive research has been conducted over the past decade to detect simple structural variations, leading to the development of well-established detection methods. However, recent studies have highlighted the potentially greater impact of complex structural variations on individuals compared to simple structural variations. Despite this, the field still lacks precise detection methods specifically designed for complex structural variations. Therefore, the development of a highly efficient and accurate detection method is of utmost importance.

Result: In response to this need, we propose a novel method called FindCSV, which leverages deep learning techniques and consensus sequences to enhance the detection of SVs using long-read sequencing data. Compared to current methods, FindCSV performs better in detecting complex and simple structural variations.

Conclusions: FindCSV is a new method to detect complex and simple structural variations with reasonable accuracy in real and simulated data. The source code for the program is available at https://github.com/nwpuzhengyan/FindCSV .

背景:结构变异在遗传疾病和进化机制中发挥着重要作用。在过去的十年中,人们对简单结构变异的检测进行了广泛的研究,从而开发出了成熟的检测方法。然而,最近的研究强调,与简单结构变异相比,复杂结构变异对个体的潜在影响更大。尽管如此,该领域仍然缺乏专门针对复杂结构变异的精确检测方法。因此,开发一种高效、准确的检测方法至关重要:针对这一需求,我们提出了一种名为 "FindCSV "的新方法,该方法利用深度学习技术和共识序列来提高利用长读程测序数据检测 SV 的能力。与现有方法相比,FindCSV 在检测复杂和简单结构变异方面表现更好:FindCSV是一种在真实和模拟数据中检测复杂和简单结构变异的新方法,具有合理的准确性。该程序的源代码可在 https://github.com/nwpuzhengyan/FindCSV 上获取。
{"title":"FindCSV: a long-read based method for detecting complex structural variations.","authors":"Yan Zheng, Xuequn Shang","doi":"10.1186/s12859-024-05937-w","DOIUrl":"https://doi.org/10.1186/s12859-024-05937-w","url":null,"abstract":"<p><strong>Background: </strong>Structural variations play a significant role in genetic diseases and evolutionary mechanisms. Extensive research has been conducted over the past decade to detect simple structural variations, leading to the development of well-established detection methods. However, recent studies have highlighted the potentially greater impact of complex structural variations on individuals compared to simple structural variations. Despite this, the field still lacks precise detection methods specifically designed for complex structural variations. Therefore, the development of a highly efficient and accurate detection method is of utmost importance.</p><p><strong>Result: </strong>In response to this need, we propose a novel method called FindCSV, which leverages deep learning techniques and consensus sequences to enhance the detection of SVs using long-read sequencing data. Compared to current methods, FindCSV performs better in detecting complex and simple structural variations.</p><p><strong>Conclusions: </strong>FindCSV is a new method to detect complex and simple structural variations with reasonable accuracy in real and simulated data. The source code for the program is available at https://github.com/nwpuzhengyan/FindCSV .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11439270/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mugen-UMAP: UMAP visualization and clustering of mutated genes in single-cell DNA sequencing data. Mugen-UMAP:单细胞 DNA 测序数据中突变基因的 UMAP 可视化和聚类。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-27 DOI: 10.1186/s12859-024-05928-x
Teng Li, Yiran Zou, Xianghan Li, Thomas K F Wong, Allen G Rodrigo

Background: The application of Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction and visualization has revolutionized the analysis of single-cell RNA expression and population genetics. However, its potential in single-cell DNA sequencing data analysis, particularly for visualizing gene mutation information, has not been fully explored.

Results: We introduce Mugen-UMAP, a novel Python-based program that extends UMAP's utility to single-cell DNA sequencing data. This innovative tool provides a comprehensive pipeline for processing gene annotation files of single-cell somatic single-nucleotide variants and metadata to the visualization of UMAP projections for identifying clusters, along with various statistical analyses. Employing Mugen-UMAP, we analyzed whole-exome sequencing data from 365 single-cell samples across 12 non-small cell lung cancer (NSCLC) patients, revealing distinct clusters associated with histological subtypes of NSCLC. Moreover, to demonstrate the general utility of Mugen-UMAP, we applied the program to 9 additional single-cell WES datasets from various cancer types, uncovering interesting patterns of cell clusters that warrant further investigation. In summary, Mugen-UMAP provides a quick and effective visualization method to uncover cell cluster patterns based on the gene mutation information from single-cell DNA sequencing data.

Conclusions: The application of Mugen-UMAP demonstrates its capacity to provide valuable insights into the visualization and interpretation of single-cell DNA sequencing data. Mugen-UMAP can be found at https://github.com/tengchn/Mugen-UMAP.

背景:应用统一表层逼近和投影(UMAP)技术进行降维和可视化已彻底改变了单细胞 RNA 表达和群体遗传学分析。然而,它在单细胞 DNA 测序数据分析,尤其是基因突变信息可视化方面的潜力尚未得到充分挖掘:我们介绍了 Mugen-UMAP,这是一个基于 Python 的新程序,它将 UMAP 的实用性扩展到了单细胞 DNA 测序数据。这一创新工具提供了一个全面的管道,用于处理单细胞体细胞单核苷酸变异的基因注释文件和元数据,以及用于识别聚类的可视化 UMAP 投影和各种统计分析。我们利用 Mugen-UMAP 分析了 12 名非小细胞肺癌(NSCLC)患者的 365 个单细胞样本的全外显子组测序数据,发现了与 NSCLC 组织学亚型相关的不同群集。此外,为了证明 Mugen-UMAP 的通用性,我们还将该程序应用于另外 9 个来自不同癌症类型的单细胞 WES 数据集,发现了值得进一步研究的有趣的细胞集群模式。总之,Mugen-UMAP 提供了一种快速有效的可视化方法,可根据单细胞 DNA 测序数据中的基因突变信息发现细胞群模式:结论:Mugen-UMAP 的应用表明,它能够为单细胞 DNA 测序数据的可视化和解读提供有价值的见解。Mugen-UMAP可在https://github.com/tengchn/Mugen-UMAP。
{"title":"Mugen-UMAP: UMAP visualization and clustering of mutated genes in single-cell DNA sequencing data.","authors":"Teng Li, Yiran Zou, Xianghan Li, Thomas K F Wong, Allen G Rodrigo","doi":"10.1186/s12859-024-05928-x","DOIUrl":"https://doi.org/10.1186/s12859-024-05928-x","url":null,"abstract":"<p><strong>Background: </strong>The application of Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction and visualization has revolutionized the analysis of single-cell RNA expression and population genetics. However, its potential in single-cell DNA sequencing data analysis, particularly for visualizing gene mutation information, has not been fully explored.</p><p><strong>Results: </strong>We introduce Mugen-UMAP, a novel Python-based program that extends UMAP's utility to single-cell DNA sequencing data. This innovative tool provides a comprehensive pipeline for processing gene annotation files of single-cell somatic single-nucleotide variants and metadata to the visualization of UMAP projections for identifying clusters, along with various statistical analyses. Employing Mugen-UMAP, we analyzed whole-exome sequencing data from 365 single-cell samples across 12 non-small cell lung cancer (NSCLC) patients, revealing distinct clusters associated with histological subtypes of NSCLC. Moreover, to demonstrate the general utility of Mugen-UMAP, we applied the program to 9 additional single-cell WES datasets from various cancer types, uncovering interesting patterns of cell clusters that warrant further investigation. In summary, Mugen-UMAP provides a quick and effective visualization method to uncover cell cluster patterns based on the gene mutation information from single-cell DNA sequencing data.</p><p><strong>Conclusions: </strong>The application of Mugen-UMAP demonstrates its capacity to provide valuable insights into the visualization and interpretation of single-cell DNA sequencing data. Mugen-UMAP can be found at https://github.com/tengchn/Mugen-UMAP.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11437917/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using flux theory in dynamic omics data sets to identify differentially changing signals using DPoP. 在动态 omics 数据集中使用通量理论,利用 DPoP 识别差异变化信号。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-27 DOI: 10.1186/s12859-024-05938-9
Harley Edwards, Joseph Zavorskas, Walker Huso, Alexander G Doan, Caton Silbiger, Steven Harris, Ranjan Srivastava, Mark R Marten

Background: Derivative profiling is a novel approach to identify differential signals from dynamic omics data sets. This approach applies variable step-size differentiation to time dynamic omics data. This work assumes that there is a general omics derivative that is a useful and descriptive feature of dynamic omics experiments. We assert that this omics derivative, or omics flux, is a valuable descriptor that can be used instead of, or with, fold change calculations.

Results: The results of derivative profiling are compared to established methods such as Multivariate Adaptive Regression Splines, significance versus fold change analysis (Volcano), and an adjusted ratio over intensity (M/A) analysis to find that there is a statistically significant similarity between the results. This comparison is repeated for transcriptomic and phosphoproteomic expression profiles previously characterized in Aspergillus nidulans. This method has been packaged in an open-source, GUI-based MATLAB app, the Derivative Profiling omics Package (DPoP). Gene Ontology (GO) term enrichment has been included in the app so that a user can automatically/programmatically describe the over/under-represented GO terms in the derivative profiling results using domain specific knowledge found in their organism's specific GO database file. The advantage of the DPoP analysis is that it is computationally inexpensive, it does not require fold change calculations, it describes both instantaneous as well as overall behavior, and it achieves statistical confidence with signal trajectories of a single bio-replicate over four or more points.

Conclusions: While we apply this method to time dynamic transcriptomic and phosphoproteomic datasets, it is a numerically generalizable technique that can be applied to any organism and any field interested in time series data analysis. The app described in this work enables omics researchers with no computer science background to apply derivative profiling to their data sets, while also allowing multidisciplined users to build on the nascent idea of profiling derivatives in omics.

背景:衍生分析是一种从动态 omics 数据集中识别差异信号的新方法。这种方法将步长可变的微分应用于时间动态 omics 数据。这项工作假定存在一种通用的 omics 衍生物,它是动态 omics 实验的一个有用的描述性特征。我们认为,这种 omics 衍生物或 omics 通量是一种有价值的描述符,可用于替代折叠变化计算或与之一起使用:结果:我们将衍生分析的结果与多元自适应回归样条曲线、显著性与折叠变化分析(火山)以及调整后的强度比(M/A)分析等成熟方法进行了比较,发现这些结果在统计学上具有显著的相似性。对黑曲霉先前表征的转录组和磷酸蛋白组表达谱进行重复比较。这种方法已被打包成一个开源的、基于图形用户界面的 MATLAB 应用程序--衍射剖析 omics 软件包(DPoP)。该程序还包含基因本体(GO)术语富集功能,这样用户就可以利用生物特定 GO 数据库文件中的特定领域知识,自动/编程描述衍生剖析结果中代表性过高/过低的 GO 术语。DPoP 分析法的优点是计算成本低廉,不需要折叠变化计算,既能描述瞬时行为,也能描述整体行为,而且能通过单个生物复制品在四个或更多点上的信号轨迹实现统计置信度:我们将这种方法应用于时间动态转录组和磷酸蛋白组数据集,但它是一种数值通用技术,可应用于任何生物体和对时间序列数据分析感兴趣的任何领域。这项工作中描述的应用程序能让没有计算机科学背景的全局组学研究人员将导数剖析应用到他们的数据集,同时还能让多学科用户在全局组学导数剖析这一新兴理念的基础上更上一层楼。
{"title":"Using flux theory in dynamic omics data sets to identify differentially changing signals using DPoP.","authors":"Harley Edwards, Joseph Zavorskas, Walker Huso, Alexander G Doan, Caton Silbiger, Steven Harris, Ranjan Srivastava, Mark R Marten","doi":"10.1186/s12859-024-05938-9","DOIUrl":"https://doi.org/10.1186/s12859-024-05938-9","url":null,"abstract":"<p><strong>Background: </strong>Derivative profiling is a novel approach to identify differential signals from dynamic omics data sets. This approach applies variable step-size differentiation to time dynamic omics data. This work assumes that there is a general omics derivative that is a useful and descriptive feature of dynamic omics experiments. We assert that this omics derivative, or omics flux, is a valuable descriptor that can be used instead of, or with, fold change calculations.</p><p><strong>Results: </strong>The results of derivative profiling are compared to established methods such as Multivariate Adaptive Regression Splines, significance versus fold change analysis (Volcano), and an adjusted ratio over intensity (M/A) analysis to find that there is a statistically significant similarity between the results. This comparison is repeated for transcriptomic and phosphoproteomic expression profiles previously characterized in Aspergillus nidulans. This method has been packaged in an open-source, GUI-based MATLAB app, the Derivative Profiling omics Package (DPoP). Gene Ontology (GO) term enrichment has been included in the app so that a user can automatically/programmatically describe the over/under-represented GO terms in the derivative profiling results using domain specific knowledge found in their organism's specific GO database file. The advantage of the DPoP analysis is that it is computationally inexpensive, it does not require fold change calculations, it describes both instantaneous as well as overall behavior, and it achieves statistical confidence with signal trajectories of a single bio-replicate over four or more points.</p><p><strong>Conclusions: </strong>While we apply this method to time dynamic transcriptomic and phosphoproteomic datasets, it is a numerically generalizable technique that can be applied to any organism and any field interested in time series data analysis. The app described in this work enables omics researchers with no computer science background to apply derivative profiling to their data sets, while also allowing multidisciplined users to build on the nascent idea of profiling derivatives in omics.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11437665/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modelling cell type-specific lncRNA regulatory network in autism with Cycle. 自闭症细胞类型特异性 lncRNA 调控网络建模。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-27 DOI: 10.1186/s12859-024-05933-0
Chenchen Xiong, Mingfang Zhang, Haolin Yang, Xuemei Wei, Chunwen Zhao, Junpeng Zhang

Background: Autism spectrum disorder (ASD) is a class of complex neurodevelopment disorders with high genetic heterogeneity. Long non-coding RNAs (lncRNAs) are vital regulators that perform specific functions within diverse cell types and play pivotal roles in neurological diseases including ASD. Therefore, exploring lncRNA regulation would contribute to deciphering ASD molecular mechanisms. Existing computational methods utilize bulk transcriptomics data to identify lncRNA regulation in all of samples, which could reveal the commonalities of lncRNA regulation in ASD, but ignore the specificity of lncRNA regulation across various cell types.

Results: Here, we present Cycle (Cell type-specific lncRNA regulatory network) to construct the landscape of cell type-specific lncRNA regulation in ASD. We have found that each ASD cell type is unique in lncRNA regulation, and more than one-third and all cell type-specific lncRNA regulatory networks are characterized as scale-free and small-world, respectively. Across 17 ASD cell types, we have discovered 19 rewired and 11 stable modules, along with eight rewired and three stable hubs within the constructed cell type-specific lncRNA regulatory networks. Enrichment analysis reveals that the discovered rewired and stable modules and hubs are closely related to ASD. Furthermore, more similar ASD cell types tend to be connected with higher strength in the constructed cell similarity network. Finally, the comparison results demonstrate that Cycle is a potential method for uncovering cell type-specific lncRNA regulation.

Conclusion: Overall, these results illustrate that Cycle is a promising method to model the landscape of cell type-specific lncRNA regulation, and provides insights into understanding the heterogeneity of lncRNA regulation between various ASD cell types.

背景:自闭症谱系障碍(ASD)是一类具有高度遗传异质性的复杂神经发育障碍。长非编码 RNA(lncRNA)是在不同细胞类型中发挥特定功能的重要调控因子,在包括 ASD 在内的神经系统疾病中发挥着关键作用。因此,探索 lncRNA 的调控有助于破译 ASD 的分子机制。现有的计算方法利用大容量转录组学数据来识别所有样本中的lncRNA调控,可以揭示ASD中lncRNA调控的共性,但忽略了lncRNA在不同细胞类型中调控的特异性:在此,我们提出了Cycle(细胞类型特异性lncRNA调控网络)来构建ASD中细胞类型特异性lncRNA调控的格局。我们发现,每种ASD细胞类型在lncRNA调控方面都是独特的,超过三分之一的细胞类型特异性lncRNA调控网络和所有细胞类型特异性lncRNA调控网络分别具有无标度和小世界的特征。在17种ASD细胞类型中,我们在构建的细胞类型特异性lncRNA调控网络中发现了19个重联模块和11个稳定模块,以及8个重联枢纽和3个稳定枢纽。富集分析表明,所发现的重配和稳定模块及中枢与ASD密切相关。此外,在构建的细胞相似性网络中,更多相似的ASD细胞类型往往以更高的强度连接在一起。最后,比较结果表明,Cycle 是一种揭示细胞类型特异性 lncRNA 调控的潜在方法:总之,这些结果表明,Cycle是一种很有前途的方法,可用于模拟细胞类型特异性lncRNA调控的景观,并为理解不同ASD细胞类型之间lncRNA调控的异质性提供了见解。
{"title":"Modelling cell type-specific lncRNA regulatory network in autism with Cycle.","authors":"Chenchen Xiong, Mingfang Zhang, Haolin Yang, Xuemei Wei, Chunwen Zhao, Junpeng Zhang","doi":"10.1186/s12859-024-05933-0","DOIUrl":"https://doi.org/10.1186/s12859-024-05933-0","url":null,"abstract":"<p><strong>Background: </strong>Autism spectrum disorder (ASD) is a class of complex neurodevelopment disorders with high genetic heterogeneity. Long non-coding RNAs (lncRNAs) are vital regulators that perform specific functions within diverse cell types and play pivotal roles in neurological diseases including ASD. Therefore, exploring lncRNA regulation would contribute to deciphering ASD molecular mechanisms. Existing computational methods utilize bulk transcriptomics data to identify lncRNA regulation in all of samples, which could reveal the commonalities of lncRNA regulation in ASD, but ignore the specificity of lncRNA regulation across various cell types.</p><p><strong>Results: </strong>Here, we present Cycle (Cell type-specific lncRNA regulatory network) to construct the landscape of cell type-specific lncRNA regulation in ASD. We have found that each ASD cell type is unique in lncRNA regulation, and more than one-third and all cell type-specific lncRNA regulatory networks are characterized as scale-free and small-world, respectively. Across 17 ASD cell types, we have discovered 19 rewired and 11 stable modules, along with eight rewired and three stable hubs within the constructed cell type-specific lncRNA regulatory networks. Enrichment analysis reveals that the discovered rewired and stable modules and hubs are closely related to ASD. Furthermore, more similar ASD cell types tend to be connected with higher strength in the constructed cell similarity network. Finally, the comparison results demonstrate that Cycle is a potential method for uncovering cell type-specific lncRNA regulation.</p><p><strong>Conclusion: </strong>Overall, these results illustrate that Cycle is a promising method to model the landscape of cell type-specific lncRNA regulation, and provides insights into understanding the heterogeneity of lncRNA regulation between various ASD cell types.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11430139/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LOCC: a novel visualization and scoring of cutoffs for continuous variables with hepatocellular carcinoma prognosis as an example. LOCC:以肝细胞癌预后为例,对连续变量的临界值进行新颖的可视化和评分。
IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-27 DOI: 10.1186/s12859-024-05932-1
George Luo, Toby Chen, John J Letterio

Background: The interpretation of large datasets, such as The Cancer Genome Atlas (TCGA), for scientific and research purposes, remains challenging despite their public availability. In this study, we focused on identifying gene expression profiles most relevant to patient prognosis and aimed to develop a method and database to address this issue. To achieve this, we introduced Luo's Optimization Categorization Curve (LOCC), an innovative tool for visualizing and scoring continuous variables against dichotomous outcomes. To demonstrate the efficacy of LOCC using real-world data, we analyzed gene expression profiles and patient data from TCGA hepatocellular carcinoma samples.

Results: To showcase LOCC, we demonstrate an optimal cutoff for E2F1 expression in hepatocellular carcinoma, which was subsequently validated in an independent cohort. Compared to ROC curves and their AUC, LOCC offered a superior description of the predictive value of E2F1 expression across various cancer types. The LOCC score, comprised of factors representing significance, range, and impact of the biomarker, facilitated the ranking of all gene expression profiles in hepatocellular carcinoma, aiding in the evaluation and understanding of previously published prognostic gene signatures. We also demonstrate that LOCC does not have the same assumptions required of Cox proportional hazards modeling for accurate analysis. Repeated sampling demonstrated that LOCC scores outperformed ROC's AUC in discriminating predictors from non-predictors. Additionally, gene set enrichment analysis revealed significant associations between certain genes and prognosis, such as E2F target genes and G2M checkpoint with poor prognosis, and bile acid metabolism and oxidative phosphorylation with good prognosis.

Conclusion: In summary, we present LOCC as a novel visualization tool for the analysis of gene expression in cancer, particularly for understanding and selecting cutoffs. Our findings suggest that LOCC scores, which effectively rank genes based on their prognostic potential, represent a more suitable approach than ROC curves and Cox proportional hazard for prognostic modeling and understanding in cancer gene expression analysis. LOCC holds promise as an invaluable tool for advancing precision medicine and furthering biomarker research. Further research regarding multivariable integration and validation will help LOCC reach its full potential and establish its utility across diverse cancer types and clinical settings.

背景:尽管《癌症基因组图谱》(The Cancer Genome Atlas,TCGA)等大型数据集公开可用,但出于科学和研究目的对这些数据集进行解读仍具有挑战性。在本研究中,我们的重点是识别与患者预后最相关的基因表达谱,并旨在开发一种方法和数据库来解决这一问题。为此,我们引入了罗氏优化分类曲线(Luo's Optimization Categorization Curve,LOCC),这是一种创新的工具,用于对连续变量和二分结果进行可视化和评分。为了利用真实世界的数据展示 LOCC 的功效,我们分析了来自 TCGA 肝细胞癌样本的基因表达谱和患者数据:为了展示 LOCC,我们展示了肝细胞癌中 E2F1 表达的最佳临界值,该临界值随后在一个独立队列中得到了验证。与 ROC 曲线及其 AUC 相比,LOCC 更好地描述了 E2F1 表达在各种癌症类型中的预测价值。LOCC 评分由代表生物标志物重要性、范围和影响的因子组成,有助于对肝细胞癌的所有基因表达谱进行排序,有助于评估和理解以前发表的预后基因特征。我们还证明,LOCC 与 Cox 比例危险度模型所要求的假设条件不同,无法进行准确分析。重复采样表明,LOCC 评分在区分预测因子和非预测因子方面优于 ROC 的 AUC。此外,基因组富集分析显示,某些基因与预后有显著关联,如E2F靶基因和G2M检查点与预后不良有关,而胆汁酸代谢和氧化磷酸化与预后良好有关:总之,我们将 LOCC 作为一种新颖的可视化工具,用于分析癌症中的基因表达,特别是用于理解和选择临界值。我们的研究结果表明,LOCC 评分能有效地根据基因的预后潜力对其进行排序,与 ROC 曲线和 Cox 比例危险相比,LOCC 是一种更适合癌症基因表达分析中预后建模和理解的方法。LOCC有望成为推进精准医疗和生物标记物研究的宝贵工具。有关多变量整合和验证的进一步研究将有助于LOCC充分发挥其潜力,并在不同癌症类型和临床环境中确立其实用性。
{"title":"LOCC: a novel visualization and scoring of cutoffs for continuous variables with hepatocellular carcinoma prognosis as an example.","authors":"George Luo, Toby Chen, John J Letterio","doi":"10.1186/s12859-024-05932-1","DOIUrl":"10.1186/s12859-024-05932-1","url":null,"abstract":"<p><strong>Background: </strong>The interpretation of large datasets, such as The Cancer Genome Atlas (TCGA), for scientific and research purposes, remains challenging despite their public availability. In this study, we focused on identifying gene expression profiles most relevant to patient prognosis and aimed to develop a method and database to address this issue. To achieve this, we introduced Luo's Optimization Categorization Curve (LOCC), an innovative tool for visualizing and scoring continuous variables against dichotomous outcomes. To demonstrate the efficacy of LOCC using real-world data, we analyzed gene expression profiles and patient data from TCGA hepatocellular carcinoma samples.</p><p><strong>Results: </strong>To showcase LOCC, we demonstrate an optimal cutoff for E2F1 expression in hepatocellular carcinoma, which was subsequently validated in an independent cohort. Compared to ROC curves and their AUC, LOCC offered a superior description of the predictive value of E2F1 expression across various cancer types. The LOCC score, comprised of factors representing significance, range, and impact of the biomarker, facilitated the ranking of all gene expression profiles in hepatocellular carcinoma, aiding in the evaluation and understanding of previously published prognostic gene signatures. We also demonstrate that LOCC does not have the same assumptions required of Cox proportional hazards modeling for accurate analysis. Repeated sampling demonstrated that LOCC scores outperformed ROC's AUC in discriminating predictors from non-predictors. Additionally, gene set enrichment analysis revealed significant associations between certain genes and prognosis, such as E2F target genes and G2M checkpoint with poor prognosis, and bile acid metabolism and oxidative phosphorylation with good prognosis.</p><p><strong>Conclusion: </strong>In summary, we present LOCC as a novel visualization tool for the analysis of gene expression in cancer, particularly for understanding and selecting cutoffs. Our findings suggest that LOCC scores, which effectively rank genes based on their prognostic potential, represent a more suitable approach than ROC curves and Cox proportional hazard for prognostic modeling and understanding in cancer gene expression analysis. LOCC holds promise as an invaluable tool for advancing precision medicine and furthering biomarker research. Further research regarding multivariable integration and validation will help LOCC reach its full potential and establish its utility across diverse cancer types and clinical settings.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11438210/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
BMC Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1