Qi Dai, Hu Chen, Wen-Jing Yi, Jia-Ning Zhao, Wei Zhang, Ping-An He, Xiao-Qing Liu, Ying-Feng Zheng, Zhuo-Xing Shi
Decoding DNA methylation sites through nanopore sequencing has emerged as a cutting-edge technology in the field of DNA methylation research, as it enables direct sequencing of native DNA molecules without the need for prior enzymatic or chemical treatments. During nanopore sequencing, methylation modifications on DNA bases cause changes in electrical current intensity. Therefore, constructing deep neural network models to decode the electrical signals of nanopore sequencing has become a crucial step in methylation site identification. In this study, we utilized nanopore sequencing data containing diverse DNA methylation types and motif sequence diversity. We proposed a feature encoding method based on current signal clustering and leveraged the powerful attention mechanism in the Transformer framework to construct the PoreFormer model for identifying DNA methylation sites in nanopore sequencing. The model demonstrated excellent performance under conditions of multi-class methylation and motif sequence diversity, offering new insights into related research fields.
通过纳米孔测序解码 DNA 甲基化位点已成为 DNA 甲基化研究领域的一项前沿技术,因为它可以直接对原生 DNA 分子进行测序,而无需事先进行酶处理或化学处理。在纳米孔测序过程中,DNA 碱基的甲基化修饰会导致电流强度发生变化。因此,构建深度神经网络模型来解码纳米孔测序的电信号已成为甲基化位点鉴定的关键步骤。在本研究中,我们利用了包含不同 DNA 甲基化类型和主题序列多样性的纳米孔测序数据。我们提出了一种基于电流信号聚类的特征编码方法,并利用 Transformer 框架中强大的注意力机制构建了 PoreFormer 模型,用于识别纳米孔测序中的 DNA 甲基化位点。该模型在多类甲基化和主题序列多样性条件下表现出卓越的性能,为相关研究领域提供了新的见解。
{"title":"Precision DNA methylation typing via hierarchical clustering of Nanopore current signals and attention-based neural network.","authors":"Qi Dai, Hu Chen, Wen-Jing Yi, Jia-Ning Zhao, Wei Zhang, Ping-An He, Xiao-Qing Liu, Ying-Feng Zheng, Zhuo-Xing Shi","doi":"10.1093/bib/bbae596","DOIUrl":"10.1093/bib/bbae596","url":null,"abstract":"<p><p>Decoding DNA methylation sites through nanopore sequencing has emerged as a cutting-edge technology in the field of DNA methylation research, as it enables direct sequencing of native DNA molecules without the need for prior enzymatic or chemical treatments. During nanopore sequencing, methylation modifications on DNA bases cause changes in electrical current intensity. Therefore, constructing deep neural network models to decode the electrical signals of nanopore sequencing has become a crucial step in methylation site identification. In this study, we utilized nanopore sequencing data containing diverse DNA methylation types and motif sequence diversity. We proposed a feature encoding method based on current signal clustering and leveraged the powerful attention mechanism in the Transformer framework to construct the PoreFormer model for identifying DNA methylation sites in nanopore sequencing. The model demonstrated excellent performance under conditions of multi-class methylation and motif sequence diversity, offering new insights into related research fields.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562827/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142614931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Medication recommendation is a crucial application of artificial intelligence in healthcare. Current methodologies mostly depend on patient-level longitudinal representation, which utilizes the entirety of historical electronic health records for making predictions. However, they tend to overlook a few key elements: (1) The need to analyze the impact of past medications on previous conditions. (2) Similarity in patient visits is more common than similarity in the complete medical histories of patients. (3) It is difficult to accurately represent patient-level longitudinal data due to the varying numbers of visits. To our knowledge, current models face difficulties in dealing with initial patient visits (i.e. in cold-start scenarios) which are common in clinical practice. This paper introduces DrugDoctor, an innovative drug recommendation model crafted to emulate the decision-making mechanics of human doctors. Unlike previous methods, DrugDoctor explores the visit-level relationship between prescriptions and diseases while considering the impact of past prescriptions on the patient's condition to provide more accurate recommendations. We design a plug-and-play block to effectively capture drug substructure-aware disease information and effectiveness-aware medication information, employing cross-attention and multi-head self-attention mechanisms. Furthermore, DrugDoctor adopts a fundamentally new visit-level training strategy, aligning more closely with the practices of doctors. Extensive experiments conducted on the MIMIC-III and MIMIC-IV datasets demonstrate that DrugDoctor outperforms 10 other state-of-the-art methods in terms of Jaccard, F1-score, and PRAUC. Moreover, DrugDoctor exhibits strong robustness in handling patients with varying numbers of visits and effectively tackles "cold-start" issues in medication combination recommendations.
{"title":"DrugDoctor: enhancing drug recommendation in cold-start scenario via visit-level representation learning and training.","authors":"Yabin Kuang, Minzhu Xie","doi":"10.1093/bib/bbae464","DOIUrl":"10.1093/bib/bbae464","url":null,"abstract":"<p><p>Medication recommendation is a crucial application of artificial intelligence in healthcare. Current methodologies mostly depend on patient-level longitudinal representation, which utilizes the entirety of historical electronic health records for making predictions. However, they tend to overlook a few key elements: (1) The need to analyze the impact of past medications on previous conditions. (2) Similarity in patient visits is more common than similarity in the complete medical histories of patients. (3) It is difficult to accurately represent patient-level longitudinal data due to the varying numbers of visits. To our knowledge, current models face difficulties in dealing with initial patient visits (i.e. in cold-start scenarios) which are common in clinical practice. This paper introduces DrugDoctor, an innovative drug recommendation model crafted to emulate the decision-making mechanics of human doctors. Unlike previous methods, DrugDoctor explores the visit-level relationship between prescriptions and diseases while considering the impact of past prescriptions on the patient's condition to provide more accurate recommendations. We design a plug-and-play block to effectively capture drug substructure-aware disease information and effectiveness-aware medication information, employing cross-attention and multi-head self-attention mechanisms. Furthermore, DrugDoctor adopts a fundamentally new visit-level training strategy, aligning more closely with the practices of doctors. Extensive experiments conducted on the MIMIC-III and MIMIC-IV datasets demonstrate that DrugDoctor outperforms 10 other state-of-the-art methods in terms of Jaccard, F1-score, and PRAUC. Moreover, DrugDoctor exhibits strong robustness in handling patients with varying numbers of visits and effectively tackles \"cold-start\" issues in medication combination recommendations.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11418268/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142280436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Messenger RNA (mRNA) vaccines represent a groundbreaking advancement in immunology and public health, particularly highlighted by their role in combating the COVID-19 pandemic. Optimizing mRNA-based antigen expression is a crucial focus in this emerging industry. We have developed a bioinformatics tool named AntigenBoost to address the challenge posed by destabilizing dipeptides that hinder ribosomal translation. AntigenBoost identifies these dipeptides within specific antigens and provides a range of potential amino acid substitution strategies using a two-dimensional scoring system. Through a combination of bioinformatics analysis and experimental validation, we significantly enhanced the in vitro expression of mRNA-derived Respiratory Syncytial Virus fusion glycoprotein and Influenza A Hemagglutinin antigen. Notably, a single amino acid substitution improved the immune response in mice, underscoring the effectiveness of AntigenBoost in mRNA vaccine design.
{"title":"AntigenBoost: enhanced mRNA-based antigen expression through rational amino acid substitution.","authors":"Yumiao Gao, Siran Zhu, Huichun Li, Xueting Hao, Wen Chen, Deng Pan, Zhikang Qian","doi":"10.1093/bib/bbae468","DOIUrl":"https://doi.org/10.1093/bib/bbae468","url":null,"abstract":"<p><p>Messenger RNA (mRNA) vaccines represent a groundbreaking advancement in immunology and public health, particularly highlighted by their role in combating the COVID-19 pandemic. Optimizing mRNA-based antigen expression is a crucial focus in this emerging industry. We have developed a bioinformatics tool named AntigenBoost to address the challenge posed by destabilizing dipeptides that hinder ribosomal translation. AntigenBoost identifies these dipeptides within specific antigens and provides a range of potential amino acid substitution strategies using a two-dimensional scoring system. Through a combination of bioinformatics analysis and experimental validation, we significantly enhanced the in vitro expression of mRNA-derived Respiratory Syncytial Virus fusion glycoprotein and Influenza A Hemagglutinin antigen. Notably, a single amino acid substitution improved the immune response in mice, underscoring the effectiveness of AntigenBoost in mRNA vaccine design.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11472322/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Ruscone, Andrea Checcoli, Randy Heiland, Emmanuel Barillot, Paul Macklin, Laurence Calzone, Vincent Noël
Multiscale models provide a unique tool for analyzing complex processes that study events occurring at different scales across space and time. In the context of biological systems, such models can simulate mechanisms happening at the intracellular level such as signaling, and at the extracellular level where cells communicate and coordinate with other cells. These models aim to understand the impact of genetic or environmental deregulation observed in complex diseases, describe the interplay between a pathological tissue and the immune system, and suggest strategies to revert the diseased phenotypes. The construction of these multiscale models remains a very complex task, including the choice of the components to consider, the level of details of the processes to simulate, or the fitting of the parameters to the data. One additional difficulty is the expert knowledge needed to program these models in languages such as C++ or Python, which may discourage the participation of non-experts. Simplifying this process through structured description formalisms-coupled with a graphical interface-is crucial in making modeling more accessible to the broader scientific community, as well as streamlining the process for advanced users. This article introduces three examples of multiscale models which rely on the framework PhysiBoSS, an add-on of PhysiCell that includes intracellular descriptions as continuous time Boolean models to the agent-based approach. The article demonstrates how to construct these models more easily, relying on PhysiCell Studio, the PhysiCell Graphical User Interface. A step-by-step tutorial is provided as Supplementary Material and all models are provided at https://physiboss.github.io/tutorial/.
多尺度模型为分析复杂过程提供了一种独特的工具,可用于研究在不同空间和时间尺度上发生的事件。在生物系统中,这类模型可以模拟发生在细胞内水平(如信号传递)和细胞外水平(细胞与其他细胞进行交流和协调)的机制。这些模型旨在了解复杂疾病中观察到的遗传或环境失调的影响,描述病理组织与免疫系统之间的相互作用,并提出恢复疾病表型的策略。构建这些多尺度模型仍然是一项非常复杂的任务,包括选择要考虑的组成部分、模拟过程的详细程度或参数与数据的拟合。另外一个困难是,用 C++ 或 Python 等语言对这些模型进行编程需要专业知识,这可能会阻碍非专业人员的参与。通过结构化的描述形式--再加上图形界面--来简化这一过程,对于让更广泛的科学界更容易接受建模以及简化高级用户的建模过程至关重要。PhysiBoSS 是 PhysiCell 的附加组件,将细胞内描述作为连续时间布尔模型纳入基于代理的方法。文章演示了如何利用 PhysiCell 图形用户界面 PhysiCell Studio 更轻松地构建这些模型。分步教程作为补充材料提供,所有模型可在 https://physiboss.github.io/tutorial/ 网站上查阅。
{"title":"Building multiscale models with PhysiBoSS, an agent-based modeling tool.","authors":"Marco Ruscone, Andrea Checcoli, Randy Heiland, Emmanuel Barillot, Paul Macklin, Laurence Calzone, Vincent Noël","doi":"10.1093/bib/bbae509","DOIUrl":"10.1093/bib/bbae509","url":null,"abstract":"<p><p>Multiscale models provide a unique tool for analyzing complex processes that study events occurring at different scales across space and time. In the context of biological systems, such models can simulate mechanisms happening at the intracellular level such as signaling, and at the extracellular level where cells communicate and coordinate with other cells. These models aim to understand the impact of genetic or environmental deregulation observed in complex diseases, describe the interplay between a pathological tissue and the immune system, and suggest strategies to revert the diseased phenotypes. The construction of these multiscale models remains a very complex task, including the choice of the components to consider, the level of details of the processes to simulate, or the fitting of the parameters to the data. One additional difficulty is the expert knowledge needed to program these models in languages such as C++ or Python, which may discourage the participation of non-experts. Simplifying this process through structured description formalisms-coupled with a graphical interface-is crucial in making modeling more accessible to the broader scientific community, as well as streamlining the process for advanced users. This article introduces three examples of multiscale models which rely on the framework PhysiBoSS, an add-on of PhysiCell that includes intracellular descriptions as continuous time Boolean models to the agent-based approach. The article demonstrates how to construct these models more easily, relying on PhysiCell Studio, the PhysiCell Graphical User Interface. A step-by-step tutorial is provided as Supplementary Material and all models are provided at https://physiboss.github.io/tutorial/.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11489466/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yourui Han, Bolin Chen, Jun Bian, Ruiming Kang, Xuequn Shang
The evolution of lung adenocarcinoma is accompanied by a multitude of gene mutations and dysfunctions, rendering its phenotypic state and evolutionary direction highly complex. To interpret the evolution of lung adenocarcinoma, various methods have been developed to elucidate the molecular pathogenesis and functional evolution processes. However, most of these methods are constrained by the absence of cancerous temporal information, and the challenges of heterogeneous characteristics. To handle these problems, in this study, a patient quasi-potential landscape method was proposed to estimate the cancerous time of phenotypic states' emergence during the evolutionary process. Subsequently, a total of 39 different oncogenetic paths were identified based on cancerous time and mutations, reflecting the molecular pathogenesis of the evolutionary process of lung adenocarcinoma. To interpret the evolution patterns of lung adenocarcinoma, three oncogenetic graphs were obtained as the common evolutionary patterns by merging the oncogenetic paths. Moreover, patients were evenly re-divided into early, middle, and late evolutionary stages according to cancerous time, and a feasible framework was developed to construct the functional evolution network of lung adenocarcinoma. A total of six significant functional evolution processes were identified from the functional evolution network based on the pathway enrichment analysis, which plays critical roles in understanding the development of lung adenocarcinoma.
{"title":"Cancerous time estimation for interpreting the evolution of lung adenocarcinoma.","authors":"Yourui Han, Bolin Chen, Jun Bian, Ruiming Kang, Xuequn Shang","doi":"10.1093/bib/bbae520","DOIUrl":"https://doi.org/10.1093/bib/bbae520","url":null,"abstract":"<p><p>The evolution of lung adenocarcinoma is accompanied by a multitude of gene mutations and dysfunctions, rendering its phenotypic state and evolutionary direction highly complex. To interpret the evolution of lung adenocarcinoma, various methods have been developed to elucidate the molecular pathogenesis and functional evolution processes. However, most of these methods are constrained by the absence of cancerous temporal information, and the challenges of heterogeneous characteristics. To handle these problems, in this study, a patient quasi-potential landscape method was proposed to estimate the cancerous time of phenotypic states' emergence during the evolutionary process. Subsequently, a total of 39 different oncogenetic paths were identified based on cancerous time and mutations, reflecting the molecular pathogenesis of the evolutionary process of lung adenocarcinoma. To interpret the evolution patterns of lung adenocarcinoma, three oncogenetic graphs were obtained as the common evolutionary patterns by merging the oncogenetic paths. Moreover, patients were evenly re-divided into early, middle, and late evolutionary stages according to cancerous time, and a feasible framework was developed to construct the functional evolution network of lung adenocarcinoma. A total of six significant functional evolution processes were identified from the functional evolution network based on the pathway enrichment analysis, which plays critical roles in understanding the development of lung adenocarcinoma.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11483137/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tingting Hou, Xiaoxi Shen, Shan Zhang, Muxuan Liang, Li Chen, Qing Lu
The recent development of artificial intelligence (AI) technology, especially the advance of deep neural network (DNN) technology, has revolutionized many fields. While DNN plays a central role in modern AI technology, it has rarely been used in genetic data analysis due to analytical and computational challenges brought by high-dimensional genetic data and an increasing number of samples. To facilitate the use of AI in genetic data analysis, we developed a C++ package, AIGen, based on two newly developed neural networks (i.e. kernel neural networks and functional neural networks) that are capable of modeling complex genotype-phenotype relationships (e.g. interactions) while providing robust performance against high-dimensional genetic data. Moreover, computationally efficient algorithms (e.g. a minimum norm quadratic unbiased estimation approach and batch training) are implemented in the package to accelerate the computation, making them computationally efficient for analyzing large-scale datasets with thousands or even millions of samples. By applying AIGen to the UK Biobank dataset, we demonstrate that it can efficiently analyze large-scale genetic data, attain improved accuracy, and maintain robust performance. Availability: AIGen is developed in C++ and its source code, along with reference libraries, is publicly accessible on GitHub at https://github.com/TingtHou/AIGen.
近年来,人工智能(AI)技术的发展,尤其是深度神经网络(DNN)技术的进步,给许多领域带来了革命性的变化。虽然 DNN 在现代人工智能技术中发挥着核心作用,但由于高维遗传数据和日益增多的样本带来的分析和计算挑战,它很少被用于遗传数据分析。为了促进人工智能在遗传数据分析中的应用,我们开发了一个 C++ 软件包 AIGen,它基于两种新开发的神经网络(即核神经网络和功能神经网络),能够模拟复杂的基因型-表型关系(如相互作用),同时在处理高维遗传数据时具有强大的性能。此外,该软件包还采用了计算效率高的算法(如最小规范二次无偏估计方法和批量训练)来加速计算,使其在分析具有数千甚至数百万样本的大规模数据集时具有很高的计算效率。通过将 AIGen 应用于英国生物库数据集,我们证明了它可以高效地分析大规模遗传数据、提高准确性并保持稳健的性能。可用性AIGen 采用 C++ 开发,其源代码和参考库可在 GitHub 上公开访问,网址为 https://github.com/TingtHou/AIGen。
{"title":"AIGen: an artificial intelligence software for complex genetic data analysis.","authors":"Tingting Hou, Xiaoxi Shen, Shan Zhang, Muxuan Liang, Li Chen, Qing Lu","doi":"10.1093/bib/bbae566","DOIUrl":"10.1093/bib/bbae566","url":null,"abstract":"<p><p>The recent development of artificial intelligence (AI) technology, especially the advance of deep neural network (DNN) technology, has revolutionized many fields. While DNN plays a central role in modern AI technology, it has rarely been used in genetic data analysis due to analytical and computational challenges brought by high-dimensional genetic data and an increasing number of samples. To facilitate the use of AI in genetic data analysis, we developed a C++ package, AIGen, based on two newly developed neural networks (i.e. kernel neural networks and functional neural networks) that are capable of modeling complex genotype-phenotype relationships (e.g. interactions) while providing robust performance against high-dimensional genetic data. Moreover, computationally efficient algorithms (e.g. a minimum norm quadratic unbiased estimation approach and batch training) are implemented in the package to accelerate the computation, making them computationally efficient for analyzing large-scale datasets with thousands or even millions of samples. By applying AIGen to the UK Biobank dataset, we demonstrate that it can efficiently analyze large-scale genetic data, attain improved accuracy, and maintain robust performance. Availability: AIGen is developed in C++ and its source code, along with reference libraries, is publicly accessible on GitHub at https://github.com/TingtHou/AIGen.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11568876/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142643854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mass spectrometry (MS)-based proteomics has become instrumental in comprehensively investigating complex biological systems. Data-independent acquisition (DIA)-MS, utilizing hybrid spectral library search strategies, allows for the simultaneous quantification of thousands of proteins, showing promise in enhancing protein identification and quantification precision. However, low-quality profiles can considerably undermine quantitative precision, resulting in inaccurate protein quantification. To tackle this challenge, we introduced STAVER, a novel algorithm that leverages standardized benchmark datasets to reduce non-biological variation in large-scale DIA-MS analyses. By eliminating unwanted noise in MS signals, STAVER significantly improved protein quantification precision, especially in hybrid spectral library searches. Moreover, we validated STAVER's robustness and applicability across multiple large-scale DIA datasets, demonstrating significantly enhanced precision and reproducibility of protein quantification. STAVER offers an innovative and effective approach for enhancing the quality of large-scale DIA proteomic data, facilitating cross-platform and cross-laboratory comparative analyses. This advancement significantly enhances the consistency and reliability of findings in clinical research. The complete package is available at https://github.com/Ran485/STAVER.
基于质谱(MS)的蛋白质组学已成为全面研究复杂生物系统的重要工具。数据独立采集(DIA)-质谱利用混合谱库搜索策略,可同时对数千种蛋白质进行定量分析,在提高蛋白质鉴定和定量精度方面大有可为。然而,低质量的图谱会大大降低定量精度,导致蛋白质定量不准确。为了应对这一挑战,我们引入了 STAVER 算法,这是一种利用标准化基准数据集来减少大规模 DIA-MS 分析中的非生物变异的新型算法。通过消除质谱信号中不必要的噪声,STAVER 显著提高了蛋白质定量精度,尤其是在混合谱库搜索中。此外,我们还在多个大规模 DIA 数据集上验证了 STAVER 的稳健性和适用性,证明其显著提高了蛋白质定量的精度和可重复性。STAVER 为提高大规模 DIA 蛋白质组学数据的质量提供了一种创新而有效的方法,促进了跨平台和跨实验室的比较分析。这一进步大大提高了临床研究结果的一致性和可靠性。完整的软件包可从 https://github.com/Ran485/STAVER 获取。
{"title":"STAVER: a standardized benchmark dataset-based algorithm for effective variation reduction in large-scale DIA-MS data.","authors":"Peng Ran, Yunzhi Wang, Kai Li, Shiman He, Subei Tan, Jiacheng Lv, Jiajun Zhu, Shaoshuai Tang, Jinwen Feng, Zhaoyu Qin, Yan Li, Lin Huang, Yanan Yin, Lingli Zhu, Wenjun Yang, Chen Ding","doi":"10.1093/bib/bbae553","DOIUrl":"10.1093/bib/bbae553","url":null,"abstract":"<p><p>Mass spectrometry (MS)-based proteomics has become instrumental in comprehensively investigating complex biological systems. Data-independent acquisition (DIA)-MS, utilizing hybrid spectral library search strategies, allows for the simultaneous quantification of thousands of proteins, showing promise in enhancing protein identification and quantification precision. However, low-quality profiles can considerably undermine quantitative precision, resulting in inaccurate protein quantification. To tackle this challenge, we introduced STAVER, a novel algorithm that leverages standardized benchmark datasets to reduce non-biological variation in large-scale DIA-MS analyses. By eliminating unwanted noise in MS signals, STAVER significantly improved protein quantification precision, especially in hybrid spectral library searches. Moreover, we validated STAVER's robustness and applicability across multiple large-scale DIA datasets, demonstrating significantly enhanced precision and reproducibility of protein quantification. STAVER offers an innovative and effective approach for enhancing the quality of large-scale DIA proteomic data, facilitating cross-platform and cross-laboratory comparative analyses. This advancement significantly enhances the consistency and reliability of findings in clinical research. The complete package is available at https://github.com/Ran485/STAVER.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11540132/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142590018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chaorui Yan, Aoyun Geng, Zhuoyu Pan, Zilong Zhang, Feifei Cui
Inflammatory responses may lead to tissue or organ damage, and proinflammatory peptides (PIPs) are signaling peptides that can induce such responses. Many diseases have been redefined as inflammatory diseases. To identify PIPs more efficiently, we expanded the dataset and designed an ensemble learning model with manually encoded features. Specifically, we adopted a more comprehensive feature encoding method and considered the actual impact of certain features to filter them. Identification and prediction of PIPs were performed using an ensemble learning model based on five different classifiers. The results show that the model's sensitivity, specificity, accuracy, and Matthews correlation coefficient are all higher than those of the state-of-the-art models. We named this model MultiFeatVotPIP, and both the model and the data can be accessed publicly at https://github.com/ChaoruiYan019/MultiFeatVotPIP. Additionally, we have developed a user-friendly web interface for users, which can be accessed at http://www.bioai-lab.com/MultiFeatVotPIP.
{"title":"MultiFeatVotPIP: a voting-based ensemble learning framework for predicting proinflammatory peptides.","authors":"Chaorui Yan, Aoyun Geng, Zhuoyu Pan, Zilong Zhang, Feifei Cui","doi":"10.1093/bib/bbae505","DOIUrl":"https://doi.org/10.1093/bib/bbae505","url":null,"abstract":"<p><p>Inflammatory responses may lead to tissue or organ damage, and proinflammatory peptides (PIPs) are signaling peptides that can induce such responses. Many diseases have been redefined as inflammatory diseases. To identify PIPs more efficiently, we expanded the dataset and designed an ensemble learning model with manually encoded features. Specifically, we adopted a more comprehensive feature encoding method and considered the actual impact of certain features to filter them. Identification and prediction of PIPs were performed using an ensemble learning model based on five different classifiers. The results show that the model's sensitivity, specificity, accuracy, and Matthews correlation coefficient are all higher than those of the state-of-the-art models. We named this model MultiFeatVotPIP, and both the model and the data can be accessed publicly at https://github.com/ChaoruiYan019/MultiFeatVotPIP. Additionally, we have developed a user-friendly web interface for users, which can be accessed at http://www.bioai-lab.com/MultiFeatVotPIP.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11479713/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142486031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianyi Chen, Xindian Wei, Lianxin Xie, Yunfei Zhang, Cheng Liu, Wenjun Shen, Si Wu, Hau-San Wong
The spatial reconstruction of single-cell RNA sequencing (scRNA-seq) data into spatial transcriptomics (ST) is a rapidly evolving field that addresses the significant challenge of aligning gene expression profiles to their spatial origins within tissues. This task is complicated by the inherent batch effects and the need for precise gene expression characterization to accurately reflect spatial information. To address these challenges, we developed SELF-Former, a transformer-based framework that utilizes multi-scale structures to learn gene representations, while designing spatial correlation constraints for the reconstruction of corresponding ST data. SELF-Former excels in recovering the spatial information of ST data and effectively mitigates batch effects between scRNA-seq and ST data. A novel aspect of SELF-Former is the introduction of a gene filtration module, which significantly enhances the spatial reconstruction task by selecting genes that are crucial for accurate spatial positioning and reconstruction. The superior performance and effectiveness of SELF-Former's modules have been validated across four benchmark datasets, establishing it as a robust and effective method for spatial reconstruction tasks. SELF-Former demonstrates its capability to extract meaningful gene expression information from scRNA-seq data and accurately map it to the spatial context of real ST data. Our method represents a significant advancement in the field, offering a reliable approach for spatial reconstruction.
将单细胞 RNA 测序(scRNA-seq)数据空间重构为空间转录组学(ST)是一个快速发展的领域,它解决了将基因表达谱与其在组织内的空间起源对齐的重大挑战。由于固有的批次效应以及需要精确的基因表达表征以准确反映空间信息,这项任务变得非常复杂。为了应对这些挑战,我们开发了 SELF-Former,这是一种基于变换器的框架,它利用多尺度结构来学习基因表征,同时为重建相应的 ST 数据设计空间相关性约束。SELF-Former 擅长恢复 ST 数据的空间信息,并能有效缓解 scRNA-seq 和 ST 数据之间的批次效应。SELF-Former 的一个新颖之处是引入了基因过滤模块,通过选择对准确空间定位和重建至关重要的基因,大大增强了空间重建任务。SELF-Former 模块的卓越性能和有效性已在四个基准数据集上得到验证,使其成为空间重建任务中一种稳健有效的方法。SELF-Former 证明了自己有能力从 scRNA-seq 数据中提取有意义的基因表达信息,并将其准确映射到真实 ST 数据的空间环境中。我们的方法代表了该领域的重大进步,为空间重建提供了一种可靠的方法。
{"title":"SELF-Former: multi-scale gene filtration transformer for single-cell spatial reconstruction.","authors":"Tianyi Chen, Xindian Wei, Lianxin Xie, Yunfei Zhang, Cheng Liu, Wenjun Shen, Si Wu, Hau-San Wong","doi":"10.1093/bib/bbae523","DOIUrl":"https://doi.org/10.1093/bib/bbae523","url":null,"abstract":"<p><p>The spatial reconstruction of single-cell RNA sequencing (scRNA-seq) data into spatial transcriptomics (ST) is a rapidly evolving field that addresses the significant challenge of aligning gene expression profiles to their spatial origins within tissues. This task is complicated by the inherent batch effects and the need for precise gene expression characterization to accurately reflect spatial information. To address these challenges, we developed SELF-Former, a transformer-based framework that utilizes multi-scale structures to learn gene representations, while designing spatial correlation constraints for the reconstruction of corresponding ST data. SELF-Former excels in recovering the spatial information of ST data and effectively mitigates batch effects between scRNA-seq and ST data. A novel aspect of SELF-Former is the introduction of a gene filtration module, which significantly enhances the spatial reconstruction task by selecting genes that are crucial for accurate spatial positioning and reconstruction. The superior performance and effectiveness of SELF-Former's modules have been validated across four benchmark datasets, establishing it as a robust and effective method for spatial reconstruction tasks. SELF-Former demonstrates its capability to extract meaningful gene expression information from scRNA-seq data and accurately map it to the spatial context of real ST data. Our method represents a significant advancement in the field, offering a reliable approach for spatial reconstruction.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11483138/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The transcriptional regulatory network (TRN) is a graph framework that helps understand the complex transcriptional regulation mechanisms in the transcription process. Identifying the phenotype-specific transcription regulators is vital to reveal the functional roles of transcription elements in associating the specific phenotypes. Although many methods have been developed towards detecting the phenotype-specific transcription elements based on the static TRN in the past decade, most of them are not satisfactory for elucidating the phenotype-related functional roles of transcription regulators in multiple levels, as the dynamic characteristics of transcription regulators are usually ignored in static models. In this study, we introduce a novel framework called DTGN to identify the phenotype-specific transcription factors (TFs) and pathways by constructing dynamic TRNs. We first design a graph autoencoder model to integrate the phenotype-oriented time-series gene expression data and static TRN to learn the temporal representations of genes. Then, based on the learned temporal representations of genes, we develop a statistical method to construct a series of dynamic TRNs associated with the development of specific phenotypes. Finally, we identify the phenotype-specific TFs and pathways from the constructed dynamic TRNs. Results from multiple phenotypic datasets show that the proposed DTGN framework outperforms most existing methods in identifying phenotype-specific TFs and pathways. Our framework offers a new approach to exploring the functional roles of transcription regulators that associate with specific phenotypes in a dynamic model.
{"title":"Constructing the dynamic transcriptional regulatory networks to identify phenotype-specific transcription regulators.","authors":"Yang Guo, Zhiqiang Xiao","doi":"10.1093/bib/bbae542","DOIUrl":"https://doi.org/10.1093/bib/bbae542","url":null,"abstract":"<p><p>The transcriptional regulatory network (TRN) is a graph framework that helps understand the complex transcriptional regulation mechanisms in the transcription process. Identifying the phenotype-specific transcription regulators is vital to reveal the functional roles of transcription elements in associating the specific phenotypes. Although many methods have been developed towards detecting the phenotype-specific transcription elements based on the static TRN in the past decade, most of them are not satisfactory for elucidating the phenotype-related functional roles of transcription regulators in multiple levels, as the dynamic characteristics of transcription regulators are usually ignored in static models. In this study, we introduce a novel framework called DTGN to identify the phenotype-specific transcription factors (TFs) and pathways by constructing dynamic TRNs. We first design a graph autoencoder model to integrate the phenotype-oriented time-series gene expression data and static TRN to learn the temporal representations of genes. Then, based on the learned temporal representations of genes, we develop a statistical method to construct a series of dynamic TRNs associated with the development of specific phenotypes. Finally, we identify the phenotype-specific TFs and pathways from the constructed dynamic TRNs. Results from multiple phenotypic datasets show that the proposed DTGN framework outperforms most existing methods in identifying phenotype-specific TFs and pathways. Our framework offers a new approach to exploring the functional roles of transcription regulators that associate with specific phenotypes in a dynamic model.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11503644/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142495337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}