It is projected that 10 million deaths could be attributed to drug-resistant bacteria infections in 2050. To address this concern, identifying new-generation antibiotics is an effective way. Antimicrobial peptides (AMPs), a class of innate immune effectors, have received significant attention for their capacity to eliminate drug-resistant pathogens, including viruses, bacteria, and fungi. Recent years have witnessed widespread applications of computational methods especially machine learning (ML) and deep learning (DL) for discovering AMPs. However, existing methods only use features including compositional, physiochemical, and structural properties of peptides, which cannot fully capture sequence information from AMPs. Here, we present SAMP, an ensemble random projection (RP) based computational model that leverages a new type of feature called proportionalized split amino acid composition (PSAAC) in addition to conventional sequence-based features for AMP prediction. With this new feature set, SAMP captures the residue patterns like sorting signals at both the N-terminal and the C-terminal, while also retaining the sequence order information from the middle peptide fragments. Benchmarking tests on different balanced and imbalanced datasets demonstrate that SAMP consistently outperforms existing state-of-the-art methods, such as iAMPpred and AMPScanner V2, in terms of accuracy, Matthews correlation coefficient (MCC), G-measure, and F1-score. In addition, by leveraging an ensemble RP architecture, SAMP is scalable to processing large-scale AMP identification with further performance improvement, compared to those models without RP. To facilitate the use of SAMP, we have developed a Python package that is freely available at https://github.com/wan-mlab/SAMP.
预计到 2050 年,可能会有 1 000 万人死于耐药菌感染。要解决这一问题,找出新一代抗生素是一种有效的方法。抗菌肽(AMPs)是一类先天性免疫效应物,因其消除耐药病原体(包括病毒、细菌和真菌)的能力而备受关注。近年来,人们广泛应用计算方法,特别是机器学习(ML)和深度学习(DL)来发现 AMPs。然而,现有的方法只能利用肽的组成、理化和结构特性等特征,无法完全捕捉到 AMPs 的序列信息。在这里,我们提出了一种基于集合随机投影(RP)的计算模型 SAMP,该模型除了利用传统的基于序列的特征进行 AMP 预测外,还利用了一种新型特征,即比例化拆分氨基酸组成(PSAAC)。利用这种新型特征集,SAMP 可以捕捉 N 端和 C 端的残基模式(如排序信号),同时还能保留中间肽段的序列顺序信息。在不同的平衡和不平衡数据集上进行的基准测试表明,SAMP 在准确度、马修斯相关系数 (MCC)、G-measure 和 F1 分数等方面始终优于 iAMPpred 和 AMPScanner V2 等现有的一流方法。此外,通过利用集合 RP 架构,SAMP 可以扩展到处理大规模 AMP 识别,与没有 RP 的模型相比,性能得到进一步提高。为方便使用 SAMP,我们开发了一个 Python 软件包,可在 https://github.com/wan-mlab/SAMP 免费获取。
{"title":"SAMP: Identifying antimicrobial peptides by an ensemble learning model based on proportionalized split amino acid composition.","authors":"Junxi Feng, Mengtao Sun, Cong Liu, Weiwei Zhang, Changmou Xu, Jieqiong Wang, Guangshun Wang, Shibiao Wan","doi":"10.1093/bfgp/elae046","DOIUrl":"10.1093/bfgp/elae046","url":null,"abstract":"<p><p>It is projected that 10 million deaths could be attributed to drug-resistant bacteria infections in 2050. To address this concern, identifying new-generation antibiotics is an effective way. Antimicrobial peptides (AMPs), a class of innate immune effectors, have received significant attention for their capacity to eliminate drug-resistant pathogens, including viruses, bacteria, and fungi. Recent years have witnessed widespread applications of computational methods especially machine learning (ML) and deep learning (DL) for discovering AMPs. However, existing methods only use features including compositional, physiochemical, and structural properties of peptides, which cannot fully capture sequence information from AMPs. Here, we present SAMP, an ensemble random projection (RP) based computational model that leverages a new type of feature called proportionalized split amino acid composition (PSAAC) in addition to conventional sequence-based features for AMP prediction. With this new feature set, SAMP captures the residue patterns like sorting signals at both the N-terminal and the C-terminal, while also retaining the sequence order information from the middle peptide fragments. Benchmarking tests on different balanced and imbalanced datasets demonstrate that SAMP consistently outperforms existing state-of-the-art methods, such as iAMPpred and AMPScanner V2, in terms of accuracy, Matthews correlation coefficient (MCC), G-measure, and F1-score. In addition, by leveraging an ensemble RP architecture, SAMP is scalable to processing large-scale AMP identification with further performance improvement, compared to those models without RP. To facilitate the use of SAMP, we have developed a Python package that is freely available at https://github.com/wan-mlab/SAMP.</p>","PeriodicalId":55323,"journal":{"name":"Briefings in Functional Genomics","volume":" ","pages":"879-890"},"PeriodicalIF":2.5,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11631067/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142689781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matheus Sanita Lima, Douglas Silva Domingues, Alexandre Rossi Paschoal, David Roy Smith
40 years ago, organelle genomes were assumed to be streamlined and, perhaps, unexciting remnants of their prokaryotic past. However, the field of organelle genomics has exposed an unparallel diversity in genome architecture (i.e. genome size, structure, and content). The transcription of these eccentric genomes can be just as elaborate - organelle genomes are pervasively transcribed into a plethora of RNA types. However, while organelle protein-coding genes are known to produce polycistronic transcripts that undergo heavy posttranscriptional processing, the nature of organelle noncoding transcriptomes is still poorly resolved. Here, we review how wet-lab experiments and second-generation sequencing data (i.e. short reads) have been useful to determine certain types of organelle RNAs, particularly noncoding RNAs. We then explain how third-generation (long-read) RNA-Seq data represent the new frontier in organelle transcriptomics. We show that public repositories (e.g. NCBI SRA) already contain enough data for inter-phyla comparative studies and argue that organelle biologists can benefit from such data. We discuss the prospects of using publicly available sequencing data for organelle-focused studies and examine the challenges of such an approach. We highlight that the lack of a comprehensive database dedicated to organelle genomics/transcriptomics is a major impediment to the development of a field with implications in basic and applied science.
{"title":"Long-read RNA sequencing can probe organelle genome pervasive transcription.","authors":"Matheus Sanita Lima, Douglas Silva Domingues, Alexandre Rossi Paschoal, David Roy Smith","doi":"10.1093/bfgp/elae026","DOIUrl":"10.1093/bfgp/elae026","url":null,"abstract":"<p><p>40 years ago, organelle genomes were assumed to be streamlined and, perhaps, unexciting remnants of their prokaryotic past. However, the field of organelle genomics has exposed an unparallel diversity in genome architecture (i.e. genome size, structure, and content). The transcription of these eccentric genomes can be just as elaborate - organelle genomes are pervasively transcribed into a plethora of RNA types. However, while organelle protein-coding genes are known to produce polycistronic transcripts that undergo heavy posttranscriptional processing, the nature of organelle noncoding transcriptomes is still poorly resolved. Here, we review how wet-lab experiments and second-generation sequencing data (i.e. short reads) have been useful to determine certain types of organelle RNAs, particularly noncoding RNAs. We then explain how third-generation (long-read) RNA-Seq data represent the new frontier in organelle transcriptomics. We show that public repositories (e.g. NCBI SRA) already contain enough data for inter-phyla comparative studies and argue that organelle biologists can benefit from such data. We discuss the prospects of using publicly available sequencing data for organelle-focused studies and examine the challenges of such an approach. We highlight that the lack of a comprehensive database dedicated to organelle genomics/transcriptomics is a major impediment to the development of a field with implications in basic and applied science.</p>","PeriodicalId":55323,"journal":{"name":"Briefings in Functional Genomics","volume":" ","pages":"695-701"},"PeriodicalIF":2.5,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141332590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Issa Keerthi, Vishnu Shukla, Sudhamani Kalluru, Lal Ahamed Mohammad, P Lavanya Kumari, Eswarayya Ramireddy, Lakshminarayana R Vemireddy
Rapidly identifying candidate genes underlying major QTLs is crucial for improving rice (Oryza sativa L.). In this study, we developed a workflow to rapidly prioritize candidate genes underpinning 99 major QTLs governing yield component traits. This workflow integrates multiomics databases, including sequence variation, gene expression, gene ontology, co-expression analysis, and protein-protein interaction. We predicted 206 candidate genes for 99 reported QTLs governing ten economically important yield-contributing traits using this approach. Among these, transcription factors belonging to families of MADS-box, WRKY, helix-loop-helix, TCP, MYB, GRAS, auxin response factor, and nuclear transcription factor Y subunit were promising. Validation of key prioritized candidate genes in contrasting rice genotypes for sequence variation and differential expression identified Leucine-Rich Repeat family protein (LOC_Os03g28270) and cytochrome P450 (LOC_Os02g57290) as candidate genes for the major QTLs GL1 and pl2.1, which govern grain length and panicle length, respectively. In conclusion, this study demonstrates that our workflow can significantly narrow down a large number of annotated genes in a QTL to a very small number of the most probable candidates, achieving approximately a 21-fold reduction. These candidate genes have potential implications for enhancing rice yield.
{"title":"Prioritization of candidate genes for major QTLs governing yield traits employing integrated multi-omics approach in rice (Oryza sativa L.).","authors":"Issa Keerthi, Vishnu Shukla, Sudhamani Kalluru, Lal Ahamed Mohammad, P Lavanya Kumari, Eswarayya Ramireddy, Lakshminarayana R Vemireddy","doi":"10.1093/bfgp/elae035","DOIUrl":"10.1093/bfgp/elae035","url":null,"abstract":"<p><p>Rapidly identifying candidate genes underlying major QTLs is crucial for improving rice (Oryza sativa L.). In this study, we developed a workflow to rapidly prioritize candidate genes underpinning 99 major QTLs governing yield component traits. This workflow integrates multiomics databases, including sequence variation, gene expression, gene ontology, co-expression analysis, and protein-protein interaction. We predicted 206 candidate genes for 99 reported QTLs governing ten economically important yield-contributing traits using this approach. Among these, transcription factors belonging to families of MADS-box, WRKY, helix-loop-helix, TCP, MYB, GRAS, auxin response factor, and nuclear transcription factor Y subunit were promising. Validation of key prioritized candidate genes in contrasting rice genotypes for sequence variation and differential expression identified Leucine-Rich Repeat family protein (LOC_Os03g28270) and cytochrome P450 (LOC_Os02g57290) as candidate genes for the major QTLs GL1 and pl2.1, which govern grain length and panicle length, respectively. In conclusion, this study demonstrates that our workflow can significantly narrow down a large number of annotated genes in a QTL to a very small number of the most probable candidates, achieving approximately a 21-fold reduction. These candidate genes have potential implications for enhancing rice yield.</p>","PeriodicalId":55323,"journal":{"name":"Briefings in Functional Genomics","volume":" ","pages":"843-857"},"PeriodicalIF":2.5,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142127426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Next-generation sequencing and other sequencing approaches have made significant progress in DNA analysis. However, there are indispensable advantages in the nonsequencing methods. They have their justifications such as being speedy, cost-effective, multi-applicable, and straightforward. Among the nonsequencing methods, the genome profiling method is worthy of reviewing because of its high potential. This article first reviews its basic properties, highlights the key concept of species identification dots (spiddos), and then summarizes its various applications.
新一代测序和其他测序方法在 DNA 分析领域取得了重大进展。然而,非测序方法也有其不可或缺的优势。它们有其合理性,如速度快、成本效益高、适用范围广、简单明了等。在非测序方法中,基因组图谱分析法因其巨大潜力而值得研究。本文首先回顾了其基本特性,强调了物种识别点(spiddos)的关键概念,然后总结了其各种应用。
{"title":"Discoveries by the genome profiling, symbolic powers of non-next generation sequencing methods.","authors":"Koichi Nishigaki","doi":"10.1093/bfgp/elae047","DOIUrl":"10.1093/bfgp/elae047","url":null,"abstract":"<p><p>Next-generation sequencing and other sequencing approaches have made significant progress in DNA analysis. However, there are indispensable advantages in the nonsequencing methods. They have their justifications such as being speedy, cost-effective, multi-applicable, and straightforward. Among the nonsequencing methods, the genome profiling method is worthy of reviewing because of its high potential. This article first reviews its basic properties, highlights the key concept of species identification dots (spiddos), and then summarizes its various applications.</p>","PeriodicalId":55323,"journal":{"name":"Briefings in Functional Genomics","volume":" ","pages":"775-797"},"PeriodicalIF":2.5,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142741507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gene regulatory networks (GRNs) contribute toward understanding the function of genes and the development of cancer or the impact of key genes on diseases. Hence, this study proposes an ensemble method based on 13 basic classification methods and a flexible neural tree (FNT) to improve GRN identification accuracy. The primary classification methods contain ridge classification, stochastic gradient descent, Gaussian process classification, Bernoulli Naive Bayes, adaptive boosting, gradient boosting decision tree, hist gradient boosting classification, eXtreme gradient boosting (XGBoost), multilayer perceptron, light gradient boosting machine, random forest, support vector machine, and k-nearest neighbor algorithm, which are regarded as the input variable set of FNT model. Additionally, a hybrid evolutionary algorithm based on a gene programming variant and particle swarm optimization is developed to search for the optimal FNT model. Experiments on three simulation datasets and three real single-cell RNA-seq datasets demonstrate that the proposed ensemble feature outperforms 13 supervised algorithms, seven unsupervised algorithms (ARACNE, CLR, GENIE3, MRNET, PCACMI, GENECI, and EPCACMI) and four single cell-specific methods (SCODE, BiRGRN, LEAP, and BiGBoost) based on the area under the receiver operating characteristic curve, area under the precision-recall curve, and F1 metrics.
{"title":"Gene regulatory network inference based on novel ensemble method.","authors":"Bin Yang, Jing Li, Xiang Li, Sanrong Liu","doi":"10.1093/bfgp/elae036","DOIUrl":"10.1093/bfgp/elae036","url":null,"abstract":"<p><p>Gene regulatory networks (GRNs) contribute toward understanding the function of genes and the development of cancer or the impact of key genes on diseases. Hence, this study proposes an ensemble method based on 13 basic classification methods and a flexible neural tree (FNT) to improve GRN identification accuracy. The primary classification methods contain ridge classification, stochastic gradient descent, Gaussian process classification, Bernoulli Naive Bayes, adaptive boosting, gradient boosting decision tree, hist gradient boosting classification, eXtreme gradient boosting (XGBoost), multilayer perceptron, light gradient boosting machine, random forest, support vector machine, and k-nearest neighbor algorithm, which are regarded as the input variable set of FNT model. Additionally, a hybrid evolutionary algorithm based on a gene programming variant and particle swarm optimization is developed to search for the optimal FNT model. Experiments on three simulation datasets and three real single-cell RNA-seq datasets demonstrate that the proposed ensemble feature outperforms 13 supervised algorithms, seven unsupervised algorithms (ARACNE, CLR, GENIE3, MRNET, PCACMI, GENECI, and EPCACMI) and four single cell-specific methods (SCODE, BiRGRN, LEAP, and BiGBoost) based on the area under the receiver operating characteristic curve, area under the precision-recall curve, and F1 metrics.</p>","PeriodicalId":55323,"journal":{"name":"Briefings in Functional Genomics","volume":" ","pages":"866-878"},"PeriodicalIF":2.5,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142332842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanfeng Xu, Fan Yu, Wenrong Feng, Jia Wei, Shengyan Su, Jianlin Li, Guoan Hua, Wenjing Li, Yongkai Tang
At present, public databases house an extensive repository of transcriptome data, with the volume continuing to grow at an accelerated pace. Utilizing these data effectively is a shared interest within the scientific community. In this study, we introduced a novel strategy that harnesses SNPs and InDels identified from transcriptome data, combined with sample metadata from databases, to effectively screen for molecular markers correlated with traits. We utilized 228 transcriptome datasets of Eriocheir sinensis from the NCBI database and employed the Genome Analysis Toolkit software to identify 96 388 SNPs and 20 645 InDels. Employing the genome-wide association study analysis, in conjunction with the gender information from databases, we identified 3456 sex-biased SNPs and 639 sex-biased InDels. The KOG and KEGG annotations of the sex-biased SNPs and InDels revealed that these genes were primarily involved in the metabolic processes of E. sinensis. Combined with SnpEff annotation and PCR experimental validation, a highly sex-biased SNP located in the Kelch domain containing 4 (Klhdc4) gene, CHR67-6415071, was found to alter the splicing sites of Klhdc4, generating two splice variants, Klhdc4_a and Klhdc4_b. Additionally, Klhdc4 exhibited robust expression across the ovaries, testes, and accessory glands. The sex-biased SNPs and InDels identified in this study are conducive to the development of unisexual cultivation methods for E. sinensis, and the alternative splicing event caused by the sex-biased SNP in Klhdc4 may serve as a potential mechanism for sex regulation in E. sinensis. The analysis strategy employed in this study represents a new direction for the rational exploitation and utilization of transcriptome data in public databases.
{"title":"Genetic variation mining of the Chinese mitten crab (Eriocheir sinensis) based on transcriptome data from public databases.","authors":"Yuanfeng Xu, Fan Yu, Wenrong Feng, Jia Wei, Shengyan Su, Jianlin Li, Guoan Hua, Wenjing Li, Yongkai Tang","doi":"10.1093/bfgp/elae030","DOIUrl":"10.1093/bfgp/elae030","url":null,"abstract":"<p><p>At present, public databases house an extensive repository of transcriptome data, with the volume continuing to grow at an accelerated pace. Utilizing these data effectively is a shared interest within the scientific community. In this study, we introduced a novel strategy that harnesses SNPs and InDels identified from transcriptome data, combined with sample metadata from databases, to effectively screen for molecular markers correlated with traits. We utilized 228 transcriptome datasets of Eriocheir sinensis from the NCBI database and employed the Genome Analysis Toolkit software to identify 96 388 SNPs and 20 645 InDels. Employing the genome-wide association study analysis, in conjunction with the gender information from databases, we identified 3456 sex-biased SNPs and 639 sex-biased InDels. The KOG and KEGG annotations of the sex-biased SNPs and InDels revealed that these genes were primarily involved in the metabolic processes of E. sinensis. Combined with SnpEff annotation and PCR experimental validation, a highly sex-biased SNP located in the Kelch domain containing 4 (Klhdc4) gene, CHR67-6415071, was found to alter the splicing sites of Klhdc4, generating two splice variants, Klhdc4_a and Klhdc4_b. Additionally, Klhdc4 exhibited robust expression across the ovaries, testes, and accessory glands. The sex-biased SNPs and InDels identified in this study are conducive to the development of unisexual cultivation methods for E. sinensis, and the alternative splicing event caused by the sex-biased SNP in Klhdc4 may serve as a potential mechanism for sex regulation in E. sinensis. The analysis strategy employed in this study represents a new direction for the rational exploitation and utilization of transcriptome data in public databases.</p>","PeriodicalId":55323,"journal":{"name":"Briefings in Functional Genomics","volume":" ","pages":"816-827"},"PeriodicalIF":2.5,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141565172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diego A Forero, Diego A Bonilla, Yeimy González-Giraldo, George P Patrinos
Recent advances in high-throughput molecular methods have led to an extraordinary volume of genomics data. Simultaneously, the progress in the computational implementation of novel algorithms has facilitated the creation of hundreds of freely available online tools for their advanced analyses. However, a general overview of the most commonly used tools for the in silico analysis of genomics data is still missing. In the current article, we present an overview of commonly used online resources for genomics research, including over 50 tools. This selection will be helpful for scientists with basic or intermediate skills in the in silico analyses of genomics data, such as researchers and students from wet labs seeking to strengthen their computational competencies. In addition, we discuss current needs and future perspectives within this field.
{"title":"An overview of key online resources for human genomics: a powerful and open toolbox for in silico research.","authors":"Diego A Forero, Diego A Bonilla, Yeimy González-Giraldo, George P Patrinos","doi":"10.1093/bfgp/elae029","DOIUrl":"10.1093/bfgp/elae029","url":null,"abstract":"<p><p>Recent advances in high-throughput molecular methods have led to an extraordinary volume of genomics data. Simultaneously, the progress in the computational implementation of novel algorithms has facilitated the creation of hundreds of freely available online tools for their advanced analyses. However, a general overview of the most commonly used tools for the in silico analysis of genomics data is still missing. In the current article, we present an overview of commonly used online resources for genomics research, including over 50 tools. This selection will be helpful for scientists with basic or intermediate skills in the in silico analyses of genomics data, such as researchers and students from wet labs seeking to strengthen their computational competencies. In addition, we discuss current needs and future perspectives within this field.</p>","PeriodicalId":55323,"journal":{"name":"Briefings in Functional Genomics","volume":" ","pages":"754-764"},"PeriodicalIF":2.5,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141592146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, the application of single-cell transcriptomics and spatial transcriptomics analysis techniques has become increasingly widespread. Whether dealing with single-cell transcriptomic or spatial transcriptomic data, dimensionality reduction and clustering are indispensable. Both single-cell and spatial transcriptomic data are often high-dimensional, making the analysis and visualization of such data challenging. Through dimensionality reduction, it becomes possible to visualize the data in a lower-dimensional space, allowing for the observation of relationships and differences between cell subpopulations. Clustering enables the grouping of similar cells into the same cluster, aiding in the identification of distinct cell subpopulations and revealing cellular diversity, providing guidance for downstream analyses. In this review, we systematically summarized the most widely recognized algorithms employed for the dimensionality reduction and clustering analysis of single-cell transcriptomic and spatial transcriptomic data. This endeavor provides valuable insights and ideas that can contribute to the development of novel tools in this rapidly evolving field.
{"title":"A comprehensive survey of dimensionality reduction and clustering methods for single-cell and spatial transcriptomics data.","authors":"Yidi Sun, Lingling Kong, Jiayi Huang, Hongyan Deng, Xinling Bian, Xingfeng Li, Feifei Cui, Lijun Dou, Chen Cao, Quan Zou, Zilong Zhang","doi":"10.1093/bfgp/elae023","DOIUrl":"10.1093/bfgp/elae023","url":null,"abstract":"<p><p>In recent years, the application of single-cell transcriptomics and spatial transcriptomics analysis techniques has become increasingly widespread. Whether dealing with single-cell transcriptomic or spatial transcriptomic data, dimensionality reduction and clustering are indispensable. Both single-cell and spatial transcriptomic data are often high-dimensional, making the analysis and visualization of such data challenging. Through dimensionality reduction, it becomes possible to visualize the data in a lower-dimensional space, allowing for the observation of relationships and differences between cell subpopulations. Clustering enables the grouping of similar cells into the same cluster, aiding in the identification of distinct cell subpopulations and revealing cellular diversity, providing guidance for downstream analyses. In this review, we systematically summarized the most widely recognized algorithms employed for the dimensionality reduction and clustering analysis of single-cell transcriptomic and spatial transcriptomic data. This endeavor provides valuable insights and ideas that can contribute to the development of novel tools in this rapidly evolving field.</p>","PeriodicalId":55323,"journal":{"name":"Briefings in Functional Genomics","volume":" ","pages":"733-744"},"PeriodicalIF":2.5,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141302188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RNA interference (RNAi) technology is widely used in the biological prevention and control of terrestrial insects. One of the main factors with the application of RNAi in insects is the difference in RNAi efficiency, which may vary not only in different insects, but also in different genes of the same insect, and even in different double-stranded RNAs (dsRNAs) of the same gene. This work focuses on the last question and establishes a bioinformatics software that can help researchers screen for the most efficient dsRNA targeting target genes. Among insects, the red flour beetle (Tribolium castaneum) is known to be one of the most sensitive to RNAi. From iBeetle-Base, we extracted 12 027 efficient dsRNA sequences with a lethality rate of ≥20% or with experimentation-induced phenotypic changes and processed these data to correspond to specific silence efficiency. Based on the first complied novel benchmark dataset, we specifically designed a deep neural network to identify and characterize efficient dsRNA for RNAi in insects. The dna2vec word embedding model was trained to extract distributed feature representations, and three powerful modules, namely convolutional neural network, bidirectional long short-term memory network, and self-attention mechanism, were integrated to form our predictor model to characterize the extracted dsRNAs and their silencing efficiencies for T. castaneum. Our model dsRNAPredictor showed reliable performance in multiple independent tests based on different species, including both T. castaneum and Aedes aegypti. This indicates that dsRNAPredictor can facilitate prescreening for designing high-efficiency dsRNA targeting target genes of insects in advance.
{"title":"Characterization of double-stranded RNA and its silencing efficiency for insects using hybrid deep-learning framework.","authors":"Han Cheng, Liping Xu, Cangzhi Jia","doi":"10.1093/bfgp/elae027","DOIUrl":"10.1093/bfgp/elae027","url":null,"abstract":"<p><p>RNA interference (RNAi) technology is widely used in the biological prevention and control of terrestrial insects. One of the main factors with the application of RNAi in insects is the difference in RNAi efficiency, which may vary not only in different insects, but also in different genes of the same insect, and even in different double-stranded RNAs (dsRNAs) of the same gene. This work focuses on the last question and establishes a bioinformatics software that can help researchers screen for the most efficient dsRNA targeting target genes. Among insects, the red flour beetle (Tribolium castaneum) is known to be one of the most sensitive to RNAi. From iBeetle-Base, we extracted 12 027 efficient dsRNA sequences with a lethality rate of ≥20% or with experimentation-induced phenotypic changes and processed these data to correspond to specific silence efficiency. Based on the first complied novel benchmark dataset, we specifically designed a deep neural network to identify and characterize efficient dsRNA for RNAi in insects. The dna2vec word embedding model was trained to extract distributed feature representations, and three powerful modules, namely convolutional neural network, bidirectional long short-term memory network, and self-attention mechanism, were integrated to form our predictor model to characterize the extracted dsRNAs and their silencing efficiencies for T. castaneum. Our model dsRNAPredictor showed reliable performance in multiple independent tests based on different species, including both T. castaneum and Aedes aegypti. This indicates that dsRNAPredictor can facilitate prescreening for designing high-efficiency dsRNA targeting target genes of insects in advance.</p>","PeriodicalId":55323,"journal":{"name":"Briefings in Functional Genomics","volume":" ","pages":"858-865"},"PeriodicalIF":2.5,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141443790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sesame (Sesamum indicum L.) is a globally cultivated oilseed crop renowned for its historical significance and widespread growth in tropical and subtropical regions. With notable nutritional and medicinal attributes, sesame has shown promising effects in combating malnutrition cancer, diabetes, and other diseases like cardiovascular problems. However, sesame production faces significant challenges from environmental threats such as charcoal rot, drought, salinity, and waterlogging stress, resulting in economic losses for farmers. The scarcity of information on stress-resistance genes and pathways exacerbates these challenges. Despite its immense importance, there is currently no platform available to provide comprehensive information on sesame, which significantly hinders the mining of various stress-associated genes and the molecular breeding of sesame. To address this gap, here a free, web-accessible, and user-friendly genomic web resource (SesameGWR, http://backlin.cabgrid.res.in/sesameGWR/) has been developed This platform provides key insights into differentially expressed genes, transcription factors, miRNAs, and molecular markers like simple sequence repeats, single nucleotide polymorphisms, and insertions and deletions associated with both biotic and abiotic stresses.. The functional genomics information and annotations embedded in this web resource were predicted through RNA-seq data analysis. Considering the impact of climate change and the nutritional and medicinal importance of sesame, this study is of utmost importance in understanding stress responses. SesameGWR will serve as a valuable tool for developing climate-resilient sesame varieties, thereby enhancing the productivity of this ancient oilseed crop.
{"title":"Sesame Genomic Web Resource (SesameGWR): a well-annotated data resource for transcriptomic signatures of abiotic and biotic stress responses in sesame (Sesamum indicum L.).","authors":"Himanshu Avashthi, Ulavappa Basavanneppa Angadi, Divya Chauhan, Anuj Kumar, Dwijesh Chandra Mishra, Parimalan Rangan, Rashmi Yadav, Dinesh Kumar","doi":"10.1093/bfgp/elae022","DOIUrl":"10.1093/bfgp/elae022","url":null,"abstract":"<p><p>Sesame (Sesamum indicum L.) is a globally cultivated oilseed crop renowned for its historical significance and widespread growth in tropical and subtropical regions. With notable nutritional and medicinal attributes, sesame has shown promising effects in combating malnutrition cancer, diabetes, and other diseases like cardiovascular problems. However, sesame production faces significant challenges from environmental threats such as charcoal rot, drought, salinity, and waterlogging stress, resulting in economic losses for farmers. The scarcity of information on stress-resistance genes and pathways exacerbates these challenges. Despite its immense importance, there is currently no platform available to provide comprehensive information on sesame, which significantly hinders the mining of various stress-associated genes and the molecular breeding of sesame. To address this gap, here a free, web-accessible, and user-friendly genomic web resource (SesameGWR, http://backlin.cabgrid.res.in/sesameGWR/) has been developed This platform provides key insights into differentially expressed genes, transcription factors, miRNAs, and molecular markers like simple sequence repeats, single nucleotide polymorphisms, and insertions and deletions associated with both biotic and abiotic stresses.. The functional genomics information and annotations embedded in this web resource were predicted through RNA-seq data analysis. Considering the impact of climate change and the nutritional and medicinal importance of sesame, this study is of utmost importance in understanding stress responses. SesameGWR will serve as a valuable tool for developing climate-resilient sesame varieties, thereby enhancing the productivity of this ancient oilseed crop.</p>","PeriodicalId":55323,"journal":{"name":"Briefings in Functional Genomics","volume":" ","pages":"828-842"},"PeriodicalIF":2.5,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141238358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}