Katie Emelianova, Diego Mauricio Riaño-Pachón, Maria Fernanda Torres Jimenez
{"title":"Making sense of complexity: Advances in bioinformatics for plant biology","authors":"Katie Emelianova, Diego Mauricio Riaño-Pachón, Maria Fernanda Torres Jimenez","doi":"10.1002/aps3.11538","DOIUrl":null,"url":null,"abstract":"<p>Coined by Dutch theoretical biologists in the 1970s, the term bioinformatics originally denoted a broad concept relating to the study of information processing in biological systems, such as ecosystem interaction, neuronal messaging, and transfer of genetic information (Hogeweg, <span>2011</span>). Subsequently co-opted to describe the sequencing and analysis of molecules (from nucleic acids to proteins), bioinformatics has diverse applications including the analysis, visualization, storage, and generation of data relating to living organisms and the molecular information they carry. Plant biology has reaped dividends from the development and maturation of bioinformatics; it has not only extended our understanding of model plant species such as <i>Arabidopsis thaliana</i> (Cantó-Pastor et al., <span>2021</span>) but also driven innovative solutions to characterize non-model species (Nevado et al., <span>2014</span>). Both avenues of discovery contribute to key objectives in improving food security, conservation, and biotechnology.</p><p>The size and complexity of many plant genomes has historically made their analysis financially and computationally difficult. Frequent polyploidy and repeat element expansion make the elucidation of plant genome sequences challenging (Soltis et al., <span>2015</span>). Furthermore, high heterozygosity in wild populations, pervasive hybridization, and a lack of inbred lines present roadblocks to analyses such as read mapping and assembly (Kajitani et al., <span>2019</span>). Long-read technologies have become ever more accessible in recent years, and algorithmic advances have accommodated sequential updates to error models, read lengths, and library types (Michael and VanBuren, <span>2020</span>). Moreover, novel methods to scaffold contigs and obtain long-range interaction information have driven impressive improvements in genome assembly quality, making telomere-to-telomere genome sequencing projects an achievable goal for many labs (Kress et al., <span>2022</span>).</p><p>Long-read technologies paired with novel mapping algorithms have fueled discovery of new transposable element (TE) dynamics, and there has been an associated resurgence of interest in their role in adaptive trait evolution and phenotypic variation (Schrader and Schmitz, <span>2019</span>; Pimpinelli and Piacentini, <span>2020</span>). Bioinformatics developments in this field have led to vast improvements in our ability to detect complex TE mobilization patterns such as nested insertions and structural variants (Bree et al., <span>2022</span>; Lemay et al., <span>2022</span>). Despite these advancements, characterization and annotation of genomic features such as genes and repetitive elements remain challenging due to species-specific genomic configurations, taxonomically patchy reference databases, and a lack of robust benchmarking and quality control. While structural and functional annotation methods still have significant obstacles to overcome, many important contributions have been made to improve the comparison and optimization of these approaches (Caballero and Wegrzyn, <span>2019</span>). Moreover, the extension and aggregation of existing gene, variant, and repeat annotation software is beginning to allow researchers to combine and curate different algorithmic approaches and databases (Nelson et al., <span>2017</span>; Kirsche et al., <span>2023</span>).</p><p>The scale of plant diversity to be characterized remains a challenge, however, and incorporating samples from preserved, non-model, or difficult-to-access material requires innovative wet lab and bioinformatics solutions (Lang et al., <span>2020</span>). Reduced representation sequencing (RRS) methods represent a crucial tool for the study of non-model plants; this adaptation of emerging sequencing technologies has allowed for cost-effective population studies, analyses of historical diversity using herbarium specimens, and phylogenomic explorations on a large scale (Kersey, <span>2019</span>; One Thousand Plant Transcriptomes Initiative, <span>2019</span>). Limitations associated with RRS such as paralogous genes, different selection landscapes of coding and non-coding sequences, and missing data are increasingly accounted for with the continuous improvement of software and methodology (Johnson et al., <span>2016</span>), and integration of -omics data for non-model taxa in online portals creates an ever more accessible environment for researchers to characterize the world's flora (Goodstein et al., <span>2012</span>).</p><p>Bioinformatics, since its inception in biological applications, has been a field in constant flux, with a high turnover of technologies, sequencing platforms, algorithms, and techniques, and the current landscape of bioinformatics in plant sciences is no different. This special issue of <i>Applications in Plant Sciences</i> presents five papers that explore bioinformatics approaches to address issues in plant biology, such as genome assembly, reduced representation sequencing, and structural and functional annotation. We summarize these papers here.</p><p>Reduced representation sequencing methods such as target capture, RAD sequencing, and genome skimming provide powerful tools for phylogenomic studies, especially in cases where whole genome analyses are infeasible or many non-model organisms must be sampled cost efficiently. Bioinformatic methods such as probe design and resolution of paralogous sequences have critical impacts on downstream analyses and interpretations; therefore, clear guidelines and accessible implementation are important to ensure that maximum benefits are reaped by the scientific community. Two papers in this issue discuss aspects of RRS.</p><p>Despite recent advances in whole genome sequencing, RRS approaches continue to be of great importance in biodiversity and evolutionary studies, particularly in situations where obtaining fresh plant material is not feasible or the number of samples is very large. In their contribution, Pezzini et al. (<span>2023</span>) provide a comprehensive review of genome skimming and target capture, two techniques used commonly for the study of non-model organisms and difficult material such as herbarium specimens. This review is timely, because while the design of target capture probes (i.e., bait sets) for specific taxa has historically been hindered by the limited availability of genomic resources for non-model organisms, this is likely to change in the next few years thanks to ambitious whole genome sequencing efforts such as the Earth Biogenome Project (Lewin et al., <span>2022</span>). The rapid growth in the number and taxonomic resolution of bait sets is making analysis of non-model plant species easier by using probes that are universal or cover larger clades. Pezzini and co-authors discuss a variety of approaches utilizing existing resources such as combining universal and taxon-specific bait sets for use in non-model organisms, or combining new results with legacy data to enable broader taxon sampling. Considerations for genome skimming and target capture have similarities; however, the untargeted technique used by genome skimming results in sequence data that are highly dependent on copy number, favoring more frequently represented regions such as those in chloroplasts and mitochondria. Including both project planning and downstream analysis considerations, the authors review the merits and drawbacks of both target capture and genome skimming approaches, providing a valuable resource for researchers who may have a variety of data, taxa, and tissue types at hand.</p><p>In their contribution to this issue, Jackson et al. (<span>2023</span>) build on the existing bioinformatic pipelines HybPiper (Johnson et al., <span>2016</span>) and ParaGone (Yang and Smith, <span>2014</span>), providing a streamlined version of both pipelines within a Singularity container, vastly simplifying dependency installation and implementation. These two pipelines perform target capture read assembly and paralogy resolution, respectively, and the use of both is a common workflow employed by phylogeneticists prior to species tree inference. Within the containerized pipeline, the authors implement two Nextflow workflows, hybpiper-nf and paragone-nf, which include improved sample handling and methodological improvements. Hybpiper-nf addresses organization and tractability of large sample sizes, automatically detecting sequence types in BLAST (Altschul et al., <span>1990</span>) and Diamond (Buchfink et al., <span>2015</span>) runs and parsing sequence names from read files. Additional improvements over the previous standalone implementations of HybPiper include additional options to manipulate the resolution of chimeric locus assemblies, giving the user greater insight and control over the processing of target capture data. The process of phylogenomic inference is streamlined by the production of correctly formatted files from hybpiper-nf that are directly compatible with paragone-nf, where four different paralog inference algorithms are implemented (originally described in Yang and Smith [<span>2014</span>]). The authors test their workflow using the Angiosperms353 and Compositae1061 bait sets applied to data sets including Asteraceae and Orchidaceae, demonstrating greatly improved usability and streamlining of the target capture workflow. This new, containerized workflow will provide the non-model plant biology community with more accessible bioinformatic tools to analyze RRS data and greatly streamline new phylogenomic projects.</p><p>Transposable elements are a ubiquitous feature of plant genomes, and the revival of interest in TEs and their role in genome dynamics, trait evolution, and evolutionary trajectories has coincided with the emergence of long-read sequencing technologies, which can allow researchers to capture 5′ and 3′ insertion sites in a single read, a feat not previously possible with short reads. Popular TE annotation software, however, remains computationally inaccessible for some researchers due to long run times and high computational demands. Gonzalez-García et al. (<span>2023</span>) leverage algorithmic advances in long-read mapping techniques to annotate TEs, using a computationally efficient homology-based method employing minimizers. The comparatively high error rate of long reads is a useful proxy for the imperfect sequence conservation between members of TE families, and the authors build on the long-read alignment method used by Minimap2 (Li, <span>2018</span>) to reduce run time from hours to minutes, marking an improvement of orders of magnitude in computational efficiency. Moreover, the authors make use of alternatives to commonly used de novo TE annotation pipelines (Orozco-Arias et al., <span>2023</span>), broadening the diversity of bioinformatic resources for TE annotation, a field which, despite its age, still presents significant challenges in model and non-model organisms alike.</p><p>The annotation of gene features is a fundamental step in ascribing context to genomic data sets, paving the way for further studies such as expression assays, comparative genomics, and population dynamics. Despite advances in genome assembly methods, genome annotation remains one of the most challenging bottlenecks facing plant genome science, with intron length variation, divergent TE dynamics, and low sequence conservation hampering the annotation efforts of non-model genome projects. In their contribution, Vuruptoor et al. (<span>2023</span>) address the need to improve quantification of structural genome annotation methods, employing a mixture of existing and emerging metrics to benchmark genome annotation methods. They approach the issue in a robust manner by using a broad diversity of taxa with challenging genomic features such as variable ploidy, high TE content, and large genomes. As well as commonly used metrics such as BUSCO, the authors draw attention to equally informative measures of annotation quality such as the ratio of mono-exonic to multi-exonic genes to detect unlikely gene models and false positive genes resulting from incomplete repeat masking. That the problem of genome annotation is not solved, even in model plant species, is testament to the importance of benchmarking studies such as this, and the inclusion of challenging taxa during software design is vital to ensure non-model plant species can equally benefit from bioinformatic innovations.</p><p>Upstream bioinformatics analyses frequently produce an extensive list of genes of interest, for example, transcripts that are differentially expressed between control and perturbed conditions, genes that show signals of accelerated rates of evolution, or particularly duplication-rich gene families. In order to make these results statistically meaningful and human readable, further contextualization is required through categorizing the genes employing the widely used system of gene ontology (GO). In GO, hierarchical structures of molecular functions, cellular locations, or biological processes are arranged from the general to the specific, and these categories represent a universal way to describe gene function. Gene Ontology annotation results in a large amount of data that is difficult to synthesize manually, precluding quick insights into the results of upstream applications. Here, Sessa et al. (<span>2023</span>) describe and test GOgetter, an easy-to-use pipeline for the summarization and visualization of GO annotations from a set of FASTA files and a GO slim mapping file as input. GOgetter combines functionalities for transferring annotations via homology searchers, calculating summaries for every data set, and producing publication-ready graphs. GOgetter is flexible, allowing users to apply different quality and similarity filters as well as use different reference databases to accommodate non-model organisms. Three case studies demonstrate GOgetter's flexibility, wide applicability from bryophytes to angiosperms, and robustness. We anticipate that this software will facilitate the rapid exploration of new transcriptomes and genomes by streamlining the GO annotation process.</p><p>Bioinformatics has revolutionized plant biology, enabling researchers to harness analytical advancements and reveal the enormous complexity of plant genomes, relationships, and biology. As technological innovations promise to provide us with ever greater insights, our bioinformatic analyses of novel data types must keep pace by supporting techniques to further our understanding of plant biology, benchmarking methods for complex bioinformatic operations such as genome annotation, and contextualizing biological data in functional or structural terms. This special issue reflects the diversity of approaches to new and old problems in plant biology, showcasing the wide range of applications of bioinformatics in plant biology, and we hope that it will support the continuing development of bioinformatics tools and methods for a new generation of technological advance.</p><p>K.E. prepared the initial draft of the manuscript. All authors contributed to article summaries, reviewed and edited subsequent drafts, and approved the final version of the manuscript.</p>","PeriodicalId":2,"journal":{"name":"ACS Applied Bio Materials","volume":null,"pages":null},"PeriodicalIF":4.6000,"publicationDate":"2023-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/aps3.11538","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Bio Materials","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/aps3.11538","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATERIALS SCIENCE, BIOMATERIALS","Score":null,"Total":0}
引用次数: 0
Abstract
Coined by Dutch theoretical biologists in the 1970s, the term bioinformatics originally denoted a broad concept relating to the study of information processing in biological systems, such as ecosystem interaction, neuronal messaging, and transfer of genetic information (Hogeweg, 2011). Subsequently co-opted to describe the sequencing and analysis of molecules (from nucleic acids to proteins), bioinformatics has diverse applications including the analysis, visualization, storage, and generation of data relating to living organisms and the molecular information they carry. Plant biology has reaped dividends from the development and maturation of bioinformatics; it has not only extended our understanding of model plant species such as Arabidopsis thaliana (Cantó-Pastor et al., 2021) but also driven innovative solutions to characterize non-model species (Nevado et al., 2014). Both avenues of discovery contribute to key objectives in improving food security, conservation, and biotechnology.
The size and complexity of many plant genomes has historically made their analysis financially and computationally difficult. Frequent polyploidy and repeat element expansion make the elucidation of plant genome sequences challenging (Soltis et al., 2015). Furthermore, high heterozygosity in wild populations, pervasive hybridization, and a lack of inbred lines present roadblocks to analyses such as read mapping and assembly (Kajitani et al., 2019). Long-read technologies have become ever more accessible in recent years, and algorithmic advances have accommodated sequential updates to error models, read lengths, and library types (Michael and VanBuren, 2020). Moreover, novel methods to scaffold contigs and obtain long-range interaction information have driven impressive improvements in genome assembly quality, making telomere-to-telomere genome sequencing projects an achievable goal for many labs (Kress et al., 2022).
Long-read technologies paired with novel mapping algorithms have fueled discovery of new transposable element (TE) dynamics, and there has been an associated resurgence of interest in their role in adaptive trait evolution and phenotypic variation (Schrader and Schmitz, 2019; Pimpinelli and Piacentini, 2020). Bioinformatics developments in this field have led to vast improvements in our ability to detect complex TE mobilization patterns such as nested insertions and structural variants (Bree et al., 2022; Lemay et al., 2022). Despite these advancements, characterization and annotation of genomic features such as genes and repetitive elements remain challenging due to species-specific genomic configurations, taxonomically patchy reference databases, and a lack of robust benchmarking and quality control. While structural and functional annotation methods still have significant obstacles to overcome, many important contributions have been made to improve the comparison and optimization of these approaches (Caballero and Wegrzyn, 2019). Moreover, the extension and aggregation of existing gene, variant, and repeat annotation software is beginning to allow researchers to combine and curate different algorithmic approaches and databases (Nelson et al., 2017; Kirsche et al., 2023).
The scale of plant diversity to be characterized remains a challenge, however, and incorporating samples from preserved, non-model, or difficult-to-access material requires innovative wet lab and bioinformatics solutions (Lang et al., 2020). Reduced representation sequencing (RRS) methods represent a crucial tool for the study of non-model plants; this adaptation of emerging sequencing technologies has allowed for cost-effective population studies, analyses of historical diversity using herbarium specimens, and phylogenomic explorations on a large scale (Kersey, 2019; One Thousand Plant Transcriptomes Initiative, 2019). Limitations associated with RRS such as paralogous genes, different selection landscapes of coding and non-coding sequences, and missing data are increasingly accounted for with the continuous improvement of software and methodology (Johnson et al., 2016), and integration of -omics data for non-model taxa in online portals creates an ever more accessible environment for researchers to characterize the world's flora (Goodstein et al., 2012).
Bioinformatics, since its inception in biological applications, has been a field in constant flux, with a high turnover of technologies, sequencing platforms, algorithms, and techniques, and the current landscape of bioinformatics in plant sciences is no different. This special issue of Applications in Plant Sciences presents five papers that explore bioinformatics approaches to address issues in plant biology, such as genome assembly, reduced representation sequencing, and structural and functional annotation. We summarize these papers here.
Reduced representation sequencing methods such as target capture, RAD sequencing, and genome skimming provide powerful tools for phylogenomic studies, especially in cases where whole genome analyses are infeasible or many non-model organisms must be sampled cost efficiently. Bioinformatic methods such as probe design and resolution of paralogous sequences have critical impacts on downstream analyses and interpretations; therefore, clear guidelines and accessible implementation are important to ensure that maximum benefits are reaped by the scientific community. Two papers in this issue discuss aspects of RRS.
Despite recent advances in whole genome sequencing, RRS approaches continue to be of great importance in biodiversity and evolutionary studies, particularly in situations where obtaining fresh plant material is not feasible or the number of samples is very large. In their contribution, Pezzini et al. (2023) provide a comprehensive review of genome skimming and target capture, two techniques used commonly for the study of non-model organisms and difficult material such as herbarium specimens. This review is timely, because while the design of target capture probes (i.e., bait sets) for specific taxa has historically been hindered by the limited availability of genomic resources for non-model organisms, this is likely to change in the next few years thanks to ambitious whole genome sequencing efforts such as the Earth Biogenome Project (Lewin et al., 2022). The rapid growth in the number and taxonomic resolution of bait sets is making analysis of non-model plant species easier by using probes that are universal or cover larger clades. Pezzini and co-authors discuss a variety of approaches utilizing existing resources such as combining universal and taxon-specific bait sets for use in non-model organisms, or combining new results with legacy data to enable broader taxon sampling. Considerations for genome skimming and target capture have similarities; however, the untargeted technique used by genome skimming results in sequence data that are highly dependent on copy number, favoring more frequently represented regions such as those in chloroplasts and mitochondria. Including both project planning and downstream analysis considerations, the authors review the merits and drawbacks of both target capture and genome skimming approaches, providing a valuable resource for researchers who may have a variety of data, taxa, and tissue types at hand.
In their contribution to this issue, Jackson et al. (2023) build on the existing bioinformatic pipelines HybPiper (Johnson et al., 2016) and ParaGone (Yang and Smith, 2014), providing a streamlined version of both pipelines within a Singularity container, vastly simplifying dependency installation and implementation. These two pipelines perform target capture read assembly and paralogy resolution, respectively, and the use of both is a common workflow employed by phylogeneticists prior to species tree inference. Within the containerized pipeline, the authors implement two Nextflow workflows, hybpiper-nf and paragone-nf, which include improved sample handling and methodological improvements. Hybpiper-nf addresses organization and tractability of large sample sizes, automatically detecting sequence types in BLAST (Altschul et al., 1990) and Diamond (Buchfink et al., 2015) runs and parsing sequence names from read files. Additional improvements over the previous standalone implementations of HybPiper include additional options to manipulate the resolution of chimeric locus assemblies, giving the user greater insight and control over the processing of target capture data. The process of phylogenomic inference is streamlined by the production of correctly formatted files from hybpiper-nf that are directly compatible with paragone-nf, where four different paralog inference algorithms are implemented (originally described in Yang and Smith [2014]). The authors test their workflow using the Angiosperms353 and Compositae1061 bait sets applied to data sets including Asteraceae and Orchidaceae, demonstrating greatly improved usability and streamlining of the target capture workflow. This new, containerized workflow will provide the non-model plant biology community with more accessible bioinformatic tools to analyze RRS data and greatly streamline new phylogenomic projects.
Transposable elements are a ubiquitous feature of plant genomes, and the revival of interest in TEs and their role in genome dynamics, trait evolution, and evolutionary trajectories has coincided with the emergence of long-read sequencing technologies, which can allow researchers to capture 5′ and 3′ insertion sites in a single read, a feat not previously possible with short reads. Popular TE annotation software, however, remains computationally inaccessible for some researchers due to long run times and high computational demands. Gonzalez-García et al. (2023) leverage algorithmic advances in long-read mapping techniques to annotate TEs, using a computationally efficient homology-based method employing minimizers. The comparatively high error rate of long reads is a useful proxy for the imperfect sequence conservation between members of TE families, and the authors build on the long-read alignment method used by Minimap2 (Li, 2018) to reduce run time from hours to minutes, marking an improvement of orders of magnitude in computational efficiency. Moreover, the authors make use of alternatives to commonly used de novo TE annotation pipelines (Orozco-Arias et al., 2023), broadening the diversity of bioinformatic resources for TE annotation, a field which, despite its age, still presents significant challenges in model and non-model organisms alike.
The annotation of gene features is a fundamental step in ascribing context to genomic data sets, paving the way for further studies such as expression assays, comparative genomics, and population dynamics. Despite advances in genome assembly methods, genome annotation remains one of the most challenging bottlenecks facing plant genome science, with intron length variation, divergent TE dynamics, and low sequence conservation hampering the annotation efforts of non-model genome projects. In their contribution, Vuruptoor et al. (2023) address the need to improve quantification of structural genome annotation methods, employing a mixture of existing and emerging metrics to benchmark genome annotation methods. They approach the issue in a robust manner by using a broad diversity of taxa with challenging genomic features such as variable ploidy, high TE content, and large genomes. As well as commonly used metrics such as BUSCO, the authors draw attention to equally informative measures of annotation quality such as the ratio of mono-exonic to multi-exonic genes to detect unlikely gene models and false positive genes resulting from incomplete repeat masking. That the problem of genome annotation is not solved, even in model plant species, is testament to the importance of benchmarking studies such as this, and the inclusion of challenging taxa during software design is vital to ensure non-model plant species can equally benefit from bioinformatic innovations.
Upstream bioinformatics analyses frequently produce an extensive list of genes of interest, for example, transcripts that are differentially expressed between control and perturbed conditions, genes that show signals of accelerated rates of evolution, or particularly duplication-rich gene families. In order to make these results statistically meaningful and human readable, further contextualization is required through categorizing the genes employing the widely used system of gene ontology (GO). In GO, hierarchical structures of molecular functions, cellular locations, or biological processes are arranged from the general to the specific, and these categories represent a universal way to describe gene function. Gene Ontology annotation results in a large amount of data that is difficult to synthesize manually, precluding quick insights into the results of upstream applications. Here, Sessa et al. (2023) describe and test GOgetter, an easy-to-use pipeline for the summarization and visualization of GO annotations from a set of FASTA files and a GO slim mapping file as input. GOgetter combines functionalities for transferring annotations via homology searchers, calculating summaries for every data set, and producing publication-ready graphs. GOgetter is flexible, allowing users to apply different quality and similarity filters as well as use different reference databases to accommodate non-model organisms. Three case studies demonstrate GOgetter's flexibility, wide applicability from bryophytes to angiosperms, and robustness. We anticipate that this software will facilitate the rapid exploration of new transcriptomes and genomes by streamlining the GO annotation process.
Bioinformatics has revolutionized plant biology, enabling researchers to harness analytical advancements and reveal the enormous complexity of plant genomes, relationships, and biology. As technological innovations promise to provide us with ever greater insights, our bioinformatic analyses of novel data types must keep pace by supporting techniques to further our understanding of plant biology, benchmarking methods for complex bioinformatic operations such as genome annotation, and contextualizing biological data in functional or structural terms. This special issue reflects the diversity of approaches to new and old problems in plant biology, showcasing the wide range of applications of bioinformatics in plant biology, and we hope that it will support the continuing development of bioinformatics tools and methods for a new generation of technological advance.
K.E. prepared the initial draft of the manuscript. All authors contributed to article summaries, reviewed and edited subsequent drafts, and approved the final version of the manuscript.
生物信息学一词由荷兰理论生物学家于20世纪70年代创立,最初表示一个与生物系统中的信息处理研究有关的广泛概念,如生态系统相互作用、神经元信息传递和遗传信息传递(Hogeweg,2011)。随后,生物信息学被用于描述分子(从核酸到蛋白质)的测序和分析,具有多种应用,包括分析、可视化、存储和生成与生物体及其携带的分子信息有关的数据。植物生物学从生物信息学的发展和成熟中获得了红利;它不仅扩展了我们对拟南芥等模式植物物种的理解(Cantó‐Pastor et al.,2021),还推动了表征非模式物种的创新解决方案(Nevado et al.,2014)。这两种发现途径都有助于实现改善粮食安全、保护和生物技术的关键目标。许多植物基因组的大小和复杂性在历史上使其分析在财务和计算上都很困难。频繁的多倍体和重复元件扩增使植物基因组序列的阐明具有挑战性(Soltis等人,2015)。此外,野生种群中的高杂合性、普遍的杂交和近交系的缺乏阻碍了读取图谱和组装等分析(Kajitani等人,2019)。近年来,长读技术变得越来越容易获得,算法的进步适应了对错误模型、读取长度和库类型的顺序更新(Michael和VanBuren,2020)。此外,构建重叠群和获得长距离相互作用信息的新方法推动了基因组组装质量的显著提高,使端粒到端粒基因组测序项目成为许多实验室可以实现的目标(Kress等人,2022)。长读技术与新的映射算法相结合,推动了新的转座元件(TE)动力学的发现,人们对其在适应性性状进化和表型变异中的作用重新产生了兴趣(Schrader和Schmitz,2019;Pimpinelli和Piacentini,2020)。该领域的生物信息学发展极大地提高了我们检测复杂TE动员模式(如嵌套插入和结构变体)的能力(Bree等人,2022;Lemay等人,2022)。尽管取得了这些进展,但由于物种特异性基因组配置、分类学上不完整的参考数据库以及缺乏强有力的基准和质量控制,基因和重复元素等基因组特征的表征和注释仍然具有挑战性。尽管结构和功能注释方法仍有重大障碍需要克服,但在改进这些方法的比较和优化方面做出了许多重要贡献(Caballero和Wegrzyn,2019)。此外,现有基因、变体和重复注释软件的扩展和聚合开始使研究人员能够组合和策划不同的算法方法和数据库(Nelson等人,2017;Kirsche等人,2023)。然而,要表征的植物多样性规模仍然是一个挑战,结合来自保存的、非模型的或难以获得的材料的样本需要创新的湿实验室和生物信息学解决方案(Lang等人,2020)。简化表示测序(RRS)方法是研究非模型植物的重要工具;这种对新兴测序技术的适应使得能够进行成本效益高的种群研究、使用植物标本馆标本分析历史多样性以及大规模的系统发育学探索(Kersey,2019;千植物转录组倡议,2019)。随着软件和方法的不断改进,与RRS相关的局限性,如同源基因、编码和非编码序列的不同选择景观以及数据缺失,越来越多地被考虑在内(Johnson等人,2016),在线门户网站中非模式分类群的组学数据的整合为研究人员描述世界植物群创造了一个更加容易访问的环境(Goodstein等人,2012)。生物信息学自在生物学应用中诞生以来,一直是一个不断变化的领域,技术、测序平台、算法和技术的更替率很高,植物科学中生物信息学的现状也不例外。这期《植物科学应用》特刊发表了五篇论文,探讨了生物信息学方法,以解决植物生物学中的问题,如基因组组装、减少代表性测序以及结构和功能注释。我们在这里总结这些论文。