Background: Chiridota heheva is a cosmopolitan holothurian well adapted to diverse deep-sea ecosystems, especially chemosynthetic environments. Besides high hydrostatic pressure and limited light, high concentrations of metal ions also represent harsh conditions in hydrothermal environments. Few holothurian species can live in such extreme conditions. Therefore, it is valuable to elucidate the adaptive genetic mechanisms of C. heheva in hydrothermal environments.
Findings: Herein, we report a high-quality reference genome assembly of C. heheva from the Kairei vent, which is the first chromosome-level genome of Apodida. The chromosome-level genome size was 1.43 Gb, with a scaffold N50 of 53.24 Mb and BUSCO completeness score of 94.5%. Contig sequences were clustered, ordered, and assembled into 19 natural chromosomes. Comparative genome analysis found that the expanded gene families and positively selected genes of C. heheva were involved in the DNA damage repair process. The expanded gene families and the unique genes contributed to maintaining iron homeostasis in an iron-enriched environment. The positively selected gene RFC2 with 10 positively selected sites played an essential role in DNA repair under extreme environments.
Conclusions: This first chromosome-level genome assembly of C. heheva reveals the hydrothermal adaptation of holothurians. As the first chromosome-level genome of order Apodida, this genome will provide the resource for investigating the evolution of class Holothuroidea.
Background: Structural variants (SVs) are genomic polymorphisms defined by their length (>50 bp). The usual types of SVs are deletions, insertions, translocations, inversions, and copy number variants. SV detection and genotyping is fundamental given the role of SVs in phenomena such as phenotypic variation and evolutionary events. Thus, methods to identify SVs using long-read sequencing data have been recently developed.
Findings: We present an accurate and efficient algorithm to predict germline SVs from long-read sequencing data. The algorithm starts collecting evidence (signatures) of SVs from read alignments. Then, signatures are clustered based on a Euclidean graph with coordinates calculated from lengths and genomic positions. Clustering is performed by the DBSCAN algorithm, which provides the advantage of delimiting clusters with high resolution. Clusters are transformed into SVs and a Bayesian model allows to precisely genotype SVs based on their supporting evidence. This algorithm is integrated into the single sample variants detector of the Next Generation Sequencing Experience Platform, which facilitates the integration with other functionalities for genomics analysis. We performed multiple benchmark experiments, including simulation and real data, representing different genome profiles, sequencing technologies (PacBio HiFi, ONT), and read depths.
Conclusion: The results show that our approach outperformed state-of-the-art tools on germline SV calling and genotyping, especially at low depths, and in error-prone repetitive regions. We believe this work significantly contributes to the development of bioinformatic strategies to maximize the use of long-read sequencing technologies.
Over the past decade, deep learning (DL) research in computer vision has been growing rapidly, with many advances in DL-based image analysis methods for biomedical problems. In this work, we introduce MMV_Im2Im, a new open-source Python package for image-to-image transformation in bioimaging applications. MMV_Im2Im is designed with a generic image-to-image transformation framework that can be used for a wide range of tasks, including semantic segmentation, instance segmentation, image restoration, image generation, and so on. Our implementation takes advantage of state-of-the-art machine learning engineering techniques, allowing researchers to focus on their research without worrying about engineering details. We demonstrate the effectiveness of MMV_Im2Im on more than 10 different biomedical problems, showcasing its general potentials and applicabilities. For computational biomedical researchers, MMV_Im2Im provides a starting point for developing new biomedical image analysis or machine learning algorithms, where they can either reuse the code in this package or fork and extend this package to facilitate the development of new methods. Experimental biomedical researchers can benefit from this work by gaining a comprehensive view of the image-to-image transformation concept through diversified examples and use cases. We hope this work can give the community inspirations on how DL-based image-to-image transformation can be integrated into the assay development process, enabling new biomedical studies that cannot be done only with traditional experimental assays. To help researchers get started, we have provided source code, documentation, and tutorials for MMV_Im2Im at [https://github.com/MMV-Lab/mmv_im2im] under MIT license.
Background: In clinical research, data have to be accessible and reproducible, but the generated data are becoming larger and analysis complex. Here we propose a platform for Findable, Accessible, Interoperable, and Reusable (FAIR) data access and creating reproducible findings. Standardized access to a major genomic repository, the European Genome-Phenome Archive (EGA), has been achieved with API services like PyEGA3. We aim to provide a FAIR data analysis service in Galaxy by retrieving genomic data from the EGA and provide a generalized "omics" platform for FAIR data analysis.
Results: To demonstrate this, we implemented an end-to-end Galaxy workflow to replicate the findings from an RD-Connect synthetic dataset Beyond the 1 Million Genomes (synB1MG) available from the EGA. We developed the PyEGA3 connector within Galaxy to easily download multiple datasets from the EGA. We added the gene.iobio tool, a diagnostic environment for precision genomics, to Galaxy and demonstrate that it provides a more dynamic and interpretable view for trio analysis results. We developed a Galaxy trio analysis workflow to determine the pathogenic variants from the synB1MG trios using the GEMINI and gene.iobio tool. The complete workflow is available at WorkflowHub, and an associated tutorial was created in the Galaxy Training Network, which helps researchers unfamiliar with Galaxy to run the workflow.
Conclusions: We showed the feasibility of reusing data from the EGA in Galaxy via PyEGA3 and validated the workflow by rediscovering spiked-in variants in synthetic data. Finally, we improved existing tools in Galaxy and created a workflow for trio analysis to demonstrate the value of FAIR genomics analysis in Galaxy.
Marine vertebrate biodiversity is fundamental to ocean ecosystem health but is threatened by climate change, overharvesting, and habitat degradation. High-quality reference genomes are valuable foundational scientific resources that can inform conservation efforts. Consequently, global consortia are striving to produce reference genomes for representatives of all life. Here, we summarize the current landscape of available marine vertebrate reference genomes, including their phylogenetic diversity and geographic hotspots of production. We discuss key logistical and technical challenges that remain to be overcome if we are to realize the vision of a comprehensive reference genome library of all marine vertebrates.
Background: With advancements in sequencing and mass spectrometry technologies, multiomics data can now be easily acquired for understanding complex biological systems. Nevertheless, substantial challenges remain in determining the association between gene-metabolite pairs due to the nonlinear and multifactorial interactions within cellular networks. The complexity arises from the interplay of multiple genes and metabolites, often involving feedback loops and time-dependent regulatory mechanisms that are not easily captured by traditional analysis methods.
Findings: Here, we introduce Compounds And Transcripts Bridge (abbreviated as CAT Bridge, available at https://catbridge.work), a free user-friendly platform for longitudinal multiomics analysis to efficiently identify transcripts associated with metabolites using time-series omics data. To evaluate the association of gene-metabolite pairs, CAT Bridge is a pioneering work benchmarking a set of statistical methods spanning causality estimation and correlation coefficient calculation for multiomics analysis. Additionally, CAT Bridge features an artificial intelligence agent to assist users interpreting the association results.
Conclusions: We applied CAT Bridge to experimentally obtained Capsicum chinense (chili pepper) and public human and Escherichia coli time-series transcriptome and metabolome datasets. CAT Bridge successfully identified genes involved in the biosynthesis of capsaicin in C. chinense. Furthermore, case study results showed that the convergent cross-mapping method outperforms traditional approaches in longitudinal multiomics analyses. CAT Bridge simplifies access to various established methods for longitudinal multiomics analysis and enables researchers to swiftly identify associated gene-metabolite pairs for further validation.
The rapidly growing collection of public single-cell sequencing data has become a valuable resource for molecular, cellular, and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our benchmarking experiments, Vulture is 66% to 88% faster than local tools (PathogenTrack and Venus) and 41% faster than the state-of-the-art cloud-based tool Cumulus, while achieving comparable microbial read identification. In terms of the cost on cloud computing systems, Vulture also shows a cost reduction of 83% ($12 vs. ${$}$70). We applied Vulture to 2 coronavirus disease 2019, 3 hepatocellular carcinoma (HCC), and 2 gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell type-specific enrichment of severe acute respiratory syndrome coronavirus 2, hepatitis B virus (HBV), and Helicobacter pylori-positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host-microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.
Background: Understanding the genotype of pest species provides an important baseline for designing integrated pest management (IPM) strategies. Recently developed long-read sequence technologies make it possible to compare genomic features of nonmodel pest species to disclose the evolutionary path underlying the pest species profiles. Here we sequenced and assembled genomes for 3 agricultural pest gelechiid moths: Phthorimaea absoluta (tomato leafminer), Keiferia lycopersicella (tomato pinworm), and Scrobipalpa atriplicella (goosefoot groundling moth). We also compared genomes of tomato leafminer and tomato pinworm with published genomes of Phthorimaea operculella and Pectinophora gossypiella to investigate the gene family evolution related to the pest species profiles.
Results: We found that the 3 solanaceous feeding species, P. absoluta, K. lycopersicella, and P. operculella, are clustered together. Gene family evolution analyses with the 4 species show clear gene family expansions on host plant-associated genes for the 3 solanaceous feeding species. These genes are involved in host compound sensing (e.g., gustatory receptors), detoxification (e.g., ABC transporter C family, cytochrome P450, glucose-methanol-choline oxidoreductase, insect cuticle proteins, and UDP-glucuronosyl), and digestion (e.g., serine proteases and peptidase family S1). A gene ontology enrichment analysis of rapid evolving genes also suggests enriched functions in host sensing and immunity.
Conclusions: Our results of family evolution analyses indicate that host plant adaptation and pathogen defense could be important drivers in species diversification among gelechiid moths.