Deep learning models can accurately reconstruct genome-wide epigenetic tracks from the reference genome sequence alone. But it is unclear what predictive power they have on sequence diverging from the reference, such as disease- and trait-associated variants or engineered sequences. Recent work has applied synthetic regulatory genomics to characterized dozens of deletions, inversions, and rearrangements of DNase I hypersensitive sites (DHSs). Here, we use the state-of-the-art model Enformer to predict DNA accessibility and RNA transcription across these engineered sequences when delivered at their endogenous loci. At a high level, we observe a good correlation between accessibility predicted by Enformer and experimental data. But model performance is best for sequences that more resembled the reference, such as single deletions or combinations of multiple DHSs. Predictive power is poorer for rearrangements affecting DHS order or orientation. We use these data to fine-tune Enformer, yielding significant reduction in prediction error. We show that this fine-tuning retains strong predictive performance for other tracks. Our results show that current deep learning models perform poorly when presented with novel sequences diverging in certain critical features from their training set. Thus, an iterative approach incorporating profiling of synthetic constructs can improve model generalizability and ultimately enable functional classification of regulatory variants identified by population studies.
Transcription is regulated at multiple levels, including initiation, elongation, and termination. Whereas much research has focused on the initiation of transcription, regulation of elongation plays an important role not only in transcription dynamics but also in cotranscriptional RNA processing and genome stability. Despite advances in high-throughput approaches for global quantification of RNA polymerase II (RNAPII) speed, RNAPII elongation rate studies have been limited to a relatively small number of long genes or to velocity estimates inferred indirectly from RNAPII occupancy data. Here, we present DRB/TTchem-seq2, a modified version of the DRB/TTchem-seq method, to directly measure gene-specific elongation rates of more than 3000 genes. By combining short time point sampling after synchronized RNAPII release into the gene body and a new computational framework to track the distance traveled by RNAPII, we greatly increase the number of genes for which it is possible to obtain elongation rates. Our direct RNAPII elongation rate quantification reveals that elongation rates vary not only among genes but also within genes. Additionally, we describe how specific histone modifications and elongation factor occupancy correlate with subclasses of genes based on their elongation rates. Together, we present a robust and powerful method for RNAPII transcription elongation rate measurement.
Integration of single-cell and spatial transcriptomes represents a fundamental strategy to enhance spatial data quality. However, existing methods for mapping single-cell data to spatial coordinates struggle with large-scale data sets comprising millions of cells. Here, we introduce Polyomino, an intelligent region-allocation method inspired by the region-of-interest (ROI) concept from image processing. By using gradient descent, Polyomino allocates cells to structured spatial regions that match the most significant biological information, optimizing the integration of data and improving speed and accuracy. Polyomino excels in integrating data even in the presence of various sequencing artifacts, such as cell segmentation errors and imbalanced cell-type representations. Polyomino outperforms state-of-the-art methods by 10 to 1000 times in speed, and it is the only approach capable of integrating data sets containing millions of cells in a single run. As a result, Polyomino uncovers originally hidden gene expression patterns in brain sections and offers new insights into organogenesis and tumor microenvironments, all with exceptional efficiency and accuracy.
A hallmark of heart disease is gene dysregulation and reactivation of fetal gene programs. Reactivation of these fetal programs has compensatory effects during heart failure, depending on the type and stage of the underlying cardiomyopathy. Thousands of putative cardiac gene regulatory elements have been identified that may control these programs, but their functions are largely unknown. Here, we profile genome-wide changes to gene expression and chromatin structure in cardiomyocytes derived from human pluripotent stem cells. We identify and characterize a gene regulatory element essential for regulating MYH6 expression, which encodes human fetal myosin. Using chromatin conformation assays in combination with epigenome editing, we find that gene regulation is mediated by a direct interaction between MYH6 and the enhancer. We also find that enhancer activation alters cardiomyocyte response to the hypertrophy-inducing peptide endothelin-1. Enhancer activation prevents polyploidization as well as changes in calcium dynamics and metabolism following stress with endothelin-1. Collectively, these results identify regulatory mechanisms of cardiac gene programs that modulate cardiomyocyte maturation, affect cellular stress response, and could serve as potential therapeutic targets.
In epigenetic analysis, the identification of differentially methylated regions (DMRs) typically involves the detection of consecutive CpGs groups that show significant changes in their average methylation levels. However, the methylation state of a genomic region can also be characterized by a mixture of patterns (epialleles) with variable frequencies, and the relative proportions of such patterns can provide insights into its mechanisms of formation. Traditional methods based on bisulfite conversion and high-throughput sequencing, such as Illumina, owing to the read size (150 bp) allow epiallele frequency analysis only in high CpG density regions, limiting differential methylation studies to just 50% of the human methylome. Nanopore sequencing, with its long reads, enables the analysis of epiallele frequency across both high and low CpG density regions. Here, we introduce a novel computational approach, PoreMeth2, an R library that integrates epiallelic diversity and methylation frequency changes from nanopore data to identify DMRs, providing insights into their possible mechanisms of formation, and annotate them to genic and regulatory elements. We apply PoreMeth2 to cancer and glial cell data sets, providing evidence of its advance over other state-of-the-art methods and demonstrating its ability to distinguish epigenomic alterations with a strong impact on gene expression from those with weaker effects on transcriptional activity.
The emergence of nanopore sequencing technology has the potential to transform metagenomics by offering low-cost, portable, and long-read sequencing capabilities. Furthermore, these platforms enable real-time data generation, which could significantly reduce the time from sample collection to result, a crucial factor for point-of-care diagnostics and biosurveillance. However, the full potential of real-time metagenomics remains largely unfulfilled due to a lack of accessible, open-source bioinformatic tools. We present Metagenomic Analysis in Real-Time (MARTi), an innovative open-source software designed for the real-time analysis, visualization, and exploration of metagenomic data. MARTi supports various classification methods, including BLAST, Centrifuge, and Kraken2, letting users customize parameters and utilize their own databases for taxonomic classification and antimicrobial resistance analysis. With a user-friendly, browser-based graphical interface, MARTi provides dynamic, real-time updates on community composition and AMR gene identification. MARTi's architecture and operational flexibility make it suitable for diverse research applications, ranging from in-field analysis to large-scale metagenomic studies. Using both simulated and real-world data, we demonstrate MARTi's performance in read classification, taxon detection, and relative abundance estimation. By bridging the gap between sequencing and actionable insights, MARTi marks a significant advance in the accessibility and functionality of real-time metagenomic analysis.

