Inferring which and how biological pathways and gene sets change is a key question in many studies that utilize single-cell RNA sequencing. Typically, these questions are addressed by quantifying the enrichment of known gene sets in lists of genes derived from global analysis. Here we offer SiPSiC, a new method to infer pathway activity in every single cell. This allows more sensitive differential analysis and utilization of pathway scores to cluster cells and compute UMAP or other similar projections. We apply our method to COVID-19, lung adenocarcinoma and glioma data sets, and demonstrate its utility. SiPSiC analysis results are consistent with findings reported in previous studies in many cases, but SiPSiC also reveals the differential activity of novel pathways, enabling us to suggest new mechanisms underlying the pathophysiology of these diseases and demonstrating SiPSiC's high accuracy and sensitivity in detecting biological function and traits. In addition, we demonstrate how it can be used to better classify cells based on activity of biological pathways instead of single genes and its ability to overcome patient-specific artifacts.
Multiomics require concerted recording of independent information, ideally from a single experiment. In this study, we introduce RIMS-seq2, a high-throughput technique to simultaneously sequence genomes and overlay methylation information while requiring only a small modification of the experimental protocol for high-throughput DNA sequencing to include a controlled deamination step. Importantly, the rate of deamination of 5-methylcytosine is negligible and thus does not interfere with standard DNA sequencing and data processing. Thus, RIMS-seq2 libraries from whole- or targeted-genome sequencing show the same germline variation calling accuracy and sensitivity compared with standard DNA-seq. Additionally, regional methylation levels provide an accurate map of the human methylome.
Species-specific genes, also known as orphans, are ubiquitous across life's domains. In prokaryotes, species-specific orphan genes (SSOGs) are mostly thought to originate in external elements such as viruses followed by horizontal gene transfer, whereas the scenario of native origination, through rapid divergence or de novo, is mostly dismissed. However, quantitative evidence supporting either scenario is lacking. Here, we systematically analyzed genomes from 4644 human gut microbiome species and identified more than 600,000 unique SSOGs, representing an average of 2.6% of a given species' pangenome. These sequences are mostly rare within each species yet show signs of purifying selection. Overall, SSOGs use optimal codons less frequently, and their proteins are more disordered than those of conserved genes (i.e., non-SSOGs). Importantly, across species, the GC content of SSOGs closely matches that of conserved ones. In contrast, the ∼5% of SSOGs that share similarity to known viral sequences have distinct characteristics, including lower GC content. Thus, SSOGs with similarity to viruses differ from the remaining SSOGs, contrasting an external origination scenario for most of them. By examining the orthologous genomic region in closely related species, we show that a small subset of SSOGs likely evolved natively de novo and find that these genes also differ in their properties from the remaining SSOGs. Our results challenge the notion that external elements are the dominant source of prokaryotic genetic novelty and will enable future studies into the biological role and relevance of species-specific genes in the human gut.
Transposable elements (TEs) and other repetitive regions have been shown to contain gene regulatory elements, including transcription factor binding sites. However, regulatory elements harbored by repeats have proven difficult to characterize using short-read sequencing assays such as ChIP-seq or ATAC-seq. Most regulatory genomics analysis pipelines discard "multimapped" reads that align equally well to multiple genomic locations. Because multimapped reads arise predominantly from repeats, current analysis pipelines fail to detect a substantial portion of regulatory events that occur in repetitive regions. To address this shortcoming, we developed Allo, a new approach to allocate multimapped reads in an efficient, accurate, and user-friendly manner. Allo combines probabilistic mapping of multimapped reads with a convolutional neural network that recognizes the read distribution features of potential peaks, offering enhanced accuracy in multimapping read assignment. Allo also provides read-level output in the form of a corrected alignment file, making it compatible with existing regulatory genomics analysis pipelines and downstream peak-finders. In a demonstration application on CTCF ChIP-seq data, we show that Allo results in the discovery of thousands of new CTCF peaks. Many of these peaks contain the expected cognate motif and/or serve as TAD boundaries. We additionally apply Allo to a diverse collection of ENCODE ChIP-seq data sets, resulting in multiple previously unidentified interactions between transcription factors and repetitive element families. Finally, we show that Allo may be particularly beneficial in identifying ChIP-seq peaks at centromeres, near segmentally duplicated genes, and in younger TEs, enabling new regulatory analyses in these regions.
DEAD box (DDX) RNA helicases are a large family of ATPases, many of which have unknown functions. There is emerging evidence that besides their role in RNA biology, DDX proteins may stimulate protein kinases. To investigate if protein kinase-DDX interaction is a more widespread phenomenon, we conducted three orthogonal large-scale screens, including proteomics analysis with 32 RNA helicases, protein array profiling, and kinome-wide in vitro kinase assays. We retrieved Ser/Thr protein kinases as prominent interactors of RNA helicases and report hundreds of binary interactions. We identified members of ten protein kinase families, which bind to, and are stimulated by, DDX proteins, including CDK, CK1, CK2, DYRK, MARK, NEK, PRKC, SRPK, STE7/MAP2K, and STE20/PAK family members. We identified MARK1 in all screens and validated that DDX proteins accelerate the MARK1 catalytic rate. These findings indicate pervasive interactions between protein kinases and DEAD box RNA helicases, and provide a rich resource to explore their regulatory relationships.
The zoonotic parasite Cryptosporidium parvum is a global cause of gastrointestinal disease in humans and ruminants. Sequence analysis of the highly polymorphic gp60 gene enabled the classification of C. parvum isolates into multiple groups (e.g., IIa, IIc, Id) and a large number of subtypes. In Europe, subtype IIaA15G2R1 is largely predominant and has been associated with many water- and food-borne outbreaks. In this study, we generated new whole-genome sequence (WGS) data from 123 human- and ruminant-derived isolates collected in 13 European countries and included other available WGS data from Europe, Egypt, China, and the United States (n = 72) in the largest comparative genomics study to date. We applied rigorous filters to exclude mixed infections and analyzed a data set from 141 isolates from the zoonotic groups IIa (n = 119) and IId (n = 22). Based on 28,047 high-quality, biallelic genomic SNPs, we identified three distinct and strongly supported populations: Isolates from China (IId) and Egypt (IIa and IId) formed population 1; a minority of European isolates (IIa and IId) formed population 2; and the majority of European (IIa, including all IIaA15G2R1 isolates) and all isolates from the United States (IIa) clustered in population 3. Based on analyses of the population structure, population genetics, and recombination, we show that population 3 has recently emerged and expanded throughout Europe to then, possibly from the United Kingdom, reach the United States, where it also expanded. The reason(s) for the successful spread of population 3 remain elusive, although genes under selective pressure uniquely in this population were identified.