Summary: Structure determination is a key step in the functional characterization of many non-coding RNA molecules. High-resolution RNA 3D structure determination efforts, however, are not keeping up with the pace of discovery of new non-coding RNA sequences. This increases the importance of computational approaches and low-resolution experimental data, such as from the small-angle X-ray scattering experiments. We present RNA Masonry, a computer program and a web service for a fully automated modeling of RNA 3D structures. It assemblies RNA fragments into geometrically plausible models that meet user-provided secondary structure constraints, restraints on tertiary contacts, and small-angle X-ray scattering data. We illustrate the method description with detailed benchmarks and its application to structural studies of viral RNAs with SAXS restraints.
Availability and implementation: The program web server is available at http://iimcb.genesilico.pl/rnamasonry. The source code is available at https://gitlab.com/gchojnowski/rnamasonry.
Motivation: Mediation analysis is performed to evaluate the effects of a hypothetical causal mechanism that marks the progression from an exposure, through mediators, to an outcome. In the age of high-throughput technologies, it has become routine to assess numerous potential mechanisms at the genome or proteome scales. Alongside this, the necessity to address issues related to multiple testing has also arisen. In a sparse scenario where only a few genes or proteins are causally involved, conventional methods for assessing mediation effects lose statistical power because the composite null distribution behind this experiment cannot be attained. The power loss hence decreases the true mechanisms identified after multiple testing corrections. To fairly delineate a uniform distribution under the composite null, Huang (Genome-wide analyses of sparse mediation effects under composite null hypotheses. Ann Appl Stat 2019a;13:60-84; AoAS) proposed the composite test to provide adjusted P-values for single-mediator analyses.
Results: Our contribution is to extend the method to multimediator analyses, which are commonly encountered in genomic studies and also flexible to various biological interests. Using the generalized Berk-Jones statistics with the composite test, we proposed a multivariate approach that favors dense and diverse mediation effects, a decorrelation approach that favors sparse and consistent effects, and a hybrid approach that captures the edges of both approaches. Our analysis suite has been implemented as an R package MACtest. The utility is demonstrated by analyzing the lung adenocarcinoma datasets from The Cancer Genome Atlas and Clinical Proteomic Tumor Analysis Consortium. We further investigate the genes and networks whose expression may be regulated by smoking-induced epigenetic aberrations.
Availability and implementation: An R package MACtest is available on https://github.com/roqe/MACtest.
Motivation: Single-cell DNA methylation sequencing can assay DNA methylation at single-cell resolution. However, incomplete coverage compromises related downstream analyses, outlining the importance of imputation techniques. With a rising number of cell samples in recent large datasets, scalable and efficient imputation models are critical to addressing the sparsity for genome-wide analyses.
Results: We proposed a novel graph-based deep learning approach to impute methylation matrices based on locus-aware neighboring subgraphs with locus-aware encoding orienting on one cell type. Merely using the CpGs methylation matrix, the obtained GraphCpG outperforms previous methods on datasets containing more than hundreds of cells and achieves competitive performance on smaller datasets, with subgraphs of predicted sites visualized by retrievable bipartite graphs. Besides better imputation performance with increasing cell number, it significantly reduces computation time and demonstrates improvement in downstream analysis.
Availability and implementation: The source code is freely available at https://github.com/yuzhong-deng/graphcpg.git.
Motivation: Genome-wide association studies (GWAS) present several computational and statistical challenges for their data analysis, including knowledge discovery, interpretability, and translation to clinical practice.
Results: We develop, apply, and comparatively evaluate an automated machine learning (AutoML) approach, customized for genomic data that delivers reliable predictive and diagnostic models, the set of genetic variants that are important for predictions (called a biosignature), and an estimate of the out-of-sample predictive power. This AutoML approach discovers variants with higher predictive performance compared to standard GWAS methods, computes an individual risk prediction score, generalizes to new, unseen data, is shown to better differentiate causal variants from other highly correlated variants, and enhances knowledge discovery and interpretability by reporting multiple equivalent biosignatures.
Availability and implementation: Code for this study is available at: https://github.com/mensxmachina/autoML-GWAS. JADBio offers a free version at: https://jadbio.com/sign-up/. SNP data can be downloaded from the EGA repository (https://ega-archive.org/). PRS data are found at: https://www.aicrowd.com/challenges/opensnp-height-prediction. Simulation data to study population structure can be found at: https://easygwas.ethz.ch/data/public/dataset/view/1/.
Motivation: Federated Learning (FL) is gaining traction in various fields as it enables integrative data analysis without sharing sensitive data, such as in healthcare. However, the risk of data leakage caused by malicious attacks must be considered. In this study, we introduce a novel attack algorithm that relies on being able to compute sample means, sample covariances, and construct known linearly independent vectors on the data owner side.
Results: We show that these basic functionalities, which are available in several established FL frameworks, are sufficient to reconstruct privacy-protected data. Additionally, the attack algorithm is robust to defense strategies that involve adding random noise. We demonstrate the limitations of existing frameworks and propose potential defense strategies analyzing the implications of using differential privacy. The novel insights presented in this study will aid in the improvement of FL frameworks.
Availability and implementation: The code examples are provided at GitHub (https://github.com/manuhuth/Data-Leakage-From-Covariances.git). The CNSIM1 dataset, which we used in the manuscript, is available within the DSData R package (https://github.com/datashield/DSData/tree/main/data).
Motivation: Understanding RNA folding at the level of secondary structures can give important insights concerning the function of a molecule. We are interested to learn how secondary structures change dynamically during transcription, as well as whether particular secondary structures form already during or only after transcription. While different approaches exist to simulate cotranscriptional folding, the current strategies for visualization are lagging behind. New, more suitable approaches are necessary to help with exploring the generated data from cotranscriptional folding simulations.
Results: We present DrForna, an interactive visualization app for viewing the time course of a cotranscriptional RNA folding simulation. Specifically, users can scroll along the time axis and see the population of structures that are present at any particular time point.
Availability and implementation: DrForna is a JavaScript project available on Github at https://github.com/ViennaRNA/drforna and deployed at https://viennarna.github.io/drforna.
Motivation: The application of machine learning approaches in phylogenetics has been impeded by the vast model space associated with inference. Supervised machine learning approaches require data from across this space to train models. Because of this, previous approaches have typically been limited to inferring relationships among unrooted quartets of taxa, where there are only three possible topologies. Here, we explore the potential of generative adversarial networks (GANs) to address this limitation. GANs consist of a generator and a discriminator: at each step, the generator aims to create data that is similar to real data, while the discriminator attempts to distinguish generated and real data. By using an evolutionary model as the generator, we use GANs to make evolutionary inferences. Since a new model can be considered at each iteration, heuristic searches of complex model spaces are possible. Thus, GANs offer a potential solution to the challenges of applying machine learning in phylogenetics.
Results: We developed phyloGAN, a GAN that infers phylogenetic relationships among species. phyloGAN takes as input a concatenated alignment, or a set of gene alignments, and infers a phylogenetic tree either considering or ignoring gene tree heterogeneity. We explored the performance of phyloGAN for up to 15 taxa in the concatenation case and 6 taxa when considering gene tree heterogeneity. Error rates are relatively low in these simple cases. However, run times are slow and performance metrics suggest issues during training. Future work should explore novel architectures that may result in more stable and efficient GANs for phylogenetics.
Availability and implementation: phyloGAN is available on github: https://github.com/meganlsmith/phyloGAN/.