RNA sequence analysis of emerging SARS-CoV-2 infection is valuable for tracking viral evolution and developing novel diagnostic tools. Furthermore, SARS-CoV-2 sequence analysis can provide insight into potential antigenic drift events that lead to strain speciation and changing clinical outcomes. In this work, we aim to develop a pipeline using next-generation sequencing (NGS) technology in addition to machine learning/bioinformatics to track the accumulation of mutations and viral evolution.
{"title":"Machine Learning Model to Track SARS-CoV-2 Viral Mutation Evolution and Speciation Using Next-generation Sequencing Data","authors":"I. Derecichei, G. Atikukke","doi":"10.1145/3388440.3415991","DOIUrl":"https://doi.org/10.1145/3388440.3415991","url":null,"abstract":"RNA sequence analysis of emerging SARS-CoV-2 infection is valuable for tracking viral evolution and developing novel diagnostic tools. Furthermore, SARS-CoV-2 sequence analysis can provide insight into potential antigenic drift events that lead to strain speciation and changing clinical outcomes. In this work, we aim to develop a pipeline using next-generation sequencing (NGS) technology in addition to machine learning/bioinformatics to track the accumulation of mutations and viral evolution.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131701527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Computational drug design has the potential to save time and money by providing a better starting point for new drugs with a complete computational evaluation. We propose a peptide design system for protein targets based on a Generative Adversarial Network (GAN) called GANDALF (Generative Adversarial Network Drug-tArget Ligand Fructifier). GAN based methods have been developed for computational drug design but these can only generate small molecules, not peptides. Peptides are very complex macromolecules which makes them much more difficult than small molecules to generate. Our GANDALF methodology uses two networks to generate a new peptide sequence and structure. It also incorporates data such as active atoms not used in other methods. Active atoms are important because they interact via electron sharing when a target protein and a peptide bind to each other. We can identify the active atoms using our electron structure calculation (eCADD) program and the rules of interaction we have developed. Our method goes farther than comparable methods by generating a full peptide structure as well as predicting binding affinity. The results were validated using a multi-step process comparing the results with FDA approved drugs and our initial prototype method. We have generated multiple peptides for three targets of interest (PD-1, PDL-1, and CTLA-4) and have found that the best generated peptide for each target was comparable to the FDA approved drugs in binding affinity and fitness of 3D binding as well as show the generated peptides were unique from the existing FDA drugs.
{"title":"GANDALF: Peptide Generation for Drug Design using Sequential and Structural Generative Adversarial Networks","authors":"Allison M. Rossetto, Wenjin Zhou","doi":"10.1145/3388440.3412487","DOIUrl":"https://doi.org/10.1145/3388440.3412487","url":null,"abstract":"Computational drug design has the potential to save time and money by providing a better starting point for new drugs with a complete computational evaluation. We propose a peptide design system for protein targets based on a Generative Adversarial Network (GAN) called GANDALF (Generative Adversarial Network Drug-tArget Ligand Fructifier). GAN based methods have been developed for computational drug design but these can only generate small molecules, not peptides. Peptides are very complex macromolecules which makes them much more difficult than small molecules to generate. Our GANDALF methodology uses two networks to generate a new peptide sequence and structure. It also incorporates data such as active atoms not used in other methods. Active atoms are important because they interact via electron sharing when a target protein and a peptide bind to each other. We can identify the active atoms using our electron structure calculation (eCADD) program and the rules of interaction we have developed. Our method goes farther than comparable methods by generating a full peptide structure as well as predicting binding affinity. The results were validated using a multi-step process comparing the results with FDA approved drugs and our initial prototype method. We have generated multiple peptides for three targets of interest (PD-1, PDL-1, and CTLA-4) and have found that the best generated peptide for each target was comparable to the FDA approved drugs in binding affinity and fitness of 3D binding as well as show the generated peptides were unique from the existing FDA drugs.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132167428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Effectively integrating and mining multi-view, high-dimensional omics data is instrumental to precision medicine. Numerous methods have been proposed for addressing this problem. However, they more or less neglect the challenges (e.g., interpretability, stability and consistency) pertaining to this integration process, whereby suffering from unstable or inconsistent variable selection and prediction accuracy deterioration. In this paper, we introduce a novel Fusion Lasso (FL) framework in which variable selection and data integration are formulated as a weighted constrained optimization problem. Specifically, four regularization constraints, i.e., sparsity, fusion penalty, instability and inconsistency, are simultaneously taken into account in the fusion model using multi-view data, while sparse features are revealed from data of each individual view through the ℓ1-norm minimization. We use the ADMM and Accelerated ADMM (AADMM) schemes to solve this optimization problem, leading to a scalable model convergence with solid theoretical guarantee. By applying FL to fve multi-omics cancer datasets collected by The Cancer Genome Atlas (TCGA), we demonstrate that FL outperforms popular variable selection and data integration approaches, such as Elastic Net, Precision Lasso, B-RAIL and MDBN, in cancer subtype and/or stage prediction. The proposed method is useful and can be further adopted to systems biology and other advanced clinical research areas where multi-view data integration is a necessity.
{"title":"Fusion Lasso and Its Applications to Cancer Subtype and Stage Prediction","authors":"Zhong Chen, Andrea Edwards, Kun Zhang","doi":"10.1145/3388440.3412461","DOIUrl":"https://doi.org/10.1145/3388440.3412461","url":null,"abstract":"Effectively integrating and mining multi-view, high-dimensional omics data is instrumental to precision medicine. Numerous methods have been proposed for addressing this problem. However, they more or less neglect the challenges (e.g., interpretability, stability and consistency) pertaining to this integration process, whereby suffering from unstable or inconsistent variable selection and prediction accuracy deterioration. In this paper, we introduce a novel Fusion Lasso (FL) framework in which variable selection and data integration are formulated as a weighted constrained optimization problem. Specifically, four regularization constraints, i.e., sparsity, fusion penalty, instability and inconsistency, are simultaneously taken into account in the fusion model using multi-view data, while sparse features are revealed from data of each individual view through the ℓ1-norm minimization. We use the ADMM and Accelerated ADMM (AADMM) schemes to solve this optimization problem, leading to a scalable model convergence with solid theoretical guarantee. By applying FL to fve multi-omics cancer datasets collected by The Cancer Genome Atlas (TCGA), we demonstrate that FL outperforms popular variable selection and data integration approaches, such as Elastic Net, Precision Lasso, B-RAIL and MDBN, in cancer subtype and/or stage prediction. The proposed method is useful and can be further adopted to systems biology and other advanced clinical research areas where multi-view data integration is a necessity.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"167 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132030650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Wolfe, Othmane Belhoussine, Anais Dawson, Maxwell Lisaius, F. Jagodzinski
We explore how amino acid mutations affect the stability of the 306 residue main protease of the COVID-19 proteome. We employ two computational approaches, Site Directed Mutagenesis (SDM) and short runs of Molecular Dynamics. We focus our attention on residues 25-32 that make up a beta sheet of a canonical beta barrel close to an active site which includes Histidine 41. We considered this region a good candidate for mutations because such a large perturbation of a highly structured region close to the active site may prove to be highly detrimental to the protein's stability and may affect catalytic efficiency. Understanding how amino acid mutations affect the stability of the protein can inform efforts to develop pharmacological interventions. We mutated the 8 residues in silico to all other possible amino acids, and analyzed the resulting 152 mutants. Both computational methods predict that only a few specific mutations to some of the 8 residues have a major effect on the structural stability of the protein.
{"title":"Impactful Mutations in Mpro of the SARS-CoV-2 Proteome","authors":"G. Wolfe, Othmane Belhoussine, Anais Dawson, Maxwell Lisaius, F. Jagodzinski","doi":"10.1145/3388440.3414706","DOIUrl":"https://doi.org/10.1145/3388440.3414706","url":null,"abstract":"We explore how amino acid mutations affect the stability of the 306 residue main protease of the COVID-19 proteome. We employ two computational approaches, Site Directed Mutagenesis (SDM) and short runs of Molecular Dynamics. We focus our attention on residues 25-32 that make up a beta sheet of a canonical beta barrel close to an active site which includes Histidine 41. We considered this region a good candidate for mutations because such a large perturbation of a highly structured region close to the active site may prove to be highly detrimental to the protein's stability and may affect catalytic efficiency. Understanding how amino acid mutations affect the stability of the protein can inform efforts to develop pharmacological interventions. We mutated the 8 residues in silico to all other possible amino acids, and analyzed the resulting 152 mutants. Both computational methods predict that only a few specific mutations to some of the 8 residues have a major effect on the structural stability of the protein.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132699316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Attention-deficit/hyperactivity disorder is the most pervasive neurodevelopmental disorder among children and adolescents. Current clinical diagnosis, however, is inaccurate and inefficient, hindering the administration of proper treatment regimens. Clinical assessments are based on qualitative observations of perceived behavior. They are time-consuming and costly, preventing individuals from gaining the support they need to succeed academically, socially, and occupationally. A more accurate and efficient method of detection is necessary to ensure that all children are able to be diagnosed and given proper treatment regimens. This research proposes a novel machine learning-based method to analyze pupil-dynamics data as an objective biomarker to characterize ADHD. After visualizing and engineering pupillometric features, an evaluation of state-of-the-art machine learning algorithms showed that an Ensemble Voting Classifier yielded the optimal binary classification metrics using leave-one-out-cross-validation (LOOCV). The model classified ADHD with 82.1% sensitivity, 72.7% specificity, and 85.6% AUROC. Moreover, novel insights into associations between pupillometric features and the presence of ADHD were garnered and statistically validated. The optimal machine learning model was implemented in a web application that administers a memory task and captures pupil biometrics in real-time to output a probabilistic risk score of a patient having ADHD. This application is the first to use pupil-size dynamics as a biomarker, and offers a time-efficient and accurate approach to detect ADHD in children.
{"title":"A Novel Pupillometric-Based Application for the Automated Detection of ADHD Using Machine Learning","authors":"William Das, S. Khanna","doi":"10.1145/3388440.3412427","DOIUrl":"https://doi.org/10.1145/3388440.3412427","url":null,"abstract":"Attention-deficit/hyperactivity disorder is the most pervasive neurodevelopmental disorder among children and adolescents. Current clinical diagnosis, however, is inaccurate and inefficient, hindering the administration of proper treatment regimens. Clinical assessments are based on qualitative observations of perceived behavior. They are time-consuming and costly, preventing individuals from gaining the support they need to succeed academically, socially, and occupationally. A more accurate and efficient method of detection is necessary to ensure that all children are able to be diagnosed and given proper treatment regimens. This research proposes a novel machine learning-based method to analyze pupil-dynamics data as an objective biomarker to characterize ADHD. After visualizing and engineering pupillometric features, an evaluation of state-of-the-art machine learning algorithms showed that an Ensemble Voting Classifier yielded the optimal binary classification metrics using leave-one-out-cross-validation (LOOCV). The model classified ADHD with 82.1% sensitivity, 72.7% specificity, and 85.6% AUROC. Moreover, novel insights into associations between pupillometric features and the presence of ADHD were garnered and statistically validated. The optimal machine learning model was implemented in a web application that administers a memory task and captures pupil biometrics in real-time to output a probabilistic risk score of a patient having ADHD. This application is the first to use pupil-size dynamics as a biomarker, and offers a time-efficient and accurate approach to detect ADHD in children.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131688949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Patterns of two molecules across biological systems are often labeled as conserved or differential. We argue that this classification is insufficient. Here, we introduce three types of relationships across systems. Upon stimuli, a type-0 pattern arises from conserved circuitry with active conserved trajectory; a type-1 pattern is conserved circuitry with active differential trajectory; a type-2 pattern is rewired circuitry with active trajectory. We present a 1st-order marginal change test, prove its optimality, and establish its asymptotic chi-squared distribution under the null hypothesis of identical marginals across conditions. The test outperformed other methods in detecting 1st-order difference in simulation studies. We also introduce a zeroth-order strength test to assess association of two variables across systems. We compared gene co-expression networks of planktonic microbial communities in cold California coastal water against the warm water of North Pacific Subtropical Gyre. The frequency of type-1 patterns is much higher than those of type-2 and type-0 patterns, revealing that the microbial communities are mostly conserved in molecular circuitry but responded differentially to ocean habitats. Type-1 and 2 patterns are enriched with genes known to respond to environmental changes or stress; type-0 patterns involve genes having essential function such as photosynthesis and general transcription. Our work provides a deep understanding to effects of the environment on gene regulation in microbial communities. The method is generally applicable to other biological systems. All tests are provided in the R package 'DiffXTables' at https://cran.r-project.org/package=DiffXTables. Other source code and lists of significant gene patterns are available at https://www.cs.nmsu.edu/~joemsong/ACM-BCB-2020/Plankton
{"title":"Three Co-expression Pattern Types across Microbial Transcriptional Networks of Plankton in Two Oceanic Waters","authors":"Ruby Sharma, Xuye Luo, Sajal Kumar, Mingzhou Song","doi":"10.1145/3388440.3412485","DOIUrl":"https://doi.org/10.1145/3388440.3412485","url":null,"abstract":"Patterns of two molecules across biological systems are often labeled as conserved or differential. We argue that this classification is insufficient. Here, we introduce three types of relationships across systems. Upon stimuli, a type-0 pattern arises from conserved circuitry with active conserved trajectory; a type-1 pattern is conserved circuitry with active differential trajectory; a type-2 pattern is rewired circuitry with active trajectory. We present a 1st-order marginal change test, prove its optimality, and establish its asymptotic chi-squared distribution under the null hypothesis of identical marginals across conditions. The test outperformed other methods in detecting 1st-order difference in simulation studies. We also introduce a zeroth-order strength test to assess association of two variables across systems. We compared gene co-expression networks of planktonic microbial communities in cold California coastal water against the warm water of North Pacific Subtropical Gyre. The frequency of type-1 patterns is much higher than those of type-2 and type-0 patterns, revealing that the microbial communities are mostly conserved in molecular circuitry but responded differentially to ocean habitats. Type-1 and 2 patterns are enriched with genes known to respond to environmental changes or stress; type-0 patterns involve genes having essential function such as photosynthesis and general transcription. Our work provides a deep understanding to effects of the environment on gene regulation in microbial communities. The method is generally applicable to other biological systems. All tests are provided in the R package 'DiffXTables' at https://cran.r-project.org/package=DiffXTables. Other source code and lists of significant gene patterns are available at https://www.cs.nmsu.edu/~joemsong/ACM-BCB-2020/Plankton","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132719698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although next generation sequencing technologies have made it possible to quickly generate a large collection of sequences, current genomic data still suffer from small data sizes, imbalances, and biases due to various factors including disease rareness, test affordability, and concerns about privacy and security. In order to address these limitations of genomic data, we develop a Population-scale Genomic Data Augmentation based on Conditional Generative Adversarial Networks (PG-cGAN) to enhance the amount and diversity of genomic data by transforming samples already in the data rather than collecting new samples. Both the generator and discriminator in the PG-CGAN are stacked with convolutional layers to capture the underlying population structure. Our results for augmenting genotypes in human leukocyte antigen (HLA) regions showed that PC-cGAN can generate new genotypes with similar population structure, variant frequency distributions and LD patterns. Since the input for PC-cGAN is the original genomic data without assumptions about prior knowledge, it can be extended to enrich many other types of biomedical data and beyond.
{"title":"Population-scale Genomic Data Augmentation Based on Conditional Generative Adversarial Networks","authors":"Junjie Chen, M. Mowlaei, Xinghua Shi","doi":"10.1145/3388440.3412475","DOIUrl":"https://doi.org/10.1145/3388440.3412475","url":null,"abstract":"Although next generation sequencing technologies have made it possible to quickly generate a large collection of sequences, current genomic data still suffer from small data sizes, imbalances, and biases due to various factors including disease rareness, test affordability, and concerns about privacy and security. In order to address these limitations of genomic data, we develop a Population-scale Genomic Data Augmentation based on Conditional Generative Adversarial Networks (PG-cGAN) to enhance the amount and diversity of genomic data by transforming samples already in the data rather than collecting new samples. Both the generator and discriminator in the PG-CGAN are stacked with convolutional layers to capture the underlying population structure. Our results for augmenting genotypes in human leukocyte antigen (HLA) regions showed that PC-cGAN can generate new genotypes with similar population structure, variant frequency distributions and LD patterns. Since the input for PC-cGAN is the original genomic data without assumptions about prior knowledge, it can be extended to enrich many other types of biomedical data and beyond.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124672307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a computational technique that leverages the asymptotic-stabilization behavior of transition probabilities that are characterized by two-state Markov chain. These asymptotic probabilities help the computational technique to protect the privacy of the functional magnetic resonance imaging (fMRI) data that is shared over a public distributed computer network. In general, the fMRI signals reveal a large number of correlated brain features that can be utilized in the development of predictive models for extracting brain networks and infer privacy information of an individual. These features make fMRI data highly vulnerable to privacy attacks. To conceal these features for privacy protection, we transform them to an asymptotic state of an fMRI signal using the concepts of asymptotic stabilization with two-sate Markov chain, and the compressed sensing and compressed learning techniques. The proposed predictive model is built using the asymptotically stabilized fMRI signals, rather than the original signals, which enhance the protection of privacy. Hence, the transformed signal, instead of the original signal, may be shared in public computer networks, such as the cloud computing network. The computer simulations show that the proposed predictive model provides very high prediction accuracy, while providing very strong privacy protection.
{"title":"Asymptotically Stable Privacy Protection Technique for fMRI Shared Data over Distributed Computer Networks","authors":"Naseeb Thapaliya, Lavanya Goluguri, S. Suthaharan","doi":"10.1145/3388440.3414863","DOIUrl":"https://doi.org/10.1145/3388440.3414863","url":null,"abstract":"This paper presents a computational technique that leverages the asymptotic-stabilization behavior of transition probabilities that are characterized by two-state Markov chain. These asymptotic probabilities help the computational technique to protect the privacy of the functional magnetic resonance imaging (fMRI) data that is shared over a public distributed computer network. In general, the fMRI signals reveal a large number of correlated brain features that can be utilized in the development of predictive models for extracting brain networks and infer privacy information of an individual. These features make fMRI data highly vulnerable to privacy attacks. To conceal these features for privacy protection, we transform them to an asymptotic state of an fMRI signal using the concepts of asymptotic stabilization with two-sate Markov chain, and the compressed sensing and compressed learning techniques. The proposed predictive model is built using the asymptotically stabilized fMRI signals, rather than the original signals, which enhance the protection of privacy. Hence, the transformed signal, instead of the original signal, may be shared in public computer networks, such as the cloud computing network. The computer simulations show that the proposed predictive model provides very high prediction accuracy, while providing very strong privacy protection.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125252850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shibiao Wan, Junil Kim, Yiping Fan, Kyoung-Jae Won
Single-cell technologies have received extensive attention from bioinformatics and computational biology communities due to their evolutionary impacts on uncovering novel cell types and intra-population heterogeneity in various domains of biology and medicine. Recent advances on single-cell RNA-sequencing (scRNA-seq) technologies have enabled parallel transcriptomic profiling of millions of cells. However, existing scRNA-seq clustering methods are lack of scalability, time-consuming and prone to information loss during dimension reduction. To address these concerns, we present SHARP [1], an ensemble random projection-based algorithm which is scalable to clustering 10 million cells. By adopting a divide-and-conquer strategy, a sparse random projection and two-layer meta-clustering, SHARP has the following advantages: (1) hyper-faster than existing algorithms; (2) scalable to 10-million cells; (3) accurate in terms of clustering performance; (4) preserving cell-to-cell distance during dimension reduction; and (5) robust to dropouts in scRNA-seq data. Comprehensive benchmarking tests on 20 scRNA-seq datasets demonstrate SHARP remarkably outperforms state-of-the-art methods in terms of speed and accuracy. To the best of our knowledge, SHARP is the only R-based tool that is scalable to clustering 10 million cells. With an avalanche of single cells in different tissues to be sequenced in multiple international projects like The Human Cell Atlas, we believe SHARP will serve as one of the useful and important tools for large-scale single-cell data analysis. Several potential future directions include while keeping the scalability and speed of SHARP, how to extend its functions into rare cell type detection and integrating single cell data from different platforms, experiments and conditions.
{"title":"Processing Millions of Single Cells by SHARP","authors":"Shibiao Wan, Junil Kim, Yiping Fan, Kyoung-Jae Won","doi":"10.1145/3388440.3414214","DOIUrl":"https://doi.org/10.1145/3388440.3414214","url":null,"abstract":"Single-cell technologies have received extensive attention from bioinformatics and computational biology communities due to their evolutionary impacts on uncovering novel cell types and intra-population heterogeneity in various domains of biology and medicine. Recent advances on single-cell RNA-sequencing (scRNA-seq) technologies have enabled parallel transcriptomic profiling of millions of cells. However, existing scRNA-seq clustering methods are lack of scalability, time-consuming and prone to information loss during dimension reduction. To address these concerns, we present SHARP [1], an ensemble random projection-based algorithm which is scalable to clustering 10 million cells. By adopting a divide-and-conquer strategy, a sparse random projection and two-layer meta-clustering, SHARP has the following advantages: (1) hyper-faster than existing algorithms; (2) scalable to 10-million cells; (3) accurate in terms of clustering performance; (4) preserving cell-to-cell distance during dimension reduction; and (5) robust to dropouts in scRNA-seq data. Comprehensive benchmarking tests on 20 scRNA-seq datasets demonstrate SHARP remarkably outperforms state-of-the-art methods in terms of speed and accuracy. To the best of our knowledge, SHARP is the only R-based tool that is scalable to clustering 10 million cells. With an avalanche of single cells in different tissues to be sequenced in multiple international projects like The Human Cell Atlas, we believe SHARP will serve as one of the useful and important tools for large-scale single-cell data analysis. Several potential future directions include while keeping the scalability and speed of SHARP, how to extend its functions into rare cell type detection and integrating single cell data from different platforms, experiments and conditions.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125338531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hamidreza Mohebbi, Nurit Haspel, D. Simovici, Joyce Quach
Gene fusion events are quite common in prostate, lymphoid, soft tissue, breast, gastric, and lung cancers. This requires fast and accurate fusion detection methods. However, accurate identification requires whole genome sequencing. Current state of the art methods suffer from inefficiency, lack of sufficient accuracy, and generation of high false positive rate. In this research we present a parallel method to convert inefficient categorical space into a compact binary array and therefore, reduce the dimensionality of the data and speed up the computation. FDJD pipeline contains three steps: general alignment, fusion candidate generation, and refinement. In our research, Jaccard distance is used as a similarity measure to find the nearest neighbors of a given query binary fingerprint alongside a fast KNN implementation. We benchmarked our fusion prediction accuracy using both simulated and genuine RNA-Seq data sets. Fusion detection results are compared with the state-of-the-art-methods STAR-Fusion, InFusion and TopHat-Fusion. The paired-end Illumina RNA-Seq genuine data were obtained from 60 publicly available cancer cell line data sets. FDJD showed superior performance compared to popular alternative fusion detection methods in both simulated and genuine data sets. It attained 90% accuracy on simulated fusion transcript inputs. Of a total of 86 fusions predicted by at least three methods, we found 44 experimentally validated fusions using wisdom of crowds approach. FDJD is not the fastest among the studied methods. However, it achieved the highest accuracy.
{"title":"Fusion Transcript Detection from RNA-Seq using Jaccard Distance","authors":"Hamidreza Mohebbi, Nurit Haspel, D. Simovici, Joyce Quach","doi":"10.1145/3388440.3415585","DOIUrl":"https://doi.org/10.1145/3388440.3415585","url":null,"abstract":"Gene fusion events are quite common in prostate, lymphoid, soft tissue, breast, gastric, and lung cancers. This requires fast and accurate fusion detection methods. However, accurate identification requires whole genome sequencing. Current state of the art methods suffer from inefficiency, lack of sufficient accuracy, and generation of high false positive rate. In this research we present a parallel method to convert inefficient categorical space into a compact binary array and therefore, reduce the dimensionality of the data and speed up the computation. FDJD pipeline contains three steps: general alignment, fusion candidate generation, and refinement. In our research, Jaccard distance is used as a similarity measure to find the nearest neighbors of a given query binary fingerprint alongside a fast KNN implementation. We benchmarked our fusion prediction accuracy using both simulated and genuine RNA-Seq data sets. Fusion detection results are compared with the state-of-the-art-methods STAR-Fusion, InFusion and TopHat-Fusion. The paired-end Illumina RNA-Seq genuine data were obtained from 60 publicly available cancer cell line data sets. FDJD showed superior performance compared to popular alternative fusion detection methods in both simulated and genuine data sets. It attained 90% accuracy on simulated fusion transcript inputs. Of a total of 86 fusions predicted by at least three methods, we found 44 experimentally validated fusions using wisdom of crowds approach. FDJD is not the fastest among the studied methods. However, it achieved the highest accuracy.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121442753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}