This paper presents a computational technique that leverages the asymptotic-stabilization behavior of transition probabilities that are characterized by two-state Markov chain. These asymptotic probabilities help the computational technique to protect the privacy of the functional magnetic resonance imaging (fMRI) data that is shared over a public distributed computer network. In general, the fMRI signals reveal a large number of correlated brain features that can be utilized in the development of predictive models for extracting brain networks and infer privacy information of an individual. These features make fMRI data highly vulnerable to privacy attacks. To conceal these features for privacy protection, we transform them to an asymptotic state of an fMRI signal using the concepts of asymptotic stabilization with two-sate Markov chain, and the compressed sensing and compressed learning techniques. The proposed predictive model is built using the asymptotically stabilized fMRI signals, rather than the original signals, which enhance the protection of privacy. Hence, the transformed signal, instead of the original signal, may be shared in public computer networks, such as the cloud computing network. The computer simulations show that the proposed predictive model provides very high prediction accuracy, while providing very strong privacy protection.
{"title":"Asymptotically Stable Privacy Protection Technique for fMRI Shared Data over Distributed Computer Networks","authors":"Naseeb Thapaliya, Lavanya Goluguri, S. Suthaharan","doi":"10.1145/3388440.3414863","DOIUrl":"https://doi.org/10.1145/3388440.3414863","url":null,"abstract":"This paper presents a computational technique that leverages the asymptotic-stabilization behavior of transition probabilities that are characterized by two-state Markov chain. These asymptotic probabilities help the computational technique to protect the privacy of the functional magnetic resonance imaging (fMRI) data that is shared over a public distributed computer network. In general, the fMRI signals reveal a large number of correlated brain features that can be utilized in the development of predictive models for extracting brain networks and infer privacy information of an individual. These features make fMRI data highly vulnerable to privacy attacks. To conceal these features for privacy protection, we transform them to an asymptotic state of an fMRI signal using the concepts of asymptotic stabilization with two-sate Markov chain, and the compressed sensing and compressed learning techniques. The proposed predictive model is built using the asymptotically stabilized fMRI signals, rather than the original signals, which enhance the protection of privacy. Hence, the transformed signal, instead of the original signal, may be shared in public computer networks, such as the cloud computing network. The computer simulations show that the proposed predictive model provides very high prediction accuracy, while providing very strong privacy protection.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125252850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Wolfe, Othmane Belhoussine, Anais Dawson, Maxwell Lisaius, F. Jagodzinski
We explore how amino acid mutations affect the stability of the 306 residue main protease of the COVID-19 proteome. We employ two computational approaches, Site Directed Mutagenesis (SDM) and short runs of Molecular Dynamics. We focus our attention on residues 25-32 that make up a beta sheet of a canonical beta barrel close to an active site which includes Histidine 41. We considered this region a good candidate for mutations because such a large perturbation of a highly structured region close to the active site may prove to be highly detrimental to the protein's stability and may affect catalytic efficiency. Understanding how amino acid mutations affect the stability of the protein can inform efforts to develop pharmacological interventions. We mutated the 8 residues in silico to all other possible amino acids, and analyzed the resulting 152 mutants. Both computational methods predict that only a few specific mutations to some of the 8 residues have a major effect on the structural stability of the protein.
{"title":"Impactful Mutations in Mpro of the SARS-CoV-2 Proteome","authors":"G. Wolfe, Othmane Belhoussine, Anais Dawson, Maxwell Lisaius, F. Jagodzinski","doi":"10.1145/3388440.3414706","DOIUrl":"https://doi.org/10.1145/3388440.3414706","url":null,"abstract":"We explore how amino acid mutations affect the stability of the 306 residue main protease of the COVID-19 proteome. We employ two computational approaches, Site Directed Mutagenesis (SDM) and short runs of Molecular Dynamics. We focus our attention on residues 25-32 that make up a beta sheet of a canonical beta barrel close to an active site which includes Histidine 41. We considered this region a good candidate for mutations because such a large perturbation of a highly structured region close to the active site may prove to be highly detrimental to the protein's stability and may affect catalytic efficiency. Understanding how amino acid mutations affect the stability of the protein can inform efforts to develop pharmacological interventions. We mutated the 8 residues in silico to all other possible amino acids, and analyzed the resulting 152 mutants. Both computational methods predict that only a few specific mutations to some of the 8 residues have a major effect on the structural stability of the protein.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132699316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Patterns of two molecules across biological systems are often labeled as conserved or differential. We argue that this classification is insufficient. Here, we introduce three types of relationships across systems. Upon stimuli, a type-0 pattern arises from conserved circuitry with active conserved trajectory; a type-1 pattern is conserved circuitry with active differential trajectory; a type-2 pattern is rewired circuitry with active trajectory. We present a 1st-order marginal change test, prove its optimality, and establish its asymptotic chi-squared distribution under the null hypothesis of identical marginals across conditions. The test outperformed other methods in detecting 1st-order difference in simulation studies. We also introduce a zeroth-order strength test to assess association of two variables across systems. We compared gene co-expression networks of planktonic microbial communities in cold California coastal water against the warm water of North Pacific Subtropical Gyre. The frequency of type-1 patterns is much higher than those of type-2 and type-0 patterns, revealing that the microbial communities are mostly conserved in molecular circuitry but responded differentially to ocean habitats. Type-1 and 2 patterns are enriched with genes known to respond to environmental changes or stress; type-0 patterns involve genes having essential function such as photosynthesis and general transcription. Our work provides a deep understanding to effects of the environment on gene regulation in microbial communities. The method is generally applicable to other biological systems. All tests are provided in the R package 'DiffXTables' at https://cran.r-project.org/package=DiffXTables. Other source code and lists of significant gene patterns are available at https://www.cs.nmsu.edu/~joemsong/ACM-BCB-2020/Plankton
{"title":"Three Co-expression Pattern Types across Microbial Transcriptional Networks of Plankton in Two Oceanic Waters","authors":"Ruby Sharma, Xuye Luo, Sajal Kumar, Mingzhou Song","doi":"10.1145/3388440.3412485","DOIUrl":"https://doi.org/10.1145/3388440.3412485","url":null,"abstract":"Patterns of two molecules across biological systems are often labeled as conserved or differential. We argue that this classification is insufficient. Here, we introduce three types of relationships across systems. Upon stimuli, a type-0 pattern arises from conserved circuitry with active conserved trajectory; a type-1 pattern is conserved circuitry with active differential trajectory; a type-2 pattern is rewired circuitry with active trajectory. We present a 1st-order marginal change test, prove its optimality, and establish its asymptotic chi-squared distribution under the null hypothesis of identical marginals across conditions. The test outperformed other methods in detecting 1st-order difference in simulation studies. We also introduce a zeroth-order strength test to assess association of two variables across systems. We compared gene co-expression networks of planktonic microbial communities in cold California coastal water against the warm water of North Pacific Subtropical Gyre. The frequency of type-1 patterns is much higher than those of type-2 and type-0 patterns, revealing that the microbial communities are mostly conserved in molecular circuitry but responded differentially to ocean habitats. Type-1 and 2 patterns are enriched with genes known to respond to environmental changes or stress; type-0 patterns involve genes having essential function such as photosynthesis and general transcription. Our work provides a deep understanding to effects of the environment on gene regulation in microbial communities. The method is generally applicable to other biological systems. All tests are provided in the R package 'DiffXTables' at https://cran.r-project.org/package=DiffXTables. Other source code and lists of significant gene patterns are available at https://www.cs.nmsu.edu/~joemsong/ACM-BCB-2020/Plankton","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132719698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Effectively integrating and mining multi-view, high-dimensional omics data is instrumental to precision medicine. Numerous methods have been proposed for addressing this problem. However, they more or less neglect the challenges (e.g., interpretability, stability and consistency) pertaining to this integration process, whereby suffering from unstable or inconsistent variable selection and prediction accuracy deterioration. In this paper, we introduce a novel Fusion Lasso (FL) framework in which variable selection and data integration are formulated as a weighted constrained optimization problem. Specifically, four regularization constraints, i.e., sparsity, fusion penalty, instability and inconsistency, are simultaneously taken into account in the fusion model using multi-view data, while sparse features are revealed from data of each individual view through the ℓ1-norm minimization. We use the ADMM and Accelerated ADMM (AADMM) schemes to solve this optimization problem, leading to a scalable model convergence with solid theoretical guarantee. By applying FL to fve multi-omics cancer datasets collected by The Cancer Genome Atlas (TCGA), we demonstrate that FL outperforms popular variable selection and data integration approaches, such as Elastic Net, Precision Lasso, B-RAIL and MDBN, in cancer subtype and/or stage prediction. The proposed method is useful and can be further adopted to systems biology and other advanced clinical research areas where multi-view data integration is a necessity.
{"title":"Fusion Lasso and Its Applications to Cancer Subtype and Stage Prediction","authors":"Zhong Chen, Andrea Edwards, Kun Zhang","doi":"10.1145/3388440.3412461","DOIUrl":"https://doi.org/10.1145/3388440.3412461","url":null,"abstract":"Effectively integrating and mining multi-view, high-dimensional omics data is instrumental to precision medicine. Numerous methods have been proposed for addressing this problem. However, they more or less neglect the challenges (e.g., interpretability, stability and consistency) pertaining to this integration process, whereby suffering from unstable or inconsistent variable selection and prediction accuracy deterioration. In this paper, we introduce a novel Fusion Lasso (FL) framework in which variable selection and data integration are formulated as a weighted constrained optimization problem. Specifically, four regularization constraints, i.e., sparsity, fusion penalty, instability and inconsistency, are simultaneously taken into account in the fusion model using multi-view data, while sparse features are revealed from data of each individual view through the ℓ1-norm minimization. We use the ADMM and Accelerated ADMM (AADMM) schemes to solve this optimization problem, leading to a scalable model convergence with solid theoretical guarantee. By applying FL to fve multi-omics cancer datasets collected by The Cancer Genome Atlas (TCGA), we demonstrate that FL outperforms popular variable selection and data integration approaches, such as Elastic Net, Precision Lasso, B-RAIL and MDBN, in cancer subtype and/or stage prediction. The proposed method is useful and can be further adopted to systems biology and other advanced clinical research areas where multi-view data integration is a necessity.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"167 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132030650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Computational drug design has the potential to save time and money by providing a better starting point for new drugs with a complete computational evaluation. We propose a peptide design system for protein targets based on a Generative Adversarial Network (GAN) called GANDALF (Generative Adversarial Network Drug-tArget Ligand Fructifier). GAN based methods have been developed for computational drug design but these can only generate small molecules, not peptides. Peptides are very complex macromolecules which makes them much more difficult than small molecules to generate. Our GANDALF methodology uses two networks to generate a new peptide sequence and structure. It also incorporates data such as active atoms not used in other methods. Active atoms are important because they interact via electron sharing when a target protein and a peptide bind to each other. We can identify the active atoms using our electron structure calculation (eCADD) program and the rules of interaction we have developed. Our method goes farther than comparable methods by generating a full peptide structure as well as predicting binding affinity. The results were validated using a multi-step process comparing the results with FDA approved drugs and our initial prototype method. We have generated multiple peptides for three targets of interest (PD-1, PDL-1, and CTLA-4) and have found that the best generated peptide for each target was comparable to the FDA approved drugs in binding affinity and fitness of 3D binding as well as show the generated peptides were unique from the existing FDA drugs.
{"title":"GANDALF: Peptide Generation for Drug Design using Sequential and Structural Generative Adversarial Networks","authors":"Allison M. Rossetto, Wenjin Zhou","doi":"10.1145/3388440.3412487","DOIUrl":"https://doi.org/10.1145/3388440.3412487","url":null,"abstract":"Computational drug design has the potential to save time and money by providing a better starting point for new drugs with a complete computational evaluation. We propose a peptide design system for protein targets based on a Generative Adversarial Network (GAN) called GANDALF (Generative Adversarial Network Drug-tArget Ligand Fructifier). GAN based methods have been developed for computational drug design but these can only generate small molecules, not peptides. Peptides are very complex macromolecules which makes them much more difficult than small molecules to generate. Our GANDALF methodology uses two networks to generate a new peptide sequence and structure. It also incorporates data such as active atoms not used in other methods. Active atoms are important because they interact via electron sharing when a target protein and a peptide bind to each other. We can identify the active atoms using our electron structure calculation (eCADD) program and the rules of interaction we have developed. Our method goes farther than comparable methods by generating a full peptide structure as well as predicting binding affinity. The results were validated using a multi-step process comparing the results with FDA approved drugs and our initial prototype method. We have generated multiple peptides for three targets of interest (PD-1, PDL-1, and CTLA-4) and have found that the best generated peptide for each target was comparable to the FDA approved drugs in binding affinity and fitness of 3D binding as well as show the generated peptides were unique from the existing FDA drugs.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132167428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aly A. Valliani, J. Schwartz, Varun Arvind, A. Taree, Jun S. Kim, Samuel K. Cho
Pediatric bone age assessment is clinically valuable for the evaluation of a variety of pediatric endocrine and orthopedic conditions. Recent studies have explored automated methods for bone age assessment using machine learning techniques, yielding impressive results. However, many state-of-the-art methods rely on manual, fine-grained segmentation of phalanges and have not been validated on an external hospital site. The purpose of this study was to examine the efficacy of a deep learning algorithm for pediatric bone age assessment without the need for time-intensive segmentation. We utilize a novel training regime to achieve results on par with existing approaches, present a systematic analysis of experimental findings via an ablation study, and evaluate generalizability on an external dataset as a function of training data size. The final optimized model achieves mean absolute error of 7.59 months upon internal validation and 11.02 upon validation with data from an external hospital site.
{"title":"Multi-Site Assessment of Pediatric Bone Age Using Deep Learning","authors":"Aly A. Valliani, J. Schwartz, Varun Arvind, A. Taree, Jun S. Kim, Samuel K. Cho","doi":"10.1145/3388440.3412429","DOIUrl":"https://doi.org/10.1145/3388440.3412429","url":null,"abstract":"Pediatric bone age assessment is clinically valuable for the evaluation of a variety of pediatric endocrine and orthopedic conditions. Recent studies have explored automated methods for bone age assessment using machine learning techniques, yielding impressive results. However, many state-of-the-art methods rely on manual, fine-grained segmentation of phalanges and have not been validated on an external hospital site. The purpose of this study was to examine the efficacy of a deep learning algorithm for pediatric bone age assessment without the need for time-intensive segmentation. We utilize a novel training regime to achieve results on par with existing approaches, present a systematic analysis of experimental findings via an ablation study, and evaluate generalizability on an external dataset as a function of training data size. The final optimized model achieves mean absolute error of 7.59 months upon internal validation and 11.02 upon validation with data from an external hospital site.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128117419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Attention-deficit/hyperactivity disorder is the most pervasive neurodevelopmental disorder among children and adolescents. Current clinical diagnosis, however, is inaccurate and inefficient, hindering the administration of proper treatment regimens. Clinical assessments are based on qualitative observations of perceived behavior. They are time-consuming and costly, preventing individuals from gaining the support they need to succeed academically, socially, and occupationally. A more accurate and efficient method of detection is necessary to ensure that all children are able to be diagnosed and given proper treatment regimens. This research proposes a novel machine learning-based method to analyze pupil-dynamics data as an objective biomarker to characterize ADHD. After visualizing and engineering pupillometric features, an evaluation of state-of-the-art machine learning algorithms showed that an Ensemble Voting Classifier yielded the optimal binary classification metrics using leave-one-out-cross-validation (LOOCV). The model classified ADHD with 82.1% sensitivity, 72.7% specificity, and 85.6% AUROC. Moreover, novel insights into associations between pupillometric features and the presence of ADHD were garnered and statistically validated. The optimal machine learning model was implemented in a web application that administers a memory task and captures pupil biometrics in real-time to output a probabilistic risk score of a patient having ADHD. This application is the first to use pupil-size dynamics as a biomarker, and offers a time-efficient and accurate approach to detect ADHD in children.
{"title":"A Novel Pupillometric-Based Application for the Automated Detection of ADHD Using Machine Learning","authors":"William Das, S. Khanna","doi":"10.1145/3388440.3412427","DOIUrl":"https://doi.org/10.1145/3388440.3412427","url":null,"abstract":"Attention-deficit/hyperactivity disorder is the most pervasive neurodevelopmental disorder among children and adolescents. Current clinical diagnosis, however, is inaccurate and inefficient, hindering the administration of proper treatment regimens. Clinical assessments are based on qualitative observations of perceived behavior. They are time-consuming and costly, preventing individuals from gaining the support they need to succeed academically, socially, and occupationally. A more accurate and efficient method of detection is necessary to ensure that all children are able to be diagnosed and given proper treatment regimens. This research proposes a novel machine learning-based method to analyze pupil-dynamics data as an objective biomarker to characterize ADHD. After visualizing and engineering pupillometric features, an evaluation of state-of-the-art machine learning algorithms showed that an Ensemble Voting Classifier yielded the optimal binary classification metrics using leave-one-out-cross-validation (LOOCV). The model classified ADHD with 82.1% sensitivity, 72.7% specificity, and 85.6% AUROC. Moreover, novel insights into associations between pupillometric features and the presence of ADHD were garnered and statistically validated. The optimal machine learning model was implemented in a web application that administers a memory task and captures pupil biometrics in real-time to output a probabilistic risk score of a patient having ADHD. This application is the first to use pupil-size dynamics as a biomarker, and offers a time-efficient and accurate approach to detect ADHD in children.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131688949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RNA sequence analysis of emerging SARS-CoV-2 infection is valuable for tracking viral evolution and developing novel diagnostic tools. Furthermore, SARS-CoV-2 sequence analysis can provide insight into potential antigenic drift events that lead to strain speciation and changing clinical outcomes. In this work, we aim to develop a pipeline using next-generation sequencing (NGS) technology in addition to machine learning/bioinformatics to track the accumulation of mutations and viral evolution.
{"title":"Machine Learning Model to Track SARS-CoV-2 Viral Mutation Evolution and Speciation Using Next-generation Sequencing Data","authors":"I. Derecichei, G. Atikukke","doi":"10.1145/3388440.3415991","DOIUrl":"https://doi.org/10.1145/3388440.3415991","url":null,"abstract":"RNA sequence analysis of emerging SARS-CoV-2 infection is valuable for tracking viral evolution and developing novel diagnostic tools. Furthermore, SARS-CoV-2 sequence analysis can provide insight into potential antigenic drift events that lead to strain speciation and changing clinical outcomes. In this work, we aim to develop a pipeline using next-generation sequencing (NGS) technology in addition to machine learning/bioinformatics to track the accumulation of mutations and viral evolution.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131701527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shibiao Wan, Junil Kim, Yiping Fan, Kyoung-Jae Won
Single-cell technologies have received extensive attention from bioinformatics and computational biology communities due to their evolutionary impacts on uncovering novel cell types and intra-population heterogeneity in various domains of biology and medicine. Recent advances on single-cell RNA-sequencing (scRNA-seq) technologies have enabled parallel transcriptomic profiling of millions of cells. However, existing scRNA-seq clustering methods are lack of scalability, time-consuming and prone to information loss during dimension reduction. To address these concerns, we present SHARP [1], an ensemble random projection-based algorithm which is scalable to clustering 10 million cells. By adopting a divide-and-conquer strategy, a sparse random projection and two-layer meta-clustering, SHARP has the following advantages: (1) hyper-faster than existing algorithms; (2) scalable to 10-million cells; (3) accurate in terms of clustering performance; (4) preserving cell-to-cell distance during dimension reduction; and (5) robust to dropouts in scRNA-seq data. Comprehensive benchmarking tests on 20 scRNA-seq datasets demonstrate SHARP remarkably outperforms state-of-the-art methods in terms of speed and accuracy. To the best of our knowledge, SHARP is the only R-based tool that is scalable to clustering 10 million cells. With an avalanche of single cells in different tissues to be sequenced in multiple international projects like The Human Cell Atlas, we believe SHARP will serve as one of the useful and important tools for large-scale single-cell data analysis. Several potential future directions include while keeping the scalability and speed of SHARP, how to extend its functions into rare cell type detection and integrating single cell data from different platforms, experiments and conditions.
{"title":"Processing Millions of Single Cells by SHARP","authors":"Shibiao Wan, Junil Kim, Yiping Fan, Kyoung-Jae Won","doi":"10.1145/3388440.3414214","DOIUrl":"https://doi.org/10.1145/3388440.3414214","url":null,"abstract":"Single-cell technologies have received extensive attention from bioinformatics and computational biology communities due to their evolutionary impacts on uncovering novel cell types and intra-population heterogeneity in various domains of biology and medicine. Recent advances on single-cell RNA-sequencing (scRNA-seq) technologies have enabled parallel transcriptomic profiling of millions of cells. However, existing scRNA-seq clustering methods are lack of scalability, time-consuming and prone to information loss during dimension reduction. To address these concerns, we present SHARP [1], an ensemble random projection-based algorithm which is scalable to clustering 10 million cells. By adopting a divide-and-conquer strategy, a sparse random projection and two-layer meta-clustering, SHARP has the following advantages: (1) hyper-faster than existing algorithms; (2) scalable to 10-million cells; (3) accurate in terms of clustering performance; (4) preserving cell-to-cell distance during dimension reduction; and (5) robust to dropouts in scRNA-seq data. Comprehensive benchmarking tests on 20 scRNA-seq datasets demonstrate SHARP remarkably outperforms state-of-the-art methods in terms of speed and accuracy. To the best of our knowledge, SHARP is the only R-based tool that is scalable to clustering 10 million cells. With an avalanche of single cells in different tissues to be sequenced in multiple international projects like The Human Cell Atlas, we believe SHARP will serve as one of the useful and important tools for large-scale single-cell data analysis. Several potential future directions include while keeping the scalability and speed of SHARP, how to extend its functions into rare cell type detection and integrating single cell data from different platforms, experiments and conditions.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125338531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Torin Adamson, Selina M. Bauernfeind, Bruna Jacobson, Lydia Tapia
Molecular docking prediction, used to design new drugs and identify biological function, can be represented as a motion planning problem in high dimensional space. This complex task can benefit from human guidance, i.e., interactive simulations combining visual and haptic feedback from atomic forces [4]. Using human-generated data to find solutions to complex problems has been used in protein folding with FoldIt [3], showing that crowdsourced data collection is viable to help solve difficult problems in biology. However, existing interactive molecular docking systems that typically use haptic devices employ devices that are not widely available to the public, limiting the potential to crowdsource solutions [1]. This user study tested 4 input devices to find potentially bound ligand-receptor states and biologically feasible paths via motion planning. On a PC laptop, players used one of the following: a 6 degree-of-freedom (DOF) haptic feedback device, a 3 DOF haptic feedback device, a game controller with vibration feedback, and a mouse and keyboard, with our in-house interactive rigid body molecular docking program, DockAnywhere [1, 2]. Players tried to dock a known inhibitor of HIV Protease (PDB ID 1AJX). The goal is to move the ligand around the receptor to find the lowest potential energy state possible. Players get feedback as a positive integer score based on the potential energy and, for haptic devices, a force feedback calculated from atomic forces. During gameplay, ligand positions and orientations are recorded as it is moved through the environment. In total, 439,832 states were collected from 32 players (8 per device). These states were divided into 4 sets, one per device. and Motion planning queries found a feasible path to the goal state in all sets, with the mouse showing more pronounced energy barriers. We found that exploration varied among different devices, with players on haptic devices exploring the interaction energy landscape more uniformly. However, all 4 devices yielded low energy ligand states with comparable values and similar closest distance to the binding site. In summary, force feedback showed no clear improvement to finding low potential energy states or getting closer to the known binding state. However, haptic devices appear to enable a more thorough exploration of the state space, creating more samples and generating smoother paths. The nature of this guidance should be the subject of future investigation.
分子对接预测用于新药设计和生物功能识别,可以表示为高维空间中的运动规划问题。这项复杂的任务可以受益于人类的指导,即结合原子力的视觉和触觉反馈的交互式模拟[4]。利用人工生成的数据寻找复杂问题的解决方案已在FoldIt的蛋白质折叠中得到应用[3],这表明众包数据收集在帮助解决生物学难题方面是可行的。然而,现有的交互式分子对接系统通常使用触觉设备,这些设备并未广泛向公众提供,这限制了众包解决方案的潜力[1]。本用户研究测试了4种输入设备,以通过运动规划找到潜在的结合配体-受体状态和生物学上可行的路径。在PC笔记本电脑上,玩家使用以下设备之一:6自由度(DOF)触觉反馈设备,3自由度触觉反馈设备,带有振动反馈的游戏控制器,鼠标和键盘,以及我们内部的交互式刚体分子对接程序DockAnywhere[1,2]。玩家试图对接一种已知的HIV蛋白酶抑制剂(PDB ID 1AJX)。目标是移动受体周围的配体,以找到尽可能低的势能状态。玩家得到的反馈是基于势能的正整数分数,而对于触觉设备,则是根据原子力计算的力反馈。在游戏过程中,配体的位置和方向会随着它在环境中的移动而被记录下来。总共从32个玩家(每台设备8个)中收集了439,832个州。这些状态被分为4组,每个设备一个。运动规划查询在所有集合中都找到了通往目标状态的可行路径,鼠标显示出更明显的能量障碍。我们发现不同设备的探索方式不同,使用触觉设备的玩家对互动能量的探索更加一致。然而,这4种装置产生的低能配体态具有相似的值和与结合位点的最近距离。综上所述,力反馈在寻找低势能态或接近已知结合态方面没有明显的改善。然而,触觉设备似乎能够更彻底地探索状态空间,创建更多的样本并生成更平滑的路径。这一指导方针的性质应成为今后调查的主题。
{"title":"Using player generated data to elucidate molecular docking","authors":"Torin Adamson, Selina M. Bauernfeind, Bruna Jacobson, Lydia Tapia","doi":"10.1145/3388440.3414704","DOIUrl":"https://doi.org/10.1145/3388440.3414704","url":null,"abstract":"Molecular docking prediction, used to design new drugs and identify biological function, can be represented as a motion planning problem in high dimensional space. This complex task can benefit from human guidance, i.e., interactive simulations combining visual and haptic feedback from atomic forces [4]. Using human-generated data to find solutions to complex problems has been used in protein folding with FoldIt [3], showing that crowdsourced data collection is viable to help solve difficult problems in biology. However, existing interactive molecular docking systems that typically use haptic devices employ devices that are not widely available to the public, limiting the potential to crowdsource solutions [1]. This user study tested 4 input devices to find potentially bound ligand-receptor states and biologically feasible paths via motion planning. On a PC laptop, players used one of the following: a 6 degree-of-freedom (DOF) haptic feedback device, a 3 DOF haptic feedback device, a game controller with vibration feedback, and a mouse and keyboard, with our in-house interactive rigid body molecular docking program, DockAnywhere [1, 2]. Players tried to dock a known inhibitor of HIV Protease (PDB ID 1AJX). The goal is to move the ligand around the receptor to find the lowest potential energy state possible. Players get feedback as a positive integer score based on the potential energy and, for haptic devices, a force feedback calculated from atomic forces. During gameplay, ligand positions and orientations are recorded as it is moved through the environment. In total, 439,832 states were collected from 32 players (8 per device). These states were divided into 4 sets, one per device. and Motion planning queries found a feasible path to the goal state in all sets, with the mouse showing more pronounced energy barriers. We found that exploration varied among different devices, with players on haptic devices exploring the interaction energy landscape more uniformly. However, all 4 devices yielded low energy ligand states with comparable values and similar closest distance to the binding site. In summary, force feedback showed no clear improvement to finding low potential energy states or getting closer to the known binding state. However, haptic devices appear to enable a more thorough exploration of the state space, creating more samples and generating smoother paths. The nature of this guidance should be the subject of future investigation.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125438905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}