STARR-Seq is a high-throughput technique for directly identifying genomic regions with enhancer activity [1]. Genomic DNA is sheared, inserted into artificial plasmids designed so that DNA with enhancer activity trigger self-transcription, and transfected into culture cells. The resulting RNA is converted back into cDNA, sequenced, and aligned to a reference genome. "Peaks" are called by comparing observed read depth at each point to an expected read depth from control DNA using a statistical test. Examples of peak calling methods based on read depth include MACS2 [4], basicSTARRSeq, and STARRPeaker [3]. It is challenging to accurately distinguish between real peaks and artifacts in regions where mean read depth is low but the variance is high. Fortunately, enhancer activity is strongly correlated with sequence content. We propose using sequence-based machine learning models in a semi-supervised framework to filter peaks. 501-bp sequences centered on the ≈11k STARR peaks from [1] were extracted from the Drosophila melanogaster dm3 genome. Randomly-sampled 501-bp sequences were used as a negative set. Peaks were filtered using a Bonferroni-corrected significance value (α = 0.05) to create a "high-confidence" subset of ≈2.2k peaks. A Logistic Regression model with k-mer count features was trained on the high-confidence peak sequences and their negatives and used to classifying the remaining ≈8.8k peak sequences. The self-trained, sequenced-based model identified an additional ≈3.7k candidate enhancers ("medium confidence"). The remaining ≈5k STARR peaks were considered "low confidence" peaks. We plotted histograms of the read depth log-fold change for the three sets of peaks (high, medium, and low confidence) (see Figure 1). The distributions for the medium- and low-confidence peaks overlapped significantly. The sequence-based model identified enhancer candidates that would otherwise be filtered out using read depth alone. We called peaks for the 4 D. melanogaster FAIRE-Seq data sets from [2]. Sequencing data were cleaned with Trimmomatic, aligned to the dm3 genome with bwa backtrack, and filtered for mapping quality (q < 10) with samtools. MACS2 called ≈61k FAIRE peaks. The STARR peaks overlapped with the FAIRE peaks with precisions of 52.7% (high-confidence peaks), 40.6% (medium-confidence peaks), and 22.5% (low-confidence peaks).
{"title":"Filtering STARR-Seq Peaks for Enhancers with Sequence Models","authors":"R. J. Nowling, Rafael Reple Geromel, B. Halligan","doi":"10.1145/3388440.3414905","DOIUrl":"https://doi.org/10.1145/3388440.3414905","url":null,"abstract":"STARR-Seq is a high-throughput technique for directly identifying genomic regions with enhancer activity [1]. Genomic DNA is sheared, inserted into artificial plasmids designed so that DNA with enhancer activity trigger self-transcription, and transfected into culture cells. The resulting RNA is converted back into cDNA, sequenced, and aligned to a reference genome. \"Peaks\" are called by comparing observed read depth at each point to an expected read depth from control DNA using a statistical test. Examples of peak calling methods based on read depth include MACS2 [4], basicSTARRSeq, and STARRPeaker [3]. It is challenging to accurately distinguish between real peaks and artifacts in regions where mean read depth is low but the variance is high. Fortunately, enhancer activity is strongly correlated with sequence content. We propose using sequence-based machine learning models in a semi-supervised framework to filter peaks. 501-bp sequences centered on the ≈11k STARR peaks from [1] were extracted from the Drosophila melanogaster dm3 genome. Randomly-sampled 501-bp sequences were used as a negative set. Peaks were filtered using a Bonferroni-corrected significance value (α = 0.05) to create a \"high-confidence\" subset of ≈2.2k peaks. A Logistic Regression model with k-mer count features was trained on the high-confidence peak sequences and their negatives and used to classifying the remaining ≈8.8k peak sequences. The self-trained, sequenced-based model identified an additional ≈3.7k candidate enhancers (\"medium confidence\"). The remaining ≈5k STARR peaks were considered \"low confidence\" peaks. We plotted histograms of the read depth log-fold change for the three sets of peaks (high, medium, and low confidence) (see Figure 1). The distributions for the medium- and low-confidence peaks overlapped significantly. The sequence-based model identified enhancer candidates that would otherwise be filtered out using read depth alone. We called peaks for the 4 D. melanogaster FAIRE-Seq data sets from [2]. Sequencing data were cleaned with Trimmomatic, aligned to the dm3 genome with bwa backtrack, and filtered for mapping quality (q < 10) with samtools. MACS2 called ≈61k FAIRE peaks. The STARR peaks overlapped with the FAIRE peaks with precisions of 52.7% (high-confidence peaks), 40.6% (medium-confidence peaks), and 22.5% (low-confidence peaks).","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114969994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arrays of repeat domains are critical to the proper function of a significant fraction of protein families. These repeats are easily identified in sequence, and are thought to have arisen primarily through the simultaneous duplication of multiple domains. However, for most repeat domain protein families, very little is typically known about the specific domain duplication events that occurred in their evolutionary histories. Here we extend existing reconciliation formulations that use domain trees and sequence trees to infer domain duplication and loss events to additionally consider simultaneous domain duplications under arbitrary cost models. We develop a novel integer linear programming (ILP) solution to this reconciliation problem, and demonstrate the accuracy and robustness of our approach on simulated datasets. Finally, as proof of principle, we apply our approach to an orthogroup containing the C2H2 zinc finger repeat domain, and identify simultaneous domain duplications that occurred at the onset of the primate lineage. Simulation and ILP code is available at https://github.com/Singh-Lab/treeSim.
{"title":"Identifying Evolutionary Origins of Repeat Domains in Protein Families","authors":"Chaitanya Aluru, Mona Singh","doi":"10.1145/3388440.3412416","DOIUrl":"https://doi.org/10.1145/3388440.3412416","url":null,"abstract":"Arrays of repeat domains are critical to the proper function of a significant fraction of protein families. These repeats are easily identified in sequence, and are thought to have arisen primarily through the simultaneous duplication of multiple domains. However, for most repeat domain protein families, very little is typically known about the specific domain duplication events that occurred in their evolutionary histories. Here we extend existing reconciliation formulations that use domain trees and sequence trees to infer domain duplication and loss events to additionally consider simultaneous domain duplications under arbitrary cost models. We develop a novel integer linear programming (ILP) solution to this reconciliation problem, and demonstrate the accuracy and robustness of our approach on simulated datasets. Finally, as proof of principle, we apply our approach to an orthogroup containing the C2H2 zinc finger repeat domain, and identify simultaneous domain duplications that occurred at the onset of the primate lineage. Simulation and ILP code is available at https://github.com/Singh-Lab/treeSim.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133298133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Skin cancer is one of the most common forms of cancer that has widespread as a disease around the world. With early, accurate diagnosis, the chances of treating skin cancer are high. This has inspired us to design a deep learning model that uses a conventional neural network to automatically classify and detect different types of skin cancer images. Through this way the system takes actions to prevent and early detect skin cancer, leading to potentially the best approach for treatment. The goal of this research is to apply the systematic meta heuristic optimization and image detection techniques based on a convolutional neural network to efficiently and accurately detect and classify different types of skin lesions.
{"title":"Convolutional Neural Network Strategy for Skin Cancer Lesions Classifications and Detections","authors":"Abdala Nour, B. Boufama","doi":"10.1145/3388440.3415988","DOIUrl":"https://doi.org/10.1145/3388440.3415988","url":null,"abstract":"Skin cancer is one of the most common forms of cancer that has widespread as a disease around the world. With early, accurate diagnosis, the chances of treating skin cancer are high. This has inspired us to design a deep learning model that uses a conventional neural network to automatically classify and detect different types of skin cancer images. Through this way the system takes actions to prevent and early detect skin cancer, leading to potentially the best approach for treatment. The goal of this research is to apply the systematic meta heuristic optimization and image detection techniques based on a convolutional neural network to efficiently and accurately detect and classify different types of skin lesions.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129817577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gene expression data for multiple biological and environmental conditions is being collected for multiple species. Functional modules and subnetwork biomarkers discovery have traditionally been based on analyzing a single gene expression dataset. Research has focused on discovering modules from multiple gene expression datasets. Gene coexpression network mining methods have been proposed for mining frequent functional modules. Moreover, biclustering algorithms have been proposed to allow for missing coexpression links. Existing approaches report a large number of edgesets that have high overlap. In this work, we propose an algorithm to mine frequent dense modules from multiple coexpression networks using a post-processing data summarization method. Our algorithm mines a succinct set of representative subgraphs that have little overlap which reduce the downstream analysis of the reported modules. Experiments on human gene expression data show that the reported modules are biologically significant as evident by Gene Ontology molecular functions and KEGG pathways enrichment.
{"title":"Post-Processing Summarization for Mining Frequent Dense Subnetworks","authors":"Sangmin Seo, Saeed Salem","doi":"10.1145/3388440.3415989","DOIUrl":"https://doi.org/10.1145/3388440.3415989","url":null,"abstract":"Gene expression data for multiple biological and environmental conditions is being collected for multiple species. Functional modules and subnetwork biomarkers discovery have traditionally been based on analyzing a single gene expression dataset. Research has focused on discovering modules from multiple gene expression datasets. Gene coexpression network mining methods have been proposed for mining frequent functional modules. Moreover, biclustering algorithms have been proposed to allow for missing coexpression links. Existing approaches report a large number of edgesets that have high overlap. In this work, we propose an algorithm to mine frequent dense modules from multiple coexpression networks using a post-processing data summarization method. Our algorithm mines a succinct set of representative subgraphs that have little overlap which reduce the downstream analysis of the reported modules. Experiments on human gene expression data show that the reported modules are biologically significant as evident by Gene Ontology molecular functions and KEGG pathways enrichment.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114944880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cryo-Electron Microscopy is a biophysics technique that produces volume images for a given molecule. It can visualize large molecules and protein complexes. At high resolution, <5Å, the structure can be modeled. When the resolution drops to worse than 5Å, computational techniques are used overcome the inaccuracy inherent in volume images. In this paper, we propose a segmentation-based approach to extract important features to overcome the essential inaccuracy in medium resolution volume images. The features are volume components represent local peak regions on the image. Later, the volume components are classified into one of the main secondary structure elements found in the protein molecules. Specifically, we built four models to classify volume components: Helix-Sheet-Loop, Helix-Binary, Sheet-Binary, and Loop-Binary. We used machine learning-based classifiers. Seven classification models are used to classify volume components. The proposed work in this paper is a preliminary approach to detect secondary structure elements from medium resolution volume images. The four machine-learning models were trained using authentic volume images from the Electron Microscopy Data Bank. No simulated/synthesized image was used for either training or testing. This is important since all existing methods use simulated images for training. Due to the noise essential to authentic images, simulated images are not best representatives. The procedure includes feature extraction, model selection, fine-tuning, and model ensembling. We tested our four models on the 20% of the dataset of 3400 volume components. The methods have achieved 80% accuracy for Sheet-Binary model, 77% for Helix-Binary, 71% for Loop-Binary and 67% for Helix-Sheet-Loop model.
{"title":"Segmentation-based Feature Extraction for Cryo-Electron Microscopy at Medium Resolution","authors":"Lin Chen, Ruba Jebril, K. Al Nasr","doi":"10.1145/3388440.3414711","DOIUrl":"https://doi.org/10.1145/3388440.3414711","url":null,"abstract":"Cryo-Electron Microscopy is a biophysics technique that produces volume images for a given molecule. It can visualize large molecules and protein complexes. At high resolution, <5Å, the structure can be modeled. When the resolution drops to worse than 5Å, computational techniques are used overcome the inaccuracy inherent in volume images. In this paper, we propose a segmentation-based approach to extract important features to overcome the essential inaccuracy in medium resolution volume images. The features are volume components represent local peak regions on the image. Later, the volume components are classified into one of the main secondary structure elements found in the protein molecules. Specifically, we built four models to classify volume components: Helix-Sheet-Loop, Helix-Binary, Sheet-Binary, and Loop-Binary. We used machine learning-based classifiers. Seven classification models are used to classify volume components. The proposed work in this paper is a preliminary approach to detect secondary structure elements from medium resolution volume images. The four machine-learning models were trained using authentic volume images from the Electron Microscopy Data Bank. No simulated/synthesized image was used for either training or testing. This is important since all existing methods use simulated images for training. Due to the noise essential to authentic images, simulated images are not best representatives. The procedure includes feature extraction, model selection, fine-tuning, and model ensembling. We tested our four models on the 20% of the dataset of 3400 volume components. The methods have achieved 80% accuracy for Sheet-Binary model, 77% for Helix-Binary, 71% for Loop-Binary and 67% for Helix-Sheet-Loop model.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115155604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The DNA regulatory code of gene expression is encoded in the gene regulatory structure spanning the coding and adjacent non-coding regulatory DNA regions. Deciphering this regulatory code, and how the whole gene structure interacts to produce mRNA transcripts and regulate mRNA abundance, can greatly improve our capabilities for controlling gene expression. Here, we consider that natural systems offer the most accurate information on gene expression regulation and apply deep learning on over 20,000 mRNA datasets to learn the DNA encoded regulatory code across a variety of model organisms from bacteria to Human [1]. We find that up to 82% of variation of gene expression is encoded in the gene regulatory structure across all model organisms. Coding and regulatory regions carry both overlapping and new, orthogonal information, and additively contribute to gene expression prediction. By mining the gene expression models for the relevant DNA regulatory motifs, we uncover that motif interactions across the whole gene regulatory structure define over 3 orders of magnitude of gene expression levels. Finally, we experimentally verify the usefulness of our AI-guided approach for protein expression engineering. Our results suggest that single motifs or regulatory regions might not be solely responsible for regulating gene expression levels. Instead, the whole gene regulatory structure, which contains the DNA regulatory grammar of interacting DNA motifs across the protein coding and non-coding regulatory regions, forms a coevolved transcriptional regulatory unit. This provides a solution by which whole gene systems with pre-specified expression patterns can be designed.
{"title":"Learning the regulatory grammar of DNA for gene expression engineering","authors":"Jan Zrimec, Aleksej Zelezniak","doi":"10.1145/3388440.3414922","DOIUrl":"https://doi.org/10.1145/3388440.3414922","url":null,"abstract":"The DNA regulatory code of gene expression is encoded in the gene regulatory structure spanning the coding and adjacent non-coding regulatory DNA regions. Deciphering this regulatory code, and how the whole gene structure interacts to produce mRNA transcripts and regulate mRNA abundance, can greatly improve our capabilities for controlling gene expression. Here, we consider that natural systems offer the most accurate information on gene expression regulation and apply deep learning on over 20,000 mRNA datasets to learn the DNA encoded regulatory code across a variety of model organisms from bacteria to Human [1]. We find that up to 82% of variation of gene expression is encoded in the gene regulatory structure across all model organisms. Coding and regulatory regions carry both overlapping and new, orthogonal information, and additively contribute to gene expression prediction. By mining the gene expression models for the relevant DNA regulatory motifs, we uncover that motif interactions across the whole gene regulatory structure define over 3 orders of magnitude of gene expression levels. Finally, we experimentally verify the usefulness of our AI-guided approach for protein expression engineering. Our results suggest that single motifs or regulatory regions might not be solely responsible for regulating gene expression levels. Instead, the whole gene regulatory structure, which contains the DNA regulatory grammar of interacting DNA motifs across the protein coding and non-coding regulatory regions, forms a coevolved transcriptional regulatory unit. This provides a solution by which whole gene systems with pre-specified expression patterns can be designed.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124701084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The ability to estimate protein-protein binding free energy in a computationally efficient via a physics-based approach is beneficial to research focused on the mechanism of viruses binding to their target proteins. Implicit solvation methodology may be particularly useful in the early stages of such research, as it can offer valuable insights into the binding process, quickly. Here we evaluate the potential of the related molecular mechanics generalized Born surface area (MMGB/SA) approach to estimate the binding free energy between the SARS-CoV-2 spike receptor-binding domain and the human ACE2 receptor. The calculations are based on a recent flavor of the generalized Born model, GBNSR6, shown to be effective in protein-ligand binding estimates, but never before used in the MMGB/SA context. Two options for representing the dielectric boundary of the molecule are evaluated: one based on standard bondi radii, and the other based on a newly developed set of atomic radii (OPT1), optimized specifically for protein-ligand binding. We first test the entire computational pipeline on the well-studied Ras-Raf protein-protein complex, which has similar binding free energy to that of the SARS-CoV-2/ACE2 complex. Predictions based on both radii sets are closer to experiment compared to a previously published estimate based on MMGB/SA. The two estimates for the SARS-CoV-2/ACE2 also provide a "bound" on the experimental ΔGbind: --14.7 (bondi) < -10.6(Exp.) < -4.1(OPT1) kcal/mol. Both estimates point to the expected near cancellation of the relatively large enthalpy and entropy contributions, suggesting that the proposed MMGB/SA protocol may be trustworthy, at least qualitatively, for analysis of the SARS-CoV-2/ACE2 in light of the need to move forward fast.
{"title":"Binding Free Energy of the Novel Coronavirus Spike Protein and the Human ACE2 Receptor: An MMGB/SA Computational Study","authors":"Negin Forouzesh","doi":"10.1145/3388440.3414712","DOIUrl":"https://doi.org/10.1145/3388440.3414712","url":null,"abstract":"The ability to estimate protein-protein binding free energy in a computationally efficient via a physics-based approach is beneficial to research focused on the mechanism of viruses binding to their target proteins. Implicit solvation methodology may be particularly useful in the early stages of such research, as it can offer valuable insights into the binding process, quickly. Here we evaluate the potential of the related molecular mechanics generalized Born surface area (MMGB/SA) approach to estimate the binding free energy between the SARS-CoV-2 spike receptor-binding domain and the human ACE2 receptor. The calculations are based on a recent flavor of the generalized Born model, GBNSR6, shown to be effective in protein-ligand binding estimates, but never before used in the MMGB/SA context. Two options for representing the dielectric boundary of the molecule are evaluated: one based on standard bondi radii, and the other based on a newly developed set of atomic radii (OPT1), optimized specifically for protein-ligand binding. We first test the entire computational pipeline on the well-studied Ras-Raf protein-protein complex, which has similar binding free energy to that of the SARS-CoV-2/ACE2 complex. Predictions based on both radii sets are closer to experiment compared to a previously published estimate based on MMGB/SA. The two estimates for the SARS-CoV-2/ACE2 also provide a \"bound\" on the experimental ΔGbind: --14.7 (bondi) < -10.6(Exp.) < -4.1(OPT1) kcal/mol. Both estimates point to the expected near cancellation of the relatively large enthalpy and entropy contributions, suggesting that the proposed MMGB/SA protocol may be trustworthy, at least qualitatively, for analysis of the SARS-CoV-2/ACE2 in light of the need to move forward fast.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127619669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diane Uwacu, Abigail Ren, Shawna L. Thomas, N. Amato
Computational methods are commonly used to predict protein-ligand interactions. These methods typically search for regions with favorable energy that geometrically fit the ligand, and then rank them as potential binding sites. While this general strategy can provide good predictions in some cases, it does not do well when the binding site is not accessible to the ligand. In addition, recent research has shown that in some cases protein access tunnels play a major role in the activity and stability of the protein's binding interactions. Hence, to fully understand the binding behavior of such proteins, it is imperative to identify and study their access tunnels. In this work, we present a motion planning algorithm that scores protein binding site accessibility for a particular ligand. This method can be used to screen ligand candidates for a protein by eliminating those that cannot access the binding site. This method was tested on two case studies to analyze effects of modifying a protein's access tunnels to increase activity and/or stability as well as study how a ligand inhibitor blocks access to the protein binding site.
{"title":"Using Guided Motion Planning to Study Binding Site Accessibility","authors":"Diane Uwacu, Abigail Ren, Shawna L. Thomas, N. Amato","doi":"10.1145/3388440.3414707","DOIUrl":"https://doi.org/10.1145/3388440.3414707","url":null,"abstract":"Computational methods are commonly used to predict protein-ligand interactions. These methods typically search for regions with favorable energy that geometrically fit the ligand, and then rank them as potential binding sites. While this general strategy can provide good predictions in some cases, it does not do well when the binding site is not accessible to the ligand. In addition, recent research has shown that in some cases protein access tunnels play a major role in the activity and stability of the protein's binding interactions. Hence, to fully understand the binding behavior of such proteins, it is imperative to identify and study their access tunnels. In this work, we present a motion planning algorithm that scores protein binding site accessibility for a particular ligand. This method can be used to screen ligand candidates for a protein by eliminating those that cannot access the binding site. This method was tested on two case studies to analyze effects of modifying a protein's access tunnels to increase activity and/or stability as well as study how a ligand inhibitor blocks access to the protein binding site.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117068825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The universe of protein structures contains many dark regions beyond the reach of experimental techniques. Yet, knowledge of the tertiary structure(s) that a protein employs to interact with partners in the cell is critical to understanding its biological function(s) and dysfunction(s). Great progress has been made in silico by methods that generate structures as part of an optimization. Recently, generative models based on neural networks are being debuted for generating protein structures. There is typically limited to showing that some generated structures are credible. In this paper, we go beyond this objective. We design variational autoencoders and evaluate whether they can replace existing, established methods. We evaluate various architectures via rigorous metrics in comparison with the popular Rosetta framework. The presented results are promising and show that once seeded with sufficient, physically-realistic structures, variational autoencoders are efficient models for generating realistic tertiary structures.
{"title":"Variational Autoencoders for Protein Structure Prediction","authors":"F. Alam, Amarda Shehu","doi":"10.1145/3388440.3412471","DOIUrl":"https://doi.org/10.1145/3388440.3412471","url":null,"abstract":"The universe of protein structures contains many dark regions beyond the reach of experimental techniques. Yet, knowledge of the tertiary structure(s) that a protein employs to interact with partners in the cell is critical to understanding its biological function(s) and dysfunction(s). Great progress has been made in silico by methods that generate structures as part of an optimization. Recently, generative models based on neural networks are being debuted for generating protein structures. There is typically limited to showing that some generated structures are credible. In this paper, we go beyond this objective. We design variational autoencoders and evaluate whether they can replace existing, established methods. We evaluate various architectures via rigorous metrics in comparison with the popular Rosetta framework. The presented results are promising and show that once seeded with sufficient, physically-realistic structures, variational autoencoders are efficient models for generating realistic tertiary structures.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115591000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}