E. Thompson, Tess Thackray, Cecilia Kalthoff, Ryan Rapoport, F. Jagodzinski
A mutation to the amino acid sequence of a protein can cause a biomolecule to be resistant to the intended effects of a drug. Assessing the changes of a drug's efficacy in response to mutations via mutagenesis wet-lab experiments is prohibitively time consuming for even a single point mutation, let alone for all possible mutations. Existing approaches for inferring mutation-induced drug resistance are available, but all of them reason about mutations of residues at or very near the protein-drug interface. However, there are examples of mutations far away from the region where the ligand binds, but which nonetheless render a protein resistant to the effects of the drug. We present a proof-of-concept computational pipeline that generates in silico the set of all possible single point mutations in a protein-ligand complex. We assess drug resistance via energy profiles for short runs of molecular dynamics of the mutants. We assess the impact of mutations far away from the protein-drug interface and provide case studies for exploring how amino acid substitutions both near and far away from where the ligand interacts with a protein target have a stabilizing or destabilizing effect on the protein-drug complex.
{"title":"Using Energy-Minimization Profiles to Measure Protein Resistance to Drugs","authors":"E. Thompson, Tess Thackray, Cecilia Kalthoff, Ryan Rapoport, F. Jagodzinski","doi":"10.1145/3388440.3414703","DOIUrl":"https://doi.org/10.1145/3388440.3414703","url":null,"abstract":"A mutation to the amino acid sequence of a protein can cause a biomolecule to be resistant to the intended effects of a drug. Assessing the changes of a drug's efficacy in response to mutations via mutagenesis wet-lab experiments is prohibitively time consuming for even a single point mutation, let alone for all possible mutations. Existing approaches for inferring mutation-induced drug resistance are available, but all of them reason about mutations of residues at or very near the protein-drug interface. However, there are examples of mutations far away from the region where the ligand binds, but which nonetheless render a protein resistant to the effects of the drug. We present a proof-of-concept computational pipeline that generates in silico the set of all possible single point mutations in a protein-ligand complex. We assess drug resistance via energy profiles for short runs of molecular dynamics of the mutants. We assess the impact of mutations far away from the protein-drug interface and provide case studies for exploring how amino acid substitutions both near and far away from where the ligand interacts with a protein target have a stabilizing or destabilizing effect on the protein-drug complex.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127717573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prefix/Suffix Variation in Retinoic Acid Response Elements","authors":"Y. Zhuang, Kara L. Cerveny, Anna M. Ritz","doi":"10.1145/3388440.3414914","DOIUrl":"https://doi.org/10.1145/3388440.3414914","url":null,"abstract":"","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129062069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The conformational space of proteins is complex and high dimensional, which makes its analysis a highly challenging task. Understanding the structure and dynamics of proteins is essential in order to understand their cellular function. It is often hard to experimentally characterize intermediate structures as well as conformational trajectory, due to the rapid dynamics of some proteins. Conformational pathways, which describe how proteins transition from one conformation to another as a result of a shift in conditions, are hard to describe experimentally. Computationally it is a challenging problem as well since physics-based simulations are time-consuming and often don't span sufficient time scales to allow capturing a full pathway. In previous work, we combined evolutionary information or rigidity analysis obtained from proteins' sequence and structure with an efficient tree based conformational search to elucidate the conformational trajectory of proteins. We incorporated backbone + C - β resolution and helped limit the search space by identifying mobile regions in a molecule. In this work, we use a hybrid algorithm which combines MC sampling and RRT*, a version of the Rapidly Exploring Random Trees (RRT) robotics-based method, to make the search more accurate and efficient, and produce smooth conformational pathways.
蛋白质的构象空间是复杂的、高维的,对其进行分析是一项极具挑战性的任务。为了了解蛋白质的细胞功能,了解蛋白质的结构和动力学是必不可少的。由于某些蛋白质的快速动力学,通常很难通过实验表征中间结构以及构象轨迹。构象途径描述了蛋白质如何由于条件的变化而从一种构象转变为另一种构象,这很难用实验来描述。从计算角度来看,这也是一个具有挑战性的问题,因为基于物理的模拟非常耗时,并且通常不能跨越足够的时间尺度来捕捉完整的路径。在之前的工作中,我们将从蛋白质序列和结构中获得的进化信息或刚度分析与有效的基于树的构象搜索相结合,以阐明蛋白质的构象轨迹。我们结合了骨架+ C - β分辨率,并通过识别分子中的移动区域来限制搜索空间。在这项工作中,我们使用了一种结合MC采样和RRT*(一种基于快速探索随机树(RRT)机器人的方法)的混合算法,使搜索更加准确和高效,并产生平滑的构象路径。
{"title":"Efficient Exploration of Protein Conformational Pathways using RRT* and MC","authors":"Fatemeh Afrasiabi, Nurit Haspel","doi":"10.1145/3388440.3414705","DOIUrl":"https://doi.org/10.1145/3388440.3414705","url":null,"abstract":"The conformational space of proteins is complex and high dimensional, which makes its analysis a highly challenging task. Understanding the structure and dynamics of proteins is essential in order to understand their cellular function. It is often hard to experimentally characterize intermediate structures as well as conformational trajectory, due to the rapid dynamics of some proteins. Conformational pathways, which describe how proteins transition from one conformation to another as a result of a shift in conditions, are hard to describe experimentally. Computationally it is a challenging problem as well since physics-based simulations are time-consuming and often don't span sufficient time scales to allow capturing a full pathway. In previous work, we combined evolutionary information or rigidity analysis obtained from proteins' sequence and structure with an efficient tree based conformational search to elucidate the conformational trajectory of proteins. We incorporated backbone + C - β resolution and helped limit the search space by identifying mobile regions in a molecule. In this work, we use a hybrid algorithm which combines MC sampling and RRT*, a version of the Rapidly Exploring Random Trees (RRT) robotics-based method, to make the search more accurate and efficient, and produce smooth conformational pathways.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116907600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The increasing popularity of metagenomic sequencing has resulted in a plethora of 16S RNA and whole genome sequence data available. Microbes play an important role in the health and disease of humans, pets, and livestock. Characterizing such microbes and their relative abundances are important to identify sample phenotypes such as disease. In the past, machine learning based methods have been applied for prediction of host disease status and overall health based on taxonomic abundance profiles. Here we utilize deep neural network modeling with taxonomic profiles for faster, precise, and effective prediction of metagenomic sample phenotypes.
{"title":"Deep Neural Network Modeling for Phenotypic Prediction of Metagenomic Samples","authors":"Yassin Mreyoud, Tae-Hyuk Ahn","doi":"10.1145/3388440.3414921","DOIUrl":"https://doi.org/10.1145/3388440.3414921","url":null,"abstract":"The increasing popularity of metagenomic sequencing has resulted in a plethora of 16S RNA and whole genome sequence data available. Microbes play an important role in the health and disease of humans, pets, and livestock. Characterizing such microbes and their relative abundances are important to identify sample phenotypes such as disease. In the past, machine learning based methods have been applied for prediction of host disease status and overall health based on taxonomic abundance profiles. Here we utilize deep neural network modeling with taxonomic profiles for faster, precise, and effective prediction of metagenomic sample phenotypes.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125510636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. J. Nowling, C. R. Beal, Scott J. Emrich, S. Behura, M. Halfon, M. Duman-Scheel
When reference genome assemblies are updated, the peaks from DNA enrichment assays such as ChIP-Seq and FAIRE-Seq need to be called again using the new genome assembly. PeakMatcher is an open-source package that aids in validation by matching peaks across two genome assemblies using the alignment of reads or within the same genome. PeakMatcher calculates recall and precision while also outputting lists of peak-to-peak matches. PeakMatcher uses read alignments to match peaks across genome assemblies. PeakMatcher finds all read aligned to one genome that overlap with a given list of peaks. PeakMatcher uses the read names to locate where those reads are aligned against a second genome. Lastly, all peaks called against the second genome that overlap with the aligned reads are found and output. PeakMatcher groups uses the peak-read-peak relationships to discover 1-to-1, 1-to-many, and many-to-many relationships. Overlap queries are performed with interval trees for maximum efficiency. We evaluated PeakMatcher on two data sets. The first data set was FAIRE-Seq (Formaldehyde-Assisted Isolation of Regulatory Elements Sequencing) of DNA isolated embyros of the mosquito Aedes aegypti [2, 4]. We implemented a peak calling pipeline and validated it on the older (highly fragmented) AaegL3 assembly [5]. PeakMatcher matched 92.9% (precision) of the 121,594 previously-called peaks from [2, 4] with 89.4% (recall) of the 124,959 peaks called with our new pipeline. Next, we applied the peak-calling pipeline to call FAIRE peaks using the newer, chromosome-complete AaegL5 assembly [3]. PeakMatcher found matches for 14 of the 16 experimentally-validated AaegL3 FAIRE peaks from [2, 4]. We validated the matches by comparing nearby genes across the genomes. Nearby genes were consistent for 11 of the 14 peaks; inconsistencies for at least two of the remaining peaks were clearly attributable to differences in assemblies. When applied to all of the peaks, Peak-Matcher matched 78.8% (precision) of the 124,959 AaegL3 peaks with 76.7% (recall) of the 128,307 AaegL5 peaks. The second data set was STARR-Seq (Self-Transcribing Active Regulatory Region Sequencing) of Drosophila melanogaster DNA in S2 culture cells [1]. We called STARR peaks against two versions (dm3 and r5.53) of the D. melanogaster genome [6]. PeakMatcher matched 77.4% (precision) of the 4,195 dm3 peaks with 94.8% (recall) of the 3,114 r5.53 peaks. PeakMatcher and associated documentation are available on GitHub (https://github.com/rnowling/peak-matcher) under the open-source Apache Software License v2. PeakMatcher was written in Python 3 using the intervaltree library.
{"title":"PeakMatcher","authors":"R. J. Nowling, C. R. Beal, Scott J. Emrich, S. Behura, M. Halfon, M. Duman-Scheel","doi":"10.1145/3388440.3414907","DOIUrl":"https://doi.org/10.1145/3388440.3414907","url":null,"abstract":"When reference genome assemblies are updated, the peaks from DNA enrichment assays such as ChIP-Seq and FAIRE-Seq need to be called again using the new genome assembly. PeakMatcher is an open-source package that aids in validation by matching peaks across two genome assemblies using the alignment of reads or within the same genome. PeakMatcher calculates recall and precision while also outputting lists of peak-to-peak matches. PeakMatcher uses read alignments to match peaks across genome assemblies. PeakMatcher finds all read aligned to one genome that overlap with a given list of peaks. PeakMatcher uses the read names to locate where those reads are aligned against a second genome. Lastly, all peaks called against the second genome that overlap with the aligned reads are found and output. PeakMatcher groups uses the peak-read-peak relationships to discover 1-to-1, 1-to-many, and many-to-many relationships. Overlap queries are performed with interval trees for maximum efficiency. We evaluated PeakMatcher on two data sets. The first data set was FAIRE-Seq (Formaldehyde-Assisted Isolation of Regulatory Elements Sequencing) of DNA isolated embyros of the mosquito Aedes aegypti [2, 4]. We implemented a peak calling pipeline and validated it on the older (highly fragmented) AaegL3 assembly [5]. PeakMatcher matched 92.9% (precision) of the 121,594 previously-called peaks from [2, 4] with 89.4% (recall) of the 124,959 peaks called with our new pipeline. Next, we applied the peak-calling pipeline to call FAIRE peaks using the newer, chromosome-complete AaegL5 assembly [3]. PeakMatcher found matches for 14 of the 16 experimentally-validated AaegL3 FAIRE peaks from [2, 4]. We validated the matches by comparing nearby genes across the genomes. Nearby genes were consistent for 11 of the 14 peaks; inconsistencies for at least two of the remaining peaks were clearly attributable to differences in assemblies. When applied to all of the peaks, Peak-Matcher matched 78.8% (precision) of the 124,959 AaegL3 peaks with 76.7% (recall) of the 128,307 AaegL5 peaks. The second data set was STARR-Seq (Self-Transcribing Active Regulatory Region Sequencing) of Drosophila melanogaster DNA in S2 culture cells [1]. We called STARR peaks against two versions (dm3 and r5.53) of the D. melanogaster genome [6]. PeakMatcher matched 77.4% (precision) of the 4,195 dm3 peaks with 94.8% (recall) of the 3,114 r5.53 peaks. PeakMatcher and associated documentation are available on GitHub (https://github.com/rnowling/peak-matcher) under the open-source Apache Software License v2. PeakMatcher was written in Python 3 using the intervaltree library.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129636748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Bagheri, T. K. Groenhof, W. B. Veldhuis, P. A. Jong, F. Asselbergs, D. Oberski
Electronic health records (EHRs) contain structured and unstructured data of significant clinical and research value. Various machine learning approaches have been developed to employ information in EHRs for risk prediction. The majority of these attempts, however, focus on structured EHR fields and lose the vast amount of information in the unstructured texts. Deep neural networks, on the other hand, gained tremendous momentum in knowledge discovery from EHR texts, while there are very seldom studies that used of both free-texts and the structured information in EHRs for clinical prediction. To exploit the potential information captured in EHRs, in this study we propose MI-BiLSTM, a multimodal bidirectional long short-term memory-based framework for cardiovascular risk prediction that integrates medical texts and structured clinical information. The MI-BiLSTM framework concatenates word embeddings from x-ray reports to classical clinical predictors from the Second Manifestations of ARTerial disease (SMART) study [1], before applying them to a final fully connected neural network. In the experiments, by employing the proposed framework, we compared performances of different deep neural network architectures on data of 5603 patients using 5-fold cross validation. Evaluated on the SMART study, we demonstrate the clinical relevance of integrating text features and classical predictors for cardiovascular risk prediction for patients with manifest vascular disease or at high--risk for cardiovascular disease. Our results show that the MI-BiLSTM framework using text data in addition to laboratory values outperforms deep learning models using only known clinical predictors. In future, we will focus on expanding our multimodal framework to import knowledge from available medical ontologies to enhance the quality of clinical decision making in risk prediction models. An open-source implementation of the proposed framework is publicly available at https://github.com/bagheria/CardioRisk-TextMining
{"title":"Multimodal Learning for Cardiovascular Risk Prediction using EHR Data","authors":"A. Bagheri, T. K. Groenhof, W. B. Veldhuis, P. A. Jong, F. Asselbergs, D. Oberski","doi":"10.1145/3388440.3414924","DOIUrl":"https://doi.org/10.1145/3388440.3414924","url":null,"abstract":"Electronic health records (EHRs) contain structured and unstructured data of significant clinical and research value. Various machine learning approaches have been developed to employ information in EHRs for risk prediction. The majority of these attempts, however, focus on structured EHR fields and lose the vast amount of information in the unstructured texts. Deep neural networks, on the other hand, gained tremendous momentum in knowledge discovery from EHR texts, while there are very seldom studies that used of both free-texts and the structured information in EHRs for clinical prediction. To exploit the potential information captured in EHRs, in this study we propose MI-BiLSTM, a multimodal bidirectional long short-term memory-based framework for cardiovascular risk prediction that integrates medical texts and structured clinical information. The MI-BiLSTM framework concatenates word embeddings from x-ray reports to classical clinical predictors from the Second Manifestations of ARTerial disease (SMART) study [1], before applying them to a final fully connected neural network. In the experiments, by employing the proposed framework, we compared performances of different deep neural network architectures on data of 5603 patients using 5-fold cross validation. Evaluated on the SMART study, we demonstrate the clinical relevance of integrating text features and classical predictors for cardiovascular risk prediction for patients with manifest vascular disease or at high--risk for cardiovascular disease. Our results show that the MI-BiLSTM framework using text data in addition to laboratory values outperforms deep learning models using only known clinical predictors. In future, we will focus on expanding our multimodal framework to import knowledge from available medical ontologies to enhance the quality of clinical decision making in risk prediction models. An open-source implementation of the proposed framework is publicly available at https://github.com/bagheria/CardioRisk-TextMining","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115815139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander F. B. Carmichael, Deepayan Bhowmik, J. Baily, A. Brownlow, G. Gunn, A. Reeves
This paper proposes Ir-Man (Information Retrieval for Marine Animal Necropsies), a framework for retrieving discrete information from marine mammal post-mortem reports for statistical analysis. When a marine mammal is reported dead after stranding in Scotland, the carcass is examined by the Scottish Marine Animal Strandings Scheme (SMASS) to establish the circumstances of the animal's death. This involves the creation of a "post-mortem" (or necropsy) report, which systematically describes the body. These semi-structured reports record lesions (damage or abnormalities to anatomical regions) as well as other observations. Observations embedded within these texts are used to determine cause of death. While a cause of death is recorded separately, many other descriptions may be of pathological and epidemiological significance when aggregated and analysed collectively. As manual extraction of these descriptions is costly, time consuming and at times erroneous, there is a need for an automated information retrieval mechanism which is a non-trivial task given the wide variety of possible descriptions, pathologies and species. The Ir-Man framework consists of a new ontology, a lexicon of observations and anatomical terms and an entity relation engine for information retrieval and statistics generation from a pool of necropsy reports. We demonstrate the effectiveness of our framework by creating a rule-based binary classifier for identifying bottlenose dolphin attacks (BDA) in harbour porpoise gross pathology reports and achieved an accuracy of 83.4%.
本文提出Ir-Man (Information Retrieval for Marine Animal Necropsies),这是一个从海洋哺乳动物死后报告中检索离散信息进行统计分析的框架。当海洋哺乳动物在苏格兰搁浅后被报告死亡时,苏格兰海洋动物搁浅计划(SMASS)将对其尸体进行检查,以确定动物死亡的情况。这包括创建一份“验尸”(或尸检)报告,该报告系统地描述了尸体。这些半结构化的报告记录病变(解剖区域的损伤或异常)以及其他观察结果。这些文字中的观察结果被用来确定死因。虽然死因是单独记录的,但许多其他描述在汇总和集体分析时可能具有病理和流行病学意义。由于人工提取这些描述是昂贵的、耗时的,而且有时是错误的,因此需要一种自动化的信息检索机制,考虑到各种可能的描述、病理和物种,这是一项重要的任务。Ir-Man框架由一个新的本体、一个观察和解剖术语词典以及一个实体关系引擎组成,该引擎用于从尸检报告池中检索信息和生成统计数据。我们通过创建一个基于规则的二元分类器来识别海港鼠海豚大体病理报告中的宽吻海豚攻击(BDA),并实现了83.4%的准确率,从而证明了我们框架的有效性。
{"title":"Ir-Man: An Information Retrieval Framework for Marine Animal Necropsy Analysis","authors":"Alexander F. B. Carmichael, Deepayan Bhowmik, J. Baily, A. Brownlow, G. Gunn, A. Reeves","doi":"10.1145/3388440.3412417","DOIUrl":"https://doi.org/10.1145/3388440.3412417","url":null,"abstract":"This paper proposes Ir-Man (Information Retrieval for Marine Animal Necropsies), a framework for retrieving discrete information from marine mammal post-mortem reports for statistical analysis. When a marine mammal is reported dead after stranding in Scotland, the carcass is examined by the Scottish Marine Animal Strandings Scheme (SMASS) to establish the circumstances of the animal's death. This involves the creation of a \"post-mortem\" (or necropsy) report, which systematically describes the body. These semi-structured reports record lesions (damage or abnormalities to anatomical regions) as well as other observations. Observations embedded within these texts are used to determine cause of death. While a cause of death is recorded separately, many other descriptions may be of pathological and epidemiological significance when aggregated and analysed collectively. As manual extraction of these descriptions is costly, time consuming and at times erroneous, there is a need for an automated information retrieval mechanism which is a non-trivial task given the wide variety of possible descriptions, pathologies and species. The Ir-Man framework consists of a new ontology, a lexicon of observations and anatomical terms and an entity relation engine for information retrieval and statistics generation from a pool of necropsy reports. We demonstrate the effectiveness of our framework by creating a rule-based binary classifier for identifying bottlenose dolphin attacks (BDA) in harbour porpoise gross pathology reports and achieved an accuracy of 83.4%.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121154917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao-Ren Yao, D. Chang, O. Frieder, Wendy Huang, I. Liang, C. Hung
We present an end-to-end, interpretable, deep-learning architecture to learn a graph kernel that predicts the outcome of chronic disease drug prescription. This is achieved through a deep metric learning collaborative with a Support Vector Machine objective using a graphical representation of Electronic Health Records. We formulate the predictive model as a binary graph classification problem with an adaptive learned graph kernel through novel cross-global attention node matching between patient graphs, simultaneously computing on multiple graphs without training pair or triplet generation. Results using the Taiwanese National Health Insurance Research Database demonstrate that our approach outperforms current start-of-the-art models both in terms of accuracy and interpretability.
{"title":"Cross-Global Attention Graph Kernel Network Prediction of Drug Prescription","authors":"Hao-Ren Yao, D. Chang, O. Frieder, Wendy Huang, I. Liang, C. Hung","doi":"10.1145/3388440.3412459","DOIUrl":"https://doi.org/10.1145/3388440.3412459","url":null,"abstract":"We present an end-to-end, interpretable, deep-learning architecture to learn a graph kernel that predicts the outcome of chronic disease drug prescription. This is achieved through a deep metric learning collaborative with a Support Vector Machine objective using a graphical representation of Electronic Health Records. We formulate the predictive model as a binary graph classification problem with an adaptive learned graph kernel through novel cross-global attention node matching between patient graphs, simultaneously computing on multiple graphs without training pair or triplet generation. Results using the Taiwanese National Health Insurance Research Database demonstrate that our approach outperforms current start-of-the-art models both in terms of accuracy and interpretability.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114478287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The nucleotide sequence representation of DNA can be inadequate for resolving protein-DNA binding sites and regulatory substrates, such as those involved in gene expression and horizontal gene transfer. Considering that sequence-like representations are algorithmically very useful, here we fused over 60 currently available DNA physicochemical and conformational variables into compact structural representations that can encode single DNA binding sites to whole regulatory regions. We find that the main structural components reflect key properties of protein-DNA interactions and can be condensed to the amount of information found in a single nucleotide position. The most accurate structural representations compress functional DNA sequence variants by 30% to 50%, as each instance encodes from tens to thousands of sequences. We show that a structural distance function discriminates among groups of DNA substrates more accurately than nucleotide sequence-based metrics. As this opens up a variety of implementation possibilities, we develop and test a distance-based alignment algorithm, demonstrating the potential of using the structural representations to enhance sequence-based algorithms. Due to the bias of most current bioinformatic methods to nucleotide sequence representations, it is possible that considerable performance increases might still be achievable with such solutions.
{"title":"Structural representations of DNA regulatory substrates can enhance sequence-based algorithms by associating functional sequence variants","authors":"Jan Zrimec","doi":"10.1145/3388440.3412482","DOIUrl":"https://doi.org/10.1145/3388440.3412482","url":null,"abstract":"The nucleotide sequence representation of DNA can be inadequate for resolving protein-DNA binding sites and regulatory substrates, such as those involved in gene expression and horizontal gene transfer. Considering that sequence-like representations are algorithmically very useful, here we fused over 60 currently available DNA physicochemical and conformational variables into compact structural representations that can encode single DNA binding sites to whole regulatory regions. We find that the main structural components reflect key properties of protein-DNA interactions and can be condensed to the amount of information found in a single nucleotide position. The most accurate structural representations compress functional DNA sequence variants by 30% to 50%, as each instance encodes from tens to thousands of sequences. We show that a structural distance function discriminates among groups of DNA substrates more accurately than nucleotide sequence-based metrics. As this opens up a variety of implementation possibilities, we develop and test a distance-based alignment algorithm, demonstrating the potential of using the structural representations to enhance sequence-based algorithms. Due to the bias of most current bioinformatic methods to nucleotide sequence representations, it is possible that considerable performance increases might still be achievable with such solutions.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127902809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Niharika Pandala, C. Cole, D. McFarland, A. Nag, H. Valafar
Recent events leading to the worldwide pandemic of COVID-19 have demonstrated the effective use of genomic sequencing technologies to establish the genetic sequence of this virus. In contrast, the COVID-19 pandemic has demonstrated the absence of computational approaches to understand the molecular basis of this infection rapidly. Here we present an integrated approach to the study of the nsp1 protein in SARS-CoV-1, which plays an essential role in maintaining the expression of viral proteins and further disabling the host protein expression, also known as the host shutoff mechanism. We present three independent methods of evaluating two potential binding sites speculated to participate in host shutoff by nsp1. We have combined results from computed models of nsp1, with deep mining of all existing protein structures (using PDBMine), and binding site recognition (using msTALI) to examine the two sites consisting of residues 55--59 and 73--80. Based on our preliminary results, we conclude that the residues 73--80 appear as the regions that facilitate the critical initial steps in the function of nsp1. Given the 90% sequence identity between nsp1 from SARS-CoV-1 and SARS-CoV-2, we conjecture the same critical initiation step in the function of SARS-CoV-2 nsp1.
{"title":"A Preliminary Investigation in the Molecular Basis of Host Shutoff Mechanism in SARS-CoV","authors":"Niharika Pandala, C. Cole, D. McFarland, A. Nag, H. Valafar","doi":"10.1145/3388440.3412483","DOIUrl":"https://doi.org/10.1145/3388440.3412483","url":null,"abstract":"Recent events leading to the worldwide pandemic of COVID-19 have demonstrated the effective use of genomic sequencing technologies to establish the genetic sequence of this virus. In contrast, the COVID-19 pandemic has demonstrated the absence of computational approaches to understand the molecular basis of this infection rapidly. Here we present an integrated approach to the study of the nsp1 protein in SARS-CoV-1, which plays an essential role in maintaining the expression of viral proteins and further disabling the host protein expression, also known as the host shutoff mechanism. We present three independent methods of evaluating two potential binding sites speculated to participate in host shutoff by nsp1. We have combined results from computed models of nsp1, with deep mining of all existing protein structures (using PDBMine), and binding site recognition (using msTALI) to examine the two sites consisting of residues 55--59 and 73--80. Based on our preliminary results, we conclude that the residues 73--80 appear as the regions that facilitate the critical initial steps in the function of nsp1. Given the 90% sequence identity between nsp1 from SARS-CoV-1 and SARS-CoV-2, we conjecture the same critical initiation step in the function of SARS-CoV-2 nsp1.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133567970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}