Fatemeh Rouzbeh, A. Grama, Paul M. Griffin, Mohammad Adibuzzaman
The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of the system architecture to achieve high performance in terms of parallelization, query processing time, aggregation of heterogeneous data types (e.g., time series, images, structured data, among others), and difficulty in reproducing scientific research remain a major challenge. This is specifically true for health sciences research, where the systems must be i) easy to use with the flexibility to manipulate data at the most granular level, ii) agnostic of programming language kernel, iii) scalable, and iv) compliant with the HIPAA privacy law. In this paper, we review the existing literature for such big data systems for scientific research in health sciences and identify the gaps of the current system landscape. We propose a novel architecture for software-hardware-data ecosystem using open source technologies such as Apache Hadoop, Kubernetes and JupyterHub in a distributed environment. We also evaluate the system using a large clinical data set of 69M patients.
{"title":"Collaborative Cloud Computing Framework for Health Data with Open Source Technologies","authors":"Fatemeh Rouzbeh, A. Grama, Paul M. Griffin, Mohammad Adibuzzaman","doi":"10.1145/3388440.3412460","DOIUrl":"https://doi.org/10.1145/3388440.3412460","url":null,"abstract":"The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of the system architecture to achieve high performance in terms of parallelization, query processing time, aggregation of heterogeneous data types (e.g., time series, images, structured data, among others), and difficulty in reproducing scientific research remain a major challenge. This is specifically true for health sciences research, where the systems must be i) easy to use with the flexibility to manipulate data at the most granular level, ii) agnostic of programming language kernel, iii) scalable, and iv) compliant with the HIPAA privacy law. In this paper, we review the existing literature for such big data systems for scientific research in health sciences and identify the gaps of the current system landscape. We propose a novel architecture for software-hardware-data ecosystem using open source technologies such as Apache Hadoop, Kubernetes and JupyterHub in a distributed environment. We also evaluate the system using a large clinical data set of 69M patients.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124133644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Legrand, A. Scheinberg, A. F. Tillack, M. Thavappiragasam, J. Vermaas, Rupesh Agarwal, J. Larkin, D. Poole, Diogo Santos-Martins, Leonardo Solis-Vasquez, Andreas Koch, Stefano Forli, Oscar R. Hernandez, Jeremy C. Smith, A. Sedova
Protein-ligand docking is an in silico tool used to screen potential drug compounds for their ability to bind to a given protein receptor within a drug-discovery campaign. Experimental drug screening is expensive and time consuming, and it is desirable to carry out large scale docking calculations in a high-throughput manner to narrow the experimental search space. Few of the existing computational docking tools were designed with high performance computing in mind. Therefore, optimizations to maximize use of high-performance computational resources available at leadership-class computing facilities enables these facilities to be leveraged for drug discovery. Here we present the porting, optimization, and validation of the AutoDock-GPU program for the Summit supercomputer, and its application to initial compound screening efforts to target proteins of the SARS-CoV-2 virus responsible for the current COVID-19 pandemic.
{"title":"GPU-Accelerated Drug Discovery with Docking on the Summit Supercomputer: Porting, Optimization, and Application to COVID-19 Research","authors":"S. Legrand, A. Scheinberg, A. F. Tillack, M. Thavappiragasam, J. Vermaas, Rupesh Agarwal, J. Larkin, D. Poole, Diogo Santos-Martins, Leonardo Solis-Vasquez, Andreas Koch, Stefano Forli, Oscar R. Hernandez, Jeremy C. Smith, A. Sedova","doi":"10.1145/3388440.3412472","DOIUrl":"https://doi.org/10.1145/3388440.3412472","url":null,"abstract":"Protein-ligand docking is an in silico tool used to screen potential drug compounds for their ability to bind to a given protein receptor within a drug-discovery campaign. Experimental drug screening is expensive and time consuming, and it is desirable to carry out large scale docking calculations in a high-throughput manner to narrow the experimental search space. Few of the existing computational docking tools were designed with high performance computing in mind. Therefore, optimizations to maximize use of high-performance computational resources available at leadership-class computing facilities enables these facilities to be leveraged for drug discovery. Here we present the porting, optimization, and validation of the AutoDock-GPU program for the Summit supercomputer, and its application to initial compound screening efforts to target proteins of the SARS-CoV-2 virus responsible for the current COVID-19 pandemic.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121420277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
COVID-19 (2019 Novel Coronavirus) has resulted in an ongoing pandemic and as of 26 July 2020, has caused more than 15.7 million cases and over 640,000 deaths. The highly dynamic and rapidly evolving situation with COVID-19 has made it difficult to access accurate, on-demand information regarding the disease. Online communities, forums, and social media provide potential venues to search for relevant questions and answers, or post questions and seek answers from other members. However, due to the nature of such sites, there are always a limited number of relevant questions and responses to search from, and posted questions are rarely answered immediately. With the advancements in the field of natural language processing, particularly in the domain of language models, it has become possible to design chatbots that can automatically answer consumer questions. However, such models are rarely applied and evaluated in the healthcare domain, to meet the information needs with accurate and up-to-date healthcare data. In this paper, we propose to apply a language model for automatically answering questions related to COVID-19 and qualitatively evaluate the generated responses. We utilized the GPT-2 language model and applied transfer learning to retrain it on the COVID-19 Open Research Dataset (CORD-19) corpus. In order to improve the quality of the generated responses, we applied 4 different approaches, namely tf-idf (Term Frequency - Inverse Document Frequency), Bidirectional Encoder Representations from Transformers (BERT), Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT), and Universal Sentence Encoder (USE) to filter and retain relevant sentences in the responses. In the performance evaluation step, we asked two medical experts to rate the responses. We found that BERT and BioBERT, on average, outperform both tf-idf and USE in relevance-based sentence filtering tasks. Additionally, based on the chatbot, we created a user-friendly interactive web application to be hosted online and made its source code available free of charge to anyone interested in running it locally, online, or just for experimental purposes. Overall, our work has yielded significant results in both designing a chatbot that produces high-quality responses to COVID-19-related questions and comparing several embedding generation techniques.
COVID-19(2019年新型冠状病毒)导致了一场持续的大流行,截至2020年7月26日,已造成1570多万例病例和64多万例死亡。COVID-19形势的高度动态和快速演变使得很难获得有关该疾病的准确、按需信息。在线社区、论坛和社交媒体提供了搜索相关问题和答案的潜在场所,或者发布问题并从其他成员那里寻求答案。然而,由于此类网站的性质,可供搜索的相关问题和回答总是有限的,发布的问题很少能立即得到回答。随着自然语言处理领域,特别是语言模型领域的进步,设计能够自动回答消费者问题的聊天机器人已经成为可能。然而,这些模型很少在医疗保健领域应用和评估,以满足准确和最新的医疗保健数据的信息需求。在本文中,我们建议应用一种语言模型来自动回答与COVID-19相关的问题,并对生成的回答进行定性评估。我们利用GPT-2语言模型,并应用迁移学习在COVID-19开放研究数据集(CORD-19)语料库上对其进行再训练。为了提高生成的响应的质量,我们采用了4种不同的方法,即tf-idf (Term Frequency - Inverse Document Frequency)、BERT(双向编码器表示)、BioBERT(双向编码器表示)和USE(通用句子编码器)来过滤和保留响应中的相关句子。在绩效评估步骤中,我们请了两位医学专家对回答进行评分。我们发现,在基于相关性的句子过滤任务中,BERT和BioBERT的平均表现优于tf-idf和USE。此外,基于聊天机器人,我们创建了一个在线托管的用户友好的交互式web应用程序,并将其源代码免费提供给任何有兴趣在本地、在线或仅用于实验目的运行它的人。总的来说,我们的工作在设计一个对covid -19相关问题产生高质量响应的聊天机器人和比较几种嵌入生成技术方面取得了重大成果。
{"title":"A Qualitative Evaluation of Language Models on Automatic Question-Answering for COVID-19","authors":"David Oniani, Yanshan Wang","doi":"10.1145/3388440.3412413","DOIUrl":"https://doi.org/10.1145/3388440.3412413","url":null,"abstract":"COVID-19 (2019 Novel Coronavirus) has resulted in an ongoing pandemic and as of 26 July 2020, has caused more than 15.7 million cases and over 640,000 deaths. The highly dynamic and rapidly evolving situation with COVID-19 has made it difficult to access accurate, on-demand information regarding the disease. Online communities, forums, and social media provide potential venues to search for relevant questions and answers, or post questions and seek answers from other members. However, due to the nature of such sites, there are always a limited number of relevant questions and responses to search from, and posted questions are rarely answered immediately. With the advancements in the field of natural language processing, particularly in the domain of language models, it has become possible to design chatbots that can automatically answer consumer questions. However, such models are rarely applied and evaluated in the healthcare domain, to meet the information needs with accurate and up-to-date healthcare data. In this paper, we propose to apply a language model for automatically answering questions related to COVID-19 and qualitatively evaluate the generated responses. We utilized the GPT-2 language model and applied transfer learning to retrain it on the COVID-19 Open Research Dataset (CORD-19) corpus. In order to improve the quality of the generated responses, we applied 4 different approaches, namely tf-idf (Term Frequency - Inverse Document Frequency), Bidirectional Encoder Representations from Transformers (BERT), Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT), and Universal Sentence Encoder (USE) to filter and retain relevant sentences in the responses. In the performance evaluation step, we asked two medical experts to rate the responses. We found that BERT and BioBERT, on average, outperform both tf-idf and USE in relevance-based sentence filtering tasks. Additionally, based on the chatbot, we created a user-friendly interactive web application to be hosted online and made its source code available free of charge to anyone interested in running it locally, online, or just for experimental purposes. Overall, our work has yielded significant results in both designing a chatbot that produces high-quality responses to COVID-19-related questions and comparing several embedding generation techniques.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129102814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Lu, S. Bauer, V. Neubert, L. Costard, F. Rosenow, J. Triesch
Epilepsy is a common neurological disorder characterized by recurrent seizures accompanied by excessive synchronous brain activity. The process of structural and functional brain alterations leading to increased seizure susceptibility and eventually spontaneous seizures is called epileptogenesis (EPG) and can span months or even years. Detecting and monitoring the progression of EPG could allow for targeted early interventions that could slow down disease progression or even halt its development. Here, we propose an approach for staging EPG using deep neural networks and identify potential electroencephalography (EEG) biomarkers to distinguish different phases of EPG. Specifically, continuous intracranial EEG recordings were collected from a rodent model where epilepsy is induced by electrical perforant pathway stimulation (PPS). A deep neural network (DNN) is trained to distinguish EEG signals from before stimulation (baseline), shortly after the PPS and long after the PPS but before the first spontaneous seizure (FSS). Experimental results show that our proposed method can classify EEG signals from the three phases with an average area under the curve (AUC) of 0.93, 0.89, and 0.86. To the best of our knowledge, this represents the first successful attempt to stage EPG prior to the FSS using DNNs.
{"title":"Staging Epileptogenesis with Deep Neural Networks","authors":"D. Lu, S. Bauer, V. Neubert, L. Costard, F. Rosenow, J. Triesch","doi":"10.1145/3388440.3412480","DOIUrl":"https://doi.org/10.1145/3388440.3412480","url":null,"abstract":"Epilepsy is a common neurological disorder characterized by recurrent seizures accompanied by excessive synchronous brain activity. The process of structural and functional brain alterations leading to increased seizure susceptibility and eventually spontaneous seizures is called epileptogenesis (EPG) and can span months or even years. Detecting and monitoring the progression of EPG could allow for targeted early interventions that could slow down disease progression or even halt its development. Here, we propose an approach for staging EPG using deep neural networks and identify potential electroencephalography (EEG) biomarkers to distinguish different phases of EPG. Specifically, continuous intracranial EEG recordings were collected from a rodent model where epilepsy is induced by electrical perforant pathway stimulation (PPS). A deep neural network (DNN) is trained to distinguish EEG signals from before stimulation (baseline), shortly after the PPS and long after the PPS but before the first spontaneous seizure (FSS). Experimental results show that our proposed method can classify EEG signals from the three phases with an average area under the curve (AUC) of 0.93, 0.89, and 0.86. To the best of our knowledge, this represents the first successful attempt to stage EPG prior to the FSS using DNNs.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126816095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the target control problem of asynchronous Boolean networks, to identify a set of nodes, the perturbation of which can drive the dynamics of the network from any initial state to the desired steady state (or attractor). We are particularly interested in temporary perturbations, which are applied for sufficient time and then released to retrieve the original dynamics. Temporary perturbations have the apparent advantage of averting unforeseen consequences, which might be induced by permanent perturbations. Despite the infamous state-space explosion problem, in this work, we develop an efficient method to compute the temporary target control for a given target attractor of a Boolean network. We apply our method to a number of real-life biological networks and compare its performance with the stable motif-based control method to demonstrate its efficacy and efficiency.
{"title":"A Dynamics-based Approach for the Target Control of Boolean Networks","authors":"Cui Su, Jun Pang","doi":"10.1145/3388440.3412464","DOIUrl":"https://doi.org/10.1145/3388440.3412464","url":null,"abstract":"We study the target control problem of asynchronous Boolean networks, to identify a set of nodes, the perturbation of which can drive the dynamics of the network from any initial state to the desired steady state (or attractor). We are particularly interested in temporary perturbations, which are applied for sufficient time and then released to retrieve the original dynamics. Temporary perturbations have the apparent advantage of averting unforeseen consequences, which might be induced by permanent perturbations. Despite the infamous state-space explosion problem, in this work, we develop an efficient method to compute the temporary target control for a given target attractor of a Boolean network. We apply our method to a number of real-life biological networks and compare its performance with the stable motif-based control method to demonstrate its efficacy and efficiency.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121943408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tengel Ekrem Skar, Einar J. Holsbø, K. Svendsen, L. A. Bongo
Population-scale drug prescription data linked with adverse drug reaction (ADR) data supports the fitting of models large enough to detect drug use and ADR patterns that are not detectable using traditional methods on smaller datasets. However, detecting ADR patterns in large datasets requires tools for scalable data processing, machine learning for data analysis, and interactive visualization. To our knowledge no existing pharmacoepidemiology tool supports all three requirements. We have therefore created a tool for interactive exploration of patterns in prescription datasets with millions of samples. We use Spark to preprocess the data for machine learning and for analyses using SQL queries. We have implemented models in Keras and the scikit-learn framework. The model results are visualized and interpreted using live Python coding in Jupyter. We apply our tool to explore a 384 million prescription data set from the Norwegian Prescription Database combined with a 62 million prescriptions for elders that were hospitalized. We preprocess the data in two minutes, train models in seconds, and plot the results in milliseconds. Our results show the power of combining computational power, short computation times, and ease of use for analysis of population scale pharmacoepidemiology datasets.
{"title":"Interactive exploration of population scale pharmacoepidemiology datasets","authors":"Tengel Ekrem Skar, Einar J. Holsbø, K. Svendsen, L. A. Bongo","doi":"10.1145/3388440.3414862","DOIUrl":"https://doi.org/10.1145/3388440.3414862","url":null,"abstract":"Population-scale drug prescription data linked with adverse drug reaction (ADR) data supports the fitting of models large enough to detect drug use and ADR patterns that are not detectable using traditional methods on smaller datasets. However, detecting ADR patterns in large datasets requires tools for scalable data processing, machine learning for data analysis, and interactive visualization. To our knowledge no existing pharmacoepidemiology tool supports all three requirements. We have therefore created a tool for interactive exploration of patterns in prescription datasets with millions of samples. We use Spark to preprocess the data for machine learning and for analyses using SQL queries. We have implemented models in Keras and the scikit-learn framework. The model results are visualized and interpreted using live Python coding in Jupyter. We apply our tool to explore a 384 million prescription data set from the Norwegian Prescription Database combined with a 62 million prescriptions for elders that were hospitalized. We preprocess the data in two minutes, train models in seconds, and plot the results in milliseconds. Our results show the power of combining computational power, short computation times, and ease of use for analysis of population scale pharmacoepidemiology datasets.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121939047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MOTIVATION One of the main challenges in applying graph convolutional neural networks on gene-interaction data is the lack of understanding of the vector space to which they belong, and also the inherent difficulties involved in representing those interactions on a significantly lower dimension, viz Euclidean spaces. The challenge becomes more prevalent when dealing with various types of heterogeneous data. We introduce a systematic, generalized method, called iSOM-GSN, used to transform "multi-omic" data with higher dimensions onto a two-dimensional grid. Afterwards, we apply a convolutional neural network to predict disease states of various types. Based on the idea of Kohonen's self-organizing map, we generate a two-dimensional grid for each sample for a given set of genes that represent a gene similarity network. RESULTS We have tested the model to predict breast and prostate cancer using gene expression, DNA methylation, and copy number alteration. Prediction accuracies in the 94-98% range were obtained for tumor stages of breast cancer and calculated Gleason scores of prostate cancer with just 14 input genes for both cases. The scheme not only outputs nearly perfect classification accuracy, but also provides an enhanced scheme for representation learning, visualization, dimensionality reduction, and interpretation of multi-omic data. AVAILABILITY The source code and sample data are available via a Github project at https://github.com/NaziaFatima/iSOM_GSN. SUPPLEMENTARY INFORMATION Supplementary figures and data availability are in the Supplementary Material file.
{"title":"iSOM-GSN: An Integrative Approach for Transforming Multi-omic Data into Gene Similarity Networks via Self-organizing Maps","authors":"Nazia Fatima, L. Rueda","doi":"10.1145/3388440.3414206","DOIUrl":"https://doi.org/10.1145/3388440.3414206","url":null,"abstract":"MOTIVATION\u0000One of the main challenges in applying graph convolutional neural networks on gene-interaction data is the lack of understanding of the vector space to which they belong, and also the inherent difficulties involved in representing those interactions on a significantly lower dimension, viz Euclidean spaces. The challenge becomes more prevalent when dealing with various types of heterogeneous data. We introduce a systematic, generalized method, called iSOM-GSN, used to transform \"multi-omic\" data with higher dimensions onto a two-dimensional grid. Afterwards, we apply a convolutional neural network to predict disease states of various types. Based on the idea of Kohonen's self-organizing map, we generate a two-dimensional grid for each sample for a given set of genes that represent a gene similarity network.\u0000\u0000\u0000RESULTS\u0000We have tested the model to predict breast and prostate cancer using gene expression, DNA methylation, and copy number alteration. Prediction accuracies in the 94-98% range were obtained for tumor stages of breast cancer and calculated Gleason scores of prostate cancer with just 14 input genes for both cases. The scheme not only outputs nearly perfect classification accuracy, but also provides an enhanced scheme for representation learning, visualization, dimensionality reduction, and interpretation of multi-omic data.\u0000\u0000\u0000AVAILABILITY\u0000The source code and sample data are available via a Github project at https://github.com/NaziaFatima/iSOM_GSN.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary figures and data availability are in the Supplementary Material file.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134606230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lorraine A. K. Ayad, P. Charalampopoulos, S. Pissis
Finding repetitive nucleic acid elements is a crucial step in many sequence analysis tasks. These include the challenging task of sequence assembly, the linkage of repeats to genetic disorders, and the identification of gene transfer. The most widely-used tool for finding repeats de novo is REPuter [2]. REPuter relies on extending maximal repeated pairs in order to enumerate all maximal k-mismatch repeats. Unfortunately, the number of these pairs can be quadratic in n, the length of the input sequence, and thus greedy heuristics are applied by its successor Vmatch to speed up the extension process. In this talk, we will introduce the concept of supermaximal k-mismatch repeats, whose number is linear in n, and capture all maximal k-mismatch repeats: every maximal k-mismatch repeat is a substring of some supermaximal k-mismatch repeat. We will present SMART, a tool based on recent algorithmic advances implemented in C++ to compute supermaximal k-mismatch repeats directly. We will also show that the elements SMART outputs are statistically much more significant than the output of the state-of-the-art tools. The full paper describing SMART appeared as [1].
{"title":"SMART: SuperMaximal approximate repeats tool","authors":"Lorraine A. K. Ayad, P. Charalampopoulos, S. Pissis","doi":"10.1145/3388440.3414210","DOIUrl":"https://doi.org/10.1145/3388440.3414210","url":null,"abstract":"Finding repetitive nucleic acid elements is a crucial step in many sequence analysis tasks. These include the challenging task of sequence assembly, the linkage of repeats to genetic disorders, and the identification of gene transfer. The most widely-used tool for finding repeats de novo is REPuter [2]. REPuter relies on extending maximal repeated pairs in order to enumerate all maximal k-mismatch repeats. Unfortunately, the number of these pairs can be quadratic in n, the length of the input sequence, and thus greedy heuristics are applied by its successor Vmatch to speed up the extension process. In this talk, we will introduce the concept of supermaximal k-mismatch repeats, whose number is linear in n, and capture all maximal k-mismatch repeats: every maximal k-mismatch repeat is a substring of some supermaximal k-mismatch repeat. We will present SMART, a tool based on recent algorithmic advances implemented in C++ to compute supermaximal k-mismatch repeats directly. We will also show that the elements SMART outputs are statistically much more significant than the output of the state-of-the-art tools. The full paper describing SMART appeared as [1].","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126184723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Molecule generation is to design new molecules with specific chemical properties and further to optimize the desired chemical properties. Following previous work, we encode molecules into continuous vectors in the latent space and then decode the embedding vectors into molecules under the variational autoencoder (VAE) framework. We investigate the posterior collapse problem of the current widely-used RNN-based VAEs for the molecule sequence generation. For the first time, we point out that the underestimated reconstruction loss of VAEs leads to the posterior collapse, and we also provide both analytical and experimental evidences to support our findings. To fix the problem and avoid the posterior collapse, we propose an effective and efficient solution in this work. Without bells and whistles, our method achieves the state-of-the-art reconstruction accuracy and competitive validity score on the ZINC 250K dataset. When generating 10,000 unique valid molecule sequences from the random prior sampling, it costs the JT-VAE 1450 seconds while our method only needs 9 seconds on a regular desktop machine.
{"title":"Re-balancing Variational Autoencoder Loss for Molecule Sequence Generation","authors":"Chao-chao Yan, Sheng Wang, Jinyu Yang, Tingyang Xu, Junzhou Huang","doi":"10.1145/3388440.3412458","DOIUrl":"https://doi.org/10.1145/3388440.3412458","url":null,"abstract":"Molecule generation is to design new molecules with specific chemical properties and further to optimize the desired chemical properties. Following previous work, we encode molecules into continuous vectors in the latent space and then decode the embedding vectors into molecules under the variational autoencoder (VAE) framework. We investigate the posterior collapse problem of the current widely-used RNN-based VAEs for the molecule sequence generation. For the first time, we point out that the underestimated reconstruction loss of VAEs leads to the posterior collapse, and we also provide both analytical and experimental evidences to support our findings. To fix the problem and avoid the posterior collapse, we propose an effective and efficient solution in this work. Without bells and whistles, our method achieves the state-of-the-art reconstruction accuracy and competitive validity score on the ZINC 250K dataset. When generating 10,000 unique valid molecule sequences from the random prior sampling, it costs the JT-VAE 1450 seconds while our method only needs 9 seconds on a regular desktop machine.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122171309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pharmaceutical drug design is a difficult and costly endeavor. Computational drug design has the potential to help save time and money by providing a better starting point for new drugs with an initial computational evaluation completed. We propose a new application of Generative Adversarial Networks (GANs), called GANDALF (Generative Adversarial Network Drug-tArget Ligand Fructifier) to design new peptides for protein targets. Other GAN based methods for computational drug design can only generate small molecules, not peptides. It also incorporates data such as active atoms, not used in other methods, which allow us to precisely identify where interaction occurs between a protein and ligand. Our method goes farther than comparable methods by generating a peptide and predicting binding affinity. We compare results for a protein of interest, PD-1, using: GANDALF, Pepcomposer, and the FDA approved drugs. We find that our method produces a peptide comparable to the FDA approved drugs and better than that of Pepcomposer. Further work will improve the GANDALF system by deepening the GAN architecture to improve on the binding affinity and 3D fit of the peptides. We are also exploring the uses of transfer learning.
{"title":"GANDALF","authors":"Allison M. Rossetto, Wenjin Zhou","doi":"10.1145/3307339.3342183","DOIUrl":"https://doi.org/10.1145/3307339.3342183","url":null,"abstract":"Pharmaceutical drug design is a difficult and costly endeavor. Computational drug design has the potential to help save time and money by providing a better starting point for new drugs with an initial computational evaluation completed. We propose a new application of Generative Adversarial Networks (GANs), called GANDALF (Generative Adversarial Network Drug-tArget Ligand Fructifier) to design new peptides for protein targets. Other GAN based methods for computational drug design can only generate small molecules, not peptides. It also incorporates data such as active atoms, not used in other methods, which allow us to precisely identify where interaction occurs between a protein and ligand. Our method goes farther than comparable methods by generating a peptide and predicting binding affinity. We compare results for a protein of interest, PD-1, using: GANDALF, Pepcomposer, and the FDA approved drugs. We find that our method produces a peptide comparable to the FDA approved drugs and better than that of Pepcomposer. Further work will improve the GANDALF system by deepening the GAN architecture to improve on the binding affinity and 3D fit of the peptides. We are also exploring the uses of transfer learning.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122629847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}