Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics最新文献_第10页

Collaborative Cloud Computing Framework for Health Data with Open Source Technologies 基于开源技术的健康数据协同云计算框架

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-07-20 DOI: 10.1145/3388440.3412460

Fatemeh Rouzbeh, A. Grama, Paul M. Griffin, Mohammad Adibuzzaman

The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of the system architecture to achieve high performance in terms of parallelization, query processing time, aggregation of heterogeneous data types (e.g., time series, images, structured data, among others), and difficulty in reproducing scientific research remain a major challenge. This is specifically true for health sciences research, where the systems must be i) easy to use with the flexibility to manipulate data at the most granular level, ii) agnostic of programming language kernel, iii) scalable, and iv) compliant with the HIPAA privacy law. In this paper, we review the existing literature for such big data systems for scientific research in health sciences and identify the gaps of the current system landscape. We propose a novel architecture for software-hardware-data ecosystem using open source technologies such as Apache Hadoop, Kubernetes and JupyterHub in a distributed environment. We also evaluate the system using a large clinical data set of 69M patients.

传感器技术的扩散和数据收集方法的进步使大量数据的积累成为可能。这些数据集越来越多地被用于科学研究。然而，在并行化、查询处理时间、异构数据类型(如时间序列、图像、结构化数据等)的聚合以及再现科学研究的难度等方面实现高性能的系统架构设计仍然是一个主要挑战。对于健康科学研究来说尤其如此，其中的系统必须i)易于使用，能够灵活地在最细粒度的级别上操作数据，ii)不受编程语言内核的影响，iii)可扩展，以及iv)符合HIPAA隐私法。在本文中，我们回顾了现有的健康科学研究大数据系统的文献，并确定了当前系统景观的差距。我们在分布式环境中使用开源技术(如Apache Hadoop、Kubernetes和JupyterHub)为软硬件数据生态系统提出了一种新的架构。我们还使用69M患者的大型临床数据集来评估该系统。

{"title":"Collaborative Cloud Computing Framework for Health Data with Open Source Technologies","authors":"Fatemeh Rouzbeh, A. Grama, Paul M. Griffin, Mohammad Adibuzzaman","doi":"10.1145/3388440.3412460","DOIUrl":"https://doi.org/10.1145/3388440.3412460","url":null,"abstract":"The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of the system architecture to achieve high performance in terms of parallelization, query processing time, aggregation of heterogeneous data types (e.g., time series, images, structured data, among others), and difficulty in reproducing scientific research remain a major challenge. This is specifically true for health sciences research, where the systems must be i) easy to use with the flexibility to manipulate data at the most granular level, ii) agnostic of programming language kernel, iii) scalable, and iv) compliant with the HIPAA privacy law. In this paper, we review the existing literature for such big data systems for scientific research in health sciences and identify the gaps of the current system landscape. We propose a novel architecture for software-hardware-data ecosystem using open source technologies such as Apache Hadoop, Kubernetes and JupyterHub in a distributed environment. We also evaluate the system using a large clinical data set of 69M patients.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124133644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GPU-Accelerated Drug Discovery with Docking on the Summit Supercomputer: Porting, Optimization, and Application to COVID-19 Research gpu加速药物发现与巅峰超级计算机对接:COVID-19研究的移植、优化和应用

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-07-06 DOI: 10.1145/3388440.3412472

S. Legrand, A. Scheinberg, A. F. Tillack, M. Thavappiragasam, J. Vermaas, Rupesh Agarwal, J. Larkin, D. Poole, Diogo Santos-Martins, Leonardo Solis-Vasquez, Andreas Koch, Stefano Forli, Oscar R. Hernandez, Jeremy C. Smith, A. Sedova

Protein-ligand docking is an in silico tool used to screen potential drug compounds for their ability to bind to a given protein receptor within a drug-discovery campaign. Experimental drug screening is expensive and time consuming, and it is desirable to carry out large scale docking calculations in a high-throughput manner to narrow the experimental search space. Few of the existing computational docking tools were designed with high performance computing in mind. Therefore, optimizations to maximize use of high-performance computational resources available at leadership-class computing facilities enables these facilities to be leveraged for drug discovery. Here we present the porting, optimization, and validation of the AutoDock-GPU program for the Summit supercomputer, and its application to initial compound screening efforts to target proteins of the SARS-CoV-2 virus responsible for the current COVID-19 pandemic.

蛋白质配体对接是一种在药物发现过程中用于筛选潜在药物化合物与给定蛋白质受体结合能力的计算机工具。实验药物筛选成本高、耗时长，希望以高通量的方式进行大规模对接计算，缩小实验搜索空间。现有的计算对接工具很少考虑到高性能计算。因此，优化以最大限度地利用领导级计算设施中可用的高性能计算资源，使这些设施能够用于药物发现。在这里，我们介绍了AutoDock-GPU程序在Summit超级计算机上的移植、优化和验证，并将其应用于针对导致当前COVID-19大流行的SARS-CoV-2病毒蛋白的初始化合物筛选工作。

{"title":"GPU-Accelerated Drug Discovery with Docking on the Summit Supercomputer: Porting, Optimization, and Application to COVID-19 Research","authors":"S. Legrand, A. Scheinberg, A. F. Tillack, M. Thavappiragasam, J. Vermaas, Rupesh Agarwal, J. Larkin, D. Poole, Diogo Santos-Martins, Leonardo Solis-Vasquez, Andreas Koch, Stefano Forli, Oscar R. Hernandez, Jeremy C. Smith, A. Sedova","doi":"10.1145/3388440.3412472","DOIUrl":"https://doi.org/10.1145/3388440.3412472","url":null,"abstract":"Protein-ligand docking is an in silico tool used to screen potential drug compounds for their ability to bind to a given protein receptor within a drug-discovery campaign. Experimental drug screening is expensive and time consuming, and it is desirable to carry out large scale docking calculations in a high-throughput manner to narrow the experimental search space. Few of the existing computational docking tools were designed with high performance computing in mind. Therefore, optimizations to maximize use of high-performance computational resources available at leadership-class computing facilities enables these facilities to be leveraged for drug discovery. Here we present the porting, optimization, and validation of the AutoDock-GPU program for the Summit supercomputer, and its application to initial compound screening efforts to target proteins of the SARS-CoV-2 virus responsible for the current COVID-19 pandemic.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121420277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

A Qualitative Evaluation of Language Models on Automatic Question-Answering for COVID-19 新型冠状病毒肺炎自动问答语言模型的定性评价

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-06-19 DOI: 10.1145/3388440.3412413

David Oniani, Yanshan Wang

COVID-19 (2019 Novel Coronavirus) has resulted in an ongoing pandemic and as of 26 July 2020, has caused more than 15.7 million cases and over 640,000 deaths. The highly dynamic and rapidly evolving situation with COVID-19 has made it difficult to access accurate, on-demand information regarding the disease. Online communities, forums, and social media provide potential venues to search for relevant questions and answers, or post questions and seek answers from other members. However, due to the nature of such sites, there are always a limited number of relevant questions and responses to search from, and posted questions are rarely answered immediately. With the advancements in the field of natural language processing, particularly in the domain of language models, it has become possible to design chatbots that can automatically answer consumer questions. However, such models are rarely applied and evaluated in the healthcare domain, to meet the information needs with accurate and up-to-date healthcare data. In this paper, we propose to apply a language model for automatically answering questions related to COVID-19 and qualitatively evaluate the generated responses. We utilized the GPT-2 language model and applied transfer learning to retrain it on the COVID-19 Open Research Dataset (CORD-19) corpus. In order to improve the quality of the generated responses, we applied 4 different approaches, namely tf-idf (Term Frequency - Inverse Document Frequency), Bidirectional Encoder Representations from Transformers (BERT), Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT), and Universal Sentence Encoder (USE) to filter and retain relevant sentences in the responses. In the performance evaluation step, we asked two medical experts to rate the responses. We found that BERT and BioBERT, on average, outperform both tf-idf and USE in relevance-based sentence filtering tasks. Additionally, based on the chatbot, we created a user-friendly interactive web application to be hosted online and made its source code available free of charge to anyone interested in running it locally, online, or just for experimental purposes. Overall, our work has yielded significant results in both designing a chatbot that produces high-quality responses to COVID-19-related questions and comparing several embedding generation techniques.

COVID-19(2019年新型冠状病毒)导致了一场持续的大流行，截至2020年7月26日，已造成1570多万例病例和64多万例死亡。COVID-19形势的高度动态和快速演变使得很难获得有关该疾病的准确、按需信息。在线社区、论坛和社交媒体提供了搜索相关问题和答案的潜在场所，或者发布问题并从其他成员那里寻求答案。然而，由于此类网站的性质，可供搜索的相关问题和回答总是有限的，发布的问题很少能立即得到回答。随着自然语言处理领域，特别是语言模型领域的进步，设计能够自动回答消费者问题的聊天机器人已经成为可能。然而，这些模型很少在医疗保健领域应用和评估，以满足准确和最新的医疗保健数据的信息需求。在本文中，我们建议应用一种语言模型来自动回答与COVID-19相关的问题，并对生成的回答进行定性评估。我们利用GPT-2语言模型，并应用迁移学习在COVID-19开放研究数据集(CORD-19)语料库上对其进行再训练。为了提高生成的响应的质量，我们采用了4种不同的方法，即tf-idf (Term Frequency - Inverse Document Frequency)、BERT(双向编码器表示)、BioBERT(双向编码器表示)和USE(通用句子编码器)来过滤和保留响应中的相关句子。在绩效评估步骤中，我们请了两位医学专家对回答进行评分。我们发现，在基于相关性的句子过滤任务中，BERT和BioBERT的平均表现优于tf-idf和USE。此外，基于聊天机器人，我们创建了一个在线托管的用户友好的交互式web应用程序，并将其源代码免费提供给任何有兴趣在本地、在线或仅用于实验目的运行它的人。总的来说，我们的工作在设计一个对covid -19相关问题产生高质量响应的聊天机器人和比较几种嵌入生成技术方面取得了重大成果。

{"title":"A Qualitative Evaluation of Language Models on Automatic Question-Answering for COVID-19","authors":"David Oniani, Yanshan Wang","doi":"10.1145/3388440.3412413","DOIUrl":"https://doi.org/10.1145/3388440.3412413","url":null,"abstract":"COVID-19 (2019 Novel Coronavirus) has resulted in an ongoing pandemic and as of 26 July 2020, has caused more than 15.7 million cases and over 640,000 deaths. The highly dynamic and rapidly evolving situation with COVID-19 has made it difficult to access accurate, on-demand information regarding the disease. Online communities, forums, and social media provide potential venues to search for relevant questions and answers, or post questions and seek answers from other members. However, due to the nature of such sites, there are always a limited number of relevant questions and responses to search from, and posted questions are rarely answered immediately. With the advancements in the field of natural language processing, particularly in the domain of language models, it has become possible to design chatbots that can automatically answer consumer questions. However, such models are rarely applied and evaluated in the healthcare domain, to meet the information needs with accurate and up-to-date healthcare data. In this paper, we propose to apply a language model for automatically answering questions related to COVID-19 and qualitatively evaluate the generated responses. We utilized the GPT-2 language model and applied transfer learning to retrain it on the COVID-19 Open Research Dataset (CORD-19) corpus. In order to improve the quality of the generated responses, we applied 4 different approaches, namely tf-idf (Term Frequency - Inverse Document Frequency), Bidirectional Encoder Representations from Transformers (BERT), Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT), and Universal Sentence Encoder (USE) to filter and retain relevant sentences in the responses. In the performance evaluation step, we asked two medical experts to rate the responses. We found that BERT and BioBERT, on average, outperform both tf-idf and USE in relevance-based sentence filtering tasks. Additionally, based on the chatbot, we created a user-friendly interactive web application to be hosted online and made its source code available free of charge to anyone interested in running it locally, online, or just for experimental purposes. Overall, our work has yielded significant results in both designing a chatbot that produces high-quality responses to COVID-19-related questions and comparing several embedding generation techniques.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129102814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Staging Epileptogenesis with Deep Neural Networks 用深度神经网络分期癫痫发生

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-06-17 DOI: 10.1145/3388440.3412480

D. Lu, S. Bauer, V. Neubert, L. Costard, F. Rosenow, J. Triesch

Epilepsy is a common neurological disorder characterized by recurrent seizures accompanied by excessive synchronous brain activity. The process of structural and functional brain alterations leading to increased seizure susceptibility and eventually spontaneous seizures is called epileptogenesis (EPG) and can span months or even years. Detecting and monitoring the progression of EPG could allow for targeted early interventions that could slow down disease progression or even halt its development. Here, we propose an approach for staging EPG using deep neural networks and identify potential electroencephalography (EEG) biomarkers to distinguish different phases of EPG. Specifically, continuous intracranial EEG recordings were collected from a rodent model where epilepsy is induced by electrical perforant pathway stimulation (PPS). A deep neural network (DNN) is trained to distinguish EEG signals from before stimulation (baseline), shortly after the PPS and long after the PPS but before the first spontaneous seizure (FSS). Experimental results show that our proposed method can classify EEG signals from the three phases with an average area under the curve (AUC) of 0.93, 0.89, and 0.86. To the best of our knowledge, this represents the first successful attempt to stage EPG prior to the FSS using DNNs.

癫痫是一种常见的神经系统疾病，其特征是反复发作并伴有过度的同步脑活动。大脑结构和功能改变导致癫痫易感性增加并最终导致自发性癫痫发作的过程称为癫痫发生(EPG)，可持续数月甚至数年。检测和监测EPG的进展可以允许有针对性的早期干预，可以减缓疾病进展甚至停止其发展。在这里，我们提出了一种使用深度神经网络进行EPG分期的方法，并识别潜在的脑电图(EEG)生物标志物来区分不同阶段的EPG。具体来说，收集了由电穿孔通路刺激(PPS)诱发癫痫的啮齿动物模型的连续颅内脑电图记录。训练深度神经网络(DNN)来区分刺激前(基线)、PPS后不久和PPS后较长时间但首次自发发作(FSS)之前的脑电图信号。实验结果表明，该方法可以对三个阶段的脑电信号进行分类，平均曲线下面积(AUC)分别为0.93、0.89和0.86。据我们所知，这是首次成功尝试在FSS之前使用dnn进行EPG。

{"title":"Staging Epileptogenesis with Deep Neural Networks","authors":"D. Lu, S. Bauer, V. Neubert, L. Costard, F. Rosenow, J. Triesch","doi":"10.1145/3388440.3412480","DOIUrl":"https://doi.org/10.1145/3388440.3412480","url":null,"abstract":"Epilepsy is a common neurological disorder characterized by recurrent seizures accompanied by excessive synchronous brain activity. The process of structural and functional brain alterations leading to increased seizure susceptibility and eventually spontaneous seizures is called epileptogenesis (EPG) and can span months or even years. Detecting and monitoring the progression of EPG could allow for targeted early interventions that could slow down disease progression or even halt its development. Here, we propose an approach for staging EPG using deep neural networks and identify potential electroencephalography (EEG) biomarkers to distinguish different phases of EPG. Specifically, continuous intracranial EEG recordings were collected from a rodent model where epilepsy is induced by electrical perforant pathway stimulation (PPS). A deep neural network (DNN) is trained to distinguish EEG signals from before stimulation (baseline), shortly after the PPS and long after the PPS but before the first spontaneous seizure (FSS). Experimental results show that our proposed method can classify EEG signals from the three phases with an average area under the curve (AUC) of 0.93, 0.89, and 0.86. To the best of our knowledge, this represents the first successful attempt to stage EPG prior to the FSS using DNNs.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126816095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A Dynamics-based Approach for the Target Control of Boolean Networks 基于动态的布尔网络目标控制方法

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-06-03 DOI: 10.1145/3388440.3412464

Cui Su, Jun Pang

We study the target control problem of asynchronous Boolean networks, to identify a set of nodes, the perturbation of which can drive the dynamics of the network from any initial state to the desired steady state (or attractor). We are particularly interested in temporary perturbations, which are applied for sufficient time and then released to retrieve the original dynamics. Temporary perturbations have the apparent advantage of averting unforeseen consequences, which might be induced by permanent perturbations. Despite the infamous state-space explosion problem, in this work, we develop an efficient method to compute the temporary target control for a given target attractor of a Boolean network. We apply our method to a number of real-life biological networks and compare its performance with the stable motif-based control method to demonstrate its efficacy and efficiency.

本文研究了异步布尔网络的目标控制问题，以确定一组节点，这些节点的扰动可以驱动网络的动态从任意初始状态到期望的稳态(或吸引子)。我们对临时扰动特别感兴趣，它被施加足够的时间，然后释放以恢复原始动力学。暂时的扰动具有明显的优点，可以避免可能由永久扰动引起的不可预见的后果。尽管存在着臭名昭著的状态空间爆炸问题，但在这项工作中，我们开发了一种有效的方法来计算布尔网络中给定目标吸引子的临时目标控制。我们将该方法应用于许多现实生活中的生物网络，并将其性能与基于稳定基序的控制方法进行比较，以证明其有效性和效率。

引用次数: 9

Interactive exploration of population scale pharmacoepidemiology datasets 群体规模药物流行病学数据集的交互式探索

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-05-20 DOI: 10.1145/3388440.3414862

Tengel Ekrem Skar, Einar J. Holsbø, K. Svendsen, L. A. Bongo

Population-scale drug prescription data linked with adverse drug reaction (ADR) data supports the fitting of models large enough to detect drug use and ADR patterns that are not detectable using traditional methods on smaller datasets. However, detecting ADR patterns in large datasets requires tools for scalable data processing, machine learning for data analysis, and interactive visualization. To our knowledge no existing pharmacoepidemiology tool supports all three requirements. We have therefore created a tool for interactive exploration of patterns in prescription datasets with millions of samples. We use Spark to preprocess the data for machine learning and for analyses using SQL queries. We have implemented models in Keras and the scikit-learn framework. The model results are visualized and interpreted using live Python coding in Jupyter. We apply our tool to explore a 384 million prescription data set from the Norwegian Prescription Database combined with a 62 million prescriptions for elders that were hospitalized. We preprocess the data in two minutes, train models in seconds, and plot the results in milliseconds. Our results show the power of combining computational power, short computation times, and ease of use for analysis of population scale pharmacoepidemiology datasets.

与药物不良反应(ADR)数据相关的人口规模药物处方数据支持足够大的模型拟合，以检测在较小数据集上使用传统方法无法检测到的药物使用和ADR模式。然而，在大型数据集中检测ADR模式需要可扩展的数据处理工具、用于数据分析的机器学习工具和交互式可视化工具。据我们所知，没有现有的药物流行病学工具支持所有这三个要求。因此，我们创建了一个工具，用于交互式探索处方数据集的模式，其中包含数百万个样本。我们使用Spark对数据进行预处理，以便进行机器学习，并使用SQL查询进行分析。我们已经在Keras和scikit-learn框架中实现了模型。模型结果在Jupyter中使用实时Python编码进行可视化和解释。我们应用我们的工具来探索来自挪威处方数据库的3.84亿处方数据集，以及住院老年人的6200万处方。我们在两分钟内预处理数据，在几秒钟内训练模型，并在几毫秒内绘制结果。我们的研究结果表明，结合计算能力，计算时间短，易于使用的群体规模药物流行病学数据集分析的力量。

{"title":"Interactive exploration of population scale pharmacoepidemiology datasets","authors":"Tengel Ekrem Skar, Einar J. Holsbø, K. Svendsen, L. A. Bongo","doi":"10.1145/3388440.3414862","DOIUrl":"https://doi.org/10.1145/3388440.3414862","url":null,"abstract":"Population-scale drug prescription data linked with adverse drug reaction (ADR) data supports the fitting of models large enough to detect drug use and ADR patterns that are not detectable using traditional methods on smaller datasets. However, detecting ADR patterns in large datasets requires tools for scalable data processing, machine learning for data analysis, and interactive visualization. To our knowledge no existing pharmacoepidemiology tool supports all three requirements. We have therefore created a tool for interactive exploration of patterns in prescription datasets with millions of samples. We use Spark to preprocess the data for machine learning and for analyses using SQL queries. We have implemented models in Keras and the scikit-learn framework. The model results are visualized and interpreted using live Python coding in Jupyter. We apply our tool to explore a 384 million prescription data set from the Norwegian Prescription Database combined with a 62 million prescriptions for elders that were hospitalized. We preprocess the data in two minutes, train models in seconds, and plot the results in milliseconds. Our results show the power of combining computational power, short computation times, and ease of use for analysis of population scale pharmacoepidemiology datasets.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121939047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

iSOM-GSN: An Integrative Approach for Transforming Multi-omic Data into Gene Similarity Networks via Self-organizing Maps iSOM-GSN:一种利用自组织图谱将多组数据转化为基因相似网络的集成方法

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-05-14 DOI: 10.1145/3388440.3414206

Nazia Fatima, L. Rueda

MOTIVATIONOne of the main challenges in applying graph convolutional neural networks on gene-interaction data is the lack of understanding of the vector space to which they belong, and also the inherent difficulties involved in representing those interactions on a significantly lower dimension, viz Euclidean spaces. The challenge becomes more prevalent when dealing with various types of heterogeneous data. We introduce a systematic, generalized method, called iSOM-GSN, used to transform "multi-omic" data with higher dimensions onto a two-dimensional grid. Afterwards, we apply a convolutional neural network to predict disease states of various types. Based on the idea of Kohonen's self-organizing map, we generate a two-dimensional grid for each sample for a given set of genes that represent a gene similarity network.RESULTSWe have tested the model to predict breast and prostate cancer using gene expression, DNA methylation, and copy number alteration. Prediction accuracies in the 94-98% range were obtained for tumor stages of breast cancer and calculated Gleason scores of prostate cancer with just 14 input genes for both cases. The scheme not only outputs nearly perfect classification accuracy, but also provides an enhanced scheme for representation learning, visualization, dimensionality reduction, and interpretation of multi-omic data.AVAILABILITYThe source code and sample data are available via a Github project at https://github.com/NaziaFatima/iSOM_GSN.SUPPLEMENTARY INFORMATIONSupplementary figures and data availability are in the Supplementary Material file.

动机将图卷积神经网络应用于基因相互作用数据的主要挑战之一是缺乏对它们所属的向量空间的理解，以及在显着较低维度(即欧几里德空间)上表示这些相互作用所涉及的固有困难。在处理各种类型的异构数据时，挑战变得更加普遍。我们介绍了一种系统的、广义的方法，称为iSOM-GSN，用于将高维的“多组”数据转换为二维网格。然后，我们应用卷积神经网络来预测各种类型的疾病状态。基于Kohonen自组织图谱的思想，我们为每个样本生成一个二维网格，代表一个基因相似性网络。结果我们已经测试了该模型通过基因表达、DNA甲基化和拷贝数改变来预测乳腺癌和前列腺癌。乳腺癌肿瘤分期的预测准确率在94-98%之间，两种情况下仅用14个输入基因计算前列腺癌的Gleason评分。该方案不仅输出了近乎完美的分类精度，而且为多组数据的表示学习、可视化、降维和解释提供了一种增强的方案。可用性源代码和示例数据可通过Github项目在https://github.com/NaziaFatima/iSOM_GSN.SUPPLEMENTARY information中获得。补充数据和数据可用性在补充材料文件中。

{"title":"iSOM-GSN: An Integrative Approach for Transforming Multi-omic Data into Gene Similarity Networks via Self-organizing Maps","authors":"Nazia Fatima, L. Rueda","doi":"10.1145/3388440.3414206","DOIUrl":"https://doi.org/10.1145/3388440.3414206","url":null,"abstract":"MOTIVATION\u0000One of the main challenges in applying graph convolutional neural networks on gene-interaction data is the lack of understanding of the vector space to which they belong, and also the inherent difficulties involved in representing those interactions on a significantly lower dimension, viz Euclidean spaces. The challenge becomes more prevalent when dealing with various types of heterogeneous data. We introduce a systematic, generalized method, called iSOM-GSN, used to transform \"multi-omic\" data with higher dimensions onto a two-dimensional grid. Afterwards, we apply a convolutional neural network to predict disease states of various types. Based on the idea of Kohonen's self-organizing map, we generate a two-dimensional grid for each sample for a given set of genes that represent a gene similarity network.\u0000\u0000\u0000RESULTS\u0000We have tested the model to predict breast and prostate cancer using gene expression, DNA methylation, and copy number alteration. Prediction accuracies in the 94-98% range were obtained for tumor stages of breast cancer and calculated Gleason scores of prostate cancer with just 14 input genes for both cases. The scheme not only outputs nearly perfect classification accuracy, but also provides an enhanced scheme for representation learning, visualization, dimensionality reduction, and interpretation of multi-omic data.\u0000\u0000\u0000AVAILABILITY\u0000The source code and sample data are available via a Github project at https://github.com/NaziaFatima/iSOM_GSN.\u0000\u0000\u0000SUPPLEMENTARY INFORMATION\u0000Supplementary figures and data availability are in the Supplementary Material file.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134606230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

SMART: SuperMaximal approximate repeats tool SMART: supermaximum近似重复工具

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2019-12-24 DOI: 10.1145/3388440.3414210

Lorraine A. K. Ayad, P. Charalampopoulos, S. Pissis

Finding repetitive nucleic acid elements is a crucial step in many sequence analysis tasks. These include the challenging task of sequence assembly, the linkage of repeats to genetic disorders, and the identification of gene transfer. The most widely-used tool for finding repeats de novo is REPuter [2]. REPuter relies on extending maximal repeated pairs in order to enumerate all maximal k-mismatch repeats. Unfortunately, the number of these pairs can be quadratic in n, the length of the input sequence, and thus greedy heuristics are applied by its successor Vmatch to speed up the extension process. In this talk, we will introduce the concept of supermaximal k-mismatch repeats, whose number is linear in n, and capture all maximal k-mismatch repeats: every maximal k-mismatch repeat is a substring of some supermaximal k-mismatch repeat. We will present SMART, a tool based on recent algorithmic advances implemented in C++ to compute supermaximal k-mismatch repeats directly. We will also show that the elements SMART outputs are statistically much more significant than the output of the state-of-the-art tools. The full paper describing SMART appeared as [1].

在许多序列分析任务中，寻找重复的核酸元素是至关重要的一步。其中包括序列组装的挑战性任务，重复序列与遗传疾病的联系，以及基因转移的鉴定。最广泛使用的寻找重复从头开始的工具是repper[2]。repter依赖于扩展最大重复对来枚举所有最大k-错配重复。不幸的是，这些对的数量在输入序列的长度n中可能是二次的，因此它的后继Vmatch应用贪婪启发式来加快扩展过程。在这篇演讲中，我们将引入超极大k-错配重复的概念，它的数目以n为线性，并捕获所有的最大k-错配重复:每一个最大k-错配重复都是某个超极大k-错配重复的子串。我们将介绍SMART，一个基于c++中实现的最新算法进展的工具，用于直接计算超最大值k-错配重复。我们还将展示SMART输出的元素在统计上比最先进工具的输出重要得多。描述SMART的论文全文如下[1]。

{"title":"SMART: SuperMaximal approximate repeats tool","authors":"Lorraine A. K. Ayad, P. Charalampopoulos, S. Pissis","doi":"10.1145/3388440.3414210","DOIUrl":"https://doi.org/10.1145/3388440.3414210","url":null,"abstract":"Finding repetitive nucleic acid elements is a crucial step in many sequence analysis tasks. These include the challenging task of sequence assembly, the linkage of repeats to genetic disorders, and the identification of gene transfer. The most widely-used tool for finding repeats de novo is REPuter [2]. REPuter relies on extending maximal repeated pairs in order to enumerate all maximal k-mismatch repeats. Unfortunately, the number of these pairs can be quadratic in n, the length of the input sequence, and thus greedy heuristics are applied by its successor Vmatch to speed up the extension process. In this talk, we will introduce the concept of supermaximal k-mismatch repeats, whose number is linear in n, and capture all maximal k-mismatch repeats: every maximal k-mismatch repeat is a substring of some supermaximal k-mismatch repeat. We will present SMART, a tool based on recent algorithmic advances implemented in C++ to compute supermaximal k-mismatch repeats directly. We will also show that the elements SMART outputs are statistically much more significant than the output of the state-of-the-art tools. The full paper describing SMART appeared as [1].","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126184723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Re-balancing Variational Autoencoder Loss for Molecule Sequence Generation 分子序列生成中变分自编码器损耗的再平衡

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2019-10-01 DOI: 10.1145/3388440.3412458

Chao-chao Yan, Sheng Wang, Jinyu Yang, Tingyang Xu, Junzhou Huang

Molecule generation is to design new molecules with specific chemical properties and further to optimize the desired chemical properties. Following previous work, we encode molecules into continuous vectors in the latent space and then decode the embedding vectors into molecules under the variational autoencoder (VAE) framework. We investigate the posterior collapse problem of the current widely-used RNN-based VAEs for the molecule sequence generation. For the first time, we point out that the underestimated reconstruction loss of VAEs leads to the posterior collapse, and we also provide both analytical and experimental evidences to support our findings. To fix the problem and avoid the posterior collapse, we propose an effective and efficient solution in this work. Without bells and whistles, our method achieves the state-of-the-art reconstruction accuracy and competitive validity score on the ZINC 250K dataset. When generating 10,000 unique valid molecule sequences from the random prior sampling, it costs the JT-VAE 1450 seconds while our method only needs 9 seconds on a regular desktop machine.

分子生成是设计具有特定化学性质的新分子，并进一步优化所需的化学性质。在之前工作的基础上，我们在潜在空间中将分子编码为连续向量，然后在变分自编码器(VAE)框架下将嵌入向量解码为分子。我们研究了目前广泛使用的基于rnn的分子序列生成的后验崩溃问题。本研究首次指出了低估的肺泡重建损失会导致肺泡后塌陷，并提供了分析和实验证据来支持我们的研究结果。为了解决这一问题，避免后路塌陷，我们在这项工作中提出了一个有效的解决方案。该方法在锌250K数据集上实现了最先进的重建精度和竞争效度评分。当从随机的先验抽样中生成10,000个唯一的有效分子序列时，JT-VAE需要花费1450秒，而我们的方法在普通台式机器上只需要9秒。

{"title":"Re-balancing Variational Autoencoder Loss for Molecule Sequence Generation","authors":"Chao-chao Yan, Sheng Wang, Jinyu Yang, Tingyang Xu, Junzhou Huang","doi":"10.1145/3388440.3412458","DOIUrl":"https://doi.org/10.1145/3388440.3412458","url":null,"abstract":"Molecule generation is to design new molecules with specific chemical properties and further to optimize the desired chemical properties. Following previous work, we encode molecules into continuous vectors in the latent space and then decode the embedding vectors into molecules under the variational autoencoder (VAE) framework. We investigate the posterior collapse problem of the current widely-used RNN-based VAEs for the molecule sequence generation. For the first time, we point out that the underestimated reconstruction loss of VAEs leads to the posterior collapse, and we also provide both analytical and experimental evidences to support our findings. To fix the problem and avoid the posterior collapse, we propose an effective and efficient solution in this work. Without bells and whistles, our method achieves the state-of-the-art reconstruction accuracy and competitive validity score on the ZINC 250K dataset. When generating 10,000 unique valid molecule sequences from the random prior sampling, it costs the JT-VAE 1450 seconds while our method only needs 9 seconds on a regular desktop machine.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122171309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

GANDALF 甘道夫

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2019-09-04 DOI: 10.1145/3307339.3342183

Allison M. Rossetto, Wenjin Zhou

Pharmaceutical drug design is a difficult and costly endeavor. Computational drug design has the potential to help save time and money by providing a better starting point for new drugs with an initial computational evaluation completed. We propose a new application of Generative Adversarial Networks (GANs), called GANDALF (Generative Adversarial Network Drug-tArget Ligand Fructifier) to design new peptides for protein targets. Other GAN based methods for computational drug design can only generate small molecules, not peptides. It also incorporates data such as active atoms, not used in other methods, which allow us to precisely identify where interaction occurs between a protein and ligand. Our method goes farther than comparable methods by generating a peptide and predicting binding affinity. We compare results for a protein of interest, PD-1, using: GANDALF, Pepcomposer, and the FDA approved drugs. We find that our method produces a peptide comparable to the FDA approved drugs and better than that of Pepcomposer. Further work will improve the GANDALF system by deepening the GAN architecture to improve on the binding affinity and 3D fit of the peptides. We are also exploring the uses of transfer learning.

{"title":"GANDALF","authors":"Allison M. Rossetto, Wenjin Zhou","doi":"10.1145/3307339.3342183","DOIUrl":"https://doi.org/10.1145/3307339.3342183","url":null,"abstract":"Pharmaceutical drug design is a difficult and costly endeavor. Computational drug design has the potential to help save time and money by providing a better starting point for new drugs with an initial computational evaluation completed. We propose a new application of Generative Adversarial Networks (GANs), called GANDALF (Generative Adversarial Network Drug-tArget Ligand Fructifier) to design new peptides for protein targets. Other GAN based methods for computational drug design can only generate small molecules, not peptides. It also incorporates data such as active atoms, not used in other methods, which allow us to precisely identify where interaction occurs between a protein and ligand. Our method goes farther than comparable methods by generating a peptide and predicting binding affinity. We compare results for a protein of interest, PD-1, using: GANDALF, Pepcomposer, and the FDA approved drugs. We find that our method produces a peptide comparable to the FDA approved drugs and better than that of Pepcomposer. Further work will improve the GANDALF system by deepening the GAN architecture to improve on the binding affinity and 3D fit of the peptides. We are also exploring the uses of transfer learning.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122629847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2