A. Jazayeri, Sara Pajouhanfar, Sadaf Saba, Christopher C. Yang
Down Syndrome (DS) is one of the most common disorders caused by the presence of an extra copy of chromosome 21. It has been shown that the expression of various genes located on chromosomes other than the extra 21 chromosomes is affected in DS. Given the practical and ethical difficulties in human tissue studies, the Ts65Dn mouse model has been widely used in DS research. In this study, we propose a pipeline composed of a supervised learning approach, modularity analysis of a bipartite network, and multivariate analysis of variance (MANOVA), for identification of differentially expressed proteins (DEP) among different classes of mice models. The proposed pipeline is tested using the expression levels of 77 proteins in eight different classes of mice models. The data includes the protein expression measurements for 34 trisomic Ts65Dn and 38 control mice. Each group is broken up into four classes based on either being stimulated for learning or not, each injected with memantine or saline. The previously proposed approaches have been unable to identify DEP among all of the eight classes simultaneously. Here, we show that our proposed pipeline can successfully identify the set of proteins expressed differently among all the eight classes. The findings of this study can inform the study of learning responses to different treatments and protein-treatment associations in DS. Also, the proposed pipeline can be adopted to identify DEP in DS or other diseases and health conditions, which can consequently inform the development of improved personalized treatment and management strategies.
{"title":"Modularity Analysis of Bipartite Networks and Multivariate ANOVA for Identification of Differentially Expressed Proteins in a Mouse Model of Down Syndrome","authors":"A. Jazayeri, Sara Pajouhanfar, Sadaf Saba, Christopher C. Yang","doi":"10.1145/3388440.3412421","DOIUrl":"https://doi.org/10.1145/3388440.3412421","url":null,"abstract":"Down Syndrome (DS) is one of the most common disorders caused by the presence of an extra copy of chromosome 21. It has been shown that the expression of various genes located on chromosomes other than the extra 21 chromosomes is affected in DS. Given the practical and ethical difficulties in human tissue studies, the Ts65Dn mouse model has been widely used in DS research. In this study, we propose a pipeline composed of a supervised learning approach, modularity analysis of a bipartite network, and multivariate analysis of variance (MANOVA), for identification of differentially expressed proteins (DEP) among different classes of mice models. The proposed pipeline is tested using the expression levels of 77 proteins in eight different classes of mice models. The data includes the protein expression measurements for 34 trisomic Ts65Dn and 38 control mice. Each group is broken up into four classes based on either being stimulated for learning or not, each injected with memantine or saline. The previously proposed approaches have been unable to identify DEP among all of the eight classes simultaneously. Here, we show that our proposed pipeline can successfully identify the set of proteins expressed differently among all the eight classes. The findings of this study can inform the study of learning responses to different treatments and protein-treatment associations in DS. Also, the proposed pipeline can be adopted to identify DEP in DS or other diseases and health conditions, which can consequently inform the development of improved personalized treatment and management strategies.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126247530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rajitha Yasas Wijesekara, Ashwin Lahorkar, Kunal Rathore, J. Valadi
Protein Function identification has become an important task due to a plethora of new genomes being sequenced. Recently, distributed representation [1] of words in the form of continuous vector representations has been found to be a very efficient way to represent semantic/syntactic information. In this representation, each word is embedded in an n- dimensional space with similar words having proximate vectors in the embedding space. In the popular skip-gram configuration, the current word is used by the model to predict its surrounding words. In this work we introduce reduced amino acid alphabets based, distributed representation for protein sequences. In our RA2Vec (Reduced Alphabets to Vectors) implementation we first map all Swiss-Prot sequences to hydropathy and conformational similarity based reduced form. Further, by employing skip-gram based method, reduced alphabets embedding vectors (RA2Vec) were created for each set. Embedding vectors for sequences with original ProtVec representation [2] were also created. These vectors were created for various combinations of K-grams and vector sizes. All seven combinations of the original ProtVec embedding vectors, Hydropathy based embedding vectors and Conformational Similarity based embedding vectors were then employed as input to Support Vector Machines classifiers and classification models were built. The embedding vectors were further reduced using recursive Feature Elimination (RFE) method to maximize fivefold CV accuracy. We assessed the validity and the utility of the new representations employing five different data sets. Our results with all data sets indicate, certain synergistic combinations of new representations with and without ProtVec embedding can result in significantly improved performance.
{"title":"RA2Vec","authors":"Rajitha Yasas Wijesekara, Ashwin Lahorkar, Kunal Rathore, J. Valadi","doi":"10.1145/3388440.3414925","DOIUrl":"https://doi.org/10.1145/3388440.3414925","url":null,"abstract":"Protein Function identification has become an important task due to a plethora of new genomes being sequenced. Recently, distributed representation [1] of words in the form of continuous vector representations has been found to be a very efficient way to represent semantic/syntactic information. In this representation, each word is embedded in an n- dimensional space with similar words having proximate vectors in the embedding space. In the popular skip-gram configuration, the current word is used by the model to predict its surrounding words. In this work we introduce reduced amino acid alphabets based, distributed representation for protein sequences. In our RA2Vec (Reduced Alphabets to Vectors) implementation we first map all Swiss-Prot sequences to hydropathy and conformational similarity based reduced form. Further, by employing skip-gram based method, reduced alphabets embedding vectors (RA2Vec) were created for each set. Embedding vectors for sequences with original ProtVec representation [2] were also created. These vectors were created for various combinations of K-grams and vector sizes. All seven combinations of the original ProtVec embedding vectors, Hydropathy based embedding vectors and Conformational Similarity based embedding vectors were then employed as input to Support Vector Machines classifiers and classification models were built. The embedding vectors were further reduced using recursive Feature Elimination (RFE) method to maximize fivefold CV accuracy. We assessed the validity and the utility of the new representations employing five different data sets. Our results with all data sets indicate, certain synergistic combinations of new representations with and without ProtVec embedding can result in significantly improved performance.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114045246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fatima Zare, J. Noorbakhsh, Tianyu Wang, Jeffrey H. Chuang, S. Nabavi
Deep learning has recently become a key methodology for the study and interpretation of cancer histology images. The ability of convolutional neural networks (CNNs) to automatically learn features from raw data without the need for pathologist expert knowledge, as well as the availability of annotated histopathology datasets, have contributed to a growing interest in deep learning applications to histopathology. In clinical practice for cancer, histopathological images have been commonly used for diagnosis, prognosis, and treatment. Recently, molecular subtype classification has gained significant attention for predicting standard chemotherapy's outcomes and creating personalized targeted cancer therapy. Genomic profiles, especially gene expression data, are mostly used for molecular subtyping. In this study, we developed a novel, PanCancer CNN model based on Google Inception V3 transfer learning to classify molecular subtypes using histopathological images. We used 22,484 Haemotoxylin and Eosin (H&E) slides from 32 cancer types provided by The Cancer Genome Atlas (TCGA) to train and evaluate the model. We showed that by employing deep learning, H&E slides can be used for classification of molecular subtypes of solid tumor samples with the high area under curves (AUCs) (micro-average= 0.90; macro-average=0.90). In cancer studies, combining histopathological images with genomic data has rarely been explored. We investigated the relationship between features extracted from H&E images and features extracted from gene expression profiles. We observed that the features from these two different modalities (H&E images and gene expression values) for molecular subtyping are highly correlated. We, therefore, developed an integrative deep learning model that combines histological images and gene expression profiles. We showed that the integrative model improves the overall performance of the molecular subtypes classification ((AUCs) micro-average= 0.99; macro-average=0.97). These results show that integrating H&E images and gene expression profiles can enhance accuracy of molecular subtype classification.
{"title":"Integrative Deep Learning for PanCancer Molecular Subtype Classification Using Histopathological Images and RNAseq Data","authors":"Fatima Zare, J. Noorbakhsh, Tianyu Wang, Jeffrey H. Chuang, S. Nabavi","doi":"10.1145/3388440.3412414","DOIUrl":"https://doi.org/10.1145/3388440.3412414","url":null,"abstract":"Deep learning has recently become a key methodology for the study and interpretation of cancer histology images. The ability of convolutional neural networks (CNNs) to automatically learn features from raw data without the need for pathologist expert knowledge, as well as the availability of annotated histopathology datasets, have contributed to a growing interest in deep learning applications to histopathology. In clinical practice for cancer, histopathological images have been commonly used for diagnosis, prognosis, and treatment. Recently, molecular subtype classification has gained significant attention for predicting standard chemotherapy's outcomes and creating personalized targeted cancer therapy. Genomic profiles, especially gene expression data, are mostly used for molecular subtyping. In this study, we developed a novel, PanCancer CNN model based on Google Inception V3 transfer learning to classify molecular subtypes using histopathological images. We used 22,484 Haemotoxylin and Eosin (H&E) slides from 32 cancer types provided by The Cancer Genome Atlas (TCGA) to train and evaluate the model. We showed that by employing deep learning, H&E slides can be used for classification of molecular subtypes of solid tumor samples with the high area under curves (AUCs) (micro-average= 0.90; macro-average=0.90). In cancer studies, combining histopathological images with genomic data has rarely been explored. We investigated the relationship between features extracted from H&E images and features extracted from gene expression profiles. We observed that the features from these two different modalities (H&E images and gene expression values) for molecular subtyping are highly correlated. We, therefore, developed an integrative deep learning model that combines histological images and gene expression profiles. We showed that the integrative model improves the overall performance of the molecular subtypes classification ((AUCs) micro-average= 0.99; macro-average=0.97). These results show that integrating H&E images and gene expression profiles can enhance accuracy of molecular subtype classification.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115506470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Torin Adamson, Selina M. Bauernfeind, Bruna Jacobson, Lydia Tapia
Molecular docking prediction, used to design new drugs and identify biological function, can be represented as a motion planning problem in high dimensional space. This complex task can benefit from human guidance, i.e., interactive simulations combining visual and haptic feedback from atomic forces [4]. Using human-generated data to find solutions to complex problems has been used in protein folding with FoldIt [3], showing that crowdsourced data collection is viable to help solve difficult problems in biology. However, existing interactive molecular docking systems that typically use haptic devices employ devices that are not widely available to the public, limiting the potential to crowdsource solutions [1]. This user study tested 4 input devices to find potentially bound ligand-receptor states and biologically feasible paths via motion planning. On a PC laptop, players used one of the following: a 6 degree-of-freedom (DOF) haptic feedback device, a 3 DOF haptic feedback device, a game controller with vibration feedback, and a mouse and keyboard, with our in-house interactive rigid body molecular docking program, DockAnywhere [1, 2]. Players tried to dock a known inhibitor of HIV Protease (PDB ID 1AJX). The goal is to move the ligand around the receptor to find the lowest potential energy state possible. Players get feedback as a positive integer score based on the potential energy and, for haptic devices, a force feedback calculated from atomic forces. During gameplay, ligand positions and orientations are recorded as it is moved through the environment. In total, 439,832 states were collected from 32 players (8 per device). These states were divided into 4 sets, one per device. and Motion planning queries found a feasible path to the goal state in all sets, with the mouse showing more pronounced energy barriers. We found that exploration varied among different devices, with players on haptic devices exploring the interaction energy landscape more uniformly. However, all 4 devices yielded low energy ligand states with comparable values and similar closest distance to the binding site. In summary, force feedback showed no clear improvement to finding low potential energy states or getting closer to the known binding state. However, haptic devices appear to enable a more thorough exploration of the state space, creating more samples and generating smoother paths. The nature of this guidance should be the subject of future investigation.
分子对接预测用于新药设计和生物功能识别,可以表示为高维空间中的运动规划问题。这项复杂的任务可以受益于人类的指导,即结合原子力的视觉和触觉反馈的交互式模拟[4]。利用人工生成的数据寻找复杂问题的解决方案已在FoldIt的蛋白质折叠中得到应用[3],这表明众包数据收集在帮助解决生物学难题方面是可行的。然而,现有的交互式分子对接系统通常使用触觉设备,这些设备并未广泛向公众提供,这限制了众包解决方案的潜力[1]。本用户研究测试了4种输入设备,以通过运动规划找到潜在的结合配体-受体状态和生物学上可行的路径。在PC笔记本电脑上,玩家使用以下设备之一:6自由度(DOF)触觉反馈设备,3自由度触觉反馈设备,带有振动反馈的游戏控制器,鼠标和键盘,以及我们内部的交互式刚体分子对接程序DockAnywhere[1,2]。玩家试图对接一种已知的HIV蛋白酶抑制剂(PDB ID 1AJX)。目标是移动受体周围的配体,以找到尽可能低的势能状态。玩家得到的反馈是基于势能的正整数分数,而对于触觉设备,则是根据原子力计算的力反馈。在游戏过程中,配体的位置和方向会随着它在环境中的移动而被记录下来。总共从32个玩家(每台设备8个)中收集了439,832个州。这些状态被分为4组,每个设备一个。运动规划查询在所有集合中都找到了通往目标状态的可行路径,鼠标显示出更明显的能量障碍。我们发现不同设备的探索方式不同,使用触觉设备的玩家对互动能量的探索更加一致。然而,这4种装置产生的低能配体态具有相似的值和与结合位点的最近距离。综上所述,力反馈在寻找低势能态或接近已知结合态方面没有明显的改善。然而,触觉设备似乎能够更彻底地探索状态空间,创建更多的样本并生成更平滑的路径。这一指导方针的性质应成为今后调查的主题。
{"title":"Using player generated data to elucidate molecular docking","authors":"Torin Adamson, Selina M. Bauernfeind, Bruna Jacobson, Lydia Tapia","doi":"10.1145/3388440.3414704","DOIUrl":"https://doi.org/10.1145/3388440.3414704","url":null,"abstract":"Molecular docking prediction, used to design new drugs and identify biological function, can be represented as a motion planning problem in high dimensional space. This complex task can benefit from human guidance, i.e., interactive simulations combining visual and haptic feedback from atomic forces [4]. Using human-generated data to find solutions to complex problems has been used in protein folding with FoldIt [3], showing that crowdsourced data collection is viable to help solve difficult problems in biology. However, existing interactive molecular docking systems that typically use haptic devices employ devices that are not widely available to the public, limiting the potential to crowdsource solutions [1]. This user study tested 4 input devices to find potentially bound ligand-receptor states and biologically feasible paths via motion planning. On a PC laptop, players used one of the following: a 6 degree-of-freedom (DOF) haptic feedback device, a 3 DOF haptic feedback device, a game controller with vibration feedback, and a mouse and keyboard, with our in-house interactive rigid body molecular docking program, DockAnywhere [1, 2]. Players tried to dock a known inhibitor of HIV Protease (PDB ID 1AJX). The goal is to move the ligand around the receptor to find the lowest potential energy state possible. Players get feedback as a positive integer score based on the potential energy and, for haptic devices, a force feedback calculated from atomic forces. During gameplay, ligand positions and orientations are recorded as it is moved through the environment. In total, 439,832 states were collected from 32 players (8 per device). These states were divided into 4 sets, one per device. and Motion planning queries found a feasible path to the goal state in all sets, with the mouse showing more pronounced energy barriers. We found that exploration varied among different devices, with players on haptic devices exploring the interaction energy landscape more uniformly. However, all 4 devices yielded low energy ligand states with comparable values and similar closest distance to the binding site. In summary, force feedback showed no clear improvement to finding low potential energy states or getting closer to the known binding state. However, haptic devices appear to enable a more thorough exploration of the state space, creating more samples and generating smoother paths. The nature of this guidance should be the subject of future investigation.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125438905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Giuste, M. Venkatesan, Conan Y. Zhao, L. Tong, Yuanda Zhu, S. Deshpande, May D. Wang
Heart transplant rejection must be quickly and accurately identified to optimize anti-rejection therapies and prevent organ loss. Expert evaluation of endomyocardial biopsies is labor-intensive, and prone to human bias, and suffers from low inter-rater agreement. Additionally, the increased utility of digital pathology for biopsy examination has exacerbated the need for additional image quality control. To meet these challenges, we developed a novel transplant rejection detection pipeline which automatically identifies histology slides in need of rescanning and highlights biopsy regions showing potential signs of rejection. Our system leverages a fast and effective automated patch-level quality filter as well as state-of-the-art feature extraction techniques to provide quality whole-slide level labeling of early rejection signs. We successfully identified digital pathology images with poor image quality and leveraged this quality gain to improve our novel weakly-supervised learning model leading to significant transplant rejection classification performance of AUC: 70.12 (±20.74) %.
{"title":"Automated Classification of Acute Rejection from Endomyocardial Biopsies","authors":"F. Giuste, M. Venkatesan, Conan Y. Zhao, L. Tong, Yuanda Zhu, S. Deshpande, May D. Wang","doi":"10.1145/3388440.3412430","DOIUrl":"https://doi.org/10.1145/3388440.3412430","url":null,"abstract":"Heart transplant rejection must be quickly and accurately identified to optimize anti-rejection therapies and prevent organ loss. Expert evaluation of endomyocardial biopsies is labor-intensive, and prone to human bias, and suffers from low inter-rater agreement. Additionally, the increased utility of digital pathology for biopsy examination has exacerbated the need for additional image quality control. To meet these challenges, we developed a novel transplant rejection detection pipeline which automatically identifies histology slides in need of rescanning and highlights biopsy regions showing potential signs of rejection. Our system leverages a fast and effective automated patch-level quality filter as well as state-of-the-art feature extraction techniques to provide quality whole-slide level labeling of early rejection signs. We successfully identified digital pathology images with poor image quality and leveraged this quality gain to improve our novel weakly-supervised learning model leading to significant transplant rejection classification performance of AUC: 70.12 (±20.74) %.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132088949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, wearable mobility monitoring devices have gained a great deal of attention for collecting movement and gait-related data. Moreover, Wearable movement monitoring devices together with machine learning techniques have been shown to be successful in a variety of healthcare applications, including diagnosis, prognosis, and rehabilitation. However, advanced studies are needed to create accurate and robust models that can differentiate between different populations based on their mobility signatures. This is particularly critical for monitoring movement and gait patterns of individuals impacted by neurodegenerative conditions such as Parkinson's Disease (PD). In order to achieve this goal, it is critical to employ a robust approach to model available data and identify the optimal set of movement parameters for the classification process. In this work, we propose a computational approach to identify the best feature selection method for spatiotemporal gait parameters. We investigate several feature selection approaches and analyze their performance as related to the mobility classification problem; including maximum information gain with minimum correlation (MIGMC), maximum signal to noise ratio with minimum correlation (MSNR&MC), genetic algorithms (GA), decision trees (DT) and principal component analysis (PCA). These methods, along with new proposed variations, are assessed in terms of classification accuracy, the number of selected features, and computation time. Data collected from the triaxial accelerometers attached to the ankles of individuals with PD, geriatrics (GE), and healthy elderly (HE) were used to train and test a set of six different machine learning techniques. Our results indicate that three out of six feature selection methods, including GA, MSNR&MC, and a modified version of MIGMC are the best performers regarding the classification accuracy. We also show that higher degrees of robust performances are achieved when employing multiple algorithms, such as decision trees and genetic algorithms. This study provides a critical first step towards the much-needed goal of utilizing data collected from wearable devices to extract important information for the diagnosis and rehabilitation of many movement-related medical conditions.
{"title":"Smart Computational Approaches with Advanced Feature Selection Algorithms for Optimizing the Classification of Mobility Data in Health Informatics","authors":"E. Rastegari, D. Orn, H. Ali","doi":"10.1145/3388440.3412426","DOIUrl":"https://doi.org/10.1145/3388440.3412426","url":null,"abstract":"Recently, wearable mobility monitoring devices have gained a great deal of attention for collecting movement and gait-related data. Moreover, Wearable movement monitoring devices together with machine learning techniques have been shown to be successful in a variety of healthcare applications, including diagnosis, prognosis, and rehabilitation. However, advanced studies are needed to create accurate and robust models that can differentiate between different populations based on their mobility signatures. This is particularly critical for monitoring movement and gait patterns of individuals impacted by neurodegenerative conditions such as Parkinson's Disease (PD). In order to achieve this goal, it is critical to employ a robust approach to model available data and identify the optimal set of movement parameters for the classification process. In this work, we propose a computational approach to identify the best feature selection method for spatiotemporal gait parameters. We investigate several feature selection approaches and analyze their performance as related to the mobility classification problem; including maximum information gain with minimum correlation (MIGMC), maximum signal to noise ratio with minimum correlation (MSNR&MC), genetic algorithms (GA), decision trees (DT) and principal component analysis (PCA). These methods, along with new proposed variations, are assessed in terms of classification accuracy, the number of selected features, and computation time. Data collected from the triaxial accelerometers attached to the ankles of individuals with PD, geriatrics (GE), and healthy elderly (HE) were used to train and test a set of six different machine learning techniques. Our results indicate that three out of six feature selection methods, including GA, MSNR&MC, and a modified version of MIGMC are the best performers regarding the classification accuracy. We also show that higher degrees of robust performances are achieved when employing multiple algorithms, such as decision trees and genetic algorithms. This study provides a critical first step towards the much-needed goal of utilizing data collected from wearable devices to extract important information for the diagnosis and rehabilitation of many movement-related medical conditions.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131972175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Designing molecules with specific structural and functional properties (e.g., drug-likeness and water solubility) is central to advancing drug discovery and material science, but it poses outstanding challenges both in wet and dry laboratories. The search space is vast and rugged. Recent advances in deep generative models are motivating new computational approaches building over deep learning to tackle the molecular space. Despite rapid advancements, state-of-the-art deep generative models for molecule generation have many limitations, including lack of interpretability. In this paper we address this limitation by proposing a generic framework for interpretable molecule generation based on novel disentangled deep graph generative models with property control. Specifically, we propose a disentanglement enhancement strategy for graphs. We also propose new deep neural architecture to achieve the above learning objective for inference and generation for variable-size graphs efficiently. Extensive experimental evaluation demonstrates the superiority of our approach in various critical aspects, such as accuracy, novelty, and disentanglement.
{"title":"Interpretable Molecule Generation via Disentanglement Learning","authors":"Yuanqi Du, Xiaojie Guo, Amarda Shehu, Liang Zhao","doi":"10.1145/3388440.3414709","DOIUrl":"https://doi.org/10.1145/3388440.3414709","url":null,"abstract":"Designing molecules with specific structural and functional properties (e.g., drug-likeness and water solubility) is central to advancing drug discovery and material science, but it poses outstanding challenges both in wet and dry laboratories. The search space is vast and rugged. Recent advances in deep generative models are motivating new computational approaches building over deep learning to tackle the molecular space. Despite rapid advancements, state-of-the-art deep generative models for molecule generation have many limitations, including lack of interpretability. In this paper we address this limitation by proposing a generic framework for interpretable molecule generation based on novel disentangled deep graph generative models with property control. Specifically, we propose a disentanglement enhancement strategy for graphs. We also propose new deep neural architecture to achieve the above learning objective for inference and generation for variable-size graphs efficiently. Extensive experimental evaluation demonstrates the superiority of our approach in various critical aspects, such as accuracy, novelty, and disentanglement.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130601349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The complexity, dynamics, and scale of data acquired by modern biotechnology increasingly favor model-free computational methods that make minimal assumptions about underlying biological mechanisms. For example, single-cell transcriptome and proteome data have a throughput several orders more than bulk methods. Many model-free statistical methods for pattern discovery such as mutual information and chi-squared tests, however, require discrete data. Most discretization methods minimize squared errors for each variable independently, not necessarily retaining joint patterns. To address this issue, we present a joint grid discretization algorithm that preserves clusters in the original data. We evaluated this algorithm on simulated data to show its advantage over other methods in maintaining clusters as measured by the adjusted Rand index. We also show it promotes global functional patterns over independent patterns. On single-cell proteome and transcriptome of leukemia and healthy blood, joint grid discretization captured known protein-to-RNA regulatory relationships, while revealing previously unknown interactions. As such, the joint grid discretization is applicable as a data transformation step in associative, functional, and causal inference of molecular interactions fundamental to systems biology. The developed software is publicly available at https://cran.r-project.org/package=GridOnClusters
{"title":"Joint Grid Discretization for Biological Pattern Discovery","authors":"Jiandong Wang, Sajal Kumar, Mingzhou Song","doi":"10.1145/3388440.3412415","DOIUrl":"https://doi.org/10.1145/3388440.3412415","url":null,"abstract":"The complexity, dynamics, and scale of data acquired by modern biotechnology increasingly favor model-free computational methods that make minimal assumptions about underlying biological mechanisms. For example, single-cell transcriptome and proteome data have a throughput several orders more than bulk methods. Many model-free statistical methods for pattern discovery such as mutual information and chi-squared tests, however, require discrete data. Most discretization methods minimize squared errors for each variable independently, not necessarily retaining joint patterns. To address this issue, we present a joint grid discretization algorithm that preserves clusters in the original data. We evaluated this algorithm on simulated data to show its advantage over other methods in maintaining clusters as measured by the adjusted Rand index. We also show it promotes global functional patterns over independent patterns. On single-cell proteome and transcriptome of leukemia and healthy blood, joint grid discretization captured known protein-to-RNA regulatory relationships, while revealing previously unknown interactions. As such, the joint grid discretization is applicable as a data transformation step in associative, functional, and causal inference of molecular interactions fundamental to systems biology. The developed software is publicly available at https://cran.r-project.org/package=GridOnClusters","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131251913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large for gestational age (LGA) births are associated with many maternal and perinatal complications. As overweight and obesity are risk factors for LGA, we aimed to predict LGA in overweight and obese women at approximately 20 gestational weeks, so that we can identify women at risk of LGA early to allow for appropriate interventions. A random forest algorithm was applied to maternal characteristics and blood biomarkers at baseline and 20 gestational weeks' ultrasound scan findings to develop a prediction model. Here we present our preliminary results demonstrating potential for use in clinical decision support for identifying patients early in pregnancy at risk of an LGA birth.
{"title":"Prediction of Large for Gestational Age Infants in Overweight and Obese Women at Approximately 20 Gestational Weeks","authors":"Yuhan Du, J. Mehegan, F. Mcauliffe, C. Mooney","doi":"10.1145/3388440.3414906","DOIUrl":"https://doi.org/10.1145/3388440.3414906","url":null,"abstract":"Large for gestational age (LGA) births are associated with many maternal and perinatal complications. As overweight and obesity are risk factors for LGA, we aimed to predict LGA in overweight and obese women at approximately 20 gestational weeks, so that we can identify women at risk of LGA early to allow for appropriate interventions. A random forest algorithm was applied to maternal characteristics and blood biomarkers at baseline and 20 gestational weeks' ultrasound scan findings to develop a prediction model. Here we present our preliminary results demonstrating potential for use in clinical decision support for identifying patients early in pregnancy at risk of an LGA birth.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114551987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A main problem for predicting cell line interactions with chemical compounds is the lack of a computational representation for cell lines. We describe a method for characterizing cell lines from scientific literature. We retrieve and process cell line-related scientific papers, perform a document classification algorithm, and then obtain a description of the information space of each cell line. We have successfully characterized a set of 300+ cell lines.
{"title":"Representing Cellular Lines with SVM and Text Processing","authors":"I. Carrera, I. Dutra, E. Tejera","doi":"10.1145/3388440.3414912","DOIUrl":"https://doi.org/10.1145/3388440.3414912","url":null,"abstract":"A main problem for predicting cell line interactions with chemical compounds is the lack of a computational representation for cell lines. We describe a method for characterizing cell lines from scientific literature. We retrieve and process cell line-related scientific papers, perform a document classification algorithm, and then obtain a description of the information space of each cell line. We have successfully characterized a set of 300+ cell lines.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114740122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}