Rajitha Yasas Wijesekara, Ashwin Lahorkar, Kunal Rathore, J. Valadi
Protein Function identification has become an important task due to a plethora of new genomes being sequenced. Recently, distributed representation [1] of words in the form of continuous vector representations has been found to be a very efficient way to represent semantic/syntactic information. In this representation, each word is embedded in an n- dimensional space with similar words having proximate vectors in the embedding space. In the popular skip-gram configuration, the current word is used by the model to predict its surrounding words. In this work we introduce reduced amino acid alphabets based, distributed representation for protein sequences. In our RA2Vec (Reduced Alphabets to Vectors) implementation we first map all Swiss-Prot sequences to hydropathy and conformational similarity based reduced form. Further, by employing skip-gram based method, reduced alphabets embedding vectors (RA2Vec) were created for each set. Embedding vectors for sequences with original ProtVec representation [2] were also created. These vectors were created for various combinations of K-grams and vector sizes. All seven combinations of the original ProtVec embedding vectors, Hydropathy based embedding vectors and Conformational Similarity based embedding vectors were then employed as input to Support Vector Machines classifiers and classification models were built. The embedding vectors were further reduced using recursive Feature Elimination (RFE) method to maximize fivefold CV accuracy. We assessed the validity and the utility of the new representations employing five different data sets. Our results with all data sets indicate, certain synergistic combinations of new representations with and without ProtVec embedding can result in significantly improved performance.
{"title":"RA2Vec","authors":"Rajitha Yasas Wijesekara, Ashwin Lahorkar, Kunal Rathore, J. Valadi","doi":"10.1145/3388440.3414925","DOIUrl":"https://doi.org/10.1145/3388440.3414925","url":null,"abstract":"Protein Function identification has become an important task due to a plethora of new genomes being sequenced. Recently, distributed representation [1] of words in the form of continuous vector representations has been found to be a very efficient way to represent semantic/syntactic information. In this representation, each word is embedded in an n- dimensional space with similar words having proximate vectors in the embedding space. In the popular skip-gram configuration, the current word is used by the model to predict its surrounding words. In this work we introduce reduced amino acid alphabets based, distributed representation for protein sequences. In our RA2Vec (Reduced Alphabets to Vectors) implementation we first map all Swiss-Prot sequences to hydropathy and conformational similarity based reduced form. Further, by employing skip-gram based method, reduced alphabets embedding vectors (RA2Vec) were created for each set. Embedding vectors for sequences with original ProtVec representation [2] were also created. These vectors were created for various combinations of K-grams and vector sizes. All seven combinations of the original ProtVec embedding vectors, Hydropathy based embedding vectors and Conformational Similarity based embedding vectors were then employed as input to Support Vector Machines classifiers and classification models were built. The embedding vectors were further reduced using recursive Feature Elimination (RFE) method to maximize fivefold CV accuracy. We assessed the validity and the utility of the new representations employing five different data sets. Our results with all data sets indicate, certain synergistic combinations of new representations with and without ProtVec embedding can result in significantly improved performance.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114045246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fatima Zare, J. Noorbakhsh, Tianyu Wang, Jeffrey H. Chuang, S. Nabavi
Deep learning has recently become a key methodology for the study and interpretation of cancer histology images. The ability of convolutional neural networks (CNNs) to automatically learn features from raw data without the need for pathologist expert knowledge, as well as the availability of annotated histopathology datasets, have contributed to a growing interest in deep learning applications to histopathology. In clinical practice for cancer, histopathological images have been commonly used for diagnosis, prognosis, and treatment. Recently, molecular subtype classification has gained significant attention for predicting standard chemotherapy's outcomes and creating personalized targeted cancer therapy. Genomic profiles, especially gene expression data, are mostly used for molecular subtyping. In this study, we developed a novel, PanCancer CNN model based on Google Inception V3 transfer learning to classify molecular subtypes using histopathological images. We used 22,484 Haemotoxylin and Eosin (H&E) slides from 32 cancer types provided by The Cancer Genome Atlas (TCGA) to train and evaluate the model. We showed that by employing deep learning, H&E slides can be used for classification of molecular subtypes of solid tumor samples with the high area under curves (AUCs) (micro-average= 0.90; macro-average=0.90). In cancer studies, combining histopathological images with genomic data has rarely been explored. We investigated the relationship between features extracted from H&E images and features extracted from gene expression profiles. We observed that the features from these two different modalities (H&E images and gene expression values) for molecular subtyping are highly correlated. We, therefore, developed an integrative deep learning model that combines histological images and gene expression profiles. We showed that the integrative model improves the overall performance of the molecular subtypes classification ((AUCs) micro-average= 0.99; macro-average=0.97). These results show that integrating H&E images and gene expression profiles can enhance accuracy of molecular subtype classification.
{"title":"Integrative Deep Learning for PanCancer Molecular Subtype Classification Using Histopathological Images and RNAseq Data","authors":"Fatima Zare, J. Noorbakhsh, Tianyu Wang, Jeffrey H. Chuang, S. Nabavi","doi":"10.1145/3388440.3412414","DOIUrl":"https://doi.org/10.1145/3388440.3412414","url":null,"abstract":"Deep learning has recently become a key methodology for the study and interpretation of cancer histology images. The ability of convolutional neural networks (CNNs) to automatically learn features from raw data without the need for pathologist expert knowledge, as well as the availability of annotated histopathology datasets, have contributed to a growing interest in deep learning applications to histopathology. In clinical practice for cancer, histopathological images have been commonly used for diagnosis, prognosis, and treatment. Recently, molecular subtype classification has gained significant attention for predicting standard chemotherapy's outcomes and creating personalized targeted cancer therapy. Genomic profiles, especially gene expression data, are mostly used for molecular subtyping. In this study, we developed a novel, PanCancer CNN model based on Google Inception V3 transfer learning to classify molecular subtypes using histopathological images. We used 22,484 Haemotoxylin and Eosin (H&E) slides from 32 cancer types provided by The Cancer Genome Atlas (TCGA) to train and evaluate the model. We showed that by employing deep learning, H&E slides can be used for classification of molecular subtypes of solid tumor samples with the high area under curves (AUCs) (micro-average= 0.90; macro-average=0.90). In cancer studies, combining histopathological images with genomic data has rarely been explored. We investigated the relationship between features extracted from H&E images and features extracted from gene expression profiles. We observed that the features from these two different modalities (H&E images and gene expression values) for molecular subtyping are highly correlated. We, therefore, developed an integrative deep learning model that combines histological images and gene expression profiles. We showed that the integrative model improves the overall performance of the molecular subtypes classification ((AUCs) micro-average= 0.99; macro-average=0.97). These results show that integrating H&E images and gene expression profiles can enhance accuracy of molecular subtype classification.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115506470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Jazayeri, Sara Pajouhanfar, Sadaf Saba, Christopher C. Yang
Down Syndrome (DS) is one of the most common disorders caused by the presence of an extra copy of chromosome 21. It has been shown that the expression of various genes located on chromosomes other than the extra 21 chromosomes is affected in DS. Given the practical and ethical difficulties in human tissue studies, the Ts65Dn mouse model has been widely used in DS research. In this study, we propose a pipeline composed of a supervised learning approach, modularity analysis of a bipartite network, and multivariate analysis of variance (MANOVA), for identification of differentially expressed proteins (DEP) among different classes of mice models. The proposed pipeline is tested using the expression levels of 77 proteins in eight different classes of mice models. The data includes the protein expression measurements for 34 trisomic Ts65Dn and 38 control mice. Each group is broken up into four classes based on either being stimulated for learning or not, each injected with memantine or saline. The previously proposed approaches have been unable to identify DEP among all of the eight classes simultaneously. Here, we show that our proposed pipeline can successfully identify the set of proteins expressed differently among all the eight classes. The findings of this study can inform the study of learning responses to different treatments and protein-treatment associations in DS. Also, the proposed pipeline can be adopted to identify DEP in DS or other diseases and health conditions, which can consequently inform the development of improved personalized treatment and management strategies.
{"title":"Modularity Analysis of Bipartite Networks and Multivariate ANOVA for Identification of Differentially Expressed Proteins in a Mouse Model of Down Syndrome","authors":"A. Jazayeri, Sara Pajouhanfar, Sadaf Saba, Christopher C. Yang","doi":"10.1145/3388440.3412421","DOIUrl":"https://doi.org/10.1145/3388440.3412421","url":null,"abstract":"Down Syndrome (DS) is one of the most common disorders caused by the presence of an extra copy of chromosome 21. It has been shown that the expression of various genes located on chromosomes other than the extra 21 chromosomes is affected in DS. Given the practical and ethical difficulties in human tissue studies, the Ts65Dn mouse model has been widely used in DS research. In this study, we propose a pipeline composed of a supervised learning approach, modularity analysis of a bipartite network, and multivariate analysis of variance (MANOVA), for identification of differentially expressed proteins (DEP) among different classes of mice models. The proposed pipeline is tested using the expression levels of 77 proteins in eight different classes of mice models. The data includes the protein expression measurements for 34 trisomic Ts65Dn and 38 control mice. Each group is broken up into four classes based on either being stimulated for learning or not, each injected with memantine or saline. The previously proposed approaches have been unable to identify DEP among all of the eight classes simultaneously. Here, we show that our proposed pipeline can successfully identify the set of proteins expressed differently among all the eight classes. The findings of this study can inform the study of learning responses to different treatments and protein-treatment associations in DS. Also, the proposed pipeline can be adopted to identify DEP in DS or other diseases and health conditions, which can consequently inform the development of improved personalized treatment and management strategies.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126247530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hamidreza Mohebbi, Nurit Haspel, D. Simovici, Joyce Quach
Gene fusion events are quite common in prostate, lymphoid, soft tissue, breast, gastric, and lung cancers. This requires fast and accurate fusion detection methods. However, accurate identification requires whole genome sequencing. Current state of the art methods suffer from inefficiency, lack of sufficient accuracy, and generation of high false positive rate. In this research we present a parallel method to convert inefficient categorical space into a compact binary array and therefore, reduce the dimensionality of the data and speed up the computation. FDJD pipeline contains three steps: general alignment, fusion candidate generation, and refinement. In our research, Jaccard distance is used as a similarity measure to find the nearest neighbors of a given query binary fingerprint alongside a fast KNN implementation. We benchmarked our fusion prediction accuracy using both simulated and genuine RNA-Seq data sets. Fusion detection results are compared with the state-of-the-art-methods STAR-Fusion, InFusion and TopHat-Fusion. The paired-end Illumina RNA-Seq genuine data were obtained from 60 publicly available cancer cell line data sets. FDJD showed superior performance compared to popular alternative fusion detection methods in both simulated and genuine data sets. It attained 90% accuracy on simulated fusion transcript inputs. Of a total of 86 fusions predicted by at least three methods, we found 44 experimentally validated fusions using wisdom of crowds approach. FDJD is not the fastest among the studied methods. However, it achieved the highest accuracy.
{"title":"Fusion Transcript Detection from RNA-Seq using Jaccard Distance","authors":"Hamidreza Mohebbi, Nurit Haspel, D. Simovici, Joyce Quach","doi":"10.1145/3388440.3415585","DOIUrl":"https://doi.org/10.1145/3388440.3415585","url":null,"abstract":"Gene fusion events are quite common in prostate, lymphoid, soft tissue, breast, gastric, and lung cancers. This requires fast and accurate fusion detection methods. However, accurate identification requires whole genome sequencing. Current state of the art methods suffer from inefficiency, lack of sufficient accuracy, and generation of high false positive rate. In this research we present a parallel method to convert inefficient categorical space into a compact binary array and therefore, reduce the dimensionality of the data and speed up the computation. FDJD pipeline contains three steps: general alignment, fusion candidate generation, and refinement. In our research, Jaccard distance is used as a similarity measure to find the nearest neighbors of a given query binary fingerprint alongside a fast KNN implementation. We benchmarked our fusion prediction accuracy using both simulated and genuine RNA-Seq data sets. Fusion detection results are compared with the state-of-the-art-methods STAR-Fusion, InFusion and TopHat-Fusion. The paired-end Illumina RNA-Seq genuine data were obtained from 60 publicly available cancer cell line data sets. FDJD showed superior performance compared to popular alternative fusion detection methods in both simulated and genuine data sets. It attained 90% accuracy on simulated fusion transcript inputs. Of a total of 86 fusions predicted by at least three methods, we found 44 experimentally validated fusions using wisdom of crowds approach. FDJD is not the fastest among the studied methods. However, it achieved the highest accuracy.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121442753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, wearable mobility monitoring devices have gained a great deal of attention for collecting movement and gait-related data. Moreover, Wearable movement monitoring devices together with machine learning techniques have been shown to be successful in a variety of healthcare applications, including diagnosis, prognosis, and rehabilitation. However, advanced studies are needed to create accurate and robust models that can differentiate between different populations based on their mobility signatures. This is particularly critical for monitoring movement and gait patterns of individuals impacted by neurodegenerative conditions such as Parkinson's Disease (PD). In order to achieve this goal, it is critical to employ a robust approach to model available data and identify the optimal set of movement parameters for the classification process. In this work, we propose a computational approach to identify the best feature selection method for spatiotemporal gait parameters. We investigate several feature selection approaches and analyze their performance as related to the mobility classification problem; including maximum information gain with minimum correlation (MIGMC), maximum signal to noise ratio with minimum correlation (MSNR&MC), genetic algorithms (GA), decision trees (DT) and principal component analysis (PCA). These methods, along with new proposed variations, are assessed in terms of classification accuracy, the number of selected features, and computation time. Data collected from the triaxial accelerometers attached to the ankles of individuals with PD, geriatrics (GE), and healthy elderly (HE) were used to train and test a set of six different machine learning techniques. Our results indicate that three out of six feature selection methods, including GA, MSNR&MC, and a modified version of MIGMC are the best performers regarding the classification accuracy. We also show that higher degrees of robust performances are achieved when employing multiple algorithms, such as decision trees and genetic algorithms. This study provides a critical first step towards the much-needed goal of utilizing data collected from wearable devices to extract important information for the diagnosis and rehabilitation of many movement-related medical conditions.
{"title":"Smart Computational Approaches with Advanced Feature Selection Algorithms for Optimizing the Classification of Mobility Data in Health Informatics","authors":"E. Rastegari, D. Orn, H. Ali","doi":"10.1145/3388440.3412426","DOIUrl":"https://doi.org/10.1145/3388440.3412426","url":null,"abstract":"Recently, wearable mobility monitoring devices have gained a great deal of attention for collecting movement and gait-related data. Moreover, Wearable movement monitoring devices together with machine learning techniques have been shown to be successful in a variety of healthcare applications, including diagnosis, prognosis, and rehabilitation. However, advanced studies are needed to create accurate and robust models that can differentiate between different populations based on their mobility signatures. This is particularly critical for monitoring movement and gait patterns of individuals impacted by neurodegenerative conditions such as Parkinson's Disease (PD). In order to achieve this goal, it is critical to employ a robust approach to model available data and identify the optimal set of movement parameters for the classification process. In this work, we propose a computational approach to identify the best feature selection method for spatiotemporal gait parameters. We investigate several feature selection approaches and analyze their performance as related to the mobility classification problem; including maximum information gain with minimum correlation (MIGMC), maximum signal to noise ratio with minimum correlation (MSNR&MC), genetic algorithms (GA), decision trees (DT) and principal component analysis (PCA). These methods, along with new proposed variations, are assessed in terms of classification accuracy, the number of selected features, and computation time. Data collected from the triaxial accelerometers attached to the ankles of individuals with PD, geriatrics (GE), and healthy elderly (HE) were used to train and test a set of six different machine learning techniques. Our results indicate that three out of six feature selection methods, including GA, MSNR&MC, and a modified version of MIGMC are the best performers regarding the classification accuracy. We also show that higher degrees of robust performances are achieved when employing multiple algorithms, such as decision trees and genetic algorithms. This study provides a critical first step towards the much-needed goal of utilizing data collected from wearable devices to extract important information for the diagnosis and rehabilitation of many movement-related medical conditions.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131972175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Giuste, M. Venkatesan, Conan Y. Zhao, L. Tong, Yuanda Zhu, S. Deshpande, May D. Wang
Heart transplant rejection must be quickly and accurately identified to optimize anti-rejection therapies and prevent organ loss. Expert evaluation of endomyocardial biopsies is labor-intensive, and prone to human bias, and suffers from low inter-rater agreement. Additionally, the increased utility of digital pathology for biopsy examination has exacerbated the need for additional image quality control. To meet these challenges, we developed a novel transplant rejection detection pipeline which automatically identifies histology slides in need of rescanning and highlights biopsy regions showing potential signs of rejection. Our system leverages a fast and effective automated patch-level quality filter as well as state-of-the-art feature extraction techniques to provide quality whole-slide level labeling of early rejection signs. We successfully identified digital pathology images with poor image quality and leveraged this quality gain to improve our novel weakly-supervised learning model leading to significant transplant rejection classification performance of AUC: 70.12 (±20.74) %.
{"title":"Automated Classification of Acute Rejection from Endomyocardial Biopsies","authors":"F. Giuste, M. Venkatesan, Conan Y. Zhao, L. Tong, Yuanda Zhu, S. Deshpande, May D. Wang","doi":"10.1145/3388440.3412430","DOIUrl":"https://doi.org/10.1145/3388440.3412430","url":null,"abstract":"Heart transplant rejection must be quickly and accurately identified to optimize anti-rejection therapies and prevent organ loss. Expert evaluation of endomyocardial biopsies is labor-intensive, and prone to human bias, and suffers from low inter-rater agreement. Additionally, the increased utility of digital pathology for biopsy examination has exacerbated the need for additional image quality control. To meet these challenges, we developed a novel transplant rejection detection pipeline which automatically identifies histology slides in need of rescanning and highlights biopsy regions showing potential signs of rejection. Our system leverages a fast and effective automated patch-level quality filter as well as state-of-the-art feature extraction techniques to provide quality whole-slide level labeling of early rejection signs. We successfully identified digital pathology images with poor image quality and leveraged this quality gain to improve our novel weakly-supervised learning model leading to significant transplant rejection classification performance of AUC: 70.12 (±20.74) %.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132088949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Designing molecules with specific structural and functional properties (e.g., drug-likeness and water solubility) is central to advancing drug discovery and material science, but it poses outstanding challenges both in wet and dry laboratories. The search space is vast and rugged. Recent advances in deep generative models are motivating new computational approaches building over deep learning to tackle the molecular space. Despite rapid advancements, state-of-the-art deep generative models for molecule generation have many limitations, including lack of interpretability. In this paper we address this limitation by proposing a generic framework for interpretable molecule generation based on novel disentangled deep graph generative models with property control. Specifically, we propose a disentanglement enhancement strategy for graphs. We also propose new deep neural architecture to achieve the above learning objective for inference and generation for variable-size graphs efficiently. Extensive experimental evaluation demonstrates the superiority of our approach in various critical aspects, such as accuracy, novelty, and disentanglement.
{"title":"Interpretable Molecule Generation via Disentanglement Learning","authors":"Yuanqi Du, Xiaojie Guo, Amarda Shehu, Liang Zhao","doi":"10.1145/3388440.3414709","DOIUrl":"https://doi.org/10.1145/3388440.3414709","url":null,"abstract":"Designing molecules with specific structural and functional properties (e.g., drug-likeness and water solubility) is central to advancing drug discovery and material science, but it poses outstanding challenges both in wet and dry laboratories. The search space is vast and rugged. Recent advances in deep generative models are motivating new computational approaches building over deep learning to tackle the molecular space. Despite rapid advancements, state-of-the-art deep generative models for molecule generation have many limitations, including lack of interpretability. In this paper we address this limitation by proposing a generic framework for interpretable molecule generation based on novel disentangled deep graph generative models with property control. Specifically, we propose a disentanglement enhancement strategy for graphs. We also propose new deep neural architecture to achieve the above learning objective for inference and generation for variable-size graphs efficiently. Extensive experimental evaluation demonstrates the superiority of our approach in various critical aspects, such as accuracy, novelty, and disentanglement.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130601349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The complexity, dynamics, and scale of data acquired by modern biotechnology increasingly favor model-free computational methods that make minimal assumptions about underlying biological mechanisms. For example, single-cell transcriptome and proteome data have a throughput several orders more than bulk methods. Many model-free statistical methods for pattern discovery such as mutual information and chi-squared tests, however, require discrete data. Most discretization methods minimize squared errors for each variable independently, not necessarily retaining joint patterns. To address this issue, we present a joint grid discretization algorithm that preserves clusters in the original data. We evaluated this algorithm on simulated data to show its advantage over other methods in maintaining clusters as measured by the adjusted Rand index. We also show it promotes global functional patterns over independent patterns. On single-cell proteome and transcriptome of leukemia and healthy blood, joint grid discretization captured known protein-to-RNA regulatory relationships, while revealing previously unknown interactions. As such, the joint grid discretization is applicable as a data transformation step in associative, functional, and causal inference of molecular interactions fundamental to systems biology. The developed software is publicly available at https://cran.r-project.org/package=GridOnClusters
{"title":"Joint Grid Discretization for Biological Pattern Discovery","authors":"Jiandong Wang, Sajal Kumar, Mingzhou Song","doi":"10.1145/3388440.3412415","DOIUrl":"https://doi.org/10.1145/3388440.3412415","url":null,"abstract":"The complexity, dynamics, and scale of data acquired by modern biotechnology increasingly favor model-free computational methods that make minimal assumptions about underlying biological mechanisms. For example, single-cell transcriptome and proteome data have a throughput several orders more than bulk methods. Many model-free statistical methods for pattern discovery such as mutual information and chi-squared tests, however, require discrete data. Most discretization methods minimize squared errors for each variable independently, not necessarily retaining joint patterns. To address this issue, we present a joint grid discretization algorithm that preserves clusters in the original data. We evaluated this algorithm on simulated data to show its advantage over other methods in maintaining clusters as measured by the adjusted Rand index. We also show it promotes global functional patterns over independent patterns. On single-cell proteome and transcriptome of leukemia and healthy blood, joint grid discretization captured known protein-to-RNA regulatory relationships, while revealing previously unknown interactions. As such, the joint grid discretization is applicable as a data transformation step in associative, functional, and causal inference of molecular interactions fundamental to systems biology. The developed software is publicly available at https://cran.r-project.org/package=GridOnClusters","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131251913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large for gestational age (LGA) births are associated with many maternal and perinatal complications. As overweight and obesity are risk factors for LGA, we aimed to predict LGA in overweight and obese women at approximately 20 gestational weeks, so that we can identify women at risk of LGA early to allow for appropriate interventions. A random forest algorithm was applied to maternal characteristics and blood biomarkers at baseline and 20 gestational weeks' ultrasound scan findings to develop a prediction model. Here we present our preliminary results demonstrating potential for use in clinical decision support for identifying patients early in pregnancy at risk of an LGA birth.
{"title":"Prediction of Large for Gestational Age Infants in Overweight and Obese Women at Approximately 20 Gestational Weeks","authors":"Yuhan Du, J. Mehegan, F. Mcauliffe, C. Mooney","doi":"10.1145/3388440.3414906","DOIUrl":"https://doi.org/10.1145/3388440.3414906","url":null,"abstract":"Large for gestational age (LGA) births are associated with many maternal and perinatal complications. As overweight and obesity are risk factors for LGA, we aimed to predict LGA in overweight and obese women at approximately 20 gestational weeks, so that we can identify women at risk of LGA early to allow for appropriate interventions. A random forest algorithm was applied to maternal characteristics and blood biomarkers at baseline and 20 gestational weeks' ultrasound scan findings to develop a prediction model. Here we present our preliminary results demonstrating potential for use in clinical decision support for identifying patients early in pregnancy at risk of an LGA birth.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114551987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A main problem for predicting cell line interactions with chemical compounds is the lack of a computational representation for cell lines. We describe a method for characterizing cell lines from scientific literature. We retrieve and process cell line-related scientific papers, perform a document classification algorithm, and then obtain a description of the information space of each cell line. We have successfully characterized a set of 300+ cell lines.
{"title":"Representing Cellular Lines with SVM and Text Processing","authors":"I. Carrera, I. Dutra, E. Tejera","doi":"10.1145/3388440.3414912","DOIUrl":"https://doi.org/10.1145/3388440.3414912","url":null,"abstract":"A main problem for predicting cell line interactions with chemical compounds is the lack of a computational representation for cell lines. We describe a method for characterizing cell lines from scientific literature. We retrieve and process cell line-related scientific papers, perform a document classification algorithm, and then obtain a description of the information space of each cell line. We have successfully characterized a set of 300+ cell lines.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114740122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}