{"title":"Session details: Session 19: Automated Diagnosys and Prediction II","authors":"Dong Si","doi":"10.1145/3254562","DOIUrl":"https://doi.org/10.1145/3254562","url":null,"abstract":"","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125466437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Relating genomic data with clinical and disease information is a new challenge for life sciences research. High performance computational platforms allow huge quantity of biological data production with new technologies (e.g. Next Generation Sequencing techniques). Nowadays, genomic ontologies describing genes and functions, as well as databases containing diseases groups, are available. We focus on the problem of enriching genomic datasets containing miRNA genes by adding related disease information. The enrichment is performed by using ontologies to find genes-to-diseases associations. Ontologies are used to describe molecular genomic processes and functions, as well as disease classes and experimental details. International Classification of Diseases (ICD) is used for the classification of diseases and clinical information. Diseases are ranked by using a Google Page Rank based algorithm. An application tool called Surf App! has been coded and developed in R and tested on a neurological disease dataset.
{"title":"Associating Genomics and Clinical Information by Means of Semantic Based Ranking","authors":"F. Cristiano, G. Tradigo, P. Veltri","doi":"10.1145/3107411.3107436","DOIUrl":"https://doi.org/10.1145/3107411.3107436","url":null,"abstract":"Relating genomic data with clinical and disease information is a new challenge for life sciences research. High performance computational platforms allow huge quantity of biological data production with new technologies (e.g. Next Generation Sequencing techniques). Nowadays, genomic ontologies describing genes and functions, as well as databases containing diseases groups, are available. We focus on the problem of enriching genomic datasets containing miRNA genes by adding related disease information. The enrichment is performed by using ontologies to find genes-to-diseases associations. Ontologies are used to describe molecular genomic processes and functions, as well as disease classes and experimental details. International Classification of Diseases (ICD) is used for the classification of diseases and clinical information. Diseases are ranked by using a Google Page Rank based algorithm. An application tool called Surf App! has been coded and developed in R and tested on a neurological disease dataset.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121419747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jon Long, Yingyuan Zhang, V. Brusic, Lubomir T. Chitkushev, Guanglan Zhang
Poisonings account for almost 1% of emergency room visits each year. Time is a critical factor in dealing with a toxicologic emergency. Delay in dispensing the first antidote dose can lead to life-threatening sequelae. Current toxicological resources that support treatment decisions are broad in scope, time-consuming to read, or at times unavailable. Our review of current toxicological resources revealed a gap in their ability to provide expedient calculations and recommendations about appropriate course of treatment. To bridge the gap, we developed the Antidote Application (AA), a computational system that automatically provides patient-specific antidote treatment recommendations and individualized dose calculations. We implemented 27 algorithms that describe FDA (the US Food and Drug Administration) approved use and evidence-based practices found in primary literature for the treatment of common toxin exposure. The AA covers 29 antidotes recommended by Poison Control and toxicology experts, 19 poison classes and 31 poisons, which represent over 200 toxic entities. To the best of our knowledge, the AA is the first educational decision support system in toxicology that provides patient-specific treatment recommendations and drug dose calculations. The AA is publicly available at http://projects.met-hilab.org/antidote/.
{"title":"Antidote Application: An Educational System for Treatment of Common Toxin Overdose","authors":"Jon Long, Yingyuan Zhang, V. Brusic, Lubomir T. Chitkushev, Guanglan Zhang","doi":"10.1145/3107411.3107415","DOIUrl":"https://doi.org/10.1145/3107411.3107415","url":null,"abstract":"Poisonings account for almost 1% of emergency room visits each year. Time is a critical factor in dealing with a toxicologic emergency. Delay in dispensing the first antidote dose can lead to life-threatening sequelae. Current toxicological resources that support treatment decisions are broad in scope, time-consuming to read, or at times unavailable. Our review of current toxicological resources revealed a gap in their ability to provide expedient calculations and recommendations about appropriate course of treatment. To bridge the gap, we developed the Antidote Application (AA), a computational system that automatically provides patient-specific antidote treatment recommendations and individualized dose calculations. We implemented 27 algorithms that describe FDA (the US Food and Drug Administration) approved use and evidence-based practices found in primary literature for the treatment of common toxin exposure. The AA covers 29 antidotes recommended by Poison Control and toxicology experts, 19 poison classes and 31 poisons, which represent over 200 toxic entities. To the best of our knowledge, the AA is the first educational decision support system in toxicology that provides patient-specific treatment recommendations and drug dose calculations. The AA is publicly available at http://projects.met-hilab.org/antidote/.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"267 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121121661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gene expression data have been used in many researches to help reveal the underlying mechanism of many diseases. In this study, we applied feature selection techniques on breast cancer patients in the METABRIC Study to predict whether patients will be disease free or not, under different treatments. Our models for prediction are of high performance, thus, the genes in those models might help reveal the mechanism of the disease, and these potential biomarkers can become targets for new therapies.
{"title":"Predicting Breast Cancer Outcome under Different Treatments by Feature Selection Approaches","authors":"H. Pham, L. Rueda, A. Ngom","doi":"10.1145/3107411.3108226","DOIUrl":"https://doi.org/10.1145/3107411.3108226","url":null,"abstract":"Gene expression data have been used in many researches to help reveal the underlying mechanism of many diseases. In this study, we applied feature selection techniques on breast cancer patients in the METABRIC Study to predict whether patients will be disease free or not, under different treatments. Our models for prediction are of high performance, thus, the genes in those models might help reveal the mechanism of the disease, and these potential biomarkers can become targets for new therapies.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116059119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Zhbannikov, Samuel S. Hunter, J. Foster, M. Settles
Modern high-throughput sequencing instruments produce massive amounts of data, which often contains noise in the form of sequencing errors, sequencing adaptors, and contaminating reads. This noise complicates genomics studies. Although many preprocessing software tools have been developed to reduce the sequence noise, many of them cannot handle data from multiple technologies and few address more than one type of noise. We present SeqyClean, a comprehensive preprocessing software pipeline. SeqyClean effectively removes multiple sources of noise in high throughput sequence data and, according to our tests, outperforms other available preprocessing tools. We show that preprocessing data with SeqyClean first improves both de-novo genome assembly and genome mapping. We have used SeqyClean extensively in the genomics core at the Institute for Bioinformatics and Evolutionary STudies (IBEST) at the University of Idaho, so it has been validated with both test and production data. SeqyClean is available as open source software under the MIT License at http://github.com/ibest/seqyclean
{"title":"SeqyClean: A Pipeline for High-throughput Sequence Data Preprocessing","authors":"I. Zhbannikov, Samuel S. Hunter, J. Foster, M. Settles","doi":"10.1145/3107411.3107446","DOIUrl":"https://doi.org/10.1145/3107411.3107446","url":null,"abstract":"Modern high-throughput sequencing instruments produce massive amounts of data, which often contains noise in the form of sequencing errors, sequencing adaptors, and contaminating reads. This noise complicates genomics studies. Although many preprocessing software tools have been developed to reduce the sequence noise, many of them cannot handle data from multiple technologies and few address more than one type of noise. We present SeqyClean, a comprehensive preprocessing software pipeline. SeqyClean effectively removes multiple sources of noise in high throughput sequence data and, according to our tests, outperforms other available preprocessing tools. We show that preprocessing data with SeqyClean first improves both de-novo genome assembly and genome mapping. We have used SeqyClean extensively in the genomics core at the Institute for Bioinformatics and Evolutionary STudies (IBEST) at the University of Idaho, so it has been validated with both test and production data. SeqyClean is available as open source software under the MIT License at http://github.com/ibest/seqyclean","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124954723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the advent of high throughput genome-wide assays it has become possible to simultaneously measure multiple types of genomic data. Several projects like TCGA, ICGC, NCI-60 has generated comprehensive, multi-dimensional maps of the key genomic changes like MiRNA, MRNA, proteomics etc. from cancer samples[2,4]. These genomic data can be used for classifying tumour types[5]. Integrative analysis of these data from multiple sources can potentially provide additional biological insights, but methods to do any such analysis are lacking. One of the widely used solutions to handle high dimension data is by removing redundant information in the integrated sample. Most of the expressed genes are overlapped and can be projected onto lower dimension, and then be used to classify different tumor types, without the loss of any/much information. Sufficient dimension reduction (SDR) [1], a supervised dimension reduction approach, can be ideal to achieve such a goal. In this paper, we propose a novel integrative SDR method that can reduce dimensions of multiple data types simultaneously while sharing common latent structures to improve prediction and interpretation. In particular, we extend the sliced inverse regression (SIR) technique, a major SDR method, to integrate multiple omits data for simultaneous dimension reduction. SIR is a supervised dimension reduction method that assumes that the outcome variable Y depends on the predictor variable X through d unknown linear combinations of the predictor[3]. The predictor variable is replaced by its projection into a lower dimension subspace of the predictor space without the loss of information. The aim is to find the intersection of all the subspaces δ called the central susbspace (CS) of the predictor space satisfying the property Y ╨ X| Pδ X. To integrate multiple types of data, we propose and implement a new integrative sufficient dimension reduction method extending SIR[3], called integrative SIR. The main idea is that we take into account all the multi-omics data information simultaneously while finding a basis matrix for each data type with some sharing latent structures. Finally, we get d dimension data which is much smaller than the original data dimension. The reduced dimension d was achieved by cross validation. To demonstrate the integrated analysis of multi-omics data, we applied and compared conventional SIR and integrative SIR to analyze MRNA, MiRNA and proteomics expression profile of a subset of cell lines from the NCI-60 panel. The data used is taken from [6]. The outcomes we have to classify are CNS, Leukemia and Melanoma tumor types. We pre-screened 400 variables from each data type with the criteria of high variance. To find classification error, we performed random forest classification after we applied to each method with leave-one-out cross-validation. As a result, we found out that integrative SIR leads to less classification error as compared to conventional SIR. To summarize, we propo
{"title":"Integrative Sufficient Dimension Reduction Methods for Multi-Omics Data Analysis","authors":"Yashita Jain, Shanshan Ding","doi":"10.1145/3107411.3108225","DOIUrl":"https://doi.org/10.1145/3107411.3108225","url":null,"abstract":"With the advent of high throughput genome-wide assays it has become possible to simultaneously measure multiple types of genomic data. Several projects like TCGA, ICGC, NCI-60 has generated comprehensive, multi-dimensional maps of the key genomic changes like MiRNA, MRNA, proteomics etc. from cancer samples[2,4]. These genomic data can be used for classifying tumour types[5]. Integrative analysis of these data from multiple sources can potentially provide additional biological insights, but methods to do any such analysis are lacking. One of the widely used solutions to handle high dimension data is by removing redundant information in the integrated sample. Most of the expressed genes are overlapped and can be projected onto lower dimension, and then be used to classify different tumor types, without the loss of any/much information. Sufficient dimension reduction (SDR) [1], a supervised dimension reduction approach, can be ideal to achieve such a goal. In this paper, we propose a novel integrative SDR method that can reduce dimensions of multiple data types simultaneously while sharing common latent structures to improve prediction and interpretation. In particular, we extend the sliced inverse regression (SIR) technique, a major SDR method, to integrate multiple omits data for simultaneous dimension reduction. SIR is a supervised dimension reduction method that assumes that the outcome variable Y depends on the predictor variable X through d unknown linear combinations of the predictor[3]. The predictor variable is replaced by its projection into a lower dimension subspace of the predictor space without the loss of information. The aim is to find the intersection of all the subspaces δ called the central susbspace (CS) of the predictor space satisfying the property Y ╨ X| Pδ X. To integrate multiple types of data, we propose and implement a new integrative sufficient dimension reduction method extending SIR[3], called integrative SIR. The main idea is that we take into account all the multi-omics data information simultaneously while finding a basis matrix for each data type with some sharing latent structures. Finally, we get d dimension data which is much smaller than the original data dimension. The reduced dimension d was achieved by cross validation. To demonstrate the integrated analysis of multi-omics data, we applied and compared conventional SIR and integrative SIR to analyze MRNA, MiRNA and proteomics expression profile of a subset of cell lines from the NCI-60 panel. The data used is taken from [6]. The outcomes we have to classify are CNS, Leukemia and Melanoma tumor types. We pre-screened 400 variables from each data type with the criteria of high variance. To find classification error, we performed random forest classification after we applied to each method with leave-one-out cross-validation. As a result, we found out that integrative SIR leads to less classification error as compared to conventional SIR. To summarize, we propo","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125180288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael P. Strömberg, R. Roy, J. Lajugie, Yu Jiang, Haochen Li, E. Margulies
Sequencing an individual genome typically produces approximately three million variants compared to the human reference genome. The consequence for each variant depends on the location and nature of the variant and is a key question for genetic analysts performing clinical diagnosis. Variant annotation describes how a variant affects the sample's genome. These annotations include the functional consequence on the different transcripts for a gene or in proximal regulatory regions. Annotation also includes additional data on what is known about a given variant that can help in understanding its relevance to a given line of investigation. Often this data is provided by different sources and contain allele frequencies for different populations, clinical implications, relevance to cancer types, additional studies, etc. Ultimately this information helps clinicians interpret variants when providing a diagnosis. The three most widely used open source annotation tools are VEP, SnpEff and AnnoVar. VEP is widely considered to be most accurate of the three, but is also slower than both SnpEff and AnnoVar. When annotating the variants from a 30x genome (NA12878), VEP finished in 18 hours whereas SnpEff 4.3g and AnnoVar finish in 15 min and 67 min respectively using one core. We present Nirvana, an open source clinical variant annotator, that is both accurate (over 99.9% concordance with VEP) and fast (takes 7 min to annotate NA12878). Nirvana is used in all of Illumina's relevant analysis pipelines and is tested rigorously to ensure adherence to clinical standards.
{"title":"Nirvana: Clinical Grade Variant Annotator","authors":"Michael P. Strömberg, R. Roy, J. Lajugie, Yu Jiang, Haochen Li, E. Margulies","doi":"10.1145/3107411.3108204","DOIUrl":"https://doi.org/10.1145/3107411.3108204","url":null,"abstract":"Sequencing an individual genome typically produces approximately three million variants compared to the human reference genome. The consequence for each variant depends on the location and nature of the variant and is a key question for genetic analysts performing clinical diagnosis. Variant annotation describes how a variant affects the sample's genome. These annotations include the functional consequence on the different transcripts for a gene or in proximal regulatory regions. Annotation also includes additional data on what is known about a given variant that can help in understanding its relevance to a given line of investigation. Often this data is provided by different sources and contain allele frequencies for different populations, clinical implications, relevance to cancer types, additional studies, etc. Ultimately this information helps clinicians interpret variants when providing a diagnosis. The three most widely used open source annotation tools are VEP, SnpEff and AnnoVar. VEP is widely considered to be most accurate of the three, but is also slower than both SnpEff and AnnoVar. When annotating the variants from a 30x genome (NA12878), VEP finished in 18 hours whereas SnpEff 4.3g and AnnoVar finish in 15 min and 67 min respectively using one core. We present Nirvana, an open source clinical variant annotator, that is both accurate (over 99.9% concordance with VEP) and fast (takes 7 min to annotate NA12878). Nirvana is used in all of Illumina's relevant analysis pipelines and is tested rigorously to ensure adherence to clinical standards.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133498709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Mulas, Chun Zeng, Yinghui Sui, Tiffany Guan, Nathanael Miller, Yuliang Tan, Fenfen Liu, Wen Jin, Andrea C. Carrano, M. Huising, O. Shirihai, Gene W. Yeo, M. Sander
Single-cell RNA-seq generates gene expression profiles of individual cells and has furthered our understanding of the developmental and cellular hierarchy within complex tissues. One computational challenge in analyzing single-cell data sets is reconstructing the progression of individual cells with respect to the gradual transition of their transcriptomes. While a number of single-cell ordering tools have been proposed, many of these require knowledge of progression markers or time delineators. Here, we adapted an algorithm previously developed for temporally ordering bulk microarray samples [1] to reconstruct the developmental trajectory of pancreatic beta-cells postnatally. To accomplish this, we applied a multi-step pipeline to analyze single-cell RNA-seq data sets from isolated beta-cells at five different time points between birth and post-weaning. Specifically, we i) ordered cells along a linear trajectory (the Pseudotime Scale) by applying one-dimensional principal component analysis to the normalized data matrix; ii) identified annotated and de-novo gene sets significantly regulated along the trajectory; iii) built a network of top-regulated genes using protein interaction repositories; and iv) scored genes for their network connectivity to transcription factors [2]. A systematic comparison showed that our approach was more accurate in correctly ordering cells for our data set than previously reported methods and allowed for direct comparisons with external data sets. Importantly, our analysis revealed never before seen changes in beta-cell metabolism and in levels of mitochondrial reactive oxygen species. We demonstrated experimentally a role for these changes in the regulation of postnatal beta-cell proliferation. Our pipeline identified maturation-related changes in gene expression not captured when evaluating bulk gene expression data across the developmental time course. The proposed methodology has a broad applicability beyond the context here described and could be used to examine the trajectory of other single cell types along a continuous course of cell state changes.
{"title":"Analysis of Single Cells on a Pseudotime Scale along Postnatal Pancreatic Beta Cell Development","authors":"F. Mulas, Chun Zeng, Yinghui Sui, Tiffany Guan, Nathanael Miller, Yuliang Tan, Fenfen Liu, Wen Jin, Andrea C. Carrano, M. Huising, O. Shirihai, Gene W. Yeo, M. Sander","doi":"10.1145/3107411.3107458","DOIUrl":"https://doi.org/10.1145/3107411.3107458","url":null,"abstract":"Single-cell RNA-seq generates gene expression profiles of individual cells and has furthered our understanding of the developmental and cellular hierarchy within complex tissues. One computational challenge in analyzing single-cell data sets is reconstructing the progression of individual cells with respect to the gradual transition of their transcriptomes. While a number of single-cell ordering tools have been proposed, many of these require knowledge of progression markers or time delineators. Here, we adapted an algorithm previously developed for temporally ordering bulk microarray samples [1] to reconstruct the developmental trajectory of pancreatic beta-cells postnatally. To accomplish this, we applied a multi-step pipeline to analyze single-cell RNA-seq data sets from isolated beta-cells at five different time points between birth and post-weaning. Specifically, we i) ordered cells along a linear trajectory (the Pseudotime Scale) by applying one-dimensional principal component analysis to the normalized data matrix; ii) identified annotated and de-novo gene sets significantly regulated along the trajectory; iii) built a network of top-regulated genes using protein interaction repositories; and iv) scored genes for their network connectivity to transcription factors [2]. A systematic comparison showed that our approach was more accurate in correctly ordering cells for our data set than previously reported methods and allowed for direct comparisons with external data sets. Importantly, our analysis revealed never before seen changes in beta-cell metabolism and in levels of mitochondrial reactive oxygen species. We demonstrated experimentally a role for these changes in the regulation of postnatal beta-cell proliferation. Our pipeline identified maturation-related changes in gene expression not captured when evaluating bulk gene expression data across the developmental time course. The proposed methodology has a broad applicability beyond the context here described and could be used to examine the trajectory of other single cell types along a continuous course of cell state changes.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115132503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stephanie Mason, T. Woods, B. Chen, F. Jagodzinski
Cavities in proteins facilitate a variety of biochemical processes. The shapes and sizes of cavities are factors that contribute to specificity in ligand binding, and docking with other biomolecules. A deep understanding of cavity properties may enable new insights into protein-protein interactions, ligand binding, and structure-based drug design studies. In this work we explore how biological properties such as size and residue membership of protein cavities correlate with the flexibility of the cavity as computed using an efficient graph theoretic rigidity algorithm. We hypothesize that various rigidity properties of protein cavities are dependent on cavity surface area. In this work we enumerate a set of cavity rigidity metrics, and demonstrate their use in characterizing over 120,000 cavities from approximately 2,500 chains. We show that cavity size indeed does correlate with some -- but not all -- cavity rigidity metrics.
{"title":"Investigating Rigidity Properties of Protein Cavities","authors":"Stephanie Mason, T. Woods, B. Chen, F. Jagodzinski","doi":"10.1145/3107411.3107502","DOIUrl":"https://doi.org/10.1145/3107411.3107502","url":null,"abstract":"Cavities in proteins facilitate a variety of biochemical processes. The shapes and sizes of cavities are factors that contribute to specificity in ligand binding, and docking with other biomolecules. A deep understanding of cavity properties may enable new insights into protein-protein interactions, ligand binding, and structure-based drug design studies. In this work we explore how biological properties such as size and residue membership of protein cavities correlate with the flexibility of the cavity as computed using an efficient graph theoretic rigidity algorithm. We hypothesize that various rigidity properties of protein cavities are dependent on cavity surface area. In this work we enumerate a set of cavity rigidity metrics, and demonstrate their use in characterizing over 120,000 cavities from approximately 2,500 chains. We show that cavity size indeed does correlate with some -- but not all -- cavity rigidity metrics.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124760158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fatima Zare, Sardar Ansari, K. Najarian, S. Nabavi
High-throughput next generation sequencing (NGS) technologies have created an opportunity for detecting copy number variations (CNVs) more accurately. In this work, we introduce a novel preprocessing pipeline to improve the detection accuracy of CNVs in heterogeneous NGS data such as cancer whole exome sequencing data. We employ several normalizations to reduce biases due to GC contents, mappability and tumor contamination.We also utilize the Taut String method as an efficient effective smoothing approach to reduce noise.
{"title":"Bias and Noise Cancellation for Robust Copy Number Variation Detection","authors":"Fatima Zare, Sardar Ansari, K. Najarian, S. Nabavi","doi":"10.1145/3107411.3108199","DOIUrl":"https://doi.org/10.1145/3107411.3108199","url":null,"abstract":"High-throughput next generation sequencing (NGS) technologies have created an opportunity for detecting copy number variations (CNVs) more accurately. In this work, we introduce a novel preprocessing pipeline to improve the detection accuracy of CNVs in heterogeneous NGS data such as cancer whole exome sequencing data. We employ several normalizations to reduce biases due to GC contents, mappability and tumor contamination.We also utilize the Taut String method as an efficient effective smoothing approach to reduce noise.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121587197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}