Xinjun Zhang, Dharanesh Gangaiah, Robert S Munson, Stanley M Spinola, Yunlong Liu
{"title":"Correcting imbalanced reads coverage in bacterial transcriptome sequencing with extreme deep coverage.","authors":"Xinjun Zhang, Dharanesh Gangaiah, Robert S Munson, Stanley M Spinola, Yunlong Liu","doi":"10.1504/IJCBDD.2014.061646","DOIUrl":null,"url":null,"abstract":"<p><p>High throughput bacterial RNA-Seq experiments can generate extremely high and imbalanced sequencing coverage. Over- or under-estimation of gene expression levels will hinder accurate gene differential expression analysis. Here we evaluated strategies to identify expression differences of genes with high coverage in bacterial transcriptome data using either raw sequence reads or unique reads with duplicate fragments removed. In addition, we proposed a generalised linear model (GLM) based approach to identify imbalance in read coverage based on sequence compositions. Our results show that analysis using raw reads identifies more differentially expressed genes with more accurate fold change than using unique reads. We also demonstrate the presence of sequence composition related biases that are independent of gene expression levels and experimental conditions. Finally, genes that still show strong coverage imbalance after correction were tagged using statistical approach. </p>","PeriodicalId":39227,"journal":{"name":"International Journal of Computational Biology and Drug Design","volume":"7 2-3","pages":"195-213"},"PeriodicalIF":0.0000,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJCBDD.2014.061646","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computational Biology and Drug Design","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1504/IJCBDD.2014.061646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2014/5/28 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"Pharmacology, Toxicology and Pharmaceutics","Score":null,"Total":0}
引用次数: 0
Abstract
High throughput bacterial RNA-Seq experiments can generate extremely high and imbalanced sequencing coverage. Over- or under-estimation of gene expression levels will hinder accurate gene differential expression analysis. Here we evaluated strategies to identify expression differences of genes with high coverage in bacterial transcriptome data using either raw sequence reads or unique reads with duplicate fragments removed. In addition, we proposed a generalised linear model (GLM) based approach to identify imbalance in read coverage based on sequence compositions. Our results show that analysis using raw reads identifies more differentially expressed genes with more accurate fold change than using unique reads. We also demonstrate the presence of sequence composition related biases that are independent of gene expression levels and experimental conditions. Finally, genes that still show strong coverage imbalance after correction were tagged using statistical approach.