V. S. Gowri, K. Shameer, C. C. S. Reddy, P. Shingate, R. Sowdhamini
{"title":"A Sequence Data Mining Protocol to Identify Best Representative Sequence for Protein Domain Families","authors":"V. S. Gowri, K. Shameer, C. C. S. Reddy, P. Shingate, R. Sowdhamini","doi":"10.1109/ICDMW.2010.153","DOIUrl":null,"url":null,"abstract":"Protein domains are the compact, evolutionarily conserved units of proteins that can be utilized for function association of the large number of gene products realised from whole genome sequencing projects. Homology, inferred by sequence similarity, is usually a reason for transfer of function annotation from pre-existing domain families to gene products. Sequence analysis protocols are directed by the reference sequence of families used for homology searches to reduce computational time in such large-scale data mining processes. As protein domain families are diverse in nature, it is an important task to identify a single best representative sequence member from a protein domain family using a well-defined, reproducible bioinformatics protocol. We report a new bioinformatics protocol that can be used to identify best representative sequence (BRS) from protein domain families. The method is based on “coverage analysis” score implemented using three different sequence search programs and the trends obtained in reporting best representative sequence are assessed. The highest average coverage for BRPs was 66% when searched using Hidden Markov Models. Further, it is crucial to select BRS specific for a sequence search method when searching in large sequence databases.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Data Mining Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2010.153","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Protein domains are the compact, evolutionarily conserved units of proteins that can be utilized for function association of the large number of gene products realised from whole genome sequencing projects. Homology, inferred by sequence similarity, is usually a reason for transfer of function annotation from pre-existing domain families to gene products. Sequence analysis protocols are directed by the reference sequence of families used for homology searches to reduce computational time in such large-scale data mining processes. As protein domain families are diverse in nature, it is an important task to identify a single best representative sequence member from a protein domain family using a well-defined, reproducible bioinformatics protocol. We report a new bioinformatics protocol that can be used to identify best representative sequence (BRS) from protein domain families. The method is based on “coverage analysis” score implemented using three different sequence search programs and the trends obtained in reporting best representative sequence are assessed. The highest average coverage for BRPs was 66% when searched using Hidden Markov Models. Further, it is crucial to select BRS specific for a sequence search method when searching in large sequence databases.