Predicting functional impact of single amino acid polymorphisms by integrating sequence and structural features

Mingjun Wang, Hongbin Shen, T. Akutsu, Jiangning Song
{"title":"Predicting functional impact of single amino acid polymorphisms by integrating sequence and structural features","authors":"Mingjun Wang, Hongbin Shen, T. Akutsu, Jiangning Song","doi":"10.1109/ISB.2011.6033115","DOIUrl":null,"url":null,"abstract":"Single amino acid polymorphisms (SAPs) are the most abundant form of known genetic variations associated with human diseases. It is of great interest to study the sequence-structure-function relationship underlying SAPs. In this work, we collected the human variant data from three databases and divided them into three categories, i.e. cancer somatic mutations (CSM), Mendelian disease-related variant (SVD) and neutral polymorphisms (SVP). We built support vector machine (SVM) classifiers to predict these three classes of SAPs, using the optimal features selected by a random forest algorithm. Consequently, 280 sequence-derived and structural features were initially extracted from the curated datasets from which 18 optimal candidate features were further selected by random forest. Furthermore, we performed a stepwise feature selection to select characteristic sequence and structural features that are important for predicting each SAPs class. As a result, our predictors achieved a prediction accuracy (ACC) of 84.97, 96.93, 86.98 and 88.24%, for the three classes, CSM, SVD and SVP, respectively. Performance comparison with other previously developed tools such as SIFT, SNAP and Polyphen2 indicates that our method provides a favorable performance with higher Sensitivity scores and Matthew's correlation coefficients (MCC). These results indicate that the prediction performance of SAPs classifiers can be effectively improved by feature selection. Moreover, division of SAPs into three respective categories and construction of accurate SVM-based classifiers for each class provides a practically useful way for investigating the difference between Mendelian disease-related variants and cancer somatic mutations.","PeriodicalId":355056,"journal":{"name":"2011 IEEE International Conference on Systems Biology (ISB)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Conference on Systems Biology (ISB)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISB.2011.6033115","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Single amino acid polymorphisms (SAPs) are the most abundant form of known genetic variations associated with human diseases. It is of great interest to study the sequence-structure-function relationship underlying SAPs. In this work, we collected the human variant data from three databases and divided them into three categories, i.e. cancer somatic mutations (CSM), Mendelian disease-related variant (SVD) and neutral polymorphisms (SVP). We built support vector machine (SVM) classifiers to predict these three classes of SAPs, using the optimal features selected by a random forest algorithm. Consequently, 280 sequence-derived and structural features were initially extracted from the curated datasets from which 18 optimal candidate features were further selected by random forest. Furthermore, we performed a stepwise feature selection to select characteristic sequence and structural features that are important for predicting each SAPs class. As a result, our predictors achieved a prediction accuracy (ACC) of 84.97, 96.93, 86.98 and 88.24%, for the three classes, CSM, SVD and SVP, respectively. Performance comparison with other previously developed tools such as SIFT, SNAP and Polyphen2 indicates that our method provides a favorable performance with higher Sensitivity scores and Matthew's correlation coefficients (MCC). These results indicate that the prediction performance of SAPs classifiers can be effectively improved by feature selection. Moreover, division of SAPs into three respective categories and construction of accurate SVM-based classifiers for each class provides a practically useful way for investigating the difference between Mendelian disease-related variants and cancer somatic mutations.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
结合序列和结构特征预测单氨基酸多态性对功能的影响
单氨基酸多态性(SAPs)是已知与人类疾病相关的最丰富的遗传变异形式。sap的序列-结构-功能关系的研究具有重要的意义。在这项工作中,我们从三个数据库中收集了人类变异数据,并将其分为三类,即癌症体细胞突变(CSM),孟德尔疾病相关变异(SVD)和中性多态性(SVP)。我们建立了支持向量机(SVM)分类器,利用随机森林算法选择的最优特征来预测这三类sap。最终,我们从整理的数据集中提取了280个序列衍生和结构特征,并通过随机森林方法从中选出了18个最优候选特征。此外,我们进行了逐步特征选择,以选择对预测每个sap类重要的特征序列和结构特征。结果表明,CSM、SVD和SVP三个类别的预测准确率(ACC)分别为84.97、96.93、86.98和88.24%。与SIFT、SNAP和Polyphen2等工具的性能比较表明,我们的方法具有较高的灵敏度得分和马修相关系数(MCC)。这些结果表明,通过特征选择可以有效地提高sap分类器的预测性能。此外,将SAPs划分为三个不同的类别,并为每个类别构建准确的基于svm的分类器,为研究孟德尔病相关变异与癌症体细胞突变之间的差异提供了一种实用的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Detecting coherent local patterns from time series gene expression data by a temporal biclustering method Bifurcation of an epidemic model with sub-optimal immunity and saturated recovery rate Parallel-META: A high-performance computational pipeline for metagenomic data analysis The role of GSH depletion in Resveratrol induced HeLa cell apoptosis Genomic signatures for metagenomic data analysis: Exploiting the reverse complementarity of tetranucleotides
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1