M. Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, Katherine Du, I. Vaisman
{"title":"Classification and Prediction of Antimicrobial Peptides Using N-gram Representation and Machine Learning","authors":"M. Othman, Sujay Ratna, Anant Tewari, Anthony M. Kang, Katherine Du, I. Vaisman","doi":"10.1145/3107411.3108215","DOIUrl":null,"url":null,"abstract":"Current antibiotic treatments for infectious diseases are drastically losing effectiveness, as the organisms they target have developed resistance to the drugs over time. In the United States, antibiotic-resistant bacterial infections annually result in more than 23,000 deaths, the morbidity rates are much higher. A promising alternative to current antibiotic treatments are antimicrobial peptides (AMPs), short sequences of amino acid residues that have been experimentally identified to inhibit the propagation of pathogens. In this study, we demonstrated that an N-gram representation of AMP sequences using reduced amino acid alphabet combined with machine learning (ML) methods provide a simple and efficient AMP classification with performance comparable to the more complex algorithms. All AMP sequences were retrieved from public data sources. Our AMP set consists of 7760 sequences, regardless of AMP subclass. We also used class-specific AMP sets (antibacterial, antiviral, antifungal, and antiparasitic). We created a raw negative set consisting of 20258 non-antimicrobial peptides (non-AMPs) using sequence fragments from annotated protein sequence databases. Models for all AMP against non-AMP sequences classification achieved a maximum accuracy of 85.0% using frequency N-gram analysis, and the RF model with 10-fold cross-validation. The datasets ranged from 200 to 7760 sequences per class. Classification using more specific classes of AMPs was conducted next. First, classification of ABPs against non-ABP sequences achieved an accuracy of up to 100% depending on a ML algorithm and alphabet reduction used. ABP against AVP sequences classification yielded a maximum accuracy of 81.8% AVP against non-AVP - 80.7% and AVP against AFP - 80.5%. The common trends present across multiple experiment series include the following: Random Forest frequently outperforms other algorithms. The optimal size of the reduced alphabet is either 3 or 4 letters. Reduction to 2 letters leads to a significant drop in accuracy, reduction to 5 or more letters does not provide any noticeable gains in classification accuracy. The results of this study indicate that N-gram based classification of AMPs is a promising approach with a strong potential for providing important insights into understanding AMP mechanisms and computationally designing new AMPs.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3107411.3108215","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Current antibiotic treatments for infectious diseases are drastically losing effectiveness, as the organisms they target have developed resistance to the drugs over time. In the United States, antibiotic-resistant bacterial infections annually result in more than 23,000 deaths, the morbidity rates are much higher. A promising alternative to current antibiotic treatments are antimicrobial peptides (AMPs), short sequences of amino acid residues that have been experimentally identified to inhibit the propagation of pathogens. In this study, we demonstrated that an N-gram representation of AMP sequences using reduced amino acid alphabet combined with machine learning (ML) methods provide a simple and efficient AMP classification with performance comparable to the more complex algorithms. All AMP sequences were retrieved from public data sources. Our AMP set consists of 7760 sequences, regardless of AMP subclass. We also used class-specific AMP sets (antibacterial, antiviral, antifungal, and antiparasitic). We created a raw negative set consisting of 20258 non-antimicrobial peptides (non-AMPs) using sequence fragments from annotated protein sequence databases. Models for all AMP against non-AMP sequences classification achieved a maximum accuracy of 85.0% using frequency N-gram analysis, and the RF model with 10-fold cross-validation. The datasets ranged from 200 to 7760 sequences per class. Classification using more specific classes of AMPs was conducted next. First, classification of ABPs against non-ABP sequences achieved an accuracy of up to 100% depending on a ML algorithm and alphabet reduction used. ABP against AVP sequences classification yielded a maximum accuracy of 81.8% AVP against non-AVP - 80.7% and AVP against AFP - 80.5%. The common trends present across multiple experiment series include the following: Random Forest frequently outperforms other algorithms. The optimal size of the reduced alphabet is either 3 or 4 letters. Reduction to 2 letters leads to a significant drop in accuracy, reduction to 5 or more letters does not provide any noticeable gains in classification accuracy. The results of this study indicate that N-gram based classification of AMPs is a promising approach with a strong potential for providing important insights into understanding AMP mechanisms and computationally designing new AMPs.