{"title":"Efficient Feature Set for Spam Email Filtering","authors":"Reshma Varghese, K. Dhanya","doi":"10.1109/IACC.2017.0152","DOIUrl":null,"url":null,"abstract":"Spams are one of the major problems for the quality of Internet services, specially in the electronic mail. Classifying emails into spam and ham category without any misclassification is the concerned area of study. The objective is to find the best feature set for spam email filtering. For this work to be carried out, four categories of features are extracted. That are Bag-of-Word (BoW)s, Bigram Bag-of-Word (BoW)s, PoS Tag and Bigram PoS Tag. Rare features are eliminated based on Naive Bayes score. We chose Information Gain as feature selection technique and constructed Feature occurrence matrix, which is weighted by Term frequency-Inverse document frequency (TF-IDF) values. Singular Value Decomposition used as matrix factorization technique. AdaBoostJ48, Random Forest and Popular linear Support Vector Machine (SVM), called Sequential Minimal Optimization (SMO) are used as classifiers for model generation. The experiments are carried out on individual feature models as well as ensemble models. High ROC of 1 and low FPR of 0 were obtained for both individual feature model and ensemble model.","PeriodicalId":248433,"journal":{"name":"2017 IEEE 7th International Advance Computing Conference (IACC)","volume":"235 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 7th International Advance Computing Conference (IACC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IACC.2017.0152","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
Spams are one of the major problems for the quality of Internet services, specially in the electronic mail. Classifying emails into spam and ham category without any misclassification is the concerned area of study. The objective is to find the best feature set for spam email filtering. For this work to be carried out, four categories of features are extracted. That are Bag-of-Word (BoW)s, Bigram Bag-of-Word (BoW)s, PoS Tag and Bigram PoS Tag. Rare features are eliminated based on Naive Bayes score. We chose Information Gain as feature selection technique and constructed Feature occurrence matrix, which is weighted by Term frequency-Inverse document frequency (TF-IDF) values. Singular Value Decomposition used as matrix factorization technique. AdaBoostJ48, Random Forest and Popular linear Support Vector Machine (SVM), called Sequential Minimal Optimization (SMO) are used as classifiers for model generation. The experiments are carried out on individual feature models as well as ensemble models. High ROC of 1 and low FPR of 0 were obtained for both individual feature model and ensemble model.