{"title":"基于网络广告用户点击数据的特征提取与累积选择,实现虚假发布者自动分类","authors":"D. Sisodia, Dilip Singh Sisodia","doi":"10.1108/dta-09-2021-0233","DOIUrl":null,"url":null,"abstract":"PurposeThe problem of choosing the utmost useful features from hundreds of features from time-series user click data arises in online advertising toward fraudulent publisher's classification. Selecting feature subsets is a key issue in such classification tasks. Practically, the use of filter approaches is common; however, they neglect the correlations amid features. Conversely, wrapper approaches could not be applied due to their complexities. Moreover, in particular, existing feature selection methods could not handle such data, which is one of the major causes of instability of feature selection.Design/methodology/approachTo overcome such issues, a majority voting-based hybrid feature selection method, namely feature distillation and accumulated selection (FDAS), is proposed to investigate the optimal subset of relevant features for analyzing the publisher's fraudulent conduct. FDAS works in two phases: (1) feature distillation, where significant features from standard filter and wrapper feature selection methods are obtained using majority voting; (2) accumulated selection, where we enumerated an accumulated evaluation of relevant feature subset to search for an optimal feature subset using effective machine learning (ML) models.FindingsEmpirical results prove enhanced classification performance with proposed features in average precision, recall, f1-score and AUC in publisher identification and classification.Originality/valueThe FDAS is evaluated on FDMA2012 user-click data and nine other benchmark datasets to gauge its generalizing characteristics, first, considering original features, second, with relevant feature subsets selected by feature selection (FS) methods, third, with optimal feature subset obtained by the proposed approach. ANOVA significance test is conducted to demonstrate significant differences between independent features.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"101 1","pages":"602-625"},"PeriodicalIF":1.7000,"publicationDate":"2022-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Feature distillation and accumulated selection for automated fraudulent publisher classification from user click data of online advertising\",\"authors\":\"D. Sisodia, Dilip Singh Sisodia\",\"doi\":\"10.1108/dta-09-2021-0233\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"PurposeThe problem of choosing the utmost useful features from hundreds of features from time-series user click data arises in online advertising toward fraudulent publisher's classification. Selecting feature subsets is a key issue in such classification tasks. Practically, the use of filter approaches is common; however, they neglect the correlations amid features. Conversely, wrapper approaches could not be applied due to their complexities. Moreover, in particular, existing feature selection methods could not handle such data, which is one of the major causes of instability of feature selection.Design/methodology/approachTo overcome such issues, a majority voting-based hybrid feature selection method, namely feature distillation and accumulated selection (FDAS), is proposed to investigate the optimal subset of relevant features for analyzing the publisher's fraudulent conduct. FDAS works in two phases: (1) feature distillation, where significant features from standard filter and wrapper feature selection methods are obtained using majority voting; (2) accumulated selection, where we enumerated an accumulated evaluation of relevant feature subset to search for an optimal feature subset using effective machine learning (ML) models.FindingsEmpirical results prove enhanced classification performance with proposed features in average precision, recall, f1-score and AUC in publisher identification and classification.Originality/valueThe FDAS is evaluated on FDMA2012 user-click data and nine other benchmark datasets to gauge its generalizing characteristics, first, considering original features, second, with relevant feature subsets selected by feature selection (FS) methods, third, with optimal feature subset obtained by the proposed approach. ANOVA significance test is conducted to demonstrate significant differences between independent features.\",\"PeriodicalId\":56156,\"journal\":{\"name\":\"Data Technologies and Applications\",\"volume\":\"101 1\",\"pages\":\"602-625\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2022-01-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data Technologies and Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1108/dta-09-2021-0233\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Technologies and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1108/dta-09-2021-0233","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Feature distillation and accumulated selection for automated fraudulent publisher classification from user click data of online advertising
PurposeThe problem of choosing the utmost useful features from hundreds of features from time-series user click data arises in online advertising toward fraudulent publisher's classification. Selecting feature subsets is a key issue in such classification tasks. Practically, the use of filter approaches is common; however, they neglect the correlations amid features. Conversely, wrapper approaches could not be applied due to their complexities. Moreover, in particular, existing feature selection methods could not handle such data, which is one of the major causes of instability of feature selection.Design/methodology/approachTo overcome such issues, a majority voting-based hybrid feature selection method, namely feature distillation and accumulated selection (FDAS), is proposed to investigate the optimal subset of relevant features for analyzing the publisher's fraudulent conduct. FDAS works in two phases: (1) feature distillation, where significant features from standard filter and wrapper feature selection methods are obtained using majority voting; (2) accumulated selection, where we enumerated an accumulated evaluation of relevant feature subset to search for an optimal feature subset using effective machine learning (ML) models.FindingsEmpirical results prove enhanced classification performance with proposed features in average precision, recall, f1-score and AUC in publisher identification and classification.Originality/valueThe FDAS is evaluated on FDMA2012 user-click data and nine other benchmark datasets to gauge its generalizing characteristics, first, considering original features, second, with relevant feature subsets selected by feature selection (FS) methods, third, with optimal feature subset obtained by the proposed approach. ANOVA significance test is conducted to demonstrate significant differences between independent features.