{"title":"一种使用极小训练样本大小的PDF恶意软件检测方法","authors":"Ran Liu, Cynthia Matuszek, Charles Nicholas","doi":"10.1145/3573128.3609352","DOIUrl":null,"url":null,"abstract":"Machine learning-based methods for PDF malware detection have grown in popularity because of their high levels of accuracy. However, many well-known ML-based detectors require a large number of specimen features to be collected before making a decision, which can be time-consuming. In this study, we present a novel, distance-based method for detecting PDF malware. Notably, our approach needs significantly less training data compared to traditional machine learning or neural network models. We evaluated our method using the Contagio dataset and reported that it can detect 90.50% of malware samples with only 20 benign PDF files used for model training. To show the statistical significance, we reported results with a 95% confidence interval (CI). We evaluated our model's performance across multiple metrics including Accuracy, F1 score, Precision, and Recall, alongside False Positive Rate, False Negative Rates, True Positive Rate and True Negative Rates. This paper highlights the feasibility of using distance-based methods for PDF malware detection, even with limited training data, thereby offering a promising direction for future research.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"199 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A PDF Malware Detection Method Using Extremely Small Training Sample Size\",\"authors\":\"Ran Liu, Cynthia Matuszek, Charles Nicholas\",\"doi\":\"10.1145/3573128.3609352\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning-based methods for PDF malware detection have grown in popularity because of their high levels of accuracy. However, many well-known ML-based detectors require a large number of specimen features to be collected before making a decision, which can be time-consuming. In this study, we present a novel, distance-based method for detecting PDF malware. Notably, our approach needs significantly less training data compared to traditional machine learning or neural network models. We evaluated our method using the Contagio dataset and reported that it can detect 90.50% of malware samples with only 20 benign PDF files used for model training. To show the statistical significance, we reported results with a 95% confidence interval (CI). We evaluated our model's performance across multiple metrics including Accuracy, F1 score, Precision, and Recall, alongside False Positive Rate, False Negative Rates, True Positive Rate and True Negative Rates. This paper highlights the feasibility of using distance-based methods for PDF malware detection, even with limited training data, thereby offering a promising direction for future research.\",\"PeriodicalId\":310776,\"journal\":{\"name\":\"Proceedings of the ACM Symposium on Document Engineering 2023\",\"volume\":\"199 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-08-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM Symposium on Document Engineering 2023\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3573128.3609352\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Symposium on Document Engineering 2023","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573128.3609352","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A PDF Malware Detection Method Using Extremely Small Training Sample Size
Machine learning-based methods for PDF malware detection have grown in popularity because of their high levels of accuracy. However, many well-known ML-based detectors require a large number of specimen features to be collected before making a decision, which can be time-consuming. In this study, we present a novel, distance-based method for detecting PDF malware. Notably, our approach needs significantly less training data compared to traditional machine learning or neural network models. We evaluated our method using the Contagio dataset and reported that it can detect 90.50% of malware samples with only 20 benign PDF files used for model training. To show the statistical significance, we reported results with a 95% confidence interval (CI). We evaluated our model's performance across multiple metrics including Accuracy, F1 score, Precision, and Recall, alongside False Positive Rate, False Negative Rates, True Positive Rate and True Negative Rates. This paper highlights the feasibility of using distance-based methods for PDF malware detection, even with limited training data, thereby offering a promising direction for future research.