{"title":"基于随机度量和离散傅立叶变换系数的低熵和高熵文件片段分类","authors":"K. Skracic, J. Petrović, P. Pale","doi":"10.1142/s2196888823500070","DOIUrl":null,"url":null,"abstract":"This paper presents an approach to improve the file fragment classification by proposing new features for classification and evaluating them on a dataset that includes both low- and high-entropy file fragments. High-entropy fragments, belonging to compressed and encrypted files, are particularly challenging to classify because they lack exploitable patterns. To address this challenge, the proposed feature vectors are constructed based on the byte frequency distribution (BFD) of file fragments, along with discrete Fourier transform coefficients and several randomness measures. These feature vectors are tested using three machine learning models: Support vector machines (SVMs), artificial neural networks (ANNs), and random forests (RFs). The proposed approach is evaluated on the govdocs1 dataset, which is freely available and widely used in this field, to enable reproducibility and fair comparison with other published research. The results show that the proposed approach outperforms existing methods and achieves better classification accuracy for both low- and high-entropy file fragments.","PeriodicalId":30898,"journal":{"name":"Vietnam Journal of Computer Science","volume":"2016 1","pages":""},"PeriodicalIF":0.6000,"publicationDate":"2023-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Classification of Low- and High-Entropy File Fragments Using Randomness Measures and Discrete Fourier Transform Coefficients\",\"authors\":\"K. Skracic, J. Petrović, P. Pale\",\"doi\":\"10.1142/s2196888823500070\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents an approach to improve the file fragment classification by proposing new features for classification and evaluating them on a dataset that includes both low- and high-entropy file fragments. High-entropy fragments, belonging to compressed and encrypted files, are particularly challenging to classify because they lack exploitable patterns. To address this challenge, the proposed feature vectors are constructed based on the byte frequency distribution (BFD) of file fragments, along with discrete Fourier transform coefficients and several randomness measures. These feature vectors are tested using three machine learning models: Support vector machines (SVMs), artificial neural networks (ANNs), and random forests (RFs). The proposed approach is evaluated on the govdocs1 dataset, which is freely available and widely used in this field, to enable reproducibility and fair comparison with other published research. The results show that the proposed approach outperforms existing methods and achieves better classification accuracy for both low- and high-entropy file fragments.\",\"PeriodicalId\":30898,\"journal\":{\"name\":\"Vietnam Journal of Computer Science\",\"volume\":\"2016 1\",\"pages\":\"\"},\"PeriodicalIF\":0.6000,\"publicationDate\":\"2023-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Vietnam Journal of Computer Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1142/s2196888823500070\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Vietnam Journal of Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s2196888823500070","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Classification of Low- and High-Entropy File Fragments Using Randomness Measures and Discrete Fourier Transform Coefficients
This paper presents an approach to improve the file fragment classification by proposing new features for classification and evaluating them on a dataset that includes both low- and high-entropy file fragments. High-entropy fragments, belonging to compressed and encrypted files, are particularly challenging to classify because they lack exploitable patterns. To address this challenge, the proposed feature vectors are constructed based on the byte frequency distribution (BFD) of file fragments, along with discrete Fourier transform coefficients and several randomness measures. These feature vectors are tested using three machine learning models: Support vector machines (SVMs), artificial neural networks (ANNs), and random forests (RFs). The proposed approach is evaluated on the govdocs1 dataset, which is freely available and widely used in this field, to enable reproducibility and fair comparison with other published research. The results show that the proposed approach outperforms existing methods and achieves better classification accuracy for both low- and high-entropy file fragments.