{"title":"在大型语料库中识别法医上不感兴趣的文件","authors":"N. Rowe","doi":"10.4108/eai.8-12-2016.151725","DOIUrl":null,"url":null,"abstract":"For digital forensics, eliminating the uninteresting is often more critical than finding the interesting since there is so much more of it. Published software-file hash values like those of the National Software Reference Library (NSRL) have limited scope. We discuss methods based on analysis of file context using the metadata of a large corpus. Tests were done with an international corpus of 262.7 million files obtained from 4018 drives. For malware investigations, we identify clues to malware in context, and show that using a Bayesian ranking formula on metadata can increase recall by 5.1 while increasing precision by 1.7 times over inspecting executables alone. For more general investigations, we show that using together two of nine criteria for uninteresting files, with exceptions for some special interesting files, can exclude 77.4% of our corpus instead of the 23.8% that were excluded by NSRL. For a test set of 19,784 randomly selected files from our corpus that were manually inspected, false positives after file exclusion (interesting files identified as uninteresting) were 0.18% and false negatives (uninteresting files identified as interesting) were 29.31% using our methods. The generality of the methods was confirmed by separately testing two halves of our corpus. Few of our excluded files were matched in two commercial hash sets. This work provides both new uninteresting hash values and programs for finding more.","PeriodicalId":335727,"journal":{"name":"EAI Endorsed Trans. Security Safety","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Identifying forensically uninteresting files in a large corpus\",\"authors\":\"N. Rowe\",\"doi\":\"10.4108/eai.8-12-2016.151725\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For digital forensics, eliminating the uninteresting is often more critical than finding the interesting since there is so much more of it. Published software-file hash values like those of the National Software Reference Library (NSRL) have limited scope. We discuss methods based on analysis of file context using the metadata of a large corpus. Tests were done with an international corpus of 262.7 million files obtained from 4018 drives. For malware investigations, we identify clues to malware in context, and show that using a Bayesian ranking formula on metadata can increase recall by 5.1 while increasing precision by 1.7 times over inspecting executables alone. For more general investigations, we show that using together two of nine criteria for uninteresting files, with exceptions for some special interesting files, can exclude 77.4% of our corpus instead of the 23.8% that were excluded by NSRL. For a test set of 19,784 randomly selected files from our corpus that were manually inspected, false positives after file exclusion (interesting files identified as uninteresting) were 0.18% and false negatives (uninteresting files identified as interesting) were 29.31% using our methods. The generality of the methods was confirmed by separately testing two halves of our corpus. Few of our excluded files were matched in two commercial hash sets. This work provides both new uninteresting hash values and programs for finding more.\",\"PeriodicalId\":335727,\"journal\":{\"name\":\"EAI Endorsed Trans. Security Safety\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"EAI Endorsed Trans. Security Safety\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4108/eai.8-12-2016.151725\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"EAI Endorsed Trans. Security Safety","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4108/eai.8-12-2016.151725","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Identifying forensically uninteresting files in a large corpus
For digital forensics, eliminating the uninteresting is often more critical than finding the interesting since there is so much more of it. Published software-file hash values like those of the National Software Reference Library (NSRL) have limited scope. We discuss methods based on analysis of file context using the metadata of a large corpus. Tests were done with an international corpus of 262.7 million files obtained from 4018 drives. For malware investigations, we identify clues to malware in context, and show that using a Bayesian ranking formula on metadata can increase recall by 5.1 while increasing precision by 1.7 times over inspecting executables alone. For more general investigations, we show that using together two of nine criteria for uninteresting files, with exceptions for some special interesting files, can exclude 77.4% of our corpus instead of the 23.8% that were excluded by NSRL. For a test set of 19,784 randomly selected files from our corpus that were manually inspected, false positives after file exclusion (interesting files identified as uninteresting) were 0.18% and false negatives (uninteresting files identified as interesting) were 29.31% using our methods. The generality of the methods was confirmed by separately testing two halves of our corpus. Few of our excluded files were matched in two commercial hash sets. This work provides both new uninteresting hash values and programs for finding more.