{"title":"Optimising data set creation in the cybersecurity landscape with a special focus on digital forensics: Principles, characteristics, and use cases","authors":"Thomas Göbel , Frank Breitinger , Harald Baier","doi":"10.1016/j.fsidi.2025.301882","DOIUrl":null,"url":null,"abstract":"<div><div>Data sets (samples) are important for research, training, and tool development. While the FAIR principles, data repositories and archives like Zenodo and NIST's Computer Forensic Reference Data Sets (CFReDS) enhance the accessibility and reusability of data sets, standardised practices for crafting and describing these data sets require further attention. This paper analyses the existing literature to identify the key data set (generation) characteristics, issues, desirable attributes, and use cases. Although our findings are generally applicable, i.e., to the cybersecurity domain, our special focus is on the digital forensics domain. We define principles and properties for cybersecurity-relevant data sets and their implications for the data creation process to maximise their quality, utility and applicability, taking into account specific data set use cases and data origin. We aim to guide data set creators in enhancing their data sets' value for the cybersecurity and digital forensics field.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"52 ","pages":"Article 301882"},"PeriodicalIF":2.0000,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Forensic Science International-Digital Investigation","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666281725000216","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Data sets (samples) are important for research, training, and tool development. While the FAIR principles, data repositories and archives like Zenodo and NIST's Computer Forensic Reference Data Sets (CFReDS) enhance the accessibility and reusability of data sets, standardised practices for crafting and describing these data sets require further attention. This paper analyses the existing literature to identify the key data set (generation) characteristics, issues, desirable attributes, and use cases. Although our findings are generally applicable, i.e., to the cybersecurity domain, our special focus is on the digital forensics domain. We define principles and properties for cybersecurity-relevant data sets and their implications for the data creation process to maximise their quality, utility and applicability, taking into account specific data set use cases and data origin. We aim to guide data set creators in enhancing their data sets' value for the cybersecurity and digital forensics field.