Lena L. Voigt , Felix Freiling , Christopher J. Hargreaves
{"title":"Re-imagen: Generating coherent background activity in synthetic scenario-based forensic datasets using large language models","authors":"Lena L. Voigt , Felix Freiling , Christopher J. Hargreaves","doi":"10.1016/j.fsidi.2024.301805","DOIUrl":null,"url":null,"abstract":"<div><div>Due to legal and privacy-related restrictions, the generation of <em>synthetic</em> data is recommended for creating datasets for digital forensic education and training. One challenge when synthesizing scenario-based forensic data is the creation of coherent background activity besides evidential actions. This work leverages the creative writing abilities of large language models (LLMs) to generate personas and actions that describe the background usage of a device consistent with the created persona. These actions are subsequently converted into a machine-readable format and executed on a virtualized device using VM control automation. We introduce Re-imagen, a framework that combines state-of-the-art LLMs and a recent unintrusive GUI automation tool to produce synthetic disk images that contain arguably coherent “wear-and-tear” artifacts that current synthesis platforms lack. While, for now, the focus is on the coherence of the generated background activity, we believe that the proposed approach is a step toward more <em>realistic</em> synthetic disk image generation.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":null,"pages":null},"PeriodicalIF":2.0000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Forensic Science International-Digital Investigation","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S266628172400129X","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Due to legal and privacy-related restrictions, the generation of synthetic data is recommended for creating datasets for digital forensic education and training. One challenge when synthesizing scenario-based forensic data is the creation of coherent background activity besides evidential actions. This work leverages the creative writing abilities of large language models (LLMs) to generate personas and actions that describe the background usage of a device consistent with the created persona. These actions are subsequently converted into a machine-readable format and executed on a virtualized device using VM control automation. We introduce Re-imagen, a framework that combines state-of-the-art LLMs and a recent unintrusive GUI automation tool to produce synthetic disk images that contain arguably coherent “wear-and-tear” artifacts that current synthesis platforms lack. While, for now, the focus is on the coherence of the generated background activity, we believe that the proposed approach is a step toward more realistic synthetic disk image generation.