Pub Date : 2024-10-01DOI: 10.1016/j.fsidi.2024.301814
Sang-Hyun Cho , Dohyun Kim , Hyuk-Chul Kwon , Minho Kim
The rapid advancement of large language models (LLMs) has opened up new possibilities for various natural language processing tasks. This study explores the potential of LLMs for author profiling in digital text forensics, which involves identifying characteristics such as age and gender from writing style—a crucial task in forensic investigations of anonymous or pseudonymous communications. Experiments were conducted using state-of-the-art LLMs, including Polyglot, EEVE, and Bllossom, to evaluate their performance in author profiling. Different fine-tuning strategies, such as full fine-tuning, Low-Rank Adaptation (LoRA), and Quantized LoRA (QLoRA), were compared to determine the most effective methods for adapting LLMs to the specific needs of this task. The results show that fine-tuned LLMs can effectively predict authors’ age and gender based on their writing styles, with Polyglot-based models generally outperforming EEVE and Bllossom models. Additionally, LoRA and QLoRA strategies significantly reduce computational costs and memory requirements while maintaining performance comparable to full fine-tuning. However, error analysis reveals limitations in the current LLM-based approach, including difficulty in capturing subtle linguistic variations across age groups and potential biases from pre-training data. These challenges are discussed and future research directions to address them are proposed. This study underscores the potential of LLMs in author profiling for digital text forensics, suggesting promising avenues for further exploration and refinement.
{"title":"Exploring the potential of large language models for author profiling tasks in digital text forensics","authors":"Sang-Hyun Cho , Dohyun Kim , Hyuk-Chul Kwon , Minho Kim","doi":"10.1016/j.fsidi.2024.301814","DOIUrl":"10.1016/j.fsidi.2024.301814","url":null,"abstract":"<div><div>The rapid advancement of large language models (LLMs) has opened up new possibilities for various natural language processing tasks. This study explores the potential of LLMs for author profiling in digital text forensics, which involves identifying characteristics such as age and gender from writing style—a crucial task in forensic investigations of anonymous or pseudonymous communications. Experiments were conducted using state-of-the-art LLMs, including Polyglot, EEVE, and Bllossom, to evaluate their performance in author profiling. Different fine-tuning strategies, such as full fine-tuning, Low-Rank Adaptation (LoRA), and Quantized LoRA (QLoRA), were compared to determine the most effective methods for adapting LLMs to the specific needs of this task. The results show that fine-tuned LLMs can effectively predict authors’ age and gender based on their writing styles, with Polyglot-based models generally outperforming EEVE and Bllossom models. Additionally, LoRA and QLoRA strategies significantly reduce computational costs and memory requirements while maintaining performance comparable to full fine-tuning. However, error analysis reveals limitations in the current LLM-based approach, including difficulty in capturing subtle linguistic variations across age groups and potential biases from pre-training data. These challenges are discussed and future research directions to address them are proposed. This study underscores the potential of LLMs in author profiling for digital text forensics, suggesting promising avenues for further exploration and refinement.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"50 ","pages":"Article 301814"},"PeriodicalIF":2.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142530440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Computer and console-based video games are an important part of the entertainment industry. Such devices may be found in evidence lockers as part of investigations, or overlooked as their intrinsic value to an investigation may not be well-understood. Modern games consoles provide network connectivity and functionality that allows a significant degree of interaction via peer-to-peer connections and/or the Internet. These gaming consoles store settings, user preferences, user information, and can capture photos, audio and video, all of which potentially contain forensic artifacts about a person of interest. Games consoles have a fixed lifespan, eventually superseded by newer models with an expanded range of capabilities. As there are significant numbers of consoles available on the secondhand market, there is clear evidence that older consoles remain in circulation even after production has ceased. What is unclear, however, is the actual extent of forensic data available within these consoles. This paper shares the results of a digital forensic case-study undertaken to assess what artifacts are retrievable based on ‘real-world’ dataset, particularly the aging, but popular Nintendo 3DS series. A total of 47 Nintendo 3DS/2DS handheld systems were purchased secondhand. They were forensically imaged then examined to identify what artifacts are commonly found ‘in the wild’ on these often overlooked systems. Results presented in this paper provide guidance to digital forensic investigators of what may be realistically obtained from these non-traditional devices.
{"title":"Nintendo 3DS forensics: A secondhand case study","authors":"Huw O.L. Read , Konstantinos Xynos , Iain Sutherland , Matthew Bovee , Clyde Tamburro","doi":"10.1016/j.fsidi.2024.301815","DOIUrl":"10.1016/j.fsidi.2024.301815","url":null,"abstract":"<div><div>Computer and console-based video games are an important part of the entertainment industry. Such devices may be found in evidence lockers as part of investigations, or overlooked as their intrinsic value to an investigation may not be well-understood. Modern games consoles provide network connectivity and functionality that allows a significant degree of interaction via peer-to-peer connections and/or the Internet. These gaming consoles store settings, user preferences, user information, and can capture photos, audio and video, all of which potentially contain forensic artifacts about a person of interest. Games consoles have a fixed lifespan, eventually superseded by newer models with an expanded range of capabilities. As there are significant numbers of consoles available on the secondhand market, there is clear evidence that older consoles remain in circulation even after production has ceased. What is unclear, however, is the actual extent of forensic data available within these consoles. This paper shares the results of a digital forensic case-study undertaken to assess what artifacts are retrievable based on ‘real-world’ dataset, particularly the aging, but popular Nintendo 3DS series. A total of 47 Nintendo 3DS/2DS handheld systems were purchased secondhand. They were forensically imaged then examined to identify what artifacts are commonly found ‘in the wild’ on these often overlooked systems. Results presented in this paper provide guidance to digital forensic investigators of what may be realistically obtained from these non-traditional devices.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"50 ","pages":"Article 301815"},"PeriodicalIF":2.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142530441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.fsidi.2024.301808
Fuqiang Du , Min Yu , Boquan Li , Kam Pui Chow , Jianguo Jiang , Yixin Zhang , Yachao Liang , Min Li , Weiqing Huang
Deepfake detection attracts increasingly attention due to serious security issues caused by facial manipulation techniques. Recently, deep learning-based detectors have achieved promising performance. However, these detectors suffer severe untrustworthy due to the lack of interpretability. Thus, it is essential to work on the interpretibility of deepfake detectors to improve the reliability and traceability of digital evidence. In this work, we propose a two-branch autoencoder network named TAENet for interpretable deepfake detection. TAENet is composed of Content Feature Disentanglement (CFD), Content Map Generation (CMG), and Classification. CFD extracts latent features of real and forged content with dual encoder and feature discriminator. CMG employs a Pixel-level Content Map Generation Loss (PCMGL) to guide the dual decoder in visualizing the latent representations of real and forged contents as real-map and fake-map. In classification module, the Auxiliary Classifier (AC) serves as map amplifier to improve the accuracy of real-map image extraction. Finally, the learned model decouples the input image into two maps that have the same size as the input, providing visualized evidence for deepfake detection. Extensive experiments demonstrate that TAENet can offer interpretability in deepfake detection without compromising accuracy.
{"title":"TAENet: Two-branch Autoencoder Network for Interpretable Deepfake Detection","authors":"Fuqiang Du , Min Yu , Boquan Li , Kam Pui Chow , Jianguo Jiang , Yixin Zhang , Yachao Liang , Min Li , Weiqing Huang","doi":"10.1016/j.fsidi.2024.301808","DOIUrl":"10.1016/j.fsidi.2024.301808","url":null,"abstract":"<div><div>Deepfake detection attracts increasingly attention due to serious security issues caused by facial manipulation techniques. Recently, deep learning-based detectors have achieved promising performance. However, these detectors suffer severe untrustworthy due to the lack of interpretability. Thus, it is essential to work on the interpretibility of deepfake detectors to improve the reliability and traceability of digital evidence. In this work, we propose a two-branch autoencoder network named TAENet for interpretable deepfake detection. TAENet is composed of Content Feature Disentanglement (CFD), Content Map Generation (CMG), and Classification. CFD extracts latent features of real and forged content with dual encoder and feature discriminator. CMG employs a Pixel-level Content Map Generation Loss (PCMGL) to guide the dual decoder in visualizing the latent representations of real and forged contents as real-map and fake-map. In classification module, the Auxiliary Classifier (AC) serves as map amplifier to improve the accuracy of real-map image extraction. Finally, the learned model decouples the input image into two maps that have the same size as the input, providing visualized evidence for deepfake detection. Extensive experiments demonstrate that TAENet can offer interpretability in deepfake detection without compromising accuracy.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"50 ","pages":"Article 301808"},"PeriodicalIF":2.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142530826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.fsidi.2024.301806
Zainab Khalid , Farkhund Iqbal , Benjamin C.M. Fung
Explainable Artificial Intelligence (XAI) aims to alleviate the black-box AI conundrum in the field of Digital Forensics (DF) (and others) by providing layman-interpretable explanations to predictions made by AI models. It also handles the increasing volumes of forensic images that are impossible to investigate via manual methods; or even automated forensic tools. A holistic, generalized, yet exhaustive framework detailing the workflow of XAI for DF is proposed for standardization. A case study examining the implementation of the framework in a network forensics investigative scenario is presented for demonstration. In addition, the XAI-DF project lays the basis for a collaborative effort from the forensics community, aimed at creating an open-source forensic database that may be employed to train AI models for the digital forensics domain. As an onset contribution to the project, we create a memory forensics database of 27 memory dumps (Windows 7, 10, and 11) simulating malware activity and extracting relevant features (specific to processes, injected code, network connections, API hooks, and process privileges) that may be used for training, testing, and validating AI models in keeping with the XAI-DF framework.
{"title":"Towards a unified XAI-based framework for digital forensic investigations","authors":"Zainab Khalid , Farkhund Iqbal , Benjamin C.M. Fung","doi":"10.1016/j.fsidi.2024.301806","DOIUrl":"10.1016/j.fsidi.2024.301806","url":null,"abstract":"<div><div>Explainable Artificial Intelligence (XAI) aims to alleviate the black-box AI conundrum in the field of Digital Forensics (DF) (and others) by providing layman-interpretable explanations to predictions made by AI models. It also handles the increasing volumes of forensic images that are impossible to investigate via manual methods; or even automated forensic tools. A holistic, generalized, yet exhaustive framework detailing the workflow of XAI for DF is proposed for standardization. A case study examining the implementation of the framework in a network forensics investigative scenario is presented for demonstration. In addition, the XAI-DF project lays the basis for a collaborative effort from the forensics community, aimed at creating an open-source forensic database that may be employed to train AI models for the digital forensics domain. As an onset contribution to the project, we create a memory forensics database of 27 memory dumps (Windows 7, 10, and 11) simulating malware activity and extracting relevant features (specific to processes, injected code, network connections, API hooks, and process privileges) that may be used for training, testing, and validating AI models in keeping with the XAI-DF framework.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"50 ","pages":"Article 301806"},"PeriodicalIF":2.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142530436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.fsidi.2024.301816
Kumarakrishna Valeti, Hemant Rathore
Android stands as the predominant operating system within the mobile ecosystem. Users can download applications from official sources like Google Play Store and other third-party platforms. However, malicious actors can attempt to compromise user device integrity through malicious applications. Traditionally, signatures, rules, and other methods have been employed to detect malware attacks and protect device integrity. However, the growing number and complexity of malicious applications have prompted the exploration of newer techniques like machine learning (ML) and deep learning (DL). Many recent studies have demonstrated promising results in detecting malicious applications using ML and DL solutions. However, research in other fields, such as computer vision, has shown that ML and DL solutions are vulnerable to targeted adversarial attacks. Malicious actors can develop malicious adversarial applications that can bypass ML and DL based anti-viruses. The study of adversarial techniques related to malware detection has now captured the security community’s attention. In this work, we utilise android permissions and intents to construct 28 distinct malware detection models using 14 classification algorithms. Later, we introduce a novel targeted false-negative evasion attack, Gradient Based K Perturbation Attack (GBKPA), designed for grey-box knowledge scenarios to assess the robustness of these models. The GBKPA attempts to craft malicious adversarial samples by making minimal perturbations without violating the syntactic and functional structure of the application. GBKPA achieved an average fooling rate (FR) of 77 % with only five perturbations across the 28 detection models. Additionally, we identified the most vulnerable android permissions and intents that malicious actors can exploit for evasion attacks. Furthermore, we analyse the transferability of adversarial samples across different classes of models and provide explanations for the same. Finally, we proposed AuxShield defence mechanism to develop robust detection models. AuxShield reduced the average FR to 3.25 % against 28 detection models. Our findings underscore the need to understand the causation of adversarial samples, their transferability, and robust defence strategies before deploying ML and DL solutions in the real world.
安卓是移动生态系统中最主要的操作系统。用户可以从官方渠道下载应用程序,如 Google Play 商店和其他第三方平台。然而,恶意行为者可能试图通过恶意应用程序破坏用户设备的完整性。传统上,人们采用签名、规则和其他方法来检测恶意软件攻击并保护设备完整性。然而,恶意应用程序的数量和复杂性不断增加,促使人们探索机器学习(ML)和深度学习(DL)等更新的技术。最近的许多研究表明,使用 ML 和 DL 解决方案检测恶意应用程序的效果很好。然而,计算机视觉等其他领域的研究表明,ML 和 DL 解决方案很容易受到有针对性的恶意攻击。恶意行为者可以开发恶意对抗应用程序,绕过基于 ML 和 DL 的反病毒程序。与恶意软件检测相关的对抗技术研究现已引起了安全界的关注。在这项工作中,我们利用安卓权限和意图,使用 14 种分类算法构建了 28 种不同的恶意软件检测模型。随后,我们引入了一种新颖的有针对性的假阴性规避攻击--基于梯度的 K Perturbation 攻击(GBKPA),该攻击专为灰盒知识场景设计,用于评估这些模型的鲁棒性。GBKPA 尝试在不违反应用程序语法和功能结构的前提下,通过最小的扰动来制作恶意对抗样本。在 28 个检测模型中,GBKPA 只用了 5 次扰动,就实现了 77% 的平均欺骗率 (FR)。此外,我们还确定了恶意行为者可用于规避攻击的最脆弱的安卓权限和意图。此外,我们还分析了对抗样本在不同类别模型中的可转移性,并提供了相应的解释。最后,我们提出了 AuxShield 防御机制,以开发稳健的检测模型。在 28 个检测模型中,AuxShield 将平均 FR 降低到 3.25%。我们的研究结果强调,在现实世界中部署 ML 和 DL 解决方案之前,有必要了解对抗样本的成因、其可转移性以及稳健的防御策略。
{"title":"GBKPA and AuxShield: Addressing adversarial robustness and transferability in android malware detection","authors":"Kumarakrishna Valeti, Hemant Rathore","doi":"10.1016/j.fsidi.2024.301816","DOIUrl":"10.1016/j.fsidi.2024.301816","url":null,"abstract":"<div><div>Android stands as the predominant operating system within the mobile ecosystem. Users can download applications from official sources like <em>Google Play Store</em> and other third-party platforms. However, malicious actors can attempt to compromise user device integrity through malicious applications. Traditionally, signatures, rules, and other methods have been employed to detect malware attacks and protect device integrity. However, the growing number and complexity of malicious applications have prompted the exploration of newer techniques like machine learning (ML) and deep learning (DL). Many recent studies have demonstrated promising results in detecting malicious applications using ML and DL solutions. However, research in other fields, such as computer vision, has shown that ML and DL solutions are vulnerable to targeted adversarial attacks. Malicious actors can develop malicious adversarial applications that can bypass ML and DL based anti-viruses. The study of adversarial techniques related to malware detection has now captured the security community’s attention. In this work, we utilise android permissions and intents to construct 28 distinct malware detection models using 14 classification algorithms. Later, we introduce a novel targeted false-negative evasion attack, <em>Gradient Based K Perturbation Attack (GBKPA)</em>, designed for grey-box knowledge scenarios to assess the robustness of these models. The GBKPA attempts to craft malicious adversarial samples by making minimal perturbations without violating the syntactic and functional structure of the application. GBKPA achieved an average fooling rate (FR) of 77 % with only five perturbations across the 28 detection models. Additionally, we identified the most vulnerable android permissions and intents that malicious actors can exploit for evasion attacks. Furthermore, we analyse the transferability of adversarial samples across different classes of models and provide explanations for the same. Finally, we proposed <em>AuxShield</em> defence mechanism to develop robust detection models. AuxShield reduced the average FR to 3.25 % against 28 detection models. Our findings underscore the need to understand the causation of adversarial samples, their transferability, and robust defence strategies before deploying ML and DL solutions in the real world.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"50 ","pages":"Article 301816"},"PeriodicalIF":2.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142530442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.fsidi.2024.301804
Luke Jennings , Matthew Sorell , Hugo G. Espinosa
Fitness tracking smart watches are becoming more prevalent in investigations and the need to understand and document their forensic potential and limitations is important for practitioners and researchers. Such fitness devices have undergone several hardware and software upgrades, changing the way they operate and evolving as more sophisticated pieces of technology. One example is the Apple Watch, working in conjunction with the Apple iPhone, to measure and record a vast amount of health information in the Apple Health database, healthdb_secure.sqlite. Over time, an end user will update their devices, but their health data, uniquely, carries over from one device to the next. In this paper, we investigate and analyse the hardware and software provenance of a real 5+ year Apple Health dataset to determine changes, patterns and anomalies over time. This provenance investigation provides insights in the form of (1) a timeline, representing the dataset's history of device and firmware updates that can be used in the context of investigation validation, (2) anomaly detection and, (3) insights into cyber hygiene. Analysis of the non-health data recorded in the health database arguably provides just as much insightful information as the health data itself.
健身追踪智能手表在调查中越来越普遍,对于从业人员和研究人员来说,了解和记录其取证潜力和局限性非常重要。此类健身设备经历了多次硬件和软件升级,改变了其操作方式,并发展成为更先进的技术。其中一个例子是 Apple Watch,它与 Apple iPhone 配合使用,可测量大量健康信息并将其记录到 Apple Health 数据库 healthdb_secure.sqlite。随着时间的推移,终端用户会更新他们的设备,但他们的健康数据会从一台设备唯一地延续到下一台设备。在本文中,我们调查并分析了一个真实的 5 年以上 Apple Health 数据集的硬件和软件出处,以确定随时间推移出现的变化、模式和异常。这种出处调查提供了以下形式的见解:(1) 时间轴,代表数据集的设备和固件更新历史,可用于调查验证;(2) 异常检测;(3) 网络卫生见解。可以说,对健康数据库中记录的非健康数据进行分析,与健康数据本身一样能提供具有洞察力的信息。
{"title":"The provenance of Apple Health data: A timeline of update history","authors":"Luke Jennings , Matthew Sorell , Hugo G. Espinosa","doi":"10.1016/j.fsidi.2024.301804","DOIUrl":"10.1016/j.fsidi.2024.301804","url":null,"abstract":"<div><div>Fitness tracking smart watches are becoming more prevalent in investigations and the need to understand and document their forensic potential and limitations is important for practitioners and researchers. Such fitness devices have undergone several hardware and software upgrades, changing the way they operate and evolving as more sophisticated pieces of technology. One example is the Apple Watch, working in conjunction with the Apple iPhone, to measure and record a vast amount of health information in the Apple Health database, <em>healthdb</em>_<em>secure</em>.<em>sqlite</em>. Over time, an end user will update their devices, but their health data, uniquely, carries over from one device to the next. In this paper, we investigate and analyse the hardware and software provenance of a real 5+ year Apple Health dataset to determine changes, patterns and anomalies over time. This provenance investigation provides insights in the form of (1) a timeline, representing the dataset's history of device and firmware updates that can be used in the context of investigation validation, (2) anomaly detection and, (3) insights into cyber hygiene. Analysis of the non-health data recorded in the health database arguably provides just as much insightful information as the health data itself.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"50 ","pages":"Article 301804"},"PeriodicalIF":2.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142530821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.fsidi.2024.301807
Jan-Niclas Hilgert, Axel Mahr, Martin Lambertz
File system and network forensics are fundamental in forensic investigations, but are often treated as distinct disciplines. This work seeks to unify these fields by introducing a novel framework capable of mounting network captures, enabling investigators to seamlessly browse data using conventional tools. Although our implementation supports various protocols such as HTTP, TLS, and FTP, this work will particularly focus on the complexities of the Server Message Block (SMB) protocol, which is fundamental for shared file system access, especially within local networks.
For this, we present a detailed methodology to extract essential file system data from SMB network traffic, aiming to reconstruct the share's file system as accurately as the original. Our approach goes beyond traditional tools like Wireshark, which typically only extract individual files from SMB transmissions. Instead, we reconstruct the entire file system hierarchy, retrieve all associated metadata, and handle multiple versions of files captured within the same network traffic. In addition, we also investigate how file operations impact SMB commands and show how these can be used to accurately recreate user activities on an SMB share based solely on network traffic. Although both methodologies and implementations can be applied independently, their combination provides investigators with a comprehensive view of the reconstructed file system along with the corresponding user activities extracted from network traffic.
{"title":"Mount SMB.pcap: Reconstructing file systems and file operations from network traffic","authors":"Jan-Niclas Hilgert, Axel Mahr, Martin Lambertz","doi":"10.1016/j.fsidi.2024.301807","DOIUrl":"10.1016/j.fsidi.2024.301807","url":null,"abstract":"<div><div>File system and network forensics are fundamental in forensic investigations, but are often treated as distinct disciplines. This work seeks to unify these fields by introducing a novel framework capable of mounting network captures, enabling investigators to seamlessly browse data using conventional tools. Although our implementation supports various protocols such as HTTP, TLS, and FTP, this work will particularly focus on the complexities of the Server Message Block (SMB) protocol, which is fundamental for shared file system access, especially within local networks.</div><div>For this, we present a detailed methodology to extract essential file system data from SMB network traffic, aiming to reconstruct the share's file system as accurately as the original. Our approach goes beyond traditional tools like Wireshark, which typically only extract individual files from SMB transmissions. Instead, we reconstruct the entire file system hierarchy, retrieve all associated metadata, and handle multiple versions of files captured within the same network traffic. In addition, we also investigate how file operations impact SMB commands and show how these can be used to accurately recreate user activities on an SMB share based solely on network traffic. Although both methodologies and implementations can be applied independently, their combination provides investigators with a comprehensive view of the reconstructed file system along with the corresponding user activities extracted from network traffic.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"50 ","pages":"Article 301807"},"PeriodicalIF":2.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142530825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.fsidi.2024.301805
Lena L. Voigt , Felix Freiling , Christopher J. Hargreaves
Due to legal and privacy-related restrictions, the generation of synthetic data is recommended for creating datasets for digital forensic education and training. One challenge when synthesizing scenario-based forensic data is the creation of coherent background activity besides evidential actions. This work leverages the creative writing abilities of large language models (LLMs) to generate personas and actions that describe the background usage of a device consistent with the created persona. These actions are subsequently converted into a machine-readable format and executed on a virtualized device using VM control automation. We introduce Re-imagen, a framework that combines state-of-the-art LLMs and a recent unintrusive GUI automation tool to produce synthetic disk images that contain arguably coherent “wear-and-tear” artifacts that current synthesis platforms lack. While, for now, the focus is on the coherence of the generated background activity, we believe that the proposed approach is a step toward more realistic synthetic disk image generation.
{"title":"Re-imagen: Generating coherent background activity in synthetic scenario-based forensic datasets using large language models","authors":"Lena L. Voigt , Felix Freiling , Christopher J. Hargreaves","doi":"10.1016/j.fsidi.2024.301805","DOIUrl":"10.1016/j.fsidi.2024.301805","url":null,"abstract":"<div><div>Due to legal and privacy-related restrictions, the generation of <em>synthetic</em> data is recommended for creating datasets for digital forensic education and training. One challenge when synthesizing scenario-based forensic data is the creation of coherent background activity besides evidential actions. This work leverages the creative writing abilities of large language models (LLMs) to generate personas and actions that describe the background usage of a device consistent with the created persona. These actions are subsequently converted into a machine-readable format and executed on a virtualized device using VM control automation. We introduce Re-imagen, a framework that combines state-of-the-art LLMs and a recent unintrusive GUI automation tool to produce synthetic disk images that contain arguably coherent “wear-and-tear” artifacts that current synthesis platforms lack. While, for now, the focus is on the coherence of the generated background activity, we believe that the proposed approach is a step toward more <em>realistic</em> synthetic disk image generation.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"50 ","pages":"Article 301805"},"PeriodicalIF":2.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142530435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}