Pub Date : 2025-03-01Epub Date: 2025-03-24DOI: 10.1016/j.fsidi.2025.301867
Hannes Spichiger , Frank Adelstein
Preservation is generally considered as the step in the forensic process that stops evidence from decaying. In this paper, we argue that the traditional scope of preservation in digital forensic science, focused on the trace, is not sufficient to ensure the stop of decay in the context of evolving systems. Instead, insufficiently preserved reference material may lead to the loss of meaning, resulting in an overall increase of uncertainty in the presented evidence. An expanded definition of Preservation and a definition of Reference Data are proposed. We present suggestions for future avenues of research of ways to preserve reference data in order to avoid a loss of meaning of the trace data.
{"title":"Preserving meaning of evidence from evolving systems","authors":"Hannes Spichiger , Frank Adelstein","doi":"10.1016/j.fsidi.2025.301867","DOIUrl":"10.1016/j.fsidi.2025.301867","url":null,"abstract":"<div><div>Preservation is generally considered as the step in the forensic process that stops evidence from decaying. In this paper, we argue that the traditional scope of preservation in digital forensic science, focused on the trace, is not sufficient to ensure the stop of decay in the context of evolving systems. Instead, insufficiently preserved reference material may lead to the loss of meaning, resulting in an overall increase of uncertainty in the presented evidence. An expanded definition of Preservation and a definition of Reference Data are proposed. We present suggestions for future avenues of research of ways to preserve reference data in order to avoid a loss of meaning of the trace data.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"52 ","pages":"Article 301867"},"PeriodicalIF":2.0,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143679789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2025-03-24DOI: 10.1016/j.fsidi.2025.301870
Romke van Dijk , Judith van de Wetering , Ranieri Argentini , Leonie Gorka , Anne Fleur van Luenen , Sieds Minnema , Edwin Rijgersberg , Mattijs Ugen , Zoltán Ádám Mann , Zeno Geradts
In digital forensic investigations, the ability to identify passwords in cleartext within digital evidence is often essential for the acquisition of data from encrypted devices. Passwords may be stored in cleartext, knowingly or accidentally, in various locations within a device, e.g., in text messages, notes, or system log files. Finding those passwords is a challenging task, as devices typically contain a substantial amount and a wide variety of textual data. This paper explores the performance of several different types of machine learning models trained to distinguish passwords from non-passwords, and ranks them according to their likelihood of being a human-generated password. Three deep learning models (PassGPT, CodeBERT and DistilBERT) were fine-tuned, and two traditional machine learning models (a feature-based XGBoost and a TF/IDF-based XGBoost) were trained. These were compared to the existing state-of-the-art technology, a password recognition model based on probabilistic context-free grammars. Our research shows that the fine-tuned PassGPT model outperforms the other models. We show that the combination of multiple different types of training datasets, carefully chosen based on the context, is needed to achieve good results. In particular, it is important to train not only on dictionary words and leaked credentials, but also on data scraped from chats and websites. Our approach was evaluated with realistic hardware that could fit inside an investigator's workstation. The evaluation was conducted on the publicly available RockYou and MyHeritage leaks, but also on a dataset derived from real casework, showing that these innovations can indeed be used in a real forensic context.
{"title":"PaSSw0rdVib3s!: AI-assisted password recognition for digital forensic investigations","authors":"Romke van Dijk , Judith van de Wetering , Ranieri Argentini , Leonie Gorka , Anne Fleur van Luenen , Sieds Minnema , Edwin Rijgersberg , Mattijs Ugen , Zoltán Ádám Mann , Zeno Geradts","doi":"10.1016/j.fsidi.2025.301870","DOIUrl":"10.1016/j.fsidi.2025.301870","url":null,"abstract":"<div><div>In digital forensic investigations, the ability to identify passwords in cleartext within digital evidence is often essential for the acquisition of data from encrypted devices. Passwords may be stored in cleartext, knowingly or accidentally, in various locations within a device, e.g., in text messages, notes, or system log files. Finding those passwords is a challenging task, as devices typically contain a substantial amount and a wide variety of textual data. This paper explores the performance of several different types of machine learning models trained to distinguish passwords from non-passwords, and ranks them according to their likelihood of being a human-generated password. Three deep learning models (PassGPT, CodeBERT and DistilBERT) were fine-tuned, and two traditional machine learning models (a feature-based XGBoost and a TF/IDF-based XGBoost) were trained. These were compared to the existing state-of-the-art technology, a password recognition model based on probabilistic context-free grammars. Our research shows that the fine-tuned PassGPT model outperforms the other models. We show that the combination of multiple different types of training datasets, carefully chosen based on the context, is needed to achieve good results. In particular, it is important to train not only on dictionary words and leaked credentials, but also on data scraped from chats and websites. Our approach was evaluated with realistic hardware that could fit inside an investigator's workstation. The evaluation was conducted on the publicly available RockYou and MyHeritage leaks, but also on a dataset derived from real casework, showing that these innovations can indeed be used in a real forensic context.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"52 ","pages":"Article 301870"},"PeriodicalIF":2.0,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143679792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2024-12-06DOI: 10.1016/j.fsidi.2024.301845
Thiago J. Silva , Ana H.B. Mazur , Edson OliveiraJr , Avelino F. Zorzo , Monalessa P. Barcellos
Experimentation is a crucial method in empirical inquiry and is widely applied in Computer Science. Controlled experimentation ensures reproducibility, transparency, and reliability of findings, making the process more formal. Digital forensics (DF) lacks formalization of controlled experimental processes, leading to inadequate and informal research, making findings less transparent, reproducible, and reliable. Furthermore, existing works in this area often lack detailed descriptions of the controlled experimental decision-making procedures. To address these issues, we developed an ontology to formalize the concepts and terms used in DF-controlled experiments. The ontology was constructed based on an existing conceptual model for DF-controlled experiments. The ontology's conceptual model is represented by UML class diagrams, and the OWL language was employed to code it. Moreover, the ontology underwent evaluation by researchers and experts in DF experimentation, with the results indicating the capability of the ontology to formalize DF experimental concepts. The contribution of this ontology is to assist DF researchers and practitioners in properly documenting their controlled experiments. This will enhance the formality of the experimental process and promote the findings' reproducibility, transparency, and reliability. For researchers, the ontology's main contribution lies in influencing how these experiments are conducted, potentially impacting their transfer to industry. Practitioners stand to benefit by adopting formal experimental procedures for testing, assessing, and acquiring DF-related technology.
{"title":"An ontology for promoting controlled experimentation in digital forensics","authors":"Thiago J. Silva , Ana H.B. Mazur , Edson OliveiraJr , Avelino F. Zorzo , Monalessa P. Barcellos","doi":"10.1016/j.fsidi.2024.301845","DOIUrl":"10.1016/j.fsidi.2024.301845","url":null,"abstract":"<div><div>Experimentation is a crucial method in empirical inquiry and is widely applied in Computer Science. Controlled experimentation ensures reproducibility, transparency, and reliability of findings, making the process more formal. Digital forensics (DF) lacks formalization of controlled experimental processes, leading to inadequate and informal research, making findings less transparent, reproducible, and reliable. Furthermore, existing works in this area often lack detailed descriptions of the controlled experimental decision-making procedures. To address these issues, we developed an ontology to formalize the concepts and terms used in DF-controlled experiments. The ontology was constructed based on an existing conceptual model for DF-controlled experiments. The ontology's conceptual model is represented by UML class diagrams, and the OWL language was employed to code it. Moreover, the ontology underwent evaluation by researchers and experts in DF experimentation, with the results indicating the capability of the ontology to formalize DF experimental concepts. The contribution of this ontology is to assist DF researchers and practitioners in properly documenting their controlled experiments. This will enhance the formality of the experimental process and promote the findings' reproducibility, transparency, and reliability. For researchers, the ontology's main contribution lies in influencing how these experiments are conducted, potentially impacting their transfer to industry. Practitioners stand to benefit by adopting formal experimental procedures for testing, assessing, and acquiring DF-related technology.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"52 ","pages":"Article 301845"},"PeriodicalIF":2.0,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143141116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2025-03-24DOI: 10.1016/j.fsidi.2025.301874
Lena L. Voigt , Felix Freiling , Christopher Hargreaves
There is currently no systematic method for evaluating digital forensic datasets. This makes it difficult to judge their suitability for specific use cases in digital forensic education and training. Additionally, there is limited comparability in the quality of synthetic datasets or the strengths and weaknesses of different data synthesis approaches. In this paper, we propose the concept of a quantitative, metrics-based assessment of forensic datasets as a first step toward a systematic evaluation approach. As a concrete implementation of this approach, we introduce Mass Disk Processor, a tool that automates the collection of metrics from large sets of disk images. It enables a privacy-preserving retrieval of high-level disk image characteristics, facilitating the assessment of not only synthetic but also real-world disk images. We demonstrate two applications of our tool. First, we create a comprehensive datasheet for publicly available, scenario-based synthetic disk images. Second, we propose a formal definition of synthetic data realism that compares properties of synthetic data to properties of real-world data and present results from an examination of the realism of current scenario-based disk images.
目前还没有评估数字法医数据集的系统方法。这使得很难判断它们是否适合数字法医教育和培训中的特定用例。此外,合成数据集的质量或不同数据合成方法的优缺点具有有限的可比性。在本文中,我们提出了一个定量的,基于指标的法医数据集评估的概念,作为迈向系统评估方法的第一步。作为这种方法的具体实现,我们介绍了Mass Disk Processor,这是一种工具,可以自动收集来自大型磁盘映像集的指标。它支持高级磁盘映像特征的隐私保护检索,不仅便于对合成磁盘映像进行评估,还便于对真实磁盘映像进行评估。我们将演示该工具的两个应用程序。首先,我们为公开可用的、基于场景的合成磁盘映像创建一个全面的数据表。其次,我们提出了合成数据真实感的正式定义,将合成数据的属性与真实世界数据的属性进行比较,并给出了对当前基于场景的磁盘映像的真实感检查的结果。
{"title":"A metrics-based look at disk images: Insights and applications","authors":"Lena L. Voigt , Felix Freiling , Christopher Hargreaves","doi":"10.1016/j.fsidi.2025.301874","DOIUrl":"10.1016/j.fsidi.2025.301874","url":null,"abstract":"<div><div>There is currently no systematic method for evaluating digital forensic datasets. This makes it difficult to judge their suitability for specific use cases in digital forensic education and training. Additionally, there is limited comparability in the quality of synthetic datasets or the strengths and weaknesses of different data synthesis approaches. In this paper, we propose the concept of a quantitative, metrics-based assessment of forensic datasets as a first step toward a systematic evaluation approach. As a concrete implementation of this approach, we introduce <em>Mass Disk Processor</em>, a tool that automates the collection of metrics from large sets of disk images. It enables a privacy-preserving retrieval of high-level disk image characteristics, facilitating the assessment of not only synthetic but also real-world disk images. We demonstrate two applications of our tool. First, we create a comprehensive datasheet for publicly available, scenario-based synthetic disk images. Second, we propose a formal definition of synthetic data realism that compares properties of synthetic data to properties of real-world data and present results from an examination of the realism of current scenario-based disk images.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"52 ","pages":"Article 301874"},"PeriodicalIF":2.0,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143679884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2025-01-15DOI: 10.1016/j.fsidi.2025.301880
Johnny Bengtsson
Sensor and actuator event log analyses within the context of digital forensics are crucial for understanding events in automated buildings, such as in a building automation and control system (BACS) or a home automation system (HAS). Conclusions drawn from erroneous, misleading, or corrupted log data may adversely affect crime scene investigations and reconstructions. This work aims to raise awareness of the potential risk of misinterpretation due to corrupted or tampered data from BACS or HAS event log systems.
A series of non-invasive sensor and actuator attacks on such systems was designed and conducted to determine the feasibility of: 1) injecting spoofed pyroelectric infrared (PIR) and carbon dioxide (CO2) sensor event log records, 2) becoming invisible to PIR sensor and CO2 sensors, and 3) mimicking the behaviour of an actuator with the aim of injecting spoofed event log records. The study also concludes that sensor fusion can reveal activities that were concealed from CO2 sensors. Furthermore, this work discusses the adversarial perspectives in the cyber-physical (CPS) domain in relation to these findings.
{"title":"The ghost in the building: Non-invasive spoofing and covert attacks on automated buildings","authors":"Johnny Bengtsson","doi":"10.1016/j.fsidi.2025.301880","DOIUrl":"10.1016/j.fsidi.2025.301880","url":null,"abstract":"<div><div>Sensor and actuator event log analyses within the context of digital forensics are crucial for understanding events in automated buildings, such as in a building automation and control system (BACS) or a home automation system (HAS). Conclusions drawn from erroneous, misleading, or corrupted log data may adversely affect crime scene investigations and reconstructions. This work aims to raise awareness of the potential risk of misinterpretation due to corrupted or tampered data from BACS or HAS event log systems.</div><div>A series of non-invasive sensor and actuator attacks on such systems was designed and conducted to determine the feasibility of: 1) injecting spoofed pyroelectric infrared (PIR) and carbon dioxide (CO<sub>2</sub>) sensor event log records, 2) becoming invisible to PIR sensor and CO<sub>2</sub> sensors, and 3) mimicking the behaviour of an actuator with the aim of injecting spoofed event log records. The study also concludes that sensor fusion can reveal activities that were concealed from CO<sub>2</sub> sensors. Furthermore, this work discusses the adversarial perspectives in the cyber-physical (CPS) domain in relation to these findings.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"52 ","pages":"Article 301880"},"PeriodicalIF":2.0,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143141113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2025-03-24DOI: 10.1016/j.fsidi.2025.301864
Christopher Hargreaves , Harm van Beek , Eoghan Casey
This work presents SOLVE-IT (Systematic Objective-based Listing of Various Established (Digital) Investigation Techniques), a digital forensics knowledge base inspired by the MITRE ATT&CK cybersecurity resource. Several applications of the knowledge-base are demonstrated: strengthening tool testing by scoping error-focused data sets for a technique, reinforcing digital forensic techniques by cataloguing available mitigations for weaknesses (a systematic approach to performing Error Mitigation Analysis), bolstering quality assurance by identifying potential weaknesses in a specific digital forensic investigation or standard processes, structured consideration of potential uses of AI in digital forensics, augmenting automation by highlighting relevant CASE ontology classes and identifying ontology gaps, and prioritizing innovation by identifying academic research opportunities. The paper provides the structure and partial implementation of a knowledge base that includes an organised set of 104 digital forensic techniques, organised over 17 objectives, with detailed descriptions, errors, and mitigations provided for 33 of them. The knowledge base is hosted on an open platform (GitHub) to allow crowdsourced contributions to evolve the contents. Tools are also provided to export the machine readable back-end data into usable formats such as spreadsheets to support many applications, including systematic error mitigation and quality assurance documentation.
{"title":"SOLVE-IT: A proposed digital forensic knowledge base inspired by MITRE ATT&CK","authors":"Christopher Hargreaves , Harm van Beek , Eoghan Casey","doi":"10.1016/j.fsidi.2025.301864","DOIUrl":"10.1016/j.fsidi.2025.301864","url":null,"abstract":"<div><div>This work presents SOLVE-IT (Systematic Objective-based Listing of Various Established (Digital) Investigation Techniques), a digital forensics knowledge base inspired by the MITRE ATT&CK cybersecurity resource. Several applications of the knowledge-base are demonstrated: strengthening tool testing by scoping error-focused data sets for a technique, reinforcing digital forensic techniques by cataloguing available mitigations for weaknesses (a systematic approach to performing Error Mitigation Analysis), bolstering quality assurance by identifying potential weaknesses in a specific digital forensic investigation or standard processes, structured consideration of potential uses of AI in digital forensics, augmenting automation by highlighting relevant CASE ontology classes and identifying ontology gaps, and prioritizing innovation by identifying academic research opportunities. The paper provides the structure and partial implementation of a knowledge base that includes an organised set of 104 digital forensic techniques, organised over 17 objectives, with detailed descriptions, errors, and mitigations provided for 33 of them. The knowledge base is hosted on an open platform (GitHub) to allow crowdsourced contributions to evolve the contents. Tools are also provided to export the machine readable back-end data into usable formats such as spreadsheets to support many applications, including systematic error mitigation and quality assurance documentation.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"52 ","pages":"Article 301864"},"PeriodicalIF":2.0,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143679787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2025-03-24DOI: 10.1016/j.fsidi.2025.301875
Hongseok Yang, Sanghyug Han, Mindong Kim, Gibum Kim
With the advancement of offline Finding Network (OFN) technology, tracking tags are being utilized in various fields, including locating elderly individuals with dementia, caring for children, and managing lost items. Recently, however, tracking tags have been misused in stalking, surveillance, and debt collection, highlighting the growing importance of digital forensics in proving criminal acts. While there has been some research on Apple AirTag and Tile products, studies focusing on Samsung's tracking tag have been lacking. Therefore, this paper proposes digital forensic techniques for law enforcement agencies to analyze Samsung tracking tag applications to identify perpetrators and substantiate criminal activities. We analyzed six tags and three applications, recognizing tag identifiers, and confirmed that location data is stored in both plaintext and encrypted forms within SQLite databases and XML files. Additionally, we conducted experiments on five different anti-forensics scenarios: 1) deletion of a registered tracking tag, 2) deletion of location data, 3) account logout, 4) service withdrawal, and 5) application synchronization, finding meaningful results to substantiate criminal actions. Furthermore, we developed S.TASER (Smart Tag Parser) based on Python that allows for the identification of deleted tags, recovery of identification data, and visualization of collected location data per tag. S.TASER's code, experimental scenarios, and raw data are publicly available for further verification. This study aims to contribute to the global digital forensic industry by suggesting additional options for investigation and evidence gathering of crimes that make use of Network.
{"title":"Samsung tracking tag application forensics in criminal investigations","authors":"Hongseok Yang, Sanghyug Han, Mindong Kim, Gibum Kim","doi":"10.1016/j.fsidi.2025.301875","DOIUrl":"10.1016/j.fsidi.2025.301875","url":null,"abstract":"<div><div>With the advancement of offline Finding Network (OFN) technology, tracking tags are being utilized in various fields, including locating elderly individuals with dementia, caring for children, and managing lost items. Recently, however, tracking tags have been misused in stalking, surveillance, and debt collection, highlighting the growing importance of digital forensics in proving criminal acts. While there has been some research on Apple AirTag and Tile products, studies focusing on Samsung's tracking tag have been lacking. Therefore, this paper proposes digital forensic techniques for law enforcement agencies to analyze Samsung tracking tag applications to identify perpetrators and substantiate criminal activities. We analyzed six tags and three applications, recognizing tag identifiers, and confirmed that location data is stored in both plaintext and encrypted forms within SQLite databases and XML files. Additionally, we conducted experiments on five different anti-forensics scenarios: 1) deletion of a registered tracking tag, 2) deletion of location data, 3) account logout, 4) service withdrawal, and 5) application synchronization, finding meaningful results to substantiate criminal actions. Furthermore, we developed S.TASER (Smart Tag Parser) based on Python that allows for the identification of deleted tags, recovery of identification data, and visualization of collected location data per tag. S.TASER's code, experimental scenarios, and raw data are publicly available for further verification. This study aims to contribute to the global digital forensic industry by suggesting additional options for investigation and evidence gathering of crimes that make use of Network.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"52 ","pages":"Article 301875"},"PeriodicalIF":2.0,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143679885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2025-03-24DOI: 10.1016/j.fsidi.2025.301878
Sean McKeown
Forensic analysts are often tasked with analysing large volumes of data in modern investigations, and frequently make use of hashing technologies to identify previously encountered images. Perceptual hashes, which seek to model the semantic (visual) content of images, are typically compared by way of Normalised Hamming Distance, counting the ratio of bits which differ between two hashes. However, this global measure of difference may overlook structural information, such as the position and relative clustering of these differences. This paper investigates the relationship between localised/positional changes in an image and the extent to which this information is encoded in various perceptual hashes. Our findings indicate that the relative position of bits in the hash does encode useful information. Consequently, we prototype and evaluate three alternative perceptual hashing distance metrics: Normalised Convolution Distance, Hatched Matrix Distance, and 2-D Ngram Cosine Distance. Results demonstrate that there is room for improvement over Hamming Distance. In particular, the worst-case image mirroring transform for DCT-based hashes can be completely mitigated without needing to change the mechanism for generating the hash. Indeed, perceived hash weaknesses may actually be deficits in the distance metric being used, and large-scale providers could potentially benefit from modifying their approach.
{"title":"Beyond Hamming Distance: Exploring spatial encoding in perceptual hashes","authors":"Sean McKeown","doi":"10.1016/j.fsidi.2025.301878","DOIUrl":"10.1016/j.fsidi.2025.301878","url":null,"abstract":"<div><div>Forensic analysts are often tasked with analysing large volumes of data in modern investigations, and frequently make use of hashing technologies to identify previously encountered images. Perceptual hashes, which seek to model the semantic (visual) content of images, are typically compared by way of Normalised Hamming Distance, counting the ratio of bits which differ between two hashes. However, this global measure of difference may overlook structural information, such as the position and relative clustering of these differences. This paper investigates the relationship between localised/positional changes in an image and the extent to which this information is encoded in various perceptual hashes. Our findings indicate that the relative position of bits in the hash does encode useful information. Consequently, we prototype and evaluate three alternative perceptual hashing distance metrics: Normalised Convolution Distance, Hatched Matrix Distance, and 2-D Ngram Cosine Distance. Results demonstrate that there is room for improvement over Hamming Distance. In particular, the worst-case image mirroring transform for DCT-based hashes can be completely mitigated without needing to change the mechanism for generating the hash. Indeed, perceived hash weaknesses may actually be deficits in the distance metric being used, and large-scale providers could potentially benefit from modifying their approach.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":"52 ","pages":"Article 301878"},"PeriodicalIF":2.0,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143679886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}