Langdon Holmes, Scott Crossley, Harshvardhan Sikka, Wesley Morris
{"title":"PIILO: an open-source system for personally identifiable information labeling and obfuscation","authors":"Langdon Holmes, Scott Crossley, Harshvardhan Sikka, Wesley Morris","doi":"10.1108/ils-04-2023-0032","DOIUrl":null,"url":null,"abstract":"Purpose This study aims to report on an automatic deidentification system for labeling and obfuscating personally identifiable information (PII) in student-generated text. Design/methodology/approach The authors evaluate the performance of their deidentification system on two data sets of student-generated text. Each data set was human-annotated for PII. The authors evaluate using two approaches: per-token PII classification accuracy and a simulated reidentification attack design. In the reidentification attack, two reviewers attempted to recover student identities from the data after PII was obfuscated by the authors’ system. In both cases, results are reported in terms of recall and precision. Findings The authors’ deidentification system recalled 84% of student name tokens in their first data set (96% of full names). On the second data set, it achieved a recall of 74% for student name tokens (91% of full names) and 75% for all direct identifiers. After the second data set was obfuscated by the authors’ system, two reviewers attempted to recover the identities of students from the obfuscated data. They performed below chance, indicating that the obfuscated data presents a low identity disclosure risk. Research limitations/implications The two data sets used in this study are not representative of all forms of student-generated text, so further work is needed to evaluate performance on more data. Practical implications This paper presents an open-source and automatic deidentification system appropriate for student-generated text with technical explanations and evaluations of performance. Originality/value Previous study on text deidentification has shown success in the medical domain. This paper develops on these approaches and applies them to text in the educational domain.","PeriodicalId":44588,"journal":{"name":"Information and Learning Sciences","volume":"161 1","pages":"0"},"PeriodicalIF":1.6000,"publicationDate":"2023-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Learning Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/ils-04-2023-0032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 1
Abstract
Purpose This study aims to report on an automatic deidentification system for labeling and obfuscating personally identifiable information (PII) in student-generated text. Design/methodology/approach The authors evaluate the performance of their deidentification system on two data sets of student-generated text. Each data set was human-annotated for PII. The authors evaluate using two approaches: per-token PII classification accuracy and a simulated reidentification attack design. In the reidentification attack, two reviewers attempted to recover student identities from the data after PII was obfuscated by the authors’ system. In both cases, results are reported in terms of recall and precision. Findings The authors’ deidentification system recalled 84% of student name tokens in their first data set (96% of full names). On the second data set, it achieved a recall of 74% for student name tokens (91% of full names) and 75% for all direct identifiers. After the second data set was obfuscated by the authors’ system, two reviewers attempted to recover the identities of students from the obfuscated data. They performed below chance, indicating that the obfuscated data presents a low identity disclosure risk. Research limitations/implications The two data sets used in this study are not representative of all forms of student-generated text, so further work is needed to evaluate performance on more data. Practical implications This paper presents an open-source and automatic deidentification system appropriate for student-generated text with technical explanations and evaluations of performance. Originality/value Previous study on text deidentification has shown success in the medical domain. This paper develops on these approaches and applies them to text in the educational domain.
期刊介绍:
Information and Learning Sciences advances inter-disciplinary research that explores scholarly intersections shared within 2 key fields: information science and the learning sciences / education sciences. The journal provides a publication venue for work that strengthens our scholarly understanding of human inquiry and learning phenomena, especially as they relate to design and uses of information and e-learning systems innovations.