PIILO: an open-source system for personally identifiable information labeling and obfuscation

IF 2.2 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE Information and Learning Sciences Pub Date : 2023-10-18 DOI:10.1108/ils-04-2023-0032

Langdon Holmes, Scott Crossley, Harshvardhan Sikka, Wesley Morris

{"title":"PIILO: an open-source system for personally identifiable information labeling and obfuscation","authors":"Langdon Holmes, Scott Crossley, Harshvardhan Sikka, Wesley Morris","doi":"10.1108/ils-04-2023-0032","DOIUrl":null,"url":null,"abstract":"Purpose This study aims to report on an automatic deidentification system for labeling and obfuscating personally identifiable information (PII) in student-generated text. Design/methodology/approach The authors evaluate the performance of their deidentification system on two data sets of student-generated text. Each data set was human-annotated for PII. The authors evaluate using two approaches: per-token PII classification accuracy and a simulated reidentification attack design. In the reidentification attack, two reviewers attempted to recover student identities from the data after PII was obfuscated by the authors’ system. In both cases, results are reported in terms of recall and precision. Findings The authors’ deidentification system recalled 84% of student name tokens in their first data set (96% of full names). On the second data set, it achieved a recall of 74% for student name tokens (91% of full names) and 75% for all direct identifiers. After the second data set was obfuscated by the authors’ system, two reviewers attempted to recover the identities of students from the obfuscated data. They performed below chance, indicating that the obfuscated data presents a low identity disclosure risk. Research limitations/implications The two data sets used in this study are not representative of all forms of student-generated text, so further work is needed to evaluate performance on more data. Practical implications This paper presents an open-source and automatic deidentification system appropriate for student-generated text with technical explanations and evaluations of performance. Originality/value Previous study on text deidentification has shown success in the medical domain. This paper develops on these approaches and applies them to text in the educational domain.","PeriodicalId":44588,"journal":{"name":"Information and Learning Sciences","volume":"161 1","pages":"0"},"PeriodicalIF":2.2000,"publicationDate":"2023-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Learning Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/ils-04-2023-0032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}

引用次数: 1

Abstract

Purpose This study aims to report on an automatic deidentification system for labeling and obfuscating personally identifiable information (PII) in student-generated text. Design/methodology/approach The authors evaluate the performance of their deidentification system on two data sets of student-generated text. Each data set was human-annotated for PII. The authors evaluate using two approaches: per-token PII classification accuracy and a simulated reidentification attack design. In the reidentification attack, two reviewers attempted to recover student identities from the data after PII was obfuscated by the authors’ system. In both cases, results are reported in terms of recall and precision. Findings The authors’ deidentification system recalled 84% of student name tokens in their first data set (96% of full names). On the second data set, it achieved a recall of 74% for student name tokens (91% of full names) and 75% for all direct identifiers. After the second data set was obfuscated by the authors’ system, two reviewers attempted to recover the identities of students from the obfuscated data. They performed below chance, indicating that the obfuscated data presents a low identity disclosure risk. Research limitations/implications The two data sets used in this study are not representative of all forms of student-generated text, so further work is needed to evaluate performance on more data. Practical implications This paper presents an open-source and automatic deidentification system appropriate for student-generated text with technical explanations and evaluations of performance. Originality/value Previous study on text deidentification has shown success in the medical domain. This paper develops on these approaches and applies them to text in the educational domain.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PIILO:用于个人身份信息标记和混淆的开源系统

本研究旨在报告一个自动去识别系统，用于在学生生成的文本中标记和混淆个人身份信息(PII)。设计/方法/方法作者在两个学生生成的文本数据集上评估了他们的去识别系统的性能。每个数据集都对PII进行了人工注释。作者使用两种方法进行评估:每个令牌PII分类准确性和模拟重新识别攻击设计。在重新识别攻击中，两名审查者试图在PII被作者的系统混淆后从数据中恢复学生身份。在这两种情况下，结果都是根据召回率和准确率来报告的。作者的去识别系统在他们的第一个数据集中召回了84%的学生名字标记(96%的全名)。在第二个数据集上，它实现了74%的学生姓名标记(91%的全名)和75%的所有直接标识符的召回。在第二组数据被作者的系统混淆后，两名审稿人试图从被混淆的数据中恢复学生的身份。他们的表现低于机会，表明混淆的数据呈现出低身份泄露风险。本研究中使用的两个数据集并不能代表所有形式的学生生成的文本，因此需要进一步的工作来评估更多数据的表现。本文提出了一个开源和自动去识别系统，适用于学生生成的具有技术解释和性能评估的文本。原创性/价值以往对文本去识别的研究在医学领域取得了成功。本文在这些方法的基础上进行了发展，并将其应用于教育领域的文本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information and Learning Sciences INFORMATION SCIENCE & LIBRARY SCIENCE-

CiteScore

9.50

自引率

2.90%

发文量

期刊介绍： Information and Learning Sciences advances inter-disciplinary research that explores scholarly intersections shared within 2 key fields: information science and the learning sciences / education sciences. The journal provides a publication venue for work that strengthens our scholarly understanding of human inquiry and learning phenomena, especially as they relate to design and uses of information and e-learning systems innovations.

期刊最新文献

A critical (theory) data literacy: tales from the field Toward a new framework for teaching algorithmic literacy Promoting students’ informal inferential reasoning through arts-integrated data literacy education The data awareness framework as part of data literacies in K-12 education Learning experience network analysis for design-based research