An Open Source Python Library for Anonymizing Sensitive Data.

IF 6.9 2区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES Scientific Data Pub Date : 2024-11-26 DOI:10.1038/s41597-024-04019-z

Judith Sáinz-Pardo Díaz, Álvaro López García

引用次数: 0

Abstract

Open science is a fundamental pillar to promote scientific progress and collaboration, based on the principles of open data, open source and open access. However, the requirements for publishing and sharing open data are in many cases difficult to meet in compliance with strict data protection regulations. Consequently, researchers need to rely on proven methods that allow them to anonymize their data without sharing it with third parties. To this end, this paper presents the implementation of a Python library for the anonymization of sensitive tabular data. This framework provides users with a wide range of anonymization methods that can be applied on the given dataset, including the set of identifiers, quasi-identifiers, generalization hierarchies and allowed level of suppression, along with the sensitive attribute and the level of anonymity required. The library has been implemented following best practices for integration and continuous development, as well as the use of workflows to test code coverage based on unit and functional tests.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于匿名化敏感数据的开源 Python 库。

基于开放数据、开放源代码和开放获取的原则，开放科学是促进科学进步与合作的基本支柱。然而，在许多情况下，发布和共享开放数据的要求很难符合严格的数据保护规定。因此，研究人员需要依靠行之有效的方法，在不与第三方共享数据的情况下对数据进行匿名处理。为此，本文介绍了用于敏感表格数据匿名化的 Python 库的实现。该框架为用户提供了多种可应用于给定数据集的匿名化方法，包括标识符集、准标识符、泛化层次和允许的抑制级别，以及敏感属性和所需的匿名级别。该库的实施遵循了集成和持续开发的最佳实践，并使用工作流来测试基于单元和功能测试的代码覆盖率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Scientific Data Social Sciences-Education

CiteScore

11.20

自引率

4.10%

发文量

689

审稿时长

16 weeks

期刊介绍： Scientific Data is an open-access journal focused on data, publishing descriptions of research datasets and articles on data sharing across natural sciences, medicine, engineering, and social sciences. Its goal is to enhance the sharing and reuse of scientific data, encourage broader data sharing, and acknowledge those who share their data. The journal primarily publishes Data Descriptors, which offer detailed descriptions of research datasets, including data collection methods and technical analyses validating data quality. These descriptors aim to facilitate data reuse rather than testing hypotheses or presenting new interpretations, methods, or in-depth analyses.