灵活和可扩展的个人数据生成和损坏

Proceedings of the 22nd ACM international conference on Information & Knowledge Management Pub Date : 2013-10-27 DOI:10.1145/2505515.2507815

P. Christen, Dinusha Vatsalan

{"title":"灵活和可扩展的个人数据生成和损坏","authors":"P. Christen, Dinusha Vatsalan","doi":"10.1145/2505515.2507815","DOIUrl":null,"url":null,"abstract":"With much of today's data being generated by people or referring to people, researchers increasingly require data that contain personal identifying information to evaluate their new algorithms. In areas such as record matching and de-duplication, fraud detection, cloud computing, and health informatics, issues such as data entry errors, typographical mistakes, noise, or recording variations, can all significantly affect the outcomes of data integration, processing, and mining projects. However, privacy concerns make it challenging to obtain real data that contain personal details. An alternative to using sensitive real data is to create synthetic data which follow similar characteristics. The advantages of synthetic data are that (1) they can be generated with well defined characteristics; (2) it is known which records represent an individual created entity (this is often unknown in real data); and (3) the generated data and the generator program itself can be published. We present a sophisticated data generation and corruption tool that allows the creation of various types of data, ranging from names and addresses, dates, social security and credit card numbers, to numerical values such as salary or blood pressure. Our tool can model dependencies between attributes, and it allows the corruption of values in various ways. We describe the overall architecture and main components of our tool, and illustrate how a user can easily extend this tool with novel functionalities.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":"{\"title\":\"Flexible and extensible generation and corruption of personal data\",\"authors\":\"P. Christen, Dinusha Vatsalan\",\"doi\":\"10.1145/2505515.2507815\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With much of today's data being generated by people or referring to people, researchers increasingly require data that contain personal identifying information to evaluate their new algorithms. In areas such as record matching and de-duplication, fraud detection, cloud computing, and health informatics, issues such as data entry errors, typographical mistakes, noise, or recording variations, can all significantly affect the outcomes of data integration, processing, and mining projects. However, privacy concerns make it challenging to obtain real data that contain personal details. An alternative to using sensitive real data is to create synthetic data which follow similar characteristics. The advantages of synthetic data are that (1) they can be generated with well defined characteristics; (2) it is known which records represent an individual created entity (this is often unknown in real data); and (3) the generated data and the generator program itself can be published. We present a sophisticated data generation and corruption tool that allows the creation of various types of data, ranging from names and addresses, dates, social security and credit card numbers, to numerical values such as salary or blood pressure. Our tool can model dependencies between attributes, and it allows the corruption of values in various ways. We describe the overall architecture and main components of our tool, and illustrate how a user can easily extend this tool with novel functionalities.\",\"PeriodicalId\":20528,\"journal\":{\"name\":\"Proceedings of the 22nd ACM international conference on Information & Knowledge Management\",\"volume\":\"21 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"48\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 22nd ACM international conference on Information & Knowledge Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2505515.2507815\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2505515.2507815","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 48

摘要

由于今天的许多数据都是由人产生或涉及人，研究人员越来越多地需要包含个人识别信息的数据来评估他们的新算法。在记录匹配和重复数据删除、欺诈检测、云计算和健康信息学等领域，数据输入错误、排版错误、噪音或记录变化等问题都可能严重影响数据集成、处理和挖掘项目的结果。然而，隐私问题使得获取包含个人详细信息的真实数据变得困难。使用敏感真实数据的替代方法是创建具有类似特征的合成数据。合成数据的优势在于:(1)生成的数据具有明确的特征;(2)知道哪些记录代表一个单独创建的实体(这在实际数据中通常是未知的);(3)生成的数据和生成器程序本身可以被发布。我们提供了一个复杂的数据生成和破坏工具，允许创建各种类型的数据，从姓名和地址，日期，社会保障和信用卡号码，到数值，如工资或血压。我们的工具可以对属性之间的依赖关系进行建模，并且它允许以各种方式破坏值。我们描述了工具的总体体系结构和主要组件，并说明了用户如何使用新功能轻松扩展此工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Flexible and extensible generation and corruption of personal data

With much of today's data being generated by people or referring to people, researchers increasingly require data that contain personal identifying information to evaluate their new algorithms. In areas such as record matching and de-duplication, fraud detection, cloud computing, and health informatics, issues such as data entry errors, typographical mistakes, noise, or recording variations, can all significantly affect the outcomes of data integration, processing, and mining projects. However, privacy concerns make it challenging to obtain real data that contain personal details. An alternative to using sensitive real data is to create synthetic data which follow similar characteristics. The advantages of synthetic data are that (1) they can be generated with well defined characteristics; (2) it is known which records represent an individual created entity (this is often unknown in real data); and (3) the generated data and the generator program itself can be published. We present a sophisticated data generation and corruption tool that allows the creation of various types of data, ranging from names and addresses, dates, social security and credit card numbers, to numerical values such as salary or blood pressure. Our tool can model dependencies between attributes, and it allows the corruption of values in various ways. We describe the overall architecture and main components of our tool, and illustrate how a user can easily extend this tool with novel functionalities.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助