合成数据中缺失数据分布的保护

Proceedings of the ACM Web Conference 2023 Pub Date : 2023-04-30 DOI:10.1145/3543507.3583297

Xinyu Wang, H. Asif, Jaideep Vaidya

{"title":"合成数据中缺失数据分布的保护","authors":"Xinyu Wang, H. Asif, Jaideep Vaidya","doi":"10.1145/3543507.3583297","DOIUrl":null,"url":null,"abstract":"Data from Web artifacts and from the Web is often sensitive and cannot be directly shared for data analysis. Therefore, synthetic data generated from the real data is increasingly used as a privacy-preserving substitute. In many cases, real data from the web has missing values where the missingness itself possesses important informational content, which domain experts leverage to improve their analysis. However, this information content is lost if either imputation or deletion is used before synthetic data generation. In this paper, we propose several methods to generate synthetic data that preserve both the observable and the missing data distributions. An extensive empirical evaluation over a range of carefully fabricated and real world datasets demonstrates the effectiveness of our approach.","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Preserving Missing Data Distribution in Synthetic Data\",\"authors\":\"Xinyu Wang, H. Asif, Jaideep Vaidya\",\"doi\":\"10.1145/3543507.3583297\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data from Web artifacts and from the Web is often sensitive and cannot be directly shared for data analysis. Therefore, synthetic data generated from the real data is increasingly used as a privacy-preserving substitute. In many cases, real data from the web has missing values where the missingness itself possesses important informational content, which domain experts leverage to improve their analysis. However, this information content is lost if either imputation or deletion is used before synthetic data generation. In this paper, we propose several methods to generate synthetic data that preserve both the observable and the missing data distributions. An extensive empirical evaluation over a range of carefully fabricated and real world datasets demonstrates the effectiveness of our approach.\",\"PeriodicalId\":296351,\"journal\":{\"name\":\"Proceedings of the ACM Web Conference 2023\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM Web Conference 2023\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3543507.3583297\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Web Conference 2023","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3543507.3583297","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

来自Web工件和Web的数据通常是敏感的，不能直接共享用于数据分析。因此，从真实数据中生成的合成数据越来越多地被用作保护隐私的替代品。在许多情况下，来自网络的真实数据有缺失的价值，而缺失本身拥有重要的信息内容，领域专家利用这些信息来改进他们的分析。但是，如果在合成数据生成之前使用插入或删除，则该信息内容将丢失。在本文中，我们提出了几种方法来生成既保留可观察数据分布又保留缺失数据分布的合成数据。对一系列精心制作的真实世界数据集进行了广泛的实证评估，证明了我们方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Preserving Missing Data Distribution in Synthetic Data

Data from Web artifacts and from the Web is often sensitive and cannot be directly shared for data analysis. Therefore, synthetic data generated from the real data is increasingly used as a privacy-preserving substitute. In many cases, real data from the web has missing values where the missingness itself possesses important informational content, which domain experts leverage to improve their analysis. However, this information content is lost if either imputation or deletion is used before synthetic data generation. In this paper, we propose several methods to generate synthetic data that preserve both the observable and the missing data distributions. An extensive empirical evaluation over a range of carefully fabricated and real world datasets demonstrates the effectiveness of our approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ACM Web Conference 2023

自引率

0.00%

发文量

期刊最新文献

CurvDrop: A Ricci Curvature Based Approach to Prevent Graph Neural Networks from Over-Smoothing and Over-Squashing Learning to Simulate Crowd Trajectories with Graph Networks Word Sense Disambiguation by Refining Target Word Embedding Curriculum Graph Poisoning Optimizing Guided Traversal for Fast Learned Sparse Retrieval