{"title":"合成数据中缺失数据分布的保护","authors":"Xinyu Wang, H. Asif, Jaideep Vaidya","doi":"10.1145/3543507.3583297","DOIUrl":null,"url":null,"abstract":"Data from Web artifacts and from the Web is often sensitive and cannot be directly shared for data analysis. Therefore, synthetic data generated from the real data is increasingly used as a privacy-preserving substitute. In many cases, real data from the web has missing values where the missingness itself possesses important informational content, which domain experts leverage to improve their analysis. However, this information content is lost if either imputation or deletion is used before synthetic data generation. In this paper, we propose several methods to generate synthetic data that preserve both the observable and the missing data distributions. An extensive empirical evaluation over a range of carefully fabricated and real world datasets demonstrates the effectiveness of our approach.","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Preserving Missing Data Distribution in Synthetic Data\",\"authors\":\"Xinyu Wang, H. Asif, Jaideep Vaidya\",\"doi\":\"10.1145/3543507.3583297\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data from Web artifacts and from the Web is often sensitive and cannot be directly shared for data analysis. Therefore, synthetic data generated from the real data is increasingly used as a privacy-preserving substitute. In many cases, real data from the web has missing values where the missingness itself possesses important informational content, which domain experts leverage to improve their analysis. However, this information content is lost if either imputation or deletion is used before synthetic data generation. In this paper, we propose several methods to generate synthetic data that preserve both the observable and the missing data distributions. An extensive empirical evaluation over a range of carefully fabricated and real world datasets demonstrates the effectiveness of our approach.\",\"PeriodicalId\":296351,\"journal\":{\"name\":\"Proceedings of the ACM Web Conference 2023\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM Web Conference 2023\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3543507.3583297\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Web Conference 2023","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3543507.3583297","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Preserving Missing Data Distribution in Synthetic Data
Data from Web artifacts and from the Web is often sensitive and cannot be directly shared for data analysis. Therefore, synthetic data generated from the real data is increasingly used as a privacy-preserving substitute. In many cases, real data from the web has missing values where the missingness itself possesses important informational content, which domain experts leverage to improve their analysis. However, this information content is lost if either imputation or deletion is used before synthetic data generation. In this paper, we propose several methods to generate synthetic data that preserve both the observable and the missing data distributions. An extensive empirical evaluation over a range of carefully fabricated and real world datasets demonstrates the effectiveness of our approach.