Synthetic Data by Principal Component Analysis

Natsuki Sano
{"title":"Synthetic Data by Principal Component Analysis","authors":"Natsuki Sano","doi":"10.1109/ICDMW51313.2020.00023","DOIUrl":null,"url":null,"abstract":"In statistical disclosure control, releasing synthetic data implies difficulty in identifying individual records, since the value of synthetic data is different from original data. We propose two methods of generating synthetic data using principal component analysis: orthogonal transformation (linear method) and sandglass-type neural networks (nonlinear method). While the typical generation method of synthetic data by multiple imputation requires existence of common variables between population and survey data, our proposed method can generate synthetic data without common variables. Additionally, the linear method can explicitly evaluate information loss as the ratio of discarded eigenvalues. We generate synthetic data by the proposed method for decathlon data and evaluate four information loss measures: our proposed information loss measure, mean absolute error for each record, mean absolute error of mean of each variable, and mean absolute error of covariance between variables. We find that information loss in the linear method is less than that in the nonlinear method.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"302 ","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Data Mining Workshops (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW51313.2020.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

In statistical disclosure control, releasing synthetic data implies difficulty in identifying individual records, since the value of synthetic data is different from original data. We propose two methods of generating synthetic data using principal component analysis: orthogonal transformation (linear method) and sandglass-type neural networks (nonlinear method). While the typical generation method of synthetic data by multiple imputation requires existence of common variables between population and survey data, our proposed method can generate synthetic data without common variables. Additionally, the linear method can explicitly evaluate information loss as the ratio of discarded eigenvalues. We generate synthetic data by the proposed method for decathlon data and evaluate four information loss measures: our proposed information loss measure, mean absolute error for each record, mean absolute error of mean of each variable, and mean absolute error of covariance between variables. We find that information loss in the linear method is less than that in the nonlinear method.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
主成分分析合成数据
在统计披露控制中,发布合成数据意味着难以识别个别记录,因为合成数据的价值不同于原始数据。我们提出了两种利用主成分分析生成合成数据的方法:正交变换(线性方法)和沙漏型神经网络(非线性方法)。典型的多重插值合成数据生成方法要求人口和调查数据之间存在共同变量,而本文提出的方法可以生成没有共同变量的合成数据。此外,线性方法可以明确地评估信息损失为丢弃特征值的比率。我们利用提出的方法生成十项全能数据的合成数据,并评估四种信息损失度量:我们提出的信息损失度量、每条记录的平均绝对误差、每个变量均值的平均绝对误差和变量间协方差的平均绝对误差。我们发现线性方法的信息损失小于非线性方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Synthetic Data by Principal Component Analysis Deep Contextualized Word Embedding for Text-based Online User Profiling to Detect Social Bots on Twitter Integration of Fuzzy and Deep Learning in Three-Way Decisions Mining Heterogeneous Data for Formulation Design Restructuring of Hoeffding Trees for Trapezoidal Data Streams
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1