Synthetic Data by Principal Component Analysis

2020 International Conference on Data Mining Workshops (ICDMW) Pub Date : 2020-11-01 DOI:10.1109/ICDMW51313.2020.00023

Natsuki Sano

引用次数: 1

Abstract

In statistical disclosure control, releasing synthetic data implies difficulty in identifying individual records, since the value of synthetic data is different from original data. We propose two methods of generating synthetic data using principal component analysis: orthogonal transformation (linear method) and sandglass-type neural networks (nonlinear method). While the typical generation method of synthetic data by multiple imputation requires existence of common variables between population and survey data, our proposed method can generate synthetic data without common variables. Additionally, the linear method can explicitly evaluate information loss as the ratio of discarded eigenvalues. We generate synthetic data by the proposed method for decathlon data and evaluate four information loss measures: our proposed information loss measure, mean absolute error for each record, mean absolute error of mean of each variable, and mean absolute error of covariance between variables. We find that information loss in the linear method is less than that in the nonlinear method.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

主成分分析合成数据

在统计披露控制中，发布合成数据意味着难以识别个别记录，因为合成数据的价值不同于原始数据。我们提出了两种利用主成分分析生成合成数据的方法:正交变换(线性方法)和沙漏型神经网络(非线性方法)。典型的多重插值合成数据生成方法要求人口和调查数据之间存在共同变量，而本文提出的方法可以生成没有共同变量的合成数据。此外，线性方法可以明确地评估信息损失为丢弃特征值的比率。我们利用提出的方法生成十项全能数据的合成数据，并评估四种信息损失度量:我们提出的信息损失度量、每条记录的平均绝对误差、每个变量均值的平均绝对误差和变量间协方差的平均绝对误差。我们发现线性方法的信息损失小于非线性方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 International Conference on Data Mining Workshops (ICDMW)

自引率

0.00%

发文量

期刊最新文献

Synthetic Data by Principal Component Analysis Deep Contextualized Word Embedding for Text-based Online User Profiling to Detect Social Bots on Twitter Integration of Fuzzy and Deep Learning in Three-Way Decisions Mining Heterogeneous Data for Formulation Design Restructuring of Hoeffding Trees for Trapezoidal Data Streams