A comparative analysis of the principal component method and parallel analysis in working with official statistical data

Q4 Mathematics Statistics in Transition Pub Date : 2023-02-24 DOI:10.59170/stattrans-2023-011

Halyna Holubova

{"title":"A comparative analysis of the principal component method and parallel analysis in\n working with official statistical data","authors":"Halyna Holubova","doi":"10.59170/stattrans-2023-011","DOIUrl":null,"url":null,"abstract":"The dynamic development of the digitized society generates large-scale information\n data flows. Therefore, data need to be compressed in a way allowing its content to\n remain complete and informative. In order for the above to be achieved, it is advisable\n to use the principal component method whose main task is to reduce the dimension of\n multidimensional space with a minimal loss of information. The article describes the\n basic conceptual approaches to the definition of principle components. Moreover, the\n methodological principles of selecting the main components are presented. Among the many\n ways to select principle components, the easiest way is selecting the first k-number of\n components with the largest eigenvalues or to determine the percentage of the total\n variance explained by each component. Many statistical data packages often use the\n Kaiser method for this purpose. However, this method fails to take into account the fact\n that when dealing with random data (noise), it is possible to identify components with\n eigenvalues greater than one, or in other words, to select redundant components. We\n conclude that when selecting the main components, the classical mechanisms should be\n used with caution. The Parallel analysis method uses multiple data simulations to\n overcome the problem of random errors. This method assumes that the components of real\n data must have greater eigenvalues than the parallel components derived from simulated\n data which have the same sample size and design, variance and number of variables. A\n comparative analysis of the eigenvalues was performed by means of two methods: the\n Kaiser criterion and the parallel Horn analysis on the example of several data sets. The\n study shows that the method of parallel analysis produces more valid results with actual\n data sets. We believe that the main advantage of Parallel analysis is its ability to\n model the process of selecting the required number of main components by determining the\n point at which they cannot be distinguished from those generated by simulated\n noise.","PeriodicalId":37985,"journal":{"name":"Statistics in Transition","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics in Transition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.59170/stattrans-2023-011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 0

Abstract

The dynamic development of the digitized society generates large-scale information data flows. Therefore, data need to be compressed in a way allowing its content to remain complete and informative. In order for the above to be achieved, it is advisable to use the principal component method whose main task is to reduce the dimension of multidimensional space with a minimal loss of information. The article describes the basic conceptual approaches to the definition of principle components. Moreover, the methodological principles of selecting the main components are presented. Among the many ways to select principle components, the easiest way is selecting the first k-number of components with the largest eigenvalues or to determine the percentage of the total variance explained by each component. Many statistical data packages often use the Kaiser method for this purpose. However, this method fails to take into account the fact that when dealing with random data (noise), it is possible to identify components with eigenvalues greater than one, or in other words, to select redundant components. We conclude that when selecting the main components, the classical mechanisms should be used with caution. The Parallel analysis method uses multiple data simulations to overcome the problem of random errors. This method assumes that the components of real data must have greater eigenvalues than the parallel components derived from simulated data which have the same sample size and design, variance and number of variables. A comparative analysis of the eigenvalues was performed by means of two methods: the Kaiser criterion and the parallel Horn analysis on the example of several data sets. The study shows that the method of parallel analysis produces more valid results with actual data sets. We believe that the main advantage of Parallel analysis is its ability to model the process of selecting the required number of main components by determining the point at which they cannot be distinguished from those generated by simulated noise.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

官方统计数据处理中主成分法和平行分析的比较分析

数字化社会的动态发展产生了大规模的信息数据流。因此，数据需要以一种允许其内容保持完整和信息性的方式进行压缩。为了实现上述目标，建议使用主成分方法，其主要任务是在信息损失最小的情况下降低多维空间的维数。这篇文章描述了定义主要组成部分的基本概念方法。此外，还介绍了选择主要组成部分的方法学原则。在选择主成分的许多方法中，最简单的方法是选择具有最大特征值的第一个k个成分，或者确定每个成分所解释的总方差的百分比。为此，许多统计数据包经常使用Kaiser方法。然而，这种方法没有考虑到这样一个事实，即在处理随机数据（噪声）时，可以识别特征值大于1的分量，或者换句话说，可以选择冗余分量。我们得出的结论是，在选择主要组件时，应谨慎使用经典机制。并行分析方法使用多个数据模拟来克服随机误差的问题。该方法假设真实数据的分量必须比从具有相同样本量和设计、方差和变量数量的模拟数据中导出的并行分量具有更大的特征值。通过两种方法对特征值进行了比较分析：Kaiser准则和对几个数据集的并行Horn分析。研究表明，并行分析方法与实际数据集相比能产生更有效的结果。我们认为，并行分析的主要优点是它能够通过确定无法将所需数量的主要成分与模拟噪声产生的成分区分开来的点来对选择所需数量主要成分的过程进行建模。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Statistics in Transition Decision Sciences-Statistics, Probability and Uncertainty

CiteScore

1.00

自引率

0.00%

发文量

审稿时长

9 weeks

期刊介绍： Statistics in Transition (SiT) is an international journal published jointly by the Polish Statistical Association (PTS) and the Central Statistical Office of Poland (CSO/GUS), which sponsors this publication. Launched in 1993, it was issued twice a year until 2006; since then it appears - under a slightly changed title, Statistics in Transition new series - three times a year; and after 2013 as a regular quarterly journal." The journal provides a forum for exchange of ideas and experience amongst members of international community of statisticians, data producers and users, including researchers, teachers, policy makers and the general public. Its initially dominating focus on statistical issues pertinent to transition from centrally planned to a market-oriented economy has gradually been extended to embracing statistical problems related to development and modernization of the system of public (official) statistics, in general.