{"title":"A comparative analysis of the principal component method and parallel analysis in\n working with official statistical data","authors":"Halyna Holubova","doi":"10.59170/stattrans-2023-011","DOIUrl":null,"url":null,"abstract":"The dynamic development of the digitized society generates large-scale information\n data flows. Therefore, data need to be compressed in a way allowing its content to\n remain complete and informative. In order for the above to be achieved, it is advisable\n to use the principal component method whose main task is to reduce the dimension of\n multidimensional space with a minimal loss of information. The article describes the\n basic conceptual approaches to the definition of principle components. Moreover, the\n methodological principles of selecting the main components are presented. Among the many\n ways to select principle components, the easiest way is selecting the first k-number of\n components with the largest eigenvalues or to determine the percentage of the total\n variance explained by each component. Many statistical data packages often use the\n Kaiser method for this purpose. However, this method fails to take into account the fact\n that when dealing with random data (noise), it is possible to identify components with\n eigenvalues greater than one, or in other words, to select redundant components. We\n conclude that when selecting the main components, the classical mechanisms should be\n used with caution. The Parallel analysis method uses multiple data simulations to\n overcome the problem of random errors. This method assumes that the components of real\n data must have greater eigenvalues than the parallel components derived from simulated\n data which have the same sample size and design, variance and number of variables. A\n comparative analysis of the eigenvalues was performed by means of two methods: the\n Kaiser criterion and the parallel Horn analysis on the example of several data sets. The\n study shows that the method of parallel analysis produces more valid results with actual\n data sets. We believe that the main advantage of Parallel analysis is its ability to\n model the process of selecting the required number of main components by determining the\n point at which they cannot be distinguished from those generated by simulated\n noise.","PeriodicalId":37985,"journal":{"name":"Statistics in Transition","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics in Transition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.59170/stattrans-2023-011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 0
Abstract
The dynamic development of the digitized society generates large-scale information
data flows. Therefore, data need to be compressed in a way allowing its content to
remain complete and informative. In order for the above to be achieved, it is advisable
to use the principal component method whose main task is to reduce the dimension of
multidimensional space with a minimal loss of information. The article describes the
basic conceptual approaches to the definition of principle components. Moreover, the
methodological principles of selecting the main components are presented. Among the many
ways to select principle components, the easiest way is selecting the first k-number of
components with the largest eigenvalues or to determine the percentage of the total
variance explained by each component. Many statistical data packages often use the
Kaiser method for this purpose. However, this method fails to take into account the fact
that when dealing with random data (noise), it is possible to identify components with
eigenvalues greater than one, or in other words, to select redundant components. We
conclude that when selecting the main components, the classical mechanisms should be
used with caution. The Parallel analysis method uses multiple data simulations to
overcome the problem of random errors. This method assumes that the components of real
data must have greater eigenvalues than the parallel components derived from simulated
data which have the same sample size and design, variance and number of variables. A
comparative analysis of the eigenvalues was performed by means of two methods: the
Kaiser criterion and the parallel Horn analysis on the example of several data sets. The
study shows that the method of parallel analysis produces more valid results with actual
data sets. We believe that the main advantage of Parallel analysis is its ability to
model the process of selecting the required number of main components by determining the
point at which they cannot be distinguished from those generated by simulated
noise.
期刊介绍:
Statistics in Transition (SiT) is an international journal published jointly by the Polish Statistical Association (PTS) and the Central Statistical Office of Poland (CSO/GUS), which sponsors this publication. Launched in 1993, it was issued twice a year until 2006; since then it appears - under a slightly changed title, Statistics in Transition new series - three times a year; and after 2013 as a regular quarterly journal." The journal provides a forum for exchange of ideas and experience amongst members of international community of statisticians, data producers and users, including researchers, teachers, policy makers and the general public. Its initially dominating focus on statistical issues pertinent to transition from centrally planned to a market-oriented economy has gradually been extended to embracing statistical problems related to development and modernization of the system of public (official) statistics, in general.