主成分分析中的最小成本压缩风险

IF 0.8 4区数学 Q3 STATISTICS & PROBABILITY Australian & New Zealand Journal of Statistics Pub Date : 2022-12-28 DOI:10.1111/anzs.12378

Bhargab Chattopadhyay, Swarnali Banerjee

{"title":"主成分分析中的最小成本压缩风险","authors":"Bhargab Chattopadhyay, Swarnali Banerjee","doi":"10.1111/anzs.12378","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Principal Component Analysis (PCA) is a popular multivariate analytic tool which can be used for dimension reduction without losing much information. Data vectors containing a large number of features arriving sequentially may be correlated with each other. An effective algorithm for such situations is online PCA. Existing Online PCA research works revolve around proposing efficient scalable updating algorithms focusing on compression loss only. They do not take into account the size of the dataset at which further arrival of data vectors can be terminated and dimension reduction can be applied. It is well known that the dataset size contributes to reducing the compression loss – the smaller the dataset size, the larger the compression loss while larger the dataset size, the lesser the compression loss. However, the reduction in compression loss by increasing dataset size will increase the total data collection cost. In this paper, we move beyond the scalability and updation problems related to Online PCA and focus on optimising a cost-compression loss which considers the compression loss and data collection cost. We minimise the corresponding risk using a two-stage PCA algorithm. The resulting two-stage algorithm is a fast and an efficient alternative to Online PCA and is shown to exhibit attractive convergence properties with no assumption on specific data distributions. Experimental studies demonstrate similar results and further illustrations are provided using real data. As an extension, a multi-stage PCA algorithm is discussed as well. Given the time complexity, the two-stage PCA algorithm is emphasised over the multi-stage PCA algorithm for online data.</p>\n </div>","PeriodicalId":55428,"journal":{"name":"Australian & New Zealand Journal of Statistics","volume":"64 4","pages":"422-441"},"PeriodicalIF":0.8000,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Minimum cost-compression risk in principal component analysis\",\"authors\":\"Bhargab Chattopadhyay, Swarnali Banerjee\",\"doi\":\"10.1111/anzs.12378\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>Principal Component Analysis (PCA) is a popular multivariate analytic tool which can be used for dimension reduction without losing much information. Data vectors containing a large number of features arriving sequentially may be correlated with each other. An effective algorithm for such situations is online PCA. Existing Online PCA research works revolve around proposing efficient scalable updating algorithms focusing on compression loss only. They do not take into account the size of the dataset at which further arrival of data vectors can be terminated and dimension reduction can be applied. It is well known that the dataset size contributes to reducing the compression loss – the smaller the dataset size, the larger the compression loss while larger the dataset size, the lesser the compression loss. However, the reduction in compression loss by increasing dataset size will increase the total data collection cost. In this paper, we move beyond the scalability and updation problems related to Online PCA and focus on optimising a cost-compression loss which considers the compression loss and data collection cost. We minimise the corresponding risk using a two-stage PCA algorithm. The resulting two-stage algorithm is a fast and an efficient alternative to Online PCA and is shown to exhibit attractive convergence properties with no assumption on specific data distributions. Experimental studies demonstrate similar results and further illustrations are provided using real data. As an extension, a multi-stage PCA algorithm is discussed as well. Given the time complexity, the two-stage PCA algorithm is emphasised over the multi-stage PCA algorithm for online data.</p>\\n </div>\",\"PeriodicalId\":55428,\"journal\":{\"name\":\"Australian & New Zealand Journal of Statistics\",\"volume\":\"64 4\",\"pages\":\"422-441\"},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2022-12-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Australian & New Zealand Journal of Statistics\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/anzs.12378\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Australian & New Zealand Journal of Statistics","FirstCategoryId":"100","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/anzs.12378","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

摘要

主成分分析(PCA)是一种流行的多元分析工具，它可以在不丢失太多信息的情况下进行降维。包含大量顺序到达的特征的数据向量可能彼此相关。在线PCA是一种有效的算法。现有的在线PCA研究工作围绕着提出有效的可扩展更新算法，只关注压缩损失。它们没有考虑数据集的大小，数据向量的进一步到达可以被终止，并且可以应用降维。众所周知，数据集大小有助于减少压缩损失——数据集大小越小，压缩损失越大，而数据集大小越大，压缩损失越小。然而，通过增加数据集大小来减少压缩损失将增加总数据收集成本。在本文中，我们超越了与在线PCA相关的可扩展性和更新问题，并专注于优化考虑压缩损失和数据收集成本的成本-压缩损失。我们使用两阶段PCA算法最小化相应的风险。所得到的两阶段算法是一种快速而有效的在线PCA替代方案，并且在不假设特定数据分布的情况下显示出有吸引力的收敛特性。实验研究表明了类似的结果，并利用实际数据提供了进一步的说明。作为扩展，本文还讨论了一种多阶段PCA算法。考虑到在线数据的时间复杂度，两阶段主成分分析算法比多阶段主成分分析算法更受重视。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Minimum cost-compression risk in principal component analysis

Principal Component Analysis (PCA) is a popular multivariate analytic tool which can be used for dimension reduction without losing much information. Data vectors containing a large number of features arriving sequentially may be correlated with each other. An effective algorithm for such situations is online PCA. Existing Online PCA research works revolve around proposing efficient scalable updating algorithms focusing on compression loss only. They do not take into account the size of the dataset at which further arrival of data vectors can be terminated and dimension reduction can be applied. It is well known that the dataset size contributes to reducing the compression loss – the smaller the dataset size, the larger the compression loss while larger the dataset size, the lesser the compression loss. However, the reduction in compression loss by increasing dataset size will increase the total data collection cost. In this paper, we move beyond the scalability and updation problems related to Online PCA and focus on optimising a cost-compression loss which considers the compression loss and data collection cost. We minimise the corresponding risk using a two-stage PCA algorithm. The resulting two-stage algorithm is a fast and an efficient alternative to Online PCA and is shown to exhibit attractive convergence properties with no assumption on specific data distributions. Experimental studies demonstrate similar results and further illustrations are provided using real data. As an extension, a multi-stage PCA algorithm is discussed as well. Given the time complexity, the two-stage PCA algorithm is emphasised over the multi-stage PCA algorithm for online data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Australian & New Zealand Journal of Statistics 数学-统计学与概率论

CiteScore

1.30

自引率

9.10%

发文量

审稿时长

>12 weeks

期刊介绍： The Australian & New Zealand Journal of Statistics is an international journal managed jointly by the Statistical Society of Australia and the New Zealand Statistical Association. Its purpose is to report significant and novel contributions in statistics, ranging across articles on statistical theory, methodology, applications and computing. The journal has a particular focus on statistical techniques that can be readily applied to real-world problems, and on application papers with an Australasian emphasis. Outstanding articles submitted to the journal may be selected as Discussion Papers, to be read at a meeting of either the Statistical Society of Australia or the New Zealand Statistical Association. The main body of the journal is divided into three sections. The Theory and Methods Section publishes papers containing original contributions to the theory and methodology of statistics, econometrics and probability, and seeks papers motivated by a real problem and which demonstrate the proposed theory or methodology in that situation. There is a strong preference for papers motivated by, and illustrated with, real data. The Applications Section publishes papers demonstrating applications of statistical techniques to problems faced by users of statistics in the sciences, government and industry. A particular focus is the application of newly developed statistical methodology to real data and the demonstration of better use of established statistical methodology in an area of application. It seeks to aid teachers of statistics by placing statistical methods in context. The Statistical Computing Section publishes papers containing new algorithms, code snippets, or software descriptions (for open source software only) which enhance the field through the application of computing. Preference is given to papers featuring publically available code and/or data, and to those motivated by statistical methods for practical problems.