Jianhua Zhao, Changchun Shang, Shulan Li, Ling Xin, Philip L. H. Yu
{"title":"通过新型分层贝叶斯信息准则选择不完整数据因子分析中的因子数量","authors":"Jianhua Zhao, Changchun Shang, Shulan Li, Ling Xin, Philip L. H. Yu","doi":"10.1007/s11634-024-00582-w","DOIUrl":null,"url":null,"abstract":"<p>The Bayesian information criterion (BIC), defined as the observed data log likelihood minus a penalty term based on the sample size <i>N</i>, is a popular model selection criterion for factor analysis with complete data. This definition has also been suggested for incomplete data. However, the penalty term based on the ‘complete’ sample size <i>N</i> is the same no matter whether in a complete or incomplete data case. For incomplete data, there are often only <span>\\(N_i<N\\)</span> observations for variable <i>i</i>, which means that using the ‘complete’ sample size <i>N</i> implausibly ignores the amounts of missing information inherent in incomplete data. Given this observation, a novel hierarchical BIC (HBIC) criterion is proposed for factor analysis with incomplete data, which is denoted by HBIC<sub>inc</sub>. The novelty is that HBIC<sub>inc</sub> only uses the actual amounts of observed information, namely <span>\\(N_i\\)</span>’s, in the penalty term. Theoretically, it is shown that HBIC<sub>inc</sub> is a large sample approximation of variational Bayesian (VB) lower bound, and BIC is a further approximation of HBIC<sub>inc</sub>, which means that HBIC<sub>inc</sub> shares the theoretical consistency of BIC. Experiments on synthetic and real data sets are conducted to access the finite sample performance of HBIC<sub>inc</sub>, BIC, and related criteria with various missing rates. The results show that HBIC<sub>inc</sub> and BIC perform similarly when the missing rate is small, but HBIC<sub>inc</sub> is more accurate when the missing rate is not small.\n</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"92 1","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Choosing the number of factors in factor analysis with incomplete data via a novel hierarchical Bayesian information criterion\",\"authors\":\"Jianhua Zhao, Changchun Shang, Shulan Li, Ling Xin, Philip L. H. Yu\",\"doi\":\"10.1007/s11634-024-00582-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The Bayesian information criterion (BIC), defined as the observed data log likelihood minus a penalty term based on the sample size <i>N</i>, is a popular model selection criterion for factor analysis with complete data. This definition has also been suggested for incomplete data. However, the penalty term based on the ‘complete’ sample size <i>N</i> is the same no matter whether in a complete or incomplete data case. For incomplete data, there are often only <span>\\\\(N_i<N\\\\)</span> observations for variable <i>i</i>, which means that using the ‘complete’ sample size <i>N</i> implausibly ignores the amounts of missing information inherent in incomplete data. Given this observation, a novel hierarchical BIC (HBIC) criterion is proposed for factor analysis with incomplete data, which is denoted by HBIC<sub>inc</sub>. The novelty is that HBIC<sub>inc</sub> only uses the actual amounts of observed information, namely <span>\\\\(N_i\\\\)</span>’s, in the penalty term. Theoretically, it is shown that HBIC<sub>inc</sub> is a large sample approximation of variational Bayesian (VB) lower bound, and BIC is a further approximation of HBIC<sub>inc</sub>, which means that HBIC<sub>inc</sub> shares the theoretical consistency of BIC. Experiments on synthetic and real data sets are conducted to access the finite sample performance of HBIC<sub>inc</sub>, BIC, and related criteria with various missing rates. The results show that HBIC<sub>inc</sub> and BIC perform similarly when the missing rate is small, but HBIC<sub>inc</sub> is more accurate when the missing rate is not small.\\n</p>\",\"PeriodicalId\":49270,\"journal\":{\"name\":\"Advances in Data Analysis and Classification\",\"volume\":\"92 1\",\"pages\":\"\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2024-03-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advances in Data Analysis and Classification\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11634-024-00582-w\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Data Analysis and Classification","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11634-024-00582-w","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
摘要
贝叶斯信息准则(BIC)的定义是观测数据对数似然值减去基于样本量 N 的惩罚项,它是完整数据因素分析中常用的模型选择准则。这一定义也适用于不完整数据。然而,基于 "完整 "样本量 N 的惩罚项无论在完整数据还是不完整数据情况下都是一样的。对于不完整数据,变量 i 通常只有 \(N_i<N\) 个观测值,这意味着使用 "完整 "样本量 N 会难以置信地忽略不完整数据中固有的缺失信息量。鉴于此,我们提出了一种新的分层 BIC(HBIC)准则,用于不完整数据的因子分析,用 HBICinc 表示。其新颖之处在于,HBICinc 只在惩罚项中使用观察到的实际信息量,即 \(N_i\)。从理论上讲,HBICinc 是变异贝叶斯(VB)下限的大样本近似,而 BIC 是 HBICinc 的进一步近似,这意味着 HBICinc 与 BIC 具有相同的理论一致性。我们在合成数据集和真实数据集上进行了实验,以了解 HBICinc、BIC 和相关准则在不同缺失率下的有限样本性能。结果表明,当缺失率较小时,HBICinc 和 BIC 的性能相似,但当缺失率不大时,HBICinc 更准确。
Choosing the number of factors in factor analysis with incomplete data via a novel hierarchical Bayesian information criterion
The Bayesian information criterion (BIC), defined as the observed data log likelihood minus a penalty term based on the sample size N, is a popular model selection criterion for factor analysis with complete data. This definition has also been suggested for incomplete data. However, the penalty term based on the ‘complete’ sample size N is the same no matter whether in a complete or incomplete data case. For incomplete data, there are often only \(N_i<N\) observations for variable i, which means that using the ‘complete’ sample size N implausibly ignores the amounts of missing information inherent in incomplete data. Given this observation, a novel hierarchical BIC (HBIC) criterion is proposed for factor analysis with incomplete data, which is denoted by HBICinc. The novelty is that HBICinc only uses the actual amounts of observed information, namely \(N_i\)’s, in the penalty term. Theoretically, it is shown that HBICinc is a large sample approximation of variational Bayesian (VB) lower bound, and BIC is a further approximation of HBICinc, which means that HBICinc shares the theoretical consistency of BIC. Experiments on synthetic and real data sets are conducted to access the finite sample performance of HBICinc, BIC, and related criteria with various missing rates. The results show that HBICinc and BIC perform similarly when the missing rate is small, but HBICinc is more accurate when the missing rate is not small.
期刊介绍:
The international journal Advances in Data Analysis and Classification (ADAC) is designed as a forum for high standard publications on research and applications concerning the extraction of knowable aspects from many types of data. It publishes articles on such topics as structural, quantitative, or statistical approaches for the analysis of data; advances in classification, clustering, and pattern recognition methods; strategies for modeling complex data and mining large data sets; methods for the extraction of knowledge from data, and applications of advanced methods in specific domains of practice. Articles illustrate how new domain-specific knowledge can be made available from data by skillful use of data analysis methods. The journal also publishes survey papers that outline, and illuminate the basic ideas and techniques of special approaches.