Pub Date : 2024-02-24DOI: 10.1007/s00180-024-01470-9
Larissa C. Alves, Ronaldo Dias, Helio S. Migon
This work presents a new scalable automatic Bayesian Lasso methodology with variational inference for non-parametric splines regression that can capture the non-linear relationship between a response variable and predictor variables. Note that under non-parametric point of view the regression curve is assumed to lie in a infinite dimension space. Regression splines use a finite approximation of this infinite space, representing the regression function by a linear combination of basis functions. The crucial point of the approach is determining the appropriate number of bases or equivalently number of knots, avoiding over-fitting/under-fitting. A decision-theoretic approach was devised for knot selection. Comprehensive simulation studies were conducted in challenging scenarios to compare alternative criteria for knot selection, thereby ensuring the efficacy of the proposed algorithms. Additionally, the performance of the proposed method was assessed using real-world datasets. The novel procedure demonstrated good performance in capturing the underlying data structure by selecting the appropriate number of knots/basis.
{"title":"Variational Bayesian Lasso for spline regression","authors":"Larissa C. Alves, Ronaldo Dias, Helio S. Migon","doi":"10.1007/s00180-024-01470-9","DOIUrl":"https://doi.org/10.1007/s00180-024-01470-9","url":null,"abstract":"<p>This work presents a new scalable automatic Bayesian Lasso methodology with variational inference for non-parametric splines regression that can capture the non-linear relationship between a response variable and predictor variables. Note that under non-parametric point of view the regression curve is assumed to lie in a infinite dimension space. Regression splines use a finite approximation of this infinite space, representing the regression function by a linear combination of basis functions. The crucial point of the approach is determining the appropriate number of bases or equivalently number of knots, avoiding over-fitting/under-fitting. A decision-theoretic approach was devised for knot selection. Comprehensive simulation studies were conducted in challenging scenarios to compare alternative criteria for knot selection, thereby ensuring the efficacy of the proposed algorithms. Additionally, the performance of the proposed method was assessed using real-world datasets. The novel procedure demonstrated good performance in capturing the underlying data structure by selecting the appropriate number of knots/basis.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"611 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139956295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, we propose a Poisson-Lindley distribution as a stochastic abundance model in which the sample is according to the independent Poisson process. Jeffery’s and Bernardo’s reference priors have been obtaining and proposed the Bayes estimators of the number of species for this model. The proposed Bayes estimators have been compared with the corresponding profile and conditional maximum likelihood estimators for their square root of the risks under squared error loss function (SELF). Jeffery’s and Bernardo’s reference priors have been considered and compared with the Bayesian approach based on biological data.
{"title":"Bayesian estimation of the number of species from Poisson-Lindley stochastic abundance model using non-informative priors","authors":"Anurag Pathak, Manoj Kumar, Sanjay Kumar Singh, Umesh Singh, Sandeep Kumar","doi":"10.1007/s00180-024-01464-7","DOIUrl":"https://doi.org/10.1007/s00180-024-01464-7","url":null,"abstract":"<p>In this article, we propose a Poisson-Lindley distribution as a stochastic abundance model in which the sample is according to the independent Poisson process. Jeffery’s and Bernardo’s reference priors have been obtaining and proposed the Bayes estimators of the number of species for this model. The proposed Bayes estimators have been compared with the corresponding profile and conditional maximum likelihood estimators for their square root of the risks under squared error loss function (SELF). Jeffery’s and Bernardo’s reference priors have been considered and compared with the Bayesian approach based on biological data.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139951516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-23DOI: 10.1007/s00180-024-01468-3
Takayuki Umeda
Normally distributed random numbers are commonly used in scientific computing in various fields. It is important to generate a set of random numbers as close to a normal distribution as possible for reducing initial fluctuations. Two types of samples from a uniform distribution are examined as source samples for inverse transform sampling methods. Three types of inverse transform sampling methods with new approximations of inverse cumulative distribution functions are also discussed for converting uniformly distributed source samples to normally distributed samples.
{"title":"Generation of normal distributions revisited","authors":"Takayuki Umeda","doi":"10.1007/s00180-024-01468-3","DOIUrl":"https://doi.org/10.1007/s00180-024-01468-3","url":null,"abstract":"<p>Normally distributed random numbers are commonly used in scientific computing in various fields. It is important to generate a set of random numbers as close to a normal distribution as possible for reducing initial fluctuations. Two types of samples from a uniform distribution are examined as source samples for inverse transform sampling methods. Three types of inverse transform sampling methods with new approximations of inverse cumulative distribution functions are also discussed for converting uniformly distributed source samples to normally distributed samples.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"32 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139951514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-21DOI: 10.1007/s00180-024-01466-5
Luca Pedini
This article presents the gretl package BayTool which integrates the software functionalities, mostly concerned with frequentist approaches, with Bayesian estimation methods of commonly used econometric models. Computational efficiency is achieved by pairing an extensive use of Gibbs sampling for posterior simulation with the possibility of splitting single-threaded experiments into multiple cores or machines by means of parallelization. From the user’s perspective, the package requires only basic knowledge of gretl scripting to fully access its functionality, while providing a point-and-click solution in the form of a graphical interface for a less experienced audience. These features, in particular, make BayTool stand out as an excellent teaching device without sacrificing more advanced or complex applications.
{"title":"Bayesian regression models in gretl: the BayTool package","authors":"Luca Pedini","doi":"10.1007/s00180-024-01466-5","DOIUrl":"https://doi.org/10.1007/s00180-024-01466-5","url":null,"abstract":"<p>This article presents the <span>gretl</span> package <span>BayTool</span> which integrates the software functionalities, mostly concerned with frequentist approaches, with Bayesian estimation methods of commonly used econometric models. Computational efficiency is achieved by pairing an extensive use of Gibbs sampling for posterior simulation with the possibility of splitting single-threaded experiments into multiple cores or machines by means of parallelization. From the user’s perspective, the package requires only basic knowledge of <span>gretl</span> scripting to fully access its functionality, while providing a point-and-click solution in the form of a graphical interface for a less experienced audience. These features, in particular, make <span>BayTool</span> stand out as an excellent teaching device without sacrificing more advanced or complex applications.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"14 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20DOI: 10.1007/s00180-024-01458-5
Erina Paul, Santosh Sutradhar, Jonathan Hartzel, Devan V. Mehrotra
Designing vaccine efficacy (VE) trials often requires recruiting large numbers of participants when the diseases of interest have a low incidence. When developing novel vaccines, such as for COVID-19 disease, the plausible range of VE is quite large at the design stage. Thus, the number of events needed to demonstrate efficacy above a pre-defined regulatory threshold can be difficult to predict and the time needed to accrue the necessary events can often be long. Therefore, it is advantageous to evaluate the efficacy at earlier interim analysis in the trial to potentially allow the trials to stop early for overwhelming VE or futility. In such cases, incorporating interim analyses through the use of the sequential probability ratio test (SPRT) can be helpful to allow for multiple analyses while controlling for both type-I and type-II errors. In this article, we propose a Bayesian SPRT for designing a vaccine trial for comparing a test vaccine with a control assuming two Poisson incidence rates. We provide guidance on how to choose the prior distribution and how to optimize the number of events for interim analyses to maximize the efficiency of the design. Through simulations, we demonstrate how the proposed Bayesian SPRT performs better when compared with the corresponding frequentist SPRT. An R repository to implement the proposed method is placed at: https://github.com/Merck/bayesiansprt.
当相关疾病的发病率较低时,设计疫苗效力(VE)试验往往需要招募大量参与者。在开发新型疫苗(如 COVID-19 疾病)时,在设计阶段 VE 的合理范围相当大。因此,要证明疗效超过预先设定的监管阈值所需的事件数量可能难以预测,而积累必要事件所需的时间往往很长。因此,在试验的早期中期分析中对疗效进行评估是很有好处的,这样有可能使试验因VE过高或无效而提前结束。在这种情况下,通过使用序贯概率比检验(SPRT)进行中期分析有助于进行多重分析,同时控制 I 型和 II 型误差。在本文中,我们提出了一种贝叶斯概率比检验方法,用于设计疫苗试验,在假设两种泊松发病率的情况下比较试验疫苗和对照疫苗。我们就如何选择先验分布以及如何优化中期分析的事件数以最大限度地提高设计效率提供了指导。通过模拟,我们展示了所提出的贝叶斯 SPRT 与相应的频数 SPRT 相比如何表现得更好。实现所提方法的 R 代码库位于:https://github.com/Merck/bayesiansprt。
{"title":"Bayesian sequential probability ratio test for vaccine efficacy trials","authors":"Erina Paul, Santosh Sutradhar, Jonathan Hartzel, Devan V. Mehrotra","doi":"10.1007/s00180-024-01458-5","DOIUrl":"https://doi.org/10.1007/s00180-024-01458-5","url":null,"abstract":"<p>Designing vaccine efficacy (VE) trials often requires recruiting large numbers of participants when the diseases of interest have a low incidence. When developing novel vaccines, such as for COVID-19 disease, the plausible range of VE is quite large at the design stage. Thus, the number of events needed to demonstrate efficacy above a pre-defined regulatory threshold can be difficult to predict and the time needed to accrue the necessary events can often be long. Therefore, it is advantageous to evaluate the efficacy at earlier interim analysis in the trial to potentially allow the trials to stop early for overwhelming VE or futility. In such cases, incorporating interim analyses through the use of the sequential probability ratio test (SPRT) can be helpful to allow for multiple analyses while controlling for both type-I and type-II errors. In this article, we propose a Bayesian SPRT for designing a vaccine trial for comparing a test vaccine with a control assuming two Poisson incidence rates. We provide guidance on how to choose the prior distribution and how to optimize the number of events for interim analyses to maximize the efficiency of the design. Through simulations, we demonstrate how the proposed Bayesian SPRT performs better when compared with the corresponding frequentist SPRT. An R repository to implement the proposed method is placed at: https://github.com/Merck/bayesiansprt.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"14 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-19DOI: 10.1007/s00180-024-01457-6
Claudio Conversano, Luca Frigau, Giulia Contu
Network-based Semi-Supervised Clustering (NeSSC) is a semi-supervised approach for clustering in the presence of an outcome variable. It uses a classification or regression model on resampled versions of the original data to produce a proximity matrix that indicates the magnitude of the similarity between pairs of observations measured with respect to the outcome. This matrix is transformed into a complex network on which a community detection algorithm is applied to search for underlying community structures which is a partition of the instances into highly homogeneous clusters to be evaluated in terms of the outcome. In this paper, we focus on the case the outcome variable to be used in NeSSC is numeric and propose an alternative selection criterion of the optimal partition based on a measure of overlapping between density curves as well as a penalization criterion which takes accounts for the number of clusters in a candidate partition. Next, we consider the performance of the proposed method for some artificial datasets and for 20 different real datasets and compare NeSSC with the other three popular methods of semi-supervised clustering with a numeric outcome. Results show that NeSSC with the overlapping criterion works particularly well when a reduced number of clusters are scattered localized.
{"title":"Overlapping coefficient in network-based semi-supervised clustering","authors":"Claudio Conversano, Luca Frigau, Giulia Contu","doi":"10.1007/s00180-024-01457-6","DOIUrl":"https://doi.org/10.1007/s00180-024-01457-6","url":null,"abstract":"<p>Network-based Semi-Supervised Clustering (NeSSC) is a semi-supervised approach for clustering in the presence of an outcome variable. It uses a classification or regression model on resampled versions of the original data to produce a proximity matrix that indicates the magnitude of the similarity between pairs of observations measured with respect to the outcome. This matrix is transformed into a complex network on which a community detection algorithm is applied to search for underlying community structures which is a partition of the instances into highly homogeneous clusters to be evaluated in terms of the outcome. In this paper, we focus on the case the outcome variable to be used in NeSSC is numeric and propose an alternative selection criterion of the optimal partition based on a measure of overlapping between density curves as well as a penalization criterion which takes accounts for the number of clusters in a candidate partition. Next, we consider the performance of the proposed method for some artificial datasets and for 20 different real datasets and compare NeSSC with the other three popular methods of semi-supervised clustering with a numeric outcome. Results show that NeSSC with the overlapping criterion works particularly well when a reduced number of clusters are scattered localized.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"18 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-15DOI: 10.1007/s00180-024-01462-9
Xing Liu, Weihua Deng
This paper discusses the first exit and Dirichlet problems of the nonisotropic tempered (alpha)-stable process (X_t). The upper bounds of all moments of the first exit position (left| X_{tau _D}right|) and the first exit time (tau _D) are explicitly obtained. It is found that the probability density function of (left| X_{tau _D}right|) or (tau _D) exponentially decays with the increase of (left| X_{tau _D}right|) or (tau _D), and (mathrm{E}left[ tau _Dright] sim mathrm{E}left[ left| X_{tau _D}-mathrm{E}left[ X_{tau _D}right] right| ^2right]), (mathrm{E}left[ tau _Dright] sim left| mathrm{E}left[ X_{tau _D}right] right|). Next, we obtain the Feynman–Kac representation of the Dirichlet problem by employing the semigroup theory. Furthermore, averaging the generated trajectories of the stochastic process leads to the solution of the Dirichlet problem, which is also verified by numerical experiments.
{"title":"First exit and Dirichlet problem for the nonisotropic tempered $$alpha$$ -stable processes","authors":"Xing Liu, Weihua Deng","doi":"10.1007/s00180-024-01462-9","DOIUrl":"https://doi.org/10.1007/s00180-024-01462-9","url":null,"abstract":"<p>This paper discusses the first exit and Dirichlet problems of the nonisotropic tempered <span>(alpha)</span>-stable process <span>(X_t)</span>. The upper bounds of all moments of the first exit position <span>(left| X_{tau _D}right|)</span> and the first exit time <span>(tau _D)</span> are explicitly obtained. It is found that the probability density function of <span>(left| X_{tau _D}right|)</span> or <span>(tau _D)</span> exponentially decays with the increase of <span>(left| X_{tau _D}right|)</span> or <span>(tau _D)</span>, and <span>(mathrm{E}left[ tau _Dright] sim mathrm{E}left[ left| X_{tau _D}-mathrm{E}left[ X_{tau _D}right] right| ^2right])</span>, <span>(mathrm{E}left[ tau _Dright] sim left| mathrm{E}left[ X_{tau _D}right] right|)</span>. Next, we obtain the Feynman–Kac representation of the Dirichlet problem by employing the semigroup theory. Furthermore, averaging the generated trajectories of the stochastic process leads to the solution of the Dirichlet problem, which is also verified by numerical experiments.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"23 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139766002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-13DOI: 10.1007/s00180-024-01463-8
Wolfgang Kössler, Hans-J. Lenz, Xing D. Wang
The Benford law is used world-wide for detecting non-conformance or data fraud of numerical data. It says that the significand of a data set from the universe is not uniformly, but logarithmically distributed. Especially, the first non-zero digit is One with an approximate probability of 0.3. There are several tests available for testing Benford, the best known are Pearson’s (chi ^2)-test, the Kolmogorov–Smirnov test and a modified version of the MAD-test. In the present paper we propose some tests, three of the four invariant sum tests are new and they are motivated by the sum invariance property of the Benford law. Two distance measures are investigated, Euclidean and Mahalanobis distance of the standardized sums to the orign. We use the significands corresponding to the first significant digit as well as the second significant digit, respectively. Moreover, we suggest inproved versions of the MAD-test and obtain critical values that are independent of the sample sizes. For illustration the tests are applied to specifically selected data sets where prior knowledge is available about being or not being Benford. Furthermore we discuss the role of truncation of distributions.
{"title":"Some new invariant sum tests and MAD tests for the assessment of Benford’s law","authors":"Wolfgang Kössler, Hans-J. Lenz, Xing D. Wang","doi":"10.1007/s00180-024-01463-8","DOIUrl":"https://doi.org/10.1007/s00180-024-01463-8","url":null,"abstract":"<p>The Benford law is used world-wide for detecting non-conformance or data fraud of numerical data. It says that the significand of a data set from the universe is not uniformly, but logarithmically distributed. Especially, the first non-zero digit is One with an approximate probability of 0.3. There are several tests available for testing Benford, the best known are Pearson’s <span>(chi ^2)</span>-test, the Kolmogorov–Smirnov test and a modified version of the MAD-test. In the present paper we propose some tests, three of the four invariant sum tests are new and they are motivated by the sum invariance property of the Benford law. Two distance measures are investigated, Euclidean and Mahalanobis distance of the standardized sums to the orign. We use the significands corresponding to the first significant digit as well as the second significant digit, respectively. Moreover, we suggest inproved versions of the MAD-test and obtain critical values that are independent of the sample sizes. For illustration the tests are applied to specifically selected data sets where prior knowledge is available about being or not being Benford. Furthermore we discuss the role of truncation of distributions.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"170 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139766122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-12DOI: 10.1007/s00180-024-01465-6
Yi Wu, Wei Wang, Xuejun Wang
Let ({X_{i},1le ile n}) be a sequence of linear process based on dependent random variables with random coefficients, which has a mean shift at an unknown location. The cumulative sum (CUSUM, for short) estimator of the change point is studied. The strong convergence, (L_{r}) convergence, complete convergence and the rate of strong convergence are established for the CUSUM estimator under some mild conditions. These results improve and extend the corresponding ones in the literature. Simulation studies and two real data examples are also provided to support the theoretical results.
设({X_{i},1le ile n} )是一个基于因变量的线性过程序列,具有随机系数,在未知位置有均值移动。研究了变化点的累积和(简称 CUSUM)估计器。在一些温和的条件下,建立了 CUSUM 估计器的强收敛性、(L_{r})收敛性、完全收敛性和强收敛率。这些结果改进并扩展了文献中的相应结果。此外,还提供了仿真研究和两个真实数据实例来支持理论结果。
{"title":"Convergence of the CUSUM estimation for a mean shift in linear processes with random coefficients","authors":"Yi Wu, Wei Wang, Xuejun Wang","doi":"10.1007/s00180-024-01465-6","DOIUrl":"https://doi.org/10.1007/s00180-024-01465-6","url":null,"abstract":"<p>Let <span>({X_{i},1le ile n})</span> be a sequence of linear process based on dependent random variables with random coefficients, which has a mean shift at an unknown location. The cumulative sum (CUSUM, for short) estimator of the change point is studied. The strong convergence, <span>(L_{r})</span> convergence, complete convergence and the rate of strong convergence are established for the CUSUM estimator under some mild conditions. These results improve and extend the corresponding ones in the literature. Simulation studies and two real data examples are also provided to support the theoretical results.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"1 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139765962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-10DOI: 10.1007/s00180-023-01447-0
Abstract
Semi-supervised learning approaches have been successfully applied in a wide range of engineering and scientific fields. This paper investigates the generative model framework with a missingness mechanism for unclassified observations, as introduced by Ahfock and McLachlan (Stat Comput 30:1–12, 2020). We show that in a partially classified sample, a classifier using Bayes’ rule of allocation with a missing-data mechanism can surpass a fully supervised classifier in a two-class normal homoscedastic model, especially with moderate to low overlap and proportion of missing class labels, or with large overlap but few missing labels. It also outperforms a classifier with no missing-data mechanism regardless of the overlap region or the proportion of missing class labels. Our exploration of two- and three-component normal mixture models with unequal covariances through simulations further corroborates our findings. Finally, we illustrate the use of the proposed classifier with a missing-data mechanism on interneuronal and skin lesion datasets.
{"title":"Analysis of estimating the Bayes rule for Gaussian mixture models with a specified missing-data mechanism","authors":"","doi":"10.1007/s00180-023-01447-0","DOIUrl":"https://doi.org/10.1007/s00180-023-01447-0","url":null,"abstract":"<h3>Abstract</h3> <p>Semi-supervised learning approaches have been successfully applied in a wide range of engineering and scientific fields. This paper investigates the generative model framework with a missingness mechanism for unclassified observations, as introduced by Ahfock and McLachlan (Stat Comput 30:1–12, 2020). We show that in a partially classified sample, a classifier using Bayes’ rule of allocation with a missing-data mechanism can surpass a fully supervised classifier in a two-class normal homoscedastic model, especially with moderate to low overlap and proportion of missing class labels, or with large overlap but few missing labels. It also outperforms a classifier with no missing-data mechanism regardless of the overlap region or the proportion of missing class labels. Our exploration of two- and three-component normal mixture models with unequal covariances through simulations further corroborates our findings. Finally, we illustrate the use of the proposed classifier with a missing-data mechanism on interneuronal and skin lesion datasets.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"212 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139765961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}