Pub Date : 2005-10-01DOI: 10.1191/1471082X05st096oa
G. Celeux, O. Martin, C. Lavergne
Data variability can be important in microarray data analysis. Thus, when clustering gene expression profiles, it could be judicious to make use of repeated data. In this paper, the problem of analysing repeated data in the model-based cluster analysis context is considered. Linear mixed models are chosen to take into account data variability and mixture of these models are considered. This leads to a large range of possible models depending on the assumptions made on both the covariance structure of the observations and the mixture model. The maximum likelihood estimation of this family of models through the EM algorithm is presented. The problem of selecting a particular mixture of linear mixed models is considered using penalized likelihood criteria. Illustrative Monte Carlo experiments are presented and an application to the clustering of gene expression profiles is detailed. All those experiments highlight the interest of linear mixed model mixtures to take into account data variability in a cluster analysis context.
{"title":"Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments","authors":"G. Celeux, O. Martin, C. Lavergne","doi":"10.1191/1471082X05st096oa","DOIUrl":"https://doi.org/10.1191/1471082X05st096oa","url":null,"abstract":"Data variability can be important in microarray data analysis. Thus, when clustering gene expression profiles, it could be judicious to make use of repeated data. In this paper, the problem of analysing repeated data in the model-based cluster analysis context is considered. Linear mixed models are chosen to take into account data variability and mixture of these models are considered. This leads to a large range of possible models depending on the assumptions made on both the covariance structure of the observations and the mixture model. The maximum likelihood estimation of this family of models through the EM algorithm is presented. The problem of selecting a particular mixture of linear mixed models is considered using penalized likelihood criteria. Illustrative Monte Carlo experiments are presented and an application to the clustering of gene expression profiles is detailed. All those experiments highlight the interest of linear mixed model mixtures to take into account data variability in a cluster analysis context.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132911325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-01DOI: 10.1191/1471082X05st093oa
J. Bowden, J. Whittaker
A latent variable frailty model is built for data coming from a neonatal study conducted to investigate whether the presence of a particular hospital service given to families with premature babies has a positive effect on their care requirements within the first year of life. The predicted value of the latent frailty term from information obtained from the family in advance of the birth furnishes an overall measure of the quality of health of the baby. This identifies families at risk. Maximum likelihood and Bayesian approaches are used to estimate the effect of the variables on the value of the latent baby frailty and for prediction of health complications. It is found that these give much the same estimates of regression coefficients, but that the variance components are the more difficult to estimate. We indicate how the findings from the model may be presented as a scorecard for predicting frailty, and so be useful to doctors working in hospital neonatal units. New information about a baby is automatically combined with the current score to provide an up-to-date score, so that rapid decisions for taking appropriate action are made more possible. A diagnostic procedure is proposed to assess how well the independence assumptions of the model are met in fitting to this data. It is concluded that the frailty model provides an informative summary of the data from this neonatal study.
{"title":"A latent variable scorecard for neonatal baby frailty","authors":"J. Bowden, J. Whittaker","doi":"10.1191/1471082X05st093oa","DOIUrl":"https://doi.org/10.1191/1471082X05st093oa","url":null,"abstract":"A latent variable frailty model is built for data coming from a neonatal study conducted to investigate whether the presence of a particular hospital service given to families with premature babies has a positive effect on their care requirements within the first year of life. The predicted value of the latent frailty term from information obtained from the family in advance of the birth furnishes an overall measure of the quality of health of the baby. This identifies families at risk. Maximum likelihood and Bayesian approaches are used to estimate the effect of the variables on the value of the latent baby frailty and for prediction of health complications. It is found that these give much the same estimates of regression coefficients, but that the variance components are the more difficult to estimate. We indicate how the findings from the model may be presented as a scorecard for predicting frailty, and so be useful to doctors working in hospital neonatal units. New information about a baby is automatically combined with the current score to provide an up-to-date score, so that rapid decisions for taking appropriate action are made more possible. A diagnostic procedure is proposed to assess how well the independence assumptions of the model are met in fitting to this data. It is concluded that the frailty model provides an informative summary of the data from this neonatal study.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130692825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-01DOI: 10.1191/1471082X05st092oa
D. Hand, M. Crowder
The retail banking sector makes heavy use of statistical models to predict various aspects of customer behaviour. These models are built using data from earlier customers, but have several weaknesses. An alternative approach, widely used in social measurement, but apparently not yet applied in the retail banking sector, is to use latent-variable techniques to measure the underlying key aspect of customer behaviour. This paper describes such a model that separates the observed variables for a customer into primary characteristics on the one hand, and indicators of previous behaviour on the other, and links the two via a latent variable that we identify as ‘customer quality’. We describe how to estimate the conditional distribution of customer quality, given the observed values of primary characteristics and past behaviour.
{"title":"Measuring customer quality in retail banking","authors":"D. Hand, M. Crowder","doi":"10.1191/1471082X05st092oa","DOIUrl":"https://doi.org/10.1191/1471082X05st092oa","url":null,"abstract":"The retail banking sector makes heavy use of statistical models to predict various aspects of customer behaviour. These models are built using data from earlier customers, but have several weaknesses. An alternative approach, widely used in social measurement, but apparently not yet applied in the retail banking sector, is to use latent-variable techniques to measure the underlying key aspect of customer behaviour. This paper describes such a model that separates the observed variables for a customer into primary characteristics on the one hand, and indicators of previous behaviour on the other, and links the two via a latent variable that we identify as ‘customer quality’. We describe how to estimate the conditional distribution of customer quality, given the observed values of primary characteristics and past behaviour.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133761420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-01DOI: 10.1191/1471082X05st091oa
J. Aitchison, K. Ng
In standard multivariate statistical analysis, common hypotheses of interest concern changes in mean vectors and subvectors. In compositional data analysis it is now well established that compositional change is most readily described in terms of the simplicial operation of perturbation and that subcompositions replace the marginal concept of subvectors. Against the background of two motivating experimental studies in the food industry, involving the compositions of cow’s milk and chicken carcasses, this paper emphasizes the importance of recognizing this fundamental operation of change in the associated simplex sample space. Well-defined hypotheses about the nature of any compositional effect can be expressed, for example, in terms of perturbation values and subcompositional stability and testing procedures developed. These procedures are applied to lattices of such hypotheses in the two practical situations. We identify the two problems as being the counterpart of the analysis of paired comparison or split plot experiments and of separate sample comparative experiments in the jargon of standard multivariate analysis.
{"title":"The role of perturbation in compositional data analysis","authors":"J. Aitchison, K. Ng","doi":"10.1191/1471082X05st091oa","DOIUrl":"https://doi.org/10.1191/1471082X05st091oa","url":null,"abstract":"In standard multivariate statistical analysis, common hypotheses of interest concern changes in mean vectors and subvectors. In compositional data analysis it is now well established that compositional change is most readily described in terms of the simplicial operation of perturbation and that subcompositions replace the marginal concept of subvectors. Against the background of two motivating experimental studies in the food industry, involving the compositions of cow’s milk and chicken carcasses, this paper emphasizes the importance of recognizing this fundamental operation of change in the associated simplex sample space. Well-defined hypotheses about the nature of any compositional effect can be expressed, for example, in terms of perturbation values and subcompositional stability and testing procedures developed. These procedures are applied to lattices of such hypotheses in the two practical situations. We identify the two problems as being the counterpart of the analysis of paired comparison or split plot experiments and of separate sample comparative experiments in the jargon of standard multivariate analysis.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131008130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-01DOI: 10.1191/1471082X05st088oa
C. Serio, Paola Vicard
A crucial task in modern genetic medicine is the understanding of complex genetic diseases. The main complicating features are that a combination of genetic and environmental risk factors is involved, and the phenotype of interest may be complex. Traditional statistical techniques based on lod-scores fail when the disease is no longer monogenic and the underlying disease transmission model is not defined. Different kinds of association tests have been proved to be an appropriate and powerful statistical tool to detect a ‘candidate gene’ for a complex disorder. However, statistical techniques able to investigate direct and indirect influences among phenotypes, genotypes and environmental risk factors, are required to analyse the association structure of complex diseases. In this paper, we propose graphical models as a natural tool to analyse the multifactorial structure of complex genetic diseases. An application of this model to primary hypertension data set is illustrated.
{"title":"Graphical chain models for the analysis of complex genetic diseases: an application to hypertension","authors":"C. Serio, Paola Vicard","doi":"10.1191/1471082X05st088oa","DOIUrl":"https://doi.org/10.1191/1471082X05st088oa","url":null,"abstract":"A crucial task in modern genetic medicine is the understanding of complex genetic diseases. The main complicating features are that a combination of genetic and environmental risk factors is involved, and the phenotype of interest may be complex. Traditional statistical techniques based on lod-scores fail when the disease is no longer monogenic and the underlying disease transmission model is not defined. Different kinds of association tests have been proved to be an appropriate and powerful statistical tool to detect a ‘candidate gene’ for a complex disorder. However, statistical techniques able to investigate direct and indirect influences among phenotypes, genotypes and environmental risk factors, are required to analyse the association structure of complex diseases. In this paper, we propose graphical models as a natural tool to analyse the multifactorial structure of complex genetic diseases. An application of this model to primary hypertension data set is illustrated.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125312573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-07-01DOI: 10.1191/1471082X05st089oa
N. Augustin, W. Sauerbrei, M. Schumacher
Predictions of disease outcome in prognostic factor models are usually based on one selected model. However, often several models fit the data equally well, but these models might differ substantially in terms of included explanatory variables and might lead to different predictions for individual patients. For survival data, we discuss two approaches to account for model selection uncertainty in two data examples, with the main emphasis on variable selection in a proportional hazard Cox model. The main aim of our investigation is to establish the ways in which either of the two approaches is useful in such prognostic models. The first approach is Bayesian model averaging (BMA) adapted for the proportional hazard model, termed ‘approx. BMA’ here. As a new approach, we propose a method which averages over a set of possible models using weights estimated from bootstrap resampling as proposed by Buckland et al., but in addition, we perform an initial screening of variables based on the inclusion frequency of each variable to reduce the set of variables and corresponding models. For some necessary parameters of the procedure, investigations concerning sensible choices are still required. The main objective of prognostic models is prediction, but the interpretation of single effects is also important and models should be general enough to ensure transportability to other clinical centres. In the data examples, we compare predictions of our new approach with approx. BMA, with ‘conventional’ predictions from one selected model and with predictions from the full model. Confidence intervals are compared in one example. Comparisons are based on the partial predictive score and the Brier score. We conclude that the two model averaging methods yield similar results and are especially useful when there is a high number of potential prognostic factors, most likely some of them without influence in a multivariable context. Although the method based on bootstrap resampling lacks formal justification and requires some ad hoc decisions, it has the additional positive effect of achieving model parsimony by reducing the number of explanatory variables and dealing with correlated variables in an automatic fashion.
{"title":"The practical utility of incorporating model selection uncertainty into prognostic models for survival data","authors":"N. Augustin, W. Sauerbrei, M. Schumacher","doi":"10.1191/1471082X05st089oa","DOIUrl":"https://doi.org/10.1191/1471082X05st089oa","url":null,"abstract":"Predictions of disease outcome in prognostic factor models are usually based on one selected model. However, often several models fit the data equally well, but these models might differ substantially in terms of included explanatory variables and might lead to different predictions for individual patients. For survival data, we discuss two approaches to account for model selection uncertainty in two data examples, with the main emphasis on variable selection in a proportional hazard Cox model. The main aim of our investigation is to establish the ways in which either of the two approaches is useful in such prognostic models. The first approach is Bayesian model averaging (BMA) adapted for the proportional hazard model, termed ‘approx. BMA’ here. As a new approach, we propose a method which averages over a set of possible models using weights estimated from bootstrap resampling as proposed by Buckland et al., but in addition, we perform an initial screening of variables based on the inclusion frequency of each variable to reduce the set of variables and corresponding models. For some necessary parameters of the procedure, investigations concerning sensible choices are still required. The main objective of prognostic models is prediction, but the interpretation of single effects is also important and models should be general enough to ensure transportability to other clinical centres. In the data examples, we compare predictions of our new approach with approx. BMA, with ‘conventional’ predictions from one selected model and with predictions from the full model. Confidence intervals are compared in one example. Comparisons are based on the partial predictive score and the Brier score. We conclude that the two model averaging methods yield similar results and are especially useful when there is a high number of potential prognostic factors, most likely some of them without influence in a multivariable context. Although the method based on bootstrap resampling lacks formal justification and requires some ad hoc decisions, it has the additional positive effect of achieving model parsimony by reducing the number of explanatory variables and dealing with correlated variables in an automatic fashion.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116844288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-04-01DOI: 10.1191/1471082X05st085oa
Herbert K. H. Lee, D. Higdon, Catherine A. Calder, C. Holloman
Gaussian processes (GP) have proven to be useful and versatile stochastic models in a wide variety of applications including computer experiments, environmental monitoring, hydrology and climate modeling. A GP model is determined by its mean and covariance functions. In most cases, the mean is specified to be a constant, or some other simple linear function, whereas the covariance function is governed by a few parameters. A Bayesian formulation is attractive as it allows for formal incorporation of uncertainty regarding the parameters governing the GP. However, estimation of these parameters can be problematic. Large datasets, posterior correlation and inverse problems can all lead to difficulties in exploring the posterior distribution. Here, we propose an alternative model which is quite tractable computationally - even with large datasets or indirectly observed data - while still maintaining the flexibility and adaptiveness of traditional GP models. This model is based on convolving simple Markov random fields with a smoothing kernel. We consider applications in hydrology and aircraft prototype testing.
{"title":"Efficient models for correlated data via convolutions of intrinsic processes","authors":"Herbert K. H. Lee, D. Higdon, Catherine A. Calder, C. Holloman","doi":"10.1191/1471082X05st085oa","DOIUrl":"https://doi.org/10.1191/1471082X05st085oa","url":null,"abstract":"Gaussian processes (GP) have proven to be useful and versatile stochastic models in a wide variety of applications including computer experiments, environmental monitoring, hydrology and climate modeling. A GP model is determined by its mean and covariance functions. In most cases, the mean is specified to be a constant, or some other simple linear function, whereas the covariance function is governed by a few parameters. A Bayesian formulation is attractive as it allows for formal incorporation of uncertainty regarding the parameters governing the GP. However, estimation of these parameters can be problematic. Large datasets, posterior correlation and inverse problems can all lead to difficulties in exploring the posterior distribution. Here, we propose an alternative model which is quite tractable computationally - even with large datasets or indirectly observed data - while still maintaining the flexibility and adaptiveness of traditional GP models. This model is based on convolving simple Markov random fields with a smoothing kernel. We consider applications in hydrology and aircraft prototype testing.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"258 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120896224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-04-01DOI: 10.1191/1471082X05st090oa
D. Hall, Lihua Wang
Finite mixtures of generalized linear mixed effect models are presented to handle situations where within-cluster correlation and heterogeneity (subpopulations) exist simultaneously. For this class of model, we consider maximum likelihood (ML) as our main approach to estimation. Owing to the complexity of the marginal loglikelihood of this model, the EM algorithm is employed to facilitate computation. The major obstacle in this procedure is to integrate over the random effects’ distribution to evaluate the expectation in the E step. When assuming normally distributed random effects, we consider adaptive Gaussian quadrature to perform this integration numerically. We also discuss nonparametric ML estimation under a relaxation of the normality assumption on the random effects. Two real data sets are analysed to compare our proposed model with other existing models and illustrate our estimation methods.
{"title":"Two-component mixtures of generalized linear mixed effects models for cluster correlated data","authors":"D. Hall, Lihua Wang","doi":"10.1191/1471082X05st090oa","DOIUrl":"https://doi.org/10.1191/1471082X05st090oa","url":null,"abstract":"Finite mixtures of generalized linear mixed effect models are presented to handle situations where within-cluster correlation and heterogeneity (subpopulations) exist simultaneously. For this class of model, we consider maximum likelihood (ML) as our main approach to estimation. Owing to the complexity of the marginal loglikelihood of this model, the EM algorithm is employed to facilitate computation. The major obstacle in this procedure is to integrate over the random effects’ distribution to evaluate the expectation in the E step. When assuming normally distributed random effects, we consider adaptive Gaussian quadrature to perform this integration numerically. We also discuss nonparametric ML estimation under a relaxation of the normality assumption on the random effects. Two real data sets are analysed to compare our proposed model with other existing models and illustrate our estimation methods.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131756557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-04-01DOI: 10.1191/1471082X05st084oa
Yongyi Min, A. Agresti
For count responses, the situation of excess zeros (relative to what standard models allow) often occurs in biomedical and sociological applications. Modeling repeated measures of zero-inflated count data presents special challenges. This is because in addition to the problem of extra zeros, the correlation between measurements upon the same subject at different occasions needs to be taken into account. This article discusses random effect models for repeated measurements on this type of response variable. A useful model is the hurdle model with random effects, which separately handles the zero observations and the positive counts. In maximum likelihood model fitting, we consider both a normal distribution and a nonparametric approach for the random effects. A special case of the hurdle model can be used to test for zero inflation. Random effects can also be introduced in a zero-inflated Poisson or negative binomial model, but such a model may encounter fitting problems if there is zero deflation at any settings of the explanatory variables. A simple alternative approach adapts the cumulative logit model with random effects, which has a single set of parameters for describing effects. We illustrate the proposed methods with examples.
{"title":"Random effect models for repeated measures of zero-inflated count data","authors":"Yongyi Min, A. Agresti","doi":"10.1191/1471082X05st084oa","DOIUrl":"https://doi.org/10.1191/1471082X05st084oa","url":null,"abstract":"For count responses, the situation of excess zeros (relative to what standard models allow) often occurs in biomedical and sociological applications. Modeling repeated measures of zero-inflated count data presents special challenges. This is because in addition to the problem of extra zeros, the correlation between measurements upon the same subject at different occasions needs to be taken into account. This article discusses random effect models for repeated measurements on this type of response variable. A useful model is the hurdle model with random effects, which separately handles the zero observations and the positive counts. In maximum likelihood model fitting, we consider both a normal distribution and a nonparametric approach for the random effects. A special case of the hurdle model can be used to test for zero inflation. Random effects can also be introduced in a zero-inflated Poisson or negative binomial model, but such a model may encounter fitting problems if there is zero deflation at any settings of the explanatory variables. A simple alternative approach adapts the cumulative logit model with random effects, which has a single set of parameters for describing effects. We illustrate the proposed methods with examples.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131889444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-04-01DOI: 10.1191/1471082X05st087oa
Jean-Baptiste Durand, O. Gaudoin
The purpose of this paper is to use the framework of hidden Markov chains (HMCs) for the modelling of the failure and debugging process of software, and the prediction of software reliability. The model parameters are estimated using the forward-backward expectation maximization algorithm, and model selection is done with the Bayesian information criterion. The advantages and drawbacks of this approach, with respect to usual modelling, are analysed. Comparison is also done on real software failure data. The main contribution of HMC modelling is that it highlights the existence of homogeneous periods in the debugging process, which allow one to identify major corrections or version updates. In terms of reliability predictions, the HMC model performs well, on average, with respect to usual models, especially when the reliability is not regularly growing.
{"title":"Software reliability modelling and prediction with hidden Markov chains","authors":"Jean-Baptiste Durand, O. Gaudoin","doi":"10.1191/1471082X05st087oa","DOIUrl":"https://doi.org/10.1191/1471082X05st087oa","url":null,"abstract":"The purpose of this paper is to use the framework of hidden Markov chains (HMCs) for the modelling of the failure and debugging process of software, and the prediction of software reliability. The model parameters are estimated using the forward-backward expectation maximization algorithm, and model selection is done with the Bayesian information criterion. The advantages and drawbacks of this approach, with respect to usual modelling, are analysed. Comparison is also done on real software failure data. The main contribution of HMC modelling is that it highlights the existence of homogeneous periods in the debugging process, which allow one to identify major corrections or version updates. In terms of reliability predictions, the HMC model performs well, on average, with respect to usual models, especially when the reliability is not regularly growing.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129340136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}