Pub Date : 2006-04-01DOI: 10.1191/1471082X06st106oa
E. S. Ng, J. Carpenter, H. Goldstein, J. Rasbash
Fitting multilevel models to discrete outcome data is problematic because the discrete distribution of the response variable implies an analytically intractable log-likelihood function. Among a number of approximate methods proposed, second-order penalised quasi-likelihood (PQL) is commonly used and is one of the most accurate. Unfortunately, even the second-order PQL approximation has been shown to produce estimates biased toward zero in certain circumstances. This bias can be marked especially when the data are sparse. One option to reduce this bias is to use Monte-Carlo simulation. A bootstrap bias correction method proposed by Kuk has been implemented in MLwiN. However, a similar technique based on the Robbins-Monro (RM) algorithm is potentially more efficient. An alternative is to use simulated maximum likelihood (SML), either alone or to refine estimates identified by other methods. In this article, we first compare bias correction using the RM algorithm, Kuk’s method and SML. We find that SML performs as efficiently as the other two methods and also yields standard errors of the bias-corrected parameter estimates and an estimate of the log-likelihood at the maximum, with which nested models can be compared. Secondly, using simulated and real data examples, we compare SML, second-order Laplace approximation (as implemented in HLM), Markov Chain Monte-Carlo (MCMC) (in MLwiN) and numerical integration using adaptive quadrature methods (in Stata’s GLLAMM and in SAS’s proc NLMIXED). We find that when the data are sparse, the second-order Laplace approximation produces markedly lower parameter estimates, whereas the MCMC method produces estimates that are noticeably higher than those from the SML and quadrature methods. Although proc NLMIXED is much faster than GLLAMM, it is not designed to fit models of more than two levels. SML produces parameter estimates and log-likelihoods very similar to those from quadrature methods. Further our SML approach extends to handle other link functions, discrete data distributions, non-normal random effects and higher-level models.
{"title":"Estimation in generalised linear mixed models with binary outcomes by simulated maximum likelihood","authors":"E. S. Ng, J. Carpenter, H. Goldstein, J. Rasbash","doi":"10.1191/1471082X06st106oa","DOIUrl":"https://doi.org/10.1191/1471082X06st106oa","url":null,"abstract":"Fitting multilevel models to discrete outcome data is problematic because the discrete distribution of the response variable implies an analytically intractable log-likelihood function. Among a number of approximate methods proposed, second-order penalised quasi-likelihood (PQL) is commonly used and is one of the most accurate. Unfortunately, even the second-order PQL approximation has been shown to produce estimates biased toward zero in certain circumstances. This bias can be marked especially when the data are sparse. One option to reduce this bias is to use Monte-Carlo simulation. A bootstrap bias correction method proposed by Kuk has been implemented in MLwiN. However, a similar technique based on the Robbins-Monro (RM) algorithm is potentially more efficient. An alternative is to use simulated maximum likelihood (SML), either alone or to refine estimates identified by other methods. In this article, we first compare bias correction using the RM algorithm, Kuk’s method and SML. We find that SML performs as efficiently as the other two methods and also yields standard errors of the bias-corrected parameter estimates and an estimate of the log-likelihood at the maximum, with which nested models can be compared. Secondly, using simulated and real data examples, we compare SML, second-order Laplace approximation (as implemented in HLM), Markov Chain Monte-Carlo (MCMC) (in MLwiN) and numerical integration using adaptive quadrature methods (in Stata’s GLLAMM and in SAS’s proc NLMIXED). We find that when the data are sparse, the second-order Laplace approximation produces markedly lower parameter estimates, whereas the MCMC method produces estimates that are noticeably higher than those from the SML and quadrature methods. Although proc NLMIXED is much faster than GLLAMM, it is not designed to fit models of more than two levels. SML produces parameter estimates and log-likelihoods very similar to those from quadrature methods. Further our SML approach extends to handle other link functions, discrete data distributions, non-normal random effects and higher-level models.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128747363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-12-01DOI: 10.1191/1471082X05st101oa
G. Tutz
This article introduces a simple model for repeated observations of an ordered categorical response variable which is isotonic over time. It is assumed that the measurements represent an irreversible process such that the response at time t is never lower than the response observed at the previous time point t − 1. Observations of this type occur, for example, in treatment studies when improvement is measured on an ordinal scale. As the response at time t depends on the previous outcome, the number of ordered response categories depends on the previous outcome leading to severe problems when simple threshold models for ordered data are used. To avoid these problems, the isotonic sequential model is introduced. It accounts for the irreversible process by considering the binary transitions to higher scores and allows a parsimonious parameterization. It is shown how the model may easily be estimated using existing software. Moreover, the model is extended to a random effects version which explicitly takes heterogeneity of individuals and potential correlations into account.
{"title":"Modelling of repeated ordered measurements by isotonic sequential regression","authors":"G. Tutz","doi":"10.1191/1471082X05st101oa","DOIUrl":"https://doi.org/10.1191/1471082X05st101oa","url":null,"abstract":"This article introduces a simple model for repeated observations of an ordered categorical response variable which is isotonic over time. It is assumed that the measurements represent an irreversible process such that the response at time t is never lower than the response observed at the previous time point t − 1. Observations of this type occur, for example, in treatment studies when improvement is measured on an ordinal scale. As the response at time t depends on the previous outcome, the number of ordered response categories depends on the previous outcome leading to severe problems when simple threshold models for ordered data are used. To avoid these problems, the isotonic sequential model is introduced. It accounts for the irreversible process by considering the binary transitions to higher scores and allows a parsimonious parameterization. It is shown how the model may easily be estimated using existing software. Moreover, the model is extended to a random effects version which explicitly takes heterogeneity of individuals and potential correlations into account.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129030605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-12-01DOI: 10.1191/1471082X05st103oa
M. Chiogna, C. Gaetan
In epidemiology, time-series regression models are specially suitable for evaluating short-term effects of time-varying exposures to pollution. To summarize findings from different studies on different cities, the techniques of designed meta-analyses have been employed. In this context, city-specific findings are summarized by an ‘effect size’ measured on a common scale. Such effects are then pooled together on a second hierarchy of analysis. The objective of this article is to exploit exploratory analysis of city-specific time series. In fact, when dealing with many sources of data, that is, many cities, an exploratory analysis becomes almost unaffordable. Our idea is to explore the time series by fitting complete dynamic regression models. These models are easier to fit than models usually employed and allow implementation of very fast automated model selection algorithms. The idea is to highlight the common features across cities through this analysis, which might then be used to design the meta-analysis. The proposal is illustrated by analysing data on the relationship between daily nonaccidental deaths and air pollution in the 20 US largest cities.
{"title":"Mining epidemiological time series: an approach based on dynamic regression","authors":"M. Chiogna, C. Gaetan","doi":"10.1191/1471082X05st103oa","DOIUrl":"https://doi.org/10.1191/1471082X05st103oa","url":null,"abstract":"In epidemiology, time-series regression models are specially suitable for evaluating short-term effects of time-varying exposures to pollution. To summarize findings from different studies on different cities, the techniques of designed meta-analyses have been employed. In this context, city-specific findings are summarized by an ‘effect size’ measured on a common scale. Such effects are then pooled together on a second hierarchy of analysis. The objective of this article is to exploit exploratory analysis of city-specific time series. In fact, when dealing with many sources of data, that is, many cities, an exploratory analysis becomes almost unaffordable. Our idea is to explore the time series by fitting complete dynamic regression models. These models are easier to fit than models usually employed and allow implementation of very fast automated model selection algorithms. The idea is to highlight the common features across cities through this analysis, which might then be used to design the meta-analysis. The proposal is illustrated by analysing data on the relationship between daily nonaccidental deaths and air pollution in the 20 US largest cities.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128458898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-12-01DOI: 10.1191/1471082X05st100oa
I. Moustaki, F. Steele
In this article, we discuss a latent variable model with continuous latent variables for manifest variables that are a mixture of categorical and survival outcomes. Models for censored and uncensored survival data are discussed. The model allows for covariate effects both on the manifest variables (direct effects) and on the latent variable(s) (indirect effects). The methodological developments are motivated by a demographic application: an exploration of women’s fertility preferences and family planning behaviour in Bangladesh.
{"title":"Latent variable models for mixed categorical and survival responses, with an application to fertility preferences and family planning in Bangladesh","authors":"I. Moustaki, F. Steele","doi":"10.1191/1471082X05st100oa","DOIUrl":"https://doi.org/10.1191/1471082X05st100oa","url":null,"abstract":"In this article, we discuss a latent variable model with continuous latent variables for manifest variables that are a mixture of categorical and survival outcomes. Models for censored and uncensored survival data are discussed. The model allows for covariate effects both on the manifest variables (direct effects) and on the latent variable(s) (indirect effects). The methodological developments are motivated by a demographic application: an exploration of women’s fertility preferences and family planning behaviour in Bangladesh.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115144227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-12-01DOI: 10.1191/1471082X05st099oa
Kahadawala Cooray
A new two-parameter family of distribution is presented. It is derived to model the highly negatively skewed data with extreme observations. The new family of distribution is referred to as the logistic-sinh distribution, as it is derived from the logistic distribution by appropriately replacing an exponential term with a hyperbolic sine term. The resulting family provides not only negatively skewed densities with thick tails but also variety of monotonic density shapes. The space of shape parameter, lambda greater than zero is divided by boundary line of lambda equals one, into two regions over which the hazard function is, respectively, increasing and bathtub shaped. The maximum likelihood parameter estimation techniques are discussed by providing approximate coverage probabilities for uncensored samples. The advantages of using the new family are demonstrated and compared by illustrating well known examples.
{"title":"Analyzing lifetime data with long-tailed skewed distribution: the logistic-sinh family","authors":"Kahadawala Cooray","doi":"10.1191/1471082X05st099oa","DOIUrl":"https://doi.org/10.1191/1471082X05st099oa","url":null,"abstract":"A new two-parameter family of distribution is presented. It is derived to model the highly negatively skewed data with extreme observations. The new family of distribution is referred to as the logistic-sinh distribution, as it is derived from the logistic distribution by appropriately replacing an exponential term with a hyperbolic sine term. The resulting family provides not only negatively skewed densities with thick tails but also variety of monotonic density shapes. The space of shape parameter, lambda greater than zero is divided by boundary line of lambda equals one, into two regions over which the hazard function is, respectively, increasing and bathtub shaped. The maximum likelihood parameter estimation techniques are discussed by providing approximate coverage probabilities for uncensored samples. The advantages of using the new family are demonstrated and compared by illustrating well known examples.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114800234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-12-01DOI: 10.1191/1471082X05st102oa
R. Samworth, H. Poore
Oceanographers study past ocean circulations and their effect on global climate through carbon isotope records obtained from microfossils deposited on the ocean floor. An initial goal is to estimate the carbon isotope levels for the Pacific, Southern and North Atlantic Oceans over the last 23 million years and to provide confidence bands. We consider a nonparametric regression model and demonstrate how several recent developments in methodology make local linear kernel regression an attractive approach for tackling the problem. The results are used to estimate a quantity called the proportion of Northern Component Water and its effect on global climate. Several interesting and important geophysical and oceanographic conclusions are suggested by the study.
{"title":"Understanding past ocean circulations: a nonparametric regression case study","authors":"R. Samworth, H. Poore","doi":"10.1191/1471082X05st102oa","DOIUrl":"https://doi.org/10.1191/1471082X05st102oa","url":null,"abstract":"Oceanographers study past ocean circulations and their effect on global climate through carbon isotope records obtained from microfossils deposited on the ocean floor. An initial goal is to estimate the carbon isotope levels for the Pacific, Southern and North Atlantic Oceans over the last 23 million years and to provide confidence bands. We consider a nonparametric regression model and demonstrate how several recent developments in methodology make local linear kernel regression an attractive approach for tackling the problem. The results are used to estimate a quantity called the proportion of Northern Component Water and its effect on global climate. Several interesting and important geophysical and oceanographic conclusions are suggested by the study.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"62 22","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113933249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-10-01DOI: 10.1191/1471082X05st098oa
L. Held, M. Höhle, Mathias W. Hofmann
A framework for the statistical analysis of counts from infectious disease surveillance databases is proposed. In its simplest form, the model can be seen as a Poisson branching process model with immigration. Extensions to include seasonal effects, time trends and overdispersion are outlined. The model is shown to provide an adequate fit and reliable one-step-ahead prediction intervals for a typical infectious disease time series. In addition, a multivariate formulation is proposed, which is well suited to capture space-time dependence caused by the spatial spread of a disease over time. An analysis of two multivariate time series is described. All analyses have been done using general optimization routines, where ML estimates and corresponding standard errors are readily available.
{"title":"A statistical framework for the analysis of multivariate infectious disease surveillance counts","authors":"L. Held, M. Höhle, Mathias W. Hofmann","doi":"10.1191/1471082X05st098oa","DOIUrl":"https://doi.org/10.1191/1471082X05st098oa","url":null,"abstract":"A framework for the statistical analysis of counts from infectious disease surveillance databases is proposed. In its simplest form, the model can be seen as a Poisson branching process model with immigration. Extensions to include seasonal effects, time trends and overdispersion are outlined. The model is shown to provide an adequate fit and reliable one-step-ahead prediction intervals for a typical infectious disease time series. In addition, a multivariate formulation is proposed, which is well suited to capture space-time dependence caused by the spatial spread of a disease over time. An analysis of two multivariate time series is described. All analyses have been done using general optimization routines, where ML estimates and corresponding standard errors are readily available.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131417747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-10-01DOI: 10.1191/1471082X05st097oa
F. Vaida, X. Meng
The celebrated simplicity of the EM algorithm is somewhat lost in its common use for generalized linear mixed models (GLMMs) because of its analytically intractable E-step. A natural and typical strategy in practice is to implement the E-step via Monte Carlo by drawing the unobserved random effects from their conditional distribution as specified by the E-step. In this paper, we show that further augmenting the missing data (e.g., the random effects) used by the M-step leads to a quite attractive and general slice sampler for implementing the Monte Carlo E-step. The slice sampler scheme is straightforward to implement, and it is neither restricted to the particular choice of the link function (e.g., probit) nor to the distribution of the random effects (e.g., normal). We apply this scheme to the standard EM algorithm as well as to an alternative EM algorithm which treats the variance-standardized random effects, rather than the random effects themselves, as missing data. The alternative EM algorithm does not only have faster convergence, but also leads to generalized linear model-like variance estimation, because it converts the random-effect standard deviations into linear regression parameters. Using the well-known salamander mating problem, we compare these two algorithms with each other, as well as with a variety of methods given in the literature in terms of the resulting point and interval estimates.
{"title":"Two slice-EM algorithms for fitting generalized linear mixed models with binary response","authors":"F. Vaida, X. Meng","doi":"10.1191/1471082X05st097oa","DOIUrl":"https://doi.org/10.1191/1471082X05st097oa","url":null,"abstract":"The celebrated simplicity of the EM algorithm is somewhat lost in its common use for generalized linear mixed models (GLMMs) because of its analytically intractable E-step. A natural and typical strategy in practice is to implement the E-step via Monte Carlo by drawing the unobserved random effects from their conditional distribution as specified by the E-step. In this paper, we show that further augmenting the missing data (e.g., the random effects) used by the M-step leads to a quite attractive and general slice sampler for implementing the Monte Carlo E-step. The slice sampler scheme is straightforward to implement, and it is neither restricted to the particular choice of the link function (e.g., probit) nor to the distribution of the random effects (e.g., normal). We apply this scheme to the standard EM algorithm as well as to an alternative EM algorithm which treats the variance-standardized random effects, rather than the random effects themselves, as missing data. The alternative EM algorithm does not only have faster convergence, but also leads to generalized linear model-like variance estimation, because it converts the random-effect standard deviations into linear regression parameters. Using the well-known salamander mating problem, we compare these two algorithms with each other, as well as with a variety of methods given in the literature in terms of the resulting point and interval estimates.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127214479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-10-01DOI: 10.1191/1471082X05st095oa
R. Bellio, C. Varin
Inference in generalized linear models with crossed effects is often made cumbersome by the high-dimensional intractable integrals involved in the likelihood function. We propose an inferential strategy based on the pairwise likelihood, which only requires the computation of bivariate distributions. The benefits of our approach are the simplicity of implementation and the potential to handle large data sets. The estimators based on the pairwise likelihood are generally consistent and asymptotically normally distributed. The pairwise likelihood makes it possible to improve on standard inferential procedures by means of bootstrap methods. The performance of the proposed methodology is illustrated by simulations and application to the well-known salamander mating data set.
{"title":"A pairwise likelihood approach to generalized linear models with crossed random effects","authors":"R. Bellio, C. Varin","doi":"10.1191/1471082X05st095oa","DOIUrl":"https://doi.org/10.1191/1471082X05st095oa","url":null,"abstract":"Inference in generalized linear models with crossed effects is often made cumbersome by the high-dimensional intractable integrals involved in the likelihood function. We propose an inferential strategy based on the pairwise likelihood, which only requires the computation of bivariate distributions. The benefits of our approach are the simplicity of implementation and the potential to handle large data sets. The estimators based on the pairwise likelihood are generally consistent and asymptotically normally distributed. The pairwise likelihood makes it possible to improve on standard inferential procedures by means of bootstrap methods. The performance of the proposed methodology is illustrated by simulations and application to the well-known salamander mating data set.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"56 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132026665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-10-01DOI: 10.1191/1471082X05st094oa
Chiu-Hsieh Hsu
In this paper, we treat the number of recurrent adenomatous polyps as a latent variable and then use a mixture distribution to model the number of observed recurrent adenomatous polyps. This approach is equivalent to zero-inflated Poisson regression, which is a method used to analyse count data with excess zeros. In a zero-inflated Poisson model, a count response variable is assumed to be distributed as a mixture of a Poisson distribution and a distribution with point mass of one at zero. In many cancer studies, patients often have variable follow-up. When the disease of interest is subject to late onset, ignoring the length of follow-up will underestimate the recurrence rate. In this paper, we modify zero-inflated Poisson regression through a weight function to incorporate the length of follow-up into analysis. We motivate, develop, and illustrate the methods described here with an example from a colon cancer study.
{"title":"Joint modelling of recurrence and progression of adenomas: a latent variable approach","authors":"Chiu-Hsieh Hsu","doi":"10.1191/1471082X05st094oa","DOIUrl":"https://doi.org/10.1191/1471082X05st094oa","url":null,"abstract":"In this paper, we treat the number of recurrent adenomatous polyps as a latent variable and then use a mixture distribution to model the number of observed recurrent adenomatous polyps. This approach is equivalent to zero-inflated Poisson regression, which is a method used to analyse count data with excess zeros. In a zero-inflated Poisson model, a count response variable is assumed to be distributed as a mixture of a Poisson distribution and a distribution with point mass of one at zero. In many cancer studies, patients often have variable follow-up. When the disease of interest is subject to late onset, ignoring the length of follow-up will underestimate the recurrence rate. In this paper, we modify zero-inflated Poisson regression through a weight function to incorporate the length of follow-up into analysis. We motivate, develop, and illustrate the methods described here with an example from a colon cancer study.","PeriodicalId":354759,"journal":{"name":"Statistical Modeling","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130698551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}