Yan Li, Katherine E Irimata, Yulei He, Jennifer Parker
Along with the rapid emergence of web surveys to address time-sensitive priority topics, various propensity score (PS)-based adjustment methods have been developed to improve population representativeness for nonprobability- or probability-sampled web surveys subject to selection bias. Conventional PS-based methods construct pseudo-weights for web samples using a higher-quality reference probability sample. The bias reduction, however, depends on the outcome and variables collected in both web and reference samples. A central issue is identifying variables for inclusion in PS-adjustment. In this paper, directed acyclic graph (DAG), a common graphical tool for causal studies but largely under-utilized in survey research, is used to examine and elucidate how different types of variables in the causal pathways impact the performance of PS-adjustment. While past literature generally recommends including all variables, our research demonstrates that only certain types of variables are needed in PS-adjustment. Our research is illustrated by NCHS' Research and Development Survey, a probability-sampled web survey with potential selection bias, PS-adjusted to the National Health Interview Survey, to estimate U.S. asthma prevalence. Findings in this paper can be used by National Statistics Offices to design questionnaires with variables that improve web-samples' population representativeness and to release more timely and accurate estimates for priority topics.
{"title":"Variable inclusion strategies through directed acyclic graphs to adjust health surveys subject to selection bias for producing national estimates.","authors":"Yan Li, Katherine E Irimata, Yulei He, Jennifer Parker","doi":"10.2478/jos-2022-0038","DOIUrl":"https://doi.org/10.2478/jos-2022-0038","url":null,"abstract":"<p><p>Along with the rapid emergence of web surveys to address time-sensitive priority topics, various propensity score (PS)-based adjustment methods have been developed to improve population representativeness for nonprobability- or probability-sampled web surveys subject to selection bias. Conventional PS-based methods construct pseudo-weights for web samples using a higher-quality reference probability sample. The bias reduction, however, depends on the outcome and variables collected in both web and reference samples. A central issue is identifying variables for inclusion in PS-adjustment. In this paper, directed acyclic graph (DAG), a common graphical tool for causal studies but largely under-utilized in survey research, is used to examine and elucidate how different types of variables in the causal pathways impact the performance of PS-adjustment. While past literature generally recommends including all variables, our research demonstrates that only certain types of variables are needed in PS-adjustment. Our research is illustrated by NCHS' Research and Development Survey, a probability-sampled web survey with potential selection bias, PS-adjusted to the National Health Interview Survey, to estimate U.S. asthma prevalence. Findings in this paper can be used by National Statistics Offices to design questionnaires with variables that improve web-samples' population representativeness and to release more timely and accurate estimates for priority topics.</p>","PeriodicalId":51092,"journal":{"name":"Journal of Official Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9490791/pdf/nihms-1807439.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10132956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract The demand for small area estimates can conflict with the objective of producing a multi-purpose data set. We use donor imputation to construct a database that supports small area estimation. Appropriately weighted sums of observed and imputed values produce model-based small area estimates. We develop imputation procedures for both unit-level and area-level models. For area-level models, we restrict to linear models. We assume a single vector of covariates is used for a possibly multivariate response. Each record in the imputed data set has complete data, an estimation weight, and a set of replicate weights for mean square error (MSE) estimation. We compare imputation procedures based on area-level models to those based on unit-level models through simulation. We apply the methods to the Iowa Seat-Belt Use Survey, a survey designed to produce state-level estimates of the proportions of vehicle occupants who wear a seat-belt. We develop a bivariate unit-level model for prediction of county-level proportions of belted drivers and total occupants. We impute values for the proportions of belted drivers and vehicle occupants onto the full population of road segments in the sampling frame. The resulting imputed data set returns approximations for the county-level predictors based on the bivariate model.
{"title":"Construction of Databases for Small Area Estimation","authors":"Emily J. Berg","doi":"10.2478/jos-2022-0031","DOIUrl":"https://doi.org/10.2478/jos-2022-0031","url":null,"abstract":"Abstract The demand for small area estimates can conflict with the objective of producing a multi-purpose data set. We use donor imputation to construct a database that supports small area estimation. Appropriately weighted sums of observed and imputed values produce model-based small area estimates. We develop imputation procedures for both unit-level and area-level models. For area-level models, we restrict to linear models. We assume a single vector of covariates is used for a possibly multivariate response. Each record in the imputed data set has complete data, an estimation weight, and a set of replicate weights for mean square error (MSE) estimation. We compare imputation procedures based on area-level models to those based on unit-level models through simulation. We apply the methods to the Iowa Seat-Belt Use Survey, a survey designed to produce state-level estimates of the proportions of vehicle occupants who wear a seat-belt. We develop a bivariate unit-level model for prediction of county-level proportions of belted drivers and total occupants. We impute values for the proportions of belted drivers and vehicle occupants onto the full population of road segments in the sampling frame. The resulting imputed data set returns approximations for the county-level predictors based on the bivariate model.","PeriodicalId":51092,"journal":{"name":"Journal of Official Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45891496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract In this article, we present a new approach based on dynamic factor models (DFMs) to perform accurate nowcasts for the percentage annual variation of the Mexican Global Economic Activity Indicator (IGAE), the commonly used variable as an approximation of monthly GDP. The procedure exploits the contemporaneous relationship of the timely traditional macroeconomic time series and nontraditional variables as Google Trends with respect to the IGAE. We evaluate the performance of the approach in a pseudo real-time framework, which includes the pandemic of COVID-19, and conclude that the procedure obtains accurate estimates, for one and two-steps ahead, above all, given the use of Google Trends. Another contribution for economic nowcasting is that the approach allows to disentangle the key variables in the DFM by estimating the confidence interval for the factor loadings, hence allows to evaluate the statistical significance of the variables in the DFM. This approach is used in official statistics to obtain preliminary and accurate estimates for IGAE up to 40 days before the official data release.
{"title":"Timely Estimates of the Monthly Mexican Economic Activity","authors":"F. Corona, G. González-Farías, J. López-Pérez","doi":"10.2478/jos-2022-0033","DOIUrl":"https://doi.org/10.2478/jos-2022-0033","url":null,"abstract":"Abstract In this article, we present a new approach based on dynamic factor models (DFMs) to perform accurate nowcasts for the percentage annual variation of the Mexican Global Economic Activity Indicator (IGAE), the commonly used variable as an approximation of monthly GDP. The procedure exploits the contemporaneous relationship of the timely traditional macroeconomic time series and nontraditional variables as Google Trends with respect to the IGAE. We evaluate the performance of the approach in a pseudo real-time framework, which includes the pandemic of COVID-19, and conclude that the procedure obtains accurate estimates, for one and two-steps ahead, above all, given the use of Google Trends. Another contribution for economic nowcasting is that the approach allows to disentangle the key variables in the DFM by estimating the confidence interval for the factor loadings, hence allows to evaluate the statistical significance of the variables in the DFM. This approach is used in official statistics to obtain preliminary and accurate estimates for IGAE up to 40 days before the official data release.","PeriodicalId":51092,"journal":{"name":"Journal of Official Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44845546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract In the production of US agricultural official statistics, certain inequality and benchmarking constraints must be satisfied. For example, available administrative data provide an accurate lower bound for the county-level estimates of planted acres, produced by the U.S. Department of Agriculture’s (USDA) National Agricultural statistics Services (NASS). In addition, the county-level estimates within a state need to add to the state-level estimates. A sub-area hierarchical Bayesian model with inequality constraints to produce county-level estimates that satisfy these important relationships is discussed, along with associated measures of uncertainty. This model combines the County Agricultural Production Survey (CAPS) data with administrative data. Inequality constraints add complexity to fitting the model and present a computational challenge to a full Bayesian approach. To evaluate the inclusion of these constraints, the models with and without inequality constraints were compared using 2014 corn planted acres estimates for three states. The performance of the model with inequality constraints illustrates the improvement of county-level estimates in accuracy and precision while preserving required relationships.
{"title":"Hierarchical Bayesian Model with Inequality Constraints for US County Estimates","authors":"Lu Chen, B. Nandram, Nathan B. Cruze","doi":"10.2478/jos-2022-0032","DOIUrl":"https://doi.org/10.2478/jos-2022-0032","url":null,"abstract":"Abstract In the production of US agricultural official statistics, certain inequality and benchmarking constraints must be satisfied. For example, available administrative data provide an accurate lower bound for the county-level estimates of planted acres, produced by the U.S. Department of Agriculture’s (USDA) National Agricultural statistics Services (NASS). In addition, the county-level estimates within a state need to add to the state-level estimates. A sub-area hierarchical Bayesian model with inequality constraints to produce county-level estimates that satisfy these important relationships is discussed, along with associated measures of uncertainty. This model combines the County Agricultural Production Survey (CAPS) data with administrative data. Inequality constraints add complexity to fitting the model and present a computational challenge to a full Bayesian approach. To evaluate the inclusion of these constraints, the models with and without inequality constraints were compared using 2014 corn planted acres estimates for three states. The performance of the model with inequality constraints illustrates the improvement of county-level estimates in accuracy and precision while preserving required relationships.","PeriodicalId":51092,"journal":{"name":"Journal of Official Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49176756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"In Memory of Dr. Lars Lyberg Remembering a Giant in Survey Research 1944–2021","authors":"","doi":"10.2478/jos-2022-0018","DOIUrl":"https://doi.org/10.2478/jos-2022-0018","url":null,"abstract":"","PeriodicalId":51092,"journal":{"name":"Journal of Official Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46047412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Alleva, G. Arbia, P. D. Falorsi, V. Nardelli, A. Zuliani
Abstract Given the urgent informational needs connected with the diffusion of infection with regard to the COVID-19 pandemic, in this article, we propose a sampling design for building a continuous-time surveillance system. Compared with other observational strategies, the proposed method has three important elements of strength and originality: (1) it aims to provide a snapshot of the phenomenon at a single moment in time, and it is designed to be a continuous survey that is repeated in several waves over time, taking different target variables during different stages of the development of the epidemic into account; (2) the statistical optimality properties of the proposed estimators are formally derived and tested with a Monte Carlo experiment; and (3) it is rapidly operational as this property is required by the emergency connected with the diffusion of the virus. The sampling design is thought to be designed with the diffusion of SAR-CoV-2 in Italy during the spring of 2020 in mind. However, it is very general, and we are confident that it can be easily extended to other geographical areas and to possible future epidemic outbreaks. Formal proofs and a Monte Carlo exercise highlight that the estimators are unbiased and have higher efficiency than the simple random sampling scheme.
{"title":"Spatial Sampling Design to Improve the Efficiency of the Estimation of the Critical Parameters of the SARS-CoV-2 Epidemic","authors":"G. Alleva, G. Arbia, P. D. Falorsi, V. Nardelli, A. Zuliani","doi":"10.2478/jos-2022-0019","DOIUrl":"https://doi.org/10.2478/jos-2022-0019","url":null,"abstract":"Abstract Given the urgent informational needs connected with the diffusion of infection with regard to the COVID-19 pandemic, in this article, we propose a sampling design for building a continuous-time surveillance system. Compared with other observational strategies, the proposed method has three important elements of strength and originality: (1) it aims to provide a snapshot of the phenomenon at a single moment in time, and it is designed to be a continuous survey that is repeated in several waves over time, taking different target variables during different stages of the development of the epidemic into account; (2) the statistical optimality properties of the proposed estimators are formally derived and tested with a Monte Carlo experiment; and (3) it is rapidly operational as this property is required by the emergency connected with the diffusion of the virus. The sampling design is thought to be designed with the diffusion of SAR-CoV-2 in Italy during the spring of 2020 in mind. However, it is very general, and we are confident that it can be easily extended to other geographical areas and to possible future epidemic outbreaks. Formal proofs and a Monte Carlo exercise highlight that the estimators are unbiased and have higher efficiency than the simple random sampling scheme.","PeriodicalId":51092,"journal":{"name":"Journal of Official Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41712016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ISCUSI may be the most prominent academic critic of current public health approaches to smoking, often serving as an expert witness for cigarette companies. He is perhaps best known for his conclusion that smokers provide governments with net economic benefits because they pay more in taxes than do nonsmokers and, thanks to their smoking-shortened lives, consume fewer government benefits. Viscusi’s new book, Smoke-Filled Rooms, presents itself as a critical analysis of the states’ settlements of their lawsuits against the cigarette companies. This framework serves Viscusi well, because it supports the narrow, dollars-and-cents approach he favors and excludes important public health considerations. As he writes, “the focus of the litigation is solely on whether the government incurred financial costs as a result of the cigarettes.” Not only are private costs ignored, but so are the suffering and loss caused by smoking and other undesirable effects that are not reflected in government expenditures. “Framing the question in this manner may seem narrow, which it is,” he writes, and he blames the “anti-smoking forces and the governmental lawsuits” for creating such a framework. But Viscusi does not go beyond this kind of sterile and limited economic view. This narrow approach might make sense if Viscusi’s book discussed only the state tobacco settlements, but it clearly does much more. Besides criticizing all other litigation against the tobacco companies (and similar litigation against other businesses), Viscusi evaluates current government and public health initiatives for reducing tobacco use, finds them lacking, and offers a controversial alternative approach. Viscusi’s analysis is often superficial and incomplete, even within the narrow framework he has chosen. In his accounting of smoking-related costs and savings, for example, Viscusi states that “this comprehensive review reflects all cost components that have been recognized in the professional economics literature.” But he later, without explanation, indicates that he “will omit influences such as costs associated with low-birthweight babies” — despite estimates that the costs resulting from smoking-affected pregnancies are as high as $2 billion per year. Other overlooked costs include Social Security survivors’ insurance payments to spouses and children of adults who die early because of smoking, cleaning and maintenance costs related to smoking, and costs related to secondhand smoke. Although Viscusi considers the costs of secondhand smoke in a separate chapter, he does not provide any estimate or substantial discussion of the costs of treating ailments caused or exacerbated by secondhand smoke, nor does he cite the published research that does so. It is also impossible to evaluate the subtotals of costs and savings that Viscusi does present, because he reveals very little about his underlying calculations, data, and assumptions. V Viscusi’s conclusion that smoking has a net positive effec
{"title":"Book Review","authors":"Ann-Marie Flygare, Ingegerd Jansson","doi":"10.2478/jos-2022-0030","DOIUrl":"https://doi.org/10.2478/jos-2022-0030","url":null,"abstract":"ISCUSI may be the most prominent academic critic of current public health approaches to smoking, often serving as an expert witness for cigarette companies. He is perhaps best known for his conclusion that smokers provide governments with net economic benefits because they pay more in taxes than do nonsmokers and, thanks to their smoking-shortened lives, consume fewer government benefits. Viscusi’s new book, Smoke-Filled Rooms, presents itself as a critical analysis of the states’ settlements of their lawsuits against the cigarette companies. This framework serves Viscusi well, because it supports the narrow, dollars-and-cents approach he favors and excludes important public health considerations. As he writes, “the focus of the litigation is solely on whether the government incurred financial costs as a result of the cigarettes.” Not only are private costs ignored, but so are the suffering and loss caused by smoking and other undesirable effects that are not reflected in government expenditures. “Framing the question in this manner may seem narrow, which it is,” he writes, and he blames the “anti-smoking forces and the governmental lawsuits” for creating such a framework. But Viscusi does not go beyond this kind of sterile and limited economic view. This narrow approach might make sense if Viscusi’s book discussed only the state tobacco settlements, but it clearly does much more. Besides criticizing all other litigation against the tobacco companies (and similar litigation against other businesses), Viscusi evaluates current government and public health initiatives for reducing tobacco use, finds them lacking, and offers a controversial alternative approach. Viscusi’s analysis is often superficial and incomplete, even within the narrow framework he has chosen. In his accounting of smoking-related costs and savings, for example, Viscusi states that “this comprehensive review reflects all cost components that have been recognized in the professional economics literature.” But he later, without explanation, indicates that he “will omit influences such as costs associated with low-birthweight babies” — despite estimates that the costs resulting from smoking-affected pregnancies are as high as $2 billion per year. Other overlooked costs include Social Security survivors’ insurance payments to spouses and children of adults who die early because of smoking, cleaning and maintenance costs related to smoking, and costs related to secondhand smoke. Although Viscusi considers the costs of secondhand smoke in a separate chapter, he does not provide any estimate or substantial discussion of the costs of treating ailments caused or exacerbated by secondhand smoke, nor does he cite the published research that does so. It is also impossible to evaluate the subtotals of costs and savings that Viscusi does present, because he reveals very little about his underlying calculations, data, and assumptions. V Viscusi’s conclusion that smoking has a net positive effec","PeriodicalId":51092,"journal":{"name":"Journal of Official Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48746281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rejoinder: Measuring Inflation under Pandemic Conditions","authors":"W. Diewert, Kevin J. Fox","doi":"10.2478/jos-2022-0029","DOIUrl":"https://doi.org/10.2478/jos-2022-0029","url":null,"abstract":"","PeriodicalId":51092,"journal":{"name":"Journal of Official Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44661399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract The estimation of poverty and inequality indicators based on survey data is trivial as long as the variable of interest (e.g., income or consumption) is measured on a metric scale. However, estimation is not directly possible, using standard formulas, when the income variable is grouped due to confidentiality constraints or in order to decrease item nonresponse. We propose an iterative kernel density algorithm that generates metric pseudo samples from the grouped variable for the estimation of indicators. The corresponding standard errors are estimated by a non-parametric bootstrap that accounts for the additional uncertainty due to the grouping. The algorithm enables the use of survey weights and household equivalence scales. The proposed method is applied to the German Microcensus for estimating the regional distribution of poverty and inequality in Germany.
{"title":"Iterative Kernel Density Estimation Applied to Grouped Data: Estimating Poverty and Inequality Indicators from the German Microcensus","authors":"Paul Walter, Marcus Gross, T. Schmid, K. Weimer","doi":"10.2478/jos-2022-0027","DOIUrl":"https://doi.org/10.2478/jos-2022-0027","url":null,"abstract":"Abstract The estimation of poverty and inequality indicators based on survey data is trivial as long as the variable of interest (e.g., income or consumption) is measured on a metric scale. However, estimation is not directly possible, using standard formulas, when the income variable is grouped due to confidentiality constraints or in order to decrease item nonresponse. We propose an iterative kernel density algorithm that generates metric pseudo samples from the grouped variable for the estimation of indicators. The corresponding standard errors are estimated by a non-parametric bootstrap that accounts for the additional uncertainty due to the grouping. The algorithm enables the use of survey weights and household equivalence scales. The proposed method is applied to the German Microcensus for estimating the regional distribution of poverty and inequality in Germany.","PeriodicalId":51092,"journal":{"name":"Journal of Official Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49405441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yves Tillé, M. Debusschere, Henri Luomaranta, Martin Axelson, E. Elvers, A. Holmberg, R. Valliant
Abstract In this article, we share some reflections on the state of statistical science and its evolution in the production systems of official statistics. We first try to make a synthesis of the evolution of statistical thinking. We then examine the evolution of practices in official statistics, which had to face very early on a diversification of sou rces: first with the use of censuses, then sample surveys and finally administrative files. At each stage, a profound revision of methods was necessary. We show that since the middle of the 20th century, one of the major challenges of statistics has been to produce estimates from a variety of sources. To do this, a large number of methods have been proposed which are based on very different f oundations. The term “big data” encompasses a set of sources and new statistical methods. We first examine the potential of valorization of big data in official statistics. Some applications such as image analysis for agricultural prediction are very old and will be further developed. However, we report our skepticism towards web-scrapping methods. Then we examine the use of new deep learning methods. With access to more and more sources, the great challenge will remain the valorization and harmonization of these sources.
{"title":"Some Thoughts on Official Statistics and its Future (with discussion)","authors":"Yves Tillé, M. Debusschere, Henri Luomaranta, Martin Axelson, E. Elvers, A. Holmberg, R. Valliant","doi":"10.2478/jos-2022-0026","DOIUrl":"https://doi.org/10.2478/jos-2022-0026","url":null,"abstract":"Abstract In this article, we share some reflections on the state of statistical science and its evolution in the production systems of official statistics. We first try to make a synthesis of the evolution of statistical thinking. We then examine the evolution of practices in official statistics, which had to face very early on a diversification of sou rces: first with the use of censuses, then sample surveys and finally administrative files. At each stage, a profound revision of methods was necessary. We show that since the middle of the 20th century, one of the major challenges of statistics has been to produce estimates from a variety of sources. To do this, a large number of methods have been proposed which are based on very different f oundations. The term “big data” encompasses a set of sources and new statistical methods. We first examine the potential of valorization of big data in official statistics. Some applications such as image analysis for agricultural prediction are very old and will be further developed. However, we report our skepticism towards web-scrapping methods. Then we examine the use of new deep learning methods. With access to more and more sources, the great challenge will remain the valorization and harmonization of these sources.","PeriodicalId":51092,"journal":{"name":"Journal of Official Statistics","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41856379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}