Multiple crises, including the COVID-19 pandemic and increased frequency and intensity of disasters related to climate change, have demonstrated the critical importance of timely and open access to trusted data. Open data principles and practices that facilitate data access and use, relevance to policy needs, and increase the impact and value of data are central to building trust in data. The paper outlines four trends that present opportunities for expanding adoption and use of open data principles and practices and building data trust: the modernization of data governance; increased attention to the role of citizens in building trust and increasing the relevance of data and citizens’ contribution to data throughout the data value chain; the adoption of open data principles; and the work of watchdog organizations monitoring the progress of countries and agencies and identifying areas of data governance that still need attention.
{"title":"Building trust and facilitating use of data","authors":"Francesca Perucci, Eric Swanson","doi":"10.3233/sji-240006","DOIUrl":"https://doi.org/10.3233/sji-240006","url":null,"abstract":"Multiple crises, including the COVID-19 pandemic and increased frequency and intensity of disasters related to climate change, have demonstrated the critical importance of timely and open access to trusted data. Open data principles and practices that facilitate data access and use, relevance to policy needs, and increase the impact and value of data are central to building trust in data. The paper outlines four trends that present opportunities for expanding adoption and use of open data principles and practices and building data trust: the modernization of data governance; increased attention to the role of citizens in building trust and increasing the relevance of data and citizens’ contribution to data throughout the data value chain; the adoption of open data principles; and the work of watchdog organizations monitoring the progress of countries and agencies and identifying areas of data governance that still need attention.","PeriodicalId":55877,"journal":{"name":"Statistical Journal of the IAOS","volume":"177 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140469907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Asatryan, V. Aleksanyan, Samvel Asatryan, M. Manucharyan
The purpose of this paper is to provide an empirical assessment of the economic efficiency of grape-producing farms in Armenia. Upon reviewing various field-related studies the frontier analysis was singled out as a methodological base of this study. More specifically two-stage empirical analysis was performed, which includes the measurement of efficiency levels of grape farms by implementing the DEA technique and then assessing the determinants of obtained efficiency scores by performing Tobit modeling. To obtain necessary data, 365 grape farms from the Armavir region were surveyed. The main findings of this paper suggest that the average efficiency score for grape farms is 0.72, and there is room for improvement in the economic performance of farms with 28%. The main determinants of farm efficiency were cultivated grape varieties, farm size, and selling prices of grapes. The obtained results mainly support the findings of similar studies carried out for various viticulture regions across the world. This study provides some methodology bases for further expansion of similar studies both in terms of including the other Armenian viticulture regions and different years to explore the changes in the efficiency of grape farms over time. This article provides a base of knowledge for policymakers, scholars, researchers, investors, and credit companies for their decision-making processes and other purposes.
本文旨在对亚美尼亚葡萄生产农场的经济效益进行实证评估。在对各种实地相关研究进行审查后,前沿分析被选为本研究的方法论基础。具体而言,本研究进行了两阶段实证分析,包括通过采用 DEA 技术衡量葡萄种植园的效率水平,然后通过 Tobit 模型评估所获得的效率分数的决定因素。为了获得必要的数据,对阿尔马维尔地区的 365 个葡萄园进行了调查。本文的主要研究结果表明,葡萄农场的平均效率为 0.72,农场的经济效益还有 28% 的提升空间。农场效率的主要决定因素是种植的葡萄品种、农场规模和葡萄销售价格。研究结果主要支持世界各地葡萄种植区的类似研究结果。本研究为进一步扩大类似研究提供了一些方法论基础,既包括亚美尼亚其他葡萄栽培地区,也包括不同年份,以探讨葡萄农场效率随时间的变化。本文为决策者、学者、研究人员、投资者和信贷公司的决策过程及其他目的提供了知识基础。
{"title":"Analyzing commercial grape farm efficiency in Armavir region (Armenia) by using two-stage empirical approach","authors":"H. Asatryan, V. Aleksanyan, Samvel Asatryan, M. Manucharyan","doi":"10.3233/sji-230064","DOIUrl":"https://doi.org/10.3233/sji-230064","url":null,"abstract":"The purpose of this paper is to provide an empirical assessment of the economic efficiency of grape-producing farms in Armenia. Upon reviewing various field-related studies the frontier analysis was singled out as a methodological base of this study. More specifically two-stage empirical analysis was performed, which includes the measurement of efficiency levels of grape farms by implementing the DEA technique and then assessing the determinants of obtained efficiency scores by performing Tobit modeling. To obtain necessary data, 365 grape farms from the Armavir region were surveyed. The main findings of this paper suggest that the average efficiency score for grape farms is 0.72, and there is room for improvement in the economic performance of farms with 28%. The main determinants of farm efficiency were cultivated grape varieties, farm size, and selling prices of grapes. The obtained results mainly support the findings of similar studies carried out for various viticulture regions across the world. This study provides some methodology bases for further expansion of similar studies both in terms of including the other Armenian viticulture regions and different years to explore the changes in the efficiency of grape farms over time. This article provides a base of knowledge for policymakers, scholars, researchers, investors, and credit companies for their decision-making processes and other purposes.","PeriodicalId":55877,"journal":{"name":"Statistical Journal of the IAOS","volume":"28 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140499035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Charlotte Juul Hansen, Lina Maria Sanchez Cespedes, Leonardo Trujillo Oyola, X. K. Dimakos, Bianca Walsh, Renata Souza Bueno, Amos T. Kabo-Bah, Omar Seidu, Vibeke Oestreich Nielsen
National statistical offices (NSOs) and academia benefit from establishing partnerships and collaborating in different ways by bringing together their respective expertise. Collaborative alliances of this nature appear to offer numerous advantages for both the partners and the public and seem to be essential for unlocking opportunities within the evolving data ecosystem. Establishing good and fruitful collaboration between academia and NSOs requires a collaborative environment where each partner can see the benefits of the collaboration and how they could contribute. Different areas of collaboration are presented within four categories: education and learning, research, promotion of data use in society and providing services to each other. The article further discusses the benefits and conditions of a successful partnership. Examples from Brazil, Colombia, Ghana, and Norway showcase practical-level experiences and some lessons learned at the country level.
{"title":"Collaboration between national statistical offices and academia: Benefits, conditions, areas of collaboration and practical level experience in countries","authors":"Charlotte Juul Hansen, Lina Maria Sanchez Cespedes, Leonardo Trujillo Oyola, X. K. Dimakos, Bianca Walsh, Renata Souza Bueno, Amos T. Kabo-Bah, Omar Seidu, Vibeke Oestreich Nielsen","doi":"10.3233/sji-230117","DOIUrl":"https://doi.org/10.3233/sji-230117","url":null,"abstract":"National statistical offices (NSOs) and academia benefit from establishing partnerships and collaborating in different ways by bringing together their respective expertise. Collaborative alliances of this nature appear to offer numerous advantages for both the partners and the public and seem to be essential for unlocking opportunities within the evolving data ecosystem. Establishing good and fruitful collaboration between academia and NSOs requires a collaborative environment where each partner can see the benefits of the collaboration and how they could contribute. Different areas of collaboration are presented within four categories: education and learning, research, promotion of data use in society and providing services to each other. The article further discusses the benefits and conditions of a successful partnership. Examples from Brazil, Colombia, Ghana, and Norway showcase practical-level experiences and some lessons learned at the country level.","PeriodicalId":55877,"journal":{"name":"Statistical Journal of the IAOS","volume":"37 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140498342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonas Klingwort, Sven Alexander Brocker, Christian Borgs
German official statistics publish statistics on personal insolvency. These statistics have been recently enhanced using web scraping to extract additional information from a public website on which the insolvency announcements are published. The currently scraped data is used for quality assurance and to derive an early indicator of personal insolvency. This paper provides novel methodological analyses for the same administrative database and presents further opportunities to improve the current official statistics regarding detail and timeliness using web scraping and text mining. These newly derived statistics inform on several aspects regarding personal insolvency’s demographic and spatial distribution.
{"title":"Spatial and demographic distributions of personal insolvency: An opportunity for official statistics","authors":"Jonas Klingwort, Sven Alexander Brocker, Christian Borgs","doi":"10.3233/sji-230072","DOIUrl":"https://doi.org/10.3233/sji-230072","url":null,"abstract":"German official statistics publish statistics on personal insolvency. These statistics have been recently enhanced using web scraping to extract additional information from a public website on which the insolvency announcements are published. The currently scraped data is used for quality assurance and to derive an early indicator of personal insolvency. This paper provides novel methodological analyses for the same administrative database and presents further opportunities to improve the current official statistics regarding detail and timeliness using web scraping and text mining. These newly derived statistics inform on several aspects regarding personal insolvency’s demographic and spatial distribution.","PeriodicalId":55877,"journal":{"name":"Statistical Journal of the IAOS","volume":"30 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138997362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we focus on respondent-driven sampling (RDS), which is a valuable survey methodology to estimate the size and the characteristics of hidden or hard-to-measure population groups. The RDS methodology makes it possible to gather information on these populations by exploiting the relationships between their components. However, RDS suffers from the lack of an estimation methodology that is sufficiently robust to accommodate the varying conditions under which it is applied. In this paper, we address the estimation problem of the RDS methodology and, by approaching it as a particular indirect sampling technique, we propose three unbiased estimation methods as possible solutions.
{"title":"Unbiased estimation strategies for respondent driven sampling","authors":"P. D. Falorsi, G. Alleva, Francesca Petrarca","doi":"10.3233/sji-230087","DOIUrl":"https://doi.org/10.3233/sji-230087","url":null,"abstract":"In this paper, we focus on respondent-driven sampling (RDS), which is a valuable survey methodology to estimate the size and the characteristics of hidden or hard-to-measure population groups. The RDS methodology makes it possible to gather information on these populations by exploiting the relationships between their components. However, RDS suffers from the lack of an estimation methodology that is sufficiently robust to accommodate the varying conditions under which it is applied. In this paper, we address the estimation problem of the RDS methodology and, by approaching it as a particular indirect sampling technique, we propose three unbiased estimation methods as possible solutions.","PeriodicalId":55877,"journal":{"name":"Statistical Journal of the IAOS","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139254869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paper deals with the concept and the definitions of hard-to-reach groups and the ways of capturing them in administrative sources, providing a detailed discussion of the meaning of hard-to-reach in the context of administrative sources and in relation to the traditional hard-to-count groups in censuses and surveys. The review of country practices shows that hard-to-reach populations in administrative data can be interpreted in different ways and that their definition is dependent on countries’ circumstances, though there are two main reasons for identifying a group as hard-to-reach in administrative sources. One of the interpretations is selecting some groups, typically considered difficult to reach with traditional survey methods (such as homeless, illegal immigrants or indigenous people) and then trying to capture them in registers to overcome the challenges of traditional field collection or to get more complete information. At first glance, administrative data might offer the potential to improve frame coverage for some target populations, but may also lead to other hard-to-reach or “hidden” populations for different population groups. Indeed, another interpretation refers to the incompleteness of registers or linked administrative databases, which makes some groups, such as children or elders, hard-to-reach and hence describe with data, due to time lag in reporting of some events or to other accuracy problems with the source itself. The paper summarizes the experience of national statistical offices in accessing hard-to-reach groups and describes problems and challenges in capturing them. It also proposes further possible work to improve access to hard-to-reach groups using administrative data.
{"title":"Hard-to-reach population groups in administrative sources: main challenges and future work","authors":"Donatella Zindato, Maciej Truszczynski","doi":"10.3233/sji-230074","DOIUrl":"https://doi.org/10.3233/sji-230074","url":null,"abstract":"The paper deals with the concept and the definitions of hard-to-reach groups and the ways of capturing them in administrative sources, providing a detailed discussion of the meaning of hard-to-reach in the context of administrative sources and in relation to the traditional hard-to-count groups in censuses and surveys. The review of country practices shows that hard-to-reach populations in administrative data can be interpreted in different ways and that their definition is dependent on countries’ circumstances, though there are two main reasons for identifying a group as hard-to-reach in administrative sources. One of the interpretations is selecting some groups, typically considered difficult to reach with traditional survey methods (such as homeless, illegal immigrants or indigenous people) and then trying to capture them in registers to overcome the challenges of traditional field collection or to get more complete information. At first glance, administrative data might offer the potential to improve frame coverage for some target populations, but may also lead to other hard-to-reach or “hidden” populations for different population groups. Indeed, another interpretation refers to the incompleteness of registers or linked administrative databases, which makes some groups, such as children or elders, hard-to-reach and hence describe with data, due to time lag in reporting of some events or to other accuracy problems with the source itself. The paper summarizes the experience of national statistical offices in accessing hard-to-reach groups and describes problems and challenges in capturing them. It also proposes further possible work to improve access to hard-to-reach groups using administrative data.","PeriodicalId":55877,"journal":{"name":"Statistical Journal of the IAOS","volume":"16 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139254707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we formulate the problem of estimating the resident population, i.e. correcting for over-counts in administrative register data, as a binary classification problem. We propose a solution based on machine learning algorithms. The selection and the optimisation of the best algorithm is shown to depend on the goal of prediction. We illustrate this method for two important cases of official statistics, Census resident population and survey design with minimum non-response. The performance of the algorithms, the uncertainty of estimates and of the evaluation metrics are described in detail and implemented in shared, open source code. We exemplify with the results obtained by applying this method to Icelandic register and survey data.
{"title":"Machine learning estimation of the resident population","authors":"Violeta Calian, Margherita Zuppardo, Omar Hardarson","doi":"10.3233/sji-230090","DOIUrl":"https://doi.org/10.3233/sji-230090","url":null,"abstract":"In this paper, we formulate the problem of estimating the resident population, i.e. correcting for over-counts in administrative register data, as a binary classification problem. We propose a solution based on machine learning algorithms. The selection and the optimisation of the best algorithm is shown to depend on the goal of prediction. We illustrate this method for two important cases of official statistics, Census resident population and survey design with minimum non-response. The performance of the algorithms, the uncertainty of estimates and of the evaluation metrics are described in detail and implemented in shared, open source code. We exemplify with the results obtained by applying this method to Icelandic register and survey data.","PeriodicalId":55877,"journal":{"name":"Statistical Journal of the IAOS","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139260061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manuel Leonard F. Albis, Sabrina O. Romasoc, Shushimita G. Pelayo, Bea Andrea C. Gavira, Jazzen Paul J. Asombrado
Official price statistics in the Philippines are mainly sourced from the conduct of regular surveys and censuses which entail high costs. As businesses move into digital platforms, alternatives to these traditional data sources have become more available; one of which is web scraping, a process of collecting information from the web. As digital and online platforms become increasingly utilized for commerce, web scraping offers a way to increase the frequency of data collection while reducing its cost compared to price surveys. This paper provides a survey of experiences of various government statistical agencies in their conduct of web scraping for the Consumer Price Index (CPI). Moreover, it details the Philippines’ experience using web scraped data to estimate the food and alcoholic beverages CPI of the National Capital Region in the Philippines, and that is compared to the official CPI estimate of the Philippine Statistics Authority. Finally, this paper discusses the challenges encountered and the recommendations for enhancing the approach.
{"title":"Web scraping for price statistics in the Philippines","authors":"Manuel Leonard F. Albis, Sabrina O. Romasoc, Shushimita G. Pelayo, Bea Andrea C. Gavira, Jazzen Paul J. Asombrado","doi":"10.3233/sji-230030","DOIUrl":"https://doi.org/10.3233/sji-230030","url":null,"abstract":"Official price statistics in the Philippines are mainly sourced from the conduct of regular surveys and censuses which entail high costs. As businesses move into digital platforms, alternatives to these traditional data sources have become more available; one of which is web scraping, a process of collecting information from the web. As digital and online platforms become increasingly utilized for commerce, web scraping offers a way to increase the frequency of data collection while reducing its cost compared to price surveys. This paper provides a survey of experiences of various government statistical agencies in their conduct of web scraping for the Consumer Price Index (CPI). Moreover, it details the Philippines’ experience using web scraped data to estimate the food and alcoholic beverages CPI of the National Capital Region in the Philippines, and that is compared to the official CPI estimate of the Philippine Statistics Authority. Finally, this paper discusses the challenges encountered and the recommendations for enhancing the approach.","PeriodicalId":55877,"journal":{"name":"Statistical Journal of the IAOS","volume":"30 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139264876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Like many countries, Ireland has been researching new systems of population estimates compiled using administrative data. Ireland does not have a Central Population Register from which the estimates can be compiled. The primary step in compiling population estimates from administrative data is to first build a Statistical Population Dataset (SPD). Ideally an SPD will have one record for each person in the population containing the relevant attributes. The ideal SPD then allows compilation of statistics by simply counting over records. In practice, the compilation of SPDs is prone to error. These errors can be classified into 4 types of error; overcoverage, undercoverage, domain misclassification and linkage error. Ireland, to date, has investigated 2 different approaches to the compilation of population estimates from administrative data. The first, labeled in this paper as the simple count method, is based on building an SPD which minimises the overall number of individual record errors such that simple counts from the SPD will provide population estimates. The second, labeled in this paper as the estimation method, is based on building an SPD which aims to eliminate all error types bar that of undercoverage and then adjusts counts for undercoverage using Dual System Estimation (DSE) methods to obtain population estimates. This paper explores the advantages and disadvantages of both methods before considering how they could be integrated to eliminate the disadvantages. Many NSIs will be considering similar challenges when compiling annual Census like population estimates and this paper aims to contribute to that discussion.
与许多国家一样,爱尔兰一直在研究利用行政数据编制人口估计数的新系统。爱尔兰没有可用于编制估算的中央人口登记册。利用行政数据编制人口估计的主要步骤是首先建立一个人口统计数据集(SPD)。理想情况下,SPD 将为人口中的每个人提供一条包含相关属性的记录。理想的 SPD 只需对记录进行计数即可编制统计数据。实际上,SPD 的编制容易出错。这些错误可分为 4 类:过度覆盖、覆盖不足、领域分类错误和链接错误。迄今为止,爱尔兰已经研究了 2 种不同的方法来编制行政数据中的人口估计值。第一种方法在本文中称为简单计数法,其基础是建立一个 SPD,最大限度地减少单个记录错误的总体数量,从而使 SPD 的简单计数能够提供人口估计值。第二种方法在本文中称为估算方法,其基础是建立一个旨在消除除覆盖不足以外所有误差类型的 SPD,然后使用双系统估算(DSE)方法对覆盖不足的计数进行调整,以获得人口估算值。本文探讨了这两种方法的优缺点,然后考虑了如何整合这两种方法以消除缺点。许多国家统计机构在编制类似人口普查的年度人口估计时都会考虑类似的挑战,本文旨在为这一讨论做出贡献。
{"title":"To count or to estimate: A note on compiling population estimates from administrative data","authors":"John Dunne, Francesca Kay, Timothy Linehan","doi":"10.3233/sji-230067","DOIUrl":"https://doi.org/10.3233/sji-230067","url":null,"abstract":"Like many countries, Ireland has been researching new systems of population estimates compiled using administrative data. Ireland does not have a Central Population Register from which the estimates can be compiled. The primary step in compiling population estimates from administrative data is to first build a Statistical Population Dataset (SPD). Ideally an SPD will have one record for each person in the population containing the relevant attributes. The ideal SPD then allows compilation of statistics by simply counting over records. In practice, the compilation of SPDs is prone to error. These errors can be classified into 4 types of error; overcoverage, undercoverage, domain misclassification and linkage error. Ireland, to date, has investigated 2 different approaches to the compilation of population estimates from administrative data. The first, labeled in this paper as the simple count method, is based on building an SPD which minimises the overall number of individual record errors such that simple counts from the SPD will provide population estimates. The second, labeled in this paper as the estimation method, is based on building an SPD which aims to eliminate all error types bar that of undercoverage and then adjusts counts for undercoverage using Dual System Estimation (DSE) methods to obtain population estimates. This paper explores the advantages and disadvantages of both methods before considering how they could be integrated to eliminate the disadvantages. Many NSIs will be considering similar challenges when compiling annual Census like population estimates and this paper aims to contribute to that discussion.","PeriodicalId":55877,"journal":{"name":"Statistical Journal of the IAOS","volume":"6 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139271001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To improve the analysis of respondent comments from the Canadian Census of Population, data scientists at Statistics Canada compared and evaluated traditional machine learning, deep learning and transformer-based techniques. Cross-lingual Language Model-Robustly Optimized Bidirectional Encoder Representations from Transformers (XLM-R), a cross-lingual language model, fine-tuned on census respondent comments yield the best result of 89.91% F1 score overall despite language and class imbalances. Following the evaluation, the fine-tuned model was implemented successfully to objectively categorize comments from the 2021 Census of Population, with high accuracy. As a result, feedback from respondents was directed to the appropriate subject matter analysts, for them to analyze post-collection.
为了改进对加拿大人口普查受访者意见的分析,加拿大统计局的数据科学家对传统的机器学习、深度学习和基于变换器的技术进行了比较和评估。尽管存在语言和类别不平衡的问题,但对人口普查受访者评论进行微调的跨语言语言模型--基于变换器的双向编码器表征(XLM-R)取得了 89.91% 的 F1 总分的最佳结果。评估结束后,经过微调的模型被成功用于对 2021 年人口普查的评论进行客观分类,准确率很高。因此,受访者的反馈意见被转给了相应的主题分析师,以便他们在收集后进行分析。
{"title":"Classifying respondent comments from the 2021 Canadian Census of Population using machine learning methods1","authors":"Joanne Yoon","doi":"10.3233/sji-230063","DOIUrl":"https://doi.org/10.3233/sji-230063","url":null,"abstract":"To improve the analysis of respondent comments from the Canadian Census of Population, data scientists at Statistics Canada compared and evaluated traditional machine learning, deep learning and transformer-based techniques. Cross-lingual Language Model-Robustly Optimized Bidirectional Encoder Representations from Transformers (XLM-R), a cross-lingual language model, fine-tuned on census respondent comments yield the best result of 89.91% F1 score overall despite language and class imbalances. Following the evaluation, the fine-tuned model was implemented successfully to objectively categorize comments from the 2021 Census of Population, with high accuracy. As a result, feedback from respondents was directed to the appropriate subject matter analysts, for them to analyze post-collection.","PeriodicalId":55877,"journal":{"name":"Statistical Journal of the IAOS","volume":"46 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139276268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}