使用机器学习算法识别 2022 年农业普查中的农场

Q3 Decision Sciences Statistical Journal of the IAOS Pub Date : 2024-05-08 DOI:10.3233/sji-230089
Gavin Corral, Luca Sartore, Katherine Vande Pol, Denise A. Abreu, Linda J Young
{"title":"使用机器学习算法识别 2022 年农业普查中的农场","authors":"Gavin Corral, Luca Sartore, Katherine Vande Pol, Denise A. Abreu, Linda J Young","doi":"10.3233/sji-230089","DOIUrl":null,"url":null,"abstract":"As is the case for many National Statistics Institutes, the United States Department of Agriculture’s (USDA’s) National Agricultural Statistics Service (NASS) has observed dwindling survey response rates, and the requests for more information at finer temporal and spatial scales have led to increased response burdens. Non-survey data are becoming increasingly abundant and accessible. Consequently, NASS is exploring the potential to complete some or all of a survey record using non-survey data, which would reduce respondent burden and potentially lead to increased response rates. In this paper, the focus is on a large set of records associated with potential farms, which are operations with undetermined farm status (farm/non-farm) and are referred to here as operations with unknown status (OUS). Although they usually have some agriculture, most OUS records are eventually classified as non-farms. Those OUS that are classified as farms tend to have higher proportions of producers from under-represented groups compared to other records. Determining the probability that an OUS record is a farm is an important step in the imputation process. The OUS records that responded to the 2017 U.S. Census of Agriculture were used to develop models to predict farm status using multiple data sources. Evaluated models include bootstrap random forest (RF), logistic regression (LR), neural network (NN), and support vector machine (SVM). Although the SVM had the best outcomes for three of the five metrics, the sensitivity for identifying farms was the lowest (13.8%). The NN model had a sensitivity of 80.5%, which was substantially higher than the other models, and its specificity of 45.3% was the lowest of all models. Because sensitivity was the primary metric of interest and the NN performed reasonably well on the other metrics, the NN was selected as the preferred model.","PeriodicalId":55877,"journal":{"name":"Statistical Journal of the IAOS","volume":" 7","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Using machine learning algorithms to identify farms on the 2022 Census of Agriculture\",\"authors\":\"Gavin Corral, Luca Sartore, Katherine Vande Pol, Denise A. Abreu, Linda J Young\",\"doi\":\"10.3233/sji-230089\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As is the case for many National Statistics Institutes, the United States Department of Agriculture’s (USDA’s) National Agricultural Statistics Service (NASS) has observed dwindling survey response rates, and the requests for more information at finer temporal and spatial scales have led to increased response burdens. Non-survey data are becoming increasingly abundant and accessible. Consequently, NASS is exploring the potential to complete some or all of a survey record using non-survey data, which would reduce respondent burden and potentially lead to increased response rates. In this paper, the focus is on a large set of records associated with potential farms, which are operations with undetermined farm status (farm/non-farm) and are referred to here as operations with unknown status (OUS). Although they usually have some agriculture, most OUS records are eventually classified as non-farms. Those OUS that are classified as farms tend to have higher proportions of producers from under-represented groups compared to other records. Determining the probability that an OUS record is a farm is an important step in the imputation process. The OUS records that responded to the 2017 U.S. Census of Agriculture were used to develop models to predict farm status using multiple data sources. Evaluated models include bootstrap random forest (RF), logistic regression (LR), neural network (NN), and support vector machine (SVM). Although the SVM had the best outcomes for three of the five metrics, the sensitivity for identifying farms was the lowest (13.8%). The NN model had a sensitivity of 80.5%, which was substantially higher than the other models, and its specificity of 45.3% was the lowest of all models. Because sensitivity was the primary metric of interest and the NN performed reasonably well on the other metrics, the NN was selected as the preferred model.\",\"PeriodicalId\":55877,\"journal\":{\"name\":\"Statistical Journal of the IAOS\",\"volume\":\" 7\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistical Journal of the IAOS\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/sji-230089\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Decision Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Journal of the IAOS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/sji-230089","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Decision Sciences","Score":null,"Total":0}
引用次数: 0

摘要

与许多国家统计局一样,美国农业部(USDA)国家农业统计服务局(NASS)也发现调查回复率不断下降,而且要求在更精细的时间和空间尺度上提供更多信息,导致回复负担加重。非调查数据越来越丰富,也越来越容易获取。因此,NASS 正在探索利用非调查数据完成部分或全部调查记录的可能性,这将减轻应答者的负担,并有可能提高应答率。本文的重点是与潜在农场相关的大量记录,这些农场的农场地位(农场/非农场)尚未确定,在此称为地位不明的农场(OUS)。虽然它们通常都有一些农业活动,但大多数 OUS 记录最终都被归类为非农场。与其他记录相比,那些被归类为农场的 OUS 往往有更高比例的生产者来自代表性不足的群体。确定 OUS 记录是农场的概率是估算过程中的一个重要步骤。对 2017 年美国农业普查做出回应的 OUS 记录被用于开发模型,以利用多种数据源预测农场地位。评估的模型包括引导随机森林(RF)、逻辑回归(LR)、神经网络(NN)和支持向量机(SVM)。虽然 SVM 在五项指标中的三项结果最好,但识别农场的灵敏度最低(13.8%)。NN 模型的灵敏度为 80.5%,大大高于其他模型,而其特异性为 45.3%,是所有模型中最低的。由于灵敏度是主要指标,而 NN 在其他指标上的表现也相当不错,因此 NN 被选为首选模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Using machine learning algorithms to identify farms on the 2022 Census of Agriculture
As is the case for many National Statistics Institutes, the United States Department of Agriculture’s (USDA’s) National Agricultural Statistics Service (NASS) has observed dwindling survey response rates, and the requests for more information at finer temporal and spatial scales have led to increased response burdens. Non-survey data are becoming increasingly abundant and accessible. Consequently, NASS is exploring the potential to complete some or all of a survey record using non-survey data, which would reduce respondent burden and potentially lead to increased response rates. In this paper, the focus is on a large set of records associated with potential farms, which are operations with undetermined farm status (farm/non-farm) and are referred to here as operations with unknown status (OUS). Although they usually have some agriculture, most OUS records are eventually classified as non-farms. Those OUS that are classified as farms tend to have higher proportions of producers from under-represented groups compared to other records. Determining the probability that an OUS record is a farm is an important step in the imputation process. The OUS records that responded to the 2017 U.S. Census of Agriculture were used to develop models to predict farm status using multiple data sources. Evaluated models include bootstrap random forest (RF), logistic regression (LR), neural network (NN), and support vector machine (SVM). Although the SVM had the best outcomes for three of the five metrics, the sensitivity for identifying farms was the lowest (13.8%). The NN model had a sensitivity of 80.5%, which was substantially higher than the other models, and its specificity of 45.3% was the lowest of all models. Because sensitivity was the primary metric of interest and the NN performed reasonably well on the other metrics, the NN was selected as the preferred model.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Statistical Journal of the IAOS
Statistical Journal of the IAOS Economics, Econometrics and Finance-Economics and Econometrics
CiteScore
1.30
自引率
0.00%
发文量
116
期刊介绍: This is the flagship journal of the International Association for Official Statistics and is expected to be widely circulated and subscribed to by individuals and institutions in all parts of the world. The main aim of the Journal is to support the IAOS mission by publishing articles to promote the understanding and advancement of official statistics and to foster the development of effective and efficient official statistical services on a global basis. Papers are expected to be of wide interest to readers. Such papers may or may not contain strictly original material. All papers are refereed.
期刊最新文献
‘Good data are used data’: Interview with Stefan Schweinfest1 Towards the 4th population census in Ethiopia: Some insights into the feasibility of the Post-Enumeration Survey Using machine learning algorithms to identify farms on the 2022 Census of Agriculture Food price inflation nowcasting and monitoring FAOSTAT Food Value Chain Domain implementation: Input Output modelling and analytical applications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1