NAICS Code Prediction Using Supervised Methods

IF 1.5 Q2 SOCIAL SCIENCES, MATHEMATICAL METHODS Statistics and Public Policy Pub Date : 2022-01-24 DOI:10.1080/2330443X.2022.2033654
C. Oehlert, Evan T. Schulz, Anne Parker
{"title":"NAICS Code Prediction Using Supervised Methods","authors":"C. Oehlert, Evan T. Schulz, Anne Parker","doi":"10.1080/2330443X.2022.2033654","DOIUrl":null,"url":null,"abstract":"Abstract When compiling industry statistics or selecting businesses for further study, researchers often rely on North American Industry Classification System (NAICS) codes. However, codes are self-reported on tax forms and reporting incorrect codes or even leaving the code blank has no tax consequences, so they are often unusable. IRSs Statistics of Income (SOI) program validates NAICS codes for businesses in the statistical samples used to produce official tax statistics for various filing populations, including sole proprietorships (those filing Form 1040 Schedule C) and corporations (those filing Forms 1120). In this article we leverage these samples to explore ways to improve NAICS code reporting for all filers in the relevant populations. For sole proprietorships, we overcame several record linkage complications to combine data from SOI samples with other administrative data. Using the SOI-validated NAICS code values as ground truth, we trained classification-tree-based models (randomForest) to predict NAICS industry sector from other tax return data, including text descriptions, for businesses which did or did not initially report a valid NAICS code. For both sole proprietorships and corporations, we were able to improve slightly on the accuracy of valid self-reported industry sector and correctly identify sector for over half of businesses with no informative reported NAICS code.","PeriodicalId":43397,"journal":{"name":"Statistics and Public Policy","volume":"9 1","pages":"58 - 66"},"PeriodicalIF":1.5000,"publicationDate":"2022-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics and Public Policy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/2330443X.2022.2033654","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SOCIAL SCIENCES, MATHEMATICAL METHODS","Score":null,"Total":0}
引用次数: 3

Abstract

Abstract When compiling industry statistics or selecting businesses for further study, researchers often rely on North American Industry Classification System (NAICS) codes. However, codes are self-reported on tax forms and reporting incorrect codes or even leaving the code blank has no tax consequences, so they are often unusable. IRSs Statistics of Income (SOI) program validates NAICS codes for businesses in the statistical samples used to produce official tax statistics for various filing populations, including sole proprietorships (those filing Form 1040 Schedule C) and corporations (those filing Forms 1120). In this article we leverage these samples to explore ways to improve NAICS code reporting for all filers in the relevant populations. For sole proprietorships, we overcame several record linkage complications to combine data from SOI samples with other administrative data. Using the SOI-validated NAICS code values as ground truth, we trained classification-tree-based models (randomForest) to predict NAICS industry sector from other tax return data, including text descriptions, for businesses which did or did not initially report a valid NAICS code. For both sole proprietorships and corporations, we were able to improve slightly on the accuracy of valid self-reported industry sector and correctly identify sector for over half of businesses with no informative reported NAICS code.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用监督方法的NAICS代码预测
摘要在编制行业统计数据或选择企业进行进一步研究时,研究人员通常依赖北美行业分类系统(NAICS)代码。然而,代码是在纳税申报表上自我报告的,报告错误的代码甚至将代码留空都不会产生税务后果,因此它们通常无法使用。IRS收入统计(SOI)程序验证了统计样本中企业的NAICS代码,该统计样本用于为各种申报人群编制官方税务统计数据,包括独资企业(提交1040表格附表C的企业)和公司(提交1120表格的企业)。在本文中,我们利用这些样本来探索如何改进相关人群中所有提交者的NAICS代码报告。对于独资企业,我们克服了几个记录关联的复杂性,将SOI样本的数据与其他管理数据相结合。使用SOI验证的NAICS代码值作为基本事实,我们训练了基于分类树的模型(randomForest),以根据其他纳税申报数据预测NAICS行业部门,包括最初报告或未报告有效NAICS代码的企业的文本描述。对于独资企业和公司,我们能够略微提高有效的自我报告行业部门的准确性,并在没有信息报告NAICS代码的情况下正确识别超过一半的企业的行业。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Statistics and Public Policy
Statistics and Public Policy SOCIAL SCIENCES, MATHEMATICAL METHODS-
CiteScore
3.20
自引率
6.20%
发文量
13
审稿时长
32 weeks
期刊最新文献
State-Building through Public Land Disposal? An Application of Matrix Completion for Counterfactual Prediction Clusters of Jail Incarcerations in US Counties: 2010-2018 Comment on ‘What protects the autonomy of the Federal Statistics Agencies? An Assessment of the Procedures in Place That Protect the Independence and Objectivity of Official Statistics” by Pierson et al. On Coping in a Non-Binary World: Rejoinder to Biedermann and Kotsoglou Commentary on “Three-Way ROCs for Forensic Decision Making” by Nicholas Scurich and Richard S. John (in: Statistics and Public Policy)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1