财政数据文本:使用自然语言处理从审计报告中提取信息

IF 1.8 Q3 PUBLIC ADMINISTRATION Data & policy Pub Date : 2023-02-28 DOI:10.1017/dap.2023.4
Alejandro Beltran
{"title":"财政数据文本:使用自然语言处理从审计报告中提取信息","authors":"Alejandro Beltran","doi":"10.1017/dap.2023.4","DOIUrl":null,"url":null,"abstract":"Abstract Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.","PeriodicalId":93427,"journal":{"name":"Data & policy","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2023-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fiscal data in text: Information extraction from audit reports using Natural Language Processing\",\"authors\":\"Alejandro Beltran\",\"doi\":\"10.1017/dap.2023.4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.\",\"PeriodicalId\":93427,\"journal\":{\"name\":\"Data & policy\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2023-02-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data & policy\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1017/dap.2023.4\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"PUBLIC ADMINISTRATION\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & policy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1017/dap.2023.4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PUBLIC ADMINISTRATION","Score":null,"Total":0}
引用次数: 0

摘要

摘要最高审计机构被吹捧为发展中国家反腐败工作的一个组成部分。SAI审查政府预算,并在公开的审计报告中报告财政差异。这些文件包含关于预算差异、资源缺失的宝贵信息,甚至可能报告欺诈和腐败行为。现有的反腐败工作研究依赖于国家一级最高审计机关发布的信息,而大多忽略了国家以下最高审计机关的审计,因为它们的信息没有以可访问的格式发布。我收集了墨西哥国家以下SAI,即锡那罗亚州高级审计署发布的公开审计报告,并建立了一个提取市政预算中发现的差异货币价值的管道。我使用光学字符识别系统地将扫描文档转换为机器可读文本,然后训练一个分类模型来识别具有相关信息的段落。从相关段落中,我通过开发一个命名实体识别器来提取预算差异的货币价值,该识别器可以自动识别这些信息。在本文中,我解释了构建管道的步骤,并详细说明了在不同上下文中复制管道的过程。由此产生的数据集包含锡那罗亚州市政预算的官方差异金额。这些信息对反腐败政策制定者很有用,因为它量化了市政支出的差异,可能会推动减少挪用的改革。尽管我关注的是墨西哥的一个州,但这种方法可以扩展到任何公开审计报告的情况。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Fiscal data in text: Information extraction from audit reports using Natural Language Processing
Abstract Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
3.10
自引率
0.00%
发文量
0
审稿时长
12 weeks
期刊最新文献
Determinants for university students’ location data sharing with public institutions during COVID-19: The Italian case Bus Rapid Transit: End of trend in Latin America? Accelerating and enhancing the generation of socioeconomic data to inform forced displacement policy and response “That is why users do not understand the maps we make for them”: Cartographic gaps between experts and domestic workers and the Right to the City Analysis of spatial–temporal validation patterns in Fortaleza’s public transport systems: a data mining approach
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1