财政数据文本：使用自然语言处理从审计报告中提取信息

IF 2.7 Q3 PUBLIC ADMINISTRATION Data & policy Pub Date : 2023-02-28 DOI:10.1017/dap.2023.4

Alejandro Beltran

{"title":"财政数据文本：使用自然语言处理从审计报告中提取信息","authors":"Alejandro Beltran","doi":"10.1017/dap.2023.4","DOIUrl":null,"url":null,"abstract":"Abstract Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.","PeriodicalId":93427,"journal":{"name":"Data & policy","volume":" ","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2023-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fiscal data in text: Information extraction from audit reports using Natural Language Processing\",\"authors\":\"Alejandro Beltran\",\"doi\":\"10.1017/dap.2023.4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.\",\"PeriodicalId\":93427,\"journal\":{\"name\":\"Data & policy\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2023-02-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data & policy\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1017/dap.2023.4\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"PUBLIC ADMINISTRATION\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & policy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1017/dap.2023.4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PUBLIC ADMINISTRATION","Score":null,"Total":0}

引用次数: 0

摘要

摘要最高审计机构被吹捧为发展中国家反腐败工作的一个组成部分。SAI审查政府预算，并在公开的审计报告中报告财政差异。这些文件包含关于预算差异、资源缺失的宝贵信息，甚至可能报告欺诈和腐败行为。现有的反腐败工作研究依赖于国家一级最高审计机关发布的信息，而大多忽略了国家以下最高审计机关的审计，因为它们的信息没有以可访问的格式发布。我收集了墨西哥国家以下SAI，即锡那罗亚州高级审计署发布的公开审计报告，并建立了一个提取市政预算中发现的差异货币价值的管道。我使用光学字符识别系统地将扫描文档转换为机器可读文本，然后训练一个分类模型来识别具有相关信息的段落。从相关段落中，我通过开发一个命名实体识别器来提取预算差异的货币价值，该识别器可以自动识别这些信息。在本文中，我解释了构建管道的步骤，并详细说明了在不同上下文中复制管道的过程。由此产生的数据集包含锡那罗亚州市政预算的官方差异金额。这些信息对反腐败政策制定者很有用，因为它量化了市政支出的差异，可能会推动减少挪用的改革。尽管我关注的是墨西哥的一个州，但这种方法可以扩展到任何公开审计报告的情况。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Fiscal data in text: Information extraction from audit reports using Natural Language Processing

Abstract Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Data & policy

CiteScore

3.10

自引率

0.00%

发文量

审稿时长

12 weeks