BidCorpus: A multifaceted learning dataset for public procurement.

IF 1 Q3 MULTIDISCIPLINARY SCIENCES Data in Brief Pub Date : 2024-12-09 eCollection Date: 2025-02-01 DOI:10.1016/j.dib.2024.111202
Weslley Lima, Victor Silva, Jasson Silva, Ricardo Lira, Anselmo Paiva
{"title":"BidCorpus: A multifaceted learning dataset for public procurement.","authors":"Weslley Lima, Victor Silva, Jasson Silva, Ricardo Lira, Anselmo Paiva","doi":"10.1016/j.dib.2024.111202","DOIUrl":null,"url":null,"abstract":"<p><p>Digital transformation has significantly impacted public procurement, improving operational efficiency, transparency, and competition. This transformation has allowed the automation of data analysis and oversight in public administration. Public procurement involves various stages and generates a multitude of documents. However, experts manually analyze these unstructured textual documents, which are time-consuming and inefficient. To address this issue, we introduce BidCorpus, a novel and comprehensive dataset consisting of thousands of documents related to public procurement, specifically bidding notices from Brazilian public websites. The dataset was labeled using weak supervision techniques, manual labeling, and BERT-based language models. Models trained with these annotated data showed promising results, with metrics greater than 80 % in various experiments. The models could also tolerate intentional changes made to bidding notices to evade fraud detection. All the resources from this work are publicly available, including the documents, pre-processing scripts, and training and evaluation of the models. We expect the dataset and its labels to be of great value to researchers working on public procurement problems.</p>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"58 ","pages":"111202"},"PeriodicalIF":1.0000,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11715116/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.dib.2024.111202","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Digital transformation has significantly impacted public procurement, improving operational efficiency, transparency, and competition. This transformation has allowed the automation of data analysis and oversight in public administration. Public procurement involves various stages and generates a multitude of documents. However, experts manually analyze these unstructured textual documents, which are time-consuming and inefficient. To address this issue, we introduce BidCorpus, a novel and comprehensive dataset consisting of thousands of documents related to public procurement, specifically bidding notices from Brazilian public websites. The dataset was labeled using weak supervision techniques, manual labeling, and BERT-based language models. Models trained with these annotated data showed promising results, with metrics greater than 80 % in various experiments. The models could also tolerate intentional changes made to bidding notices to evade fraud detection. All the resources from this work are publicly available, including the documents, pre-processing scripts, and training and evaluation of the models. We expect the dataset and its labels to be of great value to researchers working on public procurement problems.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
BidCorpus:面向公共采购的多面学习数据集。
数字化转型对公共采购产生了重大影响,提高了运营效率、透明度和竞争。这种转变使公共行政中的数据分析和监督实现了自动化。公共采购涉及多个阶段,并产生大量文件。然而,专家们手工分析这些非结构化的文本文档,这既耗时又低效。为了解决这个问题,我们引入了BidCorpus,这是一个新颖而全面的数据集,由数千份与公共采购相关的文件组成,特别是来自巴西公共网站的招标通知。使用弱监督技术、人工标记和基于bert的语言模型对数据集进行标记。用这些带注释的数据训练的模型显示出有希望的结果,在各种实验中指标大于80%。这些模型还可以容忍故意修改投标通知以逃避欺诈检测。这项工作的所有资源都是公开可用的,包括文档、预处理脚本以及模型的训练和评估。我们希望数据集及其标签对研究公共采购问题的研究人员有很大的价值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Data in Brief
Data in Brief MULTIDISCIPLINARY SCIENCES-
CiteScore
3.10
自引率
0.00%
发文量
996
审稿时长
70 days
期刊介绍: Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.
期刊最新文献
An ecological connectivity dataset for Black Sea obtained from sea currents. A dataset on environmental DNA, bacterio-, phyto- and zooplankton from an emerging periglacial lagoon in Svalbard, Arctic. "Play by play": A dataset of handball and basketball game situations in a standardized space. Smartphone image dataset for radish plant leaf disease classification from Bangladesh. LipBengal: Pioneering Bengali lip-reading dataset for pronunciation mapping through lip gestures.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1