基于变压器模型的学习曲线预测

Q1 Multidisciplinary Emerging Science Journal Pub Date : 2023-10-01 DOI:10.28991/esj-2023-07-05-03
Francisco Cruz, Mauro Castelli
{"title":"基于变压器模型的学习曲线预测","authors":"Francisco Cruz, Mauro Castelli","doi":"10.28991/esj-2023-07-05-03","DOIUrl":null,"url":null,"abstract":"One of the main challenges when training or fine-tuning a machine learning model concerns the number of observations necessary to achieve satisfactory performance. While, in general, more training observations result in a better-performing model, collecting more data can be time-consuming, expensive, or even impossible. For this reason, investigating the relationship between the dataset's size and the performance of a machine learning model is fundamental to deciding, with a certain likelihood, the minimum number of observations that are necessary to ensure a satisfactory-performing model is obtained as a result of the training process. The learning curve represents the relationship between the dataset’s size and the performance of the model and is especially useful when choosing a model for a specific task or planning the annotation work of a dataset. Thus, the purpose of this paper is to find the functions that best fit the learning curves of a Transformers-based model (LayoutLM) when fine-tuned to extract information from invoices. Two new datasets of invoices are made available for such a task. Combined with a third dataset already available online, 22 sub-datasets are defined, and their learning curves are plotted based on cross-validation results. The functions are fit using a non-linear least squares technique. The results show that both a bi-asymptotic and a Morgan-Mercer-Flodin function fit the learning curves extremely well. Also, an empirical relation is presented to predict the learning curve from a single parameter that may be easily obtained in the early stage of the annotation process. Doi: 10.28991/ESJ-2023-07-05-03 Full Text: PDF","PeriodicalId":11586,"journal":{"name":"Emerging Science Journal","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning Curves Prediction for a Transformers-Based Model\",\"authors\":\"Francisco Cruz, Mauro Castelli\",\"doi\":\"10.28991/esj-2023-07-05-03\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the main challenges when training or fine-tuning a machine learning model concerns the number of observations necessary to achieve satisfactory performance. While, in general, more training observations result in a better-performing model, collecting more data can be time-consuming, expensive, or even impossible. For this reason, investigating the relationship between the dataset's size and the performance of a machine learning model is fundamental to deciding, with a certain likelihood, the minimum number of observations that are necessary to ensure a satisfactory-performing model is obtained as a result of the training process. The learning curve represents the relationship between the dataset’s size and the performance of the model and is especially useful when choosing a model for a specific task or planning the annotation work of a dataset. Thus, the purpose of this paper is to find the functions that best fit the learning curves of a Transformers-based model (LayoutLM) when fine-tuned to extract information from invoices. Two new datasets of invoices are made available for such a task. Combined with a third dataset already available online, 22 sub-datasets are defined, and their learning curves are plotted based on cross-validation results. The functions are fit using a non-linear least squares technique. The results show that both a bi-asymptotic and a Morgan-Mercer-Flodin function fit the learning curves extremely well. Also, an empirical relation is presented to predict the learning curve from a single parameter that may be easily obtained in the early stage of the annotation process. Doi: 10.28991/ESJ-2023-07-05-03 Full Text: PDF\",\"PeriodicalId\":11586,\"journal\":{\"name\":\"Emerging Science Journal\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Emerging Science Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.28991/esj-2023-07-05-03\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Multidisciplinary\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Emerging Science Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.28991/esj-2023-07-05-03","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}
引用次数: 0

摘要

训练或微调机器学习模型时的主要挑战之一涉及到实现令人满意的性能所需的观察数量。虽然,一般来说,更多的训练观察结果会产生更好的模型,但收集更多的数据可能是耗时的、昂贵的,甚至是不可能的。出于这个原因,研究数据集大小和机器学习模型性能之间的关系是决定的基础,在一定的可能性下,最少的观察次数是确保作为训练过程的结果获得一个令人满意的模型所必需的。学习曲线表示数据集大小和模型性能之间的关系,在为特定任务选择模型或计划数据集的注释工作时特别有用。因此,本文的目的是找到最适合基于transformer的模型(LayoutLM)的学习曲线的函数,当对其进行微调以从发票中提取信息时。为这样的任务提供了两个新的发票数据集。结合已有的第三个在线数据集,定义了22个子数据集,并根据交叉验证结果绘制了它们的学习曲线。利用非线性最小二乘技术对函数进行拟合。结果表明,双渐近函数和Morgan-Mercer-Flodin函数都能很好地拟合学习曲线。此外,本文还提出了一种经验关系,用于从单个参数预测学习曲线,该参数在注释过程的早期阶段很容易获得。Doi: 10.28991/ESJ-2023-07-05-03全文:PDF
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Learning Curves Prediction for a Transformers-Based Model
One of the main challenges when training or fine-tuning a machine learning model concerns the number of observations necessary to achieve satisfactory performance. While, in general, more training observations result in a better-performing model, collecting more data can be time-consuming, expensive, or even impossible. For this reason, investigating the relationship between the dataset's size and the performance of a machine learning model is fundamental to deciding, with a certain likelihood, the minimum number of observations that are necessary to ensure a satisfactory-performing model is obtained as a result of the training process. The learning curve represents the relationship between the dataset’s size and the performance of the model and is especially useful when choosing a model for a specific task or planning the annotation work of a dataset. Thus, the purpose of this paper is to find the functions that best fit the learning curves of a Transformers-based model (LayoutLM) when fine-tuned to extract information from invoices. Two new datasets of invoices are made available for such a task. Combined with a third dataset already available online, 22 sub-datasets are defined, and their learning curves are plotted based on cross-validation results. The functions are fit using a non-linear least squares technique. The results show that both a bi-asymptotic and a Morgan-Mercer-Flodin function fit the learning curves extremely well. Also, an empirical relation is presented to predict the learning curve from a single parameter that may be easily obtained in the early stage of the annotation process. Doi: 10.28991/ESJ-2023-07-05-03 Full Text: PDF
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Emerging Science Journal
Emerging Science Journal Multidisciplinary-Multidisciplinary
CiteScore
5.40
自引率
0.00%
发文量
155
审稿时长
10 weeks
期刊最新文献
Beyond COVID-19 Lockdowns: Rethinking Mathematics Education from a Student Perspective Down-streaming Small-Scale Green Ammonia to Nitrogen-Phosphorus Fertilizer Tablets for Rural Communities Improved Fingerprint-Based Localization Based on Sequential Hybridization of Clustering Algorithms Prioritizing Critical Success Factors for Reverse Logistics as a Source of Competitive Advantage Assessment of the Development of the Circular Economy in the EU Countries: Comparative Analysis by Multiple Criteria Methods
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1