Learning Curve Estimation with Large Imbalanced Datasets

Aaron N. Richter, T. Khoshgoftaar
{"title":"Learning Curve Estimation with Large Imbalanced Datasets","authors":"Aaron N. Richter, T. Khoshgoftaar","doi":"10.1109/ICMLA.2019.00135","DOIUrl":null,"url":null,"abstract":"Datasets for machine learning are constantly increasing in size, along with computational requirements for processing the data. A useful exercise for machine learning experiments is to approximate model performance as dataset size increases. This can inform application building and data collection efforts as well as improve computational efficiency by using subsets of the data. In this paper, we evaluate a learning curve estimation method on three large imbalanced datasets. Estimation is performed by fitting an inverse power law model to a learning curve created on a small amount of data. We then explore how well this estimated curve fits to the full learning curve of each dataset. The method has been previously evaluated for small datasets (hundreds or thousands of instances), and in this study we show that the method is indeed effective for larger datasets with millions of instances. This is beneficial because only a few thousand instances are required to accurately estimate the performance of models using millions of instances. To the best of our knowledge, this is the first study to systematically explore the use of an inverse power law curve fitting method for big data.","PeriodicalId":436714,"journal":{"name":"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2019.00135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

Datasets for machine learning are constantly increasing in size, along with computational requirements for processing the data. A useful exercise for machine learning experiments is to approximate model performance as dataset size increases. This can inform application building and data collection efforts as well as improve computational efficiency by using subsets of the data. In this paper, we evaluate a learning curve estimation method on three large imbalanced datasets. Estimation is performed by fitting an inverse power law model to a learning curve created on a small amount of data. We then explore how well this estimated curve fits to the full learning curve of each dataset. The method has been previously evaluated for small datasets (hundreds or thousands of instances), and in this study we show that the method is indeed effective for larger datasets with millions of instances. This is beneficial because only a few thousand instances are required to accurately estimate the performance of models using millions of instances. To the best of our knowledge, this is the first study to systematically explore the use of an inverse power law curve fitting method for big data.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大型不平衡数据集的学习曲线估计
机器学习的数据集规模不断增加,处理数据的计算需求也在不断增加。机器学习实验的一个有用的练习是随着数据集大小的增加来近似模型的性能。这可以通知应用程序构建和数据收集工作,并通过使用数据子集提高计算效率。本文在三个大型不平衡数据集上评估了一种学习曲线估计方法。估计是通过在少量数据上建立的学习曲线上拟合一个逆幂律模型来完成的。然后我们探索这个估计曲线与每个数据集的完整学习曲线的拟合程度。该方法之前已经对小型数据集(数百或数千个实例)进行了评估,在本研究中,我们表明该方法对于具有数百万个实例的大型数据集确实有效。这是有益的,因为只需要几千个实例就可以准确地估计使用数百万个实例的模型的性能。据我们所知,这是第一个系统地探索在大数据中使用逆幂律曲线拟合方法的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Automated Stenosis Classification of Carotid Artery Sonography using Deep Neural Networks Hybrid Condition Monitoring for Power Electronic Systems Time Series Anomaly Detection from a Markov Chain Perspective Anyone here? Smart Embedded Low-Resolution Omnidirectional Video Sensor to Measure Room Occupancy Deep Learning with Domain Randomization for Optimal Filtering
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1