{"title":"Learning Curve Estimation with Large Imbalanced Datasets","authors":"Aaron N. Richter, T. Khoshgoftaar","doi":"10.1109/ICMLA.2019.00135","DOIUrl":null,"url":null,"abstract":"Datasets for machine learning are constantly increasing in size, along with computational requirements for processing the data. A useful exercise for machine learning experiments is to approximate model performance as dataset size increases. This can inform application building and data collection efforts as well as improve computational efficiency by using subsets of the data. In this paper, we evaluate a learning curve estimation method on three large imbalanced datasets. Estimation is performed by fitting an inverse power law model to a learning curve created on a small amount of data. We then explore how well this estimated curve fits to the full learning curve of each dataset. The method has been previously evaluated for small datasets (hundreds or thousands of instances), and in this study we show that the method is indeed effective for larger datasets with millions of instances. This is beneficial because only a few thousand instances are required to accurately estimate the performance of models using millions of instances. To the best of our knowledge, this is the first study to systematically explore the use of an inverse power law curve fitting method for big data.","PeriodicalId":436714,"journal":{"name":"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2019.00135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Datasets for machine learning are constantly increasing in size, along with computational requirements for processing the data. A useful exercise for machine learning experiments is to approximate model performance as dataset size increases. This can inform application building and data collection efforts as well as improve computational efficiency by using subsets of the data. In this paper, we evaluate a learning curve estimation method on three large imbalanced datasets. Estimation is performed by fitting an inverse power law model to a learning curve created on a small amount of data. We then explore how well this estimated curve fits to the full learning curve of each dataset. The method has been previously evaluated for small datasets (hundreds or thousands of instances), and in this study we show that the method is indeed effective for larger datasets with millions of instances. This is beneficial because only a few thousand instances are required to accurately estimate the performance of models using millions of instances. To the best of our knowledge, this is the first study to systematically explore the use of an inverse power law curve fitting method for big data.