{"title":"严重失衡二进制数据的最小最优率","authors":"Yang Song;Hui Zou","doi":"10.1109/TIT.2024.3459814","DOIUrl":null,"url":null,"abstract":"In a wide range of binary prediction and estimation tasks, the data set exhibits a high degree of imbalance between the sample sizes of the two classes, which greatly hinders the performance of standard machine learning methods. In spite of a vast collection of methods aiming to achieve better performance on heavily imbalanced data, the theoretical limit of estimation with imbalanced data remains unknown. This paper provides some insights into the imbalanced classification problem by establishing the minimax risk of log-odds function estimation. Our minimax bounds reveal a notion of effective sample size. We further construct a sampling technique and prove that a minimax-rate optimal method for balanced data combined with the sampling technique achieves minimax-rate optimal performance on imbalanced data.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"70 12","pages":"9001-9011"},"PeriodicalIF":2.2000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Minimax Optimal Rates With Heavily Imbalanced Binary Data\",\"authors\":\"Yang Song;Hui Zou\",\"doi\":\"10.1109/TIT.2024.3459814\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In a wide range of binary prediction and estimation tasks, the data set exhibits a high degree of imbalance between the sample sizes of the two classes, which greatly hinders the performance of standard machine learning methods. In spite of a vast collection of methods aiming to achieve better performance on heavily imbalanced data, the theoretical limit of estimation with imbalanced data remains unknown. This paper provides some insights into the imbalanced classification problem by establishing the minimax risk of log-odds function estimation. Our minimax bounds reveal a notion of effective sample size. We further construct a sampling technique and prove that a minimax-rate optimal method for balanced data combined with the sampling technique achieves minimax-rate optimal performance on imbalanced data.\",\"PeriodicalId\":13494,\"journal\":{\"name\":\"IEEE Transactions on Information Theory\",\"volume\":\"70 12\",\"pages\":\"9001-9011\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Information Theory\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10679682/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10679682/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Minimax Optimal Rates With Heavily Imbalanced Binary Data
In a wide range of binary prediction and estimation tasks, the data set exhibits a high degree of imbalance between the sample sizes of the two classes, which greatly hinders the performance of standard machine learning methods. In spite of a vast collection of methods aiming to achieve better performance on heavily imbalanced data, the theoretical limit of estimation with imbalanced data remains unknown. This paper provides some insights into the imbalanced classification problem by establishing the minimax risk of log-odds function estimation. Our minimax bounds reveal a notion of effective sample size. We further construct a sampling technique and prove that a minimax-rate optimal method for balanced data combined with the sampling technique achieves minimax-rate optimal performance on imbalanced data.
期刊介绍:
The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.