{"title":"主动学习的可扩展算法","authors":"Youguang Chen, Zheyu Wen, George Biros","doi":"arxiv-2409.07392","DOIUrl":null,"url":null,"abstract":"FIRAL is a recently proposed deterministic active learning algorithm for\nmulticlass classification using logistic regression. It was shown to outperform\nthe state-of-the-art in terms of accuracy and robustness and comes with\ntheoretical performance guarantees. However, its scalability suffers when\ndealing with datasets featuring a large number of points $n$, dimensions $d$,\nand classes $c$, due to its $\\mathcal{O}(c^2d^2+nc^2d)$ storage and\n$\\mathcal{O}(c^3(nd^2 + bd^3 + bn))$ computational complexity where $b$ is the\nnumber of points to select in active learning. To address these challenges, we\npropose an approximate algorithm with storage requirements reduced to\n$\\mathcal{O}(n(d+c) + cd^2)$ and a computational complexity of\n$\\mathcal{O}(bncd^2)$. Additionally, we present a parallel implementation on\nGPUs. We demonstrate the accuracy and scalability of our approach using MNIST,\nCIFAR-10, Caltech101, and ImageNet. The accuracy tests reveal no deterioration\nin accuracy compared to FIRAL. We report strong and weak scaling tests on up to\n12 GPUs, for three million point synthetic dataset.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"13 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Scalable Algorithm for Active Learning\",\"authors\":\"Youguang Chen, Zheyu Wen, George Biros\",\"doi\":\"arxiv-2409.07392\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"FIRAL is a recently proposed deterministic active learning algorithm for\\nmulticlass classification using logistic regression. It was shown to outperform\\nthe state-of-the-art in terms of accuracy and robustness and comes with\\ntheoretical performance guarantees. However, its scalability suffers when\\ndealing with datasets featuring a large number of points $n$, dimensions $d$,\\nand classes $c$, due to its $\\\\mathcal{O}(c^2d^2+nc^2d)$ storage and\\n$\\\\mathcal{O}(c^3(nd^2 + bd^3 + bn))$ computational complexity where $b$ is the\\nnumber of points to select in active learning. To address these challenges, we\\npropose an approximate algorithm with storage requirements reduced to\\n$\\\\mathcal{O}(n(d+c) + cd^2)$ and a computational complexity of\\n$\\\\mathcal{O}(bncd^2)$. Additionally, we present a parallel implementation on\\nGPUs. We demonstrate the accuracy and scalability of our approach using MNIST,\\nCIFAR-10, Caltech101, and ImageNet. The accuracy tests reveal no deterioration\\nin accuracy compared to FIRAL. We report strong and weak scaling tests on up to\\n12 GPUs, for three million point synthetic dataset.\",\"PeriodicalId\":501340,\"journal\":{\"name\":\"arXiv - STAT - Machine Learning\",\"volume\":\"13 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07392\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07392","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
FIRAL is a recently proposed deterministic active learning algorithm for
multiclass classification using logistic regression. It was shown to outperform
the state-of-the-art in terms of accuracy and robustness and comes with
theoretical performance guarantees. However, its scalability suffers when
dealing with datasets featuring a large number of points $n$, dimensions $d$,
and classes $c$, due to its $\mathcal{O}(c^2d^2+nc^2d)$ storage and
$\mathcal{O}(c^3(nd^2 + bd^3 + bn))$ computational complexity where $b$ is the
number of points to select in active learning. To address these challenges, we
propose an approximate algorithm with storage requirements reduced to
$\mathcal{O}(n(d+c) + cd^2)$ and a computational complexity of
$\mathcal{O}(bncd^2)$. Additionally, we present a parallel implementation on
GPUs. We demonstrate the accuracy and scalability of our approach using MNIST,
CIFAR-10, Caltech101, and ImageNet. The accuracy tests reveal no deterioration
in accuracy compared to FIRAL. We report strong and weak scaling tests on up to
12 GPUs, for three million point synthetic dataset.