{"title":"A Hybrid Sampling Methodology Based Cost Effective Active Learning for Tabular Data","authors":"Sharath M. Shankaranarayana, Anubhab Samal, Chikka Veera Raghavendra, Vijay Sankar, Arindam Dey, Sangam Kumar Singh","doi":"10.1109/COMSNETS59351.2024.10427050","DOIUrl":null,"url":null,"abstract":"Active learning (AL) is an attractive paradigm of machine learning that could be very useful in real-world machine learning for domains with high labelling costs. The main objective of AL is to efficiently acquire data for annotation from a typically large pool of unlabeled data, thereby reducing human experts' labelling effort. This helps them focus on the most informative data points that would improve machine learning model performance for the task at hand. Thus, given an unlabeled dataset and a fixed labelling budget, AL aims to select a subset of examples to be labelled such that they can result in improved model performance. The central component in an active learning workflow is the sampling process in which the most valuable samples to be labelled are identified. The most common type of sampling strategy employed is uncertainty sampling. In this sampling, each AL training iteration takes those points that the model is most uncertain about. Another important sampling strategy is diversity sampling. This sampling strategy selects a collection of samples that can well represent the entire data distribution. In this paper, we propose a hybrid sampling based active learning for the classification tasks employing tabular data, wherein we jointly sample uncertain as well as diverse points at every AL iteration. This hybrid sampling is achieved by jointly training a classification model along with an outlier detection model. In addition, we propose an improved cost-effective active learning (CEAL) method in which we automatically select high confidence data points and assign pseudo-labels based on not only the model's confidence but also the outlier detection module.","PeriodicalId":518748,"journal":{"name":"2024 16th International Conference on COMmunication Systems & NETworkS (COMSNETS)","volume":"40 1","pages":"147-152"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2024 16th International Conference on COMmunication Systems & NETworkS (COMSNETS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMSNETS59351.2024.10427050","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Active learning (AL) is an attractive paradigm of machine learning that could be very useful in real-world machine learning for domains with high labelling costs. The main objective of AL is to efficiently acquire data for annotation from a typically large pool of unlabeled data, thereby reducing human experts' labelling effort. This helps them focus on the most informative data points that would improve machine learning model performance for the task at hand. Thus, given an unlabeled dataset and a fixed labelling budget, AL aims to select a subset of examples to be labelled such that they can result in improved model performance. The central component in an active learning workflow is the sampling process in which the most valuable samples to be labelled are identified. The most common type of sampling strategy employed is uncertainty sampling. In this sampling, each AL training iteration takes those points that the model is most uncertain about. Another important sampling strategy is diversity sampling. This sampling strategy selects a collection of samples that can well represent the entire data distribution. In this paper, we propose a hybrid sampling based active learning for the classification tasks employing tabular data, wherein we jointly sample uncertain as well as diverse points at every AL iteration. This hybrid sampling is achieved by jointly training a classification model along with an outlier detection model. In addition, we propose an improved cost-effective active learning (CEAL) method in which we automatically select high confidence data points and assign pseudo-labels based on not only the model's confidence but also the outlier detection module.
主动学习(AL)是一种极具吸引力的机器学习范式,在现实世界的机器学习中,它对标注成本较高的领域非常有用。主动学习的主要目标是从大量典型的未标注数据池中高效获取标注数据,从而减少人类专家的标注工作量。这有助于他们将注意力集中在信息量最大的数据点上,从而提高机器学习模型在当前任务中的性能。因此,在给定未标注数据集和固定标注预算的情况下,AL 的目标是选择要标注的示例子集,从而提高模型性能。主动学习工作流程的核心部分是抽样过程,在这个过程中,需要标注的最有价值的样本被识别出来。最常见的抽样策略是不确定性抽样。在这种抽样中,每次 AL 训练迭代都会抽取模型最不确定的点。另一种重要的抽样策略是多样性抽样。这种抽样策略选择的样本集合能很好地代表整个数据分布。在本文中,我们针对表格数据的分类任务提出了一种基于混合采样的主动学习方法,即在每次 AL 迭代中对不确定点和多样性点进行联合采样。这种混合采样是通过联合训练分类模型和离群点检测模型来实现的。此外,我们还提出了一种改进的高性价比主动学习(CEAL)方法,即自动选择高置信度数据点,并根据模型置信度和离群点检测模块分配伪标签。