Radiah Haque, Hui-Ngo Goh, Choo-Yee Ting, Albert Quek, M.D. Rakibul Hasan
{"title":"Leveraging LLMs for optimised feature selection and embedding in structured data: A case study on graduate employment classification","authors":"Radiah Haque, Hui-Ngo Goh, Choo-Yee Ting, Albert Quek, M.D. Rakibul Hasan","doi":"10.1016/j.caeai.2024.100356","DOIUrl":null,"url":null,"abstract":"<div><div>The application of Machine Learning (ML) for predicting graduate student employability is a growing area of research, driven by the need to align educational outcomes with job market requirements. In this context, this paper investigates the application of Large Language Models (LLMs) for tabular data transformation and embedding, specifically using Bidirectional Encoder Representations from Transformers (BERT), to enhance the performance of ML models in binary classification tasks for student employability prediction. The primary objective is to determine whether converting structured data into text format improves model accuracy. The study involves several ML models including Artificial Neural Networks (ANN), CatBoost, and BERT classifier. The focus is on predicting the employment status of graduate students based on demographic, academic, and graduate tracer study data, collected from over 4000 university graduates. Feature selection methods, including Boruta and Extra Tree Classifier (ETC) are employed to identify the optimal feature set, guided by a sliding window algorithm for automatic feature selection. The models are trained in four stages: 1) original dataset without feature selection or word embedding, 2) dataset with selected optimal features, 3) transformed data with word embedding, and 4) transformed data with feature selection applied both before and after word embedding. The baseline model (without feature selection and embedding) achieved the highest accuracy with the ANN model (79%). Subsequently, applying ETC for feature selection improved accuracy, with CatBoost achieving 83%. Further transformation with BERT-based embeddings raised the highest accuracy to 85% using the BERT classifier. Finally, the optimal accuracy of 88% was obtained by applying feature selection before and after embedding, with the BERT-Boruta model. The findings from this study demonstrate that using the dual-stage feature selection approach in combination with BERT embedding significantly increases the classification accuracy. This highlights the potential of LLMs in transforming tabular data for enhanced graduate employment prediction.</div></div>","PeriodicalId":34469,"journal":{"name":"Computers and Education Artificial Intelligence","volume":"8 ","pages":"Article 100356"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers and Education Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666920X24001590","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 0
Abstract
The application of Machine Learning (ML) for predicting graduate student employability is a growing area of research, driven by the need to align educational outcomes with job market requirements. In this context, this paper investigates the application of Large Language Models (LLMs) for tabular data transformation and embedding, specifically using Bidirectional Encoder Representations from Transformers (BERT), to enhance the performance of ML models in binary classification tasks for student employability prediction. The primary objective is to determine whether converting structured data into text format improves model accuracy. The study involves several ML models including Artificial Neural Networks (ANN), CatBoost, and BERT classifier. The focus is on predicting the employment status of graduate students based on demographic, academic, and graduate tracer study data, collected from over 4000 university graduates. Feature selection methods, including Boruta and Extra Tree Classifier (ETC) are employed to identify the optimal feature set, guided by a sliding window algorithm for automatic feature selection. The models are trained in four stages: 1) original dataset without feature selection or word embedding, 2) dataset with selected optimal features, 3) transformed data with word embedding, and 4) transformed data with feature selection applied both before and after word embedding. The baseline model (without feature selection and embedding) achieved the highest accuracy with the ANN model (79%). Subsequently, applying ETC for feature selection improved accuracy, with CatBoost achieving 83%. Further transformation with BERT-based embeddings raised the highest accuracy to 85% using the BERT classifier. Finally, the optimal accuracy of 88% was obtained by applying feature selection before and after embedding, with the BERT-Boruta model. The findings from this study demonstrate that using the dual-stage feature selection approach in combination with BERT embedding significantly increases the classification accuracy. This highlights the potential of LLMs in transforming tabular data for enhanced graduate employment prediction.