Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification

IF 2.8 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Future Internet Pub Date : 2023-11-09 DOI:10.3390/fi15110363

Panagiotis Skondras, Panagiotis Zervas, Giannis Tzimas

{"title":"Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification","authors":"Panagiotis Skondras, Panagiotis Zervas, Giannis Tzimas","doi":"10.3390/fi15110363","DOIUrl":null,"url":null,"abstract":"In this article, we investigate the potential of synthetic resumes as a means for the rapid generation of training data and their effectiveness in data augmentation, especially in categories marked by sparse samples. The widespread implementation of machine learning algorithms in natural language processing (NLP) has notably streamlined the resume classification process, delivering time and cost efficiencies for hiring organizations. However, the performance of these algorithms depends on the abundance of training data. While selecting the right model architecture is essential, it is also crucial to ensure the availability of a robust, well-curated dataset. For many categories in the job market, data sparsity remains a challenge. To deal with this challenge, we employed the OpenAI API to generate both structured and unstructured resumes tailored to specific criteria. These synthetically generated resumes were cleaned, preprocessed and then utilized to train two distinct models: a transformer model (BERT) and a feedforward neural network (FFNN) that incorporated Universal Sentence Encoder 4 (USE4) embeddings. While both models were evaluated on the multiclass classification task of resumes, when trained on an augmented dataset containing 60 percent real data (from Indeed website) and 40 percent synthetic data from ChatGPT, the transformer model presented exceptional accuracy. The FFNN, albeit predictably, achieved lower accuracy. These findings highlight the value of augmented real-world data with ChatGPT-generated synthetic resumes, especially in the context of limited training data. The suitability of the BERT model for such classification tasks further reinforces this narrative.","PeriodicalId":37982,"journal":{"name":"Future Internet","volume":" 3","pages":"0"},"PeriodicalIF":2.8000,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Internet","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/fi15110363","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

In this article, we investigate the potential of synthetic resumes as a means for the rapid generation of training data and their effectiveness in data augmentation, especially in categories marked by sparse samples. The widespread implementation of machine learning algorithms in natural language processing (NLP) has notably streamlined the resume classification process, delivering time and cost efficiencies for hiring organizations. However, the performance of these algorithms depends on the abundance of training data. While selecting the right model architecture is essential, it is also crucial to ensure the availability of a robust, well-curated dataset. For many categories in the job market, data sparsity remains a challenge. To deal with this challenge, we employed the OpenAI API to generate both structured and unstructured resumes tailored to specific criteria. These synthetically generated resumes were cleaned, preprocessed and then utilized to train two distinct models: a transformer model (BERT) and a feedforward neural network (FFNN) that incorporated Universal Sentence Encoder 4 (USE4) embeddings. While both models were evaluated on the multiclass classification task of resumes, when trained on an augmented dataset containing 60 percent real data (from Indeed website) and 40 percent synthetic data from ChatGPT, the transformer model presented exceptional accuracy. The FFNN, albeit predictably, achieved lower accuracy. These findings highlight the value of augmented real-world data with ChatGPT-generated synthetic resumes, especially in the context of limited training data. The suitability of the BERT model for such classification tasks further reinforces this narrative.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用大型语言模型生成合成简历数据，以增强职位描述分类

在本文中，我们研究了合成简历作为快速生成训练数据的一种手段的潜力，以及它们在数据增强方面的有效性，特别是在稀疏样本标记的类别中。机器学习算法在自然语言处理(NLP)中的广泛应用，显著简化了简历分类过程，为招聘组织节省了时间和成本效率。然而，这些算法的性能取决于训练数据的丰富程度。虽然选择正确的模型架构是必不可少的，但确保一个健壮的、精心策划的数据集的可用性也是至关重要的。对于就业市场的许多类别来说，数据稀疏性仍然是一个挑战。为了应对这一挑战，我们使用OpenAI API根据特定标准生成结构化和非结构化简历。这些合成生成的简历被清洗、预处理，然后用于训练两个不同的模型:一个变压器模型(BERT)和一个前馈神经网络(FFNN)，其中包含通用句子编码器4 (USE4)嵌入。虽然这两个模型都在简历的多类别分类任务上进行了评估，但当在包含60%真实数据(来自Indeed网站)和40%来自ChatGPT的合成数据的增强数据集上进行训练时，变压器模型表现出了出色的准确性。FFNN虽然可以预测，但准确率较低。这些发现突出了chatgpt生成的合成简历增强现实数据的价值，特别是在培训数据有限的情况下。BERT模型对此类分类任务的适用性进一步强化了这种叙述。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Future Internet Computer Science-Computer Networks and Communications

CiteScore

7.10

自引率

5.90%

发文量

303

审稿时长

11 weeks

期刊介绍： Future Internet is a scholarly open access journal which provides an advanced forum for science and research concerned with evolution of Internet technologies and related smart systems for “Net-Living” development. The general reference subject is therefore the evolution towards the future internet ecosystem, which is feeding a continuous, intensive, artificial transformation of the lived environment, for a widespread and significant improvement of well-being in all spheres of human life (private, public, professional). Included topics are: • advanced communications network infrastructures • evolution of internet basic services • internet of things • netted peripheral sensors • industrial internet • centralized and distributed data centers • embedded computing • cloud computing • software defined network functions and network virtualization • cloud-let and fog-computing • big data, open data and analytical tools • cyber-physical systems • network and distributed operating systems • web services • semantic structures and related software tools • artificial and augmented intelligence • augmented reality • system interoperability and flexible service composition • smart mission-critical system architectures • smart terminals and applications • pro-sumer tools for application design and development • cyber security compliance • privacy compliance • reliability compliance • dependability compliance • accountability compliance • trust compliance • technical quality of basic services.