Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification

IF 2.8 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Future Internet Pub Date : 2023-11-09 DOI:10.3390/fi15110363
Panagiotis Skondras, Panagiotis Zervas, Giannis Tzimas
{"title":"Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification","authors":"Panagiotis Skondras, Panagiotis Zervas, Giannis Tzimas","doi":"10.3390/fi15110363","DOIUrl":null,"url":null,"abstract":"In this article, we investigate the potential of synthetic resumes as a means for the rapid generation of training data and their effectiveness in data augmentation, especially in categories marked by sparse samples. The widespread implementation of machine learning algorithms in natural language processing (NLP) has notably streamlined the resume classification process, delivering time and cost efficiencies for hiring organizations. However, the performance of these algorithms depends on the abundance of training data. While selecting the right model architecture is essential, it is also crucial to ensure the availability of a robust, well-curated dataset. For many categories in the job market, data sparsity remains a challenge. To deal with this challenge, we employed the OpenAI API to generate both structured and unstructured resumes tailored to specific criteria. These synthetically generated resumes were cleaned, preprocessed and then utilized to train two distinct models: a transformer model (BERT) and a feedforward neural network (FFNN) that incorporated Universal Sentence Encoder 4 (USE4) embeddings. While both models were evaluated on the multiclass classification task of resumes, when trained on an augmented dataset containing 60 percent real data (from Indeed website) and 40 percent synthetic data from ChatGPT, the transformer model presented exceptional accuracy. The FFNN, albeit predictably, achieved lower accuracy. These findings highlight the value of augmented real-world data with ChatGPT-generated synthetic resumes, especially in the context of limited training data. The suitability of the BERT model for such classification tasks further reinforces this narrative.","PeriodicalId":37982,"journal":{"name":"Future Internet","volume":" 3","pages":"0"},"PeriodicalIF":2.8000,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Internet","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/fi15110363","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

In this article, we investigate the potential of synthetic resumes as a means for the rapid generation of training data and their effectiveness in data augmentation, especially in categories marked by sparse samples. The widespread implementation of machine learning algorithms in natural language processing (NLP) has notably streamlined the resume classification process, delivering time and cost efficiencies for hiring organizations. However, the performance of these algorithms depends on the abundance of training data. While selecting the right model architecture is essential, it is also crucial to ensure the availability of a robust, well-curated dataset. For many categories in the job market, data sparsity remains a challenge. To deal with this challenge, we employed the OpenAI API to generate both structured and unstructured resumes tailored to specific criteria. These synthetically generated resumes were cleaned, preprocessed and then utilized to train two distinct models: a transformer model (BERT) and a feedforward neural network (FFNN) that incorporated Universal Sentence Encoder 4 (USE4) embeddings. While both models were evaluated on the multiclass classification task of resumes, when trained on an augmented dataset containing 60 percent real data (from Indeed website) and 40 percent synthetic data from ChatGPT, the transformer model presented exceptional accuracy. The FFNN, albeit predictably, achieved lower accuracy. These findings highlight the value of augmented real-world data with ChatGPT-generated synthetic resumes, especially in the context of limited training data. The suitability of the BERT model for such classification tasks further reinforces this narrative.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用大型语言模型生成合成简历数据,以增强职位描述分类
在本文中,我们研究了合成简历作为快速生成训练数据的一种手段的潜力,以及它们在数据增强方面的有效性,特别是在稀疏样本标记的类别中。机器学习算法在自然语言处理(NLP)中的广泛应用,显著简化了简历分类过程,为招聘组织节省了时间和成本效率。然而,这些算法的性能取决于训练数据的丰富程度。虽然选择正确的模型架构是必不可少的,但确保一个健壮的、精心策划的数据集的可用性也是至关重要的。对于就业市场的许多类别来说,数据稀疏性仍然是一个挑战。为了应对这一挑战,我们使用OpenAI API根据特定标准生成结构化和非结构化简历。这些合成生成的简历被清洗、预处理,然后用于训练两个不同的模型:一个变压器模型(BERT)和一个前馈神经网络(FFNN),其中包含通用句子编码器4 (USE4)嵌入。虽然这两个模型都在简历的多类别分类任务上进行了评估,但当在包含60%真实数据(来自Indeed网站)和40%来自ChatGPT的合成数据的增强数据集上进行训练时,变压器模型表现出了出色的准确性。FFNN虽然可以预测,但准确率较低。这些发现突出了chatgpt生成的合成简历增强现实数据的价值,特别是在培训数据有限的情况下。BERT模型对此类分类任务的适用性进一步强化了这种叙述。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Future Internet
Future Internet Computer Science-Computer Networks and Communications
CiteScore
7.10
自引率
5.90%
发文量
303
审稿时长
11 weeks
期刊介绍: Future Internet is a scholarly open access journal which provides an advanced forum for science and research concerned with evolution of Internet technologies and related smart systems for “Net-Living” development. The general reference subject is therefore the evolution towards the future internet ecosystem, which is feeding a continuous, intensive, artificial transformation of the lived environment, for a widespread and significant improvement of well-being in all spheres of human life (private, public, professional). Included topics are: • advanced communications network infrastructures • evolution of internet basic services • internet of things • netted peripheral sensors • industrial internet • centralized and distributed data centers • embedded computing • cloud computing • software defined network functions and network virtualization • cloud-let and fog-computing • big data, open data and analytical tools • cyber-physical systems • network and distributed operating systems • web services • semantic structures and related software tools • artificial and augmented intelligence • augmented reality • system interoperability and flexible service composition • smart mission-critical system architectures • smart terminals and applications • pro-sumer tools for application design and development • cyber security compliance • privacy compliance • reliability compliance • dependability compliance • accountability compliance • trust compliance • technical quality of basic services.
期刊最新文献
Controllable Queuing System with Elastic Traffic and Signals for Resource Capacity Planning in 5G Network Slicing Internet-of-Things Traffic Analysis and Device Identification Based on Two-Stage Clustering in Smart Home Environments Resource Indexing and Querying in Large Connected Environments An Analysis of Methods and Metrics for Task Scheduling in Fog Computing Evaluating Embeddings from Pre-Trained Language Models and Knowledge Graphs for Educational Content Recommendation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1