A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach

Natural Language Processing Journal Pub Date : 2025-02-05 DOI:10.1016/j.nlp.2025.100131

Ghada Soliman Ph.D. , Hozaifa Zaki , Mohamed Kilany

{"title":"A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach","authors":"Ghada Soliman Ph.D. , Hozaifa Zaki , Mohamed Kilany","doi":"10.1016/j.nlp.2025.100131","DOIUrl":null,"url":null,"abstract":"<div><div>Large Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including Multiple-Choice Question Answering (MCQA) evaluated on benchmark datasets with few-shot prompting. Given the absence of benchmark Science, Technology, Engineering, and Mathematics (STEM) datasets on Multiple-Choice Questions (MCQs) created by LLMs, we employed various LLMs (e.g., Vicuna-13B, Bard, and GPT-3.5) to generate MCQs on STEM topics curated from Wikipedia. We evaluated open-source LLM models such as Llama 2-7B and Mistral-7B Instruct, along with an encoder model such as DeBERTa v3 Large, on inference by adding context in addition to fine-tuning with and without context. The results showed that DeBERTa v3 Large and Mistral-7B Instruct outperform Llama 2-7B, highlighting the potential of LLMs with fewer parameters in answering hard MCQs when given the appropriate context through fine-tuning. We also benchmarked the results of these models against closed-source models such as Gemini and GPT-4 on inference with context, showcasing the potential of narrowing the gap between open-source and closed-source models when context is provided. Our work demonstrates the capabilities of LLMs in creating more challenging tasks that can be used as self-evaluation for other models. It also contributes to understanding LLMs’ capabilities in STEM MCQs tasks and emphasizes the importance of context for LLMs with fewer parameters in enhancing their performance.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"10 ","pages":"Article 100131"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S294971912500007X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including Multiple-Choice Question Answering (MCQA) evaluated on benchmark datasets with few-shot prompting. Given the absence of benchmark Science, Technology, Engineering, and Mathematics (STEM) datasets on Multiple-Choice Questions (MCQs) created by LLMs, we employed various LLMs (e.g., Vicuna-13B, Bard, and GPT-3.5) to generate MCQs on STEM topics curated from Wikipedia. We evaluated open-source LLM models such as Llama 2-7B and Mistral-7B Instruct, along with an encoder model such as DeBERTa v3 Large, on inference by adding context in addition to fine-tuning with and without context. The results showed that DeBERTa v3 Large and Mistral-7B Instruct outperform Llama 2-7B, highlighting the potential of LLMs with fewer parameters in answering hard MCQs when given the appropriate context through fine-tuning. We also benchmarked the results of these models against closed-source models such as Gemini and GPT-4 on inference with context, showcasing the potential of narrowing the gap between open-source and closed-source models when context is provided. Our work demonstrates the capabilities of LLMs in creating more challenging tasks that can be used as self-evaluation for other models. It also contributes to understanding LLMs’ capabilities in STEM MCQs tasks and emphasizes the importance of context for LLMs with fewer parameters in enhancing their performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Natural Language Processing Journal

自引率

0.00%

发文量