{"title":"评估大型语言模型的研究质量:不同设置和输入下 ChatGPT 的有效性分析","authors":"Mike Thelwall","doi":"arxiv-2408.06752","DOIUrl":null,"url":null,"abstract":"Evaluating the quality of academic journal articles is a time consuming but\ncritical task for national research evaluation exercises, appointments and\npromotion. It is therefore important to investigate whether Large Language\nModels (LLMs) can play a role in this process. This article assesses which\nChatGPT inputs (full text without tables, figures and references; title and\nabstract; title only) produce better quality score estimates, and the extent to\nwhich scores are affected by ChatGPT models and system prompts. The results\nshow that the optimal input is the article title and abstract, with average\nChatGPT scores based on these (30 iterations on a dataset of 51 papers)\ncorrelating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is\nslightly better than 3.5-turbo (0.66), and 4o-mini (0.66). The results suggest\nthat article full texts might confuse LLM research quality evaluations, even\nthough complex system instructions for the task are more effective than simple\nones. Thus, whilst abstracts contain insufficient information for a thorough\nassessment of rigour, they may contain strong pointers about originality and\nsignificance. Finally, linear regression can be used to convert the model\nscores into the human scale scores, which is 31% more accurate than guessing.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"4 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs\",\"authors\":\"Mike Thelwall\",\"doi\":\"arxiv-2408.06752\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Evaluating the quality of academic journal articles is a time consuming but\\ncritical task for national research evaluation exercises, appointments and\\npromotion. It is therefore important to investigate whether Large Language\\nModels (LLMs) can play a role in this process. This article assesses which\\nChatGPT inputs (full text without tables, figures and references; title and\\nabstract; title only) produce better quality score estimates, and the extent to\\nwhich scores are affected by ChatGPT models and system prompts. The results\\nshow that the optimal input is the article title and abstract, with average\\nChatGPT scores based on these (30 iterations on a dataset of 51 papers)\\ncorrelating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is\\nslightly better than 3.5-turbo (0.66), and 4o-mini (0.66). The results suggest\\nthat article full texts might confuse LLM research quality evaluations, even\\nthough complex system instructions for the task are more effective than simple\\nones. Thus, whilst abstracts contain insufficient information for a thorough\\nassessment of rigour, they may contain strong pointers about originality and\\nsignificance. Finally, linear regression can be used to convert the model\\nscores into the human scale scores, which is 31% more accurate than guessing.\",\"PeriodicalId\":501285,\"journal\":{\"name\":\"arXiv - CS - Digital Libraries\",\"volume\":\"4 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Digital Libraries\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.06752\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.06752","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs
Evaluating the quality of academic journal articles is a time consuming but
critical task for national research evaluation exercises, appointments and
promotion. It is therefore important to investigate whether Large Language
Models (LLMs) can play a role in this process. This article assesses which
ChatGPT inputs (full text without tables, figures and references; title and
abstract; title only) produce better quality score estimates, and the extent to
which scores are affected by ChatGPT models and system prompts. The results
show that the optimal input is the article title and abstract, with average
ChatGPT scores based on these (30 iterations on a dataset of 51 papers)
correlating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is
slightly better than 3.5-turbo (0.66), and 4o-mini (0.66). The results suggest
that article full texts might confuse LLM research quality evaluations, even
though complex system instructions for the task are more effective than simple
ones. Thus, whilst abstracts contain insufficient information for a thorough
assessment of rigour, they may contain strong pointers about originality and
significance. Finally, linear regression can be used to convert the model
scores into the human scale scores, which is 31% more accurate than guessing.