{"title":"Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs","authors":"Guillermo Marco, Luz Rello, Julio Gonzalo","doi":"arxiv-2409.11547","DOIUrl":null,"url":null,"abstract":"In this paper, we evaluate the creative fiction writing abilities of a\nfine-tuned small language model (SLM), BART Large, and compare its performance\nto humans and two large language models (LLMs): GPT-3.5 and GPT-4o. Our\nevaluation consists of two experiments: (i) a human evaluation where readers\nassess the stories generated by the SLM compared to human-written stories, and\n(ii) a qualitative linguistic analysis comparing the textual characteristics of\nthe stories generated by the different models. In the first experiment, we\nasked 68 participants to rate short stories generated by the models and humans\nalong dimensions such as grammaticality, relevance, creativity, and\nattractiveness. BART Large outperformed human writers in most aspects, except\ncreativity, with an overall score of 2.11 compared to 1.85 for human-written\ntexts -- a 14% improvement. In the second experiment, the qualitative analysis\nrevealed that, while GPT-4o exhibited near-perfect internal and external\ncoherence, it tended to produce more predictable narratives, with only 3% of\nits stories seen as novel. In contrast, 15% of BART's stories were considered\nnovel, indicating a higher degree of creativity despite its smaller model size.\nThis study provides both quantitative and qualitative insights into how model\nsize and fine-tuning influence the balance between creativity, fluency, and\ncoherence in creative writing tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11547","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we evaluate the creative fiction writing abilities of a
fine-tuned small language model (SLM), BART Large, and compare its performance
to humans and two large language models (LLMs): GPT-3.5 and GPT-4o. Our
evaluation consists of two experiments: (i) a human evaluation where readers
assess the stories generated by the SLM compared to human-written stories, and
(ii) a qualitative linguistic analysis comparing the textual characteristics of
the stories generated by the different models. In the first experiment, we
asked 68 participants to rate short stories generated by the models and humans
along dimensions such as grammaticality, relevance, creativity, and
attractiveness. BART Large outperformed human writers in most aspects, except
creativity, with an overall score of 2.11 compared to 1.85 for human-written
texts -- a 14% improvement. In the second experiment, the qualitative analysis
revealed that, while GPT-4o exhibited near-perfect internal and external
coherence, it tended to produce more predictable narratives, with only 3% of
its stories seen as novel. In contrast, 15% of BART's stories were considered
novel, indicating a higher degree of creativity despite its smaller model size.
This study provides both quantitative and qualitative insights into how model
size and fine-tuning influence the balance between creativity, fluency, and
coherence in creative writing tasks.