Tim Havers, Lukas Masur, Eduard Isenmann, Stephan Geisler, Christoph Zinner, Billy Sperlich, Peter Düking
{"title":"由教练专家评估GPT-4和谷歌Gemini生成的肥大相关训练计划的可重复性和质量。","authors":"Tim Havers, Lukas Masur, Eduard Isenmann, Stephan Geisler, Christoph Zinner, Billy Sperlich, Peter Düking","doi":"10.5114/biolsport.2025.145911","DOIUrl":null,"url":null,"abstract":"<p><p>Large Language Models (LLMs) are increasingly utilized in various domains, including the generation of training plans. However, the reproducibility and quality of training plans produced by different LLMs have not been studied extensively. This study aims to: i) investigate and compare the quality of muscle hypertrophy-related resistance training (RT) plans generated by Google Gemini (GG) and GPT-4, and ii) the reproducibility of the RT plans when the same prompts are provided multiple times concomitantly. Two distinct prompts were used, one providing little information about the training plan requirements and the other providing detailed information. These prompts were input into GG and GPT-4 by two different individuals, resulting in the generation of eight RT plans. These plans were evaluated by 12 coaching experts using a 5-point Likert scale, based on quality criteria derived from the literature. The results indicated a high degree of reproducibility, as indicated by coaching expert evaluation, when the same distinct prompts were provided multiple times to the LLMs of interest, with 27 out of 28 items showing no differences (p > 0.05). Overall, GPT-4 was rated higher on several aspects of RT quality criteria (p = 0.000-0.043). Additionally, compared to little information, higher information density within the prompts resulted in higher rated RT quality (p = 0.000-0.037). Our findings show that RT plans can be generated reproducibly with the same quality when using the same prompts. Furthermore, quality improves with more detailed input, and GPT-4 outperformed GG in generating higherquality plans. These results suggest that detailed information input is crucial for LLM performance.</p>","PeriodicalId":55365,"journal":{"name":"Biology of Sport","volume":"42 2","pages":"289-329"},"PeriodicalIF":6.4000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11963122/pdf/","citationCount":"0","resultStr":"{\"title\":\"Reproducibility and quality of hypertrophy-related training plans generated by GPT-4 and Google Gemini as evaluated by coaching experts.\",\"authors\":\"Tim Havers, Lukas Masur, Eduard Isenmann, Stephan Geisler, Christoph Zinner, Billy Sperlich, Peter Düking\",\"doi\":\"10.5114/biolsport.2025.145911\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Large Language Models (LLMs) are increasingly utilized in various domains, including the generation of training plans. However, the reproducibility and quality of training plans produced by different LLMs have not been studied extensively. This study aims to: i) investigate and compare the quality of muscle hypertrophy-related resistance training (RT) plans generated by Google Gemini (GG) and GPT-4, and ii) the reproducibility of the RT plans when the same prompts are provided multiple times concomitantly. Two distinct prompts were used, one providing little information about the training plan requirements and the other providing detailed information. These prompts were input into GG and GPT-4 by two different individuals, resulting in the generation of eight RT plans. These plans were evaluated by 12 coaching experts using a 5-point Likert scale, based on quality criteria derived from the literature. The results indicated a high degree of reproducibility, as indicated by coaching expert evaluation, when the same distinct prompts were provided multiple times to the LLMs of interest, with 27 out of 28 items showing no differences (p > 0.05). Overall, GPT-4 was rated higher on several aspects of RT quality criteria (p = 0.000-0.043). Additionally, compared to little information, higher information density within the prompts resulted in higher rated RT quality (p = 0.000-0.037). Our findings show that RT plans can be generated reproducibly with the same quality when using the same prompts. Furthermore, quality improves with more detailed input, and GPT-4 outperformed GG in generating higherquality plans. These results suggest that detailed information input is crucial for LLM performance.</p>\",\"PeriodicalId\":55365,\"journal\":{\"name\":\"Biology of Sport\",\"volume\":\"42 2\",\"pages\":\"289-329\"},\"PeriodicalIF\":6.4000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11963122/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biology of Sport\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.5114/biolsport.2025.145911\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/12/18 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"SPORT SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biology of Sport","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.5114/biolsport.2025.145911","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/18 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"SPORT SCIENCES","Score":null,"Total":0}
Reproducibility and quality of hypertrophy-related training plans generated by GPT-4 and Google Gemini as evaluated by coaching experts.
Large Language Models (LLMs) are increasingly utilized in various domains, including the generation of training plans. However, the reproducibility and quality of training plans produced by different LLMs have not been studied extensively. This study aims to: i) investigate and compare the quality of muscle hypertrophy-related resistance training (RT) plans generated by Google Gemini (GG) and GPT-4, and ii) the reproducibility of the RT plans when the same prompts are provided multiple times concomitantly. Two distinct prompts were used, one providing little information about the training plan requirements and the other providing detailed information. These prompts were input into GG and GPT-4 by two different individuals, resulting in the generation of eight RT plans. These plans were evaluated by 12 coaching experts using a 5-point Likert scale, based on quality criteria derived from the literature. The results indicated a high degree of reproducibility, as indicated by coaching expert evaluation, when the same distinct prompts were provided multiple times to the LLMs of interest, with 27 out of 28 items showing no differences (p > 0.05). Overall, GPT-4 was rated higher on several aspects of RT quality criteria (p = 0.000-0.043). Additionally, compared to little information, higher information density within the prompts resulted in higher rated RT quality (p = 0.000-0.037). Our findings show that RT plans can be generated reproducibly with the same quality when using the same prompts. Furthermore, quality improves with more detailed input, and GPT-4 outperformed GG in generating higherquality plans. These results suggest that detailed information input is crucial for LLM performance.
期刊介绍:
Biology of Sport is the official journal of the Institute of Sport in Warsaw, Poland, published since 1984.
Biology of Sport is an international scientific peer-reviewed journal, published quarterly in both paper and electronic format. The journal publishes articles concerning basic and applied sciences in sport: sports and exercise physiology, sports immunology and medicine, sports genetics, training and testing, pharmacology, as well as in other biological aspects related to sport. Priority is given to inter-disciplinary papers.