LLM4VV：探索用于验证和核查测试套件的 LLM 即法官

arXiv - CS - Programming Languages Pub Date : 2024-08-21 DOI:arxiv-2408.11729

Zachariah Sollenberger, Jay Patel, Christian Munley, Aaron Jarmusch, Sunita Chandrasekaran

{"title":"LLM4VV：探索用于验证和核查测试套件的 LLM 即法官","authors":"Zachariah Sollenberger, Jay Patel, Christian Munley, Aaron Jarmusch, Sunita Chandrasekaran","doi":"arxiv-2408.11729","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLM) are evolving and have significantly\nrevolutionized the landscape of software development. If used well, they can\nsignificantly accelerate the software development cycle. At the same time, the\ncommunity is very cautious of the models being trained on biased or sensitive\ndata, which can lead to biased outputs along with the inadvertent release of\nconfidential information. Additionally, the carbon footprints and the\nun-explainability of these black box models continue to raise questions about\nthe usability of LLMs. With the abundance of opportunities LLMs have to offer, this paper explores\nthe idea of judging tests used to evaluate compiler implementations of\ndirective-based programming models as well as probe into the black box of LLMs.\nBased on our results, utilizing an agent-based prompting approach and setting\nup a validation pipeline structure drastically increased the quality of\nDeepSeek Coder, the LLM chosen for the evaluation purposes.","PeriodicalId":501197,"journal":{"name":"arXiv - CS - Programming Languages","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites\",\"authors\":\"Zachariah Sollenberger, Jay Patel, Christian Munley, Aaron Jarmusch, Sunita Chandrasekaran\",\"doi\":\"arxiv-2408.11729\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models (LLM) are evolving and have significantly\\nrevolutionized the landscape of software development. If used well, they can\\nsignificantly accelerate the software development cycle. At the same time, the\\ncommunity is very cautious of the models being trained on biased or sensitive\\ndata, which can lead to biased outputs along with the inadvertent release of\\nconfidential information. Additionally, the carbon footprints and the\\nun-explainability of these black box models continue to raise questions about\\nthe usability of LLMs. With the abundance of opportunities LLMs have to offer, this paper explores\\nthe idea of judging tests used to evaluate compiler implementations of\\ndirective-based programming models as well as probe into the black box of LLMs.\\nBased on our results, utilizing an agent-based prompting approach and setting\\nup a validation pipeline structure drastically increased the quality of\\nDeepSeek Coder, the LLM chosen for the evaluation purposes.\",\"PeriodicalId\":501197,\"journal\":{\"name\":\"arXiv - CS - Programming Languages\",\"volume\":\"10 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Programming Languages\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.11729\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Programming Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.11729","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（LLM）不断发展，极大地改变了软件开发的格局。如果使用得当，它们可以显著加快软件开发周期。与此同时，业界也非常担心模型在有偏见或敏感数据的基础上进行训练，这可能会导致有偏见的输出以及无意中泄露机密信息。此外，这些黑盒子模型的碳足迹和无法解释性继续引发人们对 LLM 可用性的质疑。基于我们的研究结果，利用基于代理的提示方法和建立验证流水线结构大大提高了用于评估的 LLM--DeepSeek Coder 的质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites

Large Language Models (LLM) are evolving and have significantly revolutionized the landscape of software development. If used well, they can significantly accelerate the software development cycle. At the same time, the community is very cautious of the models being trained on biased or sensitive data, which can lead to biased outputs along with the inadvertent release of confidential information. Additionally, the carbon footprints and the un-explainability of these black box models continue to raise questions about the usability of LLMs. With the abundance of opportunities LLMs have to offer, this paper explores the idea of judging tests used to evaluate compiler implementations of directive-based programming models as well as probe into the black box of LLMs. Based on our results, utilizing an agent-based prompting approach and setting up a validation pipeline structure drastically increased the quality of DeepSeek Coder, the LLM chosen for the evaluation purposes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Programming Languages

自引率

0.00%

发文量

期刊最新文献

Memory Consistency and Program Transformations No Saved Kaleidosope: an 100% Jitted Neural Network Coding Language with Pythonic Syntax Towards Quantum Multiparty Session Types The Incredible Shrinking Context... in a decompiler near you Scheme Pearl: Quantum Continuations