Zachariah Sollenberger, Jay Patel, Christian Munley, Aaron Jarmusch, Sunita Chandrasekaran
{"title":"LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites","authors":"Zachariah Sollenberger, Jay Patel, Christian Munley, Aaron Jarmusch, Sunita Chandrasekaran","doi":"arxiv-2408.11729","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLM) are evolving and have significantly\nrevolutionized the landscape of software development. If used well, they can\nsignificantly accelerate the software development cycle. At the same time, the\ncommunity is very cautious of the models being trained on biased or sensitive\ndata, which can lead to biased outputs along with the inadvertent release of\nconfidential information. Additionally, the carbon footprints and the\nun-explainability of these black box models continue to raise questions about\nthe usability of LLMs. With the abundance of opportunities LLMs have to offer, this paper explores\nthe idea of judging tests used to evaluate compiler implementations of\ndirective-based programming models as well as probe into the black box of LLMs.\nBased on our results, utilizing an agent-based prompting approach and setting\nup a validation pipeline structure drastically increased the quality of\nDeepSeek Coder, the LLM chosen for the evaluation purposes.","PeriodicalId":501197,"journal":{"name":"arXiv - CS - Programming Languages","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Programming Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.11729","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large Language Models (LLM) are evolving and have significantly
revolutionized the landscape of software development. If used well, they can
significantly accelerate the software development cycle. At the same time, the
community is very cautious of the models being trained on biased or sensitive
data, which can lead to biased outputs along with the inadvertent release of
confidential information. Additionally, the carbon footprints and the
un-explainability of these black box models continue to raise questions about
the usability of LLMs. With the abundance of opportunities LLMs have to offer, this paper explores
the idea of judging tests used to evaluate compiler implementations of
directive-based programming models as well as probe into the black box of LLMs.
Based on our results, utilizing an agent-based prompting approach and setting
up a validation pipeline structure drastically increased the quality of
DeepSeek Coder, the LLM chosen for the evaluation purposes.