{"title":"用于源代码汇总的精选 GPT","authors":"Chia-Yi Su, Collin McMillan","doi":"10.1007/s10515-024-00421-4","DOIUrl":null,"url":null,"abstract":"<div><p>A code summary is a brief natural language description of source code. Summaries are usually only a single sentence long, and yet form the backbone of developer documentation. A short descriptions such as “changes all visible polygons to the color blue” can give a programmer a high-level idea of what code does without the effort of reading the code itself. Recently, products based on Large Language Models such as ChatGPT have demonstrated a strong ability to write these descriptions automatically. However, to use these tools, programmers must send their code to untrusted third parties for processing (e.g., via an API call). This loss of custody is not acceptable to many organizations. In this paper, we present an alternative: we train an open source model using sample output generated by GPT<span>\\(-\\)</span>3.5 in a process related to knowledge distillation. Our model is small enough (350 m parameters) to be run on a single 16gb GPU, yet we show in our evaluation that it is large enough to mimic GPT<span>\\(-\\)</span>3.5 on this task.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"31 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distilled GPT for source code summarization\",\"authors\":\"Chia-Yi Su, Collin McMillan\",\"doi\":\"10.1007/s10515-024-00421-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>A code summary is a brief natural language description of source code. Summaries are usually only a single sentence long, and yet form the backbone of developer documentation. A short descriptions such as “changes all visible polygons to the color blue” can give a programmer a high-level idea of what code does without the effort of reading the code itself. Recently, products based on Large Language Models such as ChatGPT have demonstrated a strong ability to write these descriptions automatically. However, to use these tools, programmers must send their code to untrusted third parties for processing (e.g., via an API call). This loss of custody is not acceptable to many organizations. In this paper, we present an alternative: we train an open source model using sample output generated by GPT<span>\\\\(-\\\\)</span>3.5 in a process related to knowledge distillation. Our model is small enough (350 m parameters) to be run on a single 16gb GPU, yet we show in our evaluation that it is large enough to mimic GPT<span>\\\\(-\\\\)</span>3.5 on this task.</p></div>\",\"PeriodicalId\":55414,\"journal\":{\"name\":\"Automated Software Engineering\",\"volume\":\"31 1\",\"pages\":\"\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2024-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Automated Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10515-024-00421-4\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-024-00421-4","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
摘要
代码摘要是对源代码的简短自然语言描述。摘要通常只有一句话的长度,但却是开发人员文档的支柱。简短的描述,如 "将所有可见多边形变为蓝色",可以让程序员对代码的作用有一个高层次的概念,而无需费力阅读代码本身。最近,基于大型语言模型的产品(如 ChatGPT)已经展示了自动编写这些描述的强大能力。但是,要使用这些工具,程序员必须将他们的代码发送给不受信任的第三方进行处理(例如,通过 API 调用)。对于许多组织来说,这种监护权的丧失是不可接受的。在本文中,我们提出了一个替代方案:我们使用 GPT\(-\)3.5 在知识提炼相关过程中生成的样本输出来训练一个开源模型。我们的模型足够小(350 m 参数),可以在单个 16gb GPU 上运行,但我们在评估中表明,它足够大,可以在这项任务上模仿 GPT\(-\)3.5 。
A code summary is a brief natural language description of source code. Summaries are usually only a single sentence long, and yet form the backbone of developer documentation. A short descriptions such as “changes all visible polygons to the color blue” can give a programmer a high-level idea of what code does without the effort of reading the code itself. Recently, products based on Large Language Models such as ChatGPT have demonstrated a strong ability to write these descriptions automatically. However, to use these tools, programmers must send their code to untrusted third parties for processing (e.g., via an API call). This loss of custody is not acceptable to many organizations. In this paper, we present an alternative: we train an open source model using sample output generated by GPT\(-\)3.5 in a process related to knowledge distillation. Our model is small enough (350 m parameters) to be run on a single 16gb GPU, yet we show in our evaluation that it is large enough to mimic GPT\(-\)3.5 on this task.
期刊介绍:
This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes.
Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.