源代码分析的多角度表示学习(特邀教程)

软件产业与工程 Pub Date : 2022-11-07 DOI:10.1145/3540250.3569446

Zhi Jin

{"title":"源代码分析的多角度表示学习(特邀教程)","authors":"Zhi Jin","doi":"10.1145/3540250.3569446","DOIUrl":null,"url":null,"abstract":"Programming languages are artificial and highly restricted languages. But source code is there to tell computers as well as programmers what to do, as an act of communication. Despite its weird syntax and is riddled with different delimiters, the good news is that the very large corpus of open-source code is available. That makes it reasonable to apply machine learning techniques to source code to enable the source code analytics. Despite there are plenty of deep learning frameworks in the field of NLP, source code analytics has different features. In addition to the conventional way of coding, understanding the meaning of code involves many perspectives. The source code representation could be the token sequence, the API call sequence, the data dependency graph, and the control flow graph, as well as the program hierarchy, etc. This tutorial will tell the long, ongoing, and fruitful journey on exploiting the potential power of deep learning techniques in source code analytics. It will highlight that how code representation models can be utilized to support software engineers to perform different tasks that require proficient programming knowledge. The exploratory work show that code does imply the learnable knowledge, more precisely the learnable tacit knowledge. Although such knowledge is not easily transferrable between humans, it can be transferred between the automated programming tasks. A vision for future research will be stated for source code analytics.","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":"5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-perspective representation learning for source code analytics (invited tutorial)\",\"authors\":\"Zhi Jin\",\"doi\":\"10.1145/3540250.3569446\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Programming languages are artificial and highly restricted languages. But source code is there to tell computers as well as programmers what to do, as an act of communication. Despite its weird syntax and is riddled with different delimiters, the good news is that the very large corpus of open-source code is available. That makes it reasonable to apply machine learning techniques to source code to enable the source code analytics. Despite there are plenty of deep learning frameworks in the field of NLP, source code analytics has different features. In addition to the conventional way of coding, understanding the meaning of code involves many perspectives. The source code representation could be the token sequence, the API call sequence, the data dependency graph, and the control flow graph, as well as the program hierarchy, etc. This tutorial will tell the long, ongoing, and fruitful journey on exploiting the potential power of deep learning techniques in source code analytics. It will highlight that how code representation models can be utilized to support software engineers to perform different tasks that require proficient programming knowledge. The exploratory work show that code does imply the learnable knowledge, more precisely the learnable tacit knowledge. Although such knowledge is not easily transferrable between humans, it can be transferred between the automated programming tasks. A vision for future research will be stated for source code analytics.\",\"PeriodicalId\":68155,\"journal\":{\"name\":\"软件产业与工程\",\"volume\":\"5 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"软件产业与工程\",\"FirstCategoryId\":\"1089\",\"ListUrlMain\":\"https://doi.org/10.1145/3540250.3569446\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"软件产业与工程","FirstCategoryId":"1089","ListUrlMain":"https://doi.org/10.1145/3540250.3569446","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

编程语言是人工的、高度受限的语言。但是，作为一种交流行为，源代码的存在是为了告诉计算机和程序员该做什么。尽管它的语法很奇怪，并且充满了不同的分隔符，但好消息是有非常大的开源代码语料库可用。这使得将机器学习技术应用于源代码以实现源代码分析是合理的。尽管在NLP领域有很多深度学习框架，但源代码分析有不同的特点。除了传统的编码方式之外，理解代码的含义还涉及许多角度。源代码表示可以是令牌序列、API调用序列、数据依赖关系图和控制流图，以及程序层次结构等。本教程将讲述在源代码分析中利用深度学习技术的潜在力量的漫长、持续和富有成效的旅程。它将强调如何利用代码表示模型来支持软件工程师执行需要精通编程知识的不同任务。探索性的工作表明，代码确实隐含着可学习的知识，更确切地说是可学习的隐性知识。尽管这些知识在人与人之间不容易转移，但它可以在自动化编程任务之间转移。本文将对源代码分析的未来研究进行展望。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Multi-perspective representation learning for source code analytics (invited tutorial)

Programming languages are artificial and highly restricted languages. But source code is there to tell computers as well as programmers what to do, as an act of communication. Despite its weird syntax and is riddled with different delimiters, the good news is that the very large corpus of open-source code is available. That makes it reasonable to apply machine learning techniques to source code to enable the source code analytics. Despite there are plenty of deep learning frameworks in the field of NLP, source code analytics has different features. In addition to the conventional way of coding, understanding the meaning of code involves many perspectives. The source code representation could be the token sequence, the API call sequence, the data dependency graph, and the control flow graph, as well as the program hierarchy, etc. This tutorial will tell the long, ongoing, and fruitful journey on exploiting the potential power of deep learning techniques in source code analytics. It will highlight that how code representation models can be utilized to support software engineers to perform different tasks that require proficient programming knowledge. The exploratory work show that code does imply the learnable knowledge, more precisely the learnable tacit knowledge. Although such knowledge is not easily transferrable between humans, it can be transferred between the automated programming tasks. A vision for future research will be stated for source code analytics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

软件产业与工程

自引率

0.00%

发文量

676