内存数据库支持源代码搜索和分析

2011 18th Working Conference on Reverse Engineering Pub Date : 2011-10-17 DOI:10.1109/WCRE.2011.60

O. Panchenko

{"title":"内存数据库支持源代码搜索和分析","authors":"O. Panchenko","doi":"10.1109/WCRE.2011.60","DOIUrl":null,"url":null,"abstract":"Software engineers are coerced to deal with a large amount of information about source code. Appropriate tools could assist to handle it, but existing tools are not capable of processing and presenting such a large amount of information sufficiently. With the advent of in-memory column-oriented databases the performance of some data-intensive applications could be significantly improved. This has resulted in a completely new user experience of those applications and enabled new use-cases. This PhD thesis investigates the applicability of in-memory column-oriented databases for supporting daily software engineering activities. The major research question addressed in this thesis is as follows: does in-memory column-oriented database technology provide the necessary performance advantages for working interactively with large amounts of fine-grained structural information about source code? To investigate this research question two scenarios have been selected that particularly suffer from low performance. The first selected scenario is source code search. Existing source code repositories contain a large amount of structural data. Interface definitions, abstract syntax trees, and call graphs are examples of such structural data. Existing tools have solved the performance problems either by reducing the amount of data because of using a coarse-grained representation, or by preparing answers to developers' questions in advance, or by reducing the scope of search. All currently existing alternatives result in the loss of developers' productivity. The second scenario is source code analytics. To complete reverse engineering tasks software engineers often are required to analyze a number of atomic facts that have been extracted from source code. Examples of such atomic facts are occurrences of certain syntactic patterns in code, software product metrics or violations of development guidelines. Each fact typically has several characteristics, such as the type of the fact, the location in code where found, and some attributes. Particularly, analysis of large software systems requires the ability to process a large amount of such facts efficiently. During industrial experiments conducted for this thesis it was evidenced that in-memory technology provides performance gains that improve developers' productivity and enable scenarios previously not possible. This thesis overlaps both software engineering and database technology. From the viewpoint of software engineering, it seeks to find a way to support developers in dealing with a large amount of structural data. From the viewpoint of database technology, source code search and analytics are domains for studying fundamental issues of storing and querying structural data.","PeriodicalId":350863,"journal":{"name":"2011 18th Working Conference on Reverse Engineering","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"In-Memory Database Support for Source Code Search and Analytics\",\"authors\":\"O. Panchenko\",\"doi\":\"10.1109/WCRE.2011.60\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Software engineers are coerced to deal with a large amount of information about source code. Appropriate tools could assist to handle it, but existing tools are not capable of processing and presenting such a large amount of information sufficiently. With the advent of in-memory column-oriented databases the performance of some data-intensive applications could be significantly improved. This has resulted in a completely new user experience of those applications and enabled new use-cases. This PhD thesis investigates the applicability of in-memory column-oriented databases for supporting daily software engineering activities. The major research question addressed in this thesis is as follows: does in-memory column-oriented database technology provide the necessary performance advantages for working interactively with large amounts of fine-grained structural information about source code? To investigate this research question two scenarios have been selected that particularly suffer from low performance. The first selected scenario is source code search. Existing source code repositories contain a large amount of structural data. Interface definitions, abstract syntax trees, and call graphs are examples of such structural data. Existing tools have solved the performance problems either by reducing the amount of data because of using a coarse-grained representation, or by preparing answers to developers' questions in advance, or by reducing the scope of search. All currently existing alternatives result in the loss of developers' productivity. The second scenario is source code analytics. To complete reverse engineering tasks software engineers often are required to analyze a number of atomic facts that have been extracted from source code. Examples of such atomic facts are occurrences of certain syntactic patterns in code, software product metrics or violations of development guidelines. Each fact typically has several characteristics, such as the type of the fact, the location in code where found, and some attributes. Particularly, analysis of large software systems requires the ability to process a large amount of such facts efficiently. During industrial experiments conducted for this thesis it was evidenced that in-memory technology provides performance gains that improve developers' productivity and enable scenarios previously not possible. This thesis overlaps both software engineering and database technology. From the viewpoint of software engineering, it seeks to find a way to support developers in dealing with a large amount of structural data. From the viewpoint of database technology, source code search and analytics are domains for studying fundamental issues of storing and querying structural data.\",\"PeriodicalId\":350863,\"journal\":{\"name\":\"2011 18th Working Conference on Reverse Engineering\",\"volume\":\"73 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 18th Working Conference on Reverse Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WCRE.2011.60\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 18th Working Conference on Reverse Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WCRE.2011.60","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

软件工程师被迫处理大量关于源代码的信息。适当的工具可以帮助处理它，但现有的工具无法充分处理和呈现如此大量的信息。随着内存中面向列的数据库的出现，一些数据密集型应用程序的性能可以得到显著提高。这导致了这些应用程序的全新用户体验，并启用了新的用例。这篇博士论文研究了内存中面向列的数据库在支持日常软件工程活动中的适用性。本文主要研究的问题如下:内存中面向列的数据库技术是否为与大量细粒度的源代码结构信息交互提供了必要的性能优势?为了调查这个研究问题，选择了两个特别遭受低绩效的场景。第一个选择的场景是源代码搜索。现有的源代码存储库包含大量的结构数据。接口定义、抽象语法树和调用图都是这种结构数据的例子。现有的工具通过减少数据量(因为使用粗粒度表示)、提前准备对开发人员的问题的答案、或者减少搜索范围来解决性能问题。所有当前存在的替代方案都会导致开发人员生产力的损失。第二个场景是源代码分析。为了完成逆向工程任务，软件工程师通常需要分析从源代码中提取的大量原子事实。这种原子事实的例子是代码中某些语法模式的出现、软件产品度量或对开发指南的违反。每个事实通常有几个特征，比如事实的类型、在代码中找到的位置和一些属性。特别是，大型软件系统的分析需要能够有效地处理大量这样的事实。在为本论文进行的工业实验中，证明内存技术提供了性能提升，提高了开发人员的生产力，并实现了以前不可能实现的场景。本文涉及软件工程和数据库技术。从软件工程的角度来看，它试图找到一种方法来支持开发人员处理大量的结构数据。从数据库技术的角度来看，源代码搜索和分析是研究结构化数据存储和查询基本问题的领域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

In-Memory Database Support for Source Code Search and Analytics

Software engineers are coerced to deal with a large amount of information about source code. Appropriate tools could assist to handle it, but existing tools are not capable of processing and presenting such a large amount of information sufficiently. With the advent of in-memory column-oriented databases the performance of some data-intensive applications could be significantly improved. This has resulted in a completely new user experience of those applications and enabled new use-cases. This PhD thesis investigates the applicability of in-memory column-oriented databases for supporting daily software engineering activities. The major research question addressed in this thesis is as follows: does in-memory column-oriented database technology provide the necessary performance advantages for working interactively with large amounts of fine-grained structural information about source code? To investigate this research question two scenarios have been selected that particularly suffer from low performance. The first selected scenario is source code search. Existing source code repositories contain a large amount of structural data. Interface definitions, abstract syntax trees, and call graphs are examples of such structural data. Existing tools have solved the performance problems either by reducing the amount of data because of using a coarse-grained representation, or by preparing answers to developers' questions in advance, or by reducing the scope of search. All currently existing alternatives result in the loss of developers' productivity. The second scenario is source code analytics. To complete reverse engineering tasks software engineers often are required to analyze a number of atomic facts that have been extracted from source code. Examples of such atomic facts are occurrences of certain syntactic patterns in code, software product metrics or violations of development guidelines. Each fact typically has several characteristics, such as the type of the fact, the location in code where found, and some attributes. Particularly, analysis of large software systems requires the ability to process a large amount of such facts efficiently. During industrial experiments conducted for this thesis it was evidenced that in-memory technology provides performance gains that improve developers' productivity and enable scenarios previously not possible. This thesis overlaps both software engineering and database technology. From the viewpoint of software engineering, it seeks to find a way to support developers in dealing with a large amount of structural data. From the viewpoint of database technology, source code search and analytics are domains for studying fundamental issues of storing and querying structural data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 18th Working Conference on Reverse Engineering

自引率

0.00%

发文量