SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects

2009 6th IEEE International Working Conference on Mining Software Repositories Pub Date : 2009-05-16 DOI:10.1109/MSR.2009.5069501

Joel Ossher, S. Bajracharya, Erik J. Linstead, P. Baldi, C. Lopes

{"title":"SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects","authors":"Joel Ossher, S. Bajracharya, Erik J. Linstead, P. Baldi, C. Lopes","doi":"10.1109/MSR.2009.5069501","DOIUrl":null,"url":null,"abstract":"Abstract The open source movement has made vast quantities of source code available online for free, providing an extremely large dataset for empirical study and potential resuse. A major difficulty in exploiting this potential fully is that the data are currently scattered between competing source code repositories, none of which are structured for empirical analysis and cross-project comparison. As a result, software researchers and developers are left to compile their own datasets, resulting in duplicated effort and limited results. To address this challenge, we built SourcererDB, an aggregated repository of statically analyzed and cross-linked open source Java projects. SourcererDB contains local snapshots of 2,852 Java projects taken from Sourceforge, Apache and Java.net. These projects are statically analyzed to extract rich structural information, which is then stored in a relational database. References to entities in the 16,058 external jars are resolved and grouped, allowing for cross-project usage information to be accessed easily. This paper describes: (a) the mechanism for resolving and grouping these cross-project references, (b) the structure of and the metamodel for the SourcererDB repository, and (d) end-user dataset access mechanisms. Our goal in building SourcererDB is to provide a rich dataset of source code to facilitate the sharing of extracted data and to encourage reuse and repeatability of experiments.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"47","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 6th IEEE International Working Conference on Mining Software Repositories","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSR.2009.5069501","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 47

Abstract

Abstract The open source movement has made vast quantities of source code available online for free, providing an extremely large dataset for empirical study and potential resuse. A major difficulty in exploiting this potential fully is that the data are currently scattered between competing source code repositories, none of which are structured for empirical analysis and cross-project comparison. As a result, software researchers and developers are left to compile their own datasets, resulting in duplicated effort and limited results. To address this challenge, we built SourcererDB, an aggregated repository of statically analyzed and cross-linked open source Java projects. SourcererDB contains local snapshots of 2,852 Java projects taken from Sourceforge, Apache and Java.net. These projects are statically analyzed to extract rich structural information, which is then stored in a relational database. References to entities in the 16,058 external jars are resolved and grouped, allowing for cross-project usage information to be accessed easily. This paper describes: (a) the mechanism for resolving and grouping these cross-project references, (b) the structure of and the metamodel for the SourcererDB repository, and (d) end-user dataset access mechanisms. Our goal in building SourcererDB is to provide a rich dataset of source code to facilitate the sharing of extracted data and to encourage reuse and repeatability of experiments.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SourcererDB:静态分析和交叉链接的开源Java项目的聚合存储库

开源运动使得大量的源代码可以在网上免费获得，为实证研究和潜在的再利用提供了一个极其庞大的数据集。充分利用这一潜力的一个主要困难是，数据目前分散在相互竞争的源代码存储库之间，没有一个是为经验分析和跨项目比较而构建的。因此，软件研究人员和开发人员只能自己编写数据集，这导致了重复的工作和有限的结果。为了应对这一挑战，我们构建了SourcererDB，这是一个静态分析和交叉链接的开源Java项目的聚合存储库。SourcererDB包含来自Sourceforge、Apache和Java.net的2852个Java项目的本地快照。对这些项目进行静态分析，以提取丰富的结构信息，然后将其存储在关系数据库中。对16,058个外部jar中的实体的引用进行了解析和分组，从而可以轻松地访问跨项目使用信息。本文描述了:(a)解析和分组这些跨项目引用的机制，(b) SourcererDB存储库的结构和元模型，以及(d)最终用户数据集访问机制。我们构建SourcererDB的目标是提供丰富的源代码数据集，以促进提取数据的共享，并鼓励实验的重用和可重复性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊