SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects

Joel Ossher, S. Bajracharya, Erik J. Linstead, P. Baldi, C. Lopes
{"title":"SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects","authors":"Joel Ossher, S. Bajracharya, Erik J. Linstead, P. Baldi, C. Lopes","doi":"10.1109/MSR.2009.5069501","DOIUrl":null,"url":null,"abstract":"Abstract The open source movement has made vast quantities of source code available online for free, providing an extremely large dataset for empirical study and potential resuse. A major difficulty in exploiting this potential fully is that the data are currently scattered between competing source code repositories, none of which are structured for empirical analysis and cross-project comparison. As a result, software researchers and developers are left to compile their own datasets, resulting in duplicated effort and limited results. To address this challenge, we built SourcererDB, an aggregated repository of statically analyzed and cross-linked open source Java projects. SourcererDB contains local snapshots of 2,852 Java projects taken from Sourceforge, Apache and Java.net. These projects are statically analyzed to extract rich structural information, which is then stored in a relational database. References to entities in the 16,058 external jars are resolved and grouped, allowing for cross-project usage information to be accessed easily. This paper describes: (a) the mechanism for resolving and grouping these cross-project references, (b) the structure of and the metamodel for the SourcererDB repository, and (d) end-user dataset access mechanisms. Our goal in building SourcererDB is to provide a rich dataset of source code to facilitate the sharing of extracted data and to encourage reuse and repeatability of experiments.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"47","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 6th IEEE International Working Conference on Mining Software Repositories","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSR.2009.5069501","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 47

Abstract

Abstract The open source movement has made vast quantities of source code available online for free, providing an extremely large dataset for empirical study and potential resuse. A major difficulty in exploiting this potential fully is that the data are currently scattered between competing source code repositories, none of which are structured for empirical analysis and cross-project comparison. As a result, software researchers and developers are left to compile their own datasets, resulting in duplicated effort and limited results. To address this challenge, we built SourcererDB, an aggregated repository of statically analyzed and cross-linked open source Java projects. SourcererDB contains local snapshots of 2,852 Java projects taken from Sourceforge, Apache and Java.net. These projects are statically analyzed to extract rich structural information, which is then stored in a relational database. References to entities in the 16,058 external jars are resolved and grouped, allowing for cross-project usage information to be accessed easily. This paper describes: (a) the mechanism for resolving and grouping these cross-project references, (b) the structure of and the metamodel for the SourcererDB repository, and (d) end-user dataset access mechanisms. Our goal in building SourcererDB is to provide a rich dataset of source code to facilitate the sharing of extracted data and to encourage reuse and repeatability of experiments.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SourcererDB:静态分析和交叉链接的开源Java项目的聚合存储库
开源运动使得大量的源代码可以在网上免费获得,为实证研究和潜在的再利用提供了一个极其庞大的数据集。充分利用这一潜力的一个主要困难是,数据目前分散在相互竞争的源代码存储库之间,没有一个是为经验分析和跨项目比较而构建的。因此,软件研究人员和开发人员只能自己编写数据集,这导致了重复的工作和有限的结果。为了应对这一挑战,我们构建了SourcererDB,这是一个静态分析和交叉链接的开源Java项目的聚合存储库。SourcererDB包含来自Sourceforge、Apache和Java.net的2852个Java项目的本地快照。对这些项目进行静态分析,以提取丰富的结构信息,然后将其存储在关系数据库中。对16,058个外部jar中的实体的引用进行了解析和分组,从而可以轻松地访问跨项目使用信息。本文描述了:(a)解析和分组这些跨项目引用的机制,(b) SourcererDB存储库的结构和元模型,以及(d)最终用户数据集访问机制。我们构建SourcererDB的目标是提供丰富的源代码数据集,以促进提取数据的共享,并鼓励实验的重用和可重复性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Tracking concept drift of software projects using defect prediction quality Mining the history of synchronous changes to refine code ownership Learning from defect removals Assigning bug reports using a vocabulary-based expertise model of developers Using association rules to study the co-evolution of production & test code
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1