Yuxing Ma, Chris Bogart, Sadika Amreen, R. Zaretzki, A. Mockus
{"title":"代码世界:用于挖掘开源VCS数据世界的基础设施","authors":"Yuxing Ma, Chris Bogart, Sadika Amreen, R. Zaretzki, A. Mockus","doi":"10.1109/MSR.2019.00031","DOIUrl":null,"url":null,"abstract":"Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are tens of millions of projects in the periphery interconnected through technical dependencies, code sharing, or knowledge flows? To answer such questions we a) create a very large and frequently updated collection of version control data for FLOSS projects named World of Code (WoC) and b) provide basic tools for conducting research that depends on measuring interdependencies among all FLOSS projects. Our current WoC implementation is capable of being updated on a monthly basis and contains over 12B git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"1 1","pages":"143-154"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"73","resultStr":"{\"title\":\"World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data\",\"authors\":\"Yuxing Ma, Chris Bogart, Sadika Amreen, R. Zaretzki, A. Mockus\",\"doi\":\"10.1109/MSR.2019.00031\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are tens of millions of projects in the periphery interconnected through technical dependencies, code sharing, or knowledge flows? To answer such questions we a) create a very large and frequently updated collection of version control data for FLOSS projects named World of Code (WoC) and b) provide basic tools for conducting research that depends on measuring interdependencies among all FLOSS projects. Our current WoC implementation is capable of being updated on a monthly basis and contains over 12B git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.\",\"PeriodicalId\":6706,\"journal\":{\"name\":\"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)\",\"volume\":\"1 1\",\"pages\":\"143-154\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"73\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MSR.2019.00031\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSR.2019.00031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 73
摘要
开源软件(OSS)对现代社会是必不可少的,尽管对单个(通常是中心)项目进行了大量研究,但对整个OSS生态系统的外围只有有限的了解。例如,周边数千万的项目是如何通过技术依赖、代码共享或知识流相互连接的?为了回答这些问题,我们a)为FLOSS项目创建了一个非常大且经常更新的版本控制数据集,名为代码世界(World of Code, WoC), b)提供了基本的工具,用于进行依赖于测量所有FLOSS项目之间相互依赖性的研究。我们当前的WoC实现能够每月更新一次,包含超过120亿个git对象。为了评估其研究潜力并为其使用创建场景,我们使用WoC进行了几项研究任务。特别是,我们发现它能够支持趋势评估、生态系统测量和包使用的确定。我们期望WoC能够推动对OSS开发的全球特性的调查,从而增加整个OSS生态系统的弹性。我们的基础设施有助于发现关键的技术依赖、代码流和社会网络,这些网络为确定驱动FLOSS活动和创新的关系的结构和演变提供了基础。
World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data
Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are tens of millions of projects in the periphery interconnected through technical dependencies, code sharing, or knowledge flows? To answer such questions we a) create a very large and frequently updated collection of version control data for FLOSS projects named World of Code (WoC) and b) provide basic tools for conducting research that depends on measuring interdependencies among all FLOSS projects. Our current WoC implementation is capable of being updated on a monthly basis and contains over 12B git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.