{"title":"An Empirical Study of Multiple Names and Email Addresses in OSS Version Control Repositories","authors":"Jiaxin Zhu, Jun Wei","doi":"10.1109/MSR.2019.00068","DOIUrl":null,"url":null,"abstract":"Data produced by version control systems are widely used in software research and development. Version control data users always use the name or email address field to identify the committer or author of a modification. However, developers may use multiple names and email addresses, which brings difficulties for identification of distinct developers. In this paper, we sample 450 Git repositories from GitHub to study the multiple names and email addresses of developers. We conduct a conservative estimation of its prevalence and impact on related measurements. We merge the multiple names and email addresses of a developer through a method of high precision. With the merged identities, we obtain a number of interesting findings, e.g., about 6% of the developers used multiple names or email addresses in more than 60% of the repositories, and they contributed about half of all the commits. Our impact analysis shows that the multiple names and email addresses issue cannot be ignored for the basic related measurements, e.g., the number of developers in a repository. Our results could help researchers and practitioners have a more clear understanding of multiple names and email addresses in practice to improve the accuracy of related measurements.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"1 1","pages":"409-420"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSR.2019.00068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
Data produced by version control systems are widely used in software research and development. Version control data users always use the name or email address field to identify the committer or author of a modification. However, developers may use multiple names and email addresses, which brings difficulties for identification of distinct developers. In this paper, we sample 450 Git repositories from GitHub to study the multiple names and email addresses of developers. We conduct a conservative estimation of its prevalence and impact on related measurements. We merge the multiple names and email addresses of a developer through a method of high precision. With the merged identities, we obtain a number of interesting findings, e.g., about 6% of the developers used multiple names or email addresses in more than 60% of the repositories, and they contributed about half of all the commits. Our impact analysis shows that the multiple names and email addresses issue cannot be ignored for the basic related measurements, e.g., the number of developers in a repository. Our results could help researchers and practitioners have a more clear understanding of multiple names and email addresses in practice to improve the accuracy of related measurements.