{"title":"Data collection and analysis of GitHub repositories and users","authors":"Fragkiskos Chatziasimidis, I. Stamelos","doi":"10.1109/IISA.2015.7388026","DOIUrl":null,"url":null,"abstract":"In this paper, we present the collection and mining of GitHub data, aiming to understand GitHub user behavior and project success factors. We collected information about approximately 100K projects and 10K GitHub users//owners of these projects, via GitHub API. Subsequently, we statistically analyzed such data, discretized values of features via k-means algorithm, and finally we applied apriori algorithm via weka in order to find out association rules. Having assumed that project success could be measured by the cardinality of downloads we kept only the rules which had as right par a download cardinality higher than a threshold of 1000 downloads. The results provide intersting insight in the GitHub ecosystem and seven success rules for GitHub projects.","PeriodicalId":433872,"journal":{"name":"2015 6th International Conference on Information, Intelligence, Systems and Applications (IISA)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 6th International Conference on Information, Intelligence, Systems and Applications (IISA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISA.2015.7388026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17
Abstract
In this paper, we present the collection and mining of GitHub data, aiming to understand GitHub user behavior and project success factors. We collected information about approximately 100K projects and 10K GitHub users//owners of these projects, via GitHub API. Subsequently, we statistically analyzed such data, discretized values of features via k-means algorithm, and finally we applied apriori algorithm via weka in order to find out association rules. Having assumed that project success could be measured by the cardinality of downloads we kept only the rules which had as right par a download cardinality higher than a threshold of 1000 downloads. The results provide intersting insight in the GitHub ecosystem and seven success rules for GitHub projects.