A recent study applied frequentist survival analysis methods to a subset of the Software Heritage Graph and determined which at-tributes of an open source software project contribute to its health. This paper serves as an exact replication of that study. In addition, Bayesian survival analysis methods were applied to the same dataset, and an additional project attribute was studied to serve as a conceptual replication. Both analyses focus on the effects of certain attributes on the survival of open-source software projects as mea-sured by their revision activity. Methods such as the Kaplan-Meier estimator, Cox Proportional-Hazards model, and the visualization of posterior survival functions were used for each of the project attributes. The results show that projects which publish major re-leases, have repositories on multiple hosting services, possess a large team of developers, and make frequent revisions have a higher likelihood of survival in the long run. The findings were similar to the original study; however, a deeper look revealed quantitative inconsistencies.
{"title":"Two Approaches to Survival Analysis of Open Source Python Projects","authors":"Derek Robinson, Keanelek Enns, Neha Koulecar, Manish Sihag","doi":"10.1145/3524610.3527871","DOIUrl":"https://doi.org/10.1145/3524610.3527871","url":null,"abstract":"A recent study applied frequentist survival analysis methods to a subset of the Software Heritage Graph and determined which at-tributes of an open source software project contribute to its health. This paper serves as an exact replication of that study. In addition, Bayesian survival analysis methods were applied to the same dataset, and an additional project attribute was studied to serve as a conceptual replication. Both analyses focus on the effects of certain attributes on the survival of open-source software projects as mea-sured by their revision activity. Methods such as the Kaplan-Meier estimator, Cox Proportional-Hazards model, and the visualization of posterior survival functions were used for each of the project attributes. The results show that projects which publish major re-leases, have repositories on multiple hosting services, possess a large team of developers, and make frequent revisions have a higher likelihood of survival in the long run. The findings were similar to the original study; however, a deeper look revealed quantitative inconsistencies.","PeriodicalId":426634,"journal":{"name":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","volume":"298 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114014365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
When programmers retrieve a code method and want to reuse it, they need to understand the usage patterns of the retrieved method. However, it is difficult to obtain usage information of the retrieved method since this method may only have a brief comment and few available usage examples. In this paper, we propose an approach, called LUPIN (cLone-based Usage Pattern mIniNg), to mine the usage patterns of these methods, which do not widely appeared in the code repository. The key idea of LUPIN is that the cloned code of the target method may have a similar usage pattern, and we can collect more usage information of the target method from cloned code usage examples. From the amplified usage examples, we mine the usage pattern of the target method by frequent subsequence mining after program slicing and code normalization. Our evaluation shows that LUPIN can mine four categories of usage patterns with an average precision of 0.65.
{"title":"Clone-based code method usage pattern mining","authors":"Zhipeng Xue, Yuanliang Zhang, Rulin Xu","doi":"10.1145/3524610.3527880","DOIUrl":"https://doi.org/10.1145/3524610.3527880","url":null,"abstract":"When programmers retrieve a code method and want to reuse it, they need to understand the usage patterns of the retrieved method. However, it is difficult to obtain usage information of the retrieved method since this method may only have a brief comment and few available usage examples. In this paper, we propose an approach, called LUPIN (cLone-based Usage Pattern mIniNg), to mine the usage patterns of these methods, which do not widely appeared in the code repository. The key idea of LUPIN is that the cloned code of the target method may have a similar usage pattern, and we can collect more usage information of the target method from cloned code usage examples. From the amplified usage examples, we mine the usage pattern of the target method by frequent subsequence mining after program slicing and code normalization. Our evaluation shows that LUPIN can mine four categories of usage patterns with an average precision of 0.65.","PeriodicalId":426634,"journal":{"name":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126247280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Akhila Sri Manasa Venigalla, Kowndinya Boyalakunta, S. Chimalakonda
GitHub hosts millions of software repositories, facilitating developers to contribute to many projects in multiple ways. Most of the information about the repositories is text-based in the form of stars, forks, commits, and so on. However, developers willing to contribute to projects on GitHub often find it challenging to select appropriate projects to contribute to or reuse due to the large number of repositories present on GitHub. Further, obtaining this required information often becomes a tedious process, as one has to carefully mine information hidden inside the repository. To alleviate the effort intensive mining procedures, researchers have proposed npmbadges to outline information relating to build status of a project. However, these badges are static and limit their usage to package dependency and build details. Adding visual cues such as badges, to the repositories might reduce the search space for developers. Hence, we present GitQ, to auto-matically augment GitHub repositories with badges representing information about source code and project maintenance. Presenting GitQ as a browser plugin to GitHub could make it easily accessible to developers using GitHub. GitQ is evaluated with 15 developers based on the UTAUT model to understand developer perception towards its usefulness. We observed that 11 out of 15 developers perceived GitQ to be useful in identifying the right set of reposi-tories using visual cues such as generated by GitQ. The source code and tool are available for download on GitHub at https://github.com/gitq-for-github/plugin, and the demo can be found at https://youtu.be/c0yohmIat3A.
{"title":"GitQ- Towards Using Badges as Visual Cues for GitHub Projects","authors":"Akhila Sri Manasa Venigalla, Kowndinya Boyalakunta, S. Chimalakonda","doi":"10.1145/3524610.3527876","DOIUrl":"https://doi.org/10.1145/3524610.3527876","url":null,"abstract":"GitHub hosts millions of software repositories, facilitating developers to contribute to many projects in multiple ways. Most of the information about the repositories is text-based in the form of stars, forks, commits, and so on. However, developers willing to contribute to projects on GitHub often find it challenging to select appropriate projects to contribute to or reuse due to the large number of repositories present on GitHub. Further, obtaining this required information often becomes a tedious process, as one has to carefully mine information hidden inside the repository. To alleviate the effort intensive mining procedures, researchers have proposed npmbadges to outline information relating to build status of a project. However, these badges are static and limit their usage to package dependency and build details. Adding visual cues such as badges, to the repositories might reduce the search space for developers. Hence, we present GitQ, to auto-matically augment GitHub repositories with badges representing information about source code and project maintenance. Presenting GitQ as a browser plugin to GitHub could make it easily accessible to developers using GitHub. GitQ is evaluated with 15 developers based on the UTAUT model to understand developer perception towards its usefulness. We observed that 11 out of 15 developers perceived GitQ to be useful in identifying the right set of reposi-tories using visual cues such as generated by GitQ. The source code and tool are available for download on GitHub at https://github.com/gitq-for-github/plugin, and the demo can be found at https://youtu.be/c0yohmIat3A.","PeriodicalId":426634,"journal":{"name":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117023037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Code representation, which transforms programs into vectors with semantics, is essential for source code processing. We have witnessed the effectiveness of incorporating structural information (i.e., graph) into code representations in recent years. Specifically, the abstract syntax tree (AST) and the AST-augmented graph of the program contain much structural and semantic information, and most existing studies apply them for code representation. The graph adopted by existing approaches is homogeneous, i.e., it discards the type information of the edges and the nodes lying within AST. That may cause plausible obstruction to the representation model. In this paper, we propose to leverage the type information in the graph for code representation. To be specific, we propose the heterogeneous program graph (HPG), which provides the types of the nodes and the edges explicitly. Furthermore, we employ the heterogeneous graph transformer (HGT) architecture to generate representations based on HPG, considering the type of information during processing. With the additional types in HPG, our approach can capture complex structural information, produce accurate and delicate representations, and finally perform well on certain tasks. Our in-depth evaluations upon four classic datasets for two typical tasks (i.e., method name prediction and code classification) demonstrate that the heterogeneous types in HPG benefit the representation models. Our proposed $text{HPG}+text{HGT}$ also outperforms the SOTA baselines on the subject tasks and datasets.
{"title":"Learning to Represent Programs with Heterogeneous Graphs","authors":"Wenhan Wang, Kechi Zhang, Ge Li, Zhi Jin","doi":"10.1145/3524610.3527905","DOIUrl":"https://doi.org/10.1145/3524610.3527905","url":null,"abstract":"Code representation, which transforms programs into vectors with semantics, is essential for source code processing. We have witnessed the effectiveness of incorporating structural information (i.e., graph) into code representations in recent years. Specifically, the abstract syntax tree (AST) and the AST-augmented graph of the program contain much structural and semantic information, and most existing studies apply them for code representation. The graph adopted by existing approaches is homogeneous, i.e., it discards the type information of the edges and the nodes lying within AST. That may cause plausible obstruction to the representation model. In this paper, we propose to leverage the type information in the graph for code representation. To be specific, we propose the heterogeneous program graph (HPG), which provides the types of the nodes and the edges explicitly. Furthermore, we employ the heterogeneous graph transformer (HGT) architecture to generate representations based on HPG, considering the type of information during processing. With the additional types in HPG, our approach can capture complex structural information, produce accurate and delicate representations, and finally perform well on certain tasks. Our in-depth evaluations upon four classic datasets for two typical tasks (i.e., method name prediction and code classification) demonstrate that the heterogeneous types in HPG benefit the representation models. Our proposed $text{HPG}+text{HGT}$ also outperforms the SOTA baselines on the subject tasks and datasets.","PeriodicalId":426634,"journal":{"name":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123626704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, D. Parkes
Source code summarization involves creating brief descriptions of source code in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic code summarization is a prized target of software engineering research, due to the high value summaries have to programmers and the simultaneously high cost of writing and maintaining documentation by hand. Current work is almost all based on machine models trained via big data input. Large datasets of examples of code and summaries of that code are used to train an e.g. encoder-decoder neural model. Then the output predictions of the model are evaluated against a set of reference summaries. The input is code not seen by the model, and the prediction is compared to a reference. The means by which a prediction is compared to a reference is essentially word overlap, calculated via a metric such as BLEU or ROUGE. The problem with using word overlap is that not all words in a sentence have the same importance, and many words have synonyms. The result is that calculated similarity may not match the perceived similarity by human readers. In this paper, we conduct an experiment to measure the degree to which various word overlap metrics correlate to human-rated similarity of predicted and reference summaries. We evaluate alternatives based on current work in semantic similarity metrics and propose recommendations for evaluation of source code summarization.
{"title":"Semantic Similarity Metrics for Evaluating Source Code Summarization","authors":"Yang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, D. Parkes","doi":"10.1145/NNNNNNN.NNNNNNN","DOIUrl":"https://doi.org/10.1145/NNNNNNN.NNNNNNN","url":null,"abstract":"Source code summarization involves creating brief descriptions of source code in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic code summarization is a prized target of software engineering research, due to the high value summaries have to programmers and the simultaneously high cost of writing and maintaining documentation by hand. Current work is almost all based on machine models trained via big data input. Large datasets of examples of code and summaries of that code are used to train an e.g. encoder-decoder neural model. Then the output predictions of the model are evaluated against a set of reference summaries. The input is code not seen by the model, and the prediction is compared to a reference. The means by which a prediction is compared to a reference is essentially word overlap, calculated via a metric such as BLEU or ROUGE. The problem with using word overlap is that not all words in a sentence have the same importance, and many words have synonyms. The result is that calculated similarity may not match the perceived similarity by human readers. In this paper, we conduct an experiment to measure the degree to which various word overlap metrics correlate to human-rated similarity of predicted and reference summaries. We evaluate alternatives based on current work in semantic similarity metrics and propose recommendations for evaluation of source code summarization.","PeriodicalId":426634,"journal":{"name":"2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126094041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}