{"title":"通过知识感知异构图学习改进议题-PR 链接预测","authors":"Shuotong Bai;Huaxiao Liu;Enyan Dai;Lei Liu","doi":"10.1109/TSE.2024.3408448","DOIUrl":null,"url":null,"abstract":"Links between issues and pull requests (PRs) assist GitHub developers in tackling technical challenges, gaining development inspiration, and improving repository maintenance. In realistic repositories, these links are still insufficiently established. Aiming at this situation, existing works focus on issues and PRs themselves and employ text similarity with additional information like issue size to predict issue-PR links, yet their effectiveness is unsatisfactory. The limitation is that issues and PRs are not isolated on GitHub. Rather, they are related to multiple GitHub sources, including repositories and submitters, which, through their diverse relationships, can supply potential and crucial knowledge about technical domains, developmental insights, and cross-repository technical details. To this end, we propose \n<underline>A</u>\nuto \n<bold>IP</b>\n \n<underline>L</u>\ninker (AIPL), which introduces the heterogeneous graph to model multiple GitHub sources with their relationships. Further, it leverages the metapath-based technique to reveal and incorporate the potential information for a more comprehensive understanding of issues and PRs. Firstly, we identify 4 types of GitHub sources related to issues and PRs (repositories, users, issues, PRs) as well as their relationships, and model them into task-specific heterogeneous graphs. Next, we analyze information transmitted among issues or PRs to reveal which knowledge is crucial for them. Based on our analysis, we formulate a series of metapaths and employ the metapath-based technique to incorporate various information for learning the knowledge-aware embedding of issues and PRs. Finally, we can infer whether an issue and a PR can be linked based on their embedding. We evaluate the performance of AIPL on real-world data sets collected from GitHub. The results show that, compared to the baselines, AIPL can achieve average improvements of 15.94%, 15.19%, 20.52%, and 18.50% in terms of Accuracy, Precision, Recall, and F1-score.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 7","pages":"1901-1920"},"PeriodicalIF":6.5000,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving Issue-PR Link Prediction via Knowledge-Aware Heterogeneous Graph Learning\",\"authors\":\"Shuotong Bai;Huaxiao Liu;Enyan Dai;Lei Liu\",\"doi\":\"10.1109/TSE.2024.3408448\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Links between issues and pull requests (PRs) assist GitHub developers in tackling technical challenges, gaining development inspiration, and improving repository maintenance. In realistic repositories, these links are still insufficiently established. Aiming at this situation, existing works focus on issues and PRs themselves and employ text similarity with additional information like issue size to predict issue-PR links, yet their effectiveness is unsatisfactory. The limitation is that issues and PRs are not isolated on GitHub. Rather, they are related to multiple GitHub sources, including repositories and submitters, which, through their diverse relationships, can supply potential and crucial knowledge about technical domains, developmental insights, and cross-repository technical details. To this end, we propose \\n<underline>A</u>\\nuto \\n<bold>IP</b>\\n \\n<underline>L</u>\\ninker (AIPL), which introduces the heterogeneous graph to model multiple GitHub sources with their relationships. Further, it leverages the metapath-based technique to reveal and incorporate the potential information for a more comprehensive understanding of issues and PRs. Firstly, we identify 4 types of GitHub sources related to issues and PRs (repositories, users, issues, PRs) as well as their relationships, and model them into task-specific heterogeneous graphs. Next, we analyze information transmitted among issues or PRs to reveal which knowledge is crucial for them. Based on our analysis, we formulate a series of metapaths and employ the metapath-based technique to incorporate various information for learning the knowledge-aware embedding of issues and PRs. Finally, we can infer whether an issue and a PR can be linked based on their embedding. We evaluate the performance of AIPL on real-world data sets collected from GitHub. The results show that, compared to the baselines, AIPL can achieve average improvements of 15.94%, 15.19%, 20.52%, and 18.50% in terms of Accuracy, Precision, Recall, and F1-score.\",\"PeriodicalId\":13324,\"journal\":{\"name\":\"IEEE Transactions on Software Engineering\",\"volume\":\"50 7\",\"pages\":\"1901-1920\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2024-06-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10546471/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10546471/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
Improving Issue-PR Link Prediction via Knowledge-Aware Heterogeneous Graph Learning
Links between issues and pull requests (PRs) assist GitHub developers in tackling technical challenges, gaining development inspiration, and improving repository maintenance. In realistic repositories, these links are still insufficiently established. Aiming at this situation, existing works focus on issues and PRs themselves and employ text similarity with additional information like issue size to predict issue-PR links, yet their effectiveness is unsatisfactory. The limitation is that issues and PRs are not isolated on GitHub. Rather, they are related to multiple GitHub sources, including repositories and submitters, which, through their diverse relationships, can supply potential and crucial knowledge about technical domains, developmental insights, and cross-repository technical details. To this end, we propose
A
uto
IP
L
inker (AIPL), which introduces the heterogeneous graph to model multiple GitHub sources with their relationships. Further, it leverages the metapath-based technique to reveal and incorporate the potential information for a more comprehensive understanding of issues and PRs. Firstly, we identify 4 types of GitHub sources related to issues and PRs (repositories, users, issues, PRs) as well as their relationships, and model them into task-specific heterogeneous graphs. Next, we analyze information transmitted among issues or PRs to reveal which knowledge is crucial for them. Based on our analysis, we formulate a series of metapaths and employ the metapath-based technique to incorporate various information for learning the knowledge-aware embedding of issues and PRs. Finally, we can infer whether an issue and a PR can be linked based on their embedding. We evaluate the performance of AIPL on real-world data sets collected from GitHub. The results show that, compared to the baselines, AIPL can achieve average improvements of 15.94%, 15.19%, 20.52%, and 18.50% in terms of Accuracy, Precision, Recall, and F1-score.
期刊介绍:
IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include:
a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models.
b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects.
c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards.
d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues.
e) System issues: Hardware-software trade-offs.
f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.