Qirong Bu, Xiangqiang Guo, Xia Sun, Jingjing Jiang, Xiaodi Zhao, Wang Zou, Xuxin Wang, Jianqiang Yan
{"title":"基于双峰特征融合的编程问答社区问题相关性检测","authors":"Qirong Bu, Xiangqiang Guo, Xia Sun, Jingjing Jiang, Xiaodi Zhao, Wang Zou, Xuxin Wang, Jianqiang Yan","doi":"10.1007/s10515-024-00482-5","DOIUrl":null,"url":null,"abstract":"<div><p>Programming community-based question and answering websites, represented by Stack Overflow, are popular among programmers. Users post questions and share their knowledge and experience through answering. Nonetheless, the accumulation of a large number of similar questions reduces the efficiency and quality of the community. To tackle this issue, related works utilize the complete textual information in the question posts for detecting question relatedness. But they almost all ignore the rich source code information in the posts, which also complements the semantics of the questions. In this paper, we propose a bimodal framework for relatedness detection based on the combination of text features and code features. Question pairs are encoded using a text pre-trained language model (e.g., SOBERT) and a code pre-trained language model (e.g., UniXcoder), respectively. With the powerful semantic modeling capabilities of pre-trained models, we obtain bimodal features that measure the similarity of questions from both text and code perspectives. However, directly concatenating and fusing these features may have a negative impact due to the significant differences between them. To address this, we additionally leverage the cross-attention mechanism to derive supplementary features of these bimodal features for the correct feature fusion. Cross-attention captures semantic understanding from both modalities, integrating their representations. These supplementary features measure the semantic relationship between text-guided and code-guided features, effectively bridging the semantic gap. We conducted extensive experiments on two related datasets from both the English and Chinese domains. The results show that our approach improves significantly over the baseline approaches, achieving advanced performance in the metrics of Macro-Precision, Macro-Recall and Macro-F1.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Detecting question relatedness in programming Q&A communities via bimodal feature fusion\",\"authors\":\"Qirong Bu, Xiangqiang Guo, Xia Sun, Jingjing Jiang, Xiaodi Zhao, Wang Zou, Xuxin Wang, Jianqiang Yan\",\"doi\":\"10.1007/s10515-024-00482-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Programming community-based question and answering websites, represented by Stack Overflow, are popular among programmers. Users post questions and share their knowledge and experience through answering. Nonetheless, the accumulation of a large number of similar questions reduces the efficiency and quality of the community. To tackle this issue, related works utilize the complete textual information in the question posts for detecting question relatedness. But they almost all ignore the rich source code information in the posts, which also complements the semantics of the questions. In this paper, we propose a bimodal framework for relatedness detection based on the combination of text features and code features. Question pairs are encoded using a text pre-trained language model (e.g., SOBERT) and a code pre-trained language model (e.g., UniXcoder), respectively. With the powerful semantic modeling capabilities of pre-trained models, we obtain bimodal features that measure the similarity of questions from both text and code perspectives. However, directly concatenating and fusing these features may have a negative impact due to the significant differences between them. To address this, we additionally leverage the cross-attention mechanism to derive supplementary features of these bimodal features for the correct feature fusion. Cross-attention captures semantic understanding from both modalities, integrating their representations. These supplementary features measure the semantic relationship between text-guided and code-guided features, effectively bridging the semantic gap. We conducted extensive experiments on two related datasets from both the English and Chinese domains. The results show that our approach improves significantly over the baseline approaches, achieving advanced performance in the metrics of Macro-Precision, Macro-Recall and Macro-F1.</p></div>\",\"PeriodicalId\":55414,\"journal\":{\"name\":\"Automated Software Engineering\",\"volume\":\"32 1\",\"pages\":\"\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-01-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Automated Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10515-024-00482-5\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-024-00482-5","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
Detecting question relatedness in programming Q&A communities via bimodal feature fusion
Programming community-based question and answering websites, represented by Stack Overflow, are popular among programmers. Users post questions and share their knowledge and experience through answering. Nonetheless, the accumulation of a large number of similar questions reduces the efficiency and quality of the community. To tackle this issue, related works utilize the complete textual information in the question posts for detecting question relatedness. But they almost all ignore the rich source code information in the posts, which also complements the semantics of the questions. In this paper, we propose a bimodal framework for relatedness detection based on the combination of text features and code features. Question pairs are encoded using a text pre-trained language model (e.g., SOBERT) and a code pre-trained language model (e.g., UniXcoder), respectively. With the powerful semantic modeling capabilities of pre-trained models, we obtain bimodal features that measure the similarity of questions from both text and code perspectives. However, directly concatenating and fusing these features may have a negative impact due to the significant differences between them. To address this, we additionally leverage the cross-attention mechanism to derive supplementary features of these bimodal features for the correct feature fusion. Cross-attention captures semantic understanding from both modalities, integrating their representations. These supplementary features measure the semantic relationship between text-guided and code-guided features, effectively bridging the semantic gap. We conducted extensive experiments on two related datasets from both the English and Chinese domains. The results show that our approach improves significantly over the baseline approaches, achieving advanced performance in the metrics of Macro-Precision, Macro-Recall and Macro-F1.
期刊介绍:
This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes.
Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.