基于双峰特征融合的编程问答社区问题相关性检测

IF 3.1 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Automated Software Engineering Pub Date : 2025-01-04 DOI:10.1007/s10515-024-00482-5

Qirong Bu, Xiangqiang Guo, Xia Sun, Jingjing Jiang, Xiaodi Zhao, Wang Zou, Xuxin Wang, Jianqiang Yan

{"title":"基于双峰特征融合的编程问答社区问题相关性检测","authors":"Qirong Bu, Xiangqiang Guo, Xia Sun, Jingjing Jiang, Xiaodi Zhao, Wang Zou, Xuxin Wang, Jianqiang Yan","doi":"10.1007/s10515-024-00482-5","DOIUrl":null,"url":null,"abstract":"<div><p>Programming community-based question and answering websites, represented by Stack Overflow, are popular among programmers. Users post questions and share their knowledge and experience through answering. Nonetheless, the accumulation of a large number of similar questions reduces the efficiency and quality of the community. To tackle this issue, related works utilize the complete textual information in the question posts for detecting question relatedness. But they almost all ignore the rich source code information in the posts, which also complements the semantics of the questions. In this paper, we propose a bimodal framework for relatedness detection based on the combination of text features and code features. Question pairs are encoded using a text pre-trained language model (e.g., SOBERT) and a code pre-trained language model (e.g., UniXcoder), respectively. With the powerful semantic modeling capabilities of pre-trained models, we obtain bimodal features that measure the similarity of questions from both text and code perspectives. However, directly concatenating and fusing these features may have a negative impact due to the significant differences between them. To address this, we additionally leverage the cross-attention mechanism to derive supplementary features of these bimodal features for the correct feature fusion. Cross-attention captures semantic understanding from both modalities, integrating their representations. These supplementary features measure the semantic relationship between text-guided and code-guided features, effectively bridging the semantic gap. We conducted extensive experiments on two related datasets from both the English and Chinese domains. The results show that our approach improves significantly over the baseline approaches, achieving advanced performance in the metrics of Macro-Precision, Macro-Recall and Macro-F1.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Detecting question relatedness in programming Q&A communities via bimodal feature fusion\",\"authors\":\"Qirong Bu, Xiangqiang Guo, Xia Sun, Jingjing Jiang, Xiaodi Zhao, Wang Zou, Xuxin Wang, Jianqiang Yan\",\"doi\":\"10.1007/s10515-024-00482-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Programming community-based question and answering websites, represented by Stack Overflow, are popular among programmers. Users post questions and share their knowledge and experience through answering. Nonetheless, the accumulation of a large number of similar questions reduces the efficiency and quality of the community. To tackle this issue, related works utilize the complete textual information in the question posts for detecting question relatedness. But they almost all ignore the rich source code information in the posts, which also complements the semantics of the questions. In this paper, we propose a bimodal framework for relatedness detection based on the combination of text features and code features. Question pairs are encoded using a text pre-trained language model (e.g., SOBERT) and a code pre-trained language model (e.g., UniXcoder), respectively. With the powerful semantic modeling capabilities of pre-trained models, we obtain bimodal features that measure the similarity of questions from both text and code perspectives. However, directly concatenating and fusing these features may have a negative impact due to the significant differences between them. To address this, we additionally leverage the cross-attention mechanism to derive supplementary features of these bimodal features for the correct feature fusion. Cross-attention captures semantic understanding from both modalities, integrating their representations. These supplementary features measure the semantic relationship between text-guided and code-guided features, effectively bridging the semantic gap. We conducted extensive experiments on two related datasets from both the English and Chinese domains. The results show that our approach improves significantly over the baseline approaches, achieving advanced performance in the metrics of Macro-Precision, Macro-Recall and Macro-F1.</p></div>\",\"PeriodicalId\":55414,\"journal\":{\"name\":\"Automated Software Engineering\",\"volume\":\"32 1\",\"pages\":\"\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-01-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Automated Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10515-024-00482-5\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-024-00482-5","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

以Stack Overflow为代表的编程社区问答网站在程序员中很受欢迎。用户发布问题，并通过回答分享他们的知识和经验。然而，大量类似问题的积累降低了社区的效率和质量。为了解决这一问题，相关工作利用问题贴中的完整文本信息来检测问题的相关性。但是他们几乎都忽略了帖子中丰富的源代码信息，这些信息也补充了问题的语义。本文提出了一种基于文本特征和代码特征相结合的双峰相关性检测框架。问题对分别使用文本预训练语言模型（例如SOBERT）和代码预训练语言模型（例如UniXcoder）进行编码。利用预训练模型强大的语义建模能力，我们获得了从文本和代码两个角度衡量问题相似性的双峰特征。但是，由于这些特征之间的差异很大，直接将它们连接和融合可能会产生负面影响。为了解决这个问题，我们还利用交叉注意机制来获得这些双峰特征的补充特征，以实现正确的特征融合。交叉注意捕获了两种模式的语义理解，整合了它们的表征。这些补充特性度量了文本引导和代码引导特性之间的语义关系，有效地弥合了语义差距。我们在英文和中文领域的两个相关数据集上进行了广泛的实验。结果表明，我们的方法在宏观精度、宏观召回率和宏观f1指标上取得了较好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Detecting question relatedness in programming Q&A communities via bimodal feature fusion

Programming community-based question and answering websites, represented by Stack Overflow, are popular among programmers. Users post questions and share their knowledge and experience through answering. Nonetheless, the accumulation of a large number of similar questions reduces the efficiency and quality of the community. To tackle this issue, related works utilize the complete textual information in the question posts for detecting question relatedness. But they almost all ignore the rich source code information in the posts, which also complements the semantics of the questions. In this paper, we propose a bimodal framework for relatedness detection based on the combination of text features and code features. Question pairs are encoded using a text pre-trained language model (e.g., SOBERT) and a code pre-trained language model (e.g., UniXcoder), respectively. With the powerful semantic modeling capabilities of pre-trained models, we obtain bimodal features that measure the similarity of questions from both text and code perspectives. However, directly concatenating and fusing these features may have a negative impact due to the significant differences between them. To address this, we additionally leverage the cross-attention mechanism to derive supplementary features of these bimodal features for the correct feature fusion. Cross-attention captures semantic understanding from both modalities, integrating their representations. These supplementary features measure the semantic relationship between text-guided and code-guided features, effectively bridging the semantic gap. We conducted extensive experiments on two related datasets from both the English and Chinese domains. The results show that our approach improves significantly over the baseline approaches, achieving advanced performance in the metrics of Macro-Precision, Macro-Recall and Macro-F1.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Automated Software Engineering 工程技术-计算机：软件工程

CiteScore

4.80

自引率

11.80%

发文量

审稿时长

>12 weeks

期刊介绍： This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes. Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.