首页 > 最新文献

IEEE Transactions on Software Engineering最新文献

英文 中文
Vulnerability Detection via Multiple-Graph-Based Code Representation 通过基于多图的代码表示进行漏洞检测
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-12 DOI: 10.1109/TSE.2024.3427815
Fangcheng Qiu;Zhongxin Liu;Xing Hu;Xin Xia;Gang Chen;Xinyu Wang
During software development and maintenance, vulnerability detection is an essential part of software quality assurance. Even though many program-analysis-based and machine-learning-based approaches have been proposed to automatically detect vulnerabilities, they rely on explicit rules or patterns defined by security experts and suffer from either high false positives or high false negatives. Recently, an increasing number of studies leverage deep learning techniques, especially Graph Neural Network (GNN), to detect vulnerabilities. These approaches leverage program analysis to represent the program semantics as graphs and perform graph analysis to detect vulnerabilities. However, they suffer from two main problems: (i) Existing GNN-based techniques do not effectively learn the structural and semantic features from source code for vulnerability detection. (ii) These approaches tend to ignore fine-grained information in source code. To tackle these problems, in this paper, we propose a novel vulnerability detection approach, named MGVD (M ultiple-G raph-Based V ulnerability D etection), to detect vulnerable functions. To effectively learn the structural and semantic features from source code, MGVD uses three different ways to represent each function into multiple forms, i.e., two statement graphs and a sequence of tokens. Then we encode such representations to a three-channel feature matrix. The feature matrix contains the structural feature and the semantic feature of the function. And we add a weight allocation layer to distribute the weights between structural and semantic features. To overcome the second problem, MGVD constructs each graph representation of the input function using multiple different graphs instead of a single graph. Each graph focuses on one statement in the function and its nodes denote the related statements and their fine-grained code elements. Finally, MGVD leverages CNN to identify whether this function is vulnerable based on such feature matrix. We conduct experiments on 3 vulnerability datasets with a total of 30,341 vulnerable functions and 127,931 non-vulnerable functions. The experimental results show that our method outperforms the state-of-the-art by 9.68% – 10.28% in terms of F1-score.
在软件开发和维护过程中,漏洞检测是软件质量保证的重要组成部分。尽管已经提出了许多基于程序分析和机器学习的方法来自动检测漏洞,但这些方法都依赖于安全专家定义的明确规则或模式,存在高假阳性或高假阴性的问题。最近,越来越多的研究利用深度学习技术,特别是图神经网络(GNN)来检测漏洞。这些方法利用程序分析将程序语义表示为图,并执行图分析来检测漏洞。然而,这些方法存在两个主要问题:(i) 现有的基于 GNN 的技术无法有效地从源代码中学习结构和语义特征来检测漏洞。(ii) 这些方法往往会忽略源代码中的细粒度信息。为了解决这些问题,我们在本文中提出了一种新型漏洞检测方法,命名为 MGVD(M ultiple-G raph-Based V ulnerability D etection),用于检测易受攻击的功能。为了从源代码中有效地学习结构和语义特征,MGVD 使用三种不同的方法将每个函数表示成多种形式,即两种语句图和一个标记序列。然后,我们将这些表示法编码为三通道特征矩阵。特征矩阵包含函数的结构特征和语义特征。我们还添加了一个权重分配层,在结构特征和语义特征之间分配权重。为了克服第二个问题,MGVD 使用多个不同的图而不是单一的图来构建输入函数的每个图表示。每个图聚焦于函数中的一个语句,其节点表示相关语句及其细粒度代码元素。最后,MGVD 利用 CNN 根据这些特征矩阵来识别该函数是否存在漏洞。我们在 3 个漏洞数据集上进行了实验,共有 30,341 个脆弱函数和 127,931 个非脆弱函数。实验结果表明,我们的方法在 F1 分数上比最先进的方法高出 9.68% - 10.28%。
{"title":"Vulnerability Detection via Multiple-Graph-Based Code Representation","authors":"Fangcheng Qiu;Zhongxin Liu;Xing Hu;Xin Xia;Gang Chen;Xinyu Wang","doi":"10.1109/TSE.2024.3427815","DOIUrl":"10.1109/TSE.2024.3427815","url":null,"abstract":"During software development and maintenance, vulnerability detection is an essential part of software quality assurance. Even though many program-analysis-based and machine-learning-based approaches have been proposed to automatically detect vulnerabilities, they rely on explicit rules or patterns defined by security experts and suffer from either high false positives or high false negatives. Recently, an increasing number of studies leverage deep learning techniques, especially Graph Neural Network (GNN), to detect vulnerabilities. These approaches leverage program analysis to represent the program semantics as graphs and perform graph analysis to detect vulnerabilities. However, they suffer from two main problems: (i) Existing GNN-based techniques do not effectively learn the structural and semantic features from source code for vulnerability detection. (ii) These approaches tend to ignore fine-grained information in source code. To tackle these problems, in this paper, we propose a novel vulnerability detection approach, named \u0000<sc>MGVD</small>\u0000 (\u0000<bold>M</b>\u0000 \u0000<sc>ultiple</small>\u0000-\u0000<bold>G</b>\u0000 \u0000<sc>raph-Based</small>\u0000 \u0000<bold>V</b>\u0000 \u0000<sc>ulnerability</small>\u0000 \u0000<bold>D</b>\u0000 \u0000<sc>etection)</small>\u0000, to detect vulnerable functions. To effectively learn the structural and semantic features from source code, \u0000<sc>MGVD</small>\u0000 uses three different ways to represent each function into multiple forms, i.e., two statement graphs and a sequence of tokens. Then we encode such representations to a three-channel feature matrix. The feature matrix contains the structural feature and the semantic feature of the function. And we add a weight allocation layer to distribute the weights between structural and semantic features. To overcome the second problem, \u0000<sc>MGVD</small>\u0000 constructs each graph representation of the input function using multiple different graphs instead of a single graph. Each graph focuses on one statement in the function and its nodes denote the related statements and their fine-grained code elements. Finally, \u0000<sc>MGVD</small>\u0000 leverages CNN to identify whether this function is vulnerable based on such feature matrix. We conduct experiments on 3 vulnerability datasets with a total of 30,341 vulnerable functions and 127,931 non-vulnerable functions. The experimental results show that our method outperforms the state-of-the-art by 9.68% – 10.28% in terms of F1-score.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 8","pages":"2178-2199"},"PeriodicalIF":6.5,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unity is Strength: Enhancing Precision in Reentrancy Vulnerability Detection of Smart Contract Analysis Tools 团结就是力量:提高智能合约分析工具重入性漏洞检测的精度
IF 7.4 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-12 DOI: 10.1109/tse.2024.3427321
Zexu Wang, Jiachi Chen, Peilin Zheng, Yu Zhang, Weizhe Zhang, Zibin Zheng
{"title":"Unity is Strength: Enhancing Precision in Reentrancy Vulnerability Detection of Smart Contract Analysis Tools","authors":"Zexu Wang, Jiachi Chen, Peilin Zheng, Yu Zhang, Weizhe Zhang, Zibin Zheng","doi":"10.1109/tse.2024.3427321","DOIUrl":"https://doi.org/10.1109/tse.2024.3427321","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"42 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BinCola: Diversity-Sensitive Contrastive Learning for Binary Code Similarity Detection BinCola:二进制代码相似性检测的多样性敏感对比学习
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-08 DOI: 10.1109/TSE.2024.3411072
Shuai Jiang;Cai Fu;Shuai He;Jianqiang Lv;Lansheng Han;Hong Hu
Binary Code Similarity Detection (BCSD) is a fundamental binary analysis technique in the area of software security. Recently, advanced deep learning algorithms are integrated into BCSD platforms to achieve superior performance on well-known benchmarks. However, real-world large programs embed more complex diversities due to different compilers, various optimization levels, multiple architectures and even obfuscations. Existing BCSD solutions suffer from low accuracy issues in such complicated real-world application scenarios. In this paper, we propose BinCola, a novel Transformer-based dual diversity-sensitive contrastive learning framework that comprehensively considers the diversity of compiler options and candidate functions in the real-world application scenarios and employs the attention mechanism to fuse multi-granularity function features for enhancing generality and scalability. BinCola simultaneously compares multiple candidate functions across various compilation option scenarios to learn the differences caused by distinct compiler options and different candidate functions. We evaluate BinCola's performance in a variety of ways, including binary similarity detection and real-world vulnerability search in multiple application scenarios. The results demonstrate that BinCola achieves superior performance compared to state-of-the-art (SOTA) methods, with improvements of 2.80%, 33.62%, 22.41%, and 34.25% in cross-architecture, cross-optimization level, cross-compiler, and cross-obfuscation scenarios, respectively.
二进制代码相似性检测(BCSD)是软件安全领域的一项基本二进制分析技术。最近,先进的深度学习算法被集成到 BCSD 平台中,从而在著名的基准测试中取得了优异的性能。然而,现实世界中的大型程序由于编译器不同、优化水平各异、架构多样,甚至存在混淆现象,因而嵌入了更为复杂的多样性。现有的 BCSD 解决方案在如此复杂的实际应用场景中存在准确性低的问题。在本文中,我们提出了基于变换器的新型双多样性敏感对比学习框架 BinCola,该框架全面考虑了真实世界应用场景中编译器选项和候选函数的多样性,并采用注意力机制融合多粒度函数特征,以增强通用性和可扩展性。BinCola 同时比较不同编译选项情况下的多个候选函数,以学习不同编译选项和不同候选函数所造成的差异。我们通过多种方式对 BinCola 的性能进行了评估,包括二进制相似性检测和多种应用场景下的实际漏洞搜索。结果表明,与最先进的(SOTA)方法相比,BinCola 实现了更优越的性能,在跨体系结构、跨优化级别、跨编译器和跨混淆场景中分别提高了 2.80%、33.62%、22.41% 和 34.25%。
{"title":"BinCola: Diversity-Sensitive Contrastive Learning for Binary Code Similarity Detection","authors":"Shuai Jiang;Cai Fu;Shuai He;Jianqiang Lv;Lansheng Han;Hong Hu","doi":"10.1109/TSE.2024.3411072","DOIUrl":"10.1109/TSE.2024.3411072","url":null,"abstract":"Binary Code Similarity Detection (BCSD) is a fundamental binary analysis technique in the area of software security. Recently, advanced deep learning algorithms are integrated into BCSD platforms to achieve superior performance on well-known benchmarks. However, real-world large programs embed more complex diversities due to different compilers, various optimization levels, multiple architectures and even obfuscations. Existing BCSD solutions suffer from low accuracy issues in such complicated real-world application scenarios. In this paper, we propose BinCola, a novel Transformer-based dual diversity-sensitive contrastive learning framework that comprehensively considers the diversity of compiler options and candidate functions in the real-world application scenarios and employs the attention mechanism to fuse multi-granularity function features for enhancing generality and scalability. BinCola simultaneously compares multiple candidate functions across various compilation option scenarios to learn the differences caused by distinct compiler options and different candidate functions. We evaluate BinCola's performance in a variety of ways, including binary similarity detection and real-world vulnerability search in multiple application scenarios. The results demonstrate that BinCola achieves superior performance compared to state-of-the-art (SOTA) methods, with improvements of 2.80%, 33.62%, 22.41%, and 34.25% in cross-architecture, cross-optimization level, cross-compiler, and cross-obfuscation scenarios, respectively.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 10","pages":"2485-2497"},"PeriodicalIF":6.5,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141561183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Revisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic Datasets 重新审视基于深度学习的漏洞检测在现实数据集上的表现
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-05 DOI: 10.1109/TSE.2024.3423712
Partha Chakraborty;Krishna Kanth Arumugam;Mahmoud Alfadel;Meiyappan Nagappan;Shane McIntosh
The impact of software vulnerabilities on everyday software systems is concerning. Although deep learning-based models have been proposed for vulnerability detection, their reliability remains a significant concern. While prior evaluation of such models reports impressive recall/F1 scores of up to 99%, we find that these models underperform in practical scenarios, particularly when evaluated on the entire codebases rather than only the fixing commit. In this paper, we introduce a comprehensive dataset (Real-Vul) designed to accurately represent real-world scenarios for evaluating vulnerability detection models. We evaluate DeepWukong, LineVul, ReVeal, and IVDetect vulnerability detection approaches and observe a surprisingly significant drop in performance, with precision declining by up to 95 percentage points and F1 scores dropping by up to 91 percentage points. A closer inspection reveals a substantial overlap in the embeddings generated by the models for vulnerable and uncertain samples (non-vulnerable or vulnerability not reported yet), which likely explains why we observe such a large increase in the quantity and rate of false positives. Additionally, we observe fluctuations in model performance based on vulnerability characteristics (e.g., vulnerability types and severity). For example, the studied models achieve 26 percentage points better F1 scores when vulnerabilities are related to information leaks or code injection rather than when vulnerabilities are related to path resolution or predictable return values. Our results highlight the substantial performance gap that still needs to be bridged before deep learning-based vulnerability detection is ready for deployment in practical settings. We dive deeper into why models underperform in realistic settings and our investigation revealed overfitting as a key issue. We address this by introducing an augmentation technique, potentially improving performance by up to 30%. We contribute (a) an approach to creating a dataset that future research can use to improve the practicality of model evaluation; (b) Real-Vul– a comprehensive dataset that adheres to this approach; and (c) empirical evidence that the deep learning-based models struggle to perform in a real-world setting.
软件漏洞对日常软件系统的影响令人担忧。虽然已经提出了基于深度学习的漏洞检测模型,但其可靠性仍然是一个重大问题。虽然之前对此类模型的评估报告显示召回率/F1 分数高达 99%,令人印象深刻,但我们发现这些模型在实际应用场景中表现不佳,尤其是在对整个代码库而非只对修复提交进行评估时。在本文中,我们引入了一个综合数据集(Real-Vul),该数据集旨在准确地代表真实世界的场景,用于评估漏洞检测模型。我们对 DeepWukong、LineVul、ReVeal 和 IVDetect 漏洞检测方法进行了评估,结果发现它们的性能下降幅度惊人,精度下降了 95 个百分点,F1 分数下降了 91 个百分点。仔细观察会发现,模型为有漏洞样本和不确定样本(无漏洞或尚未报告的漏洞)生成的嵌入结果存在大量重叠,这很可能是我们观察到误报数量和误报率大幅上升的原因。此外,我们还观察到基于漏洞特征(如漏洞类型和严重程度)的模型性能波动。例如,当漏洞与信息泄露或代码注入相关时,所研究模型的 F1 分数要比漏洞与路径解析或可预测返回值相关时高 26 个百分点。我们的研究结果凸显了基于深度学习的漏洞检测在实际环境中部署前仍需弥合的巨大性能差距。我们深入研究了模型在现实环境中表现不佳的原因,我们的调查发现过度拟合是一个关键问题。我们通过引入增强技术来解决这个问题,有可能将性能提高 30%。我们贡献了:(a)一种创建数据集的方法,未来的研究可以利用这种方法来提高模型评估的实用性;(b)Real-Vul--一种符合这种方法的综合数据集;以及(c)基于深度学习的模型在现实世界环境中表现不佳的经验证据。
{"title":"Revisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic Datasets","authors":"Partha Chakraborty;Krishna Kanth Arumugam;Mahmoud Alfadel;Meiyappan Nagappan;Shane McIntosh","doi":"10.1109/TSE.2024.3423712","DOIUrl":"10.1109/TSE.2024.3423712","url":null,"abstract":"The impact of software vulnerabilities on everyday software systems is concerning. Although deep learning-based models have been proposed for vulnerability detection, their reliability remains a significant concern. While prior evaluation of such models reports impressive recall/F1 scores of up to 99%, we find that these models underperform in practical scenarios, particularly when evaluated on the entire codebases rather than only the fixing commit. In this paper, we introduce a comprehensive dataset (\u0000<italic>Real-Vul</i>\u0000) designed to accurately represent real-world scenarios for evaluating vulnerability detection models. We evaluate DeepWukong, LineVul, ReVeal, and IVDetect vulnerability detection approaches and observe a surprisingly significant drop in performance, with precision declining by up to 95 percentage points and F1 scores dropping by up to 91 percentage points. A closer inspection reveals a substantial overlap in the embeddings generated by the models for vulnerable and uncertain samples (non-vulnerable or vulnerability not reported yet), which likely explains why we observe such a large increase in the quantity and rate of false positives. Additionally, we observe fluctuations in model performance based on vulnerability characteristics (e.g., vulnerability types and severity). For example, the studied models achieve 26 percentage points better F1 scores when vulnerabilities are related to information leaks or code injection rather than when vulnerabilities are related to path resolution or predictable return values. Our results highlight the substantial performance gap that still needs to be bridged before deep learning-based vulnerability detection is ready for deployment in practical settings. We dive deeper into why models underperform in realistic settings and our investigation revealed overfitting as a key issue. We address this by introducing an augmentation technique, potentially improving performance by up to 30%. We contribute (a) an approach to creating a dataset that future research can use to improve the practicality of model evaluation; (b) \u0000<italic>Real-Vul</i>\u0000– a comprehensive dataset that adheres to this approach; and (c) empirical evidence that the deep learning-based models struggle to perform in a real-world setting.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 8","pages":"2163-2177"},"PeriodicalIF":6.5,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141553315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
API2Vec++: Boosting API Sequence Representation for Malware Detection and Classification API2Vec++:提升应用程序接口序列表示法,用于恶意软件检测和分类
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-04 DOI: 10.1109/TSE.2024.3422990
Lei Cui;Junnan Yin;Jiancong Cui;Yuede Ji;Peng Liu;Zhiyu Hao;Xiaochun Yun
Analyzing malware based on API call sequences is an effective approach, as these sequences reflect the dynamic execution behavior of malware. Recent advancements in deep learning have facilitated the application of these techniques to mine valuable information from API call sequences. However, these methods typically operate on raw sequences and may not effectively capture crucial information, especially in the case of multi-process malware, due to the API call interleaving problem. Furthermore, they often fail to capture contextual behaviors within or across processes, which is particularly important for identifying and classifying malicious activities. Motivated by this, we present API2Vec++, a graph-based API embedding method for malware detection and classification. First, we construct a graph model to represent the raw sequence. Specifically, we design the Temporal Process Graph (TPG) to model inter-process behaviors and the Temporal API Property Graph (TAPG) to model intra-process behaviors. Compared to our previous graph model, the TAPG model exposes operations with associated behaviors within the process through node properties and thus enhances detection and classification abilities. Using these graphs, we develop a heuristic random walk algorithm to generate numerous paths that can capture fine-grained malicious familial behavior. By pre-training these paths using the BERT model, we generate embeddings of paths and APIs, which can then be used for malware detection and classification. Experiments on a real-world malware dataset demonstrate that API2Vec++ outperforms state-of-the-art embedding methods and detection/classification methods in both accuracy and robustness, particularly for multi-process malware.
根据 API 调用序列分析恶意软件是一种有效的方法,因为这些序列反映了恶意软件的动态执行行为。深度学习领域的最新进展促进了这些技术的应用,以便从 API 调用序列中挖掘有价值的信息。然而,由于 API 调用交错问题,这些方法通常在原始序列上运行,可能无法有效捕获关键信息,特别是在多进程恶意软件的情况下。此外,这些方法往往无法捕获进程内或进程间的上下文行为,而这对于识别和分类恶意活动尤为重要。受此启发,我们提出了一种基于图的 API 嵌入方法--API2Vec++,用于恶意软件的检测和分类。首先,我们构建了一个图模型来表示原始序列。具体来说,我们设计了时序进程图(TPG)来模拟进程间行为,并设计了时序 API 属性图(TAPG)来模拟进程内行为。与我们之前的图模型相比,TAPG 模型通过节点属性揭示了流程内相关行为的操作,从而增强了检测和分类能力。利用这些图,我们开发了一种启发式随机漫步算法,以生成大量可捕捉细粒度恶意家族行为的路径。通过使用 BERT 模型预训练这些路径,我们生成了路径和 API 的嵌入,然后可用于恶意软件检测和分类。在真实世界的恶意软件数据集上进行的实验表明,API2Vec++ 在准确性和鲁棒性方面都优于最先进的嵌入方法和检测/分类方法,尤其是在多进程恶意软件方面。
{"title":"API2Vec++: Boosting API Sequence Representation for Malware Detection and Classification","authors":"Lei Cui;Junnan Yin;Jiancong Cui;Yuede Ji;Peng Liu;Zhiyu Hao;Xiaochun Yun","doi":"10.1109/TSE.2024.3422990","DOIUrl":"10.1109/TSE.2024.3422990","url":null,"abstract":"Analyzing malware based on API call sequences is an effective approach, as these sequences reflect the dynamic execution behavior of malware. Recent advancements in deep learning have facilitated the application of these techniques to mine valuable information from API call sequences. However, these methods typically operate on raw sequences and may not effectively capture crucial information, especially in the case of multi-process malware, due to the \u0000<italic>API call interleaving problem</i>\u0000. Furthermore, they often fail to capture contextual behaviors within or across processes, which is particularly important for identifying and classifying malicious activities. Motivated by this, we present API2Vec++, a graph-based API embedding method for malware detection and classification. First, we construct a graph model to represent the raw sequence. Specifically, we design the Temporal Process Graph (TPG) to model inter-process behaviors and the Temporal API Property Graph (TAPG) to model intra-process behaviors. Compared to our previous graph model, the TAPG model exposes operations with associated behaviors within the process through node properties and thus enhances detection and classification abilities. Using these graphs, we develop a heuristic random walk algorithm to generate numerous paths that can capture fine-grained malicious familial behavior. By pre-training these paths using the BERT model, we generate embeddings of paths and APIs, which can then be used for malware detection and classification. Experiments on a real-world malware dataset demonstrate that API2Vec++ outperforms state-of-the-art embedding methods and detection/classification methods in both accuracy and robustness, particularly for multi-process malware.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 8","pages":"2142-2162"},"PeriodicalIF":6.5,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
To Do or Not to Do: Semantics and Patterns for Do Activities in UML PSSM State Machines 做或不做:UML PSSM 状态机中 Do 活动的语义和模式
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-04 DOI: 10.1109/TSE.2024.3422845
Márton Elekes;Vince Molnár;Zoltán Micskei
State machines are used in engineering many types of software-intensive systems. UML State Machines extend simple finite state machines with powerful constructs. Among the many extensions, there is one seemingly simple and innocent language construct that fundamentally changes state machines’ reactive model of computation: doActivity behaviors. DoActivity behaviors describe behavior that is executed independently from the state machine once entered in a given state, typically modeling complex computation or communication as background tasks. However, the UML specification or textbooks are vague about how the doActivity behavior construct should be appropriately used. This lack of guidance is a severe issue as, when improperly used, doActivities can cause concurrent, non-deterministic bugs that are especially challenging to find and could ruin a seemingly correct software design. The Precise Semantics of UML State Machines (PSSM) specification introduced detailed operational semantics for state machines. To the best of our knowledge, there is no rigorous review yet of doActivity's semantics as specified in PSSM. We analyzed the semantics by collecting evidence from cross-checking the text of the specification, its semantic model and executable test cases, and the simulators supporting PSSM. We synthesized insights about subtle details and emergent behaviors relevant to tool developers and advanced modelers. We reported inconsistencies and missing clarifications in more than 20 issues to the standardization committee. Based on these insights, we studied 11 patterns for doActivities detailing the consequences of using a doActivity in a given situation and discussing countermeasures or alternative design choices. We hope that our analysis of the semantics and the patterns help vendors develop conformant simulators or verification tools and engineers design better state machine models.
状态机可用于多种软件密集型系统的工程设计。UML 状态机用强大的构造扩展了简单的有限状态机。在众多扩展中,有一个看似简单、无辜的语言结构从根本上改变了状态机的反应式计算模型:DoActivity 行为。DoActivity 行为描述的是进入给定状态后独立于状态机执行的行为,通常将复杂的计算或通信建模为后台任务。然而,UML 规范或教科书对如何恰当使用 doActivity 行为构造语焉不详。这种指导的缺失是一个严重的问题,因为如果使用不当,doActivities 会导致并发的、非确定性的错误,而这些错误的发现特别具有挑战性,可能会毁掉一个看似正确的软件设计。UML 状态机精确语义(PSSM)规范为状态机引入了详细的操作语义。据我们所知,目前还没有对 PSSM 中指定的 doActivity 的语义进行严格审查。我们通过交叉检查规范文本、其语义模型和可执行测试用例以及支持 PSSM 的模拟器来收集证据,从而对语义进行了分析。我们总结了与工具开发人员和高级建模人员相关的微妙细节和突发行为。我们向标准化委员会报告了 20 多个问题中的不一致之处和缺失说明。基于这些见解,我们研究了 doActivities 的 11 种模式,详细说明了在特定情况下使用 doActivity 的后果,并讨论了对策或替代设计选择。我们希望我们对语义和模式的分析能帮助供应商开发符合标准的模拟器或验证工具,并帮助工程师设计出更好的状态机模型。
{"title":"To Do or Not to Do: Semantics and Patterns for Do Activities in UML PSSM State Machines","authors":"Márton Elekes;Vince Molnár;Zoltán Micskei","doi":"10.1109/TSE.2024.3422845","DOIUrl":"10.1109/TSE.2024.3422845","url":null,"abstract":"State machines are used in engineering many types of software-intensive systems. UML State Machines extend simple finite state machines with powerful constructs. Among the many extensions, there is one seemingly simple and innocent language construct that fundamentally changes state machines’ reactive model of computation: doActivity behaviors. DoActivity behaviors describe behavior that is executed independently from the state machine once entered in a given state, typically modeling complex computation or communication as background tasks. However, the UML specification or textbooks are vague about how the doActivity behavior construct should be appropriately used. This lack of guidance is a severe issue as, when improperly used, doActivities can cause concurrent, non-deterministic bugs that are especially challenging to find and could ruin a seemingly correct software design. The Precise Semantics of UML State Machines (PSSM) specification introduced detailed operational semantics for state machines. To the best of our knowledge, there is no rigorous review yet of doActivity's semantics as specified in PSSM. We analyzed the semantics by collecting evidence from cross-checking the text of the specification, its semantic model and executable test cases, and the simulators supporting PSSM. We synthesized insights about subtle details and emergent behaviors relevant to tool developers and advanced modelers. We reported inconsistencies and missing clarifications in more than 20 issues to the standardization committee. Based on these insights, we studied 11 patterns for doActivities detailing the consequences of using a doActivity in a given situation and discussing countermeasures or alternative design choices. We hope that our analysis of the semantics and the patterns help vendors develop conformant simulators or verification tools and engineers design better state machine models.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 8","pages":"2124-2141"},"PeriodicalIF":6.5,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Local and Global Explainability for Technical Debt Identification 识别技术债务的局部和全局可解释性
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-04 DOI: 10.1109/TSE.2024.3422427
Dimitrios Tsoukalas;Nikolaos Mittas;Elvira-Maria Arvanitou;Apostolos Ampatzoglou;Alexander Chatzigeorgiou;Dionysios Kehagias
In recent years, we have witnessed an important increase in research focusing on how machine learning (ML) techniques can be used for software quality assessment and improvement. However, the derived methodologies and tools lack transparency, due to the black-box nature of the employed machine learning models, leading to decreased trust in their results. To address this shortcoming, in this paper we extend the state-of-the-art and -practice by building explainable AI models on top of machine learning ones, to interpret the factors (i.e. software metrics) that constitute a module as in risk of having high technical debt (HIGH TD), to obtain thresholds for metric scores that are alerting for poor maintainability, and finally, we dig further to achieve local interpretation that explains the specific problems of each module, pinpointing to specific opportunities for improvement during TD management. To achieve this goal, we have developed project-specific classifiers (characterizing modules as HIGH and NOT-HIGH TD) for 21 open-source projects, and we explain their rationale using the SHapley Additive exPlanation (SHAP) analysis. Based on our analysis, complexity, comments ratio, cohesion, nesting of control flow statements, coupling, refactoring activity, and code churn are the most important reasons for characterizing classes as in HIGH TD risk. The analysis is complemented with global and local means of interpretation, such as metric thresholds and case-by-case reasoning for characterizing a class as in-risk of having HIGH TD. The results of the study are compared against the state-of-the-art and are interpreted from the point of view of both researchers and practitioners.
近年来,关于如何将机器学习(ML)技术用于软件质量评估和改进的研究大幅增加。然而,由于所使用的机器学习模型的黑箱性质,衍生出的方法和工具缺乏透明度,导致对其结果的信任度降低。为了解决这一缺陷,我们在本文中通过在机器学习模型的基础上建立可解释的人工智能模型,对构成模块具有高技术债务(HIGH TD)风险的因素(即软件度量指标)进行解释,以获得对可维护性差发出警报的度量指标得分阈值,最后,我们进一步挖掘以实现局部解释,解释每个模块的具体问题,在 TD 管理过程中指出具体的改进机会。为了实现这一目标,我们为 21 个开源项目开发了针对项目的分类器(将模块定性为高 TD 和非高 TD),并使用 SHapley Additive exPlanation(SHAP)分析法解释了分类器的原理。根据我们的分析,复杂性、注释比例、内聚性、控制流语句嵌套、耦合、重构活动和代码流失是将类定性为高 TD 风险的最重要原因。该分析还辅以全局和局部的解释手段,如度量阈值和逐案推理,以确定某个类是否具有高 TD 风险。研究结果与最先进的方法进行了比较,并从研究人员和从业人员的角度进行了解释。
{"title":"Local and Global Explainability for Technical Debt Identification","authors":"Dimitrios Tsoukalas;Nikolaos Mittas;Elvira-Maria Arvanitou;Apostolos Ampatzoglou;Alexander Chatzigeorgiou;Dionysios Kehagias","doi":"10.1109/TSE.2024.3422427","DOIUrl":"10.1109/TSE.2024.3422427","url":null,"abstract":"In recent years, we have witnessed an important increase in research focusing on how machine learning (ML) techniques can be used for software quality assessment and improvement. However, the derived methodologies and tools lack transparency, due to the black-box nature of the employed machine learning models, leading to decreased trust in their results. To address this shortcoming, in this paper we extend the state-of-the-art and -practice by building explainable AI models on top of machine learning ones, to interpret the factors (i.e. software metrics) that constitute a module as in risk of having high technical debt (HIGH TD), to obtain thresholds for metric scores that are alerting for poor maintainability, and finally, we dig further to achieve local interpretation that explains the specific problems of each module, pinpointing to specific opportunities for improvement during TD management. To achieve this goal, we have developed project-specific classifiers (characterizing modules as HIGH and NOT-HIGH TD) for 21 open-source projects, and we explain their rationale using the SHapley Additive exPlanation (SHAP) analysis. Based on our analysis, complexity, comments ratio, cohesion, nesting of control flow statements, coupling, refactoring activity, and code churn are the most important reasons for characterizing classes as in HIGH TD risk. The analysis is complemented with global and local means of interpretation, such as metric thresholds and case-by-case reasoning for characterizing a class as in-risk of having HIGH TD. The results of the study are compared against the state-of-the-art and are interpreted from the point of view of both researchers and practitioners.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 8","pages":"2110-2123"},"PeriodicalIF":6.5,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterizing the Prevalence, Distribution, and Duration of Stale Reviewer Recommendations 描述陈旧审稿人建议的普遍分布和持续时间
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-03 DOI: 10.1109/TSE.2024.3422369
Farshad Kazemi;Maxime Lamothe;Shane McIntosh
The appropriate assignment of reviewers is a key factor in determining the value that organizations can derive from code review. While inappropriate reviewer recommendations can hinder the benefits of the code review process, identifying these assignments is challenging. Stale reviewers, i.e., those who no longer contribute to the project, are one type of reviewer recommendation that is certainly inappropriate. Understanding and minimizing this type of recommendation can thus enhance the benefits of the code review process. While recent work demonstrates the existence of stale reviewers, to the best of our knowledge, attempts have yet to be made to characterize and mitigate them. In this paper, we study the prevalence and potential effects. We then propose and assess a strategy to mitigate stale recommendations in existing code reviewer recommendation tools. By applying five code reviewer recommendation approaches (LearnRec, RetentionRec, cHRev, Sofia, and WLRRec) to three thriving open-source systems with 5,806 contributors, we observe that, on average, 12.59% of incorrect recommendations are stale due to developer turnover; however, fewer stale recommendations are made when the recency of contributions is considered by the recommendation objective function. We also investigate which reviewers appear in stale recommendations and observe that the top reviewers account for a considerable proportion of stale recommendations. For instance, in 15.31% of cases, the top-3 reviewers account for at least half of the stale recommendations. Finally, we study how long stale reviewers linger after the candidate leaves the project, observing that contributors who left the project 7.7 years ago are still suggested to review change sets. Based on our findings, we propose separating the reviewer contribution recency from the other factors that are used by the CRR objective function to filter out developers who have not contributed during a specified duration. By evaluating this strategy with different intervals, we assess the potential impact of this choice on the recommended reviewers. The proposed filter reduces the staleness of recommendations, i.e., the Staleness Reduction Ratio (SRR) improves between 21.44%–92.39%. Yet since the strategy may increase active reviewer workload, careful project-specific exploration of the impact of the cut-off setting is crucial.
审核员的适当分配是决定组织能否从代码审查中获得价值的关键因素。虽然不恰当的审阅人推荐会阻碍代码审查流程的效益,但识别这些分配是具有挑战性的。陈旧的审阅者,即那些不再对项目有贡献的人,就是一种肯定不合适的审阅者推荐。因此,了解并尽量减少这类推荐可以提高代码审查流程的效益。虽然最近的工作证明了陈旧审稿人的存在,但就我们所知,还没有人尝试过对其进行描述和缓解。在本文中,我们研究了陈腐的普遍性和潜在影响。然后,我们提出并评估了在现有代码审阅者推荐工具中减少陈旧推荐的策略。通过将五种代码审查员推荐方法(LearnRec、RetentionRec、cHRev、Sofia 和 WLRRec)应用于拥有 5806 名贡献者的三个蓬勃发展的开源系统,我们观察到,由于开发人员的更替,平均有 12.59% 的错误推荐是陈旧的;然而,当推荐目标函数考虑贡献的周期时,陈旧的推荐就会减少。我们还调查了哪些审稿人出现在陈旧推荐中,发现顶级审稿人在陈旧推荐中占了相当大的比例。例如,在 15.31% 的案例中,排名前三的审稿人至少占了陈旧推荐的一半。最后,我们研究了在候选人离开项目后,陈旧的审阅者会在项目中停留多久,发现 7.7 年前离开项目的贡献者仍被建议审阅变更集。基于我们的研究结果,我们建议将审阅人贡献的持续时间与 CRR 目标函数使用的其他因素分开,以过滤掉在指定时间内没有贡献的开发人员。通过评估这一策略的不同时间间隔,我们评估了这一选择对推荐审稿人的潜在影响。所提出的过滤策略降低了推荐的陈旧度,即陈旧度降低率(SRR)提高了 21.44%-92.39% 之间。然而,由于该策略可能会增加主动审稿人的工作量,因此针对具体项目仔细探讨截止设置的影响至关重要。
{"title":"Characterizing the Prevalence, Distribution, and Duration of Stale Reviewer Recommendations","authors":"Farshad Kazemi;Maxime Lamothe;Shane McIntosh","doi":"10.1109/TSE.2024.3422369","DOIUrl":"10.1109/TSE.2024.3422369","url":null,"abstract":"The appropriate assignment of reviewers is a key factor in determining the value that organizations can derive from code review. While inappropriate reviewer recommendations can hinder the benefits of the code review process, identifying these assignments is challenging. Stale reviewers, i.e., those who no longer contribute to the project, are one type of reviewer recommendation that is certainly inappropriate. Understanding and minimizing this type of recommendation can thus enhance the benefits of the code review process. While recent work demonstrates the existence of stale reviewers, to the best of our knowledge, attempts have yet to be made to characterize and mitigate them. In this paper, we study the prevalence and potential effects. We then propose and assess a strategy to mitigate stale recommendations in existing code reviewer recommendation tools. By applying five code reviewer recommendation approaches (LearnRec, RetentionRec, cHRev, Sofia, and WLRRec) to three thriving open-source systems with 5,806 contributors, we observe that, on average, 12.59% of incorrect recommendations are stale due to developer turnover; however, fewer stale recommendations are made when the recency of contributions is considered by the recommendation objective function. We also investigate which reviewers appear in stale recommendations and observe that the top reviewers account for a considerable proportion of stale recommendations. For instance, in 15.31% of cases, the top-3 reviewers account for at least half of the stale recommendations. Finally, we study how long stale reviewers linger after the candidate leaves the project, observing that contributors who left the project 7.7 years ago are still suggested to review change sets. Based on our findings, we propose separating the reviewer contribution recency from the other factors that are used by the CRR objective function to filter out developers who have not contributed during a specified duration. By evaluating this strategy with different intervals, we assess the potential impact of this choice on the recommended reviewers. The proposed filter reduces the staleness of recommendations, i.e., the Staleness Reduction Ratio (SRR) improves between 21.44%–92.39%. Yet since the strategy may increase active reviewer workload, careful project-specific exploration of the impact of the cut-off setting is crucial.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 8","pages":"2096-2109"},"PeriodicalIF":6.5,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141521606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Esale: Enhancing Code-Summary Alignment Learning for Source Code Summarization ESALE:增强源代码摘要的代码摘要对齐学习
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-03 DOI: 10.1109/TSE.2024.3422274
Chunrong Fang;Weisong Sun;Yuchen Chen;Xiao Chen;Zhao Wei;Quanjun Zhang;Yudu You;Bin Luo;Yang Liu;Zhenyu Chen
(Source) code summarization aims to automatically generate succinct natural language summaries for given code snippets. Such summaries play a significant role in promoting developers to understand and maintain code. Inspired by neural machine translation, deep learning-based code summarization techniques widely adopt an encoder-decoder framework, where the encoder transforms given code snippets into context vectors, and the decoder decodes context vectors into summaries. Recently, large-scale pre-trained models for source code (e.g., CodeBERT and UniXcoder) are equipped with encoders capable of producing general context vectors and have achieved substantial improvements on the code summarization task. However, although they are usually trained mainly on code-focused tasks and can capture general code features, they still fall short in capturing specific features that need to be summarized. In a nutshell, they fail to learn the alignment between code snippets and summaries (code-summary alignment for short). In this paper, we propose a novel approach to improve code summarization based on summary-focused tasks. Specifically, we exploit a multi-task learning paradigm to train the encoder on three summary-focused tasks to enhance its ability to learn code-summary alignment, including unidirectional language modeling (ULM), masked language modeling (MLM), and action word prediction (AWP). Unlike pre-trained models that mainly predict masked tokens in code snippets, we design ULM and MLM to predict masked words in summaries. Intuitively, predicting words based on given code snippets would help learn the code-summary alignment. In addition, existing work shows that AWP affects the prediction of the entire summary. Therefore, we further introduce the domain-specific task AWP to enhance the ability of the encoder to learn the alignment between action words and code snippets. We evaluate the effectiveness of our approach, called Esale, by conducting extensive experiments on four datasets, including two widely used datasets JCSD and PCSD, a cross-project Java dataset CPJD, and a multilingual language dataset CodeSearchNet. Experimental results show that Esale significantly outperforms state-of-the-art baselines in all three widely used metrics, including BLEU, METEOR, and ROUGE-L. Moreover, the human evaluation proves that the summaries generated by Esale are more informative and closer to the ground-truth summaries.
(源)代码摘要旨在为给定的代码片段自动生成简洁的自然语言摘要。这些摘要在促进开发人员理解和维护代码方面发挥着重要作用。受神经机器翻译的启发,基于深度学习的代码摘要技术广泛采用编码器-解码器框架,其中编码器将给定代码片段转换为上下文向量,解码器将上下文向量解码为摘要。最近,针对源代码的大规模预训练模型(如 CodeBERT 和 UniXcoder)配备了能够生成一般上下文向量的编码器,并在代码摘要任务中取得了重大改进。不过,尽管它们通常主要针对代码任务进行训练,并能捕捉一般代码特征,但在捕捉需要总结的特定特征方面仍有不足。简而言之,它们无法学习代码片段和摘要之间的对齐(简称代码摘要对齐)。在本文中,我们提出了一种基于以摘要为重点的任务来改进代码摘要的新方法。具体来说,我们利用多任务学习范式,在三个以摘要为重点的任务上训练编码器,以增强其学习代码-摘要对齐的能力,包括单向语言建模(ULM)、屏蔽语言建模(MLM)和动作词预测(AWP)。与主要预测代码片段中被掩盖的标记的预训练模型不同,我们设计了 ULM 和 MLM 来预测摘要中被掩盖的单词。直观地说,根据给定的代码片段预测单词有助于学习代码与摘要的对齐。此外,现有工作表明 AWP 会影响整个摘要的预测。因此,我们进一步引入了特定领域任务 AWP,以增强编码器学习动作词和代码片段之间对齐的能力。我们在四个数据集(包括两个广泛使用的数据集 JCSD 和 PCSD、一个跨项目 Java 数据集 CPJD 和一个多语种语言数据集 CodeSearchNet)上进行了大量实验,评估了我们的方法(称为 Esale)的有效性。实验结果表明,在所有三个广泛使用的指标(包括 BLEU、METEOR 和 ROUGE-L)上,Esale 都明显优于最先进的基线。此外,人工评估证明,Esale 生成的摘要信息量更大,更接近地面实况摘要。
{"title":"Esale: Enhancing Code-Summary Alignment Learning for Source Code Summarization","authors":"Chunrong Fang;Weisong Sun;Yuchen Chen;Xiao Chen;Zhao Wei;Quanjun Zhang;Yudu You;Bin Luo;Yang Liu;Zhenyu Chen","doi":"10.1109/TSE.2024.3422274","DOIUrl":"10.1109/TSE.2024.3422274","url":null,"abstract":"(Source) code summarization aims to automatically generate succinct natural language summaries for given code snippets. Such summaries play a significant role in promoting developers to understand and maintain code. Inspired by neural machine translation, deep learning-based code summarization techniques widely adopt an encoder-decoder framework, where the encoder transforms given code snippets into context vectors, and the decoder decodes context vectors into summaries. Recently, large-scale pre-trained models for source code (e.g., CodeBERT and UniXcoder) are equipped with encoders capable of producing general context vectors and have achieved substantial improvements on the code summarization task. However, although they are usually trained mainly on code-focused tasks and can capture general code features, they still fall short in capturing specific features that need to be summarized. In a nutshell, they fail to learn the alignment between code snippets and summaries (code-summary alignment for short). In this paper, we propose a novel approach to improve code summarization based on summary-focused tasks. Specifically, we exploit a multi-task learning paradigm to train the encoder on three summary-focused tasks to enhance its ability to learn code-summary alignment, including unidirectional language modeling (ULM), masked language modeling (MLM), and action word prediction (AWP). Unlike pre-trained models that mainly predict masked tokens in code snippets, we design ULM and MLM to predict masked words in summaries. Intuitively, predicting words based on given code snippets would help learn the code-summary alignment. In addition, existing work shows that AWP affects the prediction of the entire summary. Therefore, we further introduce the domain-specific task AWP to enhance the ability of the encoder to learn the alignment between action words and code snippets. We evaluate the effectiveness of our approach, called \u0000<sc>Esale</small>\u0000, by conducting extensive experiments on four datasets, including two widely used datasets JCSD and PCSD, a cross-project Java dataset CPJD, and a multilingual language dataset CodeSearchNet. Experimental results show that \u0000<sc>Esale</small>\u0000 significantly outperforms state-of-the-art baselines in all three widely used metrics, including BLEU, METEOR, and ROUGE-L. Moreover, the human evaluation proves that the summaries generated by \u0000<sc>Esale</small>\u0000 are more informative and closer to the ground-truth summaries.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 8","pages":"2077-2095"},"PeriodicalIF":6.5,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141521607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Objective Software Defect Prediction via Multi-Source Uncertain Information Fusion and Multi-Task Multi-View Learning 通过多源不确定信息融合和多任务多视角学习进行多目标软件缺陷预测
IF 6.5 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING Pub Date : 2024-07-03 DOI: 10.1109/TSE.2024.3421591
Minghao Yang;Shunkun Yang;W. Eric Wong
Effective software defect prediction (SDP) is important for software quality assurance. Numerous advanced SDP methods have been proposed recently. However, how to consider the task correlations and achieve multi-objective SDP accurately and efficiently still remains to be further explored. In this paper, we propose a novel multi-objective SDP method via multi-source uncertain information fusion and multi-task multi-view learning (MTMV) to accurately and efficiently predict the proneness, location, and type of defects. Firstly, multi-view features are extracted from multi-source static analysis results, reflecting uncertain defect location distribution and semantic information. Then, a novel MTMV model is proposed to fully fuse the uncertain defect information in multi-view features and realize effective multi-objective SDP. Specifically, the convolutional GRU encoders capture the consistency and complementarity of multi-source defect information to automatically filter the noise of false and missed alarms, and reduce location and type uncertainty of static analysis results. A global attention mechanism combined with the hard parameter sharing in MTMV fuse features according to their global importance of all tasks for balanced learning. Then, considering the latent task and feature correlations, multiple task-specific decoders jointly optimize all SDP tasks by sharing the learning experience. Through the extensive experiments on 14 datasets, the proposed method significantly improves the prediction performance over 12 baseline methods for all SDP objectives. The average improvements are 30.7%, 31.2%, and 32.4% for defect proneness, location, and type prediction, respectively. Therefore, the proposed multi-objective SDP method can provide more sufficient and precise insights for developers to significantly improve the efficiency of software analysis and testing.
有效的软件缺陷预测(SDP)对软件质量保证非常重要。最近提出了许多先进的 SDP 方法。然而,如何考虑任务相关性并准确高效地实现多目标 SDP 仍有待进一步探索。本文通过多源不确定信息融合和多任务多视图学习(MTMV),提出了一种新颖的多目标 SDP 方法,以准确高效地预测缺陷的可能性、位置和类型。首先,从多源静态分析结果中提取多视角特征,反映不确定的缺陷位置分布和语义信息。然后,提出了一种新颖的 MTMV 模型,以充分融合多视角特征中的不确定缺陷信息,实现有效的多目标 SDP。具体来说,卷积 GRU 编码器可捕捉多源缺陷信息的一致性和互补性,自动过滤误报和漏报噪声,降低静态分析结果的位置和类型不确定性。MTMV 中的全局注意力机制与硬参数共享相结合,根据所有任务的全局重要性对特征进行融合,以实现均衡学习。然后,考虑到潜在任务和特征的相关性,多个特定任务解码器通过共享学习经验来共同优化所有 SDP 任务。通过在 14 个数据集上的广泛实验,与 12 种基线方法相比,所提出的方法显著提高了所有 SDP 目标的预测性能。在缺陷易发性、位置和类型预测方面,平均改进幅度分别为 30.7%、31.2% 和 32.4%。因此,所提出的多目标 SDP 方法能为开发人员提供更充分、更精确的见解,从而显著提高软件分析和测试的效率。
{"title":"Multi-Objective Software Defect Prediction via Multi-Source Uncertain Information Fusion and Multi-Task Multi-View Learning","authors":"Minghao Yang;Shunkun Yang;W. Eric Wong","doi":"10.1109/TSE.2024.3421591","DOIUrl":"10.1109/TSE.2024.3421591","url":null,"abstract":"Effective software defect prediction (SDP) is important for software quality assurance. Numerous advanced SDP methods have been proposed recently. However, how to consider the task correlations and achieve multi-objective SDP accurately and efficiently still remains to be further explored. In this paper, we propose a novel multi-objective SDP method via multi-source uncertain information fusion and multi-task multi-view learning (MTMV) to accurately and efficiently predict the proneness, location, and type of defects. Firstly, multi-view features are extracted from multi-source static analysis results, reflecting uncertain defect location distribution and semantic information. Then, a novel MTMV model is proposed to fully fuse the uncertain defect information in multi-view features and realize effective multi-objective SDP. Specifically, the convolutional GRU encoders capture the consistency and complementarity of multi-source defect information to automatically filter the noise of false and missed alarms, and reduce location and type uncertainty of static analysis results. A global attention mechanism combined with the hard parameter sharing in MTMV fuse features according to their global importance of all tasks for balanced learning. Then, considering the latent task and feature correlations, multiple task-specific decoders jointly optimize all SDP tasks by sharing the learning experience. Through the extensive experiments on 14 datasets, the proposed method significantly improves the prediction performance over 12 baseline methods for all SDP objectives. The average improvements are 30.7%, 31.2%, and 32.4% for defect proneness, location, and type prediction, respectively. Therefore, the proposed multi-objective SDP method can provide more sufficient and precise insights for developers to significantly improve the efficiency of software analysis and testing.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 8","pages":"2054-2076"},"PeriodicalIF":6.5,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141521608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Software Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1