首页 > 最新文献

Information and Software Technology最新文献

英文 中文
The top 8+2 long-term success factors in public IS procurement: A retrospective case study 公共信息系统采购的8+2个长期成功因素:回顾性案例研究
IF 4.3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-12-08 DOI: 10.1016/j.infsof.2025.107994
Sanni Marjanen, Samuli Pekkola, Tommi Mikkonen

Context:

A multibuyer procurement unit is in a vendor lock-in. It is looking to procure a new information system (IS) for one of its end-user client organizations. Both the procurement unit and its client organization want the new acquisition to succeed, hence, they wish to learn from their preceding collaboration and the IS in use, which dates back twenty years.

Objective:

This paper presents a qualitative case study where IS procurement long-term success factors are identified and prioritized by experts from a procurement unit and its client organization. The objective of the study is to increase scholars’ understanding and promote success in future IS procurement processes.

Methods:

The Delphi method and a focus group interview workshop are used. The Delphi-like study has three phases. First, a preliminary list of potential success factors is identified from the research literature. Then 13 public procurement experts and representative end users review, complement, and validate the list. In the second phase, the experts independently narrow down and prioritize the list in two rounds. In the third phase, a focus group interview workshop is conducted so that the two groups can achieve a shared understanding and consensus, as well as to rank the success factors. Finally, the ranking is validated by all participants.

Results:

A top 8+2 list of long-term success factors in public IS procurement was constructed. The first 8 factors relate to the procurement as a whole and its stakeholders, while the latter two factors refer to the procured IS and its requirements. We find that focus on the comprehensive procurement process and stakeholder alignment are paramount.

Conclusion:

The study provides a synthesized and prioritized list of public IS procurement factors that contribute to perceived long-term success. The findings broaden our understanding of such factors, thus guiding future procurement and research endeavors.
背景:多买家采购单位处于供应商锁定状态。它正在寻求为其最终用户客户组织之一采购新的信息系统(is)。采购单位及其客户组织都希望新的收购取得成功,因此,他们希望从之前的合作和使用的信息系统中学习,这可以追溯到20年前。目的:本文提出了一个定性的案例研究,其中IS采购的长期成功因素是由采购单位及其客户组织的专家确定和优先考虑的。本研究的目的是增加学者的理解,促进未来信息系统采购过程的成功。方法:采用德尔菲法和焦点小组访谈法。类似德尔菲的研究分为三个阶段。首先,从研究文献中确定了潜在成功因素的初步列表。然后由13名政府采购专家和终端用户代表对清单进行审核、补充和验证。在第二阶段,专家们在两轮中独立缩小名单并确定优先顺序。在第三阶段,进行焦点小组访谈工作坊,让两组人达成共识和共识,并对成功因素进行排名。最后,由所有参与者验证排名。结果:构建了公共信息系统采购长期成功因素前8+2列表。前8个因素涉及整个采购及其利益相关者,后两个因素涉及采购的信息系统及其要求。我们发现,对全面采购过程和利益相关者对齐的关注是至关重要的。结论:该研究提供了一个综合的、优先的公共信息系统采购因素列表,这些因素有助于感知到的长期成功。这些发现拓宽了我们对这些因素的理解,从而指导了未来的采购和研究工作。
{"title":"The top 8+2 long-term success factors in public IS procurement: A retrospective case study","authors":"Sanni Marjanen,&nbsp;Samuli Pekkola,&nbsp;Tommi Mikkonen","doi":"10.1016/j.infsof.2025.107994","DOIUrl":"10.1016/j.infsof.2025.107994","url":null,"abstract":"<div><h3>Context:</h3><div>A multibuyer procurement unit is in a vendor lock-in. It is looking to procure a new information system (IS) for one of its end-user client organizations. Both the procurement unit and its client organization want the new acquisition to succeed, hence, they wish to learn from their preceding collaboration and the IS in use, which dates back twenty years.</div></div><div><h3>Objective:</h3><div>This paper presents a qualitative case study where IS procurement long-term success factors are identified and prioritized by experts from a procurement unit and its client organization. The objective of the study is to increase scholars’ understanding and promote success in future IS procurement processes.</div></div><div><h3>Methods:</h3><div>The Delphi method and a focus group interview workshop are used. The Delphi-like study has three phases. First, a preliminary list of potential success factors is identified from the research literature. Then 13 public procurement experts and representative end users review, complement, and validate the list. In the second phase, the experts independently narrow down and prioritize the list in two rounds. In the third phase, a focus group interview workshop is conducted so that the two groups can achieve a shared understanding and consensus, as well as to rank the success factors. Finally, the ranking is validated by all participants.</div></div><div><h3>Results:</h3><div>A top 8+2 list of long-term success factors in public IS procurement was constructed. The first 8 factors relate to the procurement as a whole and its stakeholders, while the latter two factors refer to the procured IS and its requirements. We find that focus on the comprehensive procurement process and stakeholder alignment are paramount.</div></div><div><h3>Conclusion:</h3><div>The study provides a synthesized and prioritized list of public IS procurement factors that contribute to perceived long-term success. The findings broaden our understanding of such factors, thus guiding future procurement and research endeavors.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"191 ","pages":"Article 107994"},"PeriodicalIF":4.3,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards green game software engineering: A comparative analysis of energy consumption between the widespread unity and unreal video game engines 走向绿色游戏软件工程:广泛统一与虚幻电子游戏引擎能耗比较分析
IF 4.3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-12-03 DOI: 10.1016/j.infsof.2025.107991
Carlos Pérez , Javier Verón , Francisca Pérez , MaÁngeles Moraga , Coral Calero , Carlos Cetina

Context:

The total energy cost of computing activities is steadily increasing, and projections indicate that it will be one of the dominant global energy consumers in the coming decades. However, the video game sector has not yet developed the same level of environmental awareness as other computing technologies despite the estimated three billion regular video game players in the world.

Objective:

This work evaluates the energy consumption of the most widely used industry-scale video game engines: Unity and Unreal Engine.

Method:

Specifically, our work uses three scenarios representing relevant aspects of video games (Physics, Static Meshes, and Dynamic Meshes) to compare the energy consumption of the engines. The aim is to determine the influence of using each engine on energy consumption.

Results:

Our research has confirmed notable differences in energy consumption: 351% in Physics in favor of Unity, 17% in Static Meshes in favor of Unity, and 26% in Dynamic Meshes in favor of Unreal Engine.

Conclusion:

Considering the estimated three billion regular video game players worldwide and the high computational requirements of the sector, the magnitude of potential savings is a relevant issue for the research community. This might encourage a new branch of research on energy efficient video game engines.
背景:计算活动的总能源成本正在稳步增长,预测表明,在未来几十年,它将成为全球主要的能源消费者之一。然而,尽管世界上估计有30亿电子游戏玩家,但电子游戏行业还没有像其他计算技术那样发展出同样的环保意识。目的:本工作评估了最广泛使用的工业规模视频游戏引擎:Unity和虚幻引擎的能耗。方法:具体来说,我们的工作使用了代表电子游戏相关方面的三个场景(物理,静态网格和动态网格)来比较引擎的能量消耗。目的是确定使用每种发动机对能源消耗的影响。结果:我们的研究证实了在能量消耗上的显著差异:在物理上351%的人支持Unity,在静态网格上17%的人支持Unity,在动态网格上26%的人支持虚幻引擎。结论:考虑到全球约有30亿普通电子游戏玩家以及该领域的高计算需求,潜在节省的规模是研究社区的一个相关问题。这可能会促进节能电子游戏引擎研究的新分支。
{"title":"Towards green game software engineering: A comparative analysis of energy consumption between the widespread unity and unreal video game engines","authors":"Carlos Pérez ,&nbsp;Javier Verón ,&nbsp;Francisca Pérez ,&nbsp;MaÁngeles Moraga ,&nbsp;Coral Calero ,&nbsp;Carlos Cetina","doi":"10.1016/j.infsof.2025.107991","DOIUrl":"10.1016/j.infsof.2025.107991","url":null,"abstract":"<div><h3>Context:</h3><div>The total energy cost of computing activities is steadily increasing, and projections indicate that it will be one of the dominant global energy consumers in the coming decades. However, the video game sector has not yet developed the same level of environmental awareness as other computing technologies despite the estimated three billion regular video game players in the world.</div></div><div><h3>Objective:</h3><div>This work evaluates the energy consumption of the most widely used industry-scale video game engines: Unity and Unreal Engine.</div></div><div><h3>Method:</h3><div>Specifically, our work uses three scenarios representing relevant aspects of video games (Physics, Static Meshes, and Dynamic Meshes) to compare the energy consumption of the engines. The aim is to determine the influence of using each engine on energy consumption.</div></div><div><h3>Results:</h3><div>Our research has confirmed notable differences in energy consumption: 351% in Physics in favor of Unity, 17% in Static Meshes in favor of Unity, and 26% in Dynamic Meshes in favor of Unreal Engine.</div></div><div><h3>Conclusion:</h3><div>Considering the estimated three billion regular video game players worldwide and the high computational requirements of the sector, the magnitude of potential savings is a relevant issue for the research community. This might encourage a new branch of research on energy efficient video game engines.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"191 ","pages":"Article 107991"},"PeriodicalIF":4.3,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145694060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fair and square? Evaluating fairness of LLM-generated synthetic datasets 公平公正?评估法学硕士生成的合成数据集的公平性
IF 4.3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-12-02 DOI: 10.1016/j.infsof.2025.107980
Gianmario Voria, Benedetto Scala, Leopoldo Todisco, Carlo Venditto, Giammaria Giordano, Gemma Catolino, Fabio Palomba

Context:

Machine Learning (ML) is driving advancements across various industries, including healthcare, finance, and entertainment, but it also raises significant ethical concerns, particularly regarding fairness. Biases in training data can lead to unfair outcomes, perpetuating or even amplifying existing disparities. Prior research in the Software Engineering (SE) and ML communities has developed numerous bias mitigation techniques, yet two key limitations persist: (1) most approaches intervene at later stages of development, such as after data collection or model training, rather than addressing fairness from the outset; and (2) these methods often mitigate bias without fully eliminating it, since the root issue frequently lies in the data itself.

Objective:

In this paper, we explore an alternative approach to mitigate unfairness: synthetic data generation, which involves creating artificial datasets that mimic the statistical properties of real-world data. We aim to assess how this approach can contribute to generating data that positively impacts the trade-off between performance and fairness by creating datasets that reduce the influence of real-world biases through synthetic feature generation.

Methods:

To this end, we conducted an empirical study comparing ML models trained on synthetic datasets generated by large language models to ML models trained on real-world data, evaluating performance and fairness indicators.

Results:

Our results demonstrate that models trained with synthetic data, particularly those generated using simpler prompts, can achieve competitive performance while enhancing fairness.

Conclusion:

Our work suggests that synthetic data generation may be a viable approach to addressing fairness requirements in ML systems.
背景:机器学习(ML)正在推动包括医疗保健、金融和娱乐在内的各个行业的进步,但它也引发了重大的道德问题,尤其是在公平性方面。训练数据中的偏见可能导致不公平的结果,延续甚至扩大现有的差距。软件工程(SE)和机器学习社区的先前研究已经开发了许多偏见缓解技术,但仍然存在两个关键限制:(1)大多数方法在开发的后期阶段进行干预,例如在数据收集或模型训练之后,而不是从一开始就解决公平性问题;(2)这些方法往往减轻了偏见,但没有完全消除它,因为根本问题往往在于数据本身。目的:在本文中,我们探索了一种减轻不公平的替代方法:合成数据生成,这涉及到创建模拟现实世界数据统计特性的人工数据集。我们的目标是评估这种方法如何通过创建通过合成特征生成减少现实世界偏见影响的数据集,从而有助于生成对性能和公平性之间的权衡产生积极影响的数据。方法:为此,我们进行了一项实证研究,比较了在大型语言模型生成的合成数据集上训练的ML模型与在真实世界数据上训练的ML模型,评估了性能和公平性指标。结果:我们的研究结果表明,用合成数据训练的模型,特别是那些使用更简单的提示生成的模型,可以在提高公平性的同时获得有竞争力的表现。结论:我们的工作表明,合成数据生成可能是解决机器学习系统公平性要求的可行方法。
{"title":"Fair and square? Evaluating fairness of LLM-generated synthetic datasets","authors":"Gianmario Voria,&nbsp;Benedetto Scala,&nbsp;Leopoldo Todisco,&nbsp;Carlo Venditto,&nbsp;Giammaria Giordano,&nbsp;Gemma Catolino,&nbsp;Fabio Palomba","doi":"10.1016/j.infsof.2025.107980","DOIUrl":"10.1016/j.infsof.2025.107980","url":null,"abstract":"<div><h3>Context:</h3><div>Machine Learning (ML) is driving advancements across various industries, including healthcare, finance, and entertainment, but it also raises significant ethical concerns, particularly regarding fairness. Biases in training data can lead to unfair outcomes, perpetuating or even amplifying existing disparities. Prior research in the Software Engineering (SE) and ML communities has developed numerous bias mitigation techniques, yet two key limitations persist: (1) most approaches intervene at later stages of development, such as after data collection or model training, rather than addressing fairness from the outset; and (2) these methods often mitigate bias without fully eliminating it, since the root issue frequently lies in the data itself.</div></div><div><h3>Objective:</h3><div>In this paper, we explore an alternative approach to mitigate unfairness: <em>synthetic data generation</em>, which involves creating artificial datasets that mimic the statistical properties of real-world data. We aim to assess how this approach can contribute to generating data that positively impacts the trade-off between performance and fairness by creating datasets that reduce the influence of real-world biases through synthetic feature generation.</div></div><div><h3>Methods:</h3><div>To this end, we conducted an empirical study comparing ML models trained on synthetic datasets generated by large language models to ML models trained on real-world data, evaluating performance and fairness indicators.</div></div><div><h3>Results:</h3><div>Our results demonstrate that models trained with synthetic data, particularly those generated using simpler prompts, can achieve competitive performance while enhancing fairness.</div></div><div><h3>Conclusion:</h3><div>Our work suggests that synthetic data generation may be a viable approach to addressing fairness requirements in ML systems.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"191 ","pages":"Article 107980"},"PeriodicalIF":4.3,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145694462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ReEPM: A Reliability Estimation Framework for CNNs based on Error Probability Matrix modeling ReEPM:基于误差概率矩阵建模的cnn可靠性估计框架
IF 4.3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-28 DOI: 10.1016/j.infsof.2025.107981
Jie Xiao , Aizhu Liu , Yujian Yang , Yuhao Huang , Zhezhao Yang , Jungang Lou

Context:

The deployment of Convolutional Neural Networks (CNNs) in safety-critical applications faces significant challenges from soft errors. While accurate reliability assessment is vital, existing methods typically suffer from prohibitively high computational overheads, creating a critical trade-off between precision and efficiency that severely limits their practical applicability.

Objective:

To overcome this critical precision-efficiency dilemma, we aim to design a novel framework designed for CNN reliability assessment, aiming for accurate and highly efficient CNN reliability evaluation across diverse error conditions.

Method:

We propose ReEPM (Reliability Estimation Framework based on Error Probability Matrix). ReEPM constructs an Error Probability Matrix (EPM) that precisely models bit-flip error impact on CNN weights, fundamentally enabling parallel, accurate error injection without brute-force simulation. Moreover, we integrate an adaptive iterative process driven by Kalman filtering, which intelligently converges on reliability estimates with a drastically reduced number of input samples. This combination offers superior analytical rigor and computational efficiency.

Result:

Experimental results show that ReEPM achieves average accuracy of 0.9017 (single-error) and 0.9984 (multiple-error), while being 69.53× and 1989.27× faster compared widely adopted Monte Carlo fault injection, respectively. Furthermore, ReEPM significantly outperforms probability-based methods like SERN in accuracy (0.9017 vs. 0.7192 in single-error) and boasts broader applicability, covering entire networks and complex multiple-error scenarios.

Conclusion:

ReEPM establishes a new paradigm for CNN reliability assessment by effectively overcoming the critical accuracy-overhead trade-off. It offers an accuracy and rapid evaluation tool for designing resilient CNNs in next-generation safety-critical intelligent systems.
背景:卷积神经网络(cnn)在安全关键应用中的部署面临着来自软错误的重大挑战。虽然准确的可靠性评估是至关重要的,但现有的方法通常存在过高的计算开销,在精度和效率之间产生了一个关键的权衡,严重限制了它们的实际适用性。为了克服这一关键的精度效率困境,我们旨在设计一种新的CNN可靠性评估框架,旨在准确、高效地评估不同误差条件下的CNN可靠性。方法:提出基于误差概率矩阵的可靠性估计框架。ReEPM构建了一个错误概率矩阵(Error Probability Matrix, EPM),精确地模拟比特翻转错误对CNN权重的影响,从根本上实现了并行、准确的错误注入,而无需暴力模拟。此外,我们集成了一个由卡尔曼滤波驱动的自适应迭代过程,该过程在输入样本数量急剧减少的情况下智能地收敛于可靠性估计。这种组合提供了优越的分析严谨性和计算效率。结果:实验结果表明,ReEPM的平均准确率为0.9017(单误差)和0.9984(多误差),与广泛采用的蒙特卡罗故障注入相比,分别快了69.53倍和1989.27倍。此外,ReEPM在准确率上明显优于基于概率的方法,如SERN(单错误0.9017 vs 0.7192),并且具有更广泛的适用性,涵盖了整个网络和复杂的多错误场景。结论:ReEPM通过有效地克服临界精度-开销权衡,为CNN可靠性评估建立了一个新的范式。它为下一代安全关键型智能系统中弹性cnn的设计提供了准确、快速的评估工具。
{"title":"ReEPM: A Reliability Estimation Framework for CNNs based on Error Probability Matrix modeling","authors":"Jie Xiao ,&nbsp;Aizhu Liu ,&nbsp;Yujian Yang ,&nbsp;Yuhao Huang ,&nbsp;Zhezhao Yang ,&nbsp;Jungang Lou","doi":"10.1016/j.infsof.2025.107981","DOIUrl":"10.1016/j.infsof.2025.107981","url":null,"abstract":"<div><h3>Context:</h3><div>The deployment of Convolutional Neural Networks (CNNs) in safety-critical applications faces significant challenges from soft errors. While accurate reliability assessment is vital, existing methods typically suffer from prohibitively high computational overheads, creating a critical trade-off between precision and efficiency that severely limits their practical applicability.</div></div><div><h3>Objective:</h3><div>To overcome this critical precision-efficiency dilemma, we aim to design a novel framework designed for CNN reliability assessment, aiming for accurate and highly efficient CNN reliability evaluation across diverse error conditions.</div></div><div><h3>Method:</h3><div>We propose ReEPM (Reliability Estimation Framework based on Error Probability Matrix). ReEPM constructs an Error Probability Matrix (EPM) that precisely models bit-flip error impact on CNN weights, fundamentally enabling parallel, accurate error injection without brute-force simulation. Moreover, we integrate an adaptive iterative process driven by Kalman filtering, which intelligently converges on reliability estimates with a drastically reduced number of input samples. This combination offers superior analytical rigor and computational efficiency.</div></div><div><h3>Result:</h3><div>Experimental results show that ReEPM achieves average accuracy of 0.9017 (single-error) and 0.9984 (multiple-error), while being 69.53<span><math><mo>×</mo></math></span> and 1989.27<span><math><mo>×</mo></math></span> faster compared widely adopted Monte Carlo fault injection, respectively. Furthermore, ReEPM significantly outperforms probability-based methods like SERN in accuracy (0.9017 vs. 0.7192 in single-error) and boasts broader applicability, covering entire networks and complex multiple-error scenarios.</div></div><div><h3>Conclusion:</h3><div>ReEPM establishes a new paradigm for CNN reliability assessment by effectively overcoming the critical accuracy-overhead trade-off. It offers an accuracy and rapid evaluation tool for designing resilient CNNs in next-generation safety-critical intelligent systems.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"191 ","pages":"Article 107981"},"PeriodicalIF":4.3,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145694063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VDMPAGR: A vulnerability detection model based on pointer analysis and graph representation VDMPAGR:基于指针分析和图表示的漏洞检测模型
IF 4.3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-27 DOI: 10.1016/j.infsof.2025.107982
Yukun Dong, Shuai Liu, Xiaoshan Liu, Mingcheng Chen, Shuo Wang, Yinzhou Feng, Yixin Zhang

Context:

Software vulnerabilities pose a major threat to software security. Deep learning-based vulnerability detection models have demonstrated notable advantages, particularly in terms of automation and accuracy. Among these, graph representation-based vulnerability detection models have achieved a series of remarkable advancements in recent research. However, existing graph representation methods struggle to fully represent the syntactic and semantic information in source code, especially pointer relations. They also face challenges in detecting cross-function vulnerabilities and exhibit relatively low recall.

Objective:

We design VDMPAGR (A vulnerability detection model based on pointer analysis and graph representation) with two objectives: (i) to sufficiently extract the syntactic and semantic information in source code, particularly the points-to relations of pointers; (ii) to be capable of detecting cross-function vulnerabilities at the statement-level.

Method:

For detecting cross-function vulnerabilities at the statement-level, we leverage graph neural networks to learn features from the graph representation of source code. First, we conduct pointer analysis and integrate its results into System Dependency Graph (SDG) to construct a novel graph representation. Next, we construct code slices for center nodes that are potentially vulnerable and embed these slices into vector representations. Then, we employ the Dual Graph Neural Network (DGNN) proposed in this paper to extract features from the slices and the pointer relations within the slices, respectively. Finally, a Multi-Layer Perceptron (MLP) layer is used for vulnerability prediction.

Results:

We construct a C/C++ dataset from National Vulnerability Database (NVD) and Software Assurance Reference Dataset (SARD) for our experiments. The experimental results show that our model achieves a Recall (Rec) of 85.71% and an F1-score (F1) of 74.84%, outperforming all baseline models.

Conclusion:

The experimental results demonstrate that VDMPAGR achieves the best performance compared with baseline models, which proves the effectiveness of our method.
背景:软件漏洞是软件安全的主要威胁。基于深度学习的漏洞检测模型已经显示出显著的优势,特别是在自动化和准确性方面。其中,基于图表示的漏洞检测模型在最近的研究中取得了一系列显著的进展。然而,现有的图表示方法难以完全表示源代码中的语法和语义信息,特别是指针关系。它们在检测跨功能漏洞方面也面临挑战,而且召回率相对较低。目的:设计基于指针分析和图表示的漏洞检测模型VDMPAGR,主要有两个目的:(1)充分提取源代码中的语法和语义信息,特别是指针的点对关系;(ii)能够在语句级别发现跨功能的漏洞。方法:为了在语句级检测跨功能漏洞,我们利用图神经网络从源代码的图表示中学习特征。首先,我们进行指针分析,并将其结果整合到系统依赖图(SDG)中,构建新的图表示。接下来,我们为可能易受攻击的中心节点构建代码切片,并将这些切片嵌入到向量表示中。然后,我们利用本文提出的对偶图神经网络(Dual Graph Neural Network, DGNN)分别从切片和切片内的指针关系中提取特征。最后,利用多层感知器(MLP)层进行漏洞预测。结果:我们从国家漏洞数据库(NVD)和软件保障参考数据集(SARD)中构建了一个C/ c++数据集用于实验。实验结果表明,该模型的召回率(Rec)为85.71%,F1-score (F1)为74.84%,优于所有基线模型。结论:实验结果表明,与基线模型相比,VDMPAGR的性能最好,证明了该方法的有效性。
{"title":"VDMPAGR: A vulnerability detection model based on pointer analysis and graph representation","authors":"Yukun Dong,&nbsp;Shuai Liu,&nbsp;Xiaoshan Liu,&nbsp;Mingcheng Chen,&nbsp;Shuo Wang,&nbsp;Yinzhou Feng,&nbsp;Yixin Zhang","doi":"10.1016/j.infsof.2025.107982","DOIUrl":"10.1016/j.infsof.2025.107982","url":null,"abstract":"<div><h3>Context:</h3><div>Software vulnerabilities pose a major threat to software security. Deep learning-based vulnerability detection models have demonstrated notable advantages, particularly in terms of automation and accuracy. Among these, graph representation-based vulnerability detection models have achieved a series of remarkable advancements in recent research. However, existing graph representation methods struggle to fully represent the syntactic and semantic information in source code, especially pointer relations. They also face challenges in detecting cross-function vulnerabilities and exhibit relatively low recall.</div></div><div><h3>Objective:</h3><div>We design VDMPAGR (A vulnerability detection model based on pointer analysis and graph representation) with two objectives: (i) to sufficiently extract the syntactic and semantic information in source code, particularly the points-to relations of pointers; (ii) to be capable of detecting cross-function vulnerabilities at the statement-level.</div></div><div><h3>Method:</h3><div>For detecting cross-function vulnerabilities at the statement-level, we leverage graph neural networks to learn features from the graph representation of source code. First, we conduct pointer analysis and integrate its results into System Dependency Graph (SDG) to construct a novel graph representation. Next, we construct code slices for center nodes that are potentially vulnerable and embed these slices into vector representations. Then, we employ the Dual Graph Neural Network (DGNN) proposed in this paper to extract features from the slices and the pointer relations within the slices, respectively. Finally, a Multi-Layer Perceptron (MLP) layer is used for vulnerability prediction.</div></div><div><h3>Results:</h3><div>We construct a C/C++ dataset from National Vulnerability Database (NVD) and Software Assurance Reference Dataset (SARD) for our experiments. The experimental results show that our model achieves a Recall (Rec) of 85.71% and an F1-score (F1) of 74.84%, outperforming all baseline models.</div></div><div><h3>Conclusion:</h3><div>The experimental results demonstrate that VDMPAGR achieves the best performance compared with baseline models, which proves the effectiveness of our method.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"191 ","pages":"Article 107982"},"PeriodicalIF":4.3,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145694059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sentiment analysis for software engineering: How far can zero-shot learning (ZSL) go? 软件工程的情感分析:零射击学习(zero-shot learning, ZSL)能走多远?
IF 4.3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-26 DOI: 10.1016/j.infsof.2025.107971
Reem Alfayez, Manal Binkhonain

Context:

Sentiment analysis in software engineering focuses on understanding emotions expressed in software artifacts. Previous research highlighted the limitations of applying general off-the-shelf sentiment analysis tools within the software engineering domain and indicated the need for specialized tools tailored to various software engineering contexts. The development of such tools heavily relies on supervised machine learning techniques that necessitate annotated datasets. Acquiring such datasets is a substantial challenge, as it requires domain-specific expertise and significant effort.

Objective:

This study explores the potential of zero-shot learning (ZSL) to address the scarcity of annotated datasets in sentiment analysis within software engineering.

Method:

We conducted an empirical experiment to evaluate the performance of various ZSL techniques, including embedding-based, natural language inference (NLI)-based, task-aware representation of sentences (TARS)-based, and generative-based ZSL techniques. We assessed the performance of these techniques under different labels setups to examine the impact of label configurations. Additionally, we compared the results of the ZSL techniques with state-of-the-art fine-tuned transformer-based models. Finally, we performed an error analysis to identify the primary causes of misclassifications.

Results:

Our findings demonstrate that ZSL techniques, particularly those combining expert-curated labels with embedding-based or generative-based models, can achieve macro-F1 scores comparable to fine-tuned transformer-based models. The error analysis revealed that subjectivity in annotation and polar facts are the main contributors to ZSL misclassifications.

Conclusion:

This study demonstrates the potential of ZSL for sentiment analysis in software engineering. ZSL can provide a solution to the challenge of annotated dataset scarcity by reducing reliance on annotated dataset.
上下文:软件工程中的情感分析侧重于理解软件工件中表达的情感。先前的研究强调了在软件工程领域应用通用的现成情感分析工具的局限性,并指出需要针对各种软件工程环境量身定制的专门工具。此类工具的开发严重依赖于需要注释数据集的监督机器学习技术。获取这样的数据集是一个巨大的挑战,因为它需要特定领域的专业知识和巨大的努力。目的:本研究探讨了零射击学习(ZSL)的潜力,以解决软件工程中情感分析中注释数据集的稀缺性。方法:我们进行了一个实证实验来评估各种ZSL技术的性能,包括基于嵌入的、基于自然语言推理(NLI)的、基于任务感知的句子表示(TARS)的和基于生成的ZSL技术。我们评估了这些技术在不同标签设置下的性能,以检查标签配置的影响。此外,我们将ZSL技术的结果与最先进的微调变压器模型进行了比较。最后,我们进行了错误分析,以确定错误分类的主要原因。结果:我们的研究结果表明,ZSL技术,特别是那些将专家管理的标签与基于嵌入或基于生成的模型相结合的技术,可以获得与基于微调变压器的模型相当的宏观f1分数。错误分析表明,标注的主观性和极性事实是造成ZSL错误分类的主要原因。结论:本研究证明了ZSL在软件工程情感分析中的潜力。ZSL可以通过减少对带注释数据集的依赖来解决带注释数据集稀缺的问题。
{"title":"Sentiment analysis for software engineering: How far can zero-shot learning (ZSL) go?","authors":"Reem Alfayez,&nbsp;Manal Binkhonain","doi":"10.1016/j.infsof.2025.107971","DOIUrl":"10.1016/j.infsof.2025.107971","url":null,"abstract":"<div><h3>Context:</h3><div>Sentiment analysis in software engineering focuses on understanding emotions expressed in software artifacts. Previous research highlighted the limitations of applying general off-the-shelf sentiment analysis tools within the software engineering domain and indicated the need for specialized tools tailored to various software engineering contexts. The development of such tools heavily relies on supervised machine learning techniques that necessitate annotated datasets. Acquiring such datasets is a substantial challenge, as it requires domain-specific expertise and significant effort.</div></div><div><h3>Objective:</h3><div>This study explores the potential of zero-shot learning (ZSL) to address the scarcity of annotated datasets in sentiment analysis within software engineering.</div></div><div><h3>Method:</h3><div>We conducted an empirical experiment to evaluate the performance of various ZSL techniques, including embedding-based, natural language inference (NLI)-based, task-aware representation of sentences (TARS)-based, and generative-based ZSL techniques. We assessed the performance of these techniques under different labels setups to examine the impact of label configurations. Additionally, we compared the results of the ZSL techniques with state-of-the-art fine-tuned transformer-based models. Finally, we performed an error analysis to identify the primary causes of misclassifications.</div></div><div><h3>Results:</h3><div>Our findings demonstrate that ZSL techniques, particularly those combining expert-curated labels with embedding-based or generative-based models, can achieve macro-F1 scores comparable to fine-tuned transformer-based models. The error analysis revealed that subjectivity in annotation and polar facts are the main contributors to ZSL misclassifications.</div></div><div><h3>Conclusion:</h3><div>This study demonstrates the potential of ZSL for sentiment analysis in software engineering. ZSL can provide a solution to the challenge of annotated dataset scarcity by reducing reliance on annotated dataset.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"191 ","pages":"Article 107971"},"PeriodicalIF":4.3,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145625150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quality assessment of software requirements using artificial intelligence methods: A systematic literature review 使用人工智能方法的软件需求质量评估:系统的文献综述
IF 4.3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-25 DOI: 10.1016/j.infsof.2025.107979
Elise Wolf, Adam Trendowicz, Julien Siebert

Context:

The quality of requirements specifications is a critical success factor in software development. Assuring high-quality requirements, specifically in an automated way, poses a significant challenge due to their unstructured and multi-modal character. With the rise of deep learning and large language models (LLMs), new opportunities have developed to assess the quality of requirements automatically, particularly user stories in the context of agile software engineering, where short development cycles require efficient tool support.

Objective:

This study aims to systematically review and investigate the current landscape of approaches based on artificial intelligence techniques such as natural language processing and deep learning for assessing the quality of software requirements. The investigation focuses on the artificial intelligence techniques adopted, quality aspects considered, datasets used to tune and evaluate the proposed approaches, and their performance.

Method:

We conducted a systematic literature review of 26 peer-reviewed papers published between 2019 and 2025. We selected the papers after a title and abstract review of 353 papers identified through a literature databases query and forward–backward snowballing.

Results:

The results reveal significant overlap among considered quality aspects, which can be mapped onto the higher-order requirements quality model INVEST. Most studies focus on assessing requirement quality rather than improving requirements and rely heavily on synthetic and public datasets. LLMs have rapidly gained popularity since 2023, though model evaluation strategies remain inconsistent. Metrics such as accuracy, precision, recall, and F1-Score are common, yet a few studies use semantic or expert-based evaluations.

Conclusion:

The field is evolving toward LLM-driven, semantically rich models, yet lacks methodological standardization, reproducible datasets for evaluating the models, and integration of the approaches with real-world requirements engineering processes. Future work should address these limitations by developing benchmark datasets, standardizing evaluation metrics, and exploring hybrid systems that combine AI-based and traditional requirements quality assurance approaches.
背景:需求规格说明的质量是软件开发中成功的关键因素。确保高质量的需求,特别是以自动化的方式,由于其非结构化和多模式的特性,提出了重大的挑战。随着深度学习和大型语言模型(llm)的兴起,自动评估需求质量的新机会已经出现,尤其是敏捷软件工程背景下的用户故事,在这种情况下,短的开发周期需要有效的工具支持。目的:本研究旨在系统地回顾和调查基于人工智能技术(如自然语言处理和深度学习)的软件需求质量评估方法的现状。调查的重点是采用的人工智能技术,所考虑的质量方面,用于调整和评估所提出的方法的数据集,以及它们的性能。方法:对2019 - 2025年间发表的26篇同行评议论文进行系统文献综述。我们通过文献数据库查询和前后滚雪球法对353篇论文进行标题和摘要综述后选择了这些论文。结果:结果揭示了在考虑的质量方面之间显著的重叠,这可以映射到高阶需求质量模型INVEST上。大多数研究关注于评估需求质量,而不是改进需求,并且严重依赖于合成的和公共的数据集。自2023年以来,法学硕士迅速普及,尽管模型评估策略仍然不一致。准确度、精度、召回率和F1-Score等指标很常见,但也有一些研究使用语义或基于专家的评估。结论:该领域正在向法学硕士驱动的、语义丰富的模型发展,但缺乏方法标准化、用于评估模型的可重复数据集,以及与现实世界需求工程过程的方法集成。未来的工作应该通过开发基准数据集、标准化评估度量以及探索结合基于人工智能和传统需求质量保证方法的混合系统来解决这些限制。
{"title":"Quality assessment of software requirements using artificial intelligence methods: A systematic literature review","authors":"Elise Wolf,&nbsp;Adam Trendowicz,&nbsp;Julien Siebert","doi":"10.1016/j.infsof.2025.107979","DOIUrl":"10.1016/j.infsof.2025.107979","url":null,"abstract":"<div><h3>Context:</h3><div>The quality of requirements specifications is a critical success factor in software development. Assuring high-quality requirements, specifically in an automated way, poses a significant challenge due to their unstructured and multi-modal character. With the rise of deep learning and large language models (LLMs), new opportunities have developed to assess the quality of requirements automatically, particularly user stories in the context of agile software engineering, where short development cycles require efficient tool support.</div></div><div><h3>Objective:</h3><div>This study aims to systematically review and investigate the current landscape of approaches based on artificial intelligence techniques such as natural language processing and deep learning for assessing the quality of software requirements. The investigation focuses on the artificial intelligence techniques adopted, quality aspects considered, datasets used to tune and evaluate the proposed approaches, and their performance.</div></div><div><h3>Method:</h3><div>We conducted a systematic literature review of 26 peer-reviewed papers published between 2019 and 2025. We selected the papers after a title and abstract review of 353 papers identified through a literature databases query and forward–backward snowballing.</div></div><div><h3>Results:</h3><div>The results reveal significant overlap among considered quality aspects, which can be mapped onto the higher-order requirements quality model INVEST. Most studies focus on assessing requirement quality rather than improving requirements and rely heavily on synthetic and public datasets. LLMs have rapidly gained popularity since 2023, though model evaluation strategies remain inconsistent. Metrics such as accuracy, precision, recall, and F1-Score are common, yet a few studies use semantic or expert-based evaluations.</div></div><div><h3>Conclusion:</h3><div>The field is evolving toward LLM-driven, semantically rich models, yet lacks methodological standardization, reproducible datasets for evaluating the models, and integration of the approaches with real-world requirements engineering processes. Future work should address these limitations by developing benchmark datasets, standardizing evaluation metrics, and exploring hybrid systems that combine AI-based and traditional requirements quality assurance approaches.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"191 ","pages":"Article 107979"},"PeriodicalIF":4.3,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145694058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Representation learning for coincidental correctness in fault localization 故障定位中巧合正确性的表示学习
IF 4.3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-24 DOI: 10.1016/j.infsof.2025.107978
Jian Hu

Context:

Fault localization (FL) is a critical phase in the software debugging process, which employs execution coverage matrix to identify the exact locations of faults or bugs in a program’s source code. However, researchers have proved that coincidental correctness test cases (CCTC) that execute the faulty statements whereas produce the correct output are prevalent in test suites and can negatively affect the effectiveness of fault localization.

Objective:

To address this problem, we propose ER4FL: a representation learning based CCTC detection method for fault localization. Our method first detects the CCTCs in the coverage matrix, then relabels them, and finally uses the optimized coverage matrix for fault localization.

Method:

ER4FL leverages an autoencoder-based representation learning to refine the coverage matrix, which captures its most important features in a compressed form. Based on the enhanced representation (i.e., compact coverage matrix), ER4FL adopts a Gaussian Mixture Model (GMM) as a probabilistic model to identify and manipulate CCTC. Finally, ER4FL fed the coverage matrix without CCTC into FL pipeline.

Results:

Our experimental results demonstrate that ER4FL reduces the Mean First Rank (MFR) of Ochiai from 333.18 to 258.26, achieving a relative improvement of 22.49%. In addition, ER4FL decreases the number of checked statements in Convolutional Neural Network (CNN) FL from 859.20 to 579.65, corresponding to a relative reduction of 48.23%.

Conclusion:

The experimental results demonstrate that our method is statistically more effective than the six FL baselines, as well as the two CCTC detection methods.
上下文:错误定位(FL)是软件调试过程中的关键阶段,它使用执行覆盖率矩阵来确定程序源代码中错误或错误的确切位置。然而,研究人员已经证明,在测试套件中,执行错误语句而产生正确输出的巧合正确性测试用例(CCTC)非常普遍,并且会对错误定位的有效性产生负面影响。为了解决这一问题,我们提出了一种基于表示学习的CCTC检测方法ER4FL。该方法首先检测覆盖矩阵中的cctc,然后对其进行重新标记,最后利用优化后的覆盖矩阵进行故障定位。方法:ER4FL利用基于自动编码器的表示学习来细化覆盖矩阵,该矩阵以压缩形式捕获其最重要的特征。基于增强的表示(即紧凑覆盖矩阵),ER4FL采用高斯混合模型(Gaussian Mixture Model, GMM)作为概率模型来识别和操纵CCTC。最后,ER4FL将不含CCTC的覆盖矩阵送入FL管道。结果:我们的实验结果表明,ER4FL将Ochiai的Mean First Rank (MFR)从333.18降低到258.26,相对提高了22.49%。此外,ER4FL将卷积神经网络(CNN) FL中的检查语句数从859.20条减少到579.65条,相对减少48.23%。结论:实验结果表明,该方法在统计学上优于6种FL基线和2种CCTC检测方法。
{"title":"Representation learning for coincidental correctness in fault localization","authors":"Jian Hu","doi":"10.1016/j.infsof.2025.107978","DOIUrl":"10.1016/j.infsof.2025.107978","url":null,"abstract":"<div><h3>Context:</h3><div>Fault localization (FL) is a critical phase in the software debugging process, which employs execution coverage matrix to identify the exact locations of faults or bugs in a program’s source code. However, researchers have proved that coincidental correctness test cases (CCTC) that execute the faulty statements whereas produce the correct output are prevalent in test suites and can negatively affect the effectiveness of fault localization.</div></div><div><h3>Objective:</h3><div>To address this problem, we propose ER4FL: a representation learning based CCTC detection method for fault localization. Our method first detects the CCTCs in the coverage matrix, then relabels them, and finally uses the optimized coverage matrix for fault localization.</div></div><div><h3>Method:</h3><div>ER4FL leverages an autoencoder-based representation learning to refine the coverage matrix, which captures its most important features in a compressed form. Based on the enhanced representation (i.e., compact coverage matrix), ER4FL adopts a Gaussian Mixture Model (GMM) as a probabilistic model to identify and manipulate CCTC. Finally, ER4FL fed the coverage matrix without CCTC into FL pipeline.</div></div><div><h3>Results:</h3><div>Our experimental results demonstrate that ER4FL reduces the Mean First Rank (MFR) of Ochiai from 333.18 to 258.26, achieving a relative improvement of 22.49%. In addition, ER4FL decreases the number of checked statements in Convolutional Neural Network (CNN) FL from 859.20 to 579.65, corresponding to a relative reduction of 48.23%.</div></div><div><h3>Conclusion:</h3><div>The experimental results demonstrate that our method is statistically more effective than the six FL baselines, as well as the two CCTC detection methods.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"191 ","pages":"Article 107978"},"PeriodicalIF":4.3,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145584367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating and improving LLM-based competitive program generation 评估和改进基于法学硕士的竞争性项目生成
IF 4.3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-24 DOI: 10.1016/j.infsof.2025.107977
Minnan Wei, Ziming Li, Xiang Chen, Menglin Zheng, Ziyan Qu, Cheng Yu, Siyu Chen, Xiaolin Ju

Context:

Due to the demand for strong algorithmic reasoning, complex logic implementation, and strict adherence to input/output formats and resource constraints, competitive programming generation by large language models (LLMs) is considered the most challenging problem in current LLM-based code generation. However, previous studies often evaluate LLMs using simple prompts and benchmark datasets prone to data leakage. Moreover, prior work has limited consideration of the diversity in algorithm types and difficulty levels.

Objective:

In this study, we aim to evaluate and improve LLMs in solving real-world competitive programming problems.

Methods:

We initially collect 117 problems from nine regional ICPC/CCPC contests held in 2024 and design four filtering criteria to construct a curated benchmark consisting of 80 problems. Leveraging DeepSeek-R1 as the LLM, we evaluate its competitive program generation capabilities through the online judge (OJ) platforms, guided by a carefully designed basic prompt. For incorrect submissions, we construct a fine-grained error taxonomy and then propose a targeted improvement framework by combining a multi-turn dialog-based repair phase and an information-augmented regeneration phase.

Results:

Experimental results show that only 5 out of 80 problems are fully accepted when using basic prompts. For the unsolved problems, we construct the error taxonomy, including general errors (such as design, boundary, condition, data type, syntax, and input/output errors) and specialized errors (such as those in mathematical problems, greedy algorithms, and graph theories). After applying our proposed improvement strategies, we substantially increased the number of correct solutions, with 46 out of 80 problems successfully accepted.

Conclusion:

Our study highlights the current limitations of LLM-based competitive program generation and outlines promising directions for improving the performance.
背景:由于需要强大的算法推理,复杂的逻辑实现,严格遵守输入/输出格式和资源约束,大型语言模型(llm)的竞争性编程生成被认为是当前基于llm的代码生成中最具挑战性的问题。然而,以前的研究通常使用简单的提示和基准数据集来评估llm,这容易导致数据泄露。此外,先前的工作对算法类型和难度等级的多样性考虑有限。目的:在本研究中,我们旨在评估和改进llm在解决现实世界竞争性规划问题方面的能力。方法:首先从2024年举办的9个地区ICPC/CCPC竞赛中收集117个问题,设计4个过滤标准,构建包含80个问题的精选基准。利用DeepSeek-R1作为LLM,我们通过在线裁判(OJ)平台评估其竞争性程序生成能力,由精心设计的基本提示引导。对于错误提交,我们构建了一个细粒度的错误分类,然后通过结合基于多回合对话的修复阶段和信息增强的再生阶段,提出了一个有针对性的改进框架。结果:实验结果表明,在使用基本提示时,80个问题中只有5个被完全接受。对于未解决的问题,我们构建了错误分类,包括一般错误(如设计、边界、条件、数据类型、语法和输入/输出错误)和专门错误(如数学问题、贪婪算法和图论中的错误)。在应用我们提出的改进策略后,我们大大增加了正确解决方案的数量,80个问题中有46个被成功接受。结论:我们的研究强调了目前基于法学硕士的竞争性程序生成的局限性,并概述了提高性能的有希望的方向。
{"title":"Evaluating and improving LLM-based competitive program generation","authors":"Minnan Wei,&nbsp;Ziming Li,&nbsp;Xiang Chen,&nbsp;Menglin Zheng,&nbsp;Ziyan Qu,&nbsp;Cheng Yu,&nbsp;Siyu Chen,&nbsp;Xiaolin Ju","doi":"10.1016/j.infsof.2025.107977","DOIUrl":"10.1016/j.infsof.2025.107977","url":null,"abstract":"<div><h3>Context:</h3><div>Due to the demand for strong algorithmic reasoning, complex logic implementation, and strict adherence to input/output formats and resource constraints, competitive programming generation by large language models (LLMs) is considered the most challenging problem in current LLM-based code generation. However, previous studies often evaluate LLMs using simple prompts and benchmark datasets prone to data leakage. Moreover, prior work has limited consideration of the diversity in algorithm types and difficulty levels.</div></div><div><h3>Objective:</h3><div>In this study, we aim to evaluate and improve LLMs in solving real-world competitive programming problems.</div></div><div><h3>Methods:</h3><div>We initially collect 117 problems from nine regional ICPC/CCPC contests held in 2024 and design four filtering criteria to construct a curated benchmark consisting of 80 problems. Leveraging DeepSeek-R1 as the LLM, we evaluate its competitive program generation capabilities through the online judge (OJ) platforms, guided by a carefully designed basic prompt. For incorrect submissions, we construct a fine-grained error taxonomy and then propose a targeted improvement framework by combining a multi-turn dialog-based repair phase and an information-augmented regeneration phase.</div></div><div><h3>Results:</h3><div>Experimental results show that only 5 out of 80 problems are fully accepted when using basic prompts. For the unsolved problems, we construct the error taxonomy, including general errors (such as design, boundary, condition, data type, syntax, and input/output errors) and specialized errors (such as those in mathematical problems, greedy algorithms, and graph theories). After applying our proposed improvement strategies, we substantially increased the number of correct solutions, with 46 out of 80 problems successfully accepted.</div></div><div><h3>Conclusion:</h3><div>Our study highlights the current limitations of LLM-based competitive program generation and outlines promising directions for improving the performance.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"191 ","pages":"Article 107977"},"PeriodicalIF":4.3,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145584366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CPMT: A collaborative metamorphic relations and test cases prioritization approach for Metamorphic Testing CPMT:一种用于变形测试的协作的变形关系和测试用例优先化方法
IF 4.3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-21 DOI: 10.1016/j.infsof.2025.107975
Chang-ai Sun, Shifan Liu, An Fu, Jiaming Zhang

Context:

Metamorphic Testing (MT) is a widely adopted software testing technique that addresses the oracle problem by leveraging Metamorphic Relations (MRs). Various test case prioritization (TCP) techniques have been developed to improve the fault detection efficiency by scheduling the execution order of test cases. However, these techniques cannot be directly applied to MT due to its unique features, such as involving the execution of source and follow-up test cases, application of MRs, and the result verification depends on the availability of the corresponding outputs.

Objective:

This study aims to improve the fault detection efficiency of MT by developing a collaborative prioritization approach called CPMT that considers the scheduling of both MRs and test cases.

Methods:

We first formulate the prioritization problem in MT and then propose to schedule the execution of MRs and test cases based on three strategies, which prioritize the execution of MRs and test cases with a higher potential of fault detection from the different perspectives, including the contributions to specification/implementation coverage, the strictness of output relation, and the earlier detection opportunities.

Results:

Extensive experiments were conducted on seven subject programs to evaluate the effectiveness of CPMT. The experimental results have demonstrated that the proposed approach significantly improved the fault detection efficiency and outperformed the baseline techniques.

Conclusion:

CPMT provides a promising way to improve the fault detection efficiency of MT.
背景:变形测试(MT)是一种被广泛采用的软件测试技术,它通过利用变形关系(MRs)来解决oracle问题。各种测试用例优先级(TCP)技术被开发出来,通过调度测试用例的执行顺序来提高故障检测效率。然而,这些技术不能直接应用于机器翻译,因为其独特的特点,如涉及源和后续测试用例的执行,MRs的应用,结果验证取决于相应输出的可用性。目的:本研究旨在通过开发一种考虑MRs和测试用例调度的称为CPMT的协作优先级方法来提高MT的故障检测效率。方法:首先在机器翻译中提出优先级问题,然后提出基于三种策略的机器翻译和测试用例执行调度,这三种策略从不同的角度,包括对规范/实现覆盖率的贡献、输出关系的严格性和早期检测机会,对具有更高故障检测潜力的机器翻译和测试用例的执行进行优先级排序。结果:在7个实验项目中进行了大量的实验来评估CPMT的有效性。实验结果表明,该方法显著提高了故障检测效率,优于基线技术。结论:CPMT为提高MT的故障检测效率提供了一种很有前途的方法。
{"title":"CPMT: A collaborative metamorphic relations and test cases prioritization approach for Metamorphic Testing","authors":"Chang-ai Sun,&nbsp;Shifan Liu,&nbsp;An Fu,&nbsp;Jiaming Zhang","doi":"10.1016/j.infsof.2025.107975","DOIUrl":"10.1016/j.infsof.2025.107975","url":null,"abstract":"<div><h3>Context:</h3><div>Metamorphic Testing (MT) is a widely adopted software testing technique that addresses the <em>oracle problem</em> by leveraging Metamorphic Relations (MRs). Various test case prioritization (TCP) techniques have been developed to improve the fault detection efficiency by scheduling the execution order of test cases. However, these techniques cannot be directly applied to MT due to its unique features, such as involving the execution of source and follow-up test cases, application of MRs, and the result verification depends on the availability of the corresponding outputs.</div></div><div><h3>Objective:</h3><div>This study aims to improve the fault detection efficiency of MT by developing a collaborative prioritization approach called <em>CPMT</em> that considers the scheduling of both MRs and test cases.</div></div><div><h3>Methods:</h3><div>We first formulate the prioritization problem in MT and then propose to schedule the execution of MRs and test cases based on three strategies, which prioritize the execution of MRs and test cases with a higher potential of fault detection from the different perspectives, including the contributions to specification/implementation coverage, the strictness of output relation, and the earlier detection opportunities.</div></div><div><h3>Results:</h3><div>Extensive experiments were conducted on seven subject programs to evaluate the effectiveness of <em>CPMT</em>. The experimental results have demonstrated that the proposed approach significantly improved the fault detection efficiency and outperformed the baseline techniques.</div></div><div><h3>Conclusion:</h3><div><em>CPMT</em> provides a promising way to improve the fault detection efficiency of MT.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"190 ","pages":"Article 107975"},"PeriodicalIF":4.3,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145617875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Information and Software Technology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1