IEEE Transactions on Software Engineering最新文献

英文中文

Completeness and Consistency of Tabular Requirements: An SMT-Based Verification Approach 表格需求的完整性和一致性：基于smt的验证方法

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2025-01-17 DOI: 10.1109/TSE.2025.3530820

Claudio Menghi;Eugene Balai;Darren Valovcin;Christoph Sticksel;Akshay Rajhans

Tabular requirements assist with the specification of software requirements using an “if-then” paradigm and are supported by many tools. For example, the Requirements Table block in Simulink^® supports writing executable specifications that can be used as test oracles to validate an implementation. But even before the development of an implementation, automatic checking of consistency and completeness of a Requirements Table can reveal errors in the specification. Fixing such errors earlier than in later development cycles avoids costly rework and additional testing efforts that would be required otherwise. As of version R2022a, Simulink^® supports checking completeness and consistency of Requirements Tables when the requirements are stateless, that is, do not constrain behaviors over time. We overcome this limitation by considering Requirements Tables with both stateless and stateful requirements. This paper (i) formally defines the syntax and semantics of Requirements Tables, and their completeness and consistency, (ii) proposes eight encodings from two categories (namely, bounded and unbounded) that support stateful requirements, and (iii) implements Theano, a solution supporting checking completeness and consistency using these encodings. We empirically assess the effectiveness and efficiency of our encodings in checking completeness and consistency by considering a benchmark of

$160$

Requirements Tables for a timeout of two hours. Our results show that Theano can check the completeness of all the Requirements Tables in our benchmark, it can detect the inconsistency of the Requirements Tables, but it can not confirm their consistency within the timeout. We also assessed the usefulness of Theano in checking the consistency and completeness of 14 versions of a Requirements Table for a practical example from the automotive domain. Across these 14 versions, Theano could effectively detect two inconsistent and five incomplete Requirements Tables reporting a problem (inconsistency or incompleteness) for

$50%$

(7 out of 14) versions of the Requirements Table.

{"title":"Completeness and Consistency of Tabular Requirements: An SMT-Based Verification Approach","authors":"Claudio Menghi;Eugene Balai;Darren Valovcin;Christoph Sticksel;Akshay Rajhans","doi":"10.1109/TSE.2025.3530820","DOIUrl":"10.1109/TSE.2025.3530820","url":null,"abstract":"Tabular requirements assist with the specification of software requirements using an “if-then” paradigm and are supported by many tools. For example, the Requirements Table block in Simulink® supports writing executable specifications that can be used as test oracles to validate an implementation. But even before the development of an implementation, automatic checking of consistency and completeness of a Requirements Table can reveal errors in the specification. Fixing such errors earlier than in later development cycles avoids costly rework and additional testing efforts that would be required otherwise. As of version R2022a, Simulink® supports checking completeness and consistency of Requirements Tables when the requirements are stateless, that is, do not constrain behaviors over time. We overcome this limitation by considering Requirements Tables with both stateless and stateful requirements. This paper (i) formally defines the syntax and semantics of Requirements Tables, and their completeness and consistency, (ii) proposes eight encodings from two categories (namely, bounded and unbounded) that support stateful requirements, and (iii) implements Theano, a solution supporting checking completeness and consistency using these encodings. We empirically assess the effectiveness and efficiency of our encodings in checking completeness and consistency by considering a benchmark of <inline-formula><tex-math>$160$</tex-math></inline-formula> Requirements Tables for a timeout of two hours. Our results show that Theano can check the completeness of all the Requirements Tables in our benchmark, it can detect the inconsistency of the Requirements Tables, but it can not confirm their consistency within the timeout. We also assessed the usefulness of Theano in checking the consistency and completeness of 14 versions of a Requirements Table for a practical example from the automotive domain. Across these 14 versions, Theano could effectively detect two inconsistent and five incomplete Requirements Tables reporting a problem (inconsistency or incompleteness) for <inline-formula><tex-math>$50%$</tex-math></inline-formula> (7 out of 14) versions of the Requirements Table.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"595-620"},"PeriodicalIF":6.5,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10844918","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142989089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Understanding the Robustness of Transformer-Based Code Intelligence via Code Transformation: Challenges and Opportunities 通过代码转换理解基于转换器的代码智能的健壮性：挑战与机遇

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2025-01-16 DOI: 10.1109/TSE.2024.3524461

Yaoxian Li;Shiyi Qi;Cuiyun Gao;Yun Peng;David Lo;Michael R. Lyu;Zenglin Xu

Transformer-based models have demonstrated state-of-the-art performance in various intelligent coding tasks such as code comment generation and code completion. Previous studies show that deep learning models are sensitive to input variations, but few have systematically studied the robustness of Transformer under perturbed input code. In this work, we empirically study the effect of semantic-preserving code transformations on the performance of Transformers. Specifically, 27 and 24 code transformation strategies are implemented for two popular programming languages, Java and Python, respectively. To facilitating analysis, the strategies are grouped into five categories: block transformation, insertion / deletion transformation, grammatical statement transformation, grammatical token transformation, and identifier transformation. Experiments on three popular code intelligence tasks, including code completion, code summarization, and code search, demonstrate that insertion / deletion transformation and identifier transformation have the greatest impact on the performance of Transformers. Our results also suggest that Transformers based on abstract syntax trees (ASTs) show more robust performance than models based only on code sequences under most code transformations. Besides, the design of positional encoding can impact the robustness of Transformers under code transformations. We also investigate substantial code transformations at the strategy level to expand our study and explore other factors influencing the robustness of Transformers. Furthermore, we explore applications of code transformations. Based on our findings, we distill insights about the challenges and opportunities for Transformer-based code intelligence from various perspectives.

{"title":"Understanding the Robustness of Transformer-Based Code Intelligence via Code Transformation: Challenges and Opportunities","authors":"Yaoxian Li;Shiyi Qi;Cuiyun Gao;Yun Peng;David Lo;Michael R. Lyu;Zenglin Xu","doi":"10.1109/TSE.2024.3524461","DOIUrl":"10.1109/TSE.2024.3524461","url":null,"abstract":"Transformer-based models have demonstrated state-of-the-art performance in various intelligent coding tasks such as code comment generation and code completion. Previous studies show that deep learning models are sensitive to input variations, but few have systematically studied the robustness of Transformer under perturbed input code. In this work, we empirically study the effect of semantic-preserving code transformations on the performance of Transformers. Specifically, 27 and 24 code transformation strategies are implemented for two popular programming languages, Java and Python, respectively. To facilitating analysis, the strategies are grouped into five categories: block transformation, insertion / deletion transformation, grammatical statement transformation, grammatical token transformation, and identifier transformation. Experiments on three popular code intelligence tasks, including code completion, code summarization, and code search, demonstrate that insertion / deletion transformation and identifier transformation have the greatest impact on the performance of Transformers. Our results also suggest that Transformers based on abstract syntax trees (ASTs) show more robust performance than models based only on code sequences under most code transformations. Besides, the design of positional encoding can impact the robustness of Transformers under code transformations. We also investigate substantial code transformations at the strategy level to expand our study and explore other factors influencing the robustness of Transformers. Furthermore, we explore applications of code transformations. Based on our findings, we distill insights about the challenges and opportunities for Transformer-based code intelligence from various perspectives.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"521-547"},"PeriodicalIF":6.5,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142987459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Anomaly Detection on Interleaved Log Data With Semantic Association Mining on Log-Entity Graph 利用日志实体图上的语义关联挖掘对交错日志数据进行异常检测

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2025-01-13 DOI: 10.1109/TSE.2025.3527856

Guojun Chu;Jingyu Wang;Qi Qi;Haifeng Sun;Zirui Zhuang;Bo He;Yuhan Jing;Lei Zhang;Jianxin Liao

Logs record crucial information about runtime status of software system, which can be utilized for anomaly detection and fault diagnosis. However, techniques struggle to perform effectively when dealing with interleaved logs and entities that influence each other. Although manually specifying a grouping field for each dataset can handle the single grouping scenario, the problems of multiple and heterogeneous grouping still remain unsolved. To break through these limitations, we first design a log semantic association mining approach to convert log sequences into Log-Entity Graph, and then propose a novel log anomaly detection model named Lograph. The semantic association can be utilized to implicitly group the logs and sort out complex dependencies between entities, which have been overlooked in existing literature. Also, a Heterogeneous Graph Attention Network is utilized to effectively capture anomalous patterns of both logs and entities, where Log-Entity Graph serves as a data management and feature engineering module. We evaluate our model on real-world log datasets, comparing with nine baseline models. The experimental results demonstrate that Lograph can improve the accuracy of anomaly detection, especially on the datasets where entity relationships are intricate and grouping strategies are not applicable.

{"title":"Anomaly Detection on Interleaved Log Data With Semantic Association Mining on Log-Entity Graph","authors":"Guojun Chu;Jingyu Wang;Qi Qi;Haifeng Sun;Zirui Zhuang;Bo He;Yuhan Jing;Lei Zhang;Jianxin Liao","doi":"10.1109/TSE.2025.3527856","DOIUrl":"10.1109/TSE.2025.3527856","url":null,"abstract":"Logs record crucial information about runtime status of software system, which can be utilized for anomaly detection and fault diagnosis. However, techniques struggle to perform effectively when dealing with interleaved logs and entities that influence each other. Although manually specifying a grouping field for each dataset can handle the single grouping scenario, the problems of multiple and heterogeneous grouping still remain unsolved. To break through these limitations, we first design a log semantic association mining approach to convert log sequences into Log-Entity Graph, and then propose a novel log anomaly detection model named Lograph. The semantic association can be utilized to implicitly group the logs and sort out complex dependencies between entities, which have been overlooked in existing literature. Also, a Heterogeneous Graph Attention Network is utilized to effectively capture anomalous patterns of both logs and entities, where Log-Entity Graph serves as a data management and feature engineering module. We evaluate our model on real-world log datasets, comparing with nine baseline models. The experimental results demonstrate that Lograph can improve the accuracy of anomaly detection, especially on the datasets where entity relationships are intricate and grouping strategies are not applicable.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"581-594"},"PeriodicalIF":6.5,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142974694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

2024 Reviewers List 2024审稿人名单

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2025-01-10 DOI: 10.1109/TSE.2024.3525202

引用次数: 0

Towards Improving the Performance of Comment Generation Models by Using Bytecode Information 通过使用字节码信息来提高注释生成模型的性能

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2025-01-09 DOI: 10.1109/TSE.2024.3523713

Yuan Huang;Jinbo Huang;Xiangping Chen;Zibin Zheng

Code comment plays an important role in program understanding, and a large number of automatic comment generation methods have been proposed in recent years. To get a better effect of generating comments, many studies try to extract a variety of information (e.g., code tokens, AST traverse sequence, APIs call sequence) from source code as model input. In this study, we found that the bytecode compiled from the source code can provide useful information for comment generation, hence we propose to use the information from bytecode to assist the comment generation. Specifically, we extract the control flow graph (CFG) from the bytecode and propose a serialization method to obtain the CFG sequence that preserves the program structure. Then, we discuss three methods for introducing bytecode information for different models. We collected 390,000 Java methods from the maven repository, and created a dataset of 101,124 samples after deduplication and preprocessing to evaluate our method. The results show that introducing the information extracted from the bytecode can improve the BLEU-4 of 7 comment generation models.

{"title":"Towards Improving the Performance of Comment Generation Models by Using Bytecode Information","authors":"Yuan Huang;Jinbo Huang;Xiangping Chen;Zibin Zheng","doi":"10.1109/TSE.2024.3523713","DOIUrl":"10.1109/TSE.2024.3523713","url":null,"abstract":"Code comment plays an important role in program understanding, and a large number of automatic comment generation methods have been proposed in recent years. To get a better effect of generating comments, many studies try to extract a variety of information (e.g., code tokens, AST traverse sequence, APIs call sequence) from source code as model input. In this study, we found that the bytecode compiled from the source code can provide useful information for comment generation, hence we propose to use the information from bytecode to assist the comment generation. Specifically, we extract the control flow graph (CFG) from the bytecode and propose a serialization method to obtain the CFG sequence that preserves the program structure. Then, we discuss three methods for introducing bytecode information for different models. We collected 390,000 Java methods from the maven repository, and created a dataset of 101,124 samples after deduplication and preprocessing to evaluate our method. The results show that introducing the information extracted from the bytecode can improve the BLEU-4 of 7 comment generation models.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"503-520"},"PeriodicalIF":6.5,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142940444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evidence-based Software Engineering Guidelines Revisited 重新审视基于证据的软件工程指南

IF 7.4 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2025-01-08 DOI: 10.1109/tse.2025.3526730

Shari Lawrence Pfleeger, Barbara Kitchenham

引用次数: 0

Accuracy Can Lie: On the Impact of Surrogate Model in Configuration Tuning 精度可能存在：论代理模型在配置调优中的影响

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2025-01-07 DOI: 10.1109/TSE.2025.3525955

Pengzhou Chen;Jingzhi Gong;Tao Chen

To ease the expensive measurements during configuration tuning, it is natural to build a surrogate model as the replacement of the system, and thereby the configuration performance can be cheaply evaluated. Yet, a stereotype therein is that the higher the model accuracy, the better the tuning result would be, or vice versa. This “accuracy is all” belief drives our research community to build more and more accurate models and criticize a tuner for the inaccuracy of the model used. However, this practice raises some previously unaddressed questions, e.g., are the model and its accuracy really that important for the tuning result? Do those somewhat small accuracy improvements reported (e.g., a few % error reduction) in existing work really matter much to the tuners? What role does model accuracy play in the impact of tuning quality? To answer those related questions, in this paper, we conduct one of the largest-scale empirical studies to date—running over the period of 13 months

$24times 7$

—that covers 10 models, 17 tuners, and 29 systems from the existing works while under four different commonly used metrics, leading to 13,612 cases of investigation. Surprisingly, our key findings reveal that the accuracy can lie: there are a considerable number of cases where higher accuracy actually leads to no improvement in the tuning outcomes (up to 58% cases under certain setting), or even worse, it can degrade the tuning quality (up to 24% cases under certain setting). We also discover that the chosen models in most proposed tuners are sub-optimal and that the required % of accuracy change to significantly improve tuning quality varies according to the range of model accuracy. Deriving from the fitness landscape analysis, we provide in-depth discussions of the rationale behind, offering several lessons learned as well as insights for future opportunities. Most importantly, this work poses a clear message to the community: we should take one step back from the natural “accuracy is all” belief for model-based configuration tuning.

{"title":"Accuracy Can Lie: On the Impact of Surrogate Model in Configuration Tuning","authors":"Pengzhou Chen;Jingzhi Gong;Tao Chen","doi":"10.1109/TSE.2025.3525955","DOIUrl":"10.1109/TSE.2025.3525955","url":null,"abstract":"To ease the expensive measurements during configuration tuning, it is natural to build a surrogate model as the replacement of the system, and thereby the configuration performance can be cheaply evaluated. Yet, a stereotype therein is that the higher the model accuracy, the better the tuning result would be, or vice versa. This “accuracy is all” belief drives our research community to build more and more accurate models and criticize a tuner for the inaccuracy of the model used. However, this practice raises some previously unaddressed questions, e.g., are the model and its accuracy really that important for the tuning result? Do those somewhat small accuracy improvements reported (e.g., a few % error reduction) in existing work really matter much to the tuners? What role does model accuracy play in the impact of tuning quality? To answer those related questions, in this paper, we conduct one of the largest-scale empirical studies to date—running over the period of 13 months <inline-formula><tex-math>$24times 7$</tex-math></inline-formula>—that covers 10 models, 17 tuners, and 29 systems from the existing works while under four different commonly used metrics, leading to 13,612 cases of investigation. Surprisingly, our key findings reveal that the accuracy can lie: there are a considerable number of cases where higher accuracy actually leads to no improvement in the tuning outcomes (up to 58% cases under certain setting), or even worse, it can degrade the tuning quality (up to 24% cases under certain setting). We also discover that the chosen models in most proposed tuners are sub-optimal and that the required % of accuracy change to significantly improve tuning quality varies according to the range of model accuracy. Deriving from the fitness landscape analysis, we provide in-depth discussions of the rationale behind, offering several lessons learned as well as insights for future opportunities. Most importantly, this work poses a clear message to the community: we should take one step back from the natural “accuracy is all” belief for model-based configuration tuning.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"548-580"},"PeriodicalIF":6.5,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10832565","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142936245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Retrospective on the Source Code Control System 回顾源代码控制系统

IF 7.4 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2025-01-03 DOI: 10.1109/tse.2024.3524947

Marc J. Rochkind

引用次数: 0

A Holistic Approach to Design Understanding Through Concept Explanation 通过概念解释来理解设计的整体方法

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2025-01-01 DOI: 10.1109/TSE.2024.3522973

Hongzhou Fang;Yuanfang Cai;Ewan Tempero;Rick Kazman;Yu-Cheng Tu;Jason Lefever;Ernst Pisch

Complex software systems consist of multiple overlapping design structures, such as abstractions, features, crosscutting concerns, or patterns. This is similar to how a human body has multiple interacting subsystems, such as respiratory, digestive, or circulatory. Unlike in the medical domain, software designers do not have an effective way to distinguish, visualize, comprehend, and analyze these interleaving design structures. As a result, developers often struggle through the maze of source code. In this paper, we present an Automated Concept Explanation (ACE) framework that automatically extracts and categorizes major concepts from source code based on the roles that files play in design structures and their topic frequencies. Based on these categorized concepts, ACE recovers four categories of high-level design models using different algorithms and generates a natural language explanation for each. To assess if and how ACE can help developers better understand design structures, we conducted an empirical study where two groups of graduate students were assigned three design comprehension tasks: identifying feature-related files, identifying dependencies among features, and identifying design patterns used, in an open-source project. The results reveal that the students who used ACE can accomplish these tasks much faster and more accurately, and they acknowledged the usefulness of the categorized concepts and structures, multi-type high-level model visualization, and natural language explanations.

{"title":"A Holistic Approach to Design Understanding Through Concept Explanation","authors":"Hongzhou Fang;Yuanfang Cai;Ewan Tempero;Rick Kazman;Yu-Cheng Tu;Jason Lefever;Ernst Pisch","doi":"10.1109/TSE.2024.3522973","DOIUrl":"10.1109/TSE.2024.3522973","url":null,"abstract":"Complex software systems consist of multiple overlapping design structures, such as abstractions, features, crosscutting concerns, or patterns. This is similar to how a human body has multiple interacting subsystems, such as respiratory, digestive, or circulatory. Unlike in the medical domain, software designers do not have an effective way to distinguish, visualize, comprehend, and analyze these interleaving design structures. As a result, developers often struggle through the maze of source code. In this paper, we present an <italic>Automated Concept Explanation (ACE) framework that automatically extracts and categorizes major concepts from source code based on the roles that files play in design structures and their topic frequencies. Based on these categorized concepts, ACE recovers four categories of high-level design models using different algorithms and generates a natural language explanation for each. To assess if and how ACE can help developers better understand design structures, we conducted an empirical study where two groups of graduate students were assigned three design comprehension tasks: identifying feature-related files, identifying dependencies among features, and identifying design patterns used, in an open-source project. The results reveal that the students who used ACE can accomplish these tasks much faster and more accurately, and they acknowledged the usefulness of the categorized concepts and structures, multi-type high-level model visualization, and natural language explanations.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"449-465"},"PeriodicalIF":6.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142911631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Look Before You Leap: An Exploratory Study of Uncertainty Analysis for Large Language Models 三思而后行：大型语言模型不确定性分析的探索性研究

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering

Pub Date : 2025-01-01 DOI: 10.1109/TSE.2024.3519464

Yuheng Huang;Jiayang Song;Zhijie Wang;Shengming Zhao;Huaming Chen;Felix Juefei-Xu;Lei Ma

The recent performance leap of Large Language Models (LLMs) opens up new opportunities across numerous industrial applications and domains. However, the potential erroneous behavior (e.g., the generation of misinformation and hallucination) has also raised severe concerns for the trustworthiness of LLMs, especially in safety-, security- and reliability-sensitive industrial scenarios, potentially hindering real-world adoptions. While uncertainty estimation has shown its potential for interpreting the prediction risks made by classic machine learning (ML) models, the unique characteristics of recent LLMs (e.g., adopting self-attention mechanism as its core, very large-scale model size, often used in generative contexts) pose new challenges for the behavior analysis of LLMs. Up to the present, little progress has been made to better understand whether and to what extent uncertainty estimation can help characterize the capability boundary of an LLM, to counteract its undesired behavior, which is considered to be of great importance with the potential wide-range applications of LLMs across industry domains. To bridge the gap, in this paper, we initiate an early exploratory study of the risk assessment of LLMs from the lens of uncertainty. In particular, we conduct a large-scale study with as many as twelve uncertainty estimation methods and eight general LLMs on four NLP tasks and seven programming-capable LLMs on two code generation tasks to investigate to what extent uncertainty estimation techniques could help characterize the prediction risks of LLMs. Our findings confirm the potential of uncertainty estimation for revealing LLMs’ uncertain/non-factual predictions. The insights derived from our study can pave the way for more advanced analysis and research on LLMs, ultimately aiming at enhancing their trustworthiness.

{"title":"Look Before You Leap: An Exploratory Study of Uncertainty Analysis for Large Language Models","authors":"Yuheng Huang;Jiayang Song;Zhijie Wang;Shengming Zhao;Huaming Chen;Felix Juefei-Xu;Lei Ma","doi":"10.1109/TSE.2024.3519464","DOIUrl":"10.1109/TSE.2024.3519464","url":null,"abstract":"The recent performance leap of Large Language Models (LLMs) opens up new opportunities across numerous industrial applications and domains. However, the potential erroneous behavior (e.g., the generation of misinformation and hallucination) has also raised severe concerns for the trustworthiness of LLMs, especially in safety-, security- and reliability-sensitive industrial scenarios, potentially hindering real-world adoptions. While uncertainty estimation has shown its potential for interpreting the prediction risks made by classic machine learning (ML) models, the unique characteristics of recent LLMs (e.g., adopting self-attention mechanism as its core, very large-scale model size, often used in generative contexts) pose new challenges for the behavior analysis of LLMs. Up to the present, little progress has been made to better understand whether and to what extent uncertainty estimation can help characterize the capability boundary of an LLM, to counteract its undesired behavior, which is considered to be of great importance with the potential wide-range applications of LLMs across industry domains. To bridge the gap, in this paper, we initiate an early exploratory study of the risk assessment of LLMs from the lens of uncertainty. In particular, we conduct a large-scale study with as many as twelve uncertainty estimation methods and eight general LLMs on four NLP tasks and seven programming-capable LLMs on two code generation tasks to investigate to what extent uncertainty estimation techniques could help characterize the prediction risks of LLMs. Our findings confirm the potential of uncertainty estimation for revealing LLMs’ uncertain/non-factual predictions. The insights derived from our study can pave the way for more advanced analysis and research on LLMs, ultimately aiming at enhancing their trustworthiness.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"413-429"},"PeriodicalIF":6.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142911630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IEEE Transactions on Software Engineering

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀