Tabular requirements assist with the specification of software requirements using an “if-then” paradigm and are supported by many tools. For example, the Requirements Table block in Simulink® supports writing executable specifications that can be used as test oracles to validate an implementation. But even before the development of an implementation, automatic checking of consistency and completeness of a Requirements Table can reveal errors in the specification. Fixing such errors earlier than in later development cycles avoids costly rework and additional testing efforts that would be required otherwise. As of version R2022a, Simulink® supports checking completeness and consistency of Requirements Tables when the requirements are stateless, that is, do not constrain behaviors over time. We overcome this limitation by considering Requirements Tables with both stateless and stateful requirements. This paper (i) formally defines the syntax and semantics of Requirements Tables, and their completeness and consistency, (ii) proposes eight encodings from two categories (namely, bounded and unbounded) that support stateful requirements, and (iii) implements Theano, a solution supporting checking completeness and consistency using these encodings. We empirically assess the effectiveness and efficiency of our encodings in checking completeness and consistency by considering a benchmark of $160$ Requirements Tables for a timeout of two hours. Our results show that Theano can check the completeness of all the Requirements Tables in our benchmark, it can detect the inconsistency of the Requirements Tables, but it can not confirm their consistency within the timeout. We also assessed the usefulness of Theano in checking the consistency and completeness of 14 versions of a Requirements Table for a practical example from the automotive domain. Across these 14 versions, Theano could effectively detect two inconsistent and five incomplete Requirements Tables reporting a problem (inconsistency or incompleteness) for $50%$ (7 out of 14) versions of the Requirements Table.
{"title":"Completeness and Consistency of Tabular Requirements: An SMT-Based Verification Approach","authors":"Claudio Menghi;Eugene Balai;Darren Valovcin;Christoph Sticksel;Akshay Rajhans","doi":"10.1109/TSE.2025.3530820","DOIUrl":"10.1109/TSE.2025.3530820","url":null,"abstract":"Tabular requirements assist with the specification of software requirements using an “if-then” paradigm and are supported by many tools. For example, the Requirements Table block in Simulink<sup>®</sup> supports writing executable specifications that can be used as test oracles to validate an implementation. But even before the development of an implementation, automatic checking of consistency and completeness of a Requirements Table can reveal errors in the specification. Fixing such errors earlier than in later development cycles avoids costly rework and additional testing efforts that would be required otherwise. As of version R2022a, Simulink<sup>®</sup> supports checking completeness and consistency of Requirements Tables when the requirements are stateless, that is, do not constrain behaviors over time. We overcome this limitation by considering Requirements Tables with both stateless and stateful requirements. This paper (i) formally defines the syntax and semantics of Requirements Tables, and their completeness and consistency, (ii) proposes eight encodings from two categories (namely, bounded and unbounded) that support stateful requirements, and (iii) implements <small>Theano</small>, a solution supporting checking completeness and consistency using these encodings. We empirically assess the effectiveness and efficiency of our encodings in checking completeness and consistency by considering a benchmark of <inline-formula><tex-math>$160$</tex-math></inline-formula> Requirements Tables for a timeout of two hours. Our results show that <small>Theano</small> can check the completeness of all the Requirements Tables in our benchmark, it can detect the inconsistency of the Requirements Tables, but it can not confirm their consistency within the timeout. We also assessed the usefulness of <small>Theano</small> in checking the consistency and completeness of 14 versions of a Requirements Table for a practical example from the automotive domain. Across these 14 versions, <small>Theano</small> could effectively detect two inconsistent and five incomplete Requirements Tables reporting a problem (inconsistency or incompleteness) for <inline-formula><tex-math>$50%$</tex-math></inline-formula> (7 out of 14) versions of the Requirements Table.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"595-620"},"PeriodicalIF":6.5,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10844918","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142989089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-16DOI: 10.1109/TSE.2024.3524461
Yaoxian Li;Shiyi Qi;Cuiyun Gao;Yun Peng;David Lo;Michael R. Lyu;Zenglin Xu
Transformer-based models have demonstrated state-of-the-art performance in various intelligent coding tasks such as code comment generation and code completion. Previous studies show that deep learning models are sensitive to input variations, but few have systematically studied the robustness of Transformer under perturbed input code. In this work, we empirically study the effect of semantic-preserving code transformations on the performance of Transformers. Specifically, 27 and 24 code transformation strategies are implemented for two popular programming languages, Java and Python, respectively. To facilitating analysis, the strategies are grouped into five categories: block transformation, insertion / deletion transformation, grammatical statement transformation, grammatical token transformation, and identifier transformation. Experiments on three popular code intelligence tasks, including code completion, code summarization, and code search, demonstrate that insertion / deletion transformation and identifier transformation have the greatest impact on the performance of Transformers. Our results also suggest that Transformers based on abstract syntax trees (ASTs) show more robust performance than models based only on code sequences under most code transformations. Besides, the design of positional encoding can impact the robustness of Transformers under code transformations. We also investigate substantial code transformations at the strategy level to expand our study and explore other factors influencing the robustness of Transformers. Furthermore, we explore applications of code transformations. Based on our findings, we distill insights about the challenges and opportunities for Transformer-based code intelligence from various perspectives.
{"title":"Understanding the Robustness of Transformer-Based Code Intelligence via Code Transformation: Challenges and Opportunities","authors":"Yaoxian Li;Shiyi Qi;Cuiyun Gao;Yun Peng;David Lo;Michael R. Lyu;Zenglin Xu","doi":"10.1109/TSE.2024.3524461","DOIUrl":"10.1109/TSE.2024.3524461","url":null,"abstract":"Transformer-based models have demonstrated state-of-the-art performance in various intelligent coding tasks such as code comment generation and code completion. Previous studies show that deep learning models are sensitive to input variations, but few have systematically studied the robustness of Transformer under perturbed input code. In this work, we empirically study the effect of semantic-preserving code transformations on the performance of Transformers. Specifically, 27 and 24 code transformation strategies are implemented for two popular programming languages, Java and Python, respectively. To facilitating analysis, the strategies are grouped into five categories: block transformation, insertion / deletion transformation, grammatical statement transformation, grammatical token transformation, and identifier transformation. Experiments on three popular code intelligence tasks, including code completion, code summarization, and code search, demonstrate that insertion / deletion transformation and identifier transformation have the greatest impact on the performance of Transformers. Our results also suggest that Transformers based on abstract syntax trees (ASTs) show more robust performance than models based only on code sequences under most code transformations. Besides, the design of positional encoding can impact the robustness of Transformers under code transformations. We also investigate substantial code transformations at the strategy level to expand our study and explore other factors influencing the robustness of Transformers. Furthermore, we explore applications of code transformations. Based on our findings, we distill insights about the challenges and opportunities for Transformer-based code intelligence from various perspectives.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"521-547"},"PeriodicalIF":6.5,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142987459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Logs record crucial information about runtime status of software system, which can be utilized for anomaly detection and fault diagnosis. However, techniques struggle to perform effectively when dealing with interleaved logs and entities that influence each other. Although manually specifying a grouping field for each dataset can handle the single grouping scenario, the problems of multiple and heterogeneous grouping still remain unsolved. To break through these limitations, we first design a log semantic association mining approach to convert log sequences into Log-Entity Graph, and then propose a novel log anomaly detection model named Lograph. The semantic association can be utilized to implicitly group the logs and sort out complex dependencies between entities, which have been overlooked in existing literature. Also, a Heterogeneous Graph Attention Network is utilized to effectively capture anomalous patterns of both logs and entities, where Log-Entity Graph serves as a data management and feature engineering module. We evaluate our model on real-world log datasets, comparing with nine baseline models. The experimental results demonstrate that Lograph can improve the accuracy of anomaly detection, especially on the datasets where entity relationships are intricate and grouping strategies are not applicable.
{"title":"Anomaly Detection on Interleaved Log Data With Semantic Association Mining on Log-Entity Graph","authors":"Guojun Chu;Jingyu Wang;Qi Qi;Haifeng Sun;Zirui Zhuang;Bo He;Yuhan Jing;Lei Zhang;Jianxin Liao","doi":"10.1109/TSE.2025.3527856","DOIUrl":"10.1109/TSE.2025.3527856","url":null,"abstract":"Logs record crucial information about runtime status of software system, which can be utilized for anomaly detection and fault diagnosis. However, techniques struggle to perform effectively when dealing with interleaved logs and entities that influence each other. Although manually specifying a grouping field for each dataset can handle the single grouping scenario, the problems of multiple and heterogeneous grouping still remain unsolved. To break through these limitations, we first design a log semantic association mining approach to convert log sequences into Log-Entity Graph, and then propose a novel log anomaly detection model named Lograph. The semantic association can be utilized to implicitly group the logs and sort out complex dependencies between entities, which have been overlooked in existing literature. Also, a Heterogeneous Graph Attention Network is utilized to effectively capture anomalous patterns of both logs and entities, where Log-Entity Graph serves as a data management and feature engineering module. We evaluate our model on real-world log datasets, comparing with nine baseline models. The experimental results demonstrate that Lograph can improve the accuracy of anomaly detection, especially on the datasets where entity relationships are intricate and grouping strategies are not applicable.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"581-594"},"PeriodicalIF":6.5,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142974694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-09DOI: 10.1109/TSE.2024.3523713
Yuan Huang;Jinbo Huang;Xiangping Chen;Zibin Zheng
Code comment plays an important role in program understanding, and a large number of automatic comment generation methods have been proposed in recent years. To get a better effect of generating comments, many studies try to extract a variety of information (e.g., code tokens, AST traverse sequence, APIs call sequence) from source code as model input. In this study, we found that the bytecode compiled from the source code can provide useful information for comment generation, hence we propose to use the information from bytecode to assist the comment generation. Specifically, we extract the control flow graph (CFG) from the bytecode and propose a serialization method to obtain the CFG sequence that preserves the program structure. Then, we discuss three methods for introducing bytecode information for different models. We collected 390,000 Java methods from the maven repository, and created a dataset of 101,124 samples after deduplication and preprocessing to evaluate our method. The results show that introducing the information extracted from the bytecode can improve the BLEU-4 of 7 comment generation models.
{"title":"Towards Improving the Performance of Comment Generation Models by Using Bytecode Information","authors":"Yuan Huang;Jinbo Huang;Xiangping Chen;Zibin Zheng","doi":"10.1109/TSE.2024.3523713","DOIUrl":"10.1109/TSE.2024.3523713","url":null,"abstract":"Code comment plays an important role in program understanding, and a large number of automatic comment generation methods have been proposed in recent years. To get a better effect of generating comments, many studies try to extract a variety of information (e.g., code tokens, AST traverse sequence, APIs call sequence) from source code as model input. In this study, we found that the bytecode compiled from the source code can provide useful information for comment generation, hence we propose to use the information from bytecode to assist the comment generation. Specifically, we extract the control flow graph (CFG) from the bytecode and propose a serialization method to obtain the CFG sequence that preserves the program structure. Then, we discuss three methods for introducing bytecode information for different models. We collected 390,000 Java methods from the maven repository, and created a dataset of 101,124 samples after deduplication and preprocessing to evaluate our method. The results show that introducing the information extracted from the bytecode can improve the BLEU-4 of 7 comment generation models.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"503-520"},"PeriodicalIF":6.5,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142940444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-08DOI: 10.1109/tse.2025.3526730
Shari Lawrence Pfleeger, Barbara Kitchenham
{"title":"Evidence-based Software Engineering Guidelines Revisited","authors":"Shari Lawrence Pfleeger, Barbara Kitchenham","doi":"10.1109/tse.2025.3526730","DOIUrl":"https://doi.org/10.1109/tse.2025.3526730","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"8 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142936793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-07DOI: 10.1109/TSE.2025.3525955
Pengzhou Chen;Jingzhi Gong;Tao Chen
To ease the expensive measurements during configuration tuning, it is natural to build a surrogate model as the replacement of the system, and thereby the configuration performance can be cheaply evaluated. Yet, a stereotype therein is that the higher the model accuracy, the better the tuning result would be, or vice versa. This “accuracy is all” belief drives our research community to build more and more accurate models and criticize a tuner for the inaccuracy of the model used. However, this practice raises some previously unaddressed questions, e.g., are the model and its accuracy really that important for the tuning result? Do those somewhat small accuracy improvements reported (e.g., a few % error reduction) in existing work really matter much to the tuners? What role does model accuracy play in the impact of tuning quality? To answer those related questions, in this paper, we conduct one of the largest-scale empirical studies to date—running over the period of 13 months $24times 7$—that covers 10 models, 17 tuners, and 29 systems from the existing works while under four different commonly used metrics, leading to 13,612 cases of investigation. Surprisingly, our key findings reveal that the accuracy can lie: there are a considerable number of cases where higher accuracy actually leads to no improvement in the tuning outcomes (up to 58% cases under certain setting), or even worse, it can degrade the tuning quality (up to 24% cases under certain setting). We also discover that the chosen models in most proposed tuners are sub-optimal and that the required % of accuracy change to significantly improve tuning quality varies according to the range of model accuracy. Deriving from the fitness landscape analysis, we provide in-depth discussions of the rationale behind, offering several lessons learned as well as insights for future opportunities. Most importantly, this work poses a clear message to the community: we should take one step back from the natural “accuracy is all” belief for model-based configuration tuning.
{"title":"Accuracy Can Lie: On the Impact of Surrogate Model in Configuration Tuning","authors":"Pengzhou Chen;Jingzhi Gong;Tao Chen","doi":"10.1109/TSE.2025.3525955","DOIUrl":"10.1109/TSE.2025.3525955","url":null,"abstract":"To ease the expensive measurements during configuration tuning, it is natural to build a surrogate model as the replacement of the system, and thereby the configuration performance can be cheaply evaluated. Yet, a stereotype therein is that the higher the model accuracy, the better the tuning result would be, or vice versa. This “accuracy is all” belief drives our research community to build more and more accurate models and criticize a tuner for the inaccuracy of the model used. However, this practice raises some previously unaddressed questions, e.g., are the model and its accuracy really that important for the tuning result? Do those somewhat small accuracy improvements reported (e.g., a few % error reduction) in existing work really matter much to the tuners? What role does model accuracy play in the impact of tuning quality? To answer those related questions, in this paper, we conduct one of the largest-scale empirical studies to date—running over the period of 13 months <inline-formula><tex-math>$24times 7$</tex-math></inline-formula>—that covers 10 models, 17 tuners, and 29 systems from the existing works while under four different commonly used metrics, leading to 13,612 cases of investigation. Surprisingly, our key findings reveal that the accuracy can lie: there are a considerable number of cases where higher accuracy actually leads to no improvement in the tuning outcomes (up to 58% cases under certain setting), or even worse, it can degrade the tuning quality (up to 24% cases under certain setting). We also discover that the chosen models in most proposed tuners are sub-optimal and that the required % of accuracy change to significantly improve tuning quality varies according to the range of model accuracy. Deriving from the fitness landscape analysis, we provide in-depth discussions of the rationale behind, offering several lessons learned as well as insights for future opportunities. Most importantly, this work poses a clear message to the community: we should take one step back from the natural “accuracy is all” belief for model-based configuration tuning.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"548-580"},"PeriodicalIF":6.5,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10832565","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142936245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-03DOI: 10.1109/tse.2024.3524947
Marc J. Rochkind
{"title":"A Retrospective on the Source Code Control System","authors":"Marc J. Rochkind","doi":"10.1109/tse.2024.3524947","DOIUrl":"https://doi.org/10.1109/tse.2024.3524947","url":null,"abstract":"","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"14 1","pages":""},"PeriodicalIF":7.4,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142924713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Complex software systems consist of multiple overlapping design structures, such as abstractions, features, crosscutting concerns, or patterns. This is similar to how a human body has multiple interacting subsystems, such as respiratory, digestive, or circulatory. Unlike in the medical domain, software designers do not have an effective way to distinguish, visualize, comprehend, and analyze these interleaving design structures. As a result, developers often struggle through the maze of source code. In this paper, we present an Automated Concept Explanation (ACE) framework that automatically extracts and categorizes major concepts from source code based on the roles that files play in design structures and their topic frequencies. Based on these categorized concepts, ACE recovers four categories of high-level design models using different algorithms and generates a natural language explanation for each. To assess if and how ACE can help developers better understand design structures, we conducted an empirical study where two groups of graduate students were assigned three design comprehension tasks: identifying feature-related files, identifying dependencies among features, and identifying design patterns used, in an open-source project. The results reveal that the students who used ACE can accomplish these tasks much faster and more accurately, and they acknowledged the usefulness of the categorized concepts and structures, multi-type high-level model visualization, and natural language explanations.
{"title":"A Holistic Approach to Design Understanding Through Concept Explanation","authors":"Hongzhou Fang;Yuanfang Cai;Ewan Tempero;Rick Kazman;Yu-Cheng Tu;Jason Lefever;Ernst Pisch","doi":"10.1109/TSE.2024.3522973","DOIUrl":"10.1109/TSE.2024.3522973","url":null,"abstract":"Complex software systems consist of multiple overlapping design structures, such as abstractions, features, crosscutting concerns, or patterns. This is similar to how a human body has multiple interacting subsystems, such as respiratory, digestive, or circulatory. Unlike in the medical domain, software designers do not have an effective way to distinguish, visualize, comprehend, and analyze these interleaving design structures. As a result, developers often struggle through the maze of source code. In this paper, we present an <italic>Automated Concept Explanation</i> (ACE) framework that automatically extracts and categorizes major concepts from source code based on the roles that files play in design structures and their topic frequencies. Based on these categorized concepts, ACE recovers four categories of high-level design models using different algorithms and generates a natural language explanation for each. To assess if and how ACE can help developers better understand design structures, we conducted an empirical study where two groups of graduate students were assigned three design comprehension tasks: identifying feature-related files, identifying dependencies among features, and identifying design patterns used, in an open-source project. The results reveal that the students who used ACE can accomplish these tasks much faster and more accurately, and they acknowledged the usefulness of the categorized concepts and structures, multi-type high-level model visualization, and natural language explanations.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"449-465"},"PeriodicalIF":6.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142911631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01DOI: 10.1109/TSE.2024.3519464
Yuheng Huang;Jiayang Song;Zhijie Wang;Shengming Zhao;Huaming Chen;Felix Juefei-Xu;Lei Ma
The recent performance leap of Large Language Models (LLMs) opens up new opportunities across numerous industrial applications and domains. However, the potential erroneous behavior (e.g., the generation of misinformation and hallucination) has also raised severe concerns for the trustworthiness of LLMs, especially in safety-, security- and reliability-sensitive industrial scenarios, potentially hindering real-world adoptions. While uncertainty estimation has shown its potential for interpreting the prediction risks made by classic machine learning (ML) models, the unique characteristics of recent LLMs (e.g., adopting self-attention mechanism as its core, very large-scale model size, often used in generative contexts) pose new challenges for the behavior analysis of LLMs. Up to the present, little progress has been made to better understand whether and to what extent uncertainty estimation can help characterize the capability boundary of an LLM, to counteract its undesired behavior, which is considered to be of great importance with the potential wide-range applications of LLMs across industry domains. To bridge the gap, in this paper, we initiate an early exploratory study of the risk assessment of LLMs from the lens of uncertainty. In particular, we conduct a large-scale study with as many as twelve uncertainty estimation methods and eight general LLMs on four NLP tasks and seven programming-capable LLMs on two code generation tasks to investigate to what extent uncertainty estimation techniques could help characterize the prediction risks of LLMs. Our findings confirm the potential of uncertainty estimation for revealing LLMs’ uncertain/non-factual predictions. The insights derived from our study can pave the way for more advanced analysis and research on LLMs, ultimately aiming at enhancing their trustworthiness.
{"title":"Look Before You Leap: An Exploratory Study of Uncertainty Analysis for Large Language Models","authors":"Yuheng Huang;Jiayang Song;Zhijie Wang;Shengming Zhao;Huaming Chen;Felix Juefei-Xu;Lei Ma","doi":"10.1109/TSE.2024.3519464","DOIUrl":"10.1109/TSE.2024.3519464","url":null,"abstract":"The recent performance leap of Large Language Models (LLMs) opens up new opportunities across numerous industrial applications and domains. However, the potential erroneous behavior (e.g., the generation of misinformation and hallucination) has also raised severe concerns for the trustworthiness of LLMs, especially in safety-, security- and reliability-sensitive industrial scenarios, potentially hindering real-world adoptions. While uncertainty estimation has shown its potential for interpreting the prediction risks made by classic machine learning (ML) models, the unique characteristics of recent LLMs (e.g., adopting self-attention mechanism as its core, very large-scale model size, often used in generative contexts) pose new challenges for the behavior analysis of LLMs. Up to the present, little progress has been made to better understand whether and to what extent uncertainty estimation can help characterize the capability boundary of an LLM, to counteract its undesired behavior, which is considered to be of great importance with the potential wide-range applications of LLMs across industry domains. To bridge the gap, in this paper, we initiate an early exploratory study of the risk assessment of LLMs from the lens of uncertainty. In particular, we conduct a large-scale study with as many as twelve uncertainty estimation methods and eight general LLMs on four NLP tasks and seven programming-capable LLMs on two code generation tasks to investigate to what extent uncertainty estimation techniques could help characterize the prediction risks of LLMs. Our findings confirm the potential of uncertainty estimation for revealing LLMs’ uncertain/non-factual predictions. The insights derived from our study can pave the way for more advanced analysis and research on LLMs, ultimately aiming at enhancing their trustworthiness.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"413-429"},"PeriodicalIF":6.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142911630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}