ACM Transactions on Software Engineering and Methodology (TOSEM)最新文献_第4页

BiRD: Race Detection in Software Binaries under Relaxed Memory Models BiRD:放松内存模型下软件二进制文件中的竞赛检测

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2022-01-31 DOI: 10.1145/3498538

Ridhi Jain, Rahul Purandare, Subodh Sharma

Instruction reordering and interleavings in program execution under relaxed memory semantics result in non-intuitive behaviors, making it difficult to provide assurances about program correctness. Studies have shown that up to 90% of the concurrency bugs reported by state-of-the-art static analyzers are false alarms. As a result, filtering false alarms and detecting real concurrency bugs is a challenging problem. Unsurprisingly, this problem has attracted the interest of the research community over the past few decades. Nonetheless, many of the existing techniques rely on analyzing source code, rarely consider the effects introduced by compilers, and assume a sequentially consistent memory model. In a practical setting, however, developers often do not have access to the source code, and even commodity architectures such as x86 and ARM are not sequentially consistent. In this work, we present Bird, a prototype tool, to dynamically detect harmful data races in x86 binaries under relaxed memory models, TSO and PSO. Bird employs source-DPOR to explore all distinct feasible interleavings for a multithreaded application. Our evaluation of Bird on 42 publicly available benchmarks and its comparison with the state-of-the-art tools indicate Bird’s potential in effectively detecting data races in software binaries.

在宽松的内存语义下，程序执行中的指令重排序和交错会导致非直观的行为，使程序的正确性难以保证。研究表明，由最先进的静态分析器报告的并发错误中，高达90%是假警报。因此，过滤假警报和检测真正的并发错误是一个具有挑战性的问题。不出所料，这个问题在过去几十年里引起了研究界的兴趣。尽管如此，许多现有的技术依赖于分析源代码，很少考虑编译器引入的影响，并假设一个顺序一致的内存模型。然而，在实际环境中，开发人员通常无法访问源代码，甚至像x86和ARM这样的商品体系结构也不是顺序一致的。在这项工作中，我们提出了Bird，一个原型工具，在宽松内存模型，TSO和PSO下动态检测x86二进制文件中的有害数据争用。Bird使用source-DPOR来探索多线程应用程序中所有不同的可行交错。我们在42个公开可用的基准测试上对Bird进行了评估，并将其与最先进的工具进行了比较，表明Bird在有效检测软件二进制文件中的数据竞争方面具有潜力。

{"title":"BiRD: Race Detection in Software Binaries under Relaxed Memory Models","authors":"Ridhi Jain, Rahul Purandare, Subodh Sharma","doi":"10.1145/3498538","DOIUrl":"https://doi.org/10.1145/3498538","url":null,"abstract":"Instruction reordering and interleavings in program execution under relaxed memory semantics result in non-intuitive behaviors, making it difficult to provide assurances about program correctness. Studies have shown that up to 90% of the concurrency bugs reported by state-of-the-art static analyzers are false alarms. As a result, filtering false alarms and detecting real concurrency bugs is a challenging problem. Unsurprisingly, this problem has attracted the interest of the research community over the past few decades. Nonetheless, many of the existing techniques rely on analyzing source code, rarely consider the effects introduced by compilers, and assume a sequentially consistent memory model. In a practical setting, however, developers often do not have access to the source code, and even commodity architectures such as x86 and ARM are not sequentially consistent. In this work, we present Bird, a prototype tool, to dynamically detect harmful data races in x86 binaries under relaxed memory models, TSO and PSO. Bird employs source-DPOR to explore all distinct feasible interleavings for a multithreaded application. Our evaluation of Bird on 42 publicly available benchmarks and its comparison with the state-of-the-art tools indicate Bird’s potential in effectively detecting data races in software binaries.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"86 1","pages":"1 - 29"},"PeriodicalIF":0.0,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75688831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Do Developers Really Know How to Use Git Commands? A Large-scale Study Using Stack Overflow 开发者真的知道如何使用Git命令吗?使用堆栈溢出的大规模研究

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2022-01-31 DOI: 10.1145/3494518

Wenhua Yang, Chong Zhang, Minxue Pan, Chang Xu, Yu Zhou, Zhiqiu Huang

Git, a cross-platform and open source distributed version control tool, provides strong support for non-linear development and is capable of handling everything from small to large projects with speed and efficiency. It has become an indispensable tool for millions of software developers and is the de facto standard of version control in software development nowadays. However, despite its widespread use, developers still frequently face difficulties when using various Git commands to manage projects and collaborate. To better help developers use Git, it is necessary to understand the issues and difficulties that they may encounter when using Git. Unfortunately, this problem has not yet been comprehensively studied. To fill this knowledge gap, in this article, we conduct a large-scale study on Stack Overflow, a popular Q&A forum for developers. We extracted and analyzed 80,370 relevant questions from Stack Overflow, and reported the increasing popularity of the Git command questions. By analyzing the questions, we identified the Git commands that are frequently asked and those that are associated with difficult questions on Stack Overflow to help understand the difficulties developers may encounter when using Git commands. In addition, we conducted a survey to understand how developers learn Git commands in practice, showing that self-learning is the primary learning approach. These findings provide a range of actionable implications for researchers, educators, and developers.

Git是一个跨平台、开源的分布式版本控制工具，它对非线性开发提供了强有力的支持，能够快速高效地处理从小到大的项目。它已经成为数百万软件开发人员不可或缺的工具，并且是当今软件开发中版本控制的事实上的标准。然而，尽管它被广泛使用，开发人员在使用各种Git命令来管理项目和协作时仍然经常遇到困难。为了更好地帮助开发人员使用Git，有必要了解他们在使用Git时可能遇到的问题和困难。不幸的是，这个问题还没有得到全面的研究。为了填补这一知识空白，在本文中，我们对Stack Overflow(一个流行的开发人员问答论坛)进行了大规模的研究。我们从Stack Overflow中提取并分析了80,370个相关问题，并报告了Git命令问题的日益流行。通过分析这些问题，我们确定了经常被问到的Git命令以及与Stack Overflow上的难题相关的Git命令，以帮助理解开发人员在使用Git命令时可能遇到的困难。此外，我们进行了一项调查，以了解开发人员在实践中如何学习Git命令，结果显示自学是主要的学习方法。这些发现为研究人员、教育工作者和开发人员提供了一系列可操作的含义。

{"title":"Do Developers Really Know How to Use Git Commands? A Large-scale Study Using Stack Overflow","authors":"Wenhua Yang, Chong Zhang, Minxue Pan, Chang Xu, Yu Zhou, Zhiqiu Huang","doi":"10.1145/3494518","DOIUrl":"https://doi.org/10.1145/3494518","url":null,"abstract":"Git, a cross-platform and open source distributed version control tool, provides strong support for non-linear development and is capable of handling everything from small to large projects with speed and efficiency. It has become an indispensable tool for millions of software developers and is the de facto standard of version control in software development nowadays. However, despite its widespread use, developers still frequently face difficulties when using various Git commands to manage projects and collaborate. To better help developers use Git, it is necessary to understand the issues and difficulties that they may encounter when using Git. Unfortunately, this problem has not yet been comprehensively studied. To fill this knowledge gap, in this article, we conduct a large-scale study on Stack Overflow, a popular Q&A forum for developers. We extracted and analyzed 80,370 relevant questions from Stack Overflow, and reported the increasing popularity of the Git command questions. By analyzing the questions, we identified the Git commands that are frequently asked and those that are associated with difficult questions on Stack Overflow to help understand the difficulties developers may encounter when using Git commands. In addition, we conducted a survey to understand how developers learn Git commands in practice, showing that self-learning is the primary learning approach. These findings provide a range of actionable implications for researchers, educators, and developers.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"18 1","pages":"1 - 29"},"PeriodicalIF":0.0,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82460949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Towards Robustness of Deep Program Processing Models—Detection, Estimation, and Enhancement 深度程序处理模型的鲁棒性——检测、估计和增强

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2022-01-31 DOI: 10.1145/3511887

Huangzhao Zhang, Zhiyi Fu, Ge Li, L. Ma, Zhehao Zhao, Hua’an Yang, Yizhe Sun, Yang Liu, Zhi Jin

Deep learning (DL) has recently been widely applied to diverse source code processing tasks in the software engineering (SE) community, which achieves competitive performance (e.g., accuracy). However, the robustness, which requires the model to produce consistent decisions given minorly perturbed code inputs, still lacks systematic investigation as an important quality indicator. This article initiates an early step and proposes a framework CARROT for robustness detection, measurement, and enhancement of DL models for source code processing. We first propose an optimization-based attack technique CARROTA to generate valid adversarial source code examples effectively and efficiently. Based on this, we define the robustness metrics and propose robustness measurement toolkit CARROTM, which employs the worst-case performance approximation under the allowable perturbations. We further propose to improve the robustness of the DL models by adversarial training (CARROTT) with our proposed attack techniques. Our in-depth evaluations on three source code processing tasks (i.e., functionality classification, code clone detection, defect prediction) containing more than 3 million lines of code and the classic or SOTA DL models, including GRU, LSTM, ASTNN, LSCNN, TBCNN, CodeBERT, and CDLH, demonstrate the usefulness of our techniques for ❶ effective and efficient adversarial example detection, ❷ tight robustness estimation, and ❸ effective robustness enhancement.

深度学习(DL)最近被广泛应用于软件工程(SE)社区的各种源代码处理任务，实现了具有竞争力的性能(例如，准确性)。然而，鲁棒性要求模型在给定轻微扰动的代码输入的情况下产生一致的决策，仍然缺乏作为重要质量指标的系统调查。本文开始了早期的一步，并提出了一个框架CARROT，用于源代码处理的深度学习模型的鲁棒性检测、测量和增强。我们首先提出了一种基于优化的攻击技术CARROTA，以有效地生成有效的对抗性源代码示例。在此基础上，我们定义了鲁棒性度量，并提出了鲁棒性度量工具包CARROTM，该工具包在允许扰动下采用最坏情况性能逼近。我们进一步提出利用我们提出的攻击技术，通过对抗性训练(CARROTT)来提高深度学习模型的鲁棒性。我们对包含300多万行代码的三个源代码处理任务(即功能分类、代码克隆检测、缺陷预测)和经典或SOTA深度学习模型(包括GRU、LSTM、ASTNN、LSCNN、TBCNN、CodeBERT和CDLH)进行了深入评估，证明了我们的技术在高效、有效的对抗示例检测、严格鲁棒性估计和有效鲁棒性增强方面的实用性。

{"title":"Towards Robustness of Deep Program Processing Models—Detection, Estimation, and Enhancement","authors":"Huangzhao Zhang, Zhiyi Fu, Ge Li, L. Ma, Zhehao Zhao, Hua’an Yang, Yizhe Sun, Yang Liu, Zhi Jin","doi":"10.1145/3511887","DOIUrl":"https://doi.org/10.1145/3511887","url":null,"abstract":"Deep learning (DL) has recently been widely applied to diverse source code processing tasks in the software engineering (SE) community, which achieves competitive performance (e.g., accuracy). However, the robustness, which requires the model to produce consistent decisions given minorly perturbed code inputs, still lacks systematic investigation as an important quality indicator. This article initiates an early step and proposes a framework CARROT for robustness detection, measurement, and enhancement of DL models for source code processing. We first propose an optimization-based attack technique CARROTA to generate valid adversarial source code examples effectively and efficiently. Based on this, we define the robustness metrics and propose robustness measurement toolkit CARROTM, which employs the worst-case performance approximation under the allowable perturbations. We further propose to improve the robustness of the DL models by adversarial training (CARROTT) with our proposed attack techniques. Our in-depth evaluations on three source code processing tasks (i.e., functionality classification, code clone detection, defect prediction) containing more than 3 million lines of code and the classic or SOTA DL models, including GRU, LSTM, ASTNN, LSCNN, TBCNN, CodeBERT, and CDLH, demonstrate the usefulness of our techniques for ❶ effective and efficient adversarial example detection, ❷ tight robustness estimation, and ❸ effective robustness enhancement.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"122 1","pages":"1 - 40"},"PeriodicalIF":0.0,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73754499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Detecting and Augmenting Missing Key Aspects in Vulnerability Descriptions 检测和增加漏洞描述中缺失的关键方面

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2022-01-31 DOI: 10.1145/3498537

Hao Guo, Sen Chen, Zhenchang Xing, Xiaohong Li, Yude Bai, Jiamou Sun

Security vulnerabilities have been continually disclosed and documented. For the effective understanding, management, and mitigation of the fast-growing number of vulnerabilities, an important practice in documenting vulnerabilities is to describe the key vulnerability aspects, such as vulnerability type, root cause, affected product, impact, attacker type, and attack vector. In this article, we first investigate 133,639 vulnerability reports in the Common Vulnerabilities and Exposures (CVE) database over the past 20 years. We find that 56%, 85%, 38%, and 28% of CVEs miss vulnerability type, root cause, attack vector, and attacker type, respectively. By comparing the differences of the latest updated CVE reports across different databases, we observe that 1,476 missing key aspects in 1,320 CVE descriptions were augmented manually in the National Vulnerability Database (NVD), which indicates that the vulnerability database maintainers try to complete the vulnerability descriptions in practice to mitigate such a problem. To help complete the missing information of key vulnerability aspects and reduce human efforts, we propose a neural-network-based approach called PMA to predict the missing key aspects of a vulnerability based on its known aspects. We systematically explore the design space of the neural network models and empirically identify the most effective model design in the scenario. Our ablation study reveals the prominent correlations among vulnerability aspects when predicting. Trained with historical CVEs, our model achieves 88%, 71%, 61%, and 81% in F1 for predicting the missing vulnerability type, root cause, attacker type, and attack vector of 8,623 “future” CVEs across 3 years, respectively. Furthermore, we validate the predicting performance of key aspect augmentation of CVEs based on the manually augmented CVE data collected from NVD, which confirms the practicality of our approach. We finally highlight that PMA has the ability to reduce human efforts by recommending and augmenting missing key aspects for vulnerability databases, and to facilitate other research works such as severity level prediction of CVEs based on the vulnerability descriptions.

安全漏洞不断被披露和记录。为了有效地理解、管理和缓解快速增长的漏洞数量，记录漏洞的一个重要实践是描述关键的漏洞方面，例如漏洞类型、根本原因、受影响的产品、影响、攻击者类型和攻击向量。在本文中，我们首先调查了过去20年来CVE数据库中的133,639个漏洞报告。我们发现56%、85%、38%和28%的cve分别遗漏了漏洞类型、根本原因、攻击媒介和攻击者类型。通过比较不同数据库间最新更新的CVE报告的差异，我们发现在国家漏洞数据库(NVD)中，1320个CVE描述中有1476个缺失的关键方面被人工补充，这表明漏洞数据库维护者在实践中努力完善漏洞描述以缓解这一问题。为了帮助补全关键漏洞方面的缺失信息，减少人工工作量，我们提出了一种基于神经网络的PMA方法，该方法基于已知的漏洞方面来预测缺失的关键漏洞方面。我们系统地探索了神经网络模型的设计空间，并经验地确定了场景中最有效的模型设计。我们的消融研究揭示了脆弱性方面在预测时的显著相关性。通过对历史cve的训练，我们的模型在3年内预测8623个“未来”cve的缺失漏洞类型、根本原因、攻击者类型和攻击向量的F1分别达到88%、71%、61%和81%。此外，基于NVD采集的人工增强CVE数据，验证了CVE关键面向增强的预测性能，验证了该方法的实用性。最后，我们强调PMA能够通过推荐和增加漏洞数据库中缺失的关键方面来减少人工工作量，并促进其他研究工作，如基于漏洞描述的cve严重级别预测。

{"title":"Detecting and Augmenting Missing Key Aspects in Vulnerability Descriptions","authors":"Hao Guo, Sen Chen, Zhenchang Xing, Xiaohong Li, Yude Bai, Jiamou Sun","doi":"10.1145/3498537","DOIUrl":"https://doi.org/10.1145/3498537","url":null,"abstract":"Security vulnerabilities have been continually disclosed and documented. For the effective understanding, management, and mitigation of the fast-growing number of vulnerabilities, an important practice in documenting vulnerabilities is to describe the key vulnerability aspects, such as vulnerability type, root cause, affected product, impact, attacker type, and attack vector. In this article, we first investigate 133,639 vulnerability reports in the Common Vulnerabilities and Exposures (CVE) database over the past 20 years. We find that 56%, 85%, 38%, and 28% of CVEs miss vulnerability type, root cause, attack vector, and attacker type, respectively. By comparing the differences of the latest updated CVE reports across different databases, we observe that 1,476 missing key aspects in 1,320 CVE descriptions were augmented manually in the National Vulnerability Database (NVD), which indicates that the vulnerability database maintainers try to complete the vulnerability descriptions in practice to mitigate such a problem. To help complete the missing information of key vulnerability aspects and reduce human efforts, we propose a neural-network-based approach called PMA to predict the missing key aspects of a vulnerability based on its known aspects. We systematically explore the design space of the neural network models and empirically identify the most effective model design in the scenario. Our ablation study reveals the prominent correlations among vulnerability aspects when predicting. Trained with historical CVEs, our model achieves 88%, 71%, 61%, and 81% in F1 for predicting the missing vulnerability type, root cause, attacker type, and attack vector of 8,623 “future” CVEs across 3 years, respectively. Furthermore, we validate the predicting performance of key aspect augmentation of CVEs based on the manually augmented CVE data collected from NVD, which confirms the practicality of our approach. We finally highlight that PMA has the ability to reduce human efforts by recommending and augmenting missing key aspects for vulnerability databases, and to facilitate other research works such as severity level prediction of CVEs based on the vulnerability descriptions.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"28 1","pages":"1 - 27"},"PeriodicalIF":0.0,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83092885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

An Empirical Study of the Impact of Hyperparameter Tuning and Model Optimization on the Performance Properties of Deep Neural Networks 超参数整定和模型优化对深度神经网络性能影响的实证研究

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2022-01-31 DOI: 10.1145/3506695

Lizhi Liao, Heng Li, Weiyi Shang, L. Ma

Deep neural network (DNN) models typically have many hyperparameters that can be configured to achieve optimal performance on a particular dataset. Practitioners usually tune the hyperparameters of their DNN models by training a number of trial models with different configurations of the hyperparameters, to find the optimal hyperparameter configuration that maximizes the training accuracy or minimizes the training loss. As such hyperparameter tuning usually focuses on the model accuracy or the loss function, it is not clear and remains under-explored how the process impacts other performance properties of DNN models, such as inference latency and model size. On the other hand, standard DNN models are often large in size and computing-intensive, prohibiting them from being directly deployed in resource-bounded environments such as mobile devices and Internet of Things (IoT) devices. To tackle this problem, various model optimization techniques (e.g., pruning or quantization) are proposed to make DNN models smaller and less computing-intensive so that they are better suited for resource-bounded environments. However, it is neither clear how the model optimization techniques impact other performance properties of DNN models such as inference latency and battery consumption, nor how the model optimization techniques impact the effect of hyperparameter tuning (i.e., the compounding effect). Therefore, in this paper, we perform a comprehensive study on four representative and widely-adopted DNN models, i.e., CNN image classification, Resnet-50, CNN text classification, and LSTM sentiment classification, to investigate how different DNN model hyperparameters affect the standard DNN models, as well as how the hyperparameter tuning combined with model optimization affect the optimized DNN models, in terms of various performance properties (e.g., inference latency or battery consumption). Our empirical results indicate that tuning specific hyperparameters has heterogeneous impact on the performance of DNN models across different models and different performance properties. In particular, although the top tuned DNN models usually have very similar accuracy, they may have significantly different performance in terms of other aspects (e.g., inference latency). We also observe that model optimization has a confounding effect on the impact of hyperparameters on DNN model performance. For example, two sets of hyperparameters may result in standard models with similar performance but their performance may become significantly different after they are optimized and deployed on the mobile device. Our findings highlight that practitioners can benefit from paying attention to a variety of performance properties and the confounding effect of model optimization when tuning and optimizing their DNN models.

深度神经网络(DNN)模型通常具有许多超参数，可以配置这些超参数以在特定数据集上实现最佳性能。从业者通常通过训练具有不同超参数配置的多个试验模型来调整其DNN模型的超参数，以找到最大化训练精度或最小化训练损失的最佳超参数配置。由于这种超参数调优通常侧重于模型精度或损失函数，因此该过程如何影响DNN模型的其他性能属性(如推理延迟和模型大小)尚不清楚，也未得到充分探讨。另一方面，标准DNN模型通常规模较大且计算密集型，因此无法直接部署在移动设备和物联网(IoT)设备等资源有限的环境中。为了解决这个问题，提出了各种模型优化技术(例如，修剪或量化)，使深度神经网络模型更小，计算强度更低，从而更适合资源有限的环境。然而，目前尚不清楚模型优化技术如何影响DNN模型的其他性能属性，如推理延迟和电池消耗，也不清楚模型优化技术如何影响超参数调优的效果(即复合效应)。因此，在本文中，我们对CNN图像分类、Resnet-50、CNN文本分类和LSTM情感分类这四种具有代表性且被广泛采用的深度神经网络模型进行了全面的研究，研究不同的深度神经网络模型超参数对标准深度神经网络模型的影响，以及超参数调优结合模型优化对优化后的深度神经网络模型在各种性能特性(例如:推断延迟或电池消耗)。我们的实证结果表明，调整特定的超参数对不同模型和不同性能属性的DNN模型的性能有不同的影响。特别是，尽管最优DNN模型通常具有非常相似的精度，但它们在其他方面(例如，推理延迟)可能具有显着不同的性能。我们还观察到模型优化对超参数对DNN模型性能的影响具有混淆效应。例如，两组超参数可能会产生性能相似的标准模型，但在移动设备上进行优化部署后，其性能可能会出现明显差异。我们的研究结果强调，从业者在调整和优化DNN模型时，可以通过关注各种性能属性和模型优化的混淆效应而受益。

{"title":"An Empirical Study of the Impact of Hyperparameter Tuning and Model Optimization on the Performance Properties of Deep Neural Networks","authors":"Lizhi Liao, Heng Li, Weiyi Shang, L. Ma","doi":"10.1145/3506695","DOIUrl":"https://doi.org/10.1145/3506695","url":null,"abstract":"Deep neural network (DNN) models typically have many hyperparameters that can be configured to achieve optimal performance on a particular dataset. Practitioners usually tune the hyperparameters of their DNN models by training a number of trial models with different configurations of the hyperparameters, to find the optimal hyperparameter configuration that maximizes the training accuracy or minimizes the training loss. As such hyperparameter tuning usually focuses on the model accuracy or the loss function, it is not clear and remains under-explored how the process impacts other performance properties of DNN models, such as inference latency and model size. On the other hand, standard DNN models are often large in size and computing-intensive, prohibiting them from being directly deployed in resource-bounded environments such as mobile devices and Internet of Things (IoT) devices. To tackle this problem, various model optimization techniques (e.g., pruning or quantization) are proposed to make DNN models smaller and less computing-intensive so that they are better suited for resource-bounded environments. However, it is neither clear how the model optimization techniques impact other performance properties of DNN models such as inference latency and battery consumption, nor how the model optimization techniques impact the effect of hyperparameter tuning (i.e., the compounding effect). Therefore, in this paper, we perform a comprehensive study on four representative and widely-adopted DNN models, i.e., CNN image classification, Resnet-50, CNN text classification, and LSTM sentiment classification, to investigate how different DNN model hyperparameters affect the standard DNN models, as well as how the hyperparameter tuning combined with model optimization affect the optimized DNN models, in terms of various performance properties (e.g., inference latency or battery consumption). Our empirical results indicate that tuning specific hyperparameters has heterogeneous impact on the performance of DNN models across different models and different performance properties. In particular, although the top tuned DNN models usually have very similar accuracy, they may have significantly different performance in terms of other aspects (e.g., inference latency). We also observe that model optimization has a confounding effect on the impact of hyperparameters on DNN model performance. For example, two sets of hyperparameters may result in standard models with similar performance but their performance may become significantly different after they are optimized and deployed on the mobile device. Our findings highlight that practitioners can benefit from paying attention to a variety of performance properties and the confounding effect of model optimization when tuning and optimizing their DNN models.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"111 1","pages":"1 - 40"},"PeriodicalIF":0.0,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86522339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Analyzing Uncertainty in Release Planning: A Method and Experiment for Fixed-Date Release Cycles 发布计划中的不确定性分析:一种固定日期发布周期的方法与实验

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2021-12-24 DOI: 10.1145/3490487

Olawole Oni, Emmanuel Letier

Release planning—deciding what features to implement in upcoming releases of a software system—is a critical activity in iterative software development. Many release planning methods exist, but most ignore the inevitable uncertainty in estimating software development effort and business value. The article’s objective is to study whether analyzing uncertainty during release planning generates better release plans than if uncertainty is ignored. To study this question, we have developed a novel release planning method under uncertainty, called BEARS, that models uncertainty using Bayesian probability distributions and recommends release plans that maximize expected net present value and expected punctuality. We then compare release plans recommended by BEARS to those recommended by methods that ignore uncertainty on 32 release planning problems. The experiment shows that BEARS recommends release plans with higher expected net present value and expected punctuality than methods that ignore uncertainty, thereby indicating the harmful effects of ignoring uncertainty during release planning. These results highlight the importance of eliciting and analyzing uncertainty in software effort and value estimations and call for increased research in these areas.

发布计划——决定在即将发布的软件系统中实现哪些特性——是迭代软件开发中的关键活动。存在许多发布计划方法，但是大多数都忽略了在评估软件开发工作和业务价值时不可避免的不确定性。本文的目的是研究在发布计划期间分析不确定性是否比忽略不确定性产生更好的发布计划。为了研究这个问题，我们开发了一种新的不确定性下的发布计划方法，称为BEARS，它使用贝叶斯概率分布对不确定性建模，并推荐最大化预期净现值和预期准时性的发布计划。然后，我们将bear推荐的发布计划与那些忽略32个发布计划问题的不确定性的方法推荐的发布计划进行比较。实验表明，与忽略不确定性的方法相比，BEARS推荐的发布计划具有更高的预期净现值和预期准时性，从而表明在发布计划中忽略不确定性的有害影响。这些结果强调了在软件工作和价值评估中引出和分析不确定性的重要性，并呼吁在这些领域增加研究。

{"title":"Analyzing Uncertainty in Release Planning: A Method and Experiment for Fixed-Date Release Cycles","authors":"Olawole Oni, Emmanuel Letier","doi":"10.1145/3490487","DOIUrl":"https://doi.org/10.1145/3490487","url":null,"abstract":"Release planning—deciding what features to implement in upcoming releases of a software system—is a critical activity in iterative software development. Many release planning methods exist, but most ignore the inevitable uncertainty in estimating software development effort and business value. The article’s objective is to study whether analyzing uncertainty during release planning generates better release plans than if uncertainty is ignored. To study this question, we have developed a novel release planning method under uncertainty, called BEARS, that models uncertainty using Bayesian probability distributions and recommends release plans that maximize expected net present value and expected punctuality. We then compare release plans recommended by BEARS to those recommended by methods that ignore uncertainty on 32 release planning problems. The experiment shows that BEARS recommends release plans with higher expected net present value and expected punctuality than methods that ignore uncertainty, thereby indicating the harmful effects of ignoring uncertainty during release planning. These results highlight the importance of eliciting and analyzing uncertainty in software effort and value estimations and call for increased research in these areas.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"47 1","pages":"1 - 39"},"PeriodicalIF":0.0,"publicationDate":"2021-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85570341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Guided Feature Identification and Removal for Resource-constrained Firmware 资源受限固件的引导特征识别和移除

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2021-12-24 DOI: 10.1145/3487568

Ryan Williams, Tong Ren, Lorenzo De Carli, Long Lu, Gillian Smith

IoT firmware oftentimes incorporates third-party components, such as network-oriented middleware and media encoders/decoders. These components consist of large and mature codebases, shipping with a variety of non-critical features. Feature bloat increases code size, complicates auditing/debugging, and reduces stability. This is problematic for IoT devices, which are severely resource-constrained and must remain operational in the field for years. Unfortunately, identification and complete removal of code related to unwanted features requires familiarity with codebases of interest, cumbersome manual effort, and may introduce bugs. We address these difficulties by introducing PRAT, a system that takes as input the codebase of software of interest, identifies and maps features to code, presents this information to a human analyst, and removes all code belonging to unwanted features. PRAT solves the challenge of identifying feature-related code through a novel form of differential dynamic analysis and visualizes results as user-friendly feature graphs. Evaluation on diverse codebases shows superior code removal compared to both manual feature deactivation and state-of-art debloating tools, and generality across programming languages. Furthermore, a user study comparing PRAT to manual code analysis shows that it can significantly simplify the feature identification workflow.

物联网固件通常包含第三方组件，例如面向网络的中间件和媒体编码器/解码器。这些组件由大型且成熟的代码库组成，附带各种非关键特性。特性膨胀会增加代码大小，使审计/调试变得复杂，并降低稳定性。这对资源严重受限的物联网设备来说是个问题，因为这些设备必须在现场运行多年。不幸的是，识别和完全删除与不需要的特性相关的代码需要熟悉感兴趣的代码库，需要繁琐的手工工作，并且可能会引入错误。我们通过引入PRAT来解决这些困难，PRAT是一个系统，它将感兴趣的软件的代码库作为输入，识别并将特征映射到代码，将该信息呈现给人类分析师，并删除属于不需要的特征的所有代码。PRAT通过一种新颖的差分动态分析形式解决了识别特征相关代码的挑战，并将结果可视化为用户友好的特征图。对不同代码库的评估显示，与手动特性停用和最先进的消歧工具相比，代码删除更优越，并且具有跨编程语言的通用性。此外，一项将PRAT与手工代码分析进行比较的用户研究表明，PRAT可以显著简化特征识别工作流程。

{"title":"Guided Feature Identification and Removal for Resource-constrained Firmware","authors":"Ryan Williams, Tong Ren, Lorenzo De Carli, Long Lu, Gillian Smith","doi":"10.1145/3487568","DOIUrl":"https://doi.org/10.1145/3487568","url":null,"abstract":"IoT firmware oftentimes incorporates third-party components, such as network-oriented middleware and media encoders/decoders. These components consist of large and mature codebases, shipping with a variety of non-critical features. Feature bloat increases code size, complicates auditing/debugging, and reduces stability. This is problematic for IoT devices, which are severely resource-constrained and must remain operational in the field for years. Unfortunately, identification and complete removal of code related to unwanted features requires familiarity with codebases of interest, cumbersome manual effort, and may introduce bugs. We address these difficulties by introducing PRAT, a system that takes as input the codebase of software of interest, identifies and maps features to code, presents this information to a human analyst, and removes all code belonging to unwanted features. PRAT solves the challenge of identifying feature-related code through a novel form of differential dynamic analysis and visualizes results as user-friendly feature graphs. Evaluation on diverse codebases shows superior code removal compared to both manual feature deactivation and state-of-art debloating tools, and generality across programming languages. Furthermore, a user study comparing PRAT to manual code analysis shows that it can significantly simplify the feature identification workflow.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"25 1","pages":"1 - 25"},"PeriodicalIF":0.0,"publicationDate":"2021-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88695917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

How Software Refactoring Impacts Execution Time 软件重构如何影响执行时间

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2021-12-24 DOI: 10.1145/3485136

L. Traini, Daniele Di Pompeo, Michele Tucci, B. Lin, Simone Scalabrino, G. Bavota, Michele Lanza, R. Oliveto, V. Cortellessa

Refactoring aims at improving the maintainability of source code without modifying its external behavior. Previous works proposed approaches to recommend refactoring solutions to software developers. The generation of the recommended solutions is guided by metrics acting as proxy for maintainability (e.g., number of code smells removed by the recommended solution). These approaches ignore the impact of the recommended refactorings on other non-functional requirements, such as performance, energy consumption, and so forth. Little is known about the impact of refactoring operations on non-functional requirements other than maintainability. We aim to fill this gap by presenting the largest study to date to investigate the impact of refactoring on software performance, in terms of execution time. We mined the change history of 20 systems that defined performance benchmarks in their repositories, with the goal of identifying commits in which developers implemented refactoring operations impacting code components that are exercised by the performance benchmarks. Through a quantitative and qualitative analysis, we show that refactoring operations can significantly impact the execution time. Indeed, none of the investigated refactoring types can be considered “safe” in ensuring no performance regression. Refactoring types aimed at decomposing complex code entities (e.g., Extract Class/Interface, Extract Method) have higher chances of triggering performance degradation, suggesting their careful consideration when refactoring performance-critical code.

重构的目的是在不修改源代码外部行为的情况下提高源代码的可维护性。以前的作品提出了向软件开发人员推荐重构解决方案的方法。推荐的解决方案的生成是由作为可维护性代理的度量来指导的(例如，被推荐的解决方案去除的代码气味的数量)。这些方法忽略了推荐的重构对其他非功能需求的影响，比如性能、能耗等等。除了可维护性之外，重构操作对非功能性需求的影响知之甚少。我们的目标是填补这一空白，提出迄今为止最大的研究，从执行时间的角度来调查重构对软件性能的影响。我们挖掘了20个系统的变更历史，这些系统在其存储库中定义了性能基准，我们的目标是确定在哪些提交中，开发人员实现了影响性能基准执行的代码组件的重构操作。通过定量和定性分析，我们表明重构操作可以显著影响执行时间。实际上，所研究的重构类型都不能被认为是“安全的”，以确保没有性能退化。旨在分解复杂代码实体的重构类型(例如，提取类/接口，提取方法)更有可能引发性能下降，建议在重构性能关键型代码时仔细考虑。

{"title":"How Software Refactoring Impacts Execution Time","authors":"L. Traini, Daniele Di Pompeo, Michele Tucci, B. Lin, Simone Scalabrino, G. Bavota, Michele Lanza, R. Oliveto, V. Cortellessa","doi":"10.1145/3485136","DOIUrl":"https://doi.org/10.1145/3485136","url":null,"abstract":"Refactoring aims at improving the maintainability of source code without modifying its external behavior. Previous works proposed approaches to recommend refactoring solutions to software developers. The generation of the recommended solutions is guided by metrics acting as proxy for maintainability (e.g., number of code smells removed by the recommended solution). These approaches ignore the impact of the recommended refactorings on other non-functional requirements, such as performance, energy consumption, and so forth. Little is known about the impact of refactoring operations on non-functional requirements other than maintainability. We aim to fill this gap by presenting the largest study to date to investigate the impact of refactoring on software performance, in terms of execution time. We mined the change history of 20 systems that defined performance benchmarks in their repositories, with the goal of identifying commits in which developers implemented refactoring operations impacting code components that are exercised by the performance benchmarks. Through a quantitative and qualitative analysis, we show that refactoring operations can significantly impact the execution time. Indeed, none of the investigated refactoring types can be considered “safe” in ensuring no performance regression. Refactoring types aimed at decomposing complex code entities (e.g., Extract Class/Interface, Extract Method) have higher chances of triggering performance degradation, suggesting their careful consideration when refactoring performance-critical code.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"4 1","pages":"1 - 23"},"PeriodicalIF":0.0,"publicationDate":"2021-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89882997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Why Do Developers Reject Refactorings in Open-Source Projects? 开发者为什么拒绝开源项目中的重构?

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2021-12-24 DOI: 10.1145/3487062

Michele Lanza, G. Bavota

Refactoring operations are behavior-preserving changes aimed at improving source code quality. While refactoring is largely considered a good practice, refactoring proposals in pull requests are often rejected after the code review. Understanding the reasons behind the rejection of refactoring contributions can shed light on how such contributions can be improved, essentially benefiting software quality. This article reports a study in which we manually coded rejection reasons inferred from 330 refactoring-related pull requests from 207 open-source Java projects. We surveyed 267 developers to assess their perceived prevalence of these identified rejection reasons, further complementing the reasons. Our study resulted in a comprehensive taxonomy consisting of 26 refactoring-related rejection reasons and 21 process-related rejection reasons. The taxonomy, accompanied with representative examples and highlighted implications, provides developers with valuable insights on how to ponder and polish their refactoring contributions, and indicates a number of directions researchers can pursue toward better refactoring recommenders.

重构操作是旨在提高源代码质量的行为保留变更。虽然重构在很大程度上被认为是一种良好的实践，但在代码审查之后，拉取请求中的重构建议经常被拒绝。理解拒绝重构贡献背后的原因可以揭示如何改进这些贡献，从而从本质上提高软件质量。本文报告了一项研究，在这项研究中，我们从207个开源Java项目的330个与重构相关的拉取请求中推断出拒绝的原因。我们调查了267名开发者，以评估他们认为这些被拒绝原因的普遍程度，并进一步补充原因。我们的研究得出了一个全面的分类，包括26个与重构相关的拒绝原因和21个与流程相关的拒绝原因。该分类法附带了代表性示例和突出的含义，为开发人员提供了关于如何思考和完善其重构贡献的有价值的见解，并指出了研究人员可以追求的一些方向，以获得更好的重构推荐。

引用次数: 1

A Practical Approach for Dynamic Taint Tracking with Control-flow Relationships 一种具有控制流关系的动态污点跟踪实用方法

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2021-12-24 DOI: 10.1145/3485464

Katherine Hough, Jonathan Bell

Dynamic taint tracking, a technique that traces relationships between values as a program executes, has been used to support a variety of software engineering tasks. Some taint tracking systems only consider data flows and ignore control flows. As a result, relationships between some values are not reflected by the analysis. Many applications of taint tracking either benefit from or rely on these relationships being traced, but past works have found that tracking control flows resulted in over-tainting, dramatically reducing the precision of the taint tracking system. In this article, we introduce Conflux, alternative semantics for propagating taint tags along control flows. Conflux aims to reduce over-tainting by decreasing the scope of control flows and providing a heuristic for reducing loop-related over-tainting. We created a Java implementation of Conflux and performed a case study exploring the effect of Conflux on a concrete application of taint tracking, automated debugging. In addition to this case study, we evaluated Conflux’s accuracy using a novel benchmark consisting of popular, real-world programs. We compared Conflux against existing taint propagation policies, including a state-of-the-art approach for reducing control-flow-related over-tainting, finding that Conflux had the highest F1 score on 43 out of the 48 total tests.

动态污染跟踪，一种在程序执行时跟踪值之间关系的技术，已被用于支持各种软件工程任务。一些污染跟踪系统只考虑数据流而忽略控制流。因此，一些值之间的关系没有在分析中反映出来。许多污染跟踪的应用要么受益于这些被跟踪的关系，要么依赖于这些关系，但过去的工作发现，跟踪控制流导致过度污染，极大地降低了污染跟踪系统的精度。在本文中，我们将介绍Conflux，这是沿控制流传播污染标记的另一种语义。Conflux旨在通过减小控制流的范围来减少过度污染，并为减少环路相关的过度污染提供了一种启发式方法。我们创建了Conflux的Java实现，并进行了一个案例研究，探索Conflux对污染跟踪、自动调试等具体应用程序的影响。除了这个案例研究之外，我们还使用一个由流行的现实世界程序组成的新基准来评估Conflux的准确性。我们将Conflux与现有的污染传播策略(包括减少与控制流相关的过度污染的最先进方法)进行了比较，发现Conflux在48项总测试中的43项中获得了最高的F1分数。

{"title":"A Practical Approach for Dynamic Taint Tracking with Control-flow Relationships","authors":"Katherine Hough, Jonathan Bell","doi":"10.1145/3485464","DOIUrl":"https://doi.org/10.1145/3485464","url":null,"abstract":"Dynamic taint tracking, a technique that traces relationships between values as a program executes, has been used to support a variety of software engineering tasks. Some taint tracking systems only consider data flows and ignore control flows. As a result, relationships between some values are not reflected by the analysis. Many applications of taint tracking either benefit from or rely on these relationships being traced, but past works have found that tracking control flows resulted in over-tainting, dramatically reducing the precision of the taint tracking system. In this article, we introduce Conflux, alternative semantics for propagating taint tags along control flows. Conflux aims to reduce over-tainting by decreasing the scope of control flows and providing a heuristic for reducing loop-related over-tainting. We created a Java implementation of Conflux and performed a case study exploring the effect of Conflux on a concrete application of taint tracking, automated debugging. In addition to this case study, we evaluated Conflux’s accuracy using a novel benchmark consisting of popular, real-world programs. We compared Conflux against existing taint propagation policies, including a state-of-the-art approach for reducing control-flow-related over-tainting, finding that Conflux had the highest F1 score on 43 out of the 48 total tests.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"26 1","pages":"1 - 43"},"PeriodicalIF":0.0,"publicationDate":"2021-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84667475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7