ACM Transactions on Software Engineering and Methodology (TOSEM)最新文献_第10页

ConE: A Concurrent Edit Detection Tool for Large-scale Software Development 用于大规模软件开发的并发编辑检测工具

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2021-01-16 DOI: 10.1145/3478019

C. Maddila, Nachiappan Nagappan, C. Bird, Georgios Gousios, A. van Deursen

Modern, complex software systems are being continuously extended and adjusted. The developers responsible for this may come from different teams or organizations, and may be distributed over the world. This may make it difficult to keep track of what other developers are doing, which may result in multiple developers concurrently editing the same code areas. This, in turn, may lead to hard-to-merge changes or even merge conflicts, logical bugs that are difficult to detect, duplication of work, and wasted developer productivity. To address this, we explore the extent of this problem in the pull-request-based software development model. We study half a year of changes made to six large repositories in Microsoft in which at least 1,000 pull requests are created each month. We find that files concurrently edited in different pull requests are more likely to introduce bugs. Motivated by these findings, we design, implement, and deploy a service named Concurrent Edit Detector (ConE) that proactively detects pull requests containing concurrent edits, to help mitigate the problems caused by them. ConE has been designed to scale, and to minimize false alarms while still flagging relevant concurrently edited files. Key concepts of ConE include the detection of the Extent of Overlap between pull requests, and the identification of Rarely Concurrently Edited Files. To evaluate ConE, we report on its operational deployment on 234 repositories inside Microsoft. ConE assessed 26,000 pull requests and made 775 recommendations about conflicting changes, which were rated as useful in over 70% (554) of the cases. From interviews with 48 users, we learned that they believed ConE would save time in conflict resolution and avoiding duplicate work, and that over 90% intend to keep using the service on a daily basis.

现代、复杂的软件系统正在不断地扩展和调整。负责此工作的开发人员可能来自不同的团队或组织，并且可能分布在世界各地。这可能使跟踪其他开发人员正在做的事情变得困难，这可能导致多个开发人员并发地编辑相同的代码区域。反过来，这可能会导致难以合并的更改，甚至合并冲突，难以检测的逻辑错误，重复的工作，以及浪费开发人员的生产力。为了解决这个问题，我们在基于拉取请求的软件开发模型中探讨了这个问题的范围。我们研究了半年来微软六个大型存储库的变化，其中每个月至少有1000个pull请求被创建。我们发现在不同的pull request中并发编辑的文件更容易引入bug。受这些发现的启发，我们设计、实现并部署了一个名为并发编辑检测器(ConE)的服务，该服务可以主动检测包含并发编辑的拉取请求，以帮助减轻由此引起的问题。ConE被设计成可伸缩的，并尽量减少误报，同时仍然标记相关的并发编辑文件。ConE的关键概念包括检测拉取请求之间的重叠程度，以及识别很少并发编辑的文件。为了评估ConE，我们报告了它在微软内部234个存储库上的操作部署。ConE评估了26,000个拉取请求，并提出了775条关于冲突变更的建议，这些建议在超过70%(554)的情况下被评为有用。从对48名用户的采访中，我们了解到他们相信ConE可以在解决冲突和避免重复工作方面节省时间，并且超过90%的人打算继续每天使用该服务。

{"title":"ConE: A Concurrent Edit Detection Tool for Large-scale Software Development","authors":"C. Maddila, Nachiappan Nagappan, C. Bird, Georgios Gousios, A. van Deursen","doi":"10.1145/3478019","DOIUrl":"https://doi.org/10.1145/3478019","url":null,"abstract":"Modern, complex software systems are being continuously extended and adjusted. The developers responsible for this may come from different teams or organizations, and may be distributed over the world. This may make it difficult to keep track of what other developers are doing, which may result in multiple developers concurrently editing the same code areas. This, in turn, may lead to hard-to-merge changes or even merge conflicts, logical bugs that are difficult to detect, duplication of work, and wasted developer productivity. To address this, we explore the extent of this problem in the pull-request-based software development model. We study half a year of changes made to six large repositories in Microsoft in which at least 1,000 pull requests are created each month. We find that files concurrently edited in different pull requests are more likely to introduce bugs. Motivated by these findings, we design, implement, and deploy a service named Concurrent Edit Detector (ConE) that proactively detects pull requests containing concurrent edits, to help mitigate the problems caused by them. ConE has been designed to scale, and to minimize false alarms while still flagging relevant concurrently edited files. Key concepts of ConE include the detection of the Extent of Overlap between pull requests, and the identification of Rarely Concurrently Edited Files. To evaluate ConE, we report on its operational deployment on 234 repositories inside Microsoft. ConE assessed 26,000 pull requests and made 775 recommendations about conflicting changes, which were rated as useful in over 70% (554) of the cases. From interviews with 48 users, we learned that they believed ConE would save time in conflict resolution and avoiding duplicate work, and that over 90% intend to keep using the service on a daily basis.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"48 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2021-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82106952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Deep Reinforcement Learning for Black-box Testing of Android Apps Android应用黑盒测试的深度强化学习

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2021-01-07 DOI: 10.1145/3502868

Andrea Romdhana, A. Merlo, M. Ceccato, P. Tonella

The state space of Android apps is huge, and its thorough exploration during testing remains a significant challenge. The best exploration strategy is highly dependent on the features of the app under test. Reinforcement Learning (RL) is a machine learning technique that learns the optimal strategy to solve a task by trial and error, guided by positive or negative reward, rather than explicit supervision. Deep RL is a recent extension of RL that takes advantage of the learning capabilities of neural networks. Such capabilities make Deep RL suitable for complex exploration spaces such as one of Android apps. However, state-of-the-art, publicly available tools only support basic, Tabular RL. We have developed ARES, a Deep RL approach for black-box testing of Android apps. Experimental results show that it achieves higher coverage and fault revelation than the baselines, including state-of-the-art tools, such as TimeMachine and Q-Testing. We also investigated the reasons behind such performance qualitatively, and we have identified the key features of Android apps that make Deep RL particularly effective on them to be the presence of chained and blocking activities. Moreover, we have developed FATE to fine-tune the hyperparameters of Deep RL algorithms on simulated apps, since it is computationally expensive to carry it out on real apps.

Android应用程序的状态空间是巨大的，在测试过程中对其进行彻底的探索仍然是一个重大挑战。最好的探索策略是高度依赖于被测应用的功能。强化学习(RL)是一种机器学习技术，它通过尝试和错误来学习解决任务的最佳策略，在积极或消极奖励的指导下，而不是明确的监督。深度强化学习是强化学习的最新扩展，它利用了神经网络的学习能力。这样的能力使得深度RL适用于复杂的探索空间，比如Android应用程序。然而，最先进的、公开可用的工具只支持基本的表格式RL。我们已经开发了ARES，这是一种用于Android应用黑盒测试的深度强化学习方法。实验结果表明，采用最先进的工具，如timemmachine和Q-Testing，该方法比基线获得了更高的覆盖率和故障揭示。我们还定性地研究了这种性能背后的原因，我们已经确定了Android应用程序的关键特征，这些特征使深度RL对它们特别有效，即链式和阻塞活动的存在。此外，我们已经开发了FATE来微调模拟应用程序上深度强化学习算法的超参数，因为在真实应用程序上执行它的计算成本很高。

{"title":"Deep Reinforcement Learning for Black-box Testing of Android Apps","authors":"Andrea Romdhana, A. Merlo, M. Ceccato, P. Tonella","doi":"10.1145/3502868","DOIUrl":"https://doi.org/10.1145/3502868","url":null,"abstract":"The state space of Android apps is huge, and its thorough exploration during testing remains a significant challenge. The best exploration strategy is highly dependent on the features of the app under test. Reinforcement Learning (RL) is a machine learning technique that learns the optimal strategy to solve a task by trial and error, guided by positive or negative reward, rather than explicit supervision. Deep RL is a recent extension of RL that takes advantage of the learning capabilities of neural networks. Such capabilities make Deep RL suitable for complex exploration spaces such as one of Android apps. However, state-of-the-art, publicly available tools only support basic, Tabular RL. We have developed ARES, a Deep RL approach for black-box testing of Android apps. Experimental results show that it achieves higher coverage and fault revelation than the baselines, including state-of-the-art tools, such as TimeMachine and Q-Testing. We also investigated the reasons behind such performance qualitatively, and we have identified the key features of Android apps that make Deep RL particularly effective on them to be the presence of chained and blocking activities. Moreover, we have developed FATE to fine-tune the hyperparameters of Deep RL algorithms on simulated apps, since it is computationally expensive to carry it out on real apps.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"15 1","pages":"1 - 29"},"PeriodicalIF":0.0,"publicationDate":"2021-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87479944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Enabledness-based Testing of Object Protocols 基于使能性的对象协议测试

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2021-01-03 DOI: 10.1145/3415153

Javier Godoy, Juan P. Galeotti, D. Garbervetsky, Sebastián Uchitel

A significant proportion of classes in modern software introduce or use object protocols, prescriptions on the temporal orderings of method calls on objects. This article studies search-based test generation techniques that aim to exploit a particular abstraction of object protocols (enabledness preserving abstractions (EPAs)) to find failures. We define coverage criteria over an extension of EPAs that includes abnormal method termination and define a search-based test case generation technique aimed at achieving high coverage. Results suggest that the proposed case generation technique with a fitness function that aims at combined structural and extended EPA coverage can provide better failure-detection capabilities not only for protocol failures but also for general failures when compared to random testing and search-based test generation for standard structural coverage.

现代软件中相当大比例的类引入或使用对象协议，即对象上方法调用的时间顺序的规定。本文研究了基于搜索的测试生成技术，该技术旨在利用对象协议的特定抽象(可启用性保持抽象(EPAs))来查找故障。我们在包括异常方法终止的EPAs的扩展上定义了覆盖标准，并定义了旨在实现高覆盖的基于搜索的测试用例生成技术。结果表明，与标准结构覆盖的随机测试和基于搜索的测试生成相比，所提出的针对结构和扩展EPA覆盖的适应度函数的案例生成技术不仅可以为协议故障提供更好的故障检测能力，还可以为一般故障提供更好的故障检测能力。

引用次数: 2

Adversarial Specification Mining 对抗性规范挖掘

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2021-01-03 DOI: 10.1145/3424307

Hong Jin Kang, D. Lo

There have been numerous studies on mining temporal specifications from execution traces. These approaches learn finite-state automata (FSA) from execution traces when running tests. To learn accurate specifications of a software system, many tests are required. Existing approaches generalize from a limited number of traces or use simple test generation strategies. Unfortunately, these strategies may not exercise uncommon usage patterns of a software system. To address this problem, we propose a new approach, adversarial specification mining, and develop a prototype, Diversity through Counter-examples (DICE). DICE has two components: DICE-Tester and DICE-Miner. After mining Linear Temporal Logic specifications from an input test suite, DICE-Tester adversarially guides test generation, searching for counterexamples to these specifications to invalidate spurious properties. These counterexamples represent gaps in the diversity of the input test suite. This process produces execution traces of usage patterns that were unrepresented in the input test suite. Next, we propose a new specification inference algorithm, DICE-Miner, to infer FSAs using the traces, guided by the temporal specifications. We find that the inferred specifications are of higher quality than those produced by existing state-of-the-art specification miners. Finally, we use the FSAs in a fuzzer for servers of stateful protocols, increasing its coverage.

已经有很多关于从执行轨迹中挖掘时间规范的研究。这些方法在运行测试时从执行跟踪中学习有限状态自动机(FSA)。为了了解软件系统的准确规格，需要进行许多测试。现有的方法从有限数量的跟踪中泛化，或者使用简单的测试生成策略。不幸的是，这些策略可能不会执行软件系统的不常见使用模式。为了解决这个问题，我们提出了一种新的方法，对抗性规范挖掘，并开发了一个原型，通过反例的多样性(DICE)。DICE有两个组件:DICE- tester和DICE- miner。在从输入测试套件中挖掘线性时序逻辑规范之后，DICE-Tester对抗性地引导测试生成，搜索这些规范的反例以使虚假属性无效。这些反例表示了输入测试套件多样性中的差距。此过程产生输入测试套件中未表示的使用模式的执行跟踪。接下来，我们提出了一种新的规范推断算法，DICE-Miner，在时间规范的指导下使用迹线推断fsa。我们发现推断的规范比现有的最先进的规范矿工生产的规范质量更高。最后，我们在有状态协议服务器的fuzzer中使用了fsa，增加了其覆盖范围。

{"title":"Adversarial Specification Mining","authors":"Hong Jin Kang, D. Lo","doi":"10.1145/3424307","DOIUrl":"https://doi.org/10.1145/3424307","url":null,"abstract":"There have been numerous studies on mining temporal specifications from execution traces. These approaches learn finite-state automata (FSA) from execution traces when running tests. To learn accurate specifications of a software system, many tests are required. Existing approaches generalize from a limited number of traces or use simple test generation strategies. Unfortunately, these strategies may not exercise uncommon usage patterns of a software system. To address this problem, we propose a new approach, adversarial specification mining, and develop a prototype, Diversity through Counter-examples (DICE). DICE has two components: DICE-Tester and DICE-Miner. After mining Linear Temporal Logic specifications from an input test suite, DICE-Tester adversarially guides test generation, searching for counterexamples to these specifications to invalidate spurious properties. These counterexamples represent gaps in the diversity of the input test suite. This process produces execution traces of usage patterns that were unrepresented in the input test suite. Next, we propose a new specification inference algorithm, DICE-Miner, to infer FSAs using the traces, guided by the temporal specifications. We find that the inferred specifications are of higher quality than those produced by existing state-of-the-art specification miners. Finally, we use the FSAs in a fuzzer for servers of stateful protocols, increasing its coverage.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"50 1","pages":"1 - 40"},"PeriodicalIF":0.0,"publicationDate":"2021-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90995684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Test Data Generation for Path Coverage of MPI Programs Using SAEO 基于SAEO的MPI程序路径覆盖测试数据生成

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2021-01-03 DOI: 10.1145/3423132

D. Gong, Baicai Sun, Xiangjuan Yao, Tian Tian

Message-passing interface (MPI) programs, a typical kind of parallel programs, have been commonly used in various applications. However, it generally takes exhaustive computation to run these programs when generating test data to test them. In this article, we propose a method of test data generation for path coverage of MPI programs using surrogate-assisted evolutionary optimization, which can efficiently generate test data with high quality. We first divide a sample set of a program into a number of clusters according to the multi-mode characteristic of the coverage problem, with each cluster training a surrogate model. Then, we estimate the fitness of each individual using one or more surrogate models when generating test data through evolving a population. Finally, a small number of representative individuals are selected to execute the program, with the purpose of obtaining their real fitness, to guide the subsequent evolution of the population. We apply the proposed method to seven benchmark MPI programs and compare it with several state-of-the-art approaches. The experimental results show that the proposed method can generate test data with reduced computation, thus improving the testing efficiency.

消息传递接口(MPI)程序是一种典型的并行程序，已广泛应用于各种应用中。然而，在生成测试数据以测试它们时，通常需要详尽的计算来运行这些程序。本文提出了一种基于代理辅助进化优化的MPI程序路径覆盖测试数据生成方法，该方法可以高效地生成高质量的测试数据。我们首先根据覆盖问题的多模式特征，将程序样本集分成若干个聚类，每个聚类训练一个代理模型。然后，在通过进化种群生成测试数据时，我们使用一个或多个代理模型来估计每个个体的适应度。最后，选择少数具有代表性的个体执行程序，以获得其真正的适应度，指导种群的后续进化。我们将提出的方法应用于七个基准MPI程序，并将其与几种最先进的方法进行比较。实验结果表明，该方法能够以较少的计算量生成测试数据，从而提高测试效率。

{"title":"Test Data Generation for Path Coverage of MPI Programs Using SAEO","authors":"D. Gong, Baicai Sun, Xiangjuan Yao, Tian Tian","doi":"10.1145/3423132","DOIUrl":"https://doi.org/10.1145/3423132","url":null,"abstract":"Message-passing interface (MPI) programs, a typical kind of parallel programs, have been commonly used in various applications. However, it generally takes exhaustive computation to run these programs when generating test data to test them. In this article, we propose a method of test data generation for path coverage of MPI programs using surrogate-assisted evolutionary optimization, which can efficiently generate test data with high quality. We first divide a sample set of a program into a number of clusters according to the multi-mode characteristic of the coverage problem, with each cluster training a surrogate model. Then, we estimate the fitness of each individual using one or more surrogate models when generating test data through evolving a population. Finally, a small number of representative individuals are selected to execute the program, with the purpose of obtaining their real fitness, to guide the subsequent evolution of the population. We apply the proposed method to seven benchmark MPI programs and compare it with several state-of-the-art approaches. The experimental results show that the proposed method can generate test data with reduced computation, thus improving the testing efficiency.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"105 1","pages":"1 - 37"},"PeriodicalIF":0.0,"publicationDate":"2021-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83900191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

History-based Model Repair Recommendations 基于历史的模型修复建议

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2021-01-03 DOI: 10.1145/3419017

Manuel Ohrndorf, Christopher Pietsch, U. Kelter, Lars Grunske, Timo Kehrer

Models in Model-driven Engineering are primary development artifacts that are heavily edited in all stages of software development and that can become temporarily inconsistent during editing. In general, there are many alternatives to resolve an inconsistency, and which one is the most suitable depends on a variety of factors. As also proposed by recent approaches to model repair, it is reasonable to leave the actual choice and approval of a repair alternative to the discretion of the developer. Model repair tools can support developers by proposing a list of the most promising repairs. Such repair recommendations will be only accepted in practice if the generated proposals are plausible and understandable, and if the set as a whole is manageable. Current approaches, which mostly focus on exhaustive search strategies, exploring all possible model repairs without considering the intention of historic changes, fail in meeting these requirements. In this article, we present a new approach to generate repair proposals that aims at inconsistencies that have been introduced by past incomplete edit steps that can be located in the version history of a model. Such an incomplete edit step is either undone or it is extended to a full execution of a consistency-preserving edit operation. The history-based analysis of inconsistencies as well as the generation of repair recommendations are fully automated, and all interactive selection steps are supported by our repair tool called REVISION. We evaluate our approach using histories of real-world models obtained from popular open-source modeling projects hosted in the Eclipse Git repository, including the evolution of the entire UML meta-model. Our experimental results confirm our hypothesis that most of the inconsistencies, namely, 93.4, can be resolved by complementing incomplete edits. 92.6% of the generated repair proposals are relevant in the sense that their effect can be observed in the models’ histories. 94.9% of the relevant repair proposals are ranked at the topmost position.

模型驱动工程中的模型是主要的开发工件，在软件开发的所有阶段都要进行大量编辑，并且在编辑期间可能会暂时不一致。一般来说，解决不一致的方法有很多，哪一种最合适取决于各种因素。正如最近的模型修复方法所建议的那样，将修复方案的实际选择和批准留给开发人员自行决定是合理的。模型修复工具可以通过提出最有希望的修复列表来支持开发人员。这样的修复建议只有在产生的建议是合理和可理解的，并且作为一个整体是可管理的情况下才会在实践中被接受。目前的方法，主要集中在穷尽搜索策略，探索所有可能的模型修复，而不考虑历史变化的意图，不能满足这些要求。在本文中，我们提出了一种新的方法来生成修复建议，该建议针对过去不完整的编辑步骤所引入的不一致，这些步骤可以位于模型的版本历史中。这样一个不完整的编辑步骤要么被撤消，要么被扩展为一个保持一致性的编辑操作的完整执行。基于历史的不一致性分析以及修复建议的生成都是完全自动化的，所有的交互式选择步骤都由我们的修复工具REVISION支持。我们使用从Eclipse Git存储库中托管的流行开源建模项目中获得的真实模型的历史来评估我们的方法，包括整个UML元模型的演变。我们的实验结果证实了我们的假设，即大多数不一致，即93.4，可以通过补充不完整的编辑来解决。92.6%的修复建议是相关的，因为它们的影响可以在模型的历史中观察到。94.9%的相关修复方案排在首位。

{"title":"History-based Model Repair Recommendations","authors":"Manuel Ohrndorf, Christopher Pietsch, U. Kelter, Lars Grunske, Timo Kehrer","doi":"10.1145/3419017","DOIUrl":"https://doi.org/10.1145/3419017","url":null,"abstract":"Models in Model-driven Engineering are primary development artifacts that are heavily edited in all stages of software development and that can become temporarily inconsistent during editing. In general, there are many alternatives to resolve an inconsistency, and which one is the most suitable depends on a variety of factors. As also proposed by recent approaches to model repair, it is reasonable to leave the actual choice and approval of a repair alternative to the discretion of the developer. Model repair tools can support developers by proposing a list of the most promising repairs. Such repair recommendations will be only accepted in practice if the generated proposals are plausible and understandable, and if the set as a whole is manageable. Current approaches, which mostly focus on exhaustive search strategies, exploring all possible model repairs without considering the intention of historic changes, fail in meeting these requirements. In this article, we present a new approach to generate repair proposals that aims at inconsistencies that have been introduced by past incomplete edit steps that can be located in the version history of a model. Such an incomplete edit step is either undone or it is extended to a full execution of a consistency-preserving edit operation. The history-based analysis of inconsistencies as well as the generation of repair recommendations are fully automated, and all interactive selection steps are supported by our repair tool called REVISION. We evaluate our approach using histories of real-world models obtained from popular open-source modeling projects hosted in the Eclipse Git repository, including the evolution of the entire UML meta-model. Our experimental results confirm our hypothesis that most of the inconsistencies, namely, 93.4, can be resolved by complementing incomplete edits. 92.6% of the generated repair proposals are relevant in the sense that their effect can be observed in the models’ histories. 94.9% of the relevant repair proposals are ranked at the topmost position.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"17 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2021-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82506319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Mastering Variation in Human Studies 掌握人类研究中的变异

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2020-12-31 DOI: 10.1145/3406544

J. Siegmund, Norman Peitek, S. Apel, Norbert Siegmund

The human factor is prevalent in empirical software engineering research. However, human studies often do not use the full potential of analysis methods by combining analysis of individual tasks and participants with an analysis that aggregates results over tasks and/or participants. This may hide interesting insights of tasks and participants and may lead to false conclusions by overrating or underrating single-task or participant performance. We show that studying multiple levels of aggregation of individual tasks and participants allows researchers to have both insights from individual variations as well as generalized, reliable conclusions based on aggregated data. Our literature survey revealed that most human studies perform either a fully aggregated analysis or an analysis of individual tasks. To show that there is important, non-trivial variation when including human participants, we reanalyze 12 published empirical studies, thereby changing the conclusions or making them more nuanced. Moreover, we demonstrate the effects of different aggregation levels by answering a novel research question on published sets of fMRI data. We show that when more data are aggregated, the results become more accurate. This proposed technique can help researchers to find a sweet spot in the tradeoff between cost of a study and reliability of conclusions.

在实证软件工程研究中，人的因素是普遍存在的。然而，人类研究通常没有充分利用分析方法的潜力，将对单个任务和参与者的分析与对任务和/或参与者的汇总结果的分析相结合。这可能会隐藏任务和参与者的有趣见解，并可能因高估或低估单个任务或参与者的表现而导致错误的结论。我们表明，研究单个任务和参与者的多个层面的聚合使研究人员既可以从个体变化中获得见解，也可以根据聚合数据得出普遍可靠的结论。我们的文献调查显示，大多数人类研究要么进行完全汇总分析，要么对单个任务进行分析。为了表明，当包括人类参与者时，存在重要的、非琐碎的变化，我们重新分析了12项已发表的实证研究，从而改变了结论或使其更加微妙。此外，我们通过回答一个关于已发表的fMRI数据集的新研究问题来证明不同聚集水平的影响。我们表明，当聚合更多的数据时，结果变得更加准确。这项建议的技术可以帮助研究人员在研究成本和结论可靠性之间找到一个平衡点。

{"title":"Mastering Variation in Human Studies","authors":"J. Siegmund, Norman Peitek, S. Apel, Norbert Siegmund","doi":"10.1145/3406544","DOIUrl":"https://doi.org/10.1145/3406544","url":null,"abstract":"The human factor is prevalent in empirical software engineering research. However, human studies often do not use the full potential of analysis methods by combining analysis of individual tasks and participants with an analysis that aggregates results over tasks and/or participants. This may hide interesting insights of tasks and participants and may lead to false conclusions by overrating or underrating single-task or participant performance. We show that studying multiple levels of aggregation of individual tasks and participants allows researchers to have both insights from individual variations as well as generalized, reliable conclusions based on aggregated data. Our literature survey revealed that most human studies perform either a fully aggregated analysis or an analysis of individual tasks. To show that there is important, non-trivial variation when including human participants, we reanalyze 12 published empirical studies, thereby changing the conclusions or making them more nuanced. Moreover, we demonstrate the effects of different aggregation levels by answering a novel research question on published sets of fMRI data. We show that when more data are aggregated, the results become more accurate. This proposed technique can help researchers to find a sweet spot in the tradeoff between cost of a study and reliability of conclusions.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"441 1","pages":"1 - 40"},"PeriodicalIF":0.0,"publicationDate":"2020-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80240645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Practical Approach to Verification of Floating-Point C/C++ Programs with math.h/cmath Functions 用math.h/cmath函数验证C/ c++浮点程序的实用方法

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2020-12-31 DOI: 10.1145/3410875

Roberto Bagnara, M. Chiari, R. Gori, Abramo Bagnara

Verification of C/C++ programs has seen considerable progress in several areas, but not for programs that use these languages’ mathematical libraries. The reason is that all libraries in widespread use come with no guarantees about the computed results. This would seem to prevent any attempt at formal verification of programs that use them: without a specification for the functions, no conclusion can be drawn statically about the behavior of the program. We propose an alternative to surrender. We introduce a pragmatic approach that leverages the fact that most math.h/cmath functions are almost piecewise monotonic: as we discovered through exhaustive testing, they may have glitches, often of very small size and in small numbers. We develop interval refinement techniques for such functions based on a modified dichotomic search, which enable verification via symbolic execution based model checking, abstract interpretation, and test data generation. To the best of our knowledge, our refinement algorithms are the first in the literature to be able to handle non-correctly rounded function implementations, enabling verification in the presence of the most common implementations. We experimentally evaluate our approach on real-world code, showing its ability to detect or rule out anomalous behaviors.

C/ c++程序的验证在几个领域已经取得了相当大的进展，但对于使用这些语言的数学库的程序来说却没有进展。原因是所有广泛使用的库都不能保证计算结果。这似乎阻止了对使用它们的程序进行正式验证的任何尝试:没有函数的规范，就无法静态地得出关于程序行为的结论。我们提出了一个替代投降的办法。我们引入了一种实用的方法，利用了大多数math.h/cmath函数几乎是分段单调的这一事实:正如我们通过详尽测试发现的那样，它们可能有小故障，通常是非常小的小故障。我们开发了基于改进的二分搜索的区间优化技术，该技术可以通过基于模型检查、抽象解释和测试数据生成的符号执行进行验证。据我们所知，我们的细化算法是文献中第一个能够处理非正确舍入函数实现的算法，能够在最常见的实现中进行验证。我们在现实世界的代码上实验评估了我们的方法，展示了它检测或排除异常行为的能力。

{"title":"A Practical Approach to Verification of Floating-Point C/C++ Programs with math.h/cmath Functions","authors":"Roberto Bagnara, M. Chiari, R. Gori, Abramo Bagnara","doi":"10.1145/3410875","DOIUrl":"https://doi.org/10.1145/3410875","url":null,"abstract":"Verification of C/C++ programs has seen considerable progress in several areas, but not for programs that use these languages’ mathematical libraries. The reason is that all libraries in widespread use come with no guarantees about the computed results. This would seem to prevent any attempt at formal verification of programs that use them: without a specification for the functions, no conclusion can be drawn statically about the behavior of the program. We propose an alternative to surrender. We introduce a pragmatic approach that leverages the fact that most math.h/cmath functions are almost piecewise monotonic: as we discovered through exhaustive testing, they may have glitches, often of very small size and in small numbers. We develop interval refinement techniques for such functions based on a modified dichotomic search, which enable verification via symbolic execution based model checking, abstract interpretation, and test data generation. To the best of our knowledge, our refinement algorithms are the first in the literature to be able to handle non-correctly rounded function implementations, enabling verification in the presence of the most common implementations. We experimentally evaluate our approach on real-world code, showing its ability to detect or rule out anomalous behaviors.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"188 1","pages":"1 - 53"},"PeriodicalIF":0.0,"publicationDate":"2020-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75071801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

RegionTrack RegionTrack

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2020-12-31 DOI: 10.1145/3412377

Xiaoxue Ma, Shangru Wu, E. Pobee, Xiupei Mei, Hao Zhang, Bo Jiang, W. Chan

Atomicity is a correctness criterion to reason about isolated code regions in a multithreaded program when they are executed concurrently. However, dynamic instances of these code regions, called transactions, may fail to behave atomically, resulting in transactional atomicity violations. Existing dynamic online atomicity checkers incur either false positives or false negatives in detecting transactions experiencing transactional atomicity violations. This article proposes RegionTrack. RegionTrack tracks cross-thread dependences at the event, dynamic subregion, and transaction levels. It maintains both dynamic subregions within selected transactions and transactional happens-before relations through its novel timestamp propagation approach. We prove that RegionTrack is sound and complete in detecting both transactional atomicity violations and non-serializable traces. To the best of our knowledge, it is the first online technique that precisely captures the transitively closed set of happens-before relations over all conflicting events with respect to every running transaction for the above two kinds of issues. We have evaluated RegionTrack on 19 subjects of the DaCapo and the Java Grande Forum benchmarks. The empirical results confirm that RegionTrack precisely detected all those transactions which experienced transactional atomicity violations and identified all non-serializable traces. The overall results also show that RegionTrack incurred 1.10x and 1.08x lower memory and runtime overheads than Velodrome and 2.10x and 1.21x lower than Aerodrome, respectively. Moreover, it incurred 2.89x lower memory overhead than DoubleChecker. On average, Velodrome detected about 55% fewer violations than RegionTrack, which in turn reported about 3%–70% fewer violations than DoubleChecker.

原子性是判断多线程程序中并发执行的隔离代码区域的正确性标准。然而，这些代码区域的动态实例(称为事务)可能无法自动执行行为，从而导致事务原子性违规。现有的动态在线原子性检查器在检测遇到事务原子性违反的事务时会产生误报或误报。本文提出了RegionTrack。RegionTrack在事件、动态子区域和事务级别跟踪跨线程依赖关系。它通过其新颖的时间戳传播方法维护所选事务和事务发生前关系中的动态子区域。我们证明了RegionTrack在检测事务原子性违反和不可序列化跟踪方面是健全和完整的。据我们所知，对于上述两种问题，它是第一种在线技术，可以精确地捕获与每个正在运行的事务相关的所有冲突事件的传递封闭的happens-before关系集。我们就达卡波和爪哇大论坛基准的19个主题对区域跟踪进行了评估。经验结果证实，RegionTrack精确地检测到所有经历了事务原子性冲突的事务，并识别出所有不可序列化的跟踪。总体结果还表明，RegionTrack的内存和运行时开销分别比Velodrome低1.10倍和1.08倍，比Aerodrome低2.10倍和1.21倍。此外，它产生的内存开销比DoubleChecker低2.89倍。平均而言，Velodrome检测到的违规行为比RegionTrack少55%，而RegionTrack报告的违规行为比DoubleChecker少3%-70%。

{"title":"RegionTrack","authors":"Xiaoxue Ma, Shangru Wu, E. Pobee, Xiupei Mei, Hao Zhang, Bo Jiang, W. Chan","doi":"10.1145/3412377","DOIUrl":"https://doi.org/10.1145/3412377","url":null,"abstract":"Atomicity is a correctness criterion to reason about isolated code regions in a multithreaded program when they are executed concurrently. However, dynamic instances of these code regions, called transactions, may fail to behave atomically, resulting in transactional atomicity violations. Existing dynamic online atomicity checkers incur either false positives or false negatives in detecting transactions experiencing transactional atomicity violations. This article proposes <monospace>RegionTrack</monospace>. <monospace>RegionTrack</monospace> tracks cross-thread dependences at the event, dynamic subregion, and transaction levels. It maintains both dynamic subregions within selected transactions and transactional happens-before relations through its novel timestamp propagation approach. We prove that <monospace>RegionTrack</monospace> is sound and complete in detecting both transactional atomicity violations and non-serializable traces. To the best of our knowledge, it is the first online technique that precisely captures the transitively closed set of happens-before relations over all conflicting events with respect to every running transaction for the above two kinds of issues. We have evaluated <monospace>RegionTrack</monospace> on 19 subjects of the DaCapo and the Java Grande Forum benchmarks. The empirical results confirm that <monospace>RegionTrack</monospace> precisely detected all those transactions which experienced transactional atomicity violations and identified all non-serializable traces. The overall results also show that <monospace>RegionTrack</monospace> incurred 1.10x and 1.08x lower memory and runtime overheads than <monospace>Velodrome</monospace> and 2.10x and 1.21x lower than <monospace>Aerodrome</monospace>, respectively. Moreover, it incurred 2.89x lower memory overhead than <monospace>DoubleChecker</monospace>. On average, <monospace>Velodrome</monospace> detected about 55% fewer violations than <monospace>RegionTrack</monospace>, which in turn reported about 3%–70% fewer violations than <monospace>DoubleChecker</monospace>.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"31 1","pages":"1 - 49"},"PeriodicalIF":0.0,"publicationDate":"2020-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78775356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Uncertainty-wise Requirements Prioritization with Search 不确定的需求优先级搜索

ACM Transactions on Software Engineering and Methodology (TOSEM)

Pub Date : 2020-12-31 DOI: 10.1145/3408301

Huihui Zhang, Man Zhang, T. Yue, Sajid Ali, Yan Li

Requirements review is an effective technique to ensure the quality of requirements in practice, especially in safety-critical domains (e.g., avionics systems, automotive systems). In such contexts, a typical requirements review process often prioritizes requirements, due to limited time and monetary budget, by, for instance, prioritizing requirements with higher implementation cost earlier in the review process. However, such a requirement implementation cost is typically estimated by stakeholders who often lack knowledge about (future) requirements implementation scenarios, which leads to uncertainty in cost overrun. In this article, we explicitly consider such uncertainty (quantified as cost overrun probability) when prioritizing requirements based on the assumption that a requirement with higher importance, a higher number of dependencies to other requirements, and higher implementation cost will be reviewed with the higher priority. Motivated by this, we formulate four objectives for uncertainty-wise requirements prioritization: maximizing the importance of requirements, requirements dependencies, the implementation cost of requirements, and cost overrun probability. These four objectives are integrated as part of our search-based uncertainty-wise requirements prioritization approach with tool support, named as URP. We evaluated six Multi-Objective Search Algorithms (MOSAs) (i.e., NSGA-II, NSGA-III, MOCell, SPEA2, IBEA, and PAES) together with Random Search (RS) using three real-world datasets (i.e., the RALIC, Word, and ReleasePlanner datasets) and 19 synthetic optimization problems. Results show that all the selected MOSAs can solve the requirements prioritization problem with significantly better performance than RS. Among them, IBEA was over 40% better than RS in terms of permutation effectiveness for the first 10% of prioritized requirements in the prioritization sequence of all three datasets. In addition, IBEA achieved the best performance in terms of the convergence of solutions, and NSGA-III performed the best when considering both the convergence and diversity of nondominated solutions.

需求评审是一种在实践中确保需求质量的有效技术，特别是在安全关键领域(例如，航空电子系统、汽车系统)。在这种情况下，由于时间和财政预算的限制，典型的需求审查过程通常会对需求进行优先级排序，例如，在审查过程的早期对具有较高实现成本的需求进行优先级排序。然而，这样的需求实现成本通常是由缺乏(未来)需求实现场景知识的涉众估算的，这会导致成本超支的不确定性。在本文中，我们明确地考虑了这样的不确定性(量化为成本超支概率)，当基于这样的假设对需求进行优先级排序时，具有更高重要性的需求，对其他需求的依赖数量更多，以及更高的实现成本将以更高的优先级进行审查。受此启发，我们为不确定性需求优先化制定了四个目标:最大化需求的重要性、需求依赖性、需求的实现成本和成本超支概率。这四个目标被集成为我们基于搜索的不确定性需求优先化方法的一部分，并带有工具支持，称为URP。我们评估了6种多目标搜索算法(MOSAs)(即NSGA-II, NSGA-III, MOCell, SPEA2, IBEA和PAES)以及随机搜索(RS)，使用3个真实数据集(即RALIC, Word和ReleasePlanner数据集)和19个综合优化问题。结果表明，所选择的mosa均能解决需求优先级问题，且性能明显优于RS，其中IBEA对三个数据集优先级顺序中前10%的优先级需求的排列效率优于RS 40%以上。此外，IBEA在解的收敛性方面表现最好，而NSGA-III在考虑非支配解的收敛性和多样性方面表现最好。

{"title":"Uncertainty-wise Requirements Prioritization with Search","authors":"Huihui Zhang, Man Zhang, T. Yue, Sajid Ali, Yan Li","doi":"10.1145/3408301","DOIUrl":"https://doi.org/10.1145/3408301","url":null,"abstract":"Requirements review is an effective technique to ensure the quality of requirements in practice, especially in safety-critical domains (e.g., avionics systems, automotive systems). In such contexts, a typical requirements review process often prioritizes requirements, due to limited time and monetary budget, by, for instance, prioritizing requirements with higher implementation cost earlier in the review process. However, such a requirement implementation cost is typically estimated by stakeholders who often lack knowledge about (future) requirements implementation scenarios, which leads to uncertainty in cost overrun. In this article, we explicitly consider such uncertainty (quantified as cost overrun probability) when prioritizing requirements based on the assumption that a requirement with higher importance, a higher number of dependencies to other requirements, and higher implementation cost will be reviewed with the higher priority. Motivated by this, we formulate four objectives for uncertainty-wise requirements prioritization: maximizing the importance of requirements, requirements dependencies, the implementation cost of requirements, and cost overrun probability. These four objectives are integrated as part of our search-based uncertainty-wise requirements prioritization approach with tool support, named as URP. We evaluated six Multi-Objective Search Algorithms (MOSAs) (i.e., NSGA-II, NSGA-III, MOCell, SPEA2, IBEA, and PAES) together with Random Search (RS) using three real-world datasets (i.e., the RALIC, Word, and ReleasePlanner datasets) and 19 synthetic optimization problems. Results show that all the selected MOSAs can solve the requirements prioritization problem with significantly better performance than RS. Among them, IBEA was over 40% better than RS in terms of permutation effectiveness for the first 10% of prioritized requirements in the prioritization sequence of all three datasets. In addition, IBEA achieved the best performance in terms of the convergence of solutions, and NSGA-III performed the best when considering both the convergence and diversity of nondominated solutions.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"59 1","pages":"1 - 54"},"PeriodicalIF":0.0,"publicationDate":"2020-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85985512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10