C. Maddila, Nachiappan Nagappan, C. Bird, Georgios Gousios, A. van Deursen
Modern, complex software systems are being continuously extended and adjusted. The developers responsible for this may come from different teams or organizations, and may be distributed over the world. This may make it difficult to keep track of what other developers are doing, which may result in multiple developers concurrently editing the same code areas. This, in turn, may lead to hard-to-merge changes or even merge conflicts, logical bugs that are difficult to detect, duplication of work, and wasted developer productivity. To address this, we explore the extent of this problem in the pull-request-based software development model. We study half a year of changes made to six large repositories in Microsoft in which at least 1,000 pull requests are created each month. We find that files concurrently edited in different pull requests are more likely to introduce bugs. Motivated by these findings, we design, implement, and deploy a service named Concurrent Edit Detector (ConE) that proactively detects pull requests containing concurrent edits, to help mitigate the problems caused by them. ConE has been designed to scale, and to minimize false alarms while still flagging relevant concurrently edited files. Key concepts of ConE include the detection of the Extent of Overlap between pull requests, and the identification of Rarely Concurrently Edited Files. To evaluate ConE, we report on its operational deployment on 234 repositories inside Microsoft. ConE assessed 26,000 pull requests and made 775 recommendations about conflicting changes, which were rated as useful in over 70% (554) of the cases. From interviews with 48 users, we learned that they believed ConE would save time in conflict resolution and avoiding duplicate work, and that over 90% intend to keep using the service on a daily basis.
{"title":"ConE: A Concurrent Edit Detection Tool for Large-scale Software Development","authors":"C. Maddila, Nachiappan Nagappan, C. Bird, Georgios Gousios, A. van Deursen","doi":"10.1145/3478019","DOIUrl":"https://doi.org/10.1145/3478019","url":null,"abstract":"Modern, complex software systems are being continuously extended and adjusted. The developers responsible for this may come from different teams or organizations, and may be distributed over the world. This may make it difficult to keep track of what other developers are doing, which may result in multiple developers concurrently editing the same code areas. This, in turn, may lead to hard-to-merge changes or even merge conflicts, logical bugs that are difficult to detect, duplication of work, and wasted developer productivity. To address this, we explore the extent of this problem in the pull-request-based software development model. We study half a year of changes made to six large repositories in Microsoft in which at least 1,000 pull requests are created each month. We find that files concurrently edited in different pull requests are more likely to introduce bugs. Motivated by these findings, we design, implement, and deploy a service named Concurrent Edit Detector (ConE) that proactively detects pull requests containing concurrent edits, to help mitigate the problems caused by them. ConE has been designed to scale, and to minimize false alarms while still flagging relevant concurrently edited files. Key concepts of ConE include the detection of the Extent of Overlap between pull requests, and the identification of Rarely Concurrently Edited Files. To evaluate ConE, we report on its operational deployment on 234 repositories inside Microsoft. ConE assessed 26,000 pull requests and made 775 recommendations about conflicting changes, which were rated as useful in over 70% (554) of the cases. From interviews with 48 users, we learned that they believed ConE would save time in conflict resolution and avoiding duplicate work, and that over 90% intend to keep using the service on a daily basis.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"48 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2021-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82106952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The state space of Android apps is huge, and its thorough exploration during testing remains a significant challenge. The best exploration strategy is highly dependent on the features of the app under test. Reinforcement Learning (RL) is a machine learning technique that learns the optimal strategy to solve a task by trial and error, guided by positive or negative reward, rather than explicit supervision. Deep RL is a recent extension of RL that takes advantage of the learning capabilities of neural networks. Such capabilities make Deep RL suitable for complex exploration spaces such as one of Android apps. However, state-of-the-art, publicly available tools only support basic, Tabular RL. We have developed ARES, a Deep RL approach for black-box testing of Android apps. Experimental results show that it achieves higher coverage and fault revelation than the baselines, including state-of-the-art tools, such as TimeMachine and Q-Testing. We also investigated the reasons behind such performance qualitatively, and we have identified the key features of Android apps that make Deep RL particularly effective on them to be the presence of chained and blocking activities. Moreover, we have developed FATE to fine-tune the hyperparameters of Deep RL algorithms on simulated apps, since it is computationally expensive to carry it out on real apps.
{"title":"Deep Reinforcement Learning for Black-box Testing of Android Apps","authors":"Andrea Romdhana, A. Merlo, M. Ceccato, P. Tonella","doi":"10.1145/3502868","DOIUrl":"https://doi.org/10.1145/3502868","url":null,"abstract":"The state space of Android apps is huge, and its thorough exploration during testing remains a significant challenge. The best exploration strategy is highly dependent on the features of the app under test. Reinforcement Learning (RL) is a machine learning technique that learns the optimal strategy to solve a task by trial and error, guided by positive or negative reward, rather than explicit supervision. Deep RL is a recent extension of RL that takes advantage of the learning capabilities of neural networks. Such capabilities make Deep RL suitable for complex exploration spaces such as one of Android apps. However, state-of-the-art, publicly available tools only support basic, Tabular RL. We have developed ARES, a Deep RL approach for black-box testing of Android apps. Experimental results show that it achieves higher coverage and fault revelation than the baselines, including state-of-the-art tools, such as TimeMachine and Q-Testing. We also investigated the reasons behind such performance qualitatively, and we have identified the key features of Android apps that make Deep RL particularly effective on them to be the presence of chained and blocking activities. Moreover, we have developed FATE to fine-tune the hyperparameters of Deep RL algorithms on simulated apps, since it is computationally expensive to carry it out on real apps.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"15 1","pages":"1 - 29"},"PeriodicalIF":0.0,"publicationDate":"2021-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87479944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Javier Godoy, Juan P. Galeotti, D. Garbervetsky, Sebastián Uchitel
A significant proportion of classes in modern software introduce or use object protocols, prescriptions on the temporal orderings of method calls on objects. This article studies search-based test generation techniques that aim to exploit a particular abstraction of object protocols (enabledness preserving abstractions (EPAs)) to find failures. We define coverage criteria over an extension of EPAs that includes abnormal method termination and define a search-based test case generation technique aimed at achieving high coverage. Results suggest that the proposed case generation technique with a fitness function that aims at combined structural and extended EPA coverage can provide better failure-detection capabilities not only for protocol failures but also for general failures when compared to random testing and search-based test generation for standard structural coverage.
{"title":"Enabledness-based Testing of Object Protocols","authors":"Javier Godoy, Juan P. Galeotti, D. Garbervetsky, Sebastián Uchitel","doi":"10.1145/3415153","DOIUrl":"https://doi.org/10.1145/3415153","url":null,"abstract":"A significant proportion of classes in modern software introduce or use object protocols, prescriptions on the temporal orderings of method calls on objects. This article studies search-based test generation techniques that aim to exploit a particular abstraction of object protocols (enabledness preserving abstractions (EPAs)) to find failures. We define coverage criteria over an extension of EPAs that includes abnormal method termination and define a search-based test case generation technique aimed at achieving high coverage. Results suggest that the proposed case generation technique with a fitness function that aims at combined structural and extended EPA coverage can provide better failure-detection capabilities not only for protocol failures but also for general failures when compared to random testing and search-based test generation for standard structural coverage.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"85 1","pages":"1 - 36"},"PeriodicalIF":0.0,"publicationDate":"2021-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78097513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There have been numerous studies on mining temporal specifications from execution traces. These approaches learn finite-state automata (FSA) from execution traces when running tests. To learn accurate specifications of a software system, many tests are required. Existing approaches generalize from a limited number of traces or use simple test generation strategies. Unfortunately, these strategies may not exercise uncommon usage patterns of a software system. To address this problem, we propose a new approach, adversarial specification mining, and develop a prototype, Diversity through Counter-examples (DICE). DICE has two components: DICE-Tester and DICE-Miner. After mining Linear Temporal Logic specifications from an input test suite, DICE-Tester adversarially guides test generation, searching for counterexamples to these specifications to invalidate spurious properties. These counterexamples represent gaps in the diversity of the input test suite. This process produces execution traces of usage patterns that were unrepresented in the input test suite. Next, we propose a new specification inference algorithm, DICE-Miner, to infer FSAs using the traces, guided by the temporal specifications. We find that the inferred specifications are of higher quality than those produced by existing state-of-the-art specification miners. Finally, we use the FSAs in a fuzzer for servers of stateful protocols, increasing its coverage.
{"title":"Adversarial Specification Mining","authors":"Hong Jin Kang, D. Lo","doi":"10.1145/3424307","DOIUrl":"https://doi.org/10.1145/3424307","url":null,"abstract":"There have been numerous studies on mining temporal specifications from execution traces. These approaches learn finite-state automata (FSA) from execution traces when running tests. To learn accurate specifications of a software system, many tests are required. Existing approaches generalize from a limited number of traces or use simple test generation strategies. Unfortunately, these strategies may not exercise uncommon usage patterns of a software system. To address this problem, we propose a new approach, adversarial specification mining, and develop a prototype, Diversity through Counter-examples (DICE). DICE has two components: DICE-Tester and DICE-Miner. After mining Linear Temporal Logic specifications from an input test suite, DICE-Tester adversarially guides test generation, searching for counterexamples to these specifications to invalidate spurious properties. These counterexamples represent gaps in the diversity of the input test suite. This process produces execution traces of usage patterns that were unrepresented in the input test suite. Next, we propose a new specification inference algorithm, DICE-Miner, to infer FSAs using the traces, guided by the temporal specifications. We find that the inferred specifications are of higher quality than those produced by existing state-of-the-art specification miners. Finally, we use the FSAs in a fuzzer for servers of stateful protocols, increasing its coverage.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"50 1","pages":"1 - 40"},"PeriodicalIF":0.0,"publicationDate":"2021-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90995684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Message-passing interface (MPI) programs, a typical kind of parallel programs, have been commonly used in various applications. However, it generally takes exhaustive computation to run these programs when generating test data to test them. In this article, we propose a method of test data generation for path coverage of MPI programs using surrogate-assisted evolutionary optimization, which can efficiently generate test data with high quality. We first divide a sample set of a program into a number of clusters according to the multi-mode characteristic of the coverage problem, with each cluster training a surrogate model. Then, we estimate the fitness of each individual using one or more surrogate models when generating test data through evolving a population. Finally, a small number of representative individuals are selected to execute the program, with the purpose of obtaining their real fitness, to guide the subsequent evolution of the population. We apply the proposed method to seven benchmark MPI programs and compare it with several state-of-the-art approaches. The experimental results show that the proposed method can generate test data with reduced computation, thus improving the testing efficiency.
{"title":"Test Data Generation for Path Coverage of MPI Programs Using SAEO","authors":"D. Gong, Baicai Sun, Xiangjuan Yao, Tian Tian","doi":"10.1145/3423132","DOIUrl":"https://doi.org/10.1145/3423132","url":null,"abstract":"Message-passing interface (MPI) programs, a typical kind of parallel programs, have been commonly used in various applications. However, it generally takes exhaustive computation to run these programs when generating test data to test them. In this article, we propose a method of test data generation for path coverage of MPI programs using surrogate-assisted evolutionary optimization, which can efficiently generate test data with high quality. We first divide a sample set of a program into a number of clusters according to the multi-mode characteristic of the coverage problem, with each cluster training a surrogate model. Then, we estimate the fitness of each individual using one or more surrogate models when generating test data through evolving a population. Finally, a small number of representative individuals are selected to execute the program, with the purpose of obtaining their real fitness, to guide the subsequent evolution of the population. We apply the proposed method to seven benchmark MPI programs and compare it with several state-of-the-art approaches. The experimental results show that the proposed method can generate test data with reduced computation, thus improving the testing efficiency.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"105 1","pages":"1 - 37"},"PeriodicalIF":0.0,"publicationDate":"2021-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83900191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manuel Ohrndorf, Christopher Pietsch, U. Kelter, Lars Grunske, Timo Kehrer
Models in Model-driven Engineering are primary development artifacts that are heavily edited in all stages of software development and that can become temporarily inconsistent during editing. In general, there are many alternatives to resolve an inconsistency, and which one is the most suitable depends on a variety of factors. As also proposed by recent approaches to model repair, it is reasonable to leave the actual choice and approval of a repair alternative to the discretion of the developer. Model repair tools can support developers by proposing a list of the most promising repairs. Such repair recommendations will be only accepted in practice if the generated proposals are plausible and understandable, and if the set as a whole is manageable. Current approaches, which mostly focus on exhaustive search strategies, exploring all possible model repairs without considering the intention of historic changes, fail in meeting these requirements. In this article, we present a new approach to generate repair proposals that aims at inconsistencies that have been introduced by past incomplete edit steps that can be located in the version history of a model. Such an incomplete edit step is either undone or it is extended to a full execution of a consistency-preserving edit operation. The history-based analysis of inconsistencies as well as the generation of repair recommendations are fully automated, and all interactive selection steps are supported by our repair tool called REVISION. We evaluate our approach using histories of real-world models obtained from popular open-source modeling projects hosted in the Eclipse Git repository, including the evolution of the entire UML meta-model. Our experimental results confirm our hypothesis that most of the inconsistencies, namely, 93.4, can be resolved by complementing incomplete edits. 92.6% of the generated repair proposals are relevant in the sense that their effect can be observed in the models’ histories. 94.9% of the relevant repair proposals are ranked at the topmost position.
{"title":"History-based Model Repair Recommendations","authors":"Manuel Ohrndorf, Christopher Pietsch, U. Kelter, Lars Grunske, Timo Kehrer","doi":"10.1145/3419017","DOIUrl":"https://doi.org/10.1145/3419017","url":null,"abstract":"Models in Model-driven Engineering are primary development artifacts that are heavily edited in all stages of software development and that can become temporarily inconsistent during editing. In general, there are many alternatives to resolve an inconsistency, and which one is the most suitable depends on a variety of factors. As also proposed by recent approaches to model repair, it is reasonable to leave the actual choice and approval of a repair alternative to the discretion of the developer. Model repair tools can support developers by proposing a list of the most promising repairs. Such repair recommendations will be only accepted in practice if the generated proposals are plausible and understandable, and if the set as a whole is manageable. Current approaches, which mostly focus on exhaustive search strategies, exploring all possible model repairs without considering the intention of historic changes, fail in meeting these requirements. In this article, we present a new approach to generate repair proposals that aims at inconsistencies that have been introduced by past incomplete edit steps that can be located in the version history of a model. Such an incomplete edit step is either undone or it is extended to a full execution of a consistency-preserving edit operation. The history-based analysis of inconsistencies as well as the generation of repair recommendations are fully automated, and all interactive selection steps are supported by our repair tool called REVISION. We evaluate our approach using histories of real-world models obtained from popular open-source modeling projects hosted in the Eclipse Git repository, including the evolution of the entire UML meta-model. Our experimental results confirm our hypothesis that most of the inconsistencies, namely, 93.4, can be resolved by complementing incomplete edits. 92.6% of the generated repair proposals are relevant in the sense that their effect can be observed in the models’ histories. 94.9% of the relevant repair proposals are ranked at the topmost position.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"17 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2021-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82506319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Siegmund, Norman Peitek, S. Apel, Norbert Siegmund
The human factor is prevalent in empirical software engineering research. However, human studies often do not use the full potential of analysis methods by combining analysis of individual tasks and participants with an analysis that aggregates results over tasks and/or participants. This may hide interesting insights of tasks and participants and may lead to false conclusions by overrating or underrating single-task or participant performance. We show that studying multiple levels of aggregation of individual tasks and participants allows researchers to have both insights from individual variations as well as generalized, reliable conclusions based on aggregated data. Our literature survey revealed that most human studies perform either a fully aggregated analysis or an analysis of individual tasks. To show that there is important, non-trivial variation when including human participants, we reanalyze 12 published empirical studies, thereby changing the conclusions or making them more nuanced. Moreover, we demonstrate the effects of different aggregation levels by answering a novel research question on published sets of fMRI data. We show that when more data are aggregated, the results become more accurate. This proposed technique can help researchers to find a sweet spot in the tradeoff between cost of a study and reliability of conclusions.
{"title":"Mastering Variation in Human Studies","authors":"J. Siegmund, Norman Peitek, S. Apel, Norbert Siegmund","doi":"10.1145/3406544","DOIUrl":"https://doi.org/10.1145/3406544","url":null,"abstract":"The human factor is prevalent in empirical software engineering research. However, human studies often do not use the full potential of analysis methods by combining analysis of individual tasks and participants with an analysis that aggregates results over tasks and/or participants. This may hide interesting insights of tasks and participants and may lead to false conclusions by overrating or underrating single-task or participant performance. We show that studying multiple levels of aggregation of individual tasks and participants allows researchers to have both insights from individual variations as well as generalized, reliable conclusions based on aggregated data. Our literature survey revealed that most human studies perform either a fully aggregated analysis or an analysis of individual tasks. To show that there is important, non-trivial variation when including human participants, we reanalyze 12 published empirical studies, thereby changing the conclusions or making them more nuanced. Moreover, we demonstrate the effects of different aggregation levels by answering a novel research question on published sets of fMRI data. We show that when more data are aggregated, the results become more accurate. This proposed technique can help researchers to find a sweet spot in the tradeoff between cost of a study and reliability of conclusions.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"441 1","pages":"1 - 40"},"PeriodicalIF":0.0,"publicationDate":"2020-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80240645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Roberto Bagnara, M. Chiari, R. Gori, Abramo Bagnara
Verification of C/C++ programs has seen considerable progress in several areas, but not for programs that use these languages’ mathematical libraries. The reason is that all libraries in widespread use come with no guarantees about the computed results. This would seem to prevent any attempt at formal verification of programs that use them: without a specification for the functions, no conclusion can be drawn statically about the behavior of the program. We propose an alternative to surrender. We introduce a pragmatic approach that leverages the fact that most math.h/cmath functions are almost piecewise monotonic: as we discovered through exhaustive testing, they may have glitches, often of very small size and in small numbers. We develop interval refinement techniques for such functions based on a modified dichotomic search, which enable verification via symbolic execution based model checking, abstract interpretation, and test data generation. To the best of our knowledge, our refinement algorithms are the first in the literature to be able to handle non-correctly rounded function implementations, enabling verification in the presence of the most common implementations. We experimentally evaluate our approach on real-world code, showing its ability to detect or rule out anomalous behaviors.
{"title":"A Practical Approach to Verification of Floating-Point C/C++ Programs with math.h/cmath Functions","authors":"Roberto Bagnara, M. Chiari, R. Gori, Abramo Bagnara","doi":"10.1145/3410875","DOIUrl":"https://doi.org/10.1145/3410875","url":null,"abstract":"Verification of C/C++ programs has seen considerable progress in several areas, but not for programs that use these languages’ mathematical libraries. The reason is that all libraries in widespread use come with no guarantees about the computed results. This would seem to prevent any attempt at formal verification of programs that use them: without a specification for the functions, no conclusion can be drawn statically about the behavior of the program. We propose an alternative to surrender. We introduce a pragmatic approach that leverages the fact that most math.h/cmath functions are almost piecewise monotonic: as we discovered through exhaustive testing, they may have glitches, often of very small size and in small numbers. We develop interval refinement techniques for such functions based on a modified dichotomic search, which enable verification via symbolic execution based model checking, abstract interpretation, and test data generation. To the best of our knowledge, our refinement algorithms are the first in the literature to be able to handle non-correctly rounded function implementations, enabling verification in the presence of the most common implementations. We experimentally evaluate our approach on real-world code, showing its ability to detect or rule out anomalous behaviors.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"188 1","pages":"1 - 53"},"PeriodicalIF":0.0,"publicationDate":"2020-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75071801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoxue Ma, Shangru Wu, E. Pobee, Xiupei Mei, Hao Zhang, Bo Jiang, W. Chan
Atomicity is a correctness criterion to reason about isolated code regions in a multithreaded program when they are executed concurrently. However, dynamic instances of these code regions, called transactions, may fail to behave atomically, resulting in transactional atomicity violations. Existing dynamic online atomicity checkers incur either false positives or false negatives in detecting transactions experiencing transactional atomicity violations. This article proposes RegionTrack. RegionTrack tracks cross-thread dependences at the event, dynamic subregion, and transaction levels. It maintains both dynamic subregions within selected transactions and transactional happens-before relations through its novel timestamp propagation approach. We prove that RegionTrack is sound and complete in detecting both transactional atomicity violations and non-serializable traces. To the best of our knowledge, it is the first online technique that precisely captures the transitively closed set of happens-before relations over all conflicting events with respect to every running transaction for the above two kinds of issues. We have evaluated RegionTrack on 19 subjects of the DaCapo and the Java Grande Forum benchmarks. The empirical results confirm that RegionTrack precisely detected all those transactions which experienced transactional atomicity violations and identified all non-serializable traces. The overall results also show that RegionTrack incurred 1.10x and 1.08x lower memory and runtime overheads than Velodrome and 2.10x and 1.21x lower than Aerodrome, respectively. Moreover, it incurred 2.89x lower memory overhead than DoubleChecker. On average, Velodrome detected about 55% fewer violations than RegionTrack, which in turn reported about 3%–70% fewer violations than DoubleChecker.
{"title":"RegionTrack","authors":"Xiaoxue Ma, Shangru Wu, E. Pobee, Xiupei Mei, Hao Zhang, Bo Jiang, W. Chan","doi":"10.1145/3412377","DOIUrl":"https://doi.org/10.1145/3412377","url":null,"abstract":"Atomicity is a correctness criterion to reason about isolated code regions in a multithreaded program when they are executed concurrently. However, dynamic instances of these code regions, called transactions, may fail to behave atomically, resulting in transactional atomicity violations. Existing dynamic online atomicity checkers incur either false positives or false negatives in detecting transactions experiencing transactional atomicity violations. This article proposes <monospace>RegionTrack</monospace>. <monospace>RegionTrack</monospace> tracks cross-thread dependences at the event, dynamic subregion, and transaction levels. It maintains both dynamic subregions within selected transactions and transactional happens-before relations through its novel timestamp propagation approach. We prove that <monospace>RegionTrack</monospace> is sound and complete in detecting both transactional atomicity violations and non-serializable traces. To the best of our knowledge, it is the first online technique that precisely captures the transitively closed set of happens-before relations over all conflicting events with respect to every running transaction for the above two kinds of issues. We have evaluated <monospace>RegionTrack</monospace> on 19 subjects of the DaCapo and the Java Grande Forum benchmarks. The empirical results confirm that <monospace>RegionTrack</monospace> precisely detected all those transactions which experienced transactional atomicity violations and identified all non-serializable traces. The overall results also show that <monospace>RegionTrack</monospace> incurred 1.10x and 1.08x lower memory and runtime overheads than <monospace>Velodrome</monospace> and 2.10x and 1.21x lower than <monospace>Aerodrome</monospace>, respectively. Moreover, it incurred 2.89x lower memory overhead than <monospace>DoubleChecker</monospace>. On average, <monospace>Velodrome</monospace> detected about 55% fewer violations than <monospace>RegionTrack</monospace>, which in turn reported about 3%–70% fewer violations than <monospace>DoubleChecker</monospace>.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"31 1","pages":"1 - 49"},"PeriodicalIF":0.0,"publicationDate":"2020-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78775356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huihui Zhang, Man Zhang, T. Yue, Sajid Ali, Yan Li
Requirements review is an effective technique to ensure the quality of requirements in practice, especially in safety-critical domains (e.g., avionics systems, automotive systems). In such contexts, a typical requirements review process often prioritizes requirements, due to limited time and monetary budget, by, for instance, prioritizing requirements with higher implementation cost earlier in the review process. However, such a requirement implementation cost is typically estimated by stakeholders who often lack knowledge about (future) requirements implementation scenarios, which leads to uncertainty in cost overrun. In this article, we explicitly consider such uncertainty (quantified as cost overrun probability) when prioritizing requirements based on the assumption that a requirement with higher importance, a higher number of dependencies to other requirements, and higher implementation cost will be reviewed with the higher priority. Motivated by this, we formulate four objectives for uncertainty-wise requirements prioritization: maximizing the importance of requirements, requirements dependencies, the implementation cost of requirements, and cost overrun probability. These four objectives are integrated as part of our search-based uncertainty-wise requirements prioritization approach with tool support, named as URP. We evaluated six Multi-Objective Search Algorithms (MOSAs) (i.e., NSGA-II, NSGA-III, MOCell, SPEA2, IBEA, and PAES) together with Random Search (RS) using three real-world datasets (i.e., the RALIC, Word, and ReleasePlanner datasets) and 19 synthetic optimization problems. Results show that all the selected MOSAs can solve the requirements prioritization problem with significantly better performance than RS. Among them, IBEA was over 40% better than RS in terms of permutation effectiveness for the first 10% of prioritized requirements in the prioritization sequence of all three datasets. In addition, IBEA achieved the best performance in terms of the convergence of solutions, and NSGA-III performed the best when considering both the convergence and diversity of nondominated solutions.
{"title":"Uncertainty-wise Requirements Prioritization with Search","authors":"Huihui Zhang, Man Zhang, T. Yue, Sajid Ali, Yan Li","doi":"10.1145/3408301","DOIUrl":"https://doi.org/10.1145/3408301","url":null,"abstract":"Requirements review is an effective technique to ensure the quality of requirements in practice, especially in safety-critical domains (e.g., avionics systems, automotive systems). In such contexts, a typical requirements review process often prioritizes requirements, due to limited time and monetary budget, by, for instance, prioritizing requirements with higher implementation cost earlier in the review process. However, such a requirement implementation cost is typically estimated by stakeholders who often lack knowledge about (future) requirements implementation scenarios, which leads to uncertainty in cost overrun. In this article, we explicitly consider such uncertainty (quantified as cost overrun probability) when prioritizing requirements based on the assumption that a requirement with higher importance, a higher number of dependencies to other requirements, and higher implementation cost will be reviewed with the higher priority. Motivated by this, we formulate four objectives for uncertainty-wise requirements prioritization: maximizing the importance of requirements, requirements dependencies, the implementation cost of requirements, and cost overrun probability. These four objectives are integrated as part of our search-based uncertainty-wise requirements prioritization approach with tool support, named as URP. We evaluated six Multi-Objective Search Algorithms (MOSAs) (i.e., NSGA-II, NSGA-III, MOCell, SPEA2, IBEA, and PAES) together with Random Search (RS) using three real-world datasets (i.e., the RALIC, Word, and ReleasePlanner datasets) and 19 synthetic optimization problems. Results show that all the selected MOSAs can solve the requirements prioritization problem with significantly better performance than RS. Among them, IBEA was over 40% better than RS in terms of permutation effectiveness for the first 10% of prioritized requirements in the prioritization sequence of all three datasets. In addition, IBEA achieved the best performance in terms of the convergence of solutions, and NSGA-III performed the best when considering both the convergence and diversity of nondominated solutions.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"59 1","pages":"1 - 54"},"PeriodicalIF":0.0,"publicationDate":"2020-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85985512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}