{"title":"行为执行比较:测试是否代表现场行为?","authors":"Qianqian Wang, Yuriy Brun, A. Orso","doi":"10.1109/ICST.2017.36","DOIUrl":null,"url":null,"abstract":"Software testing is the most widely used approach for assessing and improving software quality, but it is inherently incomplete and may not be representative of how the software is used in the field. This paper addresses the questions of to what extent tests represent how real users use software, and how to measure behavioral differences between test and field executions. We study four real-world systems, one used by endusers and three used by other (client) software, and compare test suites written by the systems' developers to field executions using four models of behavior: statement coverage, method coverage, mutation score, and a temporal-invariant-based model we developed. We find that developer-written test suites fail to accurately represent field executions: the tests, on average, miss 6.2% of the statements and 7.7% of the methods exercised in the field, the behavior exercised only in the field kills an extra 8.6% of the mutants, finally, the tests miss 52.6% of the behavioral invariants that occur in the field. In addition, augmenting the in-house test suites with automatically-generated tests by a tool targeting high code coverage only marginally improves the tests' behavioral representativeness. These differences between field and test executions—and in particular the finer-grained and more sophisticated ones that we measured using our invariantbased model—can provide insight for developers and suggest a better method for measuring test suite quality.","PeriodicalId":112258,"journal":{"name":"2017 IEEE International Conference on Software Testing, Verification and Validation (ICST)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":"{\"title\":\"Behavioral Execution Comparison: Are Tests Representative of Field Behavior?\",\"authors\":\"Qianqian Wang, Yuriy Brun, A. Orso\",\"doi\":\"10.1109/ICST.2017.36\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Software testing is the most widely used approach for assessing and improving software quality, but it is inherently incomplete and may not be representative of how the software is used in the field. This paper addresses the questions of to what extent tests represent how real users use software, and how to measure behavioral differences between test and field executions. We study four real-world systems, one used by endusers and three used by other (client) software, and compare test suites written by the systems' developers to field executions using four models of behavior: statement coverage, method coverage, mutation score, and a temporal-invariant-based model we developed. We find that developer-written test suites fail to accurately represent field executions: the tests, on average, miss 6.2% of the statements and 7.7% of the methods exercised in the field, the behavior exercised only in the field kills an extra 8.6% of the mutants, finally, the tests miss 52.6% of the behavioral invariants that occur in the field. In addition, augmenting the in-house test suites with automatically-generated tests by a tool targeting high code coverage only marginally improves the tests' behavioral representativeness. These differences between field and test executions—and in particular the finer-grained and more sophisticated ones that we measured using our invariantbased model—can provide insight for developers and suggest a better method for measuring test suite quality.\",\"PeriodicalId\":112258,\"journal\":{\"name\":\"2017 IEEE International Conference on Software Testing, Verification and Validation (ICST)\",\"volume\":\"57 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-03-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE International Conference on Software Testing, Verification and Validation (ICST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICST.2017.36\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Software Testing, Verification and Validation (ICST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICST.2017.36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Behavioral Execution Comparison: Are Tests Representative of Field Behavior?
Software testing is the most widely used approach for assessing and improving software quality, but it is inherently incomplete and may not be representative of how the software is used in the field. This paper addresses the questions of to what extent tests represent how real users use software, and how to measure behavioral differences between test and field executions. We study four real-world systems, one used by endusers and three used by other (client) software, and compare test suites written by the systems' developers to field executions using four models of behavior: statement coverage, method coverage, mutation score, and a temporal-invariant-based model we developed. We find that developer-written test suites fail to accurately represent field executions: the tests, on average, miss 6.2% of the statements and 7.7% of the methods exercised in the field, the behavior exercised only in the field kills an extra 8.6% of the mutants, finally, the tests miss 52.6% of the behavioral invariants that occur in the field. In addition, augmenting the in-house test suites with automatically-generated tests by a tool targeting high code coverage only marginally improves the tests' behavioral representativeness. These differences between field and test executions—and in particular the finer-grained and more sophisticated ones that we measured using our invariantbased model—can provide insight for developers and suggest a better method for measuring test suite quality.