利用自然语言处理技术识别被测单元

IF 1.1 Q3 COMPUTER SCIENCE, THEORY & METHODS Open Computer Science Pub Date : 2020-12-17 DOI:10.1515/comp-2020-0150
Matej Madeja, J. Porubän
{"title":"利用自然语言处理技术识别被测单元","authors":"Matej Madeja, J. Porubän","doi":"10.1515/comp-2020-0150","DOIUrl":null,"url":null,"abstract":"Abstract Unit under test identification (UUT) is often difficult due to test smells, such as testing multiple UUTs in one test. Because the tests best reflect the current product specification they can be used to comprehend parts of the production code and the relationships between them. Because there is a similar vocabulary between the test and UUT, five NLP techniques were used on the source code of 5 popular Github projects in this paper. The collected results were compared with the manually identified UUTs. The tf-idf model achieved the best accuracy of 22% for a right UUT and 57% with a tolerance up to fifth place of manual identification. These results were obtained after preprocessing input documents with java keywords removal and word split. The tf-idf model achieved the best model training time and the index search takes within 1s per request, so it could be used in an Integrated Development Environment (IDE) as a support tool in the future. At the same time, it has been found that, for document preprocessing, word splitting improves accuracy best and removing java keywords has just a small improvement for tf-idf model results. Removing comments only slightly worsens the accuracy of Natural Language Processing (NLP) models. The best speed provided the word splitting with average 0.3s preprocessing time per all documents in a project.","PeriodicalId":43014,"journal":{"name":"Open Computer Science","volume":"11 1","pages":"22 - 32"},"PeriodicalIF":1.1000,"publicationDate":"2020-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/comp-2020-0150","citationCount":"0","resultStr":"{\"title\":\"Unit Under Test Identification Using Natural Language Processing Techniques\",\"authors\":\"Matej Madeja, J. Porubän\",\"doi\":\"10.1515/comp-2020-0150\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Unit under test identification (UUT) is often difficult due to test smells, such as testing multiple UUTs in one test. Because the tests best reflect the current product specification they can be used to comprehend parts of the production code and the relationships between them. Because there is a similar vocabulary between the test and UUT, five NLP techniques were used on the source code of 5 popular Github projects in this paper. The collected results were compared with the manually identified UUTs. The tf-idf model achieved the best accuracy of 22% for a right UUT and 57% with a tolerance up to fifth place of manual identification. These results were obtained after preprocessing input documents with java keywords removal and word split. The tf-idf model achieved the best model training time and the index search takes within 1s per request, so it could be used in an Integrated Development Environment (IDE) as a support tool in the future. At the same time, it has been found that, for document preprocessing, word splitting improves accuracy best and removing java keywords has just a small improvement for tf-idf model results. Removing comments only slightly worsens the accuracy of Natural Language Processing (NLP) models. The best speed provided the word splitting with average 0.3s preprocessing time per all documents in a project.\",\"PeriodicalId\":43014,\"journal\":{\"name\":\"Open Computer Science\",\"volume\":\"11 1\",\"pages\":\"22 - 32\"},\"PeriodicalIF\":1.1000,\"publicationDate\":\"2020-12-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1515/comp-2020-0150\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Open Computer Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1515/comp-2020-0150\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Open Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/comp-2020-0150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

摘要

摘要被测单元识别(UUT)通常由于测试气味而困难,例如在一次测试中测试多个UUT。因为测试最能反映当前的产品规范,所以它们可以用来理解生产代码的各个部分以及它们之间的关系。由于测试和UUT之间有相似的词汇,本文在5个流行的Github项目的源代码中使用了五种NLP技术。将收集的结果与手动识别的UUT进行比较。对于右侧UUT,tf-idf模型实现了22%的最佳精度,并且在手动识别的公差高达第五位的情况下实现了57%的最佳精度。这些结果是在对输入文档进行java关键词去除和分词预处理后获得的。tf-idf模型实现了最佳的模型训练时间,每个请求的索引搜索时间在1秒内,因此它可以在集成开发环境(IDE)中用作未来的支持工具。同时,研究发现,对于文档预处理,分词能最好地提高准确性,而去除java关键字对tf-idf模型结果的改善很小。删除注释只会略微恶化自然语言处理(NLP)模型的准确性。最佳速度提供了一个项目中每个文档平均0.3秒的预处理时间的分词。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Unit Under Test Identification Using Natural Language Processing Techniques
Abstract Unit under test identification (UUT) is often difficult due to test smells, such as testing multiple UUTs in one test. Because the tests best reflect the current product specification they can be used to comprehend parts of the production code and the relationships between them. Because there is a similar vocabulary between the test and UUT, five NLP techniques were used on the source code of 5 popular Github projects in this paper. The collected results were compared with the manually identified UUTs. The tf-idf model achieved the best accuracy of 22% for a right UUT and 57% with a tolerance up to fifth place of manual identification. These results were obtained after preprocessing input documents with java keywords removal and word split. The tf-idf model achieved the best model training time and the index search takes within 1s per request, so it could be used in an Integrated Development Environment (IDE) as a support tool in the future. At the same time, it has been found that, for document preprocessing, word splitting improves accuracy best and removing java keywords has just a small improvement for tf-idf model results. Removing comments only slightly worsens the accuracy of Natural Language Processing (NLP) models. The best speed provided the word splitting with average 0.3s preprocessing time per all documents in a project.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Open Computer Science
Open Computer Science COMPUTER SCIENCE, THEORY & METHODS-
CiteScore
4.00
自引率
0.00%
发文量
24
审稿时长
25 weeks
期刊最新文献
Artificial intelligence-based public safety data resource management in smart cities Application of fingerprint image fuzzy edge recognition algorithm in criminal technology Application of SSD network algorithm in panoramic video image vehicle detection system Data preprocessing impact on machine learning algorithm performance RFID supply chain data deconstruction method based on artificial intelligence technology
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1