2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)最新文献_第3页

Peeler: Learning to Effectively Predict Flakiness without Running Tests 削皮器:学习在不运行测试的情况下有效地预测片状

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Pub Date : 2022-10-01 DOI: 10.1109/ICSME55016.2022.00031

Yihao Qin, Shangwen Wang, Kui Liu, Bo Lin, Hongjun Wu, Li Li, Xiaoguang Mao, Tegawendé F. Bissyandé

Regression testing is a widely adopted approach to expose change-induced bugs as well as to verify the correctness/robustness of code in modern software development settings. Unfortunately, the occurrence of flaky tests leads to a significant increase in the cost of regression testing and eventually reduces the productivity of developers (i.e., their ability to find and fix real problems). State-of-the-art approaches leverage dynamic test information obtained through expensive re-execution of test cases to effectively identify flaky tests. Towards accounting for scalability constraints, some recent approaches have built on static test case features, but fall short on effectiveness. In this paper, we introduce Peeler, a new fully static approach for predicting flaky tests through exploring a representation of test cases based on the data dependency relations. The predictor is then trained as a neural network based model, which achieves at the same time scalability (because it does not require any test execution), effectiveness (because it exploits relevant test dependency features), and practicality (because it can be applied in the wild to find new flaky tests). Experimental validation on 17,532 test cases from 21 Java projects shows that Peeler outperforms the state-of-the-art FlakeFlagger by around 20 percentage points: we catch 22% more flaky tests while yielding 51% less false positives. Finally, in a live study with projects in-the-wild, we reported to developers 21 flakiness cases, among which 12 have already been confirmed by developers as being indeed flaky.

在现代软件开发环境中，回归测试是一种广泛采用的方法，用于暴露变化引起的错误，以及验证代码的正确性/健壮性。不幸的是，不稳定测试的出现导致回归测试成本的显著增加，并最终降低了开发人员的生产力(即，他们发现和修复实际问题的能力)。最先进的方法利用通过昂贵的重新执行测试用例获得的动态测试信息来有效地识别不可靠的测试。为了考虑可伸缩性的限制，一些最近的方法建立在静态测试用例的特性上，但是缺乏有效性。在本文中，我们通过探索基于数据依赖关系的测试用例的表示，引入了一种新的全静态方法Peeler，用于预测片状测试。然后将预测器训练为基于神经网络的模型，它同时实现了可伸缩性(因为它不需要任何测试执行)、有效性(因为它利用了相关的测试依赖特性)和实用性(因为它可以在野外应用以找到新的薄片测试)。对来自21个Java项目的17,532个测试用例的实验验证表明，Peeler比最先进的FlakeFlagger高出约20个百分点:我们捕获了22%的不可靠测试，而产生的误报减少了51%。最后，在对项目的实时研究中，我们向开发人员报告了21个不稳定的案例，其中12个已经被开发人员确认为确实不稳定。

{"title":"Peeler: Learning to Effectively Predict Flakiness without Running Tests","authors":"Yihao Qin, Shangwen Wang, Kui Liu, Bo Lin, Hongjun Wu, Li Li, Xiaoguang Mao, Tegawendé F. Bissyandé","doi":"10.1109/ICSME55016.2022.00031","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00031","url":null,"abstract":"Regression testing is a widely adopted approach to expose change-induced bugs as well as to verify the correctness/robustness of code in modern software development settings. Unfortunately, the occurrence of flaky tests leads to a significant increase in the cost of regression testing and eventually reduces the productivity of developers (i.e., their ability to find and fix real problems). State-of-the-art approaches leverage dynamic test information obtained through expensive re-execution of test cases to effectively identify flaky tests. Towards accounting for scalability constraints, some recent approaches have built on static test case features, but fall short on effectiveness. In this paper, we introduce Peeler, a new fully static approach for predicting flaky tests through exploring a representation of test cases based on the data dependency relations. The predictor is then trained as a neural network based model, which achieves at the same time scalability (because it does not require any test execution), effectiveness (because it exploits relevant test dependency features), and practicality (because it can be applied in the wild to find new flaky tests). Experimental validation on 17,532 test cases from 21 Java projects shows that Peeler outperforms the state-of-the-art FlakeFlagger by around 20 percentage points: we catch 22% more flaky tests while yielding 51% less false positives. Finally, in a live study with projects in-the-wild, we reported to developers 21 flakiness cases, among which 12 have already been confirmed by developers as being indeed flaky.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125689875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Message from the General Co-Chairs and Program Co-Chairs 总联合主席和项目联合主席的致辞

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Pub Date : 2022-10-01 DOI: 10.1109/icsme55016.2022.00005

引用次数: 0

LiFUSO: A Tool for Library Feature Unveiling based on Stack Overflow Posts LiFUSO:一个基于堆栈溢出帖子的图书馆特性揭示工具

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Pub Date : 2022-10-01 DOI: 10.1109/ICSME55016.2022.00065

Camilo Velázquez-Rodríguez, Eleni Constantinou, Coen De Roover

Selecting a library from a vast ecosystem can be a daunting task. The libraries are not only numerous, but they also lack an enumeration of the features they offer. A feature enumeration for each library in an ecosystem would help developers select the most appropriate library for the task at hand. Within this enumeration, a library feature could take the form of a brief description together with the API references through which the feature can be reused. This paper presents LiFUSO, a tool that leverages Stack Overflow posts to compute a list of such features for a given library. Each feature corresponds to a cluster of related API references based on the similarity of the Stack Overflow posts in which they occur. Once LiFUSO has extracted such a cluster of posts, it applies natural language processing to describe the corresponding feature. We describe the engineering aspects of the tool, and illustrate its usage through a preliminary case study in which we compare the features uncovered for two competing libraries within the same domain. An executable version of the tool is available at https://github.com/softwarelanguageslab/lifuso and its demonstration video is accessible at https://youtu.be/tDE1LWa86cA.

从庞大的生态系统中选择一个库可能是一项艰巨的任务。这些库不仅数量众多，而且还缺乏它们提供的特性的枚举。生态系统中每个库的特性枚举将帮助开发人员为手头的任务选择最合适的库。在这个枚举中，库特性可以采用简短描述和API引用的形式，通过API引用可以重用该特性。本文介绍了LiFUSO，一个利用Stack Overflow posts来计算给定库的此类特性列表的工具。每个特性都对应于一组相关的API引用，这是基于它们出现的Stack Overflow帖子的相似性。一旦LiFUSO提取了这样一组帖子，它就会应用自然语言处理来描述相应的特征。我们描述了该工具的工程方面，并通过一个初步的案例研究说明了它的用法，在这个案例研究中，我们比较了同一领域中两个相互竞争的库所揭示的特性。该工具的可执行版本可在https://github.com/softwarelanguageslab/lifuso上获得，其演示视频可在https://youtu.be/tDE1LWa86cA上访问。

{"title":"LiFUSO: A Tool for Library Feature Unveiling based on Stack Overflow Posts","authors":"Camilo Velázquez-Rodríguez, Eleni Constantinou, Coen De Roover","doi":"10.1109/ICSME55016.2022.00065","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00065","url":null,"abstract":"Selecting a library from a vast ecosystem can be a daunting task. The libraries are not only numerous, but they also lack an enumeration of the features they offer. A feature enumeration for each library in an ecosystem would help developers select the most appropriate library for the task at hand. Within this enumeration, a library feature could take the form of a brief description together with the API references through which the feature can be reused. This paper presents LiFUSO, a tool that leverages Stack Overflow posts to compute a list of such features for a given library. Each feature corresponds to a cluster of related API references based on the similarity of the Stack Overflow posts in which they occur. Once LiFUSO has extracted such a cluster of posts, it applies natural language processing to describe the corresponding feature. We describe the engineering aspects of the tool, and illustrate its usage through a preliminary case study in which we compare the features uncovered for two competing libraries within the same domain. An executable version of the tool is available at https://github.com/softwarelanguageslab/lifuso and its demonstration video is accessible at https://youtu.be/tDE1LWa86cA.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132807874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Engineering Implications of Code Maintenance in Practice 实践中代码维护的工程含义

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Pub Date : 2022-10-01 DOI: 10.1109/ICSME55016.2022.00078

N. Lee, R. Abreu, M. Yatbaz, Hang Qu, Nachiappan Nagappan

Allowing developers to move fast when evolving and maintaining low-latency, large-scale distributed systems is a challenging problem due to i) sheer system complexity and scale, ii) degrading code quality, and iii) difficulty of performing reliable rapid change management while the system is in production. Addressing these problems has many benefits to increase system developer efficiency, reliability, performance, as well as code maintenance. In this paper, we present a real-world case study of an architectural refactoring project within an industrial setting. The system in scope is our codenamed ItemIndexer delivery system (I2DS), which is responsible for processing and delivering a large number of items at rapid speed to billions of users in real time. I2DS is running in production, refactored live over a period of 9 months, and assessed through impact validation studies that show a 42% improvement in developer efficiency, 87% improvement in reliability, 20% increase in item scoring, a 10% increase in item matching, and 14% CPU savings.

允许开发人员在发展和维护低延迟的大规模分布式系统时快速移动是一个具有挑战性的问题，因为i)纯粹的系统复杂性和规模，ii)降低代码质量，以及iii)在系统处于生产状态时执行可靠的快速变更管理的困难。解决这些问题对于提高系统开发人员的效率、可靠性、性能以及代码维护有很多好处。在本文中，我们展示了一个工业环境中的架构重构项目的真实案例研究。范围内的系统是我们代号为ItemIndexer的交付系统(I2DS)，它负责以快速的速度实时处理和交付大量项目给数十亿用户。I2DS在生产环境中运行，经过了9个月的实时重构，并通过影响验证研究进行了评估，结果显示开发人员的效率提高了42%，可靠性提高了87%，项目评分提高了20%，项目匹配提高了10%，节省了14%的CPU。

引用次数: 0

"When the Code becomes a Crime Scene" Towards Dark Web Threat Intelligence with Software Quality Metrics “当代码成为犯罪现场时”——用软件质量度量方法研究暗网威胁情报

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Pub Date : 2022-10-01 DOI: 10.1109/ICSME55016.2022.00055

Giuseppe Cascavilla, Gemma Catolino, Felipe Ebert, D. Tamburri, Willem-Jan van den Heuvel

The increasing growth of illegal online activities in the so-called dark web—that is, the hidden collective of internet sites only accessible by a specialized web browsers—has challenged law enforcement agencies in recent years with sparse research efforts to help. For example, research has been devoted to supporting law enforcement by employing Natural Language Processing (NLP) to detect illegal activities on the dark web and build models for their classification. However, current approaches strongly rely upon the linguistic characteristics used to train the models, e.g., language semantics, which threatens their generalizability. To overcome this limitation, we tackle the problem of predicting illegal and criminal activities—a process defined as threat intelligence—on the dark web from a complementary perspective—that of dark web code maintenance and evolution— and propose a novel approach that uses software quality metrics and dark website appearance parameters instead of linguistic characteristics. We performed a preliminary empirical study on 10.367 web pages and collected more than 40 code metrics and website parameters using sonarqube. Results show an accuracy of up to 82% for predicting the three types of illegal activities (i.e., suspicious, normal, and unknown) and 66% for detecting 26 specific illegal activities, such as drugs or weapons trafficking. We deem our results can influence the current trends in detecting illegal activities on the dark web and put forward a completely novel research avenue toward dealing with this problem from a software maintenance and evolution perspective.

近年来，在所谓的“暗网”(暗网是指只能通过专门的网络浏览器才能访问的隐藏的互联网站点群)中，非法在线活动日益增多，这给执法机构带来了挑战，因为他们的研究工作很少。例如，研究一直致力于通过使用自然语言处理(NLP)来检测暗网上的非法活动并建立分类模型来支持执法。然而，目前的方法严重依赖于用于训练模型的语言特征，例如语言语义，这威胁到它们的泛化性。为了克服这一限制，我们从一个互补的角度——暗网代码维护和演变的角度——解决了在暗网上预测非法和犯罪活动的问题——一个被定义为威胁情报的过程——并提出了一种使用软件质量指标和暗网外观参数而不是语言特征的新方法。我们对10.367个网页进行了初步的实证研究，并使用sonarqube收集了40多个代码指标和网站参数。结果显示，预测三种类型的非法活动(即可疑、正常和未知)的准确率高达82%，检测26种特定非法活动(如毒品或武器贩运)的准确率高达66%。我们认为我们的研究结果可以影响当前暗网上非法活动的检测趋势，并从软件维护和进化的角度为解决这一问题提出了一种全新的研究途径。

{"title":"\"When the Code becomes a Crime Scene\" Towards Dark Web Threat Intelligence with Software Quality Metrics","authors":"Giuseppe Cascavilla, Gemma Catolino, Felipe Ebert, D. Tamburri, Willem-Jan van den Heuvel","doi":"10.1109/ICSME55016.2022.00055","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00055","url":null,"abstract":"The increasing growth of illegal online activities in the so-called dark web—that is, the hidden collective of internet sites only accessible by a specialized web browsers—has challenged law enforcement agencies in recent years with sparse research efforts to help. For example, research has been devoted to supporting law enforcement by employing Natural Language Processing (NLP) to detect illegal activities on the dark web and build models for their classification. However, current approaches strongly rely upon the linguistic characteristics used to train the models, e.g., language semantics, which threatens their generalizability. To overcome this limitation, we tackle the problem of predicting illegal and criminal activities—a process defined as threat intelligence—on the dark web from a complementary perspective—that of dark web code maintenance and evolution— and propose a novel approach that uses software quality metrics and dark website appearance parameters instead of linguistic characteristics. We performed a preliminary empirical study on 10.367 web pages and collected more than 40 code metrics and website parameters using sonarqube. Results show an accuracy of up to 82% for predicting the three types of illegal activities (i.e., suspicious, normal, and unknown) and 66% for detecting 26 specific illegal activities, such as drugs or weapons trafficking. We deem our results can influence the current trends in detecting illegal activities on the dark web and put forward a completely novel research avenue toward dealing with this problem from a software maintenance and evolution perspective.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125592526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Deceiving Deep Neural Networks-Based Binary Code Matching with Adversarial Programs 利用对抗程序欺骗基于深度神经网络的二进制代码匹配

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Pub Date : 2022-10-01 DOI: 10.1109/ICSME55016.2022.00019

W. Wong, Huaijin Wang, Pingchuan Ma, Shuai Wang, Mingyue Jiang, T. Chen, Qiyi Tang, Sen Nie, Shi Wu

Deep neural networks (DNNs) have achieved a major success in solving challenging tasks such as social networks analysis and image classification. Despite the prosperous development of DNNs, recent research has demonstrated the feasibility of exploiting DNNs using adversarial examples, in which a small distortion is added into the input data to largely mislead prediction of DNNs.Determining the similarity of two binary codes is the foundation for many reverse engineering, re-engineering, and security applications. Currently, the majority of binary code matching tools are based on DNNs, the dependability of which has not been completely studied. In this research, we present an attack that perturbs software in executable format to deceive DNN-based binary code matching. Unlike prior attacks which mostly change non-functional code components to generate adversarial programs, our approach proposes the design of several semantics-preserving transformations directly toward the control flow graph of binary code, making it particularly effective to deceive DNNs. To speedup the process, we design a framework that leverages gradient- or hill climbing-based optimizations to generate adversarial examples in both white-box and black-box settings. We evaluated our attack against two popular DNN-based binary code matching tools, asm2vec and ncc, and achieve reasonably high success rates. Our attack toward an industrial-strength DNN-based binary code matching service, BinaryAI, shows that the proposed attack can fool remote APIs in challenging black-box settings with a success rate of over 16.2% (on average). Furthermore, we show that the generated adversarial programs can be used to augment robustness of two white-box models, asm2vec and ncc, reducing the attack success rates by 17.3% and 6.8% while preserving stable, if not better, standard accuracy.

深度神经网络(dnn)在解决社交网络分析和图像分类等具有挑战性的任务方面取得了重大成功。尽管深度神经网络的蓬勃发展，最近的研究已经证明了利用对抗示例开发深度神经网络的可行性，其中在输入数据中添加一个小的失真，从而在很大程度上误导了对深度神经网络的预测。确定两个二进制代码的相似性是许多逆向工程、再工程和安全应用程序的基础。目前，大多数二进制码匹配工具都是基于深度神经网络的，其可靠性尚未得到充分的研究。在本研究中，我们提出了一种通过干扰可执行格式的软件来欺骗基于dnn的二进制代码匹配的攻击。与之前主要改变非功能代码组件以生成对抗性程序的攻击不同，我们的方法提出了几种直接针对二进制代码的控制流图的语义保留转换的设计，使其特别有效地欺骗dnn。为了加快这一过程，我们设计了一个框架，利用基于梯度或爬坡的优化，在白盒和黑盒设置中生成对抗性示例。我们针对两种流行的基于dnn的二进制代码匹配工具asm2vec和ncc评估了我们的攻击，并获得了相当高的成功率。我们对基于工业强度dnn的二进制代码匹配服务BinaryAI的攻击表明，所提出的攻击可以在具有挑战性的黑箱设置中欺骗远程api，成功率超过16.2%(平均)。此外，我们表明，生成的对抗程序可以用来增强两个白盒模型asm2vec和ncc的鲁棒性，将攻击成功率降低17.3%和6.8%，同时保持稳定的标准精度。

{"title":"Deceiving Deep Neural Networks-Based Binary Code Matching with Adversarial Programs","authors":"W. Wong, Huaijin Wang, Pingchuan Ma, Shuai Wang, Mingyue Jiang, T. Chen, Qiyi Tang, Sen Nie, Shi Wu","doi":"10.1109/ICSME55016.2022.00019","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00019","url":null,"abstract":"Deep neural networks (DNNs) have achieved a major success in solving challenging tasks such as social networks analysis and image classification. Despite the prosperous development of DNNs, recent research has demonstrated the feasibility of exploiting DNNs using adversarial examples, in which a small distortion is added into the input data to largely mislead prediction of DNNs.Determining the similarity of two binary codes is the foundation for many reverse engineering, re-engineering, and security applications. Currently, the majority of binary code matching tools are based on DNNs, the dependability of which has not been completely studied. In this research, we present an attack that perturbs software in executable format to deceive DNN-based binary code matching. Unlike prior attacks which mostly change non-functional code components to generate adversarial programs, our approach proposes the design of several semantics-preserving transformations directly toward the control flow graph of binary code, making it particularly effective to deceive DNNs. To speedup the process, we design a framework that leverages gradient- or hill climbing-based optimizations to generate adversarial examples in both white-box and black-box settings. We evaluated our attack against two popular DNN-based binary code matching tools, asm2vec and ncc, and achieve reasonably high success rates. Our attack toward an industrial-strength DNN-based binary code matching service, BinaryAI, shows that the proposed attack can fool remote APIs in challenging black-box settings with a success rate of over 16.2% (on average). Furthermore, we show that the generated adversarial programs can be used to augment robustness of two white-box models, asm2vec and ncc, reducing the attack success rates by 17.3% and 6.8% while preserving stable, if not better, standard accuracy.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129515928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrating Software Issue Tracking and Traceability Models 集成软件问题跟踪和可追溯性模型

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Pub Date : 2022-10-01 DOI: 10.1109/ICSME55016.2022.00053

Naveen Ganesh Muralidharan, Vera Pantelic, V. Bandur, R. Paige

Awareness of the importance of systems and software traceability, as well as tool support for traceability, have improved over the years. But an effective solution for traceability must align and integrate with an organization’s engineering processes. Specifically, the phases of the traceability process model (traceability strategy, creation, use and maintenance of traceability) must be aligned with the organization’s engineering processes. Previous research has discussed the benefits of integrating traceability into the configuration management process. In this paper, we propose Change Request management using traceability data. In our approach, new Change Requests (CRs) are created from the traceability model of the corresponding project. The created CRs contain a portion of the project’s overall traceability model that is relevant to that change. A proof-of-concept issue tracking system is proposed that uses a traceability model at its core.

对系统和软件可追溯性的重要性的认识，以及对可追溯性的工具支持，在过去几年中得到了改进。但是一个有效的可追溯性解决方案必须与组织的工程过程保持一致和集成。具体地说，跟踪过程模型的各个阶段(跟踪策略、创建、使用和维护跟踪)必须与组织的工程过程保持一致。以前的研究已经讨论了将可跟踪性集成到配置管理过程中的好处。在本文中，我们建议使用可追溯性数据进行变更请求管理。在我们的方法中，新的变更请求(cr)是从相应项目的可追溯性模型中创建的。创建的cr包含与该更改相关的项目总体可跟踪性模型的一部分。提出了一个以可追溯性模型为核心的概念验证问题跟踪系统。

引用次数: 0

An Effective Approach for Parsing Large Log Files 解析大型日志文件的有效方法

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Pub Date : 2022-10-01 DOI: 10.1109/ICSME55016.2022.00009

Issam Sedki, A. Hamou-Lhadj, O. Mohamed, M. Shehab

Because of their contribution to the overall reliability assurance process, software logs have become important data assets for the analysis of software systems. Logs are often the only data points that can shed light on how a software system behaves once deployed. Unfortunately, logs are often unstructured data items, hindering viable analysis of their content. There are studies that aim to automatically parse large log files. The primary goal is to create templates from raw log data samples that can later be used to recognize future logs. In this paper, we propose ULP, a Unified Log Parsing tool, which is highly accurate and efficient. ULP combines string matching and local frequency analysis to parse large log files in an efficient manner. First, log events are organized into groups using a text processing method. Frequency analysis is then applied locally to instances of the same group to identify static and dynamic content of log events. When applied to 10 log datasets of the LogPai benchmark, ULP achieves an average accuracy of 89.2%, which outperforms the accuracy of four leading log parsing tools, namely Drain, Logram, SPELL and AEL. Additionally, ULP can parse up to four million log events in less than 3 minutes. ULP is available online as an open source and can be readily used by practitioners and researchers to parse effectively and efficiently large log files so as to support log analysis tasks.

由于它们对整个可靠性保证过程的贡献，软件日志已成为软件系统分析的重要数据资产。日志通常是唯一能够揭示软件系统在部署后如何运行的数据点。不幸的是，日志通常是非结构化的数据项，阻碍了对其内容的可行分析。有一些研究旨在自动解析大型日志文件。主要目标是从原始日志数据示例创建模板，这些模板以后可用于识别未来的日志。在本文中，我们提出了一个统一的日志解析工具ULP，它具有很高的准确性和效率。ULP将字符串匹配和本地频率分析相结合，以高效的方式解析大型日志文件。首先，使用文本处理方法将日志事件组织成组。然后将频率分析本地应用于同一组的实例，以识别日志事件的静态和动态内容。当应用于LogPai基准的10个日志数据集时，ULP的平均准确率达到89.2%，优于四种领先的日志解析工具(Drain, Logram, SPELL和AEL)的准确率。此外，ULP可以在不到3分钟的时间内解析多达400万个日志事件。ULP作为开放源代码在线提供，从业者和研究人员可以很容易地使用它来有效和高效地解析大型日志文件，从而支持日志分析任务。

{"title":"An Effective Approach for Parsing Large Log Files","authors":"Issam Sedki, A. Hamou-Lhadj, O. Mohamed, M. Shehab","doi":"10.1109/ICSME55016.2022.00009","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00009","url":null,"abstract":"Because of their contribution to the overall reliability assurance process, software logs have become important data assets for the analysis of software systems. Logs are often the only data points that can shed light on how a software system behaves once deployed. Unfortunately, logs are often unstructured data items, hindering viable analysis of their content. There are studies that aim to automatically parse large log files. The primary goal is to create templates from raw log data samples that can later be used to recognize future logs. In this paper, we propose ULP, a Unified Log Parsing tool, which is highly accurate and efficient. ULP combines string matching and local frequency analysis to parse large log files in an efficient manner. First, log events are organized into groups using a text processing method. Frequency analysis is then applied locally to instances of the same group to identify static and dynamic content of log events. When applied to 10 log datasets of the LogPai benchmark, ULP achieves an average accuracy of 89.2%, which outperforms the accuracy of four leading log parsing tools, namely Drain, Logram, SPELL and AEL. Additionally, ULP can parse up to four million log events in less than 3 minutes. ULP is available online as an open source and can be readily used by practitioners and researchers to parse effectively and efficiently large log files so as to support log analysis tasks.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117212938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Why Don’t XAI Techniques Agree? Characterizing the Disagreements Between Post-hoc Explanations of Defect Predictions 为什么XAI技术不一致?描述缺陷预测的事后解释之间的分歧

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Pub Date : 2022-10-01 DOI: 10.1109/ICSME55016.2022.00056

Saumendu Roy, Gabriel Laberge, Banani Roy, Foutse Khomh, Amin Nikanjam, Saikat Mondal

Machine Learning (ML) based defect prediction models can be used to improve the reliability and overall quality of software systems. However, such defect predictors might not be deployed in real applications due to the lack of transparency. Thus, recently, application of several post-hoc explanation methods (e.g., LIME and SHAP) have gained popularity. These explanation methods can offer insight by ranking features based on their importance in black box decisions. The explainability of ML techniques is reasonably novel in the Software Engineering community. However, it is still unclear whether such explainability methods genuinely help practitioners make better decisions regarding software maintenance. Recent user studies show that data scientists usually utilize multiple post-hoc explainers to understand a single model decision because of the lack of ground truth. Such a scenario causes disagreement between explainability methods and impedes drawing a conclusion. Therefore, our study first investigates three disagreement metrics between LIME and SHAP explanations of 10 defect-predictors, and exposes that disagreements regarding the rankings of feature importance are most frequent. Our findings lead us to propose a method of aggregating LIME and SHAP explanations that puts less emphasis on these disagreements while highlighting the aspect on which explanations agree.

基于机器学习(ML)的缺陷预测模型可用于提高软件系统的可靠性和整体质量。然而，由于缺乏透明度，这样的缺陷预测器可能不会部署在实际的应用程序中。因此，最近，一些事后解释方法(例如LIME和SHAP)的应用得到了普及。这些解释方法可以根据特征在黑盒决策中的重要性对其进行排序，从而提供洞察力。机器学习技术的可解释性在软件工程社区中是相当新颖的。然而，这种可解释性方法是否真正地帮助实践者在软件维护方面做出更好的决策仍然是不清楚的。最近的用户研究表明，由于缺乏基础真理，数据科学家通常使用多个事后解释器来理解单个模型决策。这种情况导致了可解释性方法之间的分歧，并阻碍了得出结论。因此，我们的研究首先调查了10个缺陷预测因子的LIME和SHAP解释之间的三个分歧度量，并揭示了关于特征重要性排名的分歧是最常见的。我们的研究结果使我们提出了一种综合LIME和SHAP解释的方法，该方法较少强调这些分歧，同时突出解释一致的方面。

{"title":"Why Don’t XAI Techniques Agree? Characterizing the Disagreements Between Post-hoc Explanations of Defect Predictions","authors":"Saumendu Roy, Gabriel Laberge, Banani Roy, Foutse Khomh, Amin Nikanjam, Saikat Mondal","doi":"10.1109/ICSME55016.2022.00056","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00056","url":null,"abstract":"Machine Learning (ML) based defect prediction models can be used to improve the reliability and overall quality of software systems. However, such defect predictors might not be deployed in real applications due to the lack of transparency. Thus, recently, application of several post-hoc explanation methods (e.g., LIME and SHAP) have gained popularity. These explanation methods can offer insight by ranking features based on their importance in black box decisions. The explainability of ML techniques is reasonably novel in the Software Engineering community. However, it is still unclear whether such explainability methods genuinely help practitioners make better decisions regarding software maintenance. Recent user studies show that data scientists usually utilize multiple post-hoc explainers to understand a single model decision because of the lack of ground truth. Such a scenario causes disagreement between explainability methods and impedes drawing a conclusion. Therefore, our study first investigates three disagreement metrics between LIME and SHAP explanations of 10 defect-predictors, and exposes that disagreements regarding the rankings of feature importance are most frequent. Our findings lead us to propose a method of aggregating LIME and SHAP explanations that puts less emphasis on these disagreements while highlighting the aspect on which explanations agree.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"5 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124301728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

RepoQuester: A Tool Towards Evaluating GitHub Projects RepoQuester:一个评估GitHub项目的工具

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Pub Date : 2022-10-01 DOI: 10.1109/ICSME55016.2022.00069

Kowndinya Boyalakuntla, M. Nagappan, S. Chimalakonda, Nuthan Munaiah

Given the drastic rise of repositories on GitHub, it is often hard for developers to find relevant projects meeting their requirements as analyzing source code and other artifacts is effort-intensive. In our prior work, we proposed Repo Reaper (or simply Reaper) that assesses GitHub projects based on seven metrics spanning across project collaboration, quality, and maintenance. Reaper identified 1.4 million projects out of nearly 1.8 million projects to have no purpose for collaboration or software development by classifying them into ‘engineered’ and ‘non-engineered’ software projects. While Reaper can be used to assess millions of repositories based on GHTorrent, it is not designed to be used by developers for standalone repositories on local machines and is dependent on GHTorrent. Hence, in this paper, we propose a re-engineered and extended command-line tool named RepoQuester that aims to assist developers in evaluating GitHub projects on their local machines. RepoQuester computes metrics for projects and does not classify projects into ‘engineered’ and ‘non-engineered’ ones. However, to demonstrate the correctness of metric scores produced by RepoQuester, we have performed the project classification on the Reaper’s training and validation datasets by updating them with the latest metric scores (as reported by RepoQuester). These datasets have their ground truth manually established. During the analysis, we observed that the machine learning classifiers built on the updated datasets produced an F1 score of 72%. During the evaluation, for each project, we found that RepoQuester can analyze metric scores in less than 10 seconds. A demo video explaining the tool highlights and usage is available at https://youtu.be/Q8OdmNzUfN0, and source code at https://github.com/Kowndinya2000/Repoquester.

考虑到GitHub上存储库的急剧增加，开发人员通常很难找到满足他们需求的相关项目，因为分析源代码和其他工件是非常费力的。在我们之前的工作中，我们提出了Repo Reaper(或简称Reaper)，它基于跨越项目协作、质量和维护的七个指标来评估GitHub项目。Reaper通过将近180万个项目分为“工程”和“非工程”软件项目，确定了140万个项目没有协作或软件开发的目的。虽然Reaper可以用来评估基于GHTorrent的数百万个存储库，但它不是为开发人员在本地机器上使用独立存储库而设计的，它依赖于GHTorrent。因此，在本文中，我们提出了一个重新设计和扩展的命令行工具RepoQuester，旨在帮助开发人员在本地机器上评估GitHub项目。RepoQuester为项目计算度量，并且不会将项目分为“工程化”和“非工程化”。然而，为了证明RepoQuester生成的度量分数的正确性，我们已经在Reaper的训练和验证数据集上执行了项目分类，方法是用最新的度量分数(由RepoQuester报告)更新它们。这些数据集都是人工建立的。在分析过程中，我们观察到建立在更新数据集上的机器学习分类器产生了72%的F1分数。在评估期间，对于每个项目，我们发现RepoQuester可以在不到10秒的时间内分析度量分数。解释该工具重点和用法的演示视频可在https://youtu.be/Q8OdmNzUfN0上获得，源代码可在https://github.com/Kowndinya2000/Repoquester上获得。

{"title":"RepoQuester: A Tool Towards Evaluating GitHub Projects","authors":"Kowndinya Boyalakuntla, M. Nagappan, S. Chimalakonda, Nuthan Munaiah","doi":"10.1109/ICSME55016.2022.00069","DOIUrl":"https://doi.org/10.1109/ICSME55016.2022.00069","url":null,"abstract":"Given the drastic rise of repositories on GitHub, it is often hard for developers to find relevant projects meeting their requirements as analyzing source code and other artifacts is effort-intensive. In our prior work, we proposed Repo Reaper (or simply Reaper) that assesses GitHub projects based on seven metrics spanning across project collaboration, quality, and maintenance. Reaper identified 1.4 million projects out of nearly 1.8 million projects to have no purpose for collaboration or software development by classifying them into ‘engineered’ and ‘non-engineered’ software projects. While Reaper can be used to assess millions of repositories based on GHTorrent, it is not designed to be used by developers for standalone repositories on local machines and is dependent on GHTorrent. Hence, in this paper, we propose a re-engineered and extended command-line tool named RepoQuester that aims to assist developers in evaluating GitHub projects on their local machines. RepoQuester computes metrics for projects and does not classify projects into ‘engineered’ and ‘non-engineered’ ones. However, to demonstrate the correctness of metric scores produced by RepoQuester, we have performed the project classification on the Reaper’s training and validation datasets by updating them with the latest metric scores (as reported by RepoQuester). These datasets have their ground truth manually established. During the analysis, we observed that the machine learning classifiers built on the updated datasets produced an F1 score of 72%. During the evaluation, for each project, we found that RepoQuester can analyze metric scores in less than 10 seconds. A demo video explaining the tool highlights and usage is available at https://youtu.be/Q8OdmNzUfN0, and source code at https://github.com/Kowndinya2000/Repoquester.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123069123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0