2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)最新文献_第2页

Free the Bugs: Disclosing Blocking Violations in Reactive Programming 释放bug:揭露响应式编程中的阻塞违规

2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2020-09-01 DOI: 10.1109/SCAM51674.2020.00025

Felix Dobslaw, Morgan Vallin, Robin Sundström

In programming, concurrency allows threads to share processing units interleaving and seemingly simultaneous to improve resource utilization and performance. Previous research has found that concurrency faults are hard to avoid, hard to find, often leading to undesired and unpredictable behavior. Further, with the growing availability of multi-core devices and adaptation of concurrency features in high-level languages, concurrency faults occur reportedly often, which is why countermeasures must be investigated to limit harm. Reactive programming provides an abstraction to simplify complex concurrent and asynchronous tasks through reactive language extensions such as the RxJava and Project Reactor libraries for Java. Still, blocking violations are possibly resulting in concurrency faults with no Java compiler warnings. BlockHound is a tool that detects incorrect blocking by wrapping the original code and intercepting blocking calls to provide appropriate runtime errors. In this study, we seek an understanding of how common blocking violations are and whether a tool such as BlockHound can give us insight into the root-causes to highlight them as pitfalls to developers. The investigated Softwares are Java-based open-source projects using reactive frameworks selected based on high star ratings and large fork quantities that indicate high adoption. We activated BlockHound in the project’s test-suites and analyzed log files for common patterns to reveal blocking violations in 7/29 investigated open-source projects with 5024 stars and 1437 forks. A small number of system calls could be identified as root-causes. We here present countermeasures that successfully removed the uncertainty of blocking violations. The code’s intentional logic was retained in all validated projects through passing unit-tests.

在编程中，并发性允许线程共享处理单元，这些处理单元相互交错，似乎同时进行，以提高资源利用率和性能。以前的研究发现并发错误很难避免，很难发现，常常导致不希望的和不可预测的行为。此外，随着多核设备的日益可用性和高级语言中并发特性的适应，据报道并发错误经常发生，这就是为什么必须研究对策以限制危害的原因。响应式编程提供了一种抽象，通过响应式语言扩展(如Java的RxJava和Project Reactor库)来简化复杂的并发和异步任务。但是，阻塞违规可能会导致并发错误，而Java编译器没有警告。BlockHound是一种工具，它通过包装原始代码和拦截阻塞调用来检测不正确的阻塞，以提供适当的运行时错误。在本研究中，我们试图了解阻塞违规的常见程度，以及BlockHound等工具是否可以让我们深入了解根本原因，从而突出它们作为开发人员的陷阱。所调查的软件是基于java的开源项目，使用响应式框架，这些框架是根据高星级和高分叉数量选择的，这表明它们的采用率很高。我们在项目的测试套件中激活了BlockHound，并分析了常见模式的日志文件，以揭示7/29中5024颗星和1437个分叉的阻塞违规行为。少数系统调用可以确定为根本原因。我们在这里提出的对策成功地消除了封锁违规的不确定性。通过单元测试，代码的意图逻辑保留在所有经过验证的项目中。

{"title":"Free the Bugs: Disclosing Blocking Violations in Reactive Programming","authors":"Felix Dobslaw, Morgan Vallin, Robin Sundström","doi":"10.1109/SCAM51674.2020.00025","DOIUrl":"https://doi.org/10.1109/SCAM51674.2020.00025","url":null,"abstract":"In programming, concurrency allows threads to share processing units interleaving and seemingly simultaneous to improve resource utilization and performance. Previous research has found that concurrency faults are hard to avoid, hard to find, often leading to undesired and unpredictable behavior. Further, with the growing availability of multi-core devices and adaptation of concurrency features in high-level languages, concurrency faults occur reportedly often, which is why countermeasures must be investigated to limit harm. Reactive programming provides an abstraction to simplify complex concurrent and asynchronous tasks through reactive language extensions such as the RxJava and Project Reactor libraries for Java. Still, blocking violations are possibly resulting in concurrency faults with no Java compiler warnings. BlockHound is a tool that detects incorrect blocking by wrapping the original code and intercepting blocking calls to provide appropriate runtime errors. In this study, we seek an understanding of how common blocking violations are and whether a tool such as BlockHound can give us insight into the root-causes to highlight them as pitfalls to developers. The investigated Softwares are Java-based open-source projects using reactive frameworks selected based on high star ratings and large fork quantities that indicate high adoption. We activated BlockHound in the project’s test-suites and analyzed log files for common patterns to reveal blocking violations in 7/29 investigated open-source projects with 5024 stars and 1437 forks. A small number of system calls could be identified as root-causes. We here present countermeasures that successfully removed the uncertainty of blocking violations. The code’s intentional logic was retained in all validated projects through passing unit-tests.","PeriodicalId":410351,"journal":{"name":"2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128776000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

DCT: An Scalable Multi-Objective Module Clustering Tool DCT:一个可扩展的多目标模块聚类工具

2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2020-09-01 DOI: 10.1109/SCAM51674.2020.00024

Ana Paula M. Tarchetti, L. Amaral, M. Oliveira, R. Bonifácio, G. Pinto, D. Lo

Maintaining complex software systems is a timeconsuming and challenging task. Practitioners must have a general understanding of the system’s decomposition and how the system’s developers have implemented the software features (probably cutting across different modules). Re-engineering practices are imperative to tackle these challenges. Previous research has shown the benefits of using software module clustering (SMC) to aid developers during re-engineering tasks (e.g., revealing the architecture of the systems, identifying how the concerns are spread among the modules of the systems, recommending refactorings, and so on). Nonetheless, although the literature on software module clustering has substantially evolved in the last 20 years, there are just a few tools publicly available. Still, these available tools do not scale to large scenarios, in particular, when optimizing multi-objectives. In this paper we present the Draco Clustering Tool (DCT), a new software module clustering tool. DCT design decisions make multi-objective software clusterization feasible, even for software systems comprising up to 1,000 modules. We report an empirical study that compares DCT with other available multi-objective tool (HD-NSGA-II), and both DCT and HD-NSGA-II with mono-objective tools (BUNCH and HD-LNS). We evidence that DCT solves the scalability issue when clustering medium size projects in a multi-objective mode. In a more extreme case, DCT was able to cluster Druid (an analytics data store) 221 times faster than HD-NSGA-II.

维护复杂的软件系统是一项耗时且具有挑战性的任务。从业者必须对系统的分解以及系统开发人员如何实现软件特性(可能跨越不同的模块)有一个大致的了解。重新设计实践是应对这些挑战的必要条件。以前的研究已经显示了使用软件模块集群(SMC)在重新工程任务期间帮助开发人员的好处(例如，揭示系统的体系结构，确定关注点如何在系统的模块之间传播，建议重构，等等)。尽管如此，尽管关于软件模块集群的文献在过去20年中有了很大的发展，但是只有少数工具是公开可用的。但是，这些可用的工具不能扩展到大型场景，特别是在优化多目标时。本文介绍了一种新型的软件模块聚类工具——Draco聚类工具(DCT)。DCT设计决策使得多目标软件聚类可行，甚至对于包含多达1,000个模块的软件系统也是如此。我们报告了一项实证研究，比较了DCT与其他可用的多目标工具(HD-NSGA-II)，以及DCT和HD-NSGA-II与单目标工具(BUNCH和HD-LNS)。我们证明DCT解决了在多目标模式下对中型项目进行聚类时的可扩展性问题。在更极端的情况下，DCT能够比HD-NSGA-II快221倍地聚类德鲁伊(一个分析数据存储)。

{"title":"DCT: An Scalable Multi-Objective Module Clustering Tool","authors":"Ana Paula M. Tarchetti, L. Amaral, M. Oliveira, R. Bonifácio, G. Pinto, D. Lo","doi":"10.1109/SCAM51674.2020.00024","DOIUrl":"https://doi.org/10.1109/SCAM51674.2020.00024","url":null,"abstract":"Maintaining complex software systems is a timeconsuming and challenging task. Practitioners must have a general understanding of the system’s decomposition and how the system’s developers have implemented the software features (probably cutting across different modules). Re-engineering practices are imperative to tackle these challenges. Previous research has shown the benefits of using software module clustering (SMC) to aid developers during re-engineering tasks (e.g., revealing the architecture of the systems, identifying how the concerns are spread among the modules of the systems, recommending refactorings, and so on). Nonetheless, although the literature on software module clustering has substantially evolved in the last 20 years, there are just a few tools publicly available. Still, these available tools do not scale to large scenarios, in particular, when optimizing multi-objectives. In this paper we present the Draco Clustering Tool (DCT), a new software module clustering tool. DCT design decisions make multi-objective software clusterization feasible, even for software systems comprising up to 1,000 modules. We report an empirical study that compares DCT with other available multi-objective tool (HD-NSGA-II), and both DCT and HD-NSGA-II with mono-objective tools (BUNCH and HD-LNS). We evidence that DCT solves the scalability issue when clustering medium size projects in a multi-objective mode. In a more extreme case, DCT was able to cluster Druid (an analytics data store) 221 times faster than HD-NSGA-II.","PeriodicalId":410351,"journal":{"name":"2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128609490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Automated Identification of On-hold Self-admitted Technical Debt 自动识别被搁置的自我承认的技术债务

2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2020-09-01 DOI: 10.1109/SCAM51674.2020.00011

Rungroj Maipradit, B. Lin, Csaba Nagy, G. Bavota, Michele Lanza, Hideaki Hata, Ken-ichi Matsumoto

Modern software is developed under considerable time pressure, which implies that developers more often than not have to resort to compromises when it comes to code that is well written and code that just does the job. This has led over the past decades to the concept of “technical debt”, a short-term hack that potentially generates long-term maintenance problems. Self-admitted technical debt (SATD) is a particular form of technical debt: developers consciously perform the hack but also document it in the code by adding comments as a reminder (or as an admission of guilt). We focus on a specific type of SATD, namely “On-hold” SATD, in which developers document in their comments the need to halt an implementation task due to conditions outside of their scope of work (e.g., an open issue must be closed before a function can be implemented).We present an approach, based on regular expressions and machine learning, which is able to detect issues referenced in code comments, and to automatically classify the detected instances as either “On-hold” (the issue is referenced to indicate the need to wait for its resolution before completing a task), or as “cross-reference”, (the issue is referenced to document the code, for example to explain the rationale behind an implementation choice). Our approach also mines the issue tracker of the projects to check if the On-hold SATD instances are “superfluous” and can be removed (i.e., the referenced issue has been closed, but the SATD is still in the code). Our evaluation confirms that our approach can indeed identify relevant instances of On-hold SATD. We illustrate its usefulness by identifying superfluous On-hold SATD instances in open source projects as confirmed by the original developers.

现代软件是在相当大的时间压力下开发的，这意味着开发人员在编写良好的代码和完成工作的代码时往往不得不妥协。在过去的几十年里，这导致了“技术债务”的概念，一个短期的黑客攻击可能会产生长期的维护问题。自我承认的技术债务(SATD)是技术债务的一种特殊形式:开发人员有意识地执行黑客攻击，但也通过添加注释作为提醒(或承认有罪)将其记录在代码中。我们专注于一种特定类型的SATD，即“on -hold”SATD，在这种SATD中，开发人员在他们的注释中记录了由于他们工作范围之外的条件(例如，在功能可以实现之前必须关闭一个开放的问题)而需要停止实现任务。我们提出了一种基于正则表达式和机器学习的方法，它能够检测代码注释中引用的问题，并自动将检测到的实例分类为“on -hold”(引用该问题表示需要在完成任务之前等待其解决)或“交叉引用”(引用该问题以记录代码，例如解释实现选择背后的基本原理)。我们的方法还挖掘项目的问题跟踪器，以检查On-hold SATD实例是否“多余”并且可以删除(即，引用的问题已经关闭，但SATD仍然在代码中)。我们的评估证实，我们的方法确实可以识别出待机SATD的相关实例。我们通过识别原始开发人员确认的开源项目中多余的On-hold SATD实例来说明它的有用性。

{"title":"Automated Identification of On-hold Self-admitted Technical Debt","authors":"Rungroj Maipradit, B. Lin, Csaba Nagy, G. Bavota, Michele Lanza, Hideaki Hata, Ken-ichi Matsumoto","doi":"10.1109/SCAM51674.2020.00011","DOIUrl":"https://doi.org/10.1109/SCAM51674.2020.00011","url":null,"abstract":"Modern software is developed under considerable time pressure, which implies that developers more often than not have to resort to compromises when it comes to code that is well written and code that just does the job. This has led over the past decades to the concept of “technical debt”, a short-term hack that potentially generates long-term maintenance problems. Self-admitted technical debt (SATD) is a particular form of technical debt: developers consciously perform the hack but also document it in the code by adding comments as a reminder (or as an admission of guilt). We focus on a specific type of SATD, namely “On-hold” SATD, in which developers document in their comments the need to halt an implementation task due to conditions outside of their scope of work (e.g., an open issue must be closed before a function can be implemented).We present an approach, based on regular expressions and machine learning, which is able to detect issues referenced in code comments, and to automatically classify the detected instances as either “On-hold” (the issue is referenced to indicate the need to wait for its resolution before completing a task), or as “cross-reference”, (the issue is referenced to document the code, for example to explain the rationale behind an implementation choice). Our approach also mines the issue tracker of the projects to check if the On-hold SATD instances are “superfluous” and can be removed (i.e., the referenced issue has been closed, but the SATD is still in the code). Our evaluation confirms that our approach can indeed identify relevant instances of On-hold SATD. We illustrate its usefulness by identifying superfluous On-hold SATD instances in open source projects as confirmed by the original developers.","PeriodicalId":410351,"journal":{"name":"2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"8 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128782013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Engineering a Converter Between Two Domain-Specific Languages for Sorting 设计两种领域特定语言之间的转换器用于排序

2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2020-09-01 DOI: 10.1109/SCAM51674.2020.00030

J. Fabry, Ynès Jaradin, Aynel Gül

Part of the ecosystem of applications running on mainframe computers is the DFSORT program. It is responsible for sorting and reformatting data (amongst other functionalities) and is configured by specifications written in a Domain-Specific Language (DSL). When migrating such sort workloads off from the mainframe, the SyncSort product is an attractive alternative. It is also configured by specifications written in a DSL but this language is structured in a radically different way. Whereas the DFSORT DSL uses an explicit fixed pipeline for processing, the SyncSort DSL does not. To allow DFSORT workloads to run on SyncSort we have therefore built a source-to-source translator from the DFSORT DSL to the SyncSort DSL. Our language converter performs abstract interpretation of the DFSORT specification, considering the different steps in the DFSORT pipeline at translation time. This is done by building a graph of objects and key to the construction of this graph is the reification of the records being sorted. In this paper we report on the design and implementation of the converter, describing how it treats the DFSORT pipeline. We also show how its design allowed for the straightforward implementation of unexpected changes in requirements for the generated output.

在大型计算机上运行的应用程序生态系统的一部分是DFSORT程序。它负责对数据进行排序和重新格式化(以及其他功能)，并由用领域特定语言(DSL)编写的规范进行配置。当从大型机迁移这类排序工作负载时，SyncSort产品是一个很有吸引力的选择。它也由用DSL编写的规范进行配置，但是这种语言的结构方式完全不同。DFSORT DSL使用显式的固定管道进行处理，而SyncSort DSL则没有。因此，为了允许DFSORT工作负载在SyncSort上运行，我们构建了一个从DFSORT DSL到SyncSort DSL的源到源转换器。我们的语言转换器执行DFSORT规范的抽象解释，在翻译时考虑DFSORT管道中的不同步骤。这是通过构建一个对象图来实现的，构建这个图的关键是要排序的记录的具体化。在本文中，我们报告了转换器的设计和实现，描述了它如何处理DFSORT管道。我们还展示了它的设计如何允许对生成的输出的需求进行意外更改的直接实现。

{"title":"Engineering a Converter Between Two Domain-Specific Languages for Sorting","authors":"J. Fabry, Ynès Jaradin, Aynel Gül","doi":"10.1109/SCAM51674.2020.00030","DOIUrl":"https://doi.org/10.1109/SCAM51674.2020.00030","url":null,"abstract":"Part of the ecosystem of applications running on mainframe computers is the DFSORT program. It is responsible for sorting and reformatting data (amongst other functionalities) and is configured by specifications written in a Domain-Specific Language (DSL). When migrating such sort workloads off from the mainframe, the SyncSort product is an attractive alternative. It is also configured by specifications written in a DSL but this language is structured in a radically different way. Whereas the DFSORT DSL uses an explicit fixed pipeline for processing, the SyncSort DSL does not. To allow DFSORT workloads to run on SyncSort we have therefore built a source-to-source translator from the DFSORT DSL to the SyncSort DSL. Our language converter performs abstract interpretation of the DFSORT specification, considering the different steps in the DFSORT pipeline at translation time. This is done by building a graph of objects and key to the construction of this graph is the reification of the records being sorted. In this paper we report on the design and implementation of the converter, describing how it treats the DFSORT pipeline. We also show how its design allowed for the straightforward implementation of unexpected changes in requirements for the generated output.","PeriodicalId":410351,"journal":{"name":"2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128700822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Out of Sight, Out of Place: Detecting and Assessing Swapped Arguments 看不见，不在地方:检测和评估交换参数

2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2020-09-01 DOI: 10.1109/SCAM51674.2020.00031

Roger Scott, Joseph Ranieri, Lucja Kot, Vineeth Kashyap

Programmers often add meaningful information about program semantics when naming program entities such as variables, functions, and macros. However, static analysis tools typically discount this information when they look for bugs in a program. In this work, we describe the design and implementation of a static analysis checker called SWAPD, which uses the natural language information in programs to warn about mistakenly-swapped arguments at call sites. SWAPD combines two independent detection strategies to improve the effectiveness of the overall checker. We present the results of a comprehensive evaluation of SWAPD over a large corpus of C and C++ programs totaling 417 million lines of code. In this evaluation, SWAPD found 154 manually-vetted real-world cases of mistakenly-swapped arguments, suggesting that such errors— while not pervasive in released code—are a real problem and a worthwhile target for static analysis.

程序员经常在命名程序实体(如变量、函数和宏)时添加有关程序语义的有意义的信息。然而，静态分析工具在寻找程序中的错误时通常会忽略这些信息。在这项工作中，我们描述了一个称为SWAPD的静态分析检查器的设计和实现，它使用程序中的自然语言信息来警告调用站点上错误交换的参数。SWAPD结合了两种独立的检测策略，以提高整体检查器的有效性。我们给出了对SWAPD在一个大型C和c++程序语料库(总计4.17亿行代码)上进行综合评估的结果。在这次评估中，SWAPD发现了154个人工审查的错误交换参数的实际案例，这表明此类错误——虽然在已发布代码中并不普遍——是一个真正的问题，也是静态分析的一个值得关注的目标。

引用次数: 6

MUTAMA: An Automated Multi-label Tagging Approach for Software Libraries on Maven MUTAMA: Maven软件库的自动多标签标记方法

2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2020-09-01 DOI: 10.1109/SCAM51674.2020.00034

Camilo Velázquez-Rodríguez, Coen De Roover

Recent studies show that the Maven ecosystem alone already contains over 2 million library artefacts including their source code, byte code, and documentation. To help developers cope with this information, several websites overlay configurable views on the ecosystem. For instance, views in which similar libraries are grouped into categories or views showing all libraries that have been tagged with tags corresponding to coarse-grained library features. The MVNRepository overlay website offers both category-based and tag-based views. Unfortunately, several libraries have not been categorised or are missing relevant tags. Some initial approaches to the automated categorisation of Maven libraries have already been proposed. However, no such approach exists for the problem of tagging of libraries in a multi-label setting.This paper proposes MUTAMA, a multi-label classification approach to the Maven library tagging problem based on information extracted from the byte code of each library. We analysed 4088 randomly selected libraries from the Maven software ecosystem. MUTAMA trains and deploys five multi-label classifiers using feature vectors obtained from class and method names of the tagged libraries. Our results indicate that classifiers based on ensemble methods achieve the best performances. Finally, we propose directions to follow in this area.

最近的研究表明，仅Maven生态系统就已经包含了超过200万个库构件，包括它们的源代码、字节码和文档。为了帮助开发人员处理这些信息，一些网站覆盖了生态系统的可配置视图。例如，在视图中，相似的库被分组到不同的类别中，或者视图显示了所有的库，这些库被标记为与粗粒度库特性相对应的标签。MVNRepository覆盖网站提供基于类别和基于标签的视图。不幸的是，有几个库没有被分类，或者缺少相关的标签。已经提出了一些Maven库自动分类的初始方法。然而，对于多标签环境下的图书馆标注问题，没有这样的解决方法。针对Maven库标注问题，提出了一种基于每个库字节码提取信息的多标签分类方法MUTAMA。我们分析了从Maven软件生态系统中随机选择的4088个库。MUTAMA使用从标记库的类名和方法名中获得的特征向量来训练和部署五个多标签分类器。我们的研究结果表明，基于集成方法的分类器获得了最好的性能。最后，我们提出了在这一领域应遵循的方向。

{"title":"MUTAMA: An Automated Multi-label Tagging Approach for Software Libraries on Maven","authors":"Camilo Velázquez-Rodríguez, Coen De Roover","doi":"10.1109/SCAM51674.2020.00034","DOIUrl":"https://doi.org/10.1109/SCAM51674.2020.00034","url":null,"abstract":"Recent studies show that the Maven ecosystem alone already contains over 2 million library artefacts including their source code, byte code, and documentation. To help developers cope with this information, several websites overlay configurable views on the ecosystem. For instance, views in which similar libraries are grouped into categories or views showing all libraries that have been tagged with tags corresponding to coarse-grained library features. The MVNRepository overlay website offers both category-based and tag-based views. Unfortunately, several libraries have not been categorised or are missing relevant tags. Some initial approaches to the automated categorisation of Maven libraries have already been proposed. However, no such approach exists for the problem of tagging of libraries in a multi-label setting.This paper proposes MUTAMA, a multi-label classification approach to the Maven library tagging problem based on information extracted from the byte code of each library. We analysed 4088 randomly selected libraries from the Maven software ecosystem. MUTAMA trains and deploys five multi-label classifiers using feature vectors obtained from class and method names of the tagged libraries. Our results indicate that classifiers based on ensemble methods achieve the best performances. Finally, we propose directions to follow in this area.","PeriodicalId":410351,"journal":{"name":"2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114451231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

The Role of Implicit Conversions in Erroneous Function Argument Swapping in C++ c++中隐式转换在错误函数实参交换中的作用

2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2020-09-01 DOI: 10.1109/SCAM51674.2020.00028

Richárd Szalay, Ábel Sinkovics, Z. Porkoláb

Argument selection defects, in which the programmer has chosen the wrong argument to a function call is a widely investigated problem. The compiler can detect such misuse of arguments based on the argument and parameter type in case of statically typed programming languages. When adjacent parameters have the same type, or they can be converted between one another, the potential error will not be diagnosed. Related research is usually confined to exact type equivalence, often ignoring potential implicit or explicit conversions. However, in current mainstream languages, like C++, built-in conversions between numerics and user-defined conversions may significantly increase the number of mistakes to go unnoticed. We investigated the situation for C and C++ languages where functions are defined with multiple adjacent parameters that allow arguments to pass in the wrong order. When implicit conversions are taken into account, the number of mistake-prone function declarations significantly increases compared to strict type equivalence. We analysed the outcome and categorised the offending parameter types. The empirical results should further encourage the language and library development community to emphasise the importance of strong typing and the restriction of implicit conversion.

参数选择缺陷，即程序员为函数调用选择了错误的参数，是一个被广泛研究的问题。在静态类型编程语言中，编译器可以根据实参和形参类型检测这种实参滥用。当相邻参数具有相同的类型，或者它们之间可以相互转换时，将不会诊断出潜在的错误。相关研究通常局限于精确类型等价，往往忽略潜在的隐式或显式转换。然而，在当前的主流语言(如c++)中，数字之间的内置转换和用户定义的转换可能会显著增加未被注意到的错误数量。我们研究了C和c++语言的情况，在这些语言中，函数定义了多个相邻的形参，从而允许参数以错误的顺序传递。当考虑隐式转换时，与严格类型等价相比，容易出错的函数声明的数量显著增加。我们分析了结果，并对违规参数类型进行了分类。实证结果应进一步鼓励语言和库开发社区强调强类型和隐式转换限制的重要性。

{"title":"The Role of Implicit Conversions in Erroneous Function Argument Swapping in C++","authors":"Richárd Szalay, Ábel Sinkovics, Z. Porkoláb","doi":"10.1109/SCAM51674.2020.00028","DOIUrl":"https://doi.org/10.1109/SCAM51674.2020.00028","url":null,"abstract":"Argument selection defects, in which the programmer has chosen the wrong argument to a function call is a widely investigated problem. The compiler can detect such misuse of arguments based on the argument and parameter type in case of statically typed programming languages. When adjacent parameters have the same type, or they can be converted between one another, the potential error will not be diagnosed. Related research is usually confined to exact type equivalence, often ignoring potential implicit or explicit conversions. However, in current mainstream languages, like C++, built-in conversions between numerics and user-defined conversions may significantly increase the number of mistakes to go unnoticed. We investigated the situation for C and C++ languages where functions are defined with multiple adjacent parameters that allow arguments to pass in the wrong order. When implicit conversions are taken into account, the number of mistake-prone function declarations significantly increases compared to strict type equivalence. We analysed the outcome and categorised the offending parameter types. The empirical results should further encourage the language and library development community to emphasise the importance of strong typing and the restriction of implicit conversion.","PeriodicalId":410351,"journal":{"name":"2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"38 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131234856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

An Investigation into the Effect of Control and Data Dependence Paths on Predicate Testability 控制和数据依赖路径对谓词可测性影响的研究

2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2020-09-01 DOI: 10.1109/SCAM51674.2020.00023

D. Binkley, James R. Glenn, A. Alsharif, Phil McMinn

The squeeziness of a sequence of program statements captures the loss of information (loss of entropy) caused by its execution. This information loss leads to problems such as failed error propagation. Intuitively, longer more complex statement sequences (more formally, longer paths of dependencies) bring greater squeeze. Using the cost of search-based test data generation as a measure of lost information, we investigate this intuition. Unexpectedly, we find virtually no correlation between dependence path length and information loss. Thus our study represents an (unexpected) negative result.Moreover, looking through the literature, this finding is in agreement with recent work of Masri and Podgurski. As such, our work replicates a negative result. More precisely, it provides a conceptual, generalization and extension replication. The replication falls into the category of a conceptual replication in that different methods are used to address a common problem, and into the category of generalization and extension in that we sample a different population of subjects and more rigorously consider the resulting data. Specifically, while Masri and Podgurski only informally observed the lack of a connection, we rigorously assess it using a range of statistical models.

程序语句序列的压缩性捕获了由其执行引起的信息损失(熵的损失)。这种信息丢失会导致错误传播失败等问题。直观地说，更长的更复杂的语句序列(更正式地说，更长的依赖关系路径)会带来更大的挤压。使用基于搜索的测试数据生成的成本作为丢失信息的度量，我们研究了这种直觉。出乎意料的是，我们发现依赖路径长度和信息损失之间几乎没有相关性。因此，我们的研究代表了一个(意想不到的)负面结果。此外，纵观文献，这一发现与Masri和Podgurski最近的工作是一致的。因此，我们的工作重复了一个否定的结果。更准确地说，它提供了概念性的、泛化的和扩展的复制。复制属于概念复制的范畴，因为使用不同的方法来解决一个共同的问题;复制属于概括和扩展的范畴，因为我们对不同的受试者群体进行采样，并更严格地考虑结果数据。具体来说，虽然Masri和Podgurski只是非正式地观察到缺乏联系，但我们使用一系列统计模型对其进行了严格评估。

{"title":"An Investigation into the Effect of Control and Data Dependence Paths on Predicate Testability","authors":"D. Binkley, James R. Glenn, A. Alsharif, Phil McMinn","doi":"10.1109/SCAM51674.2020.00023","DOIUrl":"https://doi.org/10.1109/SCAM51674.2020.00023","url":null,"abstract":"The squeeziness of a sequence of program statements captures the loss of information (loss of entropy) caused by its execution. This information loss leads to problems such as failed error propagation. Intuitively, longer more complex statement sequences (more formally, longer paths of dependencies) bring greater squeeze. Using the cost of search-based test data generation as a measure of lost information, we investigate this intuition. Unexpectedly, we find virtually no correlation between dependence path length and information loss. Thus our study represents an (unexpected) negative result.Moreover, looking through the literature, this finding is in agreement with recent work of Masri and Podgurski. As such, our work replicates a negative result. More precisely, it provides a conceptual, generalization and extension replication. The replication falls into the category of a conceptual replication in that different methods are used to address a common problem, and into the category of generalization and extension in that we sample a different population of subjects and more rigorously consider the resulting data. Specifically, while Masri and Podgurski only informally observed the lack of a connection, we rigorously assess it using a range of statistical models.","PeriodicalId":410351,"journal":{"name":"2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130105557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Techniques for Efficient Automated Elimination of False Positives 有效自动消除假阳性的技术

2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2020-09-01 DOI: 10.1109/SCAM51674.2020.00035

Tukaram Muske, A. Serebrenik

Static analysis tools are useful to detect common programming errors. However, they generate a large number of false positives. Postprocessing of these alarms using a model checker has been proposed to automatically eliminate false positives from them. To scale up the automated false positives elimination (AFPE), several techniques, e.g., program slicing, are used. However, these techniques increase the time taken by AFPE, and the increased time is a major concern during application of AFPE to alarms generated on large systems.To reduce the time taken by AFPE, we propose two techniques. The techniques achieve the reduction by identifying and skipping redundant calls to the slicer and model checker. The first technique is based on our observation that, (a) combination of application-level slicing, verification with incremental context, and the context-level slicing helps to eliminate more false positives; (b) however, doing so can result in redundant calls to the slicer. In this technique, we use data dependencies to compute these redundant calls. The second technique is based on our observation that (a) code partitioning is commonly used by static analysis tools to analyze very large systems, and (b) applying AFPE to alarms generated on partitioned-code can result in repeated calls to both the slicer and model checker. We use memoization to identify the repeated calls and skip them.The first technique is currently under evaluation. Our initial evaluation of the second technique indicates that it reduces AFPE time by up to 56%, with median reduction of 12.15%.

静态分析工具对于检测常见的编程错误非常有用。然而，它们会产生大量的误报。提出了使用模型检查器对这些警报进行后处理，以自动消除它们的误报。为了扩大自动误报消除(AFPE)，使用了几种技术，例如程序切片。然而，这些技术增加了AFPE所花费的时间，并且在将AFPE应用于大型系统上产生的警报时，增加的时间是一个主要问题。为了减少AFPE所花费的时间，我们提出了两种技术。这些技术通过识别和跳过对切片器和模型检查器的冗余调用来实现减少。第一种技术是基于我们的观察，(a)应用级切片、增量上下文验证和上下文级切片的结合有助于消除更多的误报;(b)然而，这样做会导致对切片器的冗余调用。在这种技术中，我们使用数据依赖性来计算这些冗余调用。第二种技术基于我们的观察:(a)静态分析工具通常使用代码分区来分析非常大的系统，以及(b)将AFPE应用于分区代码上生成的警报可能导致对切片器和模型检查器的重复调用。我们使用记忆法来识别重复的呼叫并跳过它们。第一种技术目前正在评估中。我们对第二种技术的初步评估表明，它将AFPE时间减少了56%，中位数减少了12.15%。

{"title":"Techniques for Efficient Automated Elimination of False Positives","authors":"Tukaram Muske, A. Serebrenik","doi":"10.1109/SCAM51674.2020.00035","DOIUrl":"https://doi.org/10.1109/SCAM51674.2020.00035","url":null,"abstract":"Static analysis tools are useful to detect common programming errors. However, they generate a large number of false positives. Postprocessing of these alarms using a model checker has been proposed to automatically eliminate false positives from them. To scale up the automated false positives elimination (AFPE), several techniques, e.g., program slicing, are used. However, these techniques increase the time taken by AFPE, and the increased time is a major concern during application of AFPE to alarms generated on large systems.To reduce the time taken by AFPE, we propose two techniques. The techniques achieve the reduction by identifying and skipping redundant calls to the slicer and model checker. The first technique is based on our observation that, (a) combination of application-level slicing, verification with incremental context, and the context-level slicing helps to eliminate more false positives; (b) however, doing so can result in redundant calls to the slicer. In this technique, we use data dependencies to compute these redundant calls. The second technique is based on our observation that (a) code partitioning is commonly used by static analysis tools to analyze very large systems, and (b) applying AFPE to alarms generated on partitioned-code can result in repeated calls to both the slicer and model checker. We use memoization to identify the repeated calls and skip them.The first technique is currently under evaluation. Our initial evaluation of the second technique indicates that it reduces AFPE time by up to 56%, with median reduction of 12.15%.","PeriodicalId":410351,"journal":{"name":"2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115030655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Title Page iii 第三页标题

2020 IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)

Pub Date : 2020-09-01 DOI: 10.1109/scam51674.2020.00002

引用次数: 0