Pattern matching is an expressive way of matching data and extracting pieces of information from it. The recent inclusion of pattern matching in the Java and Python languages highlights that such a facility is more and more adopted by developers for everyday development. Other main stream programming languages also offer pattern matching capabilities as part of the language (Rust, Scala, Haskell, and OCaml), with different degrees of expressivity in what can be matched. In the meantime, in graphs, pattern matching takes a slightly different turn; it enhances the expressivity of the patterns that can be defined. Smalltalk currently offers little pattern matching capability to find specific objects inside a large graph of objects using a declarative pattern. In Pharo, the closest library to classical pattern matching that exists is the RBParseTreeSearcher, which allows to express specialized patterns over a Pharo Abstract Syntax Tree to find some inner node. The question arises of what features a flexible pattern matching language should have. In this paper, we review the features found in different existing pattern matching languages, both in General Purpose Languages (like Java) and in declarative graph pattern matching languages. We then describe MoTion, a new pattern matching engine for Pharo smalltalk, combining all these features. We discuss some aspects of MoTion’s implementation and illustrate its use with real case examples.
In solving binary code similarity detection, many approaches choose to operate on certain unified intermediate representations (IRs), such as Low Level Virtual Machine (LLVM) IR, to overcome the cross-architecture analysis challenge induced by the significant morphological and syntactic gaps across the diverse instruction set architectures (ISAs). However, the LLVM IRs of the same program can be affected by diverse factors, such as the acquisition source, i.e., compiled from source code or disassembled and lifted from binary code. While the impact of compilation settings on binary code has been explored, the specific differences between LLVM IRs from varied sources remain underexamined. To this end, we pioneer an in-depth empirical study to assess the discrepancies in LLVM IRs derived from different sources. Correspondingly, an extensive dataset containing nearly 98 million LLVM IR instructions distributed in 808,431 functions is curated with respect to these potential IR-influential factors. On this basis, three types of code metrics detailing the syntactic, structural, and semantic aspects of the IR samples are devised and leveraged to assess the divergence of the IRs across different origins. The findings offer insights into how and to what extent the various factors affect the IRs, providing valuable guidance for assembling a training corpus aimed at developing robust LLVM IR-oriented pre-training models, as well as facilitating relevant program analysis studies that operate on the LLVM IRs.
Fault localization aims to automatically identify the cause of an error in a program by localizing the error to a relatively small part of the program. In this paper, we present a novel technique for automated fault localization via error invariants inferred by abstract interpretation. An error invariant for a location in an error program over-approximates the reachable states at the given location that may produce the error, if the execution of the program is continued from that location. Error invariants can be used for statement-wise semantic slicing of error programs and for obtaining concise error explanations. We use an iterative refinement sequence of backward–forward static analyses by abstract interpretation to compute error invariants, which are designed to explain why an error program violates a particular assertion.
Furthermore, we present a practical application of the fault localization technique for automatic repair of programs. Given an erroneous program, we first use the fault localization to automatically identify statements relevant for the error, and then repeatedly mutate the expressions in those relevant statements until a correct program that satisfies all assertions is found. All other statements classified by the fault localization as irrelevant for the error are not mutated in the program repair process. This way, we significantly reduce the search space of mutated programs without losing any potentially correct program, and so locate a repaired program much faster than a program repair without fault localization.
We have developed a prototype tool for automatic fault localization and repair of C programs. We demonstrate the effectiveness of our approach to localize errors in realistic C programs, and to subsequently repair them. Moreover, we show that our approach based on combining fault localization and code mutations is significantly faster that the previous program repair approach without fault localization.

