2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)最新文献_第3页

Formal Verification of Masking Countermeasures for Arithmetic Programs* 算术程序掩码对抗的形式化验证*

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Pub Date : 2020-09-01 DOI: 10.1145/3324884.3418920

Pengfei Gao

Cryptographic algorithms are widely used to protect data privacy in many aspects of daily lives from smart card to cyber-physical systems. Unfortunately, programs implementing cryptographic algorithms may be vulnerable to practical power side-channel attacks, which may infer private data via statistical analysis of the correlation between power consumptions of an electronic device and private data. To thwart these attacks, several masking schemes have been proposed. However, programs that rely on secure masking schemes are not secure a priori. Although some techniques have been proposed for formally verifying masking countermeasures and for quantifying masking strength, they are currently limited to Boolean programs and suffer from low accuracy. In this work, we propose an approach for formally verifying masking countermeasures of arithmetic programs. Our approach is more accurate for arithmetic programs and more scalable for Boolean programs comparing to the existing approaches. We have implemented our methods in a verification tool QMVERIF which has been extensively evaluated on cryptographic benchmarks including full AES, DES and MAC-Keccak. The experimental results demonstrate the effectiveness and efficiency of our approach, especially for compositional reasoning.

从智能卡到网络物理系统，加密算法广泛应用于日常生活的许多方面，以保护数据隐私。不幸的是，实现加密算法的程序可能容易受到实际的功率侧信道攻击，这种攻击可能通过对电子设备的功耗和私有数据之间的相关性进行统计分析来推断私有数据。为了阻止这些攻击，提出了几种屏蔽方案。然而，依赖于安全屏蔽方案的程序并不是先天安全的。虽然已经提出了一些技术来正式验证掩蔽对策和量化掩蔽强度，但它们目前仅限于布尔程序，并且精度较低。在这项工作中，我们提出了一种形式化验证算术程序屏蔽对抗的方法。与现有方法相比，我们的方法对算术程序更精确，对布尔程序更具可扩展性。我们已经在验证工具QMVERIF中实现了我们的方法，该工具已经在加密基准上进行了广泛的评估，包括完整的AES, DES和MAC-Keccak。实验结果证明了该方法的有效性和有效性，特别是在组合推理方面。

{"title":"Formal Verification of Masking Countermeasures for Arithmetic Programs*","authors":"Pengfei Gao","doi":"10.1145/3324884.3418920","DOIUrl":"https://doi.org/10.1145/3324884.3418920","url":null,"abstract":"Cryptographic algorithms are widely used to protect data privacy in many aspects of daily lives from smart card to cyber-physical systems. Unfortunately, programs implementing cryptographic algorithms may be vulnerable to practical power side-channel attacks, which may infer private data via statistical analysis of the correlation between power consumptions of an electronic device and private data. To thwart these attacks, several masking schemes have been proposed. However, programs that rely on secure masking schemes are not secure a priori. Although some techniques have been proposed for formally verifying masking countermeasures and for quantifying masking strength, they are currently limited to Boolean programs and suffer from low accuracy. In this work, we propose an approach for formally verifying masking countermeasures of arithmetic programs. Our approach is more accurate for arithmetic programs and more scalable for Boolean programs comparing to the existing approaches. We have implemented our methods in a verification tool QMVERIF which has been extensively evaluated on cryptographic benchmarks including full AES, DES and MAC-Keccak. The experimental results demonstrate the effectiveness and efficiency of our approach, especially for compositional reasoning.","PeriodicalId":106337,"journal":{"name":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116999575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multiple-Boundary Clustering and Prioritization to Promote Neural Network Retraining 多边界聚类和优先排序促进神经网络再训练

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Pub Date : 2020-09-01 DOI: 10.1145/3324884.3416621

Weijun Shen, Yanhui Li, Lin Chen, Yuanlei Han, Yuming Zhou, Baowen Xu

With the increasing application of deep learning (DL) models in many safety-critical scenarios, effective and efficient DL testing techniques are much in demand to improve the quality of DL models. One of the major challenges is the data gap between the training data to construct the models and the testing data to evaluate them. To bridge the gap, testers aim to collect an effective subset of inputs from the testing contexts, with limited labeling effort, for retraining DL models.To assist the subset selection, we propose Multiple-Boundary Clustering and Prioritization (MCP), a technique to cluster test samples into the boundary areas of multiple boundaries for DL models and specify the priority to select samples evenly from all boundary areas, to make sure enough useful samples for each boundary reconstruction. To evaluate MCP, we conduct an extensive empirical study with three popular DL models and 33 simulated testing contexts. The experiment results show that, compared with state-of-the-art baseline methods, on effectiveness, our approach MCP has a significantly better performance by evaluating the improved quality of retrained DL models; on efficiency, MCP also has the advantages in time costs.

随着深度学习模型在许多安全关键场景中的应用越来越多，迫切需要有效的深度学习测试技术来提高深度学习模型的质量。其中一个主要的挑战是用于构建模型的训练数据和用于评估模型的测试数据之间的数据差距。为了弥合差距，测试人员的目标是从测试环境中收集输入的有效子集，使用有限的标记工作，用于重新训练DL模型。为了辅助子集选择，我们提出了多边界聚类和优先排序(multiple - boundary Clustering and priority, MCP)技术，该技术将测试样本聚类到DL模型的多个边界的边界区域，并指定优先级从所有边界区域均匀地选择样本，以确保每次边界重建都有足够的有用样本。为了评估MCP，我们对三种流行的深度学习模型和33种模拟测试环境进行了广泛的实证研究。实验结果表明，与最先进的基线方法相比，在有效性方面，通过评估再训练DL模型的改进质量，我们的方法MCP具有明显更好的性能;在效率上，MCP在时间成本上也有优势。

{"title":"Multiple-Boundary Clustering and Prioritization to Promote Neural Network Retraining","authors":"Weijun Shen, Yanhui Li, Lin Chen, Yuanlei Han, Yuming Zhou, Baowen Xu","doi":"10.1145/3324884.3416621","DOIUrl":"https://doi.org/10.1145/3324884.3416621","url":null,"abstract":"With the increasing application of deep learning (DL) models in many safety-critical scenarios, effective and efficient DL testing techniques are much in demand to improve the quality of DL models. One of the major challenges is the data gap between the training data to construct the models and the testing data to evaluate them. To bridge the gap, testers aim to collect an effective subset of inputs from the testing contexts, with limited labeling effort, for retraining DL models.To assist the subset selection, we propose Multiple-Boundary Clustering and Prioritization (MCP), a technique to cluster test samples into the boundary areas of multiple boundaries for DL models and specify the priority to select samples evenly from all boundary areas, to make sure enough useful samples for each boundary reconstruction. To evaluate MCP, we conduct an extensive empirical study with three popular DL models and 33 simulated testing contexts. The experiment results show that, compared with state-of-the-art baseline methods, on effectiveness, our approach MCP has a significantly better performance by evaluating the improved quality of retrained DL models; on efficiency, MCP also has the advantages in time costs.","PeriodicalId":106337,"journal":{"name":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117156871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Demystifying Diehard Android Apps 揭秘铁杆Android应用

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Pub Date : 2020-09-01 DOI: 10.1145/3324884.3416637

Hao Zhou, Haoyu Wang, Yajin Zhou, Xiapu Luo, Yutian Tang, Lei Xue, Ting Wang

Smartphone vendors are using multiple methods to kill processes of Android apps to reduce the battery consumption. This motivates developers to find ways to extend the liveness time of their apps, hence the name diehard apps in this paper. Although there are blogs and articles illustrating methods to achieve this purpose, there is no systematic research about them. What's more important, little is known about the prevalence of diehard apps in the wild. In this paper, we take a first step to systematically investigate diehard apps by answering the following research questions. First, why and how can they circumvent the resource-saving mechanisms of Android? Second, how prevalent are they in the wild? In particular, we conduct a semi -automated analysis to illustrate insights why existing methods to kill app processes could be evaded, and then systematically present 12 diehard methods. After that, we develop a system named DiehardDetector to detect diehard apps in a large scale. The experimental result of applying DiehardDetector to more than 80k Android apps downloaded from Google Play showed that around 21 % of apps adopt various diehard methods. Moreover, our system can achieve high precision and recall.

智能手机厂商正在使用多种方法来终止Android应用程序的进程，以减少电池消耗。这促使开发者想方设法延长应用的存活时间，因此本文将其称为“死忠应用”。虽然有博客和文章阐述了实现这一目的的方法，但没有对它们进行系统的研究。更重要的是，人们对硬核应用的流行程度知之甚少。在本文中，我们通过回答以下研究问题，迈出了系统调查顽固应用程序的第一步。首先，他们为什么以及如何绕过Android的资源节约机制?第二，它们在野外有多普遍?特别是，我们进行了半自动化的分析，以说明为什么现有的方法可以避免杀死应用程序进程，然后系统地提出了12种顽固的方法。在此之后，我们开发了一个名为DiehardDetector的系统，用于大规模检测死硬应用程序。将DiehardDetector应用于b谷歌Play下载的8万多款Android应用的实验结果显示，约21%的应用采用了各种diehard方法。系统具有较高的查准率和查全率。

{"title":"Demystifying Diehard Android Apps","authors":"Hao Zhou, Haoyu Wang, Yajin Zhou, Xiapu Luo, Yutian Tang, Lei Xue, Ting Wang","doi":"10.1145/3324884.3416637","DOIUrl":"https://doi.org/10.1145/3324884.3416637","url":null,"abstract":"Smartphone vendors are using multiple methods to kill processes of Android apps to reduce the battery consumption. This motivates developers to find ways to extend the liveness time of their apps, hence the name diehard apps in this paper. Although there are blogs and articles illustrating methods to achieve this purpose, there is no systematic research about them. What's more important, little is known about the prevalence of diehard apps in the wild. In this paper, we take a first step to systematically investigate diehard apps by answering the following research questions. First, why and how can they circumvent the resource-saving mechanisms of Android? Second, how prevalent are they in the wild? In particular, we conduct a semi -automated analysis to illustrate insights why existing methods to kill app processes could be evaded, and then systematically present 12 diehard methods. After that, we develop a system named DiehardDetector to detect diehard apps in a large scale. The experimental result of applying DiehardDetector to more than 80k Android apps downloaded from Google Play showed that around 21 % of apps adopt various diehard methods. Moreover, our system can achieve high precision and recall.","PeriodicalId":106337,"journal":{"name":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115262844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Just-In-Time Reactive Synthesis 即时反应合成

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Pub Date : 2020-09-01 DOI: 10.1145/3324884.3416557

S. Maoz, Ilia Shevrin

Reactive synthesis is an automated procedure to obtain a correct-by-construction reactive system from its temporal logic specification. GR(1) is an expressive assume-guarantee fragment of LTL that enables efficient synthesis and has been recently used in different contexts and application domains. In this work we present just-in-time synthesis (JITS) for GR(1), a novel means to execute synthesized reactive systems. Rather than constructing a controller at synthesis time, we compute next states during system execution, and only when they are required. We prove that JITS does not compromise the correctness of the synthesized system execution. We further show that the basic algorithm can be extended to enable several variants. We have implemented JITS in the Spectra synthesizer. Our evaluation, comparing JITS to existing tools over known benchmark specifications, shows that JITS reduces (1) total synthesis time, (2) the size of the synthesis output, and (3) the loading time for system execution, all while having little to no effect on system execution performance.

反应性合成是从反应性系统的时序逻辑规范中获得构造正确的反应性系统的自动化过程。GR(1)是LTL中一种具有表达能力的假设保证片段，它可以实现高效的合成，最近被用于不同的上下文和应用领域。在这项工作中，我们提出了GR(1)的即时合成(jit)，这是一种执行合成反应系统的新方法。我们不是在合成时构造控制器，而是在系统执行期间计算下一个状态，并且只在需要时才计算下一个状态。我们证明了jit不会损害合成系统执行的正确性。我们进一步表明，基本算法可以扩展到支持多个变体。我们已经在Spectra合成器中实现了jit。我们的评估将JITS与已知基准规范上的现有工具进行了比较，结果表明，JITS减少了(1)总合成时间，(2)合成输出的大小，以及(3)系统执行的加载时间，同时对系统执行性能几乎没有影响。

引用次数: 11

Predicting Code Context Models for Software Development Tasks 预测软件开发任务的代码上下文模型

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Pub Date : 2020-09-01 DOI: 10.1145/3324884.3416544

Zhiyuan Wan, G. Murphy, Xin Xia

Code context models consist of source code elements and their relations relevant to a development task. Prior research showed that making code context models explicit in software tools can benefit software development practices, e.g., code navigation and searching. However, little focus has been put on how to proactively form code context models. In this paper, we explore the proactive formation of code context models based on the topological patterns of code elements from interaction histories for a project. Specifically, we first learn abstract topological patterns based on the stereotype roles of code elements, rather than on specific code elements; we then leverage the learned patterns to predict the code context models for a given task by graph pattern matching. To determine the effectiveness of this approach, we applied the approach to interaction histories stored for the Eclipse Mylyn open source project. We found that our approach achieves maximum F-measures of 0.67, 0.33 and 0.21 for 1-step, 2-step and 3-step predictions, respectively. The most similar approach to ours is Suade, which supports 1-step prediction only. In comparison to this existing work, our approach predicts code context models with significantly higher F-measure (0.57 over 0.23 on average). The results demonstrate the value of integrating historical and structural approaches to form more accurate code context models.CCS CONCEPTS • Software and its engineering → Development frameworks and environments; • Human-centered computing → Human computer interaction (HCI).

代码上下文模型由源代码元素及其与开发任务相关的关系组成。先前的研究表明，在软件工具中明确代码上下文模型可以使软件开发实践受益，例如，代码导航和搜索。然而，很少有人关注如何主动形成代码上下文模型。在本文中，我们探索了基于项目交互历史中代码元素的拓扑模式的代码上下文模型的主动形成。具体来说，我们首先学习基于代码元素的原型角色的抽象拓扑模式，而不是基于特定的代码元素;然后，我们利用学习到的模式，通过图形模式匹配来预测给定任务的代码上下文模型。为了确定此方法的有效性，我们将该方法应用于为Eclipse Mylyn开放源代码项目存储的交互历史记录。我们发现，对于一步、两步和三步预测，我们的方法分别达到了0.67、0.33和0.21的最大f测量值。与我们最相似的方法是Suade，它只支持一步预测。与现有的工作相比，我们的方法预测代码上下文模型的f值明显更高(平均为0.57比0.23)。结果证明了将历史方法和结构方法集成以形成更准确的代码上下文模型的价值。•软件及其工程→开发框架和环境;•以人为中心的计算→人机交互(HCI)。

{"title":"Predicting Code Context Models for Software Development Tasks","authors":"Zhiyuan Wan, G. Murphy, Xin Xia","doi":"10.1145/3324884.3416544","DOIUrl":"https://doi.org/10.1145/3324884.3416544","url":null,"abstract":"Code context models consist of source code elements and their relations relevant to a development task. Prior research showed that making code context models explicit in software tools can benefit software development practices, e.g., code navigation and searching. However, little focus has been put on how to proactively form code context models. In this paper, we explore the proactive formation of code context models based on the topological patterns of code elements from interaction histories for a project. Specifically, we first learn abstract topological patterns based on the stereotype roles of code elements, rather than on specific code elements; we then leverage the learned patterns to predict the code context models for a given task by graph pattern matching. To determine the effectiveness of this approach, we applied the approach to interaction histories stored for the Eclipse Mylyn open source project. We found that our approach achieves maximum F-measures of 0.67, 0.33 and 0.21 for 1-step, 2-step and 3-step predictions, respectively. The most similar approach to ours is Suade, which supports 1-step prediction only. In comparison to this existing work, our approach predicts code context models with significantly higher F-measure (0.57 over 0.23 on average). The results demonstrate the value of integrating historical and structural approaches to form more accurate code context models.CCS CONCEPTS • Software and its engineering → Development frameworks and environments; • Human-centered computing → Human computer interaction (HCI).","PeriodicalId":106337,"journal":{"name":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124089510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Where Shall We Log? Studying and Suggesting Logging Locations in Code Blocks 我们在哪里登陆?研究和建议代码块中的日志记录位置

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Pub Date : 2020-09-01 DOI: 10.1145/3324884.3416636

Zhenhao Li, T. Chen, Weiyi Shang

Developers write logging statements to generate logs and record system execution behaviors to assist in debugging and software maintenance. However, deciding where to insert logging statements is a crucial yet challenging task. On one hand, logging too little may increase the maintenance difficulty due to missing important system execution information. On the other hand, logging too much may introduce excessive logs that mask the real problems and cause significant performance overhead. Prior studies provide recommendations on logging locations, but such recommendations are only for limited situations (e.g., exception logging) or at a coarse-grained level (e.g., method level). Thus, properly helping developers decide finer-grained logging locations for different situations remains an unsolved challenge. In this paper, we tackle the challenge by first conducting a comprehensive manual study on the characteristics of logging locations in seven open-source systems. We uncover six categories of logging locations and find that developers usually insert logging statements to record execution information in various types of code blocks. Based on the observed patterns, we then propose a deep learning framework to automatically suggest logging locations at the block level. We model the source code at the code block level using the syntactic and semantic information. We find that: 1) our models achieve an average of 80.1% balanced accuracy when suggesting logging locations in blocks; 2) our cross-system logging suggestion results reveal that there might be an implicit logging guideline across systems. Our results show that we may accurately provide finer-grained suggestions on logging locations, and such suggestions may be shared across systems.

开发人员编写日志语句来生成日志并记录系统执行行为，以帮助调试和软件维护。然而，决定在哪里插入日志语句是一项至关重要但具有挑战性的任务。一方面，由于缺少重要的系统执行信息，日志记录过少可能会增加维护难度。另一方面，过多的日志记录可能会引入过多的日志，从而掩盖真正的问题，并导致显著的性能开销。先前的研究提供了关于日志记录位置的建议，但这些建议仅适用于有限的情况(例如，异常日志记录)或粗粒度级别(例如，方法级别)。因此，正确地帮助开发人员确定不同情况下的细粒度日志记录位置仍然是一个未解决的挑战。在本文中，我们首先对七个开源系统中测井位置的特征进行了全面的手工研究，以此来解决这一挑战。我们发现了六种类型的日志记录位置，并发现开发人员通常会在不同类型的代码块中插入日志记录语句来记录执行信息。基于观察到的模式，我们提出了一个深度学习框架来自动建议块级别的日志位置。我们使用语法和语义信息在代码块级别对源代码进行建模。我们发现:1)我们的模型在建议区块内的测井位置时平均达到80.1%的平衡精度;2)我们的跨系统日志建议结果表明，可能存在一个隐式的跨系统日志指南。我们的结果表明，我们可以准确地提供关于日志记录位置的细粒度建议，并且这些建议可以跨系统共享。

{"title":"Where Shall We Log? Studying and Suggesting Logging Locations in Code Blocks","authors":"Zhenhao Li, T. Chen, Weiyi Shang","doi":"10.1145/3324884.3416636","DOIUrl":"https://doi.org/10.1145/3324884.3416636","url":null,"abstract":"Developers write logging statements to generate logs and record system execution behaviors to assist in debugging and software maintenance. However, deciding where to insert logging statements is a crucial yet challenging task. On one hand, logging too little may increase the maintenance difficulty due to missing important system execution information. On the other hand, logging too much may introduce excessive logs that mask the real problems and cause significant performance overhead. Prior studies provide recommendations on logging locations, but such recommendations are only for limited situations (e.g., exception logging) or at a coarse-grained level (e.g., method level). Thus, properly helping developers decide finer-grained logging locations for different situations remains an unsolved challenge. In this paper, we tackle the challenge by first conducting a comprehensive manual study on the characteristics of logging locations in seven open-source systems. We uncover six categories of logging locations and find that developers usually insert logging statements to record execution information in various types of code blocks. Based on the observed patterns, we then propose a deep learning framework to automatically suggest logging locations at the block level. We model the source code at the code block level using the syntactic and semantic information. We find that: 1) our models achieve an average of 80.1% balanced accuracy when suggesting logging locations in blocks; 2) our cross-system logging suggestion results reveal that there might be an implicit logging guideline across systems. Our results show that we may accurately provide finer-grained suggestions on logging locations, and such suggestions may be shared across systems.","PeriodicalId":106337,"journal":{"name":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126428003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

Code to Comment “Translation”: Data, Metrics, Baselining & Evaluation 注释“翻译”的代码:数据、度量、基线和评估

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Pub Date : 2020-09-01 DOI: 10.1145/3324884.3416546

David Gros, Hariharan Sezhiyan, Prem Devanbu, Zhou Yu

The relationship of comments to code, and in particular, the task of generating useful comments given the code, has long been of interest. The earliest approaches have been based on strong syntactic theories of comment-structures, and relied on textual templates. More recently, researchers have applied deep-learning methods to this task-specifically, trainable generative translation models which are known to work very well for Natural Language translation (e.g., from German to English). We carefully examine the underlying assumption here: that the task of generating comments sufficiently resembles the task of translating between natural languages, and so similar models and evaluation metrics could be used. We analyze several recent code-comment datasets for this task: CODENN, DEEPCOM, FUNCOM, and Docstring. We compare them with WMT19, a standard dataset frequently used to train state-of-the-art natural language translators. We found some interesting differences between the code-comment data and the WMT19 natural language data. Next, we describe and conduct some studies to calibrate BLEU (which is commonly used as a measure of comment quality). using “affinity pairs” of methods, from different projects, in the same project, in the same class, etc; Our study suggests that the current performance on some datasets might need to be improved substantially. We also argue that fairly naive information retrieval (IR) methods do well enough at this task to be considered a reasonable baseline. Finally, we make some suggestions on how our findings might be used in future research in this area.

注释与代码的关系，特别是生成给定代码的有用注释的任务，一直是人们感兴趣的问题。最早的方法是基于注释结构的强大语法理论，并依赖于文本模板。最近，研究人员将深度学习方法应用于这项任务-特别是可训练的生成翻译模型，该模型已知在自然语言翻译(例如，从德语到英语)中工作得非常好。我们仔细检查了这里的基本假设:生成注释的任务与在自然语言之间进行翻译的任务非常相似，因此可以使用类似的模型和评估指标。我们为这个任务分析了几个最近的代码注释数据集:CODENN、DEEPCOM、FUNCOM和Docstring。我们将它们与WMT19进行比较，WMT19是一个经常用于训练最先进的自然语言翻译的标准数据集。我们在代码注释数据和WMT19自然语言数据之间发现了一些有趣的差异。接下来，我们描述并进行一些研究来校准BLEU(通常用于衡量评论质量)。使用“亲和对”的方法，来自不同项目、同一项目、同一类等;我们的研究表明，目前在一些数据集上的性能可能需要大幅提高。我们还认为，相当朴素的信息检索(IR)方法在这项任务中做得足够好，可以被认为是一个合理的基线。最后，我们对如何将我们的发现应用于该领域的未来研究提出了一些建议。

{"title":"Code to Comment “Translation”: Data, Metrics, Baselining & Evaluation","authors":"David Gros, Hariharan Sezhiyan, Prem Devanbu, Zhou Yu","doi":"10.1145/3324884.3416546","DOIUrl":"https://doi.org/10.1145/3324884.3416546","url":null,"abstract":"The relationship of comments to code, and in particular, the task of generating useful comments given the code, has long been of interest. The earliest approaches have been based on strong syntactic theories of comment-structures, and relied on textual templates. More recently, researchers have applied deep-learning methods to this task-specifically, trainable generative translation models which are known to work very well for Natural Language translation (e.g., from German to English). We carefully examine the underlying assumption here: that the task of generating comments sufficiently resembles the task of translating between natural languages, and so similar models and evaluation metrics could be used. We analyze several recent code-comment datasets for this task: CODENN, DEEPCOM, FUNCOM, and Docstring. We compare them with WMT19, a standard dataset frequently used to train state-of-the-art natural language translators. We found some interesting differences between the code-comment data and the WMT19 natural language data. Next, we describe and conduct some studies to calibrate BLEU (which is commonly used as a measure of comment quality). using “affinity pairs” of methods, from different projects, in the same project, in the same class, etc; Our study suggests that the current performance on some datasets might need to be improved substantially. We also argue that fairly naive information retrieval (IR) methods do well enough at this task to be considered a reasonable baseline. Finally, we make some suggestions on how our findings might be used in future research in this area.","PeriodicalId":106337,"journal":{"name":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131230673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Revisiting the Relationship Between Fault Detection, Test Adequacy Criteria, and Test Set Size 重新审视故障检测、测试充分性标准和测试集大小之间的关系

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Pub Date : 2020-09-01 DOI: 10.1145/3324884.3416667

Yiqun T. Chen, Rahul Gopinath, Anita Tadakamalla, Michael D. Ernst, Reid Holmes, G. Fraser, P. Ammann, René Just

The research community has long recognized a complex interrelationship between fault detection, test adequacy criteria, and test set size. However, there is substantial confusion about whether and how to experimentally control for test set size when assessing how well an adequacy criterion is correlated with fault detection and when comparing test adequacy criteria. Resolving the confusion, this paper makes the following contributions: (1) A review of contradictory analyses of the relationships between fault detection, test adequacy criteria, and test set size. Specifically, this paper addresses the supposed contradiction of prior work and explains why test set size is neither a confounding variable, as previously suggested, nor an independent variable that should be experimentally manipulated. (2) An explication and discussion of the experimental designs of prior work, together with a discussion of conceptual and statistical problems, as well as specific guidelines for future work. (3) A methodology for comparing test adequacy criteria on an equal basis, which accounts for test set size without directly manipulating it through unrealistic stratification. (4) An empirical evaluation that compares the effectiveness of coverage-based testing, mutation-based testing, and random testing. Additionally, this paper proposes probabilistic coupling, a methodology for assessing the representativeness of a set of test goals for a given fault and for approximating the fault-detection probability of adequate test sets.

研究团体早就认识到故障检测、测试充分性标准和测试集大小之间存在复杂的相互关系。然而，在评估充分性标准与故障检测的相关性以及比较测试充分性标准时，是否以及如何通过实验控制测试集大小存在很大的混淆。为了解决这一问题，本文做出了以下贡献:(1)回顾了对故障检测、测试充分性标准和测试集大小之间关系的矛盾分析。具体而言，本文解决了先前工作的假设矛盾，并解释了为什么测试集大小既不是先前建议的混淆变量，也不是应该在实验中操作的独立变量。(2)对先前工作的实验设计进行解释和讨论，同时讨论概念和统计问题，以及未来工作的具体指导方针。(3)一种在平等基础上比较测试充分性标准的方法，该方法考虑了测试集的大小，而不是通过不切实际的分层直接操纵它。(4)对基于覆盖的测试、基于突变的测试和随机测试的有效性进行了实证评价。此外，本文还提出了概率耦合，这是一种评估给定故障的一组测试目标的代表性和近似适当测试集的故障检测概率的方法。

{"title":"Revisiting the Relationship Between Fault Detection, Test Adequacy Criteria, and Test Set Size","authors":"Yiqun T. Chen, Rahul Gopinath, Anita Tadakamalla, Michael D. Ernst, Reid Holmes, G. Fraser, P. Ammann, René Just","doi":"10.1145/3324884.3416667","DOIUrl":"https://doi.org/10.1145/3324884.3416667","url":null,"abstract":"The research community has long recognized a complex interrelationship between fault detection, test adequacy criteria, and test set size. However, there is substantial confusion about whether and how to experimentally control for test set size when assessing how well an adequacy criterion is correlated with fault detection and when comparing test adequacy criteria. Resolving the confusion, this paper makes the following contributions: (1) A review of contradictory analyses of the relationships between fault detection, test adequacy criteria, and test set size. Specifically, this paper addresses the supposed contradiction of prior work and explains why test set size is neither a confounding variable, as previously suggested, nor an independent variable that should be experimentally manipulated. (2) An explication and discussion of the experimental designs of prior work, together with a discussion of conceptual and statistical problems, as well as specific guidelines for future work. (3) A methodology for comparing test adequacy criteria on an equal basis, which accounts for test set size without directly manipulating it through unrealistic stratification. (4) An empirical evaluation that compares the effectiveness of coverage-based testing, mutation-based testing, and random testing. Additionally, this paper proposes probabilistic coupling, a methodology for assessing the representativeness of a set of test goals for a given fault and for approximating the fault-detection probability of adequate test sets.","PeriodicalId":106337,"journal":{"name":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123963574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Team Discussions and Dynamics During DevOps Tool Adoptions in OSS Projects 在OSS项目中采用DevOps工具期间的团队讨论和动态

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Pub Date : 2020-09-01 DOI: 10.1145/3324884.3416640

Likang Yin, V. Filkov

In Open Source Software (OSS) projects, pre-built tools dominate DevOps-oriented pipelines. In practice, a multitude of configuration management, cloud-based continuous integration, and automated deployment tools exist, and often more than one for each task. Tools are adopted (and given up) by OSS projects regularly. Prior work has shown that some tool adoptions are preceded by discussions, and that tool adoptions can result in benefits to the project. But important questions remain: how do teams decide to adopt a tool? What is discussed before the adoption and for how long? And, what team characteristics are determinant of the adoption?In this paper, we employ a large-scale empirical study in order to characterize the team discussions and to discern the team-level determinants of tool adoption into OSS projects' development pipelines. Guided by theories of team and individual motivations and dynamics, we perform exploratory data analyses, do deep-dive case studies, and develop regression models to learn the determinants of adoption and discussion length, and the direction of their effect on the adoption. From data of commit and comment traces of large-scale GitHub projects, our models find that prior exposure to a tool and member involvement are positively associated with the tool adoption, while longer discussions and the number of newer team members associate negatively. These results can provide guidance beyond the technical appropriateness for the timeliness of tool adoptions in diverse programmer teams. Our data and code is available at https://github.com/lkyin/tool_adoptions. In this paper, we employ a large-scale empirical study in order to characterize the team discussions and to discern the team-level determinants of tool adoption into OSS projects' development pipelines. Guided by theories of team and individual motivations and dynamics, we perform exploratory data analyses, do deep-dive case studies, and develop regression models to learn the determinants of adoption and discussion length, and the direction of their effect on the adoption. From data of commit and comment traces of large-scale GitHub projects, our models find that prior exposure to a tool and member involvement are positively associated with the tool adoption, while longer discussions and the number of newer team members associate negatively. These results can provide guidance beyond the technical appropriateness for the timeliness of tool adoptions in diverse programmer teams. Our data and code is available at https://github.com/lkyin/tool_adoptions.

在开源软件(OSS)项目中，预构建的工具主导着面向devops的管道。在实践中，存在大量的配置管理、基于云的持续集成和自动部署工具，而且每个任务通常不止一个。OSS项目定期采用(或放弃)工具。先前的工作已经表明，一些工具的采用是在讨论之前进行的，并且这些工具的采用可以给项目带来好处。但是重要的问题仍然存在:团队如何决定采用一种工具?领养前会讨论什么，讨论多久?并且，哪些团队特征决定了采用?在本文中，我们采用了大规模的实证研究，以描述团队讨论的特征，并辨别在OSS项目的开发管道中采用工具的团队级别决定因素。在团队和个人动机和动力学理论的指导下，我们进行了探索性数据分析，进行了深入的案例研究，并开发了回归模型，以了解采用和讨论长度的决定因素，以及它们对采用的影响方向。从大规模GitHub项目的提交和评论跟踪数据来看，我们的模型发现，先前对工具的接触和成员参与与工具的采用呈正相关，而更长的讨论和新团队成员的数量呈负相关。这些结果可以为在不同的程序员团队中采用工具的及时性提供超越技术适当性的指导。我们的数据和代码可在https://github.com/lkyin/tool_adoptions上获得。在本文中，我们采用了大规模的实证研究，以描述团队讨论的特征，并辨别在OSS项目的开发管道中采用工具的团队级别决定因素。在团队和个人动机和动力学理论的指导下，我们进行了探索性数据分析，进行了深入的案例研究，并开发了回归模型，以了解采用和讨论长度的决定因素，以及它们对采用的影响方向。从大规模GitHub项目的提交和评论跟踪数据来看，我们的模型发现，先前对工具的接触和成员参与与工具的采用呈正相关，而更长的讨论和新团队成员的数量呈负相关。这些结果可以为在不同的程序员团队中采用工具的及时性提供超越技术适当性的指导。我们的数据和代码可在https://github.com/lkyin/tool_adoptions上获得。

{"title":"Team Discussions and Dynamics During DevOps Tool Adoptions in OSS Projects","authors":"Likang Yin, V. Filkov","doi":"10.1145/3324884.3416640","DOIUrl":"https://doi.org/10.1145/3324884.3416640","url":null,"abstract":"In Open Source Software (OSS) projects, pre-built tools dominate DevOps-oriented pipelines. In practice, a multitude of configuration management, cloud-based continuous integration, and automated deployment tools exist, and often more than one for each task. Tools are adopted (and given up) by OSS projects regularly. Prior work has shown that some tool adoptions are preceded by discussions, and that tool adoptions can result in benefits to the project. But important questions remain: how do teams decide to adopt a tool? What is discussed before the adoption and for how long? And, what team characteristics are determinant of the adoption?In this paper, we employ a large-scale empirical study in order to characterize the team discussions and to discern the team-level determinants of tool adoption into OSS projects' development pipelines. Guided by theories of team and individual motivations and dynamics, we perform exploratory data analyses, do deep-dive case studies, and develop regression models to learn the determinants of adoption and discussion length, and the direction of their effect on the adoption. From data of commit and comment traces of large-scale GitHub projects, our models find that prior exposure to a tool and member involvement are positively associated with the tool adoption, while longer discussions and the number of newer team members associate negatively. These results can provide guidance beyond the technical appropriateness for the timeliness of tool adoptions in diverse programmer teams. Our data and code is available at https://github.com/lkyin/tool_adoptions. In this paper, we employ a large-scale empirical study in order to characterize the team discussions and to discern the team-level determinants of tool adoption into OSS projects' development pipelines. Guided by theories of team and individual motivations and dynamics, we perform exploratory data analyses, do deep-dive case studies, and develop regression models to learn the determinants of adoption and discussion length, and the direction of their effect on the adoption. From data of commit and comment traces of large-scale GitHub projects, our models find that prior exposure to a tool and member involvement are positively associated with the tool adoption, while longer discussions and the number of newer team members associate negatively. These results can provide guidance beyond the technical appropriateness for the timeliness of tool adoptions in diverse programmer teams. Our data and code is available at https://github.com/lkyin/tool_adoptions.","PeriodicalId":106337,"journal":{"name":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"359 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124527024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

A Machine Learning based Approach to Autogenerate Diagnostic Models for CNC machines 基于机器学习的数控机床自动诊断模型生成方法

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Pub Date : 2020-09-01 DOI: 10.1145/3324884.3418915

K. Masalimov

This article presents a description of a system for the automatic generation of predictive diagnostic models of CNC machine tools. This system allows machine tool maintenance specialists to select and operate models based on LSTM neural networks to determine the state of elements of CNC machines. Examples of changes in the accuracy of the models used during operation are given to determine the state of the cutting tool (more than 95%) and the bearings of electric motors (more than 91%).

本文介绍了一种用于数控机床预测诊断模型自动生成的系统。该系统允许机床维护专家基于LSTM神经网络选择和操作模型，以确定数控机床元件的状态。给出了在操作过程中使用的模型精度变化的例子，以确定刀具的状态(超过95%)和电动机的轴承(超过91%)。

引用次数: 2