The artifact accompanying the paper “Understanding Developers Well-Being and Productivity: a 2-year Longitudinal Analysis during the COVID-19 Pandemic” provides a comprehensive set of tools, data, and scripts that were utilized in the longitudinal study. Spanning 24 months, from April 2020 to April 2022, the study delves into the shifts in well-being, productivity, social contacts, needs, and several other variables of software engineers during the COVID-19 pandemic. The artifact facilitates the reproduction of the study’s findings, offering a deeper insight into the systematic changes observed in various variables, such as well-being, quality of social contacts, and emotional loneliness. By providing access to the evidence-generating mechanisms and the generated data, the artifact ensures transparency, reproducibility, and allow researchers to use our rich dataset to test their own research question. This Replicated Computational Results report aims to detail the contents of the artifact, its relevance to the main paper, and guidelines for its effective utilization.
{"title":"Understanding Developers Well-Being and Productivity: a 2-year Longitudinal Analysis during the COVID-19 Pandemic - RCR Report","authors":"Daniel Russo, Paul H. P. Hanel, Niels van Berkel","doi":"10.1145/3640338","DOIUrl":"https://doi.org/10.1145/3640338","url":null,"abstract":"<p>The artifact accompanying the paper “Understanding Developers Well-Being and Productivity: a 2-year Longitudinal Analysis during the COVID-19 Pandemic” provides a comprehensive set of tools, data, and scripts that were utilized in the longitudinal study. Spanning 24 months, from April 2020 to April 2022, the study delves into the shifts in well-being, productivity, social contacts, needs, and several other variables of software engineers during the COVID-19 pandemic. The artifact facilitates the reproduction of the study’s findings, offering a deeper insight into the systematic changes observed in various variables, such as well-being, quality of social contacts, and emotional loneliness. By providing access to the evidence-generating mechanisms and the generated data, the artifact ensures transparency, reproducibility, and allow researchers to use our rich dataset to test their own research question. This Replicated Computational Results report aims to detail the contents of the artifact, its relevance to the main paper, and guidelines for its effective utilization.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"82 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139464781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ensuring the safety of autonomous vehicles (AVs) is of utmost importance, and testing them in simulated environments is a safer option than conducting in-field operational tests. However, generating an exhaustive test suite to identify critical test scenarios is computationally expensive as the representation of each test is complex and contains various dynamic and static features, such as the AV under test, road participants (vehicles, pedestrians, and static obstacles), environmental factors (weather and light), and the road’s structural features (lanes, turns, road speed, etc.). In this paper, we present a systematic technique that uses Instance Space Analysis (ISA) to identify the significant features of test scenarios that affect their ability to reveal the unsafe behaviour of AVs. ISA identifies the features that best differentiate safety-critical scenarios from normal driving and visualises the impact of these features on test scenario outcomes (safe/unsafe) in 2D. This visualisation helps to identify untested regions of the instance space and provides an indicator of the quality of the test suite in terms of the percentage of feature space covered by testing. To test the predictive ability of the identified features, we train five Machine Learning classifiers to classify test scenarios as safe or unsafe. The high precision, recall, and F1 scores indicate that our proposed approach is effective in predicting the outcome of a test scenario without executing it and can be used for test generation, selection, and prioritisation.
确保自动驾驶汽车(AV)的安全性至关重要,而在模拟环境中对其进行测试是比进行现场运行测试更安全的选择。然而,生成详尽的测试套件以确定关键测试场景的计算成本很高,因为每个测试的表示都很复杂,包含各种动态和静态特征,如被测自动驾驶汽车、道路参与者(车辆、行人和静态障碍物)、环境因素(天气和光线)以及道路结构特征(车道、转弯、道路速度等)。在本文中,我们提出了一种系统技术,利用实例空间分析(ISA)来识别影响测试场景揭示自动驾驶汽车不安全行为能力的重要特征。ISA 能够识别出最能区分安全关键场景与正常驾驶的特征,并以二维形式直观显示这些特征对测试场景结果(安全/不安全)的影响。这种可视化有助于识别实例空间中未经测试的区域,并根据测试覆盖的特征空间百分比提供测试套件的质量指标。为了测试已识别特征的预测能力,我们训练了五个机器学习分类器,将测试场景划分为安全或不安全。高精确度、高召回率和高 F1 分数表明,我们提出的方法可以在不执行测试的情况下有效预测测试场景的结果,并可用于测试的生成、选择和优先级排序。
{"title":"Identifying and Explaining Safety-critical Scenarios for Autonomous Vehicles via Key Features","authors":"Neelofar Neelofar, Aldeida Aleti","doi":"10.1145/3640335","DOIUrl":"https://doi.org/10.1145/3640335","url":null,"abstract":"<p>Ensuring the safety of autonomous vehicles (AVs) is of utmost importance, and testing them in simulated environments is a safer option than conducting in-field operational tests. However, generating an exhaustive test suite to identify critical test scenarios is computationally expensive as the representation of each test is complex and contains various dynamic and static features, such as the AV under test, road participants (vehicles, pedestrians, and static obstacles), environmental factors (weather and light), and the road’s structural features (lanes, turns, road speed, etc.). In this paper, we present a systematic technique that uses <i>Instance Space Analysis (ISA)</i> to identify the significant features of test scenarios that affect their ability to reveal the unsafe behaviour of AVs. ISA identifies the features that best differentiate safety-critical scenarios from normal driving and visualises the impact of these features on test scenario outcomes (safe/unsafe) in 2<i>D</i>. This visualisation helps to identify untested regions of the instance space and provides an indicator of the quality of the test suite in terms of the percentage of feature space covered by testing. To test the predictive ability of the identified features, we train five Machine Learning classifiers to classify test scenarios as safe or unsafe. The high precision, recall, and F1 scores indicate that our proposed approach is effective in predicting the outcome of a test scenario without executing it and can be used for test generation, selection, and prioritisation.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"45 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139464741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anshunkang Zhou, Yikun Hu, Xiangzhe Xu, Charles Zhang
Binary code similarity analysis is extremely useful since it provides rich information about an unknown binary, such as revealing its functionality and identifying reused libraries. Robust binary similarity analysis is challenging as heavy compiler optimizations can make semantically similar binaries have gigantic syntactic differences. Unfortunately, existing semantic-based methods still suffer from either incomplete coverage or low accuracy.
In this paper, we propose ARCTURUS, a new technique that can achieve high code coverage and high accuracy simultaneously by manipulating program execution under the guidance of code reachability. Our key insight is that the compiler must preserve program semantics (e.g., dependences between code fragments) during compilation; therefore, the code reachability, which implies the interdependence between code, is invariant across code transformations. Based on the above insight, our key idea is to leverage the stability of code reachability to manipulate the program execution such that deep code logic can also be covered in a consistent way. Experimental results show that ARCTURUS achieves an average precision of 87.8% with 100% block coverage, outperforming compared methods by 38.4% on average. ARCTURUS takes only 0.15 seconds to process one function on average, indicating that it is efficient for practical use.
{"title":"ARCTURUS: Full Coverage Binary Similarity Analysis with Reachability-Guided Emulation","authors":"Anshunkang Zhou, Yikun Hu, Xiangzhe Xu, Charles Zhang","doi":"10.1145/3640337","DOIUrl":"https://doi.org/10.1145/3640337","url":null,"abstract":"<p>Binary code similarity analysis is extremely useful since it provides rich information about an unknown binary, such as revealing its functionality and identifying reused libraries. Robust binary similarity analysis is challenging as heavy compiler optimizations can make semantically similar binaries have gigantic syntactic differences. Unfortunately, existing semantic-based methods still suffer from either incomplete coverage or low accuracy. </p><p>In this paper, we propose <span>ARCTURUS</span>, a new technique that can achieve high code coverage and high accuracy simultaneously by manipulating program execution under the guidance of code reachability. Our key insight is that the compiler must preserve program semantics (e.g., dependences between code fragments) during compilation; therefore, the code reachability, which implies the interdependence between code, is invariant across code transformations. Based on the above insight, our key idea is to leverage the stability of code reachability to manipulate the program execution such that deep code logic can also be covered in a consistent way. Experimental results show that <span>ARCTURUS</span> achieves an average precision of 87.8% with 100% block coverage, outperforming compared methods by 38.4% on average. <span>ARCTURUS</span> takes only 0.15 seconds to process one function on average, indicating that it is efficient for practical use.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"20 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139464736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Widely used compilers like GCC and LLVM usually have hundreds of optimizations controlled by optimization flags, which are enabled or disabled during compilation to improve runtime performance (e.g., small execution time) of the compiler program. Due to the large number of optimization flags and their combination, it is difficult for compiler users to manually tune compiler optimization flags. In the literature, a number of autotuning techniques have been proposed, which tune optimization flags for a compiled program by comparing its actual runtime performance with different optimization flag combination. Due to the huge search space and heavy actual runtime cost, these techniques suffer from the widely-recognized efficiency problem. To reduce the heavy runtime cost, in this paper we propose a lightweight learning approach which uses a small number of actual runtime performance data to predict the runtime performance of a compiled program with various optimization flag combinations. Furthermore, to reduce the search space, we design a novel particle swarm algorithm which tunes compiler optimization flags with the prediction model. To evaluate the performance of the proposed approach CompTuner, we conduct an extensive experimental study on two popular C compilers GCC and LLVM with two widely used benchmarks cBench and PolyBench. The experimental results show that CompTuner significantly outperforms the six compared techniques, including the state-of-art technique BOCA.
{"title":"Compiler Autotuning through Multiple Phase Learning","authors":"Mingxuan Zhu, Dan Hao, Junjie Chen","doi":"10.1145/3640330","DOIUrl":"https://doi.org/10.1145/3640330","url":null,"abstract":"<p>Widely used compilers like GCC and LLVM usually have hundreds of optimizations controlled by optimization flags, which are enabled or disabled during compilation to improve runtime performance (e.g., small execution time) of the compiler program. Due to the large number of optimization flags and their combination, it is difficult for compiler users to manually tune compiler optimization flags. In the literature, a number of autotuning techniques have been proposed, which tune optimization flags for a compiled program by comparing its actual runtime performance with different optimization flag combination. Due to the huge search space and heavy actual runtime cost, these techniques suffer from the widely-recognized efficiency problem. To reduce the heavy runtime cost, in this paper we propose a lightweight learning approach which uses a small number of actual runtime performance data to predict the runtime performance of a compiled program with various optimization flag combinations. Furthermore, to reduce the search space, we design a novel particle swarm algorithm which tunes compiler optimization flags with the prediction model. To evaluate the performance of the proposed approach CompTuner, we conduct an extensive experimental study on two popular C compilers GCC and LLVM with two widely used benchmarks cBench and PolyBench. The experimental results show that CompTuner significantly outperforms the six compared techniques, including the state-of-art technique BOCA.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"1 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139464740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Testing autonomous driving systems for safety and reliability is essential, yet complex. A primary challenge is identifying relevant test scenarios, especially the critical ones that may expose hazards or harm to autonomous vehicles and other road users. Although numerous approaches and tools for critical scenario identification are proposed, the industry practices for selection, implementation, and limitations of approaches, are not well understood. Therefore, we aim to explore practical aspects of how autonomous driving systems are tested, particularly the identification and use of critical scenarios. We interviewed 13 practitioners from 7 companies in autonomous driving in Sweden. We used thematic modeling to analyse and synthesize the interview data. As a result, we present 9 themes of practices and 4 themes of challenges related to critical scenarios. Our analysis indicates there is little joint effort in the industry, despite every approach has its own limitations, and tools and platforms are lacking. To that end, we recommend the industry and academia combine different approaches, collaborate among different stakeholders, and continuously learn the field. The contributions of our study are exploration and synthesis of industry practices and related challenges for critical scenario identification and testing, and potential increase of industry relevance for future studies.
{"title":"Industry Practices for Challenging Autonomous Driving Systems with Critical Scenarios","authors":"Qunying Song, Emelie Engström, Per Runeson","doi":"10.1145/3640334","DOIUrl":"https://doi.org/10.1145/3640334","url":null,"abstract":"<p>Testing autonomous driving systems for safety and reliability is essential, yet complex. A primary challenge is identifying relevant test scenarios, especially the critical ones that may expose hazards or harm to autonomous vehicles and other road users. Although numerous approaches and tools for critical scenario identification are proposed, the industry practices for selection, implementation, and limitations of approaches, are not well understood. Therefore, we aim to explore practical aspects of how autonomous driving systems are tested, particularly the identification and use of critical scenarios. We interviewed 13 practitioners from 7 companies in autonomous driving in Sweden. We used thematic modeling to analyse and synthesize the interview data. As a result, we present 9 themes of practices and 4 themes of challenges related to critical scenarios. Our analysis indicates there is little joint effort in the industry, despite every approach has its own limitations, and tools and platforms are lacking. To that end, we recommend the industry and academia combine different approaches, collaborate among different stakeholders, and continuously learn the field. The contributions of our study are exploration and synthesis of industry practices and related challenges for critical scenario identification and testing, and potential increase of industry relevance for future studies.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"30 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139464957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning (DL) frameworks have become the cornerstone of the rapidly developing DL field. Through installation dependencies specified in the distribution metadata, numerous packages directly or transitively depend on DL frameworks, layer after layer, forming DL package supply chains (SCs), which are critical for DL frameworks to remain competitive. However, vital knowledge on how to nurture and sustain DL package SCs is still lacking. Achieving this knowledge may help DL frameworks formulate effective measures to strengthen their SCs to remain competitive and shed light on dependency issues and practices in the DL SC for researchers and practitioners. In this paper, we explore the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs to bridge this knowledge gap. We analyze the metadata of nearly six million PyPI package distributions and construct version-sensitive SCs for two popular DL frameworks: TensorFlow and PyTorch. We find that popular packages (measured by the number of monthly downloads) in the two SCs cover 34 domains belonging to eight categories. Applications, Infrastructure, and Sciences categories account for over 85% of popular packages in either SC and TensorFlow and PyTorch SC have developed specializations on Infrastructure and Applications packages respectively. We employ the Leiden community detection algorithm and detect 131 and 100 clusters in the two SCs. The clusters mainly exhibit four shapes: Arrow, Star, Tree, and Forest with increasing dependency complexity. Most clusters are Arrow or Star, while Tree and Forest clusters account for most packages (Tensorflow SC: 70.7%, PyTorch SC: 92.9%). We identify three groups of reasons why packages disengage from the SC (i.e., remove the DL framework and its dependents from their installation dependencies): dependency issues, functional improvements, and ease of installation. The most common reason in TensorFlow SC is dependency incompatibility and in PyTorch SC is to simplify functionalities and reduce installation size. Our study provides rich implications for DL framework vendors, researchers, and practitioners on the maintenance and dependency management practices of PyPI DL SCs.
{"title":"Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement","authors":"Kai Gao, Runzhi He, Bing Xie, Minghui Zhou","doi":"10.1145/3640336","DOIUrl":"https://doi.org/10.1145/3640336","url":null,"abstract":"<p>Deep learning (DL) frameworks have become the cornerstone of the rapidly developing DL field. Through installation dependencies specified in the distribution metadata, numerous packages directly or transitively depend on DL frameworks, layer after layer, forming DL package supply chains (SCs), which are critical for DL frameworks to remain competitive. However, vital knowledge on how to nurture and sustain DL package SCs is still lacking. Achieving this knowledge may help DL frameworks formulate effective measures to strengthen their SCs to remain competitive and shed light on dependency issues and practices in the DL SC for researchers and practitioners. In this paper, we explore the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs to bridge this knowledge gap. We analyze the metadata of nearly six million PyPI package distributions and construct version-sensitive SCs for two popular DL frameworks: TensorFlow and PyTorch. We find that popular packages (measured by the number of monthly downloads) in the two SCs cover 34 domains belonging to eight categories. <i>Applications</i>, <i>Infrastructure</i>, and <i>Sciences</i> categories account for over 85% of popular packages in either SC and TensorFlow and PyTorch SC have developed specializations on <i>Infrastructure</i> and <i>Applications</i> packages respectively. We employ the Leiden community detection algorithm and detect 131 and 100 clusters in the two SCs. The clusters mainly exhibit four shapes: Arrow, Star, Tree, and Forest with increasing dependency complexity. Most clusters are Arrow or Star, while Tree and Forest clusters account for most packages (Tensorflow SC: 70.7%, PyTorch SC: 92.9%). We identify three groups of reasons why packages disengage from the SC (i.e., remove the DL framework and its dependents from their installation dependencies): dependency issues, functional improvements, and ease of installation. The most common reason in TensorFlow SC is dependency incompatibility and in PyTorch SC is to simplify functionalities and reduce installation size. Our study provides rich implications for DL framework vendors, researchers, and practitioners on the maintenance and dependency management practices of PyPI DL SCs.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"264 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139413140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quanjun Zhang, Juan Zhai, Chunrong Fang, Jiawei Liu, Weisong Sun, Haichuan Hu, Qingyu Wang
Machine translation systems have been widely adopted in our daily life, making life easier and more convenient. Unfortunately, erroneous translations may result in severe consequences, such as financial losses. This requires to improve the accuracy and the reliability of machine translation systems. However, it is challenging to test machine translation systems because of the complexity and intractability of the underlying neural models. To tackle these challenges, we propose a novel metamorphic testing approach by syntactic tree pruning (STP) to validate machine translation systems. Our key insight is that a pruned sentence should have similar crucial semantics compared with the original sentence. Specifically, STP (1) proposes a core semantics-preserving pruning strategy by basic sentence structures and dependency relations on the level of syntactic tree representation; (2) generates source sentence pairs based on the metamorphic relation; (3) reports suspicious issues whose translations break the consistency property by a bag-of-words model. We further evaluate STP on two state-of-the-art machine translation systems (i.e., Google Translate and Bing Microsoft Translator) with 1,200 source sentences as inputs. The results show that STP accurately finds 5,073 unique erroneous translations in Google Translate and 5,100 unique erroneous translations in Bing Microsoft Translator (400% more than state-of-the-art techniques), with 64.5% and 65.4% precision, respectively. The reported erroneous translations vary in types and more than 90% of them found by state-of-the-art techniques. There are 9,393 erroneous translations unique to STP, which is 711.9% more than state-of-the-art techniques. Moreover, STP is quite effective in detecting translation errors for the original sentences with a recall reaching 74.0%, improving state-of-the-art techniques by 55.1% on average.
{"title":"Machine Translation Testing via Syntactic Tree Pruning","authors":"Quanjun Zhang, Juan Zhai, Chunrong Fang, Jiawei Liu, Weisong Sun, Haichuan Hu, Qingyu Wang","doi":"10.1145/3640329","DOIUrl":"https://doi.org/10.1145/3640329","url":null,"abstract":"Machine translation systems have been widely adopted in our daily life, making life easier and more convenient. Unfortunately, erroneous translations may result in severe consequences, such as financial losses. This requires to improve the accuracy and the reliability of machine translation systems. However, it is challenging to test machine translation systems because of the complexity and intractability of the underlying neural models. To tackle these challenges, we propose a novel metamorphic testing approach by syntactic tree pruning (STP) to validate machine translation systems. Our key insight is that a pruned sentence should have similar crucial semantics compared with the original sentence. Specifically, STP (1) proposes a core semantics-preserving pruning strategy by basic sentence structures and dependency relations on the level of syntactic tree representation; (2) generates source sentence pairs based on the metamorphic relation; (3) reports suspicious issues whose translations break the consistency property by a bag-of-words model. We further evaluate STP on two state-of-the-art machine translation systems (i.e., Google Translate and Bing Microsoft Translator) with 1,200 source sentences as inputs. The results show that STP accurately finds 5,073 unique erroneous translations in Google Translate and 5,100 unique erroneous translations in Bing Microsoft Translator (400% more than state-of-the-art techniques), with 64.5% and 65.4% precision, respectively. The reported erroneous translations vary in types and more than 90% of them found by state-of-the-art techniques. There are 9,393 erroneous translations unique to STP, which is 711.9% more than state-of-the-art techniques. Moreover, STP is quite effective in detecting translation errors for the original sentences with a recall reaching 74.0%, improving state-of-the-art techniques by 55.1% on average.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"1 10","pages":""},"PeriodicalIF":4.4,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139457522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The COVID-19 pandemic has brought significant and enduring shifts in various aspects of life, including increased flexibility in work arrangements. In a longitudinal study, spanning 24 months with six measurement points from April 2020 to April 2022, we explore changes in well-being, productivity, social contacts, and needs of software engineers during this time. Our findings indicate systematic changes in various variables. For example, well-being and quality of social contacts increased while emotional loneliness decreased as lockdown measures were relaxed. Conversely, people’s boredom and productivity, remained stable. Furthermore, a preliminary investigation into the future of work at the end of the pandemic revealed a consensus among developers for a preference of hybrid work arrangements. We also discovered that prior job changes and low job satisfaction were consistently linked to intentions to change jobs if current work conditions do not meet developers’ needs. This highlights the need for software organizations to adapt to various work arrangements to remain competitive employers. Building upon our findings and the existing literature, we introduce the Integrated Job Demands-Resources and Self-Determination (IJARS) Model as a comprehensive framework to explain the well-being and productivity of software engineers during the COVID-19 pandemic.
{"title":"Understanding Developers Well-Being and Productivity: a 2-year Longitudinal Analysis during the COVID-19 Pandemic","authors":"Daniel Russo, Paul H. P. Hanel, Niels van Berkel","doi":"10.1145/3638244","DOIUrl":"https://doi.org/10.1145/3638244","url":null,"abstract":"<p>The COVID-19 pandemic has brought significant and enduring shifts in various aspects of life, including increased flexibility in work arrangements. In a longitudinal study, spanning 24 months with six measurement points from April 2020 to April 2022, we explore changes in well-being, productivity, social contacts, and needs of software engineers during this time. Our findings indicate systematic changes in various variables. For example, well-being and quality of social contacts increased while emotional loneliness decreased as lockdown measures were relaxed. Conversely, people’s boredom and productivity, remained stable. Furthermore, a preliminary investigation into the future of work at the end of the pandemic revealed a consensus among developers for a preference of hybrid work arrangements. We also discovered that prior job changes and low job satisfaction were consistently linked to intentions to change jobs if current work conditions do not meet developers’ needs. This highlights the need for software organizations to adapt to various work arrangements to remain competitive employers. Building upon our findings and the existing literature, we introduce the Integrated Job Demands-Resources and Self-Determination (IJARS) Model as a comprehensive framework to explain the well-being and productivity of software engineers during the COVID-19 pandemic.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"22 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139028006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The emergence of foundation models, such as large language models (LLMs) GPT-4 and text-to-image models DALL-E, has opened up numerous possibilities across various domains. People can now use natural language (i.e. prompts) to communicate with AI to perform tasks. While people can use foundation models through chatbots (e.g., ChatGPT), chat, regardless of the capabilities of the underlying models, is not a production tool for building reusable AI services. APIs like LangChain allow for LLM-based application development but require substantial programming knowledge, thus posing a barrier. To mitigate this, we systematically review, summarise, refine and extend the concept of AI chain by incorporating the best principles and practices that have been accumulated in software engineering for decades into AI chain engineering, to systematize AI chain engineering methodology. We also develop a no-code integrated development environment, Prompt Sapper , which embodies these AI chain engineering principles and patterns naturally in the process of building AI chains, thereby improving the performance and quality of AI chains. With Prompt Sapper, AI chain engineers can compose prompt-based AI services on top of foundation models through chat-based requirement analysis and visual programming. Our user study evaluated and demonstrated the efficiency and correctness of Prompt Sapper.
{"title":"Prompt Sapper: A LLM-Empowered Production Tool for Building AI Chains","authors":"Yu Cheng, Jieshan Chen, Qing Huang, Zhenchang Xing, Xiwei Xu, Qinghua Lu","doi":"10.1145/3638247","DOIUrl":"https://doi.org/10.1145/3638247","url":null,"abstract":"<p>The emergence of foundation models, such as large language models (LLMs) GPT-4 and text-to-image models DALL-E, has opened up numerous possibilities across various domains. People can now use natural language (i.e. prompts) to communicate with AI to perform tasks. While people can use foundation models through chatbots (e.g., ChatGPT), chat, regardless of the capabilities of the underlying models, is not a production tool for building reusable AI services. APIs like LangChain allow for LLM-based application development but require substantial programming knowledge, thus posing a barrier. To mitigate this, we systematically review, summarise, refine and extend the concept of AI chain by incorporating the best principles and practices that have been accumulated in software engineering for decades into AI chain engineering, to systematize AI chain engineering methodology. We also develop a no-code integrated development environment, Prompt Sapper\u0000, which embodies these AI chain engineering principles and patterns naturally in the process of building AI chains, thereby improving the performance and quality of AI chains. With Prompt Sapper, AI chain engineers can compose prompt-based AI services on top of foundation models through chat-based requirement analysis and visual programming. Our user study evaluated and demonstrated the efficiency and correctness of Prompt Sapper.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"63 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138823904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bentley James Oakes, Michalis Famelis, Houari Sahraoui
Domain experts are increasingly employing machine learning to solve their domain-specific problems. This article presents to software engineering researchers the six key challenges that a domain expert faces in addressing their problem with a computational workflow, and the underlying executable implementation. These challenges arise out of our conceptual framework which presents the “route” of transformations that a domain expert may choose to take while developing their solution.
To ground our conceptual framework in the state-of-the-practice, this article discusses a selection of available textual and graphical workflow systems and their support for the transformations described in our framework. Example studies from the literature in various domains are also examined to highlight the tools used by the domain experts as well as a classification of the domain-specificity and machine learning usage of their problem, workflow, and implementation.
The state-of-the-practice informs our discussion of the six key challenges, where we identify which challenges and transformations are not sufficiently addressed by available tools. We also suggest possible research directions for software engineering researchers to increase the automation of these tools and disseminate best-practice techniques between software engineering and various scientific domains.
{"title":"Building Domain-Specific Machine Learning Workflows: A Conceptual Framework for the State-of-the-Practice","authors":"Bentley James Oakes, Michalis Famelis, Houari Sahraoui","doi":"10.1145/3638243","DOIUrl":"https://doi.org/10.1145/3638243","url":null,"abstract":"<p>Domain experts are increasingly employing machine learning to solve their domain-specific problems. This article presents to software engineering researchers the six key challenges that a domain expert faces in addressing their problem with a computational workflow, and the underlying executable implementation. These challenges arise out of our conceptual framework which presents the “route” of transformations that a domain expert may choose to take while developing their solution. </p><p>To ground our conceptual framework in the state-of-the-practice, this article discusses a selection of available textual and graphical workflow systems and their support for the transformations described in our framework. Example studies from the literature in various domains are also examined to highlight the tools used by the domain experts as well as a classification of the domain-specificity and machine learning usage of their problem, workflow, and implementation. </p><p>The state-of-the-practice informs our discussion of the six key challenges, where we identify which challenges and transformations are not sufficiently addressed by available tools. We also suggest possible research directions for software engineering researchers to increase the automation of these tools and disseminate best-practice techniques between software engineering and various scientific domains.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"2017 1","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138824279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}