Amila Indika, Christopher Lee, Haochen Wang, Justin Lisoway, Anthony Peruma, Rick Kazman
The proliferation of mobile applications (apps) has made it crucial to ensure their accessibility for users with disabilities. However, there is a lack of research on the real-world challenges developers face in implementing mobile accessibility features. This study presents a large-scale empirical analysis of accessibility discussions on Stack Overflow to identify the trends and challenges Android and iOS developers face. We examine the growth patterns, characteristics, and common topics mobile developers discuss. Our results show several challenges, including integrating assistive technologies like screen readers, ensuring accessible UI design, supporting text-to-speech across languages, handling complex gestures, and conducting accessibility testing. We envision our findings driving improvements in developer practices, research directions, tool support, and educational resources.
{"title":"Exploring Accessibility Trends and Challenges in Mobile App Development: A Study of Stack Overflow Questions","authors":"Amila Indika, Christopher Lee, Haochen Wang, Justin Lisoway, Anthony Peruma, Rick Kazman","doi":"arxiv-2409.07945","DOIUrl":"https://doi.org/arxiv-2409.07945","url":null,"abstract":"The proliferation of mobile applications (apps) has made it crucial to ensure\u0000their accessibility for users with disabilities. However, there is a lack of\u0000research on the real-world challenges developers face in implementing mobile\u0000accessibility features. This study presents a large-scale empirical analysis of\u0000accessibility discussions on Stack Overflow to identify the trends and\u0000challenges Android and iOS developers face. We examine the growth patterns,\u0000characteristics, and common topics mobile developers discuss. Our results show\u0000several challenges, including integrating assistive technologies like screen\u0000readers, ensuring accessible UI design, supporting text-to-speech across\u0000languages, handling complex gestures, and conducting accessibility testing. We\u0000envision our findings driving improvements in developer practices, research\u0000directions, tool support, and educational resources.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
UI automation tests play a crucial role in ensuring the quality of mobile applications. Despite the growing popularity of machine learning techniques to generate these tests, they still face several challenges, such as the mismatch of UI elements. The recent advances in Large Language Models (LLMs) have addressed these issues by leveraging their semantic understanding capabilities. However, a significant gap remains in applying these models to industrial-level app testing, particularly in terms of cost optimization and knowledge limitation. To address this, we introduce CAT to create cost-effective UI automation tests for industry apps by combining machine learning and LLMs with best practices. Given the task description, CAT employs Retrieval Augmented Generation (RAG) to source examples of industrial app usage as the few-shot learning context, assisting LLMs in generating the specific sequence of actions. CAT then employs machine learning techniques, with LLMs serving as a complementary optimizer, to map the target element on the UI screen. Our evaluations on the WeChat testing dataset demonstrate the CAT's performance and cost-effectiveness, achieving 90% UI automation with $0.34 cost, outperforming the state-of-the-art. We have also integrated our approach into the real-world WeChat testing platform, demonstrating its usefulness in detecting 141 bugs and enhancing the developers' testing process.
{"title":"Enabling Cost-Effective UI Automation Testing with Retrieval-Based LLMs: A Case Study in WeChat","authors":"Sidong Feng, Haochuan Lu, Jianqin Jiang, Ting Xiong, Likun Huang, Yinglin Liang, Xiaoqin Li, Yuetang Deng, Aldeida Aleti","doi":"arxiv-2409.07829","DOIUrl":"https://doi.org/arxiv-2409.07829","url":null,"abstract":"UI automation tests play a crucial role in ensuring the quality of mobile\u0000applications. Despite the growing popularity of machine learning techniques to\u0000generate these tests, they still face several challenges, such as the mismatch\u0000of UI elements. The recent advances in Large Language Models (LLMs) have\u0000addressed these issues by leveraging their semantic understanding capabilities.\u0000However, a significant gap remains in applying these models to industrial-level\u0000app testing, particularly in terms of cost optimization and knowledge\u0000limitation. To address this, we introduce CAT to create cost-effective UI\u0000automation tests for industry apps by combining machine learning and LLMs with\u0000best practices. Given the task description, CAT employs Retrieval Augmented\u0000Generation (RAG) to source examples of industrial app usage as the few-shot\u0000learning context, assisting LLMs in generating the specific sequence of\u0000actions. CAT then employs machine learning techniques, with LLMs serving as a\u0000complementary optimizer, to map the target element on the UI screen. Our\u0000evaluations on the WeChat testing dataset demonstrate the CAT's performance and\u0000cost-effectiveness, achieving 90% UI automation with $0.34 cost, outperforming\u0000the state-of-the-art. We have also integrated our approach into the real-world\u0000WeChat testing platform, demonstrating its usefulness in detecting 141 bugs and\u0000enhancing the developers' testing process.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Machine/deep learning models have been widely adopted for predicting the configuration performance of software systems. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose a model-agnostic and sparsity-robust framework for predicting configuration performance, dubbed DaL, based on the new paradigm of dividable learning that builds a model via "divide-and-learn". To handle sample sparsity, the samples from the configuration landscape are divided into distant divisions, for each of which we build a sparse local model, e.g., regularized Hierarchical Interaction Neural Network, to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Further, DaL adaptively determines the optimal number of divisions required for a system and sample size without any extra training or profiling. Experiment results from 12 real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, DaL performs no worse than the best counterpart on 44 out of 60 cases with up to 1.61x improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. In particular, the mechanism that adapted the parameter d can reach the optimal value for 76.43% of the individual runs. The result also confirms that the paradigm of dividable learning is more suitable than other similar paradigms such as ensemble learning for predicting configuration performance. Practically, DaL considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility.
{"title":"Dividable Configuration Performance Learning","authors":"Jingzhi Gong, Tao Chen, Rami Bahsoon","doi":"arxiv-2409.07629","DOIUrl":"https://doi.org/arxiv-2409.07629","url":null,"abstract":"Machine/deep learning models have been widely adopted for predicting the\u0000configuration performance of software systems. However, a crucial yet\u0000unaddressed challenge is how to cater for the sparsity inherited from the\u0000configuration landscape: the influence of configuration options (features) and\u0000the distribution of data samples are highly sparse. In this paper, we propose a\u0000model-agnostic and sparsity-robust framework for predicting configuration\u0000performance, dubbed DaL, based on the new paradigm of dividable learning that\u0000builds a model via \"divide-and-learn\". To handle sample sparsity, the samples\u0000from the configuration landscape are divided into distant divisions, for each\u0000of which we build a sparse local model, e.g., regularized Hierarchical\u0000Interaction Neural Network, to deal with the feature sparsity. A newly given\u0000configuration would then be assigned to the right model of division for the\u0000final prediction. Further, DaL adaptively determines the optimal number of\u0000divisions required for a system and sample size without any extra training or\u0000profiling. Experiment results from 12 real-world systems and five sets of\u0000training data reveal that, compared with the state-of-the-art approaches, DaL\u0000performs no worse than the best counterpart on 44 out of 60 cases with up to\u00001.61x improvement on accuracy; requires fewer samples to reach the same/better\u0000accuracy; and producing acceptable training overhead. In particular, the\u0000mechanism that adapted the parameter d can reach the optimal value for 76.43%\u0000of the individual runs. The result also confirms that the paradigm of dividable\u0000learning is more suitable than other similar paradigms such as ensemble\u0000learning for predicting configuration performance. Practically, DaL\u0000considerably improves different global models when using them as the underlying\u0000local models, which further strengthens its flexibility.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Laura Pomponio, Maximiliano Cristiá, Estanislao Ruiz Sorazábal, Maximiliano García
We show the design of the software of the microcontroller unit of a weeding robot based on the Process Control architectural style and design patterns. The design consists of 133 modules resulting from using 8 design patterns for a total of 30 problems. As a result the design yields more reusable components and an easily modifiable and extensible program. Design documentation is also presented. Finally, the implementation (12 KLOC of C++ code) is empirically evaluated to prove that the design does not produce an inefficient implementation.
{"title":"Reusability and Modifiability in Robotics Software (Extended Version)","authors":"Laura Pomponio, Maximiliano Cristiá, Estanislao Ruiz Sorazábal, Maximiliano García","doi":"arxiv-2409.07228","DOIUrl":"https://doi.org/arxiv-2409.07228","url":null,"abstract":"We show the design of the software of the microcontroller unit of a weeding\u0000robot based on the Process Control architectural style and design patterns. The\u0000design consists of 133 modules resulting from using 8 design patterns for a\u0000total of 30 problems. As a result the design yields more reusable components\u0000and an easily modifiable and extensible program. Design documentation is also\u0000presented. Finally, the implementation (12 KLOC of C++ code) is empirically\u0000evaluated to prove that the design does not produce an inefficient\u0000implementation.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"113 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, Tushar Khot
Given that Large Language Models (LLMs) have made significant progress in writing code, can they now be used to autonomously reproduce results from research repositories? Such a capability would be a boon to the research community, helping researchers validate, understand, and extend prior work. To advance towards this goal, we introduce SUPER, the first benchmark designed to evaluate the capability of LLMs in setting up and executing tasks from research repositories. SUPERaims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges (e.g., configuring a trainer), and 602 automatically generated problems for larger-scale development. We introduce various evaluation measures to assess both task success and progress, utilizing gold solutions when available or approximations otherwise. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios. This illustrates the challenge of this task, and suggests that SUPER can serve as a valuable resource for the community to make and measure progress.
{"title":"SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories","authors":"Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, Tushar Khot","doi":"arxiv-2409.07440","DOIUrl":"https://doi.org/arxiv-2409.07440","url":null,"abstract":"Given that Large Language Models (LLMs) have made significant progress in\u0000writing code, can they now be used to autonomously reproduce results from\u0000research repositories? Such a capability would be a boon to the research\u0000community, helping researchers validate, understand, and extend prior work. To\u0000advance towards this goal, we introduce SUPER, the first benchmark designed to\u0000evaluate the capability of LLMs in setting up and executing tasks from research\u0000repositories. SUPERaims to capture the realistic challenges faced by\u0000researchers working with Machine Learning (ML) and Natural Language Processing\u0000(NLP) research repositories. Our benchmark comprises three distinct problem\u0000sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems\u0000derived from the expert set that focus on specific challenges (e.g.,\u0000configuring a trainer), and 602 automatically generated problems for\u0000larger-scale development. We introduce various evaluation measures to assess\u0000both task success and progress, utilizing gold solutions when available or\u0000approximations otherwise. We show that state-of-the-art approaches struggle to\u0000solve these problems with the best model (GPT-4o) solving only 16.3% of the\u0000end-to-end set, and 46.1% of the scenarios. This illustrates the challenge of\u0000this task, and suggests that SUPER can serve as a valuable resource for the\u0000community to make and measure progress.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Analyzing user reviews for sentiment towards app features can provide valuable insights into users' perceptions of app functionality and their evolving needs. Given the volume of user reviews received daily, an automated mechanism to generate feature-level sentiment summaries of user reviews is needed. Recent advances in Large Language Models (LLMs) such as ChatGPT have shown impressive performance on several new tasks without updating the model's parameters i.e. using zero or a few labeled examples. Despite these advancements, LLMs' capabilities to perform feature-specific sentiment analysis of user reviews remain unexplored. This study compares the performance of state-of-the-art LLMs, including GPT-4, ChatGPT, and LLama-2-chat variants, for extracting app features and associated sentiments under 0-shot, 1-shot, and 5-shot scenarios. Results indicate the best-performing GPT-4 model outperforms rule-based approaches by 23.6% in f1-score with zero-shot feature extraction; 5-shot further improving it by 6%. GPT-4 achieves a 74% f1-score for predicting positive sentiment towards correctly predicted app features, with 5-shot enhancing it by 7%. Our study suggests that LLM models are promising for generating feature-specific sentiment summaries of user reviews.
{"title":"A Fine-grained Sentiment Analysis of App Reviews using Large Language Models: An Evaluation Study","authors":"Faiz Ali Shah, Ahmed Sabir, Rajesh Sharma","doi":"arxiv-2409.07162","DOIUrl":"https://doi.org/arxiv-2409.07162","url":null,"abstract":"Analyzing user reviews for sentiment towards app features can provide\u0000valuable insights into users' perceptions of app functionality and their\u0000evolving needs. Given the volume of user reviews received daily, an automated\u0000mechanism to generate feature-level sentiment summaries of user reviews is\u0000needed. Recent advances in Large Language Models (LLMs) such as ChatGPT have\u0000shown impressive performance on several new tasks without updating the model's\u0000parameters i.e. using zero or a few labeled examples. Despite these\u0000advancements, LLMs' capabilities to perform feature-specific sentiment analysis\u0000of user reviews remain unexplored. This study compares the performance of\u0000state-of-the-art LLMs, including GPT-4, ChatGPT, and LLama-2-chat variants, for\u0000extracting app features and associated sentiments under 0-shot, 1-shot, and\u00005-shot scenarios. Results indicate the best-performing GPT-4 model outperforms\u0000rule-based approaches by 23.6% in f1-score with zero-shot feature extraction;\u00005-shot further improving it by 6%. GPT-4 achieves a 74% f1-score for predicting\u0000positive sentiment towards correctly predicted app features, with 5-shot\u0000enhancing it by 7%. Our study suggests that LLM models are promising for\u0000generating feature-specific sentiment summaries of user reviews.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the substantial number of enrollments in programming courses, a key challenge is delivering personalized feedback to students. The nature of this feedback varies significantly, contingent on the subject and the chosen evaluation method. However, tailoring current Automated Assessment Tools (AATs) to integrate other program analysis tools is not straightforward. Moreover, AATs usually support only specific programming languages, providing feedback exclusively through dedicated websites based on test suites. This paper introduces GitSEED, a language-agnostic automated assessment tool designed for Programming Education and Software Engineering (SE) and backed by GitLab. The students interact with GitSEED through GitLab. Using GitSEED, students in Computer Science (CS) and SE can master the fundamentals of git while receiving personalized feedback on their programming assignments and projects. Furthermore, faculty members can easily tailor GitSEED's pipeline by integrating various code evaluation tools (e.g., memory leak detection, fault localization, program repair, etc.) to offer personalized feedback that aligns with the needs of each CS/SE course. Our experiments assess GitSEED's efficacy via comprehensive user evaluation, examining the impact of feedback mechanisms and features on student learning outcomes. Findings reveal positive correlations between GitSEED usage and student engagement.
{"title":"GitSEED: A Git-backed Automated Assessment Tool for Software Engineering and Programming Education","authors":"Pedro Orvalho, Mikoláš Janota, Vasco Manquinho","doi":"arxiv-2409.07362","DOIUrl":"https://doi.org/arxiv-2409.07362","url":null,"abstract":"Due to the substantial number of enrollments in programming courses, a key\u0000challenge is delivering personalized feedback to students. The nature of this\u0000feedback varies significantly, contingent on the subject and the chosen\u0000evaluation method. However, tailoring current Automated Assessment Tools (AATs)\u0000to integrate other program analysis tools is not straightforward. Moreover,\u0000AATs usually support only specific programming languages, providing feedback\u0000exclusively through dedicated websites based on test suites. This paper introduces GitSEED, a language-agnostic automated assessment tool\u0000designed for Programming Education and Software Engineering (SE) and backed by\u0000GitLab. The students interact with GitSEED through GitLab. Using GitSEED,\u0000students in Computer Science (CS) and SE can master the fundamentals of git\u0000while receiving personalized feedback on their programming assignments and\u0000projects. Furthermore, faculty members can easily tailor GitSEED's pipeline by\u0000integrating various code evaluation tools (e.g., memory leak detection, fault\u0000localization, program repair, etc.) to offer personalized feedback that aligns\u0000with the needs of each CS/SE course. Our experiments assess GitSEED's efficacy\u0000via comprehensive user evaluation, examining the impact of feedback mechanisms\u0000and features on student learning outcomes. Findings reveal positive\u0000correlations between GitSEED usage and student engagement.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Umm-e- Habiba, Markus Haug, Justus Bogner, Stefan Wagner
Artificial intelligence (AI) permeates all fields of life, which resulted in new challenges in requirements engineering for artificial intelligence (RE4AI), e.g., the difficulty in specifying and validating requirements for AI or considering new quality requirements due to emerging ethical implications. It is currently unclear if existing RE methods are sufficient or if new ones are needed to address these challenges. Therefore, our goal is to provide a comprehensive overview of RE4AI to researchers and practitioners. What has been achieved so far, i.e., what practices are available, and what research gaps and challenges still need to be addressed? To achieve this, we conducted a systematic mapping study combining query string search and extensive snowballing. The extracted data was aggregated, and results were synthesized using thematic analysis. Our selection process led to the inclusion of 126 primary studies. Existing RE4AI research focuses mainly on requirements analysis and elicitation, with most practices applied in these areas. Furthermore, we identified requirements specification, explainability, and the gap between machine learning engineers and end-users as the most prevalent challenges, along with a few others. Additionally, we proposed seven potential research directions to address these challenges. Practitioners can use our results to identify and select suitable RE methods for working on their AI-based systems, while researchers can build on the identified gaps and research directions to push the field forward.
人工智能(AI)已渗透到生活的各个领域,这给人工智能需求工程(RE4AI)带来了新的挑战,例如,很难明确和验证人工智能的需求,或者由于新出现的伦理问题而需要考虑新的质量要求。目前还不清楚现有的 RE 方法是否足够,或者是否需要新的方法来应对这些挑战。因此,我们的目标是为研究人员和从业人员提供 RE4AI 的全面概述。迄今为止已经取得了哪些成果,即有哪些实践,还有哪些研究空白和挑战需要解决?为此,我们结合查询字符串搜索和广泛的雪球搜索,开展了系统的绘图研究。我们对提取的数据进行了汇总,并通过专题分析对结果进行了综合。通过筛选,我们纳入了 126 项主要研究。现有的 RE4AI 研究主要集中在需求分析和诱导方面,大多数实践都应用于这些领域。此外,我们还发现需求规范、可解释性、机器学习工程师与最终用户之间的差距以及其他一些问题是最普遍的挑战。此外,我们还提出了应对这些挑战的七个潜在研究方向。实践者可以利用我们的研究成果来确定和选择合适的可再生能源方法,用于他们基于人工智能的系统,而研究人员则可以在已确定的差距和研究方向的基础上推动该领域向前发展。
{"title":"How Mature is Requirements Engineering for AI-based Systems? A Systematic Mapping Study on Practices, Challenges, and Future Research Directions","authors":"Umm-e- Habiba, Markus Haug, Justus Bogner, Stefan Wagner","doi":"arxiv-2409.07192","DOIUrl":"https://doi.org/arxiv-2409.07192","url":null,"abstract":"Artificial intelligence (AI) permeates all fields of life, which resulted in\u0000new challenges in requirements engineering for artificial intelligence (RE4AI),\u0000e.g., the difficulty in specifying and validating requirements for AI or\u0000considering new quality requirements due to emerging ethical implications. It\u0000is currently unclear if existing RE methods are sufficient or if new ones are\u0000needed to address these challenges. Therefore, our goal is to provide a\u0000comprehensive overview of RE4AI to researchers and practitioners. What has been\u0000achieved so far, i.e., what practices are available, and what research gaps and\u0000challenges still need to be addressed? To achieve this, we conducted a\u0000systematic mapping study combining query string search and extensive\u0000snowballing. The extracted data was aggregated, and results were synthesized\u0000using thematic analysis. Our selection process led to the inclusion of 126\u0000primary studies. Existing RE4AI research focuses mainly on requirements\u0000analysis and elicitation, with most practices applied in these areas.\u0000Furthermore, we identified requirements specification, explainability, and the\u0000gap between machine learning engineers and end-users as the most prevalent\u0000challenges, along with a few others. Additionally, we proposed seven potential\u0000research directions to address these challenges. Practitioners can use our\u0000results to identify and select suitable RE methods for working on their\u0000AI-based systems, while researchers can build on the identified gaps and\u0000research directions to push the field forward.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"235 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oleksandr Kosenkov, Michael Unterkalmsteiner, Daniel Mendez, Jannik Fischbach
Context: Regulations, such as the European Accessibility Act (EAA), impact the engineering of software products and services. Managing that impact while providing meaningful inputs to development teams is one of the emerging requirements engineering (RE) challenges. Problem: Enterprises conduct Regulatory Impact Analysis (RIA) to consider the effects of regulations on software products offered and formulate requirements at an enterprise level. Despite its practical relevance, we are unaware of any studies on this large-scale regulatory RE process. Methodology: We conducted an exploratory interview study of RIA in three large enterprises. We focused on how they conduct RIA, emphasizing cross-functional interactions, and using the EAA as an example. Results: RIA, as a regulatory RE process, is conducted to address the needs of executive management and central functions. It involves coordination between different functions and levels of enterprise hierarchy. Enterprises use artifacts to support interpretation and communication of the results of RIA. Challenges to RIA are mainly related to the execution of such coordination and managing the knowledge involved. Conclusion: RIA in large enterprises demands close coordination of multiple stakeholders and roles. Applying interpretation and compliance artifacts is one approach to support such coordination. However, there are no established practices for creating and managing such artifacts.
背景:欧洲无障碍法案》(EAA)等法规对软件产品和服务的工程设计产生了影响。管理这种影响,同时为开发团队提供有意义的投入,是新出现的需求工程(RE)挑战之一。问题:企业会进行法规影响分析(RIA),以考虑法规对所提供软件产品的影响,并制定企业级需求。尽管具有实际意义,但我们还不知道有任何关于这种大规模监管 RE 流程的研究。研究方法:我们对三家大型企业的 RIA 进行了探索性访谈研究。我们重点研究了他们如何开展监管影响评估,强调跨职能互动,并以监管局为例。研究结果监管影响评估作为一种监管 RE 流程,是为了满足执行管理层和中央职能部门的需求而开展的。它涉及不同职能部门和企业层级之间的协调。监管影响评估面临的挑战主要与执行此类协调和管理相关知识有关。结论:大型企业的 RIA 需要多个利益相关者和角色的密切协调。应用解释和合规工件是支持这种协调的一种方法。然而,目前还没有创建和管理此类人工制品的成熟做法。
{"title":"Regulatory Requirements Engineering in Large Enterprises: An Interview Study on the European Accessibility Act","authors":"Oleksandr Kosenkov, Michael Unterkalmsteiner, Daniel Mendez, Jannik Fischbach","doi":"arxiv-2409.07313","DOIUrl":"https://doi.org/arxiv-2409.07313","url":null,"abstract":"Context: Regulations, such as the European Accessibility Act (EAA), impact\u0000the engineering of software products and services. Managing that impact while\u0000providing meaningful inputs to development teams is one of the emerging\u0000requirements engineering (RE) challenges. Problem: Enterprises conduct Regulatory Impact Analysis (RIA) to consider the\u0000effects of regulations on software products offered and formulate requirements\u0000at an enterprise level. Despite its practical relevance, we are unaware of any\u0000studies on this large-scale regulatory RE process. Methodology: We conducted an exploratory interview study of RIA in three\u0000large enterprises. We focused on how they conduct RIA, emphasizing\u0000cross-functional interactions, and using the EAA as an example. Results: RIA, as a regulatory RE process, is conducted to address the needs\u0000of executive management and central functions. It involves coordination between\u0000different functions and levels of enterprise hierarchy. Enterprises use\u0000artifacts to support interpretation and communication of the results of RIA.\u0000Challenges to RIA are mainly related to the execution of such coordination and\u0000managing the knowledge involved. Conclusion: RIA in large enterprises demands close coordination of multiple\u0000stakeholders and roles. Applying interpretation and compliance artifacts is one\u0000approach to support such coordination. However, there are no established\u0000practices for creating and managing such artifacts.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexey Vasilev, Anna Volodkevich, Denis Kulandin, Tatiana Bysheva, Anton Klenitskiy
Using a single tool to build and compare recommender systems significantly reduces the time to market for new models. In addition, the comparison results when using such tools look more consistent. This is why many different tools and libraries for researchers in the field of recommendations have recently appeared. Unfortunately, most of these frameworks are aimed primarily at researchers and require modification for use in production due to the inability to work on large datasets or an inappropriate architecture. In this demo, we present our open-source toolkit RePlay - a framework containing an end-to-end pipeline for building recommender systems, which is ready for production use. RePlay also allows you to use a suitable stack for the pipeline on each stage: Pandas, Polars, or Spark. This allows the library to scale computations and deploy to a cluster. Thus, RePlay allows data scientists to easily move from research mode to production mode using the same interfaces.
{"title":"RePlay: a Recommendation Framework for Experimentation and Production Use","authors":"Alexey Vasilev, Anna Volodkevich, Denis Kulandin, Tatiana Bysheva, Anton Klenitskiy","doi":"arxiv-2409.07272","DOIUrl":"https://doi.org/arxiv-2409.07272","url":null,"abstract":"Using a single tool to build and compare recommender systems significantly\u0000reduces the time to market for new models. In addition, the comparison results\u0000when using such tools look more consistent. This is why many different tools\u0000and libraries for researchers in the field of recommendations have recently\u0000appeared. Unfortunately, most of these frameworks are aimed primarily at\u0000researchers and require modification for use in production due to the inability\u0000to work on large datasets or an inappropriate architecture. In this demo, we\u0000present our open-source toolkit RePlay - a framework containing an end-to-end\u0000pipeline for building recommender systems, which is ready for production use.\u0000RePlay also allows you to use a suitable stack for the pipeline on each stage:\u0000Pandas, Polars, or Spark. This allows the library to scale computations and\u0000deploy to a cluster. Thus, RePlay allows data scientists to easily move from\u0000research mode to production mode using the same interfaces.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"92 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}