arXiv - CS - Software Engineering最新文献_第7页

Exploring Accessibility Trends and Challenges in Mobile App Development: A Study of Stack Overflow Questions 探索移动应用程序开发中的无障碍趋势和挑战：Stack Overflow 问题研究

arXiv - CS - Software Engineering

Pub Date : 2024-09-12 DOI: arxiv-2409.07945

Amila Indika, Christopher Lee, Haochen Wang, Justin Lisoway, Anthony Peruma, Rick Kazman

The proliferation of mobile applications (apps) has made it crucial to ensuretheir accessibility for users with disabilities. However, there is a lack ofresearch on the real-world challenges developers face in implementing mobileaccessibility features. This study presents a large-scale empirical analysis ofaccessibility discussions on Stack Overflow to identify the trends andchallenges Android and iOS developers face. We examine the growth patterns,characteristics, and common topics mobile developers discuss. Our results showseveral challenges, including integrating assistive technologies like screenreaders, ensuring accessible UI design, supporting text-to-speech acrosslanguages, handling complex gestures, and conducting accessibility testing. Weenvision our findings driving improvements in developer practices, researchdirections, tool support, and educational resources.

移动应用程序（Apps）的激增使得确保其对残疾用户的无障碍性变得至关重要。然而，关于开发人员在实施移动无障碍功能时所面临的现实挑战的研究却十分匮乏。本研究对 Stack Overflow 上的无障碍讨论进行了大规模实证分析，以确定 Android 和 iOS 开发人员面临的趋势和挑战。我们研究了移动开发人员讨论的增长模式、特点和常见话题。我们的研究结果显示了一系列挑战，包括集成读屏软件等辅助技术、确保无障碍 UI 设计、支持跨语言文本转语音、处理复杂的手势以及进行无障碍测试。我们希望我们的研究结果能够推动开发者实践、研究方向、工具支持和教育资源的改进。

引用次数: 0

Enabling Cost-Effective UI Automation Testing with Retrieval-Based LLMs: A Case Study in WeChat 利用基于检索的 LLM 实现经济高效的用户界面自动化测试：微信案例研究

arXiv - CS - Software Engineering

Pub Date : 2024-09-12 DOI: arxiv-2409.07829

Sidong Feng, Haochuan Lu, Jianqin Jiang, Ting Xiong, Likun Huang, Yinglin Liang, Xiaoqin Li, Yuetang Deng, Aldeida Aleti

UI automation tests play a crucial role in ensuring the quality of mobileapplications. Despite the growing popularity of machine learning techniques togenerate these tests, they still face several challenges, such as the mismatchof UI elements. The recent advances in Large Language Models (LLMs) haveaddressed these issues by leveraging their semantic understanding capabilities.However, a significant gap remains in applying these models to industrial-levelapp testing, particularly in terms of cost optimization and knowledgelimitation. To address this, we introduce CAT to create cost-effective UIautomation tests for industry apps by combining machine learning and LLMs withbest practices. Given the task description, CAT employs Retrieval AugmentedGeneration (RAG) to source examples of industrial app usage as the few-shotlearning context, assisting LLMs in generating the specific sequence ofactions. CAT then employs machine learning techniques, with LLMs serving as acomplementary optimizer, to map the target element on the UI screen. Ourevaluations on the WeChat testing dataset demonstrate the CAT's performance andcost-effectiveness, achieving 90% UI automation with $0.34 cost, outperformingthe state-of-the-art. We have also integrated our approach into the real-worldWeChat testing platform, demonstrating its usefulness in detecting 141 bugs andenhancing the developers' testing process.

用户界面自动化测试在确保移动应用程序的质量方面发挥着至关重要的作用。尽管用于生成这些测试的机器学习技术越来越受欢迎，但它们仍然面临着一些挑战，例如 UI 元素的不匹配。然而，在将这些模型应用于工业级应用测试方面仍存在巨大差距，尤其是在成本优化和知识限制方面。为了解决这个问题，我们引入了 CAT，通过将机器学习和 LLM 与最佳实践相结合，为行业应用程序创建经济高效的用户界面自动化测试。给定任务描述后，CAT 采用检索增强生成（RAG）技术，将行业应用程序的使用实例作为少数几个可识别的上下文，协助 LLM 生成特定的操作序列。然后，CAT 采用机器学习技术，LLM 作为辅助优化器，将目标元素映射到 UI 屏幕上。在微信测试数据集上进行的评估证明了 CAT 的性能和成本效益，它以 0.34 美元的成本实现了 90% 的用户界面自动化，超越了最先进的技术。我们还将我们的方法集成到了现实世界的微信测试平台中，证明了它在检测 141 个错误和增强开发人员测试流程方面的实用性。

{"title":"Enabling Cost-Effective UI Automation Testing with Retrieval-Based LLMs: A Case Study in WeChat","authors":"Sidong Feng, Haochuan Lu, Jianqin Jiang, Ting Xiong, Likun Huang, Yinglin Liang, Xiaoqin Li, Yuetang Deng, Aldeida Aleti","doi":"arxiv-2409.07829","DOIUrl":"https://doi.org/arxiv-2409.07829","url":null,"abstract":"UI automation tests play a crucial role in ensuring the quality of mobile\u0000applications. Despite the growing popularity of machine learning techniques to\u0000generate these tests, they still face several challenges, such as the mismatch\u0000of UI elements. The recent advances in Large Language Models (LLMs) have\u0000addressed these issues by leveraging their semantic understanding capabilities.\u0000However, a significant gap remains in applying these models to industrial-level\u0000app testing, particularly in terms of cost optimization and knowledge\u0000limitation. To address this, we introduce CAT to create cost-effective UI\u0000automation tests for industry apps by combining machine learning and LLMs with\u0000best practices. Given the task description, CAT employs Retrieval Augmented\u0000Generation (RAG) to source examples of industrial app usage as the few-shot\u0000learning context, assisting LLMs in generating the specific sequence of\u0000actions. CAT then employs machine learning techniques, with LLMs serving as a\u0000complementary optimizer, to map the target element on the UI screen. Our\u0000evaluations on the WeChat testing dataset demonstrate the CAT's performance and\u0000cost-effectiveness, achieving 90% UI automation with $0.34 cost, outperforming\u0000the state-of-the-art. We have also integrated our approach into the real-world\u0000WeChat testing platform, demonstrating its usefulness in detecting 141 bugs and\u0000enhancing the developers' testing process.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dividable Configuration Performance Learning 可分割配置性能学习

arXiv - CS - Software Engineering

Pub Date : 2024-09-11 DOI: arxiv-2409.07629

Jingzhi Gong, Tao Chen, Rami Bahsoon

Machine/deep learning models have been widely adopted for predicting theconfiguration performance of software systems. However, a crucial yetunaddressed challenge is how to cater for the sparsity inherited from theconfiguration landscape: the influence of configuration options (features) andthe distribution of data samples are highly sparse. In this paper, we propose amodel-agnostic and sparsity-robust framework for predicting configurationperformance, dubbed DaL, based on the new paradigm of dividable learning thatbuilds a model via "divide-and-learn". To handle sample sparsity, the samplesfrom the configuration landscape are divided into distant divisions, for eachof which we build a sparse local model, e.g., regularized HierarchicalInteraction Neural Network, to deal with the feature sparsity. A newly givenconfiguration would then be assigned to the right model of division for thefinal prediction. Further, DaL adaptively determines the optimal number ofdivisions required for a system and sample size without any extra training orprofiling. Experiment results from 12 real-world systems and five sets oftraining data reveal that, compared with the state-of-the-art approaches, DaLperforms no worse than the best counterpart on 44 out of 60 cases with up to1.61x improvement on accuracy; requires fewer samples to reach the same/betteraccuracy; and producing acceptable training overhead. In particular, themechanism that adapted the parameter d can reach the optimal value for 76.43%of the individual runs. The result also confirms that the paradigm of dividablelearning is more suitable than other similar paradigms such as ensemblelearning for predicting configuration performance. Practically, DaLconsiderably improves different global models when using them as the underlyinglocal models, which further strengthens its flexibility.

机器/深度学习模型已被广泛用于预测软件系统的配置性能。然而，一个尚未解决的关键挑战是如何应对配置环境所带来的稀疏性：配置选项（特征）的影响和数据样本的分布高度稀疏。在本文中，我们基于通过 "分而学之 "建立模型的可分学习新范式，提出了一种预测配置性能的模型无关性和稀疏性稳健框架，并将其命名为 DaL。为了处理样本稀疏性问题，我们将配置景观中的样本划分为不同的部分，并为每个部分建立稀疏的局部模型，例如正则化层次交互神经网络，以处理特征稀疏性问题。然后，新给出的配置将被分配给合适的分部模型，用于最终预测。此外，DaL 还能自适应地确定系统所需的最佳分割数和样本大小，而无需任何额外的训练或预测。来自 12 个真实系统和 5 组训练数据的实验结果表明，与最先进的方法相比，DaL 在 60 个案例中的 44 个案例中的表现不比最佳方法差，准确率提高了 1.61 倍；需要更少的样本就能达到相同/更好的准确率；并且产生了可接受的训练开销。特别是，调整参数 d 的机制可以在 76.43% 的单次运行中达到最优值。这一结果也证实了可分式学习范式比其他类似范式（如集合学习）更适合预测配置性能。在实践中，当使用不同的全局模型作为基础局部模型时，DaL 可以显著改进这些模型，这进一步增强了它的灵活性。

{"title":"Dividable Configuration Performance Learning","authors":"Jingzhi Gong, Tao Chen, Rami Bahsoon","doi":"arxiv-2409.07629","DOIUrl":"https://doi.org/arxiv-2409.07629","url":null,"abstract":"Machine/deep learning models have been widely adopted for predicting the\u0000configuration performance of software systems. However, a crucial yet\u0000unaddressed challenge is how to cater for the sparsity inherited from the\u0000configuration landscape: the influence of configuration options (features) and\u0000the distribution of data samples are highly sparse. In this paper, we propose a\u0000model-agnostic and sparsity-robust framework for predicting configuration\u0000performance, dubbed DaL, based on the new paradigm of dividable learning that\u0000builds a model via \"divide-and-learn\". To handle sample sparsity, the samples\u0000from the configuration landscape are divided into distant divisions, for each\u0000of which we build a sparse local model, e.g., regularized Hierarchical\u0000Interaction Neural Network, to deal with the feature sparsity. A newly given\u0000configuration would then be assigned to the right model of division for the\u0000final prediction. Further, DaL adaptively determines the optimal number of\u0000divisions required for a system and sample size without any extra training or\u0000profiling. Experiment results from 12 real-world systems and five sets of\u0000training data reveal that, compared with the state-of-the-art approaches, DaL\u0000performs no worse than the best counterpart on 44 out of 60 cases with up to\u00001.61x improvement on accuracy; requires fewer samples to reach the same/better\u0000accuracy; and producing acceptable training overhead. In particular, the\u0000mechanism that adapted the parameter d can reach the optimal value for 76.43%\u0000of the individual runs. The result also confirms that the paradigm of dividable\u0000learning is more suitable than other similar paradigms such as ensemble\u0000learning for predicting configuration performance. Practically, DaL\u0000considerably improves different global models when using them as the underlying\u0000local models, which further strengthens its flexibility.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reusability and Modifiability in Robotics Software (Extended Version) 机器人软件的可重用性和可修改性（扩展版）

arXiv - CS - Software Engineering

Pub Date : 2024-09-11 DOI: arxiv-2409.07228

Laura Pomponio, Maximiliano Cristiá, Estanislao Ruiz Sorazábal, Maximiliano García

We show the design of the software of the microcontroller unit of a weedingrobot based on the Process Control architectural style and design patterns. Thedesign consists of 133 modules resulting from using 8 design patterns for atotal of 30 problems. As a result the design yields more reusable componentsand an easily modifiable and extensible program. Design documentation is alsopresented. Finally, the implementation (12 KLOC of C++ code) is empiricallyevaluated to prove that the design does not produce an inefficientimplementation.

我们展示了基于过程控制架构风格和设计模式的除草机器人微控制器单元软件设计。该设计由 133 个模块组成，使用了 8 种设计模式，共解决了 30 个问题。因此，该设计产生了更多可重复使用的组件以及易于修改和扩展的程序。此外，还介绍了设计文档。最后，对实现（12 KLOC 的 C++ 代码）进行了经验评估，以证明该设计不会产生低效的实现。

引用次数: 0

SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories SUPER：评估代理从研究资料库中设置和执行任务的能力

arXiv - CS - Software Engineering

Pub Date : 2024-09-11 DOI: arxiv-2409.07440

Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, Tushar Khot

Given that Large Language Models (LLMs) have made significant progress inwriting code, can they now be used to autonomously reproduce results fromresearch repositories? Such a capability would be a boon to the researchcommunity, helping researchers validate, understand, and extend prior work. Toadvance towards this goal, we introduce SUPER, the first benchmark designed toevaluate the capability of LLMs in setting up and executing tasks from researchrepositories. SUPERaims to capture the realistic challenges faced byresearchers working with Machine Learning (ML) and Natural Language Processing(NLP) research repositories. Our benchmark comprises three distinct problemsets: 45 end-to-end problems with annotated expert solutions, 152 sub problemsderived from the expert set that focus on specific challenges (e.g.,configuring a trainer), and 602 automatically generated problems forlarger-scale development. We introduce various evaluation measures to assessboth task success and progress, utilizing gold solutions when available orapproximations otherwise. We show that state-of-the-art approaches struggle tosolve these problems with the best model (GPT-4o) solving only 16.3% of theend-to-end set, and 46.1% of the scenarios. This illustrates the challenge ofthis task, and suggests that SUPER can serve as a valuable resource for thecommunity to make and measure progress.

鉴于大型语言模型（LLM）在编写代码方面已经取得了重大进展，现在是否可以利用它们来自主重现研究资料库中的结果？这种能力将为研究界带来福音，帮助研究人员验证、理解和扩展先前的工作。为了向这一目标迈进，我们推出了 SUPER，它是第一个用于评估 LLM 从研究资源库中设置和执行任务的能力的基准。SUPER 试图捕捉使用机器学习（ML）和自然语言处理（NLP）研究资源库的研究人员所面临的现实挑战。我们的基准包括三个不同的问题集：45 个带有专家解决方案注释的端到端问题，152 个从专家集中衍生出来的子问题，这些问题侧重于特定的挑战（例如，配置训练器），以及 602 个自动生成的问题，用于更大规模的开发。我们引入了各种评估方法来评估任务的成功率和进度，在有金牌解决方案的情况下使用金牌解决方案，在没有金牌解决方案的情况下使用近似解决方案。我们的研究表明，最先进的方法在解决这些问题时非常吃力，最佳模型（GPT-4o）只能解决 16.3% 的端到端问题集和 46.1% 的场景问题。这说明了这项任务的挑战性，也表明 SUPER 可以成为社区取得和衡量进展的宝贵资源。

{"title":"SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories","authors":"Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, Tushar Khot","doi":"arxiv-2409.07440","DOIUrl":"https://doi.org/arxiv-2409.07440","url":null,"abstract":"Given that Large Language Models (LLMs) have made significant progress in\u0000writing code, can they now be used to autonomously reproduce results from\u0000research repositories? Such a capability would be a boon to the research\u0000community, helping researchers validate, understand, and extend prior work. To\u0000advance towards this goal, we introduce SUPER, the first benchmark designed to\u0000evaluate the capability of LLMs in setting up and executing tasks from research\u0000repositories. SUPERaims to capture the realistic challenges faced by\u0000researchers working with Machine Learning (ML) and Natural Language Processing\u0000(NLP) research repositories. Our benchmark comprises three distinct problem\u0000sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems\u0000derived from the expert set that focus on specific challenges (e.g.,\u0000configuring a trainer), and 602 automatically generated problems for\u0000larger-scale development. We introduce various evaluation measures to assess\u0000both task success and progress, utilizing gold solutions when available or\u0000approximations otherwise. We show that state-of-the-art approaches struggle to\u0000solve these problems with the best model (GPT-4o) solving only 16.3% of the\u0000end-to-end set, and 46.1% of the scenarios. This illustrates the challenge of\u0000this task, and suggests that SUPER can serve as a valuable resource for the\u0000community to make and measure progress.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Fine-grained Sentiment Analysis of App Reviews using Large Language Models: An Evaluation Study 使用大型语言模型对应用程序评论进行细粒度情感分析：评估研究

arXiv - CS - Software Engineering

Pub Date : 2024-09-11 DOI: arxiv-2409.07162

Faiz Ali Shah, Ahmed Sabir, Rajesh Sharma

Analyzing user reviews for sentiment towards app features can providevaluable insights into users' perceptions of app functionality and theirevolving needs. Given the volume of user reviews received daily, an automatedmechanism to generate feature-level sentiment summaries of user reviews isneeded. Recent advances in Large Language Models (LLMs) such as ChatGPT haveshown impressive performance on several new tasks without updating the model'sparameters i.e. using zero or a few labeled examples. Despite theseadvancements, LLMs' capabilities to perform feature-specific sentiment analysisof user reviews remain unexplored. This study compares the performance ofstate-of-the-art LLMs, including GPT-4, ChatGPT, and LLama-2-chat variants, forextracting app features and associated sentiments under 0-shot, 1-shot, and5-shot scenarios. Results indicate the best-performing GPT-4 model outperformsrule-based approaches by 23.6% in f1-score with zero-shot feature extraction;5-shot further improving it by 6%. GPT-4 achieves a 74% f1-score for predictingpositive sentiment towards correctly predicted app features, with 5-shotenhancing it by 7%. Our study suggests that LLM models are promising forgenerating feature-specific sentiment summaries of user reviews.

分析用户评论中对应用程序功能的情感可以为了解用户对应用程序功能的看法和不断变化的需求提供宝贵的信息。鉴于每天都会收到大量的用户评论，因此需要一种自动机制来生成用户评论的特征级情感摘要。大型语言模型（LLMs）（如 ChatGPT）的最新进展表明，在不更新模型参数的情况下，即使用零个或少量标记示例的情况下，它在一些新任务上的表现令人印象深刻。尽管取得了这些进步，但 LLMs 对用户评论进行特定特征情感分析的能力仍有待开发。本研究比较了最先进的 LLM（包括 GPT-4、ChatGPT 和 LLama-2-chat 变体）在 0 次拍摄、1 次拍摄和 5 次拍摄场景下提取应用特征和相关情感的性能。结果表明，表现最好的 GPT-4 模型在零次特征提取的 f1 分数上比基于规则的方法高出 23.6%；在 5 次特征提取的 f1 分数上进一步提高了 6%。在对正确预测的应用特征进行正面情感预测方面，GPT-4 的 f1 分数达到 74%，而 5-shot 则提高了 7%。我们的研究表明，LLM 模型有望生成针对用户评论特征的情感总结。

{"title":"A Fine-grained Sentiment Analysis of App Reviews using Large Language Models: An Evaluation Study","authors":"Faiz Ali Shah, Ahmed Sabir, Rajesh Sharma","doi":"arxiv-2409.07162","DOIUrl":"https://doi.org/arxiv-2409.07162","url":null,"abstract":"Analyzing user reviews for sentiment towards app features can provide\u0000valuable insights into users' perceptions of app functionality and their\u0000evolving needs. Given the volume of user reviews received daily, an automated\u0000mechanism to generate feature-level sentiment summaries of user reviews is\u0000needed. Recent advances in Large Language Models (LLMs) such as ChatGPT have\u0000shown impressive performance on several new tasks without updating the model's\u0000parameters i.e. using zero or a few labeled examples. Despite these\u0000advancements, LLMs' capabilities to perform feature-specific sentiment analysis\u0000of user reviews remain unexplored. This study compares the performance of\u0000state-of-the-art LLMs, including GPT-4, ChatGPT, and LLama-2-chat variants, for\u0000extracting app features and associated sentiments under 0-shot, 1-shot, and\u00005-shot scenarios. Results indicate the best-performing GPT-4 model outperforms\u0000rule-based approaches by 23.6% in f1-score with zero-shot feature extraction;\u00005-shot further improving it by 6%. GPT-4 achieves a 74% f1-score for predicting\u0000positive sentiment towards correctly predicted app features, with 5-shot\u0000enhancing it by 7%. Our study suggests that LLM models are promising for\u0000generating feature-specific sentiment summaries of user reviews.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GitSEED: A Git-backed Automated Assessment Tool for Software Engineering and Programming Education GitSEED：支持 Git 的软件工程和编程教育自动评估工具

arXiv - CS - Software Engineering

Pub Date : 2024-09-11 DOI: arxiv-2409.07362

Pedro Orvalho, Mikoláš Janota, Vasco Manquinho

Due to the substantial number of enrollments in programming courses, a keychallenge is delivering personalized feedback to students. The nature of thisfeedback varies significantly, contingent on the subject and the chosenevaluation method. However, tailoring current Automated Assessment Tools (AATs)to integrate other program analysis tools is not straightforward. Moreover,AATs usually support only specific programming languages, providing feedbackexclusively through dedicated websites based on test suites. This paper introduces GitSEED, a language-agnostic automated assessment tooldesigned for Programming Education and Software Engineering (SE) and backed byGitLab. The students interact with GitSEED through GitLab. Using GitSEED,students in Computer Science (CS) and SE can master the fundamentals of gitwhile receiving personalized feedback on their programming assignments andprojects. Furthermore, faculty members can easily tailor GitSEED's pipeline byintegrating various code evaluation tools (e.g., memory leak detection, faultlocalization, program repair, etc.) to offer personalized feedback that alignswith the needs of each CS/SE course. Our experiments assess GitSEED's efficacyvia comprehensive user evaluation, examining the impact of feedback mechanismsand features on student learning outcomes. Findings reveal positivecorrelations between GitSEED usage and student engagement.

由于编程课程的注册人数众多，向学生提供个性化反馈是一项关键挑战。这种反馈的性质因科目和所选评估方法的不同而有很大差异。然而，定制当前的自动评估工具（AAT）以整合其他程序分析工具并非易事。此外，自动评估工具通常只支持特定的编程语言，只能通过基于测试套件的专用网站提供反馈。本文介绍了 GitSEED，这是一款针对编程教育和软件工程（SE）设计的、由 GitLab 支持的语言无关自动评估工具。学生通过 GitLab 与 GitSEED 进行交互。通过 GitSEED，计算机科学（CS）和软件工程（SE）专业的学生可以掌握 git 的基础知识，同时还能获得有关编程作业和项目的个性化反馈。此外，教师还可以通过集成各种代码评估工具（如内存泄漏检测、故障定位、程序修复等），轻松定制 GitSEED 的管道，提供符合 CS/SE 课程需求的个性化反馈。我们的实验通过全面的用户评估来评估 GitSEED 的功效，检查反馈机制和功能对学生学习成果的影响。实验结果表明，GitSEED 的使用与学生的参与度之间存在正相关关系。

{"title":"GitSEED: A Git-backed Automated Assessment Tool for Software Engineering and Programming Education","authors":"Pedro Orvalho, Mikoláš Janota, Vasco Manquinho","doi":"arxiv-2409.07362","DOIUrl":"https://doi.org/arxiv-2409.07362","url":null,"abstract":"Due to the substantial number of enrollments in programming courses, a key\u0000challenge is delivering personalized feedback to students. The nature of this\u0000feedback varies significantly, contingent on the subject and the chosen\u0000evaluation method. However, tailoring current Automated Assessment Tools (AATs)\u0000to integrate other program analysis tools is not straightforward. Moreover,\u0000AATs usually support only specific programming languages, providing feedback\u0000exclusively through dedicated websites based on test suites. This paper introduces GitSEED, a language-agnostic automated assessment tool\u0000designed for Programming Education and Software Engineering (SE) and backed by\u0000GitLab. The students interact with GitSEED through GitLab. Using GitSEED,\u0000students in Computer Science (CS) and SE can master the fundamentals of git\u0000while receiving personalized feedback on their programming assignments and\u0000projects. Furthermore, faculty members can easily tailor GitSEED's pipeline by\u0000integrating various code evaluation tools (e.g., memory leak detection, fault\u0000localization, program repair, etc.) to offer personalized feedback that aligns\u0000with the needs of each CS/SE course. Our experiments assess GitSEED's efficacy\u0000via comprehensive user evaluation, examining the impact of feedback mechanisms\u0000and features on student learning outcomes. Findings reveal positive\u0000correlations between GitSEED usage and student engagement.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

How Mature is Requirements Engineering for AI-based Systems? A Systematic Mapping Study on Practices, Challenges, and Future Research Directions 基于人工智能的系统的需求工程成熟度如何？关于实践、挑战和未来研究方向的系统映射研究

arXiv - CS - Software Engineering

Pub Date : 2024-09-11 DOI: arxiv-2409.07192

Umm-e- Habiba, Markus Haug, Justus Bogner, Stefan Wagner

Artificial intelligence (AI) permeates all fields of life, which resulted innew challenges in requirements engineering for artificial intelligence (RE4AI),e.g., the difficulty in specifying and validating requirements for AI orconsidering new quality requirements due to emerging ethical implications. Itis currently unclear if existing RE methods are sufficient or if new ones areneeded to address these challenges. Therefore, our goal is to provide acomprehensive overview of RE4AI to researchers and practitioners. What has beenachieved so far, i.e., what practices are available, and what research gaps andchallenges still need to be addressed? To achieve this, we conducted asystematic mapping study combining query string search and extensivesnowballing. The extracted data was aggregated, and results were synthesizedusing thematic analysis. Our selection process led to the inclusion of 126primary studies. Existing RE4AI research focuses mainly on requirementsanalysis and elicitation, with most practices applied in these areas.Furthermore, we identified requirements specification, explainability, and thegap between machine learning engineers and end-users as the most prevalentchallenges, along with a few others. Additionally, we proposed seven potentialresearch directions to address these challenges. Practitioners can use ourresults to identify and select suitable RE methods for working on theirAI-based systems, while researchers can build on the identified gaps andresearch directions to push the field forward.

人工智能（AI）已渗透到生活的各个领域，这给人工智能需求工程（RE4AI）带来了新的挑战，例如，很难明确和验证人工智能的需求，或者由于新出现的伦理问题而需要考虑新的质量要求。目前还不清楚现有的 RE 方法是否足够，或者是否需要新的方法来应对这些挑战。因此，我们的目标是为研究人员和从业人员提供 RE4AI 的全面概述。迄今为止已经取得了哪些成果，即有哪些实践，还有哪些研究空白和挑战需要解决？为此，我们结合查询字符串搜索和广泛的雪球搜索，开展了系统的绘图研究。我们对提取的数据进行了汇总，并通过专题分析对结果进行了综合。通过筛选，我们纳入了 126 项主要研究。现有的 RE4AI 研究主要集中在需求分析和诱导方面，大多数实践都应用于这些领域。此外，我们还发现需求规范、可解释性、机器学习工程师与最终用户之间的差距以及其他一些问题是最普遍的挑战。此外，我们还提出了应对这些挑战的七个潜在研究方向。实践者可以利用我们的研究成果来确定和选择合适的可再生能源方法，用于他们基于人工智能的系统，而研究人员则可以在已确定的差距和研究方向的基础上推动该领域向前发展。

{"title":"How Mature is Requirements Engineering for AI-based Systems? A Systematic Mapping Study on Practices, Challenges, and Future Research Directions","authors":"Umm-e- Habiba, Markus Haug, Justus Bogner, Stefan Wagner","doi":"arxiv-2409.07192","DOIUrl":"https://doi.org/arxiv-2409.07192","url":null,"abstract":"Artificial intelligence (AI) permeates all fields of life, which resulted in\u0000new challenges in requirements engineering for artificial intelligence (RE4AI),\u0000e.g., the difficulty in specifying and validating requirements for AI or\u0000considering new quality requirements due to emerging ethical implications. It\u0000is currently unclear if existing RE methods are sufficient or if new ones are\u0000needed to address these challenges. Therefore, our goal is to provide a\u0000comprehensive overview of RE4AI to researchers and practitioners. What has been\u0000achieved so far, i.e., what practices are available, and what research gaps and\u0000challenges still need to be addressed? To achieve this, we conducted a\u0000systematic mapping study combining query string search and extensive\u0000snowballing. The extracted data was aggregated, and results were synthesized\u0000using thematic analysis. Our selection process led to the inclusion of 126\u0000primary studies. Existing RE4AI research focuses mainly on requirements\u0000analysis and elicitation, with most practices applied in these areas.\u0000Furthermore, we identified requirements specification, explainability, and the\u0000gap between machine learning engineers and end-users as the most prevalent\u0000challenges, along with a few others. Additionally, we proposed seven potential\u0000research directions to address these challenges. Practitioners can use our\u0000results to identify and select suitable RE methods for working on their\u0000AI-based systems, while researchers can build on the identified gaps and\u0000research directions to push the field forward.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"235 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142222853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Regulatory Requirements Engineering in Large Enterprises: An Interview Study on the European Accessibility Act 大型企业的法规要求工程：欧洲无障碍法案访谈研究

arXiv - CS - Software Engineering

Pub Date : 2024-09-11 DOI: arxiv-2409.07313

Oleksandr Kosenkov, Michael Unterkalmsteiner, Daniel Mendez, Jannik Fischbach

Context: Regulations, such as the European Accessibility Act (EAA), impactthe engineering of software products and services. Managing that impact whileproviding meaningful inputs to development teams is one of the emergingrequirements engineering (RE) challenges. Problem: Enterprises conduct Regulatory Impact Analysis (RIA) to consider theeffects of regulations on software products offered and formulate requirementsat an enterprise level. Despite its practical relevance, we are unaware of anystudies on this large-scale regulatory RE process. Methodology: We conducted an exploratory interview study of RIA in threelarge enterprises. We focused on how they conduct RIA, emphasizingcross-functional interactions, and using the EAA as an example. Results: RIA, as a regulatory RE process, is conducted to address the needsof executive management and central functions. It involves coordination betweendifferent functions and levels of enterprise hierarchy. Enterprises useartifacts to support interpretation and communication of the results of RIA.Challenges to RIA are mainly related to the execution of such coordination andmanaging the knowledge involved. Conclusion: RIA in large enterprises demands close coordination of multiplestakeholders and roles. Applying interpretation and compliance artifacts is oneapproach to support such coordination. However, there are no establishedpractices for creating and managing such artifacts.

背景：欧洲无障碍法案》（EAA）等法规对软件产品和服务的工程设计产生了影响。管理这种影响，同时为开发团队提供有意义的投入，是新出现的需求工程（RE）挑战之一。问题：企业会进行法规影响分析（RIA），以考虑法规对所提供软件产品的影响，并制定企业级需求。尽管具有实际意义，但我们还不知道有任何关于这种大规模监管 RE 流程的研究。研究方法：我们对三家大型企业的 RIA 进行了探索性访谈研究。我们重点研究了他们如何开展监管影响评估，强调跨职能互动，并以监管局为例。研究结果监管影响评估作为一种监管 RE 流程，是为了满足执行管理层和中央职能部门的需求而开展的。它涉及不同职能部门和企业层级之间的协调。监管影响评估面临的挑战主要与执行此类协调和管理相关知识有关。结论：大型企业的 RIA 需要多个利益相关者和角色的密切协调。应用解释和合规工件是支持这种协调的一种方法。然而，目前还没有创建和管理此类人工制品的成熟做法。

{"title":"Regulatory Requirements Engineering in Large Enterprises: An Interview Study on the European Accessibility Act","authors":"Oleksandr Kosenkov, Michael Unterkalmsteiner, Daniel Mendez, Jannik Fischbach","doi":"arxiv-2409.07313","DOIUrl":"https://doi.org/arxiv-2409.07313","url":null,"abstract":"Context: Regulations, such as the European Accessibility Act (EAA), impact\u0000the engineering of software products and services. Managing that impact while\u0000providing meaningful inputs to development teams is one of the emerging\u0000requirements engineering (RE) challenges. Problem: Enterprises conduct Regulatory Impact Analysis (RIA) to consider the\u0000effects of regulations on software products offered and formulate requirements\u0000at an enterprise level. Despite its practical relevance, we are unaware of any\u0000studies on this large-scale regulatory RE process. Methodology: We conducted an exploratory interview study of RIA in three\u0000large enterprises. We focused on how they conduct RIA, emphasizing\u0000cross-functional interactions, and using the EAA as an example. Results: RIA, as a regulatory RE process, is conducted to address the needs\u0000of executive management and central functions. It involves coordination between\u0000different functions and levels of enterprise hierarchy. Enterprises use\u0000artifacts to support interpretation and communication of the results of RIA.\u0000Challenges to RIA are mainly related to the execution of such coordination and\u0000managing the knowledge involved. Conclusion: RIA in large enterprises demands close coordination of multiple\u0000stakeholders and roles. Applying interpretation and compliance artifacts is one\u0000approach to support such coordination. However, there are no established\u0000practices for creating and managing such artifacts.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RePlay: a Recommendation Framework for Experimentation and Production Use RePlay：用于实验和生产的推荐框架

arXiv - CS - Software Engineering

Pub Date : 2024-09-11 DOI: arxiv-2409.07272

Alexey Vasilev, Anna Volodkevich, Denis Kulandin, Tatiana Bysheva, Anton Klenitskiy

Using a single tool to build and compare recommender systems significantlyreduces the time to market for new models. In addition, the comparison resultswhen using such tools look more consistent. This is why many different toolsand libraries for researchers in the field of recommendations have recentlyappeared. Unfortunately, most of these frameworks are aimed primarily atresearchers and require modification for use in production due to the inabilityto work on large datasets or an inappropriate architecture. In this demo, wepresent our open-source toolkit RePlay - a framework containing an end-to-endpipeline for building recommender systems, which is ready for production use.RePlay also allows you to use a suitable stack for the pipeline on each stage:Pandas, Polars, or Spark. This allows the library to scale computations anddeploy to a cluster. Thus, RePlay allows data scientists to easily move fromresearch mode to production mode using the same interfaces.

使用单一工具来构建和比较推荐系统可以大大缩短新模型的上市时间。此外，使用此类工具得出的比较结果也更加一致。这就是为什么最近出现了许多不同的工具和库，供推荐领域的研究人员使用。遗憾的是，这些框架大多主要面向研究人员，由于无法处理大型数据集或架构不合适，在生产中使用时需要进行修改。在本演示中，我们将介绍我们的开源工具包 RePlay--一个包含用于构建推荐系统的端到端流水线的框架，可随时用于生产。这样，该库就可以扩展计算并部署到集群中。因此，RePlay 允许数据科学家使用相同的接口轻松地从研究模式转向生产模式。

引用次数: 0