MLTEing Models: Negotiating, Evaluating, and Documenting Model and System Qualities

2023 IEEE/ACM 45th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER) Pub Date : 2023-03-03 DOI:10.1109/ICSE-NIER58687.2023.00012

Katherine R. Maffey, Kyle Dotterrer, Jennifer Niemann, Iain J. Cruickshank, G. Lewis, Christian Kästner

引用次数: 2

Abstract

Many organizations seek to ensure that machine learning (ML) and artificial intelligence (AI) systems work as intended in production but currently do not have a cohesive methodology in place to do so. To fill this gap, we propose MLTE (Machine Learning Test and Evaluation, colloquially referred to as "melt"), a framework and implementation to evaluate ML models and systems. The framework compiles state-of-the-art evaluation techniques into an organizational process for interdisciplinary teams, including model developers, software engineers, system owners, and other stakeholders. MLTE tooling supports this process by providing a domain-specific language that teams can use to express model requirements, an infrastructure to define, generate, and collect ML evaluation metrics, and the means to communicate results.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MLTEing模型:协商、评估和记录模型和系统质量

许多组织试图确保机器学习(ML)和人工智能(AI)系统在生产中按预期工作，但目前没有一个有凝聚力的方法来做到这一点。为了填补这一空白，我们提出了MLTE(机器学习测试和评估，俗称“melt”)，这是一个评估机器学习模型和系统的框架和实现。框架将最先进的评估技术汇编到跨学科团队的组织过程中，包括模型开发人员、软件工程师、系统所有者和其他涉众。MLTE工具通过提供特定于领域的语言来支持这个过程，团队可以使用这种语言来表达模型需求，提供用于定义、生成和收集ML评估度量的基础结构，以及交流结果的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 IEEE/ACM 45th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER)

自引率

0.00%

发文量

期刊最新文献

Performance Analysis with Bayesian Inference Interpersonal Trust in OSS: Exploring Dimensions of Trust in GitHub Pull Requests Message from the ICSE 2023 General Chair A Novel and Pragmatic Scenario Modeling Framework with Verification-in-the-loop for Autonomous Driving Systems Test-Driven Development Benefits Beyond Design Quality: Flow State and Developer Experience