荷兰平局：构建二元分类问题的通用基线

IF 0.7 4区数学 Q3 STATISTICS & PROBABILITY Journal of Applied Probability Pub Date : 2024-09-19 DOI:10.1017/jpr.2024.52

Etienne van de Bijl, Jan Klein, Joris Pries, Sandjai Bhulai, Mark Hoogendoorn, Rob van der Mei

{"title":"荷兰平局：构建二元分类问题的通用基线","authors":"Etienne van de Bijl, Jan Klein, Joris Pries, Sandjai Bhulai, Mark Hoogendoorn, Rob van der Mei","doi":"10.1017/jpr.2024.52","DOIUrl":null,"url":null,"abstract":"Novel prediction methods should always be compared to a baseline to determine their performance. Without this frame of reference, the performance score of a model is basically meaningless. What does it mean when a model achieves an <img data-mimesubtype=\"png\" data-type=\"\" src=\"https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20240918134025706-0265:S0021900224000524:S0021900224000524_inline1.png\">$F_1$</img> of 0.8 on a test set? A proper baseline is, therefore, required to evaluate the ‘goodness’ of a performance score. Comparing results with the latest state-of-the-art model is usually insightful. However, being state-of-the-art is dynamic, as newer models are continuously developed. Contrary to an advanced model, it is also possible to use a simple dummy classifier. However, the latter model could be beaten too easily, making the comparison less valuable. Furthermore, most existing baselines are stochastic and need to be computed repeatedly to get a reliable expected performance, which could be computationally expensive. We present a universal baseline method for all binary classification models, named the Dutch Draw (DD). This approach weighs simple classifiers and determines the best classifier to use as a baseline. Theoretically, we derive the DD baseline for many commonly used evaluation measures and show that in most situations it reduces to (almost) always predicting either zero or one. Summarizing, the DD baseline is general, as it is applicable to any binary classification problem; simple, as it can be quickly determined without training or parameter tuning; and informative, as insightful conclusions can be drawn from the results. The DD baseline serves two purposes. First, it is a robust and universal baseline that enables comparisons across research papers. Second, it provides a sanity check during the prediction model’s development process. When a model does not outperform the DD baseline, it is a major warning sign.","PeriodicalId":50256,"journal":{"name":"Journal of Applied Probability","volume":"194 1","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The dutch draw: constructing a universal baseline for binary classification problems\",\"authors\":\"Etienne van de Bijl, Jan Klein, Joris Pries, Sandjai Bhulai, Mark Hoogendoorn, Rob van der Mei\",\"doi\":\"10.1017/jpr.2024.52\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Novel prediction methods should always be compared to a baseline to determine their performance. Without this frame of reference, the performance score of a model is basically meaningless. What does it mean when a model achieves an <img data-mimesubtype=\\\"png\\\" data-type=\\\"\\\" src=\\\"https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20240918134025706-0265:S0021900224000524:S0021900224000524_inline1.png\\\">$F_1$</img> of 0.8 on a test set? A proper baseline is, therefore, required to evaluate the ‘goodness’ of a performance score. Comparing results with the latest state-of-the-art model is usually insightful. However, being state-of-the-art is dynamic, as newer models are continuously developed. Contrary to an advanced model, it is also possible to use a simple dummy classifier. However, the latter model could be beaten too easily, making the comparison less valuable. Furthermore, most existing baselines are stochastic and need to be computed repeatedly to get a reliable expected performance, which could be computationally expensive. We present a universal baseline method for all binary classification models, named the Dutch Draw (DD). This approach weighs simple classifiers and determines the best classifier to use as a baseline. Theoretically, we derive the DD baseline for many commonly used evaluation measures and show that in most situations it reduces to (almost) always predicting either zero or one. Summarizing, the DD baseline is general, as it is applicable to any binary classification problem; simple, as it can be quickly determined without training or parameter tuning; and informative, as insightful conclusions can be drawn from the results. The DD baseline serves two purposes. First, it is a robust and universal baseline that enables comparisons across research papers. Second, it provides a sanity check during the prediction model’s development process. When a model does not outperform the DD baseline, it is a major warning sign.\",\"PeriodicalId\":50256,\"journal\":{\"name\":\"Journal of Applied Probability\",\"volume\":\"194 1\",\"pages\":\"\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2024-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Applied Probability\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1017/jpr.2024.52\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Probability","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1017/jpr.2024.52","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

摘要

新的预测方法应始终与基线进行比较，以确定其性能。如果没有这个参照基准，模型的性能得分基本上毫无意义。当一个模型在测试集上的 $F_1$ 达到 0.8 时，这意味着什么？因此，评估性能分数的 "好坏 "需要一个适当的基准。将结果与最新的最先进模型进行比较通常很有意义。然而，最新模型是动态的，因为不断有新的模型被开发出来。与先进模型相反，也可以使用简单的虚拟分类器。但是，后一种模型很容易被打败，从而降低了比较的价值。此外，现有的基线大多是随机的，需要反复计算才能获得可靠的预期性能，这可能会导致计算成本高昂。我们提出了一种适用于所有二元分类模型的通用基线方法，名为 "荷兰平局"（Dutch Draw，DD）。这种方法可对简单分类器进行权衡，并确定用作基线的最佳分类器。从理论上讲，我们为许多常用的评估指标推导出了 DD 基线，并表明在大多数情况下，它（几乎）总是预测 0 或 1。概括地说，DD 基线是通用的，因为它适用于任何二元分类问题；简单的，因为它无需训练或参数调整即可快速确定；信息丰富的，因为可以从结果中得出深刻的结论。DD 基线有两个作用。首先，它是一个稳健、通用的基线，可以对不同的研究论文进行比较。其次，它在预测模型的开发过程中提供了合理性检查。当一个模型的性能没有超过 DD 基线时，这是一个重要的警示信号。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

The dutch draw: constructing a universal baseline for binary classification problems

Novel prediction methods should always be compared to a baseline to determine their performance. Without this frame of reference, the performance score of a model is basically meaningless. What does it mean when a model achieves an $F_1$ of 0.8 on a test set? A proper baseline is, therefore, required to evaluate the ‘goodness’ of a performance score. Comparing results with the latest state-of-the-art model is usually insightful. However, being state-of-the-art is dynamic, as newer models are continuously developed. Contrary to an advanced model, it is also possible to use a simple dummy classifier. However, the latter model could be beaten too easily, making the comparison less valuable. Furthermore, most existing baselines are stochastic and need to be computed repeatedly to get a reliable expected performance, which could be computationally expensive. We present a universal baseline method for all binary classification models, named the Dutch Draw (DD). This approach weighs simple classifiers and determines the best classifier to use as a baseline. Theoretically, we derive the DD baseline for many commonly used evaluation measures and show that in most situations it reduces to (almost) always predicting either zero or one. Summarizing, the DD baseline is general, as it is applicable to any binary classification problem; simple, as it can be quickly determined without training or parameter tuning; and informative, as insightful conclusions can be drawn from the results. The DD baseline serves two purposes. First, it is a robust and universal baseline that enables comparisons across research papers. Second, it provides a sanity check during the prediction model’s development process. When a model does not outperform the DD baseline, it is a major warning sign.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Applied Probability 数学-统计学与概率论

CiteScore

1.50

自引率

10.00%

发文量

审稿时长

6-12 weeks

期刊介绍： Journal of Applied Probability is the oldest journal devoted to the publication of research in the field of applied probability. It is an international journal published by the Applied Probability Trust, and it serves as a companion publication to the Advances in Applied Probability. Its wide audience includes leading researchers across the entire spectrum of applied probability, including biosciences applications, operations research, telecommunications, computer science, engineering, epidemiology, financial mathematics, the physical and social sciences, and any field where stochastic modeling is used. A submission to Applied Probability represents a submission that may, at the Editor-in-Chief’s discretion, appear in either the Journal of Applied Probability or the Advances in Applied Probability. Typically, shorter papers appear in the Journal, with longer contributions appearing in the Advances.