Evaluating the Practical Utility of Confidence-score based Techniques for Unsupervised Open-world Classification

First Workshop on Insights from Negative Results in NLP Pub Date : 1900-01-01 DOI:10.18653/v1/2022.insights-1.3

Sopan Khosla, Rashmi Gangadharaiah

引用次数: 4

Abstract

Open-world classification in dialog systems require models to detect open intents, while ensuring the quality of in-domain (ID) intent classification. In this work, we revisit methods that leverage distance-based statistics for unsupervised out-of-domain (OOD) detection. We show that despite their superior performance on threshold-independent metrics like AUROC on test-set, threshold values chosen based on the performance on a validation-set do not generalize well to the test-set, thus resulting in substantially lower performance on ID or OOD detection accuracy and F1-scores. Our analysis shows that this lack of generalizability can be successfully mitigated by setting aside a hold-out set from validation data for threshold selection (sometimes achieving relative gains as high as 100%). Extensive experiments on seven benchmark datasets show that this fix puts the performance of these methods at par with, or sometimes even better than, the current state-of-the-art OOD detection techniques.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估基于置信度分数的无监督开放世界分类技术的实际效用

对话系统中的开放世界分类要求模型检测开放意图，同时保证域内意图分类的质量。在这项工作中，我们重新审视了利用基于距离的统计进行无监督域外(OOD)检测的方法。我们表明，尽管它们在测试集上的AUROC等与阈值无关的指标上表现优异，但基于验证集上的性能选择的阈值并不能很好地推广到测试集，从而导致ID或OOD检测精度和f1分数的性能大大降低。我们的分析表明，通过从验证数据中留出一个保留集用于阈值选择(有时可以获得高达100%的相对增益)，可以成功地减轻这种泛化性的缺乏。在7个基准数据集上进行的大量实验表明，该修复使这些方法的性能与当前最先进的OOD检测技术相当，有时甚至更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

First Workshop on Insights from Negative Results in NLP

自引率

0.00%

发文量

期刊最新文献

What GPT Knows About Who is Who Pathologies of Pre-trained Language Models in Few-shot Fine-tuning Can Question Rewriting Help Conversational Question Answering? Extending the Scope of Out-of-Domain: Examining QA models in multiple subdomains Do Data-based Curricula Work?