超越研究数据集:行业环境中的新意图发现

Findings (Sydney (N.S.W.) Pub Date : 2023-05-09 DOI:10.48550/arXiv.2305.05474

Aleksandra Chrabrowa, Tsimur Hadeliya, D. Kajtoch, Robert Mroczkowski, Piotr Rybak

{"title":"超越研究数据集:行业环境中的新意图发现","authors":"Aleksandra Chrabrowa, Tsimur Hadeliya, D. Kajtoch, Robert Mroczkowski, Piotr Rybak","doi":"10.48550/arXiv.2305.05474","DOIUrl":null,"url":null,"abstract":"Novel intent discovery automates the process of grouping similar messages (questions) to identify previously unknown intents. However, current research focuses on publicly available datasets which have only the question field and significantly differ from real-life datasets. This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision. We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv. All our methods combined to fully utilize real-life datasets give up to 33pp performance boost over state-of-the-art Constrained Deep Adaptive Clustering (CDAC) model for question only. By comparison CDAC model for the question data only gives only up to 13pp performance boost over the naive baseline.","PeriodicalId":73025,"journal":{"name":"Findings (Sydney (N.S.W.)","volume":"1 1","pages":"895-911"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Going beyond research datasets: Novel intent discovery in the industry setting\",\"authors\":\"Aleksandra Chrabrowa, Tsimur Hadeliya, D. Kajtoch, Robert Mroczkowski, Piotr Rybak\",\"doi\":\"10.48550/arXiv.2305.05474\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Novel intent discovery automates the process of grouping similar messages (questions) to identify previously unknown intents. However, current research focuses on publicly available datasets which have only the question field and significantly differ from real-life datasets. This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision. We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv. All our methods combined to fully utilize real-life datasets give up to 33pp performance boost over state-of-the-art Constrained Deep Adaptive Clustering (CDAC) model for question only. By comparison CDAC model for the question data only gives only up to 13pp performance boost over the naive baseline.\",\"PeriodicalId\":73025,\"journal\":{\"name\":\"Findings (Sydney (N.S.W.)\",\"volume\":\"1 1\",\"pages\":\"895-911\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Findings (Sydney (N.S.W.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2305.05474\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Findings (Sydney (N.S.W.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2305.05474","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

新颖的意图发现自动化了对类似消息（问题）进行分组以识别先前未知意图的过程。然而，目前的研究集中在公开可用的数据集上，这些数据集只有问题领域，与现实生活中的数据集有很大不同。本文提出了改进部署在大型电子商务平台中的意图发现管道的方法。我们展示了在域内数据上预训练语言模型的好处：既有自我监督的，也有弱监督的。我们还设计了在聚类任务的微调过程中利用真实数据集的会话结构（即问答）的最佳方法，我们称之为Conv。与最先进的仅用于问题的约束深度自适应聚类（CDAC）模型相比，我们所有的方法结合起来，充分利用真实数据集中的性能提高了33pp。相比之下，问题数据的CDAC模型只比原始基线提供了高达13pp的性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Going beyond research datasets: Novel intent discovery in the industry setting

Novel intent discovery automates the process of grouping similar messages (questions) to identify previously unknown intents. However, current research focuses on publicly available datasets which have only the question field and significantly differ from real-life datasets. This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision. We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv. All our methods combined to fully utilize real-life datasets give up to 33pp performance boost over state-of-the-art Constrained Deep Adaptive Clustering (CDAC) model for question only. By comparison CDAC model for the question data only gives only up to 13pp performance boost over the naive baseline.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Findings (Sydney (N.S.W.)

自引率

0.00%

发文量

审稿时长

4 weeks