超越研究数据集:行业环境中的新意图发现

Aleksandra Chrabrowa, Tsimur Hadeliya, D. Kajtoch, Robert Mroczkowski, Piotr Rybak
{"title":"超越研究数据集:行业环境中的新意图发现","authors":"Aleksandra Chrabrowa, Tsimur Hadeliya, D. Kajtoch, Robert Mroczkowski, Piotr Rybak","doi":"10.48550/arXiv.2305.05474","DOIUrl":null,"url":null,"abstract":"Novel intent discovery automates the process of grouping similar messages (questions) to identify previously unknown intents. However, current research focuses on publicly available datasets which have only the question field and significantly differ from real-life datasets. This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision. We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv. All our methods combined to fully utilize real-life datasets give up to 33pp performance boost over state-of-the-art Constrained Deep Adaptive Clustering (CDAC) model for question only. By comparison CDAC model for the question data only gives only up to 13pp performance boost over the naive baseline.","PeriodicalId":73025,"journal":{"name":"Findings (Sydney (N.S.W.)","volume":"1 1","pages":"895-911"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Going beyond research datasets: Novel intent discovery in the industry setting\",\"authors\":\"Aleksandra Chrabrowa, Tsimur Hadeliya, D. Kajtoch, Robert Mroczkowski, Piotr Rybak\",\"doi\":\"10.48550/arXiv.2305.05474\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Novel intent discovery automates the process of grouping similar messages (questions) to identify previously unknown intents. However, current research focuses on publicly available datasets which have only the question field and significantly differ from real-life datasets. This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision. We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv. All our methods combined to fully utilize real-life datasets give up to 33pp performance boost over state-of-the-art Constrained Deep Adaptive Clustering (CDAC) model for question only. By comparison CDAC model for the question data only gives only up to 13pp performance boost over the naive baseline.\",\"PeriodicalId\":73025,\"journal\":{\"name\":\"Findings (Sydney (N.S.W.)\",\"volume\":\"1 1\",\"pages\":\"895-911\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Findings (Sydney (N.S.W.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2305.05474\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Findings (Sydney (N.S.W.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2305.05474","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

新颖的意图发现自动化了对类似消息(问题)进行分组以识别先前未知意图的过程。然而,目前的研究集中在公开可用的数据集上,这些数据集只有问题领域,与现实生活中的数据集有很大不同。本文提出了改进部署在大型电子商务平台中的意图发现管道的方法。我们展示了在域内数据上预训练语言模型的好处:既有自我监督的,也有弱监督的。我们还设计了在聚类任务的微调过程中利用真实数据集的会话结构(即问答)的最佳方法,我们称之为Conv。与最先进的仅用于问题的约束深度自适应聚类(CDAC)模型相比,我们所有的方法结合起来,充分利用真实数据集中的性能提高了33pp。相比之下,问题数据的CDAC模型只比原始基线提供了高达13pp的性能提升。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Going beyond research datasets: Novel intent discovery in the industry setting
Novel intent discovery automates the process of grouping similar messages (questions) to identify previously unknown intents. However, current research focuses on publicly available datasets which have only the question field and significantly differ from real-life datasets. This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision. We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv. All our methods combined to fully utilize real-life datasets give up to 33pp performance boost over state-of-the-art Constrained Deep Adaptive Clustering (CDAC) model for question only. By comparison CDAC model for the question data only gives only up to 13pp performance boost over the naive baseline.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
审稿时长
4 weeks
期刊最新文献
Exploring Pedestrian Injury Severity by Incorporating Spatial Information in Machine Learning Darkness and Death in the U.S.: Walking Distances Across the Nation by Time of Day and Time of Year Activity Reduction as Resilience Indicator: Evidence with Filomena Data The Lifestyle and Mobility Connection of Community Supported Agriculture (CSA) Users Transit Fleet Electrification Barriers, Resolutions and Costs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1