CL-WSTC: Continual Learning for Weakly Supervised Text Classification on the Internet

Miao Li, Jiaqi Zhu, Xin Yang, Yi Yang, Qiang Gao, Hongan Wang
{"title":"CL-WSTC: Continual Learning for Weakly Supervised Text Classification on the Internet","authors":"Miao Li, Jiaqi Zhu, Xin Yang, Yi Yang, Qiang Gao, Hongan Wang","doi":"10.1145/3543507.3583249","DOIUrl":null,"url":null,"abstract":"Continual text classification is an important research direction in Web mining. Existing works are limited to supervised approaches relying on abundant labeled data, but in the open and dynamic environment of Internet, involving constant semantic change of known topics and the appearance of unknown topics, text annotations are hard to access in time for each period. That calls for the technique of weakly supervised text classification (WSTC), which requires just seed words for each category and has succeed in static text classification tasks. However, there are still no studies of applying WSTC methods in a continual learning paradigm to actually accommodate the open and evolving Internet. In this paper, we tackle this problem for the first time and propose a framework, named Continual Learning for Weakly Supervised Text Classification (CL-WSTC), which can take any WSTC method as base model. It consists of two modules, classification decision with delay and seed word updating. In the former, the probability threshold for each category in each period is adaptively learned to determine the acceptance/rejection of texts. In the latter, with candidate words output by the base model, seed words are added and deleted via reinforcement learning with immediate rewards, according to an empirically certified unsupervised measure. Extensive experiments show that our approach has strong universality and can achieve a better trade-off between classification accuracy and decision timeliness compared to non-continual counterparts, with intuitively interpretable updating of seed words.","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Web Conference 2023","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3543507.3583249","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Continual text classification is an important research direction in Web mining. Existing works are limited to supervised approaches relying on abundant labeled data, but in the open and dynamic environment of Internet, involving constant semantic change of known topics and the appearance of unknown topics, text annotations are hard to access in time for each period. That calls for the technique of weakly supervised text classification (WSTC), which requires just seed words for each category and has succeed in static text classification tasks. However, there are still no studies of applying WSTC methods in a continual learning paradigm to actually accommodate the open and evolving Internet. In this paper, we tackle this problem for the first time and propose a framework, named Continual Learning for Weakly Supervised Text Classification (CL-WSTC), which can take any WSTC method as base model. It consists of two modules, classification decision with delay and seed word updating. In the former, the probability threshold for each category in each period is adaptively learned to determine the acceptance/rejection of texts. In the latter, with candidate words output by the base model, seed words are added and deleted via reinforcement learning with immediate rewards, according to an empirically certified unsupervised measure. Extensive experiments show that our approach has strong universality and can achieve a better trade-off between classification accuracy and decision timeliness compared to non-continual counterparts, with intuitively interpretable updating of seed words.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
互联网上弱监督文本分类的持续学习
连续文本分类是Web挖掘的一个重要研究方向。现有的工作仅限于依赖于大量标注数据的监督方法,但在互联网开放、动态的环境中,涉及到已知主题的不断语义变化和未知主题的出现,文本注释很难在每个时期都能及时获取。这需要弱监督文本分类技术(WSTC),该技术只需要每个类别的种子词,并且在静态文本分类任务中取得了成功。然而,目前还没有研究将WSTC方法应用到持续学习范式中,以真正适应开放和不断发展的互联网。本文首次解决了这一问题,提出了一个基于弱监督文本分类的持续学习框架(CL-WSTC),该框架可以采用任何弱监督文本分类方法作为基本模型。它包括两个模块:带延迟的分类决策模块和种子词更新模块。在前者中,自适应学习每个时期每个类别的概率阈值,以确定文本的接受/拒绝。在后者中,根据基础模型输出的候选词,根据经验认证的无监督度量,通过带有即时奖励的强化学习来添加和删除种子词。大量的实验表明,我们的方法具有很强的通用性,与非连续的方法相比,可以在分类精度和决策及时性之间实现更好的权衡,并且种子词的更新具有直观的可解释性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
CurvDrop: A Ricci Curvature Based Approach to Prevent Graph Neural Networks from Over-Smoothing and Over-Squashing Learning to Simulate Crowd Trajectories with Graph Networks Word Sense Disambiguation by Refining Target Word Embedding Curriculum Graph Poisoning Optimizing Guided Traversal for Fast Learned Sparse Retrieval
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1