CL-WSTC: Continual Learning for Weakly Supervised Text Classification on the Internet

Proceedings of the ACM Web Conference 2023 Pub Date : 2023-04-30 DOI:10.1145/3543507.3583249

Miao Li, Jiaqi Zhu, Xin Yang, Yi Yang, Qiang Gao, Hongan Wang

{"title":"CL-WSTC: Continual Learning for Weakly Supervised Text Classification on the Internet","authors":"Miao Li, Jiaqi Zhu, Xin Yang, Yi Yang, Qiang Gao, Hongan Wang","doi":"10.1145/3543507.3583249","DOIUrl":null,"url":null,"abstract":"Continual text classification is an important research direction in Web mining. Existing works are limited to supervised approaches relying on abundant labeled data, but in the open and dynamic environment of Internet, involving constant semantic change of known topics and the appearance of unknown topics, text annotations are hard to access in time for each period. That calls for the technique of weakly supervised text classification (WSTC), which requires just seed words for each category and has succeed in static text classification tasks. However, there are still no studies of applying WSTC methods in a continual learning paradigm to actually accommodate the open and evolving Internet. In this paper, we tackle this problem for the first time and propose a framework, named Continual Learning for Weakly Supervised Text Classification (CL-WSTC), which can take any WSTC method as base model. It consists of two modules, classification decision with delay and seed word updating. In the former, the probability threshold for each category in each period is adaptively learned to determine the acceptance/rejection of texts. In the latter, with candidate words output by the base model, seed words are added and deleted via reinforcement learning with immediate rewards, according to an empirically certified unsupervised measure. Extensive experiments show that our approach has strong universality and can achieve a better trade-off between classification accuracy and decision timeliness compared to non-continual counterparts, with intuitively interpretable updating of seed words.","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Web Conference 2023","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3543507.3583249","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Continual text classification is an important research direction in Web mining. Existing works are limited to supervised approaches relying on abundant labeled data, but in the open and dynamic environment of Internet, involving constant semantic change of known topics and the appearance of unknown topics, text annotations are hard to access in time for each period. That calls for the technique of weakly supervised text classification (WSTC), which requires just seed words for each category and has succeed in static text classification tasks. However, there are still no studies of applying WSTC methods in a continual learning paradigm to actually accommodate the open and evolving Internet. In this paper, we tackle this problem for the first time and propose a framework, named Continual Learning for Weakly Supervised Text Classification (CL-WSTC), which can take any WSTC method as base model. It consists of two modules, classification decision with delay and seed word updating. In the former, the probability threshold for each category in each period is adaptively learned to determine the acceptance/rejection of texts. In the latter, with candidate words output by the base model, seed words are added and deleted via reinforcement learning with immediate rewards, according to an empirically certified unsupervised measure. Extensive experiments show that our approach has strong universality and can achieve a better trade-off between classification accuracy and decision timeliness compared to non-continual counterparts, with intuitively interpretable updating of seed words.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

互联网上弱监督文本分类的持续学习

连续文本分类是Web挖掘的一个重要研究方向。现有的工作仅限于依赖于大量标注数据的监督方法，但在互联网开放、动态的环境中，涉及到已知主题的不断语义变化和未知主题的出现，文本注释很难在每个时期都能及时获取。这需要弱监督文本分类技术(WSTC)，该技术只需要每个类别的种子词，并且在静态文本分类任务中取得了成功。然而，目前还没有研究将WSTC方法应用到持续学习范式中，以真正适应开放和不断发展的互联网。本文首次解决了这一问题，提出了一个基于弱监督文本分类的持续学习框架(CL-WSTC)，该框架可以采用任何弱监督文本分类方法作为基本模型。它包括两个模块:带延迟的分类决策模块和种子词更新模块。在前者中，自适应学习每个时期每个类别的概率阈值，以确定文本的接受/拒绝。在后者中，根据基础模型输出的候选词，根据经验认证的无监督度量，通过带有即时奖励的强化学习来添加和删除种子词。大量的实验表明，我们的方法具有很强的通用性，与非连续的方法相比，可以在分类精度和决策及时性之间实现更好的权衡，并且种子词的更新具有直观的可解释性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ACM Web Conference 2023

自引率

0.00%

发文量

期刊最新文献

CurvDrop: A Ricci Curvature Based Approach to Prevent Graph Neural Networks from Over-Smoothing and Over-Squashing Learning to Simulate Crowd Trajectories with Graph Networks Word Sense Disambiguation by Refining Target Word Embedding Curriculum Graph Poisoning Optimizing Guided Traversal for Fast Learned Sparse Retrieval