Weakly supervised text classification method based on transformer

International Conference on Mechatronics Engineering and Artificial Intelligence Pub Date : 2023-02-28 DOI:10.1117/12.2672391

Ling Gan, aijun yi

引用次数: 0

Abstract

The seed word-driven approach based on weakly supervised text classification (WTC) is the dominant approach. In existing seed word-driven methods，using metrics such as Term Frequency (TF), Inverse Document Frequency (IDF) and its combinations to update the seed words. the method assigns the same weight to all metrics, leading to the selection of common or poorly differentiated words as seed words; In addition most of the text classifiers used in the study have difficulty in capturing the correlation and global information between text information. In order to solve the above problems, Using Transformer as a text classifier first, The multi-headed self-attention mechanism allows capturing longrange dependencies while computing in parallel and fully learning the global semantic information of the input text. Then an improved TF-IDF method is proposed to increase the weight of IDF so that some common words that affect the classification can be filtered out. Its experimental results are improved on 20News and NYT datasets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于变压器的弱监督文本分类方法

基于弱监督文本分类(WTC)的种子词驱动方法是主流方法。在现有的种子词驱动方法中，利用词频(Term Frequency, TF)、逆文档频率(Inverse Document Frequency, IDF)及其组合等指标来更新种子词。该方法为所有指标分配相同的权重，导致选择常见或差分化词作为种子词;此外，研究中使用的大多数文本分类器在捕获文本信息之间的相关性和全局信息方面存在困难。为了解决上述问题，首先使用Transformer作为文本分类器，多头自关注机制允许在并行计算的同时捕获远程依赖关系，并充分学习输入文本的全局语义信息。然后提出了一种改进的TF-IDF方法，增加IDF的权重，从而过滤掉一些影响分类的常用词。在20News和NYT数据集上对实验结果进行了改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Conference on Mechatronics Engineering and Artificial Intelligence

自引率

0.00%

发文量