基于变压器的弱监督文本分类方法

Ling Gan, aijun yi
{"title":"基于变压器的弱监督文本分类方法","authors":"Ling Gan, aijun yi","doi":"10.1117/12.2672391","DOIUrl":null,"url":null,"abstract":"The seed word-driven approach based on weakly supervised text classification (WTC) is the dominant approach. In existing seed word-driven methods,using metrics such as Term Frequency (TF), Inverse Document Frequency (IDF) and its combinations to update the seed words. the method assigns the same weight to all metrics, leading to the selection of common or poorly differentiated words as seed words; In addition most of the text classifiers used in the study have difficulty in capturing the correlation and global information between text information. In order to solve the above problems, Using Transformer as a text classifier first, The multi-headed self-attention mechanism allows capturing longrange dependencies while computing in parallel and fully learning the global semantic information of the input text. Then an improved TF-IDF method is proposed to increase the weight of IDF so that some common words that affect the classification can be filtered out. Its experimental results are improved on 20News and NYT datasets.","PeriodicalId":290902,"journal":{"name":"International Conference on Mechatronics Engineering and Artificial Intelligence","volume":"124 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Weakly supervised text classification method based on transformer\",\"authors\":\"Ling Gan, aijun yi\",\"doi\":\"10.1117/12.2672391\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The seed word-driven approach based on weakly supervised text classification (WTC) is the dominant approach. In existing seed word-driven methods,using metrics such as Term Frequency (TF), Inverse Document Frequency (IDF) and its combinations to update the seed words. the method assigns the same weight to all metrics, leading to the selection of common or poorly differentiated words as seed words; In addition most of the text classifiers used in the study have difficulty in capturing the correlation and global information between text information. In order to solve the above problems, Using Transformer as a text classifier first, The multi-headed self-attention mechanism allows capturing longrange dependencies while computing in parallel and fully learning the global semantic information of the input text. Then an improved TF-IDF method is proposed to increase the weight of IDF so that some common words that affect the classification can be filtered out. Its experimental results are improved on 20News and NYT datasets.\",\"PeriodicalId\":290902,\"journal\":{\"name\":\"International Conference on Mechatronics Engineering and Artificial Intelligence\",\"volume\":\"124 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Mechatronics Engineering and Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1117/12.2672391\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Mechatronics Engineering and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2672391","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

基于弱监督文本分类(WTC)的种子词驱动方法是主流方法。在现有的种子词驱动方法中,利用词频(Term Frequency, TF)、逆文档频率(Inverse Document Frequency, IDF)及其组合等指标来更新种子词。该方法为所有指标分配相同的权重,导致选择常见或差分化词作为种子词;此外,研究中使用的大多数文本分类器在捕获文本信息之间的相关性和全局信息方面存在困难。为了解决上述问题,首先使用Transformer作为文本分类器,多头自关注机制允许在并行计算的同时捕获远程依赖关系,并充分学习输入文本的全局语义信息。然后提出了一种改进的TF-IDF方法,增加IDF的权重,从而过滤掉一些影响分类的常用词。在20News和NYT数据集上对实验结果进行了改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Weakly supervised text classification method based on transformer
The seed word-driven approach based on weakly supervised text classification (WTC) is the dominant approach. In existing seed word-driven methods,using metrics such as Term Frequency (TF), Inverse Document Frequency (IDF) and its combinations to update the seed words. the method assigns the same weight to all metrics, leading to the selection of common or poorly differentiated words as seed words; In addition most of the text classifiers used in the study have difficulty in capturing the correlation and global information between text information. In order to solve the above problems, Using Transformer as a text classifier first, The multi-headed self-attention mechanism allows capturing longrange dependencies while computing in parallel and fully learning the global semantic information of the input text. Then an improved TF-IDF method is proposed to increase the weight of IDF so that some common words that affect the classification can be filtered out. Its experimental results are improved on 20News and NYT datasets.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Transmission line UAV inspection multi-link data congestion control method The combined supercritical CO2-based binary mixture Brayton cycle/organic Rankine cycle: a thermodynamic parametric analysis Research on scene text detection algorithm based on modified YOLOv5 Vibration analysis and optimization design of profile sawing and milling machine tool Lung cancer prediction and analysis by regression models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1