基于字典和机器学习的分类方法:对Twitter数据的调性和帧检测的比较

IF 1.8 Q2 POLITICAL SCIENCE Political Research Exchange Pub Date : 2022-02-01 DOI:10.1080/2474736X.2022.2029217
M. Reveilhac, D. Morselli
{"title":"基于字典和机器学习的分类方法:对Twitter数据的调性和帧检测的比较","authors":"M. Reveilhac, D. Morselli","doi":"10.1080/2474736X.2022.2029217","DOIUrl":null,"url":null,"abstract":"ABSTRACT Automated text analysis methods have made it possible to classify large corpora of text by measures such as frames and tonality, with a growing popularity in social, political and psychological science. These methods often demand a training dataset of sufficient size to generate accurate models that can be applied to unseen texts. In practice, however, there are no clear recommendations about how big the training samples should be. This issue becomes especially acute when dealing with texts skewed toward categories and when researchers cannot afford large samples of annotated texts. Leveraging on the case of support for democracy, we provide a guide to help researchers navigate decisions when producing measures of tonality and frames from a small sample of annotated social media posts. We find that supervised machine learning algorithms outperform dictionaries for tonality classification tasks. However, custom dictionaries are useful complements of these algorithms when identifying latent democracy dimensions in social media messages, especially as the method of elaborating these dictionaries is guided by word embedding techniques and human validation. Therefore, we provide easily implementable recommendations to increase estimation accuracy under non-optimal condition.","PeriodicalId":20269,"journal":{"name":"Political Research Exchange","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2022-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Dictionary-based and machine learning classification approaches: a comparison for tonality and frame detection on Twitter data\",\"authors\":\"M. Reveilhac, D. Morselli\",\"doi\":\"10.1080/2474736X.2022.2029217\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ABSTRACT Automated text analysis methods have made it possible to classify large corpora of text by measures such as frames and tonality, with a growing popularity in social, political and psychological science. These methods often demand a training dataset of sufficient size to generate accurate models that can be applied to unseen texts. In practice, however, there are no clear recommendations about how big the training samples should be. This issue becomes especially acute when dealing with texts skewed toward categories and when researchers cannot afford large samples of annotated texts. Leveraging on the case of support for democracy, we provide a guide to help researchers navigate decisions when producing measures of tonality and frames from a small sample of annotated social media posts. We find that supervised machine learning algorithms outperform dictionaries for tonality classification tasks. However, custom dictionaries are useful complements of these algorithms when identifying latent democracy dimensions in social media messages, especially as the method of elaborating these dictionaries is guided by word embedding techniques and human validation. Therefore, we provide easily implementable recommendations to increase estimation accuracy under non-optimal condition.\",\"PeriodicalId\":20269,\"journal\":{\"name\":\"Political Research Exchange\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2022-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Political Research Exchange\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/2474736X.2022.2029217\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"POLITICAL SCIENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Political Research Exchange","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/2474736X.2022.2029217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"POLITICAL SCIENCE","Score":null,"Total":0}
引用次数: 3

摘要

自动文本分析方法使得通过框架和调性等措施对大型文本语料库进行分类成为可能,在社会、政治和心理科学中越来越受欢迎。这些方法通常需要一个足够大的训练数据集来生成准确的模型,这些模型可以应用于未见过的文本。然而,在实践中,对于训练样本应该有多大并没有明确的建议。当处理偏向于分类的文本时,当研究人员无法负担大量注释文本的样本时,这个问题变得特别尖锐。利用支持民主的案例,我们提供了一个指南,帮助研究人员在从一小部分带注释的社交媒体帖子中产生调性和框架度量时做出决策。我们发现监督机器学习算法在调性分类任务上优于字典。然而,在识别社交媒体信息中潜在的民主维度时,自定义词典是这些算法的有用补充,特别是在精心设计这些词典的方法由词嵌入技术和人工验证指导的情况下。因此,我们提供了易于实现的建议,以提高非最优条件下的估计精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Dictionary-based and machine learning classification approaches: a comparison for tonality and frame detection on Twitter data
ABSTRACT Automated text analysis methods have made it possible to classify large corpora of text by measures such as frames and tonality, with a growing popularity in social, political and psychological science. These methods often demand a training dataset of sufficient size to generate accurate models that can be applied to unseen texts. In practice, however, there are no clear recommendations about how big the training samples should be. This issue becomes especially acute when dealing with texts skewed toward categories and when researchers cannot afford large samples of annotated texts. Leveraging on the case of support for democracy, we provide a guide to help researchers navigate decisions when producing measures of tonality and frames from a small sample of annotated social media posts. We find that supervised machine learning algorithms outperform dictionaries for tonality classification tasks. However, custom dictionaries are useful complements of these algorithms when identifying latent democracy dimensions in social media messages, especially as the method of elaborating these dictionaries is guided by word embedding techniques and human validation. Therefore, we provide easily implementable recommendations to increase estimation accuracy under non-optimal condition.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Political Research Exchange
Political Research Exchange POLITICAL SCIENCE-
CiteScore
3.40
自引率
0.00%
发文量
25
审稿时长
39 weeks
期刊最新文献
Online repression and transnational social movements: Thailand and the #MilkTeaAlliance Did Russia’s invasion of Ukraine unite Europe? Cohesion and divisions of the European Parliament on Twitter Quantifying the ideational context: political frames, meaning trajectories and punctuated equilibria in Spanish mainstream press during the Catalan nationalist challenge Breakdown by disengagement: Tunisia’s transition from representative democracy Merging the Great Patriotic War and Russian warfare in Ukraine. A case-study of Russian military patriotic clubs in 2022
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1