基于字典和机器学习的分类方法:对Twitter数据的调性和帧检测的比较

IF 1.8 Q2 POLITICAL SCIENCE Political Research Exchange Pub Date : 2022-02-01 DOI:10.1080/2474736X.2022.2029217

M. Reveilhac, D. Morselli

{"title":"基于字典和机器学习的分类方法:对Twitter数据的调性和帧检测的比较","authors":"M. Reveilhac, D. Morselli","doi":"10.1080/2474736X.2022.2029217","DOIUrl":null,"url":null,"abstract":"ABSTRACT Automated text analysis methods have made it possible to classify large corpora of text by measures such as frames and tonality, with a growing popularity in social, political and psychological science. These methods often demand a training dataset of sufficient size to generate accurate models that can be applied to unseen texts. In practice, however, there are no clear recommendations about how big the training samples should be. This issue becomes especially acute when dealing with texts skewed toward categories and when researchers cannot afford large samples of annotated texts. Leveraging on the case of support for democracy, we provide a guide to help researchers navigate decisions when producing measures of tonality and frames from a small sample of annotated social media posts. We find that supervised machine learning algorithms outperform dictionaries for tonality classification tasks. However, custom dictionaries are useful complements of these algorithms when identifying latent democracy dimensions in social media messages, especially as the method of elaborating these dictionaries is guided by word embedding techniques and human validation. Therefore, we provide easily implementable recommendations to increase estimation accuracy under non-optimal condition.","PeriodicalId":20269,"journal":{"name":"Political Research Exchange","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2022-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Dictionary-based and machine learning classification approaches: a comparison for tonality and frame detection on Twitter data\",\"authors\":\"M. Reveilhac, D. Morselli\",\"doi\":\"10.1080/2474736X.2022.2029217\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ABSTRACT Automated text analysis methods have made it possible to classify large corpora of text by measures such as frames and tonality, with a growing popularity in social, political and psychological science. These methods often demand a training dataset of sufficient size to generate accurate models that can be applied to unseen texts. In practice, however, there are no clear recommendations about how big the training samples should be. This issue becomes especially acute when dealing with texts skewed toward categories and when researchers cannot afford large samples of annotated texts. Leveraging on the case of support for democracy, we provide a guide to help researchers navigate decisions when producing measures of tonality and frames from a small sample of annotated social media posts. We find that supervised machine learning algorithms outperform dictionaries for tonality classification tasks. However, custom dictionaries are useful complements of these algorithms when identifying latent democracy dimensions in social media messages, especially as the method of elaborating these dictionaries is guided by word embedding techniques and human validation. Therefore, we provide easily implementable recommendations to increase estimation accuracy under non-optimal condition.\",\"PeriodicalId\":20269,\"journal\":{\"name\":\"Political Research Exchange\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2022-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Political Research Exchange\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/2474736X.2022.2029217\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"POLITICAL SCIENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Political Research Exchange","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/2474736X.2022.2029217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"POLITICAL SCIENCE","Score":null,"Total":0}

引用次数: 3

摘要

自动文本分析方法使得通过框架和调性等措施对大型文本语料库进行分类成为可能，在社会、政治和心理科学中越来越受欢迎。这些方法通常需要一个足够大的训练数据集来生成准确的模型，这些模型可以应用于未见过的文本。然而，在实践中，对于训练样本应该有多大并没有明确的建议。当处理偏向于分类的文本时，当研究人员无法负担大量注释文本的样本时，这个问题变得特别尖锐。利用支持民主的案例，我们提供了一个指南，帮助研究人员在从一小部分带注释的社交媒体帖子中产生调性和框架度量时做出决策。我们发现监督机器学习算法在调性分类任务上优于字典。然而，在识别社交媒体信息中潜在的民主维度时，自定义词典是这些算法的有用补充，特别是在精心设计这些词典的方法由词嵌入技术和人工验证指导的情况下。因此，我们提供了易于实现的建议，以提高非最优条件下的估计精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Dictionary-based and machine learning classification approaches: a comparison for tonality and frame detection on Twitter data

ABSTRACT Automated text analysis methods have made it possible to classify large corpora of text by measures such as frames and tonality, with a growing popularity in social, political and psychological science. These methods often demand a training dataset of sufficient size to generate accurate models that can be applied to unseen texts. In practice, however, there are no clear recommendations about how big the training samples should be. This issue becomes especially acute when dealing with texts skewed toward categories and when researchers cannot afford large samples of annotated texts. Leveraging on the case of support for democracy, we provide a guide to help researchers navigate decisions when producing measures of tonality and frames from a small sample of annotated social media posts. We find that supervised machine learning algorithms outperform dictionaries for tonality classification tasks. However, custom dictionaries are useful complements of these algorithms when identifying latent democracy dimensions in social media messages, especially as the method of elaborating these dictionaries is guided by word embedding techniques and human validation. Therefore, we provide easily implementable recommendations to increase estimation accuracy under non-optimal condition.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊