{"title":"基于字典和机器学习的分类方法:对Twitter数据的调性和帧检测的比较","authors":"M. Reveilhac, D. Morselli","doi":"10.1080/2474736X.2022.2029217","DOIUrl":null,"url":null,"abstract":"ABSTRACT Automated text analysis methods have made it possible to classify large corpora of text by measures such as frames and tonality, with a growing popularity in social, political and psychological science. These methods often demand a training dataset of sufficient size to generate accurate models that can be applied to unseen texts. In practice, however, there are no clear recommendations about how big the training samples should be. This issue becomes especially acute when dealing with texts skewed toward categories and when researchers cannot afford large samples of annotated texts. Leveraging on the case of support for democracy, we provide a guide to help researchers navigate decisions when producing measures of tonality and frames from a small sample of annotated social media posts. We find that supervised machine learning algorithms outperform dictionaries for tonality classification tasks. However, custom dictionaries are useful complements of these algorithms when identifying latent democracy dimensions in social media messages, especially as the method of elaborating these dictionaries is guided by word embedding techniques and human validation. Therefore, we provide easily implementable recommendations to increase estimation accuracy under non-optimal condition.","PeriodicalId":20269,"journal":{"name":"Political Research Exchange","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2022-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Dictionary-based and machine learning classification approaches: a comparison for tonality and frame detection on Twitter data\",\"authors\":\"M. Reveilhac, D. Morselli\",\"doi\":\"10.1080/2474736X.2022.2029217\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ABSTRACT Automated text analysis methods have made it possible to classify large corpora of text by measures such as frames and tonality, with a growing popularity in social, political and psychological science. These methods often demand a training dataset of sufficient size to generate accurate models that can be applied to unseen texts. In practice, however, there are no clear recommendations about how big the training samples should be. This issue becomes especially acute when dealing with texts skewed toward categories and when researchers cannot afford large samples of annotated texts. Leveraging on the case of support for democracy, we provide a guide to help researchers navigate decisions when producing measures of tonality and frames from a small sample of annotated social media posts. We find that supervised machine learning algorithms outperform dictionaries for tonality classification tasks. However, custom dictionaries are useful complements of these algorithms when identifying latent democracy dimensions in social media messages, especially as the method of elaborating these dictionaries is guided by word embedding techniques and human validation. Therefore, we provide easily implementable recommendations to increase estimation accuracy under non-optimal condition.\",\"PeriodicalId\":20269,\"journal\":{\"name\":\"Political Research Exchange\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2022-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Political Research Exchange\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/2474736X.2022.2029217\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"POLITICAL SCIENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Political Research Exchange","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/2474736X.2022.2029217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"POLITICAL SCIENCE","Score":null,"Total":0}
Dictionary-based and machine learning classification approaches: a comparison for tonality and frame detection on Twitter data
ABSTRACT Automated text analysis methods have made it possible to classify large corpora of text by measures such as frames and tonality, with a growing popularity in social, political and psychological science. These methods often demand a training dataset of sufficient size to generate accurate models that can be applied to unseen texts. In practice, however, there are no clear recommendations about how big the training samples should be. This issue becomes especially acute when dealing with texts skewed toward categories and when researchers cannot afford large samples of annotated texts. Leveraging on the case of support for democracy, we provide a guide to help researchers navigate decisions when producing measures of tonality and frames from a small sample of annotated social media posts. We find that supervised machine learning algorithms outperform dictionaries for tonality classification tasks. However, custom dictionaries are useful complements of these algorithms when identifying latent democracy dimensions in social media messages, especially as the method of elaborating these dictionaries is guided by word embedding techniques and human validation. Therefore, we provide easily implementable recommendations to increase estimation accuracy under non-optimal condition.