Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods.

IF 1.3 N/A LANGUAGE & LINGUISTICS Corpus Pragmatics Pub Date : 2023-04-30 DOI:10.1007/s41701-023-00143-0
Antonio Moreno-Ortiz, María García-Gámez
{"title":"Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods.","authors":"Antonio Moreno-Ortiz,&nbsp;María García-Gámez","doi":"10.1007/s41701-023-00143-0","DOIUrl":null,"url":null,"abstract":"<p><p>In the context of the COVID-19 pandemic, social media platforms such as Twitter have been of great importance for users to exchange news, ideas, and perceptions. Researchers from fields such as discourse analysis and the social sciences have resorted to this content to explore public opinion and stance on this topic, and they have tried to gather information through the compilation of large-scale corpora. However, the size of such corpora is both an advantage and a drawback, as simple text retrieval techniques and tools may prove to be impractical or altogether incapable of handling such masses of data. This study provides methodological and practical cues on how to manage the contents of a large-scale social media corpus such as Chen et al. (JMIR Public Health Surveill 6(2):e19273, 2020) COVID-19 corpus. We compare and evaluate, in terms of efficiency and efficacy, available methods to handle such a large corpus. First, we compare different sample sizes to assess whether it is possible to achieve similar results despite the size difference and evaluate sampling methods following a specific data management approach to storing the original corpus. Second, we examine two keyword extraction methodologies commonly used to obtain a compact representation of the main subject and topics of a text: the traditional method used in corpus linguistics, which compares word frequencies using a reference corpus, and graph-based techniques as developed in Natural Language Processing tasks. The methods and strategies discussed in this study enable valuable quantitative and qualitative analyses of an otherwise intractable mass of social media data.</p>","PeriodicalId":52343,"journal":{"name":"Corpus Pragmatics","volume":null,"pages":null},"PeriodicalIF":1.3000,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10148754/pdf/","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Corpus Pragmatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41701-023-00143-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"N/A","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 2

Abstract

In the context of the COVID-19 pandemic, social media platforms such as Twitter have been of great importance for users to exchange news, ideas, and perceptions. Researchers from fields such as discourse analysis and the social sciences have resorted to this content to explore public opinion and stance on this topic, and they have tried to gather information through the compilation of large-scale corpora. However, the size of such corpora is both an advantage and a drawback, as simple text retrieval techniques and tools may prove to be impractical or altogether incapable of handling such masses of data. This study provides methodological and practical cues on how to manage the contents of a large-scale social media corpus such as Chen et al. (JMIR Public Health Surveill 6(2):e19273, 2020) COVID-19 corpus. We compare and evaluate, in terms of efficiency and efficacy, available methods to handle such a large corpus. First, we compare different sample sizes to assess whether it is possible to achieve similar results despite the size difference and evaluate sampling methods following a specific data management approach to storing the original corpus. Second, we examine two keyword extraction methodologies commonly used to obtain a compact representation of the main subject and topics of a text: the traditional method used in corpus linguistics, which compares word frequencies using a reference corpus, and graph-based techniques as developed in Natural Language Processing tasks. The methods and strategies discussed in this study enable valuable quantitative and qualitative analyses of an otherwise intractable mass of social media data.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大型社交媒体企业分析策略:抽样和关键词提取方法。
在新冠肺炎大流行的背景下,推特等社交媒体平台对用户交流新闻、想法和看法至关重要。来自话语分析和社会科学等领域的研究人员利用这一内容来探讨公众对这一主题的看法和立场,并试图通过汇编大规模语料库来收集信息。然而,这种语料库的大小既是优点也是缺点,因为简单的文本检索技术和工具可能被证明是不切实际的,或者完全无法处理这样大量的数据。本研究为如何管理大规模社交媒体语料库的内容提供了方法和实践线索,如Chen等人(JMIR公共卫生调查6(2):e192732020)新冠肺炎语料库。我们从效率和疗效的角度比较和评估了处理如此庞大语料库的可用方法。首先,我们比较不同的样本量,以评估是否有可能在大小不同的情况下获得相似的结果,并根据特定的数据管理方法来评估存储原始语料库的采样方法。其次,我们研究了两种常用于获得文本主要主题和主题的紧凑表示的关键词提取方法:语料库语言学中使用的传统方法,即使用参考语料库比较单词频率,以及自然语言处理任务中开发的基于图的技术。本研究中讨论的方法和策略能够对大量难以处理的社交媒体数据进行有价值的定量和定性分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Corpus Pragmatics
Corpus Pragmatics Arts and Humanities-Language and Linguistics
CiteScore
2.60
自引率
0.00%
发文量
15
期刊介绍: Corpus Pragmatics offers a forum for theoretical and applied linguists who carry out research in the new linguistic discipline that stands at the interface between corpus linguistics and pragmatics. The journal promotes the combination of the two approaches through research on new topics in linguistics, with a particular focus on interdisciplinary studies, and to enlarge and implement current pragmatic theories that have hitherto not benefited from empirical corpus support. Authors are encouraged to describe the statistical analyses used in their research and to supply the data and scripts in R when possible. The objective of Corpus Pragmatics is to develop pragmatics with the aid of quantitative corpus methodology. The journal accepts original research papers, short research notes, and occasional thematic issues. The journal follows a double-blind peer review system.
期刊最新文献
Attitudinal Resources in Academic Talks: A Corpus-Based Analysis Across Languages “But Never Do be Long Without Writing Us, for Altho' Many Miles Divide Us We Have Your Welfare at Heart”: An Analysis of Requests in Intimate Discourse in Irish Emigrants’ Letters (1700–1940) Unmasking Malicious Stance Indicators and Attitudinal Priming: An ‘Evaluative Textbite’ Approach to Identity Attacks in Violent Extremist Discourse Review of Collocations, Corpora and Language Learning Review of the Pragmatics of Humour in Interactive Contexts
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1