话题提取：BERTopic 对第 117 届国会推特网络的洞察

IF 3.4 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Informatics Pub Date : 2024-02-17 DOI:10.3390/informatics11010008

Margarida Mendonça, Álvaro Figueira

{"title":"话题提取：BERTopic 对第 117 届国会推特网络的洞察","authors":"Margarida Mendonça, Álvaro Figueira","doi":"10.3390/informatics11010008","DOIUrl":null,"url":null,"abstract":"As social media (SM) becomes increasingly prevalent, its impact on society is expected to grow accordingly. While SM has brought positive transformations, it has also amplified pre-existing issues such as misinformation, echo chambers, manipulation, and propaganda. A thorough comprehension of this impact, aided by state-of-the-art analytical tools and by an awareness of societal biases and complexities, enables us to anticipate and mitigate the potential negative effects. One such tool is BERTopic, a novel deep-learning algorithm developed for Topic Mining, which has been shown to offer significant advantages over traditional methods like Latent Dirichlet Allocation (LDA), particularly in terms of its high modularity, which allows for extensive personalization at each stage of the topic modeling process. In this study, we hypothesize that BERTopic, when optimized for Twitter data, can provide a more coherent and stable topic modeling. We began by conducting a review of the literature on topic-mining approaches for short-text data. Using this knowledge, we explored the potential for optimizing BERTopic and analyzed its effectiveness. Our focus was on Twitter data spanning the two years of the 117th US Congress. We evaluated BERTopic’s performance using coherence, perplexity, diversity, and stability scores, finding significant improvements over traditional methods and the default parameters for this tool. We discovered that improvements are possible in BERTopic’s coherence and stability. We also identified the major topics of this Congress, which include abortion, student debt, and Judge Ketanji Brown Jackson. Additionally, we describe a simple application we developed for a better visualization of Congress topics.","PeriodicalId":37100,"journal":{"name":"Informatics","volume":null,"pages":null},"PeriodicalIF":3.4000,"publicationDate":"2024-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Topic Extraction: BERTopic's Insight into the 117th Congress's Twitterverse\",\"authors\":\"Margarida Mendonça, Álvaro Figueira\",\"doi\":\"10.3390/informatics11010008\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As social media (SM) becomes increasingly prevalent, its impact on society is expected to grow accordingly. While SM has brought positive transformations, it has also amplified pre-existing issues such as misinformation, echo chambers, manipulation, and propaganda. A thorough comprehension of this impact, aided by state-of-the-art analytical tools and by an awareness of societal biases and complexities, enables us to anticipate and mitigate the potential negative effects. One such tool is BERTopic, a novel deep-learning algorithm developed for Topic Mining, which has been shown to offer significant advantages over traditional methods like Latent Dirichlet Allocation (LDA), particularly in terms of its high modularity, which allows for extensive personalization at each stage of the topic modeling process. In this study, we hypothesize that BERTopic, when optimized for Twitter data, can provide a more coherent and stable topic modeling. We began by conducting a review of the literature on topic-mining approaches for short-text data. Using this knowledge, we explored the potential for optimizing BERTopic and analyzed its effectiveness. Our focus was on Twitter data spanning the two years of the 117th US Congress. We evaluated BERTopic’s performance using coherence, perplexity, diversity, and stability scores, finding significant improvements over traditional methods and the default parameters for this tool. We discovered that improvements are possible in BERTopic’s coherence and stability. We also identified the major topics of this Congress, which include abortion, student debt, and Judge Ketanji Brown Jackson. Additionally, we describe a simple application we developed for a better visualization of Congress topics.\",\"PeriodicalId\":37100,\"journal\":{\"name\":\"Informatics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-02-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/informatics11010008\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/informatics11010008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

随着社交媒体（SM）的日益普及，其对社会的影响预计也会相应增加。在社交媒体带来积极变革的同时，它也放大了原本存在的问题，如错误信息、回声室、操纵和宣传。借助最先进的分析工具以及对社会偏见和复杂性的认识，对这种影响的透彻理解使我们能够预测并减轻潜在的负面影响。BERTopic 就是这样一种工具，它是一种为主题挖掘而开发的新型深度学习算法，与 Latent Dirichlet Allocation（LDA）等传统方法相比，BERTopic 具有显著优势，尤其是它的高度模块化特性，可以在主题建模过程的每个阶段实现广泛的个性化。在本研究中，我们假设 BERTopic 在针对 Twitter 数据进行优化后，可以提供更连贯、更稳定的话题建模。我们首先回顾了有关短文本数据话题挖掘方法的文献。利用这些知识，我们探索了优化 BERTopic 的潜力，并分析了其有效性。我们的重点是跨越第 117 届美国国会两年的 Twitter 数据。我们使用一致性、困惑度、多样性和稳定性评分对 BERTopic 的性能进行了评估，发现它比传统方法和该工具的默认参数有明显改善。我们发现，BERTopic 的一致性和稳定性还有改进的可能。我们还确定了本次大会的主要议题，其中包括堕胎、学生债务和凯坦吉-布朗-杰克逊法官。此外，我们还介绍了我们为更好地可视化国会议题而开发的一个简单应用程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Topic Extraction: BERTopic's Insight into the 117th Congress's Twitterverse

As social media (SM) becomes increasingly prevalent, its impact on society is expected to grow accordingly. While SM has brought positive transformations, it has also amplified pre-existing issues such as misinformation, echo chambers, manipulation, and propaganda. A thorough comprehension of this impact, aided by state-of-the-art analytical tools and by an awareness of societal biases and complexities, enables us to anticipate and mitigate the potential negative effects. One such tool is BERTopic, a novel deep-learning algorithm developed for Topic Mining, which has been shown to offer significant advantages over traditional methods like Latent Dirichlet Allocation (LDA), particularly in terms of its high modularity, which allows for extensive personalization at each stage of the topic modeling process. In this study, we hypothesize that BERTopic, when optimized for Twitter data, can provide a more coherent and stable topic modeling. We began by conducting a review of the literature on topic-mining approaches for short-text data. Using this knowledge, we explored the potential for optimizing BERTopic and analyzed its effectiveness. Our focus was on Twitter data spanning the two years of the 117th US Congress. We evaluated BERTopic’s performance using coherence, perplexity, diversity, and stability scores, finding significant improvements over traditional methods and the default parameters for this tool. We discovered that improvements are possible in BERTopic’s coherence and stability. We also identified the major topics of this Congress, which include abortion, student debt, and Judge Ketanji Brown Jackson. Additionally, we describe a simple application we developed for a better visualization of Congress topics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Informatics Social Sciences-Communication

CiteScore

6.60

自引率

6.50%

发文量

审稿时长

6 weeks