D. Murthy , S. Keshari , S. Arora , Q. Yang , A. Loukas , S.J. Schwartz , M.B. Harrell , E.T. Hébert , A.V. Wilkinson
{"title":"Categorizing E-cigarette-related tweets using BERT topic modeling","authors":"D. Murthy , S. Keshari , S. Arora , Q. Yang , A. Loukas , S.J. Schwartz , M.B. Harrell , E.T. Hébert , A.V. Wilkinson","doi":"10.1016/j.etdah.2024.100160","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Social media platforms are critical channels for promoting e-cigarettes, particularly among youth, making analysis of their vast and diverse content essential for public health interventions. Prevalence rates of e-cigarette use are high and evidence suggests that social media are popular forums that promote e-cigarette use through direct and indirect marketing techniques. The volume and diverse nature of e-cigarette-related information on social media is challenging and may obfuscate public health prevention messaging. Traditional hand-coding methods are labor-intensive and limit scalability. In contrast, unsupervised machine learning approaches, such as topic modeling, allow for efficient analysis of large datasets, uncovering patterns and trends that manual methods cannot achieve at scale. The present study focused on ascertaining the extent to which themes and topics in tweets related to e-cigarettes can be successfully rendered into useful homogenous units using machine learning. A better understanding of current depictions and discussions around e-cigarette products and use on social media can inform public health counter messaging and policy interventions.</div></div><div><h3>Methods</h3><div>We used topic modeling (BERTopic) to iteratively derive vape-related tweet clusters and calculate the importance of particular words to these groupings. We conducted a qualitative content analysis to study clustered tweets. We also sought to determine the geographic locations of e-cigarette conversations using automated geoparsing methods, which translate toponyms in textual data into geographic identifiers, to attempt to infer the location of tweets.</div></div><div><h3>Results</h3><div>We were able to successfully identify >100,000 tweets in broad thematic categories in English and Spanish. Our correlation and inter-topic map analysis of the machine-derived topics, which examines the relationships between topics, indicated that most of the topics were unique (correlation value < 0.5) and did not overlap with each other. We identified six topics: Flavors and Disposable Vapes, Cannabis, Vape Shops and Refillable Vapes, Vape Culture, Anti-vaping and Quitting, and Spanish Tweets and Vaping Nicotine. Further analysis of these topics using qualitative methods identified themes within each topic. For example, Category 6 (Spanish Tweets and Vaping Nicotine) included four topics focused on the health risks of vaping, personal motivations for vaping, and the regulation of vaping products. Using geoparsing, which automatically detects location information, we found that the United States had the highest number of tweets related to vaping.</div></div><div><h3>Discussion/conclusion</h3><div>Results underscore the possibility of leveraging BERTopic modeling to reduce large quantities of data to comprehensively describe and categorize myriad e-cigarette related messages to which social media users are exposed. This data reduction approach can be applied to various social media platforms to describe and categorize e-cigarette posts and thereby triangulate and validate findings. Thematic content analysis of the topics identified through this technique requires supervision and human inputs. Our approach provides a comprehensive understanding of the evolving e-cigarette discourse, informing public health counter-messaging and policy interventions. Moreover, findings support the need for regulation, such as reducing appealing flavors and suggest that social media can be used effectively to support public health messaging (i.e., quitting messages).</div></div>","PeriodicalId":72899,"journal":{"name":"Emerging trends in drugs, addictions, and health","volume":"4 ","pages":"Article 100160"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Emerging trends in drugs, addictions, and health","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667118224000199","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background
Social media platforms are critical channels for promoting e-cigarettes, particularly among youth, making analysis of their vast and diverse content essential for public health interventions. Prevalence rates of e-cigarette use are high and evidence suggests that social media are popular forums that promote e-cigarette use through direct and indirect marketing techniques. The volume and diverse nature of e-cigarette-related information on social media is challenging and may obfuscate public health prevention messaging. Traditional hand-coding methods are labor-intensive and limit scalability. In contrast, unsupervised machine learning approaches, such as topic modeling, allow for efficient analysis of large datasets, uncovering patterns and trends that manual methods cannot achieve at scale. The present study focused on ascertaining the extent to which themes and topics in tweets related to e-cigarettes can be successfully rendered into useful homogenous units using machine learning. A better understanding of current depictions and discussions around e-cigarette products and use on social media can inform public health counter messaging and policy interventions.
Methods
We used topic modeling (BERTopic) to iteratively derive vape-related tweet clusters and calculate the importance of particular words to these groupings. We conducted a qualitative content analysis to study clustered tweets. We also sought to determine the geographic locations of e-cigarette conversations using automated geoparsing methods, which translate toponyms in textual data into geographic identifiers, to attempt to infer the location of tweets.
Results
We were able to successfully identify >100,000 tweets in broad thematic categories in English and Spanish. Our correlation and inter-topic map analysis of the machine-derived topics, which examines the relationships between topics, indicated that most of the topics were unique (correlation value < 0.5) and did not overlap with each other. We identified six topics: Flavors and Disposable Vapes, Cannabis, Vape Shops and Refillable Vapes, Vape Culture, Anti-vaping and Quitting, and Spanish Tweets and Vaping Nicotine. Further analysis of these topics using qualitative methods identified themes within each topic. For example, Category 6 (Spanish Tweets and Vaping Nicotine) included four topics focused on the health risks of vaping, personal motivations for vaping, and the regulation of vaping products. Using geoparsing, which automatically detects location information, we found that the United States had the highest number of tweets related to vaping.
Discussion/conclusion
Results underscore the possibility of leveraging BERTopic modeling to reduce large quantities of data to comprehensively describe and categorize myriad e-cigarette related messages to which social media users are exposed. This data reduction approach can be applied to various social media platforms to describe and categorize e-cigarette posts and thereby triangulate and validate findings. Thematic content analysis of the topics identified through this technique requires supervision and human inputs. Our approach provides a comprehensive understanding of the evolving e-cigarette discourse, informing public health counter-messaging and policy interventions. Moreover, findings support the need for regulation, such as reducing appealing flavors and suggest that social media can be used effectively to support public health messaging (i.e., quitting messages).