Large-Scale Demographic Inference of Social Media Users in a Low-Resource Scenario

Karim Lasri, Manuel Tonneau, Haaya Naushan, Niyati Malhotra, I. Farouq, Victor Orozco-Olvera, S. Fraiberger
{"title":"Large-Scale Demographic Inference of Social Media Users in a Low-Resource Scenario","authors":"Karim Lasri, Manuel Tonneau, Haaya Naushan, Niyati Malhotra, I. Farouq, Victor Orozco-Olvera, S. Fraiberger","doi":"10.1609/icwsm.v17i1.22165","DOIUrl":null,"url":null,"abstract":"Characterizing the demographics of social media users\nenables a diversity of applications, from better targeting of policy interventions to the derivation of representative population\nestimates of social phenomena. Achieving high performance with supervised learning, however, can be challenging as labeled data is often scarce. Alternatively, rule-based matching strategies provide well-grounded information but only offer partial coverage over users. It is unclear, therefore, what features and models are best suited to maximize coverage over a large set of users while maintaining high performance. In this paper, we develop a cost-effective strategy for large-scale demographic inference by relying on minimal labeling efforts. We combine a name-matching strategy with graph-based methods to map the demographics of 1.8 million Nigerian Twitter users.\nSpecifically, we compare a purely graph-based propagation model, namely Label Propagation (LP), with Graph Convolutional Networks (GCN), a graph model that also incorporates node features based on user content.\nWe find that both models largely outperform supervised learning approaches based purely on user content that lack graph information. Notably, we find that LP achieves comparable performance to the state-of-the-art GCN while providing greater interpretability at a lower computing cost. Moreover, performance does not significantly improve with the addition of user-specific features, such as textual representations of user tweets and user geolocation. Leveraging our data collection effort, we describe the demographic composition of Nigerian Twitter finding that it is a highly non-uniform sample of the general Nigerian population.","PeriodicalId":175641,"journal":{"name":"International Conference on Web and Social Media","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Web and Social Media","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1609/icwsm.v17i1.22165","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Characterizing the demographics of social media users enables a diversity of applications, from better targeting of policy interventions to the derivation of representative population estimates of social phenomena. Achieving high performance with supervised learning, however, can be challenging as labeled data is often scarce. Alternatively, rule-based matching strategies provide well-grounded information but only offer partial coverage over users. It is unclear, therefore, what features and models are best suited to maximize coverage over a large set of users while maintaining high performance. In this paper, we develop a cost-effective strategy for large-scale demographic inference by relying on minimal labeling efforts. We combine a name-matching strategy with graph-based methods to map the demographics of 1.8 million Nigerian Twitter users. Specifically, we compare a purely graph-based propagation model, namely Label Propagation (LP), with Graph Convolutional Networks (GCN), a graph model that also incorporates node features based on user content. We find that both models largely outperform supervised learning approaches based purely on user content that lack graph information. Notably, we find that LP achieves comparable performance to the state-of-the-art GCN while providing greater interpretability at a lower computing cost. Moreover, performance does not significantly improve with the addition of user-specific features, such as textual representations of user tweets and user geolocation. Leveraging our data collection effort, we describe the demographic composition of Nigerian Twitter finding that it is a highly non-uniform sample of the general Nigerian population.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
低资源情境下社交媒体用户的大规模人口统计推断
描述社交媒体用户的人口统计特征可用于多种应用,从更好地确定政策干预的目标到对社会现象的代表性人口估计的推导。然而,通过监督学习实现高性能可能具有挑战性,因为标记数据通常是稀缺的。另外,基于规则的匹配策略提供了有充分根据的信息,但只提供了对用户的部分覆盖。因此,不清楚哪些特性和模型最适合在保持高性能的同时最大限度地覆盖大量用户。在本文中,我们开发了一个成本效益的策略,大规模人口推断依靠最小的标签工作。我们将名称匹配策略与基于图形的方法结合起来,绘制了180万尼日利亚Twitter用户的人口统计图。具体来说,我们比较了纯基于图的传播模型,即标签传播(LP)和图卷积网络(GCN), GCN是一种基于用户内容合并节点特征的图模型。我们发现,这两种模型在很大程度上都优于纯粹基于缺乏图形信息的用户内容的监督学习方法。值得注意的是,我们发现LP实现了与最先进的GCN相当的性能,同时以更低的计算成本提供了更高的可解释性。此外,添加特定于用户的特性(如用户tweet的文本表示和用户地理位置)并不能显著提高性能。利用我们的数据收集工作,我们描述了尼日利亚Twitter的人口组成,发现它是尼日利亚一般人口的一个高度不统一的样本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
RTANet: Recommendation Target-Aware Network Embedding Who Is behind a Trend? Temporal Analysis of Interactions among Trend Participants on Twitter Host-Centric Social Connectedness of Migrants in Europe on Facebook Recipe Networks and the Principles of Healthy Food on the Web Social Influence-Maximizing Group Recommendation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1