首页 > 最新文献

EPJ Data Science最新文献

英文 中文
Digital traces of brain drain: developers during the Russian invasion of Ukraine. 人才流失的数字痕迹:俄罗斯入侵乌克兰期间的开发者。
IF 3.6 2区 计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-01-01 DOI: 10.1140/epjds/s13688-023-00389-3
Johannes Wachs

The Russian invasion of Ukraine has caused large scale destruction, significant loss of life, and the displacement of millions of people. Besides those fleeing direct conflict in Ukraine, many individuals in Russia are also thought to have moved to third countries. In particular the exodus of skilled human capital, sometimes called brain drain, out of Russia may have a significant effect on the course of the war and the Russian economy in the long run. Yet quantifying brain drain, especially during crisis situations is generally difficult. This hinders our ability to understand its drivers and to anticipate its consequences. To address this gap, I draw on and extend a large scale dataset of the locations of highly active software developers collected in February 2021, one year before the invasion. Revisiting those developers that had been located in Russia in 2021, I confirm an ongoing exodus of developers from Russia in snapshots taken in June and November 2022. By November 11.1% of Russian developers list a new country, compared with 2.8% of developers from comparable countries in the region but not directly involved in the conflict. 13.2% of Russian developers have obscured their location (vs. 2.4% in the comparison set). Developers leaving Russia were significantly more active and central in the collaboration network than those who remain. This suggests that many of the most important developers have already left Russia. In some receiving countries the number of arrivals is significant: I estimate an increase in the number of local software developers of 42% in Armenia, 60% in Cyprus and 94% in Georgia.

Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00389-3.

俄罗斯对乌克兰的入侵造成了大规模的破坏、重大的生命损失和数百万人流离失所。除了那些逃离乌克兰直接冲突的人,据信俄罗斯的许多人也搬到了第三国。特别是熟练人力资本的外流,有时被称为人才流失,从长远来看,可能会对战争进程和俄罗斯经济产生重大影响。然而,量化人才流失,特别是在危机形势下,通常是困难的。这阻碍了我们理解其驱动因素和预测其后果的能力。为了解决这一差距,我利用并扩展了2021年2月(入侵前一年)收集的高度活跃的软件开发人员位置的大规模数据集。重新审视那些在2021年位于俄罗斯的开发商,我在2022年6月和11月拍摄的快照中证实了俄罗斯开发商的持续外流。截至11月,11.1%的俄罗斯开发商列出了一个新的国家,相比之下,该地区没有直接卷入冲突的可比国家的开发商中,这一比例为2.8%。13.2%的俄罗斯开发者隐藏了自己的位置(相比之下,对比数据为2.4%)。离开俄罗斯的开发者比留在俄罗斯的开发者在合作网络中更加活跃和核心。这表明许多最重要的开发商已经离开了俄罗斯。在一些接收国家,到达的人数是显著的:我估计当地软件开发人员的数量在亚美尼亚增加了42%,在塞浦路斯增加了60%,在格鲁吉亚增加了94%。补充信息:在线版本包含补充资料,下载地址:10.1140/epjds/s13688-023-00389-3。
{"title":"Digital traces of brain drain: developers during the Russian invasion of Ukraine.","authors":"Johannes Wachs","doi":"10.1140/epjds/s13688-023-00389-3","DOIUrl":"https://doi.org/10.1140/epjds/s13688-023-00389-3","url":null,"abstract":"<p><p>The Russian invasion of Ukraine has caused large scale destruction, significant loss of life, and the displacement of millions of people. Besides those fleeing direct conflict in Ukraine, many individuals in Russia are also thought to have moved to third countries. In particular the exodus of skilled human capital, sometimes called brain drain, out of Russia may have a significant effect on the course of the war and the Russian economy in the long run. Yet quantifying brain drain, especially during crisis situations is generally difficult. This hinders our ability to understand its drivers and to anticipate its consequences. To address this gap, I draw on and extend a large scale dataset of the locations of highly active software developers collected in February 2021, one year before the invasion. Revisiting those developers that had been located in Russia in 2021, I confirm an ongoing exodus of developers from Russia in snapshots taken in June and November 2022. By November 11.1% of Russian developers list a new country, compared with 2.8% of developers from comparable countries in the region but not directly involved in the conflict. 13.2% of Russian developers have obscured their location (vs. 2.4% in the comparison set). Developers leaving Russia were significantly more active and central in the collaboration network than those who remain. This suggests that many of the most important developers have already left Russia. In some receiving countries the number of arrivals is significant: I estimate an increase in the number of local software developers of 42% in Armenia, 60% in Cyprus and 94% in Georgia.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1140/epjds/s13688-023-00389-3.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"14"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10184088/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9557423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Exposing influence campaigns in the age of LLMs: a behavioral-based AI approach to detecting state-sponsored trolls. 揭露LLM时代的影响力运动:一种基于行为的人工智能方法,用于检测国家资助的巨魔。
IF 3.6 2区 计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-01-01 Epub Date: 2023-10-09 DOI: 10.1140/epjds/s13688-023-00423-4
Fatima Ezzeddine, Omran Ayoub, Silvia Giordano, Gianluca Nogara, Ihab Sbeity, Emilio Ferrara, Luca Luceri

The detection of state-sponsored trolls operating in influence campaigns on social media is a critical and unsolved challenge for the research community, which has significant implications beyond the online realm. To address this challenge, we propose a new AI-based solution that identifies troll accounts solely through behavioral cues associated with their sequences of sharing activity, encompassing both their actions and the feedback they receive from others. Our approach does not incorporate any textual content shared and consists of two steps: First, we leverage an LSTM-based classifier to determine whether account sequences belong to a state-sponsored troll or an organic, legitimate user. Second, we employ the classified sequences to calculate a metric named the "Troll Score", quantifying the degree to which an account exhibits troll-like behavior. To assess the effectiveness of our method, we examine its performance in the context of the 2016 Russian interference campaign during the U.S. Presidential election. Our experiments yield compelling results, demonstrating that our approach can identify account sequences with an AUC close to 99% and accurately differentiate between Russian trolls and organic users with an AUC of 91%. Notably, our behavioral-based approach holds a significant advantage in the ever-evolving landscape, where textual and linguistic properties can be easily mimicked by Large Language Models (LLMs): In contrast to existing language-based techniques, it relies on more challenging-to-replicate behavioral cues, ensuring greater resilience in identifying influence campaigns, especially given the potential increase in the usage of LLMs for generating inauthentic content. Finally, we assessed the generalizability of our solution to various entities driving different information operations and found promising results that will guide future research.

对研究界来说,检测在社交媒体上进行影响力活动的国家资助的巨魔是一个关键且尚未解决的挑战,这在网络领域之外具有重大意义。为了应对这一挑战,我们提出了一种新的基于人工智能的解决方案,该解决方案仅通过与其共享活动序列相关的行为线索来识别巨魔账户,包括他们的行为和从他人那里收到的反馈。我们的方法不包含任何共享的文本内容,包括两个步骤:首先,我们利用基于LSTM的分类器来确定账户序列是属于国家资助的巨魔还是有机的合法用户。其次,我们使用分类序列来计算一个名为“巨魔得分”的指标,量化账户表现出巨魔般行为的程度。为了评估我们的方法的有效性,我们在2016年美国总统大选期间俄罗斯干预运动的背景下考察了其表现。我们的实验产生了令人信服的结果,证明我们的方法可以识别AUC接近99%的账户序列,并准确区分AUC为91%的俄罗斯巨魔和有机用户。值得注意的是,我们基于行为的方法在不断发展的环境中具有显著优势,在这种环境中,文本和语言属性可以很容易地被大型语言模型(LLM)模仿:与现有的基于语言的技术相比,它依赖于更具挑战性的行为线索复制,确保在识别影响活动时具有更大的弹性,特别是考虑到LLM用于生成不真实内容的使用的潜在增加。最后,我们评估了我们的解决方案对驱动不同信息操作的各种实体的可推广性,并发现了有希望的结果,这些结果将指导未来的研究。
{"title":"Exposing influence campaigns in the age of LLMs: a behavioral-based AI approach to detecting state-sponsored trolls.","authors":"Fatima Ezzeddine, Omran Ayoub, Silvia Giordano, Gianluca Nogara, Ihab Sbeity, Emilio Ferrara, Luca Luceri","doi":"10.1140/epjds/s13688-023-00423-4","DOIUrl":"10.1140/epjds/s13688-023-00423-4","url":null,"abstract":"<p><p>The detection of state-sponsored trolls operating in influence campaigns on social media is a critical and unsolved challenge for the research community, which has significant implications beyond the online realm. To address this challenge, we propose a new AI-based solution that identifies troll accounts solely through behavioral cues associated with their sequences of sharing activity, encompassing both their actions and the feedback they receive from others. Our approach does not incorporate any textual content shared and consists of two steps: First, we leverage an LSTM-based classifier to determine whether account sequences belong to a state-sponsored troll or an organic, legitimate user. Second, we employ the classified sequences to calculate a metric named the \"Troll Score\", quantifying the degree to which an account exhibits troll-like behavior. To assess the effectiveness of our method, we examine its performance in the context of the 2016 Russian interference campaign during the U.S. Presidential election. Our experiments yield compelling results, demonstrating that our approach can identify account sequences with an AUC close to 99% and accurately differentiate between Russian trolls and organic users with an AUC of 91%. Notably, our behavioral-based approach holds a significant advantage in the ever-evolving landscape, where textual and linguistic properties can be easily mimicked by Large Language Models (LLMs): In contrast to existing language-based techniques, it relies on more challenging-to-replicate behavioral cues, ensuring greater resilience in identifying influence campaigns, especially given the potential increase in the usage of LLMs for generating inauthentic content. Finally, we assessed the generalizability of our solution to various entities driving different information operations and found promising results that will guide future research.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"46"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10562499/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41195512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Forecasting patient flows with pandemic induced concept drift using explainable machine learning. 使用可解释的机器学习预测流行病引起的概念漂移的患者流量。
IF 3.6 2区 计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-01-01 DOI: 10.1140/epjds/s13688-023-00387-5
Teo Susnjak, Paula Maddigan

Accurately forecasting patient arrivals at Urgent Care Clinics (UCCs) and Emergency Departments (EDs) is important for effective resourcing and patient care. However, correctly estimating patient flows is not straightforward since it depends on many drivers. The predictability of patient arrivals has recently been further complicated by the COVID-19 pandemic conditions and the resulting lockdowns. This study investigates how a suite of novel quasi-real-time variables like Google search terms, pedestrian traffic, the prevailing incidence levels of influenza, as well as the COVID-19 Alert Level indicators can both generally improve the forecasting models of patient flows and effectively adapt the models to the unfolding disruptions of pandemic conditions. This research also uniquely contributes to the body of work in this domain by employing tools from the eXplainable AI field to investigate more deeply the internal mechanics of the models than has previously been done. The Voting ensemble-based method combining machine learning and statistical techniques was the most reliable in our experiments. Our study showed that the prevailing COVID-19 Alert Level feature together with Google search terms and pedestrian traffic were effective at producing generalisable forecasts. The implications of this study are that proxy variables can effectively augment standard autoregressive features to ensure accurate forecasting of patient flows. The experiments showed that the proposed features are potentially effective model inputs for preserving forecast accuracies in the event of future pandemic outbreaks.

准确预测急诊诊所(UCCs)和急诊科(EDs)的患者到达量对于有效的资源分配和患者护理非常重要。然而,正确估计患者流量并非易事,因为它取决于许多驱动因素。最近,COVID-19大流行的情况和由此导致的封锁使患者到达的可预测性进一步复杂化。本研究探讨了一套新的准实时变量,如谷歌搜索词、行人交通、流感的主要发病率水平以及COVID-19警戒级别指标,如何在总体上改进患者流量预测模型,并有效地使模型适应不断变化的大流行情况。这项研究还通过使用来自可解释人工智能领域的工具,比以前更深入地研究模型的内部机制,为该领域的工作做出了独特的贡献。结合机器学习和统计技术的基于投票集合的方法在我们的实验中是最可靠的。我们的研究表明,流行的COVID-19警报级别功能与谷歌搜索词和行人交通一起,可以有效地产生普遍的预测。本研究的意义在于,代理变量可以有效地增强标准的自回归特征,以确保准确预测患者流量。实验表明,所提出的特征是潜在的有效模型输入,可以在未来大流行爆发的情况下保持预测的准确性。
{"title":"Forecasting patient flows with pandemic induced concept drift using explainable machine learning.","authors":"Teo Susnjak,&nbsp;Paula Maddigan","doi":"10.1140/epjds/s13688-023-00387-5","DOIUrl":"https://doi.org/10.1140/epjds/s13688-023-00387-5","url":null,"abstract":"<p><p>Accurately forecasting patient arrivals at Urgent Care Clinics (UCCs) and Emergency Departments (EDs) is important for effective resourcing and patient care. However, correctly estimating patient flows is not straightforward since it depends on many drivers. The predictability of patient arrivals has recently been further complicated by the COVID-19 pandemic conditions and the resulting lockdowns. This study investigates how a suite of novel quasi-real-time variables like Google search terms, pedestrian traffic, the prevailing incidence levels of influenza, as well as the COVID-19 Alert Level indicators can both generally improve the forecasting models of patient flows and effectively adapt the models to the unfolding disruptions of pandemic conditions. This research also uniquely contributes to the body of work in this domain by employing tools from the eXplainable AI field to investigate more deeply the internal mechanics of the models than has previously been done. The Voting ensemble-based method combining machine learning and statistical techniques was the most reliable in our experiments. Our study showed that the prevailing COVID-19 Alert Level feature together with Google search terms and pedestrian traffic were effective at producing generalisable forecasts. The implications of this study are that proxy variables can effectively augment standard autoregressive features to ensure accurate forecasting of patient flows. The experiments showed that the proposed features are potentially effective model inputs for preserving forecast accuracies in the event of future pandemic outbreaks.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"11"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10119825/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9448957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identifying latent activity behaviors and lifestyles using mobility data to describe urban dynamics. 利用流动数据描述城市动态,识别潜在的活动行为和生活方式。
IF 3.6 2区 计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-01-01 Epub Date: 2023-05-18 DOI: 10.1140/epjds/s13688-023-00390-w
Yanni Yang, Alex Pentland, Esteban Moro

Urbanization and its problems require an in-depth and comprehensive understanding of urban dynamics, especially the complex and diversified lifestyles in modern cities. Digitally acquired data can accurately capture complex human activity, but it lacks the interpretability of demographic data. In this paper, we study a privacy-enhanced dataset of the mobility visitation patterns of 1.2 million people to 1.1 million places in 11 metro areas in the U.S. to detect the latent mobility behaviors and lifestyles in the largest American cities. Despite the considerable complexity of mobility visitations, we found that lifestyles can be automatically decomposed into only 12 latent interpretable activity behaviors on how people combine shopping, eating, working, or using their free time. Rather than describing individuals with a single lifestyle, we find that city dwellers' behavior is a mixture of those behaviors. Those detected latent activity behaviors are equally present across cities and cannot be fully explained by main demographic features. Finally, we find those latent behaviors are associated with dynamics like experienced income segregation, transportation, or healthy behaviors in cities, even after controlling for demographic features. Our results signal the importance of complementing traditional census data with activity behaviors to understand urban dynamics.

Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00390-w.

城市化及其问题需要深入全面地了解城市动态,特别是现代城市中复杂多样的生活方式。数字化获取的数据可以准确地捕捉复杂的人类活动,但缺乏人口统计数据的可解释性。在本文中,我们研究了美国11个大都市地区120万人到110万个地方的流动访问模式的隐私增强数据集,以检测美国最大城市潜在的流动行为和生活方式。尽管流动访问相当复杂,但我们发现,生活方式只能自动分解为12种潜在的可解释的活动行为,即人们如何将购物、吃饭、工作或利用空闲时间结合起来。我们发现,城市居民的行为是这些行为的混合,而不是用单一的生活方式来描述个人。这些被检测到的潜在活动行为在城市中同样存在,不能用主要的人口特征来完全解释。最后,我们发现这些潜在行为与经历过的收入隔离、交通或城市中的健康行为等动态有关,即使在控制了人口特征之后也是如此。我们的研究结果表明,用活动行为补充传统人口普查数据对了解城市动态的重要性。补充信息:在线版本包含补充材料,网址为10.1140/epjds/s1368-023-00390-w。
{"title":"Identifying latent activity behaviors and lifestyles using mobility data to describe urban dynamics.","authors":"Yanni Yang,&nbsp;Alex Pentland,&nbsp;Esteban Moro","doi":"10.1140/epjds/s13688-023-00390-w","DOIUrl":"10.1140/epjds/s13688-023-00390-w","url":null,"abstract":"<p><p>Urbanization and its problems require an in-depth and comprehensive understanding of urban dynamics, especially the complex and diversified lifestyles in modern cities. Digitally acquired data can accurately capture complex human activity, but it lacks the interpretability of demographic data. In this paper, we study a privacy-enhanced dataset of the mobility visitation patterns of 1.2 million people to 1.1 million places in 11 metro areas in the U.S. to detect the latent mobility behaviors and lifestyles in the largest American cities. Despite the considerable complexity of mobility visitations, we found that lifestyles can be automatically decomposed into only 12 latent interpretable activity behaviors on how people combine shopping, eating, working, or using their free time. Rather than describing individuals with a single lifestyle, we find that city dwellers' behavior is a mixture of those behaviors. Those detected latent activity behaviors are equally present across cities and cannot be fully explained by main demographic features. Finally, we find those latent behaviors are associated with dynamics like experienced income segregation, transportation, or healthy behaviors in cities, even after controlling for demographic features. Our results signal the importance of complementing traditional census data with activity behaviors to understand urban dynamics.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1140/epjds/s13688-023-00390-w.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"15"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10193357/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9509481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
How does Twitter account moderation work? Dynamics of account creation and suspension on Twitter during major geopolitical events. Twitter帐户审核是如何工作的?重大地缘政治事件期间推特账户创建和暂停的动态。
IF 3.6 2区 计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-01-01 Epub Date: 2023-10-04 DOI: 10.1140/epjds/s13688-023-00420-7
Francesco Pierri, Luca Luceri, Emily Chen, Emilio Ferrara

Social media moderation policies are often at the center of public debate, and their implementation and enactment are sometimes surrounded by a veil of mystery. Unsurprisingly, due to limited platform transparency and data access, relatively little research has been devoted to characterizing moderation dynamics, especially in the context of controversial events and the platform activity associated with them. Here, we study the dynamics of account creation and suspension on Twitter during two global political events: Russia's invasion of Ukraine and the 2022 French Presidential election. Leveraging a large-scale dataset of 270M tweets shared by 16M users in multiple languages over several months, we identify peaks of suspicious account creation and suspension, and we characterize behaviors that more frequently lead to account suspension. We show how large numbers of accounts get suspended within days of their creation. Suspended accounts tend to mostly interact with legitimate users, as opposed to other suspicious accounts, making unwarranted and excessive use of reply and mention features, and sharing large amounts of spam and harmful content. While we are only able to speculate about the specific causes leading to a given account suspension, our findings contribute to shedding light on patterns of platform abuse and subsequent moderation during major events.

社交媒体节制政策经常处于公众辩论的中心,其实施和颁布有时被神秘的面纱所包围。不出所料,由于平台透明度和数据访问有限,专门研究缓和动态的研究相对较少,尤其是在有争议事件及其相关平台活动的背景下。在这里,我们研究了两个全球政治事件期间推特账户创建和暂停的动态:俄罗斯入侵乌克兰和2022年法国总统大选。利用1600万用户在几个月内以多种语言共享的2.7亿条推文的大规模数据集,我们确定了可疑账户创建和暂停的峰值,并描述了更频繁导致账户暂停的行为。我们展示了大量账户是如何在创建后几天内被暂停的。与其他可疑账户相比,被暂停的账户大多与合法用户互动,无端和过度使用回复和提及功能,并共享大量垃圾邮件和有害内容。虽然我们只能推测导致特定账户暂停的具体原因,但我们的发现有助于揭示重大事件中平台滥用和随后节制的模式。
{"title":"How does Twitter account moderation work? Dynamics of account creation and suspension on Twitter during major geopolitical events.","authors":"Francesco Pierri, Luca Luceri, Emily Chen, Emilio Ferrara","doi":"10.1140/epjds/s13688-023-00420-7","DOIUrl":"10.1140/epjds/s13688-023-00420-7","url":null,"abstract":"<p><p>Social media moderation policies are often at the center of public debate, and their implementation and enactment are sometimes surrounded by a veil of mystery. Unsurprisingly, due to limited platform transparency and data access, relatively little research has been devoted to characterizing moderation dynamics, especially in the context of controversial events and the platform activity associated with them. Here, we study the dynamics of account creation and suspension on Twitter during two global political events: Russia's invasion of Ukraine and the 2022 French Presidential election. Leveraging a large-scale dataset of 270M tweets shared by 16M users in multiple languages over several months, we identify peaks of suspicious account creation and suspension, and we characterize behaviors that more frequently lead to account suspension. We show how large numbers of accounts get suspended within days of their creation. Suspended accounts tend to mostly interact with legitimate users, as opposed to other suspicious accounts, making unwarranted and excessive use of reply and mention features, and sharing large amounts of spam and harmful content. While we are only able to speculate about the specific causes leading to a given account suspension, our findings contribute to shedding light on patterns of platform abuse and subsequent moderation during major events.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"43"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10550859/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41111015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Using word embeddings to analyse audience effects and individual differences in parenting Subreddits. 使用单词嵌入来分析受众效应和养育子女的个体差异。
IF 3.6 2区 计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-01-01 Epub Date: 2023-09-20 DOI: 10.1140/epjds/s13688-023-00412-7
Melody Sepahpour-Fard, Michael Quayle, Maria Schuld, Taha Yasseri

This paper explores how individuals' language use in gender-specific groups ("mothers" and "fathers") compares to their interactions when referred to as "parents." Language adaptation based on the audience is well-documented, yet large-scale studies of naturally-occurring audience effects are rare. To address this, we investigate audience and gender effects in the context of parenting, where gender plays a significant role. We focus on interactions within Reddit, particularly in the parenting Subreddits r/Daddit, r/Mommit, and r/Parenting, which cater to distinct audiences. By analyzing user posts using word embeddings, we measure similarities between user-tokens and word-tokens, also considering differences among high and low self-monitors. Results reveal that in mixed-gender contexts, mothers and fathers exhibit similar behavior in discussing a wide range of topics, while fathers emphasize more on educational and family advice. Single-gender Subreddits see more focused discussions. Mothers in r/Mommit discuss medical care, sleep, potty training, and food, distinguishing themselves. In terms of individual differences, we found that, especially on r/Parenting, high self-monitors tend to conform more to the norms of the Subreddit by discussing more of the topics associated with the Subreddit.

本文探讨了个体在特定性别群体(“母亲”和“父亲”)中的语言使用与他们在被称为“父母”时的互动之间的比较。基于受众的语言适应有充分的证据,但对自然发生的受众效应的大规模研究却很少。为了解决这一问题,我们调查了育儿背景下的受众和性别影响,性别在育儿中发挥着重要作用。我们专注于Reddit内部的互动,特别是在面向不同受众的育儿子版块r/Daddit、r/Mommit和r/parenting中。通过使用单词嵌入分析用户帖子,我们测量了用户标记和单词标记之间的相似性,同时考虑了高自我监控和低自我监控之间的差异。结果表明,在混合性别背景下,母亲和父亲在讨论广泛的话题时表现出相似的行为,而父亲则更强调教育和家庭建议。单一性别小组的讨论更加集中。妈妈们讨论医疗保健、睡眠、如厕训练和食物,以区分自己。就个体差异而言,我们发现,特别是在r/Parenting方面,高自我监控者倾向于通过讨论更多与Subreddit相关的话题来更符合Subreddit的规范。
{"title":"Using word embeddings to analyse audience effects and individual differences in parenting Subreddits.","authors":"Melody Sepahpour-Fard, Michael Quayle, Maria Schuld, Taha Yasseri","doi":"10.1140/epjds/s13688-023-00412-7","DOIUrl":"10.1140/epjds/s13688-023-00412-7","url":null,"abstract":"<p><p>This paper explores how individuals' language use in gender-specific groups (\"mothers\" and \"fathers\") compares to their interactions when referred to as \"parents.\" Language adaptation based on the audience is well-documented, yet large-scale studies of naturally-occurring audience effects are rare. To address this, we investigate audience and gender effects in the context of parenting, where gender plays a significant role. We focus on interactions within Reddit, particularly in the parenting Subreddits r/Daddit, r/Mommit, and r/Parenting, which cater to distinct audiences. By analyzing user posts using word embeddings, we measure similarities between user-tokens and word-tokens, also considering differences among high and low self-monitors. Results reveal that in mixed-gender contexts, mothers and fathers exhibit similar behavior in discussing a wide range of topics, while fathers emphasize more on educational and family advice. Single-gender Subreddits see more focused discussions. Mothers in r/Mommit discuss medical care, sleep, potty training, and food, distinguishing themselves. In terms of individual differences, we found that, especially on r/Parenting, high self-monitors tend to conform more to the norms of the Subreddit by discussing more of the topics associated with the Subreddit.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"38"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10511593/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41117699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mental health concerns precede quits: shifts in the work discourse during the Covid-19 pandemic and great resignation. 心理健康问题先于辞职:新冠肺炎大流行期间工作话语的转变和巨大的辞职。
IF 3.6 2区 计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-01-01 Epub Date: 2023-10-12 DOI: 10.1140/epjds/s13688-023-00417-2
R Maria Del Rio-Chanona, Alejandro Hermida-Carrillo, Melody Sepahpour-Fard, Luning Sun, Renata Topinkova, Ljubica Nedelkoska

To study the causes of the 2021 Great Resignation, we use text analysis and investigate the changes in work- and quit-related posts between 2018 and 2021 on Reddit. We find that the Reddit discourse evolution resembles the dynamics of the U.S. quit and layoff rates. Furthermore, when the COVID-19 pandemic started, conversations related to working from home, switching jobs, work-related distress, and mental health increased, while discussions on commuting or moving for a job decreased. We distinguish between general work-related and specific quit-related discourse changes using a difference-in-differences method. Our main finding is that mental health and work-related distress topics disproportionally increased among quit-related posts since the onset of the pandemic, likely contributing to the quits of the Great Resignation. Along with better labor market conditions, some relief came beginning-to-mid-2021 when these concerns decreased. Our study underscores the importance of having access to data from online forums, such as Reddit, to study emerging economic phenomena in real time, providing a valuable supplement to traditional labor market surveys and administrative data.

Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00417-2.

为了研究2021年大辞职的原因,我们使用文本分析,调查了2018年至2021年间Reddit上与工作和辞职相关的帖子的变化。我们发现,Reddit的话语演变类似于美国辞职率和裁员率的动态。此外,当新冠肺炎大流行开始时,与在家工作、换工作、与工作有关的痛苦和心理健康有关的对话增加了,而关于通勤或搬家工作的讨论减少了。我们使用差异中的差异方法来区分与工作相关的一般话语变化和与辞职相关的特定话语变化。我们的主要发现是,自疫情爆发以来,心理健康和与工作相关的痛苦话题在辞职相关的职位中不成比例地增加,这可能是大辞职的原因之一。随着劳动力市场状况的改善,从2021年年中开始,这些担忧有所缓解。我们的研究强调了访问Reddit等在线论坛的数据以实时研究新兴经济现象的重要性,为传统的劳动力市场调查和行政数据提供了宝贵的补充。补充信息:在线版本包含补充材料,可访问10.1140/epjds/s1368-023-00417-2。
{"title":"Mental health concerns precede quits: shifts in the work discourse during the Covid-19 pandemic and great resignation.","authors":"R Maria Del Rio-Chanona,&nbsp;Alejandro Hermida-Carrillo,&nbsp;Melody Sepahpour-Fard,&nbsp;Luning Sun,&nbsp;Renata Topinkova,&nbsp;Ljubica Nedelkoska","doi":"10.1140/epjds/s13688-023-00417-2","DOIUrl":"10.1140/epjds/s13688-023-00417-2","url":null,"abstract":"<p><p>To study the causes of the 2021 Great Resignation, we use text analysis and investigate the changes in work- and quit-related posts between 2018 and 2021 on Reddit. We find that the Reddit discourse evolution resembles the dynamics of the U.S. quit and layoff rates. Furthermore, when the COVID-19 pandemic started, conversations related to working from home, switching jobs, work-related distress, and mental health increased, while discussions on commuting or moving for a job decreased. We distinguish between general work-related and specific quit-related discourse changes using a difference-in-differences method. Our main finding is that mental health and work-related distress topics disproportionally increased among quit-related posts since the onset of the pandemic, likely contributing to the quits of the Great Resignation. Along with better labor market conditions, some relief came beginning-to-mid-2021 when these concerns decreased. Our study underscores the importance of having access to data from online forums, such as Reddit, to study emerging economic phenomena in real time, providing a valuable supplement to traditional labor market surveys and administrative data.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1140/epjds/s13688-023-00417-2.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"49"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10570174/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41233433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Has Covid-19 permanently changed online purchasing behavior? Covid-19 是否永久性地改变了在线购买行为?
IF 3 2区 计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-01-01 Epub Date: 2023-01-16 DOI: 10.1140/epjds/s13688-022-00375-1
Hiroyasu Inoue, Yasuyuki Todo

This study examines how the COVID-19 pandemic has affected online purchasing behavior using data from a major online shopping platform in Japan. We focus on the effect of two measures of the pandemic, i.e., the number of positive COVID-19 cases and state declarations of emergency to mitigate the pandemic. We find that both measures promoted online purchases at the beginning of the pandemic, but in later periods, their effect faded. In addition, online purchases returned to normal after states of emergency ended, and the overall time trend in online purchases excluding the effects of the two measures was stable during the first two years of the pandemic. These results suggest that the effect of the pandemic on online purchasing behavior is temporary and will not persist after the pandemic.

Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-022-00375-1.

本研究利用日本一家大型在线购物平台的数据,探讨了 COVID-19 大流行对在线购买行为的影响。我们重点研究了两种大流行措施的影响,即 COVID-19 阳性病例数和国家宣布紧急状态以缓解大流行。我们发现,这两项措施在疫情初期促进了网购,但在后期,其效果逐渐减弱。此外,在紧急状态结束后,网购又恢复了正常,在大流行病的前两年,排除这两项措施的影响,网购的总体时间趋势是稳定的。这些结果表明,疫情对网购行为的影响是暂时的,在疫情过后不会持续:在线版本包含补充材料,可在 10.1140/epjds/s13688-022-00375-1 网站上查阅。
{"title":"Has Covid-19 permanently changed online purchasing behavior?","authors":"Hiroyasu Inoue, Yasuyuki Todo","doi":"10.1140/epjds/s13688-022-00375-1","DOIUrl":"10.1140/epjds/s13688-022-00375-1","url":null,"abstract":"<p><p>This study examines how the COVID-19 pandemic has affected online purchasing behavior using data from a major online shopping platform in Japan. We focus on the effect of two measures of the pandemic, i.e., the number of positive COVID-19 cases and state declarations of emergency to mitigate the pandemic. We find that both measures promoted online purchases at the beginning of the pandemic, but in later periods, their effect faded. In addition, online purchases returned to normal after states of emergency ended, and the overall time trend in online purchases excluding the effects of the two measures was stable during the first two years of the pandemic. These results suggest that the effect of the pandemic on online purchasing behavior is temporary and will not persist after the pandemic.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1140/epjds/s13688-022-00375-1.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"1"},"PeriodicalIF":3.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9841963/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10581067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian inference of transition matrices from incomplete graph data with a topological prior. 基于拓扑先验的不完全图数据的转移矩阵的贝叶斯推理。
IF 3.6 2区 计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-01-01 Epub Date: 2023-10-11 DOI: 10.1140/epjds/s13688-023-00416-3
Vincenzo Perri, Luka V Petrović, Ingo Scholtes

Many network analysis and graph learning techniques are based on discrete- or continuous-time models of random walks. To apply these methods, it is necessary to infer transition matrices that formalize the underlying stochastic process in an observed graph. For weighted graphs, where weighted edges capture observations of repeated interactions between nodes, it is common to estimate the entries of such transition matrices based on the (relative) weights of edges. However in real-world settings we are often confronted with incomplete data, which turns the construction of the transition matrix based on a weighted graph into an inference problem. Moreover, we often have access to additional information, which capture topological constraints of the system, i.e. which edges in a weighted graph are (theoretically) possible and which are not. Examples include transportation networks, where we may have access to a small sample of passenger trajectories as well as the physical topology of connections, or a limited set of observed social interactions with additional information on the underlying social structure. Combining these two different sources of information to reliably infer transition matrices from incomplete data on repeated interactions is an important open challenge, with severe implications for the reliability of downstream network analysis tasks. Addressing this issue, we show that including knowledge on such topological constraints can considerably improve the inference of transition matrices, especially in situations where we only have a small number of observed interactions. To this end, we derive an analytically tractable Bayesian method that uses repeated interactions and a topological prior to perform data-efficient inference of transition matrices. We compare our approach against commonly used frequentist and Bayesian approaches both in synthetic data and in five real-world datasets, and we find that our method recovers the transition probabilities with higher accuracy. Furthermore, we demonstrate that the method is robust even in cases when the knowledge of the topological constraint is partial. Lastly, we show that this higher accuracy improves the results for downstream network analysis tasks like cluster detection and node ranking, which highlights the practical relevance of our method for interdisciplinary data-driven analyses of networked systems.

许多网络分析和图学习技术都是基于随机行走的离散或连续时间模型。为了应用这些方法,有必要推断转移矩阵,该矩阵形式化了观测图中潜在的随机过程。对于加权图,其中加权边捕获节点之间重复交互的观察结果,通常基于边的(相对)权重来估计这种转移矩阵的条目。然而,在现实世界中,我们经常遇到不完整的数据,这将基于加权图的转换矩阵的构建变成了一个推理问题。此外,我们经常可以访问额外的信息,这些信息捕捉系统的拓扑约束,即加权图中的哪些边(理论上)是可能的,哪些不可能。例子包括交通网络,在交通网络中,我们可能可以访问一小部分乘客轨迹样本以及连接的物理拓扑,或者一组有限的观察到的社会互动,以及关于潜在社会结构的额外信息。将这两种不同的信息源结合起来,从重复交互的不完整数据中可靠地推断转换矩阵是一个重要的开放挑战,对下游网络分析任务的可靠性有着严重的影响。针对这个问题,我们表明,包括关于这种拓扑约束的知识可以大大改进转换矩阵的推断,特别是在我们只有少量观察到的相互作用的情况下。为此,我们推导了一种可分析处理的贝叶斯方法,该方法使用重复相互作用和拓扑先验来执行转换矩阵的数据高效推理。我们将我们的方法与合成数据和五个真实世界数据集中常用的频率论和贝叶斯方法进行了比较,发现我们的方法以更高的精度恢复了转换概率。此外,我们证明了即使在拓扑约束的知识是部分的情况下,该方法也是鲁棒的。最后,我们表明,这种更高的精度提高了下游网络分析任务(如聚类检测和节点排序)的结果,这突出了我们的方法在网络系统跨学科数据驱动分析中的实际相关性。
{"title":"Bayesian inference of transition matrices from incomplete graph data with a topological prior.","authors":"Vincenzo Perri, Luka V Petrović, Ingo Scholtes","doi":"10.1140/epjds/s13688-023-00416-3","DOIUrl":"10.1140/epjds/s13688-023-00416-3","url":null,"abstract":"<p><p>Many network analysis and graph learning techniques are based on discrete- or continuous-time models of random walks. To apply these methods, it is necessary to infer transition matrices that formalize the underlying stochastic process in an observed graph. For weighted graphs, where weighted edges capture observations of repeated interactions between nodes, it is common to estimate the entries of such transition matrices based on the (relative) weights of edges. However in real-world settings we are often confronted with incomplete data, which turns the construction of the transition matrix based on a weighted graph into an <i>inference problem</i>. Moreover, we often have access to additional information, which capture topological constraints of the system, i.e. which edges in a weighted graph are (theoretically) possible and which are not. Examples include transportation networks, where we may have access to a small sample of passenger trajectories as well as the physical topology of connections, or a limited set of observed social interactions with additional information on the underlying social structure. Combining these two different sources of information to reliably infer transition matrices from incomplete data on repeated interactions is an important open challenge, with severe implications for the reliability of downstream network analysis tasks. Addressing this issue, we show that including knowledge on such topological constraints can considerably improve the inference of transition matrices, especially in situations where we only have a small number of observed interactions. To this end, we derive an analytically tractable Bayesian method that uses repeated interactions and a topological prior to perform data-efficient inference of transition matrices. We compare our approach against commonly used frequentist and Bayesian approaches both in synthetic data and in five real-world datasets, and we find that our method recovers the transition probabilities with higher accuracy. Furthermore, we demonstrate that the method is robust even in cases when the knowledge of the topological constraint is partial. Lastly, we show that this higher accuracy improves the results for downstream network analysis tasks like cluster detection and node ranking, which highlights the practical relevance of our method for interdisciplinary data-driven analyses of networked systems.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"48"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10567898/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41233432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Do poverty and wealth look the same the world over? A comparative study of 12 cities from five high-income countries using street images. 全世界的贫穷和富裕看起来都一样吗?利用街道图像对五个高收入国家的 12 座城市进行比较研究。
IF 3 2区 计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-01-01 Epub Date: 2023-06-07 DOI: 10.1140/epjds/s13688-023-00394-6
Esra Suel, Emily Muller, James E Bennett, Tony Blakely, Yvonne Doyle, John Lynch, Joreintje D Mackenbach, Ariane Middel, Anja Mizdrak, Ricky Nathvani, Michael Brauer, Majid Ezzati

Urbanization and inequalities are two of the major policy themes of our time, intersecting in large cities where social and economic inequalities are particularly pronounced. Large scale street-level images are a source of city-wide visual information and allow for comparative analyses of multiple cities. Computer vision methods based on deep learning applied to street images have been shown to successfully measure inequalities in socioeconomic and environmental features, yet existing work has been within specific geographies and have not looked at how visual environments compare across different cities and countries. In this study, we aim to apply existing methods to understand whether, and to what extent, poor and wealthy groups live in visually similar neighborhoods across cities and countries. We present novel insights on similarity of neighborhoods using street-level images and deep learning methods. We analyzed 7.2 million images from 12 cities in five high-income countries, home to more than 85 million people: Auckland (New Zealand), Sydney (Australia), Toronto and Vancouver (Canada), Atlanta, Boston, Chicago, Los Angeles, New York, San Francisco, and Washington D.C. (United States of America), and London (United Kingdom). Visual features associated with neighborhood disadvantage are more distinct and unique to each city than those associated with affluence. For example, from what is visible from street images, high density poor neighborhoods located near the city center (e.g., in London) are visually distinct from poor suburban neighborhoods characterized by lower density and lower accessibility (e.g., in Atlanta). This suggests that differences between two cities is also driven by historical factors, policies, and local geography. Our results also have implications for image-based measures of inequality in cities especially when trained on data from cities that are visually distinct from target cities. We showed that these are more prone to errors for disadvantaged areas especially when transferring across cities, suggesting more attention needs to be paid to improving methods for capturing heterogeneity in poor environment across cities around the world.

Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00394-6.

城市化和不平等是当今时代的两大政策主题,在社会和经济不平等现象尤为突出的大城市中,这两个主题相互交织。大规模的街道图像是整个城市视觉信息的来源,可以对多个城市进行比较分析。基于深度学习的计算机视觉方法应用于街道图像,已被证明能成功测量社会经济和环境特征中的不平等,但现有的工作都是在特定的地理范围内进行的,并没有研究不同城市和国家之间的视觉环境是如何比较的。在本研究中,我们旨在应用现有方法,了解不同城市和国家的贫困群体和富裕群体是否生活在视觉相似的街区,以及在多大程度上生活在视觉相似的街区。我们利用街景图像和深度学习方法,对街区的相似性提出了新的见解。我们分析了来自五个高收入国家 12 个城市的 720 万张图像,这些城市拥有超过 8500 万人口:这些城市包括:奥克兰(新西兰)、悉尼(澳大利亚)、多伦多和温哥华(加拿大)、亚特兰大、波士顿、芝加哥、洛杉矶、纽约、旧金山和华盛顿特区(美国)以及伦敦(英国)。与富裕地区相比,每个城市与贫困地区相关的视觉特征更为明显和独特。例如,从街道图像上可以看出,靠近市中心的高密度贫困社区(如伦敦)与郊区的低密度贫困社区(如亚特兰大)在视觉上截然不同。这表明,两个城市之间的差异还受到历史因素、政策和当地地理环境的影响。我们的研究结果还对基于图像的城市不平等度量方法产生了影响,尤其是在对来自视觉上与目标城市截然不同的城市的数据进行训练时。我们的研究结果表明,对于贫困地区来说,这些方法更容易出现误差,尤其是在跨城市转移时,这表明需要更加关注如何改进方法,以捕捉世界各地城市贫困环境的异质性:在线版本包含补充材料,可查阅 10.1140/epjds/s13688-023-00394-6。
{"title":"Do poverty and wealth look the same the world over? A comparative study of 12 cities from five high-income countries using street images.","authors":"Esra Suel, Emily Muller, James E Bennett, Tony Blakely, Yvonne Doyle, John Lynch, Joreintje D Mackenbach, Ariane Middel, Anja Mizdrak, Ricky Nathvani, Michael Brauer, Majid Ezzati","doi":"10.1140/epjds/s13688-023-00394-6","DOIUrl":"10.1140/epjds/s13688-023-00394-6","url":null,"abstract":"<p><p>Urbanization and inequalities are two of the major policy themes of our time, intersecting in large cities where social and economic inequalities are particularly pronounced. Large scale street-level images are a source of city-wide visual information and allow for comparative analyses of multiple cities. Computer vision methods based on deep learning applied to street images have been shown to successfully measure inequalities in socioeconomic and environmental features, yet existing work has been within specific geographies and have not looked at how visual environments compare across different cities and countries. In this study, we aim to apply existing methods to understand whether, and to what extent, poor and wealthy groups live in visually similar neighborhoods across cities and countries. We present novel insights on similarity of neighborhoods using street-level images and deep learning methods. We analyzed 7.2 million images from 12 cities in five high-income countries, home to more than 85 million people: Auckland (New Zealand), Sydney (Australia), Toronto and Vancouver (Canada), Atlanta, Boston, Chicago, Los Angeles, New York, San Francisco, and Washington D.C. (United States of America), and London (United Kingdom). Visual features associated with neighborhood disadvantage are more distinct and unique to each city than those associated with affluence. For example, from what is visible from street images, high density poor neighborhoods located near the city center (e.g., in London) are visually distinct from poor suburban neighborhoods characterized by lower density and lower accessibility (e.g., in Atlanta). This suggests that differences between two cities is also driven by historical factors, policies, and local geography. Our results also have implications for image-based measures of inequality in cities especially when trained on data from cities that are visually distinct from target cities. We showed that these are more prone to errors for disadvantaged areas especially when transferring across cities, suggesting more attention needs to be paid to improving methods for capturing heterogeneity in poor environment across cities around the world.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1140/epjds/s13688-023-00394-6.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"19"},"PeriodicalIF":3.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10245348/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9982453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
EPJ Data Science
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1