EPJ Data Science最新文献_第10页

LEIA: Linguistic Embeddings for the Identification of Affect. 情感识别的语言嵌入。

IF 3 2区计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

EPJ Data Science

Pub Date : 2023-01-01 Epub Date: 2023-11-16 DOI: 10.1140/epjds/s13688-023-00427-0

Segun Taofeek Aroyehun, Lukas Malik, Hannah Metzler, Nikolas Haimerl, Anna Di Natale, David Garcia

The wealth of text data generated by social media has enabled new kinds of analysis of emotions with language models. These models are often trained on small and costly datasets of text annotations produced by readers who guess the emotions expressed by others in social media posts. This affects the quality of emotion identification methods due to training data size limitations and noise in the production of labels used in model development. We present LEIA, a model for emotion identification in text that has been trained on a dataset of more than 6 million posts with self-annotated emotion labels for happiness, affection, sadness, anger, and fear. LEIA is based on a word masking method that enhances the learning of emotion words during model pre-training. LEIA achieves macro-F1 values of approximately 73 on three in-domain test datasets, outperforming other supervised and unsupervised methods in a strong benchmark that shows that LEIA generalizes across posts, users, and time periods. We further perform an out-of-domain evaluation on five different datasets of social media and other sources, showing LEIA's robust performance across media, data collection methods, and annotation schemes. Our results show that LEIA generalizes its classification of anger, happiness, and sadness beyond the domain it was trained on. LEIA can be applied in future research to provide better identification of emotions in text from the perspective of the writer.

社交媒体产生的大量文本数据使得用语言模型分析情绪成为可能。这些模型通常是在小而昂贵的文本注释数据集上进行训练的，这些数据集是由读者在社交媒体帖子中猜测他人表达的情绪而产生的。由于训练数据大小的限制和模型开发中使用的标签生产中的噪声，这影响了情绪识别方法的质量。我们提出了LEIA，这是一个用于文本情感识别的模型，该模型已经在超过600万篇文章的数据集上进行了训练，这些帖子具有自我注释的情绪标签，包括快乐、情感、悲伤、愤怒和恐惧。LEIA是一种基于词掩蔽的方法，该方法在模型预训练过程中增强了情绪词的学习。LEIA在三个域内测试数据集上实现了大约73的宏f1值，在一个强大的基准测试中优于其他有监督和无监督的方法，这表明LEIA泛化了帖子、用户和时间段。我们进一步对社交媒体和其他来源的五个不同数据集进行了域外评估，显示了LEIA在媒体、数据收集方法和注释方案上的稳健性能。我们的结果表明，LEIA将其对愤怒、快乐和悲伤的分类推广到了它所训练的领域之外。LEIA可以应用于未来的研究，从作者的角度更好地识别文本中的情绪。

{"title":"LEIA: Linguistic Embeddings for the Identification of Affect.","authors":"Segun Taofeek Aroyehun, Lukas Malik, Hannah Metzler, Nikolas Haimerl, Anna Di Natale, David Garcia","doi":"10.1140/epjds/s13688-023-00427-0","DOIUrl":"10.1140/epjds/s13688-023-00427-0","url":null,"abstract":"The wealth of text data generated by social media has enabled new kinds of analysis of emotions with language models. These models are often trained on small and costly datasets of text annotations produced by readers who guess the emotions expressed by others in social media posts. This affects the quality of emotion identification methods due to training data size limitations and noise in the production of labels used in model development. We present LEIA, a model for emotion identification in text that has been trained on a dataset of more than 6 million posts with self-annotated emotion labels for happiness, affection, sadness, anger, and fear. LEIA is based on a word masking method that enhances the learning of emotion words during model pre-training. LEIA achieves macro-F1 values of approximately 73 on three in-domain test datasets, outperforming other supervised and unsupervised methods in a strong benchmark that shows that LEIA generalizes across posts, users, and time periods. We further perform an out-of-domain evaluation on five different datasets of social media and other sources, showing LEIA's robust performance across media, data collection methods, and annotation schemes. Our results show that LEIA generalizes its classification of anger, happiness, and sadness beyond the domain it was trained on. LEIA can be applied in future research to provide better identification of emotions in text from the perspective of the writer.","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"52"},"PeriodicalIF":3.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10654159/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138458730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The shock, the coping, the resilience: smartphone application use reveals Covid-19 lockdown effects on human behaviors. 震惊、应对和恢复力:智能手机应用程序的使用揭示了Covid-19对人类行为的封锁效应。

IF 3.6 2区计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

EPJ Data Science

Pub Date : 2023-01-01 DOI: 10.1140/epjds/s13688-023-00391-9

Xiao Fan Liu, Zhen-Zhen Wang, Xiao-Ke Xu, Ye Wu, Zhidan Zhao, Huarong Deng, Ping Wang, Naipeng Chao, Yi-Hui C Huang

Human mobility restriction policies have been widely used to contain the coronavirus disease-19 (COVID-19). However, a critical question is how these policies affect individuals' behavioral and psychological well-being during and after confinement periods. Here, we analyze China's five most stringent city-level lockdowns in 2021, treating them as natural experiments that allow for examining behavioral changes in millions of people through smartphone application use. We made three fundamental observations. First, the use of physical and economic activity-related apps experienced a steep decline, yet apps that provide daily necessities maintained normal usage. Second, apps that fulfilled lower-level human needs, such as working, socializing, information seeking, and entertainment, saw an immediate and substantial increase in screen time. Those that satisfied higher-level needs, such as education, only attracted delayed attention. Third, human behaviors demonstrated resilience as most routines resumed after the lockdowns were lifted. Nonetheless, long-term lifestyle changes were observed, as significant numbers of people chose to continue working and learning online, becoming "digital residents." This study also demonstrates the capability of smartphone screen time analytics in the study of human behaviors.

Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00391-9.

限制人员流动政策已被广泛用于控制冠状病毒病-19 (COVID-19)。然而，一个关键的问题是，这些政策在坐月子期间和之后如何影响个人的行为和心理健康。在这里，我们分析了2021年中国五个最严格的城市一级封锁，将它们视为自然实验，可以通过使用智能手机应用程序来检查数百万人的行为变化。我们做了三个基本的观察。首先，与身体和经济活动相关的应用程序的使用急剧下降，但提供日常必需品的应用程序保持正常使用。其次，满足低级人类需求的应用程序，如工作、社交、信息搜索和娱乐，屏幕时间立即大幅增加。那些满足更高层次需求的项目，比如教育，只会引起人们的延迟关注。第三，人们的行为表现出弹性，在封锁解除后，大多数日常活动恢复了。尽管如此，长期的生活方式发生了变化，因为相当多的人选择继续在网上工作和学习，成为“数字居民”。这项研究还证明了智能手机屏幕时间分析在人类行为研究中的能力。补充资料:在线版本包含补充资料，下载地址:10.1140/epjds/s13688-023-00391-9。

{"title":"The shock, the coping, the resilience: smartphone application use reveals Covid-19 lockdown effects on human behaviors.","authors":"Xiao Fan Liu, Zhen-Zhen Wang, Xiao-Ke Xu, Ye Wu, Zhidan Zhao, Huarong Deng, Ping Wang, Naipeng Chao, Yi-Hui C Huang","doi":"10.1140/epjds/s13688-023-00391-9","DOIUrl":"https://doi.org/10.1140/epjds/s13688-023-00391-9","url":null,"abstract":"Human mobility restriction policies have been widely used to contain the coronavirus disease-19 (COVID-19). However, a critical question is how these policies affect individuals' behavioral and psychological well-being during and after confinement periods. Here, we analyze China's five most stringent city-level lockdowns in 2021, treating them as natural experiments that allow for examining behavioral changes in millions of people through smartphone application use. We made three fundamental observations. First, the use of physical and economic activity-related apps experienced a steep decline, yet apps that provide daily necessities maintained normal usage. Second, apps that fulfilled lower-level human needs, such as working, socializing, information seeking, and entertainment, saw an immediate and substantial increase in screen time. Those that satisfied higher-level needs, such as education, only attracted delayed attention. Third, human behaviors demonstrated resilience as most routines resumed after the lockdowns were lifted. Nonetheless, long-term lifestyle changes were observed, as significant numbers of people chose to continue working and learning online, becoming \"digital residents.\" This study also demonstrates the capability of smartphone screen time analytics in the study of human behaviors.Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00391-9.","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"17"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10240109/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9947205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Digital traces of brain drain: developers during the Russian invasion of Ukraine. 人才流失的数字痕迹:俄罗斯入侵乌克兰期间的开发者。

IF 3.6 2区计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

EPJ Data Science

Pub Date : 2023-01-01 DOI: 10.1140/epjds/s13688-023-00389-3

Johannes Wachs

The Russian invasion of Ukraine has caused large scale destruction, significant loss of life, and the displacement of millions of people. Besides those fleeing direct conflict in Ukraine, many individuals in Russia are also thought to have moved to third countries. In particular the exodus of skilled human capital, sometimes called brain drain, out of Russia may have a significant effect on the course of the war and the Russian economy in the long run. Yet quantifying brain drain, especially during crisis situations is generally difficult. This hinders our ability to understand its drivers and to anticipate its consequences. To address this gap, I draw on and extend a large scale dataset of the locations of highly active software developers collected in February 2021, one year before the invasion. Revisiting those developers that had been located in Russia in 2021, I confirm an ongoing exodus of developers from Russia in snapshots taken in June and November 2022. By November 11.1% of Russian developers list a new country, compared with 2.8% of developers from comparable countries in the region but not directly involved in the conflict. 13.2% of Russian developers have obscured their location (vs. 2.4% in the comparison set). Developers leaving Russia were significantly more active and central in the collaboration network than those who remain. This suggests that many of the most important developers have already left Russia. In some receiving countries the number of arrivals is significant: I estimate an increase in the number of local software developers of 42% in Armenia, 60% in Cyprus and 94% in Georgia.

Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00389-3.

俄罗斯对乌克兰的入侵造成了大规模的破坏、重大的生命损失和数百万人流离失所。除了那些逃离乌克兰直接冲突的人，据信俄罗斯的许多人也搬到了第三国。特别是熟练人力资本的外流，有时被称为人才流失，从长远来看，可能会对战争进程和俄罗斯经济产生重大影响。然而，量化人才流失，特别是在危机形势下，通常是困难的。这阻碍了我们理解其驱动因素和预测其后果的能力。为了解决这一差距，我利用并扩展了2021年2月(入侵前一年)收集的高度活跃的软件开发人员位置的大规模数据集。重新审视那些在2021年位于俄罗斯的开发商，我在2022年6月和11月拍摄的快照中证实了俄罗斯开发商的持续外流。截至11月，11.1%的俄罗斯开发商列出了一个新的国家，相比之下，该地区没有直接卷入冲突的可比国家的开发商中，这一比例为2.8%。13.2%的俄罗斯开发者隐藏了自己的位置(相比之下，对比数据为2.4%)。离开俄罗斯的开发者比留在俄罗斯的开发者在合作网络中更加活跃和核心。这表明许多最重要的开发商已经离开了俄罗斯。在一些接收国家，到达的人数是显著的:我估计当地软件开发人员的数量在亚美尼亚增加了42%，在塞浦路斯增加了60%，在格鲁吉亚增加了94%。补充信息:在线版本包含补充资料，下载地址:10.1140/epjds/s13688-023-00389-3。

{"title":"Digital traces of brain drain: developers during the Russian invasion of Ukraine.","authors":"Johannes Wachs","doi":"10.1140/epjds/s13688-023-00389-3","DOIUrl":"https://doi.org/10.1140/epjds/s13688-023-00389-3","url":null,"abstract":"The Russian invasion of Ukraine has caused large scale destruction, significant loss of life, and the displacement of millions of people. Besides those fleeing direct conflict in Ukraine, many individuals in Russia are also thought to have moved to third countries. In particular the exodus of skilled human capital, sometimes called brain drain, out of Russia may have a significant effect on the course of the war and the Russian economy in the long run. Yet quantifying brain drain, especially during crisis situations is generally difficult. This hinders our ability to understand its drivers and to anticipate its consequences. To address this gap, I draw on and extend a large scale dataset of the locations of highly active software developers collected in February 2021, one year before the invasion. Revisiting those developers that had been located in Russia in 2021, I confirm an ongoing exodus of developers from Russia in snapshots taken in June and November 2022. By November 11.1% of Russian developers list a new country, compared with 2.8% of developers from comparable countries in the region but not directly involved in the conflict. 13.2% of Russian developers have obscured their location (vs. 2.4% in the comparison set). Developers leaving Russia were significantly more active and central in the collaboration network than those who remain. This suggests that many of the most important developers have already left Russia. In some receiving countries the number of arrivals is significant: I estimate an increase in the number of local software developers of 42% in Armenia, 60% in Cyprus and 94% in Georgia.Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00389-3.","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"14"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10184088/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9557423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Exposing influence campaigns in the age of LLMs: a behavioral-based AI approach to detecting state-sponsored trolls. 揭露LLM时代的影响力运动：一种基于行为的人工智能方法，用于检测国家资助的巨魔。

IF 3.6 2区计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

EPJ Data Science

Pub Date : 2023-01-01 Epub Date: 2023-10-09 DOI: 10.1140/epjds/s13688-023-00423-4

Fatima Ezzeddine, Omran Ayoub, Silvia Giordano, Gianluca Nogara, Ihab Sbeity, Emilio Ferrara, Luca Luceri

The detection of state-sponsored trolls operating in influence campaigns on social media is a critical and unsolved challenge for the research community, which has significant implications beyond the online realm. To address this challenge, we propose a new AI-based solution that identifies troll accounts solely through behavioral cues associated with their sequences of sharing activity, encompassing both their actions and the feedback they receive from others. Our approach does not incorporate any textual content shared and consists of two steps: First, we leverage an LSTM-based classifier to determine whether account sequences belong to a state-sponsored troll or an organic, legitimate user. Second, we employ the classified sequences to calculate a metric named the "Troll Score", quantifying the degree to which an account exhibits troll-like behavior. To assess the effectiveness of our method, we examine its performance in the context of the 2016 Russian interference campaign during the U.S. Presidential election. Our experiments yield compelling results, demonstrating that our approach can identify account sequences with an AUC close to 99% and accurately differentiate between Russian trolls and organic users with an AUC of 91%. Notably, our behavioral-based approach holds a significant advantage in the ever-evolving landscape, where textual and linguistic properties can be easily mimicked by Large Language Models (LLMs): In contrast to existing language-based techniques, it relies on more challenging-to-replicate behavioral cues, ensuring greater resilience in identifying influence campaigns, especially given the potential increase in the usage of LLMs for generating inauthentic content. Finally, we assessed the generalizability of our solution to various entities driving different information operations and found promising results that will guide future research.

对研究界来说，检测在社交媒体上进行影响力活动的国家资助的巨魔是一个关键且尚未解决的挑战，这在网络领域之外具有重大意义。为了应对这一挑战，我们提出了一种新的基于人工智能的解决方案，该解决方案仅通过与其共享活动序列相关的行为线索来识别巨魔账户，包括他们的行为和从他人那里收到的反馈。我们的方法不包含任何共享的文本内容，包括两个步骤：首先，我们利用基于LSTM的分类器来确定账户序列是属于国家资助的巨魔还是有机的合法用户。其次，我们使用分类序列来计算一个名为“巨魔得分”的指标，量化账户表现出巨魔般行为的程度。为了评估我们的方法的有效性，我们在2016年美国总统大选期间俄罗斯干预运动的背景下考察了其表现。我们的实验产生了令人信服的结果，证明我们的方法可以识别AUC接近99%的账户序列，并准确区分AUC为91%的俄罗斯巨魔和有机用户。值得注意的是，我们基于行为的方法在不断发展的环境中具有显著优势，在这种环境中，文本和语言属性可以很容易地被大型语言模型（LLM）模仿：与现有的基于语言的技术相比，它依赖于更具挑战性的行为线索复制，确保在识别影响活动时具有更大的弹性，特别是考虑到LLM用于生成不真实内容的使用的潜在增加。最后，我们评估了我们的解决方案对驱动不同信息操作的各种实体的可推广性，并发现了有希望的结果，这些结果将指导未来的研究。

{"title":"Exposing influence campaigns in the age of LLMs: a behavioral-based AI approach to detecting state-sponsored trolls.","authors":"Fatima Ezzeddine, Omran Ayoub, Silvia Giordano, Gianluca Nogara, Ihab Sbeity, Emilio Ferrara, Luca Luceri","doi":"10.1140/epjds/s13688-023-00423-4","DOIUrl":"10.1140/epjds/s13688-023-00423-4","url":null,"abstract":"The detection of state-sponsored trolls operating in influence campaigns on social media is a critical and unsolved challenge for the research community, which has significant implications beyond the online realm. To address this challenge, we propose a new AI-based solution that identifies troll accounts solely through behavioral cues associated with their sequences of sharing activity, encompassing both their actions and the feedback they receive from others. Our approach does not incorporate any textual content shared and consists of two steps: First, we leverage an LSTM-based classifier to determine whether account sequences belong to a state-sponsored troll or an organic, legitimate user. Second, we employ the classified sequences to calculate a metric named the \"Troll Score\", quantifying the degree to which an account exhibits troll-like behavior. To assess the effectiveness of our method, we examine its performance in the context of the 2016 Russian interference campaign during the U.S. Presidential election. Our experiments yield compelling results, demonstrating that our approach can identify account sequences with an AUC close to 99% and accurately differentiate between Russian trolls and organic users with an AUC of 91%. Notably, our behavioral-based approach holds a significant advantage in the ever-evolving landscape, where textual and linguistic properties can be easily mimicked by Large Language Models (LLMs): In contrast to existing language-based techniques, it relies on more challenging-to-replicate behavioral cues, ensuring greater resilience in identifying influence campaigns, especially given the potential increase in the usage of LLMs for generating inauthentic content. Finally, we assessed the generalizability of our solution to various entities driving different information operations and found promising results that will guide future research.","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"46"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10562499/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41195512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Forecasting patient flows with pandemic induced concept drift using explainable machine learning. 使用可解释的机器学习预测流行病引起的概念漂移的患者流量。

IF 3.6 2区计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

EPJ Data Science

Pub Date : 2023-01-01 DOI: 10.1140/epjds/s13688-023-00387-5

Teo Susnjak, Paula Maddigan

Accurately forecasting patient arrivals at Urgent Care Clinics (UCCs) and Emergency Departments (EDs) is important for effective resourcing and patient care. However, correctly estimating patient flows is not straightforward since it depends on many drivers. The predictability of patient arrivals has recently been further complicated by the COVID-19 pandemic conditions and the resulting lockdowns. This study investigates how a suite of novel quasi-real-time variables like Google search terms, pedestrian traffic, the prevailing incidence levels of influenza, as well as the COVID-19 Alert Level indicators can both generally improve the forecasting models of patient flows and effectively adapt the models to the unfolding disruptions of pandemic conditions. This research also uniquely contributes to the body of work in this domain by employing tools from the eXplainable AI field to investigate more deeply the internal mechanics of the models than has previously been done. The Voting ensemble-based method combining machine learning and statistical techniques was the most reliable in our experiments. Our study showed that the prevailing COVID-19 Alert Level feature together with Google search terms and pedestrian traffic were effective at producing generalisable forecasts. The implications of this study are that proxy variables can effectively augment standard autoregressive features to ensure accurate forecasting of patient flows. The experiments showed that the proposed features are potentially effective model inputs for preserving forecast accuracies in the event of future pandemic outbreaks.

准确预测急诊诊所(UCCs)和急诊科(EDs)的患者到达量对于有效的资源分配和患者护理非常重要。然而，正确估计患者流量并非易事，因为它取决于许多驱动因素。最近，COVID-19大流行的情况和由此导致的封锁使患者到达的可预测性进一步复杂化。本研究探讨了一套新的准实时变量，如谷歌搜索词、行人交通、流感的主要发病率水平以及COVID-19警戒级别指标，如何在总体上改进患者流量预测模型，并有效地使模型适应不断变化的大流行情况。这项研究还通过使用来自可解释人工智能领域的工具，比以前更深入地研究模型的内部机制，为该领域的工作做出了独特的贡献。结合机器学习和统计技术的基于投票集合的方法在我们的实验中是最可靠的。我们的研究表明，流行的COVID-19警报级别功能与谷歌搜索词和行人交通一起，可以有效地产生普遍的预测。本研究的意义在于，代理变量可以有效地增强标准的自回归特征，以确保准确预测患者流量。实验表明，所提出的特征是潜在的有效模型输入，可以在未来大流行爆发的情况下保持预测的准确性。

{"title":"Forecasting patient flows with pandemic induced concept drift using explainable machine learning.","authors":"Teo Susnjak, Paula Maddigan","doi":"10.1140/epjds/s13688-023-00387-5","DOIUrl":"https://doi.org/10.1140/epjds/s13688-023-00387-5","url":null,"abstract":"Accurately forecasting patient arrivals at Urgent Care Clinics (UCCs) and Emergency Departments (EDs) is important for effective resourcing and patient care. However, correctly estimating patient flows is not straightforward since it depends on many drivers. The predictability of patient arrivals has recently been further complicated by the COVID-19 pandemic conditions and the resulting lockdowns. This study investigates how a suite of novel quasi-real-time variables like Google search terms, pedestrian traffic, the prevailing incidence levels of influenza, as well as the COVID-19 Alert Level indicators can both generally improve the forecasting models of patient flows and effectively adapt the models to the unfolding disruptions of pandemic conditions. This research also uniquely contributes to the body of work in this domain by employing tools from the eXplainable AI field to investigate more deeply the internal mechanics of the models than has previously been done. The Voting ensemble-based method combining machine learning and statistical techniques was the most reliable in our experiments. Our study showed that the prevailing COVID-19 Alert Level feature together with Google search terms and pedestrian traffic were effective at producing generalisable forecasts. The implications of this study are that proxy variables can effectively augment standard autoregressive features to ensure accurate forecasting of patient flows. The experiments showed that the proposed features are potentially effective model inputs for preserving forecast accuracies in the event of future pandemic outbreaks.","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"11"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10119825/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9448957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Identifying latent activity behaviors and lifestyles using mobility data to describe urban dynamics. 利用流动数据描述城市动态，识别潜在的活动行为和生活方式。

IF 3.6 2区计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

EPJ Data Science

Pub Date : 2023-01-01 Epub Date: 2023-05-18 DOI: 10.1140/epjds/s13688-023-00390-w

Yanni Yang, Alex Pentland, Esteban Moro

Urbanization and its problems require an in-depth and comprehensive understanding of urban dynamics, especially the complex and diversified lifestyles in modern cities. Digitally acquired data can accurately capture complex human activity, but it lacks the interpretability of demographic data. In this paper, we study a privacy-enhanced dataset of the mobility visitation patterns of 1.2 million people to 1.1 million places in 11 metro areas in the U.S. to detect the latent mobility behaviors and lifestyles in the largest American cities. Despite the considerable complexity of mobility visitations, we found that lifestyles can be automatically decomposed into only 12 latent interpretable activity behaviors on how people combine shopping, eating, working, or using their free time. Rather than describing individuals with a single lifestyle, we find that city dwellers' behavior is a mixture of those behaviors. Those detected latent activity behaviors are equally present across cities and cannot be fully explained by main demographic features. Finally, we find those latent behaviors are associated with dynamics like experienced income segregation, transportation, or healthy behaviors in cities, even after controlling for demographic features. Our results signal the importance of complementing traditional census data with activity behaviors to understand urban dynamics.

Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00390-w.

城市化及其问题需要深入全面地了解城市动态，特别是现代城市中复杂多样的生活方式。数字化获取的数据可以准确地捕捉复杂的人类活动，但缺乏人口统计数据的可解释性。在本文中，我们研究了美国11个大都市地区120万人到110万个地方的流动访问模式的隐私增强数据集，以检测美国最大城市潜在的流动行为和生活方式。尽管流动访问相当复杂，但我们发现，生活方式只能自动分解为12种潜在的可解释的活动行为，即人们如何将购物、吃饭、工作或利用空闲时间结合起来。我们发现，城市居民的行为是这些行为的混合，而不是用单一的生活方式来描述个人。这些被检测到的潜在活动行为在城市中同样存在，不能用主要的人口特征来完全解释。最后，我们发现这些潜在行为与经历过的收入隔离、交通或城市中的健康行为等动态有关，即使在控制了人口特征之后也是如此。我们的研究结果表明，用活动行为补充传统人口普查数据对了解城市动态的重要性。补充信息：在线版本包含补充材料，网址为10.1140/epjds/s1368-023-00390-w。

{"title":"Identifying latent activity behaviors and lifestyles using mobility data to describe urban dynamics.","authors":"Yanni Yang, Alex Pentland, Esteban Moro","doi":"10.1140/epjds/s13688-023-00390-w","DOIUrl":"10.1140/epjds/s13688-023-00390-w","url":null,"abstract":"Urbanization and its problems require an in-depth and comprehensive understanding of urban dynamics, especially the complex and diversified lifestyles in modern cities. Digitally acquired data can accurately capture complex human activity, but it lacks the interpretability of demographic data. In this paper, we study a privacy-enhanced dataset of the mobility visitation patterns of 1.2 million people to 1.1 million places in 11 metro areas in the U.S. to detect the latent mobility behaviors and lifestyles in the largest American cities. Despite the considerable complexity of mobility visitations, we found that lifestyles can be automatically decomposed into only 12 latent interpretable activity behaviors on how people combine shopping, eating, working, or using their free time. Rather than describing individuals with a single lifestyle, we find that city dwellers' behavior is a mixture of those behaviors. Those detected latent activity behaviors are equally present across cities and cannot be fully explained by main demographic features. Finally, we find those latent behaviors are associated with dynamics like experienced income segregation, transportation, or healthy behaviors in cities, even after controlling for demographic features. Our results signal the importance of complementing traditional census data with activity behaviors to understand urban dynamics.Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00390-w.","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"15"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10193357/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9509481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

How does Twitter account moderation work? Dynamics of account creation and suspension on Twitter during major geopolitical events. Twitter帐户审核是如何工作的？重大地缘政治事件期间推特账户创建和暂停的动态。

IF 3.6 2区计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

EPJ Data Science

Pub Date : 2023-01-01 Epub Date: 2023-10-04 DOI: 10.1140/epjds/s13688-023-00420-7

Francesco Pierri, Luca Luceri, Emily Chen, Emilio Ferrara

Social media moderation policies are often at the center of public debate, and their implementation and enactment are sometimes surrounded by a veil of mystery. Unsurprisingly, due to limited platform transparency and data access, relatively little research has been devoted to characterizing moderation dynamics, especially in the context of controversial events and the platform activity associated with them. Here, we study the dynamics of account creation and suspension on Twitter during two global political events: Russia's invasion of Ukraine and the 2022 French Presidential election. Leveraging a large-scale dataset of 270M tweets shared by 16M users in multiple languages over several months, we identify peaks of suspicious account creation and suspension, and we characterize behaviors that more frequently lead to account suspension. We show how large numbers of accounts get suspended within days of their creation. Suspended accounts tend to mostly interact with legitimate users, as opposed to other suspicious accounts, making unwarranted and excessive use of reply and mention features, and sharing large amounts of spam and harmful content. While we are only able to speculate about the specific causes leading to a given account suspension, our findings contribute to shedding light on patterns of platform abuse and subsequent moderation during major events.

社交媒体节制政策经常处于公众辩论的中心，其实施和颁布有时被神秘的面纱所包围。不出所料，由于平台透明度和数据访问有限，专门研究缓和动态的研究相对较少，尤其是在有争议事件及其相关平台活动的背景下。在这里，我们研究了两个全球政治事件期间推特账户创建和暂停的动态：俄罗斯入侵乌克兰和2022年法国总统大选。利用1600万用户在几个月内以多种语言共享的2.7亿条推文的大规模数据集，我们确定了可疑账户创建和暂停的峰值，并描述了更频繁导致账户暂停的行为。我们展示了大量账户是如何在创建后几天内被暂停的。与其他可疑账户相比，被暂停的账户大多与合法用户互动，无端和过度使用回复和提及功能，并共享大量垃圾邮件和有害内容。虽然我们只能推测导致特定账户暂停的具体原因，但我们的发现有助于揭示重大事件中平台滥用和随后节制的模式。

{"title":"How does Twitter account moderation work? Dynamics of account creation and suspension on Twitter during major geopolitical events.","authors":"Francesco Pierri, Luca Luceri, Emily Chen, Emilio Ferrara","doi":"10.1140/epjds/s13688-023-00420-7","DOIUrl":"10.1140/epjds/s13688-023-00420-7","url":null,"abstract":"Social media moderation policies are often at the center of public debate, and their implementation and enactment are sometimes surrounded by a veil of mystery. Unsurprisingly, due to limited platform transparency and data access, relatively little research has been devoted to characterizing moderation dynamics, especially in the context of controversial events and the platform activity associated with them. Here, we study the dynamics of account creation and suspension on Twitter during two global political events: Russia's invasion of Ukraine and the 2022 French Presidential election. Leveraging a large-scale dataset of 270M tweets shared by 16M users in multiple languages over several months, we identify peaks of suspicious account creation and suspension, and we characterize behaviors that more frequently lead to account suspension. We show how large numbers of accounts get suspended within days of their creation. Suspended accounts tend to mostly interact with legitimate users, as opposed to other suspicious accounts, making unwarranted and excessive use of reply and mention features, and sharing large amounts of spam and harmful content. While we are only able to speculate about the specific causes leading to a given account suspension, our findings contribute to shedding light on patterns of platform abuse and subsequent moderation during major events.","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"43"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10550859/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41111015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Using word embeddings to analyse audience effects and individual differences in parenting Subreddits. 使用单词嵌入来分析受众效应和养育子女的个体差异。

IF 3.6 2区计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

EPJ Data Science

Pub Date : 2023-01-01 Epub Date: 2023-09-20 DOI: 10.1140/epjds/s13688-023-00412-7

Melody Sepahpour-Fard, Michael Quayle, Maria Schuld, Taha Yasseri

This paper explores how individuals' language use in gender-specific groups ("mothers" and "fathers") compares to their interactions when referred to as "parents." Language adaptation based on the audience is well-documented, yet large-scale studies of naturally-occurring audience effects are rare. To address this, we investigate audience and gender effects in the context of parenting, where gender plays a significant role. We focus on interactions within Reddit, particularly in the parenting Subreddits r/Daddit, r/Mommit, and r/Parenting, which cater to distinct audiences. By analyzing user posts using word embeddings, we measure similarities between user-tokens and word-tokens, also considering differences among high and low self-monitors. Results reveal that in mixed-gender contexts, mothers and fathers exhibit similar behavior in discussing a wide range of topics, while fathers emphasize more on educational and family advice. Single-gender Subreddits see more focused discussions. Mothers in r/Mommit discuss medical care, sleep, potty training, and food, distinguishing themselves. In terms of individual differences, we found that, especially on r/Parenting, high self-monitors tend to conform more to the norms of the Subreddit by discussing more of the topics associated with the Subreddit.

本文探讨了个体在特定性别群体（“母亲”和“父亲”）中的语言使用与他们在被称为“父母”时的互动之间的比较。基于受众的语言适应有充分的证据，但对自然发生的受众效应的大规模研究却很少。为了解决这一问题，我们调查了育儿背景下的受众和性别影响，性别在育儿中发挥着重要作用。我们专注于Reddit内部的互动，特别是在面向不同受众的育儿子版块r/Daddit、r/Mommit和r/parenting中。通过使用单词嵌入分析用户帖子，我们测量了用户标记和单词标记之间的相似性，同时考虑了高自我监控和低自我监控之间的差异。结果表明，在混合性别背景下，母亲和父亲在讨论广泛的话题时表现出相似的行为，而父亲则更强调教育和家庭建议。单一性别小组的讨论更加集中。妈妈们讨论医疗保健、睡眠、如厕训练和食物，以区分自己。就个体差异而言，我们发现，特别是在r/Parenting方面，高自我监控者倾向于通过讨论更多与Subreddit相关的话题来更符合Subreddit的规范。

{"title":"Using word embeddings to analyse audience effects and individual differences in parenting Subreddits.","authors":"Melody Sepahpour-Fard, Michael Quayle, Maria Schuld, Taha Yasseri","doi":"10.1140/epjds/s13688-023-00412-7","DOIUrl":"10.1140/epjds/s13688-023-00412-7","url":null,"abstract":"This paper explores how individuals' language use in gender-specific groups (\"mothers\" and \"fathers\") compares to their interactions when referred to as \"parents.\" Language adaptation based on the audience is well-documented, yet large-scale studies of naturally-occurring audience effects are rare. To address this, we investigate audience and gender effects in the context of parenting, where gender plays a significant role. We focus on interactions within Reddit, particularly in the parenting Subreddits r/Daddit, r/Mommit, and r/Parenting, which cater to distinct audiences. By analyzing user posts using word embeddings, we measure similarities between user-tokens and word-tokens, also considering differences among high and low self-monitors. Results reveal that in mixed-gender contexts, mothers and fathers exhibit similar behavior in discussing a wide range of topics, while fathers emphasize more on educational and family advice. Single-gender Subreddits see more focused discussions. Mothers in r/Mommit discuss medical care, sleep, potty training, and food, distinguishing themselves. In terms of individual differences, we found that, especially on r/Parenting, high self-monitors tend to conform more to the norms of the Subreddit by discussing more of the topics associated with the Subreddit.","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"38"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10511593/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41117699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mental health concerns precede quits: shifts in the work discourse during the Covid-19 pandemic and great resignation. 心理健康问题先于辞职：新冠肺炎大流行期间工作话语的转变和巨大的辞职。

IF 3.6 2区计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

EPJ Data Science

Pub Date : 2023-01-01 Epub Date: 2023-10-12 DOI: 10.1140/epjds/s13688-023-00417-2

R Maria Del Rio-Chanona, Alejandro Hermida-Carrillo, Melody Sepahpour-Fard, Luning Sun, Renata Topinkova, Ljubica Nedelkoska

To study the causes of the 2021 Great Resignation, we use text analysis and investigate the changes in work- and quit-related posts between 2018 and 2021 on Reddit. We find that the Reddit discourse evolution resembles the dynamics of the U.S. quit and layoff rates. Furthermore, when the COVID-19 pandemic started, conversations related to working from home, switching jobs, work-related distress, and mental health increased, while discussions on commuting or moving for a job decreased. We distinguish between general work-related and specific quit-related discourse changes using a difference-in-differences method. Our main finding is that mental health and work-related distress topics disproportionally increased among quit-related posts since the onset of the pandemic, likely contributing to the quits of the Great Resignation. Along with better labor market conditions, some relief came beginning-to-mid-2021 when these concerns decreased. Our study underscores the importance of having access to data from online forums, such as Reddit, to study emerging economic phenomena in real time, providing a valuable supplement to traditional labor market surveys and administrative data.

Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00417-2.

为了研究2021年大辞职的原因，我们使用文本分析，调查了2018年至2021年间Reddit上与工作和辞职相关的帖子的变化。我们发现，Reddit的话语演变类似于美国辞职率和裁员率的动态。此外，当新冠肺炎大流行开始时，与在家工作、换工作、与工作有关的痛苦和心理健康有关的对话增加了，而关于通勤或搬家工作的讨论减少了。我们使用差异中的差异方法来区分与工作相关的一般话语变化和与辞职相关的特定话语变化。我们的主要发现是，自疫情爆发以来，心理健康和与工作相关的痛苦话题在辞职相关的职位中不成比例地增加，这可能是大辞职的原因之一。随着劳动力市场状况的改善，从2021年年中开始，这些担忧有所缓解。我们的研究强调了访问Reddit等在线论坛的数据以实时研究新兴经济现象的重要性，为传统的劳动力市场调查和行政数据提供了宝贵的补充。补充信息：在线版本包含补充材料，可访问10.1140/epjds/s1368-023-00417-2。

{"title":"Mental health concerns precede quits: shifts in the work discourse during the Covid-19 pandemic and great resignation.","authors":"R Maria Del Rio-Chanona, Alejandro Hermida-Carrillo, Melody Sepahpour-Fard, Luning Sun, Renata Topinkova, Ljubica Nedelkoska","doi":"10.1140/epjds/s13688-023-00417-2","DOIUrl":"10.1140/epjds/s13688-023-00417-2","url":null,"abstract":"To study the causes of the 2021 Great Resignation, we use text analysis and investigate the changes in work- and quit-related posts between 2018 and 2021 on Reddit. We find that the Reddit discourse evolution resembles the dynamics of the U.S. quit and layoff rates. Furthermore, when the COVID-19 pandemic started, conversations related to working from home, switching jobs, work-related distress, and mental health increased, while discussions on commuting or moving for a job decreased. We distinguish between general work-related and specific quit-related discourse changes using a difference-in-differences method. Our main finding is that mental health and work-related distress topics disproportionally increased among quit-related posts since the onset of the pandemic, likely contributing to the quits of the Great Resignation. Along with better labor market conditions, some relief came beginning-to-mid-2021 when these concerns decreased. Our study underscores the importance of having access to data from online forums, such as Reddit, to study emerging economic phenomena in real time, providing a valuable supplement to traditional labor market surveys and administrative data.Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00417-2.","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"49"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10570174/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41233433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Has Covid-19 permanently changed online purchasing behavior? Covid-19 是否永久性地改变了在线购买行为？

IF 3 2区计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

EPJ Data Science

Pub Date : 2023-01-01 Epub Date: 2023-01-16 DOI: 10.1140/epjds/s13688-022-00375-1

Hiroyasu Inoue, Yasuyuki Todo

This study examines how the COVID-19 pandemic has affected online purchasing behavior using data from a major online shopping platform in Japan. We focus on the effect of two measures of the pandemic, i.e., the number of positive COVID-19 cases and state declarations of emergency to mitigate the pandemic. We find that both measures promoted online purchases at the beginning of the pandemic, but in later periods, their effect faded. In addition, online purchases returned to normal after states of emergency ended, and the overall time trend in online purchases excluding the effects of the two measures was stable during the first two years of the pandemic. These results suggest that the effect of the pandemic on online purchasing behavior is temporary and will not persist after the pandemic.

Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-022-00375-1.

本研究利用日本一家大型在线购物平台的数据，探讨了 COVID-19 大流行对在线购买行为的影响。我们重点研究了两种大流行措施的影响，即 COVID-19 阳性病例数和国家宣布紧急状态以缓解大流行。我们发现，这两项措施在疫情初期促进了网购，但在后期，其效果逐渐减弱。此外，在紧急状态结束后，网购又恢复了正常，在大流行病的前两年，排除这两项措施的影响，网购的总体时间趋势是稳定的。这些结果表明，疫情对网购行为的影响是暂时的，在疫情过后不会持续：在线版本包含补充材料，可在 10.1140/epjds/s13688-022-00375-1 网站上查阅。

引用次数: 0