Pub Date : 2023-01-01DOI: 10.1140/epjds/s13688-023-00389-3
Johannes Wachs
The Russian invasion of Ukraine has caused large scale destruction, significant loss of life, and the displacement of millions of people. Besides those fleeing direct conflict in Ukraine, many individuals in Russia are also thought to have moved to third countries. In particular the exodus of skilled human capital, sometimes called brain drain, out of Russia may have a significant effect on the course of the war and the Russian economy in the long run. Yet quantifying brain drain, especially during crisis situations is generally difficult. This hinders our ability to understand its drivers and to anticipate its consequences. To address this gap, I draw on and extend a large scale dataset of the locations of highly active software developers collected in February 2021, one year before the invasion. Revisiting those developers that had been located in Russia in 2021, I confirm an ongoing exodus of developers from Russia in snapshots taken in June and November 2022. By November 11.1% of Russian developers list a new country, compared with 2.8% of developers from comparable countries in the region but not directly involved in the conflict. 13.2% of Russian developers have obscured their location (vs. 2.4% in the comparison set). Developers leaving Russia were significantly more active and central in the collaboration network than those who remain. This suggests that many of the most important developers have already left Russia. In some receiving countries the number of arrivals is significant: I estimate an increase in the number of local software developers of 42% in Armenia, 60% in Cyprus and 94% in Georgia.
Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00389-3.
{"title":"Digital traces of brain drain: developers during the Russian invasion of Ukraine.","authors":"Johannes Wachs","doi":"10.1140/epjds/s13688-023-00389-3","DOIUrl":"https://doi.org/10.1140/epjds/s13688-023-00389-3","url":null,"abstract":"<p><p>The Russian invasion of Ukraine has caused large scale destruction, significant loss of life, and the displacement of millions of people. Besides those fleeing direct conflict in Ukraine, many individuals in Russia are also thought to have moved to third countries. In particular the exodus of skilled human capital, sometimes called brain drain, out of Russia may have a significant effect on the course of the war and the Russian economy in the long run. Yet quantifying brain drain, especially during crisis situations is generally difficult. This hinders our ability to understand its drivers and to anticipate its consequences. To address this gap, I draw on and extend a large scale dataset of the locations of highly active software developers collected in February 2021, one year before the invasion. Revisiting those developers that had been located in Russia in 2021, I confirm an ongoing exodus of developers from Russia in snapshots taken in June and November 2022. By November 11.1% of Russian developers list a new country, compared with 2.8% of developers from comparable countries in the region but not directly involved in the conflict. 13.2% of Russian developers have obscured their location (vs. 2.4% in the comparison set). Developers leaving Russia were significantly more active and central in the collaboration network than those who remain. This suggests that many of the most important developers have already left Russia. In some receiving countries the number of arrivals is significant: I estimate an increase in the number of local software developers of 42% in Armenia, 60% in Cyprus and 94% in Georgia.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1140/epjds/s13688-023-00389-3.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"14"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10184088/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9557423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The detection of state-sponsored trolls operating in influence campaigns on social media is a critical and unsolved challenge for the research community, which has significant implications beyond the online realm. To address this challenge, we propose a new AI-based solution that identifies troll accounts solely through behavioral cues associated with their sequences of sharing activity, encompassing both their actions and the feedback they receive from others. Our approach does not incorporate any textual content shared and consists of two steps: First, we leverage an LSTM-based classifier to determine whether account sequences belong to a state-sponsored troll or an organic, legitimate user. Second, we employ the classified sequences to calculate a metric named the "Troll Score", quantifying the degree to which an account exhibits troll-like behavior. To assess the effectiveness of our method, we examine its performance in the context of the 2016 Russian interference campaign during the U.S. Presidential election. Our experiments yield compelling results, demonstrating that our approach can identify account sequences with an AUC close to 99% and accurately differentiate between Russian trolls and organic users with an AUC of 91%. Notably, our behavioral-based approach holds a significant advantage in the ever-evolving landscape, where textual and linguistic properties can be easily mimicked by Large Language Models (LLMs): In contrast to existing language-based techniques, it relies on more challenging-to-replicate behavioral cues, ensuring greater resilience in identifying influence campaigns, especially given the potential increase in the usage of LLMs for generating inauthentic content. Finally, we assessed the generalizability of our solution to various entities driving different information operations and found promising results that will guide future research.
{"title":"Exposing influence campaigns in the age of LLMs: a behavioral-based AI approach to detecting state-sponsored trolls.","authors":"Fatima Ezzeddine, Omran Ayoub, Silvia Giordano, Gianluca Nogara, Ihab Sbeity, Emilio Ferrara, Luca Luceri","doi":"10.1140/epjds/s13688-023-00423-4","DOIUrl":"10.1140/epjds/s13688-023-00423-4","url":null,"abstract":"<p><p>The detection of state-sponsored trolls operating in influence campaigns on social media is a critical and unsolved challenge for the research community, which has significant implications beyond the online realm. To address this challenge, we propose a new AI-based solution that identifies troll accounts solely through behavioral cues associated with their sequences of sharing activity, encompassing both their actions and the feedback they receive from others. Our approach does not incorporate any textual content shared and consists of two steps: First, we leverage an LSTM-based classifier to determine whether account sequences belong to a state-sponsored troll or an organic, legitimate user. Second, we employ the classified sequences to calculate a metric named the \"Troll Score\", quantifying the degree to which an account exhibits troll-like behavior. To assess the effectiveness of our method, we examine its performance in the context of the 2016 Russian interference campaign during the U.S. Presidential election. Our experiments yield compelling results, demonstrating that our approach can identify account sequences with an AUC close to 99% and accurately differentiate between Russian trolls and organic users with an AUC of 91%. Notably, our behavioral-based approach holds a significant advantage in the ever-evolving landscape, where textual and linguistic properties can be easily mimicked by Large Language Models (LLMs): In contrast to existing language-based techniques, it relies on more challenging-to-replicate behavioral cues, ensuring greater resilience in identifying influence campaigns, especially given the potential increase in the usage of LLMs for generating inauthentic content. Finally, we assessed the generalizability of our solution to various entities driving different information operations and found promising results that will guide future research.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"46"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10562499/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41195512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1140/epjds/s13688-023-00387-5
Teo Susnjak, Paula Maddigan
Accurately forecasting patient arrivals at Urgent Care Clinics (UCCs) and Emergency Departments (EDs) is important for effective resourcing and patient care. However, correctly estimating patient flows is not straightforward since it depends on many drivers. The predictability of patient arrivals has recently been further complicated by the COVID-19 pandemic conditions and the resulting lockdowns. This study investigates how a suite of novel quasi-real-time variables like Google search terms, pedestrian traffic, the prevailing incidence levels of influenza, as well as the COVID-19 Alert Level indicators can both generally improve the forecasting models of patient flows and effectively adapt the models to the unfolding disruptions of pandemic conditions. This research also uniquely contributes to the body of work in this domain by employing tools from the eXplainable AI field to investigate more deeply the internal mechanics of the models than has previously been done. The Voting ensemble-based method combining machine learning and statistical techniques was the most reliable in our experiments. Our study showed that the prevailing COVID-19 Alert Level feature together with Google search terms and pedestrian traffic were effective at producing generalisable forecasts. The implications of this study are that proxy variables can effectively augment standard autoregressive features to ensure accurate forecasting of patient flows. The experiments showed that the proposed features are potentially effective model inputs for preserving forecast accuracies in the event of future pandemic outbreaks.
{"title":"Forecasting patient flows with pandemic induced concept drift using explainable machine learning.","authors":"Teo Susnjak, Paula Maddigan","doi":"10.1140/epjds/s13688-023-00387-5","DOIUrl":"https://doi.org/10.1140/epjds/s13688-023-00387-5","url":null,"abstract":"<p><p>Accurately forecasting patient arrivals at Urgent Care Clinics (UCCs) and Emergency Departments (EDs) is important for effective resourcing and patient care. However, correctly estimating patient flows is not straightforward since it depends on many drivers. The predictability of patient arrivals has recently been further complicated by the COVID-19 pandemic conditions and the resulting lockdowns. This study investigates how a suite of novel quasi-real-time variables like Google search terms, pedestrian traffic, the prevailing incidence levels of influenza, as well as the COVID-19 Alert Level indicators can both generally improve the forecasting models of patient flows and effectively adapt the models to the unfolding disruptions of pandemic conditions. This research also uniquely contributes to the body of work in this domain by employing tools from the eXplainable AI field to investigate more deeply the internal mechanics of the models than has previously been done. The Voting ensemble-based method combining machine learning and statistical techniques was the most reliable in our experiments. Our study showed that the prevailing COVID-19 Alert Level feature together with Google search terms and pedestrian traffic were effective at producing generalisable forecasts. The implications of this study are that proxy variables can effectively augment standard autoregressive features to ensure accurate forecasting of patient flows. The experiments showed that the proposed features are potentially effective model inputs for preserving forecast accuracies in the event of future pandemic outbreaks.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"11"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10119825/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9448957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01Epub Date: 2023-05-18DOI: 10.1140/epjds/s13688-023-00390-w
Yanni Yang, Alex Pentland, Esteban Moro
Urbanization and its problems require an in-depth and comprehensive understanding of urban dynamics, especially the complex and diversified lifestyles in modern cities. Digitally acquired data can accurately capture complex human activity, but it lacks the interpretability of demographic data. In this paper, we study a privacy-enhanced dataset of the mobility visitation patterns of 1.2 million people to 1.1 million places in 11 metro areas in the U.S. to detect the latent mobility behaviors and lifestyles in the largest American cities. Despite the considerable complexity of mobility visitations, we found that lifestyles can be automatically decomposed into only 12 latent interpretable activity behaviors on how people combine shopping, eating, working, or using their free time. Rather than describing individuals with a single lifestyle, we find that city dwellers' behavior is a mixture of those behaviors. Those detected latent activity behaviors are equally present across cities and cannot be fully explained by main demographic features. Finally, we find those latent behaviors are associated with dynamics like experienced income segregation, transportation, or healthy behaviors in cities, even after controlling for demographic features. Our results signal the importance of complementing traditional census data with activity behaviors to understand urban dynamics.
Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00390-w.
{"title":"Identifying latent activity behaviors and lifestyles using mobility data to describe urban dynamics.","authors":"Yanni Yang, Alex Pentland, Esteban Moro","doi":"10.1140/epjds/s13688-023-00390-w","DOIUrl":"10.1140/epjds/s13688-023-00390-w","url":null,"abstract":"<p><p>Urbanization and its problems require an in-depth and comprehensive understanding of urban dynamics, especially the complex and diversified lifestyles in modern cities. Digitally acquired data can accurately capture complex human activity, but it lacks the interpretability of demographic data. In this paper, we study a privacy-enhanced dataset of the mobility visitation patterns of 1.2 million people to 1.1 million places in 11 metro areas in the U.S. to detect the latent mobility behaviors and lifestyles in the largest American cities. Despite the considerable complexity of mobility visitations, we found that lifestyles can be automatically decomposed into only 12 latent interpretable activity behaviors on how people combine shopping, eating, working, or using their free time. Rather than describing individuals with a single lifestyle, we find that city dwellers' behavior is a mixture of those behaviors. Those detected latent activity behaviors are equally present across cities and cannot be fully explained by main demographic features. Finally, we find those latent behaviors are associated with dynamics like experienced income segregation, transportation, or healthy behaviors in cities, even after controlling for demographic features. Our results signal the importance of complementing traditional census data with activity behaviors to understand urban dynamics.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1140/epjds/s13688-023-00390-w.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"15"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10193357/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9509481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01Epub Date: 2023-10-04DOI: 10.1140/epjds/s13688-023-00420-7
Francesco Pierri, Luca Luceri, Emily Chen, Emilio Ferrara
Social media moderation policies are often at the center of public debate, and their implementation and enactment are sometimes surrounded by a veil of mystery. Unsurprisingly, due to limited platform transparency and data access, relatively little research has been devoted to characterizing moderation dynamics, especially in the context of controversial events and the platform activity associated with them. Here, we study the dynamics of account creation and suspension on Twitter during two global political events: Russia's invasion of Ukraine and the 2022 French Presidential election. Leveraging a large-scale dataset of 270M tweets shared by 16M users in multiple languages over several months, we identify peaks of suspicious account creation and suspension, and we characterize behaviors that more frequently lead to account suspension. We show how large numbers of accounts get suspended within days of their creation. Suspended accounts tend to mostly interact with legitimate users, as opposed to other suspicious accounts, making unwarranted and excessive use of reply and mention features, and sharing large amounts of spam and harmful content. While we are only able to speculate about the specific causes leading to a given account suspension, our findings contribute to shedding light on patterns of platform abuse and subsequent moderation during major events.
{"title":"How does Twitter account moderation work? Dynamics of account creation and suspension on Twitter during major geopolitical events.","authors":"Francesco Pierri, Luca Luceri, Emily Chen, Emilio Ferrara","doi":"10.1140/epjds/s13688-023-00420-7","DOIUrl":"10.1140/epjds/s13688-023-00420-7","url":null,"abstract":"<p><p>Social media moderation policies are often at the center of public debate, and their implementation and enactment are sometimes surrounded by a veil of mystery. Unsurprisingly, due to limited platform transparency and data access, relatively little research has been devoted to characterizing moderation dynamics, especially in the context of controversial events and the platform activity associated with them. Here, we study the dynamics of account creation and suspension on Twitter during two global political events: Russia's invasion of Ukraine and the 2022 French Presidential election. Leveraging a large-scale dataset of 270M tweets shared by 16M users in multiple languages over several months, we identify peaks of suspicious account creation and suspension, and we characterize behaviors that more frequently lead to account suspension. We show how large numbers of accounts get suspended within days of their creation. Suspended accounts tend to mostly interact with legitimate users, as opposed to other suspicious accounts, making unwarranted and excessive use of reply and mention features, and sharing large amounts of spam and harmful content. While we are only able to speculate about the specific causes leading to a given account suspension, our findings contribute to shedding light on patterns of platform abuse and subsequent moderation during major events.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"43"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10550859/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41111015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01Epub Date: 2023-09-20DOI: 10.1140/epjds/s13688-023-00412-7
Melody Sepahpour-Fard, Michael Quayle, Maria Schuld, Taha Yasseri
This paper explores how individuals' language use in gender-specific groups ("mothers" and "fathers") compares to their interactions when referred to as "parents." Language adaptation based on the audience is well-documented, yet large-scale studies of naturally-occurring audience effects are rare. To address this, we investigate audience and gender effects in the context of parenting, where gender plays a significant role. We focus on interactions within Reddit, particularly in the parenting Subreddits r/Daddit, r/Mommit, and r/Parenting, which cater to distinct audiences. By analyzing user posts using word embeddings, we measure similarities between user-tokens and word-tokens, also considering differences among high and low self-monitors. Results reveal that in mixed-gender contexts, mothers and fathers exhibit similar behavior in discussing a wide range of topics, while fathers emphasize more on educational and family advice. Single-gender Subreddits see more focused discussions. Mothers in r/Mommit discuss medical care, sleep, potty training, and food, distinguishing themselves. In terms of individual differences, we found that, especially on r/Parenting, high self-monitors tend to conform more to the norms of the Subreddit by discussing more of the topics associated with the Subreddit.
{"title":"Using word embeddings to analyse audience effects and individual differences in parenting Subreddits.","authors":"Melody Sepahpour-Fard, Michael Quayle, Maria Schuld, Taha Yasseri","doi":"10.1140/epjds/s13688-023-00412-7","DOIUrl":"10.1140/epjds/s13688-023-00412-7","url":null,"abstract":"<p><p>This paper explores how individuals' language use in gender-specific groups (\"mothers\" and \"fathers\") compares to their interactions when referred to as \"parents.\" Language adaptation based on the audience is well-documented, yet large-scale studies of naturally-occurring audience effects are rare. To address this, we investigate audience and gender effects in the context of parenting, where gender plays a significant role. We focus on interactions within Reddit, particularly in the parenting Subreddits r/Daddit, r/Mommit, and r/Parenting, which cater to distinct audiences. By analyzing user posts using word embeddings, we measure similarities between user-tokens and word-tokens, also considering differences among high and low self-monitors. Results reveal that in mixed-gender contexts, mothers and fathers exhibit similar behavior in discussing a wide range of topics, while fathers emphasize more on educational and family advice. Single-gender Subreddits see more focused discussions. Mothers in r/Mommit discuss medical care, sleep, potty training, and food, distinguishing themselves. In terms of individual differences, we found that, especially on r/Parenting, high self-monitors tend to conform more to the norms of the Subreddit by discussing more of the topics associated with the Subreddit.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"38"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10511593/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41117699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01Epub Date: 2023-10-12DOI: 10.1140/epjds/s13688-023-00417-2
R Maria Del Rio-Chanona, Alejandro Hermida-Carrillo, Melody Sepahpour-Fard, Luning Sun, Renata Topinkova, Ljubica Nedelkoska
To study the causes of the 2021 Great Resignation, we use text analysis and investigate the changes in work- and quit-related posts between 2018 and 2021 on Reddit. We find that the Reddit discourse evolution resembles the dynamics of the U.S. quit and layoff rates. Furthermore, when the COVID-19 pandemic started, conversations related to working from home, switching jobs, work-related distress, and mental health increased, while discussions on commuting or moving for a job decreased. We distinguish between general work-related and specific quit-related discourse changes using a difference-in-differences method. Our main finding is that mental health and work-related distress topics disproportionally increased among quit-related posts since the onset of the pandemic, likely contributing to the quits of the Great Resignation. Along with better labor market conditions, some relief came beginning-to-mid-2021 when these concerns decreased. Our study underscores the importance of having access to data from online forums, such as Reddit, to study emerging economic phenomena in real time, providing a valuable supplement to traditional labor market surveys and administrative data.
Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00417-2.
{"title":"Mental health concerns precede quits: shifts in the work discourse during the Covid-19 pandemic and great resignation.","authors":"R Maria Del Rio-Chanona, Alejandro Hermida-Carrillo, Melody Sepahpour-Fard, Luning Sun, Renata Topinkova, Ljubica Nedelkoska","doi":"10.1140/epjds/s13688-023-00417-2","DOIUrl":"10.1140/epjds/s13688-023-00417-2","url":null,"abstract":"<p><p>To study the causes of the 2021 Great Resignation, we use text analysis and investigate the changes in work- and quit-related posts between 2018 and 2021 on Reddit. We find that the Reddit discourse evolution resembles the dynamics of the U.S. quit and layoff rates. Furthermore, when the COVID-19 pandemic started, conversations related to working from home, switching jobs, work-related distress, and mental health increased, while discussions on commuting or moving for a job decreased. We distinguish between general work-related and specific quit-related discourse changes using a difference-in-differences method. Our main finding is that mental health and work-related distress topics disproportionally increased among quit-related posts since the onset of the pandemic, likely contributing to the quits of the Great Resignation. Along with better labor market conditions, some relief came beginning-to-mid-2021 when these concerns decreased. Our study underscores the importance of having access to data from online forums, such as Reddit, to study emerging economic phenomena in real time, providing a valuable supplement to traditional labor market surveys and administrative data.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1140/epjds/s13688-023-00417-2.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"49"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10570174/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41233433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01Epub Date: 2023-01-16DOI: 10.1140/epjds/s13688-022-00375-1
Hiroyasu Inoue, Yasuyuki Todo
This study examines how the COVID-19 pandemic has affected online purchasing behavior using data from a major online shopping platform in Japan. We focus on the effect of two measures of the pandemic, i.e., the number of positive COVID-19 cases and state declarations of emergency to mitigate the pandemic. We find that both measures promoted online purchases at the beginning of the pandemic, but in later periods, their effect faded. In addition, online purchases returned to normal after states of emergency ended, and the overall time trend in online purchases excluding the effects of the two measures was stable during the first two years of the pandemic. These results suggest that the effect of the pandemic on online purchasing behavior is temporary and will not persist after the pandemic.
Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-022-00375-1.
{"title":"Has Covid-19 permanently changed online purchasing behavior?","authors":"Hiroyasu Inoue, Yasuyuki Todo","doi":"10.1140/epjds/s13688-022-00375-1","DOIUrl":"10.1140/epjds/s13688-022-00375-1","url":null,"abstract":"<p><p>This study examines how the COVID-19 pandemic has affected online purchasing behavior using data from a major online shopping platform in Japan. We focus on the effect of two measures of the pandemic, i.e., the number of positive COVID-19 cases and state declarations of emergency to mitigate the pandemic. We find that both measures promoted online purchases at the beginning of the pandemic, but in later periods, their effect faded. In addition, online purchases returned to normal after states of emergency ended, and the overall time trend in online purchases excluding the effects of the two measures was stable during the first two years of the pandemic. These results suggest that the effect of the pandemic on online purchasing behavior is temporary and will not persist after the pandemic.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1140/epjds/s13688-022-00375-1.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"1"},"PeriodicalIF":3.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9841963/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10581067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01Epub Date: 2023-10-11DOI: 10.1140/epjds/s13688-023-00416-3
Vincenzo Perri, Luka V Petrović, Ingo Scholtes
Many network analysis and graph learning techniques are based on discrete- or continuous-time models of random walks. To apply these methods, it is necessary to infer transition matrices that formalize the underlying stochastic process in an observed graph. For weighted graphs, where weighted edges capture observations of repeated interactions between nodes, it is common to estimate the entries of such transition matrices based on the (relative) weights of edges. However in real-world settings we are often confronted with incomplete data, which turns the construction of the transition matrix based on a weighted graph into an inference problem. Moreover, we often have access to additional information, which capture topological constraints of the system, i.e. which edges in a weighted graph are (theoretically) possible and which are not. Examples include transportation networks, where we may have access to a small sample of passenger trajectories as well as the physical topology of connections, or a limited set of observed social interactions with additional information on the underlying social structure. Combining these two different sources of information to reliably infer transition matrices from incomplete data on repeated interactions is an important open challenge, with severe implications for the reliability of downstream network analysis tasks. Addressing this issue, we show that including knowledge on such topological constraints can considerably improve the inference of transition matrices, especially in situations where we only have a small number of observed interactions. To this end, we derive an analytically tractable Bayesian method that uses repeated interactions and a topological prior to perform data-efficient inference of transition matrices. We compare our approach against commonly used frequentist and Bayesian approaches both in synthetic data and in five real-world datasets, and we find that our method recovers the transition probabilities with higher accuracy. Furthermore, we demonstrate that the method is robust even in cases when the knowledge of the topological constraint is partial. Lastly, we show that this higher accuracy improves the results for downstream network analysis tasks like cluster detection and node ranking, which highlights the practical relevance of our method for interdisciplinary data-driven analyses of networked systems.
{"title":"Bayesian inference of transition matrices from incomplete graph data with a topological prior.","authors":"Vincenzo Perri, Luka V Petrović, Ingo Scholtes","doi":"10.1140/epjds/s13688-023-00416-3","DOIUrl":"10.1140/epjds/s13688-023-00416-3","url":null,"abstract":"<p><p>Many network analysis and graph learning techniques are based on discrete- or continuous-time models of random walks. To apply these methods, it is necessary to infer transition matrices that formalize the underlying stochastic process in an observed graph. For weighted graphs, where weighted edges capture observations of repeated interactions between nodes, it is common to estimate the entries of such transition matrices based on the (relative) weights of edges. However in real-world settings we are often confronted with incomplete data, which turns the construction of the transition matrix based on a weighted graph into an <i>inference problem</i>. Moreover, we often have access to additional information, which capture topological constraints of the system, i.e. which edges in a weighted graph are (theoretically) possible and which are not. Examples include transportation networks, where we may have access to a small sample of passenger trajectories as well as the physical topology of connections, or a limited set of observed social interactions with additional information on the underlying social structure. Combining these two different sources of information to reliably infer transition matrices from incomplete data on repeated interactions is an important open challenge, with severe implications for the reliability of downstream network analysis tasks. Addressing this issue, we show that including knowledge on such topological constraints can considerably improve the inference of transition matrices, especially in situations where we only have a small number of observed interactions. To this end, we derive an analytically tractable Bayesian method that uses repeated interactions and a topological prior to perform data-efficient inference of transition matrices. We compare our approach against commonly used frequentist and Bayesian approaches both in synthetic data and in five real-world datasets, and we find that our method recovers the transition probabilities with higher accuracy. Furthermore, we demonstrate that the method is robust even in cases when the knowledge of the topological constraint is partial. Lastly, we show that this higher accuracy improves the results for downstream network analysis tasks like cluster detection and node ranking, which highlights the practical relevance of our method for interdisciplinary data-driven analyses of networked systems.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"48"},"PeriodicalIF":3.6,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10567898/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41233432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01Epub Date: 2023-06-07DOI: 10.1140/epjds/s13688-023-00394-6
Esra Suel, Emily Muller, James E Bennett, Tony Blakely, Yvonne Doyle, John Lynch, Joreintje D Mackenbach, Ariane Middel, Anja Mizdrak, Ricky Nathvani, Michael Brauer, Majid Ezzati
Urbanization and inequalities are two of the major policy themes of our time, intersecting in large cities where social and economic inequalities are particularly pronounced. Large scale street-level images are a source of city-wide visual information and allow for comparative analyses of multiple cities. Computer vision methods based on deep learning applied to street images have been shown to successfully measure inequalities in socioeconomic and environmental features, yet existing work has been within specific geographies and have not looked at how visual environments compare across different cities and countries. In this study, we aim to apply existing methods to understand whether, and to what extent, poor and wealthy groups live in visually similar neighborhoods across cities and countries. We present novel insights on similarity of neighborhoods using street-level images and deep learning methods. We analyzed 7.2 million images from 12 cities in five high-income countries, home to more than 85 million people: Auckland (New Zealand), Sydney (Australia), Toronto and Vancouver (Canada), Atlanta, Boston, Chicago, Los Angeles, New York, San Francisco, and Washington D.C. (United States of America), and London (United Kingdom). Visual features associated with neighborhood disadvantage are more distinct and unique to each city than those associated with affluence. For example, from what is visible from street images, high density poor neighborhoods located near the city center (e.g., in London) are visually distinct from poor suburban neighborhoods characterized by lower density and lower accessibility (e.g., in Atlanta). This suggests that differences between two cities is also driven by historical factors, policies, and local geography. Our results also have implications for image-based measures of inequality in cities especially when trained on data from cities that are visually distinct from target cities. We showed that these are more prone to errors for disadvantaged areas especially when transferring across cities, suggesting more attention needs to be paid to improving methods for capturing heterogeneity in poor environment across cities around the world.
Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-023-00394-6.
{"title":"Do poverty and wealth look the same the world over? A comparative study of 12 cities from five high-income countries using street images.","authors":"Esra Suel, Emily Muller, James E Bennett, Tony Blakely, Yvonne Doyle, John Lynch, Joreintje D Mackenbach, Ariane Middel, Anja Mizdrak, Ricky Nathvani, Michael Brauer, Majid Ezzati","doi":"10.1140/epjds/s13688-023-00394-6","DOIUrl":"10.1140/epjds/s13688-023-00394-6","url":null,"abstract":"<p><p>Urbanization and inequalities are two of the major policy themes of our time, intersecting in large cities where social and economic inequalities are particularly pronounced. Large scale street-level images are a source of city-wide visual information and allow for comparative analyses of multiple cities. Computer vision methods based on deep learning applied to street images have been shown to successfully measure inequalities in socioeconomic and environmental features, yet existing work has been within specific geographies and have not looked at how visual environments compare across different cities and countries. In this study, we aim to apply existing methods to understand whether, and to what extent, poor and wealthy groups live in visually similar neighborhoods across cities and countries. We present novel insights on similarity of neighborhoods using street-level images and deep learning methods. We analyzed 7.2 million images from 12 cities in five high-income countries, home to more than 85 million people: Auckland (New Zealand), Sydney (Australia), Toronto and Vancouver (Canada), Atlanta, Boston, Chicago, Los Angeles, New York, San Francisco, and Washington D.C. (United States of America), and London (United Kingdom). Visual features associated with neighborhood disadvantage are more distinct and unique to each city than those associated with affluence. For example, from what is visible from street images, high density poor neighborhoods located near the city center (e.g., in London) are visually distinct from poor suburban neighborhoods characterized by lower density and lower accessibility (e.g., in Atlanta). This suggests that differences between two cities is also driven by historical factors, policies, and local geography. Our results also have implications for image-based measures of inequality in cities especially when trained on data from cities that are visually distinct from target cities. We showed that these are more prone to errors for disadvantaged areas especially when transferring across cities, suggesting more attention needs to be paid to improving methods for capturing heterogeneity in poor environment across cities around the world.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1140/epjds/s13688-023-00394-6.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"12 1","pages":"19"},"PeriodicalIF":3.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10245348/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9982453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}