Pub Date : 2025-01-01Epub Date: 2025-11-27DOI: 10.1140/epjds/s13688-025-00594-2
Nnaemeka Ohamadike, Kevin Durrheim, Mpho Primus
Does race bias manifest in South African news, and how can computational methods like word embeddings reveal it? After apartheid's end in 1994, South Africa implemented policies to address racial and economic divides and transform institutions and structures, including the news media. This study introduces a computational approach to quantify race bias in South African news using neural embeddings. We trained word2vec word embeddings on COVID-19 vaccination news articles from 76 South African news sources. These large-scale embeddings are unbiased by design but can detect and reveal hidden biases in language. We found consistent race bias in the coverage of socioeconomic phenomena, while health results were weaker, mixed and likely corpus-dependent. COVID-19 may have also amplified associations between "Black" and unhealthy terms in news coverage. Our methodology complements traditional qualitative techniques and allows for a more objective and representative way of investigating racism in South African news. Findings are validated through multiple methods, including human ratings, and have implications for South African news and this research field.
Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-025-00594-2.
{"title":"The news in black and white: word embeddings quantify racism in South African news.","authors":"Nnaemeka Ohamadike, Kevin Durrheim, Mpho Primus","doi":"10.1140/epjds/s13688-025-00594-2","DOIUrl":"10.1140/epjds/s13688-025-00594-2","url":null,"abstract":"<p><p>Does race bias manifest in South African news, and how can computational methods like word embeddings reveal it? After apartheid's end in 1994, South Africa implemented policies to address racial and economic divides and transform institutions and structures, including the news media. This study introduces a computational approach to quantify race bias in South African news using neural embeddings. We trained word2vec word embeddings on COVID-19 vaccination news articles from 76 South African news sources. These large-scale embeddings are unbiased by design but can detect and reveal hidden biases in language. We found consistent race bias in the coverage of socioeconomic phenomena, while health results were weaker, mixed and likely corpus-dependent. COVID-19 may have also amplified associations between \"Black\" and unhealthy terms in news coverage. Our methodology complements traditional qualitative techniques and allows for a more objective and representative way of investigating racism in South African news. Findings are validated through multiple methods, including human ratings, and have implications for South African news and this research field.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1140/epjds/s13688-025-00594-2.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"14 1","pages":"83"},"PeriodicalIF":2.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12660342/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145647706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-02-19DOI: 10.1140/epjds/s13688-025-00523-3
Dragos Gorduza, Stefan Zohren, Xiaowen Dong
Understanding stock market instability is a key question in financial management as practitioners seek to forecast breakdowns in long-run asset co-movement patterns which expose portfolios to rapid and devastating collapses in value. These disruptions are linked to changes in the structure of market wide stock correlations which increase the risk of high volatility shocks. The structure of these co-movements can be described as a network where companies are represented by nodes while edges capture correlations between their price movements. Co-movement breakdowns then manifest as abrupt changes in the topological structure of this network. Measuring the scale of this change and learning a timely indicator of breakdowns is central in understanding both financial stability and volatility forecasting. We propose to use the edge reconstruction accuracy of a graph auto-encoder as an indicator for how homogeneous connections between assets are, which we use, based on the literature of financial network analysis, as a proxy to infer market volatility. We show, through our experiments on the Standard and Poor's index over the 2015-2022 period, that the reconstruction errors from our model correlate with volatility spikes and can be used to improve out-of-sample autoregressive modeling of volatility. Our results demonstrate that market instability can be predicted by changes in the homogeneity in connections of the financial network which expands the understanding of instability in the stock market. We discuss the implications of this graph machine learning-based volatility estimation for policy targeted at ensuring financial market stability.
{"title":"Understanding stock market instability via graph auto-encoders.","authors":"Dragos Gorduza, Stefan Zohren, Xiaowen Dong","doi":"10.1140/epjds/s13688-025-00523-3","DOIUrl":"10.1140/epjds/s13688-025-00523-3","url":null,"abstract":"<p><p>Understanding stock market instability is a key question in financial management as practitioners seek to forecast breakdowns in long-run asset co-movement patterns which expose portfolios to rapid and devastating collapses in value. These disruptions are linked to changes in the structure of market wide stock correlations which increase the risk of high volatility shocks. The structure of these co-movements can be described as a network where companies are represented by nodes while edges capture correlations between their price movements. Co-movement breakdowns then manifest as abrupt changes in the topological structure of this network. Measuring the scale of this change and learning a timely indicator of breakdowns is central in understanding both financial stability and volatility forecasting. We propose to use the edge reconstruction accuracy of a graph auto-encoder as an indicator for how homogeneous connections between assets are, which we use, based on the literature of financial network analysis, as a proxy to infer market volatility. We show, through our experiments on the Standard and Poor's index over the 2015-2022 period, that the reconstruction errors from our model correlate with volatility spikes and can be used to improve out-of-sample autoregressive modeling of volatility. Our results demonstrate that market instability can be predicted by changes in the homogeneity in connections of the financial network which expands the understanding of instability in the stock market. We discuss the implications of this graph machine learning-based volatility estimation for policy targeted at ensuring financial market stability.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"14 1","pages":"13"},"PeriodicalIF":3.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11839781/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143482451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-02-21DOI: 10.1140/epjds/s13688-025-00534-0
João A Leite, Olesya Razuvayevskaya, Kalina Bontcheva, Carolina Scarton
Credibility signals represent a wide range of heuristics typically used by journalists and fact-checkers to assess the veracity of online content. Automating the extraction of credibility signals presents significant challenges due to the necessity of training high-accuracy, signal-specific extractors, coupled with the lack of sufficiently large annotated datasets. This paper introduces Pastel (Prompted weAk Supervision wiTh crEdibility signaLs), a weakly supervised approach that leverages large language models (LLMs) to extract credibility signals from web content, and subsequently combines them to predict the veracity of content without relying on human supervision. We validate our approach using four article-level misinformation detection datasets, demonstrating that Pastel outperforms zero-shot veracity detection by 38.3% and achieves 86.7% of the performance of the state-of-the-art system trained with human supervision. Moreover, in cross-domain settings where training and testing datasets originate from different domains, Pastel significantly outperforms the state-of-the-art supervised model by 63%. We further study the association between credibility signals and veracity, and perform an ablation study showing the impact of each signal on model performance. Our findings reveal that 12 out of the 19 proposed signals exhibit strong associations with veracity across all datasets, while some signals show domain-specific strengths.
Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-025-00534-0.
可信度信号代表了广泛的启发式方法,通常由记者和事实核查员用来评估在线内容的真实性。由于需要训练高精度、特定信号的提取器,再加上缺乏足够大的注释数据集,可信度信号的自动提取面临着重大挑战。本文介绍了一种弱监督方法Pastel (prompt weAk Supervision wiTh crEdibility signaLs),它利用大型语言模型(llm)从web内容中提取可信度信号,然后将它们组合在一起,在不依赖人工监督的情况下预测内容的真实性。我们使用四篇文章级别的错误信息检测数据集验证了我们的方法,结果表明,Pastel比零射击准确率检测高出38.3%,达到了人工监督训练的最先进系统性能的86.7%。此外,在训练和测试数据集来自不同领域的跨领域设置中,Pastel显著优于最先进的监督模型63%。我们进一步研究了可信度信号和准确性之间的关系,并进行了消融研究,显示了每个信号对模型性能的影响。我们的研究结果表明,19个提议的信号中有12个与所有数据集的准确性表现出很强的相关性,而一些信号则表现出特定领域的优势。补充信息:在线版本包含补充资料,可在10.1140/epjds/s13688-025-00534-0获得。
{"title":"Weakly supervised veracity classification with LLM-predicted credibility signals.","authors":"João A Leite, Olesya Razuvayevskaya, Kalina Bontcheva, Carolina Scarton","doi":"10.1140/epjds/s13688-025-00534-0","DOIUrl":"10.1140/epjds/s13688-025-00534-0","url":null,"abstract":"<p><p>Credibility signals represent a wide range of heuristics typically used by journalists and fact-checkers to assess the veracity of online content. Automating the extraction of credibility signals presents significant challenges due to the necessity of training high-accuracy, signal-specific extractors, coupled with the lack of sufficiently large annotated datasets. This paper introduces Pastel (<b>P</b>rompted we<b>A</b>k <b>S</b>upervision wi<b>T</b>h cr<b>E</b>dibility signa<b>L</b>s), a weakly supervised approach that leverages large language models (LLMs) to extract credibility signals from web content, and subsequently combines them to predict the veracity of content without relying on human supervision. We validate our approach using four article-level misinformation detection datasets, demonstrating that Pastel outperforms zero-shot veracity detection by 38.3% and achieves 86.7% of the performance of the state-of-the-art system trained with human supervision. Moreover, in cross-domain settings where training and testing datasets originate from different domains, Pastel significantly outperforms the state-of-the-art supervised model by 63%. We further study the association between credibility signals and veracity, and perform an ablation study showing the impact of each signal on model performance. Our findings reveal that 12 out of the 19 proposed signals exhibit strong associations with veracity across all datasets, while some signals show domain-specific strengths.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1140/epjds/s13688-025-00534-0.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"14 1","pages":"16"},"PeriodicalIF":3.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11845407/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143482452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-07-10DOI: 10.1140/epjds/s13688-025-00563-9
Thomas Louf, José J Ramasco, David Sánchez, Márton Karsai
The socioeconomic background of people and how they use standard forms of language are not independent, as demonstrated in various sociolinguistic studies. However, the extent to which these correlations may be influenced by the mixing of people from different socioeconomic classes remains relatively unexplored from a quantitative perspective. In this work we leverage geotagged tweets and transferable computational methods to map deviations from standard English across eight UK metropolitan areas. We combine these data with high-resolution income maps to assign a proxy socioeconomic indicator to home-located users. Strikingly, we find a consistent pattern suggesting that the more different socioeconomic classes mix, the less interdependent the frequency of their departures from standard grammar and their income become. Further, we propose an agent-based model of linguistic variety adoption that sheds light on the mechanisms that produce the observations seen in the data.
Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-025-00563-9.
{"title":"When dialects collide: how socioeconomic mixing affects language use.","authors":"Thomas Louf, José J Ramasco, David Sánchez, Márton Karsai","doi":"10.1140/epjds/s13688-025-00563-9","DOIUrl":"10.1140/epjds/s13688-025-00563-9","url":null,"abstract":"<p><p>The socioeconomic background of people and how they use standard forms of language are not independent, as demonstrated in various sociolinguistic studies. However, the extent to which these correlations may be influenced by the mixing of people from different socioeconomic classes remains relatively unexplored from a quantitative perspective. In this work we leverage geotagged tweets and transferable computational methods to map deviations from standard English across eight UK metropolitan areas. We combine these data with high-resolution income maps to assign a proxy socioeconomic indicator to home-located users. Strikingly, we find a consistent pattern suggesting that the more different socioeconomic classes mix, the less interdependent the frequency of their departures from standard grammar and their income become. Further, we propose an agent-based model of linguistic variety adoption that sheds light on the mechanisms that produce the observations seen in the data.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1140/epjds/s13688-025-00563-9.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"14 1","pages":"47"},"PeriodicalIF":3.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12245997/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144625631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-03-19DOI: 10.1140/epjds/s13688-025-00531-3
Harald Schweiger, Emilia Parada-Cabaleiro, Markus Schedl
Music playlist creation is a crucial, yet not fully explored task in music data mining and music information retrieval. Previous studies have largely focused on investigating diversity, popularity, and serendipity of tracks in human- or machine-generated playlists. However, the concept of playlist coherence - vaguely defined as smooth transitions between tracks - remains poorly understood and even lacks a standardized definition. In this paper, we provide a formal definition for measuring playlist coherence based on the sequential ordering of tracks, offering a more interpretable measurement compared to existing literature, and allowing for comparisons between playlists with different musical styles. The presented formal framework to measure coherence is applied to analyze a substantial dataset of user-generated playlists, examining how various playlist characteristics influence coherence. We identified four key attributes: playlist length, number of edits, track popularity, and collaborative playlist curation as potential influencing factors. Using correlation and causal inference models, the impact of these attributes on coherence across ten auditory and one metadata feature are assessed. Our findings indicate that these attributes influence playlist coherence to varying extents. Longer playlists tend to exhibit higher coherence, whereas playlists dominated by popular tracks or those extensively modified by users show reduced coherence. In contrast, collaborative playlist curation yielded mixed results. The insights from this study have practical implications for enhancing recommendation tasks, such as automatic playlist generation and continuation, beyond traditional accuracy metrics. As a demonstration of these findings, we propose a simple greedy algorithm that reorganizes playlists to align coherence with observed trends.
Supplementary information: The online version contains supplementary material available at 10.1140/epjds/s13688-025-00531-3.
{"title":"The impact of playlist characteristics on coherence in user-curated music playlists.","authors":"Harald Schweiger, Emilia Parada-Cabaleiro, Markus Schedl","doi":"10.1140/epjds/s13688-025-00531-3","DOIUrl":"10.1140/epjds/s13688-025-00531-3","url":null,"abstract":"<p><p>Music playlist creation is a crucial, yet not fully explored task in music data mining and music information retrieval. Previous studies have largely focused on investigating diversity, popularity, and serendipity of tracks in human- or machine-generated playlists. However, the concept of playlist coherence - vaguely defined as smooth transitions between tracks - remains poorly understood and even lacks a standardized definition. In this paper, we provide a formal definition for measuring playlist coherence based on the sequential ordering of tracks, offering a more interpretable measurement compared to existing literature, and allowing for comparisons between playlists with different musical styles. The presented formal framework to measure coherence is applied to analyze a substantial dataset of user-generated playlists, examining how various playlist characteristics influence coherence. We identified four key attributes: playlist length, number of edits, track popularity, and collaborative playlist curation as potential influencing factors. Using correlation and causal inference models, the impact of these attributes on coherence across ten auditory and one metadata feature are assessed. Our findings indicate that these attributes influence playlist coherence to varying extents. Longer playlists tend to exhibit higher coherence, whereas playlists dominated by popular tracks or those extensively modified by users show reduced coherence. In contrast, collaborative playlist curation yielded mixed results. The insights from this study have practical implications for enhancing recommendation tasks, such as automatic playlist generation and continuation, beyond traditional accuracy metrics. As a demonstration of these findings, we propose a simple greedy algorithm that reorganizes playlists to align coherence with observed trends.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1140/epjds/s13688-025-00531-3.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"14 1","pages":"24"},"PeriodicalIF":3.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11923031/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143691361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-08-27DOI: 10.1140/epjds/s13688-025-00582-6
Ghenai Amira, Nath Keshav, Satsangi Aarat
The COVID-19 pandemic significantly impacted older adults, generating widespread online discussions that revealed how this at-risk population was perceived. Understanding these portrayals is essential, as public discourse influences societal perceptions of aging and impacts policies and practices affecting older adults. Past research highlights that ageist stereotypes and attitudes frequently surface in public discussions, shaping the experiences of older individuals. The current study presents AGECovP, a comprehensive dataset featuring a diverse collection of YouTube videos, a leading social media platform. AGECovP is designed to provide researchers with meaningful insights into how older adults were portrayed during the pandemic and how topics such as conspiracy theories, misinformation, and the anti-vaccine movement were framed in relation to aging populations. In addition, the dataset includes a set of labeled comments indicating the presence of ageist content, enabling researchers to perform ageist detection and analyze ageism in online discourse. By providing a resource for examining both overt and subtle forms of ageism, AGECovP contributes to the development of tools and methodologies for addressing bias against older adults. This dataset fosters actionable insights into societal attitudes, enhancing the development of inclusive policies and interventions. Our data is available at: https://zenodo.org/records/15800324.
{"title":"AGECovP: identifying ageism and analyzing COVID-19 discourse on older adults in YouTube.","authors":"Ghenai Amira, Nath Keshav, Satsangi Aarat","doi":"10.1140/epjds/s13688-025-00582-6","DOIUrl":"10.1140/epjds/s13688-025-00582-6","url":null,"abstract":"<p><p>The COVID-19 pandemic significantly impacted older adults, generating widespread online discussions that revealed how this at-risk population was perceived. Understanding these portrayals is essential, as public discourse influences societal perceptions of aging and impacts policies and practices affecting older adults. Past research highlights that ageist stereotypes and attitudes frequently surface in public discussions, shaping the experiences of older individuals. The current study presents AGECovP, a comprehensive dataset featuring a diverse collection of YouTube videos, a leading social media platform. AGECovP is designed to provide researchers with meaningful insights into how older adults were portrayed during the pandemic and how topics such as conspiracy theories, misinformation, and the anti-vaccine movement were framed in relation to aging populations. In addition, the dataset includes a set of labeled comments indicating the presence of ageist content, enabling researchers to perform ageist detection and analyze ageism in online discourse. By providing a resource for examining both overt and subtle forms of ageism, AGECovP contributes to the development of tools and methodologies for addressing bias against older adults. This dataset fosters actionable insights into societal attitudes, enhancing the development of inclusive policies and interventions. Our data is available at: https://zenodo.org/records/15800324.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"14 1","pages":"65"},"PeriodicalIF":2.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12390874/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144947648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-05-23DOI: 10.1140/epjds/s13688-025-00556-8
Liam Burke-Moore, Angus R Williams, Jonathan Bright
Engaging with online social media platforms is an important part of life as a public figure in modern society, enabling connection with broad audiences and providing a platform for spreading ideas. However, public figures are often disproportionate recipients of hate and abuse on these platforms, degrading public discourse. While significant research on abuse received by groups such as politicians and journalists exists, little has been done to understand the differences in the dynamics of abuse across different groups of public figures, systematically and at scale. To address this, we present analysis of a novel dataset of 45.5M tweets targeted at 4602 UK public figures across 3 domains (members of parliament, footballers, journalists), labelled using fine-tuned transformer-based language models. We find that MPs receive more abuse in absolute terms, but that journalists are most likely to receive abuse after controlling for other factors. We show that abuse is unevenly distributed in all groups, with a small number of individuals receiving the majority of abuse, and that for some groups, abuse is more temporally uneven, being driven by specific events, particularly for footballers. We also find that a more prominent online presence and being male are indicative of higher levels of abuse across all 3 domains.
{"title":"Journalists are most likely to receive abuse: analysing online abuse of UK public figures across sport, politics, and journalism on Twitter.","authors":"Liam Burke-Moore, Angus R Williams, Jonathan Bright","doi":"10.1140/epjds/s13688-025-00556-8","DOIUrl":"10.1140/epjds/s13688-025-00556-8","url":null,"abstract":"<p><p>Engaging with online social media platforms is an important part of life as a public figure in modern society, enabling connection with broad audiences and providing a platform for spreading ideas. However, public figures are often disproportionate recipients of hate and abuse on these platforms, degrading public discourse. While significant research on abuse received by groups such as politicians and journalists exists, little has been done to understand the differences in the dynamics of abuse across different groups of public figures, systematically and at scale. To address this, we present analysis of a novel dataset of 45.5M tweets targeted at 4602 UK public figures across 3 domains (members of parliament, footballers, journalists), labelled using fine-tuned transformer-based language models. We find that MPs receive more abuse in absolute terms, but that journalists are most likely to receive abuse after controlling for other factors. We show that abuse is unevenly distributed in all groups, with a small number of individuals receiving the majority of abuse, and that for some groups, abuse is more temporally uneven, being driven by specific events, particularly for footballers. We also find that a more prominent online presence and being male are indicative of higher levels of abuse across all 3 domains.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"14 1","pages":"41"},"PeriodicalIF":3.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12102095/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144141660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Covid-19 pandemic, caused by the SARS-Cov2- virus, has transformed our lives. To combat the spread of the infection, remote work has become a widespread practice. However, this shift has led to various work-related problems, including prolonged working hours, mental health issues, and communication difficulties. One particular challenge faced by team members is the inability to accurately gauge the work engagement (WE) levels of subordinates, such as their absorption, dedication, and vigor, due to the limited number of in-person interactions that occur in remote work settings. To address this issue, online communication systems utilizing text-based chat tools such as Slack and Microsoft Teams have gained popularity as substitutes for face-to-face communication. In this paper, we propose a novel approach that uses graph neural networks (GNNs) to estimate the work engagement levels (WELs) of users on text-based chat platforms. Specifically, our method involves embedding users in a feature space based solely on the structural information of the utilized communication network, without considering the contents of the conversations that take place. We conduct two studies using Slack data to evaluate our proposal. The first study reveals that the properties of communication networks play a more significant role when estimating WELs than do conversation contents. Building upon this result, the second study involves the development of a machine learning model that estimates WELs using only the architectural features of the employed communication network. In this network representation, each node corresponds to a human user, and edges represent communication logs; i.e., if person A talks to person B, the edge between node A and node B is stretched. Notably, our model achieves a correlation coefficient of 0.60 between the observed and predicted WEL values. Importantly, our proposed approach relies solely on communication network data and does not require linguistic information. This makes it particularly valuable for real-world business situations.
由 SARS-Cov2- 病毒引起的 Covid-19 大流行改变了我们的生活。为了抵御感染的传播,远程工作已成为一种普遍做法。然而,这种转变导致了各种与工作相关的问题,包括工作时间延长、心理健康问题和沟通困难。团队成员面临的一个特殊挑战是,由于远程工作环境中面对面交流的次数有限,因此无法准确衡量下属的工作投入(WE)水平,如他们的吸收力、敬业度和活力。为了解决这个问题,利用 Slack 和 Microsoft Teams 等基于文本的聊天工具的在线交流系统作为面对面交流的替代品受到了欢迎。在本文中,我们提出了一种新方法,利用图神经网络(GNN)来估计用户在基于文本的聊天平台上的工作参与度(WEL)。具体来说,我们的方法是仅根据所使用的通信网络的结构信息将用户嵌入特征空间,而不考虑所发生的对话内容。我们使用 Slack 数据进行了两项研究,以评估我们的建议。第一项研究表明,在估算 WEL 时,通信网络的属性比对话内容发挥着更重要的作用。在这一结果的基础上,第二项研究开发了一个机器学习模型,该模型仅使用所使用的通信网络的架构特征来估算 WEL。在这种网络表示法中,每个节点对应一个人类用户,而边代表通信日志;也就是说,如果 A 人与 B 人交谈,节点 A 和节点 B 之间的边就会被拉伸。值得注意的是,我们的模型在观察到的 WEL 值和预测的 WEL 值之间达到了 0.60 的相关系数。重要的是,我们提出的方法完全依赖于通信网络数据,而不需要语言信息。这使得它在现实世界的商业环境中特别有价值。
{"title":"Estimating work engagement from online chat tools","authors":"Hiroaki Tanaka, Wataru Yamada, Keiichi Ochiai, Shoko Wakamiya, Eiji Aramaki","doi":"10.1140/epjds/s13688-024-00496-9","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00496-9","url":null,"abstract":"<p>The Covid-19 pandemic, caused by the SARS-Cov2- virus, has transformed our lives. To combat the spread of the infection, remote work has become a widespread practice. However, this shift has led to various work-related problems, including prolonged working hours, mental health issues, and communication difficulties. One particular challenge faced by team members is the inability to accurately gauge the work engagement (WE) levels of subordinates, such as their absorption, dedication, and vigor, due to the limited number of in-person interactions that occur in remote work settings. To address this issue, online communication systems utilizing text-based chat tools such as Slack and Microsoft Teams have gained popularity as substitutes for face-to-face communication. In this paper, we propose a novel approach that uses graph neural networks (GNNs) to estimate the work engagement levels (WELs) of users on text-based chat platforms. Specifically, our method involves embedding users in a feature space based solely on the structural information of the utilized communication network, without considering the contents of the conversations that take place. We conduct two studies using Slack data to evaluate our proposal. The first study reveals that the properties of communication networks play a more significant role when estimating WELs than do conversation contents. Building upon this result, the second study involves the development of a machine learning model that estimates WELs using only the architectural features of the employed communication network. In this network representation, each node corresponds to a human user, and edges represent communication logs; i.e., if person A talks to person B, the edge between node A and node B is stretched. Notably, our model achieves a correlation coefficient of 0.60 between the observed and predicted WEL values. Importantly, our proposed approach relies solely on communication network data and does not require linguistic information. This makes it particularly valuable for real-world business situations.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"1 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-02DOI: 10.1140/epjds/s13688-024-00494-x
Lluc Font-Pomarol, Angelo Piga, Sergio Nasarre-Aznar, Marta Sales-Pardo, Roger Guimerà
There are examples of how unconscious bias can influence actions of people. In the judiciary, however, despite some examples there is no general theory on whether different demographic attributes such as gender, seniority or ethnicity affect case sentencing. We aim to gain insight into this issue by analyzing over 100k decisions of three different areas of law with the goal of understanding whether judge identity or judge attributes such as gender and seniority can be inferred from decision documents. We find that stylistic features of decisions are predictive of judge identities, their gender and their seniority, a finding that is aligned with results from analysis of written texts outside the judiciary. Surprisingly, we find that features based on legislation cited are also predictive of judge identities and attributes. While own content reuse by judges can explain our ability to predict judge identities, no specific reduced set of features can explain the differences we find in the legislation cited of decisions when we group judges by gender or seniority. Our findings open the door for further research on how these differences translate into how judges apply the law and, ultimately, to promote a more transparent and fair judiciary system.
{"title":"Language and the use of law are predictive of judge gender and seniority","authors":"Lluc Font-Pomarol, Angelo Piga, Sergio Nasarre-Aznar, Marta Sales-Pardo, Roger Guimerà","doi":"10.1140/epjds/s13688-024-00494-x","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00494-x","url":null,"abstract":"<p>There are examples of how unconscious bias can influence actions of people. In the judiciary, however, despite some examples there is no general theory on whether different demographic attributes such as gender, seniority or ethnicity affect case sentencing. We aim to gain insight into this issue by analyzing over 100k decisions of three different areas of law with the goal of understanding whether judge identity or judge attributes such as gender and seniority can be inferred from decision documents. We find that stylistic features of decisions are predictive of judge identities, their gender and their seniority, a finding that is aligned with results from analysis of written texts outside the judiciary. Surprisingly, we find that features based on legislation cited are also predictive of judge identities and attributes. While own content reuse by judges can explain our ability to predict judge identities, no specific reduced set of features can explain the differences we find in the legislation cited of decisions when we group judges by gender or seniority. Our findings open the door for further research on how these differences translate into how judges apply the law and, ultimately, to promote a more transparent and fair judiciary system.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"13 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-14DOI: 10.1140/epjds/s13688-024-00482-1
Zouhaier Dhifaoui
As nations progress, the impact of climate change on food prices becomes increasingly substantial. While the influence of climate change on the yields of major agricultural products is widely recognized, its specific effect on food prices remains uncertain. This study delves into the impact of the North Atlantic Oscillation (NAO) index, a well-established climate indicator, on global food prices. To accomplish this, a robust bivariate Hurst exponent (robust bHe) is applied. The study employs a sliding windows approach across various time scales to produce a color map of this coefficient, presenting a time-varying version. Furthermore, variable-lag transfer entropy with a sliding windows approach is utilized to discern causal relationships between the NAO index and international food prices. The findings reveal that significant increases in the NAO index are correlated with noteworthy upswings in various international food prices over both short and long-term periods. Additionally, variable-lag transfer entropy confirms the causal role of the NAO index in influencing international food prices.
{"title":"Connection between climatic change and international food prices: evidence from robust long-range cross-correlation and variable-lag transfer entropy with sliding windows approach","authors":"Zouhaier Dhifaoui","doi":"10.1140/epjds/s13688-024-00482-1","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00482-1","url":null,"abstract":"<p>As nations progress, the impact of climate change on food prices becomes increasingly substantial. While the influence of climate change on the yields of major agricultural products is widely recognized, its specific effect on food prices remains uncertain. This study delves into the impact of the North Atlantic Oscillation (NAO) index, a well-established climate indicator, on global food prices. To accomplish this, a robust bivariate Hurst exponent (robust bHe) is applied. The study employs a sliding windows approach across various time scales to produce a color map of this coefficient, presenting a time-varying version. Furthermore, variable-lag transfer entropy with a sliding windows approach is utilized to discern causal relationships between the NAO index and international food prices. The findings reveal that significant increases in the NAO index are correlated with noteworthy upswings in various international food prices over both short and long-term periods. Additionally, variable-lag transfer entropy confirms the causal role of the NAO index in influencing international food prices.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"34 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}