Pub Date : 2024-04-22DOI: 10.1140/epjds/s13688-024-00466-1
Alex D. Singleton, Seth Spielman
In the United States, recent changes to the National Statistical System have amplified the geographic-demographic resolution trade-off. That is, when working with demographic and economic data from the American Community Survey, as one zooms in geographically one loses resolution demographically due to very large margins of error. In this paper, we present a solution to this problem in the form of an AI based open and reproducible geodemographic classification system for the United States using small area estimates from the American Community Survey (ACS). We employ a partitioning clustering algorithm to a range of socio-economic, demographic, and built environment variables. Our approach utilizes an open source software pipeline that ensures adaptability to future data updates. A key innovation is the integration of GPT4, a state-of-the-art large language model, to generate intuitive cluster descriptions and names. This represents a novel application of natural language processing in geodemographic research and showcases the potential for human-AI collaboration within the geospatial domain.
{"title":"Segmentation using large language models: A new typology of American neighborhoods","authors":"Alex D. Singleton, Seth Spielman","doi":"10.1140/epjds/s13688-024-00466-1","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00466-1","url":null,"abstract":"<p>In the United States, recent changes to the National Statistical System have amplified the geographic-demographic resolution trade-off. That is, when working with demographic and economic data from the American Community Survey, as one zooms in geographically one loses resolution demographically due to very large margins of error. In this paper, we present a solution to this problem in the form of an AI based open and reproducible geodemographic classification system for the United States using small area estimates from the American Community Survey (ACS). We employ a partitioning clustering algorithm to a range of socio-economic, demographic, and built environment variables. Our approach utilizes an open source software pipeline that ensures adaptability to future data updates. A key innovation is the integration of GPT4, a state-of-the-art large language model, to generate intuitive cluster descriptions and names. This represents a novel application of natural language processing in geodemographic research and showcases the potential for human-AI collaboration within the geospatial domain.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"24 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-19DOI: 10.1140/epjds/s13688-024-00472-3
Chiara Zappalà, Sandro Sousa, Tiago Cunha, Alessandro Pluchino, Andrea Rapisarda, Roberta Sinatra
Success in sports is a complex phenomenon that has only garnered limited research attention. In particular, we lack a deep scientific understanding of success in sports like tennis and the factors that contribute to it. Here, we study the unfolding of tennis players’ careers to understand the role of early career stages and the impact of specific tournaments on players’ trajectories. We employ a comprehensive approach combining network science and analysis of the Association of Tennis Professionals (ATP) tournament data and introduce a novel method to quantify tournament prestige based on the eigenvector centrality of the co-attendance network of tournaments. Focusing on the interplay between participation in central tournaments and players’ performance, we find that the level of the tournament where players achieve their first win is associated with becoming a top player. This work sheds light on the critical role of the initial stages in the progression of players’ careers, offering valuable insights into the dynamics of success in tennis.
{"title":"Early career wins and tournament prestige characterize tennis players’ trajectories","authors":"Chiara Zappalà, Sandro Sousa, Tiago Cunha, Alessandro Pluchino, Andrea Rapisarda, Roberta Sinatra","doi":"10.1140/epjds/s13688-024-00472-3","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00472-3","url":null,"abstract":"<p>Success in sports is a complex phenomenon that has only garnered limited research attention. In particular, we lack a deep scientific understanding of success in sports like tennis and the factors that contribute to it. Here, we study the unfolding of tennis players’ careers to understand the role of early career stages and the impact of specific tournaments on players’ trajectories. We employ a comprehensive approach combining network science and analysis of the Association of Tennis Professionals (ATP) tournament data and introduce a novel method to quantify tournament prestige based on the eigenvector centrality of the co-attendance network of tournaments. Focusing on the interplay between participation in central tournaments and players’ performance, we find that the level of the tournament where players achieve their first win is associated with becoming a top player. This work sheds light on the critical role of the initial stages in the progression of players’ careers, offering valuable insights into the dynamics of success in tennis.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"2 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140623057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-19DOI: 10.1140/epjds/s13688-024-00467-0
Serena Tardelli, Leonardo Nizzoli, Marco Avvenuti, Stefano Cresci, Maurizio Tesconi
Organized attempts to manipulate public opinion during election run-ups have dominated online debates in the last few years. Such attempts require numerous accounts to act in coordination to exert influence. Yet, the ways in which coordinated behavior surfaces during major online political debates is still largely unclear. This study sheds light on coordinated behaviors that took place on Twitter (now X) during the 2020 US Presidential Election. Utilizing state-of-the-art network science methods, we detect and characterize the coordinated communities that participated in the debate. Our approach goes beyond previous analyses by proposing a multifaceted characterization of the coordinated communities that allows obtaining nuanced results. In particular, we uncover three main categories of coordinated users: (i) moderate groups genuinely interested in the electoral debate, (ii) conspiratorial groups that spread false information and divisive narratives, and (iii) foreign influence networks that either sought to tamper with the debate or that exploited it to publicize their own agendas. We also reveal a large use of automation by far-right foreign influence and conspiratorial communities. Conversely, left-leaning supporters were overall less coordinated and engaged primarily in harmless, factual communication. Our results also showed that Twitter was effective at thwarting the activity of some coordinated groups, while it failed on some other equally suspicious ones. Overall, this study advances the understanding of online human interactions and contributes new knowledge to mitigate cyber social threats.
{"title":"Multifaceted online coordinated behavior in the 2020 US presidential election","authors":"Serena Tardelli, Leonardo Nizzoli, Marco Avvenuti, Stefano Cresci, Maurizio Tesconi","doi":"10.1140/epjds/s13688-024-00467-0","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00467-0","url":null,"abstract":"<p>Organized attempts to manipulate public opinion during election run-ups have dominated online debates in the last few years. Such attempts require numerous accounts to <i>act in coordination</i> to exert influence. Yet, the ways in which coordinated behavior surfaces during major online political debates is still largely unclear. This study sheds light on coordinated behaviors that took place on Twitter (now X) during the 2020 US Presidential Election. Utilizing state-of-the-art network science methods, we detect and characterize the coordinated communities that participated in the debate. Our approach goes beyond previous analyses by proposing a multifaceted characterization of the coordinated communities that allows obtaining nuanced results. In particular, we uncover three main categories of coordinated users: (<i>i</i>) moderate groups genuinely interested in the electoral debate, (<i>ii</i>) conspiratorial groups that spread false information and divisive narratives, and (<i>iii</i>) foreign influence networks that either sought to tamper with the debate or that exploited it to publicize their own agendas. We also reveal a large use of automation by far-right foreign influence and conspiratorial communities. Conversely, left-leaning supporters were overall less coordinated and engaged primarily in harmless, factual communication. Our results also showed that Twitter was effective at thwarting the activity of some coordinated groups, while it failed on some other equally suspicious ones. Overall, this study advances the understanding of online human interactions and contributes new knowledge to mitigate cyber social threats.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"48 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140623004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-16DOI: 10.1140/epjds/s13688-024-00470-5
Mohsen Ghasemizade, Jeremiah Onaolapo
A conspiracy theory (CT) suggests covert groups or powerful individuals secretly manipulate events. Not knowing about existing conspiracy theories could make one more likely to believe them, so this work aims to compile a list of CTs shaped as a tree that is as comprehensive as possible. We began with a manually curated ‘tree’ of CTs from academic papers and Wikipedia. Next, we examined 1769 CT-related articles from four fact-checking websites, focusing on their core content, and used a technique called Keyphrase Extraction to label the documents. This process yielded 769 identified conspiracies, each assigned a label and a family name. The second goal of this project was to detect whether an article is a conspiracy theory, so we built a binary classifier with our labeled dataset. This model uses a transformer-based machine learning technique and is pre-trained on a large corpus called RoBERTa, resulting in an F1 score of 87%. This model helps to identify potential conspiracy theories in new articles. We used a combination of clustering (HDBSCAN) and a dimension reduction technique (UMAP) to assign a label from the tree to these new articles detected as conspiracy theories. We then labeled these groups accordingly to help us match them to the tree. These can lead us to detect new conspiracy theories and expand the tree using computational methods. We successfully generated a tree of conspiracy theories and built a pipeline to detect and categorize conspiracy theories within any text corpora. This pipeline gives us valuable insights through any databases formatted as text.
阴谋论(CT)是指秘密团体或有权势的个人暗中操纵事件。不了解现有的阴谋论可能会让人更容易相信它们,因此这项工作旨在编制一份尽可能全面的阴谋论树状列表。我们首先从学术论文和维基百科中人工编辑了一棵 CT "树"。接下来,我们检查了四个事实核查网站中与 CT 相关的 1769 篇文章,重点关注其核心内容,并使用一种名为 "关键词提取 "的技术对文档进行标注。在此过程中,我们识别出了 769 个阴谋,每个阴谋都有一个标签和姓氏。这个项目的第二个目标是检测一篇文章是否是阴谋论,因此我们用标注过的数据集建立了一个二元分类器。该模型使用了基于变换器的机器学习技术,并在名为 RoBERTa 的大型语料库上进行了预训练,结果 F1 得分为 87%。该模型有助于识别新文章中潜在的阴谋论。我们结合使用了聚类(HDBSCAN)和降维技术(UMAP),为这些被检测为阴谋论的新文章分配树标签。然后,我们对这些组进行相应的标记,以帮助我们将它们与树进行匹配。这些可以帮助我们检测出新的阴谋论,并使用计算方法扩展树。我们成功生成了一棵阴谋论树,并建立了一个在任何文本语料库中检测和分类阴谋论的管道。通过该管道,我们可以从任何文本格式的数据库中获得有价值的见解。
{"title":"Developing a hierarchical model for unraveling conspiracy theories","authors":"Mohsen Ghasemizade, Jeremiah Onaolapo","doi":"10.1140/epjds/s13688-024-00470-5","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00470-5","url":null,"abstract":"<p>A conspiracy theory (CT) suggests covert groups or powerful individuals secretly manipulate events. Not knowing about existing conspiracy theories could make one more likely to believe them, so this work aims to compile a list of CTs shaped as a tree that is as comprehensive as possible. We began with a manually curated ‘tree’ of CTs from academic papers and Wikipedia. Next, we examined 1769 CT-related articles from four fact-checking websites, focusing on their core content, and used a technique called Keyphrase Extraction to label the documents. This process yielded 769 identified conspiracies, each assigned a label and a family name. The second goal of this project was to detect whether an article is a conspiracy theory, so we built a binary classifier with our labeled dataset. This model uses a transformer-based machine learning technique and is pre-trained on a large corpus called RoBERTa, resulting in an F1 score of 87%. This model helps to identify potential conspiracy theories in new articles. We used a combination of clustering (HDBSCAN) and a dimension reduction technique (UMAP) to assign a label from the tree to these new articles detected as conspiracy theories. We then labeled these groups accordingly to help us match them to the tree. These can lead us to detect new conspiracy theories and expand the tree using computational methods. We successfully generated a tree of conspiracy theories and built a pipeline to detect and categorize conspiracy theories within any text corpora. This pipeline gives us valuable insights through any databases formatted as text.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"1 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140610910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The escalation of urban traffic congestion has reached a critical extent due to rapid urbanization, capturing considerable attention within urban science and transportation research. Although preceding studies have validated the scale-free distributions in spatio-temporal congestion clusters across cities, the influence of travel demand on that distribution has yet to be explored. Using a unique traffic dataset during the COVID-19 pandemic in Shanghai 2022, we present empirical evidence that travel demand plays a pivotal role in shaping the scaling laws of traffic congestion. We uncover a noteworthy negative linear correlation between the travel demand and the traffic resilience represented by scaling exponents of congestion cluster size and recovery duration. Additionally, we reveal that travel demand broadly dominates the scale of congestion in the form of scaling laws, including the aggregated volume of congestion clusters, the number of congestion clusters, and the number of congested roads. Subsequent micro-level analysis of congestion propagation also unveils that cascade diffusion determines the demand sensitivity of congestion, while other intrinsic components, namely spontaneous generation and dissipation, are rather stable. Our findings of traffic congestion under diverse travel demand can profoundly enrich our understanding of the scale-free nature of traffic congestion and provide insights into internal mechanisms of congestion propagation.
{"title":"Scaling law of real traffic jams under varying travel demand","authors":"Rui Chen, Yuming Lin, Huan Yan, Jiazhen Liu, Yu Liu, Yong Li","doi":"10.1140/epjds/s13688-024-00471-4","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00471-4","url":null,"abstract":"<p>The escalation of urban traffic congestion has reached a critical extent due to rapid urbanization, capturing considerable attention within urban science and transportation research. Although preceding studies have validated the scale-free distributions in spatio-temporal congestion clusters across cities, the influence of travel demand on that distribution has yet to be explored. Using a unique traffic dataset during the COVID-19 pandemic in Shanghai 2022, we present empirical evidence that travel demand plays a pivotal role in shaping the scaling laws of traffic congestion. We uncover a noteworthy negative linear correlation between the travel demand and the traffic resilience represented by scaling exponents of congestion cluster size and recovery duration. Additionally, we reveal that travel demand broadly dominates the scale of congestion in the form of scaling laws, including the aggregated volume of congestion clusters, the number of congestion clusters, and the number of congested roads. Subsequent micro-level analysis of congestion propagation also unveils that cascade diffusion determines the demand sensitivity of congestion, while other intrinsic components, namely spontaneous generation and dissipation, are rather stable. Our findings of traffic congestion under diverse travel demand can profoundly enrich our understanding of the scale-free nature of traffic congestion and provide insights into internal mechanisms of congestion propagation.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"38 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140563982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.1140/epjds/s13688-024-00464-3
Matteo Serafino, Zhenkun Zhou, José S. Andrade, Alexandre Bovet, Hernán A. Makse
The ongoing debate surrounding the impact of the Internet Research Agency’s (IRA) social media campaign during the 2016 U.S. presidential election has largely overshadowed the involvement of other actors. Our analysis brings to light a substantial group of suspended Twitter users, outnumbering the IRA user group by a factor of 60, who align with the ideologies of the IRA campaign. Our study demonstrates that this group of suspended Twitter accounts significantly influenced individuals categorized as undecided or weak supporters, potentially with the aim of swaying their opinions, as indicated by Granger causality.
围绕互联网研究机构(IRA)在 2016 年美国总统大选期间的社交媒体活动所产生的影响而展开的持续辩论在很大程度上掩盖了其他参与者的参与。我们的分析揭示了一大批被暂停推特账号的用户,其人数比 IRA 用户多出 60 倍,他们与 IRA 运动的意识形态一致。我们的研究表明,正如格兰杰因果关系所显示的那样,这群被暂停的推特账户极大地影响了被归类为未决定或弱支持者的个人,其目的可能是左右他们的观点。
{"title":"Suspended accounts align with the Internet Research Agency misinformation campaign to influence the 2016 US election","authors":"Matteo Serafino, Zhenkun Zhou, José S. Andrade, Alexandre Bovet, Hernán A. Makse","doi":"10.1140/epjds/s13688-024-00464-3","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00464-3","url":null,"abstract":"<p>The ongoing debate surrounding the impact of the Internet Research Agency’s (IRA) social media campaign during the 2016 U.S. presidential election has largely overshadowed the involvement of other actors. Our analysis brings to light a substantial group of suspended Twitter users, outnumbering the IRA user group by a factor of 60, who align with the ideologies of the IRA campaign. Our study demonstrates that this group of suspended Twitter accounts significantly influenced individuals categorized as undecided or weak supporters, potentially with the aim of swaying their opinions, as indicated by Granger causality.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"49 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-04DOI: 10.1140/epjds/s13688-024-00469-y
Abstract
Social Media (SM) has become a popular medium for individuals to share their opinions on various topics, including politics, social issues, and daily affairs. During controversial events such as political elections, active users often proclaim their stance and try to persuade others to support them. However, disparities in participation levels can lead to misperceptions and cause analysts to misjudge the support for each side. For example, current models usually rely on content production and overlook a vast majority of civically engaged users who passively consume information. These “silent users” can significantly impact the democratic process despite being less vocal. Accounting for the stances of this silent majority is critical to improving our reliance on SM to understand and measure social phenomena. Thus, this study proposes and evaluates a new approach for silent users’ stance prediction based on collaborative filtering and Graph Convolutional Networks, which exploits multiple relationships between users and topics. Furthermore, our method allows us to describe users with different stances and online behaviors. We demonstrate its validity using real-world datasets from two related political events. Specifically, we examine user attitudes leading to the Chilean constitutional referendums in 2020 and 2022 through extensive Twitter datasets. In both datasets, our model outperforms the baselines by over 9% at the edge- and the user level. Thus, our method offers an improvement in effectively quantifying the support and creating a multidimensional understanding of social discussions on SM platforms, especially during polarizing events.
摘要 社交媒体(SM)已成为个人就政治、社会问题和日常事务等各种话题分享观点的流行媒介。在政治选举等有争议的事件中,活跃的用户往往会宣布自己的立场,并试图说服他人支持自己。然而,参与程度的差异会导致误解,使分析人员错误判断各方的支持率。例如,当前的模型通常依赖于内容生产,而忽略了绝大多数被动消费信息的公民参与用户。这些 "沉默的用户 "尽管声音较小,却能对民主进程产生重大影响。考虑到这一沉默的大多数的立场,对于改善我们对 SM 的依赖以理解和衡量社会现象至关重要。因此,本研究提出并评估了一种基于协同过滤和图卷积网络的沉默用户立场预测新方法,该方法利用了用户和话题之间的多重关系。此外,我们的方法还能描述具有不同立场和在线行为的用户。我们使用两个相关政治事件的真实数据集证明了该方法的有效性。具体来说,我们通过广泛的 Twitter 数据集研究了用户对 2020 年和 2022 年智利宪法公投的态度。在这两个数据集中,我们的模型在边缘和用户层面的表现均优于基线模型 9% 以上。因此,我们的方法在有效量化支持度和多维度理解 SM 平台上的社会讨论方面有所改进,尤其是在极化事件中。
{"title":"Unveiling the silent majority: stance detection and characterization of passive users on social media using collaborative filtering and graph convolutional networks","authors":"","doi":"10.1140/epjds/s13688-024-00469-y","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00469-y","url":null,"abstract":"<h3>Abstract</h3> <p>Social Media (SM) has become a popular medium for individuals to share their opinions on various topics, including politics, social issues, and daily affairs. During controversial events such as political elections, active users often proclaim their stance and try to persuade others to support them. However, disparities in participation levels can lead to misperceptions and cause analysts to misjudge the support for each side. For example, current models usually rely on content production and overlook a vast majority of civically engaged users who passively consume information. These “silent users” can significantly impact the democratic process despite being less vocal. Accounting for the stances of this silent majority is critical to improving our reliance on SM to understand and measure social phenomena. Thus, this study proposes and evaluates a new approach for silent users’ stance prediction based on collaborative filtering and Graph Convolutional Networks, which exploits multiple relationships between users and topics. Furthermore, our method allows us to describe users with different stances and online behaviors. We demonstrate its validity using real-world datasets from two related political events. Specifically, we examine user attitudes leading to the Chilean constitutional referendums in 2020 and 2022 through extensive Twitter datasets. In both datasets, our model outperforms the baselines by over 9% at the edge- and the user level. Thus, our method offers an improvement in effectively quantifying the support and creating a multidimensional understanding of social discussions on SM platforms, especially during polarizing events.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"32 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140563977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-02DOI: 10.1140/epjds/s13688-024-00468-z
Abstract
The selection of research topics by scientists can be viewed as an exploration process conducted by individuals with cognitive limitations traversing a complex cognitive landscape influenced by both individual and social factors. While existing theoretical investigations have provided valuable insights, the intricate and multifaceted nature of modern science hinders the implementation of empirical experiments. This study leverages advancements in Geographic Information System (GIS) techniques to investigate the patterns and dynamic mechanisms of topic-transition among scientists. By constructing the knowledge space across 6 large-scale disciplines, we depict the trajectories of scientists’ topic transitions within this space, measuring the flow and distance of research regions across different sub-spaces. Our findings reveal a predominantly conservative pattern of topic transition at the individual level, with scientists primarily exploring local knowledge spaces. Furthermore, simulation modeling analysis identifies research intensity, driven by the concentration of scientists within a specific region, as the key facilitator of topic transition. Conversely, the knowledge distance between fields serves as a significant barrier to exploration. Notably, despite potential opportunities for breakthrough discoveries at the intersection of subfields, empirical evidence suggests that these opportunities do not exert a strong pull on scientists, leading them to favor familiar research areas. Our study provides valuable insights into the exploration dynamics of scientific knowledge production, highlighting the influence of individual cognition, social factors, and the intrinsic structure of the knowledge landscape itself. These findings offer a framework for understanding and potentially shaping the course of scientific progress.
{"title":"Science as exploration in a knowledge landscape: tracing hotspots or seeking opportunity?","authors":"","doi":"10.1140/epjds/s13688-024-00468-z","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00468-z","url":null,"abstract":"<h3>Abstract</h3> <p>The selection of research topics by scientists can be viewed as an exploration process conducted by individuals with cognitive limitations traversing a complex cognitive landscape influenced by both individual and social factors. While existing theoretical investigations have provided valuable insights, the intricate and multifaceted nature of modern science hinders the implementation of empirical experiments. This study leverages advancements in Geographic Information System (GIS) techniques to investigate the patterns and dynamic mechanisms of topic-transition among scientists. By constructing the knowledge space across 6 large-scale disciplines, we depict the trajectories of scientists’ topic transitions within this space, measuring the flow and distance of research regions across different sub-spaces. Our findings reveal a predominantly conservative pattern of topic transition at the individual level, with scientists primarily exploring local knowledge spaces. Furthermore, simulation modeling analysis identifies research intensity, driven by the concentration of scientists within a specific region, as the key facilitator of topic transition. Conversely, the knowledge distance between fields serves as a significant barrier to exploration. Notably, despite potential opportunities for breakthrough discoveries at the intersection of subfields, empirical evidence suggests that these opportunities do not exert a strong pull on scientists, leading them to favor familiar research areas. Our study provides valuable insights into the exploration dynamics of scientific knowledge production, highlighting the influence of individual cognition, social factors, and the intrinsic structure of the knowledge landscape itself. These findings offer a framework for understanding and potentially shaping the course of scientific progress.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"2013 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140564049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-26DOI: 10.1140/epjds/s13688-024-00462-5
Abstract
Artificial Intelligence (AI) technologies have exposed more and more ethical issues while providing services to people. It is challenging for people to realize the occurrence of AI ethical issues in most cases. The lower the public awareness, the more difficult it is to address AI ethical issues. Many previous studies have explored public reactions and opinions on AI ethical issues through questionnaires and social media platforms like Twitter. However, these approaches primarily focus on categorizing popular topics and sentiments, overlooking the public’s potential lack of knowledge underlying these issues. Few studies revealed the holistic knowledge structure of AI ethical topics and the relations among the subtopics. As the world’s largest online encyclopedia, Wikipedia encourages people to jointly contribute and share their knowledge by adding new topics and following a well-accepted hierarchical structure. Through public viewing and editing, Wikipedia serves as a proxy for knowledge transmission. This study aims to analyze how the public comprehend the body of knowledge of AI ethics. We adopted the community detection approach to identify the hierarchical community of the AI ethical topics, and further extracted the AI ethics-related entities, which are proper nouns, organizations, and persons. The findings reveal that the primary topics at the top-level community, most pertinent to AI ethics, predominantly revolve around knowledge-based and ethical issues. Examples include transitions from Information Theory to Internet Copyright Infringement. In summary, this study contributes to three points, (1) to present the holistic knowledge structure of AI ethics, (2) to evaluate and improve the existing body of knowledge of AI ethics, (3) to enhance public perception of AI ethics to mitigate the risks associated with AI technologies.
{"title":"Unveiling public perception of AI ethics: an exploration on Wikipedia data","authors":"","doi":"10.1140/epjds/s13688-024-00462-5","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00462-5","url":null,"abstract":"<h3>Abstract</h3> <p>Artificial Intelligence (AI) technologies have exposed more and more ethical issues while providing services to people. It is challenging for people to realize the occurrence of AI ethical issues in most cases. The lower the public awareness, the more difficult it is to address AI ethical issues. Many previous studies have explored public reactions and opinions on AI ethical issues through questionnaires and social media platforms like Twitter. However, these approaches primarily focus on categorizing popular topics and sentiments, overlooking the public’s potential lack of knowledge underlying these issues. Few studies revealed the holistic knowledge structure of AI ethical topics and the relations among the subtopics. As the world’s largest online encyclopedia, Wikipedia encourages people to jointly contribute and share their knowledge by adding new topics and following a well-accepted hierarchical structure. Through public viewing and editing, Wikipedia serves as a proxy for knowledge transmission. This study aims to analyze how the public comprehend the body of knowledge of AI ethics. We adopted the community detection approach to identify the hierarchical community of the AI ethical topics, and further extracted the AI ethics-related entities, which are proper nouns, organizations, and persons. The findings reveal that the primary topics at the top-level community, most pertinent to AI ethics, predominantly revolve around knowledge-based and ethical issues. Examples include transitions from Information Theory to Internet Copyright Infringement. In summary, this study contributes to three points, (1) to present the holistic knowledge structure of AI ethics, (2) to evaluate and improve the existing body of knowledge of AI ethics, (3) to enhance public perception of AI ethics to mitigate the risks associated with AI technologies.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"101 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140300351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-26DOI: 10.1140/epjds/s13688-024-00461-6
Manuel Pratelli, Marinella Petrocchi, Fabio Saracco, Rocco De Nicola
For U.S. presidential elections, most states use the so-called winner-take-all system, in which the state’s presidential electors are awarded to the winning political party in the state after a popular vote phase, regardless of the actual margin of victory. Therefore, election campaigns are especially intense in states where there is no clear direction on which party will be the winning party. These states are often referred to as swing states. To measure the impact of such an election law on the campaigns, we analyze the Twitter activity surrounding the 2020 US preelection debate, with a particular focus on the spread of disinformation. We find that about 88% of the online traffic was associated with swing states. In addition, the sharing of links to unreliable news sources is significantly more prevalent in tweets associated with swing states: in this case, untrustworthy tweets are predominantly generated by automated accounts. Furthermore, we observe that the debate is mostly led by two main communities, one with a predominantly Republican affiliation and the other with accounts of different political orientations. Most of the disinformation comes from the former.
{"title":"Online disinformation in the 2020 U.S. election: swing vs. safe states","authors":"Manuel Pratelli, Marinella Petrocchi, Fabio Saracco, Rocco De Nicola","doi":"10.1140/epjds/s13688-024-00461-6","DOIUrl":"https://doi.org/10.1140/epjds/s13688-024-00461-6","url":null,"abstract":"<p>For U.S. presidential elections, most states use the so-called winner-take-all system, in which the state’s presidential electors are awarded to the winning political party in the state after a popular vote phase, regardless of the actual margin of victory. Therefore, election campaigns are especially intense in states where there is no clear direction on which party will be the winning party. These states are often referred to as <i>swing states</i>. To measure the impact of such an election law on the campaigns, we analyze the Twitter activity surrounding the 2020 US preelection debate, with a particular focus on the spread of disinformation. We find that about 88% of the online traffic was associated with swing states. In addition, the sharing of links to unreliable news sources is significantly more prevalent in tweets associated with swing states: in this case, untrustworthy tweets are predominantly generated by automated accounts. Furthermore, we observe that the debate is mostly led by two main communities, one with a predominantly Republican affiliation and the other with accounts of different political orientations. Most of the disinformation comes from the former.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"33 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140300355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}