Commuting flow prediction is an essential task for municipal operations in the real world. Previous studies have revealed that it is feasible to estimate the commuting origin-destination (OD) demand within a city using multiple auxiliary data. However, most existing methods are not suitable to deal with a similar task at a large scale, namely within a prefecture or the whole nation, owing to the increased number of geographical units that need to be maintained. In addition, region representation learning is a universal approach for gaining urban knowledge for diverse metropolitan downstream tasks. Although many researchers have developed comprehensive frameworks to describe urban units from multi-source data, they have not clarified the relationship between the selected geographical elements. Furthermore, metropolitan areas naturally preserve ranked structures, like cities and their inclusive districts, which makes elucidating relations between cross-level urban units necessary. Therefore, we develop a heterogeneous graph-based model to generate meaningful region embeddings at multiple spatial resolutions for predicting different types of inter-level OD flows. To demonstrate the effectiveness of the proposed method, extensive experiments were conducted using real-world aggregated mobile phone datasets collected from Shizuoka Prefecture, Japan. The results indicate that our proposed model outperforms existing models in terms of a uniform urban structure. We extend the understanding of predicted results using reasonable explanations to enhance the credibility of the model.
通勤流量预测是现实世界中市政运营的一项重要任务。以往的研究表明,利用多种辅助数据估算城市内的通勤起点-终点(OD)需求是可行的。然而,由于需要维护的地理单元数量增加,大多数现有方法并不适合处理大规模的类似任务,即县级或全国范围内的类似任务。尽管许多研究人员已经开发了综合框架来从多源数据中描述城市单元,但他们并没有阐明这些选定的地理要素之间的关系。因此,我们开发了一种基于异构图的模型,在多种空间分辨率下生成有意义的区域嵌入,用于预测不同类型的跨层级 OD 流量。为了证明所提方法的有效性,我们使用从日本静冈县收集的真实世界聚合移动电话数据集进行了大量实验。结果表明,我们提出的模型在统一城市结构方面优于现有模型。我们通过合理的解释扩展了对预测结果的理解,从而提高了模型的可信度。
{"title":"Explainable Hierarchical Urban Representation Learning for Commuting Flow Prediction","authors":"Mingfei Cai, Yanbo Pang, Yoshihide Sekimoto","doi":"arxiv-2408.14762","DOIUrl":"https://doi.org/arxiv-2408.14762","url":null,"abstract":"Commuting flow prediction is an essential task for municipal operations in\u0000the real world. Previous studies have revealed that it is feasible to estimate\u0000the commuting origin-destination (OD) demand within a city using multiple\u0000auxiliary data. However, most existing methods are not suitable to deal with a\u0000similar task at a large scale, namely within a prefecture or the whole nation,\u0000owing to the increased number of geographical units that need to be maintained.\u0000In addition, region representation learning is a universal approach for gaining\u0000urban knowledge for diverse metropolitan downstream tasks. Although many\u0000researchers have developed comprehensive frameworks to describe urban units\u0000from multi-source data, they have not clarified the relationship between the\u0000selected geographical elements. Furthermore, metropolitan areas naturally\u0000preserve ranked structures, like cities and their inclusive districts, which\u0000makes elucidating relations between cross-level urban units necessary.\u0000Therefore, we develop a heterogeneous graph-based model to generate meaningful\u0000region embeddings at multiple spatial resolutions for predicting different\u0000types of inter-level OD flows. To demonstrate the effectiveness of the proposed\u0000method, extensive experiments were conducted using real-world aggregated mobile\u0000phone datasets collected from Shizuoka Prefecture, Japan. The results indicate\u0000that our proposed model outperforms existing models in terms of a uniform urban\u0000structure. We extend the understanding of predicted results using reasonable\u0000explanations to enhance the credibility of the model.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142226904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large-scale "pre-train and prompt learning" paradigms have demonstrated remarkable adaptability, enabling broad applications across diverse domains such as question answering, image recognition, and multimodal retrieval. This approach fully leverages the potential of large-scale pre-trained models, reducing downstream data requirements and computational costs while enhancing model applicability across various tasks. Graphs, as versatile data structures that capture relationships between entities, play pivotal roles in fields such as social network analysis, recommender systems, and biological graphs. Despite the success of pre-train and prompt learning paradigms in Natural Language Processing (NLP) and Computer Vision (CV), their application in graph domains remains nascent. In graph-structured data, not only do the node and edge features often have disparate distributions, but the topological structures also differ significantly. This diversity in graph data can lead to incompatible patterns or gaps between pre-training and fine-tuning on downstream graphs. We aim to bridge this gap by summarizing methods for alleviating these disparities. This includes exploring prompt design methodologies, comparing related techniques, assessing application scenarios and datasets, and identifying unresolved problems and challenges. This survey categorizes over 100 relevant works in this field, summarizing general design principles and the latest applications, including text-attributed graphs, molecules, proteins, and recommendation systems. Through this extensive review, we provide a foundational understanding of graph prompt learning, aiming to impact not only the graph mining community but also the broader Artificial General Intelligence (AGI) community.
{"title":"Towards Graph Prompt Learning: A Survey and Beyond","authors":"Qingqing Long, Yuchen Yan, Peiyan Zhang, Chen Fang, Wentao Cui, Zhiyuan Ning, Meng Xiao, Ning Cao, Xiao Luo, Lingjun Xu, Shiyue Jiang, Zheng Fang, Chong Chen, Xian-Sheng Hua, Yuanchun Zhou","doi":"arxiv-2408.14520","DOIUrl":"https://doi.org/arxiv-2408.14520","url":null,"abstract":"Large-scale \"pre-train and prompt learning\" paradigms have demonstrated\u0000remarkable adaptability, enabling broad applications across diverse domains\u0000such as question answering, image recognition, and multimodal retrieval. This\u0000approach fully leverages the potential of large-scale pre-trained models,\u0000reducing downstream data requirements and computational costs while enhancing\u0000model applicability across various tasks. Graphs, as versatile data structures\u0000that capture relationships between entities, play pivotal roles in fields such\u0000as social network analysis, recommender systems, and biological graphs. Despite\u0000the success of pre-train and prompt learning paradigms in Natural Language\u0000Processing (NLP) and Computer Vision (CV), their application in graph domains\u0000remains nascent. In graph-structured data, not only do the node and edge\u0000features often have disparate distributions, but the topological structures\u0000also differ significantly. This diversity in graph data can lead to\u0000incompatible patterns or gaps between pre-training and fine-tuning on\u0000downstream graphs. We aim to bridge this gap by summarizing methods for\u0000alleviating these disparities. This includes exploring prompt design\u0000methodologies, comparing related techniques, assessing application scenarios\u0000and datasets, and identifying unresolved problems and challenges. This survey\u0000categorizes over 100 relevant works in this field, summarizing general design\u0000principles and the latest applications, including text-attributed graphs,\u0000molecules, proteins, and recommendation systems. Through this extensive review,\u0000we provide a foundational understanding of graph prompt learning, aiming to\u0000impact not only the graph mining community but also the broader Artificial\u0000General Intelligence (AGI) community.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142214926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lahari Anne, The-Anh Vu-Le, Minhyuk Park, Tandy Warnow, George Chacko
Since true communities within real-world networks are rarely known, synthetic networks with planted ground truths are valuable for evaluating the performance of community detection methods. Of the synthetic network generation tools available, Stochastic Block Models (SBMs) produce networks with ground truth clusters that well approximate input parameters from real-world networks and clusterings. However, we show that SBMs can produce disconnected ground truth clusters, even when given parameters from clusterings where all clusters are connected. Here we describe the REalistic Cluster Connectivity Simulator (RECCS), a technique that modifies an SBM synthetic network to improve the fit to a given clustered real-world network with respect to edge connectivity within clusters, while maintaining the good fit with respect to other network and cluster statistics. Using real-world networks up to 13.9 million nodes in size, we show that RECCS, applied to stochastic block models, results in synthetic networks that have a better fit to cluster edge connectivity than unmodified SBMs, while providing roughly the same quality fit for other network and clustering parameters as unmodified SBMs.
{"title":"Synthetic Networks That Preserve Edge Connectivity","authors":"Lahari Anne, The-Anh Vu-Le, Minhyuk Park, Tandy Warnow, George Chacko","doi":"arxiv-2408.13647","DOIUrl":"https://doi.org/arxiv-2408.13647","url":null,"abstract":"Since true communities within real-world networks are rarely known, synthetic\u0000networks with planted ground truths are valuable for evaluating the performance\u0000of community detection methods. Of the synthetic network generation tools\u0000available, Stochastic Block Models (SBMs) produce networks with ground truth\u0000clusters that well approximate input parameters from real-world networks and\u0000clusterings. However, we show that SBMs can produce disconnected ground truth\u0000clusters, even when given parameters from clusterings where all clusters are\u0000connected. Here we describe the REalistic Cluster Connectivity Simulator\u0000(RECCS), a technique that modifies an SBM synthetic network to improve the fit\u0000to a given clustered real-world network with respect to edge connectivity\u0000within clusters, while maintaining the good fit with respect to other network\u0000and cluster statistics. Using real-world networks up to 13.9 million nodes in\u0000size, we show that RECCS, applied to stochastic block models, results in\u0000synthetic networks that have a better fit to cluster edge connectivity than\u0000unmodified SBMs, while providing roughly the same quality fit for other network\u0000and clustering parameters as unmodified SBMs.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142214962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tobias Bernecker, Ghalia Rehawi, Francesco Paolo Casale, Janine Knauer-Arloth, Annalisa Marsico
Graph generation addresses the problem of generating new graphs that have a data distribution similar to real-world graphs. While previous diffusion-based graph generation methods have shown promising results, they often struggle to scale to large graphs. In this work, we propose ARROW-Diff (AutoRegressive RandOm Walk Diffusion), a novel random walk-based diffusion approach for efficient large-scale graph generation. Our method encompasses two components in an iterative process of random walk sampling and graph pruning. We demonstrate that ARROW-Diff can scale to large graphs efficiently, surpassing other baseline methods in terms of both generation time and multiple graph statistics, reflecting the high quality of the generated graphs.
图形生成解决的问题是生成数据分布与现实世界图形相似的新图形。虽然之前基于扩散的图生成方法已经取得了可喜的成果,但它们往往难以扩展到大型图。在这项工作中,我们提出了 ARROW-Diff(AutoRegressiveRandOm Walk Diffusion,自动回归随机漫步扩散),这是一种基于随机漫步的新型扩散方法,可用于高效的大规模图生成。我们的方法包括随机漫步采样和图剪枝迭代过程中的两个部分。我们证明,ARROW-Diff 可以高效地扩展到大型图,在生成时间和多个图统计方面都超过了其他基线方法,反映出生成图的高质量。
{"title":"Random Walk Diffusion for Efficient Large-Scale Graph Generation","authors":"Tobias Bernecker, Ghalia Rehawi, Francesco Paolo Casale, Janine Knauer-Arloth, Annalisa Marsico","doi":"arxiv-2408.04461","DOIUrl":"https://doi.org/arxiv-2408.04461","url":null,"abstract":"Graph generation addresses the problem of generating new graphs that have a\u0000data distribution similar to real-world graphs. While previous diffusion-based\u0000graph generation methods have shown promising results, they often struggle to\u0000scale to large graphs. In this work, we propose ARROW-Diff (AutoRegressive\u0000RandOm Walk Diffusion), a novel random walk-based diffusion approach for\u0000efficient large-scale graph generation. Our method encompasses two components\u0000in an iterative process of random walk sampling and graph pruning. We\u0000demonstrate that ARROW-Diff can scale to large graphs efficiently, surpassing\u0000other baseline methods in terms of both generation time and multiple graph\u0000statistics, reflecting the high quality of the generated graphs.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141946776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Interdisciplinary collaboration is crucial for addressing complex scientific challenges. Recent advancements in large language models (LLMs) have shown significant potential in benefiting researchers across various fields. To explore the application of LLMs in scientific disciplines and their implications for interdisciplinary collaboration, we collect and analyze 50,391 papers from OpenAlex, an open-source platform for scholarly metadata. We first employ Shannon entropy to assess the diversity of collaboration in terms of authors' institutions and departments. Our results reveal that most fields have exhibited varying degrees of increased entropy following the release of ChatGPT, with Computer Science displaying a consistent increase. Other fields such as Social Science, Decision Science, Psychology, Engineering, Health Professions, and Business, Management & Accounting have shown minor to significant increases in entropy in 2024 compared to 2023. Statistical testing further indicates that the entropy in Computer Science, Decision Science, and Engineering is significantly lower than that in health-related fields like Medicine and Biochemistry, Genetics & Molecular Biology. In addition, our network analysis based on authors' affiliation information highlights the prominence of Computer Science, Medicine, and other Computer Science-related departments in LLM research. Regarding authors' institutions, our analysis reveals that entities such as Stanford University, Harvard University, University College London, and Google are key players, either dominating centrality measures or playing crucial roles in connecting research networks. Overall, this study provides valuable insights into the current landscape and evolving dynamics of collaboration networks in LLM research.
{"title":"Academic collaboration on large language model studies increases overall but varies across disciplines","authors":"Lingyao Li, Ly Dinh, Songhua Hu, Libby Hemphill","doi":"arxiv-2408.04163","DOIUrl":"https://doi.org/arxiv-2408.04163","url":null,"abstract":"Interdisciplinary collaboration is crucial for addressing complex scientific\u0000challenges. Recent advancements in large language models (LLMs) have shown\u0000significant potential in benefiting researchers across various fields. To\u0000explore the application of LLMs in scientific disciplines and their\u0000implications for interdisciplinary collaboration, we collect and analyze 50,391\u0000papers from OpenAlex, an open-source platform for scholarly metadata. We first\u0000employ Shannon entropy to assess the diversity of collaboration in terms of\u0000authors' institutions and departments. Our results reveal that most fields have\u0000exhibited varying degrees of increased entropy following the release of\u0000ChatGPT, with Computer Science displaying a consistent increase. Other fields\u0000such as Social Science, Decision Science, Psychology, Engineering, Health\u0000Professions, and Business, Management & Accounting have shown minor to\u0000significant increases in entropy in 2024 compared to 2023. Statistical testing\u0000further indicates that the entropy in Computer Science, Decision Science, and\u0000Engineering is significantly lower than that in health-related fields like\u0000Medicine and Biochemistry, Genetics & Molecular Biology. In addition, our\u0000network analysis based on authors' affiliation information highlights the\u0000prominence of Computer Science, Medicine, and other Computer Science-related\u0000departments in LLM research. Regarding authors' institutions, our analysis\u0000reveals that entities such as Stanford University, Harvard University,\u0000University College London, and Google are key players, either dominating\u0000centrality measures or playing crucial roles in connecting research networks.\u0000Overall, this study provides valuable insights into the current landscape and\u0000evolving dynamics of collaboration networks in LLM research.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141969664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuan Zhang, Laia Castro Herrero, Frank Esser, Alexandre Bovet
Selective exposure, individuals' inclination to seek out information that supports their beliefs while avoiding information that contradicts them, plays an important role in the emergence of polarization. In the political domain, selective exposure is usually measured on a left-right ideology scale, ignoring finer details. Here, we combine survey and Twitter data collected during the 2022 Brazilian Presidential Election and investigate selective exposure patterns between the survey respondents and political influencers. We analyze the followship network between survey respondents and political influencers and find a multilevel community structure that reveals a hierarchical organization more complex than a simple split between left and right. Moreover, depending on the level we consider, we find different associations between network indices of exposure patterns and 189 individual attributes of the survey respondents. For example, at finer levels, the number of influencer communities a survey respondent follows is associated with several factors, such as demographics, news consumption frequency, and incivility perception. In comparison, only their political ideology is a significant factor at coarser levels. Our work demonstrates that measuring selective exposure at a single level, such as left and right, misses important information necessary to capture this phenomenon correctly.
{"title":"More than 'Left and Right': Revealing Multilevel Online Political Selective Exposure","authors":"Yuan Zhang, Laia Castro Herrero, Frank Esser, Alexandre Bovet","doi":"arxiv-2408.03828","DOIUrl":"https://doi.org/arxiv-2408.03828","url":null,"abstract":"Selective exposure, individuals' inclination to seek out information that\u0000supports their beliefs while avoiding information that contradicts them, plays\u0000an important role in the emergence of polarization. In the political domain,\u0000selective exposure is usually measured on a left-right ideology scale, ignoring\u0000finer details. Here, we combine survey and Twitter data collected during the\u00002022 Brazilian Presidential Election and investigate selective exposure\u0000patterns between the survey respondents and political influencers. We analyze\u0000the followship network between survey respondents and political influencers and\u0000find a multilevel community structure that reveals a hierarchical organization\u0000more complex than a simple split between left and right. Moreover, depending on\u0000the level we consider, we find different associations between network indices\u0000of exposure patterns and 189 individual attributes of the survey respondents.\u0000For example, at finer levels, the number of influencer communities a survey\u0000respondent follows is associated with several factors, such as demographics,\u0000news consumption frequency, and incivility perception. In comparison, only\u0000their political ideology is a significant factor at coarser levels. Our work\u0000demonstrates that measuring selective exposure at a single level, such as left\u0000and right, misses important information necessary to capture this phenomenon\u0000correctly.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141946777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Over the past few years, many efforts have been dedicated to studying cyberbullying in social edge computing devices, and most of them focus on three roles: victims, perpetrators, and bystanders. If we want to obtain a deep insight into the formation, evolution, and intervention of cyberbullying in devices at the edge of the Internet, it is necessary to explore more fine-grained roles. This paper presents a multi-level method for role feature modeling and proposes a differential evolution-assisted K-means (DEK) method to identify diverse roles. Our work aims to provide a role identification scheme for cyberbullying scenarios for social edge computing environments to alleviate the general safety issues that cyberbullying brings. The experiments on ten real-world datasets obtained from Weibo and five public datasets show that the proposed DEK outperforms the existing approaches on the method level. After clustering, we obtained nine roles and analyzed the characteristics of each role and their evolution trends under different cyberbullying scenarios. Our work in this paper can be placed in devices at the edge of the Internet, leading to better real-time identification performance and adapting to the broad geographic location and high mobility of mobile devices.
在过去的几年里,人们致力于研究社会边缘计算设备中的网络欺凌问题,其中大部分都集中在三个角色上:受害者、施暴者和旁观者。如果我们想深入了解互联网边缘设备中网络欺凌的形成、演变和干预,就有必要探索更精细的角色。本文介绍了一种多层次的角色特征建模方法,并提出了一种差分进化辅助 K-均值(DEK)方法来识别多样化的角色。我们的工作旨在为社交边缘计算环境中的网络欺凌场景提供一种角色识别方案,以缓解网络欺凌带来的普遍安全问题。在从微博获取的 10 个真实世界数据集和 5 个公开数据集上进行的实验表明,所提出的 DEK 在方法层面上优于现有方法。经过聚类,我们得到了九个角色,并分析了每个角色在不同网络欺凌场景下的特征及其演变趋势。本文的研究成果可以应用于互联网边缘的设备,从而获得更好的实时识别性能,并适应移动设备广泛的地理位置和高流动性的特点。
{"title":"Role Identification based Method for Cyberbullying Analysis in Social Edge Computing","authors":"Runyu Wang, Tun Lu, Peng Zhang, Ning Gu","doi":"arxiv-2408.03502","DOIUrl":"https://doi.org/arxiv-2408.03502","url":null,"abstract":"Over the past few years, many efforts have been dedicated to studying\u0000cyberbullying in social edge computing devices, and most of them focus on three\u0000roles: victims, perpetrators, and bystanders. If we want to obtain a deep\u0000insight into the formation, evolution, and intervention of cyberbullying in\u0000devices at the edge of the Internet, it is necessary to explore more\u0000fine-grained roles. This paper presents a multi-level method for role feature\u0000modeling and proposes a differential evolution-assisted K-means (DEK) method to\u0000identify diverse roles. Our work aims to provide a role identification scheme\u0000for cyberbullying scenarios for social edge computing environments to alleviate\u0000the general safety issues that cyberbullying brings. The experiments on ten\u0000real-world datasets obtained from Weibo and five public datasets show that the\u0000proposed DEK outperforms the existing approaches on the method level. After\u0000clustering, we obtained nine roles and analyzed the characteristics of each\u0000role and their evolution trends under different cyberbullying scenarios. Our\u0000work in this paper can be placed in devices at the edge of the Internet,\u0000leading to better real-time identification performance and adapting to the\u0000broad geographic location and high mobility of mobile devices.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141946778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the emergence of social networks, online platforms dedicated to different use cases, and sensor networks, the emergence of large-scale graph community detection has become a steady field of research with real-world applications. Community detection algorithms have numerous practical applications, particularly due to their scalability with data size. Nonetheless, a notable drawback of community detection algorithms is their computational intensity~cite{Apostol2014}, resulting in decreasing performance as data size increases. For this purpose, new frameworks that employ distributed systems such as Apache Hadoop and Apache Spark which can seamlessly handle large-scale graphs must be developed. In this paper, we propose a novel framework for community detection algorithms, i.e., K-Cliques, Louvain, and Fast Greedy, developed using Apache Spark GraphFrames. We test their performance and scalability on two real-world datasets. The experimental results prove the feasibility of developing graph mining algorithms using Apache Spark GraphFrames.
{"title":"Large-Scale Graphs Community Detection using Spark GraphFrames","authors":"Elena-Simona Apostol, Adrian-Cosmin Cojocaru, Ciprian-Octavian Truică","doi":"arxiv-2408.03966","DOIUrl":"https://doi.org/arxiv-2408.03966","url":null,"abstract":"With the emergence of social networks, online platforms dedicated to\u0000different use cases, and sensor networks, the emergence of large-scale graph\u0000community detection has become a steady field of research with real-world\u0000applications. Community detection algorithms have numerous practical\u0000applications, particularly due to their scalability with data size.\u0000Nonetheless, a notable drawback of community detection algorithms is their\u0000computational intensity~cite{Apostol2014}, resulting in decreasing performance\u0000as data size increases. For this purpose, new frameworks that employ\u0000distributed systems such as Apache Hadoop and Apache Spark which can seamlessly\u0000handle large-scale graphs must be developed. In this paper, we propose a novel\u0000framework for community detection algorithms, i.e., K-Cliques, Louvain, and\u0000Fast Greedy, developed using Apache Spark GraphFrames. We test their\u0000performance and scalability on two real-world datasets. The experimental\u0000results prove the feasibility of developing graph mining algorithms using\u0000Apache Spark GraphFrames.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141946775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erfan Samieyan Sahneh, Gianluca Nogara, Matthew R. DeVerna, Nick Liu, Luca Luceri, Filippo Menczer, Francesco Pierri, Silvia Giordano
Bluesky is a Twitter-like decentralized social media platform that has recently grown in popularity. After an invite-only period, it opened to the public worldwide on February 6th, 2024. In this paper, we provide a longitudinal analysis of user activity in the two months around the opening, studying changes in the general characteristics of the platform due to the rapid growth of the user base. We observe a broad distribution of activity similar to more established platforms, but a higher volume of original than reshared content, and very low toxicity. After opening to the public, Bluesky experienced a large surge in new users and activity, especially posting English and Japanese content. In particular, several accounts entered the discussion with suspicious behavior, like following many accounts and sharing content from low-credibility news outlets. Some of these have already been classified as spam or suspended, suggesting effective moderation.
{"title":"The Dawn of Decentralized Social Media: An Exploration of Bluesky's Public Opening","authors":"Erfan Samieyan Sahneh, Gianluca Nogara, Matthew R. DeVerna, Nick Liu, Luca Luceri, Filippo Menczer, Francesco Pierri, Silvia Giordano","doi":"arxiv-2408.03146","DOIUrl":"https://doi.org/arxiv-2408.03146","url":null,"abstract":"Bluesky is a Twitter-like decentralized social media platform that has\u0000recently grown in popularity. After an invite-only period, it opened to the\u0000public worldwide on February 6th, 2024. In this paper, we provide a\u0000longitudinal analysis of user activity in the two months around the opening,\u0000studying changes in the general characteristics of the platform due to the\u0000rapid growth of the user base. We observe a broad distribution of activity\u0000similar to more established platforms, but a higher volume of original than\u0000reshared content, and very low toxicity. After opening to the public, Bluesky\u0000experienced a large surge in new users and activity, especially posting English\u0000and Japanese content. In particular, several accounts entered the discussion\u0000with suspicious behavior, like following many accounts and sharing content from\u0000low-credibility news outlets. Some of these have already been classified as\u0000spam or suspended, suggesting effective moderation.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"55 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141946779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jibing Gong, Jiquan Peng, Jin Qu, ShuYing Du, Kaiyu Wang
Detecting Twitter Bots is crucial for maintaining the integrity of online discourse, safeguarding democratic processes, and preventing the spread of malicious propaganda. However, advanced Twitter Bots today often employ sophisticated feature manipulation and account farming techniques to blend seamlessly with genuine user interactions, posing significant challenges to existing detection models. In response to these challenges, this paper proposes a novel Twitter Bot Detection framework called BotSAI. This framework enhances the consistency of multimodal user features, accurately characterizing various modalities to distinguish between real users and bots. Specifically, the architecture integrates information from users, textual content, and heterogeneous network topologies, leveraging customized encoders to obtain comprehensive user feature representations. The heterogeneous network encoder efficiently aggregates information from neighboring nodes through oversampling techniques and local relationship transformers. Subsequently, a multi-channel representation mechanism maps user representations into invariant and specific subspaces, enhancing the feature vectors. Finally, a self-attention mechanism is introduced to integrate and refine the enhanced user representations, enabling efficient information interaction. Extensive experiments demonstrate that BotSAI outperforms existing state-of-the-art methods on two major Twitter Bot Detection benchmarks, exhibiting superior performance. Additionally, systematic experiments reveal the impact of different social relationships on detection accuracy, providing novel insights for the identification of social bots.
{"title":"Enhancing Twitter Bot Detection via Multimodal Invariant Representations","authors":"Jibing Gong, Jiquan Peng, Jin Qu, ShuYing Du, Kaiyu Wang","doi":"arxiv-2408.03096","DOIUrl":"https://doi.org/arxiv-2408.03096","url":null,"abstract":"Detecting Twitter Bots is crucial for maintaining the integrity of online\u0000discourse, safeguarding democratic processes, and preventing the spread of\u0000malicious propaganda. However, advanced Twitter Bots today often employ\u0000sophisticated feature manipulation and account farming techniques to blend\u0000seamlessly with genuine user interactions, posing significant challenges to\u0000existing detection models. In response to these challenges, this paper proposes\u0000a novel Twitter Bot Detection framework called BotSAI. This framework enhances\u0000the consistency of multimodal user features, accurately characterizing various\u0000modalities to distinguish between real users and bots. Specifically, the\u0000architecture integrates information from users, textual content, and\u0000heterogeneous network topologies, leveraging customized encoders to obtain\u0000comprehensive user feature representations. The heterogeneous network encoder\u0000efficiently aggregates information from neighboring nodes through oversampling\u0000techniques and local relationship transformers. Subsequently, a multi-channel\u0000representation mechanism maps user representations into invariant and specific\u0000subspaces, enhancing the feature vectors. Finally, a self-attention mechanism\u0000is introduced to integrate and refine the enhanced user representations,\u0000enabling efficient information interaction. Extensive experiments demonstrate\u0000that BotSAI outperforms existing state-of-the-art methods on two major Twitter\u0000Bot Detection benchmarks, exhibiting superior performance. Additionally,\u0000systematic experiments reveal the impact of different social relationships on\u0000detection accuracy, providing novel insights for the identification of social\u0000bots.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141969666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}