In the broader machine learning literature, data-generation methods demonstrate promising results by generating additional informative training examples via augmenting sparse labels. Such methods are less studied in graphs due to the intricate dependencies among nodes in complex topology structures. This paper presents a novel node generation method that infuses a small set of high-quality synthesized nodes into the graph as additional labeled nodes to optimally expand the propagation of labeled information. By simply infusing additional nodes, the framework is orthogonal to the graph learning and downstream classification techniques, and thus is compatible with most popular graph pre-training (self-supervised learning), semi-supervised learning, and meta-learning methods. The contribution lies in designing the generated node set by solving a novel optimization problem. The optimization places the generated nodes in a manner that: (1) minimizes the classification loss to guarantee training accuracy and (2) maximizes label propagation to low-confidence nodes in the downstream task to ensure high-quality propagation. Theoretically, we show that the above dual optimization maximizes the global confidence of node classification. Our Experiments demonstrate statistically significant performance improvements over 14 baselines on 10 publicly available datasets.
{"title":"Virtual Node Generation for Node Classification in Sparsely-Labeled Graphs","authors":"Hang Cui, Tarek Abdelzaher","doi":"arxiv-2409.07712","DOIUrl":"https://doi.org/arxiv-2409.07712","url":null,"abstract":"In the broader machine learning literature, data-generation methods\u0000demonstrate promising results by generating additional informative training\u0000examples via augmenting sparse labels. Such methods are less studied in graphs\u0000due to the intricate dependencies among nodes in complex topology structures.\u0000This paper presents a novel node generation method that infuses a small set of\u0000high-quality synthesized nodes into the graph as additional labeled nodes to\u0000optimally expand the propagation of labeled information. By simply infusing\u0000additional nodes, the framework is orthogonal to the graph learning and\u0000downstream classification techniques, and thus is compatible with most popular\u0000graph pre-training (self-supervised learning), semi-supervised learning, and\u0000meta-learning methods. The contribution lies in designing the generated node\u0000set by solving a novel optimization problem. The optimization places the\u0000generated nodes in a manner that: (1) minimizes the classification loss to\u0000guarantee training accuracy and (2) maximizes label propagation to\u0000low-confidence nodes in the downstream task to ensure high-quality propagation.\u0000Theoretically, we show that the above dual optimization maximizes the global\u0000confidence of node classification. Our Experiments demonstrate statistically\u0000significant performance improvements over 14 baselines on 10 publicly available\u0000datasets.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142214681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gangda Deng, Hongkuan Zhou, Rajgopal Kannan, Viktor Prasanna
Heterophilous graphs, where dissimilar nodes tend to connect, pose a challenge for graph neural networks (GNNs) as their superior performance typically comes from aggregating homophilous information. Increasing the GNN depth can expand the scope (i.e., receptive field), potentially finding homophily from the higher-order neighborhoods. However, uniformly expanding the scope results in subpar performance since real-world graphs often exhibit homophily disparity between nodes. An ideal way is personalized scopes, allowing nodes to have varying scope sizes. Existing methods typically add node-adaptive weights for each hop. Although expressive, they inevitably suffer from severe overfitting. To address this issue, we formalize personalized scoping as a separate scope classification problem that overcomes GNN overfitting in node classification. Specifically, we predict the optimal GNN depth for each node. Our theoretical and empirical analysis suggests that accurately predicting the depth can significantly enhance generalization. We further propose Adaptive Scope (AS), a lightweight MLP-based approach that only participates in GNN inference. AS encodes structural patterns and predicts the depth to select the best model for each node's prediction. Experimental results show that AS is highly flexible with various GNN architectures across a wide range of datasets while significantly improving accuracy.
{"title":"Learning Personalized Scoping for Graph Neural Networks under Heterophily","authors":"Gangda Deng, Hongkuan Zhou, Rajgopal Kannan, Viktor Prasanna","doi":"arxiv-2409.06998","DOIUrl":"https://doi.org/arxiv-2409.06998","url":null,"abstract":"Heterophilous graphs, where dissimilar nodes tend to connect, pose a\u0000challenge for graph neural networks (GNNs) as their superior performance\u0000typically comes from aggregating homophilous information. Increasing the GNN\u0000depth can expand the scope (i.e., receptive field), potentially finding\u0000homophily from the higher-order neighborhoods. However, uniformly expanding the\u0000scope results in subpar performance since real-world graphs often exhibit\u0000homophily disparity between nodes. An ideal way is personalized scopes,\u0000allowing nodes to have varying scope sizes. Existing methods typically add\u0000node-adaptive weights for each hop. Although expressive, they inevitably suffer\u0000from severe overfitting. To address this issue, we formalize personalized\u0000scoping as a separate scope classification problem that overcomes GNN\u0000overfitting in node classification. Specifically, we predict the optimal GNN\u0000depth for each node. Our theoretical and empirical analysis suggests that\u0000accurately predicting the depth can significantly enhance generalization. We\u0000further propose Adaptive Scope (AS), a lightweight MLP-based approach that only\u0000participates in GNN inference. AS encodes structural patterns and predicts the\u0000depth to select the best model for each node's prediction. Experimental results\u0000show that AS is highly flexible with various GNN architectures across a wide\u0000range of datasets while significantly improving accuracy.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"274 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142214679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose and demonstrate the DisasterNeedFinder framework in order to provide appropriate information support for the Noto Peninsula Earthquake. In the event of a large-scale disaster, it is essential to accurately capture the ever-changing information needs. However, it is difficult to obtain appropriate information from the chaotic situation on the ground. Therefore, as a data-driven approach, we aim to pick up precise information needs at the site by integrally analyzing the location information of disaster victims and search information. It is difficult to make a clear estimation of information needs by just analyzing search history information in disaster areas, due to the large amount of noise and the small number of users. Therefore, the idea of assuming that the magnitude of information needs is not the volume of searches, but the degree of abnormalities in searches, enables an appropriate understanding of the information needs of the disaster victims. DNF has been continuously clarifying the information needs of disaster areas since the disaster strike, and has been recognized as a new approach to support disaster areas by being featured in the major Japanese media on several occasions.
{"title":"DisasterNeedFinder: Understanding the Information Needs in the 2024 Noto Earthquake (Comprehensive Explanation)","authors":"Kota Tsubouchi, Shuji Yamaguchi, Keijirou Saitou, Akihisa Soemori, Masato Morita, Shigeki Asou","doi":"arxiv-2409.07102","DOIUrl":"https://doi.org/arxiv-2409.07102","url":null,"abstract":"We propose and demonstrate the DisasterNeedFinder framework in order to\u0000provide appropriate information support for the Noto Peninsula Earthquake. In\u0000the event of a large-scale disaster, it is essential to accurately capture the\u0000ever-changing information needs. However, it is difficult to obtain appropriate\u0000information from the chaotic situation on the ground. Therefore, as a\u0000data-driven approach, we aim to pick up precise information needs at the site\u0000by integrally analyzing the location information of disaster victims and search\u0000information. It is difficult to make a clear estimation of information needs by\u0000just analyzing search history information in disaster areas, due to the large\u0000amount of noise and the small number of users. Therefore, the idea of assuming\u0000that the magnitude of information needs is not the volume of searches, but the\u0000degree of abnormalities in searches, enables an appropriate understanding of\u0000the information needs of the disaster victims. DNF has been continuously\u0000clarifying the information needs of disaster areas since the disaster strike,\u0000and has been recognized as a new approach to support disaster areas by being\u0000featured in the major Japanese media on several occasions.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142226900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study explores the conceptual development of a medical insurance catalogue voting system. The methodology is centred on creating a model where doctors would vote on treatment inclusions, aiming to demonstrate transparency and integrity. The results from Monte Carlo simulations suggest a robust consensus on the selection of medicines and treatments. Further theoretical investigations propose incorporating a patient outcome-based incentive mechanism. This conceptual approach could enhance decision-making in healthcare by aligning stakeholder interests with patient outcomes, aiming for an optimised, equitable insurance catalogue with potential blockchain-based smart-contracts to ensure transparency and integrity.
{"title":"A Novel Voting System for Medical Catalogues in National Health Insurance","authors":"Xingyuan Liang, Haibao Wen","doi":"arxiv-2409.07057","DOIUrl":"https://doi.org/arxiv-2409.07057","url":null,"abstract":"This study explores the conceptual development of a medical insurance\u0000catalogue voting system. The methodology is centred on creating a model where\u0000doctors would vote on treatment inclusions, aiming to demonstrate transparency\u0000and integrity. The results from Monte Carlo simulations suggest a robust\u0000consensus on the selection of medicines and treatments. Further theoretical\u0000investigations propose incorporating a patient outcome-based incentive\u0000mechanism. This conceptual approach could enhance decision-making in healthcare\u0000by aligning stakeholder interests with patient outcomes, aiming for an\u0000optimised, equitable insurance catalogue with potential blockchain-based\u0000smart-contracts to ensure transparency and integrity.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142214678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Russian Internet Trolls use fake personas to spread disinformation through multiple social media streams. Given the increased frequency of this threat across social media platforms, understanding those operations is paramount in combating their influence. Using Twitter content identified as part of the Russian influence network, we created a predictive model to map the network operations. We classify accounts type based on their authenticity function for a sub-sample of accounts by introducing logical categories and training a predictive model to identify similar behavior patterns across the network. Our model attains 88% prediction accuracy for the test set. Validation is done by comparing the similarities with the 3 million Russian troll tweets dataset. The result indicates a 90.7% similarity between the two datasets. Furthermore, we compare our model predictions on a Russian tweets dataset, and the results state that there is 90.5% correspondence between the predictions and the actual categories. The prediction and validation results suggest that our predictive model can assist with mapping the actors in such networks.
{"title":"Mapping the Russian Internet Troll Network on Twitter using a Predictive Model","authors":"Sachith Dassanayaka, Ori Swed, Dimitri Volchenkov","doi":"arxiv-2409.08305","DOIUrl":"https://doi.org/arxiv-2409.08305","url":null,"abstract":"Russian Internet Trolls use fake personas to spread disinformation through\u0000multiple social media streams. Given the increased frequency of this threat\u0000across social media platforms, understanding those operations is paramount in\u0000combating their influence. Using Twitter content identified as part of the\u0000Russian influence network, we created a predictive model to map the network\u0000operations. We classify accounts type based on their authenticity function for\u0000a sub-sample of accounts by introducing logical categories and training a\u0000predictive model to identify similar behavior patterns across the network. Our\u0000model attains 88% prediction accuracy for the test set. Validation is done by\u0000comparing the similarities with the 3 million Russian troll tweets dataset. The\u0000result indicates a 90.7% similarity between the two datasets. Furthermore, we\u0000compare our model predictions on a Russian tweets dataset, and the results\u0000state that there is 90.5% correspondence between the predictions and the actual\u0000categories. The prediction and validation results suggest that our predictive\u0000model can assist with mapping the actors in such networks.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study examines whether positive news about firms increases their stock prices and, moreover, whether it increases stock prices of the firms' suppliers and customers, using a large sample of publicly listed firms across the world and another of Japanese listed firms. The level of positiveness of each news article is determined by FinBERT, a natural language processing model fine-tuned specifically for financial information. Supply chains of firms across the world are identified mostly by financial statements, while those of Japanese firms are taken from large-scale firm-level surveys. We find that positive news increases the change rate of stock prices of firms mentioned in the news before its disclosure, most likely because of diffusion of information through informal channels. Positive news also raises stock prices of the firms' suppliers and customers before its disclosure, confirming propagation of market values through supply chains. In addition, we generally find a larger post-news effect on stock prices of the mentioned firms and their suppliers and customers than the pre-news effect. The positive difference between the post- and pre-news effects can be considered as the net effect of the disclosure of positive news, controlling for informal information diffusion. However, the post-news effect on suppliers and customers in Japan is smaller than the pre-news effect, a result opposite to those from firms across the world. This notable result is possibly because supply chain links of Japanese firms are stronger than global supply chains while such knowledge is restricted to selected investors.
{"title":"Market Reaction to News Flows in Supply Chain Networks","authors":"Hiroyasu Inoue, Yasuyuki Todo","doi":"arxiv-2409.06255","DOIUrl":"https://doi.org/arxiv-2409.06255","url":null,"abstract":"This study examines whether positive news about firms increases their stock\u0000prices and, moreover, whether it increases stock prices of the firms' suppliers\u0000and customers, using a large sample of publicly listed firms across the world\u0000and another of Japanese listed firms. The level of positiveness of each news\u0000article is determined by FinBERT, a natural language processing model\u0000fine-tuned specifically for financial information. Supply chains of firms\u0000across the world are identified mostly by financial statements, while those of\u0000Japanese firms are taken from large-scale firm-level surveys. We find that\u0000positive news increases the change rate of stock prices of firms mentioned in\u0000the news before its disclosure, most likely because of diffusion of information\u0000through informal channels. Positive news also raises stock prices of the firms'\u0000suppliers and customers before its disclosure, confirming propagation of market\u0000values through supply chains. In addition, we generally find a larger post-news\u0000effect on stock prices of the mentioned firms and their suppliers and customers\u0000than the pre-news effect. The positive difference between the post- and\u0000pre-news effects can be considered as the net effect of the disclosure of\u0000positive news, controlling for informal information diffusion. However, the\u0000post-news effect on suppliers and customers in Japan is smaller than the\u0000pre-news effect, a result opposite to those from firms across the world. This\u0000notable result is possibly because supply chain links of Japanese firms are\u0000stronger than global supply chains while such knowledge is restricted to\u0000selected investors.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"78 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142214680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The forest matrix plays a crucial role in network science, opinion dynamics, and machine learning, offering deep insights into the structure of and dynamics on networks. In this paper, we study the problem of querying entries of the forest matrix in evolving graphs, which more accurately represent the dynamic nature of real-world networks compared to static graphs. To address the unique challenges posed by evolving graphs, we first introduce two approximation algorithms, textsc{SFQ} and textsc{SFQPlus}, for static graphs. textsc{SFQ} employs a probabilistic interpretation of the forest matrix, while textsc{SFQPlus} incorporates a novel variance reduction technique and is theoretically proven to offer enhanced accuracy. Based on these two algorithms, we further devise two dynamic algorithms centered around efficiently maintaining a list of spanning converging forests. This approach ensures $O(1)$ runtime complexity for updates, including edge additions and deletions, as well as for querying matrix elements, and provides an unbiased estimation of forest matrix entries. Finally, through extensive experiments on various real-world networks, we demonstrate the efficiency and effectiveness of our algorithms. Particularly, our algorithms are scalable to massive graphs with more than forty million nodes.
{"title":"Fast Computation for the Forest Matrix of an Evolving Graph","authors":"Haoxin Sun, Xiaotian Zhou, Zhongzhi Zhang","doi":"arxiv-2409.05503","DOIUrl":"https://doi.org/arxiv-2409.05503","url":null,"abstract":"The forest matrix plays a crucial role in network science, opinion dynamics,\u0000and machine learning, offering deep insights into the structure of and dynamics\u0000on networks. In this paper, we study the problem of querying entries of the\u0000forest matrix in evolving graphs, which more accurately represent the dynamic\u0000nature of real-world networks compared to static graphs. To address the unique\u0000challenges posed by evolving graphs, we first introduce two approximation\u0000algorithms, textsc{SFQ} and textsc{SFQPlus}, for static graphs. textsc{SFQ}\u0000employs a probabilistic interpretation of the forest matrix, while\u0000textsc{SFQPlus} incorporates a novel variance reduction technique and is\u0000theoretically proven to offer enhanced accuracy. Based on these two algorithms,\u0000we further devise two dynamic algorithms centered around efficiently\u0000maintaining a list of spanning converging forests. This approach ensures $O(1)$\u0000runtime complexity for updates, including edge additions and deletions, as well\u0000as for querying matrix elements, and provides an unbiased estimation of forest\u0000matrix entries. Finally, through extensive experiments on various real-world\u0000networks, we demonstrate the efficiency and effectiveness of our algorithms.\u0000Particularly, our algorithms are scalable to massive graphs with more than\u0000forty million nodes.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142214683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kemeny's constant for random walks on a graph is defined as the mean hitting time from one node to another selected randomly according to the stationary distribution. It has found numerous applications and attracted considerable research interest. However, exact computation of Kemeny's constant requires matrix inversion, which scales poorly for large networks with millions of nodes. Existing approximation algorithms either leverage properties exclusive to undirected graphs or involve inefficient simulation, leaving room for further optimization. To address these limitations for directed graphs, we propose two novel approximation algorithms for estimating Kemeny's constant on directed graphs with theoretical error guarantees. Extensive numerical experiments on real-world networks validate the superiority of our algorithms over baseline methods in terms of efficiency and accuracy.
{"title":"Fast Computation of Kemeny's Constant for Directed Graphs","authors":"Haisong Xia, Zhongzhi Zhang","doi":"arxiv-2409.05471","DOIUrl":"https://doi.org/arxiv-2409.05471","url":null,"abstract":"Kemeny's constant for random walks on a graph is defined as the mean hitting\u0000time from one node to another selected randomly according to the stationary\u0000distribution. It has found numerous applications and attracted considerable\u0000research interest. However, exact computation of Kemeny's constant requires\u0000matrix inversion, which scales poorly for large networks with millions of\u0000nodes. Existing approximation algorithms either leverage properties exclusive\u0000to undirected graphs or involve inefficient simulation, leaving room for\u0000further optimization. To address these limitations for directed graphs, we\u0000propose two novel approximation algorithms for estimating Kemeny's constant on\u0000directed graphs with theoretical error guarantees. Extensive numerical\u0000experiments on real-world networks validate the superiority of our algorithms\u0000over baseline methods in terms of efficiency and accuracy.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142214684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Henrique F. de Arruda, Sandro M. Reia, Shiyang Ruan, Kuldip S. Atwal, Hamdi Kavak, Taylor Anderson, Dieter Pfoser
Building type information is crucial for population estimation, traffic planning, urban planning, and emergency response applications. Although essential, such data is often not readily available. To alleviate this problem, this work creates a comprehensive dataset by providing residential/non-residential building classification covering the entire United States. We propose and utilize an unsupervised machine learning method to classify building types based on building footprints and available OpenStreetMap information. The classification result is validated using authoritative ground truth data for select counties in the U.S. The validation shows a high precision for non-residential building classification and a high recall for residential buildings. We identified various approaches to improving the quality of the classification, such as removing sheds and garages from the dataset. Furthermore, analyzing the misclassifications revealed that they are mainly due to missing and scarce metadata in OSM. A major result of this work is the resulting dataset of classifying 67,705,475 buildings. We hope that this data is of value to the scientific community, including urban and transportation planners.
建筑类型信息对于人口估计、交通规划、城市规划和应急响应应用至关重要。尽管非常重要,但此类数据往往不易获得。为了缓解这一问题,这项工作通过提供覆盖全美的住宅/非住宅建筑分类,创建了一个综合数据集。我们提出并使用了一种无监督机器学习方法,根据建筑物占地面积和可用的 OpenStreetMap 信息对建筑物类型进行分类。我们使用美国部分郡县的权威地面实况数据对分类结果进行了验证。验证结果表明,非住宅建筑分类的精确度很高,而住宅建筑分类的召回率很高。我们确定了提高分类质量的各种方法,例如从数据集中移除棚屋和车库。此外,对错误分类的分析表明,这些错误分类主要是由于 OSM 中元数据的缺失和匮乏造成的。这项工作的一个主要成果是建立了一个数据集,对 67 705 475 幢建筑物进行了分类。我们希望这些数据能对科学界,包括城市和交通规划者有所帮助。
{"title":"Extracting the U.S. building types from OpenStreetMap data","authors":"Henrique F. de Arruda, Sandro M. Reia, Shiyang Ruan, Kuldip S. Atwal, Hamdi Kavak, Taylor Anderson, Dieter Pfoser","doi":"arxiv-2409.05692","DOIUrl":"https://doi.org/arxiv-2409.05692","url":null,"abstract":"Building type information is crucial for population estimation, traffic\u0000planning, urban planning, and emergency response applications. Although\u0000essential, such data is often not readily available. To alleviate this problem,\u0000this work creates a comprehensive dataset by providing\u0000residential/non-residential building classification covering the entire United\u0000States. We propose and utilize an unsupervised machine learning method to\u0000classify building types based on building footprints and available\u0000OpenStreetMap information. The classification result is validated using\u0000authoritative ground truth data for select counties in the U.S. The validation\u0000shows a high precision for non-residential building classification and a high\u0000recall for residential buildings. We identified various approaches to improving\u0000the quality of the classification, such as removing sheds and garages from the\u0000dataset. Furthermore, analyzing the misclassifications revealed that they are\u0000mainly due to missing and scarce metadata in OSM. A major result of this work\u0000is the resulting dataset of classifying 67,705,475 buildings. We hope that this\u0000data is of value to the scientific community, including urban and\u0000transportation planners.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"120 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142214682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. No prior work related to social media mining has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper aims to address this research gap and makes two scientific contributions to this field. First, it presents a multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. The dataset, available at https://dx.doi.org/10.21227/7fvc-y093, contains Instagram posts about mpox in 52 languages. For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were performed. This process included classifying each post into (i) one of the sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutral, (ii) hate or not hate, and (iii) anxiety/stress detected or no anxiety/stress detected. These results are presented as separate attributes in the dataset. Second, this paper presents the results of performing sentiment analysis, hate speech analysis, and anxiety or stress analysis. The variation of the sentiment classes - fear, surprise, joy, sadness, anger, disgust, and neutral were observed to be 27.95%, 2.57%, 8.69%, 5.94%, 2.69%, 1.53%, and 50.64%, respectively. In terms of hate speech detection, 95.75% of the posts did not contain hate and the remaining 4.25% of the posts contained hate. Finally, 72.05% of the posts did not indicate any anxiety/stress, and the remaining 27.95% of the posts represented some form of anxiety/stress.
{"title":"Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety Analysis","authors":"Nirmalya Thakur","doi":"arxiv-2409.05292","DOIUrl":"https://doi.org/arxiv-2409.05292","url":null,"abstract":"The world is currently experiencing an outbreak of mpox, which has been\u0000declared a Public Health Emergency of International Concern by WHO. No prior\u0000work related to social media mining has focused on the development of a dataset\u0000of Instagram posts about the mpox outbreak. The work presented in this paper\u0000aims to address this research gap and makes two scientific contributions to\u0000this field. First, it presents a multilingual dataset of 60,127 Instagram posts\u0000about mpox, published between July 23, 2022, and September 5, 2024. The\u0000dataset, available at https://dx.doi.org/10.21227/7fvc-y093, contains Instagram\u0000posts about mpox in 52 languages. For each of these posts, the Post ID, Post\u0000Description, Date of publication, language, and translated version of the post\u0000(translation to English was performed using the Google Translate API) are\u0000presented as separate attributes in the dataset. After developing this dataset,\u0000sentiment analysis, hate speech detection, and anxiety or stress detection were\u0000performed. This process included classifying each post into (i) one of the\u0000sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or\u0000neutral, (ii) hate or not hate, and (iii) anxiety/stress detected or no\u0000anxiety/stress detected. These results are presented as separate attributes in\u0000the dataset. Second, this paper presents the results of performing sentiment\u0000analysis, hate speech analysis, and anxiety or stress analysis. The variation\u0000of the sentiment classes - fear, surprise, joy, sadness, anger, disgust, and\u0000neutral were observed to be 27.95%, 2.57%, 8.69%, 5.94%, 2.69%, 1.53%, and\u000050.64%, respectively. In terms of hate speech detection, 95.75% of the posts\u0000did not contain hate and the remaining 4.25% of the posts contained hate.\u0000Finally, 72.05% of the posts did not indicate any anxiety/stress, and the\u0000remaining 27.95% of the posts represented some form of anxiety/stress.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142214714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}