Tong Xia, Junjie Lin, Yong Li, Jie Feng, Pan Hui, Funing Sun, Diansheng Guo, Depeng Jin
Crowd flow prediction is an essential task benefiting a wide range of applications for the transportation system and public safety. However, it is a challenging problem due to the complex spatio-temporal dependence and the complicated impact of urban structure on the crowd flow patterns. In this article, we propose a novel framework, 3-Dimensional Graph Convolution Network (3DGCN), to predict citywide crowd flow. We first model it as a dynamic spatio-temporal graph prediction problem, where each node represents a region with time-varying flows, and each edge represents the origin–destination (OD) flow between its corresponding regions. As such, OD flows among regions are treated as a proxy for the spatial interactions among regions. To tackle the complex spatio-temporal dependence, our proposed 3DGCN can model the correlation among graph spatial and temporal neighbors simultaneously. To learn and incorporate urban structures in crowd flow prediction, we design the GCN aggregator to be learned from both crowd flow prediction and region function inference at the same time. Extensive experiments with real-world datasets in two cities demonstrate that our model outperforms state-of-the-art baselines by 9.6%∼19.5% for the next-time-interval prediction.
{"title":"3DGCN: 3-Dimensional Dynamic Graph Convolutional Network for Citywide Crowd Flow Prediction","authors":"Tong Xia, Junjie Lin, Yong Li, Jie Feng, Pan Hui, Funing Sun, Diansheng Guo, Depeng Jin","doi":"10.1145/3451394","DOIUrl":"https://doi.org/10.1145/3451394","url":null,"abstract":"Crowd flow prediction is an essential task benefiting a wide range of applications for the transportation system and public safety. However, it is a challenging problem due to the complex spatio-temporal dependence and the complicated impact of urban structure on the crowd flow patterns. In this article, we propose a novel framework, 3-Dimensional Graph Convolution Network (3DGCN), to predict citywide crowd flow. We first model it as a dynamic spatio-temporal graph prediction problem, where each node represents a region with time-varying flows, and each edge represents the origin–destination (OD) flow between its corresponding regions. As such, OD flows among regions are treated as a proxy for the spatial interactions among regions. To tackle the complex spatio-temporal dependence, our proposed 3DGCN can model the correlation among graph spatial and temporal neighbors simultaneously. To learn and incorporate urban structures in crowd flow prediction, we design the GCN aggregator to be learned from both crowd flow prediction and region function inference at the same time. Extensive experiments with real-world datasets in two cities demonstrate that our model outperforms state-of-the-art baselines by 9.6%∼19.5% for the next-time-interval prediction.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124097391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The abundant sequential documents such as online archival, social media, and news feeds are streamingly updated, where each chunk of documents is incorporated with smoothly evolving yet dependent topics. Such digital texts have attracted extensive research on dynamic topic modeling to infer hidden evolving topics and their temporal dependencies. However, most of the existing approaches focus on single-topic-thread evolution and ignore the fact that a current topic may be coupled with multiple relevant prior topics. In addition, these approaches also incur the intractable inference problem when inferring latent parameters, resulting in a high computational cost and performance degradation. In this work, we assume that a current topic evolves from all prior topics with corresponding coupling weights, forming the multi-topic-thread evolution. Our method models the dependencies between evolving topics and thoroughly encodes their complex multi-couplings across time steps. To conquer the intractable inference challenge, a new solution with a set of novel data augmentation techniques is proposed, which successfully discomposes the multi-couplings between evolving topics. A fully conjugate model is thus obtained to guarantee the effectiveness and efficiency of the inference technique. A novel Gibbs sampler with a backward–forward filter algorithm efficiently learns latent time-evolving parameters in a closed-form. In addition, the latent Indian Buffet Process compound distribution is exploited to automatically infer the overall topic number and customize the sparse topic proportions for each sequential document without bias. The proposed method is evaluated on both synthetic and real-world datasets against the competitive baselines, demonstrating its superiority over the baselines in terms of the low per-word perplexity, high coherent topics, and better document time prediction.
{"title":"Recurrent Coupled Topic Modeling over Sequential Documents","authors":"Jinjin Guo, Longbing Cao, Zhiguo Gong","doi":"10.1145/3451530","DOIUrl":"https://doi.org/10.1145/3451530","url":null,"abstract":"The abundant sequential documents such as online archival, social media, and news feeds are streamingly updated, where each chunk of documents is incorporated with smoothly evolving yet dependent topics. Such digital texts have attracted extensive research on dynamic topic modeling to infer hidden evolving topics and their temporal dependencies. However, most of the existing approaches focus on single-topic-thread evolution and ignore the fact that a current topic may be coupled with multiple relevant prior topics. In addition, these approaches also incur the intractable inference problem when inferring latent parameters, resulting in a high computational cost and performance degradation. In this work, we assume that a current topic evolves from all prior topics with corresponding coupling weights, forming the multi-topic-thread evolution. Our method models the dependencies between evolving topics and thoroughly encodes their complex multi-couplings across time steps. To conquer the intractable inference challenge, a new solution with a set of novel data augmentation techniques is proposed, which successfully discomposes the multi-couplings between evolving topics. A fully conjugate model is thus obtained to guarantee the effectiveness and efficiency of the inference technique. A novel Gibbs sampler with a backward–forward filter algorithm efficiently learns latent time-evolving parameters in a closed-form. In addition, the latent Indian Buffet Process compound distribution is exploited to automatically infer the overall topic number and customize the sparse topic proportions for each sequential document without bias. The proposed method is evaluated on both synthetic and real-world datasets against the competitive baselines, demonstrating its superiority over the baselines in terms of the low per-word perplexity, high coherent topics, and better document time prediction.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130916851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Locality preserving projection (LPP) is a dimensionality reduction algorithm preserving the neighhorhood graph structure of data. However, the conventional LPP is sensitive to outliers existing in data. This article proposes a novel low-rank LPP model called LR-LPP. In this new model, original data are decomposed into the clean intrinsic component and noise component. Then the projective matrix is learned based on the clean intrinsic component which is encoded in low-rank features. The noise component is constrained by the ℓ1-norm which is more robust to outliers. Finally, LR-LPP model is extended to LR-FLPP in which low-dimensional feature is measured by F-norm. LR-FLPP will reduce aggregated error and weaken the effect of outliers, which will make the proposed LR-FLPP even more robust for outliers. The experimental results on public image databases demonstrate the effectiveness of the proposed LR-LPP and LR-FLPP.
{"title":"Robust Image Representation via Low Rank Locality Preserving Projection","authors":"Shuai Yin, Yanfeng Sun, Junbin Gao, Yongli Hu, Boyue Wang, Baocai Yin","doi":"10.1145/3434768","DOIUrl":"https://doi.org/10.1145/3434768","url":null,"abstract":"Locality preserving projection (LPP) is a dimensionality reduction algorithm preserving the neighhorhood graph structure of data. However, the conventional LPP is sensitive to outliers existing in data. This article proposes a novel low-rank LPP model called LR-LPP. In this new model, original data are decomposed into the clean intrinsic component and noise component. Then the projective matrix is learned based on the clean intrinsic component which is encoded in low-rank features. The noise component is constrained by the ℓ1-norm which is more robust to outliers. Finally, LR-LPP model is extended to LR-FLPP in which low-dimensional feature is measured by F-norm. LR-FLPP will reduce aggregated error and weaken the effect of outliers, which will make the proposed LR-FLPP even more robust for outliers. The experimental results on public image databases demonstrate the effectiveness of the proposed LR-LPP and LR-FLPP.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"900 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116396345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benhui Zhang, Maoguo Gong, Jianbin Huang, Xiaoke Ma
Many complex systems derived from nature and society consist of multiple types of entities and heterogeneous interactions, which can be effectively modeled as heterogeneous information network (HIN). Structural analysis of heterogeneous networks is of great significance by leveraging the rich semantic information of objects and links in the heterogeneous networks. And, clustering heterogeneous networks aims to group vertices into classes, which sheds light on revealing the structure–function relations of the underlying systems. The current algorithms independently perform the feature extraction and clustering, which are criticized for not fully characterizing the structure of clusters. In this study, we propose a learning model by joint Graph Embedding and Nonnegative Matrix Factorization (aka GEjNMF), where feature extraction and clustering are simultaneously learned by exploiting the graph embedding and latent structure of networks. We formulate the objective function of GEjNMF and transform the heterogeneous network clustering problem into a constrained optimization problem, which is effectively solved by l0-norm optimization. The advantage of GEjNMF is that features are selected under the guidance of clustering, which improves the performance and saves the running time of algorithms at the same time. The experimental results on three benchmark heterogeneous networks demonstrate that GEjNMF achieves the best performance with the least running time compared with the best state-of-the-art methods. Furthermore, the proposed algorithm is robust across heterogeneous networks from various fields. The proposed model and method provide an effective alternative for heterogeneous network clustering.
{"title":"Clustering Heterogeneous Information Network by Joint Graph Embedding and Nonnegative Matrix Factorization","authors":"Benhui Zhang, Maoguo Gong, Jianbin Huang, Xiaoke Ma","doi":"10.1145/3441449","DOIUrl":"https://doi.org/10.1145/3441449","url":null,"abstract":"Many complex systems derived from nature and society consist of multiple types of entities and heterogeneous interactions, which can be effectively modeled as heterogeneous information network (HIN). Structural analysis of heterogeneous networks is of great significance by leveraging the rich semantic information of objects and links in the heterogeneous networks. And, clustering heterogeneous networks aims to group vertices into classes, which sheds light on revealing the structure–function relations of the underlying systems. The current algorithms independently perform the feature extraction and clustering, which are criticized for not fully characterizing the structure of clusters. In this study, we propose a learning model by joint Graph Embedding and Nonnegative Matrix Factorization (aka GEjNMF), where feature extraction and clustering are simultaneously learned by exploiting the graph embedding and latent structure of networks. We formulate the objective function of GEjNMF and transform the heterogeneous network clustering problem into a constrained optimization problem, which is effectively solved by l0-norm optimization. The advantage of GEjNMF is that features are selected under the guidance of clustering, which improves the performance and saves the running time of algorithms at the same time. The experimental results on three benchmark heterogeneous networks demonstrate that GEjNMF achieves the best performance with the least running time compared with the best state-of-the-art methods. Furthermore, the proposed algorithm is robust across heterogeneous networks from various fields. The proposed model and method provide an effective alternative for heterogeneous network clustering.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"292 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116011520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huan Zhao, Quanming Yao, Yangqiu Song, J. Kwok, Lee
Collaborative filtering (CF) has been one of the most important and popular recommendation methods, which aims at predicting users’ preferences (ratings) based on their past behaviors. Recently, various types of side information beyond the explicit ratings users give to items, such as social connections among users and metadata of items, have been introduced into CF and shown to be useful for improving recommendation performance. However, previous works process different types of information separately, thus failing to capture the correlations that might exist across them. To address this problem, in this work, we study the application of heterogeneous information network (HIN), which offers a unifying and flexible representation of different types of side information, to enhance CF-based recommendation methods. However, we face challenging issues in HIN-based recommendation, i.e., how to capture similarities of complex semantics between users and items in a HIN, and how to effectively fuse these similarities to improve final recommendation performance. To address these issues, we apply metagraph to similarity computation and solve the information fusion problem with a “matrix factorization (MF) + factorization machine (FM)” framework. For the MF part, we obtain the user-item similarity matrix from each metagraph and then apply low-rank matrix approximation to obtain latent features for both users and items. For the FM part, we apply FM with Group lasso (FMG) on the features obtained from the MF part to train the recommending model and, at the same time, identify the useful metagraphs. Besides FMG, a two-stage method, we further propose an end-to-end method, hierarchical attention fusing, to fuse metagraph-based similarities for the final recommendation. Experimental results on four large real-world datasets show that the two proposed frameworks significantly outperform existing state-of-the-art methods in terms of recommendation performance.
{"title":"Side Information Fusion for Recommender Systems over Heterogeneous Information Network","authors":"Huan Zhao, Quanming Yao, Yangqiu Song, J. Kwok, Lee","doi":"10.1145/3441446","DOIUrl":"https://doi.org/10.1145/3441446","url":null,"abstract":"Collaborative filtering (CF) has been one of the most important and popular recommendation methods, which aims at predicting users’ preferences (ratings) based on their past behaviors. Recently, various types of side information beyond the explicit ratings users give to items, such as social connections among users and metadata of items, have been introduced into CF and shown to be useful for improving recommendation performance. However, previous works process different types of information separately, thus failing to capture the correlations that might exist across them. To address this problem, in this work, we study the application of heterogeneous information network (HIN), which offers a unifying and flexible representation of different types of side information, to enhance CF-based recommendation methods. However, we face challenging issues in HIN-based recommendation, i.e., how to capture similarities of complex semantics between users and items in a HIN, and how to effectively fuse these similarities to improve final recommendation performance. To address these issues, we apply metagraph to similarity computation and solve the information fusion problem with a “matrix factorization (MF) + factorization machine (FM)” framework. For the MF part, we obtain the user-item similarity matrix from each metagraph and then apply low-rank matrix approximation to obtain latent features for both users and items. For the FM part, we apply FM with Group lasso (FMG) on the features obtained from the MF part to train the recommending model and, at the same time, identify the useful metagraphs. Besides FMG, a two-stage method, we further propose an end-to-end method, hierarchical attention fusing, to fuse metagraph-based similarities for the final recommendation. Experimental results on four large real-world datasets show that the two proposed frameworks significantly outperform existing state-of-the-art methods in terms of recommendation performance.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129687970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wissam Al Jurdi, J. B. Abdo, J. Demerjian, A. Makhoul
Recommender systems have been upgraded, tested, and applied in many, often incomparable ways. In attempts to diligently understand user behavior in certain environments, those systems have been frequently utilized in domains like e-commerce, e-learning, and tourism. Their increasing need and popularity have allowed the existence of numerous research paths on major issues like data sparsity, cold start, malicious noise, and natural noise, which immensely limit their performance. It is typical that the quality of the data that fuel those systems should be extremely reliable. Inconsistent user information in datasets can alter the performance of recommenders, albeit running advanced personalizing algorithms. The consequences of this can be costly as such systems are employed in abundant online businesses. Successfully managing these inconsistencies results in more personalized user experiences. In this article, the previous works conducted on natural noise management in recommender datasets are thoroughly analyzed. We adequately explore the ways in which the proposed methods measure improved performances and touch on the different natural noise management techniques and the attributes of the solutions. Additionally, we test the evaluation methods employed to assess the approaches and discuss several key gaps and other improvements the field should realize in the future. Our work considers the likelihood of a modern research branch on natural noise management and recommender assessment.
{"title":"Critique on Natural Noise in Recommender Systems","authors":"Wissam Al Jurdi, J. B. Abdo, J. Demerjian, A. Makhoul","doi":"10.1145/3447780","DOIUrl":"https://doi.org/10.1145/3447780","url":null,"abstract":"Recommender systems have been upgraded, tested, and applied in many, often incomparable ways. In attempts to diligently understand user behavior in certain environments, those systems have been frequently utilized in domains like e-commerce, e-learning, and tourism. Their increasing need and popularity have allowed the existence of numerous research paths on major issues like data sparsity, cold start, malicious noise, and natural noise, which immensely limit their performance. It is typical that the quality of the data that fuel those systems should be extremely reliable. Inconsistent user information in datasets can alter the performance of recommenders, albeit running advanced personalizing algorithms. The consequences of this can be costly as such systems are employed in abundant online businesses. Successfully managing these inconsistencies results in more personalized user experiences. In this article, the previous works conducted on natural noise management in recommender datasets are thoroughly analyzed. We adequately explore the ways in which the proposed methods measure improved performances and touch on the different natural noise management techniques and the attributes of the solutions. Additionally, we test the evaluation methods employed to assess the approaches and discuss several key gaps and other improvements the field should realize in the future. Our work considers the likelihood of a modern research branch on natural noise management and recommender assessment.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114523252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Yue, Hao Shen, Sen Wang, R. Boots, Guodong Long, Weitong Chen, Xiaowei Zhao
The brain–computer interface (BCI) control technology that utilizes motor imagery to perform the desired action instead of manual operation will be widely used in smart environments. However, most of the research lacks robust feature representation of multi-channel EEG series, resulting in low intention recognition accuracy. This article proposes an EEG2Image based Denoised-ConvNets (called EID) to enhance feature representation of the intention recognition task. Specifically, we perform signal decomposition, slicing, and image mapping to decrease the noise from the irrelevant frequency bands. After that, we construct the Denoised-ConvNets structure to learn the colorspace and spatial variations of image objects without cropping new training images precisely. Toward further utilizing the color and spatial transformation layers, the colorspace and colored area of image objects have been enhanced and enlarged, respectively. In the multi-classification scenario, extensive experiments on publicly available EEG datasets confirm that the proposed method has better performance than state-of-the-art methods.
{"title":"Exploring BCI Control in Smart Environments","authors":"Lin Yue, Hao Shen, Sen Wang, R. Boots, Guodong Long, Weitong Chen, Xiaowei Zhao","doi":"10.1145/3450449","DOIUrl":"https://doi.org/10.1145/3450449","url":null,"abstract":"The brain–computer interface (BCI) control technology that utilizes motor imagery to perform the desired action instead of manual operation will be widely used in smart environments. However, most of the research lacks robust feature representation of multi-channel EEG series, resulting in low intention recognition accuracy. This article proposes an EEG2Image based Denoised-ConvNets (called EID) to enhance feature representation of the intention recognition task. Specifically, we perform signal decomposition, slicing, and image mapping to decrease the noise from the irrelevant frequency bands. After that, we construct the Denoised-ConvNets structure to learn the colorspace and spatial variations of image objects without cropping new training images precisely. Toward further utilizing the color and spatial transformation layers, the colorspace and colored area of image objects have been enhanced and enlarged, respectively. In the multi-classification scenario, extensive experiments on publicly available EEG datasets confirm that the proposed method has better performance than state-of-the-art methods.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122897182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Liu, Xi He, Mingdong Song, Jianqiang Li, Guangzhi Qu, Jianlei Lang, Rentao Gu
Atmospheric visibility is an indicator of atmospheric transparency and its range directly reflects the quality of the atmospheric environment. With the acceleration of industrialization and urbanization, the natural environment has suffered some damages. In recent decades, the level of atmospheric visibility shows an overall downward trend. A decrease in atmospheric visibility will lead to a higher frequency of haze, which will seriously affect people's normal life, and also have a significant negative economic impact. The causal relationship mining of atmospheric visibility can reveal the potential relation between visibility and other influencing factors, which is very important in environmental management, air pollution control and haze control. However, causality mining based on statistical methods and traditional machine learning techniques usually achieve qualitative results that are hard to measure the degree of causality accurately. This article proposed the seq2seq-LSTM Granger causality analysis method for mining the causality relationship between atmospheric visibility and its influencing factors. In the experimental part, by comparing with methods such as linear regression, random forest, gradient boosting decision tree, light gradient boosting machine, and extreme gradient boosting, it turns out that the visibility prediction accuracy based on the seq2seq-LSTM model is about 10% higher than traditional machine learning methods. Therefore, the causal relationship mining based on this method can deeply reveal the implicit relationship between them and provide theoretical support for air pollution control.
{"title":"A Method for Mining Granger Causality Relationship on Atmospheric Visibility","authors":"Bo Liu, Xi He, Mingdong Song, Jianqiang Li, Guangzhi Qu, Jianlei Lang, Rentao Gu","doi":"10.1145/3447681","DOIUrl":"https://doi.org/10.1145/3447681","url":null,"abstract":"Atmospheric visibility is an indicator of atmospheric transparency and its range directly reflects the quality of the atmospheric environment. With the acceleration of industrialization and urbanization, the natural environment has suffered some damages. In recent decades, the level of atmospheric visibility shows an overall downward trend. A decrease in atmospheric visibility will lead to a higher frequency of haze, which will seriously affect people's normal life, and also have a significant negative economic impact. The causal relationship mining of atmospheric visibility can reveal the potential relation between visibility and other influencing factors, which is very important in environmental management, air pollution control and haze control. However, causality mining based on statistical methods and traditional machine learning techniques usually achieve qualitative results that are hard to measure the degree of causality accurately. This article proposed the seq2seq-LSTM Granger causality analysis method for mining the causality relationship between atmospheric visibility and its influencing factors. In the experimental part, by comparing with methods such as linear regression, random forest, gradient boosting decision tree, light gradient boosting machine, and extreme gradient boosting, it turns out that the visibility prediction accuracy based on the seq2seq-LSTM model is about 10% higher than traditional machine learning methods. Therefore, the causal relationship mining based on this method can deeply reveal the implicit relationship between them and provide theoretical support for air pollution control.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":" 18","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120829695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dipanjyoti Paul, Rahul Kumar, S. Saha, Jimson Mathew
The feature selection method is the process of selecting only relevant features by removing irrelevant or redundant features amongst the large number of features that are used to represent data. Nowadays, many application domains especially social media networks, generate new features continuously at different time stamps. In such a scenario, when the features are arriving in an online fashion, to cope up with the continuous arrival of features, the selection task must also have to be a continuous process. Therefore, the streaming feature selection based approach has to be incorporated, i.e., every time a new feature or a group of features arrives, the feature selection process has to be invoked. Again, in recent years, there are many application domains that generate data where samples may belong to more than one classes called multi-label dataset. The multiple labels that the instances are being associated with, may have some dependencies amongst themselves. Finding the co-relation amongst the class labels helps to select the discriminative features across multiple labels. In this article, we develop streaming feature selection methods for multi-label data where the multiple labels are reduced to a lower-dimensional space. The similar labels are grouped together before performing the selection method to improve the selection quality and to make the model time efficient. The multi-objective version of the cuckoo search-based approach is used to select the optimal feature set. The proposed method develops two versions of the streaming feature selection method: ) when the features arrive individually and ) when the features arrive in the form of a batch. Various multi-label datasets from various domains such as text, biology, and audio have been used to test the developed streaming feature selection methods. The proposed methods are compared with many previous feature selection methods and from the comparison, the superiority of using multiple objectives and label co-relation in the feature selection process can be established.
{"title":"Multi-objective Cuckoo Search-based Streaming Feature Selection for Multi-label Dataset","authors":"Dipanjyoti Paul, Rahul Kumar, S. Saha, Jimson Mathew","doi":"10.1145/3447586","DOIUrl":"https://doi.org/10.1145/3447586","url":null,"abstract":"The feature selection method is the process of selecting only relevant features by removing irrelevant or redundant features amongst the large number of features that are used to represent data. Nowadays, many application domains especially social media networks, generate new features continuously at different time stamps. In such a scenario, when the features are arriving in an online fashion, to cope up with the continuous arrival of features, the selection task must also have to be a continuous process. Therefore, the streaming feature selection based approach has to be incorporated, i.e., every time a new feature or a group of features arrives, the feature selection process has to be invoked. Again, in recent years, there are many application domains that generate data where samples may belong to more than one classes called multi-label dataset. The multiple labels that the instances are being associated with, may have some dependencies amongst themselves. Finding the co-relation amongst the class labels helps to select the discriminative features across multiple labels. In this article, we develop streaming feature selection methods for multi-label data where the multiple labels are reduced to a lower-dimensional space. The similar labels are grouped together before performing the selection method to improve the selection quality and to make the model time efficient. The multi-objective version of the cuckoo search-based approach is used to select the optimal feature set. The proposed method develops two versions of the streaming feature selection method: ) when the features arrive individually and ) when the features arrive in the form of a batch. Various multi-label datasets from various domains such as text, biology, and audio have been used to test the developed streaming feature selection methods. The proposed methods are compared with many previous feature selection methods and from the comparison, the superiority of using multiple objectives and label co-relation in the feature selection process can be established.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122071537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Community detection on network data is a fundamental task, and has many applications in industry. Network data in industry can be very large, with incomplete and complex attributes, and more importantly, growing. This calls for a community detection technique that is able to handle both attribute and topological information on large scale networks, and also is incremental. In this article, we propose inc-AGGMMR, an incremental community detection framework that is able to effectively address the challenges that come from scalability, mixed attributes, incomplete values, and evolving of the network. Through construction of augmented graph, we map attributes into the network by introducing attribute centers and belongingness edges. The communities are then detected by modularity maximization. During this process, we adjust the weights of belongingness edges to balance the contribution between attribute and topological information to the detection of communities. The weight adjustment mechanism enables incremental updates of community membership of all vertices. We evaluate inc-AGGMMR on five benchmark datasets against eight strong baselines. We also provide a case study to incrementally detect communities on a PayPal payment network which contains users with transactions. The results demonstrate inc-AGGMMR’s effectiveness and practicability.
{"title":"Incremental Community Detection on Large Complex Attributed Network","authors":"Zhe Chen, Aixin Sun, Xiaokui Xiao","doi":"10.1145/3451216","DOIUrl":"https://doi.org/10.1145/3451216","url":null,"abstract":"Community detection on network data is a fundamental task, and has many applications in industry. Network data in industry can be very large, with incomplete and complex attributes, and more importantly, growing. This calls for a community detection technique that is able to handle both attribute and topological information on large scale networks, and also is incremental. In this article, we propose inc-AGGMMR, an incremental community detection framework that is able to effectively address the challenges that come from scalability, mixed attributes, incomplete values, and evolving of the network. Through construction of augmented graph, we map attributes into the network by introducing attribute centers and belongingness edges. The communities are then detected by modularity maximization. During this process, we adjust the weights of belongingness edges to balance the contribution between attribute and topological information to the detection of communities. The weight adjustment mechanism enables incremental updates of community membership of all vertices. We evaluate inc-AGGMMR on five benchmark datasets against eight strong baselines. We also provide a case study to incrementally detect communities on a PayPal payment network which contains users with transactions. The results demonstrate inc-AGGMMR’s effectiveness and practicability.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114706399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}