Chun-Wei Lin, Y. Djenouri, Gautam Srivastava, Yuanfa Li, Philip S. Yu
High-utility sequential pattern mining (HUSPM) is a hot research topic in recent decades since it combines both sequential and utility properties to reveal more information and knowledge rather than the traditional frequent itemset mining or sequential pattern mining. Several works of HUSPM have been presented but most of them are based on main memory to speed up mining performance. However, this assumption is not realistic and not suitable in large-scale environments since in real industry, the size of the collected data is very huge and it is impossible to fit the data into the main memory of a single machine. In this article, we first develop a parallel and distributed three-stage MapReduce model for mining high-utility sequential patterns based on large-scale databases. Two properties are then developed to hold the correctness and completeness of the discovered patterns in the developed framework. In addition, two data structures called sidset and utility-linked list are utilized in the developed framework to accelerate the computation for mining the required patterns. From the results, we can observe that the designed model has good performance in large-scale datasets in terms of runtime, memory, efficiency of the number of distributed nodes, and scalability compared to the serial HUSP-Span approach.
{"title":"Scalable Mining of High-Utility Sequential Patterns With Three-Tier MapReduce Model","authors":"Chun-Wei Lin, Y. Djenouri, Gautam Srivastava, Yuanfa Li, Philip S. Yu","doi":"10.1145/3487046","DOIUrl":"https://doi.org/10.1145/3487046","url":null,"abstract":"High-utility sequential pattern mining (HUSPM) is a hot research topic in recent decades since it combines both sequential and utility properties to reveal more information and knowledge rather than the traditional frequent itemset mining or sequential pattern mining. Several works of HUSPM have been presented but most of them are based on main memory to speed up mining performance. However, this assumption is not realistic and not suitable in large-scale environments since in real industry, the size of the collected data is very huge and it is impossible to fit the data into the main memory of a single machine. In this article, we first develop a parallel and distributed three-stage MapReduce model for mining high-utility sequential patterns based on large-scale databases. Two properties are then developed to hold the correctness and completeness of the discovered patterns in the developed framework. In addition, two data structures called sidset and utility-linked list are utilized in the developed framework to accelerate the computation for mining the required patterns. From the results, we can observe that the designed model has good performance in large-scale datasets in terms of runtime, memory, efficiency of the number of distributed nodes, and scalability compared to the serial HUSP-Span approach.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123661021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heli Sun, Yang Li, Bing Lv, Wujie Yan, Liang He, Shaojie Qiao, Jianbin Huang
Graph representation learning aims at learning low-dimension representations for nodes in graphs, and has been proven very useful in several downstream tasks. In this article, we propose a new model, Graph Community Infomax (GCI), that can adversarial learn representations for nodes in attributed networks. Different from other adversarial network embedding models, which would assume that the data follow some prior distributions and generate fake examples, GCI utilizes the community information of networks, using nodes as positive(or real) examples and negative(or fake) examples at the same time. An autoencoder is applied to learn the embedding vectors for nodes and reconstruct the adjacency matrix, and a discriminator is used to maximize the mutual information between nodes and communities. Experiments on several real-world and synthetic networks have shown that GCI outperforms various network embedding methods on community detection tasks.
{"title":"Graph Community Infomax","authors":"Heli Sun, Yang Li, Bing Lv, Wujie Yan, Liang He, Shaojie Qiao, Jianbin Huang","doi":"10.1145/3480244","DOIUrl":"https://doi.org/10.1145/3480244","url":null,"abstract":"Graph representation learning aims at learning low-dimension representations for nodes in graphs, and has been proven very useful in several downstream tasks. In this article, we propose a new model, Graph Community Infomax (GCI), that can adversarial learn representations for nodes in attributed networks. Different from other adversarial network embedding models, which would assume that the data follow some prior distributions and generate fake examples, GCI utilizes the community information of networks, using nodes as positive(or real) examples and negative(or fake) examples at the same time. An autoencoder is applied to learn the embedding vectors for nodes and reconstruct the adjacency matrix, and a discriminator is used to maximize the mutual information between nodes and communities. Experiments on several real-world and synthetic networks have shown that GCI outperforms various network embedding methods on community detection tasks.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133628405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Temporal knowledge graph (TKG) representation learning embeds relations and entities into a continuous low-dimensional vector space by incorporating temporal information. Latest studies mainly aim at learning entity representations by modeling entity interactions from the neighbor structure of the graph. However, the interactions of relations from the neighbor structure of the graph are neglected, which are also of significance for learning informative representations. In addition, there still lacks an effective historical relation encoder to model the multi-range temporal dependencies. In this article, we propose a dual graph convolution network based TKG representation learning method using historical relations (DACHA). Specifically, we first construct the primal graph according to historical relations, as well as the edge graph by regarding historical relations as nodes. Then, we employ the dual graph convolution network to capture the interactions of both entities and historical relations from the neighbor structure of the graph. In addition, the temporal self-attentive historical relation encoder is proposed to explicitly model both local and global temporal dependencies. Extensive experiments on two event based TKG datasets demonstrate that DACHA achieves the state-of-the-art results.
{"title":"DACHA: A Dual Graph Convolution Based Temporal Knowledge Graph Representation Learning Method Using Historical Relation","authors":"Ling Chen, Xing Tang, Weiqiu Chen, Y. Qian, Yansheng Li, Yongjun Zhang","doi":"10.1145/3477051","DOIUrl":"https://doi.org/10.1145/3477051","url":null,"abstract":"Temporal knowledge graph (TKG) representation learning embeds relations and entities into a continuous low-dimensional vector space by incorporating temporal information. Latest studies mainly aim at learning entity representations by modeling entity interactions from the neighbor structure of the graph. However, the interactions of relations from the neighbor structure of the graph are neglected, which are also of significance for learning informative representations. In addition, there still lacks an effective historical relation encoder to model the multi-range temporal dependencies. In this article, we propose a dual graph convolution network based TKG representation learning method using historical relations (DACHA). Specifically, we first construct the primal graph according to historical relations, as well as the edge graph by regarding historical relations as nodes. Then, we employ the dual graph convolution network to capture the interactions of both entities and historical relations from the neighbor structure of the graph. In addition, the temporal self-attentive historical relation encoder is proposed to explicitly model both local and global temporal dependencies. Extensive experiments on two event based TKG datasets demonstrate that DACHA achieves the state-of-the-art results.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125097820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jie Feng, Yong Li, Ziqian Lin, Can Rong, Funing Sun, Diansheng Guo, Depeng Jin
Crowd flow prediction is of great importance in a wide range of applications from urban planning, traffic control to public safety. It aims at predicting the inflow (the traffic of crowds entering a region in a given time interval) and outflow (the traffic of crowds leaving a region for other places) of each region in the city with knowing the historical flow data. In this article, we propose DeepSTN+, a deep learning-based convolutional model, to predict crowd flows in the metropolis. First, DeepSTN+ employs the ConvPlus structure to model the long-range spatial dependence among crowd flows in different regions. Further, PoI distributions and time factor are combined to express the effect of location attributes to introduce prior knowledge of the crowd movements. Finally, we propose a temporal attention-based fusion mechanism to stabilize the training process, which further improves the performance. Extensive experimental results based on four real-life datasets demonstrate the superiority of our model, i.e., DeepSTN+ reduces the error of the crowd flow prediction by approximately 10%–21% compared with the state-of-the-art baselines.
{"title":"Context-aware Spatial-Temporal Neural Network for Citywide Crowd Flow Prediction via Modeling Long-range Spatial Dependency","authors":"Jie Feng, Yong Li, Ziqian Lin, Can Rong, Funing Sun, Diansheng Guo, Depeng Jin","doi":"10.1145/3477577","DOIUrl":"https://doi.org/10.1145/3477577","url":null,"abstract":"Crowd flow prediction is of great importance in a wide range of applications from urban planning, traffic control to public safety. It aims at predicting the inflow (the traffic of crowds entering a region in a given time interval) and outflow (the traffic of crowds leaving a region for other places) of each region in the city with knowing the historical flow data. In this article, we propose DeepSTN+, a deep learning-based convolutional model, to predict crowd flows in the metropolis. First, DeepSTN+ employs the ConvPlus structure to model the long-range spatial dependence among crowd flows in different regions. Further, PoI distributions and time factor are combined to express the effect of location attributes to introduce prior knowledge of the crowd movements. Finally, we propose a temporal attention-based fusion mechanism to stabilize the training process, which further improves the performance. Extensive experimental results based on four real-life datasets demonstrate the superiority of our model, i.e., DeepSTN+ reduces the error of the crowd flow prediction by approximately 10%–21% compared with the state-of-the-art baselines.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129695429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kun Kuang, Hengtao Zhang, Runze Wu, Fei Wu, Y. Zhuang, Aijun Zhang
In data mining and machine learning, it is commonly assumed that training and test data share the same population distribution. However, this assumption is often violated in practice because of the sample selection bias, which might induce the distribution shift from training data to test data. Such a model-agnostic distribution shift usually leads to prediction instability across unknown test data. This article proposes a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design. It isolates the clear effect of each predictor from the confounding variables. A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift, improving both the accuracy of parameter estimation and the stability of prediction across unknown test data. Numerical experiments on synthetic and real-world datasets demonstrate that our BSSP algorithm can significantly outperform the baseline methods for stable prediction across unknown test data.
{"title":"Balance-Subsampled Stable Prediction Across Unknown Test Data","authors":"Kun Kuang, Hengtao Zhang, Runze Wu, Fei Wu, Y. Zhuang, Aijun Zhang","doi":"10.1145/3477052","DOIUrl":"https://doi.org/10.1145/3477052","url":null,"abstract":"In data mining and machine learning, it is commonly assumed that training and test data share the same population distribution. However, this assumption is often violated in practice because of the sample selection bias, which might induce the distribution shift from training data to test data. Such a model-agnostic distribution shift usually leads to prediction instability across unknown test data. This article proposes a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design. It isolates the clear effect of each predictor from the confounding variables. A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift, improving both the accuracy of parameter estimation and the stability of prediction across unknown test data. Numerical experiments on synthetic and real-world datasets demonstrate that our BSSP algorithm can significantly outperform the baseline methods for stable prediction across unknown test data.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130344375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuo Lei, Xuchao Zhang, Liang Zhao, Arnold P. Boedihardjo, Chang-Tien Lu
In today’s era of big data, robust least-squares regression becomes a more challenging problem when considering the extremely corrupted labels along with explosive growth of datasets. Traditional robust methods can handle the noise but suffer from several challenges when applied in huge dataset including (1) computational infeasibility of handling an entire dataset at once, (2) existence of heterogeneously distributed corruption, and (3) difficulty in corruption estimation when data cannot be entirely loaded. This article proposes online and distributed robust regression approaches, both of which can concurrently address all the above challenges. Specifically, the distributed algorithm optimizes the regression coefficients of each data block via heuristic hard thresholding and combines all the estimates in a distributed robust consolidation. In addition, an online version of the distributed algorithm is proposed to incrementally update the existing estimates with new incoming data. Furthermore, a novel online robust regression method is proposed to estimate under a biased-batch corruption. We also prove that our algorithms benefit from strong robustness guarantees in terms of regression coefficient recovery with a constant upper bound on the error of state-of-the-art batch methods. Extensive experiments on synthetic and real datasets demonstrate that our approaches are superior to those of existing methods in effectiveness, with competitive efficiency.
{"title":"Online and Distributed Robust Regressions with Extremely Noisy Labels","authors":"Shuo Lei, Xuchao Zhang, Liang Zhao, Arnold P. Boedihardjo, Chang-Tien Lu","doi":"10.1145/3473038","DOIUrl":"https://doi.org/10.1145/3473038","url":null,"abstract":"In today’s era of big data, robust least-squares regression becomes a more challenging problem when considering the extremely corrupted labels along with explosive growth of datasets. Traditional robust methods can handle the noise but suffer from several challenges when applied in huge dataset including (1) computational infeasibility of handling an entire dataset at once, (2) existence of heterogeneously distributed corruption, and (3) difficulty in corruption estimation when data cannot be entirely loaded. This article proposes online and distributed robust regression approaches, both of which can concurrently address all the above challenges. Specifically, the distributed algorithm optimizes the regression coefficients of each data block via heuristic hard thresholding and combines all the estimates in a distributed robust consolidation. In addition, an online version of the distributed algorithm is proposed to incrementally update the existing estimates with new incoming data. Furthermore, a novel online robust regression method is proposed to estimate under a biased-batch corruption. We also prove that our algorithms benefit from strong robustness guarantees in terms of regression coefficient recovery with a constant upper bound on the error of state-of-the-art batch methods. Extensive experiments on synthetic and real datasets demonstrate that our approaches are superior to those of existing methods in effectiveness, with competitive efficiency.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121518014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Network embedding is a technique that aims at inferring the low-dimensional representations of nodes in a semantic space. In this article, we study the problem of inferring the low-dimensional representations of both nodes and attributes for attributed networks in the same semantic space such that the affinity between a node and an attribute can be effectively measured. Intuitively, this problem can be addressed by simply utilizing existing variational auto-encoder (VAE) based network embedding algorithms. However, the variational posterior distribution in previous VAE based network embedding algorithms is often assumed and restricted to be a mean-field Gaussian distribution or other simple distribution families, which results in poor inference of the embeddings. To alleviate the above defect, we propose a novel VAE-based co-embedding method for attributed network, F-CAN, where posterior distributions are flexible, complex, and scalable distributions constructed through the normalizing flow. We evaluate our proposed models on a number of network tasks with several benchmark datasets. Experimental results demonstrate that there are clear improvements in the qualities of embeddings generated by our model to the state-of-the-art attributed network embedding methods.
{"title":"A Normalizing Flow-Based Co-Embedding Model for Attributed Networks","authors":"Shangsong Liang, Zhuo Ouyang, Zaiqiao Meng","doi":"10.1145/3477049","DOIUrl":"https://doi.org/10.1145/3477049","url":null,"abstract":"Network embedding is a technique that aims at inferring the low-dimensional representations of nodes in a semantic space. In this article, we study the problem of inferring the low-dimensional representations of both nodes and attributes for attributed networks in the same semantic space such that the affinity between a node and an attribute can be effectively measured. Intuitively, this problem can be addressed by simply utilizing existing variational auto-encoder (VAE) based network embedding algorithms. However, the variational posterior distribution in previous VAE based network embedding algorithms is often assumed and restricted to be a mean-field Gaussian distribution or other simple distribution families, which results in poor inference of the embeddings. To alleviate the above defect, we propose a novel VAE-based co-embedding method for attributed network, F-CAN, where posterior distributions are flexible, complex, and scalable distributions constructed through the normalizing flow. We evaluate our proposed models on a number of network tasks with several benchmark datasets. Experimental results demonstrate that there are clear improvements in the qualities of embeddings generated by our model to the state-of-the-art attributed network embedding methods.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124075305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanchun Jiang, Ruicheng Liang, Ji Zhang, Jianshan Sun, Yezheng Liu, Yang Qian
Online social media provides rich and varied information reflecting the significant concerns of the public during the coronavirus pandemic. Analyzing what the public is concerned with from social media information can support policy-makers to maintain the stability of the social economy and life of the society. In this article, we focus on the detection of the network public opinions during the coronavirus pandemic. We propose a novel Relational Topic Model for Short texts (RTMS) to draw opinion topics from social media data. RTMS exploits the feature of texts in online social media and the opinion propagation patterns among individuals. Moreover, a dynamic version of RTMS (DRTMS) is proposed to capture the evolution of public opinions. Our experiment is conducted on a real-world dataset which includes 67,592 comments from 14,992 users. The results demonstrate that, compared with the benchmark methods, the proposed RTMS and DRTMS models can detect meaningful public opinions by leveraging the feature of social media data. It can also effectively capture the evolution of public concerns during different phases of the coronavirus pandemic.
{"title":"Network Public Opinion Detection During the Coronavirus Pandemic: A Short-Text Relational Topic Model","authors":"Yuanchun Jiang, Ruicheng Liang, Ji Zhang, Jianshan Sun, Yezheng Liu, Yang Qian","doi":"10.1145/3480246","DOIUrl":"https://doi.org/10.1145/3480246","url":null,"abstract":"Online social media provides rich and varied information reflecting the significant concerns of the public during the coronavirus pandemic. Analyzing what the public is concerned with from social media information can support policy-makers to maintain the stability of the social economy and life of the society. In this article, we focus on the detection of the network public opinions during the coronavirus pandemic. We propose a novel Relational Topic Model for Short texts (RTMS) to draw opinion topics from social media data. RTMS exploits the feature of texts in online social media and the opinion propagation patterns among individuals. Moreover, a dynamic version of RTMS (DRTMS) is proposed to capture the evolution of public opinions. Our experiment is conducted on a real-world dataset which includes 67,592 comments from 14,992 users. The results demonstrate that, compared with the benchmark methods, the proposed RTMS and DRTMS models can detect meaningful public opinions by leveraging the feature of social media data. It can also effectively capture the evolution of public concerns during different phases of the coronavirus pandemic.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117225882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Youxi Wu, L. Luo, Yan Li, Lei Guo, Philippe Fournier-Viger, Xingquan Zhu, Xindong Wu
Nonoverlapping sequential pattern mining is an important type of sequential pattern mining (SPM) with gap constraints, which not only can reveal interesting patterns to users but also can effectively reduce the search space using the Apriori (anti-monotonicity) property. However, the existing algorithms do not focus on attributes of interest to users, meaning that existing methods may discover many frequent patterns that are redundant. To solve this problem, this article proposes a task called nonoverlapping three-way sequential pattern (NTP) mining, where attributes are categorized according to three levels of interest: strong, medium, and weak interest. NTP mining can effectively avoid mining redundant patterns since the NTPs are composed of strong and medium interest items. Moreover, NTPs can avoid serious deviations (the occurrence is significantly different from its pattern) since gap constraints cannot match with strong interest patterns. To mine NTPs, an effective algorithm is put forward, called NTP-Miner, which applies two main steps: support (frequency occurrence) calculation and candidate pattern generation. To calculate the support of an NTP, depth-first and backtracking strategies are adopted, which do not require creating a whole Nettree structure, meaning that many redundant nodes and parent–child relationships do not need to be created. Hence, time and space efficiency is improved. To generate candidate patterns while reducing their number, NTP-Miner employs a pattern join strategy and only mines patterns of strong and medium interest. Experimental results on stock market and protein datasets show that NTP-Miner not only is more efficient than other competitive approaches but can also help users find more valuable patterns. More importantly, NTP mining has achieved better performance than other competitive methods in clustering tasks. Algorithms and data are available at: https://github.com/wuc567/Pattern-Mining/tree/master/NTP-Miner.
{"title":"NTP-Miner: Nonoverlapping Three-Way Sequential Pattern Mining","authors":"Youxi Wu, L. Luo, Yan Li, Lei Guo, Philippe Fournier-Viger, Xingquan Zhu, Xindong Wu","doi":"10.1145/3480245","DOIUrl":"https://doi.org/10.1145/3480245","url":null,"abstract":"Nonoverlapping sequential pattern mining is an important type of sequential pattern mining (SPM) with gap constraints, which not only can reveal interesting patterns to users but also can effectively reduce the search space using the Apriori (anti-monotonicity) property. However, the existing algorithms do not focus on attributes of interest to users, meaning that existing methods may discover many frequent patterns that are redundant. To solve this problem, this article proposes a task called nonoverlapping three-way sequential pattern (NTP) mining, where attributes are categorized according to three levels of interest: strong, medium, and weak interest. NTP mining can effectively avoid mining redundant patterns since the NTPs are composed of strong and medium interest items. Moreover, NTPs can avoid serious deviations (the occurrence is significantly different from its pattern) since gap constraints cannot match with strong interest patterns. To mine NTPs, an effective algorithm is put forward, called NTP-Miner, which applies two main steps: support (frequency occurrence) calculation and candidate pattern generation. To calculate the support of an NTP, depth-first and backtracking strategies are adopted, which do not require creating a whole Nettree structure, meaning that many redundant nodes and parent–child relationships do not need to be created. Hence, time and space efficiency is improved. To generate candidate patterns while reducing their number, NTP-Miner employs a pattern join strategy and only mines patterns of strong and medium interest. Experimental results on stock market and protein datasets show that NTP-Miner not only is more efficient than other competitive approaches but can also help users find more valuable patterns. More importantly, NTP mining has achieved better performance than other competitive methods in clustering tasks. Algorithms and data are available at: https://github.com/wuc567/Pattern-Mining/tree/master/NTP-Miner.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130367982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mengzhuo Guo, Zhongzhi Xu, Qingpeng Zhang, Xiuwu Liao, Jiapeng Liu
Ordinal regression predicts the objects’ labels that exhibit a natural ordering, which is vital to decision-making problems such as credit scoring and clinical diagnosis. In these problems, the ability to explain how the individual features and their interactions affect the decisions is as critical as model performance. Unfortunately, the existing ordinal regression models in the machine learning community aim at improving prediction accuracy rather than explore explainability. To achieve high accuracy while explaining the relationships between the features and the predictions, we propose a new method for ordinal regression problems, namely the Explainable Ordinal Factorization Model (XOFM). XOFM uses piecewise linear functions to approximate the shape functions of individual features, and renders the pairwise features interaction effects as heat-maps. The proposed XOFM captures the nonlinearity in the main effects and ensures the interaction effects’ same flexibility. Therefore, the underlying model yields comparable performance while remaining explainable by explicitly describing the main and interaction effects. To address the potential sparsity problem caused by discretizing the whole feature scale into several sub-intervals, XOFM integrates the Factorization Machines (FMs) to factorize the model parameters. Comprehensive experiments with benchmark real-world and synthetic datasets demonstrate that the proposed XOFM leads to state-of-the-art prediction performance while preserving an easy-to-understand explainability.
{"title":"Deciphering Feature Effects on Decision-Making in Ordinal Regression Problems: An Explainable Ordinal Factorization Model","authors":"Mengzhuo Guo, Zhongzhi Xu, Qingpeng Zhang, Xiuwu Liao, Jiapeng Liu","doi":"10.1145/3487048","DOIUrl":"https://doi.org/10.1145/3487048","url":null,"abstract":"Ordinal regression predicts the objects’ labels that exhibit a natural ordering, which is vital to decision-making problems such as credit scoring and clinical diagnosis. In these problems, the ability to explain how the individual features and their interactions affect the decisions is as critical as model performance. Unfortunately, the existing ordinal regression models in the machine learning community aim at improving prediction accuracy rather than explore explainability. To achieve high accuracy while explaining the relationships between the features and the predictions, we propose a new method for ordinal regression problems, namely the Explainable Ordinal Factorization Model (XOFM). XOFM uses piecewise linear functions to approximate the shape functions of individual features, and renders the pairwise features interaction effects as heat-maps. The proposed XOFM captures the nonlinearity in the main effects and ensures the interaction effects’ same flexibility. Therefore, the underlying model yields comparable performance while remaining explainable by explicitly describing the main and interaction effects. To address the potential sparsity problem caused by discretizing the whole feature scale into several sub-intervals, XOFM integrates the Factorization Machines (FMs) to factorize the model parameters. Comprehensive experiments with benchmark real-world and synthetic datasets demonstrate that the proposed XOFM leads to state-of-the-art prediction performance while preserving an easy-to-understand explainability.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125039677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}