ACM Transactions on Knowledge Discovery from Data (TKDD)最新文献_第5页

Scalable Mining of High-Utility Sequential Patterns With Three-Tier MapReduce Model 基于三层MapReduce模型的高效用序列模式可扩展挖掘

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2021-11-15 DOI: 10.1145/3487046

Chun-Wei Lin, Y. Djenouri, Gautam Srivastava, Yuanfa Li, Philip S. Yu

High-utility sequential pattern mining (HUSPM) is a hot research topic in recent decades since it combines both sequential and utility properties to reveal more information and knowledge rather than the traditional frequent itemset mining or sequential pattern mining. Several works of HUSPM have been presented but most of them are based on main memory to speed up mining performance. However, this assumption is not realistic and not suitable in large-scale environments since in real industry, the size of the collected data is very huge and it is impossible to fit the data into the main memory of a single machine. In this article, we first develop a parallel and distributed three-stage MapReduce model for mining high-utility sequential patterns based on large-scale databases. Two properties are then developed to hold the correctness and completeness of the discovered patterns in the developed framework. In addition, two data structures called sidset and utility-linked list are utilized in the developed framework to accelerate the computation for mining the required patterns. From the results, we can observe that the designed model has good performance in large-scale datasets in terms of runtime, memory, efficiency of the number of distributed nodes, and scalability compared to the serial HUSP-Span approach.

高效用序列模式挖掘(HUSPM)是近几十年来的一个研究热点，它结合了序列和效用的特性，比传统的频繁项集挖掘或序列模式挖掘更能揭示信息和知识。HUSPM已经提出了一些工作，但大多数都是基于主存来提高挖掘性能。然而，这种假设是不现实的，不适合大规模的环境，因为在实际工业中，收集的数据的大小是非常巨大的，不可能将数据放入单个机器的主存储器中。在本文中，我们首先开发了一个并行和分布式的三阶段MapReduce模型，用于挖掘基于大规模数据库的高实用顺序模式。然后开发两个属性来保证所开发框架中所发现模式的正确性和完整性。此外，在开发的框架中还使用了两种数据结构sidset和utility-linked list来加速挖掘所需模式的计算。从结果可以看出，与串行HUSP-Span方法相比，所设计的模型在运行时间、内存、分布式节点数量效率和可扩展性方面具有良好的性能。

{"title":"Scalable Mining of High-Utility Sequential Patterns With Three-Tier MapReduce Model","authors":"Chun-Wei Lin, Y. Djenouri, Gautam Srivastava, Yuanfa Li, Philip S. Yu","doi":"10.1145/3487046","DOIUrl":"https://doi.org/10.1145/3487046","url":null,"abstract":"High-utility sequential pattern mining (HUSPM) is a hot research topic in recent decades since it combines both sequential and utility properties to reveal more information and knowledge rather than the traditional frequent itemset mining or sequential pattern mining. Several works of HUSPM have been presented but most of them are based on main memory to speed up mining performance. However, this assumption is not realistic and not suitable in large-scale environments since in real industry, the size of the collected data is very huge and it is impossible to fit the data into the main memory of a single machine. In this article, we first develop a parallel and distributed three-stage MapReduce model for mining high-utility sequential patterns based on large-scale databases. Two properties are then developed to hold the correctness and completeness of the discovered patterns in the developed framework. In addition, two data structures called sidset and utility-linked list are utilized in the developed framework to accelerate the computation for mining the required patterns. From the results, we can observe that the designed model has good performance in large-scale datasets in terms of runtime, memory, efficiency of the number of distributed nodes, and scalability compared to the serial HUSP-Span approach.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123661021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Graph Community Infomax 图社区信息

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2021-11-13 DOI: 10.1145/3480244

Heli Sun, Yang Li, Bing Lv, Wujie Yan, Liang He, Shaojie Qiao, Jianbin Huang

Graph representation learning aims at learning low-dimension representations for nodes in graphs, and has been proven very useful in several downstream tasks. In this article, we propose a new model, Graph Community Infomax (GCI), that can adversarial learn representations for nodes in attributed networks. Different from other adversarial network embedding models, which would assume that the data follow some prior distributions and generate fake examples, GCI utilizes the community information of networks, using nodes as positive(or real) examples and negative(or fake) examples at the same time. An autoencoder is applied to learn the embedding vectors for nodes and reconstruct the adjacency matrix, and a discriminator is used to maximize the mutual information between nodes and communities. Experiments on several real-world and synthetic networks have shown that GCI outperforms various network embedding methods on community detection tasks.

图表示学习旨在学习图中节点的低维表示，并已被证明在几个下游任务中非常有用。在本文中，我们提出了一个新的模型，图社区信息(GCI)，可以对抗学习表征属性网络中的节点。与其他对抗性网络嵌入模型假设数据遵循某些先验分布并生成假示例不同，GCI利用网络的社区信息，同时使用节点作为正(或真实)示例和负(或假)示例。使用自编码器学习节点的嵌入向量并重构邻接矩阵，使用鉴别器最大化节点和群体之间的互信息。在几个真实网络和合成网络上的实验表明，GCI在社区检测任务上优于各种网络嵌入方法。

引用次数: 3

DACHA: A Dual Graph Convolution Based Temporal Knowledge Graph Representation Learning Method Using Historical Relation DACHA:一种基于对偶图卷积的历史关系时态知识图表示学习方法

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2021-10-22 DOI: 10.1145/3477051

Ling Chen, Xing Tang, Weiqiu Chen, Y. Qian, Yansheng Li, Yongjun Zhang

Temporal knowledge graph (TKG) representation learning embeds relations and entities into a continuous low-dimensional vector space by incorporating temporal information. Latest studies mainly aim at learning entity representations by modeling entity interactions from the neighbor structure of the graph. However, the interactions of relations from the neighbor structure of the graph are neglected, which are also of significance for learning informative representations. In addition, there still lacks an effective historical relation encoder to model the multi-range temporal dependencies. In this article, we propose a dual graph convolution network based TKG representation learning method using historical relations (DACHA). Specifically, we first construct the primal graph according to historical relations, as well as the edge graph by regarding historical relations as nodes. Then, we employ the dual graph convolution network to capture the interactions of both entities and historical relations from the neighbor structure of the graph. In addition, the temporal self-attentive historical relation encoder is proposed to explicitly model both local and global temporal dependencies. Extensive experiments on two event based TKG datasets demonstrate that DACHA achieves the state-of-the-art results.

时态知识图(TKG)表示学习通过结合时态信息将关系和实体嵌入到连续的低维向量空间中。最新的研究主要是通过从图的邻居结构中建模实体交互来学习实体表示。然而，图的邻居结构中关系的相互作用被忽略了，这对学习信息表示也有重要意义。此外，还缺乏一种有效的历史关系编码器来对多范围时间依赖性进行建模。在本文中，我们提出了一种基于历史关系(DACHA)的对偶图卷积网络TKG表示学习方法。具体而言，我们首先根据历史关系构造原始图，并将历史关系作为节点构造边缘图。然后，我们利用对偶图卷积网络从图的邻居结构中捕获实体之间的相互作用和历史关系。此外，提出了时间自关注历史关系编码器，以显式地建模局部和全局时间依赖性。在两个基于事件的TKG数据集上进行的大量实验表明，DACHA达到了最先进的结果。

{"title":"DACHA: A Dual Graph Convolution Based Temporal Knowledge Graph Representation Learning Method Using Historical Relation","authors":"Ling Chen, Xing Tang, Weiqiu Chen, Y. Qian, Yansheng Li, Yongjun Zhang","doi":"10.1145/3477051","DOIUrl":"https://doi.org/10.1145/3477051","url":null,"abstract":"Temporal knowledge graph (TKG) representation learning embeds relations and entities into a continuous low-dimensional vector space by incorporating temporal information. Latest studies mainly aim at learning entity representations by modeling entity interactions from the neighbor structure of the graph. However, the interactions of relations from the neighbor structure of the graph are neglected, which are also of significance for learning informative representations. In addition, there still lacks an effective historical relation encoder to model the multi-range temporal dependencies. In this article, we propose a dual graph convolution network based TKG representation learning method using historical relations (DACHA). Specifically, we first construct the primal graph according to historical relations, as well as the edge graph by regarding historical relations as nodes. Then, we employ the dual graph convolution network to capture the interactions of both entities and historical relations from the neighbor structure of the graph. In addition, the temporal self-attentive historical relation encoder is proposed to explicitly model both local and global temporal dependencies. Extensive experiments on two event based TKG datasets demonstrate that DACHA achieves the state-of-the-art results.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125097820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Context-aware Spatial-Temporal Neural Network for Citywide Crowd Flow Prediction via Modeling Long-range Spatial Dependency 基于远程空间依赖建模的城市人群流量预测的上下文感知时空神经网络

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2021-10-22 DOI: 10.1145/3477577

Jie Feng, Yong Li, Ziqian Lin, Can Rong, Funing Sun, Diansheng Guo, Depeng Jin

Crowd flow prediction is of great importance in a wide range of applications from urban planning, traffic control to public safety. It aims at predicting the inflow (the traffic of crowds entering a region in a given time interval) and outflow (the traffic of crowds leaving a region for other places) of each region in the city with knowing the historical flow data. In this article, we propose DeepSTN+, a deep learning-based convolutional model, to predict crowd flows in the metropolis. First, DeepSTN+ employs the ConvPlus structure to model the long-range spatial dependence among crowd flows in different regions. Further, PoI distributions and time factor are combined to express the effect of location attributes to introduce prior knowledge of the crowd movements. Finally, we propose a temporal attention-based fusion mechanism to stabilize the training process, which further improves the performance. Extensive experimental results based on four real-life datasets demonstrate the superiority of our model, i.e., DeepSTN+ reduces the error of the crowd flow prediction by approximately 10%–21% compared with the state-of-the-art baselines.

人群流量预测在城市规划、交通控制、公共安全等领域有着广泛的应用。它的目的是在了解历史流量数据的情况下，预测城市各个区域的流入(在给定时间间隔内进入一个区域的人群流量)和流出(离开一个区域前往其他地方的人群流量)。在本文中，我们提出了一种基于深度学习的卷积模型DeepSTN+来预测大都市的人群流量。首先，DeepSTN+采用ConvPlus结构对不同区域人群流动之间的长期空间依赖关系进行建模。进一步，结合PoI分布和时间因子来表达位置属性的影响，引入人群运动的先验知识。最后，我们提出了一种基于时间注意力的融合机制来稳定训练过程，进一步提高了性能。基于四个真实数据集的大量实验结果证明了我们模型的优越性，即与最先进的基线相比，DeepSTN+将人群流量预测的误差降低了约10%-21%。

{"title":"Context-aware Spatial-Temporal Neural Network for Citywide Crowd Flow Prediction via Modeling Long-range Spatial Dependency","authors":"Jie Feng, Yong Li, Ziqian Lin, Can Rong, Funing Sun, Diansheng Guo, Depeng Jin","doi":"10.1145/3477577","DOIUrl":"https://doi.org/10.1145/3477577","url":null,"abstract":"Crowd flow prediction is of great importance in a wide range of applications from urban planning, traffic control to public safety. It aims at predicting the inflow (the traffic of crowds entering a region in a given time interval) and outflow (the traffic of crowds leaving a region for other places) of each region in the city with knowing the historical flow data. In this article, we propose DeepSTN+, a deep learning-based convolutional model, to predict crowd flows in the metropolis. First, DeepSTN+ employs the ConvPlus structure to model the long-range spatial dependence among crowd flows in different regions. Further, PoI distributions and time factor are combined to express the effect of location attributes to introduce prior knowledge of the crowd movements. Finally, we propose a temporal attention-based fusion mechanism to stabilize the training process, which further improves the performance. Extensive experimental results based on four real-life datasets demonstrate the superiority of our model, i.e., DeepSTN+ reduces the error of the crowd flow prediction by approximately 10%–21% compared with the state-of-the-art baselines.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129695429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Balance-Subsampled Stable Prediction Across Unknown Test Data 跨未知测试数据的平衡-次采样稳定预测

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2021-10-22 DOI: 10.1145/3477052

Kun Kuang, Hengtao Zhang, Runze Wu, Fei Wu, Y. Zhuang, Aijun Zhang

In data mining and machine learning, it is commonly assumed that training and test data share the same population distribution. However, this assumption is often violated in practice because of the sample selection bias, which might induce the distribution shift from training data to test data. Such a model-agnostic distribution shift usually leads to prediction instability across unknown test data. This article proposes a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design. It isolates the clear effect of each predictor from the confounding variables. A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift, improving both the accuracy of parameter estimation and the stability of prediction across unknown test data. Numerical experiments on synthetic and real-world datasets demonstrate that our BSSP algorithm can significantly outperform the baseline methods for stable prediction across unknown test data.

在数据挖掘和机器学习中，通常假设训练数据和测试数据共享相同的总体分布。然而，在实践中，由于样本选择偏差，这一假设经常被违背，这可能导致分布从训练数据到测试数据的转移。这种与模型无关的分布转移通常会导致未知测试数据的预测不稳定。本文提出了一种基于分数因子设计理论的平衡-次抽样稳定预测(BSSP)算法。它从混杂变量中分离出每个预测因子的明显影响。设计理论分析表明，该方法可以减少分布移位引起的预测因子之间的混杂效应，提高参数估计的精度和跨未知测试数据预测的稳定性。在合成数据集和实际数据集上的数值实验表明，我们的BSSP算法在未知测试数据的稳定预测方面明显优于基线方法。

引用次数: 8

Online and Distributed Robust Regressions with Extremely Noisy Labels 带有极端噪声标签的在线和分布式鲁棒回归

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2021-10-22 DOI: 10.1145/3473038

Shuo Lei, Xuchao Zhang, Liang Zhao, Arnold P. Boedihardjo, Chang-Tien Lu

In today’s era of big data, robust least-squares regression becomes a more challenging problem when considering the extremely corrupted labels along with explosive growth of datasets. Traditional robust methods can handle the noise but suffer from several challenges when applied in huge dataset including (1) computational infeasibility of handling an entire dataset at once, (2) existence of heterogeneously distributed corruption, and (3) difficulty in corruption estimation when data cannot be entirely loaded. This article proposes online and distributed robust regression approaches, both of which can concurrently address all the above challenges. Specifically, the distributed algorithm optimizes the regression coefficients of each data block via heuristic hard thresholding and combines all the estimates in a distributed robust consolidation. In addition, an online version of the distributed algorithm is proposed to incrementally update the existing estimates with new incoming data. Furthermore, a novel online robust regression method is proposed to estimate under a biased-batch corruption. We also prove that our algorithms benefit from strong robustness guarantees in terms of regression coefficient recovery with a constant upper bound on the error of state-of-the-art batch methods. Extensive experiments on synthetic and real datasets demonstrate that our approaches are superior to those of existing methods in effectiveness, with competitive efficiency.

在当今的大数据时代，考虑到数据集的爆炸性增长，标签的严重损坏，鲁棒性最小二乘回归成为一个更具挑战性的问题。传统的鲁棒方法可以处理噪声，但在应用于大数据集时面临以下几个挑战:(1)一次处理整个数据集的计算不可行性;(2)存在异构分布的损坏;(3)数据不能完全加载时的损坏估计困难。本文提出了在线和分布式鲁棒回归方法，这两种方法都可以同时解决上述所有挑战。具体而言，分布式算法通过启发式硬阈值优化每个数据块的回归系数，并将所有估计合并在分布式鲁棒合并中。此外，本文还提出了一种分布式算法的在线版本，可以用新的传入数据增量地更新现有的估计。在此基础上，提出了一种新的在线鲁棒回归估计方法。我们还证明了我们的算法受益于强大的鲁棒性保证，在回归系数恢复方面具有恒定的误差上界。在合成数据集和真实数据集上的大量实验表明，我们的方法在有效性上优于现有的方法，具有竞争力的效率。

{"title":"Online and Distributed Robust Regressions with Extremely Noisy Labels","authors":"Shuo Lei, Xuchao Zhang, Liang Zhao, Arnold P. Boedihardjo, Chang-Tien Lu","doi":"10.1145/3473038","DOIUrl":"https://doi.org/10.1145/3473038","url":null,"abstract":"In today’s era of big data, robust least-squares regression becomes a more challenging problem when considering the extremely corrupted labels along with explosive growth of datasets. Traditional robust methods can handle the noise but suffer from several challenges when applied in huge dataset including (1) computational infeasibility of handling an entire dataset at once, (2) existence of heterogeneously distributed corruption, and (3) difficulty in corruption estimation when data cannot be entirely loaded. This article proposes online and distributed robust regression approaches, both of which can concurrently address all the above challenges. Specifically, the distributed algorithm optimizes the regression coefficients of each data block via heuristic hard thresholding and combines all the estimates in a distributed robust consolidation. In addition, an online version of the distributed algorithm is proposed to incrementally update the existing estimates with new incoming data. Furthermore, a novel online robust regression method is proposed to estimate under a biased-batch corruption. We also prove that our algorithms benefit from strong robustness guarantees in terms of regression coefficient recovery with a constant upper bound on the error of state-of-the-art batch methods. Extensive experiments on synthetic and real datasets demonstrate that our approaches are superior to those of existing methods in effectiveness, with competitive efficiency.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121518014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Normalizing Flow-Based Co-Embedding Model for Attributed Networks 一种基于归一化流的属性网络协同嵌入模型

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2021-10-22 DOI: 10.1145/3477049

Shangsong Liang, Zhuo Ouyang, Zaiqiao Meng

Network embedding is a technique that aims at inferring the low-dimensional representations of nodes in a semantic space. In this article, we study the problem of inferring the low-dimensional representations of both nodes and attributes for attributed networks in the same semantic space such that the affinity between a node and an attribute can be effectively measured. Intuitively, this problem can be addressed by simply utilizing existing variational auto-encoder (VAE) based network embedding algorithms. However, the variational posterior distribution in previous VAE based network embedding algorithms is often assumed and restricted to be a mean-field Gaussian distribution or other simple distribution families, which results in poor inference of the embeddings. To alleviate the above defect, we propose a novel VAE-based co-embedding method for attributed network, F-CAN, where posterior distributions are flexible, complex, and scalable distributions constructed through the normalizing flow. We evaluate our proposed models on a number of network tasks with several benchmark datasets. Experimental results demonstrate that there are clear improvements in the qualities of embeddings generated by our model to the state-of-the-art attributed network embedding methods.

网络嵌入是一种旨在推断语义空间中节点的低维表示的技术。在本文中，我们研究了在同一语义空间中推断属性网络的节点和属性的低维表示的问题，从而可以有效地测量节点和属性之间的亲和力。直观地说，这个问题可以通过简单地利用现有的基于变分自编码器(VAE)的网络嵌入算法来解决。然而，在以往基于VAE的网络嵌入算法中，变分后验分布通常被假设并限制为平均场高斯分布或其他简单分布族，这导致嵌入的推理能力较差。为了缓解上述缺陷，我们提出了一种新的基于vae的属性网络共嵌入方法F-CAN，其中后验分布是通过归一化流构建的灵活、复杂和可扩展的分布。我们用几个基准数据集在许多网络任务上评估了我们提出的模型。实验结果表明，与最先进的属性网络嵌入方法相比，我们的模型产生的嵌入质量有明显的提高。

{"title":"A Normalizing Flow-Based Co-Embedding Model for Attributed Networks","authors":"Shangsong Liang, Zhuo Ouyang, Zaiqiao Meng","doi":"10.1145/3477049","DOIUrl":"https://doi.org/10.1145/3477049","url":null,"abstract":"Network embedding is a technique that aims at inferring the low-dimensional representations of nodes in a semantic space. In this article, we study the problem of inferring the low-dimensional representations of both nodes and attributes for attributed networks in the same semantic space such that the affinity between a node and an attribute can be effectively measured. Intuitively, this problem can be addressed by simply utilizing existing variational auto-encoder (VAE) based network embedding algorithms. However, the variational posterior distribution in previous VAE based network embedding algorithms is often assumed and restricted to be a mean-field Gaussian distribution or other simple distribution families, which results in poor inference of the embeddings. To alleviate the above defect, we propose a novel VAE-based co-embedding method for attributed network, F-CAN, where posterior distributions are flexible, complex, and scalable distributions constructed through the normalizing flow. We evaluate our proposed models on a number of network tasks with several benchmark datasets. Experimental results demonstrate that there are clear improvements in the qualities of embeddings generated by our model to the state-of-the-art attributed network embedding methods.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124075305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Network Public Opinion Detection During the Coronavirus Pandemic: A Short-Text Relational Topic Model 冠状病毒大流行期间的网络舆情检测:一个短文本关系主题模型

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2021-10-22 DOI: 10.1145/3480246

Yuanchun Jiang, Ruicheng Liang, Ji Zhang, Jianshan Sun, Yezheng Liu, Yang Qian

Online social media provides rich and varied information reflecting the significant concerns of the public during the coronavirus pandemic. Analyzing what the public is concerned with from social media information can support policy-makers to maintain the stability of the social economy and life of the society. In this article, we focus on the detection of the network public opinions during the coronavirus pandemic. We propose a novel Relational Topic Model for Short texts (RTMS) to draw opinion topics from social media data. RTMS exploits the feature of texts in online social media and the opinion propagation patterns among individuals. Moreover, a dynamic version of RTMS (DRTMS) is proposed to capture the evolution of public opinions. Our experiment is conducted on a real-world dataset which includes 67,592 comments from 14,992 users. The results demonstrate that, compared with the benchmark methods, the proposed RTMS and DRTMS models can detect meaningful public opinions by leveraging the feature of social media data. It can also effectively capture the evolution of public concerns during different phases of the coronavirus pandemic.

在线社交媒体提供了丰富多样的信息，反映了公众在冠状病毒大流行期间的重大关切。从社交媒体信息中分析公众关注的问题，可以支持决策者维护社会经济和社会生活的稳定。在这篇文章中，我们重点研究了冠状病毒大流行期间的网络舆情检测。为了从社交媒体数据中提取意见主题，我们提出了一种新的短文本关系主题模型。RTMS利用了网络社交媒体文本的特征和个人之间的意见传播模式。此外，本文还提出了一个动态版本的RTMS (DRTMS)来捕捉民意的演变。我们的实验是在一个真实世界的数据集上进行的，其中包括来自14,992名用户的67,592条评论。结果表明，与基准方法相比，本文提出的RTMS和DRTMS模型可以利用社交媒体数据的特征来检测有意义的民意。它还可以有效地捕捉到在冠状病毒大流行的不同阶段公众关注的演变。

{"title":"Network Public Opinion Detection During the Coronavirus Pandemic: A Short-Text Relational Topic Model","authors":"Yuanchun Jiang, Ruicheng Liang, Ji Zhang, Jianshan Sun, Yezheng Liu, Yang Qian","doi":"10.1145/3480246","DOIUrl":"https://doi.org/10.1145/3480246","url":null,"abstract":"Online social media provides rich and varied information reflecting the significant concerns of the public during the coronavirus pandemic. Analyzing what the public is concerned with from social media information can support policy-makers to maintain the stability of the social economy and life of the society. In this article, we focus on the detection of the network public opinions during the coronavirus pandemic. We propose a novel Relational Topic Model for Short texts (RTMS) to draw opinion topics from social media data. RTMS exploits the feature of texts in online social media and the opinion propagation patterns among individuals. Moreover, a dynamic version of RTMS (DRTMS) is proposed to capture the evolution of public opinions. Our experiment is conducted on a real-world dataset which includes 67,592 comments from 14,992 users. The results demonstrate that, compared with the benchmark methods, the proposed RTMS and DRTMS models can detect meaningful public opinions by leveraging the feature of social media data. It can also effectively capture the evolution of public concerns during different phases of the coronavirus pandemic.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117225882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

NTP-Miner: Nonoverlapping Three-Way Sequential Pattern Mining NTP-Miner:非重叠三向顺序模式挖掘

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2021-10-22 DOI: 10.1145/3480245

Youxi Wu, L. Luo, Yan Li, Lei Guo, Philippe Fournier-Viger, Xingquan Zhu, Xindong Wu

Nonoverlapping sequential pattern mining is an important type of sequential pattern mining (SPM) with gap constraints, which not only can reveal interesting patterns to users but also can effectively reduce the search space using the Apriori (anti-monotonicity) property. However, the existing algorithms do not focus on attributes of interest to users, meaning that existing methods may discover many frequent patterns that are redundant. To solve this problem, this article proposes a task called nonoverlapping three-way sequential pattern (NTP) mining, where attributes are categorized according to three levels of interest: strong, medium, and weak interest. NTP mining can effectively avoid mining redundant patterns since the NTPs are composed of strong and medium interest items. Moreover, NTPs can avoid serious deviations (the occurrence is significantly different from its pattern) since gap constraints cannot match with strong interest patterns. To mine NTPs, an effective algorithm is put forward, called NTP-Miner, which applies two main steps: support (frequency occurrence) calculation and candidate pattern generation. To calculate the support of an NTP, depth-first and backtracking strategies are adopted, which do not require creating a whole Nettree structure, meaning that many redundant nodes and parent–child relationships do not need to be created. Hence, time and space efficiency is improved. To generate candidate patterns while reducing their number, NTP-Miner employs a pattern join strategy and only mines patterns of strong and medium interest. Experimental results on stock market and protein datasets show that NTP-Miner not only is more efficient than other competitive approaches but can also help users find more valuable patterns. More importantly, NTP mining has achieved better performance than other competitive methods in clustering tasks. Algorithms and data are available at: https://github.com/wuc567/Pattern-Mining/tree/master/NTP-Miner.

无重叠序列模式挖掘是一种重要的带间隙约束的序列模式挖掘(SPM)，它不仅可以向用户揭示感兴趣的模式，而且利用Apriori(反单调性)特性可以有效地减少搜索空间。然而，现有的算法并不关注用户感兴趣的属性，这意味着现有的方法可能会发现许多冗余的频繁模式。为了解决这个问题，本文提出了一个称为非重叠三向顺序模式(NTP)挖掘的任务，其中根据三个兴趣级别对属性进行分类:强兴趣、中等兴趣和弱兴趣。NTP挖掘可以有效地避免挖掘冗余模式，因为NTP由强兴趣项和中等兴趣项组成。此外，由于缺口约束无法与强烈的利益格局相匹配，NTPs可以避免严重偏差(其发生与其模式显著不同)。为了挖掘ntp，提出了一种有效的NTP-Miner算法，该算法主要分为两个步骤:支持度(出现频率)计算和候选模式生成。为了计算NTP的支持度，采用深度优先和回溯策略，不需要创建整个网络树结构，这意味着不需要创建许多冗余节点和父子关系。因此，提高了时间和空间效率。为了在减少候选模式数量的同时生成候选模式，NTP-Miner采用模式连接策略，只挖掘强烈和中等兴趣的模式。在股票市场和蛋白质数据集上的实验结果表明，NTP-Miner不仅比其他竞争方法更有效，而且可以帮助用户找到更有价值的模式。更重要的是，NTP挖掘在集群任务中取得了比其他竞争方法更好的性能。算法和数据可在:https://github.com/wuc567/Pattern-Mining/tree/master/NTP-Miner。

{"title":"NTP-Miner: Nonoverlapping Three-Way Sequential Pattern Mining","authors":"Youxi Wu, L. Luo, Yan Li, Lei Guo, Philippe Fournier-Viger, Xingquan Zhu, Xindong Wu","doi":"10.1145/3480245","DOIUrl":"https://doi.org/10.1145/3480245","url":null,"abstract":"Nonoverlapping sequential pattern mining is an important type of sequential pattern mining (SPM) with gap constraints, which not only can reveal interesting patterns to users but also can effectively reduce the search space using the Apriori (anti-monotonicity) property. However, the existing algorithms do not focus on attributes of interest to users, meaning that existing methods may discover many frequent patterns that are redundant. To solve this problem, this article proposes a task called nonoverlapping three-way sequential pattern (NTP) mining, where attributes are categorized according to three levels of interest: strong, medium, and weak interest. NTP mining can effectively avoid mining redundant patterns since the NTPs are composed of strong and medium interest items. Moreover, NTPs can avoid serious deviations (the occurrence is significantly different from its pattern) since gap constraints cannot match with strong interest patterns. To mine NTPs, an effective algorithm is put forward, called NTP-Miner, which applies two main steps: support (frequency occurrence) calculation and candidate pattern generation. To calculate the support of an NTP, depth-first and backtracking strategies are adopted, which do not require creating a whole Nettree structure, meaning that many redundant nodes and parent–child relationships do not need to be created. Hence, time and space efficiency is improved. To generate candidate patterns while reducing their number, NTP-Miner employs a pattern join strategy and only mines patterns of strong and medium interest. Experimental results on stock market and protein datasets show that NTP-Miner not only is more efficient than other competitive approaches but can also help users find more valuable patterns. More importantly, NTP mining has achieved better performance than other competitive methods in clustering tasks. Algorithms and data are available at: https://github.com/wuc567/Pattern-Mining/tree/master/NTP-Miner.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130367982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Deciphering Feature Effects on Decision-Making in Ordinal Regression Problems: An Explainable Ordinal Factorization Model 解码特征对有序回归问题决策的影响:一个可解释的有序分解模型

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2021-10-22 DOI: 10.1145/3487048

Mengzhuo Guo, Zhongzhi Xu, Qingpeng Zhang, Xiuwu Liao, Jiapeng Liu

Ordinal regression predicts the objects’ labels that exhibit a natural ordering, which is vital to decision-making problems such as credit scoring and clinical diagnosis. In these problems, the ability to explain how the individual features and their interactions affect the decisions is as critical as model performance. Unfortunately, the existing ordinal regression models in the machine learning community aim at improving prediction accuracy rather than explore explainability. To achieve high accuracy while explaining the relationships between the features and the predictions, we propose a new method for ordinal regression problems, namely the Explainable Ordinal Factorization Model (XOFM). XOFM uses piecewise linear functions to approximate the shape functions of individual features, and renders the pairwise features interaction effects as heat-maps. The proposed XOFM captures the nonlinearity in the main effects and ensures the interaction effects’ same flexibility. Therefore, the underlying model yields comparable performance while remaining explainable by explicitly describing the main and interaction effects. To address the potential sparsity problem caused by discretizing the whole feature scale into several sub-intervals, XOFM integrates the Factorization Machines (FMs) to factorize the model parameters. Comprehensive experiments with benchmark real-world and synthetic datasets demonstrate that the proposed XOFM leads to state-of-the-art prediction performance while preserving an easy-to-understand explainability.

序数回归预测对象的标签表现出自然的顺序，这对信用评分和临床诊断等决策问题至关重要。在这些问题中，解释单个特征及其相互作用如何影响决策的能力与模型性能一样重要。不幸的是，机器学习社区中现有的有序回归模型旨在提高预测精度，而不是探索可解释性。为了在解释特征和预测之间的关系的同时达到较高的准确性，我们提出了一种新的有序回归问题的方法，即可解释有序分解模型(XOFM)。XOFM采用分段线性函数逼近单个特征的形状函数，并将成对特征的交互效果呈现为热图。所提出的XOFM捕获了主效应中的非线性，并保证了交互效应具有相同的灵活性。因此，底层模型产生可比较的性能，同时通过显式描述主要和交互效应保持可解释性。为了解决将整个特征尺度离散为几个子区间可能造成的稀疏性问题，XOFM集成了因子分解机(FMs)来分解模型参数。对基准真实世界和合成数据集的综合实验表明，所提出的XOFM在保持易于理解的可解释性的同时，带来了最先进的预测性能。

{"title":"Deciphering Feature Effects on Decision-Making in Ordinal Regression Problems: An Explainable Ordinal Factorization Model","authors":"Mengzhuo Guo, Zhongzhi Xu, Qingpeng Zhang, Xiuwu Liao, Jiapeng Liu","doi":"10.1145/3487048","DOIUrl":"https://doi.org/10.1145/3487048","url":null,"abstract":"Ordinal regression predicts the objects’ labels that exhibit a natural ordering, which is vital to decision-making problems such as credit scoring and clinical diagnosis. In these problems, the ability to explain how the individual features and their interactions affect the decisions is as critical as model performance. Unfortunately, the existing ordinal regression models in the machine learning community aim at improving prediction accuracy rather than explore explainability. To achieve high accuracy while explaining the relationships between the features and the predictions, we propose a new method for ordinal regression problems, namely the Explainable Ordinal Factorization Model (XOFM). XOFM uses piecewise linear functions to approximate the shape functions of individual features, and renders the pairwise features interaction effects as heat-maps. The proposed XOFM captures the nonlinearity in the main effects and ensures the interaction effects’ same flexibility. Therefore, the underlying model yields comparable performance while remaining explainable by explicitly describing the main and interaction effects. To address the potential sparsity problem caused by discretizing the whole feature scale into several sub-intervals, XOFM integrates the Factorization Machines (FMs) to factorize the model parameters. Comprehensive experiments with benchmark real-world and synthetic datasets demonstrate that the proposed XOFM leads to state-of-the-art prediction performance while preserving an easy-to-understand explainability.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125039677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4