ACM Transactions on Knowledge Discovery from Data (TKDD)最新文献_第4页

BhBF: A Bloom Filter Using Bh Sequences for Multi-set Membership Query BhBF:一种基于Bh序列的多集隶属度查询布隆过滤器

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-03-10 DOI: 10.1145/3502735

Shuyu Pei, Kun Xie, Xin Wang, Gaogang Xie, Kenli Li, Wei Li, Yanbiao Li, Jigang Wen

Multi-set membership query is a fundamental issue for network functions such as packet processing and state machines monitoring. Given the rigid query speed and memory requirements, it would be promising if a multi-set query algorithm can be designed based on Bloom filter (BF), a space-efficient probabilistic data structure. However, existing efforts on multi-set query based on BF suffer from at least one of the following drawbacks: low query speed, low query accuracy, limitation in only supporting insertion and query operations, or limitation in the set size. To address the issues, we design a novel Bh sequence-based Bloom filter (BhBF) for multi-set query, which supports four operations: insertion, query, deletion, and update. In BhBF, the set ID is encoded as a code in a Bh sequence. Exploiting good properties of Bh sequences, we can correctly decode the BF cells to obtain the set IDs even when the number of hash collisions is high, which brings high query accuracy. In BhBF, we propose two strategies to further speed up the query speed and increase the query accuracy. On the theoretical side, we analyze the false positive and classification failure rate of our BhBF. Our results from extensive experiments over two real datasets demonstrate that BhBF significantly advances state-of-the-art multi-set query algorithms.

多集成员查询是数据包处理和状态机监控等网络功能的基本问题。考虑到查询速度和内存需求的刚性，基于布隆过滤器(BF)这种空间高效的概率数据结构设计多集查询算法是有希望的。然而，现有的基于BF的多集查询至少存在以下缺点之一:查询速度慢、查询精度低、仅支持插入和查询操作的限制或集大小的限制。为了解决这个问题，我们设计了一种新的基于Bh序列的多集查询布隆过滤器(BhBF)，它支持插入、查询、删除和更新四种操作。在BhBF中，集合ID被编码为Bh序列中的代码。利用Bh序列的良好性质，即使在哈希碰撞次数较多的情况下，我们也能正确解码BF单元以获得集合id，从而提高查询精度。在BhBF中，我们提出了两种策略来进一步加快查询速度和提高查询精度。在理论方面，我们分析了我们的BhBF的误报率和分类失效率。我们在两个真实数据集上进行的大量实验结果表明，BhBF显著推进了最先进的多集查询算法。

{"title":"BhBF: A Bloom Filter Using Bh Sequences for Multi-set Membership Query","authors":"Shuyu Pei, Kun Xie, Xin Wang, Gaogang Xie, Kenli Li, Wei Li, Yanbiao Li, Jigang Wen","doi":"10.1145/3502735","DOIUrl":"https://doi.org/10.1145/3502735","url":null,"abstract":"Multi-set membership query is a fundamental issue for network functions such as packet processing and state machines monitoring. Given the rigid query speed and memory requirements, it would be promising if a multi-set query algorithm can be designed based on Bloom filter (BF), a space-efficient probabilistic data structure. However, existing efforts on multi-set query based on BF suffer from at least one of the following drawbacks: low query speed, low query accuracy, limitation in only supporting insertion and query operations, or limitation in the set size. To address the issues, we design a novel Bh sequence-based Bloom filter (BhBF) for multi-set query, which supports four operations: insertion, query, deletion, and update. In BhBF, the set ID is encoded as a code in a Bh sequence. Exploiting good properties of Bh sequences, we can correctly decode the BF cells to obtain the set IDs even when the number of hash collisions is high, which brings high query accuracy. In BhBF, we propose two strategies to further speed up the query speed and increase the query accuracy. On the theoretical side, we analyze the false positive and classification failure rate of our BhBF. Our results from extensive experiments over two real datasets demonstrate that BhBF significantly advances state-of-the-art multi-set query algorithms.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116270873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Multi-label Deep Convolutional Transform Learning for Non-intrusive Load Monitoring 非侵入式负荷监测的多标签深度卷积变换学习

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-03-10 DOI: 10.1145/3502729

Shikha Singh, É. Chouzenoux, G. Chierchia, A. Majumdar

The objective of this letter is to propose a novel computational method to learn the state of an appliance (ON / OFF) given the aggregate power consumption recorded by the smart-meter. We formulate a multi-label classification problem where the classes correspond to the appliances. The proposed approach is based on our recently introduced framework of convolutional transform learning. We propose a deep supervised version of it relying on an original multi-label cost. Comparisons with state-of-the-art techniques show that our proposed method improves over the benchmarks on popular non-intrusive load monitoring datasets.

这封信的目的是提出一种新的计算方法来学习电器的状态(开/关)给定智能电表记录的总功耗。我们制定了一个多标签分类问题，其中类对应于器具。所提出的方法是基于我们最近引入的卷积变换学习框架。我们提出了一个基于原始多标签成本的深度监督版本。与最新技术的比较表明，我们提出的方法比流行的非侵入式负载监控数据集的基准性能有所提高。

引用次数: 1

MBN: Towards Multi-Behavior Sequence Modeling for Next Basket Recommendation 面向下一篮推荐的多行为序列建模

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-03-10 DOI: 10.1145/3497748

Yanyan Shen, Baoyuan Ou, Ranzhen Li

Next basket recommendation aims at predicting the next set of items that a user would likely purchase together, which plays an important role in e-commerce platforms. Unlike conventional item recommendation, the next basket recommendation focuses on capturing item correlations among baskets and learning the user’s temporal interest from the past purchasing basket sequence. In practice, most users interact with items in various kinds of behaviors. The multi-behavior data sheds light on user’s potential purchasing intention and resolves noisy signals from accidentally purchased items. In this article, we conduct an empirical study on real datasets to exploit the characteristics of multi-behavior data and confirm its positive effects on next basket recommendation. We develop a novel Multi-Behavior Network (MBN) model that captures item correlations and acquires meta-knowledge from multi-behavior basket sequences effectively. MBN employs the meta multi-behavior sequence encoder to model temporal dependencies of each individual behavior and extract meta-knowledge across different behaviors. Furthermore, we design the recurring-item-aware predictor in MBN to realize the high degree of the repeated occurrences of items, leading to better recommendation performance. We conduct extensive experiments to evaluate the performance of our proposed MBN model using real-world multi-behavior data. The results demonstrate the superior recommendation performance of MBN compared with various state-of-the-art methods.

Next basket推荐旨在预测用户可能一起购买的下一组商品，这在电子商务平台中起着重要作用。与传统的商品推荐不同，下一个购物篮推荐侧重于捕获购物篮之间的商品相关性，并从过去的购物篮序列中学习用户的时间兴趣。在实践中，大多数用户以各种各样的行为与项目交互。多行为数据揭示了用户潜在的购买意愿，并解决了意外购买物品的噪音信号。本文通过对真实数据集的实证研究，挖掘多行为数据的特征，并验证其对下一篮推荐的积极作用。我们开发了一种新的多行为网络(MBN)模型，该模型可以有效地捕获项目相关性并从多行为篮序列中获取元知识。MBN采用元多行为序列编码器对每个个体行为的时间依赖性进行建模，并提取跨不同行为的元知识。此外，我们在MBN中设计了循环项目感知预测器，实现了项目的高度重复出现，从而提高了推荐性能。我们使用真实世界的多行为数据进行了大量的实验来评估我们提出的MBN模型的性能。结果表明，MBN的推荐性能优于现有的推荐方法。

{"title":"MBN: Towards Multi-Behavior Sequence Modeling for Next Basket Recommendation","authors":"Yanyan Shen, Baoyuan Ou, Ranzhen Li","doi":"10.1145/3497748","DOIUrl":"https://doi.org/10.1145/3497748","url":null,"abstract":"Next basket recommendation aims at predicting the next set of items that a user would likely purchase together, which plays an important role in e-commerce platforms. Unlike conventional item recommendation, the next basket recommendation focuses on capturing item correlations among baskets and learning the user’s temporal interest from the past purchasing basket sequence. In practice, most users interact with items in various kinds of behaviors. The multi-behavior data sheds light on user’s potential purchasing intention and resolves noisy signals from accidentally purchased items. In this article, we conduct an empirical study on real datasets to exploit the characteristics of multi-behavior data and confirm its positive effects on next basket recommendation. We develop a novel Multi-Behavior Network (MBN) model that captures item correlations and acquires meta-knowledge from multi-behavior basket sequences effectively. MBN employs the meta multi-behavior sequence encoder to model temporal dependencies of each individual behavior and extract meta-knowledge across different behaviors. Furthermore, we design the recurring-item-aware predictor in MBN to realize the high degree of the repeated occurrences of items, leading to better recommendation performance. We conduct extensive experiments to evaluate the performance of our proposed MBN model using real-world multi-behavior data. The results demonstrate the superior recommendation performance of MBN compared with various state-of-the-art methods.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126231120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Graph-Enhanced Spatial-Temporal Network for Next POI Recommendation 下一个POI推荐的图增强时空网络

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-02-24 DOI: 10.1145/3513092

Zhaobo Wang, Yanmin Zhu, Qiaomei Zhang, Haobing Liu, Chunyang Wang, Tong Liu

The task of next Point-of-Interest (POI) recommendation aims at recommending a list of POIs for a user to visit at the next timestamp based on his/her previous interactions, which is valuable for both location-based service providers and users. Recent state-of-the-art studies mainly employ recurrent neural network (RNN) based methods to model user check-in behaviors according to user’s historical check-in sequences. However, most of the existing RNN-based methods merely capture geographical influences depending on physical distance or successive relation among POIs. They are insufficient to capture the high-order complex geographical influences among POI networks, which are essential for estimating user preferences. To address this limitation, we propose a novel Graph-based Spatial Dependency modeling (GSD) module, which focuses on explicitly modeling complex geographical influences by leveraging graph embedding. GSD captures two types of geographical influences, i.e., distance-based and transition-based influences from designed POI semantic graphs. Additionally, we propose a novel Graph-enhanced Spatial-Temporal network (GSTN), which incorporates user spatial and temporal dependencies for next POI recommendation. Specifically, GSTN consists of a Long Short-Term Memory (LSTM) network for user-specific temporal dependencies modeling and GSD for user spatial dependencies learning. Finally, we evaluate the proposed model using three real-world datasets. Extensive experiments demonstrate the effectiveness of GSD in capturing various geographical influences and the improvement of GSTN over state-of-the-art methods.

下一个兴趣点(POI)推荐任务的目的是根据用户之前的交互，为用户在下一个时间戳推荐一个访问的兴趣点列表，这对基于位置的服务提供商和用户都很有价值。目前的研究主要采用基于递归神经网络(RNN)的方法，根据用户的历史签入顺序对用户签入行为进行建模。然而，大多数现有的基于rnn的方法仅仅通过物理距离或poi之间的连续关系来捕捉地理影响。它们不足以捕捉POI网络之间的高阶复杂地理影响，而这对于估计用户偏好至关重要。为了解决这一限制，我们提出了一种新的基于图的空间依赖建模(GSD)模块，该模块侧重于通过利用图嵌入显式建模复杂的地理影响。GSD捕获了两种类型的地理影响，即设计的POI语义图中基于距离和基于过渡的影响。此外，我们提出了一种新的图增强时空网络(GSTN)，它结合了用户的空间和时间依赖关系，用于下一个POI推荐。具体来说，GSTN由用于用户特定时间依赖性建模的长短期记忆(LSTM)网络和用于用户空间依赖性学习的GSD网络组成。最后，我们使用三个真实世界的数据集来评估所提出的模型。大量实验证明了GSD在捕捉各种地理影响方面的有效性，以及GSTN相对于最先进方法的改进。

{"title":"Graph-Enhanced Spatial-Temporal Network for Next POI Recommendation","authors":"Zhaobo Wang, Yanmin Zhu, Qiaomei Zhang, Haobing Liu, Chunyang Wang, Tong Liu","doi":"10.1145/3513092","DOIUrl":"https://doi.org/10.1145/3513092","url":null,"abstract":"The task of next Point-of-Interest (POI) recommendation aims at recommending a list of POIs for a user to visit at the next timestamp based on his/her previous interactions, which is valuable for both location-based service providers and users. Recent state-of-the-art studies mainly employ recurrent neural network (RNN) based methods to model user check-in behaviors according to user’s historical check-in sequences. However, most of the existing RNN-based methods merely capture geographical influences depending on physical distance or successive relation among POIs. They are insufficient to capture the high-order complex geographical influences among POI networks, which are essential for estimating user preferences. To address this limitation, we propose a novel Graph-based Spatial Dependency modeling (GSD) module, which focuses on explicitly modeling complex geographical influences by leveraging graph embedding. GSD captures two types of geographical influences, i.e., distance-based and transition-based influences from designed POI semantic graphs. Additionally, we propose a novel Graph-enhanced Spatial-Temporal network (GSTN), which incorporates user spatial and temporal dependencies for next POI recommendation. Specifically, GSTN consists of a Long Short-Term Memory (LSTM) network for user-specific temporal dependencies modeling and GSD for user spatial dependencies learning. Finally, we evaluate the proposed model using three real-world datasets. Extensive experiments demonstrate the effectiveness of GSD in capturing various geographical influences and the improvement of GSTN over state-of-the-art methods.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128620282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

The Distance Function Optimization for the Near Neighbors-Based Classifiers 基于近邻分类器的距离函数优化

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-02-24 DOI: 10.1145/3434769

M. Jiřina, Said Krayem

Based on the analysis of conditions for a good distance function we found four rules that should be fulfilled. Then, we introduce two new distance functions, a metric and a pseudometric one. We have tested how they fit for distance-based classifiers, especially for the IINC classifier. We rank distance functions according to several criteria and tests. Rankings depend not only on criteria or nature of the statistical test, but also whether it takes into account different difficulties of tasks or whether it considers all tasks as equally difficult. We have found that the new distance functions introduced belong among the four or five best out of 23 distance functions. We have tested them on 24 different tasks, using the mean, the median, the Friedman aligned test, and the Quade test. Our results show that a suitable distance function can improve behavior of distance-based classification rules.

通过对一个好的距离函数的条件分析，得出了一个好的距离函数必须满足的四个条件。然后，我们引入两个新的距离函数，一个度量函数和一个伪度量函数。我们已经测试了它们如何适合基于距离的分类器，特别是IINC分类器。我们根据几个标准和测试对距离函数进行排序。排名不仅取决于统计检验的标准或性质，还取决于它是否考虑到任务的不同难度，或者是否认为所有任务都同样困难。我们发现新引入的距离函数在23个距离函数中属于四五个最好的。我们在24项不同的任务中对他们进行了测试，使用了均值、中位数、弗里德曼对齐检验和Quade检验。研究结果表明，适当的距离函数可以改善基于距离的分类规则的行为。

引用次数: 1

Asymmetric Multi-Task Learning with Local Transference 局部迁移下的非对称多任务学习

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-02-04 DOI: 10.1145/3514252

Saullo H. G. Oliveira, A. Gonçalves, F. von Zuben

In this article, we present the Group Asymmetric Multi-Task Learning (GAMTL) algorithm that automatically learns from data how tasks transfer information among themselves at the level of a subset of features. In practice, for each group of features GAMTL extracts an asymmetric relationship supported by the tasks, instead of assuming a single structure for all features. The additional flexibility promoted by local transference in GAMTL allows any two tasks to have multiple asymmetric relationships. The proposed method leverages the information present in these multiple structures to bias the training of individual tasks towards more generalizable models. The solution to the GAMTL’s associated optimization problem is an alternating minimization procedure involving tasks parameters and multiple asymmetric relationships, thus guiding to convex smaller sub-problems. GAMTL was evaluated on both synthetic and real datasets. To evidence GAMTL versatility, we generated a synthetic scenario characterized by diverse profiles of structural relationships among tasks. GAMTL was also applied to the problem of Alzheimer’s Disease (AD) progression prediction. Our experiments indicated that the proposed approach not only increased prediction performance, but also estimated scientifically grounded relationships among multiple cognitive scores, taken here as multiple regression tasks, and regions of interest in the brain, directly associated here with groups of features. We also employed stability selection analysis to investigate GAMTL’s robustness to data sampling rate and hyper-parameter configuration. GAMTL source code is available on GitHub: https://github.com/shgo/gamtl.

在本文中，我们提出了组非对称多任务学习(GAMTL)算法，该算法自动从数据中学习任务如何在特征子集级别上在它们之间传递信息。在实践中，对于每一组特征，GAMTL提取任务支持的不对称关系，而不是为所有特征假设一个单一的结构。GAMTL中本地迁移带来的额外灵活性允许任意两个任务具有多个非对称关系。所提出的方法利用这些多重结构中的信息，将单个任务的训练偏向于更一般化的模型。GAMTL相关优化问题的解决方案是一个涉及任务参数和多个不对称关系的交替最小化过程，从而引导到凸较小的子问题。在合成数据集和真实数据集上对GAMTL进行了评估。为了证明GAMTL的多功能性，我们生成了一个以任务间结构关系的不同概况为特征的综合场景。GAMTL也被应用于阿尔茨海默病(AD)的进展预测问题。我们的实验表明，所提出的方法不仅提高了预测性能，而且还估计了多个认知得分(这里作为多个回归任务)和大脑中与特征组直接相关的感兴趣区域之间的科学基础关系。我们还采用稳定性选择分析来研究GAMTL对数据采样率和超参数配置的鲁棒性。GAMTL源代码可在GitHub上获得:https://github.com/shgo/gamtl。

{"title":"Asymmetric Multi-Task Learning with Local Transference","authors":"Saullo H. G. Oliveira, A. Gonçalves, F. von Zuben","doi":"10.1145/3514252","DOIUrl":"https://doi.org/10.1145/3514252","url":null,"abstract":"In this article, we present the Group Asymmetric Multi-Task Learning (GAMTL) algorithm that automatically learns from data how tasks transfer information among themselves at the level of a subset of features. In practice, for each group of features GAMTL extracts an asymmetric relationship supported by the tasks, instead of assuming a single structure for all features. The additional flexibility promoted by local transference in GAMTL allows any two tasks to have multiple asymmetric relationships. The proposed method leverages the information present in these multiple structures to bias the training of individual tasks towards more generalizable models. The solution to the GAMTL’s associated optimization problem is an alternating minimization procedure involving tasks parameters and multiple asymmetric relationships, thus guiding to convex smaller sub-problems. GAMTL was evaluated on both synthetic and real datasets. To evidence GAMTL versatility, we generated a synthetic scenario characterized by diverse profiles of structural relationships among tasks. GAMTL was also applied to the problem of Alzheimer’s Disease (AD) progression prediction. Our experiments indicated that the proposed approach not only increased prediction performance, but also estimated scientifically grounded relationships among multiple cognitive scores, taken here as multiple regression tasks, and regions of interest in the brain, directly associated here with groups of features. We also employed stability selection analysis to investigate GAMTL’s robustness to data sampling rate and hyper-parameter configuration. GAMTL source code is available on GitHub: https://github.com/shgo/gamtl.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124000502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Graph Neural News Recommendation with User Existing and Potential Interest Modeling 基于用户现有兴趣和潜在兴趣建模的图神经新闻推荐

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-02-04 DOI: 10.1145/3511708

Zhaopeng Qiu, Yunfan Hu, Xian Wu

Personalized news recommendations can alleviate the information overload problem. To enable personalized recommendation, one critical step is to learn a comprehensive user representation to model her/his interests. Many existing works learn user representations from the historical clicked news articles, which reflect their existing interests. However, these approaches ignore users’ potential interests and pay less attention to news that may interest the users in the future. To address this problem, we propose a novel Graph neural news Recommendation model with user Existing and Potential interest modeling, named GREP. Different from existing works, GREP introduces three modules to jointly model users’ existing and potential interests: (1) Existing Interest Encoding module mines user historical clicked news and applies the multi-head self-attention mechanism to capture the relatedness among the news; (2) Potential Interest Encoding module leverages the graph neural network to explore the user potential interests on the knowledge graph; and (3) Bi-directional Interaction module dynamically builds a news-entity bipartite graph to further enrich two interest representations. Finally, GREP combines the existing and potential interest representations to represent the user and leverages a prediction layer to estimate the clicking probability of the candidate news. Experiments on two real-world large-scale datasets demonstrate the state-of-the-art performance of GREP.

个性化新闻推荐可以缓解信息过载的问题。为了实现个性化推荐，一个关键步骤是学习一个全面的用户表示来建模她/他的兴趣。许多现有作品从历史点击新闻文章中学习用户表示，这反映了他们现有的兴趣。然而，这些方法忽略了用户潜在的兴趣，对用户未来可能感兴趣的新闻关注较少。为了解决这个问题，我们提出了一种新的基于用户现有兴趣和潜在兴趣建模的图神经新闻推荐模型，称为GREP。与现有作品不同的是，GREP引入了三个模块对用户现有和潜在兴趣进行联合建模:(1)现有兴趣编码模块挖掘用户历史点击新闻，并采用多头自关注机制捕捉新闻之间的相关性;(2)潜在兴趣编码模块利用图神经网络在知识图上挖掘用户潜在兴趣;(3)双向交互模块动态构建新闻实体二部图，进一步丰富两种兴趣表示。最后，GREP结合现有的和潜在的兴趣表示来表示用户，并利用预测层来估计候选新闻的点击概率。在两个真实世界的大规模数据集上的实验证明了GREP的最先进性能。

{"title":"Graph Neural News Recommendation with User Existing and Potential Interest Modeling","authors":"Zhaopeng Qiu, Yunfan Hu, Xian Wu","doi":"10.1145/3511708","DOIUrl":"https://doi.org/10.1145/3511708","url":null,"abstract":"Personalized news recommendations can alleviate the information overload problem. To enable personalized recommendation, one critical step is to learn a comprehensive user representation to model her/his interests. Many existing works learn user representations from the historical clicked news articles, which reflect their existing interests. However, these approaches ignore users’ potential interests and pay less attention to news that may interest the users in the future. To address this problem, we propose a novel Graph neural news Recommendation model with user Existing and Potential interest modeling, named GREP. Different from existing works, GREP introduces three modules to jointly model users’ existing and potential interests: (1) Existing Interest Encoding module mines user historical clicked news and applies the multi-head self-attention mechanism to capture the relatedness among the news; (2) Potential Interest Encoding module leverages the graph neural network to explore the user potential interests on the knowledge graph; and (3) Bi-directional Interaction module dynamically builds a news-entity bipartite graph to further enrich two interest representations. Finally, GREP combines the existing and potential interest representations to represent the user and leverages a prediction layer to estimate the clicking probability of the candidate news. Experiments on two real-world large-scale datasets demonstrate the state-of-the-art performance of GREP.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"113 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132364225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Quality-Informed Process Mining: A Case for Standardised Data Quality Annotations 基于质量的过程挖掘:标准化数据质量注释的案例

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-02-04 DOI: 10.1145/3511707

Kanika Goel, S. Leemans, Niels Martin, M. Wynn

Real-life event logs, reflecting the actual executions of complex business processes, are faced with numerous data quality issues. Extensive data sanity checks and pre-processing are usually needed before historical data can be used as input to obtain reliable data-driven insights. However, most of the existing algorithms in process mining, a field focusing on data-driven process analysis, do not take any data quality issues or the potential effects of data pre-processing into account explicitly. This can result in erroneous process mining results, leading to inaccurate, or misleading conclusions about the process under investigation. To address this gap, we propose data quality annotations for event logs, which can be used by process mining algorithms to generate quality-informed insights. Using a design science approach, requirements are formulated, which are leveraged to propose data quality annotations. Moreover, we present the “Quality-Informed visual Miner” plug-in to demonstrate the potential utility and impact of data quality annotations. Our experimental results, utilising both synthetic and real-life event logs, show how the use of data quality annotations by process mining techniques can assist in increasing the reliability of performance analysis results.

反映复杂业务流程实际执行情况的真实事件日志面临着许多数据质量问题。在将历史数据用作输入以获得可靠的数据驱动的见解之前，通常需要进行大量的数据完整性检查和预处理。然而，过程挖掘是一个专注于数据驱动过程分析的领域，大多数现有算法都没有明确考虑任何数据质量问题或数据预处理的潜在影响。这可能导致错误的流程挖掘结果，从而导致关于所调查流程的不准确或误导性结论。为了解决这一差距，我们提出了事件日志的数据质量注释，过程挖掘算法可以使用它来生成质量知情的见解。使用设计科学方法，制定需求，并利用这些需求提出数据质量注释。此外，我们还介绍了“quality - informed visual Miner”插件，以演示数据质量注释的潜在效用和影响。我们利用合成事件日志和真实事件日志的实验结果表明，通过过程挖掘技术使用数据质量注释可以帮助提高性能分析结果的可靠性。

{"title":"Quality-Informed Process Mining: A Case for Standardised Data Quality Annotations","authors":"Kanika Goel, S. Leemans, Niels Martin, M. Wynn","doi":"10.1145/3511707","DOIUrl":"https://doi.org/10.1145/3511707","url":null,"abstract":"Real-life event logs, reflecting the actual executions of complex business processes, are faced with numerous data quality issues. Extensive data sanity checks and pre-processing are usually needed before historical data can be used as input to obtain reliable data-driven insights. However, most of the existing algorithms in process mining, a field focusing on data-driven process analysis, do not take any data quality issues or the potential effects of data pre-processing into account explicitly. This can result in erroneous process mining results, leading to inaccurate, or misleading conclusions about the process under investigation. To address this gap, we propose data quality annotations for event logs, which can be used by process mining algorithms to generate quality-informed insights. Using a design science approach, requirements are formulated, which are leveraged to propose data quality annotations. Moreover, we present the “Quality-Informed visual Miner” plug-in to demonstrate the potential utility and impact of data quality annotations. Our experimental results, utilising both synthetic and real-life event logs, show how the use of data quality annotations by process mining techniques can assist in increasing the reliability of performance analysis results.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128250610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Multi-relation Graph Summarization 多关系图摘要

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2021-12-24 DOI: 10.1145/3494561

Xiangyu Ke, Arijit Khan, F. Bonchi

Graph summarization is beneficial in a wide range of applications, such as visualization, interactive and exploratory analysis, approximate query processing, reducing the on-disk storage footprint, and graph processing in modern hardware. However, the bulk of the literature on graph summarization surprisingly overlooks the possibility of having edges of different types. In this article, we study the novel problem of producing summaries of multi-relation networks, i.e., graphs where multiple edges of different types may exist between any pair of nodes. Multi-relation graphs are an expressive model of real-world activities, in which a relation can be a topic in social networks, an interaction type in genetic networks, or a snapshot in temporal graphs. The first approach that we consider for multi-relation graph summarization is a two-step method based on summarizing each relation in isolation, and then aggregating the resulting summaries in some clever way to produce a final unique summary. In doing this, as a side contribution, we provide the first polynomial-time approximation algorithm based on the k-Median clustering for the classic problem of lossless single-relation graph summarization. Then, we demonstrate the shortcomings of these two-step methods, and propose holistic approaches, both approximate and heuristic algorithms, to compute a summary directly for multi-relation graphs. In particular, we prove that the approximation bound of k-Median clustering for the single relation solution can be maintained in a multi-relation graph with proper aggregation operation over adjacency matrices corresponding to its multiple relations. Experimental results and case studies (on co-authorship networks and brain networks) validate the effectiveness and efficiency of the proposed algorithms.

图形摘要在许多应用程序中都是有益的，例如可视化、交互式和探索性分析、近似查询处理、减少磁盘存储占用以及现代硬件中的图形处理。然而，令人惊讶的是，大量关于图摘要的文献忽略了具有不同类型边的可能性。在本文中，我们研究了生成多关系网络的摘要的新问题，即任意对节点之间可能存在多条不同类型边的图。多关系图是现实世界活动的表达模型，其中关系可以是社会网络中的主题，遗传网络中的交互类型或时间图中的快照。我们考虑的第一种多关系图汇总方法是一种两步方法，该方法基于孤立地汇总每个关系，然后以某种聪明的方式聚合所得到的汇总，以生成最终的唯一汇总。在此过程中，作为一个侧面贡献，我们提供了第一个基于k-Median聚类的多项式时间近似算法，用于经典的无损单关系图总结问题。然后，我们论证了这两步方法的缺点，并提出了整体方法，包括近似和启发式算法，直接计算多关系图的摘要。特别地，我们证明了单关系解的k-中值聚类的近似界可以在多关系图中维持，只要对其多个关系对应的邻接矩阵进行适当的聚集操作。实验结果和案例研究(在合著网络和大脑网络上)验证了所提出算法的有效性和效率。

{"title":"Multi-relation Graph Summarization","authors":"Xiangyu Ke, Arijit Khan, F. Bonchi","doi":"10.1145/3494561","DOIUrl":"https://doi.org/10.1145/3494561","url":null,"abstract":"Graph summarization is beneficial in a wide range of applications, such as visualization, interactive and exploratory analysis, approximate query processing, reducing the on-disk storage footprint, and graph processing in modern hardware. However, the bulk of the literature on graph summarization surprisingly overlooks the possibility of having edges of different types. In this article, we study the novel problem of producing summaries of multi-relation networks, i.e., graphs where multiple edges of different types may exist between any pair of nodes. Multi-relation graphs are an expressive model of real-world activities, in which a relation can be a topic in social networks, an interaction type in genetic networks, or a snapshot in temporal graphs. The first approach that we consider for multi-relation graph summarization is a two-step method based on summarizing each relation in isolation, and then aggregating the resulting summaries in some clever way to produce a final unique summary. In doing this, as a side contribution, we provide the first polynomial-time approximation algorithm based on the k-Median clustering for the classic problem of lossless single-relation graph summarization. Then, we demonstrate the shortcomings of these two-step methods, and propose holistic approaches, both approximate and heuristic algorithms, to compute a summary directly for multi-relation graphs. In particular, we prove that the approximation bound of k-Median clustering for the single relation solution can be maintained in a multi-relation graph with proper aggregation operation over adjacency matrices corresponding to its multiple relations. Experimental results and case studies (on co-authorship networks and brain networks) validate the effectiveness and efficiency of the proposed algorithms.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115741918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

HCBST: An Efficient Hybrid Sampling Technique for Class Imbalance Problems HCBST:一类不平衡问题的有效混合抽样技术

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2021-11-15 DOI: 10.1145/3488280

R. Sowah, B. Kuditchar, Godfrey A. Mills, A. Acakpovi, Ralph A. Twum, Gifty Buah, R. Agboyi

Class imbalance problem is prevalent in many real-world domains. It has become an active area of research. In binary classification problems, imbalance learning refers to learning from a dataset with a high degree of skewness to the negative class. This phenomenon causes classification algorithms to perform woefully when predicting positive classes with new examples. Data resampling, which involves manipulating the training data before applying standard classification techniques, is among the most commonly used techniques to deal with the class imbalance problem. This article presents a new hybrid sampling technique that improves the overall performance of classification algorithms for solving the class imbalance problem significantly. The proposed method called the Hybrid Cluster-Based Undersampling Technique (HCBST) uses a combination of the cluster undersampling technique to under-sample the majority instances and an oversampling technique derived from Sigma Nearest Oversampling based on Convex Combination, to oversample the minority instances to solve the class imbalance problem with a high degree of accuracy and reliability. The performance of the proposed algorithm was tested using 11 datasets from the National Aeronautics and Space Administration Metric Data Program data repository and University of California Irvine Machine Learning data repository with varying degrees of imbalance. Results were compared with classification algorithms such as the K-nearest neighbours, support vector machines, decision tree, random forest, neural network, AdaBoost, naïve Bayes, and quadratic discriminant analysis. Tests results revealed that for the same datasets, the HCBST performed better with average performances of 0.73, 0.67, and 0.35 in terms of performance measures of area under curve, geometric mean, and Matthews Correlation Coefficient, respectively, across all the classifiers used for this study. The HCBST has the potential of improving the performance of the class imbalance problem, which by extension, will improve on the various applications that rely on the concept for a solution.

类不平衡问题在现实世界中普遍存在。它已成为一个活跃的研究领域。在二元分类问题中，不平衡学习是指从一个高度偏度的数据集向负类学习。这种现象导致分类算法在用新样本预测正类时表现不佳。数据重采样是处理类不平衡问题最常用的技术之一，它涉及在应用标准分类技术之前对训练数据进行操作。本文提出了一种新的混合采样技术，该技术显著提高了分类算法解决类不平衡问题的总体性能。本文提出的混合聚类欠采样技术(HCBST)将聚类欠采样技术与基于凸组合的Sigma最近邻过采样技术相结合，对少数样本进行过采样，以解决类不平衡问题，具有较高的准确性和可靠性。使用来自美国国家航空航天局度量数据计划数据存储库和加州大学欧文分校机器学习数据存储库的11个数据集测试了所提出算法的性能，这些数据集具有不同程度的不平衡。结果与k近邻、支持向量机、决策树、随机森林、神经网络、AdaBoost、naïve贝叶斯和二次判别分析等分类算法进行了比较。测试结果显示，对于相同的数据集，HCBST在本研究中使用的所有分类器的曲线下面积、几何平均和马修斯相关系数的性能度量方面的平均性能分别为0.73、0.67和0.35，表现更好。HCBST具有改进类不平衡问题的性能的潜力，通过扩展，它将改进依赖于该概念的解决方案的各种应用程序。

{"title":"HCBST: An Efficient Hybrid Sampling Technique for Class Imbalance Problems","authors":"R. Sowah, B. Kuditchar, Godfrey A. Mills, A. Acakpovi, Ralph A. Twum, Gifty Buah, R. Agboyi","doi":"10.1145/3488280","DOIUrl":"https://doi.org/10.1145/3488280","url":null,"abstract":"Class imbalance problem is prevalent in many real-world domains. It has become an active area of research. In binary classification problems, imbalance learning refers to learning from a dataset with a high degree of skewness to the negative class. This phenomenon causes classification algorithms to perform woefully when predicting positive classes with new examples. Data resampling, which involves manipulating the training data before applying standard classification techniques, is among the most commonly used techniques to deal with the class imbalance problem. This article presents a new hybrid sampling technique that improves the overall performance of classification algorithms for solving the class imbalance problem significantly. The proposed method called the Hybrid Cluster-Based Undersampling Technique (HCBST) uses a combination of the cluster undersampling technique to under-sample the majority instances and an oversampling technique derived from Sigma Nearest Oversampling based on Convex Combination, to oversample the minority instances to solve the class imbalance problem with a high degree of accuracy and reliability. The performance of the proposed algorithm was tested using 11 datasets from the National Aeronautics and Space Administration Metric Data Program data repository and University of California Irvine Machine Learning data repository with varying degrees of imbalance. Results were compared with classification algorithms such as the K-nearest neighbours, support vector machines, decision tree, random forest, neural network, AdaBoost, naïve Bayes, and quadratic discriminant analysis. Tests results revealed that for the same datasets, the HCBST performed better with average performances of 0.73, 0.67, and 0.35 in terms of performance measures of area under curve, geometric mean, and Matthews Correlation Coefficient, respectively, across all the classifiers used for this study. The HCBST has the potential of improving the performance of the class imbalance problem, which by extension, will improve on the various applications that rely on the concept for a solution.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"251 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116721133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8