Multi-set membership query is a fundamental issue for network functions such as packet processing and state machines monitoring. Given the rigid query speed and memory requirements, it would be promising if a multi-set query algorithm can be designed based on Bloom filter (BF), a space-efficient probabilistic data structure. However, existing efforts on multi-set query based on BF suffer from at least one of the following drawbacks: low query speed, low query accuracy, limitation in only supporting insertion and query operations, or limitation in the set size. To address the issues, we design a novel Bh sequence-based Bloom filter (BhBF) for multi-set query, which supports four operations: insertion, query, deletion, and update. In BhBF, the set ID is encoded as a code in a Bh sequence. Exploiting good properties of Bh sequences, we can correctly decode the BF cells to obtain the set IDs even when the number of hash collisions is high, which brings high query accuracy. In BhBF, we propose two strategies to further speed up the query speed and increase the query accuracy. On the theoretical side, we analyze the false positive and classification failure rate of our BhBF. Our results from extensive experiments over two real datasets demonstrate that BhBF significantly advances state-of-the-art multi-set query algorithms.
{"title":"BhBF: A Bloom Filter Using Bh Sequences for Multi-set Membership Query","authors":"Shuyu Pei, Kun Xie, Xin Wang, Gaogang Xie, Kenli Li, Wei Li, Yanbiao Li, Jigang Wen","doi":"10.1145/3502735","DOIUrl":"https://doi.org/10.1145/3502735","url":null,"abstract":"Multi-set membership query is a fundamental issue for network functions such as packet processing and state machines monitoring. Given the rigid query speed and memory requirements, it would be promising if a multi-set query algorithm can be designed based on Bloom filter (BF), a space-efficient probabilistic data structure. However, existing efforts on multi-set query based on BF suffer from at least one of the following drawbacks: low query speed, low query accuracy, limitation in only supporting insertion and query operations, or limitation in the set size. To address the issues, we design a novel Bh sequence-based Bloom filter (BhBF) for multi-set query, which supports four operations: insertion, query, deletion, and update. In BhBF, the set ID is encoded as a code in a Bh sequence. Exploiting good properties of Bh sequences, we can correctly decode the BF cells to obtain the set IDs even when the number of hash collisions is high, which brings high query accuracy. In BhBF, we propose two strategies to further speed up the query speed and increase the query accuracy. On the theoretical side, we analyze the false positive and classification failure rate of our BhBF. Our results from extensive experiments over two real datasets demonstrate that BhBF significantly advances state-of-the-art multi-set query algorithms.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116270873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shikha Singh, É. Chouzenoux, G. Chierchia, A. Majumdar
The objective of this letter is to propose a novel computational method to learn the state of an appliance (ON / OFF) given the aggregate power consumption recorded by the smart-meter. We formulate a multi-label classification problem where the classes correspond to the appliances. The proposed approach is based on our recently introduced framework of convolutional transform learning. We propose a deep supervised version of it relying on an original multi-label cost. Comparisons with state-of-the-art techniques show that our proposed method improves over the benchmarks on popular non-intrusive load monitoring datasets.
{"title":"Multi-label Deep Convolutional Transform Learning for Non-intrusive Load Monitoring","authors":"Shikha Singh, É. Chouzenoux, G. Chierchia, A. Majumdar","doi":"10.1145/3502729","DOIUrl":"https://doi.org/10.1145/3502729","url":null,"abstract":"The objective of this letter is to propose a novel computational method to learn the state of an appliance (ON / OFF) given the aggregate power consumption recorded by the smart-meter. We formulate a multi-label classification problem where the classes correspond to the appliances. The proposed approach is based on our recently introduced framework of convolutional transform learning. We propose a deep supervised version of it relying on an original multi-label cost. Comparisons with state-of-the-art techniques show that our proposed method improves over the benchmarks on popular non-intrusive load monitoring datasets.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"83 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124465938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Next basket recommendation aims at predicting the next set of items that a user would likely purchase together, which plays an important role in e-commerce platforms. Unlike conventional item recommendation, the next basket recommendation focuses on capturing item correlations among baskets and learning the user’s temporal interest from the past purchasing basket sequence. In practice, most users interact with items in various kinds of behaviors. The multi-behavior data sheds light on user’s potential purchasing intention and resolves noisy signals from accidentally purchased items. In this article, we conduct an empirical study on real datasets to exploit the characteristics of multi-behavior data and confirm its positive effects on next basket recommendation. We develop a novel Multi-Behavior Network (MBN) model that captures item correlations and acquires meta-knowledge from multi-behavior basket sequences effectively. MBN employs the meta multi-behavior sequence encoder to model temporal dependencies of each individual behavior and extract meta-knowledge across different behaviors. Furthermore, we design the recurring-item-aware predictor in MBN to realize the high degree of the repeated occurrences of items, leading to better recommendation performance. We conduct extensive experiments to evaluate the performance of our proposed MBN model using real-world multi-behavior data. The results demonstrate the superior recommendation performance of MBN compared with various state-of-the-art methods.
Next basket推荐旨在预测用户可能一起购买的下一组商品,这在电子商务平台中起着重要作用。与传统的商品推荐不同,下一个购物篮推荐侧重于捕获购物篮之间的商品相关性,并从过去的购物篮序列中学习用户的时间兴趣。在实践中,大多数用户以各种各样的行为与项目交互。多行为数据揭示了用户潜在的购买意愿,并解决了意外购买物品的噪音信号。本文通过对真实数据集的实证研究,挖掘多行为数据的特征,并验证其对下一篮推荐的积极作用。我们开发了一种新的多行为网络(MBN)模型,该模型可以有效地捕获项目相关性并从多行为篮序列中获取元知识。MBN采用元多行为序列编码器对每个个体行为的时间依赖性进行建模,并提取跨不同行为的元知识。此外,我们在MBN中设计了循环项目感知预测器,实现了项目的高度重复出现,从而提高了推荐性能。我们使用真实世界的多行为数据进行了大量的实验来评估我们提出的MBN模型的性能。结果表明,MBN的推荐性能优于现有的推荐方法。
{"title":"MBN: Towards Multi-Behavior Sequence Modeling for Next Basket Recommendation","authors":"Yanyan Shen, Baoyuan Ou, Ranzhen Li","doi":"10.1145/3497748","DOIUrl":"https://doi.org/10.1145/3497748","url":null,"abstract":"Next basket recommendation aims at predicting the next set of items that a user would likely purchase together, which plays an important role in e-commerce platforms. Unlike conventional item recommendation, the next basket recommendation focuses on capturing item correlations among baskets and learning the user’s temporal interest from the past purchasing basket sequence. In practice, most users interact with items in various kinds of behaviors. The multi-behavior data sheds light on user’s potential purchasing intention and resolves noisy signals from accidentally purchased items. In this article, we conduct an empirical study on real datasets to exploit the characteristics of multi-behavior data and confirm its positive effects on next basket recommendation. We develop a novel Multi-Behavior Network (MBN) model that captures item correlations and acquires meta-knowledge from multi-behavior basket sequences effectively. MBN employs the meta multi-behavior sequence encoder to model temporal dependencies of each individual behavior and extract meta-knowledge across different behaviors. Furthermore, we design the recurring-item-aware predictor in MBN to realize the high degree of the repeated occurrences of items, leading to better recommendation performance. We conduct extensive experiments to evaluate the performance of our proposed MBN model using real-world multi-behavior data. The results demonstrate the superior recommendation performance of MBN compared with various state-of-the-art methods.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126231120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The task of next Point-of-Interest (POI) recommendation aims at recommending a list of POIs for a user to visit at the next timestamp based on his/her previous interactions, which is valuable for both location-based service providers and users. Recent state-of-the-art studies mainly employ recurrent neural network (RNN) based methods to model user check-in behaviors according to user’s historical check-in sequences. However, most of the existing RNN-based methods merely capture geographical influences depending on physical distance or successive relation among POIs. They are insufficient to capture the high-order complex geographical influences among POI networks, which are essential for estimating user preferences. To address this limitation, we propose a novel Graph-based Spatial Dependency modeling (GSD) module, which focuses on explicitly modeling complex geographical influences by leveraging graph embedding. GSD captures two types of geographical influences, i.e., distance-based and transition-based influences from designed POI semantic graphs. Additionally, we propose a novel Graph-enhanced Spatial-Temporal network (GSTN), which incorporates user spatial and temporal dependencies for next POI recommendation. Specifically, GSTN consists of a Long Short-Term Memory (LSTM) network for user-specific temporal dependencies modeling and GSD for user spatial dependencies learning. Finally, we evaluate the proposed model using three real-world datasets. Extensive experiments demonstrate the effectiveness of GSD in capturing various geographical influences and the improvement of GSTN over state-of-the-art methods.
{"title":"Graph-Enhanced Spatial-Temporal Network for Next POI Recommendation","authors":"Zhaobo Wang, Yanmin Zhu, Qiaomei Zhang, Haobing Liu, Chunyang Wang, Tong Liu","doi":"10.1145/3513092","DOIUrl":"https://doi.org/10.1145/3513092","url":null,"abstract":"The task of next Point-of-Interest (POI) recommendation aims at recommending a list of POIs for a user to visit at the next timestamp based on his/her previous interactions, which is valuable for both location-based service providers and users. Recent state-of-the-art studies mainly employ recurrent neural network (RNN) based methods to model user check-in behaviors according to user’s historical check-in sequences. However, most of the existing RNN-based methods merely capture geographical influences depending on physical distance or successive relation among POIs. They are insufficient to capture the high-order complex geographical influences among POI networks, which are essential for estimating user preferences. To address this limitation, we propose a novel Graph-based Spatial Dependency modeling (GSD) module, which focuses on explicitly modeling complex geographical influences by leveraging graph embedding. GSD captures two types of geographical influences, i.e., distance-based and transition-based influences from designed POI semantic graphs. Additionally, we propose a novel Graph-enhanced Spatial-Temporal network (GSTN), which incorporates user spatial and temporal dependencies for next POI recommendation. Specifically, GSTN consists of a Long Short-Term Memory (LSTM) network for user-specific temporal dependencies modeling and GSD for user spatial dependencies learning. Finally, we evaluate the proposed model using three real-world datasets. Extensive experiments demonstrate the effectiveness of GSD in capturing various geographical influences and the improvement of GSTN over state-of-the-art methods.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128620282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Based on the analysis of conditions for a good distance function we found four rules that should be fulfilled. Then, we introduce two new distance functions, a metric and a pseudometric one. We have tested how they fit for distance-based classifiers, especially for the IINC classifier. We rank distance functions according to several criteria and tests. Rankings depend not only on criteria or nature of the statistical test, but also whether it takes into account different difficulties of tasks or whether it considers all tasks as equally difficult. We have found that the new distance functions introduced belong among the four or five best out of 23 distance functions. We have tested them on 24 different tasks, using the mean, the median, the Friedman aligned test, and the Quade test. Our results show that a suitable distance function can improve behavior of distance-based classification rules.
{"title":"The Distance Function Optimization for the Near Neighbors-Based Classifiers","authors":"M. Jiřina, Said Krayem","doi":"10.1145/3434769","DOIUrl":"https://doi.org/10.1145/3434769","url":null,"abstract":"Based on the analysis of conditions for a good distance function we found four rules that should be fulfilled. Then, we introduce two new distance functions, a metric and a pseudometric one. We have tested how they fit for distance-based classifiers, especially for the IINC classifier. We rank distance functions according to several criteria and tests. Rankings depend not only on criteria or nature of the statistical test, but also whether it takes into account different difficulties of tasks or whether it considers all tasks as equally difficult. We have found that the new distance functions introduced belong among the four or five best out of 23 distance functions. We have tested them on 24 different tasks, using the mean, the median, the Friedman aligned test, and the Quade test. Our results show that a suitable distance function can improve behavior of distance-based classification rules.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123989560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, we present the Group Asymmetric Multi-Task Learning (GAMTL) algorithm that automatically learns from data how tasks transfer information among themselves at the level of a subset of features. In practice, for each group of features GAMTL extracts an asymmetric relationship supported by the tasks, instead of assuming a single structure for all features. The additional flexibility promoted by local transference in GAMTL allows any two tasks to have multiple asymmetric relationships. The proposed method leverages the information present in these multiple structures to bias the training of individual tasks towards more generalizable models. The solution to the GAMTL’s associated optimization problem is an alternating minimization procedure involving tasks parameters and multiple asymmetric relationships, thus guiding to convex smaller sub-problems. GAMTL was evaluated on both synthetic and real datasets. To evidence GAMTL versatility, we generated a synthetic scenario characterized by diverse profiles of structural relationships among tasks. GAMTL was also applied to the problem of Alzheimer’s Disease (AD) progression prediction. Our experiments indicated that the proposed approach not only increased prediction performance, but also estimated scientifically grounded relationships among multiple cognitive scores, taken here as multiple regression tasks, and regions of interest in the brain, directly associated here with groups of features. We also employed stability selection analysis to investigate GAMTL’s robustness to data sampling rate and hyper-parameter configuration. GAMTL source code is available on GitHub: https://github.com/shgo/gamtl.
{"title":"Asymmetric Multi-Task Learning with Local Transference","authors":"Saullo H. G. Oliveira, A. Gonçalves, F. von Zuben","doi":"10.1145/3514252","DOIUrl":"https://doi.org/10.1145/3514252","url":null,"abstract":"In this article, we present the Group Asymmetric Multi-Task Learning (GAMTL) algorithm that automatically learns from data how tasks transfer information among themselves at the level of a subset of features. In practice, for each group of features GAMTL extracts an asymmetric relationship supported by the tasks, instead of assuming a single structure for all features. The additional flexibility promoted by local transference in GAMTL allows any two tasks to have multiple asymmetric relationships. The proposed method leverages the information present in these multiple structures to bias the training of individual tasks towards more generalizable models. The solution to the GAMTL’s associated optimization problem is an alternating minimization procedure involving tasks parameters and multiple asymmetric relationships, thus guiding to convex smaller sub-problems. GAMTL was evaluated on both synthetic and real datasets. To evidence GAMTL versatility, we generated a synthetic scenario characterized by diverse profiles of structural relationships among tasks. GAMTL was also applied to the problem of Alzheimer’s Disease (AD) progression prediction. Our experiments indicated that the proposed approach not only increased prediction performance, but also estimated scientifically grounded relationships among multiple cognitive scores, taken here as multiple regression tasks, and regions of interest in the brain, directly associated here with groups of features. We also employed stability selection analysis to investigate GAMTL’s robustness to data sampling rate and hyper-parameter configuration. GAMTL source code is available on GitHub: https://github.com/shgo/gamtl.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124000502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Personalized news recommendations can alleviate the information overload problem. To enable personalized recommendation, one critical step is to learn a comprehensive user representation to model her/his interests. Many existing works learn user representations from the historical clicked news articles, which reflect their existing interests. However, these approaches ignore users’ potential interests and pay less attention to news that may interest the users in the future. To address this problem, we propose a novel Graph neural news Recommendation model with user Existing and Potential interest modeling, named GREP. Different from existing works, GREP introduces three modules to jointly model users’ existing and potential interests: (1) Existing Interest Encoding module mines user historical clicked news and applies the multi-head self-attention mechanism to capture the relatedness among the news; (2) Potential Interest Encoding module leverages the graph neural network to explore the user potential interests on the knowledge graph; and (3) Bi-directional Interaction module dynamically builds a news-entity bipartite graph to further enrich two interest representations. Finally, GREP combines the existing and potential interest representations to represent the user and leverages a prediction layer to estimate the clicking probability of the candidate news. Experiments on two real-world large-scale datasets demonstrate the state-of-the-art performance of GREP.
{"title":"Graph Neural News Recommendation with User Existing and Potential Interest Modeling","authors":"Zhaopeng Qiu, Yunfan Hu, Xian Wu","doi":"10.1145/3511708","DOIUrl":"https://doi.org/10.1145/3511708","url":null,"abstract":"Personalized news recommendations can alleviate the information overload problem. To enable personalized recommendation, one critical step is to learn a comprehensive user representation to model her/his interests. Many existing works learn user representations from the historical clicked news articles, which reflect their existing interests. However, these approaches ignore users’ potential interests and pay less attention to news that may interest the users in the future. To address this problem, we propose a novel Graph neural news Recommendation model with user Existing and Potential interest modeling, named GREP. Different from existing works, GREP introduces three modules to jointly model users’ existing and potential interests: (1) Existing Interest Encoding module mines user historical clicked news and applies the multi-head self-attention mechanism to capture the relatedness among the news; (2) Potential Interest Encoding module leverages the graph neural network to explore the user potential interests on the knowledge graph; and (3) Bi-directional Interaction module dynamically builds a news-entity bipartite graph to further enrich two interest representations. Finally, GREP combines the existing and potential interest representations to represent the user and leverages a prediction layer to estimate the clicking probability of the candidate news. Experiments on two real-world large-scale datasets demonstrate the state-of-the-art performance of GREP.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"113 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132364225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Real-life event logs, reflecting the actual executions of complex business processes, are faced with numerous data quality issues. Extensive data sanity checks and pre-processing are usually needed before historical data can be used as input to obtain reliable data-driven insights. However, most of the existing algorithms in process mining, a field focusing on data-driven process analysis, do not take any data quality issues or the potential effects of data pre-processing into account explicitly. This can result in erroneous process mining results, leading to inaccurate, or misleading conclusions about the process under investigation. To address this gap, we propose data quality annotations for event logs, which can be used by process mining algorithms to generate quality-informed insights. Using a design science approach, requirements are formulated, which are leveraged to propose data quality annotations. Moreover, we present the “Quality-Informed visual Miner” plug-in to demonstrate the potential utility and impact of data quality annotations. Our experimental results, utilising both synthetic and real-life event logs, show how the use of data quality annotations by process mining techniques can assist in increasing the reliability of performance analysis results.
{"title":"Quality-Informed Process Mining: A Case for Standardised Data Quality Annotations","authors":"Kanika Goel, S. Leemans, Niels Martin, M. Wynn","doi":"10.1145/3511707","DOIUrl":"https://doi.org/10.1145/3511707","url":null,"abstract":"Real-life event logs, reflecting the actual executions of complex business processes, are faced with numerous data quality issues. Extensive data sanity checks and pre-processing are usually needed before historical data can be used as input to obtain reliable data-driven insights. However, most of the existing algorithms in process mining, a field focusing on data-driven process analysis, do not take any data quality issues or the potential effects of data pre-processing into account explicitly. This can result in erroneous process mining results, leading to inaccurate, or misleading conclusions about the process under investigation. To address this gap, we propose data quality annotations for event logs, which can be used by process mining algorithms to generate quality-informed insights. Using a design science approach, requirements are formulated, which are leveraged to propose data quality annotations. Moreover, we present the “Quality-Informed visual Miner” plug-in to demonstrate the potential utility and impact of data quality annotations. Our experimental results, utilising both synthetic and real-life event logs, show how the use of data quality annotations by process mining techniques can assist in increasing the reliability of performance analysis results.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128250610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph summarization is beneficial in a wide range of applications, such as visualization, interactive and exploratory analysis, approximate query processing, reducing the on-disk storage footprint, and graph processing in modern hardware. However, the bulk of the literature on graph summarization surprisingly overlooks the possibility of having edges of different types. In this article, we study the novel problem of producing summaries of multi-relation networks, i.e., graphs where multiple edges of different types may exist between any pair of nodes. Multi-relation graphs are an expressive model of real-world activities, in which a relation can be a topic in social networks, an interaction type in genetic networks, or a snapshot in temporal graphs. The first approach that we consider for multi-relation graph summarization is a two-step method based on summarizing each relation in isolation, and then aggregating the resulting summaries in some clever way to produce a final unique summary. In doing this, as a side contribution, we provide the first polynomial-time approximation algorithm based on the k-Median clustering for the classic problem of lossless single-relation graph summarization. Then, we demonstrate the shortcomings of these two-step methods, and propose holistic approaches, both approximate and heuristic algorithms, to compute a summary directly for multi-relation graphs. In particular, we prove that the approximation bound of k-Median clustering for the single relation solution can be maintained in a multi-relation graph with proper aggregation operation over adjacency matrices corresponding to its multiple relations. Experimental results and case studies (on co-authorship networks and brain networks) validate the effectiveness and efficiency of the proposed algorithms.
{"title":"Multi-relation Graph Summarization","authors":"Xiangyu Ke, Arijit Khan, F. Bonchi","doi":"10.1145/3494561","DOIUrl":"https://doi.org/10.1145/3494561","url":null,"abstract":"Graph summarization is beneficial in a wide range of applications, such as visualization, interactive and exploratory analysis, approximate query processing, reducing the on-disk storage footprint, and graph processing in modern hardware. However, the bulk of the literature on graph summarization surprisingly overlooks the possibility of having edges of different types. In this article, we study the novel problem of producing summaries of multi-relation networks, i.e., graphs where multiple edges of different types may exist between any pair of nodes. Multi-relation graphs are an expressive model of real-world activities, in which a relation can be a topic in social networks, an interaction type in genetic networks, or a snapshot in temporal graphs. The first approach that we consider for multi-relation graph summarization is a two-step method based on summarizing each relation in isolation, and then aggregating the resulting summaries in some clever way to produce a final unique summary. In doing this, as a side contribution, we provide the first polynomial-time approximation algorithm based on the k-Median clustering for the classic problem of lossless single-relation graph summarization. Then, we demonstrate the shortcomings of these two-step methods, and propose holistic approaches, both approximate and heuristic algorithms, to compute a summary directly for multi-relation graphs. In particular, we prove that the approximation bound of k-Median clustering for the single relation solution can be maintained in a multi-relation graph with proper aggregation operation over adjacency matrices corresponding to its multiple relations. Experimental results and case studies (on co-authorship networks and brain networks) validate the effectiveness and efficiency of the proposed algorithms.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115741918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Sowah, B. Kuditchar, Godfrey A. Mills, A. Acakpovi, Ralph A. Twum, Gifty Buah, R. Agboyi
Class imbalance problem is prevalent in many real-world domains. It has become an active area of research. In binary classification problems, imbalance learning refers to learning from a dataset with a high degree of skewness to the negative class. This phenomenon causes classification algorithms to perform woefully when predicting positive classes with new examples. Data resampling, which involves manipulating the training data before applying standard classification techniques, is among the most commonly used techniques to deal with the class imbalance problem. This article presents a new hybrid sampling technique that improves the overall performance of classification algorithms for solving the class imbalance problem significantly. The proposed method called the Hybrid Cluster-Based Undersampling Technique (HCBST) uses a combination of the cluster undersampling technique to under-sample the majority instances and an oversampling technique derived from Sigma Nearest Oversampling based on Convex Combination, to oversample the minority instances to solve the class imbalance problem with a high degree of accuracy and reliability. The performance of the proposed algorithm was tested using 11 datasets from the National Aeronautics and Space Administration Metric Data Program data repository and University of California Irvine Machine Learning data repository with varying degrees of imbalance. Results were compared with classification algorithms such as the K-nearest neighbours, support vector machines, decision tree, random forest, neural network, AdaBoost, naïve Bayes, and quadratic discriminant analysis. Tests results revealed that for the same datasets, the HCBST performed better with average performances of 0.73, 0.67, and 0.35 in terms of performance measures of area under curve, geometric mean, and Matthews Correlation Coefficient, respectively, across all the classifiers used for this study. The HCBST has the potential of improving the performance of the class imbalance problem, which by extension, will improve on the various applications that rely on the concept for a solution.
{"title":"HCBST: An Efficient Hybrid Sampling Technique for Class Imbalance Problems","authors":"R. Sowah, B. Kuditchar, Godfrey A. Mills, A. Acakpovi, Ralph A. Twum, Gifty Buah, R. Agboyi","doi":"10.1145/3488280","DOIUrl":"https://doi.org/10.1145/3488280","url":null,"abstract":"Class imbalance problem is prevalent in many real-world domains. It has become an active area of research. In binary classification problems, imbalance learning refers to learning from a dataset with a high degree of skewness to the negative class. This phenomenon causes classification algorithms to perform woefully when predicting positive classes with new examples. Data resampling, which involves manipulating the training data before applying standard classification techniques, is among the most commonly used techniques to deal with the class imbalance problem. This article presents a new hybrid sampling technique that improves the overall performance of classification algorithms for solving the class imbalance problem significantly. The proposed method called the Hybrid Cluster-Based Undersampling Technique (HCBST) uses a combination of the cluster undersampling technique to under-sample the majority instances and an oversampling technique derived from Sigma Nearest Oversampling based on Convex Combination, to oversample the minority instances to solve the class imbalance problem with a high degree of accuracy and reliability. The performance of the proposed algorithm was tested using 11 datasets from the National Aeronautics and Space Administration Metric Data Program data repository and University of California Irvine Machine Learning data repository with varying degrees of imbalance. Results were compared with classification algorithms such as the K-nearest neighbours, support vector machines, decision tree, random forest, neural network, AdaBoost, naïve Bayes, and quadratic discriminant analysis. Tests results revealed that for the same datasets, the HCBST performed better with average performances of 0.73, 0.67, and 0.35 in terms of performance measures of area under curve, geometric mean, and Matthews Correlation Coefficient, respectively, across all the classifiers used for this study. The HCBST has the potential of improving the performance of the class imbalance problem, which by extension, will improve on the various applications that rely on the concept for a solution.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"251 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116721133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}