Pub Date : 2024-07-09DOI: 10.1007/s10618-024-01052-9
Adil Bahaj, Mounir Ghogho
Recently, knowledge graphs (KGs) have been shown to benefit many machine learning applications in multiple domains (e.g. self-driving, agriculture, bio-medicine, recommender systems, etc.). However, KGs suffer from incompleteness, which motivates the task of KG completion which consists of inferring new (unobserved) links between existing entities based on observed links. This task is achieved using either a probabilistic, rule-based, or embedding-based approach. The latter has been shown to consistently outperform the former approaches. It however relies on negative sampling, which supposes that every observed link is “true” and that every unobserved link is “false”. Negative sampling increases the computation complexity of the learning process and introduces noise in the learning. We propose NSF-KGE, a framework for KG embedding that does not require negative sampling, yet achieves performance comparable to that of the negative sampling-based approach. NSF-KGE employs objectives from the non-contrastive self-supervised literature to learn representations that are invariant to relation transformations (e.g. translation, scaling, rotation etc) while avoiding representation collapse.
最近,知识图谱(KGs)已被证明有利于多个领域(如自动驾驶、农业、生物医学、推荐系统等)的许多机器学习应用。然而,KGs 存在不完整性,这就需要完成 KG 的任务,即根据观察到的链接推断现有实体之间新的(未观察到的)链接。这项任务可以通过概率、基于规则或基于嵌入的方法来完成。事实证明,后者一直优于前者。不过,后者依赖于负抽样,即假设每个观察到的链接都是 "真 "的,而每个未观察到的链接都是 "假 "的。负抽样增加了学习过程的计算复杂度,并在学习中引入了噪声。我们提出的 NSF-KGE 是一种不需要负采样的 KG 嵌入框架,其性能可与基于负采样的方法相媲美。NSF-KGE 采用了非对比自监督文献中的目标,以学习对关系变换(如平移、缩放、旋转等)不变的表示,同时避免表示崩溃。
{"title":"Negative-sample-free knowledge graph embedding","authors":"Adil Bahaj, Mounir Ghogho","doi":"10.1007/s10618-024-01052-9","DOIUrl":"https://doi.org/10.1007/s10618-024-01052-9","url":null,"abstract":"<p>Recently, knowledge graphs (KGs) have been shown to benefit many machine learning applications in multiple domains (e.g. self-driving, agriculture, bio-medicine, recommender systems, etc.). However, KGs suffer from incompleteness, which motivates the task of KG completion which consists of inferring new (unobserved) links between existing entities based on observed links. This task is achieved using either a probabilistic, rule-based, or embedding-based approach. The latter has been shown to consistently outperform the former approaches. It however relies on negative sampling, which supposes that every observed link is “true” and that every unobserved link is “false”. Negative sampling increases the computation complexity of the learning process and introduces noise in the learning. We propose NSF-KGE, a framework for KG embedding that does not require negative sampling, yet achieves performance comparable to that of the negative sampling-based approach. NSF-KGE employs objectives from the non-contrastive self-supervised literature to learn representations that are invariant to relation transformations (e.g. translation, scaling, rotation etc) while avoiding representation collapse.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"14 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141570090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Knowledge Graph Embedding (KGE) has attracted increasing attention. Relation patterns, such as symmetry and inversion, have received considerable focus. Among them, composition patterns are particularly important, as they involve nearly all relations in KGs. However, prior KGE approaches often consider relations to be compositional only if they are well-represented in the training data. Consequently, it can lead to performance degradation, especially for under-represented composition patterns. To this end, we propose HolmE, a general form of KGE with its relation embedding space closed under composition, namely that the composition of any two given relation embeddings remains within the embedding space. This property ensures that every relation embedding can compose, or be composed by other relation embeddings. It enhances HolmE’s capability to model under-represented (also called long-tail) composition patterns with limited learning instances. To our best knowledge, our work is pioneering in discussing KGE with this property of being closed under composition. We provide detailed theoretical proof and extensive experiments to demonstrate the notable advantages of HolmE in modelling composition patterns, particularly for long-tail patterns. Our results also highlight HolmE’s effectiveness in extrapolating to unseen relations through composition and its state-of-the-art performance on benchmark datasets.
{"title":"Knowledge graph embedding closed under composition","authors":"Zhuoxun Zheng, Baifan Zhou, Hui Yang, Zhipeng Tan, Zequn Sun, Chunnong Li, Arild Waaler, Evgeny Kharlamov, Ahmet Soylu","doi":"10.1007/s10618-024-01050-x","DOIUrl":"https://doi.org/10.1007/s10618-024-01050-x","url":null,"abstract":"<p>Knowledge Graph Embedding (KGE) has attracted increasing attention. Relation patterns, such as symmetry and inversion, have received considerable focus. Among them, composition patterns are particularly important, as they involve nearly all relations in KGs. However, prior KGE approaches often consider relations to be compositional only if they are well-represented in the training data. Consequently, it can lead to performance degradation, especially for under-represented composition patterns. To this end, we propose HolmE, a general form of KGE with its relation embedding space closed under composition, namely that the composition of any two given relation embeddings remains within the embedding space. This property ensures that every relation embedding can compose, or be composed by other relation embeddings. It enhances HolmE’s capability to model under-represented (also called long-tail) composition patterns with limited learning instances. To our best knowledge, our work is pioneering in discussing KGE with this property of being closed under composition. We provide detailed theoretical proof and extensive experiments to demonstrate the notable advantages of HolmE in modelling composition patterns, particularly for long-tail patterns. Our results also highlight HolmE’s effectiveness in extrapolating to unseen relations through composition and its state-of-the-art performance on benchmark datasets.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"35 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141551621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-03DOI: 10.1007/s10618-024-01049-4
Pu Wang, Jingya Sun, Wei Chen, Lei Zhao
Identifying the region’s functionalities and what the specific Point-of-Interest (POI) needs is essential for effective urban planning. However, due to the diversified and ambiguity nature of urban regions, there are still some significant challenges to be resolved in urban POI demand analysis. To this end, we propose a novel framework, in which Region-of-Interest Demand Modeling is enhanced through the graph representation learning, namely Variational Multi-graph Auto-encoding Fusion, aiming to effectively predict the ROI demand from both the POI level and category level. Specifically, we first divide the urban area into spatially differentiated neighborhood regions, extract the corresponding multi-dimensional natures, and then generate the Spatial-Attributed Region Graph (SARG). After that, we introduce an unsupervised multi-graph based variational auto-encoder to map regional profiles of SARG into latent space, and further retrieve the dynamic latent representations through probabilistic sampling and global fusing. Additionally, during the training process, a spatio-temporal constrained Bayesian algorithm is adopted to infer the destination POIs. Finally, extensive experiments are conducted on real-world dataset, which demonstrate our model significantly outperforms state-of-the-art baselines.
确定区域的功能和特定兴趣点(POI)的需求对于有效的城市规划至关重要。然而,由于城市区域的多样性和模糊性,城市兴趣点需求分析仍有一些重大挑战有待解决。为此,我们提出了一个新颖的框架,通过图表示学习(即变异多图自动编码融合)来增强兴趣区域需求建模,旨在从 POI 层面和类别层面有效预测 ROI 需求。具体来说,我们首先将城市区域划分为空间上不同的邻近区域,提取相应的多维性质,然后生成空间属性区域图(SARG)。然后,我们引入基于无监督多图的变异自动编码器,将 SARG 的区域轮廓映射到潜空间,并通过概率采样和全局融合进一步检索动态潜表征。此外,在训练过程中,还采用了时空约束贝叶斯算法来推断目的地 POI。最后,我们在真实世界的数据集上进行了大量实验,结果表明我们的模型明显优于最先进的基线模型。
{"title":"Towards effective urban region-of-interest demand modeling via graph representation learning","authors":"Pu Wang, Jingya Sun, Wei Chen, Lei Zhao","doi":"10.1007/s10618-024-01049-4","DOIUrl":"https://doi.org/10.1007/s10618-024-01049-4","url":null,"abstract":"<p>Identifying the region’s functionalities and what the specific Point-of-Interest (POI) needs is essential for effective urban planning. However, due to the diversified and ambiguity nature of urban regions, there are still some significant challenges to be resolved in urban POI demand analysis. To this end, we propose a novel framework, in which Region-of-Interest Demand Modeling is enhanced through the graph representation learning, namely Variational Multi-graph Auto-encoding Fusion, aiming to effectively predict the ROI demand from both the POI level and category level. Specifically, we first divide the urban area into spatially differentiated neighborhood regions, extract the corresponding multi-dimensional natures, and then generate the Spatial-Attributed Region Graph (SARG). After that, we introduce an unsupervised multi-graph based variational auto-encoder to map regional profiles of SARG into latent space, and further retrieve the dynamic latent representations through probabilistic sampling and global fusing. Additionally, during the training process, a spatio-temporal constrained Bayesian algorithm is adopted to infer the destination POIs. Finally, extensive experiments are conducted on real-world dataset, which demonstrate our model significantly outperforms state-of-the-art baselines.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"73 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141515797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-22DOI: 10.1007/s10618-024-01048-5
Xiaosheng Li, Wenjie Xi, Jessica Lin
Neural networks are widely used in machine learning and data mining. Typically, these networks need to be trained, implying the adjustment of weights (parameters) within the network based on the input data. In this work, we propose a novel approach, RandomNet, that employs untrained deep neural networks to cluster time series. RandomNet uses different sets of random weights to extract diverse representations of time series and then ensembles the clustering relationships derived from these different representations to build the final clustering results. By extracting diverse representations, our model can effectively handle time series with different characteristics. Since all parameters are randomly generated, no training is required during the process. We provide a theoretical analysis of the effectiveness of the method. To validate its performance, we conduct extensive experiments on all of the 128 datasets in the well-known UCR time series archive and perform statistical analysis of the results. These datasets have different sizes, sequence lengths, and they are from diverse fields. The experimental results show that the proposed method is competitive compared with existing state-of-the-art methods.
{"title":"Randomnet: clustering time series using untrained deep neural networks","authors":"Xiaosheng Li, Wenjie Xi, Jessica Lin","doi":"10.1007/s10618-024-01048-5","DOIUrl":"https://doi.org/10.1007/s10618-024-01048-5","url":null,"abstract":"<p>Neural networks are widely used in machine learning and data mining. Typically, these networks need to be trained, implying the adjustment of weights (parameters) within the network based on the input data. In this work, we propose a novel approach, RandomNet, that employs untrained deep neural networks to cluster time series. RandomNet uses different sets of random weights to extract diverse representations of time series and then ensembles the clustering relationships derived from these different representations to build the final clustering results. By extracting diverse representations, our model can effectively handle time series with different characteristics. Since all parameters are randomly generated, no training is required during the process. We provide a theoretical analysis of the effectiveness of the method. To validate its performance, we conduct extensive experiments on all of the 128 datasets in the well-known UCR time series archive and perform statistical analysis of the results. These datasets have different sizes, sequence lengths, and they are from diverse fields. The experimental results show that the proposed method is competitive compared with existing state-of-the-art methods.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"71 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141503028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-20DOI: 10.1007/s10618-024-01045-8
Thu Trang Nguyen, Thach Le Nguyen, Georgiana Ifrim
Time series classification is a task which deals with temporal sequences, a prevalent data type common in domains such as human activity recognition, sports analytics and general sensing. In this area, interest in explanability has been growing as explanation is key to understand the data and the model better. Recently, a great variety of techniques (e.g., LIME, SHAP, CAM) have been proposed and adapted for time series to provide explanation in the form of saliency maps, where the importance of each data point in the time series is quantified with a numerical value. However, the saliency maps can and often disagree, so it is unclear which one to use. This paper provides a novel framework to quantitatively evaluate and rank explanation methods for time series classification. We show how to robustly evaluate the informativeness of a given explanation method (i.e., relevance for the classification task), and how to compare explanations side-by-side. The goal is to recommend the best explainer for a given time series classification dataset. We propose AMEE, a Model-Agnostic Explanation Evaluation framework, for recommending saliency-based explanations for time series classification. In this approach, data perturbation is added to the input time series guided by each explanation. Our results show that perturbing discriminative parts of the time series leads to significant changes in classification accuracy, which can be used to evaluate each explanation. To be robust to different types of perturbations and different types of classifiers, we aggregate the accuracy loss across perturbations and classifiers. This novel approach allows us to recommend the best explainer among a set of different explainers, including random and oracle explainers. We provide a quantitative and qualitative analysis for synthetic datasets, a variety of time-series datasets, as well as a real-world case study with known expert ground truth.
{"title":"Robust explainer recommendation for time series classification","authors":"Thu Trang Nguyen, Thach Le Nguyen, Georgiana Ifrim","doi":"10.1007/s10618-024-01045-8","DOIUrl":"https://doi.org/10.1007/s10618-024-01045-8","url":null,"abstract":"<p>Time series classification is a task which deals with temporal sequences, a prevalent data type common in domains such as human activity recognition, sports analytics and general sensing. In this area, interest in explanability has been growing as explanation is key to understand the data and the model better. Recently, a great variety of techniques (e.g., LIME, SHAP, CAM) have been proposed and adapted for time series to provide explanation in the form of <i>saliency maps</i>, where the importance of each data point in the time series is quantified with a numerical value. However, the saliency maps can and often disagree, so it is unclear which one to use. This paper provides a novel framework to <i>quantitatively evaluate and rank explanation methods for time series classification</i>. We show how to robustly evaluate the informativeness of a given explanation method (i.e., relevance for the classification task), and how to compare explanations side-by-side. <i>The goal is to recommend the best explainer for a given time series classification dataset.</i> We propose AMEE, a Model-Agnostic Explanation Evaluation framework, for recommending saliency-based explanations for time series classification. In this approach, data perturbation is added to the input time series guided by each explanation. Our results show that perturbing discriminative parts of the time series leads to significant changes in classification accuracy, which can be used to evaluate each explanation. To be robust to different types of perturbations and different types of classifiers, we aggregate the accuracy loss across perturbations and classifiers. This novel approach allows us to recommend the best explainer among a set of different explainers, including random and oracle explainers. We provide a quantitative and qualitative analysis for synthetic datasets, a variety of time-series datasets, as well as a real-world case study with known expert ground truth.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"44 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141503030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We argue that time series analysis is fundamentally different in nature to either vision or natural language processing with respect to the forms of meaningful self-supervised learning tasks that can be defined. Motivated by this insight, we introduce a novel approach called Series2Vec for self-supervised representation learning. Unlike the state-of-the-art methods in time series which rely on hand-crafted data augmentation, Series2Vec is trained by predicting the similarity between two series in both temporal and spectral domains through a self-supervised task. By leveraging the similarity prediction task, which has inherent meaning for a wide range of time series analysis tasks, Series2Vec eliminates the need for hand-crafted data augmentation. To further enforce the network to learn similar representations for similar time series, we propose a novel approach that applies order-invariant attention to each representation within the batch during training. Our evaluation of Series2Vec on nine large real-world datasets, along with the UCR/UEA archive, shows enhanced performance compared to current state-of-the-art self-supervised techniques for time series. Additionally, our extensive experiments show that Series2Vec performs comparably with fully supervised training and offers high efficiency in datasets with limited-labeled data. Finally, we show that the fusion of Series2Vec with other representation learning models leads to enhanced performance for time series classification. Code and models are open-source at https://github.com/Navidfoumani/Series2Vec
{"title":"Series2vec: similarity-based self-supervised representation learning for time series classification","authors":"Navid Mohammadi Foumani, Chang Wei Tan, Geoffrey I. Webb, Hamid Rezatofighi, Mahsa Salehi","doi":"10.1007/s10618-024-01043-w","DOIUrl":"https://doi.org/10.1007/s10618-024-01043-w","url":null,"abstract":"<p>We argue that time series analysis is fundamentally different in nature to either vision or natural language processing with respect to the forms of meaningful self-supervised learning tasks that can be defined. Motivated by this insight, we introduce a novel approach called <i>Series2Vec</i> for self-supervised representation learning. Unlike the state-of-the-art methods in time series which rely on hand-crafted data augmentation, Series2Vec is trained by predicting the similarity between two series in both temporal and spectral domains through a self-supervised task. By leveraging the similarity prediction task, which has inherent meaning for a wide range of time series analysis tasks, Series2Vec eliminates the need for hand-crafted data augmentation. To further enforce the network to learn similar representations for similar time series, we propose a novel approach that applies order-invariant attention to each representation within the batch during training. Our evaluation of Series2Vec on nine large real-world datasets, along with the UCR/UEA archive, shows enhanced performance compared to current state-of-the-art self-supervised techniques for time series. Additionally, our extensive experiments show that Series2Vec performs comparably with fully supervised training and offers high efficiency in datasets with limited-labeled data. Finally, we show that the fusion of Series2Vec with other representation learning models leads to enhanced performance for time series classification. Code and models are open-source at https://github.com/Navidfoumani/Series2Vec</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"139 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141503029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-19DOI: 10.1007/s10618-024-01046-7
Margot Geerts, Seppe vanden Broucke, Jochen De Weerdt
The geospatial domain increasingly relies on data-driven methodologies to extract actionable insights from the growing volume of available data. Despite the effectiveness of tree-based models in capturing complex relationships between features and targets, they fall short when it comes to considering spatial factors. This limitation arises from their reliance on univariate, axis-parallel splits that result in rectangular areas on a map. To address this issue and enhance both performance and interpretability, we propose a solution that introduces two novel bivariate splits: an oblique and Gaussian split designed specifically for geographic coordinates. Our innovation, called Geospatial Random Forest (geoRF), builds upon Geospatial Regression Trees (GeoTrees) to effectively incorporate geographic features and extract maximum spatial insights. Through an extensive benchmark, we show that our geoRF model outperforms traditional spatial statistical models, other spatial RF variations, machine learning and deep learning methods across a range of geospatial tasks. Furthermore, we contextualize our method’s computational time complexity relative to baseline approaches. Our prediction maps illustrate that geoRF produces more robust and intuitive decision boundaries compared to conventional tree-based models. Utilizing impurity-based feature importance measures, we validate geoRF’s effectiveness in highlighting the significance of geographic coordinates, especially in data sets exhibiting pronounced spatial patterns.
{"title":"GeoRF: a geospatial random forest","authors":"Margot Geerts, Seppe vanden Broucke, Jochen De Weerdt","doi":"10.1007/s10618-024-01046-7","DOIUrl":"https://doi.org/10.1007/s10618-024-01046-7","url":null,"abstract":"<p>The geospatial domain increasingly relies on data-driven methodologies to extract actionable insights from the growing volume of available data. Despite the effectiveness of tree-based models in capturing complex relationships between features and targets, they fall short when it comes to considering spatial factors. This limitation arises from their reliance on univariate, axis-parallel splits that result in rectangular areas on a map. To address this issue and enhance both performance and interpretability, we propose a solution that introduces two novel bivariate splits: an oblique and Gaussian split designed specifically for geographic coordinates. Our innovation, called Geospatial Random Forest (geoRF), builds upon Geospatial Regression Trees (GeoTrees) to effectively incorporate geographic features and extract maximum spatial insights. Through an extensive benchmark, we show that our geoRF model outperforms traditional spatial statistical models, other spatial RF variations, machine learning and deep learning methods across a range of geospatial tasks. Furthermore, we contextualize our method’s computational time complexity relative to baseline approaches. Our prediction maps illustrate that geoRF produces more robust and intuitive decision boundaries compared to conventional tree-based models. Utilizing impurity-based feature importance measures, we validate geoRF’s effectiveness in highlighting the significance of geographic coordinates, especially in data sets exhibiting pronounced spatial patterns.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"22 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141528877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-17DOI: 10.1007/s10618-024-01047-6
Bingqing Liu
Event sequence data widely exists in real life, where each event is typically represented as a tuple, event type and occurrence time. Recently, neural point process (NPP), a probabilistic model that learns the next event distribution with events history given, has gained a lot of attention for event sequence modelling. Existing NPP models use one single vector to encode the whole events history. However, each type of event has its own historical events of concern, which should have led to a different encoding for events history. To this end, we propose Type-wise Neural Point Process (TNPP), with each type of event having a history vector to encode the historical events of its own interest. Type-wise encoding further leads to the realization of type-wise decoding, which together makes a more effective neural point process. Experimental results on six datasets show that TNPP outperforms existing models on the event type prediction task under both extrapolation and interpolation setting. Moreover, the results in terms of scalability and interpretability show that TNPP scales well to datasets with many event types and can provide high-quality event dependencies for interpretation. The code and data can be found at https://github.com/lbq8942/TNPP.
{"title":"Modelling event sequence data by type-wise neural point process","authors":"Bingqing Liu","doi":"10.1007/s10618-024-01047-6","DOIUrl":"https://doi.org/10.1007/s10618-024-01047-6","url":null,"abstract":"<p>Event sequence data widely exists in real life, where each event is typically represented as a tuple, event type and occurrence time. Recently, neural point process (NPP), a probabilistic model that learns the next event distribution with events history given, has gained a lot of attention for event sequence modelling. Existing NPP models use one single vector to encode the whole events history. However, each type of event has its own historical events of concern, which should have led to a different encoding for events history. To this end, we propose Type-wise Neural Point Process (TNPP), with each type of event having a history vector to encode the historical events of its own interest. Type-wise encoding further leads to the realization of type-wise decoding, which together makes a more effective neural point process. Experimental results on six datasets show that TNPP outperforms existing models on the event type prediction task under both extrapolation and interpolation setting. Moreover, the results in terms of scalability and interpretability show that TNPP scales well to datasets with many event types and can provide high-quality event dependencies for interpretation. The code and data can be found at https://github.com/lbq8942/TNPP.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"30 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141528876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-08DOI: 10.1007/s10618-024-01044-9
Neville K. Kitson, Anthony C. Constantinou
Causal Bayesian Networks (CBNs) provide an important tool for reasoning under uncertainty with potential application to many complex causal systems. Structure learning algorithms that can tell us something about the causal structure of these systems are becoming increasingly important. In the literature, the validity of these algorithms is often tested for sensitivity over varying sample sizes, hyper-parameters, and occasionally objective functions, but the effect of the order in which the variables are read from data is rarely quantified. We show that many commonly-used algorithms, both established and state-of-the-art, are more sensitive to variable ordering than these other factors when learning CBNs from discrete variables. This effect is strongest in hill-climbing and its variants where we explain how it arises, but extends to hybrid, and to a lesser-extent, constraint-based algorithms. Because the variable ordering is arbitrary, any significant effect it has on learnt graph accuracy is concerning, and raises questions about the validity of both many older and more recent results produced by these algorithms in practical applications and their rankings in performance evaluations.
{"title":"The impact of variable ordering on Bayesian network structure learning","authors":"Neville K. Kitson, Anthony C. Constantinou","doi":"10.1007/s10618-024-01044-9","DOIUrl":"https://doi.org/10.1007/s10618-024-01044-9","url":null,"abstract":"<p>Causal Bayesian Networks (CBNs) provide an important tool for reasoning under uncertainty with potential application to many complex causal systems. Structure learning algorithms that can tell us something about the causal structure of these systems are becoming increasingly important. In the literature, the validity of these algorithms is often tested for sensitivity over varying sample sizes, hyper-parameters, and occasionally objective functions, but the effect of the order in which the variables are read from data is rarely quantified. We show that many commonly-used algorithms, both established and state-of-the-art, are more sensitive to variable ordering than these other factors when learning CBNs from discrete variables. This effect is strongest in hill-climbing and its variants where we explain how it arises, but extends to hybrid, and to a lesser-extent, constraint-based algorithms. Because the variable ordering is arbitrary, any significant effect it has on learnt graph accuracy is concerning, and raises questions about the validity of both many older and more recent results produced by these algorithms in practical applications and their rankings in performance evaluations.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"44 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141503031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-04DOI: 10.1007/s10618-024-01042-x
Jinping Hu, Evert de Haan, Bernd Skiera
Uplift modeling, also referred to as heterogeneous treatment effect estimation, is a machine learning technique utilized in marketing for estimating the incremental impact of treatment on the response of each customer. Uplift models face a fundamental challenge in causal inference because the variable of interest (i.e., the uplift itself) remains unobservable. As a result, popular uplift models (such as meta-learners and uplift trees) do not incorporate loss functions for uplifts in their algorithms. This article addresses that gap by proposing uplift models with quasi-loss functions (UpliftQL models), which separately use four specially designed quasi-loss functions for uplift estimation in algorithms. Using simulated data, our analysis reveals that, on average, 55% (34%) of the top five models from a set of 14 are UpliftQL models for binary (continuous) outcomes. Further empirical data analysis shows that over 60% of the top-performing models are consistently UpliftQL models.
{"title":"Uplift modeling with quasi-loss-functions","authors":"Jinping Hu, Evert de Haan, Bernd Skiera","doi":"10.1007/s10618-024-01042-x","DOIUrl":"https://doi.org/10.1007/s10618-024-01042-x","url":null,"abstract":"<p>Uplift modeling, also referred to as heterogeneous treatment effect estimation, is a machine learning technique utilized in marketing for estimating the incremental impact of treatment on the response of each customer. Uplift models face a fundamental challenge in causal inference because the variable of interest (i.e., the uplift itself) remains unobservable. As a result, popular uplift models (such as meta-learners and uplift trees) do not incorporate loss functions for uplifts in their algorithms. This article addresses that gap by proposing uplift models with quasi-loss functions (UpliftQL models), which separately use four specially designed quasi-loss functions for uplift estimation in algorithms. Using simulated data, our analysis reveals that, on average, 55% (34%) of the top five models from a set of 14 are UpliftQL models for binary (continuous) outcomes. Further empirical data analysis shows that over 60% of the top-performing models are consistently UpliftQL models.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"23 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141258154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}