Pub Date : 2024-01-09DOI: 10.1007/s10618-023-00999-5
Marco Heyden, Edouard Fouché, Vadim Arzamasov, Tanja Fenn, Florian Kalinke, Klemens Böhm
Change detection is of fundamental importance when analyzing data streams. Detecting changes both quickly and accurately enables monitoring and prediction systems to react, e.g., by issuing an alarm or by updating a learning algorithm. However, detecting changes is challenging when observations are high-dimensional. In high-dimensional data, change detectors should not only be able to identify when changes happen, but also in which subspace they occur. Ideally, one should also quantify how severe they are. Our approach, ABCD, has these properties. ABCD learns an encoder-decoder model and monitors its accuracy over a window of adaptive size. ABCD derives a change score based on Bernstein’s inequality to detect deviations in terms of accuracy, which indicate changes. Our experiments demonstrate that ABCD outperforms its best competitor by up to 20% in F1-score on average. It can also accurately estimate changes’ subspace, together with a severity measure that correlates with the ground truth.
在分析数据流时,变化检测至关重要。快速而准确地检测变化可使监控和预测系统做出反应,例如发出警报或更新学习算法。然而,当观测数据是高维数据时,检测变化是一项挑战。在高维数据中,变化检测器不仅要能识别变化发生的时间,还要能识别变化发生在哪个子空间。理想情况下,还应该量化变化的严重程度。我们的方法 ABCD 就具有这些特性。ABCD 学习编码器-解码器模型,并在一个自适应大小的窗口内监控其准确性。ABCD 基于伯恩斯坦不等式得出变化分数,以检测准确度方面的偏差,这表明发生了变化。我们的实验证明,ABCD 的 F1 分数平均比最佳竞争对手高出 20%。它还能准确估计变化的子空间,以及与地面实况相关的严重程度。
{"title":"Adaptive Bernstein change detector for high-dimensional data streams","authors":"Marco Heyden, Edouard Fouché, Vadim Arzamasov, Tanja Fenn, Florian Kalinke, Klemens Böhm","doi":"10.1007/s10618-023-00999-5","DOIUrl":"https://doi.org/10.1007/s10618-023-00999-5","url":null,"abstract":"<p>Change detection is of fundamental importance when analyzing data streams. Detecting changes both quickly and accurately enables monitoring and prediction systems to react, e.g., by issuing an alarm or by updating a learning algorithm. However, detecting changes is challenging when observations are high-dimensional. In high-dimensional data, change detectors should not only be able to identify when changes happen, but also in which subspace they occur. Ideally, one should also quantify how severe they are. Our approach, ABCD, has these properties. ABCD learns an encoder-decoder model and monitors its accuracy over a window of adaptive size. ABCD derives a change score based on Bernstein’s inequality to detect deviations in terms of accuracy, which indicate changes. Our experiments demonstrate that ABCD outperforms its best competitor by up to 20% in F1-score on average. It can also accurately estimate changes’ subspace, together with a severity measure that correlates with the ground truth.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"54 ","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139411891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-05DOI: 10.1007/s10618-023-00992-y
Zhanbo Liang, Jie Guo, Weidong Qiu, Zheng Huang, Shujun Li
With the rise of Web 2.0 platforms such as online social media, people’s private information, such as their location, occupation and even family information, is often inadvertently disclosed through online discussions. Therefore, it is important to detect such unwanted privacy disclosures to help alert people affected and the online platform. In this paper, privacy disclosure detection is modeled as a multi-label text classification (MLTC) problem, and a new privacy disclosure detection model is proposed to construct an MLTC classifier for detecting online privacy disclosures. This classifier takes an online post as the input and outputs multiple labels, each reflecting a possible privacy disclosure. The proposed presentation method combines three different sources of information, the input text itself, the label-to-text correlation and the label-to-label correlation. A double-attention mechanism is used to combine the first two sources of information, and a graph convolutional network is employed to extract the third source of information that is then used to help fuse features extracted from the first two sources of information. Our extensive experimental results, obtained on a public dataset of privacy-disclosing posts on Twitter, demonstrated that our proposed privacy disclosure detection method significantly and consistently outperformed other state-of-the-art methods in terms of all key performance indicators.
随着网络社交媒体等 Web 2.0 平台的兴起,人们的私人信息,如位置、职业甚至家庭信息,往往会在网上讨论中不经意地泄露。因此,检测此类不必要的隐私泄露以帮助提醒受影响者和网络平台是非常重要的。本文将隐私披露检测建模为一个多标签文本分类(MLTC)问题,并提出了一个新的隐私披露检测模型,以构建一个用于检测在线隐私披露的 MLTC 分类器。该分类器以网上帖子为输入,输出多个标签,每个标签反映一个可能的隐私披露。所提出的呈现方法结合了三种不同的信息来源:输入文本本身、标签与文本之间的相关性以及标签与标签之间的相关性。双重关注机制用于结合前两个信息源,图卷积网络用于提取第三个信息源,然后用来帮助融合从前两个信息源中提取的特征。我们在 Twitter 上公开的隐私披露帖子数据集上取得的大量实验结果表明,我们提出的隐私披露检测方法在所有关键性能指标上都显著且持续地优于其他最先进的方法。
{"title":"When graph convolution meets double attention: online privacy disclosure detection with multi-label text classification","authors":"Zhanbo Liang, Jie Guo, Weidong Qiu, Zheng Huang, Shujun Li","doi":"10.1007/s10618-023-00992-y","DOIUrl":"https://doi.org/10.1007/s10618-023-00992-y","url":null,"abstract":"<p>With the rise of Web 2.0 platforms such as online social media, people’s private information, such as their location, occupation and even family information, is often inadvertently disclosed through online discussions. Therefore, it is important to detect such unwanted privacy disclosures to help alert people affected and the online platform. In this paper, privacy disclosure detection is modeled as a multi-label text classification (MLTC) problem, and a new privacy disclosure detection model is proposed to construct an MLTC classifier for detecting online privacy disclosures. This classifier takes an online post as the input and outputs multiple labels, each reflecting a possible privacy disclosure. The proposed presentation method combines three different sources of information, the input text itself, the label-to-text correlation and the label-to-label correlation. A double-attention mechanism is used to combine the first two sources of information, and a graph convolutional network is employed to extract the third source of information that is then used to help fuse features extracted from the first two sources of information. Our extensive experimental results, obtained on a public dataset of privacy-disclosing posts on Twitter, demonstrated that our proposed privacy disclosure detection method significantly and consistently outperformed other state-of-the-art methods in terms of all key performance indicators.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"80 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139376729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-03DOI: 10.1007/s10618-023-00996-8
Tobias Koopmann, Martin Becker, Florian Lemmerich, Andreas Hotho
The term Behavioral Networks describes networks that contain relational information on human behavior. This ranges from social networks that contain friendships or cooperations between individuals, to navigational networks that contain geographical or web navigation, and many more. Understanding the forces driving behavior within these networks can be beneficial to improving the underlying network, for example, by generating new hyperlinks on websites, or by proposing new connections and friends on social networks. Previous approaches considered different hypotheses on a single network and evaluated which hypothesis fits best. These hypotheses can represent human intuition and expert opinions or be based on previous insights. In this work, we extend these approaches to enable the comparison of a single hypothesis between multiple networks. We unveil several issues of naive approaches that potentially impact comparisons and lead to undesired results. Based on these findings, we propose a framework with five flexible components that allow addressing specific analysis goals tailored to the application scenario. We show the benefits and limits of our approach by applying it to synthetic data and several real-world datasets, including web navigation, bibliometric navigation, and geographic navigation. Our work supports practitioners and researchers with the aim of understanding similarities and differences in human behavior between environments.
{"title":"CompTrails: comparing hypotheses across behavioral networks","authors":"Tobias Koopmann, Martin Becker, Florian Lemmerich, Andreas Hotho","doi":"10.1007/s10618-023-00996-8","DOIUrl":"https://doi.org/10.1007/s10618-023-00996-8","url":null,"abstract":"<p>The term <i>Behavioral Networks</i> describes networks that contain relational information on human behavior. This ranges from social networks that contain friendships or cooperations between individuals, to navigational networks that contain geographical or web navigation, and many more. Understanding the forces driving behavior within these networks can be beneficial to improving the underlying network, for example, by generating new hyperlinks on websites, or by proposing new connections and friends on social networks. Previous approaches considered different hypotheses on a single network and evaluated which hypothesis fits best. These hypotheses can represent human intuition and expert opinions or be based on previous insights. In this work, we extend these approaches to enable the comparison of a single hypothesis between multiple networks. We unveil several issues of naive approaches that potentially impact comparisons and lead to undesired results. Based on these findings, we propose a framework with five flexible components that allow addressing specific analysis goals tailored to the application scenario. We show the benefits and limits of our approach by applying it to synthetic data and several real-world datasets, including web navigation, bibliometric navigation, and geographic navigation. Our work supports practitioners and researchers with the aim of understanding similarities and differences in human behavior between environments.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"28 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139095516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1007/s10618-023-00991-z
Abstract
The ranking of objects is widely used to rate their relative quality or relevance across multiple assessments. Beyond classical rank aggregation, it is of interest to estimate the usually unobservable latent signals that inform a consensus ranking. Under the only assumption of independent assessments, which can be incomplete, we introduce indirect inference via convex optimization in combination with computationally efficient Poisson Bootstrap. Two different objective functions are suggested, one linear and the other quadratic. The mathematical formulation of the signal estimation problem is based on pairwise comparisons of all objects with respect to their rank positions. Sets of constraints represent the order relations. The transitivity property of rank scales allows us to reduce substantially the number of constraints associated with the full set of object comparisons. The key idea is to globally reduce the errors induced by the rankers until optimal latent signals can be obtained. Its main advantage is low computational costs, even when handling (n < < p) data problems. Exploratory tools can be developed based on the bootstrap signal estimates and standard errors. Simulation evidence, a comparison with the state-of-the-art rank centrality method, and two applications, one in higher education evaluation and the other in molecular cancer research, are presented.
{"title":"Effective signal reconstruction from multiple ranked lists via convex optimization","authors":"","doi":"10.1007/s10618-023-00991-z","DOIUrl":"https://doi.org/10.1007/s10618-023-00991-z","url":null,"abstract":"<h3>Abstract</h3> <p>The ranking of objects is widely used to rate their relative quality or relevance across multiple assessments. Beyond classical rank aggregation, it is of interest to estimate the usually unobservable latent signals that inform a consensus ranking. Under the only assumption of independent assessments, which can be incomplete, we introduce indirect inference via convex optimization in combination with computationally efficient Poisson Bootstrap. Two different objective functions are suggested, one linear and the other quadratic. The mathematical formulation of the signal estimation problem is based on pairwise comparisons of all objects with respect to their rank positions. Sets of constraints represent the order relations. The transitivity property of rank scales allows us to reduce substantially the number of constraints associated with the full set of object comparisons. The key idea is to globally reduce the errors induced by the rankers until optimal latent signals can be obtained. Its main advantage is low computational costs, even when handling <span> <span>(n < < p)</span> </span> data problems. Exploratory tools can be developed based on the bootstrap signal estimates and standard errors. Simulation evidence, a comparison with the state-of-the-art rank centrality method, and two applications, one in higher education evaluation and the other in molecular cancer research, are presented.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"52 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139083082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1007/s10618-023-01000-z
Zhenxiang Cao, N. Seeuws, Maarten Vos, Alexander Bertrand
{"title":"Correction: A semi‑supervised interactive algorithm for change point detection","authors":"Zhenxiang Cao, N. Seeuws, Maarten Vos, Alexander Bertrand","doi":"10.1007/s10618-023-01000-z","DOIUrl":"https://doi.org/10.1007/s10618-023-01000-z","url":null,"abstract":"","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"54 20","pages":"1"},"PeriodicalIF":4.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139390140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-29DOI: 10.1007/s10618-023-00989-7
Moshe Unger, Michel Wedel, Alexander Tuzhilin
We propose the use of a deep learning architecture, called RETINA, to predict multi-alternative, multi-attribute consumer choice from eye movement data. RETINA directly uses the complete time series of raw eye-tracking data from both eyes as input to state-of-the art Transformer and Metric Learning Deep Learning methods. Using the raw data input eliminates the information loss that may result from first calculating fixations, deriving metrics from the fixations data and analysing those metrics, as has been often done in eye movement research, and allows us to apply Deep Learning to eye tracking data sets of the size commonly encountered in academic and applied research. Using a data set with 112 respondents who made choices among four laptops, we show that the proposed architecture outperforms other state-of-the-art machine learning methods (standard BERT, LSTM, AutoML, logistic regression) calibrated on raw data or fixation data. The analysis of partial time and partial data segments reveals the ability of RETINA to predict choice outcomes well before participants reach a decision. Specifically, we find that using a mere 5 s of data, the RETINA architecture achieves a predictive validation accuracy of over 0.7. We provide an assessment of which features of the eye movement data contribute to RETINA’s prediction accuracy. We make recommendations on how the proposed deep learning architecture can be used as a basis for future academic research, in particular its application to eye movements collected from front-facing video cameras.
{"title":"Predicting consumer choice from raw eye-movement data using the RETINA deep learning architecture","authors":"Moshe Unger, Michel Wedel, Alexander Tuzhilin","doi":"10.1007/s10618-023-00989-7","DOIUrl":"https://doi.org/10.1007/s10618-023-00989-7","url":null,"abstract":"<p>We propose the use of a deep learning architecture, called RETINA, to predict multi-alternative, multi-attribute consumer choice from eye movement data. RETINA directly uses the complete time series of raw eye-tracking data from both eyes as input to state-of-the art Transformer and Metric Learning Deep Learning methods. Using the raw data input eliminates the information loss that may result from first calculating fixations, deriving metrics from the fixations data and analysing those metrics, as has been often done in eye movement research, and allows us to apply Deep Learning to eye tracking data sets of the size commonly encountered in academic and applied research. Using a data set with 112 respondents who made choices among four laptops, we show that the proposed architecture outperforms other state-of-the-art machine learning methods (standard BERT, LSTM, AutoML, logistic regression) calibrated on raw data or fixation data. The analysis of partial time and partial data segments reveals the ability of RETINA to predict choice outcomes well before participants reach a decision. Specifically, we find that using a mere 5 s of data, the RETINA architecture achieves a predictive validation accuracy of over 0.7. We provide an assessment of which features of the eye movement data contribute to RETINA’s prediction accuracy. We make recommendations on how the proposed deep learning architecture can be used as a basis for future academic research, in particular its application to eye movements collected from front-facing video cameras.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"29 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-26DOI: 10.1007/s10618-023-00994-w
Huizi Wu, Cong Geng, Hui Fang
Session-based recommendation (SR) aims to dynamically recommend items to a user based on a sequence of the most recent user-item interactions. Most existing studies on SR adopt advanced deep learning methods. However, the majority only consider a special behavior type (e.g., click), while those few considering multi-typed behaviors ignore to take full advantage of the relationships between products (items). In this case, the paper proposes a novel approach, called Substitutable and Complementary Relationships from Multi-behavior Data (denoted as SCRM) to better explore the relationships between products for effective recommendation. Specifically, we firstly construct substitutable and complementary graphs based on a user’s sequential behaviors in every session by jointly considering ‘click’ and ‘purchase’ behaviors. We then design a denoising network to remove false relationships, and further consider constraints on the two relationships via a particularly designed loss function. Extensive experiments on two e-commerce datasets demonstrate the superiority of our model over state-of-the-art methods, and the effectiveness of every component in SCRM.
基于会话的推荐(SR)旨在根据用户与物品最近的交互序列向用户动态推荐物品。关于会话推荐的现有研究大多采用先进的深度学习方法。然而,大多数研究只考虑了一种特殊的行为类型(如点击),而少数考虑多类型行为的研究则忽略了充分利用产品(项目)之间的关系。在这种情况下,本文提出了一种名为 "多行为数据中的可替代和互补关系"(Substitutable and Complementary Relationships from Multi-behavior Data,简称 SCRM)的新方法,以更好地探索产品之间的关系,从而实现有效的推荐。具体来说,我们首先通过联合考虑 "点击 "和 "购买 "行为,根据用户在每个会话中的连续行为构建可替代和互补图。然后,我们设计了一个去噪网络来去除虚假关系,并通过一个特别设计的损失函数进一步考虑对这两种关系的约束。在两个电子商务数据集上进行的广泛实验证明了我们的模型优于最先进的方法,以及 SCRM 中每个组件的有效性。
{"title":"Session-based recommendation by exploiting substitutable and complementary relationships from multi-behavior data","authors":"Huizi Wu, Cong Geng, Hui Fang","doi":"10.1007/s10618-023-00994-w","DOIUrl":"https://doi.org/10.1007/s10618-023-00994-w","url":null,"abstract":"<p>Session-based recommendation (SR) aims to dynamically recommend items to a user based on a sequence of the most recent user-item interactions. Most existing studies on SR adopt advanced deep learning methods. However, the majority only consider a special behavior type (e.g., click), while those few considering multi-typed behaviors ignore to take full advantage of the relationships between products (items). In this case, the paper proposes a novel approach, called Substitutable and Complementary Relationships from Multi-behavior Data (denoted as SCRM) to better explore the relationships between products for effective recommendation. Specifically, we firstly construct substitutable and complementary graphs based on a user’s sequential behaviors in every session by jointly considering ‘click’ and ‘purchase’ behaviors. We then design a denoising network to remove false relationships, and further consider constraints on the two relationships via a particularly designed loss function. Extensive experiments on two e-commerce datasets demonstrate the superiority of our model over state-of-the-art methods, and the effectiveness of every component in SCRM.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"37 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2023-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139057144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-21DOI: 10.1007/s10618-023-00995-9
Jaewan Chun, Geon Lee, Kijung Shin, Jinhong Jung
Random walk with restart (RWR) is a widely-used measure of node similarity in graphs, and it has proved useful for ranking, community detection, link prediction, anomaly detection, etc. Since RWR is typically required to be computed separately for a larger number of query nodes or even for all nodes, fast computation of it is indispensable. However, for hypergraphs, the fast computation of RWR has been unexplored, despite its great potential. In this paper, we propose ARCHER, a fast computation framework for RWR on hypergraphs. Specifically, we first formally define RWR on hypergraphs, and then we propose two computation methods that compose ARCHER. Since the two methods are complementary (i.e., offering relative advantages on different hypergraphs), we also develop a method for automatic selection between them, which takes a very short time compared to the total running time. Through our extensive experiments on 18 real-world hypergraphs, we demonstrate (a) the speed and space efficiency of ARCHER, (b) the complementary nature of the two computation methods composing ARCHER, (c) the accuracy of its automatic selection method, and (d) its successful application to anomaly detection on hypergraphs.
{"title":"Random walk with restart on hypergraphs: fast computation and an application to anomaly detection","authors":"Jaewan Chun, Geon Lee, Kijung Shin, Jinhong Jung","doi":"10.1007/s10618-023-00995-9","DOIUrl":"https://doi.org/10.1007/s10618-023-00995-9","url":null,"abstract":"<p>Random walk with restart (RWR) is a widely-used measure of node similarity in graphs, and it has proved useful for ranking, community detection, link prediction, anomaly detection, etc. Since RWR is typically required to be computed separately for a larger number of query nodes or even for all nodes, fast computation of it is indispensable. However, for hypergraphs, the fast computation of RWR has been unexplored, despite its great potential. In this paper, we propose <span>ARCHER</span>, a fast computation framework for RWR on hypergraphs. Specifically, we first formally define RWR on hypergraphs, and then we propose two computation methods that compose <span>ARCHER</span>. Since the two methods are complementary (i.e., offering relative advantages on different hypergraphs), we also develop a method for automatic selection between them, which takes a very short time compared to the total running time. Through our extensive experiments on 18 real-world hypergraphs, we demonstrate (a) the speed and space efficiency of <span>ARCHER</span>, (b) the complementary nature of the two computation methods composing <span>ARCHER</span>, (c) the accuracy of its automatic selection method, and (d) its successful application to anomaly detection on hypergraphs.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"69 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138823850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-21DOI: 10.1007/s10618-023-00997-7
Antonio R. Moya, Bruno Veloso, João Gama, Sebastián Ventura
{"title":"Improving hyper-parameter self-tuning for data streams by adapting an evolutionary approach","authors":"Antonio R. Moya, Bruno Veloso, João Gama, Sebastián Ventura","doi":"10.1007/s10618-023-00997-7","DOIUrl":"https://doi.org/10.1007/s10618-023-00997-7","url":null,"abstract":"","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"52 11","pages":""},"PeriodicalIF":4.8,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138952437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-12DOI: 10.1007/s10618-023-00990-0
Ling Jian, Kai Shao, Ying Liu, Jundong Li, Xijun Liang
Distilling actionable patterns from large-scale streaming data in the presence of concept drift is a challenging problem, especially when data is polluted with noisy labels. To date, various data stream mining algorithms have been proposed and extensively used in many real-world applications. Considering the functional complementation of classical online learning algorithms and with the goal of combining their advantages, we propose an Online Ensemble Classification (OEC) algorithm to integrate the predictions obtained by different base online classification algorithms. The proposed OEC method works by learning weights of different base classifiers dynamically through the classical Normalized Exponentiated Gradient (NEG) algorithm framework. As a result, the proposed OEC inherits the adaptability and flexibility of concept drift-tracking online classifiers, while maintaining the robustness of noise-resistant online classifiers. Theoretically, we show OEC algorithm is a low regret algorithm which makes it a good candidate to learn from noisy streaming data. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed OEC method.
{"title":"OEC: an online ensemble classifier for mining data streams with noisy labels","authors":"Ling Jian, Kai Shao, Ying Liu, Jundong Li, Xijun Liang","doi":"10.1007/s10618-023-00990-0","DOIUrl":"https://doi.org/10.1007/s10618-023-00990-0","url":null,"abstract":"<p>Distilling actionable patterns from large-scale streaming data in the presence of concept drift is a challenging problem, especially when data is polluted with noisy labels. To date, various data stream mining algorithms have been proposed and extensively used in many real-world applications. Considering the functional complementation of classical online learning algorithms and with the goal of combining their advantages, we propose an Online Ensemble Classification (OEC) algorithm to integrate the predictions obtained by different base online classification algorithms. The proposed OEC method works by learning weights of different base classifiers dynamically through the classical Normalized Exponentiated Gradient (NEG) algorithm framework. As a result, the proposed OEC inherits the adaptability and flexibility of concept drift-tracking online classifiers, while maintaining the robustness of noise-resistant online classifiers. Theoretically, we show OEC algorithm is a low regret algorithm which makes it a good candidate to learn from noisy streaming data. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed OEC method.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"177 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138628812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}