Data Mining and Knowledge Discovery最新文献

英文中文

MASS: distance profile of a query over a time series MASS：查询在时间序列中的距离分布图

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-02-05 DOI: 10.1007/s10618-024-01005-2

Sheng Zhong, Abdullah Mueen

Given a long time series, the distance profile of a query time series computes distances between the query and every possible subsequence of a long time series. MASS (Mueen’s Algorithm for Similarity Search) is an algorithm to efficiently compute distance profile under z-normalized Euclidean distance (Mueen et al. in The fastest similarity search algorithm for time series subsequences under Euclidean distance. http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html, 2017). MASS is recognized as a useful tool in many data mining works. However, complete documentation of the increasingly efficient versions of the algorithm does not exist. In this paper, we formalize the notion of a distance profile, describe four versions of the MASS algorithm, show several extensions of distance profiles under various operating conditions, describe how MASS improves performances of existing data mining algorithms, and finally, show utility of MASS in domains including seismology, robotics and power grids.

在给定一个长时间序列的情况下，查询时间序列的距离轮廓计算的是查询时间序列与长时间序列的每个可能子序列之间的距离。MASS（Mueen's Algorithm for Similarity Search）是一种在z归一化欧氏距离下高效计算距离剖面的算法（Mueen等人在The fastest similarity search algorithm for time series subences under Euclidean distance. http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html，2017）。在许多数据挖掘工作中，MASS 是公认的有用工具。然而，关于该算法日益高效版本的完整文档并不存在。在本文中，我们正式定义了距离剖面的概念，描述了 MASS 算法的四个版本，展示了距离剖面在各种操作条件下的若干扩展，描述了 MASS 如何提高现有数据挖掘算法的性能，最后展示了 MASS 在地震学、机器人学和电网等领域的实用性。

引用次数: 0

Better trees: an empirical study on hyperparameter tuning of classification decision tree induction algorithms 更好的树：关于分类决策树归纳算法超参数调整的实证研究

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-01-31 DOI: 10.1007/s10618-024-01002-5

Rafael Gomes Mantovani, Tomáš Horváth, André L. D. Rossi, Ricardo Cerri, Sylvio Barbon Junior, Joaquin Vanschoren, André C. P. L. F. de Carvalho

Machine learning algorithms often contain many hyperparameters whose values affect the predictive performance of the induced models in intricate ways. Due to the high number of possibilities for these hyperparameter configurations and their complex interactions, it is common to use optimization techniques to find settings that lead to high predictive performance. However, insights into efficiently exploring this vast space of configurations and dealing with the trade-off between predictive and runtime performance remain challenging. Furthermore, there are cases where the default hyperparameters fit the suitable configuration. Additionally, for many reasons, including model validation and attendance to new legislation, there is an increasing interest in interpretable models, such as those created by the decision tree (DT) induction algorithms. This paper provides a comprehensive approach for investigating the effects of hyperparameter tuning for the two DT induction algorithms most often used, CART and C4.5. DT induction algorithms present high predictive performance and interpretable classification models, though many hyperparameters need to be adjusted. Experiments were carried out with different tuning strategies to induce models and to evaluate hyperparameters’ relevance using 94 classification datasets from OpenML. The experimental results point out that different hyperparameter profiles for the tuning of each algorithm provide statistically significant improvements in most of the datasets for CART, but only in one-third for C4.5. Although different algorithms may present different tuning scenarios, the tuning techniques generally required few evaluations to find accurate solutions. Furthermore, the best technique for all the algorithms was the Irace. Finally, we found out that tuning a specific small subset of hyperparameters is a good alternative for achieving optimal predictive performance.

机器学习算法通常包含许多超参数，其值会以错综复杂的方式影响诱导模型的预测性能。由于这些超参数配置存在大量可能性及其复杂的相互作用，因此通常需要使用优化技术来找到能带来高预测性能的设置。然而，如何有效探索这一巨大的配置空间，以及如何处理预测性能和运行性能之间的权衡，仍然是一项挑战。此外，在某些情况下，默认超参数也适合合适的配置。此外，出于模型验证和遵守新法规等多种原因，人们对决策树（DT）归纳算法等可解释模型的兴趣与日俱增。本文提供了一种综合方法，用于研究最常用的两种 DT 归纳算法（CART 和 C4.5）的超参数调整效果。尽管许多超参数需要调整，但 DT 归纳算法具有很高的预测性能和可解释的分类模型。我们使用不同的调整策略进行了实验，以诱导模型，并使用 OpenML 的 94 个分类数据集评估超参数的相关性。实验结果表明，对每种算法进行不同的超参数调整，在大多数数据集上都能对 CART 算法带来统计意义上的显著改进，但在 C4.5 算法中，只有三分之一的数据集有显著改进。虽然不同的算法可能会有不同的调整方案，但调整技术一般只需要很少的评估就能找到准确的解决方案。此外，所有算法的最佳技术都是 Irace。最后，我们发现调整特定的一小部分超参数子集是获得最佳预测性能的良好选择。

{"title":"Better trees: an empirical study on hyperparameter tuning of classification decision tree induction algorithms","authors":"Rafael Gomes Mantovani, Tomáš Horváth, André L. D. Rossi, Ricardo Cerri, Sylvio Barbon Junior, Joaquin Vanschoren, André C. P. L. F. de Carvalho","doi":"10.1007/s10618-024-01002-5","DOIUrl":"https://doi.org/10.1007/s10618-024-01002-5","url":null,"abstract":"Machine learning algorithms often contain many hyperparameters whose values affect the predictive performance of the induced models in intricate ways. Due to the high number of possibilities for these hyperparameter configurations and their complex interactions, it is common to use optimization techniques to find settings that lead to high predictive performance. However, insights into efficiently exploring this vast space of configurations and dealing with the trade-off between predictive and runtime performance remain challenging. Furthermore, there are cases where the default hyperparameters fit the suitable configuration. Additionally, for many reasons, including model validation and attendance to new legislation, there is an increasing interest in interpretable models, such as those created by the decision tree (DT) induction algorithms. This paper provides a comprehensive approach for investigating the effects of hyperparameter tuning for the two DT induction algorithms most often used, CART and C4.5. DT induction algorithms present high predictive performance and interpretable classification models, though many hyperparameters need to be adjusted. Experiments were carried out with different tuning strategies to induce models and to evaluate hyperparameters’ relevance using 94 classification datasets from OpenML. The experimental results point out that different hyperparameter profiles for the tuning of each algorithm provide statistically significant improvements in most of the datasets for CART, but only in one-third for C4.5. Although different algorithms may present different tuning scenarios, the tuning techniques generally required few evaluations to find accurate solutions. Furthermore, the best technique for all the algorithms was the Irace. Finally, we found out that tuning a specific small subset of hyperparameters is a good alternative for achieving optimal predictive performance.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"13 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139658401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Central node identification via weighted kernel density estimation 通过加权核密度估计识别中心节点

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-01-31 DOI: 10.1007/s10618-024-01003-4

Abstract

The detection of central nodes in a network is a fundamental task in network science and graph data analysis. During the past decades, numerous centrality measures have been presented to characterize what is a central node. However, few studies address this issue from a statistical inference perspective. In this paper, we formulate the central node identification issue as a weighted kernel density estimation problem on graphs. Such a formulation provides a generic framework for recognizing central nodes. On one hand, some existing centrality evaluation metrics can be unified under this framework through the manipulation of kernel functions. On the other hand, more effective methods for node centrality assessment can be developed based on proper weighting coefficient specification. Experimental results on 20 simulated networks and 53 real networks show that our method outperforms both six prior state-of-the-art centrality measures and two recently proposed centrality evaluation methods. To the best of our knowledge, this is the first piece of work that addresses the central node identification issue via weighted kernel density estimation.

摘要检测网络中的中心节点是网络科学和图数据分析中的一项基本任务。在过去的几十年里，人们提出了许多中心性测量方法来描述什么是中心节点。然而，很少有研究从统计推断的角度来解决这个问题。在本文中，我们将中心节点识别问题表述为图上的加权核密度估计问题。这样的表述为识别中心节点提供了一个通用框架。一方面，通过对核函数的处理，一些现有的中心性评价指标可以统一到这一框架下。另一方面，基于适当的加权系数规范，可以开发出更有效的节点中心性评估方法。在 20 个模拟网络和 53 个真实网络上的实验结果表明，我们的方法优于之前六种最先进的中心性测量方法和最近提出的两种中心性评估方法。据我们所知，这是第一项通过加权核密度估计来解决中心节点识别问题的研究。

引用次数: 0

Fusing structural information with knowledge enhanced text representation for knowledge graph completion 融合结构信息与知识增强型文本表示法，促进知识图谱的完善

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-01-19 DOI: 10.1007/s10618-023-00998-6

Kang Tang, Shasha Li, Jintao Tang, Dong Li, Pancheng Wang, Ting Wang

Although knowledge graphs store a large number of facts in the form of triplets, they are still limited by incompleteness. Hence, Knowledge Graph Completion (KGC), defined as inferring missing entities or relations based on observed facts, has long been a fundamental issue for various knowledge driven downstream applications. Prevailing KG embedding methods for KGC like TransE rely solely on mining structural information of existing facts, thus failing to handle generalization issue as they are inapplicable to unseen entities. Recently, a series of researches employ pre-trained encoders to learn textual representation for triples i.e., textual-encoding methods. While exhibiting great generalization for unseen entities, they are still inferior compared with above KG embedding based ones. In this paper, we devise a novel textual-encoding learning framework for KGC. To enrich textual prior knowledge for more informative prediction, it features three hierarchical maskings which can utilize far contexts of input text so that textual prior knowledge can be elicited. Besides, to solve predictive ambiguity caused by improper relational modeling, a relational-aware structure learning scheme is applied based on textual embeddings. Extensive experimental results on several popular datasets suggest the effectiveness of our approach even compared with recent state-of-the-arts in this task.

尽管知识图谱以三元组的形式存储了大量事实，但它们仍然受到不完整性的限制。因此，知识图谱补全（KGC），即根据观察到的事实推断缺失的实体或关系，一直以来都是各种知识驱动型下游应用的基本问题。用于 KGC 的主流 KG 嵌入方法（如 TransE）仅依赖于挖掘现有事实的结构信息，因此无法处理泛化问题，因为它们不适用于未见实体。最近，一系列研究采用预训练编码器来学习三元组的文本表示，即文本编码方法。虽然这些方法对未知实体有很好的泛化效果，但与上述基于 KG 嵌入的方法相比仍有不足。在本文中，我们为 KGC 设计了一个新颖的文本编码学习框架。为了丰富文本先验知识，使预测更有信息量，它采用了三种分层掩码，可以利用输入文本的远距离上下文，从而激发文本先验知识。此外，为了解决不恰当的关系建模导致的预测模糊性问题，还应用了基于文本嵌入的关系感知结构学习方案。在多个流行数据集上的广泛实验结果表明，即使与该任务的最新技术水平相比，我们的方法也非常有效。

{"title":"Fusing structural information with knowledge enhanced text representation for knowledge graph completion","authors":"Kang Tang, Shasha Li, Jintao Tang, Dong Li, Pancheng Wang, Ting Wang","doi":"10.1007/s10618-023-00998-6","DOIUrl":"https://doi.org/10.1007/s10618-023-00998-6","url":null,"abstract":"Although knowledge graphs store a large number of facts in the form of triplets, they are still limited by incompleteness. Hence, Knowledge Graph Completion (KGC), defined as inferring missing entities or relations based on observed facts, has long been a fundamental issue for various knowledge driven downstream applications. Prevailing KG embedding methods for KGC like TransE rely solely on mining structural information of existing facts, thus failing to handle generalization issue as they are inapplicable to unseen entities. Recently, a series of researches employ pre-trained encoders to learn textual representation for triples i.e., textual-encoding methods. While exhibiting great generalization for unseen entities, they are still inferior compared with above KG embedding based ones. In this paper, we devise a novel textual-encoding learning framework for KGC. To enrich textual prior knowledge for more informative prediction, it features three hierarchical maskings which can utilize far contexts of input text so that textual prior knowledge can be elicited. Besides, to solve predictive ambiguity caused by improper relational modeling, a relational-aware structure learning scheme is applied based on textual embeddings. Extensive experimental results on several popular datasets suggest the effectiveness of our approach even compared with recent state-of-the-arts in this task.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"31 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139497515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive Bernstein change detector for high-dimensional data streams 用于高维数据流的自适应伯恩斯坦变化检测器

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-01-09 DOI: 10.1007/s10618-023-00999-5

Marco Heyden, Edouard Fouché, Vadim Arzamasov, Tanja Fenn, Florian Kalinke, Klemens Böhm

Change detection is of fundamental importance when analyzing data streams. Detecting changes both quickly and accurately enables monitoring and prediction systems to react, e.g., by issuing an alarm or by updating a learning algorithm. However, detecting changes is challenging when observations are high-dimensional. In high-dimensional data, change detectors should not only be able to identify when changes happen, but also in which subspace they occur. Ideally, one should also quantify how severe they are. Our approach, ABCD, has these properties. ABCD learns an encoder-decoder model and monitors its accuracy over a window of adaptive size. ABCD derives a change score based on Bernstein’s inequality to detect deviations in terms of accuracy, which indicate changes. Our experiments demonstrate that ABCD outperforms its best competitor by up to 20% in F1-score on average. It can also accurately estimate changes’ subspace, together with a severity measure that correlates with the ground truth.

在分析数据流时，变化检测至关重要。快速而准确地检测变化可使监控和预测系统做出反应，例如发出警报或更新学习算法。然而，当观测数据是高维数据时，检测变化是一项挑战。在高维数据中，变化检测器不仅要能识别变化发生的时间，还要能识别变化发生在哪个子空间。理想情况下，还应该量化变化的严重程度。我们的方法 ABCD 就具有这些特性。ABCD 学习编码器-解码器模型，并在一个自适应大小的窗口内监控其准确性。ABCD 基于伯恩斯坦不等式得出变化分数，以检测准确度方面的偏差，这表明发生了变化。我们的实验证明，ABCD 的 F1 分数平均比最佳竞争对手高出 20%。它还能准确估计变化的子空间，以及与地面实况相关的严重程度。

引用次数: 0

When graph convolution meets double attention: online privacy disclosure detection with multi-label text classification 当图卷积遇到双重关注：利用多标签文本分类进行在线隐私披露检测

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-01-05 DOI: 10.1007/s10618-023-00992-y

Zhanbo Liang, Jie Guo, Weidong Qiu, Zheng Huang, Shujun Li

With the rise of Web 2.0 platforms such as online social media, people’s private information, such as their location, occupation and even family information, is often inadvertently disclosed through online discussions. Therefore, it is important to detect such unwanted privacy disclosures to help alert people affected and the online platform. In this paper, privacy disclosure detection is modeled as a multi-label text classification (MLTC) problem, and a new privacy disclosure detection model is proposed to construct an MLTC classifier for detecting online privacy disclosures. This classifier takes an online post as the input and outputs multiple labels, each reflecting a possible privacy disclosure. The proposed presentation method combines three different sources of information, the input text itself, the label-to-text correlation and the label-to-label correlation. A double-attention mechanism is used to combine the first two sources of information, and a graph convolutional network is employed to extract the third source of information that is then used to help fuse features extracted from the first two sources of information. Our extensive experimental results, obtained on a public dataset of privacy-disclosing posts on Twitter, demonstrated that our proposed privacy disclosure detection method significantly and consistently outperformed other state-of-the-art methods in terms of all key performance indicators.

随着网络社交媒体等 Web 2.0 平台的兴起，人们的私人信息，如位置、职业甚至家庭信息，往往会在网上讨论中不经意地泄露。因此，检测此类不必要的隐私泄露以帮助提醒受影响者和网络平台是非常重要的。本文将隐私披露检测建模为一个多标签文本分类（MLTC）问题，并提出了一个新的隐私披露检测模型，以构建一个用于检测在线隐私披露的 MLTC 分类器。该分类器以网上帖子为输入，输出多个标签，每个标签反映一个可能的隐私披露。所提出的呈现方法结合了三种不同的信息来源：输入文本本身、标签与文本之间的相关性以及标签与标签之间的相关性。双重关注机制用于结合前两个信息源，图卷积网络用于提取第三个信息源，然后用来帮助融合从前两个信息源中提取的特征。我们在 Twitter 上公开的隐私披露帖子数据集上取得的大量实验结果表明，我们提出的隐私披露检测方法在所有关键性能指标上都显著且持续地优于其他最先进的方法。

{"title":"When graph convolution meets double attention: online privacy disclosure detection with multi-label text classification","authors":"Zhanbo Liang, Jie Guo, Weidong Qiu, Zheng Huang, Shujun Li","doi":"10.1007/s10618-023-00992-y","DOIUrl":"https://doi.org/10.1007/s10618-023-00992-y","url":null,"abstract":"With the rise of Web 2.0 platforms such as online social media, people’s private information, such as their location, occupation and even family information, is often inadvertently disclosed through online discussions. Therefore, it is important to detect such unwanted privacy disclosures to help alert people affected and the online platform. In this paper, privacy disclosure detection is modeled as a multi-label text classification (MLTC) problem, and a new privacy disclosure detection model is proposed to construct an MLTC classifier for detecting online privacy disclosures. This classifier takes an online post as the input and outputs multiple labels, each reflecting a possible privacy disclosure. The proposed presentation method combines three different sources of information, the input text itself, the label-to-text correlation and the label-to-label correlation. A double-attention mechanism is used to combine the first two sources of information, and a graph convolutional network is employed to extract the third source of information that is then used to help fuse features extracted from the first two sources of information. Our extensive experimental results, obtained on a public dataset of privacy-disclosing posts on Twitter, demonstrated that our proposed privacy disclosure detection method significantly and consistently outperformed other state-of-the-art methods in terms of all key performance indicators.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"80 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139376729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CompTrails: comparing hypotheses across behavioral networks CompTrails：跨行为网络的假设比较

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-01-03 DOI: 10.1007/s10618-023-00996-8

Tobias Koopmann, Martin Becker, Florian Lemmerich, Andreas Hotho

The term Behavioral Networks describes networks that contain relational information on human behavior. This ranges from social networks that contain friendships or cooperations between individuals, to navigational networks that contain geographical or web navigation, and many more. Understanding the forces driving behavior within these networks can be beneficial to improving the underlying network, for example, by generating new hyperlinks on websites, or by proposing new connections and friends on social networks. Previous approaches considered different hypotheses on a single network and evaluated which hypothesis fits best. These hypotheses can represent human intuition and expert opinions or be based on previous insights. In this work, we extend these approaches to enable the comparison of a single hypothesis between multiple networks. We unveil several issues of naive approaches that potentially impact comparisons and lead to undesired results. Based on these findings, we propose a framework with five flexible components that allow addressing specific analysis goals tailored to the application scenario. We show the benefits and limits of our approach by applying it to synthetic data and several real-world datasets, including web navigation, bibliometric navigation, and geographic navigation. Our work supports practitioners and researchers with the aim of understanding similarities and differences in human behavior between environments.

行为网络一词描述的是包含人类行为相关信息的网络。其中包括包含个人之间友谊或合作关系的社交网络，以及包含地理或网络导航的导航网络等等。了解这些网络中的行为驱动力有助于改善底层网络，例如，在网站上生成新的超链接，或在社交网络上提出新的连接和朋友。以前的方法考虑了单个网络的不同假设，并评估哪种假设最适合。这些假设可以代表人类的直觉和专家意见，也可以基于以往的见解。在这项工作中，我们对这些方法进行了扩展，以便在多个网络之间对单一假设进行比较。我们揭示了天真方法的几个问题，这些问题可能会影响比较并导致不理想的结果。基于这些发现，我们提出了一个包含五个灵活组件的框架，可以根据应用场景实现特定的分析目标。通过将我们的方法应用于合成数据和几个真实世界的数据集，包括网络导航、文献计量导航和地理导航，我们展示了这种方法的优势和局限性。我们的工作可为从业人员和研究人员提供支持，帮助他们了解不同环境下人类行为的异同。

{"title":"CompTrails: comparing hypotheses across behavioral networks","authors":"Tobias Koopmann, Martin Becker, Florian Lemmerich, Andreas Hotho","doi":"10.1007/s10618-023-00996-8","DOIUrl":"https://doi.org/10.1007/s10618-023-00996-8","url":null,"abstract":"The term Behavioral Networks describes networks that contain relational information on human behavior. This ranges from social networks that contain friendships or cooperations between individuals, to navigational networks that contain geographical or web navigation, and many more. Understanding the forces driving behavior within these networks can be beneficial to improving the underlying network, for example, by generating new hyperlinks on websites, or by proposing new connections and friends on social networks. Previous approaches considered different hypotheses on a single network and evaluated which hypothesis fits best. These hypotheses can represent human intuition and expert opinions or be based on previous insights. In this work, we extend these approaches to enable the comparison of a single hypothesis between multiple networks. We unveil several issues of naive approaches that potentially impact comparisons and lead to undesired results. Based on these findings, we propose a framework with five flexible components that allow addressing specific analysis goals tailored to the application scenario. We show the benefits and limits of our approach by applying it to synthetic data and several real-world datasets, including web navigation, bibliometric navigation, and geographic navigation. Our work supports practitioners and researchers with the aim of understanding similarities and differences in human behavior between environments.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"28 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139095516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Effective signal reconstruction from multiple ranked lists via convex optimization 通过凸优化从多个排序列表中有效重建信号

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-01-02 DOI: 10.1007/s10618-023-00991-z

Abstract

The ranking of objects is widely used to rate their relative quality or relevance across multiple assessments. Beyond classical rank aggregation, it is of interest to estimate the usually unobservable latent signals that inform a consensus ranking. Under the only assumption of independent assessments, which can be incomplete, we introduce indirect inference via convex optimization in combination with computationally efficient Poisson Bootstrap. Two different objective functions are suggested, one linear and the other quadratic. The mathematical formulation of the signal estimation problem is based on pairwise comparisons of all objects with respect to their rank positions. Sets of constraints represent the order relations. The transitivity property of rank scales allows us to reduce substantially the number of constraints associated with the full set of object comparisons. The key idea is to globally reduce the errors induced by the rankers until optimal latent signals can be obtained. Its main advantage is low computational costs, even when handling (n < < p) data problems. Exploratory tools can be developed based on the bootstrap signal estimates and standard errors. Simulation evidence, a comparison with the state-of-the-art rank centrality method, and two applications, one in higher education evaluation and the other in molecular cancer research, are presented.

摘要物体排名被广泛用于在多个评估中评定其相对质量或相关性。除了传统的排名汇总外，人们还对估计通常无法观察到的潜在信号以达成一致排名很感兴趣。在独立评估（可能是不完整的）这一唯一假设下，我们通过凸优化结合计算效率高的泊松引导法引入了间接推理。我们提出了两种不同的目标函数，一种是线性函数，另一种是二次函数。信号估计问题的数学表述是基于所有对象在等级位置上的成对比较。一组约束条件代表了等级关系。秩标度的反演特性使我们能够大幅减少与全套对象比较相关的约束条件数量。其关键思路是全面减少排序器引起的误差，直至获得最佳的潜在信号。它的主要优点是计算成本低，即使在处理 (n < < p) 数据问题时也是如此。基于引导信号估计值和标准误差，可以开发探索工具。本文介绍了模拟证据、与最先进的秩中心性方法的比较以及两个应用，一个应用于高等教育评估，另一个应用于分子癌症研究。

{"title":"Effective signal reconstruction from multiple ranked lists via convex optimization","authors":"","doi":"10.1007/s10618-023-00991-z","DOIUrl":"https://doi.org/10.1007/s10618-023-00991-z","url":null,"abstract":"<h3>Abstract</h3> The ranking of objects is widely used to rate their relative quality or relevance across multiple assessments. Beyond classical rank aggregation, it is of interest to estimate the usually unobservable latent signals that inform a consensus ranking. Under the only assumption of independent assessments, which can be incomplete, we introduce indirect inference via convex optimization in combination with computationally efficient Poisson Bootstrap. Two different objective functions are suggested, one linear and the other quadratic. The mathematical formulation of the signal estimation problem is based on pairwise comparisons of all objects with respect to their rank positions. Sets of constraints represent the order relations. The transitivity property of rank scales allows us to reduce substantially the number of constraints associated with the full set of object comparisons. The key idea is to globally reduce the errors induced by the rankers until optimal latent signals can be obtained. Its main advantage is low computational costs, even when handling (n < < p) data problems. Exploratory tools can be developed based on the bootstrap signal estimates and standard errors. Simulation evidence, a comparison with the state-of-the-art rank centrality method, and two applications, one in higher education evaluation and the other in molecular cancer research, are presented.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"52 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139083082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Correction: A semi‑supervised interactive algorithm for change point detection 更正：用于变化点检测的半监督交互式算法

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-01-02 DOI: 10.1007/s10618-023-01000-z

Zhenxiang Cao, N. Seeuws, Maarten Vos, Alexander Bertrand

引用次数: 0

Predicting consumer choice from raw eye-movement data using the RETINA deep learning architecture 利用 RETINA 深度学习架构从原始眼动数据中预测消费者选择

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2023-12-29 DOI: 10.1007/s10618-023-00989-7

Moshe Unger, Michel Wedel, Alexander Tuzhilin

We propose the use of a deep learning architecture, called RETINA, to predict multi-alternative, multi-attribute consumer choice from eye movement data. RETINA directly uses the complete time series of raw eye-tracking data from both eyes as input to state-of-the art Transformer and Metric Learning Deep Learning methods. Using the raw data input eliminates the information loss that may result from first calculating fixations, deriving metrics from the fixations data and analysing those metrics, as has been often done in eye movement research, and allows us to apply Deep Learning to eye tracking data sets of the size commonly encountered in academic and applied research. Using a data set with 112 respondents who made choices among four laptops, we show that the proposed architecture outperforms other state-of-the-art machine learning methods (standard BERT, LSTM, AutoML, logistic regression) calibrated on raw data or fixation data. The analysis of partial time and partial data segments reveals the ability of RETINA to predict choice outcomes well before participants reach a decision. Specifically, we find that using a mere 5 s of data, the RETINA architecture achieves a predictive validation accuracy of over 0.7. We provide an assessment of which features of the eye movement data contribute to RETINA’s prediction accuracy. We make recommendations on how the proposed deep learning architecture can be used as a basis for future academic research, in particular its application to eye movements collected from front-facing video cameras.

我们建议使用一种名为 RETINA 的深度学习架构，从眼动数据中预测消费者的多选择、多属性选择。RETINA 直接使用来自双眼的原始眼动跟踪数据的完整时间序列，作为最先进的变换器和度量学习深度学习方法的输入。使用原始数据输入消除了眼动研究中经常使用的首先计算固定点、从固定点数据中得出度量值并分析这些度量值可能造成的信息损失，并使我们能够将深度学习应用于学术和应用研究中常见的眼动跟踪数据集。通过使用 112 名受访者在四台笔记本电脑中进行选择的数据集，我们发现，所提出的架构优于其他最先进的机器学习方法（标准 BERT、LSTM、AutoML、逻辑回归），这些方法都是在原始数据或固定数据上进行校准的。对部分时间和部分数据片段的分析表明，RETINA 能够在参与者做出决定之前预测选择结果。具体来说，我们发现 RETINA 架构使用短短 5 秒钟的数据就能达到 0.7 以上的预测验证准确率。我们评估了眼动数据的哪些特征有助于提高 RETINA 的预测准确率。我们就如何将所提出的深度学习架构作为未来学术研究的基础提出了建议，特别是将其应用于从前置摄像头收集的眼动数据。

{"title":"Predicting consumer choice from raw eye-movement data using the RETINA deep learning architecture","authors":"Moshe Unger, Michel Wedel, Alexander Tuzhilin","doi":"10.1007/s10618-023-00989-7","DOIUrl":"https://doi.org/10.1007/s10618-023-00989-7","url":null,"abstract":"We propose the use of a deep learning architecture, called RETINA, to predict multi-alternative, multi-attribute consumer choice from eye movement data. RETINA directly uses the complete time series of raw eye-tracking data from both eyes as input to state-of-the art Transformer and Metric Learning Deep Learning methods. Using the raw data input eliminates the information loss that may result from first calculating fixations, deriving metrics from the fixations data and analysing those metrics, as has been often done in eye movement research, and allows us to apply Deep Learning to eye tracking data sets of the size commonly encountered in academic and applied research. Using a data set with 112 respondents who made choices among four laptops, we show that the proposed architecture outperforms other state-of-the-art machine learning methods (standard BERT, LSTM, AutoML, logistic regression) calibrated on raw data or fixation data. The analysis of partial time and partial data segments reveals the ability of RETINA to predict choice outcomes well before participants reach a decision. Specifically, we find that using a mere 5 s of data, the RETINA architecture achieves a predictive validation accuracy of over 0.7. We provide an assessment of which features of the eye movement data contribute to RETINA’s prediction accuracy. We make recommendations on how the proposed deep learning architecture can be used as a basis for future academic research, in particular its application to eye movements collected from front-facing video cameras.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"29 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Data Mining and Knowledge Discovery

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀