Machine Learning最新文献_第2页

Evaluating soccer match prediction models: a deep learning approach and feature optimization for gradient-boosted trees 评估足球比赛预测模型：梯度提升树的深度学习方法和特征优化

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-08-23 DOI: 10.1007/s10994-024-06608-w

Calvin Yeung, Rory Bunker, Rikuhei Umemoto, Keisuke Fujii

Machine learning models have become increasingly popular for predicting the results of soccer matches, however, the lack of publicly-available benchmark datasets has made model evaluation challenging. The 2023 Soccer Prediction Challenge required the prediction of match results first in terms of the exact goals scored by each team, and second, in terms of the probabilities for a win, draw, and loss. The original training set of matches and features, which was provided for the competition, was augmented with additional matches that were played between 4 April and 13 April 2023, representing the period after which the training set ended, but prior to the first matches that were to be predicted (upon which the performance was evaluated). A CatBoost model was employed using pi-ratings as the features, which were initially identified as the optimal choice for calculating the win/draw/loss probabilities. Notably, deep learning models have frequently been disregarded in this particular task. Therefore, in this study, we aimed to assess the performance of a deep learning model and determine the optimal feature set for a gradient-boosted tree model. The model was trained using the most recent 5 years of data, and three training and validation sets were used in a hyperparameter grid search. The results from the validation sets show that our model had strong performance and stability compared to previously published models from the 2017 Soccer Prediction Challenge for win/draw/loss prediction. Our model ranked 16th in the 2023 Soccer Prediction Challenge with RPS 0.2195.

机器学习模型在预测足球比赛结果方面越来越受欢迎，然而，由于缺乏公开的基准数据集，模型评估工作面临挑战。2023 年足球预测挑战赛要求预测比赛结果，首先是每支球队的准确进球数，其次是胜、平、负的概率。在为比赛提供的原始比赛和特征训练集的基础上，增加了 2023 年 4 月 4 日至 4 月 13 日期间的额外比赛，这段时间是训练集结束之后，但在要预测的首场比赛之前（根据这些比赛来评估性能）。我们采用了 CatBoost 模型，使用π-rati 作为特征，该特征最初被认为是计算胜/平/负概率的最佳选择。值得注意的是，深度学习模型在这一特定任务中经常被忽视。因此，在本研究中，我们旨在评估深度学习模型的性能，并确定梯度提升树模型的最佳特征集。我们使用最近 5 年的数据对模型进行了训练，并在超参数网格搜索中使用了三个训练集和验证集。验证集的结果显示，与之前发布的 2017 年足球预测挑战赛胜/平/负预测模型相比，我们的模型具有很强的性能和稳定性。我们的模型在 2023 年足球预测挑战赛中以 0.2195 的 RPS 排名第 16 位。

{"title":"Evaluating soccer match prediction models: a deep learning approach and feature optimization for gradient-boosted trees","authors":"Calvin Yeung, Rory Bunker, Rikuhei Umemoto, Keisuke Fujii","doi":"10.1007/s10994-024-06608-w","DOIUrl":"https://doi.org/10.1007/s10994-024-06608-w","url":null,"abstract":"Machine learning models have become increasingly popular for predicting the results of soccer matches, however, the lack of publicly-available benchmark datasets has made model evaluation challenging. The 2023 Soccer Prediction Challenge required the prediction of match results first in terms of the exact goals scored by each team, and second, in terms of the probabilities for a win, draw, and loss. The original training set of matches and features, which was provided for the competition, was augmented with additional matches that were played between 4 April and 13 April 2023, representing the period after which the training set ended, but prior to the first matches that were to be predicted (upon which the performance was evaluated). A CatBoost model was employed using pi-ratings as the features, which were initially identified as the optimal choice for calculating the win/draw/loss probabilities. Notably, deep learning models have frequently been disregarded in this particular task. Therefore, in this study, we aimed to assess the performance of a deep learning model and determine the optimal feature set for a gradient-boosted tree model. The model was trained using the most recent 5 years of data, and three training and validation sets were used in a hyperparameter grid search. The results from the validation sets show that our model had strong performance and stability compared to previously published models from the 2017 Soccer Prediction Challenge for win/draw/loss prediction. Our model ranked 16th in the 2023 Soccer Prediction Challenge with RPS 0.2195.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"55 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A cross-domain user association scheme based on graph attention networks with trajectory embedding 基于轨迹嵌入的图注意网络的跨域用户关联方案

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-08-21 DOI: 10.1007/s10994-024-06613-z

Keqing Cen, Zhenghao Yang, Ze Wang, Minhong Dong

With the widespread adoption of mobile internet, users generate vast amounts of location-based data across multiple social networking platforms. This data is valuable for applications such as personalized recommendations and targeted advertising. Accurately identifying users across different platforms enhances understanding of user behavior and preferences. To address the complexity of cross-domain user identification caused by varying check-in frequencies and data precision differences, we propose HTEGAT, a hierarchical trajectory embedding-based graph attention network model. HTEGAT addresses these issues by combining an Encoder and a Trajectory Identification module. The Encoder module, by integrating self-attention mechanisms with LSTM, can effectively extract location point-level features and accurately capture trajectory transition features, thereby accurately characterizing hierarchical temporal trajectories. Trajectory Identification module introduces trajectory distance-neighbor relationships and constructs an adjacency matrix based on these relationships. By utilizing attention weight coefficients in a graph attention network to capture similarities between trajectories, this approach reduces identification complexity while addressing the issue of dataset sparsity. Experiments on two cross-domain Location-Based Social Network (LBSN) datasets demonstrate that HTEGAT achieves higher hit rates with lower time complexity. On the Foursquare-Twitter dataset, HTEGAT significantly improved hit rates, surpassing state-of-the-art methods. On the Instagram-Twitter dataset, HTEGAT consistently outperformed contemporary models, showcasing its effectiveness and superiority.

随着移动互联网的广泛应用，用户在多个社交网络平台上产生了大量基于位置的数据。这些数据对于个性化推荐和定向广告等应用非常有价值。准确识别不同平台上的用户可以加深对用户行为和偏好的了解。为了解决因签到频率和数据精度不同而造成的跨域用户识别的复杂性，我们提出了 HTEGAT，一种基于分层轨迹嵌入的图注意力网络模型。HTEGAT 通过将编码器模块和轨迹识别模块相结合来解决这些问题。编码器模块通过将自我注意机制与 LSTM 相结合，可以有效提取位置点级特征，并准确捕捉轨迹过渡特征，从而准确描述分层时间轨迹。轨迹识别模块引入轨迹的距离-邻接关系，并根据这些关系构建邻接矩阵。通过利用图注意力网络中的注意力权重系数来捕捉轨迹之间的相似性，这种方法降低了识别的复杂性，同时解决了数据集稀少的问题。在两个跨领域基于位置的社交网络（LBSN）数据集上的实验表明，HTEGAT 能以更低的时间复杂度实现更高的命中率。在 Foursquare-Twitter 数据集上，HTEGAT 显著提高了命中率，超过了最先进的方法。在 Instagram-Twitter 数据集上，HTEGAT 的表现始终优于当代模型，展示了其有效性和优越性。

{"title":"A cross-domain user association scheme based on graph attention networks with trajectory embedding","authors":"Keqing Cen, Zhenghao Yang, Ze Wang, Minhong Dong","doi":"10.1007/s10994-024-06613-z","DOIUrl":"https://doi.org/10.1007/s10994-024-06613-z","url":null,"abstract":"With the widespread adoption of mobile internet, users generate vast amounts of location-based data across multiple social networking platforms. This data is valuable for applications such as personalized recommendations and targeted advertising. Accurately identifying users across different platforms enhances understanding of user behavior and preferences. To address the complexity of cross-domain user identification caused by varying check-in frequencies and data precision differences, we propose HTEGAT, a hierarchical trajectory embedding-based graph attention network model. HTEGAT addresses these issues by combining an Encoder and a Trajectory Identification module. The Encoder module, by integrating self-attention mechanisms with LSTM, can effectively extract location point-level features and accurately capture trajectory transition features, thereby accurately characterizing hierarchical temporal trajectories. Trajectory Identification module introduces trajectory distance-neighbor relationships and constructs an adjacency matrix based on these relationships. By utilizing attention weight coefficients in a graph attention network to capture similarities between trajectories, this approach reduces identification complexity while addressing the issue of dataset sparsity. Experiments on two cross-domain Location-Based Social Network (LBSN) datasets demonstrate that HTEGAT achieves higher hit rates with lower time complexity. On the Foursquare-Twitter dataset, HTEGAT significantly improved hit rates, surpassing state-of-the-art methods. On the Instagram-Twitter dataset, HTEGAT consistently outperformed contemporary models, showcasing its effectiveness and superiority.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"1 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142226444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A class sensitivity feature guided T-type generative model for noisy label classification 用于噪声标签分类的类敏感度特征引导 T 型生成模型

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-08-20 DOI: 10.1007/s10994-024-06598-9

Yidi Bai, Hengjian Cui

Large-scale datasets inevitably contain noisy labels, which induces weak performance of deep neural networks (DNNs). Many existing methods focus on loss and regularization tricks, as well as characterizing and modelling differences between noisy and clean samples. However, taking advantage of information from different extents of distortion in latent feature space, is less explored and remains challenging. To solve this problem, we analyze characteristic distortion extents of different high-dimensional features, achieving the conclusion that features vary in their degree of deformation in their correlations with respect to categorical variables. Aforementioned disturbances on features not only reduce sensitivity and contribution of latent features to classification, but also bring obstacles into generating decision boundaries. To mitigate these issues, we propose class sensitivity feature extractor (CSFE) and T-type generative classifier (TGC). Based on the weighted Mahalanobis distance between conditional and unconditional cumulative distribution function after variance-stabilizing transformation, CSFE realizes high quality feature extraction through evaluating class-wise discrimination ability and sensitivity to classification. TGC introduces student-t estimator to clustering analysis in latent space, which is more robust in generating decision boundaries while maintaining equivalent efficiency. To alleviate the cost of retraining a whole DNN, we propose an ensemble model to simultaneously generate robust decision boundaries and train the DNN with the improved CSFE named SoftCSFE. Extensive experiments on three datasets, which are the RML2016.10a dataset, UCR Time Series Classification Archive dataset and a real-world dataset Clothing1M, show advantages of our methods.

大规模数据集不可避免地包含噪声标签，这会导致深度神经网络（DNN）性能低下。许多现有方法都侧重于损失和正则化技巧，以及对噪声样本和干净样本之间的差异进行表征和建模。然而，如何利用潜在特征空间中不同失真程度的信息，这方面的探索较少，仍然具有挑战性。为了解决这个问题，我们分析了不同高维特征的特征变形程度，得出的结论是，特征的变形程度不同，它们与分类变量的相关性也不同。上述对特征的干扰不仅降低了潜在特征的灵敏度和对分类的贡献，还为生成决策边界带来了障碍。为了缓解这些问题，我们提出了类别灵敏度特征提取器（CSFE）和 T 型生成分类器（TGC）。CSFE 基于经过方差稳定变换后的条件累积分布函数和非条件累积分布函数之间的加权马哈拉诺比斯距离，通过评估分类鉴别能力和分类灵敏度来实现高质量的特征提取。TGC 在潜空间聚类分析中引入了 student-t 估计器，在生成决策边界时更加稳健，同时保持了同等效率。为了减轻重新训练整个 DNN 的成本，我们提出了一种集合模型，以同时生成稳健的决策边界，并用改进的 CSFE 训练 DNN，命名为 SoftCSFE。在三个数据集（RML2016.10a 数据集、UCR 时间序列分类档案数据集和现实世界数据集 Clothing1M）上的广泛实验显示了我们方法的优势。

{"title":"A class sensitivity feature guided T-type generative model for noisy label classification","authors":"Yidi Bai, Hengjian Cui","doi":"10.1007/s10994-024-06598-9","DOIUrl":"https://doi.org/10.1007/s10994-024-06598-9","url":null,"abstract":"Large-scale datasets inevitably contain noisy labels, which induces weak performance of deep neural networks (DNNs). Many existing methods focus on loss and regularization tricks, as well as characterizing and modelling differences between noisy and clean samples. However, taking advantage of information from different extents of distortion in latent feature space, is less explored and remains challenging. To solve this problem, we analyze characteristic distortion extents of different high-dimensional features, achieving the conclusion that features vary in their degree of deformation in their correlations with respect to categorical variables. Aforementioned disturbances on features not only reduce sensitivity and contribution of latent features to classification, but also bring obstacles into generating decision boundaries. To mitigate these issues, we propose class sensitivity feature extractor (CSFE) and T-type generative classifier (TGC). Based on the weighted Mahalanobis distance between conditional and unconditional cumulative distribution function after variance-stabilizing transformation, CSFE realizes high quality feature extraction through evaluating class-wise discrimination ability and sensitivity to classification. TGC introduces student-t estimator to clustering analysis in latent space, which is more robust in generating decision boundaries while maintaining equivalent efficiency. To alleviate the cost of retraining a whole DNN, we propose an ensemble model to simultaneously generate robust decision boundaries and train the DNN with the improved CSFE named SoftCSFE. Extensive experiments on three datasets, which are the RML2016.10a dataset, UCR Time Series Classification Archive dataset and a real-world dataset Clothing1M, show advantages of our methods.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"58 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Weighting non-IID batches for out-of-distribution detection 对非 IID 批次进行加权，以检测配送外情况

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-08-19 DOI: 10.1007/s10994-024-06605-z

Zhilin Zhao, Longbing Cao

A standard network pretrained on in-distribution (ID) samples could make high-confidence predictions on out-of-distribution (OOD) samples, leaving the possibility of failing to distinguish ID and OOD samples in the test phase. To address this over-confidence issue, the existing methods improve the OOD sensitivity from modeling perspectives, i.e., retraining it by modifying training processes or objective functions. In contrast, this paper proposes a simple but effective method, namely Weighted Non-IID Batching (WNB), by adjusting batch weights. WNB builds on a key observation: increasing the batch size can improve the OOD detection performance. This is because a smaller batch size may make its batch samples more likely to be treated as non-IID from the assumed ID, i.e., associated with an OOD. This causes a network to provide high-confidence predictions for all samples from the OOD. Accordingly, WNB applies a weight function to weight each batch according to the discrepancy between batch samples and the entire training ID dataset. Specifically, the weight function is derived by minimizing the generalization error bound. It ensures that the weight function assigns larger weights to batches with smaller discrepancies and makes a trade-off between ID classification and OOD detection performance. Experimental results show that incorporating WNB into state-of-the-art OOD detection methods can further improve their performance.

对分布内（ID）样本进行预训练的标准网络可以对分布外（OOD）样本进行高置信度预测，但在测试阶段可能无法区分 ID 和 OOD 样本。为解决这一过度置信问题，现有方法从建模角度提高了 OOD 灵敏度，即通过修改训练过程或目标函数对其进行再训练。相比之下，本文提出了一种简单而有效的方法，即通过调整批次权重来实现加权非 IID 批处理（WNB）。WNB 基于一个重要的观察结果：增加批次大小可以提高 OOD 检测性能。这是因为，较小的批次规模可能会使其批次样本更有可能从假定的 ID 被视为非 IID，即与 OOD 相关联。这将导致网络对来自 OOD 的所有样本提供高置信度预测。因此，WNB 根据批次样本与整个训练 ID 数据集之间的差异，应用加权函数对每个批次进行加权。具体来说，权重函数是通过最小化泛化误差边界得出的。它确保权重函数为差异较小的批次分配较大的权重，并在 ID 分类和 OOD 检测性能之间做出权衡。实验结果表明，将 WNB 纳入最先进的 OOD 检测方法可以进一步提高其性能。

{"title":"Weighting non-IID batches for out-of-distribution detection","authors":"Zhilin Zhao, Longbing Cao","doi":"10.1007/s10994-024-06605-z","DOIUrl":"https://doi.org/10.1007/s10994-024-06605-z","url":null,"abstract":"A standard network pretrained on in-distribution (ID) samples could make high-confidence predictions on out-of-distribution (OOD) samples, leaving the possibility of failing to distinguish ID and OOD samples in the test phase. To address this over-confidence issue, the existing methods improve the OOD sensitivity from modeling perspectives, i.e., retraining it by modifying training processes or objective functions. In contrast, this paper proposes a simple but effective method, namely Weighted Non-IID Batching (WNB), by adjusting batch weights. WNB builds on a key observation: increasing the batch size can improve the OOD detection performance. This is because a smaller batch size may make its batch samples more likely to be treated as non-IID from the assumed ID, i.e., associated with an OOD. This causes a network to provide high-confidence predictions for all samples from the OOD. Accordingly, WNB applies a weight function to weight each batch according to the discrepancy between batch samples and the entire training ID dataset. Specifically, the weight function is derived by minimizing the generalization error bound. It ensures that the weight function assigns larger weights to batches with smaller discrepancies and makes a trade-off between ID classification and OOD detection performance. Experimental results show that incorporating WNB into state-of-the-art OOD detection methods can further improve their performance.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"267 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Understanding prediction discrepancies in classification 了解分类中的预测差异

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-08-07 DOI: 10.1007/s10994-024-06557-4

Xavier Renard, Thibault Laugel, Marcin Detyniecki

A multitude of classifiers can be trained on the same data to achieve similar performances during test time while having learned significantly different classification patterns. When selecting a classifier, the machine learning practitioner has no understanding on the differences between models, their limits, where they agree and where they don’t. But this choice will result in concrete consequences for instances to be classified in the discrepancy zone, since the final decision will be based on the selected classification pattern. Besides the arbitrary nature of the result, a bad choice could have further negative consequences such as loss of opportunity or lack of fairness. This paper proposes to address this question by analyzing the prediction discrepancies in a pool of best-performing models trained on the same data. A model-agnostic algorithm, DIG, is proposed to capture and explain discrepancies locally in tabular datasets, to enable the practitioner to make the best educated decision when selecting a model by anticipating its potential undesired consequences.

在相同的数据上训练多种分类器，在测试期间可以获得相似的性能，但学习到的分类模式却大相径庭。在选择分类器时，机器学习从业者并不了解模型之间的差异、它们的局限性、它们在哪些方面一致，哪些方面不一致。但这种选择会给差异区内的实例分类带来具体后果，因为最终决定将基于所选的分类模式。除了结果的任意性之外，错误的选择还可能带来更多负面影响，如丧失机会或缺乏公平性。本文建议通过分析在相同数据上训练出来的最佳模型库中的预测差异来解决这个问题。本文提出了一种与模型无关的算法--DIG，用于捕捉和解释表格数据集中的局部差异，从而使实践者在选择模型时，通过预测其潜在的不良后果，做出最明智的决定。

引用次数: 0

Empirical Bayes linked matrix decomposition 经验贝叶斯关联矩阵分解

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-08-07 DOI: 10.1007/s10994-024-06599-8

Eric F. Lock

Data for several applications in diverse fields can be represented as multiple matrices that are linked across rows or columns. This is particularly common in molecular biomedical research, in which multiple molecular “omics” technologies may capture different feature sets (e.g., corresponding to rows in a matrix) and/or different sample populations (corresponding to columns). This has motivated a large body of work on integrative matrix factorization approaches that identify and decompose low-dimensional signal that is shared across multiple matrices or specific to a given matrix. We propose an empirical variational Bayesian approach to this problem that has several advantages over existing techniques, including the flexibility to accommodate shared signal over any number of row or column sets (i.e., bidimensional integration), an intuitive model-based objective function that yields appropriate shrinkage for the inferred signals, and a relatively efficient estimation algorithm with no tuning parameters. A general result establishes conditions for the uniqueness of the underlying decomposition for a broad family of methods that includes the proposed approach. For scenarios with missing data, we describe an associated iterative imputation approach that is novel for the single-matrix context and a powerful approach for “blockwise” imputation (in which an entire row or column is missing) in various linked matrix contexts. Extensive simulations show that the method performs very well under different scenarios with respect to recovering underlying low-rank signal, accurately decomposing shared and specific signals, and accurately imputing missing data. The approach is applied to gene expression and miRNA data from breast cancer tissue and normal breast tissue, for which it gives an informative decomposition of variation and outperforms alternative strategies for missing data imputation.

不同领域中多个应用的数据可表示为跨行或跨列连接的多个矩阵。这在分子生物医学研究中尤为常见，因为多种分子 "omics "技术可能会捕获不同的特征集（例如，与矩阵中的行相对应）和/或不同的样本群（与列相对应）。这就推动了大量关于综合矩阵因式分解方法的研究工作，这些方法可以识别和分解多个矩阵共享或特定矩阵特有的低维信号。我们针对这一问题提出了一种经验变分贝叶斯方法，它与现有技术相比有几个优势，包括可以灵活地适应任意数量的行或列集（即二维积分）上的共享信号；基于模型的直观目标函数可以对推断出的信号进行适当的收缩；以及无需调整参数的相对高效的估计算法。一般结果为包括所提方法在内的一系列方法的基础分解的唯一性确立了条件。对于数据缺失的情况，我们描述了一种相关的迭代估算方法，这种方法在单矩阵情况下是新颖的，在各种链接矩阵情况下是一种强大的 "顺时针 "估算方法（其中整行或整列缺失）。大量仿真表明，该方法在不同情况下都能很好地恢复底层低秩信号、准确分解共享信号和特定信号，并准确归因缺失数据。该方法被应用于乳腺癌组织和正常乳腺组织的基因表达和 miRNA 数据，对这些数据进行了翔实的变异分解，并优于其他缺失数据归因策略。

{"title":"Empirical Bayes linked matrix decomposition","authors":"Eric F. Lock","doi":"10.1007/s10994-024-06599-8","DOIUrl":"https://doi.org/10.1007/s10994-024-06599-8","url":null,"abstract":"Data for several applications in diverse fields can be represented as multiple matrices that are linked across rows or columns. This is particularly common in molecular biomedical research, in which multiple molecular “omics” technologies may capture different feature sets (e.g., corresponding to rows in a matrix) and/or different sample populations (corresponding to columns). This has motivated a large body of work on integrative matrix factorization approaches that identify and decompose low-dimensional signal that is shared across multiple matrices or specific to a given matrix. We propose an empirical variational Bayesian approach to this problem that has several advantages over existing techniques, including the flexibility to accommodate shared signal over any number of row or column sets (i.e., bidimensional integration), an intuitive model-based objective function that yields appropriate shrinkage for the inferred signals, and a relatively efficient estimation algorithm with no tuning parameters. A general result establishes conditions for the uniqueness of the underlying decomposition for a broad family of methods that includes the proposed approach. For scenarios with missing data, we describe an associated iterative imputation approach that is novel for the single-matrix context and a powerful approach for “blockwise” imputation (in which an entire row or column is missing) in various linked matrix contexts. Extensive simulations show that the method performs very well under different scenarios with respect to recovering underlying low-rank signal, accurately decomposing shared and specific signals, and accurately imputing missing data. The approach is applied to gene expression and miRNA data from breast cancer tissue and normal breast tissue, for which it gives an informative decomposition of variation and outperforms alternative strategies for missing data imputation.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"24 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning an adaptive forwarding strategy for mobile wireless networks: resource usage vs. latency 学习移动无线网络的自适应转发策略：资源使用与延迟

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-08-07 DOI: 10.1007/s10994-024-06601-3

Victoria Manfredi, Alicia P. Wolfe, Xiaolan Zhang, Bing Wang

Mobile wireless networks present several challenges for any learning system, due to uncertain and variable device movement, a decentralized network architecture, and constraints on network resources. In this work, we use deep reinforcement learning (DRL) to learn a scalable and generalizable forwarding strategy for such networks. We make the following contributions: (i) we use hierarchical RL to design DRL packet agents rather than device agents to capture the packet forwarding decisions that are made over time and improve training efficiency; (ii) we use relational features to ensure generalizability of the learned forwarding strategy to a wide range of network dynamics and enable offline training; and (iii) we incorporate both forwarding goals and network resource considerations into packet decision-making by designing a weighted reward function. Our results show that the forwarding strategy used by our DRL packet agent often achieves a similar delay per packet delivered as the oracle forwarding strategy and almost always outperforms all other strategies (including state-of-the-art strategies) in terms of delay, even on scenarios on which the DRL agent was not trained.

由于设备移动的不确定性和可变性、分散的网络架构以及网络资源的限制，移动无线网络给任何学习系统都带来了挑战。在这项工作中，我们使用深度强化学习（DRL）为此类网络学习可扩展、可通用的转发策略。我们的贡献如下：(i) 我们使用分层强化学习来设计 DRL 数据包代理，而不是设备代理，以捕捉随着时间推移而做出的数据包转发决策，并提高训练效率；(ii) 我们使用关系特征来确保学习到的转发策略对各种网络动态具有普适性，并实现离线训练；(iii) 我们通过设计加权奖励函数，将转发目标和网络资源考虑因素纳入数据包决策。我们的研究结果表明，我们的 DRL 数据包代理所使用的转发策略通常能实现与 Oracle 转发策略相似的每个数据包传输延迟，而且在延迟方面几乎总是优于所有其他策略（包括最先进的策略），即使在 DRL 代理未接受过训练的场景中也是如此。

{"title":"Learning an adaptive forwarding strategy for mobile wireless networks: resource usage vs. latency","authors":"Victoria Manfredi, Alicia P. Wolfe, Xiaolan Zhang, Bing Wang","doi":"10.1007/s10994-024-06601-3","DOIUrl":"https://doi.org/10.1007/s10994-024-06601-3","url":null,"abstract":"Mobile wireless networks present several challenges for any learning system, due to uncertain and variable device movement, a decentralized network architecture, and constraints on network resources. In this work, we use deep reinforcement learning (DRL) to learn a scalable and generalizable forwarding strategy for such networks. We make the following contributions: (i) we use hierarchical RL to design DRL packet agents rather than device agents to capture the packet forwarding decisions that are made over time and improve training efficiency; (ii) we use relational features to ensure generalizability of the learned forwarding strategy to a wide range of network dynamics and enable offline training; and (iii) we incorporate both forwarding goals and network resource considerations into packet decision-making by designing a weighted reward function. Our results show that the forwarding strategy used by our DRL packet agent often achieves a similar delay per packet delivered as the oracle forwarding strategy and almost always outperforms all other strategies (including state-of-the-art strategies) in terms of delay, even on scenarios on which the DRL agent was not trained.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"79 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A generic approach for reproducible model distillation 可重复模型提炼的通用方法

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-08-05 DOI: 10.1007/s10994-024-06597-w

Yunzhe Zhou, Peiru Xu, Giles Hooker

Model distillation has been a popular method for producing interpretable machine learning. It uses an interpretable “student” model to mimic the predictions made by the black box “teacher” model. However, when the student model is sensitive to the variability of the data sets used for training even when keeping the teacher fixed, the corresponded interpretation is not reliable. Existing strategies stabilize model distillation by checking whether a large enough sample of pseudo-data is generated to reliably reproduce student models, but methods to do so have so far been developed separately for each specific class of student model. In this paper, we develop a generic approach for stable model distillation based on central limit theorem for the estimated fidelity of the student to the teacher. We start with a collection of candidate student models and search for candidates that reasonably agree with the teacher. Then we construct a multiple testing framework to select a sample size such that the consistent student model would be selected under different pseudo samples. We demonstrate the application of our proposed approach on three commonly used intelligible models: decision trees, falling rule lists and symbolic regression. Finally, we conduct simulation experiments on Mammographic Mass and Breast Cancer datasets and illustrate the testing procedure throughout a theoretical analysis with Markov process. The code is publicly available at https://github.com/yunzhe-zhou/GenericDistillation.

模型提炼一直是产生可解释机器学习的流行方法。它使用可解释的 "学生 "模型来模仿黑盒 "教师 "模型的预测。然而，当学生模型对用于训练的数据集的可变性很敏感时，即使教师模型保持不变，相应的解释也不可靠。现有的策略通过检查是否生成了足够大的伪数据样本来可靠地重现学生模型，从而稳定模型提炼，但迄今为止，针对每一类特定学生模型的方法都是单独开发的。在本文中，我们基于中心极限定理，针对学生对教师的估计保真度，开发了一种通用的稳定模型提炼方法。我们从候选学生模型集合开始，寻找与教师合理一致的候选模型。然后，我们构建了一个多重测试框架，以选择样本大小，从而在不同的伪样本下选出一致的学生模型。我们在决策树、下降规则列表和符号回归这三种常用智能模型上演示了我们提出的方法的应用。最后，我们在乳腺肿块和乳腺癌数据集上进行了模拟实验，并通过马尔可夫过程的理论分析说明了测试过程。代码可在 https://github.com/yunzhe-zhou/GenericDistillation 公开获取。

{"title":"A generic approach for reproducible model distillation","authors":"Yunzhe Zhou, Peiru Xu, Giles Hooker","doi":"10.1007/s10994-024-06597-w","DOIUrl":"https://doi.org/10.1007/s10994-024-06597-w","url":null,"abstract":"Model distillation has been a popular method for producing interpretable machine learning. It uses an interpretable “student” model to mimic the predictions made by the black box “teacher” model. However, when the student model is sensitive to the variability of the data sets used for training even when keeping the teacher fixed, the corresponded interpretation is not reliable. Existing strategies stabilize model distillation by checking whether a large enough sample of pseudo-data is generated to reliably reproduce student models, but methods to do so have so far been developed separately for each specific class of student model. In this paper, we develop a generic approach for stable model distillation based on central limit theorem for the estimated fidelity of the student to the teacher. We start with a collection of candidate student models and search for candidates that reasonably agree with the teacher. Then we construct a multiple testing framework to select a sample size such that the consistent student model would be selected under different pseudo samples. We demonstrate the application of our proposed approach on three commonly used intelligible models: decision trees, falling rule lists and symbolic regression. Finally, we conduct simulation experiments on Mammographic Mass and Breast Cancer datasets and illustrate the testing procedure throughout a theoretical analysis with Markov process. The code is publicly available at https://github.com/yunzhe-zhou/GenericDistillation.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"23 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Autoreplicative random forests with applications to missing value imputation 自复制随机森林在缺失值估算中的应用

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-08-01 DOI: 10.1007/s10994-024-06584-1

Ekaterina Antonenko, Ander Carreño, Jesse Read

Missing values are a common problem in data science and machine learning. Removing instances with missing values is a straightforward workaround, but this can significantly hinder subsequent data analysis, particularly when features outnumber instances. There are a variety of methodologies proposed in the literature for imputing missing values. Denoising Autoencoders, for example, have been leveraged efficiently for imputation. However, neural network approaches have been relatively less effective on smaller datasets. In this work, we propose Autoreplicative Random Forests (ARF) as a multi-output learning approach, which we introduce in the context of a framework that may impute via either an iterative or procedural process. Experiments on several low- and high-dimensional datasets show that ARF is computationally efficient and exhibits better imputation performance than its competitors, including neural network approaches. In order to provide statistical analysis and mathematical background to the proposed missing value imputation framework, we also propose probabilistic ARFs, where the confidence values are provided over different imputation hypotheses, therefore maximizing the utility of such a framework in a machine-learning pipeline targeting predictive performance.

缺失值是数据科学和机器学习中的常见问题。删除缺失值的实例是一种直接的解决方法，但这会严重阻碍后续的数据分析，尤其是当特征数量超过实例数量时。文献中提出了多种方法来弥补缺失值。例如，去噪自动编码器已被有效地用于估算。然而，神经网络方法在较小的数据集上效果相对较差。在这项工作中，我们提出了自复制随机森林（ARF）作为一种多输出学习方法，并在一个可通过迭代或程序过程进行归因的框架中介绍了这种方法。在几个低维和高维数据集上进行的实验表明，ARF 的计算效率很高，与包括神经网络方法在内的竞争对手相比，它的归因性能更好。为了给所提出的缺失值估算框架提供统计分析和数学背景，我们还提出了概率 ARF，即根据不同的估算假设提供置信度值，从而最大限度地提高这种框架在以预测性能为目标的机器学习管道中的效用。

{"title":"Autoreplicative random forests with applications to missing value imputation","authors":"Ekaterina Antonenko, Ander Carreño, Jesse Read","doi":"10.1007/s10994-024-06584-1","DOIUrl":"https://doi.org/10.1007/s10994-024-06584-1","url":null,"abstract":"Missing values are a common problem in data science and machine learning. Removing instances with missing values is a straightforward workaround, but this can significantly hinder subsequent data analysis, particularly when features outnumber instances. There are a variety of methodologies proposed in the literature for imputing missing values. Denoising Autoencoders, for example, have been leveraged efficiently for imputation. However, neural network approaches have been relatively less effective on smaller datasets. In this work, we propose Autoreplicative Random Forests (ARF) as a multi-output learning approach, which we introduce in the context of a framework that may impute via either an iterative or procedural process. Experiments on several low- and high-dimensional datasets show that ARF is computationally efficient and exhibits better imputation performance than its competitors, including neural network approaches. In order to provide statistical analysis and mathematical background to the proposed missing value imputation framework, we also propose probabilistic ARFs, where the confidence values are provided over different imputation hypotheses, therefore maximizing the utility of such a framework in a machine-learning pipeline targeting predictive performance.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"219 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141886519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Describing group evolution in temporal data using multi-faceted events 利用多方面事件描述时间数据中的群体演变

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-08-01 DOI: 10.1007/s10994-024-06600-4

Andrea Failla, Rémy Cazabet, Giulio Rossetti, Salvatore Citraro

Groups—such as clusters of points or communities of nodes—are fundamental when addressing various data mining tasks. In temporal data, the predominant approach for characterizing group evolution has been through the identification of “events”. However, the events usually described in the literature, e.g., shrinks/growths, splits/merges, are often arbitrarily defined, creating a gap between such theoretical/predefined types and real-data group observations. Moving beyond existing taxonomies, we think of events as “archetypes” characterized by a unique combination of quantitative dimensions that we call “facets”. Group dynamics are defined by their position within the facet space, where archetypal events occupy extremities. Thus, rather than enforcing strict event types, our approach can allow for hybrid descriptions of dynamics involving group proximity to multiple archetypes. We apply our framework to evolving groups from several face-to-face interaction datasets, showing it enables richer, more reliable characterization of group dynamics with respect to state-of-the-art methods, especially when the groups are subject to complex relationships. Our approach also offers intuitive solutions to common tasks related to dynamic group analysis, such as choosing an appropriate aggregation scale, quantifying partition stability, and evaluating event quality.

在处理各种数据挖掘任务时，群体（如点群或节点群）是最基本的。在时态数据中，描述群体演变的主要方法是识别 "事件"。然而，文献中通常描述的事件，如收缩/增长、分裂/合并，往往是任意定义的，这就在此类理论/预定义类型与实际数据群体观察之间造成了差距。超越现有的分类法，我们将事件视为 "原型"，其特点是独特的量化维度组合，我们称之为 "面"。群体动态由其在 "面 "空间中的位置来定义，原型事件在 "面 "空间中占据极端位置。因此，我们的方法并不强制要求严格的事件类型，而是允许对涉及群体接近多种原型的动态进行混合描述。我们将我们的框架应用于几个面对面互动数据集中不断演化的群体，结果表明，与最先进的方法相比，它能对群体动态进行更丰富、更可靠的描述，尤其是在群体关系复杂的情况下。我们的方法还为与动态群体分析相关的常见任务提供了直观的解决方案，例如选择合适的聚合规模、量化分区稳定性和评估事件质量。

{"title":"Describing group evolution in temporal data using multi-faceted events","authors":"Andrea Failla, Rémy Cazabet, Giulio Rossetti, Salvatore Citraro","doi":"10.1007/s10994-024-06600-4","DOIUrl":"https://doi.org/10.1007/s10994-024-06600-4","url":null,"abstract":"Groups—such as clusters of points or communities of nodes—are fundamental when addressing various data mining tasks. In temporal data, the predominant approach for characterizing group evolution has been through the identification of “events”. However, the events usually described in the literature, e.g., shrinks/growths, splits/merges, are often arbitrarily defined, creating a gap between such theoretical/predefined types and real-data group observations. Moving beyond existing taxonomies, we think of events as “archetypes” characterized by a unique combination of quantitative dimensions that we call “facets”. Group dynamics are defined by their position within the facet space, where archetypal events occupy extremities. Thus, rather than enforcing strict event types, our approach can allow for hybrid descriptions of dynamics involving group proximity to multiple archetypes. We apply our framework to evolving groups from several face-to-face interaction datasets, showing it enables richer, more reliable characterization of group dynamics with respect to state-of-the-art methods, especially when the groups are subject to complex relationships. Our approach also offers intuitive solutions to common tasks related to dynamic group analysis, such as choosing an appropriate aggregation scale, quantifying partition stability, and evaluating event quality.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"78 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0