首页 > 最新文献

Machine Learning最新文献

英文 中文
Weighting non-IID batches for out-of-distribution detection 对非 IID 批次进行加权,以检测配送外情况
IF 7.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-19 DOI: 10.1007/s10994-024-06605-z
Zhilin Zhao, Longbing Cao

A standard network pretrained on in-distribution (ID) samples could make high-confidence predictions on out-of-distribution (OOD) samples, leaving the possibility of failing to distinguish ID and OOD samples in the test phase. To address this over-confidence issue, the existing methods improve the OOD sensitivity from modeling perspectives, i.e., retraining it by modifying training processes or objective functions. In contrast, this paper proposes a simple but effective method, namely Weighted Non-IID Batching (WNB), by adjusting batch weights. WNB builds on a key observation: increasing the batch size can improve the OOD detection performance. This is because a smaller batch size may make its batch samples more likely to be treated as non-IID from the assumed ID, i.e., associated with an OOD. This causes a network to provide high-confidence predictions for all samples from the OOD. Accordingly, WNB applies a weight function to weight each batch according to the discrepancy between batch samples and the entire training ID dataset. Specifically, the weight function is derived by minimizing the generalization error bound. It ensures that the weight function assigns larger weights to batches with smaller discrepancies and makes a trade-off between ID classification and OOD detection performance. Experimental results show that incorporating WNB into state-of-the-art OOD detection methods can further improve their performance.

对分布内(ID)样本进行预训练的标准网络可以对分布外(OOD)样本进行高置信度预测,但在测试阶段可能无法区分 ID 和 OOD 样本。为解决这一过度置信问题,现有方法从建模角度提高了 OOD 灵敏度,即通过修改训练过程或目标函数对其进行再训练。相比之下,本文提出了一种简单而有效的方法,即通过调整批次权重来实现加权非 IID 批处理(WNB)。WNB 基于一个重要的观察结果:增加批次大小可以提高 OOD 检测性能。这是因为,较小的批次规模可能会使其批次样本更有可能从假定的 ID 被视为非 IID,即与 OOD 相关联。这将导致网络对来自 OOD 的所有样本提供高置信度预测。因此,WNB 根据批次样本与整个训练 ID 数据集之间的差异,应用加权函数对每个批次进行加权。具体来说,权重函数是通过最小化泛化误差边界得出的。它确保权重函数为差异较小的批次分配较大的权重,并在 ID 分类和 OOD 检测性能之间做出权衡。实验结果表明,将 WNB 纳入最先进的 OOD 检测方法可以进一步提高其性能。
{"title":"Weighting non-IID batches for out-of-distribution detection","authors":"Zhilin Zhao, Longbing Cao","doi":"10.1007/s10994-024-06605-z","DOIUrl":"https://doi.org/10.1007/s10994-024-06605-z","url":null,"abstract":"<p>A standard network pretrained on in-distribution (ID) samples could make high-confidence predictions on out-of-distribution (OOD) samples, leaving the possibility of failing to distinguish ID and OOD samples in the test phase. To address this over-confidence issue, the existing methods improve the OOD sensitivity from modeling perspectives, i.e., retraining it by modifying training processes or objective functions. In contrast, this paper proposes a simple but effective method, namely Weighted Non-IID Batching (WNB), by adjusting batch weights. WNB builds on a key observation: increasing the batch size can improve the OOD detection performance. This is because a smaller batch size may make its batch samples more likely to be treated as non-IID from the assumed ID, i.e., associated with an OOD. This causes a network to provide high-confidence predictions for all samples from the OOD. Accordingly, WNB applies a weight function to weight each batch according to the discrepancy between batch samples and the entire training ID dataset. Specifically, the weight function is derived by minimizing the generalization error bound. It ensures that the weight function assigns larger weights to batches with smaller discrepancies and makes a trade-off between ID classification and OOD detection performance. Experimental results show that incorporating WNB into state-of-the-art OOD detection methods can further improve their performance.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"267 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142209734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding prediction discrepancies in classification 了解分类中的预测差异
IF 7.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-07 DOI: 10.1007/s10994-024-06557-4
Xavier Renard, Thibault Laugel, Marcin Detyniecki

A multitude of classifiers can be trained on the same data to achieve similar performances during test time while having learned significantly different classification patterns. When selecting a classifier, the machine learning practitioner has no understanding on the differences between models, their limits, where they agree and where they don’t. But this choice will result in concrete consequences for instances to be classified in the discrepancy zone, since the final decision will be based on the selected classification pattern. Besides the arbitrary nature of the result, a bad choice could have further negative consequences such as loss of opportunity or lack of fairness. This paper proposes to address this question by analyzing the prediction discrepancies in a pool of best-performing models trained on the same data. A model-agnostic algorithm, DIG, is proposed to capture and explain discrepancies locally in tabular datasets, to enable the practitioner to make the best educated decision when selecting a model by anticipating its potential undesired consequences.

在相同的数据上训练多种分类器,在测试期间可以获得相似的性能,但学习到的分类模式却大相径庭。在选择分类器时,机器学习从业者并不了解模型之间的差异、它们的局限性、它们在哪些方面一致,哪些方面不一致。但这种选择会给差异区内的实例分类带来具体后果,因为最终决定将基于所选的分类模式。除了结果的任意性之外,错误的选择还可能带来更多负面影响,如丧失机会或缺乏公平性。本文建议通过分析在相同数据上训练出来的最佳模型库中的预测差异来解决这个问题。本文提出了一种与模型无关的算法--DIG,用于捕捉和解释表格数据集中的局部差异,从而使实践者在选择模型时,通过预测其潜在的不良后果,做出最明智的决定。
{"title":"Understanding prediction discrepancies in classification","authors":"Xavier Renard, Thibault Laugel, Marcin Detyniecki","doi":"10.1007/s10994-024-06557-4","DOIUrl":"https://doi.org/10.1007/s10994-024-06557-4","url":null,"abstract":"<p>A multitude of classifiers can be trained on the same data to achieve similar performances during test time while having learned significantly different classification patterns. When selecting a classifier, the machine learning practitioner has no understanding on the differences between models, their limits, where they agree and where they don’t. But this choice will result in concrete consequences for instances to be classified in the discrepancy zone, since the final decision will be based on the selected classification pattern. Besides the arbitrary nature of the result, a bad choice could have further negative consequences such as loss of opportunity or lack of fairness. This paper proposes to address this question by analyzing the prediction discrepancies in a pool of best-performing models trained on the same data. A model-agnostic algorithm, DIG, is proposed to <i>capture and explain</i> discrepancies locally in tabular datasets, to enable the practitioner to make the best educated decision when selecting a model by anticipating its potential undesired consequences.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"13 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Empirical Bayes linked matrix decomposition 经验贝叶斯关联矩阵分解
IF 7.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-07 DOI: 10.1007/s10994-024-06599-8
Eric F. Lock

Data for several applications in diverse fields can be represented as multiple matrices that are linked across rows or columns. This is particularly common in molecular biomedical research, in which multiple molecular “omics” technologies may capture different feature sets (e.g., corresponding to rows in a matrix) and/or different sample populations (corresponding to columns). This has motivated a large body of work on integrative matrix factorization approaches that identify and decompose low-dimensional signal that is shared across multiple matrices or specific to a given matrix. We propose an empirical variational Bayesian approach to this problem that has several advantages over existing techniques, including the flexibility to accommodate shared signal over any number of row or column sets (i.e., bidimensional integration), an intuitive model-based objective function that yields appropriate shrinkage for the inferred signals, and a relatively efficient estimation algorithm with no tuning parameters. A general result establishes conditions for the uniqueness of the underlying decomposition for a broad family of methods that includes the proposed approach. For scenarios with missing data, we describe an associated iterative imputation approach that is novel for the single-matrix context and a powerful approach for “blockwise” imputation (in which an entire row or column is missing) in various linked matrix contexts. Extensive simulations show that the method performs very well under different scenarios with respect to recovering underlying low-rank signal, accurately decomposing shared and specific signals, and accurately imputing missing data. The approach is applied to gene expression and miRNA data from breast cancer tissue and normal breast tissue, for which it gives an informative decomposition of variation and outperforms alternative strategies for missing data imputation.

不同领域中多个应用的数据可表示为跨行或跨列连接的多个矩阵。这在分子生物医学研究中尤为常见,因为多种分子 "omics "技术可能会捕获不同的特征集(例如,与矩阵中的行相对应)和/或不同的样本群(与列相对应)。这就推动了大量关于综合矩阵因式分解方法的研究工作,这些方法可以识别和分解多个矩阵共享或特定矩阵特有的低维信号。我们针对这一问题提出了一种经验变分贝叶斯方法,它与现有技术相比有几个优势,包括可以灵活地适应任意数量的行或列集(即二维积分)上的共享信号;基于模型的直观目标函数可以对推断出的信号进行适当的收缩;以及无需调整参数的相对高效的估计算法。一般结果为包括所提方法在内的一系列方法的基础分解的唯一性确立了条件。对于数据缺失的情况,我们描述了一种相关的迭代估算方法,这种方法在单矩阵情况下是新颖的,在各种链接矩阵情况下是一种强大的 "顺时针 "估算方法(其中整行或整列缺失)。大量仿真表明,该方法在不同情况下都能很好地恢复底层低秩信号、准确分解共享信号和特定信号,并准确归因缺失数据。该方法被应用于乳腺癌组织和正常乳腺组织的基因表达和 miRNA 数据,对这些数据进行了翔实的变异分解,并优于其他缺失数据归因策略。
{"title":"Empirical Bayes linked matrix decomposition","authors":"Eric F. Lock","doi":"10.1007/s10994-024-06599-8","DOIUrl":"https://doi.org/10.1007/s10994-024-06599-8","url":null,"abstract":"<p>Data for several applications in diverse fields can be represented as multiple matrices that are linked across rows or columns. This is particularly common in molecular biomedical research, in which multiple molecular “omics” technologies may capture different feature sets (e.g., corresponding to rows in a matrix) and/or different sample populations (corresponding to columns). This has motivated a large body of work on integrative matrix factorization approaches that identify and decompose low-dimensional signal that is shared across multiple matrices or specific to a given matrix. We propose an empirical variational Bayesian approach to this problem that has several advantages over existing techniques, including the flexibility to accommodate shared signal over any number of row or column sets (i.e., bidimensional integration), an intuitive model-based objective function that yields appropriate shrinkage for the inferred signals, and a relatively efficient estimation algorithm with no tuning parameters. A general result establishes conditions for the uniqueness of the underlying decomposition for a broad family of methods that includes the proposed approach. For scenarios with missing data, we describe an associated iterative imputation approach that is novel for the single-matrix context and a powerful approach for “blockwise” imputation (in which an entire row or column is missing) in various linked matrix contexts. Extensive simulations show that the method performs very well under different scenarios with respect to recovering underlying low-rank signal, accurately decomposing shared and specific signals, and accurately imputing missing data. The approach is applied to gene expression and miRNA data from breast cancer tissue and normal breast tissue, for which it gives an informative decomposition of variation and outperforms alternative strategies for missing data imputation.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"24 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning an adaptive forwarding strategy for mobile wireless networks: resource usage vs. latency 学习移动无线网络的自适应转发策略:资源使用与延迟
IF 7.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-07 DOI: 10.1007/s10994-024-06601-3
Victoria Manfredi, Alicia P. Wolfe, Xiaolan Zhang, Bing Wang

Mobile wireless networks present several challenges for any learning system, due to uncertain and variable device movement, a decentralized network architecture, and constraints on network resources. In this work, we use deep reinforcement learning (DRL) to learn a scalable and generalizable forwarding strategy for such networks. We make the following contributions: (i) we use hierarchical RL to design DRL packet agents rather than device agents to capture the packet forwarding decisions that are made over time and improve training efficiency; (ii) we use relational features to ensure generalizability of the learned forwarding strategy to a wide range of network dynamics and enable offline training; and (iii) we incorporate both forwarding goals and network resource considerations into packet decision-making by designing a weighted reward function. Our results show that the forwarding strategy used by our DRL packet agent often achieves a similar delay per packet delivered as the oracle forwarding strategy and almost always outperforms all other strategies (including state-of-the-art strategies) in terms of delay, even on scenarios on which the DRL agent was not trained.

由于设备移动的不确定性和可变性、分散的网络架构以及网络资源的限制,移动无线网络给任何学习系统都带来了挑战。在这项工作中,我们使用深度强化学习(DRL)为此类网络学习可扩展、可通用的转发策略。我们的贡献如下:(i) 我们使用分层强化学习来设计 DRL 数据包代理,而不是设备代理,以捕捉随着时间推移而做出的数据包转发决策,并提高训练效率;(ii) 我们使用关系特征来确保学习到的转发策略对各种网络动态具有普适性,并实现离线训练;(iii) 我们通过设计加权奖励函数,将转发目标和网络资源考虑因素纳入数据包决策。我们的研究结果表明,我们的 DRL 数据包代理所使用的转发策略通常能实现与 Oracle 转发策略相似的每个数据包传输延迟,而且在延迟方面几乎总是优于所有其他策略(包括最先进的策略),即使在 DRL 代理未接受过训练的场景中也是如此。
{"title":"Learning an adaptive forwarding strategy for mobile wireless networks: resource usage vs. latency","authors":"Victoria Manfredi, Alicia P. Wolfe, Xiaolan Zhang, Bing Wang","doi":"10.1007/s10994-024-06601-3","DOIUrl":"https://doi.org/10.1007/s10994-024-06601-3","url":null,"abstract":"<p>Mobile wireless networks present several challenges for any learning system, due to uncertain and variable device movement, a decentralized network architecture, and constraints on network resources. In this work, we use deep reinforcement learning (DRL) to learn a scalable and generalizable forwarding strategy for such networks. We make the following contributions: (i) we use hierarchical RL to design DRL packet agents rather than device agents to capture the packet forwarding decisions that are made over time and improve training efficiency; (ii) we use relational features to ensure generalizability of the learned forwarding strategy to a wide range of network dynamics and enable offline training; and (iii) we incorporate both forwarding goals and network resource considerations into packet decision-making by designing a weighted reward function. Our results show that the forwarding strategy used by our DRL packet agent often achieves a similar delay per packet delivered as the oracle forwarding strategy and almost always outperforms all other strategies (including state-of-the-art strategies) in terms of delay, even on scenarios on which the DRL agent was not trained.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"79 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A generic approach for reproducible model distillation 可重复模型提炼的通用方法
IF 7.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-05 DOI: 10.1007/s10994-024-06597-w
Yunzhe Zhou, Peiru Xu, Giles Hooker

Model distillation has been a popular method for producing interpretable machine learning. It uses an interpretable “student” model to mimic the predictions made by the black box “teacher” model. However, when the student model is sensitive to the variability of the data sets used for training even when keeping the teacher fixed, the corresponded interpretation is not reliable. Existing strategies stabilize model distillation by checking whether a large enough sample of pseudo-data is generated to reliably reproduce student models, but methods to do so have so far been developed separately for each specific class of student model. In this paper, we develop a generic approach for stable model distillation based on central limit theorem for the estimated fidelity of the student to the teacher. We start with a collection of candidate student models and search for candidates that reasonably agree with the teacher. Then we construct a multiple testing framework to select a sample size such that the consistent student model would be selected under different pseudo samples. We demonstrate the application of our proposed approach on three commonly used intelligible models: decision trees, falling rule lists and symbolic regression. Finally, we conduct simulation experiments on Mammographic Mass and Breast Cancer datasets and illustrate the testing procedure throughout a theoretical analysis with Markov process. The code is publicly available at https://github.com/yunzhe-zhou/GenericDistillation.

模型提炼一直是产生可解释机器学习的流行方法。它使用可解释的 "学生 "模型来模仿黑盒 "教师 "模型的预测。然而,当学生模型对用于训练的数据集的可变性很敏感时,即使教师模型保持不变,相应的解释也不可靠。现有的策略通过检查是否生成了足够大的伪数据样本来可靠地重现学生模型,从而稳定模型提炼,但迄今为止,针对每一类特定学生模型的方法都是单独开发的。在本文中,我们基于中心极限定理,针对学生对教师的估计保真度,开发了一种通用的稳定模型提炼方法。我们从候选学生模型集合开始,寻找与教师合理一致的候选模型。然后,我们构建了一个多重测试框架,以选择样本大小,从而在不同的伪样本下选出一致的学生模型。我们在决策树、下降规则列表和符号回归这三种常用智能模型上演示了我们提出的方法的应用。最后,我们在乳腺肿块和乳腺癌数据集上进行了模拟实验,并通过马尔可夫过程的理论分析说明了测试过程。代码可在 https://github.com/yunzhe-zhou/GenericDistillation 公开获取。
{"title":"A generic approach for reproducible model distillation","authors":"Yunzhe Zhou, Peiru Xu, Giles Hooker","doi":"10.1007/s10994-024-06597-w","DOIUrl":"https://doi.org/10.1007/s10994-024-06597-w","url":null,"abstract":"<p>Model distillation has been a popular method for producing interpretable machine learning. It uses an interpretable “student” model to mimic the predictions made by the black box “teacher” model. However, when the student model is sensitive to the variability of the data sets used for training even when keeping the teacher fixed, the corresponded interpretation is not reliable. Existing strategies stabilize model distillation by checking whether a large enough sample of pseudo-data is generated to reliably reproduce student models, but methods to do so have so far been developed separately for each specific class of student model. In this paper, we develop a generic approach for stable model distillation based on central limit theorem for the estimated fidelity of the student to the teacher. We start with a collection of candidate student models and search for candidates that reasonably agree with the teacher. Then we construct a multiple testing framework to select a sample size such that the consistent student model would be selected under different pseudo samples. We demonstrate the application of our proposed approach on three commonly used intelligible models: decision trees, falling rule lists and symbolic regression. Finally, we conduct simulation experiments on Mammographic Mass and Breast Cancer datasets and illustrate the testing procedure throughout a theoretical analysis with Markov process. The code is publicly available at https://github.com/yunzhe-zhou/GenericDistillation.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"23 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Autoreplicative random forests with applications to missing value imputation 自复制随机森林在缺失值估算中的应用
IF 7.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-01 DOI: 10.1007/s10994-024-06584-1
Ekaterina Antonenko, Ander Carreño, Jesse Read

Missing values are a common problem in data science and machine learning. Removing instances with missing values is a straightforward workaround, but this can significantly hinder subsequent data analysis, particularly when features outnumber instances. There are a variety of methodologies proposed in the literature for imputing missing values. Denoising Autoencoders, for example, have been leveraged efficiently for imputation. However, neural network approaches have been relatively less effective on smaller datasets. In this work, we propose Autoreplicative Random Forests (ARF) as a multi-output learning approach, which we introduce in the context of a framework that may impute via either an iterative or procedural process. Experiments on several low- and high-dimensional datasets show that ARF is computationally efficient and exhibits better imputation performance than its competitors, including neural network approaches. In order to provide statistical analysis and mathematical background to the proposed missing value imputation framework, we also propose probabilistic ARFs, where the confidence values are provided over different imputation hypotheses, therefore maximizing the utility of such a framework in a machine-learning pipeline targeting predictive performance.

缺失值是数据科学和机器学习中的常见问题。删除缺失值的实例是一种直接的解决方法,但这会严重阻碍后续的数据分析,尤其是当特征数量超过实例数量时。文献中提出了多种方法来弥补缺失值。例如,去噪自动编码器已被有效地用于估算。然而,神经网络方法在较小的数据集上效果相对较差。在这项工作中,我们提出了自复制随机森林(ARF)作为一种多输出学习方法,并在一个可通过迭代或程序过程进行归因的框架中介绍了这种方法。在几个低维和高维数据集上进行的实验表明,ARF 的计算效率很高,与包括神经网络方法在内的竞争对手相比,它的归因性能更好。为了给所提出的缺失值估算框架提供统计分析和数学背景,我们还提出了概率 ARF,即根据不同的估算假设提供置信度值,从而最大限度地提高这种框架在以预测性能为目标的机器学习管道中的效用。
{"title":"Autoreplicative random forests with applications to missing value imputation","authors":"Ekaterina Antonenko, Ander Carreño, Jesse Read","doi":"10.1007/s10994-024-06584-1","DOIUrl":"https://doi.org/10.1007/s10994-024-06584-1","url":null,"abstract":"<p>Missing values are a common problem in data science and machine learning. Removing instances with missing values is a straightforward workaround, but this can significantly hinder subsequent data analysis, particularly when features outnumber instances. There are a variety of methodologies proposed in the literature for imputing missing values. Denoising Autoencoders, for example, have been leveraged efficiently for imputation. However, neural network approaches have been relatively less effective on smaller datasets. In this work, we propose Autoreplicative Random Forests (ARF) as a multi-output learning approach, which we introduce in the context of a framework that may impute via either an iterative or procedural process. Experiments on several low- and high-dimensional datasets show that ARF is computationally efficient and exhibits better imputation performance than its competitors, including neural network approaches. In order to provide statistical analysis and mathematical background to the proposed missing value imputation framework, we also propose probabilistic ARFs, where the confidence values are provided over different imputation hypotheses, therefore maximizing the utility of such a framework in a machine-learning pipeline targeting predictive performance.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"219 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141886519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Describing group evolution in temporal data using multi-faceted events 利用多方面事件描述时间数据中的群体演变
IF 7.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-01 DOI: 10.1007/s10994-024-06600-4
Andrea Failla, Rémy Cazabet, Giulio Rossetti, Salvatore Citraro

Groups—such as clusters of points or communities of nodes—are fundamental when addressing various data mining tasks. In temporal data, the predominant approach for characterizing group evolution has been through the identification of “events”. However, the events usually described in the literature, e.g., shrinks/growths, splits/merges, are often arbitrarily defined, creating a gap between such theoretical/predefined types and real-data group observations. Moving beyond existing taxonomies, we think of events as “archetypes” characterized by a unique combination of quantitative dimensions that we call “facets”. Group dynamics are defined by their position within the facet space, where archetypal events occupy extremities. Thus, rather than enforcing strict event types, our approach can allow for hybrid descriptions of dynamics involving group proximity to multiple archetypes. We apply our framework to evolving groups from several face-to-face interaction datasets, showing it enables richer, more reliable characterization of group dynamics with respect to state-of-the-art methods, especially when the groups are subject to complex relationships. Our approach also offers intuitive solutions to common tasks related to dynamic group analysis, such as choosing an appropriate aggregation scale, quantifying partition stability, and evaluating event quality.

在处理各种数据挖掘任务时,群体(如点群或节点群)是最基本的。在时态数据中,描述群体演变的主要方法是识别 "事件"。然而,文献中通常描述的事件,如收缩/增长、分裂/合并,往往是任意定义的,这就在此类理论/预定义类型与实际数据群体观察之间造成了差距。超越现有的分类法,我们将事件视为 "原型",其特点是独特的量化维度组合,我们称之为 "面"。群体动态由其在 "面 "空间中的位置来定义,原型事件在 "面 "空间中占据极端位置。因此,我们的方法并不强制要求严格的事件类型,而是允许对涉及群体接近多种原型的动态进行混合描述。我们将我们的框架应用于几个面对面互动数据集中不断演化的群体,结果表明,与最先进的方法相比,它能对群体动态进行更丰富、更可靠的描述,尤其是在群体关系复杂的情况下。我们的方法还为与动态群体分析相关的常见任务提供了直观的解决方案,例如选择合适的聚合规模、量化分区稳定性和评估事件质量。
{"title":"Describing group evolution in temporal data using multi-faceted events","authors":"Andrea Failla, Rémy Cazabet, Giulio Rossetti, Salvatore Citraro","doi":"10.1007/s10994-024-06600-4","DOIUrl":"https://doi.org/10.1007/s10994-024-06600-4","url":null,"abstract":"<p>Groups—such as clusters of points or communities of nodes—are fundamental when addressing various data mining tasks. In temporal data, the predominant approach for characterizing group evolution has been through the identification of “events”. However, the events usually described in the literature, e.g., shrinks/growths, splits/merges, are often arbitrarily defined, creating a gap between such theoretical/predefined types and real-data group observations. Moving beyond existing taxonomies, we think of events as “archetypes” characterized by a unique combination of quantitative dimensions that we call “facets”. Group dynamics are defined by their position within the facet space, where archetypal events occupy extremities. Thus, rather than enforcing strict event types, our approach can allow for hybrid descriptions of dynamics involving group proximity to multiple archetypes. We apply our framework to evolving groups from several face-to-face interaction datasets, showing it enables richer, more reliable characterization of group dynamics with respect to state-of-the-art methods, especially when the groups are subject to complex relationships. Our approach also offers intuitive solutions to common tasks related to dynamic group analysis, such as choosing an appropriate aggregation scale, quantifying partition stability, and evaluating event quality.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"78 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Neural calibration of hidden inhomogeneous Markov chains: information decompression in life insurance 隐藏不均匀马尔科夫链的神经校准:人寿保险中的信息解压缩
IF 7.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-31 DOI: 10.1007/s10994-024-06551-w
Mark Kiermayer, Christian Weiß

Markov chains play a key role in a vast number of areas, including life insurance mathematics. Standard actuarial quantities as the premium value can be interpreted as compressed, lossy information about the underlying Markov process. We introduce a method to reconstruct the underlying Markov chain given collective information of a portfolio of contracts. Our neural architecture characterizes the process in a highly explainable way by explicitly providing one-step transition probabilities. Further, we provide an intrinsic, economic model validation to inspect the quality of the information decompression. Lastly, our methodology is successfully tested for a realistic data set of German term life insurance contracts.

马尔可夫链在包括人寿保险数学在内的众多领域发挥着关键作用。保费值等标准精算量可以解释为有关底层马尔可夫过程的压缩、有损信息。我们介绍了一种方法,可以根据合同组合的集体信息重建底层马尔可夫链。我们的神经架构通过明确提供一步转换概率,以高度可解释的方式描述了该过程。此外,我们还提供了一个内在的经济模型验证,以检查信息解压缩的质量。最后,我们的方法在德国定期人寿保险合同的现实数据集上进行了成功测试。
{"title":"Neural calibration of hidden inhomogeneous Markov chains: information decompression in life insurance","authors":"Mark Kiermayer, Christian Weiß","doi":"10.1007/s10994-024-06551-w","DOIUrl":"https://doi.org/10.1007/s10994-024-06551-w","url":null,"abstract":"<p>Markov chains play a key role in a vast number of areas, including life insurance mathematics. Standard actuarial quantities as the premium value can be interpreted as compressed, lossy information about the underlying Markov process. We introduce a method to reconstruct the underlying Markov chain given collective information of a portfolio of contracts. Our neural architecture characterizes the process in a highly explainable way by explicitly providing one-step transition probabilities. Further, we provide an intrinsic, economic model validation to inspect the quality of the information decompression. Lastly, our methodology is successfully tested for a realistic data set of German term life insurance contracts.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"22 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integration of multi-modal datasets to estimate human aging 整合多模态数据集,估算人类衰老程度
IF 7.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-29 DOI: 10.1007/s10994-024-06588-x
Rogério Ribeiro, Athos Moraes, Marta Moreno, Pedro G. Ferreira

Aging involves complex biological processes leading to the decline of living organisms. As population lifespan increases worldwide, the importance of identifying factors underlying healthy aging has become critical. Integration of multi-modal datasets is a powerful approach for the analysis of complex biological systems, with the potential to uncover novel aging biomarkers. In this study, we leveraged publicly available epigenomic, transcriptomic and telomere length data along with histological images from the Genotype-Tissue Expression project to build tissue-specific regression models for age prediction. Using data from two tissues, lung and ovary, we aimed to compare model performance across data modalities, as well as to assess the improvement resulting from integrating multiple data types. Our results demostrate that methylation outperformed the other data modalities, with a mean absolute error of 3.36 and 4.36 in the test sets for lung and ovary, respectively. These models achieved lower error rates when compared with established state-of-the-art tissue-agnostic methylation models, emphasizing the importance of a tissue-specific approach. Additionally, this work has shown how the application of Hierarchical Image Pyramid Transformers for feature extraction significantly enhances age modeling using histological images. Finally, we evaluated the benefits of integrating multiple data modalities into a single model. Combining methylation data with other data modalities only marginally improved performance likely due to the limited number of available samples. Combining gene expression with histological features yielded more accurate age predictions compared with the individual performance of these data types. Given these results, this study shows how machine learning applications can be extended to/in multi-modal aging research. Code used is available at https://github.com/zroger49/multi_modal_age_prediction.

衰老是导致生物体衰退的复杂生物过程。随着全球人口寿命的延长,确定健康老龄化的基本因素变得至关重要。整合多模态数据集是分析复杂生物系统的有力方法,有可能发现新的衰老生物标志物。在这项研究中,我们利用公开的表观基因组、转录组和端粒长度数据以及基因型-组织表达项目的组织学图像,建立了用于年龄预测的组织特异性回归模型。我们使用肺和卵巢这两种组织的数据,旨在比较不同数据模式下的模型性能,并评估整合多种数据类型所带来的改进。我们的结果表明,甲基化的表现优于其他数据模式,肺和卵巢测试集的平均绝对误差分别为 3.36 和 4.36。与已建立的最先进的组织鉴定甲基化模型相比,这些模型的错误率更低,强调了针对特定组织的方法的重要性。此外,这项工作还展示了如何应用层次图像金字塔变换器进行特征提取,从而显著增强利用组织学图像进行年龄建模的效果。最后,我们评估了将多种数据模式整合到一个模型中的好处。由于可用样本数量有限,将甲基化数据与其他数据模式相结合只能略微提高性能。与这些数据类型的单独性能相比,将基因表达与组织学特征相结合能产生更准确的年龄预测。鉴于这些结果,本研究展示了如何将机器学习应用扩展到多模态衰老研究中。所用代码见 https://github.com/zroger49/multi_modal_age_prediction。
{"title":"Integration of multi-modal datasets to estimate human aging","authors":"Rogério Ribeiro, Athos Moraes, Marta Moreno, Pedro G. Ferreira","doi":"10.1007/s10994-024-06588-x","DOIUrl":"https://doi.org/10.1007/s10994-024-06588-x","url":null,"abstract":"<p>Aging involves complex biological processes leading to the decline of living organisms. As population lifespan increases worldwide, the importance of identifying factors underlying healthy aging has become critical. Integration of multi-modal datasets is a powerful approach for the analysis of complex biological systems, with the potential to uncover novel aging biomarkers. In this study, we leveraged publicly available epigenomic, transcriptomic and telomere length data along with histological images from the Genotype-Tissue Expression project to build tissue-specific regression models for age prediction. Using data from two tissues, lung and ovary, we aimed to compare model performance across data modalities, as well as to assess the improvement resulting from integrating multiple data types. Our results demostrate that methylation outperformed the other data modalities, with a mean absolute error of 3.36 and 4.36 in the test sets for lung and ovary, respectively. These models achieved lower error rates when compared with established state-of-the-art tissue-agnostic methylation models, emphasizing the importance of a tissue-specific approach. Additionally, this work has shown how the application of Hierarchical Image Pyramid Transformers for feature extraction significantly enhances age modeling using histological images. Finally, we evaluated the benefits of integrating multiple data modalities into a single model. Combining methylation data with other data modalities only marginally improved performance likely due to the limited number of available samples. Combining gene expression with histological features yielded more accurate age predictions compared with the individual performance of these data types. Given these results, this study shows how machine learning applications can be extended to/in multi-modal aging research. Code used is available at https://github.com/zroger49/multi_modal_age_prediction.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"20 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Jaccard-constrained dense subgraph discovery 雅卡德约束密集子图发现
IF 7.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-23 DOI: 10.1007/s10994-024-06595-y
Chamalee Wickrama Arachchi, Nikolaj Tatti

Finding dense subgraphs is a core problem in graph mining with many applications in diverse domains. At the same time many real-world networks vary over time, that is, the dataset can be represented as a sequence of graph snapshots. Hence, it is natural to consider the question of finding dense subgraphs in a temporal network that are allowed to vary over time to a certain degree. In this paper, we search for dense subgraphs that have large pairwise Jaccard similarity coefficients. More formally, given a set of graph snapshots and input parameter (alpha), we find a collection of dense subgraphs, with pairwise Jaccard index at least (alpha), such that the sum of densities of the induced subgraphs is maximized. We prove that this problem is NP-hard and we present a greedy, iterative algorithm which runs in ({mathcal {O}} mathopen {} left( nk^2 + mright)) time per single iteration, where k is the length of the graph sequence and n and m denote number of vertices and total number of edges respectively. We also consider an alternative problem where subgraphs with large pairwise Jaccard indices are rewarded. We do this by incorporating the indices directly into the objective function. More formally, given a set of graph snapshots and a weight (lambda), we find a collection of dense subgraphs such that the sum of densities of the induced subgraphs plus the sum of Jaccard indices, weighted by (lambda), is maximized. We prove that this problem is NP-hard. To discover dense subgraphs with good objective value, we present an iterative algorithm which runs in ({mathcal {O}} mathopen {}left( n^2k^2 + m log n + k^3 nright)) time per single iteration, and a greedy algorithm which runs in ({mathcal {O}} mathopen {}left( n^2k^2 + m log n + k^3 nright)) time. We show experimentally that our algorithms are efficient, they can find ground truth in synthetic datasets and provide good results from real-world datasets. Finally, we present two case studies that show the usefulness of our problem.

寻找稠密子图是图挖掘的一个核心问题,在不同领域有很多应用。同时,现实世界中的许多网络会随时间变化,也就是说,数据集可以表示为一系列图快照。因此,我们很自然地要考虑在时态网络中寻找允许随时间变化到一定程度的密集子图的问题。在本文中,我们将寻找具有较大成对 Jaccard 相似系数的密集子图。更正式地说,给定一组图快照和输入参数((alpha)),我们会找到一个密集子图集合,其成对的杰卡德指数至少为((alpha)),从而使诱导子图的密度之和达到最大。我们证明了这个问题的 NP 难度,并提出了一种贪婪的迭代算法,该算法的运行时间为 ({mathcal {O}}mathopen {}其中 k 是图序列的长度,n 和 m 分别表示顶点数和边的总数。我们还考虑了另一个问题,即奖励具有较大成对 Jaccard 指数的子图。为此,我们将指数直接纳入目标函数。更正式地说,给定一组图快照和一个权重 (lambda),我们会找到一个密集子图集合,使得诱导子图的密度总和加上 Jaccard 指数总和(以 (lambda)加权)达到最大。我们证明这个问题是 NP 难的。为了发现具有良好目标值的密集子图,我们提出了一种迭代算法,该算法的运行时间为({mathcal {O}}left(n^2k^2+mlog n + k^3 nright)) 每次迭代的时间,以及一种贪婪算法,其运行时间为({mathcal {O}}n^2k^2 + m (log n + k^3 nright )时间内运行。我们通过实验证明,我们的算法是高效的,它们可以在合成数据集中找到地面实况,并在真实世界的数据集中提供良好的结果。最后,我们介绍了两个案例研究,展示了我们的问题的实用性。
{"title":"Jaccard-constrained dense subgraph discovery","authors":"Chamalee Wickrama Arachchi, Nikolaj Tatti","doi":"10.1007/s10994-024-06595-y","DOIUrl":"https://doi.org/10.1007/s10994-024-06595-y","url":null,"abstract":"<p>Finding dense subgraphs is a core problem in graph mining with many applications in diverse domains. At the same time many real-world networks vary over time, that is, the dataset can be represented as a sequence of graph snapshots. Hence, it is natural to consider the question of finding dense subgraphs in a temporal network that are allowed to vary over time to a certain degree. In this paper, we search for dense subgraphs that have large pairwise Jaccard similarity coefficients. More formally, given a set of graph snapshots and input parameter <span>(alpha)</span>, we find a collection of dense subgraphs, with pairwise Jaccard index at least <span>(alpha)</span>, such that the sum of densities of the induced subgraphs is maximized. We prove that this problem is <b>NP</b>-hard and we present a greedy, iterative algorithm which runs in <span>({mathcal {O}} mathopen {} left( nk^2 + mright))</span> time per single iteration, where <i>k</i> is the length of the graph sequence and <i>n</i> and <i>m</i> denote number of vertices and total number of edges respectively. We also consider an alternative problem where subgraphs with large pairwise Jaccard indices are rewarded. We do this by incorporating the indices directly into the objective function. More formally, given a set of graph snapshots and a weight <span>(lambda)</span>, we find a collection of dense subgraphs such that the sum of densities of the induced subgraphs plus the sum of Jaccard indices, weighted by <span>(lambda)</span>, is maximized. We prove that this problem is <b>NP</b>-hard. To discover dense subgraphs with good objective value, we present an iterative algorithm which runs in <span>({mathcal {O}} mathopen {}left( n^2k^2 + m log n + k^3 nright))</span> time per single iteration, and a greedy algorithm which runs in <span>({mathcal {O}} mathopen {}left( n^2k^2 + m log n + k^3 nright))</span> time. We show experimentally that our algorithms are efficient, they can find ground truth in synthetic datasets and provide good results from real-world datasets. Finally, we present two case studies that show the usefulness of our problem.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"63 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141780827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Machine Learning
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1