Machine Learning最新文献_第3页

Neural calibration of hidden inhomogeneous Markov chains: information decompression in life insurance 隐藏不均匀马尔科夫链的神经校准：人寿保险中的信息解压缩

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-31 DOI: 10.1007/s10994-024-06551-w

Mark Kiermayer, Christian Weiß

Markov chains play a key role in a vast number of areas, including life insurance mathematics. Standard actuarial quantities as the premium value can be interpreted as compressed, lossy information about the underlying Markov process. We introduce a method to reconstruct the underlying Markov chain given collective information of a portfolio of contracts. Our neural architecture characterizes the process in a highly explainable way by explicitly providing one-step transition probabilities. Further, we provide an intrinsic, economic model validation to inspect the quality of the information decompression. Lastly, our methodology is successfully tested for a realistic data set of German term life insurance contracts.

马尔可夫链在包括人寿保险数学在内的众多领域发挥着关键作用。保费值等标准精算量可以解释为有关底层马尔可夫过程的压缩、有损信息。我们介绍了一种方法，可以根据合同组合的集体信息重建底层马尔可夫链。我们的神经架构通过明确提供一步转换概率，以高度可解释的方式描述了该过程。此外，我们还提供了一个内在的经济模型验证，以检查信息解压缩的质量。最后，我们的方法在德国定期人寿保险合同的现实数据集上进行了成功测试。

引用次数: 0

Integration of multi-modal datasets to estimate human aging 整合多模态数据集，估算人类衰老程度

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-29 DOI: 10.1007/s10994-024-06588-x

Rogério Ribeiro, Athos Moraes, Marta Moreno, Pedro G. Ferreira

Aging involves complex biological processes leading to the decline of living organisms. As population lifespan increases worldwide, the importance of identifying factors underlying healthy aging has become critical. Integration of multi-modal datasets is a powerful approach for the analysis of complex biological systems, with the potential to uncover novel aging biomarkers. In this study, we leveraged publicly available epigenomic, transcriptomic and telomere length data along with histological images from the Genotype-Tissue Expression project to build tissue-specific regression models for age prediction. Using data from two tissues, lung and ovary, we aimed to compare model performance across data modalities, as well as to assess the improvement resulting from integrating multiple data types. Our results demostrate that methylation outperformed the other data modalities, with a mean absolute error of 3.36 and 4.36 in the test sets for lung and ovary, respectively. These models achieved lower error rates when compared with established state-of-the-art tissue-agnostic methylation models, emphasizing the importance of a tissue-specific approach. Additionally, this work has shown how the application of Hierarchical Image Pyramid Transformers for feature extraction significantly enhances age modeling using histological images. Finally, we evaluated the benefits of integrating multiple data modalities into a single model. Combining methylation data with other data modalities only marginally improved performance likely due to the limited number of available samples. Combining gene expression with histological features yielded more accurate age predictions compared with the individual performance of these data types. Given these results, this study shows how machine learning applications can be extended to/in multi-modal aging research. Code used is available at https://github.com/zroger49/multi_modal_age_prediction.

衰老是导致生物体衰退的复杂生物过程。随着全球人口寿命的延长，确定健康老龄化的基本因素变得至关重要。整合多模态数据集是分析复杂生物系统的有力方法，有可能发现新的衰老生物标志物。在这项研究中，我们利用公开的表观基因组、转录组和端粒长度数据以及基因型-组织表达项目的组织学图像，建立了用于年龄预测的组织特异性回归模型。我们使用肺和卵巢这两种组织的数据，旨在比较不同数据模式下的模型性能，并评估整合多种数据类型所带来的改进。我们的结果表明，甲基化的表现优于其他数据模式，肺和卵巢测试集的平均绝对误差分别为 3.36 和 4.36。与已建立的最先进的组织鉴定甲基化模型相比，这些模型的错误率更低，强调了针对特定组织的方法的重要性。此外，这项工作还展示了如何应用层次图像金字塔变换器进行特征提取，从而显著增强利用组织学图像进行年龄建模的效果。最后，我们评估了将多种数据模式整合到一个模型中的好处。由于可用样本数量有限，将甲基化数据与其他数据模式相结合只能略微提高性能。与这些数据类型的单独性能相比，将基因表达与组织学特征相结合能产生更准确的年龄预测。鉴于这些结果，本研究展示了如何将机器学习应用扩展到多模态衰老研究中。所用代码见 https://github.com/zroger49/multi_modal_age_prediction。

{"title":"Integration of multi-modal datasets to estimate human aging","authors":"Rogério Ribeiro, Athos Moraes, Marta Moreno, Pedro G. Ferreira","doi":"10.1007/s10994-024-06588-x","DOIUrl":"https://doi.org/10.1007/s10994-024-06588-x","url":null,"abstract":"Aging involves complex biological processes leading to the decline of living organisms. As population lifespan increases worldwide, the importance of identifying factors underlying healthy aging has become critical. Integration of multi-modal datasets is a powerful approach for the analysis of complex biological systems, with the potential to uncover novel aging biomarkers. In this study, we leveraged publicly available epigenomic, transcriptomic and telomere length data along with histological images from the Genotype-Tissue Expression project to build tissue-specific regression models for age prediction. Using data from two tissues, lung and ovary, we aimed to compare model performance across data modalities, as well as to assess the improvement resulting from integrating multiple data types. Our results demostrate that methylation outperformed the other data modalities, with a mean absolute error of 3.36 and 4.36 in the test sets for lung and ovary, respectively. These models achieved lower error rates when compared with established state-of-the-art tissue-agnostic methylation models, emphasizing the importance of a tissue-specific approach. Additionally, this work has shown how the application of Hierarchical Image Pyramid Transformers for feature extraction significantly enhances age modeling using histological images. Finally, we evaluated the benefits of integrating multiple data modalities into a single model. Combining methylation data with other data modalities only marginally improved performance likely due to the limited number of available samples. Combining gene expression with histological features yielded more accurate age predictions compared with the individual performance of these data types. Given these results, this study shows how machine learning applications can be extended to/in multi-modal aging research. Code used is available at https://github.com/zroger49/multi_modal_age_prediction.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"20 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Jaccard-constrained dense subgraph discovery 雅卡德约束密集子图发现

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-23 DOI: 10.1007/s10994-024-06595-y

Chamalee Wickrama Arachchi, Nikolaj Tatti

Finding dense subgraphs is a core problem in graph mining with many applications in diverse domains. At the same time many real-world networks vary over time, that is, the dataset can be represented as a sequence of graph snapshots. Hence, it is natural to consider the question of finding dense subgraphs in a temporal network that are allowed to vary over time to a certain degree. In this paper, we search for dense subgraphs that have large pairwise Jaccard similarity coefficients. More formally, given a set of graph snapshots and input parameter (alpha), we find a collection of dense subgraphs, with pairwise Jaccard index at least (alpha), such that the sum of densities of the induced subgraphs is maximized. We prove that this problem is NP-hard and we present a greedy, iterative algorithm which runs in ({mathcal {O}} mathopen {} left( nk^2 + mright)) time per single iteration, where k is the length of the graph sequence and n and m denote number of vertices and total number of edges respectively. We also consider an alternative problem where subgraphs with large pairwise Jaccard indices are rewarded. We do this by incorporating the indices directly into the objective function. More formally, given a set of graph snapshots and a weight (lambda), we find a collection of dense subgraphs such that the sum of densities of the induced subgraphs plus the sum of Jaccard indices, weighted by (lambda), is maximized. We prove that this problem is NP-hard. To discover dense subgraphs with good objective value, we present an iterative algorithm which runs in ({mathcal {O}} mathopen {}left( n^2k^2 + m log n + k^3 nright)) time per single iteration, and a greedy algorithm which runs in ({mathcal {O}} mathopen {}left( n^2k^2 + m log n + k^3 nright)) time. We show experimentally that our algorithms are efficient, they can find ground truth in synthetic datasets and provide good results from real-world datasets. Finally, we present two case studies that show the usefulness of our problem.

寻找稠密子图是图挖掘的一个核心问题，在不同领域有很多应用。同时，现实世界中的许多网络会随时间变化，也就是说，数据集可以表示为一系列图快照。因此，我们很自然地要考虑在时态网络中寻找允许随时间变化到一定程度的密集子图的问题。在本文中，我们将寻找具有较大成对 Jaccard 相似系数的密集子图。更正式地说，给定一组图快照和输入参数（(alpha)），我们会找到一个密集子图集合，其成对的杰卡德指数至少为（(alpha)），从而使诱导子图的密度之和达到最大。我们证明了这个问题的 NP 难度，并提出了一种贪婪的迭代算法，该算法的运行时间为（{mathcal {O}}mathopen {}其中 k 是图序列的长度，n 和 m 分别表示顶点数和边的总数。我们还考虑了另一个问题，即奖励具有较大成对 Jaccard 指数的子图。为此，我们将指数直接纳入目标函数。更正式地说，给定一组图快照和一个权重 (lambda)，我们会找到一个密集子图集合，使得诱导子图的密度总和加上 Jaccard 指数总和（以 (lambda)加权）达到最大。我们证明这个问题是 NP 难的。为了发现具有良好目标值的密集子图，我们提出了一种迭代算法，该算法的运行时间为（{mathcal {O}}left(n^2k^2+mlog n + k^3 nright)) 每次迭代的时间，以及一种贪婪算法，其运行时间为（{mathcal {O}}n^2k^2 + m （log n + k^3 nright ）时间内运行。我们通过实验证明，我们的算法是高效的，它们可以在合成数据集中找到地面实况，并在真实世界的数据集中提供良好的结果。最后，我们介绍了两个案例研究，展示了我们的问题的实用性。

{"title":"Jaccard-constrained dense subgraph discovery","authors":"Chamalee Wickrama Arachchi, Nikolaj Tatti","doi":"10.1007/s10994-024-06595-y","DOIUrl":"https://doi.org/10.1007/s10994-024-06595-y","url":null,"abstract":"Finding dense subgraphs is a core problem in graph mining with many applications in diverse domains. At the same time many real-world networks vary over time, that is, the dataset can be represented as a sequence of graph snapshots. Hence, it is natural to consider the question of finding dense subgraphs in a temporal network that are allowed to vary over time to a certain degree. In this paper, we search for dense subgraphs that have large pairwise Jaccard similarity coefficients. More formally, given a set of graph snapshots and input parameter (alpha), we find a collection of dense subgraphs, with pairwise Jaccard index at least (alpha), such that the sum of densities of the induced subgraphs is maximized. We prove that this problem is NP-hard and we present a greedy, iterative algorithm which runs in ({mathcal {O}} mathopen {} left( nk^2 + mright)) time per single iteration, where k is the length of the graph sequence and n and m denote number of vertices and total number of edges respectively. We also consider an alternative problem where subgraphs with large pairwise Jaccard indices are rewarded. We do this by incorporating the indices directly into the objective function. More formally, given a set of graph snapshots and a weight (lambda), we find a collection of dense subgraphs such that the sum of densities of the induced subgraphs plus the sum of Jaccard indices, weighted by (lambda), is maximized. We prove that this problem is NP-hard. To discover dense subgraphs with good objective value, we present an iterative algorithm which runs in ({mathcal {O}} mathopen {}left( n^2k^2 + m log n + k^3 nright)) time per single iteration, and a greedy algorithm which runs in ({mathcal {O}} mathopen {}left( n^2k^2 + m log n + k^3 nright)) time. We show experimentally that our algorithms are efficient, they can find ground truth in synthetic datasets and provide good results from real-world datasets. Finally, we present two case studies that show the usefulness of our problem.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"63 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141780827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distribution-free conformal joint prediction regions for neural marked temporal point processes 神经标记时点过程的无分布共形联合预测区域

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-23 DOI: 10.1007/s10994-024-06594-z

Victor Dheur, Tanguy Bosser, Rafael Izbicki, Souhaib Ben Taieb

Sequences of labeled events observed at irregular intervals in continuous time are ubiquitous across various fields. Temporal Point Processes (TPPs) provide a mathematical framework for modeling these sequences, enabling inferences such as predicting the arrival time of future events and their associated label, called mark. However, due to model misspecification or lack of training data, these probabilistic models may provide a poor approximation of the true, unknown underlying process, with prediction regions extracted from them being unreliable estimates of the underlying uncertainty. This paper develops more reliable methods for uncertainty quantification in neural TPP models via the framework of conformal prediction. A primary objective is to generate a distribution-free joint prediction region for an event’s arrival time and mark, with a finite-sample marginal coverage guarantee. A key challenge is to handle both a strictly positive, continuous response and a categorical response, without distributional assumptions. We first consider a simple but overly conservative approach that combines individual prediction regions for the event’s arrival time and mark. Then, we introduce a more effective method based on bivariate highest density regions derived from the joint predictive density of arrival times and marks. By leveraging the dependencies between these two variables, this method excludes unlikely combinations of the two, resulting in sharper prediction regions while still attaining the pre-specified coverage level. We also explore the generation of individual univariate prediction regions for events’ arrival times and marks through conformal regression and classification techniques. Moreover, we evaluate the stronger notion of conditional coverage. Finally, through extensive experimentation on both simulated and real-world datasets, we assess the validity and efficiency of these methods.

在连续时间中以不规则间隔观察到的标记事件序列在各个领域无处不在。时点过程（TPPs）为这些序列建模提供了一个数学框架，可用于推断，如预测未来事件的到达时间及其相关标签（称为标记）。然而，由于模型规范错误或缺乏训练数据，这些概率模型可能无法很好地近似真实、未知的底层过程，从中提取的预测区域对底层不确定性的估计并不可靠。本文通过保形预测框架，为神经 TPP 模型的不确定性量化开发了更可靠的方法。其主要目标是为事件的到达时间和标记生成一个无分布的联合预测区域，并保证有限样本的边际覆盖。一个关键的挑战是如何在不考虑分布假设的情况下，同时处理严格的正向连续响应和分类响应。我们首先考虑了一种简单但过于保守的方法，即结合事件到达时间和标记的单个预测区域。然后，我们介绍了一种更有效的方法，它基于从到达时间和标记的联合预测密度得出的二元最高密度区域。通过利用这两个变量之间的依赖关系，该方法排除了这两个变量的不可能组合，从而产生了更清晰的预测区域，同时仍能达到预先指定的覆盖水平。我们还探索了通过保形回归和分类技术生成事件到达时间和标记的单变量预测区域。此外，我们还评估了更强的条件覆盖概念。最后，通过在模拟和真实世界数据集上进行大量实验，我们评估了这些方法的有效性和效率。

{"title":"Distribution-free conformal joint prediction regions for neural marked temporal point processes","authors":"Victor Dheur, Tanguy Bosser, Rafael Izbicki, Souhaib Ben Taieb","doi":"10.1007/s10994-024-06594-z","DOIUrl":"https://doi.org/10.1007/s10994-024-06594-z","url":null,"abstract":"Sequences of labeled events observed at irregular intervals in continuous time are ubiquitous across various fields. Temporal Point Processes (TPPs) provide a mathematical framework for modeling these sequences, enabling inferences such as predicting the arrival time of future events and their associated label, called mark. However, due to model misspecification or lack of training data, these probabilistic models may provide a poor approximation of the true, unknown underlying process, with prediction regions extracted from them being unreliable estimates of the underlying uncertainty. This paper develops more reliable methods for uncertainty quantification in neural TPP models via the framework of conformal prediction. A primary objective is to generate a distribution-free joint prediction region for an event’s arrival time and mark, with a finite-sample marginal coverage guarantee. A key challenge is to handle both a strictly positive, continuous response and a categorical response, without distributional assumptions. We first consider a simple but overly conservative approach that combines individual prediction regions for the event’s arrival time and mark. Then, we introduce a more effective method based on bivariate highest density regions derived from the joint predictive density of arrival times and marks. By leveraging the dependencies between these two variables, this method excludes unlikely combinations of the two, resulting in sharper prediction regions while still attaining the pre-specified coverage level. We also explore the generation of individual univariate prediction regions for events’ arrival times and marks through conformal regression and classification techniques. Moreover, we evaluate the stronger notion of conditional coverage. Finally, through extensive experimentation on both simulated and real-world datasets, we assess the validity and efficiency of these methods.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"31 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141780828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Extrapolation is not the same as interpolation 外推法不同于内插法

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-23 DOI: 10.1007/s10994-024-06591-2

Yuxuan Wang, Ross D. King

We propose a new machine learning formulation designed specifically for extrapolation. The textbook way to apply machine learning to drug design is to learn a univariate function that when a drug (structure) is input, the function outputs a real number (the activity): f(drug) (rightarrow) activity. However, experience in real-world drug design suggests that this formulation of the drug design problem is not quite correct. Specifically, what one is really interested in is extrapolation: predicting the activity of new drugs with higher activity than any existing ones. Our new formulation for extrapolation is based on learning a bivariate function that predicts the difference in activities of two drugs F(drug1, drug2) (rightarrow) difference in activity, followed by the use of ranking algorithms. This formulation is general and agnostic, suitable for finding samples with target values beyond the target value range of the training set. We applied the formulation to work with support vector machines , random forests , and Gradient Boosting Machines . We compared the formulation with standard regression on thousands of drug design datasets, gene expression datasets and material property datasets. The test set extrapolation metric was the identification of examples with greater values than the training set, and top-performing examples (within the top 10% of the whole dataset). On this metric our pairwise formulation vastly outperformed standard regression. Its proposed variations also showed a consistent outperformance. Its application in the stock selection problem further confirmed the advantage of this pairwise formulation.

我们提出了一种专为外推设计的新机器学习方法。教科书上将机器学习应用于药物设计的方法是学习一个单变量函数，当输入药物（结构）时，函数输出一个实数（活性）：f(drug) (rightarrow) activity。然而，实际药物设计的经验表明，药物设计问题的这种表述并不完全正确。具体来说，我们真正感兴趣的是外推法：预测活性高于任何现有药物的新药的活性。我们新的外推方法是基于学习一个预测两种药物活性差异的双变量函数 F(drug1, drug2) (rightarrow) 活性差异，然后使用排序算法。这种公式具有通用性和不可知性，适用于寻找目标值超出训练集目标值范围的样本。我们将该公式应用于支持向量机、随机森林和梯度提升机。我们在数以千计的药物设计数据集、基因表达数据集和材料属性数据集上比较了该公式与标准回归。测试集的外推指标是识别出比训练集值更高的示例，以及表现最好的示例（在整个数据集中排名前 10%）。在这一指标上，我们的成对公式大大优于标准回归。其提出的变体也显示出持续的优越性。在选股问题中的应用进一步证实了这种成对公式的优势。

{"title":"Extrapolation is not the same as interpolation","authors":"Yuxuan Wang, Ross D. King","doi":"10.1007/s10994-024-06591-2","DOIUrl":"https://doi.org/10.1007/s10994-024-06591-2","url":null,"abstract":"We propose a new machine learning formulation designed specifically for extrapolation. The textbook way to apply machine learning to drug design is to learn a univariate function that when a drug (structure) is input, the function outputs a real number (the activity): f(drug) (rightarrow) activity. However, experience in real-world drug design suggests that this formulation of the drug design problem is not quite correct. Specifically, what one is really interested in is extrapolation: predicting the activity of new drugs with higher activity than any existing ones. Our new formulation for extrapolation is based on learning a bivariate function that predicts the difference in activities of two drugs F(drug1, drug2) (rightarrow) difference in activity, followed by the use of ranking algorithms. This formulation is general and agnostic, suitable for finding samples with target values beyond the target value range of the training set. We applied the formulation to work with support vector machines , random forests , and Gradient Boosting Machines . We compared the formulation with standard regression on thousands of drug design datasets, gene expression datasets and material property datasets. The test set extrapolation metric was the identification of examples with greater values than the training set, and top-performing examples (within the top 10% of the whole dataset). On this metric our pairwise formulation vastly outperformed standard regression. Its proposed variations also showed a consistent outperformance. Its application in the stock selection problem further confirmed the advantage of this pairwise formulation.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"70 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141780829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards efficient AutoML: a pipeline synthesis approach leveraging pre-trained transformers for multimodal data 实现高效的 AutoML：利用预训练转换器进行多模态数据的流水线合成方法

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-19 DOI: 10.1007/s10994-024-06568-1

Ambarish Moharil, Joaquin Vanschoren, Prabhant Singh, Damian Tamburri

This paper introduces an Automated Machine Learning (AutoML) framework specifically designed to efficiently synthesize end-to-end multimodal machine learning pipelines. Traditional reliance on the computationally demanding Neural Architecture Search is minimized through the strategic integration of pre-trained transformer models. This innovative approach enables the effective unification of diverse data modalities into high-dimensional embeddings, streamlining the pipeline development process. We leverage an advanced Bayesian Optimization strategy, informed by meta-learning, to facilitate the warm-starting of the pipeline synthesis, thereby enhancing computational efficiency. Our methodology demonstrates its potential to create advanced and custom multimodal pipelines within limited computational resources. Extensive testing across 23 varied multimodal datasets indicates the promise and utility of our framework in diverse scenarios. The results contribute to the ongoing efforts in the AutoML field, suggesting new possibilities for efficiently handling complex multimodal data. This research represents a step towards developing more efficient and versatile tools in multimodal machine learning pipeline development, acknowledging the collaborative and ever-evolving nature of this field.

本文介绍了自动机器学习（AutoML）框架，该框架专门用于高效合成端到端多模态机器学习管道。通过战略性地集成预训练的转换器模型，最大限度地减少了对计算要求极高的神经架构搜索的传统依赖。这种创新方法能将不同的数据模式有效地统一到高维嵌入中，从而简化管道开发流程。我们利用先进的贝叶斯优化策略，通过元学习，促进管道合成的热启动，从而提高计算效率。我们的方法展示了在有限的计算资源内创建先进的定制多模态管道的潜力。在 23 个不同的多模态数据集上进行的广泛测试表明了我们的框架在不同场景中的前景和实用性。这些结果为 AutoML 领域的持续努力做出了贡献，为高效处理复杂的多模态数据提供了新的可能性。这项研究标志着在多模态机器学习管道开发方面朝着开发更高效、更多功能的工具迈出了一步，同时也承认了这一领域的协作性和不断发展的性质。

{"title":"Towards efficient AutoML: a pipeline synthesis approach leveraging pre-trained transformers for multimodal data","authors":"Ambarish Moharil, Joaquin Vanschoren, Prabhant Singh, Damian Tamburri","doi":"10.1007/s10994-024-06568-1","DOIUrl":"https://doi.org/10.1007/s10994-024-06568-1","url":null,"abstract":"This paper introduces an Automated Machine Learning (AutoML) framework specifically designed to efficiently synthesize end-to-end multimodal machine learning pipelines. Traditional reliance on the computationally demanding Neural Architecture Search is minimized through the strategic integration of pre-trained transformer models. This innovative approach enables the effective unification of diverse data modalities into high-dimensional embeddings, streamlining the pipeline development process. We leverage an advanced Bayesian Optimization strategy, informed by meta-learning, to facilitate the warm-starting of the pipeline synthesis, thereby enhancing computational efficiency. Our methodology demonstrates its potential to create advanced and custom multimodal pipelines within limited computational resources. Extensive testing across 23 varied multimodal datasets indicates the promise and utility of our framework in diverse scenarios. The results contribute to the ongoing efforts in the AutoML field, suggesting new possibilities for efficiently handling complex multimodal data. This research represents a step towards developing more efficient and versatile tools in multimodal machine learning pipeline development, acknowledging the collaborative and ever-evolving nature of this field.\u0000","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"76 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141746054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ICM ensemble with novel betting functions for concept drift 具有新颖投注功能的概念漂移 ICM 集合

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-17 DOI: 10.1007/s10994-024-06593-0

Charalambos Eliades, Harris Papadopoulos

This study builds upon our previous work by introducing a refined Inductive Conformal Martingale (ICM) approach for addressing Concept Drift. Specifically, we enhance our previously proposed CAUTIOUS betting function to incorporate multiple density estimators for improving detection ability. We also combine this betting function with two base estimators that have not been previously utilized within the ICM framework: the Interpolated Histogram and Nearest Neighbor Density Estimators. We assess these extensions using both a single ICM and an ensemble of ICMs. For the latter, we conduct a comprehensive experimental investigation into the influence of the ensemble size on prediction accuracy and the number of available predictions. Our experimental results on four benchmark datasets demonstrate that the proposed approach surpasses our previous methodology in terms of performance while matching or in many cases exceeding that of three contemporary state-of-the-art techniques.

本研究以我们之前的工作为基础，引入了一种经过改进的归纳共形马丁格尔（ICM）方法来解决概念漂移问题。具体来说，我们增强了之前提出的 CAUTIOUS 下注函数，将多个密度估算器纳入其中，以提高检测能力。我们还将这一投注函数与之前未在 ICM 框架内使用过的两个基本估计器相结合：插值直方图和近邻密度估计器。我们使用单个 ICM 和 ICM 集合对这些扩展进行了评估。对于后者，我们对集合规模对预测准确性和可用预测数量的影响进行了全面的实验研究。我们在四个基准数据集上的实验结果表明，所提出的方法在性能上超越了我们以前的方法，同时与三种当代最先进的技术相匹配，甚至在很多情况下超过了它们。

引用次数: 0

Variable selection for both outcomes and predictors: sparse multivariate principal covariates regression 结果和预测因素的变量选择：稀疏多变量主协变量回归

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-17 DOI: 10.1007/s10994-024-06520-3

Soogeun Park, Eva Ceulemans, Katrijn Van Deun

Datasets comprised of large sets of both predictor and outcome variables are becoming more widely used in research. In addition to the well-known problems of model complexity and predictor variable selection, predictive modelling with such large data also presents a relatively novel and under-studied challenge of outcome variable selection. Certain outcome variables in the data may not be adequately predicted by the given sets of predictors. In this paper, we propose the method of Sparse Multivariate Principal Covariates Regression that addresses these issues altogether by expanding the Principal Covariates Regression model to incorporate sparsity penalties on both of predictor and outcome variables. Our method is one of the first methods that perform variable selection for both predictors and outcomes simultaneously. Moreover, by relying on summary variables that explain the variance in both predictor and outcome variables, the method offers a sparse and succinct model representation of the data. In a simulation study, the method performed better than methods with similar aims such as sparse Partial Least Squares at prediction of the outcome variables and recovery of the population parameters. Lastly, we administered the method on an empirical dataset to illustrate its application in practice.

由大量预测变量和结果变量组成的数据集在研究中的应用越来越广泛。除了众所周知的模型复杂性和预测变量选择问题外，使用此类大型数据建立预测模型还面临着一个相对新颖且研究不足的挑战，即结果变量选择问题。给定的预测变量集可能无法充分预测数据中的某些结果变量。在本文中，我们提出了稀疏多变量主变量回归方法，通过扩展主变量回归模型，在预测变量和结果变量上都加入稀疏性惩罚，从而彻底解决这些问题。我们的方法是首批同时对预测变量和结果变量进行选择的方法之一。此外，通过依赖能解释预测变量和结果变量方差的汇总变量，该方法提供了稀疏而简洁的数据模型表示。在模拟研究中，该方法在预测结果变量和恢复群体参数方面的表现优于具有类似目的的方法，如稀疏偏最小二乘法。最后，我们在一个经验数据集上使用了该方法，以说明其在实践中的应用。

{"title":"Variable selection for both outcomes and predictors: sparse multivariate principal covariates regression","authors":"Soogeun Park, Eva Ceulemans, Katrijn Van Deun","doi":"10.1007/s10994-024-06520-3","DOIUrl":"https://doi.org/10.1007/s10994-024-06520-3","url":null,"abstract":"Datasets comprised of large sets of both predictor and outcome variables are becoming more widely used in research. In addition to the well-known problems of model complexity and predictor variable selection, predictive modelling with such large data also presents a relatively novel and under-studied challenge of outcome variable selection. Certain outcome variables in the data may not be adequately predicted by the given sets of predictors. In this paper, we propose the method of Sparse Multivariate Principal Covariates Regression that addresses these issues altogether by expanding the Principal Covariates Regression model to incorporate sparsity penalties on both of predictor and outcome variables. Our method is one of the first methods that perform variable selection for both predictors and outcomes simultaneously. Moreover, by relying on summary variables that explain the variance in both predictor and outcome variables, the method offers a sparse and succinct model representation of the data. In a simulation study, the method performed better than methods with similar aims such as sparse Partial Least Squares at prediction of the outcome variables and recovery of the population parameters. Lastly, we administered the method on an empirical dataset to illustrate its application in practice.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"2018 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141740289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Methodology and evaluation in sports analytics: challenges, approaches, and lessons learned 体育分析的方法和评估：挑战、方法和经验教训

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-17 DOI: 10.1007/s10994-024-06585-0

Jesse Davis, Lotte Bransen, Laurens Devos, Arne Jaspers, Wannes Meert, Pieter Robberechts, Jan Van Haaren, Maaike Van Roy

There has been an explosion of data collected about sports. Because such data is extremely rich and complex, machine learning is increasingly being used to extract actionable insights from it. Typically, machine learning is used to build models and indicators that capture the skills, capabilities, and tendencies of athletes and teams. Such indicators and models are in turn used to inform decision-making at professional clubs. Designing these indicators requires paying careful attention to a number of subtle issues from a methodological and evaluation perspective. In this paper, we highlight these challenges in sports and discuss a variety of approaches for handling them. Methodologically, we highlight that dependencies affect how to perform data partitioning for evaluation as well as the need to consider contextual factors. From an evaluation perspective, we draw a distinction between evaluating the developed indicators themselves versus the underlying models that power them. We argue that both aspects must be considered, but that they require different approaches. We hope that this article helps bridge the gap between traditional sports expertise and modern data analytics by providing a structured framework with practical examples.

收集到的体育数据呈爆炸式增长。由于这些数据极其丰富和复杂，人们越来越多地使用机器学习来从中提取可行的见解。通常，机器学习用于建立模型和指标，以捕捉运动员和团队的技能、能力和倾向。这些指标和模型反过来又为职业俱乐部的决策提供依据。设计这些指标需要从方法论和评估的角度仔细关注一些微妙的问题。在本文中，我们强调了体育运动中的这些挑战，并讨论了处理这些挑战的各种方法。在方法论上，我们强调依赖性会影响如何进行数据分区以进行评估，以及考虑背景因素的必要性。从评估的角度来看，我们将评估所开发的指标本身与支持这些指标的基础模型区分开来。我们认为，这两个方面都必须考虑，但需要采用不同的方法。我们希望这篇文章通过提供一个结构化的框架和实际案例，有助于弥合传统体育专业知识与现代数据分析之间的差距。

{"title":"Methodology and evaluation in sports analytics: challenges, approaches, and lessons learned","authors":"Jesse Davis, Lotte Bransen, Laurens Devos, Arne Jaspers, Wannes Meert, Pieter Robberechts, Jan Van Haaren, Maaike Van Roy","doi":"10.1007/s10994-024-06585-0","DOIUrl":"https://doi.org/10.1007/s10994-024-06585-0","url":null,"abstract":"There has been an explosion of data collected about sports. Because such data is extremely rich and complex, machine learning is increasingly being used to extract actionable insights from it. Typically, machine learning is used to build models and indicators that capture the skills, capabilities, and tendencies of athletes and teams. Such indicators and models are in turn used to inform decision-making at professional clubs. Designing these indicators requires paying careful attention to a number of subtle issues from a methodological and evaluation perspective. In this paper, we highlight these challenges in sports and discuss a variety of approaches for handling them. Methodologically, we highlight that dependencies affect how to perform data partitioning for evaluation as well as the need to consider contextual factors. From an evaluation perspective, we draw a distinction between evaluating the developed indicators themselves versus the underlying models that power them. We argue that both aspects must be considered, but that they require different approaches. We hope that this article helps bridge the gap between traditional sports expertise and modern data analytics by providing a structured framework with practical examples.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"26 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141746053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spatial entropy as an inductive bias for vision transformers 作为视觉变换器感应偏置的空间熵

IF 7.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning

Pub Date : 2024-07-17 DOI: 10.1007/s10994-024-06570-7

Elia Peruzzo, Enver Sangineto, Yahui Liu, Marco De Nadai, Wei Bi, Bruno Lepri, Nicu Sebe

Recent work on Vision Transformers (VTs) showed that introducing a local inductive bias in the VT architecture helps reducing the number of samples necessary for training. However, the architecture modifications lead to a loss of generality of the Transformer backbone, partially contradicting the push towards the development of uniform architectures, shared, e.g., by both the Computer Vision and the Natural Language Processing areas. In this work, we propose a different and complementary direction, in which a local bias is introduced using an auxiliary self-supervised task, performed jointly with standard supervised training. Specifically, we exploit the observation that the attention maps of VTs, when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. Thus, we explicitly encourage the emergence of this spatial clustering as a form of training regularization. In more detail, we exploit the assumption that, in a given image, objects usually correspond to few connected regions, and we propose a spatial formulation of the information entropy to quantify this object-based inductive bias. By minimizing the proposed spatial entropy, we include an additional self-supervised signal during training. Using extensive experiments, we show that the proposed regularization leads to equivalent or better results than other VT proposals which include a local bias by changing the basic Transformer architecture, and it can drastically boost the VT final accuracy when using small-medium training sets. The code is available at https://github.com/helia95/SAR.

最近关于视觉转换器（VT）的研究表明，在视觉转换器架构中引入局部归纳偏差有助于减少训练所需的样本数量。然而，架构的修改导致变换器主干失去了通用性，部分违背了计算机视觉和自然语言处理等领域共享的统一架构的发展方向。在这项工作中，我们提出了一个不同的互补方向，即使用辅助自监督任务引入局部偏差，并与标准监督训练联合执行。具体来说，我们利用了一个观察结果，即在进行自我监督训练时，VT 的注意图可能包含一种语义分割结构，而这种结构在监督训练时不会自发出现。因此，我们明确鼓励这种空间聚类的出现，将其作为训练正则化的一种形式。更详细地说，我们利用了一个假设，即在给定图像中，对象通常对应于很少的连接区域，我们提出了一种信息熵的空间表述来量化这种基于对象的归纳偏差。通过最小化提出的空间熵，我们在训练过程中加入了额外的自监督信号。通过大量的实验，我们发现所提出的正则化方法与其他通过改变基本变换器架构而包含局部偏差的 VT 方案相比，能带来相同或更好的结果，而且在使用中小型训练集时，它还能大幅提高 VT 的最终准确率。代码见 https://github.com/helia95/SAR。

{"title":"Spatial entropy as an inductive bias for vision transformers","authors":"Elia Peruzzo, Enver Sangineto, Yahui Liu, Marco De Nadai, Wei Bi, Bruno Lepri, Nicu Sebe","doi":"10.1007/s10994-024-06570-7","DOIUrl":"https://doi.org/10.1007/s10994-024-06570-7","url":null,"abstract":"Recent work on Vision Transformers (VTs) showed that introducing a local inductive bias in the VT architecture helps reducing the number of samples necessary for training. However, the architecture modifications lead to a loss of generality of the Transformer backbone, partially contradicting the push towards the development of uniform architectures, shared, e.g., by both the Computer Vision and the Natural Language Processing areas. In this work, we propose a different and complementary direction, in which a local bias is introduced using an auxiliary self-supervised task, performed jointly with standard supervised training. Specifically, we exploit the observation that the attention maps of VTs, when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. Thus, we explicitly encourage the emergence of this spatial clustering as a form of training regularization. In more detail, we exploit the assumption that, in a given image, objects usually correspond to few connected regions, and we propose a spatial formulation of the information entropy to quantify this object-based inductive bias. By minimizing the proposed spatial entropy, we include an additional self-supervised signal during training. Using extensive experiments, we show that the proposed regularization leads to equivalent or better results than other VT proposals which include a local bias by changing the basic Transformer architecture, and it can drastically boost the VT final accuracy when using small-medium training sets. The code is available at https://github.com/helia95/SAR.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"68 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141740288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0