arXiv - STAT - Machine Learning最新文献

英文中文

Towards Explainable Automated Data Quality Enhancement without Domain Knowledge 在没有领域知识的情况下实现可解释的自动数据质量增强

arXiv - STAT - Machine Learning

Pub Date : 2024-09-16 DOI: arxiv-2409.10139

Djibril Sarr

In the era of big data, ensuring the quality of datasets has becomeincreasingly crucial across various domains. We propose a comprehensiveframework designed to automatically assess and rectify data quality issues inany given dataset, regardless of its specific content, focusing on both textualand numerical data. Our primary objective is to address three fundamental typesof defects: absence, redundancy, and incoherence. At the heart of our approachlies a rigorous demand for both explainability and interpretability, ensuringthat the rationale behind the identification and correction of data anomaliesis transparent and understandable. To achieve this, we adopt a hybrid approachthat integrates statistical methods with machine learning algorithms. Indeed,by leveraging statistical techniques alongside machine learning, we strike abalance between accuracy and explainability, enabling users to trust andcomprehend the assessment process. Acknowledging the challenges associated withautomating the data quality assessment process, particularly in terms of timeefficiency and accuracy, we adopt a pragmatic strategy, employingresource-intensive algorithms only when necessary, while favoring simpler, moreefficient solutions whenever possible. Through a practical analysis conductedon a publicly provided dataset, we illustrate the challenges that arise whentrying to enhance data quality while keeping explainability. We demonstrate theeffectiveness of our approach in detecting and rectifying missing values,duplicates and typographical errors as well as the challenges remaining to beaddressed to achieve similar accuracy on statistical outliers and logic errorsunder the constraints set in our work.

在大数据时代，确保数据集的质量在各个领域都变得越来越重要。我们提出了一个综合框架，旨在自动评估和纠正任何给定数据集的数据质量问题，无论其具体内容如何，重点关注文本数据和数字数据。我们的主要目标是解决三种基本类型的缺陷：缺失、冗余和不连贯。我们方法的核心是严格要求可解释性和可解释性，确保识别和纠正数据异常背后的原理是透明和可理解的。为此，我们采用了一种将统计方法与机器学习算法相结合的混合方法。事实上，通过利用统计技术和机器学习，我们在准确性和可解释性之间取得了平衡，使用户能够信任和理解评估过程。我们认识到自动化数据质量评估过程所面临的挑战，尤其是在时间效率和准确性方面，因此我们采取了务实的策略，只在必要时才使用资源密集型算法，同时尽可能采用更简单、更高效的解决方案。通过对一个公开提供的数据集进行实际分析，我们说明了在努力提高数据质量的同时保持可解释性所面临的挑战。我们展示了我们的方法在检测和纠正缺失值、重复数据和排版错误方面的有效性，以及在我们的工作中所设定的约束条件下，要在统计异常值和逻辑错误方面实现类似的准确性所面临的挑战。

{"title":"Towards Explainable Automated Data Quality Enhancement without Domain Knowledge","authors":"Djibril Sarr","doi":"arxiv-2409.10139","DOIUrl":"https://doi.org/arxiv-2409.10139","url":null,"abstract":"In the era of big data, ensuring the quality of datasets has become\u0000increasingly crucial across various domains. We propose a comprehensive\u0000framework designed to automatically assess and rectify data quality issues in\u0000any given dataset, regardless of its specific content, focusing on both textual\u0000and numerical data. Our primary objective is to address three fundamental types\u0000of defects: absence, redundancy, and incoherence. At the heart of our approach\u0000lies a rigorous demand for both explainability and interpretability, ensuring\u0000that the rationale behind the identification and correction of data anomalies\u0000is transparent and understandable. To achieve this, we adopt a hybrid approach\u0000that integrates statistical methods with machine learning algorithms. Indeed,\u0000by leveraging statistical techniques alongside machine learning, we strike a\u0000balance between accuracy and explainability, enabling users to trust and\u0000comprehend the assessment process. Acknowledging the challenges associated with\u0000automating the data quality assessment process, particularly in terms of time\u0000efficiency and accuracy, we adopt a pragmatic strategy, employing\u0000resource-intensive algorithms only when necessary, while favoring simpler, more\u0000efficient solutions whenever possible. Through a practical analysis conducted\u0000on a publicly provided dataset, we illustrate the challenges that arise when\u0000trying to enhance data quality while keeping explainability. We demonstrate the\u0000effectiveness of our approach in detecting and rectifying missing values,\u0000duplicates and typographical errors as well as the challenges remaining to be\u0000addressed to achieve similar accuracy on statistical outliers and logic errors\u0000under the constraints set in our work.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Partial Distribution Matching via Partial Wasserstein Adversarial Networks 通过部分瓦瑟斯坦对抗网络进行部分分布匹配

arXiv - STAT - Machine Learning

Pub Date : 2024-09-16 DOI: arxiv-2409.10499

Zi-Ming Wang, Nan Xue, Ling Lei, Rebecka Jörnsten, Gui-Song Xia

This paper studies the problem of distribution matching (DM), which is afundamental machine learning problem seeking to robustly align two probabilitydistributions. Our approach is established on a relaxed formulation, calledpartial distribution matching (PDM), which seeks to match a fraction of thedistributions instead of matching them completely. We theoretically derive theKantorovich-Rubinstein duality for the partial Wasserstain-1 (PW) discrepancy,and develop a partial Wasserstein adversarial network (PWAN) that efficientlyapproximates the PW discrepancy based on this dual form. Partial matching canthen be achieved by optimizing the network using gradient descent. Twopractical tasks, point set registration and partial domain adaptation areinvestigated, where the goals are to partially match distributions in 3D spaceand high-dimensional feature space respectively. The experiment results confirmthat the proposed PWAN effectively produces highly robust matching results,performing better or on par with the state-of-the-art methods.

本文研究的是分布匹配（DM）问题，这是一个基本的机器学习问题，旨在稳健地匹配两个概率分布。我们的方法建立在一种宽松的表述之上，称为部分分布匹配（PDM），它寻求匹配部分分布，而不是完全匹配。我们从理论上推导出了部分 Wasserstain-1（PW）差异的康托洛维奇-鲁宾斯坦对偶形式，并开发了一种部分 Wasserstein 对抗网络 (PWAN)，它能基于这种对偶形式有效地近似 PW 差异。通过使用梯度下降法优化网络，可以实现部分匹配。研究了点集注册和部分域适应这两项实际任务，其目标分别是部分匹配三维空间和高维特征空间中的分布。实验结果证实，所提出的 PWAN 能有效地产生高鲁棒性的匹配结果，其性能优于或与最先进的方法相当。

引用次数: 0

Tight Lower Bounds under Asymmetric High-Order Hölder Smoothness and Uniform Convexity 非对称高阶荷尔德平滑性和均匀凸性下的严格下界

arXiv - STAT - Machine Learning

Pub Date : 2024-09-16 DOI: arxiv-2409.10773

Site Bai, Brian Bullins

In this paper, we provide tight lower bounds for the oracle complexity ofminimizing high-order H"older smooth and uniformly convex functions.Specifically, for a function whose $p^{th}$-order derivatives are H"oldercontinuous with degree $nu$ and parameter $H$, and that is uniformly convexwith degree $q$ and parameter $sigma$, we focus on two asymmetric cases: (1)$q > p + nu$, and (2) $q < p+nu$. Given up to $p^{th}$-order oracle access,we establish worst-case oracle complexities of $Omegaleft( left(frac{H}{sigma}right)^frac{2}{3(p+nu)-2}left(frac{sigma}{epsilon}right)^frac{2(q-p-nu)}{q(3(p+nu)-2)}right)$ with atruncated-Gaussian smoothed hard function in the first case and$Omegaleft(left(frac{H}{sigma}right)^frac{2}{3(p+nu)-2}+log^2left(frac{sigma^{p+nu}}{H^q}right)^frac{1}{p+nu-q}right)$ in thesecond case, for reaching an $epsilon$-approximate solution in terms of theoptimality gap. Our analysis generalizes previous lower bounds for functionsunder first- and second-order smoothness as well as those for uniformly convexfunctions, and furthermore our results match the corresponding upper bounds inthe general setting.

本文为最小化高阶光滑均匀凸函数的神谕复杂度提供了严格的下界。具体来说，对于一个函数，其 $p^{th}$ 阶导数是阶数为 $nu$ 和参数为 $H$ 的连续高阶导数，并且是阶数为 $q$ 和参数为 $sigma$ 的均匀凸函数，我们关注两种非对称情况：(1)$q > p + nu$；(2)$q < p+ nu$。给定最多 $p^{th}$ 的神谕访问、我们建立了 $Omegaleft( left(frac{H}{sigma}right)^frac{2}{3(p+nu)-2}left(frac{sigma}{epsilon}right)^frac{2(q-p-nu)}{q(3(p+nu)-2)}right)$ 的最坏情况下的神谕复杂性，并对其进行了截断。在第一种情况下是高斯平滑硬函数，在第二种情况下是$Omega（left(left(frac{H}{sigma}right)^frac{2}{3(p+nu)-2}+log^2left(frac{sigma^{p+nu}{H^q}right)^frac{1}{p+nu-q}right)$、的最优性差距达到近似解。我们的分析概括了以往一阶和二阶平滑函数以及均匀凸函数的下界，而且我们的结果与一般情况下的相应上界相吻合。

{"title":"Tight Lower Bounds under Asymmetric High-Order Hölder Smoothness and Uniform Convexity","authors":"Site Bai, Brian Bullins","doi":"arxiv-2409.10773","DOIUrl":"https://doi.org/arxiv-2409.10773","url":null,"abstract":"In this paper, we provide tight lower bounds for the oracle complexity of\u0000minimizing high-order H\"older smooth and uniformly convex functions.\u0000Specifically, for a function whose $p^{th}$-order derivatives are H\"older\u0000continuous with degree $nu$ and parameter $H$, and that is uniformly convex\u0000with degree $q$ and parameter $sigma$, we focus on two asymmetric cases: (1)\u0000$q > p + nu$, and (2) $q < p+nu$. Given up to $p^{th}$-order oracle access,\u0000we establish worst-case oracle complexities of $Omegaleft( left(\u0000frac{H}{sigma}right)^frac{2}{3(p+nu)-2}left(\u0000frac{sigma}{epsilon}right)^frac{2(q-p-nu)}{q(3(p+nu)-2)}right)$ with a\u0000truncated-Gaussian smoothed hard function in the first case and\u0000$Omegaleft(left(frac{H}{sigma}right)^frac{2}{3(p+nu)-2}+\u0000log^2left(frac{sigma^{p+nu}}{H^q}right)^frac{1}{p+nu-q}right)$ in the\u0000second case, for reaching an $epsilon$-approximate solution in terms of the\u0000optimality gap. Our analysis generalizes previous lower bounds for functions\u0000under first- and second-order smoothness as well as those for uniformly convex\u0000functions, and furthermore our results match the corresponding upper bounds in\u0000the general setting.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"89 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Conditional sampling within generative diffusion models 生成式扩散模型中的条件采样

arXiv - STAT - Machine Learning

Pub Date : 2024-09-15 DOI: arxiv-2409.09650

Zheng Zhao, Ziwei Luo, Jens Sjölund, Thomas B. Schön

Generative diffusions are a powerful class of Monte Carlo samplers thatleverage bridging Markov processes to approximate complex, high-dimensionaldistributions, such as those found in image processing and language models.Despite their success in these domains, an important open challenge remains:extending these techniques to sample from conditional distributions, asrequired in, for example, Bayesian inverse problems. In this paper, we presenta comprehensive review of existing computational approaches to conditionalsampling within generative diffusion models. Specifically, we highlight keymethodologies that either utilise the joint distribution, or rely on(pre-trained) marginal distributions with explicit likelihoods, to constructconditional generative samplers.

生成扩散是一类功能强大的蒙特卡罗采样器，它利用桥接马尔可夫过程来逼近复杂的高维分布，如图像处理和语言模型中的分布。尽管它们在这些领域取得了成功，但一个重要的挑战依然存在：将这些技术扩展到条件分布的采样，如贝叶斯逆问题中所要求的那样。在本文中，我们全面回顾了生成扩散模型中条件采样的现有计算方法。具体来说，我们重点介绍了利用联合分布或依赖具有显式似然的（预训练）边际分布来构建条件生成采样器的关键方法。

引用次数: 0

Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics 通过学习动力了解合成映射的简单性偏差

arXiv - STAT - Machine Learning

Pub Date : 2024-09-15 DOI: arxiv-2409.09626

Yi Ren, Danica J. Sutherland

Obtaining compositional mappings is important for the model to generalizewell compositionally. To better understand when and how to encourage the modelto learn such mappings, we study their uniqueness through differentperspectives. Specifically, we first show that the compositional mappings arethe simplest bijections through the lens of coding length (i.e., an upper boundof their Kolmogorov complexity). This property explains why models having suchmappings can generalize well. We further show that the simplicity bias isusually an intrinsic property of neural network training via gradient descent.That partially explains why some models spontaneously generalize well when theyare trained appropriately.

获得组合映射对于模型的组合泛化非常重要。为了更好地理解何时以及如何鼓励模型学习这种映射，我们从不同的角度研究了它们的唯一性。具体来说，我们首先从编码长度（即其科尔莫哥洛夫复杂度的上限）的角度证明了组合映射是最简单的双射。这一特性解释了为什么具有这种映射的模型可以很好地泛化。我们进一步证明，简单性偏差通常是神经网络通过梯度下降训练的内在属性，这也部分解释了为什么有些模型在经过适当训练后会自发地实现良好泛化。

引用次数: 0

Veridical Data Science for Medical Foundation Models 医学基础模型的验证数据科学

arXiv - STAT - Machine Learning

Pub Date : 2024-09-15 DOI: arxiv-2409.10580

Ahmed Alaa, Bin Yu

The advent of foundation models (FMs) such as large language models (LLMs)has led to a cultural shift in data science, both in medicine and beyond. Thisshift involves moving away from specialized predictive models trained forspecific, well-defined domain questions to generalist FMs pre-trained on vastamounts of unstructured data, which can then be adapted to various clinicaltasks and questions. As a result, the standard data science workflow inmedicine has been fundamentally altered; the foundation model lifecycle (FMLC)now includes distinct upstream and downstream processes, in which computationalresources, model and data access, and decision-making power are distributedamong multiple stakeholders. At their core, FMs are fundamentally statisticalmodels, and this new workflow challenges the principles of Veridical DataScience (VDS), hindering the rigorous statistical analysis expected intransparent and scientifically reproducible data science practices. Wecritically examine the medical FMLC in light of the core principles of VDS:predictability, computability, and stability (PCS), and explain how it deviatesfrom the standard data science workflow. Finally, we propose recommendationsfor a reimagined medical FMLC that expands and refines the PCS principles forVDS including considering the computational and accessibility constraintsinherent to FMs.

大型语言模型（LLM）等基础模型（FM）的出现导致了数据科学在医学及其他领域的文化转变。这种转变包括从针对特定、明确领域问题训练的专业预测模型转向在大量非结构化数据上预先训练的通用 FM，然后再将其调整到各种临床任务和问题上。因此，医疗领域的标准数据科学工作流程发生了根本性变化；基础模型生命周期（FMLC）现在包括不同的上游和下游流程，在这些流程中，计算资源、模型和数据访问以及决策权分布在多个利益相关者之间。从根本上说，FM 是一种统计模型，而这种新的工作流程挑战了数据科学的真实性原则（VDS），阻碍了不透明、科学上可重复的数据科学实践所期望的严格统计分析。我们根据 VDS 的核心原则：可预测性、可计算性和稳定性（PCS）对医学 FMLC 进行了严格审查，并解释了它是如何偏离标准数据科学工作流程的。最后，我们提出了重新构想医学 FMLC 的建议，以扩展和完善 VDS 的 PCS 原则，包括考虑 FM 固有的计算和访问限制。

{"title":"Veridical Data Science for Medical Foundation Models","authors":"Ahmed Alaa, Bin Yu","doi":"arxiv-2409.10580","DOIUrl":"https://doi.org/arxiv-2409.10580","url":null,"abstract":"The advent of foundation models (FMs) such as large language models (LLMs)\u0000has led to a cultural shift in data science, both in medicine and beyond. This\u0000shift involves moving away from specialized predictive models trained for\u0000specific, well-defined domain questions to generalist FMs pre-trained on vast\u0000amounts of unstructured data, which can then be adapted to various clinical\u0000tasks and questions. As a result, the standard data science workflow in\u0000medicine has been fundamentally altered; the foundation model lifecycle (FMLC)\u0000now includes distinct upstream and downstream processes, in which computational\u0000resources, model and data access, and decision-making power are distributed\u0000among multiple stakeholders. At their core, FMs are fundamentally statistical\u0000models, and this new workflow challenges the principles of Veridical Data\u0000Science (VDS), hindering the rigorous statistical analysis expected in\u0000transparent and scientifically reproducible data science practices. We\u0000critically examine the medical FMLC in light of the core principles of VDS:\u0000predictability, computability, and stability (PCS), and explain how it deviates\u0000from the standard data science workflow. Finally, we propose recommendations\u0000for a reimagined medical FMLC that expands and refines the PCS principles for\u0000VDS including considering the computational and accessibility constraints\u0000inherent to FMs.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BEnDEM:A Boltzmann Sampler Based on Bootstrapped Denoising Energy Matching BEnDEM：基于引导去噪能量匹配的波尔兹曼采样器

arXiv - STAT - Machine Learning

Pub Date : 2024-09-15 DOI: arxiv-2409.09787

RuiKang OuYang, Bo Qiang, José Miguel Hernández-Lobato

Developing an efficient sampler capable of generating independent andidentically distributed (IID) samples from a Boltzmann distribution is acrucial challenge in scientific research, e.g. molecular dynamics. In thiswork, we intend to learn neural samplers given energy functions instead of datasampled from the Boltzmann distribution. By learning the energies of the noiseddata, we propose a diffusion-based sampler, ENERGY-BASED DENOISING ENERGYMATCHING, which theoretically has lower variance and more complexity comparedto related works. Furthermore, a novel bootstrapping technique is applied toEnDEM to balance between bias and variance. We evaluate EnDEM and BEnDEM on a2-dimensional 40 Gaussian Mixture Model (GMM) and a 4-particle double-wellingpotential (DW-4). The experimental results demonstrate that BEnDEM can achievestate-of-the-art performance while being more robust.

开发一种能够从波尔兹曼分布生成独立且相同分布（IID）样本的高效采样器是科学研究（如分子动力学）中的一项重要挑战。在这项工作中，我们打算学习给定能量函数的神经采样器，而不是从玻尔兹曼分布中采样的数据。通过学习噪声数据的能量，我们提出了一种基于扩散的采样器--基于能量的去噪能量匹配（ENERGY-BASED DENOISING ENERGYMATCHING），与相关研究相比，它在理论上具有更低的方差和更高的复杂度。此外，EnDEM 还采用了一种新颖的引导技术来平衡偏差和方差。我们评估了 EnDEM 和 BEnDEM 在二维 40 高斯混合模型（GMM）和四粒子双阱势能（DW-4）上的应用。实验结果表明，BEnDEM 可以达到最先进的性能，同时更加稳健。

引用次数: 0

Scaling Continuous Kernels with Sparse Fourier Domain Learning 利用稀疏傅立叶域学习扩展连续核

arXiv - STAT - Machine Learning

Pub Date : 2024-09-15 DOI: arxiv-2409.09875

Clayton Harper, Luke Wood, Peter Gerstoft, Eric C. Larson

We address three key challenges in learning continuous kernelrepresentations: computational efficiency, parameter efficiency, and spectralbias. Continuous kernels have shown significant potential, but their practicaladoption is often limited by high computational and memory demands.Additionally, these methods are prone to spectral bias, which impedes theirability to capture high-frequency details. To overcome these limitations, wepropose a novel approach that leverages sparse learning in the Fourier domain.Our method enables the efficient scaling of continuous kernels, drasticallyreduces computational and memory requirements, and mitigates spectral bias byexploiting the Gibbs phenomenon.

我们解决了学习连续核表示的三个关键挑战：计算效率、参数效率和频谱偏差。连续内核已显示出巨大的潜力，但其实际应用往往受到高计算量和内存需求的限制。此外，这些方法容易出现频谱偏差，从而影响捕捉高频细节的能力。为了克服这些局限性，我们提出了一种利用傅立叶域稀疏学习的新方法。我们的方法实现了连续核的高效扩展，大大降低了计算和内存需求，并通过利用吉布斯现象减轻了频谱偏差。

引用次数: 0

OML-AD: Online Machine Learning for Anomaly Detection in Time Series Data OML-AD：用于时间序列数据异常检测的在线机器学习

arXiv - STAT - Machine Learning

Pub Date : 2024-09-15 DOI: arxiv-2409.09742

Sebastian Wette, Florian Heinrichs

Time series are ubiquitous and occur naturally in a variety of applications-- from data recorded by sensors in manufacturing processes, over financialdata streams to climate data. Different tasks arise, such as regression,classification or segmentation of the time series. However, to reliably solvethese challenges, it is important to filter out abnormal observations thatdeviate from the usual behavior of the time series. While many anomalydetection methods exist for independent data and stationary time series, thesemethods are not applicable to non-stationary time series. To allow fornon-stationarity in the data, while simultaneously detecting anomalies, wepropose OML-AD, a novel approach for anomaly detection (AD) based on onlinemachine learning (OML). We provide an implementation of OML-AD within thePython library River and show that it outperforms state-of-the-art baselinemethods in terms of accuracy and computational efficiency.

时间序列无处不在，自然出现在各种应用中--从生产过程中传感器记录的数据、金融数据流到气候数据。不同的任务随之而来，如时间序列的回归、分类或分割。然而，要可靠地解决这些难题，重要的是要过滤掉与时间序列通常行为不同的异常观测数据。虽然有很多异常检测方法适用于独立数据和静态时间序列，但这些方法并不适用于非静态时间序列。为了在检测异常的同时考虑数据的非平稳性，我们提出了基于在线机器学习（OML）的异常检测（AD）新方法 OML-AD。我们在 Python 库 River 中提供了 OML-AD 的实现，并证明它在准确性和计算效率方面优于最先进的基准方法。

引用次数: 0

Model Selection Through Model Sorting 通过模型排序选择模型

arXiv - STAT - Machine Learning

Pub Date : 2024-09-15 DOI: arxiv-2409.09674

Mohammad Ali Hajiani, Babak Seyfe

We propose a novel approach to select the best model of the data. Based onthe exclusive properties of the nested models, we find the most parsimoniousmodel containing the risk minimizer predictor. We prove the existence ofprobable approximately correct (PAC) bounds on the difference of the minimumempirical risk of two successive nested models, called successive empiricalexcess risk (SEER). Based on these bounds, we propose a model order selectionmethod called nested empirical risk (NER). By the sorted NER (S-NER) method tosort the models intelligently, the minimum risk decreases. We construct a testthat predicts whether expanding the model decreases the minimum risk or not.With a high probability, the NER and S-NER choose the true model order and themost parsimonious model containing the risk minimizer predictor, respectively.We use S-NER model selection in the linear regression and show that, the S-NERmethod without any prior information can outperform the accuracy of featuresorting algorithms like orthogonal matching pursuit (OMP) that aided with priorknowledge of the true model order. Also, in the UCR data set, the NER methodreduces the complexity of the classification of UCR datasets dramatically, witha negligible loss of accuracy.

我们提出了一种选择最佳数据模型的新方法。基于嵌套模型的排他性，我们找到了包含风险最小化预测因子的最简洁模型。我们证明了两个连续嵌套模型的最小经验风险之差存在可能近似正确（PAC）的边界，称为连续经验超额风险（SEER）。基于这些界限，我们提出了一种称为嵌套经验风险（NER）的模型顺序选择方法。通过排序 NER（S-NER）方法对模型进行智能排序，可以降低最小风险。我们在线性回归中使用了 S-NER 模型选择方法，结果表明，在没有任何先验信息的情况下，S-NER 方法的准确性优于正交匹配追寻（OMP）等先验了解真实模型顺序的特征排序算法。此外，在 UCR 数据集中，NER 方法大大降低了 UCR 数据集分类的复杂性，其准确性损失可以忽略不计。

{"title":"Model Selection Through Model Sorting","authors":"Mohammad Ali Hajiani, Babak Seyfe","doi":"arxiv-2409.09674","DOIUrl":"https://doi.org/arxiv-2409.09674","url":null,"abstract":"We propose a novel approach to select the best model of the data. Based on\u0000the exclusive properties of the nested models, we find the most parsimonious\u0000model containing the risk minimizer predictor. We prove the existence of\u0000probable approximately correct (PAC) bounds on the difference of the minimum\u0000empirical risk of two successive nested models, called successive empirical\u0000excess risk (SEER). Based on these bounds, we propose a model order selection\u0000method called nested empirical risk (NER). By the sorted NER (S-NER) method to\u0000sort the models intelligently, the minimum risk decreases. We construct a test\u0000that predicts whether expanding the model decreases the minimum risk or not.\u0000With a high probability, the NER and S-NER choose the true model order and the\u0000most parsimonious model containing the risk minimizer predictor, respectively.\u0000We use S-NER model selection in the linear regression and show that, the S-NER\u0000method without any prior information can outperform the accuracy of feature\u0000sorting algorithms like orthogonal matching pursuit (OMP) that aided with prior\u0000knowledge of the true model order. Also, in the UCR data set, the NER method\u0000reduces the complexity of the classification of UCR datasets dramatically, with\u0000a negligible loss of accuracy.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - STAT - Machine Learning

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀