In the era of big data, ensuring the quality of datasets has become increasingly crucial across various domains. We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset, regardless of its specific content, focusing on both textual and numerical data. Our primary objective is to address three fundamental types of defects: absence, redundancy, and incoherence. At the heart of our approach lies a rigorous demand for both explainability and interpretability, ensuring that the rationale behind the identification and correction of data anomalies is transparent and understandable. To achieve this, we adopt a hybrid approach that integrates statistical methods with machine learning algorithms. Indeed, by leveraging statistical techniques alongside machine learning, we strike a balance between accuracy and explainability, enabling users to trust and comprehend the assessment process. Acknowledging the challenges associated with automating the data quality assessment process, particularly in terms of time efficiency and accuracy, we adopt a pragmatic strategy, employing resource-intensive algorithms only when necessary, while favoring simpler, more efficient solutions whenever possible. Through a practical analysis conducted on a publicly provided dataset, we illustrate the challenges that arise when trying to enhance data quality while keeping explainability. We demonstrate the effectiveness of our approach in detecting and rectifying missing values, duplicates and typographical errors as well as the challenges remaining to be addressed to achieve similar accuracy on statistical outliers and logic errors under the constraints set in our work.
{"title":"Towards Explainable Automated Data Quality Enhancement without Domain Knowledge","authors":"Djibril Sarr","doi":"arxiv-2409.10139","DOIUrl":"https://doi.org/arxiv-2409.10139","url":null,"abstract":"In the era of big data, ensuring the quality of datasets has become\u0000increasingly crucial across various domains. We propose a comprehensive\u0000framework designed to automatically assess and rectify data quality issues in\u0000any given dataset, regardless of its specific content, focusing on both textual\u0000and numerical data. Our primary objective is to address three fundamental types\u0000of defects: absence, redundancy, and incoherence. At the heart of our approach\u0000lies a rigorous demand for both explainability and interpretability, ensuring\u0000that the rationale behind the identification and correction of data anomalies\u0000is transparent and understandable. To achieve this, we adopt a hybrid approach\u0000that integrates statistical methods with machine learning algorithms. Indeed,\u0000by leveraging statistical techniques alongside machine learning, we strike a\u0000balance between accuracy and explainability, enabling users to trust and\u0000comprehend the assessment process. Acknowledging the challenges associated with\u0000automating the data quality assessment process, particularly in terms of time\u0000efficiency and accuracy, we adopt a pragmatic strategy, employing\u0000resource-intensive algorithms only when necessary, while favoring simpler, more\u0000efficient solutions whenever possible. Through a practical analysis conducted\u0000on a publicly provided dataset, we illustrate the challenges that arise when\u0000trying to enhance data quality while keeping explainability. We demonstrate the\u0000effectiveness of our approach in detecting and rectifying missing values,\u0000duplicates and typographical errors as well as the challenges remaining to be\u0000addressed to achieve similar accuracy on statistical outliers and logic errors\u0000under the constraints set in our work.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zi-Ming Wang, Nan Xue, Ling Lei, Rebecka Jörnsten, Gui-Song Xia
This paper studies the problem of distribution matching (DM), which is a fundamental machine learning problem seeking to robustly align two probability distributions. Our approach is established on a relaxed formulation, called partial distribution matching (PDM), which seeks to match a fraction of the distributions instead of matching them completely. We theoretically derive the Kantorovich-Rubinstein duality for the partial Wasserstain-1 (PW) discrepancy, and develop a partial Wasserstein adversarial network (PWAN) that efficiently approximates the PW discrepancy based on this dual form. Partial matching can then be achieved by optimizing the network using gradient descent. Two practical tasks, point set registration and partial domain adaptation are investigated, where the goals are to partially match distributions in 3D space and high-dimensional feature space respectively. The experiment results confirm that the proposed PWAN effectively produces highly robust matching results, performing better or on par with the state-of-the-art methods.
{"title":"Partial Distribution Matching via Partial Wasserstein Adversarial Networks","authors":"Zi-Ming Wang, Nan Xue, Ling Lei, Rebecka Jörnsten, Gui-Song Xia","doi":"arxiv-2409.10499","DOIUrl":"https://doi.org/arxiv-2409.10499","url":null,"abstract":"This paper studies the problem of distribution matching (DM), which is a\u0000fundamental machine learning problem seeking to robustly align two probability\u0000distributions. Our approach is established on a relaxed formulation, called\u0000partial distribution matching (PDM), which seeks to match a fraction of the\u0000distributions instead of matching them completely. We theoretically derive the\u0000Kantorovich-Rubinstein duality for the partial Wasserstain-1 (PW) discrepancy,\u0000and develop a partial Wasserstein adversarial network (PWAN) that efficiently\u0000approximates the PW discrepancy based on this dual form. Partial matching can\u0000then be achieved by optimizing the network using gradient descent. Two\u0000practical tasks, point set registration and partial domain adaptation are\u0000investigated, where the goals are to partially match distributions in 3D space\u0000and high-dimensional feature space respectively. The experiment results confirm\u0000that the proposed PWAN effectively produces highly robust matching results,\u0000performing better or on par with the state-of-the-art methods.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we provide tight lower bounds for the oracle complexity of minimizing high-order H"older smooth and uniformly convex functions. Specifically, for a function whose $p^{th}$-order derivatives are H"older continuous with degree $nu$ and parameter $H$, and that is uniformly convex with degree $q$ and parameter $sigma$, we focus on two asymmetric cases: (1) $q > p + nu$, and (2) $q < p+nu$. Given up to $p^{th}$-order oracle access, we establish worst-case oracle complexities of $Omegaleft( left( frac{H}{sigma}right)^frac{2}{3(p+nu)-2}left( frac{sigma}{epsilon}right)^frac{2(q-p-nu)}{q(3(p+nu)-2)}right)$ with a truncated-Gaussian smoothed hard function in the first case and $Omegaleft(left(frac{H}{sigma}right)^frac{2}{3(p+nu)-2}+ log^2left(frac{sigma^{p+nu}}{H^q}right)^frac{1}{p+nu-q}right)$ in the second case, for reaching an $epsilon$-approximate solution in terms of the optimality gap. Our analysis generalizes previous lower bounds for functions under first- and second-order smoothness as well as those for uniformly convex functions, and furthermore our results match the corresponding upper bounds in the general setting.
{"title":"Tight Lower Bounds under Asymmetric High-Order Hölder Smoothness and Uniform Convexity","authors":"Site Bai, Brian Bullins","doi":"arxiv-2409.10773","DOIUrl":"https://doi.org/arxiv-2409.10773","url":null,"abstract":"In this paper, we provide tight lower bounds for the oracle complexity of\u0000minimizing high-order H\"older smooth and uniformly convex functions.\u0000Specifically, for a function whose $p^{th}$-order derivatives are H\"older\u0000continuous with degree $nu$ and parameter $H$, and that is uniformly convex\u0000with degree $q$ and parameter $sigma$, we focus on two asymmetric cases: (1)\u0000$q > p + nu$, and (2) $q < p+nu$. Given up to $p^{th}$-order oracle access,\u0000we establish worst-case oracle complexities of $Omegaleft( left(\u0000frac{H}{sigma}right)^frac{2}{3(p+nu)-2}left(\u0000frac{sigma}{epsilon}right)^frac{2(q-p-nu)}{q(3(p+nu)-2)}right)$ with a\u0000truncated-Gaussian smoothed hard function in the first case and\u0000$Omegaleft(left(frac{H}{sigma}right)^frac{2}{3(p+nu)-2}+\u0000log^2left(frac{sigma^{p+nu}}{H^q}right)^frac{1}{p+nu-q}right)$ in the\u0000second case, for reaching an $epsilon$-approximate solution in terms of the\u0000optimality gap. Our analysis generalizes previous lower bounds for functions\u0000under first- and second-order smoothness as well as those for uniformly convex\u0000functions, and furthermore our results match the corresponding upper bounds in\u0000the general setting.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"89 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zheng Zhao, Ziwei Luo, Jens Sjölund, Thomas B. Schön
Generative diffusions are a powerful class of Monte Carlo samplers that leverage bridging Markov processes to approximate complex, high-dimensional distributions, such as those found in image processing and language models. Despite their success in these domains, an important open challenge remains: extending these techniques to sample from conditional distributions, as required in, for example, Bayesian inverse problems. In this paper, we present a comprehensive review of existing computational approaches to conditional sampling within generative diffusion models. Specifically, we highlight key methodologies that either utilise the joint distribution, or rely on (pre-trained) marginal distributions with explicit likelihoods, to construct conditional generative samplers.
{"title":"Conditional sampling within generative diffusion models","authors":"Zheng Zhao, Ziwei Luo, Jens Sjölund, Thomas B. Schön","doi":"arxiv-2409.09650","DOIUrl":"https://doi.org/arxiv-2409.09650","url":null,"abstract":"Generative diffusions are a powerful class of Monte Carlo samplers that\u0000leverage bridging Markov processes to approximate complex, high-dimensional\u0000distributions, such as those found in image processing and language models.\u0000Despite their success in these domains, an important open challenge remains:\u0000extending these techniques to sample from conditional distributions, as\u0000required in, for example, Bayesian inverse problems. In this paper, we present\u0000a comprehensive review of existing computational approaches to conditional\u0000sampling within generative diffusion models. Specifically, we highlight key\u0000methodologies that either utilise the joint distribution, or rely on\u0000(pre-trained) marginal distributions with explicit likelihoods, to construct\u0000conditional generative samplers.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Obtaining compositional mappings is important for the model to generalize well compositionally. To better understand when and how to encourage the model to learn such mappings, we study their uniqueness through different perspectives. Specifically, we first show that the compositional mappings are the simplest bijections through the lens of coding length (i.e., an upper bound of their Kolmogorov complexity). This property explains why models having such mappings can generalize well. We further show that the simplicity bias is usually an intrinsic property of neural network training via gradient descent. That partially explains why some models spontaneously generalize well when they are trained appropriately.
{"title":"Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics","authors":"Yi Ren, Danica J. Sutherland","doi":"arxiv-2409.09626","DOIUrl":"https://doi.org/arxiv-2409.09626","url":null,"abstract":"Obtaining compositional mappings is important for the model to generalize\u0000well compositionally. To better understand when and how to encourage the model\u0000to learn such mappings, we study their uniqueness through different\u0000perspectives. Specifically, we first show that the compositional mappings are\u0000the simplest bijections through the lens of coding length (i.e., an upper bound\u0000of their Kolmogorov complexity). This property explains why models having such\u0000mappings can generalize well. We further show that the simplicity bias is\u0000usually an intrinsic property of neural network training via gradient descent.\u0000That partially explains why some models spontaneously generalize well when they\u0000are trained appropriately.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The advent of foundation models (FMs) such as large language models (LLMs) has led to a cultural shift in data science, both in medicine and beyond. This shift involves moving away from specialized predictive models trained for specific, well-defined domain questions to generalist FMs pre-trained on vast amounts of unstructured data, which can then be adapted to various clinical tasks and questions. As a result, the standard data science workflow in medicine has been fundamentally altered; the foundation model lifecycle (FMLC) now includes distinct upstream and downstream processes, in which computational resources, model and data access, and decision-making power are distributed among multiple stakeholders. At their core, FMs are fundamentally statistical models, and this new workflow challenges the principles of Veridical Data Science (VDS), hindering the rigorous statistical analysis expected in transparent and scientifically reproducible data science practices. We critically examine the medical FMLC in light of the core principles of VDS: predictability, computability, and stability (PCS), and explain how it deviates from the standard data science workflow. Finally, we propose recommendations for a reimagined medical FMLC that expands and refines the PCS principles for VDS including considering the computational and accessibility constraints inherent to FMs.
{"title":"Veridical Data Science for Medical Foundation Models","authors":"Ahmed Alaa, Bin Yu","doi":"arxiv-2409.10580","DOIUrl":"https://doi.org/arxiv-2409.10580","url":null,"abstract":"The advent of foundation models (FMs) such as large language models (LLMs)\u0000has led to a cultural shift in data science, both in medicine and beyond. This\u0000shift involves moving away from specialized predictive models trained for\u0000specific, well-defined domain questions to generalist FMs pre-trained on vast\u0000amounts of unstructured data, which can then be adapted to various clinical\u0000tasks and questions. As a result, the standard data science workflow in\u0000medicine has been fundamentally altered; the foundation model lifecycle (FMLC)\u0000now includes distinct upstream and downstream processes, in which computational\u0000resources, model and data access, and decision-making power are distributed\u0000among multiple stakeholders. At their core, FMs are fundamentally statistical\u0000models, and this new workflow challenges the principles of Veridical Data\u0000Science (VDS), hindering the rigorous statistical analysis expected in\u0000transparent and scientifically reproducible data science practices. We\u0000critically examine the medical FMLC in light of the core principles of VDS:\u0000predictability, computability, and stability (PCS), and explain how it deviates\u0000from the standard data science workflow. Finally, we propose recommendations\u0000for a reimagined medical FMLC that expands and refines the PCS principles for\u0000VDS including considering the computational and accessibility constraints\u0000inherent to FMs.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RuiKang OuYang, Bo Qiang, José Miguel Hernández-Lobato
Developing an efficient sampler capable of generating independent and identically distributed (IID) samples from a Boltzmann distribution is a crucial challenge in scientific research, e.g. molecular dynamics. In this work, we intend to learn neural samplers given energy functions instead of data sampled from the Boltzmann distribution. By learning the energies of the noised data, we propose a diffusion-based sampler, ENERGY-BASED DENOISING ENERGY MATCHING, which theoretically has lower variance and more complexity compared to related works. Furthermore, a novel bootstrapping technique is applied to EnDEM to balance between bias and variance. We evaluate EnDEM and BEnDEM on a 2-dimensional 40 Gaussian Mixture Model (GMM) and a 4-particle double-welling potential (DW-4). The experimental results demonstrate that BEnDEM can achieve state-of-the-art performance while being more robust.
{"title":"BEnDEM:A Boltzmann Sampler Based on Bootstrapped Denoising Energy Matching","authors":"RuiKang OuYang, Bo Qiang, José Miguel Hernández-Lobato","doi":"arxiv-2409.09787","DOIUrl":"https://doi.org/arxiv-2409.09787","url":null,"abstract":"Developing an efficient sampler capable of generating independent and\u0000identically distributed (IID) samples from a Boltzmann distribution is a\u0000crucial challenge in scientific research, e.g. molecular dynamics. In this\u0000work, we intend to learn neural samplers given energy functions instead of data\u0000sampled from the Boltzmann distribution. By learning the energies of the noised\u0000data, we propose a diffusion-based sampler, ENERGY-BASED DENOISING ENERGY\u0000MATCHING, which theoretically has lower variance and more complexity compared\u0000to related works. Furthermore, a novel bootstrapping technique is applied to\u0000EnDEM to balance between bias and variance. We evaluate EnDEM and BEnDEM on a\u00002-dimensional 40 Gaussian Mixture Model (GMM) and a 4-particle double-welling\u0000potential (DW-4). The experimental results demonstrate that BEnDEM can achieve\u0000state-of-the-art performance while being more robust.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clayton Harper, Luke Wood, Peter Gerstoft, Eric C. Larson
We address three key challenges in learning continuous kernel representations: computational efficiency, parameter efficiency, and spectral bias. Continuous kernels have shown significant potential, but their practical adoption is often limited by high computational and memory demands. Additionally, these methods are prone to spectral bias, which impedes their ability to capture high-frequency details. To overcome these limitations, we propose a novel approach that leverages sparse learning in the Fourier domain. Our method enables the efficient scaling of continuous kernels, drastically reduces computational and memory requirements, and mitigates spectral bias by exploiting the Gibbs phenomenon.
{"title":"Scaling Continuous Kernels with Sparse Fourier Domain Learning","authors":"Clayton Harper, Luke Wood, Peter Gerstoft, Eric C. Larson","doi":"arxiv-2409.09875","DOIUrl":"https://doi.org/arxiv-2409.09875","url":null,"abstract":"We address three key challenges in learning continuous kernel\u0000representations: computational efficiency, parameter efficiency, and spectral\u0000bias. Continuous kernels have shown significant potential, but their practical\u0000adoption is often limited by high computational and memory demands.\u0000Additionally, these methods are prone to spectral bias, which impedes their\u0000ability to capture high-frequency details. To overcome these limitations, we\u0000propose a novel approach that leverages sparse learning in the Fourier domain.\u0000Our method enables the efficient scaling of continuous kernels, drastically\u0000reduces computational and memory requirements, and mitigates spectral bias by\u0000exploiting the Gibbs phenomenon.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Time series are ubiquitous and occur naturally in a variety of applications -- from data recorded by sensors in manufacturing processes, over financial data streams to climate data. Different tasks arise, such as regression, classification or segmentation of the time series. However, to reliably solve these challenges, it is important to filter out abnormal observations that deviate from the usual behavior of the time series. While many anomaly detection methods exist for independent data and stationary time series, these methods are not applicable to non-stationary time series. To allow for non-stationarity in the data, while simultaneously detecting anomalies, we propose OML-AD, a novel approach for anomaly detection (AD) based on online machine learning (OML). We provide an implementation of OML-AD within the Python library River and show that it outperforms state-of-the-art baseline methods in terms of accuracy and computational efficiency.
时间序列无处不在,自然出现在各种应用中--从生产过程中传感器记录的数据、金融数据流到气候数据。不同的任务随之而来,如时间序列的回归、分类或分割。然而,要可靠地解决这些难题,重要的是要过滤掉与时间序列通常行为不同的异常观测数据。虽然有很多异常检测方法适用于独立数据和静态时间序列,但这些方法并不适用于非静态时间序列。为了在检测异常的同时考虑数据的非平稳性,我们提出了基于在线机器学习(OML)的异常检测(AD)新方法 OML-AD。我们在 Python 库 River 中提供了 OML-AD 的实现,并证明它在准确性和计算效率方面优于最先进的基准方法。
{"title":"OML-AD: Online Machine Learning for Anomaly Detection in Time Series Data","authors":"Sebastian Wette, Florian Heinrichs","doi":"arxiv-2409.09742","DOIUrl":"https://doi.org/arxiv-2409.09742","url":null,"abstract":"Time series are ubiquitous and occur naturally in a variety of applications\u0000-- from data recorded by sensors in manufacturing processes, over financial\u0000data streams to climate data. Different tasks arise, such as regression,\u0000classification or segmentation of the time series. However, to reliably solve\u0000these challenges, it is important to filter out abnormal observations that\u0000deviate from the usual behavior of the time series. While many anomaly\u0000detection methods exist for independent data and stationary time series, these\u0000methods are not applicable to non-stationary time series. To allow for\u0000non-stationarity in the data, while simultaneously detecting anomalies, we\u0000propose OML-AD, a novel approach for anomaly detection (AD) based on online\u0000machine learning (OML). We provide an implementation of OML-AD within the\u0000Python library River and show that it outperforms state-of-the-art baseline\u0000methods in terms of accuracy and computational efficiency.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a novel approach to select the best model of the data. Based on the exclusive properties of the nested models, we find the most parsimonious model containing the risk minimizer predictor. We prove the existence of probable approximately correct (PAC) bounds on the difference of the minimum empirical risk of two successive nested models, called successive empirical excess risk (SEER). Based on these bounds, we propose a model order selection method called nested empirical risk (NER). By the sorted NER (S-NER) method to sort the models intelligently, the minimum risk decreases. We construct a test that predicts whether expanding the model decreases the minimum risk or not. With a high probability, the NER and S-NER choose the true model order and the most parsimonious model containing the risk minimizer predictor, respectively. We use S-NER model selection in the linear regression and show that, the S-NER method without any prior information can outperform the accuracy of feature sorting algorithms like orthogonal matching pursuit (OMP) that aided with prior knowledge of the true model order. Also, in the UCR data set, the NER method reduces the complexity of the classification of UCR datasets dramatically, with a negligible loss of accuracy.
{"title":"Model Selection Through Model Sorting","authors":"Mohammad Ali Hajiani, Babak Seyfe","doi":"arxiv-2409.09674","DOIUrl":"https://doi.org/arxiv-2409.09674","url":null,"abstract":"We propose a novel approach to select the best model of the data. Based on\u0000the exclusive properties of the nested models, we find the most parsimonious\u0000model containing the risk minimizer predictor. We prove the existence of\u0000probable approximately correct (PAC) bounds on the difference of the minimum\u0000empirical risk of two successive nested models, called successive empirical\u0000excess risk (SEER). Based on these bounds, we propose a model order selection\u0000method called nested empirical risk (NER). By the sorted NER (S-NER) method to\u0000sort the models intelligently, the minimum risk decreases. We construct a test\u0000that predicts whether expanding the model decreases the minimum risk or not.\u0000With a high probability, the NER and S-NER choose the true model order and the\u0000most parsimonious model containing the risk minimizer predictor, respectively.\u0000We use S-NER model selection in the linear regression and show that, the S-NER\u0000method without any prior information can outperform the accuracy of feature\u0000sorting algorithms like orthogonal matching pursuit (OMP) that aided with prior\u0000knowledge of the true model order. Also, in the UCR data set, the NER method\u0000reduces the complexity of the classification of UCR datasets dramatically, with\u0000a negligible loss of accuracy.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}