The objective of this research is to provide a framework with which the data science community can understand, define, and develop data science as a field of inquiry. The framework is based on the classical reference framework (axiology, ontology, epistemology, methodology) used for 200 years to define knowledge discovery paradigms and disciplines in the humanities, sciences, algorithms, and now data science. I augmented it for automated problem-solving with (methods, technology, community). The resulting data science reference framework is used to define the data science knowledge discovery paradigm in terms of the philosophy of data science addressed in previous papers and the data science problem-solving paradigm, i.e., the data science method, and the data science problem-solving workflow, both addressed in this paper. The framework is a much called for unifying framework for data science as it contains the components required to define data science. For insights to better understand data science, this paper uses the framework to define the emerging, often enigmatic, data science problem-solving paradigm and workflow, and to compare them with their well-understood scientific counterparts, scientific problem-solving paradigm and workflow.
{"title":"A framework for understanding data science","authors":"Michael L Brodie","doi":"arxiv-2403.00776","DOIUrl":"https://doi.org/arxiv-2403.00776","url":null,"abstract":"The objective of this research is to provide a framework with which the data\u0000science community can understand, define, and develop data science as a field\u0000of inquiry. The framework is based on the classical reference framework\u0000(axiology, ontology, epistemology, methodology) used for 200 years to define\u0000knowledge discovery paradigms and disciplines in the humanities, sciences,\u0000algorithms, and now data science. I augmented it for automated problem-solving\u0000with (methods, technology, community). The resulting data science reference\u0000framework is used to define the data science knowledge discovery paradigm in\u0000terms of the philosophy of data science addressed in previous papers and the\u0000data science problem-solving paradigm, i.e., the data science method, and the\u0000data science problem-solving workflow, both addressed in this paper. The\u0000framework is a much called for unifying framework for data science as it\u0000contains the components required to define data science. For insights to better\u0000understand data science, this paper uses the framework to define the emerging,\u0000often enigmatic, data science problem-solving paradigm and workflow, and to\u0000compare them with their well-understood scientific counterparts, scientific\u0000problem-solving paradigm and workflow.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Since its introduction in 2011, the partial information decomposition (PID) has triggered an explosion of interest in the field of multivariate information theory and the study of emergent, higher-order ("synergistic") interactions in complex systems. Despite its power, however, the PID has a number of limitations that restrict its general applicability: it scales poorly with system size and the standard approach to decomposition hinges on a definition of "redundancy", leaving synergy only vaguely defined as "that information not redundant." Other heuristic measures, such as the O-information, have been introduced, although these measures typically only provided a summary statistic of redundancy/synergy dominance, rather than direct insight into the synergy itself. To address this issue, we present an alternative decomposition that is synergy-first, scales much more gracefully than the PID, and has a straightforward interpretation. Our approach defines synergy as that information in a set that would be lost following the minimally invasive perturbation on any single element. By generalizing this idea to sets of elements, we construct a totally ordered "backbone" of partial synergy atoms that sweeps systems scales. Our approach starts with entropy, but can be generalized to the Kullback-Leibler divergence, and by extension, to the total correlation and the single-target mutual information. Finally, we show that this approach can be used to decompose higher-order interactions beyond just information theory: we demonstrate this by showing how synergistic combinations of pairwise edges in a complex network supports signal communicability and global integration. We conclude by discussing how this perspective on synergistic structure (information-based or otherwise) can deepen our understanding of part-whole relationships in complex systems.
{"title":"A scalable, synergy-first backbone decomposition of higher-order structures in complex systems","authors":"Thomas F. Varley","doi":"arxiv-2402.08135","DOIUrl":"https://doi.org/arxiv-2402.08135","url":null,"abstract":"Since its introduction in 2011, the partial information decomposition (PID)\u0000has triggered an explosion of interest in the field of multivariate information\u0000theory and the study of emergent, higher-order (\"synergistic\") interactions in\u0000complex systems. Despite its power, however, the PID has a number of\u0000limitations that restrict its general applicability: it scales poorly with\u0000system size and the standard approach to decomposition hinges on a definition\u0000of \"redundancy\", leaving synergy only vaguely defined as \"that information not\u0000redundant.\" Other heuristic measures, such as the O-information, have been\u0000introduced, although these measures typically only provided a summary statistic\u0000of redundancy/synergy dominance, rather than direct insight into the synergy\u0000itself. To address this issue, we present an alternative decomposition that is\u0000synergy-first, scales much more gracefully than the PID, and has a\u0000straightforward interpretation. Our approach defines synergy as that\u0000information in a set that would be lost following the minimally invasive\u0000perturbation on any single element. By generalizing this idea to sets of\u0000elements, we construct a totally ordered \"backbone\" of partial synergy atoms\u0000that sweeps systems scales. Our approach starts with entropy, but can be\u0000generalized to the Kullback-Leibler divergence, and by extension, to the total\u0000correlation and the single-target mutual information. Finally, we show that\u0000this approach can be used to decompose higher-order interactions beyond just\u0000information theory: we demonstrate this by showing how synergistic combinations\u0000of pairwise edges in a complex network supports signal communicability and\u0000global integration. We conclude by discussing how this perspective on\u0000synergistic structure (information-based or otherwise) can deepen our\u0000understanding of part-whole relationships in complex systems.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139764541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper explores an innovative approach to teaching data wrangling skills to students through hands-on activities before transitioning to coding. Data wrangling, a critical aspect of data analysis, involves cleaning, transforming, and restructuring data. We introduce the use of a physical tool, mathlink cubes, to facilitate a tangible understanding of data sets. This approach helps students grasp the concepts of data wrangling before implementing them in coding languages such as R. We detail a classroom activity that includes hands-on tasks paralleling common data wrangling processes such as filtering, selecting, and mutating, followed by their coding equivalents using R's `dplyr` package.
本文探讨了一种在过渡到编码之前通过实践活动向学生传授数据整理技能的创新方法。数据整理是数据分析的一个重要方面,涉及数据的清理、转换和重组。我们介绍了一种物理工具--数学链接立方体--的使用,以促进对数据集的具体理解。我们详细介绍了一个课堂活动,其中包括与过滤、选择和突变等常见数据处理过程并行的实践任务,以及使用 R 的 "dplyr "包进行的等效编码。
{"title":"Using Mathlink Cubes to Introduce Data Wrangling with Examples in R","authors":"Lucy D'Agostino McGowan","doi":"arxiv-2402.07029","DOIUrl":"https://doi.org/arxiv-2402.07029","url":null,"abstract":"This paper explores an innovative approach to teaching data wrangling skills\u0000to students through hands-on activities before transitioning to coding. Data\u0000wrangling, a critical aspect of data analysis, involves cleaning, transforming,\u0000and restructuring data. We introduce the use of a physical tool, mathlink\u0000cubes, to facilitate a tangible understanding of data sets. This approach helps\u0000students grasp the concepts of data wrangling before implementing them in\u0000coding languages such as R. We detail a classroom activity that includes\u0000hands-on tasks paralleling common data wrangling processes such as filtering,\u0000selecting, and mutating, followed by their coding equivalents using R's `dplyr`\u0000package.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139764450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many people are interested in ChatGPT since it has become a prominent AIGC model that provides high-quality responses in various contexts, such as software development and maintenance. Misuse of ChatGPT might cause significant issues, particularly in public safety and education, despite its immense potential. The majority of researchers choose to publish their work on Arxiv. The effectiveness and originality of future work depend on the ability to detect AI components in such contributions. To address this need, this study will analyze a method that can see purposely manufactured content that academic organizations use to post on Arxiv. For this study, a dataset was created using physics, mathematics, and computer science articles. Using the newly built dataset, the following step is to put originality.ai through its paces. The statistical analysis shows that Originality.ai is very accurate, with a rate of 98%.
{"title":"Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Submissions using AI Detection Tool","authors":"Arslan Akram","doi":"arxiv-2403.13812","DOIUrl":"https://doi.org/arxiv-2403.13812","url":null,"abstract":"Many people are interested in ChatGPT since it has become a prominent AIGC\u0000model that provides high-quality responses in various contexts, such as\u0000software development and maintenance. Misuse of ChatGPT might cause significant\u0000issues, particularly in public safety and education, despite its immense\u0000potential. The majority of researchers choose to publish their work on Arxiv.\u0000The effectiveness and originality of future work depend on the ability to\u0000detect AI components in such contributions. To address this need, this study\u0000will analyze a method that can see purposely manufactured content that academic\u0000organizations use to post on Arxiv. For this study, a dataset was created using\u0000physics, mathematics, and computer science articles. Using the newly built\u0000dataset, the following step is to put originality.ai through its paces. The\u0000statistical analysis shows that Originality.ai is very accurate, with a rate of\u000098%.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140205779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Malaria is the leading cause of death globally, especially in sub-Saharan African countries claiming over 400,000 deaths globally each year, underscoring the critical need for continued efforts to combat this preventable and treatable disease. The objective of this study is to provide statistical guidance on the optimal preventive and control measures against malaria. Data have been collected from reliable sources, such as World Health Organization, UNICEF, Our World in Data, and STATcompiler. Data were categorized according to the factors and sub-factors related to deaths caused by malaria. These factors and sub-factors were determined based on root cause analysis and data sources. Using JMP 16 Pro software, both linear and multiple linear regression were conducted to analyze the data. The analyses aimed to establish a linear relationship between the dependent variable (malaria deaths in the overall population) and independent variables, such as life expectancy, malaria prevalence in children, net usage, indoor residual spraying usage, literate population, and population with inadequate sanitation in each selected sample country. The statistical analysis revealed that using insecticide treated nets (ITNs) by children and individuals significantly decreased the death count, as 1,000 individuals sleeping under ITNs could reduce the death count by eight. Based on the statistical analysis, this study suggests more rigorous research on the usage of ITNs.
{"title":"Malaria incidence and prevalence: An ecological analysis through Six Sigma approach","authors":"Md. Al-Amin, Kesava Chandran Vijaya Bhaskar, Walaa Enab, Reza Kamali Miab, Jennifer Slavin, Nigar Sultana","doi":"arxiv-2402.02233","DOIUrl":"https://doi.org/arxiv-2402.02233","url":null,"abstract":"Malaria is the leading cause of death globally, especially in sub-Saharan\u0000African countries claiming over 400,000 deaths globally each year, underscoring\u0000the critical need for continued efforts to combat this preventable and\u0000treatable disease. The objective of this study is to provide statistical\u0000guidance on the optimal preventive and control measures against malaria. Data\u0000have been collected from reliable sources, such as World Health Organization,\u0000UNICEF, Our World in Data, and STATcompiler. Data were categorized according to\u0000the factors and sub-factors related to deaths caused by malaria. These factors\u0000and sub-factors were determined based on root cause analysis and data sources.\u0000Using JMP 16 Pro software, both linear and multiple linear regression were\u0000conducted to analyze the data. The analyses aimed to establish a linear\u0000relationship between the dependent variable (malaria deaths in the overall\u0000population) and independent variables, such as life expectancy, malaria\u0000prevalence in children, net usage, indoor residual spraying usage, literate\u0000population, and population with inadequate sanitation in each selected sample\u0000country. The statistical analysis revealed that using insecticide treated nets\u0000(ITNs) by children and individuals significantly decreased the death count, as\u00001,000 individuals sleeping under ITNs could reduce the death count by eight.\u0000Based on the statistical analysis, this study suggests more rigorous research\u0000on the usage of ITNs.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139767019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chixiang Chen, Michelle Shardell, Jaime Lynn Speiser, Karen Bandeen-Roche, Heather Allore, Thomas G Travison, Michael Griswold, Terrence E. Murphy
Background: Introduced in 2010, the sub-discipline of gerontologic biostatistics (GBS) was conceptualized to address the specific challenges in analyzing data from research studies involving older adults. However, the evolving technological landscape has catalyzed data science and statistical advancements since the original GBS publication, greatly expanding the scope of gerontologic research. There is a need to describe how these advancements enhance the analysis of multi-modal data and complex phenotypes that are hallmarks of gerontologic research. Methods: This paper introduces GBS 2.0, an updated and expanded set of analytical methods reflective of the practice of gerontologic biostatistics in contemporary and future research. Results: GBS 2.0 topics and relevant software resources include cutting-edge methods in experimental design; analytical techniques that include adaptations of machine learning, quantifying deep phenotypic measurements, high-dimensional -omics analysis; the integration of information from multiple studies, and strategies to foster reproducibility, replicability, and open science. Discussion: The methodological topics presented here seek to update and expand GBS. By facilitating the synthesis of biostatistics and data science in gerontology, we aim to foster the next generation of gerontologic researchers.
{"title":"Gerontologic Biostatistics 2.0: Developments over 10+ years in the age of data science","authors":"Chixiang Chen, Michelle Shardell, Jaime Lynn Speiser, Karen Bandeen-Roche, Heather Allore, Thomas G Travison, Michael Griswold, Terrence E. Murphy","doi":"arxiv-2402.01112","DOIUrl":"https://doi.org/arxiv-2402.01112","url":null,"abstract":"Background: Introduced in 2010, the sub-discipline of gerontologic\u0000biostatistics (GBS) was conceptualized to address the specific challenges in\u0000analyzing data from research studies involving older adults. However, the\u0000evolving technological landscape has catalyzed data science and statistical\u0000advancements since the original GBS publication, greatly expanding the scope of\u0000gerontologic research. There is a need to describe how these advancements\u0000enhance the analysis of multi-modal data and complex phenotypes that are\u0000hallmarks of gerontologic research. Methods: This paper introduces GBS 2.0, an\u0000updated and expanded set of analytical methods reflective of the practice of\u0000gerontologic biostatistics in contemporary and future research. Results: GBS\u00002.0 topics and relevant software resources include cutting-edge methods in\u0000experimental design; analytical techniques that include adaptations of machine\u0000learning, quantifying deep phenotypic measurements, high-dimensional -omics\u0000analysis; the integration of information from multiple studies, and strategies\u0000to foster reproducibility, replicability, and open science. Discussion: The\u0000methodological topics presented here seek to update and expand GBS. By\u0000facilitating the synthesis of biostatistics and data science in gerontology, we\u0000aim to foster the next generation of gerontologic researchers.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"236 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139690246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This review article focuses on regularised estimation procedures applicable to geostatistical and spatial econometric models. These methods are particularly relevant in the case of big geospatial data for dimensionality reduction or model selection. To structure the review, we initially consider the most general case of multivariate spatiotemporal processes (i.e., $g > 1$ dimensions of the spatial domain, a one-dimensional temporal domain, and $q geq 1$ random variables). Then, the idea of regularised/penalised estimation procedures and different choices of shrinkage targets are discussed. Finally, guided by the elements of a mixed-effects model, which allows for a variety of spatiotemporal models, we show different regularisation procedures and how they can be used for the analysis of geo-referenced data, e.g. for selection of relevant regressors, dimensionality reduction of the covariance matrices, detection of conditionally independent locations, or the estimation of a full spatial interaction matrix.
{"title":"A review of regularised estimation methods and cross-validation in spatiotemporal statistics","authors":"Philipp Otto, Alessandro Fassò, Paolo Maranzano","doi":"arxiv-2402.00183","DOIUrl":"https://doi.org/arxiv-2402.00183","url":null,"abstract":"This review article focuses on regularised estimation procedures applicable\u0000to geostatistical and spatial econometric models. These methods are\u0000particularly relevant in the case of big geospatial data for dimensionality\u0000reduction or model selection. To structure the review, we initially consider\u0000the most general case of multivariate spatiotemporal processes (i.e., $g > 1$\u0000dimensions of the spatial domain, a one-dimensional temporal domain, and $q\u0000geq 1$ random variables). Then, the idea of regularised/penalised estimation\u0000procedures and different choices of shrinkage targets are discussed. Finally,\u0000guided by the elements of a mixed-effects model, which allows for a variety of\u0000spatiotemporal models, we show different regularisation procedures and how they\u0000can be used for the analysis of geo-referenced data, e.g. for selection of\u0000relevant regressors, dimensionality reduction of the covariance matrices,\u0000detection of conditionally independent locations, or the estimation of a full\u0000spatial interaction matrix.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"2 5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139668263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Lin, Per Olof Hedekvist, Nina Mylly, Math Bollen, Jingchun Shen, Jiawei Xiong, Christofer Silfvenius
Traditional lighting source reliability evaluations, often covering just half of a lamp's volume, can misrepresent real-world performance. To overcome these limitations,adopting advanced asset management strategies for a more holistic evaluation is crucial. This paper investigates human-centric and integrative lighting asset management in Swedish public libraries. Through field observations, interviews, and gap analysis, the study highlights a disparity between current lighting conditions and stakeholder expectations, with issues like eye strain suggesting significant improvement potential. We propose a shift towards more dynamic lighting asset management and reliability evaluations, emphasizing continuous enhancement and comprehensive training in human-centric and integrative lighting principles.
{"title":"Human-Centric and Integrative Lighting Asset Management in Public Libraries: Qualitative Insights and Challenges from a Swedish Field Study","authors":"Jing Lin, Per Olof Hedekvist, Nina Mylly, Math Bollen, Jingchun Shen, Jiawei Xiong, Christofer Silfvenius","doi":"arxiv-2401.11000","DOIUrl":"https://doi.org/arxiv-2401.11000","url":null,"abstract":"Traditional lighting source reliability evaluations, often covering just half\u0000of a lamp's volume, can misrepresent real-world performance. To overcome these\u0000limitations,adopting advanced asset management strategies for a more holistic\u0000evaluation is crucial. This paper investigates human-centric and integrative\u0000lighting asset management in Swedish public libraries. Through field\u0000observations, interviews, and gap analysis, the study highlights a disparity\u0000between current lighting conditions and stakeholder expectations, with issues\u0000like eye strain suggesting significant improvement potential. We propose a\u0000shift towards more dynamic lighting asset management and reliability\u0000evaluations, emphasizing continuous enhancement and comprehensive training in\u0000human-centric and integrative lighting principles.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"117 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139556295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Johan Medrano, Abderrahmane Kheddar, Annick Lesne, Sofiane Ramdani
When nonlinear measures are estimated from sampled temporal signals with finite-length, a radius parameter must be carefully selected to avoid a poor estimation. These measures are generally derived from the correlation integral which quantifies the probability of finding neighbors, i.e. pair of points spaced by less than the radius parameter. While each nonlinear measure comes with several specific empirical rules to select a radius value, we provide a systematic selection method. We show that the optimal radius for nonlinear measures can be approximated by the optimal bandwidth of a Kernel Density Estimator (KDE) related to the correlation sum. The KDE framework provides non-parametric tools to approximate a density function from finite samples (e.g. histograms) and optimal methods to select a smoothing parameter, the bandwidth (e.g. bin width in histograms). We use results from KDE to derive a closed-form expression for the optimal radius. The latter is used to compute the correlation dimension and to construct recurrence plots yielding an estimate of Kolmogorov-Sinai entropy. We assess our method through numerical experiments on signals generated by nonlinear systems and experimental electroencephalographic time series.
从无限长的采样时间信号中估计非线性度量时,必须仔细选择半径参数,以避免估计结果不佳。这些度量通常由相关积分推导而来,相关积分量化了找到邻近点(即间距小于半径参数的点对)的概率。虽然每种非线性度量都有几种特定的经验规则来选择半径值,但我们提供了一种系统的选择方法。我们证明,非线性度量的最佳半径可以用与相关性总和相关的核密度估计器(KDE)的最佳带宽来近似。KDE 框架提供了从有限样本(如直方图)近似密度函数的非参数工具,以及选择平滑参数--带宽(如直方图中的二进制宽度)的最优方法。我们利用 KDE 的结果推导出最优半径的封闭式表达式。后者用于计算相关维度和构建递归图,从而得出柯尔莫哥洛夫-西奈熵的估计值。我们通过对非线性系统产生的信号和脑电图时间序列进行数值实验来评估我们的方法。
{"title":"Radius selection using kernel density estimation for the computation of nonlinear measures","authors":"Johan Medrano, Abderrahmane Kheddar, Annick Lesne, Sofiane Ramdani","doi":"arxiv-2401.03891","DOIUrl":"https://doi.org/arxiv-2401.03891","url":null,"abstract":"When nonlinear measures are estimated from sampled temporal signals with\u0000finite-length, a radius parameter must be carefully selected to avoid a poor\u0000estimation. These measures are generally derived from the correlation integral\u0000which quantifies the probability of finding neighbors, i.e. pair of points\u0000spaced by less than the radius parameter. While each nonlinear measure comes\u0000with several specific empirical rules to select a radius value, we provide a\u0000systematic selection method. We show that the optimal radius for nonlinear\u0000measures can be approximated by the optimal bandwidth of a Kernel Density\u0000Estimator (KDE) related to the correlation sum. The KDE framework provides\u0000non-parametric tools to approximate a density function from finite samples\u0000(e.g. histograms) and optimal methods to select a smoothing parameter, the\u0000bandwidth (e.g. bin width in histograms). We use results from KDE to derive a\u0000closed-form expression for the optimal radius. The latter is used to compute\u0000the correlation dimension and to construct recurrence plots yielding an\u0000estimate of Kolmogorov-Sinai entropy. We assess our method through numerical\u0000experiments on signals generated by nonlinear systems and experimental\u0000electroencephalographic time series.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"254 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139412931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper studies the quotient geometry of bounded or fixed-rank correlation matrices. The set of bounded-rank correlation matrices is in bijection with a quotient set of a spherical product manifold by an orthogonal group. We show that it admits an orbit space structure and its stratification is determined by the rank of the matrices. Also, the principal stratum has a compatible Riemannian quotient manifold structure. We develop efficient Riemannian optimization algorithms for computing the distance and the weighted Frechet mean in the orbit space. We prove that any minimizing geodesic in the orbit space has constant rank on the interior of the segment. Moreover, we examine geometric properties of the quotient manifold, including horizontal and vertical spaces, Riemannian metric, injectivity radius, exponential and logarithmic map, gradient and Hessian.
{"title":"Quotient geometry of bounded or fixed rank correlation matrices","authors":"Hengchao Chen","doi":"arxiv-2401.03126","DOIUrl":"https://doi.org/arxiv-2401.03126","url":null,"abstract":"This paper studies the quotient geometry of bounded or fixed-rank correlation\u0000matrices. The set of bounded-rank correlation matrices is in bijection with a\u0000quotient set of a spherical product manifold by an orthogonal group. We show\u0000that it admits an orbit space structure and its stratification is determined by\u0000the rank of the matrices. Also, the principal stratum has a compatible\u0000Riemannian quotient manifold structure. We develop efficient Riemannian\u0000optimization algorithms for computing the distance and the weighted Frechet\u0000mean in the orbit space. We prove that any minimizing geodesic in the orbit\u0000space has constant rank on the interior of the segment. Moreover, we examine\u0000geometric properties of the quotient manifold, including horizontal and\u0000vertical spaces, Riemannian metric, injectivity radius, exponential and\u0000logarithmic map, gradient and Hessian.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139412935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}