Elena Moltchanova, Miguel Moyers-González, Geertrui Van de Voorde, José Felipe Voloch, Philipp Wacker
In this paper, we consider how probability theory can be used to determine the survival strategy in two of the ``Squid Game" and ``Squid Game: The Challenge" challenges: the Hopscotch and the Warships. We show how Hopscotch can be easily tackled with the knowledge of the binomial distribution, taught in introductory statistics courses, while Warships is a much more complex problem, which can be tackled at different levels.
{"title":"How to survive the Squid Games using probability theory","authors":"Elena Moltchanova, Miguel Moyers-González, Geertrui Van de Voorde, José Felipe Voloch, Philipp Wacker","doi":"arxiv-2409.05263","DOIUrl":"https://doi.org/arxiv-2409.05263","url":null,"abstract":"In this paper, we consider how probability theory can be used to determine\u0000the survival strategy in two of the ``Squid Game\" and ``Squid Game: The\u0000Challenge\" challenges: the Hopscotch and the Warships. We show how Hopscotch\u0000can be easily tackled with the knowledge of the binomial distribution, taught\u0000in introductory statistics courses, while Warships is a much more complex\u0000problem, which can be tackled at different levels.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142179815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study introduces a novel approach to forecasting by Tobit Exponential Smoothing with time aggregation constraints. This model, a particular case of the Tobit Innovations State Space system, handles censored observed time series effectively, such as sales data, with known and potentially variable censoring levels over time. The paper provides a comprehensive analysis of the model structure, including its representation in system equations and the optimal recursive estimation of states. It also explores the benefits of time aggregation in state space systems, particularly for inventory management and demand forecasting. Through a series of case studies, the paper demonstrates the effectiveness of the model across various scenarios, including hourly and daily censoring levels. The results highlight the model's ability to produce accurate forecasts and confidence bands comparable to those from uncensored models, even under severe censoring conditions. The study further discusses the implications for inventory policy, emphasizing the importance of avoiding spiral-down effects in demand estimation. The paper concludes by showcasing the superiority of the proposed model over standard methods, particularly in reducing lost sales and excess stock, thereby optimizing inventory costs. This research contributes to the field of forecasting by offering a robust model that effectively addresses the challenges of censored data and time aggregation.
{"title":"Censored Data Forecasting: Applying Tobit Exponential Smoothing with Time Aggregation","authors":"Diego J. Pedregal, Juan R. Trapero","doi":"arxiv-2409.05412","DOIUrl":"https://doi.org/arxiv-2409.05412","url":null,"abstract":"This study introduces a novel approach to forecasting by Tobit Exponential\u0000Smoothing with time aggregation constraints. This model, a particular case of\u0000the Tobit Innovations State Space system, handles censored observed time series\u0000effectively, such as sales data, with known and potentially variable censoring\u0000levels over time. The paper provides a comprehensive analysis of the model\u0000structure, including its representation in system equations and the optimal\u0000recursive estimation of states. It also explores the benefits of time\u0000aggregation in state space systems, particularly for inventory management and\u0000demand forecasting. Through a series of case studies, the paper demonstrates\u0000the effectiveness of the model across various scenarios, including hourly and\u0000daily censoring levels. The results highlight the model's ability to produce\u0000accurate forecasts and confidence bands comparable to those from uncensored\u0000models, even under severe censoring conditions. The study further discusses the\u0000implications for inventory policy, emphasizing the importance of avoiding\u0000spiral-down effects in demand estimation. The paper concludes by showcasing the\u0000superiority of the proposed model over standard methods, particularly in\u0000reducing lost sales and excess stock, thereby optimizing inventory costs. This\u0000research contributes to the field of forecasting by offering a robust model\u0000that effectively addresses the challenges of censored data and time\u0000aggregation.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"154 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142179809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bianca-Elena Mihăilă, Marian-Gabriel Hâncean, Matjaž Perc, Jürgen Lerner, Iulian Oană, Marius Geantă, José Luis Molina, Cosmina Cioroboiu
While research on adolescent smoking is extensive, little attention has been given to smoking behaviors among rural middle-aged and older adults. This study examines the role of personal networks and sociodemographic factors in predicting smoking status in a rural Romanian community. Using a link-tracing sampling method, we gathered data from 76 participants out of 83 in Leresti, Arges County. Face-to-face interviews collected sociodemographic data and network information, including smoking status and relational dynamics. We applied multilevel logistic regression models to predict smoking behaviors (current smokers, former smokers, and non-smokers) based on individual characteristics and network influences. Results indicate that social networks significantly influence smoking behaviors. For current smokers, having a smoking family member greatly increased the odds of smoking (OR = 2.51, 95% CI: 1.62, 3.91, p < 0.001). Similarly, non-smoking family members increased the likelihood of being a non-smoker (OR = 1.64, 95% CI: 1.04, 2.61, p < 0.05). Women were less likely to smoke, highlighting sex differences in behavior. These findings emphasize the critical role of social networks in shaping smoking habits, advocating for targeted interventions in rural areas.
{"title":"Cross-sectional personal network analysis of adult smoking in rural areas","authors":"Bianca-Elena Mihăilă, Marian-Gabriel Hâncean, Matjaž Perc, Jürgen Lerner, Iulian Oană, Marius Geantă, José Luis Molina, Cosmina Cioroboiu","doi":"arxiv-2408.14832","DOIUrl":"https://doi.org/arxiv-2408.14832","url":null,"abstract":"While research on adolescent smoking is extensive, little attention has been\u0000given to smoking behaviors among rural middle-aged and older adults. This study\u0000examines the role of personal networks and sociodemographic factors in\u0000predicting smoking status in a rural Romanian community. Using a link-tracing\u0000sampling method, we gathered data from 76 participants out of 83 in Leresti,\u0000Arges County. Face-to-face interviews collected sociodemographic data and\u0000network information, including smoking status and relational dynamics. We\u0000applied multilevel logistic regression models to predict smoking behaviors\u0000(current smokers, former smokers, and non-smokers) based on individual\u0000characteristics and network influences. Results indicate that social networks\u0000significantly influence smoking behaviors. For current smokers, having a\u0000smoking family member greatly increased the odds of smoking (OR = 2.51, 95% CI:\u00001.62, 3.91, p < 0.001). Similarly, non-smoking family members increased the\u0000likelihood of being a non-smoker (OR = 1.64, 95% CI: 1.04, 2.61, p < 0.05).\u0000Women were less likely to smoke, highlighting sex differences in behavior.\u0000These findings emphasize the critical role of social networks in shaping\u0000smoking habits, advocating for targeted interventions in rural areas.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"183 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142179810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alina Dubovskaya, Caroline B. Pena, David J. P. O'Sullivan
The dynamics of information diffusion in complex networks is widely studied in an attempt to understand how individuals communicate and how information travels and reaches individuals through interactions. However, complex networks often present community structure, and tools to analyse information diffusion on networks with communities are needed. In this paper, we develop theoretical tools using multi-type branching processes to model and analyse simple contagion information spread on a broad class of networks with community structure. We show how, by using limited information about the network -- the degree distribution within and between communities -- we can calculate standard statistical characteristics of the dynamics of information diffusion, such as the extinction probability, hazard function, and cascade size distribution. These properties can be estimated not only for the entire network but also for each community separately. Furthermore, we estimate the probability of information spreading from one community to another where it is not currently spreading. We demonstrate the accuracy of our framework by applying it to two specific examples: the Stochastic Block Model and a log-normal network with community structure. We show how the initial seeding location affects the observed cascade size distribution on a heavy-tailed network and that our framework accurately captures this effect.
{"title":"Modeling information spread across networks with communities using a multitype branching process framework","authors":"Alina Dubovskaya, Caroline B. Pena, David J. P. O'Sullivan","doi":"arxiv-2408.04456","DOIUrl":"https://doi.org/arxiv-2408.04456","url":null,"abstract":"The dynamics of information diffusion in complex networks is widely studied\u0000in an attempt to understand how individuals communicate and how information\u0000travels and reaches individuals through interactions. However, complex networks\u0000often present community structure, and tools to analyse information diffusion\u0000on networks with communities are needed. In this paper, we develop theoretical\u0000tools using multi-type branching processes to model and analyse simple\u0000contagion information spread on a broad class of networks with community\u0000structure. We show how, by using limited information about the network -- the\u0000degree distribution within and between communities -- we can calculate standard\u0000statistical characteristics of the dynamics of information diffusion, such as\u0000the extinction probability, hazard function, and cascade size distribution.\u0000These properties can be estimated not only for the entire network but also for\u0000each community separately. Furthermore, we estimate the probability of\u0000information spreading from one community to another where it is not currently\u0000spreading. We demonstrate the accuracy of our framework by applying it to two\u0000specific examples: the Stochastic Block Model and a log-normal network with\u0000community structure. We show how the initial seeding location affects the\u0000observed cascade size distribution on a heavy-tailed network and that our\u0000framework accurately captures this effect.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141939997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jose Antonio Roldan-Nofuentes, Saad bouh Sidaty-regad
The weighted kappa coefficient of a binary diagnostic test is a measure of the beyond-chance agreement between the diagnostic test and the gold standard, and depends on the sensitivity and specificity of the diagnostic test, on the disease prevalence and on the relative importance between the false positives and the false negatives. This article studies the comparison of the weighted kappa coefficients of two binary diagnostic tests subject to a paired design through confidence intervals. Three asymptotic confidence intervals are studied for the difference between the parameters and five other intervals for the ratio. Simulation experiments were carried out to study the coverage probabilities and the average lengths of the intervals, giving some general rules for application. A method is also proposed to calculate the sample size necessary to compare the two weighted kappa coefficients through a confidence interval. A program in R has been written to solve the problem studied and it is available as supplementary material. The results were applied to a real example of the diagnosis of malaria.
二元诊断检测的加权卡帕系数是对诊断检测与金标准之间机会外一致性的衡量,它取决于诊断检测的灵敏度和特异性、疾病流行率以及假阳性和假阴性之间的相对重要性。本文通过置信区间研究了两种二元诊断检测的加权卡帕系数的比较。研究了参数差异的三个渐近置信区间和比率的其他五个置信区间。通过模拟实验研究了区间的覆盖概率和平均长度,并给出了一些应用的一般规则。此外,还提出了一种方法来计算通过置信区间比较两个加权卡帕系数所需的样本大小。为解决所研究的问题,我们用 R 语言编写了一个程序,并将其作为补充材料提供。研究结果已应用于疟疾诊断的实际案例中。
{"title":"Asymptotic confidence intervals for the difference and the ratio of the weighted kappa coefficients of two diagnostic tests subject to a paired design","authors":"Jose Antonio Roldan-Nofuentes, Saad bouh Sidaty-regad","doi":"arxiv-2407.21387","DOIUrl":"https://doi.org/arxiv-2407.21387","url":null,"abstract":"The weighted kappa coefficient of a binary diagnostic test is a measure of\u0000the beyond-chance agreement between the diagnostic test and the gold standard,\u0000and depends on the sensitivity and specificity of the diagnostic test, on the\u0000disease prevalence and on the relative importance between the false positives\u0000and the false negatives. This article studies the comparison of the weighted\u0000kappa coefficients of two binary diagnostic tests subject to a paired design\u0000through confidence intervals. Three asymptotic confidence intervals are studied\u0000for the difference between the parameters and five other intervals for the\u0000ratio. Simulation experiments were carried out to study the coverage\u0000probabilities and the average lengths of the intervals, giving some general\u0000rules for application. A method is also proposed to calculate the sample size\u0000necessary to compare the two weighted kappa coefficients through a confidence\u0000interval. A program in R has been written to solve the problem studied and it\u0000is available as supplementary material. The results were applied to a real\u0000example of the diagnosis of malaria.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jose Antonio Roldan-Nofuentes, Saad Bouh Sidaty-Regad
Positive and negative likelihood ratios are parameters which are used to assess and compare the effectiveness of binary diagnostic tests. Both parameters only depend on the sensitivity and specificity of the diagnostic test and are equivalent to a relative risk. This article studies the comparison of the likelihood ratios of two binary diagnostic tests subject to a paired design through confidence intervals. Six approximate confidence intervals are presented for the ratio of the likelihood ratios, and simulation experiments are carried out to study the coverage probabilities and the average lengths of the intervals considered, and some general rules of application are proposed. A method is also proposed to determine the sample size necessary to estimate the ratio between the likelihood ratios with a determined precision. The results were applied to the diagnosis of coronary artery disease.
{"title":"Comparison of the likelihood ratios of two diagnostic tests subject to a paired design: confidence intervals and sample size","authors":"Jose Antonio Roldan-Nofuentes, Saad Bouh Sidaty-Regad","doi":"arxiv-2407.21382","DOIUrl":"https://doi.org/arxiv-2407.21382","url":null,"abstract":"Positive and negative likelihood ratios are parameters which are used to\u0000assess and compare the effectiveness of binary diagnostic tests. Both\u0000parameters only depend on the sensitivity and specificity of the diagnostic\u0000test and are equivalent to a relative risk. This article studies the comparison\u0000of the likelihood ratios of two binary diagnostic tests subject to a paired\u0000design through confidence intervals. Six approximate confidence intervals are\u0000presented for the ratio of the likelihood ratios, and simulation experiments\u0000are carried out to study the coverage probabilities and the average lengths of\u0000the intervals considered, and some general rules of application are proposed. A\u0000method is also proposed to determine the sample size necessary to estimate the\u0000ratio between the likelihood ratios with a determined precision. The results\u0000were applied to the diagnosis of coronary artery disease.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Predictive values are measures of the clinical accuracy of a binary diagnostic test, and depend on the sensitivity and the specificity of the diagnostic test and on the disease prevalence among the population being studied. This article studies hypothesis tests to simultaneously compare the predictive values of two binary diagnostic tests in the presence of missing data. The hypothesis tests were solved applying two computational methods: the expectation maximization and the supplemented expectation maximization algorithms, and multiple imputation. Simulation experiments were carried out to study the sizes and the powers of the hypothesis tests, giving some general rules of application. Two R programmes were written to apply each method, and they are available as supplementary material for the manuscript. The results were applied to the diagnosis of Alzheimer's disease.
预测值是衡量二元诊断检测临床准确性的指标,它取决于诊断检测的灵敏度和特异性以及所研究人群的疾病流行率。本文研究了假设检验,以同时比较两种二元诊断检测在数据缺失情况下的预测价值。假设检验采用了两种计算方法:期望最大化算法和补充期望最大化算法以及多重归因法。通过模拟实验研究了假设检验的大小和功率,并给出了一些一般应用规则。我们编写了两个 R 程序来应用每种方法,它们作为手稿的补充材料提供。研究结果被应用于阿尔茨海默病的诊断。
{"title":"Computational methods to simultaneously compare the predictive values of two diagnostic tests with missing data: EM-SEM algorithms and multiple imputation","authors":"Jose Antonio Roldan-Nofuentes","doi":"arxiv-2407.21190","DOIUrl":"https://doi.org/arxiv-2407.21190","url":null,"abstract":"Predictive values are measures of the clinical accuracy of a binary\u0000diagnostic test, and depend on the sensitivity and the specificity of the\u0000diagnostic test and on the disease prevalence among the population being\u0000studied. This article studies hypothesis tests to simultaneously compare the\u0000predictive values of two binary diagnostic tests in the presence of missing\u0000data. The hypothesis tests were solved applying two computational methods: the\u0000expectation maximization and the supplemented expectation maximization\u0000algorithms, and multiple imputation. Simulation experiments were carried out to\u0000study the sizes and the powers of the hypothesis tests, giving some general\u0000rules of application. Two R programmes were written to apply each method, and\u0000they are available as supplementary material for the manuscript. The results\u0000were applied to the diagnosis of Alzheimer's disease.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"212 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Foreigners and "estrangeirados", an expression meaning "people going to a foreign country ["estrangeiro"] getting there further education", had a leading role in the development of Mathematical Statistics in Portugal. In what concerns Statistics, "estrangeirados" in the nineteenth century were mainly liberal intellectuals exiled for political reasons. From 1930 onwards, the research funding authority sent university professors abroad, and hired foreign researchers to stay in Portuguese institutions, and some of them were instrumental in the importation of new concepts and methods of inferential statistics. After 1970, there was a huge program of sending young researchers abroad for doctoral studies. At the same time, many new universities and polytechnic institutes have been created in Portugal. After that, aside from foreigners who choose to have a research career in those institutions and the "estrangeirados" who had returned and created programs of doctoral studies, others, who hadn't the opportunity of studying abroad, began to play a decisive role in the development of Statistics in Portugal. The publication of handbooks on Probability and Statistics, thesis and core papers in Portuguese scientific journals, and also of works for the layman, reveals how Statistics progressed from descriptive to a mathematical discipline used for inference in all fields of knowledge, from natural sciences to methodology of scientific research.
{"title":"How Books Tell a History of Statistics in Portugal: Works of Foreigners, Estrangeirados, and Others","authors":"Dinis Pestana, Rui Santos","doi":"arxiv-2407.19433","DOIUrl":"https://doi.org/arxiv-2407.19433","url":null,"abstract":"Foreigners and \"estrangeirados\", an expression meaning \"people going to a\u0000foreign country [\"estrangeiro\"] getting there further education\", had a leading\u0000role in the development of Mathematical Statistics in Portugal. In what\u0000concerns Statistics, \"estrangeirados\" in the nineteenth century were mainly\u0000liberal intellectuals exiled for political reasons. From 1930 onwards, the\u0000research funding authority sent university professors abroad, and hired foreign\u0000researchers to stay in Portuguese institutions, and some of them were\u0000instrumental in the importation of new concepts and methods of inferential\u0000statistics. After 1970, there was a huge program of sending young researchers\u0000abroad for doctoral studies. At the same time, many new universities and\u0000polytechnic institutes have been created in Portugal. After that, aside from\u0000foreigners who choose to have a research career in those institutions and the\u0000\"estrangeirados\" who had returned and created programs of doctoral studies,\u0000others, who hadn't the opportunity of studying abroad, began to play a decisive\u0000role in the development of Statistics in Portugal. The publication of handbooks\u0000on Probability and Statistics, thesis and core papers in Portuguese scientific\u0000journals, and also of works for the layman, reveals how Statistics progressed\u0000from descriptive to a mathematical discipline used for inference in all fields\u0000of knowledge, from natural sciences to methodology of scientific research.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We undertake extensive analysis of English Premier League data over the period 2009/10 to 2017/18 to identify and rank key factors affecting the economic and footballing performances of the teams. Alternative end-of-season league tables are generated by re-ranking the teams based on five different descriptors - total expenditure, total funds spent on players, total funds spent on foreign players, the ratio of foreign to British players and the overall profit. The unequal distribution of resources and expenditure between the clubs is analyzed through Lorenz curves. A comparative analysis of the differences between the alternative tables and the conventional end-of-season league table establishes the most likely factors to influence the performances of the teams that we also rank using Principal Component Analysis. We find that the top teams in the league are also those that tend to have the highest expenditure overall, for all players, including foreign players; they also have the highest ratios of foreign to British players. Our statistical and machine learning study also indicates that successful performance on the field may not guarantee healthy profits at the end of the season.
{"title":"The Impact of Foreign Players in the English Premier League: A Mathematical Analys","authors":"Amit K Chattopadhyay, A. Abdul, Sudhir Jain","doi":"arxiv-2407.19285","DOIUrl":"https://doi.org/arxiv-2407.19285","url":null,"abstract":"We undertake extensive analysis of English Premier League data over the\u0000period 2009/10 to 2017/18 to identify and rank key factors affecting the\u0000economic and footballing performances of the teams. Alternative end-of-season\u0000league tables are generated by re-ranking the teams based on five different\u0000descriptors - total expenditure, total funds spent on players, total funds\u0000spent on foreign players, the ratio of foreign to British players and the\u0000overall profit. The unequal distribution of resources and expenditure between\u0000the clubs is analyzed through Lorenz curves. A comparative analysis of the\u0000differences between the alternative tables and the conventional end-of-season\u0000league table establishes the most likely factors to influence the performances\u0000of the teams that we also rank using Principal Component Analysis. We find that\u0000the top teams in the league are also those that tend to have the highest\u0000expenditure overall, for all players, including foreign players; they also have\u0000the highest ratios of foreign to British players. Our statistical and machine\u0000learning study also indicates that successful performance on the field may not\u0000guarantee healthy profits at the end of the season.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Lucas Makinen, Tom Charnock, Natalia Porqueres, Axel Lapel, Alan Heavens, Benjamin D. Wandelt
In inference problems, we often have domain knowledge which allows us to define summary statistics that capture most of the information content in a dataset. In this paper, we present a hybrid approach, where such physics-based summaries are augmented by a set of compressed neural summary statistics that are optimised to extract the extra information that is not captured by the predefined summaries. The resulting statistics are very powerful inputs to simulation-based or implicit inference of model parameters. We apply this generalisation of Information Maximising Neural Networks (IMNNs) to parameter constraints from tomographic weak gravitational lensing convergence maps to find summary statistics that are explicitly optimised to complement angular power spectrum estimates. We study several dark matter simulation resolutions in low- and high-noise regimes. We show that i) the information-update formalism extracts at least $3times$ and up to $8times$ as much information as the angular power spectrum in all noise regimes, ii) the network summaries are highly complementary to existing 2-point summaries, and iii) our formalism allows for networks with smaller, physically-informed architectures to match much larger regression networks with far fewer simulations needed to obtain asymptotically optimal inference.
{"title":"Hybrid summary statistics: neural weak lensing inference beyond the power spectrum","authors":"T. Lucas Makinen, Tom Charnock, Natalia Porqueres, Axel Lapel, Alan Heavens, Benjamin D. Wandelt","doi":"arxiv-2407.18909","DOIUrl":"https://doi.org/arxiv-2407.18909","url":null,"abstract":"In inference problems, we often have domain knowledge which allows us to\u0000define summary statistics that capture most of the information content in a\u0000dataset. In this paper, we present a hybrid approach, where such physics-based\u0000summaries are augmented by a set of compressed neural summary statistics that\u0000are optimised to extract the extra information that is not captured by the\u0000predefined summaries. The resulting statistics are very powerful inputs to\u0000simulation-based or implicit inference of model parameters. We apply this\u0000generalisation of Information Maximising Neural Networks (IMNNs) to parameter\u0000constraints from tomographic weak gravitational lensing convergence maps to\u0000find summary statistics that are explicitly optimised to complement angular\u0000power spectrum estimates. We study several dark matter simulation resolutions\u0000in low- and high-noise regimes. We show that i) the information-update\u0000formalism extracts at least $3times$ and up to $8times$ as much information\u0000as the angular power spectrum in all noise regimes, ii) the network summaries\u0000are highly complementary to existing 2-point summaries, and iii) our formalism\u0000allows for networks with smaller, physically-informed architectures to match\u0000much larger regression networks with far fewer simulations needed to obtain\u0000asymptotically optimal inference.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141873119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}