If professional teams can accurately predict the order of their league’s draft, they would have a competitive advantage when using or trading their draft picks. Many experts and enthusiasts publish forecasts of the order players are drafted into professional sports leagues, known as mock drafts. Using a novel dataset of mock drafts for the National Basketball Association (NBA), we explore mock drafts’ ability to forecast the actual draft. We analyze authors’ mock draft accuracy over time and ask how we can reasonably aggregate information from multiple authors. For both tasks, mock drafts are usually analyzed as ranked lists, and in this paper, we propose ways to improve on these methods. We propose that rank-biased distance is the appropriate error metric for measuring accuracy of mock drafts as ranked lists. To best combine information from multiple mock drafts into a single consensus mock draft, we also propose a combination method based on the ideas of ranked-choice voting. We show that this method provides improved forecasts over the standard Borda count combination method used for most similar analyses in sports, and that either combination method provides a more accurate forecast across seasons than any single author.
{"title":"Improving the aggregation and evaluation of NBA mock drafts","authors":"Jared D. Fisher, Colin Montague","doi":"10.1515/jqas-2023-0100","DOIUrl":"https://doi.org/10.1515/jqas-2023-0100","url":null,"abstract":"If professional teams can accurately predict the order of their league’s draft, they would have a competitive advantage when using or trading their draft picks. Many experts and enthusiasts publish forecasts of the order players are drafted into professional sports leagues, known as mock drafts. Using a novel dataset of mock drafts for the National Basketball Association (NBA), we explore mock drafts’ ability to forecast the actual draft. We analyze authors’ mock draft accuracy over time and ask how we can reasonably aggregate information from multiple authors. For both tasks, mock drafts are usually analyzed as ranked lists, and in this paper, we propose ways to improve on these methods. We propose that rank-biased distance is the appropriate error metric for measuring accuracy of mock drafts as ranked lists. To best combine information from multiple mock drafts into a single consensus mock draft, we also propose a combination method based on the ideas of ranked-choice voting. We show that this method provides improved forecasts over the standard Borda count combination method used for most similar analyses in sports, and that either combination method provides a more accurate forecast across seasons than any single author.","PeriodicalId":16925,"journal":{"name":"Journal of Quantitative Analysis in Sports","volume":null,"pages":null},"PeriodicalIF":0.8,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the last decade, the offensive and defensive philosophies employed by teams in the National Basketball Association (NBA) have changed substantially. As a result, most players can no longer be classified into only one of the five traditional positions (PG, SG, SF, PF, C) and instead spend a percentage of their playing time at multiple positions, making positional data compositional. Further, given the desirability for versatile players, an argument can be made that traditional positions themselves are archaic. Using data from the 2016–17, 2017–18, and 2018–19 seasons, I explore how Bayesian hierarchical models can be used to estimate team defensive strength in three ways. First, only considering players classified by their majority traditional position. Second, by using compositional traditional positional data. Third, using compositional data from modern positions (archetypes) defined by fuzzy k-means clustering. I find that the fuzzy k-means approach leads to a modest improvement in both the root mean squared error and median 95 % posterior predictive interval width for the test data, and, more importantly, identifies 11 modern archetypes that, when combined, are correlated with team win total and adjusted team defensive rating. The modern archetype compositions can be used by stakeholders to better understand team defensive strength.
在过去十年中,美国篮球协会(NBA)各队采用的进攻和防守理念发生了巨大变化。因此,大多数球员不再只能被归类到五个传统位置(PG、SG、SF、PF、C)中的一个,而是在多个位置上花费一定比例的上场时间,这就使得位置数据具有了构成性。此外,鉴于人们对全能球员的渴望,可以说传统位置本身已经过时。利用2016-17、2017-18和2018-19赛季的数据,我从三个方面探讨了如何利用贝叶斯层次模型来估计球队的防守强度。首先,只考虑按主要传统位置分类的球员。第二,使用传统位置的组成数据。第三,使用模糊均值聚类所定义的现代位置(原型)的组成数据。我发现,模糊 K 均值聚类方法使测试数据的均方根误差和中位 95 % 后验预测区间宽度都得到了适度改善,更重要的是,它识别出了 11 种现代原型,这些原型组合起来与球队总胜场数和调整后的球队防守评级相关。利益相关者可以利用现代原型组合更好地了解球队的防守实力。
{"title":"A basketball paradox: exploring NBA team defensive efficiency in a positionless game","authors":"Charles South","doi":"10.1515/jqas-2024-0010","DOIUrl":"https://doi.org/10.1515/jqas-2024-0010","url":null,"abstract":"In the last decade, the offensive and defensive philosophies employed by teams in the National Basketball Association (NBA) have changed substantially. As a result, most players can no longer be classified into only one of the five traditional positions (PG, SG, SF, PF, C) and instead spend a percentage of their playing time at multiple positions, making positional data compositional. Further, given the desirability for versatile players, an argument can be made that traditional positions themselves are archaic. Using data from the 2016–17, 2017–18, and 2018–19 seasons, I explore how Bayesian hierarchical models can be used to estimate team defensive strength in three ways. First, only considering players classified by their majority traditional position. Second, by using compositional traditional positional data. Third, using compositional data from modern positions (archetypes) defined by fuzzy <jats:italic>k</jats:italic>-means clustering. I find that the fuzzy <jats:italic>k</jats:italic>-means approach leads to a modest improvement in both the root mean squared error and median 95 % posterior predictive interval width for the test data, and, more importantly, identifies 11 modern archetypes that, when combined, are correlated with team win total and adjusted team defensive rating. The modern archetype compositions can be used by stakeholders to better understand team defensive strength.","PeriodicalId":16925,"journal":{"name":"Journal of Quantitative Analysis in Sports","volume":null,"pages":null},"PeriodicalIF":0.8,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Count data play a crucial role in sports analytics, providing valuable insights into various aspects of the game. Models that accurately capture the characteristics of count data are essential for making reliable inferences. In this paper, we propose the use of the Conway–Maxwell–Poisson (CMP) model for analyzing count data in sports. The CMP model offers flexibility in modeling data with different levels of dispersion. Here we consider a bivariate CMP model that models the potential correlation between home and away scores by incorporating a random effect specification. We illustrate the advantages of the CMP model through simulations. We then analyze data from baseball and soccer games before, during, and after the COVID-19 pandemic. The performance of our proposed CMP model matches or outperforms standard Poisson and Negative Binomial models, providing a good fit and an accurate estimation of the observed effects in count data with any level of dispersion. The results highlight the robustness and flexibility of the CMP model in analyzing count data in sports, making it a suitable default choice for modeling a diverse range of count data types in sports, where the data dispersion may vary.
{"title":"Bayesian bivariate Conway–Maxwell–Poisson regression model for correlated count data in sports","authors":"Mauro Florez, Michele Guindani, Marina Vannucci","doi":"10.1515/jqas-2024-0072","DOIUrl":"https://doi.org/10.1515/jqas-2024-0072","url":null,"abstract":"\u0000 Count data play a crucial role in sports analytics, providing valuable insights into various aspects of the game. Models that accurately capture the characteristics of count data are essential for making reliable inferences. In this paper, we propose the use of the Conway–Maxwell–Poisson (CMP) model for analyzing count data in sports. The CMP model offers flexibility in modeling data with different levels of dispersion. Here we consider a bivariate CMP model that models the potential correlation between home and away scores by incorporating a random effect specification. We illustrate the advantages of the CMP model through simulations. We then analyze data from baseball and soccer games before, during, and after the COVID-19 pandemic. The performance of our proposed CMP model matches or outperforms standard Poisson and Negative Binomial models, providing a good fit and an accurate estimation of the observed effects in count data with any level of dispersion. The results highlight the robustness and flexibility of the CMP model in analyzing count data in sports, making it a suitable default choice for modeling a diverse range of count data types in sports, where the data dispersion may vary.","PeriodicalId":16925,"journal":{"name":"Journal of Quantitative Analysis in Sports","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141919111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Count data play a crucial role in sports analytics, providing valuable insights into various aspects of the game. Models that accurately capture the characteristics of count data are essential for making reliable inferences. In this paper, we propose the use of the Conway–Maxwell–Poisson (CMP) model for analyzing count data in sports. The CMP model offers flexibility in modeling data with different levels of dispersion. Here we consider a bivariate CMP model that models the potential correlation between home and away scores by incorporating a random effect specification. We illustrate the advantages of the CMP model through simulations. We then analyze data from baseball and soccer games before, during, and after the COVID-19 pandemic. The performance of our proposed CMP model matches or outperforms standard Poisson and Negative Binomial models, providing a good fit and an accurate estimation of the observed effects in count data with any level of dispersion. The results highlight the robustness and flexibility of the CMP model in analyzing count data in sports, making it a suitable default choice for modeling a diverse range of count data types in sports, where the data dispersion may vary.
{"title":"Bayesian bivariate Conway–Maxwell–Poisson regression model for correlated count data in sports","authors":"Mauro Florez, Michele Guindani, Marina Vannucci","doi":"10.1515/jqas-2024-0072","DOIUrl":"https://doi.org/10.1515/jqas-2024-0072","url":null,"abstract":"\u0000 Count data play a crucial role in sports analytics, providing valuable insights into various aspects of the game. Models that accurately capture the characteristics of count data are essential for making reliable inferences. In this paper, we propose the use of the Conway–Maxwell–Poisson (CMP) model for analyzing count data in sports. The CMP model offers flexibility in modeling data with different levels of dispersion. Here we consider a bivariate CMP model that models the potential correlation between home and away scores by incorporating a random effect specification. We illustrate the advantages of the CMP model through simulations. We then analyze data from baseball and soccer games before, during, and after the COVID-19 pandemic. The performance of our proposed CMP model matches or outperforms standard Poisson and Negative Binomial models, providing a good fit and an accurate estimation of the observed effects in count data with any level of dispersion. The results highlight the robustness and flexibility of the CMP model in analyzing count data in sports, making it a suitable default choice for modeling a diverse range of count data types in sports, where the data dispersion may vary.","PeriodicalId":16925,"journal":{"name":"Journal of Quantitative Analysis in Sports","volume":null,"pages":null},"PeriodicalIF":1.1,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141919753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vincent Renner, Konstantin Görgen, Alexander Woll, Hagen Wäsche, Melanie Schienle
Identifying success factors in football is of sporting and economic interest. However, research in this field for national teams and their competitions is rare despite the popularity of teams and events. Therefore, we analyze data for the UEFA EURO 2020 and, for comparison purposes, the previous tournament in 2016. To mitigate the challenges of perceived multicollinearity and a small sample size, and to identify the relevant variables, we apply the ‘LASSO Cross-fitted Stability-Selection’ algorithm. This approach involves iterative splitting of data, with variables chosen via a ‘least absolute shrinkage and selection operator’ (LASSO) model (Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B 58: 267–288) on one half of the observations, while coefficients are estimated on the other half. Subsequently, we inspect the frequency of selection and stability of coefficient estimation for each variable over the repeated samples to identify factors as relevant. By that, we are able to differentiate generally valid success factors such as the market value ratio from on-field variables whose importance is tournament-dependent, e.g. the tackles attempted. As the latter is connected to a team’s tactics, we conclude that their observed relevance is correlated to the results of the linked playing style in the specific tournaments. We also show the changing effect of these playing-styles on success across tournaments.
确定足球运动的成功因素具有体育和经济意义。然而,尽管球队和赛事很受欢迎,但针对国家队及其赛事的研究却很少见。因此,我们分析了 2020 年欧洲杯的数据,并与 2016 年的上届赛事进行比较。为了减轻多重共线性和样本量较小带来的挑战,并确定相关变量,我们采用了 "LASSO 交叉拟合稳定性选择 "算法。这种方法涉及数据的迭代分割,通过 "最小绝对收缩和选择算子"(LASSO)模型选择变量(Tibshirani, R. (1996)。Regression shrinkage and selection via the lasso.J. Roy.J. Roy.Soc. B 58: 267-288),而系数则是在另一半观测值上估算的。随后,我们检查重复样本中每个变量的选择频率和系数估计的稳定性,以确定相关因素。这样,我们就能将市值比等普遍有效的成功因素与场上变量(其重要性取决于赛事)(如拦截成功率)区分开来。由于后者与球队的战术相关,我们得出结论,观察到的相关性与特定赛事中相关打法的结果相关。我们还展示了这些打法在不同赛事中对成功的影响变化。
{"title":"Success factors in national team football: an analysis of the UEFA EURO 2020","authors":"Vincent Renner, Konstantin Görgen, Alexander Woll, Hagen Wäsche, Melanie Schienle","doi":"10.1515/jqas-2023-0026","DOIUrl":"https://doi.org/10.1515/jqas-2023-0026","url":null,"abstract":"Identifying success factors in football is of sporting and economic interest. However, research in this field for national teams and their competitions is rare despite the popularity of teams and events. Therefore, we analyze data for the UEFA EURO 2020 and, for comparison purposes, the previous tournament in 2016. To mitigate the challenges of perceived multicollinearity and a small sample size, and to identify the relevant variables, we apply the ‘LASSO Cross-fitted Stability-Selection’ algorithm. This approach involves iterative splitting of data, with variables chosen via a ‘least absolute shrinkage and selection operator’ (LASSO) model (Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. <jats:italic>J. Roy. Stat. Soc. B</jats:italic> 58: 267–288) on one half of the observations, while coefficients are estimated on the other half. Subsequently, we inspect the frequency of selection and stability of coefficient estimation for each variable over the repeated samples to identify factors as relevant. By that, we are able to differentiate generally valid success factors such as the market value ratio from on-field variables whose importance is tournament-dependent, e.g. the tackles attempted. As the latter is connected to a team’s tactics, we conclude that their observed relevance is correlated to the results of the linked playing style in the specific tournaments. We also show the changing effect of these playing-styles on success across tournaments.","PeriodicalId":16925,"journal":{"name":"Journal of Quantitative Analysis in Sports","volume":null,"pages":null},"PeriodicalIF":0.8,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We perform an exploratory data analysis on a data-set for the top 16 professional darts players from the 2019 season. We use this data-set to fit player skill models which can then be used in dynamic zero-sum games (ZSGs) that model real-world matches between players. We propose an empirical Bayesian approach based on the Dirichlet-Multinomial (DM) model that overcomes limitations in the data. Specifically we introduce two DM-based skill models where the first model borrows strength from other darts players and the second model borrows strength from other regions of the dartboard. We find these DM-based models outperform simpler benchmark models with respect to Brier and Spherical scores, both of which are proper scoring rules. We also show in ZSGs settings that the difference between DM-based skill models and the simpler benchmark models is practically significant. Finally, we use our DM-based model to analyze specific situations that arose in real-world darts matches during the 2019 season.
{"title":"An empirical Bayes approach for estimating skill models for professional darts players","authors":"Martin B. Haugh, Chun Wang","doi":"10.1515/jqas-2023-0084","DOIUrl":"https://doi.org/10.1515/jqas-2023-0084","url":null,"abstract":"We perform an exploratory data analysis on a data-set for the top 16 professional darts players from the 2019 season. We use this data-set to fit player skill models which can then be used in dynamic zero-sum games (ZSGs) that model real-world matches between players. We propose an empirical Bayesian approach based on the Dirichlet-Multinomial (DM) model that overcomes limitations in the data. Specifically we introduce two DM-based skill models where the first model borrows strength from other darts players and the second model borrows strength from other regions of the dartboard. We find these DM-based models outperform simpler benchmark models with respect to Brier and Spherical scores, both of which are proper scoring rules. We also show in ZSGs settings that the difference between DM-based skill models and the simpler benchmark models is practically significant. Finally, we use our DM-based model to analyze specific situations that arose in real-world darts matches during the 2019 season.","PeriodicalId":16925,"journal":{"name":"Journal of Quantitative Analysis in Sports","volume":null,"pages":null},"PeriodicalIF":0.8,"publicationDate":"2024-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141610981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The existence and justification to the home advantage – the benefit a sports team receives when playing at home – has been studied across sport. The majority of research on this topic is limited to individual leagues in short time frames, which hinders extrapolation and a deeper understanding of possible causes. Using nearly two decades of data from the National Football League (NFL), the National Collegiate Athletic Association (NCAA), and high schools from across the United States, we provide a uniform approach to understanding the home advantage in American football. Our findings suggest home advantage is declining in the NFL and the highest levels of collegiate football, but not in amateur football. This increases the possibility that characteristics of the NCAA and NFL, such as travel improvements and instant replay, have helped level the playing field.
{"title":"A comprehensive survey of the home advantage in American football","authors":"Luke Benz, Thompson Bliss, Michael Lopez","doi":"10.1515/jqas-2024-0016","DOIUrl":"https://doi.org/10.1515/jqas-2024-0016","url":null,"abstract":"The existence and justification to the home advantage – the benefit a sports team receives when playing at home – has been studied across sport. The majority of research on this topic is limited to individual leagues in short time frames, which hinders extrapolation and a deeper understanding of possible causes. Using nearly two decades of data from the National Football League (NFL), the National Collegiate Athletic Association (NCAA), and high schools from across the United States, we provide a uniform approach to understanding the home advantage in American football. Our findings suggest home advantage is declining in the NFL and the highest levels of collegiate football, but not in amateur football. This increases the possibility that characteristics of the NCAA and NFL, such as travel improvements and instant replay, have helped level the playing field.","PeriodicalId":16925,"journal":{"name":"Journal of Quantitative Analysis in Sports","volume":null,"pages":null},"PeriodicalIF":0.8,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We leverage Large Language Models (LLMs) to extract information from scouting report texts and improve predictions of National Hockey League (NHL) draft outcomes. In parallel, we derive statistical features based on a player’s on-ice performance leading up to the draft. These two datasets are then combined using ensemble machine learning models. We find that both on-ice statistics and scouting reports have predictive value, however combining them leads to the strongest results.
{"title":"Improving NHL draft outcome predictions using scouting reports","authors":"Hubert Luo","doi":"10.1515/jqas-2024-0047","DOIUrl":"https://doi.org/10.1515/jqas-2024-0047","url":null,"abstract":"We leverage Large Language Models (LLMs) to extract information from scouting report texts and improve predictions of National Hockey League (NHL) draft outcomes. In parallel, we derive statistical features based on a player’s on-ice performance leading up to the draft. These two datasets are then combined using ensemble machine learning models. We find that both on-ice statistics and scouting reports have predictive value, however combining them leads to the strongest results.","PeriodicalId":16925,"journal":{"name":"Journal of Quantitative Analysis in Sports","volume":null,"pages":null},"PeriodicalIF":0.8,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper attempts to identify football players who have a similar style to a player of interest. Playing style is not adequately quantified with traditional statistics, and therefore style statistics are created using tracking data. Tracking data allow us to monitor players throughout a match, and therefore include both “on-the-ball” and “off-the-ball” observations. Having developed style features, tractable discrepancy measures are introduced that are based on Kullback–Leibler divergence in the context of multivariate normal distributions. Examples are provided where a pool of players from the Chinese Super League are identified as having a playing style that is similar to players of interest.
{"title":"Comparison of individual playing styles in football","authors":"Tianyu Guan, Sumit Sarkar, Tim B. Swartz","doi":"10.1515/jqas-2024-0041","DOIUrl":"https://doi.org/10.1515/jqas-2024-0041","url":null,"abstract":"\u0000 This paper attempts to identify football players who have a similar style to a player of interest. Playing style is not adequately quantified with traditional statistics, and therefore style statistics are created using tracking data. Tracking data allow us to monitor players throughout a match, and therefore include both “on-the-ball” and “off-the-ball” observations. Having developed style features, tractable discrepancy measures are introduced that are based on Kullback–Leibler divergence in the context of multivariate normal distributions. Examples are provided where a pool of players from the Chinese Super League are identified as having a playing style that is similar to players of interest.","PeriodicalId":16925,"journal":{"name":"Journal of Quantitative Analysis in Sports","volume":null,"pages":null},"PeriodicalIF":0.8,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141098293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-competitor races often feature complicated within-race strategies that are difficult to capture when training data on race outcome level data. Models which do not account for race-level strategy may suffer from confounded inferences and predictions. We develop a generative model for multi-competitor races which explicitly models race-level effects like drafting and separates strategy from competitor ability. The model allows one to simulate full races from any real or created starting position opening new avenues for attributing value to within-race actions and performing counter-factual analyses. This methodology is sufficiently general to apply to any track based multi-competitor races where both tracking data is available and competitor movement is well described by simultaneous forward and lateral movements. We apply this methodology to one-mile horse races using frame-level tracking data provided by the New York Racing Association (NYRA) and the New York Thoroughbred Horsemen’s Association (NYTHA) for the Big Data Derby 2022 Kaggle Competition. We demonstrate how this model can yield new inferences, such as the estimation of horse-specific speed profiles and examples of posterior predictive counterfactual simulations to answer questions of interest such as starting lane impacts on race outcomes.
{"title":"A generative approach to frame-level multi-competitor races","authors":"Tyrel Stokes, Gurashish Bagga, Kimberly Kroetch, Brendan Kumagai, Liam Welsh","doi":"10.1515/jqas-2023-0091","DOIUrl":"https://doi.org/10.1515/jqas-2023-0091","url":null,"abstract":"Multi-competitor races often feature complicated within-race strategies that are difficult to capture when training data on race outcome level data. Models which do not account for race-level strategy may suffer from confounded inferences and predictions. We develop a generative model for multi-competitor races which explicitly models race-level effects like drafting and separates strategy from competitor ability. The model allows one to simulate full races from any real or created starting position opening new avenues for attributing value to within-race actions and performing counter-factual analyses. This methodology is sufficiently general to apply to any track based multi-competitor races where both tracking data is available and competitor movement is well described by simultaneous forward and lateral movements. We apply this methodology to one-mile horse races using frame-level tracking data provided by the New York Racing Association (NYRA) and the New York Thoroughbred Horsemen’s Association (NYTHA) for the Big Data Derby 2022 Kaggle Competition. We demonstrate how this model can yield new inferences, such as the estimation of horse-specific speed profiles and examples of posterior predictive counterfactual simulations to answer questions of interest such as starting lane impacts on race outcomes.","PeriodicalId":16925,"journal":{"name":"Journal of Quantitative Analysis in Sports","volume":null,"pages":null},"PeriodicalIF":0.8,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141145809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}