预测MLB常规赛的输赢结果-使用数据挖掘方法的比较研究

Q2 Computer Science International Journal of Computer Science in Sport Pub Date : 2016-12-01 DOI:10.1515/IJCSS-2016-0007

Soto Valero

{"title":"预测MLB常规赛的输赢结果-使用数据挖掘方法的比较研究","authors":"Soto Valero","doi":"10.1515/IJCSS-2016-0007","DOIUrl":null,"url":null,"abstract":"Baseball is a statistically filled sport, and predicting the winner of a particular Major League Baseball (MLB) game is an interesting and challenging task. Up to now, there is no definitive formula for determining what factors will conduct a team to victory, but through the analysis of many years of historical records many trends could emerge. Recent studies concentrated on using and generating new statistics called sabermetrics in order to rank teams and players according to their perceived strengths and consequently applying these rankings to forecast specific games. In this paper, we employ sabermetrics statistics with the purpose of assessing the predictive capabilities of four data mining methods (classification and regression based) for predicting outcomes (win or loss) in MLB regular season games. Our model approach uses only past data when making a prediction, corresponding to ten years of publicly available data. We create a dataset with accumulative sabermetrics statistics for each MLB team during this period for which data contamination is not possible. The inherent difficulties of attempting this specific sports prediction are confirmed using two geometry or topology based measures of data complexity. Results reveal that the classification predictive scheme forecasts game outcomes better than regression scheme, and of the four data mining methods used, SVMs produce the best predictive results with a mean of nearly 60% prediction accuracy for each team. The evaluation of our model is performed using stratified 10-fold cross-validation.","PeriodicalId":38466,"journal":{"name":"International Journal of Computer Science in Sport","volume":"15 1","pages":"91-112"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/IJCSS-2016-0007","citationCount":"1","resultStr":"{\"title\":\"Predicting Win-Loss outcomes in MLB regular season games – A comparative study using data mining methods\",\"authors\":\"Soto Valero\",\"doi\":\"10.1515/IJCSS-2016-0007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Baseball is a statistically filled sport, and predicting the winner of a particular Major League Baseball (MLB) game is an interesting and challenging task. Up to now, there is no definitive formula for determining what factors will conduct a team to victory, but through the analysis of many years of historical records many trends could emerge. Recent studies concentrated on using and generating new statistics called sabermetrics in order to rank teams and players according to their perceived strengths and consequently applying these rankings to forecast specific games. In this paper, we employ sabermetrics statistics with the purpose of assessing the predictive capabilities of four data mining methods (classification and regression based) for predicting outcomes (win or loss) in MLB regular season games. Our model approach uses only past data when making a prediction, corresponding to ten years of publicly available data. We create a dataset with accumulative sabermetrics statistics for each MLB team during this period for which data contamination is not possible. The inherent difficulties of attempting this specific sports prediction are confirmed using two geometry or topology based measures of data complexity. Results reveal that the classification predictive scheme forecasts game outcomes better than regression scheme, and of the four data mining methods used, SVMs produce the best predictive results with a mean of nearly 60% prediction accuracy for each team. The evaluation of our model is performed using stratified 10-fold cross-validation.\",\"PeriodicalId\":38466,\"journal\":{\"name\":\"International Journal of Computer Science in Sport\",\"volume\":\"15 1\",\"pages\":\"91-112\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1515/IJCSS-2016-0007\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Computer Science in Sport\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1515/IJCSS-2016-0007\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Science in Sport","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/IJCSS-2016-0007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 1

摘要

棒球是一项充满统计数据的运动，预测一场特定的美国职业棒球大联盟(MLB)比赛的获胜者是一项有趣而富有挑战性的任务。到目前为止，还没有明确的公式来确定哪些因素会引导一支球队走向胜利，但通过对多年历史记录的分析，可以发现许多趋势。最近的研究集中在使用和生成新的统计数据(称为sabermetrics)，以便根据球队和球员的感知优势对他们进行排名，并最终应用这些排名来预测特定的比赛。在本文中，我们采用sabermetrics统计，目的是评估四种数据挖掘方法(基于分类和回归)的预测能力，以预测MLB常规赛比赛的结果(赢或输)。我们的模型方法在进行预测时只使用过去的数据，对应于十年的公开数据。我们创建了一个数据集，其中包含这段时间内每个MLB球队的累积统计数据，其中数据污染是不可能的。使用两种基于数据复杂性的几何或拓扑度量来证实尝试这种特定运动预测的固有困难。结果表明，分类预测方案对比赛结果的预测优于回归方案，并且在所使用的四种数据挖掘方法中，支持向量机的预测结果最好，每个团队的平均预测准确率接近60%。我们的模型的评估是使用分层10倍交叉验证进行的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Predicting Win-Loss outcomes in MLB regular season games – A comparative study using data mining methods

Baseball is a statistically filled sport, and predicting the winner of a particular Major League Baseball (MLB) game is an interesting and challenging task. Up to now, there is no definitive formula for determining what factors will conduct a team to victory, but through the analysis of many years of historical records many trends could emerge. Recent studies concentrated on using and generating new statistics called sabermetrics in order to rank teams and players according to their perceived strengths and consequently applying these rankings to forecast specific games. In this paper, we employ sabermetrics statistics with the purpose of assessing the predictive capabilities of four data mining methods (classification and regression based) for predicting outcomes (win or loss) in MLB regular season games. Our model approach uses only past data when making a prediction, corresponding to ten years of publicly available data. We create a dataset with accumulative sabermetrics statistics for each MLB team during this period for which data contamination is not possible. The inherent difficulties of attempting this specific sports prediction are confirmed using two geometry or topology based measures of data complexity. Results reveal that the classification predictive scheme forecasts game outcomes better than regression scheme, and of the four data mining methods used, SVMs produce the best predictive results with a mean of nearly 60% prediction accuracy for each team. The evaluation of our model is performed using stratified 10-fold cross-validation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊