Tiago Pinho da Silva, A. R. Parmezan, Gustavo E. A. P. A. Batista
{"title":"A Graph-Based Spatial Cross-Validation Approach for Assessing Models Learned with Selected Features to Understand Election Results","authors":"Tiago Pinho da Silva, A. R. Parmezan, Gustavo E. A. P. A. Batista","doi":"10.1109/ICMLA52953.2021.00150","DOIUrl":null,"url":null,"abstract":"Elections are complex activities fundamental to any democracy. The contextualized analysis of election data allows us to understand electoral behavior and the factors that influence it. Multidisciplinary studies have been prioritized the predictive modeling of electoral features from thousands of explanatory features, considering geographic and spatial aspects inherent to the data. When building a model for such a purpose, it must be rigorously evaluated to understand its prediction error in future test cases. Although cross-validation is a widely used procedure for this task, it leads to optimistic results because the spatial independence between test and training data is not ensured in the resampling. On the other hand, alternatives to deal with spatial dependence may fall into a pessimistic scenario by assuming total spatial independence between the test and training sets regardless of the size of the first one, increasing the probability of overfitting. This paper addresses these issues by proposing a graph-based spatial cross-validation approach to assess models learned with selected features from spatially contextualized electoral datasets. Our approach takes advantage of the spatial graph structure provided by the lattice-type spatial objects to define a local training set to each test fold. We generate the local training sets by removing spatially close data that are highly correlated and irrelevant distant data that may interfere with error estimates. Experiments involving the second round of the 2018 Brazilian presidential election demonstrate that our approach contributes to the fair evaluation of models by enabling more realistic and local modeling.","PeriodicalId":6750,"journal":{"name":"2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"81 1","pages":"909-915"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA52953.2021.00150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Elections are complex activities fundamental to any democracy. The contextualized analysis of election data allows us to understand electoral behavior and the factors that influence it. Multidisciplinary studies have been prioritized the predictive modeling of electoral features from thousands of explanatory features, considering geographic and spatial aspects inherent to the data. When building a model for such a purpose, it must be rigorously evaluated to understand its prediction error in future test cases. Although cross-validation is a widely used procedure for this task, it leads to optimistic results because the spatial independence between test and training data is not ensured in the resampling. On the other hand, alternatives to deal with spatial dependence may fall into a pessimistic scenario by assuming total spatial independence between the test and training sets regardless of the size of the first one, increasing the probability of overfitting. This paper addresses these issues by proposing a graph-based spatial cross-validation approach to assess models learned with selected features from spatially contextualized electoral datasets. Our approach takes advantage of the spatial graph structure provided by the lattice-type spatial objects to define a local training set to each test fold. We generate the local training sets by removing spatially close data that are highly correlated and irrelevant distant data that may interfere with error estimates. Experiments involving the second round of the 2018 Brazilian presidential election demonstrate that our approach contributes to the fair evaluation of models by enabling more realistic and local modeling.