{"title":"Missing values handling for machine learning portfolios","authors":"Andrew Y. Chen , Jack McCoy","doi":"10.1016/j.jfineco.2024.103815","DOIUrl":null,"url":null,"abstract":"<div><p>We characterize the structure and origins of missingness for 159 cross-sectional return predictors and study missing value handling for portfolios constructed using machine learning. Simply imputing with cross-sectional means performs well compared to rigorous expectation-maximization methods. This stems from three facts about predictor data: (1) missingness occurs in large blocks organized by time, (2) cross-sectional correlations are small, and (3) missingness tends to occur in blocks organized by the underlying data source. As a result, observed data provide little information about missing data. Sophisticated imputations introduce estimation noise that can lead to underperformance if machine learning is not carefully applied.</p></div>","PeriodicalId":51346,"journal":{"name":"Journal of Financial Economics","volume":"155 ","pages":"Article 103815"},"PeriodicalIF":10.4000,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Financial Economics","FirstCategoryId":"96","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0304405X24000382","RegionNum":1,"RegionCategory":"经济学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BUSINESS, FINANCE","Score":null,"Total":0}
引用次数: 0
Abstract
We characterize the structure and origins of missingness for 159 cross-sectional return predictors and study missing value handling for portfolios constructed using machine learning. Simply imputing with cross-sectional means performs well compared to rigorous expectation-maximization methods. This stems from three facts about predictor data: (1) missingness occurs in large blocks organized by time, (2) cross-sectional correlations are small, and (3) missingness tends to occur in blocks organized by the underlying data source. As a result, observed data provide little information about missing data. Sophisticated imputations introduce estimation noise that can lead to underperformance if machine learning is not carefully applied.
期刊介绍:
The Journal of Financial Economics provides a specialized forum for the publication of research in the area of financial economics and the theory of the firm, placing primary emphasis on the highest quality analytical, empirical, and clinical contributions in the following major areas: capital markets, financial institutions, corporate finance, corporate governance, and the economics of organizations.