{"title":"DimBoost:提升梯度提升决策树到更高的维度","authors":"Jiawei Jiang, B. Cui, Ce Zhang, Fangcheng Fu","doi":"10.1145/3183713.3196892","DOIUrl":null,"url":null,"abstract":"Gradient boosting decision tree (GBDT) is one of the most popular machine learning models widely used in both academia and industry. Although GBDT has been widely supported by existing systems such as XGBoost, LightGBM, and MLlib, one system bottleneck appears when the dimensionality of the data becomes high. As a result, when we tried to support our industrial partner on datasets of the dimension up to 330K, we observed suboptimal performance for all these aforementioned systems. In this paper, we ask \"Can we build a scalable GBDT training system whose performance scales better with respect to dimensionality of the data?\" The first contribution of this paper is a careful investigation of existing systems by developing a performance model with respect to the dimensionality of the data. We find that the collective communication operations in many existing systems only implement the algorithm designed for small messages. By just fixing this problem, we are able to speed up these systems by up to 2X. Our second contribution is a series of optimizations to further optimize the performance of collective communications. These optimizations include a task scheduler, a two-phase split finding method, and low-precision gradient histograms. Our third contribution is a sparsity-aware algorithm to build gradient histograms and a novel index structure to build histograms in parallel. We implement these optimizations in DimBoost and show that it can be 2-9X faster than existing systems.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":"{\"title\":\"DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions\",\"authors\":\"Jiawei Jiang, B. Cui, Ce Zhang, Fangcheng Fu\",\"doi\":\"10.1145/3183713.3196892\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Gradient boosting decision tree (GBDT) is one of the most popular machine learning models widely used in both academia and industry. Although GBDT has been widely supported by existing systems such as XGBoost, LightGBM, and MLlib, one system bottleneck appears when the dimensionality of the data becomes high. As a result, when we tried to support our industrial partner on datasets of the dimension up to 330K, we observed suboptimal performance for all these aforementioned systems. In this paper, we ask \\\"Can we build a scalable GBDT training system whose performance scales better with respect to dimensionality of the data?\\\" The first contribution of this paper is a careful investigation of existing systems by developing a performance model with respect to the dimensionality of the data. We find that the collective communication operations in many existing systems only implement the algorithm designed for small messages. By just fixing this problem, we are able to speed up these systems by up to 2X. Our second contribution is a series of optimizations to further optimize the performance of collective communications. These optimizations include a task scheduler, a two-phase split finding method, and low-precision gradient histograms. Our third contribution is a sparsity-aware algorithm to build gradient histograms and a novel index structure to build histograms in parallel. We implement these optimizations in DimBoost and show that it can be 2-9X faster than existing systems.\",\"PeriodicalId\":20430,\"journal\":{\"name\":\"Proceedings of the 2018 International Conference on Management of Data\",\"volume\":\"20 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"34\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 International Conference on Management of Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3183713.3196892\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3183713.3196892","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions
Gradient boosting decision tree (GBDT) is one of the most popular machine learning models widely used in both academia and industry. Although GBDT has been widely supported by existing systems such as XGBoost, LightGBM, and MLlib, one system bottleneck appears when the dimensionality of the data becomes high. As a result, when we tried to support our industrial partner on datasets of the dimension up to 330K, we observed suboptimal performance for all these aforementioned systems. In this paper, we ask "Can we build a scalable GBDT training system whose performance scales better with respect to dimensionality of the data?" The first contribution of this paper is a careful investigation of existing systems by developing a performance model with respect to the dimensionality of the data. We find that the collective communication operations in many existing systems only implement the algorithm designed for small messages. By just fixing this problem, we are able to speed up these systems by up to 2X. Our second contribution is a series of optimizations to further optimize the performance of collective communications. These optimizations include a task scheduler, a two-phase split finding method, and low-precision gradient histograms. Our third contribution is a sparsity-aware algorithm to build gradient histograms and a novel index structure to build histograms in parallel. We implement these optimizations in DimBoost and show that it can be 2-9X faster than existing systems.