Price Prediction Using Web Scraping and Machine Learning Algorithms in the Used Car Market

Seda Yilmaz, Ihsan Hakan Selvi
{"title":"Price Prediction Using Web Scraping and Machine Learning Algorithms in the Used Car Market","authors":"Seda Yilmaz, Ihsan Hakan Selvi","doi":"10.35377/saucis...1309103","DOIUrl":null,"url":null,"abstract":"The development of technology increases data traffic and data size day by day. Therefore, it has become very important to collect and interpret data. This study, it is aimed to analyze the car sales data collected using web scraping techniques by using machine learning algorithms and to create a price estimation model. The data needed for analysis was collected using Selenium and BeautifulSoup and prepared for analysis by applying various data preprocessing steps. Lasso regression and PCA analysis were used for feature selection and size reduction, and the GridSearchCV method was used for hyperparameter tuning. The results were evaluated with machine learning algorithms. \nRandom Forest, K-Nearest Neighbor, Gradient Boost, AdaBoost, Support Vector and XGBoost regression algorithms were used in the analysis. The obtained analysis results were evaluated together with Mean Square Error (MSE), Root Mean Square Error (RMSE) and Coefficient of Determination (R-square). When the results for data set 1 were examined, the model that gave the best results was XGBoost Regression with 0.973 R2, 0.026 MSE and 0.161 RMSE values. When the results for data set 2 were examined, the model that gave the best results was K-Nearest Neighbor Regression with 0.978 R2, 0.021 MSE and 0.145 RMSE values.","PeriodicalId":257636,"journal":{"name":"Sakarya University Journal of Computer and Information Sciences","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sakarya University Journal of Computer and Information Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.35377/saucis...1309103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The development of technology increases data traffic and data size day by day. Therefore, it has become very important to collect and interpret data. This study, it is aimed to analyze the car sales data collected using web scraping techniques by using machine learning algorithms and to create a price estimation model. The data needed for analysis was collected using Selenium and BeautifulSoup and prepared for analysis by applying various data preprocessing steps. Lasso regression and PCA analysis were used for feature selection and size reduction, and the GridSearchCV method was used for hyperparameter tuning. The results were evaluated with machine learning algorithms. Random Forest, K-Nearest Neighbor, Gradient Boost, AdaBoost, Support Vector and XGBoost regression algorithms were used in the analysis. The obtained analysis results were evaluated together with Mean Square Error (MSE), Root Mean Square Error (RMSE) and Coefficient of Determination (R-square). When the results for data set 1 were examined, the model that gave the best results was XGBoost Regression with 0.973 R2, 0.026 MSE and 0.161 RMSE values. When the results for data set 2 were examined, the model that gave the best results was K-Nearest Neighbor Regression with 0.978 R2, 0.021 MSE and 0.145 RMSE values.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
二手车市场中使用Web抓取和机器学习算法的价格预测
随着技术的发展,数据流量和数据量日益增加。因此,收集和解释数据变得非常重要。本研究旨在通过使用机器学习算法分析使用网络抓取技术收集的汽车销售数据,并创建价格估计模型。使用Selenium和BeautifulSoup收集分析所需的数据,并通过各种数据预处理步骤准备分析。使用Lasso回归和PCA分析进行特征选择和尺寸缩减,使用GridSearchCV方法进行超参数调整。使用机器学习算法对结果进行评估。采用随机森林、k近邻、梯度Boost、AdaBoost、支持向量和XGBoost回归算法进行分析。对所得分析结果进行均方误差(MSE)、均方根误差(RMSE)和决定系数(R-square)评价。当对数据集1的结果进行检验时,给出最佳结果的模型是XGBoost Regression, R2为0.973,MSE为0.026,RMSE为0.161。当对数据集2的结果进行检验时,给出最佳结果的模型是k -最近邻回归,R2为0.978,MSE为0.021,RMSE为0.145。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Prediction of Cardiovascular Disease Based on Voting Ensemble Model and SHAP Analysis A NOVEL ADDITIVE INTERNET OF THINGS (IoT) FEATURES AND CONVOLUTIONAL NEURAL NETWORK FOR CLASSIFICATION AND SOURCE IDENTIFICATION OF IoT DEVICES High-Capacity Multiplier Design Using Look Up Table Sequential and Correlated Image Hash Code Generation with Deep Reinforcement Learning Price Prediction Using Web Scraping and Machine Learning Algorithms in the Used Car Market
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1