Describe the house and I will tell you the price: House price prediction with textual description data

IF 1.9 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Natural Language Engineering Pub Date : 2023-07-18 DOI:10.1017/s1351324923000360

Han Zhang, Yansong Li, Paula Branco

{"title":"Describe the house and I will tell you the price: House price prediction with textual description data","authors":"Han Zhang, Yansong Li, Paula Branco","doi":"10.1017/s1351324923000360","DOIUrl":null,"url":null,"abstract":"\n House price prediction is an important problem that could benefit home buyers and sellers. Traditional models for house price prediction use numerical attributes such as the number of rooms but disregard the house description text. The recent developments in text processing suggest these can be valuable attributes, which motivated us to use house descriptions. This paper focuses on the house asking/advertising price and studies the impact of using house description texts to predict the final house price. To achieve this, we collected a large and diverse set of attributes on house postings, including the house advertising price. Then, we compare the performance of three scenarios: using only the house description, only numeric attributes, or both. We processed the description text through three word embedding techniques: TF-IDF, Word2Vec, and BERT. Four regression algorithms are trained using only textual data, non-textual data, or both. Our results show that by using exclusively the description data with Word2Vec and a Deep Learning model, we can achieve good performance. However, the best overall performance is obtained when using both textual and non-textual features. An \n \n \n \n$R^2$\n\n \n of 0.7904 is achieved by the deep learning model using only description data on the testing data. This clearly indicates that using the house description text alone is a strong predictor for the house price. However, when observing the RMSE on the test data, the best model was gradient boosting using both numeric and description data. Overall, we observe that combining the textual and non-textual features improves the learned model and provides performance benefits when compared against using only one of the feature types. We also provide a freely available application for house price prediction, which is solely based on a house text description and uses our final developed model with Word2Vec and Deep Learning to predict the house price.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/s1351324923000360","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

House price prediction is an important problem that could benefit home buyers and sellers. Traditional models for house price prediction use numerical attributes such as the number of rooms but disregard the house description text. The recent developments in text processing suggest these can be valuable attributes, which motivated us to use house descriptions. This paper focuses on the house asking/advertising price and studies the impact of using house description texts to predict the final house price. To achieve this, we collected a large and diverse set of attributes on house postings, including the house advertising price. Then, we compare the performance of three scenarios: using only the house description, only numeric attributes, or both. We processed the description text through three word embedding techniques: TF-IDF, Word2Vec, and BERT. Four regression algorithms are trained using only textual data, non-textual data, or both. Our results show that by using exclusively the description data with Word2Vec and a Deep Learning model, we can achieve good performance. However, the best overall performance is obtained when using both textual and non-textual features. An $R^2$ of 0.7904 is achieved by the deep learning model using only description data on the testing data. This clearly indicates that using the house description text alone is a strong predictor for the house price. However, when observing the RMSE on the test data, the best model was gradient boosting using both numeric and description data. Overall, we observe that combining the textual and non-textual features improves the learned model and provides performance benefits when compared against using only one of the feature types. We also provide a freely available application for house price prediction, which is solely based on a house text description and uses our final developed model with Word2Vec and Deep Learning to predict the house price.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

描述房子，我告诉你价格:房价预测用文字描述数据

房价预测是一个对购房者和卖家都有利的重要问题。传统的房价预测模型使用数字属性，如房间数，但忽略房屋描述文本。文本处理的最新发展表明，这些可能是有价值的属性，这促使我们使用房屋描述。本文以房屋要价/广告价格为研究对象，研究使用房屋描述文本预测最终房价的影响。为了实现这一目标，我们收集了大量不同的房屋广告属性，包括房屋广告价格。然后，我们比较三种场景的性能:仅使用房屋描述，仅使用数字属性，或两者兼而有之。我们通过三种词嵌入技术处理描述文本:TF-IDF、Word2Vec和BERT。四种回归算法仅使用文本数据、非文本数据或两者进行训练。我们的研究结果表明，通过Word2Vec和深度学习模型单独使用描述数据，我们可以获得很好的性能。然而，当同时使用文本和非文本特征时，可以获得最佳的总体性能。深度学习模型仅使用测试数据上的描述数据获得了0.7904的R^2$。这清楚地表明，单独使用房屋描述文本是房价的有力预测指标。然而，当观察测试数据上的RMSE时，最好的模型是同时使用数值和描述数据的梯度增强。总的来说，我们观察到，与只使用一种特征类型相比，结合文本和非文本特征可以改善学习模型，并提供性能优势。我们还提供了一个免费的房价预测应用程序，它完全基于房屋文本描述，并使用我们最终开发的带有Word2Vec和深度学习的模型来预测房价。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

5.90

自引率

12.00%

发文量

审稿时长

>12 weeks

期刊介绍： Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.