ATPM-REAP: A Simple and Efficient Address Tracking and Parsing for Vietnamese Real Estate Advertisement Posts

2022 14th International Conference on Knowledge and Systems Engineering (KSE) Pub Date : 2022-10-19 DOI:10.1109/KSE56063.2022.9953770

Binh T. Nguyen, Tung Tran Nguyen Doan, S. T. Huynh, An Tran-Hoai Le, An Trong Nguyen, K. Tran, N. Ho, Trung T. Nguyen, Dang T. Huynh

{"title":"ATPM-REAP: A Simple and Efficient Address Tracking and Parsing for Vietnamese Real Estate Advertisement Posts","authors":"Binh T. Nguyen, Tung Tran Nguyen Doan, S. T. Huynh, An Tran-Hoai Le, An Trong Nguyen, K. Tran, N. Ho, Trung T. Nguyen, Dang T. Huynh","doi":"10.1109/KSE56063.2022.9953770","DOIUrl":null,"url":null,"abstract":"Real estate is an enormous and essential field in many countries. Taking advantage of helpful information from real estate advertisement posts can help better understand the market condition and explore other vital insights, especially for the Vietnamese market. It is worth noting that in the representative information of real estate, the address or the location is required information. However, there are different ways to write down the address information in Vietnam. For this reason, detecting the relevant text representing the address information from real estate advertisement posts becomes an essential and challenging task. This paper investigates the address detecting and parsing task for the Vietnamese language. First, we create a dataset of real estate advertisements having 16 different attributes (entities) of each real estate and assign the correct label for each entity detected during the data annotation process. Then, we propose a practical approach for detecting locations of possible addresses inside one specific real estate advertisement post and then extract the localized address text into four different levels of the address information: City/Province, District/Town, Ward, and Street. The experiment results indicate that the ${\\mathrm {PhoBERT}}_{bas\\mathrm{e}}$ model achieves the best performance with an F1-score of 0.8195. Finally, we compare our proposed method with other approaches and achieve the highest accuracy results for all levels as follows: City/Province (0.952), District/Town (0.9482), Ward (0.9225), Street (0.8994), and the combined accuracy of correctly detecting all four levels is 0.8367.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KSE56063.2022.9953770","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Real estate is an enormous and essential field in many countries. Taking advantage of helpful information from real estate advertisement posts can help better understand the market condition and explore other vital insights, especially for the Vietnamese market. It is worth noting that in the representative information of real estate, the address or the location is required information. However, there are different ways to write down the address information in Vietnam. For this reason, detecting the relevant text representing the address information from real estate advertisement posts becomes an essential and challenging task. This paper investigates the address detecting and parsing task for the Vietnamese language. First, we create a dataset of real estate advertisements having 16 different attributes (entities) of each real estate and assign the correct label for each entity detected during the data annotation process. Then, we propose a practical approach for detecting locations of possible addresses inside one specific real estate advertisement post and then extract the localized address text into four different levels of the address information: City/Province, District/Town, Ward, and Street. The experiment results indicate that the ${\mathrm {PhoBERT}}_{bas\mathrm{e}}$ model achieves the best performance with an F1-score of 0.8195. Finally, we compare our proposed method with other approaches and achieve the highest accuracy results for all levels as follows: City/Province (0.952), District/Town (0.9482), Ward (0.9225), Street (0.8994), and the combined accuracy of correctly detecting all four levels is 0.8367.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ATPM-REAP:越南房地产广告帖子的简单有效地址跟踪和解析

房地产在许多国家都是一个巨大而重要的领域。利用房地产广告帖子中的有用信息可以帮助您更好地了解市场状况并探索其他重要见解，特别是对于越南市场。值得注意的是，在房地产的代表信息中，地址或位置是必需的信息。然而，在越南有不同的方式来写下地址信息。因此，从房地产广告帖子中检测代表地址信息的相关文本就成为一项必要而富有挑战性的任务。本文研究了越南语的地址检测和解析任务。首先，我们创建了一个房地产广告数据集，每个房地产有16个不同的属性(实体)，并为数据注释过程中检测到的每个实体分配正确的标签。然后，我们提出了一种实用的方法来检测特定房地产广告帖子中可能的地址位置，然后将本地化的地址文本提取为四个不同级别的地址信息:市/省、区/镇、区和街道。实验结果表明，${\mathrm {PhoBERT}}_{bas\mathrm{e}}$模型性能最佳，f1得分为0.8195。最后，我们将所提方法与其他方法进行比较，得到了在所有层次上准确率最高的结果:市/省(0.952)、区/镇(0.9482)、区(0.9225)、街(0.8994)，正确检测四个层次的总准确率为0.8367。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 14th International Conference on Knowledge and Systems Engineering (KSE)

自引率

0.00%

发文量

期刊最新文献

DWEN: A novel method for accurate estimation of cell type compositions from bulk data samples Polygenic risk scores adaptation for Height in a Vietnamese population Sentiment Classification for Beauty-fashion Reviews An Automated Stub Method for Unit Testing C/C++ Projects Knowledge-based Problem Solving and Reasoning methods