ATPM-REAP: A Simple and Efficient Address Tracking and Parsing for Vietnamese Real Estate Advertisement Posts

Binh T. Nguyen, Tung Tran Nguyen Doan, S. T. Huynh, An Tran-Hoai Le, An Trong Nguyen, K. Tran, N. Ho, Trung T. Nguyen, Dang T. Huynh
{"title":"ATPM-REAP: A Simple and Efficient Address Tracking and Parsing for Vietnamese Real Estate Advertisement Posts","authors":"Binh T. Nguyen, Tung Tran Nguyen Doan, S. T. Huynh, An Tran-Hoai Le, An Trong Nguyen, K. Tran, N. Ho, Trung T. Nguyen, Dang T. Huynh","doi":"10.1109/KSE56063.2022.9953770","DOIUrl":null,"url":null,"abstract":"Real estate is an enormous and essential field in many countries. Taking advantage of helpful information from real estate advertisement posts can help better understand the market condition and explore other vital insights, especially for the Vietnamese market. It is worth noting that in the representative information of real estate, the address or the location is required information. However, there are different ways to write down the address information in Vietnam. For this reason, detecting the relevant text representing the address information from real estate advertisement posts becomes an essential and challenging task. This paper investigates the address detecting and parsing task for the Vietnamese language. First, we create a dataset of real estate advertisements having 16 different attributes (entities) of each real estate and assign the correct label for each entity detected during the data annotation process. Then, we propose a practical approach for detecting locations of possible addresses inside one specific real estate advertisement post and then extract the localized address text into four different levels of the address information: City/Province, District/Town, Ward, and Street. The experiment results indicate that the ${\\mathrm {PhoBERT}}_{bas\\mathrm{e}}$ model achieves the best performance with an F1-score of 0.8195. Finally, we compare our proposed method with other approaches and achieve the highest accuracy results for all levels as follows: City/Province (0.952), District/Town (0.9482), Ward (0.9225), Street (0.8994), and the combined accuracy of correctly detecting all four levels is 0.8367.","PeriodicalId":330865,"journal":{"name":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 14th International Conference on Knowledge and Systems Engineering (KSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KSE56063.2022.9953770","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Real estate is an enormous and essential field in many countries. Taking advantage of helpful information from real estate advertisement posts can help better understand the market condition and explore other vital insights, especially for the Vietnamese market. It is worth noting that in the representative information of real estate, the address or the location is required information. However, there are different ways to write down the address information in Vietnam. For this reason, detecting the relevant text representing the address information from real estate advertisement posts becomes an essential and challenging task. This paper investigates the address detecting and parsing task for the Vietnamese language. First, we create a dataset of real estate advertisements having 16 different attributes (entities) of each real estate and assign the correct label for each entity detected during the data annotation process. Then, we propose a practical approach for detecting locations of possible addresses inside one specific real estate advertisement post and then extract the localized address text into four different levels of the address information: City/Province, District/Town, Ward, and Street. The experiment results indicate that the ${\mathrm {PhoBERT}}_{bas\mathrm{e}}$ model achieves the best performance with an F1-score of 0.8195. Finally, we compare our proposed method with other approaches and achieve the highest accuracy results for all levels as follows: City/Province (0.952), District/Town (0.9482), Ward (0.9225), Street (0.8994), and the combined accuracy of correctly detecting all four levels is 0.8367.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ATPM-REAP:越南房地产广告帖子的简单有效地址跟踪和解析
房地产在许多国家都是一个巨大而重要的领域。利用房地产广告帖子中的有用信息可以帮助您更好地了解市场状况并探索其他重要见解,特别是对于越南市场。值得注意的是,在房地产的代表信息中,地址或位置是必需的信息。然而,在越南有不同的方式来写下地址信息。因此,从房地产广告帖子中检测代表地址信息的相关文本就成为一项必要而富有挑战性的任务。本文研究了越南语的地址检测和解析任务。首先,我们创建了一个房地产广告数据集,每个房地产有16个不同的属性(实体),并为数据注释过程中检测到的每个实体分配正确的标签。然后,我们提出了一种实用的方法来检测特定房地产广告帖子中可能的地址位置,然后将本地化的地址文本提取为四个不同级别的地址信息:市/省、区/镇、区和街道。实验结果表明,${\mathrm {PhoBERT}}_{bas\mathrm{e}}$模型性能最佳,f1得分为0.8195。最后,我们将所提方法与其他方法进行比较,得到了在所有层次上准确率最高的结果:市/省(0.952)、区/镇(0.9482)、区(0.9225)、街(0.8994),正确检测四个层次的总准确率为0.8367。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
DWEN: A novel method for accurate estimation of cell type compositions from bulk data samples Polygenic risk scores adaptation for Height in a Vietnamese population Sentiment Classification for Beauty-fashion Reviews An Automated Stub Method for Unit Testing C/C++ Projects Knowledge-based Problem Solving and Reasoning methods
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1