利用自然语言处理技术从发掘报告中了解考古遗址的文化相似性

IF 0.7 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Journal of Advanced Computational Intelligence and Intelligent Informatics Pub Date : 2023-05-20 DOI:10.20965/jaciii.2023.p0394
Fumihiro Sakahira, Yuji Yamaguchi, Takao Terano
{"title":"利用自然语言处理技术从发掘报告中了解考古遗址的文化相似性","authors":"Fumihiro Sakahira, Yuji Yamaguchi, Takao Terano","doi":"10.20965/jaciii.2023.p0394","DOIUrl":null,"url":null,"abstract":"In this study, we applied natural language processing (NLP) techniques to texts of excavation reports on buried cultural properties to calculate the degree of similarity between the reports for determining archaeological sites that have a high degree of similarity. Specifically, we validated whether the similarity of sentence embeddings in the excavation reports of these sites is consistent with the existing classification. Four archaeological sites classified in existing archaeological research papers were used. For validation, 128 excavation reports from the four sites were used; sentence embeddings were obtained using Doc2Vec. We obtained the following results: 1) In applying NLP to excavation reports for determining the similarities of archaeological sites, merging the texts for each site into a single document and then processing it was more preferable than processing it in separate volumes of the excavation report. 2) The similarity based on sentence embedding of excavation reports using Doc2Vec was more consistent with the classification of the characteristics of archaeological sites than term frequency–inverse document frequency (TF-IDF). 3) When targeting a specific period, the sentence embedding exclusively for the text of the relevant period is consistent with the classification of the characteristics of the archaeological site from the artifacts and structural remains of that specific period. 4) When a specific period is targeted, the exclusive sentence embeddings of that period, obtained through the additive compositionality of sentence embeddings, can be used to classify the characteristics of archaeological sites based on the artifacts and structural remains on that period. Consequently, the similarities of texts based on NLP can reflect the similarities of archaeological sites. This holds true even for excavation reports that include spelling inconsistencies, optical character reader misrecognition, and garbled words.","PeriodicalId":45921,"journal":{"name":"Journal of Advanced Computational Intelligence and Intelligent Informatics","volume":"136 1","pages":"394-403"},"PeriodicalIF":0.7000,"publicationDate":"2023-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Understanding Cultural Similarities of Archaeological Sites from Excavation Reports Using Natural Language Processing Technique\",\"authors\":\"Fumihiro Sakahira, Yuji Yamaguchi, Takao Terano\",\"doi\":\"10.20965/jaciii.2023.p0394\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this study, we applied natural language processing (NLP) techniques to texts of excavation reports on buried cultural properties to calculate the degree of similarity between the reports for determining archaeological sites that have a high degree of similarity. Specifically, we validated whether the similarity of sentence embeddings in the excavation reports of these sites is consistent with the existing classification. Four archaeological sites classified in existing archaeological research papers were used. For validation, 128 excavation reports from the four sites were used; sentence embeddings were obtained using Doc2Vec. We obtained the following results: 1) In applying NLP to excavation reports for determining the similarities of archaeological sites, merging the texts for each site into a single document and then processing it was more preferable than processing it in separate volumes of the excavation report. 2) The similarity based on sentence embedding of excavation reports using Doc2Vec was more consistent with the classification of the characteristics of archaeological sites than term frequency–inverse document frequency (TF-IDF). 3) When targeting a specific period, the sentence embedding exclusively for the text of the relevant period is consistent with the classification of the characteristics of the archaeological site from the artifacts and structural remains of that specific period. 4) When a specific period is targeted, the exclusive sentence embeddings of that period, obtained through the additive compositionality of sentence embeddings, can be used to classify the characteristics of archaeological sites based on the artifacts and structural remains on that period. Consequently, the similarities of texts based on NLP can reflect the similarities of archaeological sites. This holds true even for excavation reports that include spelling inconsistencies, optical character reader misrecognition, and garbled words.\",\"PeriodicalId\":45921,\"journal\":{\"name\":\"Journal of Advanced Computational Intelligence and Intelligent Informatics\",\"volume\":\"136 1\",\"pages\":\"394-403\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2023-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Advanced Computational Intelligence and Intelligent Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.20965/jaciii.2023.p0394\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Advanced Computational Intelligence and Intelligent Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.20965/jaciii.2023.p0394","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 1

摘要

在本研究中,我们将自然语言处理(NLP)技术应用于埋藏文物的挖掘报告文本,计算报告之间的相似程度,以确定具有高度相似度的考古遗址。具体而言,我们验证了这些遗址挖掘报告中句子嵌入的相似性是否与现有分类一致。使用了现有考古研究论文中分类的四个考古遗址。为了验证,使用了来自四个地点的128份挖掘报告;使用Doc2Vec获取句子嵌入。我们得到了以下结果:1)在将NLP应用于挖掘报告中以确定考古遗址的相似性时,将每个遗址的文本合并为一个文件然后进行处理比在挖掘报告的单独卷中进行处理更可取。2)基于句子嵌入的Doc2Vec挖掘报告相似度比术语频率-逆文档频率(TF-IDF)更符合考古遗址特征分类。3)在针对特定时期时,专门针对相关时期文本的句子嵌入与该特定时期的文物和结构遗迹的考古遗址特征分类是一致的。(4)当以特定时期为目标时,通过句嵌入的加性组合性获得该时期的专属句嵌入,可以根据该时期的文物和结构遗迹对考古遗址的特征进行分类。因此,基于NLP的文本相似性可以反映考古遗址的相似性。这甚至适用于包含拼写不一致、光学字符阅读器错误识别和乱码的挖掘报告。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Understanding Cultural Similarities of Archaeological Sites from Excavation Reports Using Natural Language Processing Technique
In this study, we applied natural language processing (NLP) techniques to texts of excavation reports on buried cultural properties to calculate the degree of similarity between the reports for determining archaeological sites that have a high degree of similarity. Specifically, we validated whether the similarity of sentence embeddings in the excavation reports of these sites is consistent with the existing classification. Four archaeological sites classified in existing archaeological research papers were used. For validation, 128 excavation reports from the four sites were used; sentence embeddings were obtained using Doc2Vec. We obtained the following results: 1) In applying NLP to excavation reports for determining the similarities of archaeological sites, merging the texts for each site into a single document and then processing it was more preferable than processing it in separate volumes of the excavation report. 2) The similarity based on sentence embedding of excavation reports using Doc2Vec was more consistent with the classification of the characteristics of archaeological sites than term frequency–inverse document frequency (TF-IDF). 3) When targeting a specific period, the sentence embedding exclusively for the text of the relevant period is consistent with the classification of the characteristics of the archaeological site from the artifacts and structural remains of that specific period. 4) When a specific period is targeted, the exclusive sentence embeddings of that period, obtained through the additive compositionality of sentence embeddings, can be used to classify the characteristics of archaeological sites based on the artifacts and structural remains on that period. Consequently, the similarities of texts based on NLP can reflect the similarities of archaeological sites. This holds true even for excavation reports that include spelling inconsistencies, optical character reader misrecognition, and garbled words.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
1.50
自引率
14.30%
发文量
89
期刊介绍: JACIII focuses on advanced computational intelligence and intelligent informatics. The topics include, but are not limited to; Fuzzy logic, Fuzzy control, Neural Networks, GA and Evolutionary Computation, Hybrid Systems, Adaptation and Learning Systems, Distributed Intelligent Systems, Network systems, Multi-media, Human interface, Biologically inspired evolutionary systems, Artificial life, Chaos, Complex systems, Fractals, Robotics, Medical applications, Pattern recognition, Virtual reality, Wavelet analysis, Scientific applications, Industrial applications, and Artistic applications.
期刊最新文献
The Impact of Individual Heterogeneity on Household Asset Choice: An Empirical Study Based on China Family Panel Studies Private Placement, Investor Sentiment, and Stock Price Anomaly Does Increasing Public Service Expenditure Slow the Long-Term Economic Growth Rate?—Evidence from China Prediction and Characteristic Analysis of Enterprise Digital Transformation Integrating XGBoost and SHAP Industrial Chain Map and Linkage Network Characteristics of Digital Economy
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1