An unsupervised automatic organization method for Professor Shirakawa’s hand-notated documents of oracle bone inscriptions

IF 2.5 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE International Journal on Document Analysis and Recognition Pub Date : 2024-03-05 DOI:10.1007/s10032-024-00463-0

Xuebin Yue, Ziming Wang, Ryuto Ishibashi, Hayata Kaneko, Lin Meng

{"title":"An unsupervised automatic organization method for Professor Shirakawa’s hand-notated documents of oracle bone inscriptions","authors":"Xuebin Yue, Ziming Wang, Ryuto Ishibashi, Hayata Kaneko, Lin Meng","doi":"10.1007/s10032-024-00463-0","DOIUrl":null,"url":null,"abstract":"<p>As one of the most influential Chinese cultural researchers in the second half of the twentieth-century, Professor Shirakawa is active in the research field of ancient Chinese characters. He has left behind many valuable research documents, especially his hand-notated oracle bone inscriptions (OBIs) documents. OBIs are one of the world’s oldest characters and were used in the Shang Dynasty about 3600 years ago for divination and recording events. The organization of OBIs is not only helpful in better understanding Prof. Shirakawa’s research and further study of OBIs in general and their importance in ancient Chinese history. This paper proposes an unsupervised automatic organization method to organize Prof. Shirakawa’s OBIs and construct a handwritten OBIs data set for neural network learning. First, a suite of noise reduction is proposed to remove strangely shaped noise to reduce the data loss of OBIs. Secondly, a novel segmentation method based on the supervised classification of OBIs regions is proposed to reduce adverse effects between characters for more accurate OBIs segmentation. Thirdly, a unique unsupervised clustering method is proposed to classify the segmented characters. Finally, all the same characters in the hand-notated OBIs documents are organized together. The evaluation results show that noise reduction has been proposed to remove noises with an accuracy of 97.85%, which contains number information and closed-loop-like edges in the dataset. In addition, the accuracy of supervised classification of OBIs regions based on our model achieves 85.50%, which is higher than eight state-of-the-art deep learning models, and a particular preprocessing method we proposed improves the classification accuracy by nearly 11.50%. The accuracy of OBIs clustering based on supervised classification achieves 74.91%. These results demonstrate the effectiveness of our proposed unsupervised automatic organization of Prof. Shirakawa’s hand-notated OBIs documents. The code and datasets are available at http://www.ihpc.se.ritsumei.ac.jp/obidataset.html.</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"101 1","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal on Document Analysis and Recognition","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10032-024-00463-0","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

As one of the most influential Chinese cultural researchers in the second half of the twentieth-century, Professor Shirakawa is active in the research field of ancient Chinese characters. He has left behind many valuable research documents, especially his hand-notated oracle bone inscriptions (OBIs) documents. OBIs are one of the world’s oldest characters and were used in the Shang Dynasty about 3600 years ago for divination and recording events. The organization of OBIs is not only helpful in better understanding Prof. Shirakawa’s research and further study of OBIs in general and their importance in ancient Chinese history. This paper proposes an unsupervised automatic organization method to organize Prof. Shirakawa’s OBIs and construct a handwritten OBIs data set for neural network learning. First, a suite of noise reduction is proposed to remove strangely shaped noise to reduce the data loss of OBIs. Secondly, a novel segmentation method based on the supervised classification of OBIs regions is proposed to reduce adverse effects between characters for more accurate OBIs segmentation. Thirdly, a unique unsupervised clustering method is proposed to classify the segmented characters. Finally, all the same characters in the hand-notated OBIs documents are organized together. The evaluation results show that noise reduction has been proposed to remove noises with an accuracy of 97.85%, which contains number information and closed-loop-like edges in the dataset. In addition, the accuracy of supervised classification of OBIs regions based on our model achieves 85.50%, which is higher than eight state-of-the-art deep learning models, and a particular preprocessing method we proposed improves the classification accuracy by nearly 11.50%. The accuracy of OBIs clustering based on supervised classification achieves 74.91%. These results demonstrate the effectiveness of our proposed unsupervised automatic organization of Prof. Shirakawa’s hand-notated OBIs documents. The code and datasets are available at http://www.ihpc.se.ritsumei.ac.jp/obidataset.html.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

白川教授手记甲骨文文献的无监督自动整理方法

作为二十世纪下半叶最具影响力的中国文化研究者之一，白川教授活跃在中国古代文字研究领域。他留下了许多珍贵的研究文献，尤其是他手注的甲骨文文献。甲骨文是世界上最古老的文字之一，约在 3600 年前的商代用于占卜和记录事件。整理甲骨文不仅有助于更好地理解白川教授的研究，还有助于进一步研究甲骨文及其在中国古代历史中的重要性。本文提出了一种无监督自动组织方法来组织白川教授的口述历史，并构建了一个用于神经网络学习的手写口述历史数据集。首先，本文提出了一套降噪方法来去除奇形怪状的噪声，以减少 OBIs 的数据损失。其次，提出了一种基于 OBIs 区域监督分类的新型分割方法，以减少字符之间的不利影响，从而实现更准确的 OBIs 分割。第三，提出一种独特的无监督聚类方法，对分割后的字符进行分类。最后，将手注 OBIs 文档中所有相同的字符组织在一起。评估结果表明，提出的降噪方法去除噪音的准确率达到了 97.85%，其中包含了数据集中的数字信息和类似闭环的边缘。此外，基于我们的模型对OBIs区域进行监督分类的准确率达到了85.50%，高于8个最先进的深度学习模型，而我们提出的一种特殊预处理方法将分类准确率提高了近11.50%。基于监督分类的 OBIs 聚类准确率达到了 74.91%。这些结果证明了我们提出的对白川教授手注 OBIs 文档进行无监督自动组织的有效性。代码和数据集可在 http://www.ihpc.se.ritsumei.ac.jp/obidataset.html 上获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Journal on Document Analysis and Recognition 工程技术-计算机：人工智能

CiteScore

6.20

自引率

4.30%

发文量

审稿时长

7.5 months

期刊介绍： The large number of existing documents and the production of a multitude of new ones every year raise important issues in efficient handling, retrieval and storage of these documents and the information which they contain. This has led to the emergence of new research domains dealing with the recognition by computers of the constituent elements of documents - including characters, symbols, text, lines, graphics, images, handwriting, signatures, etc. In addition, these new domains deal with automatic analyses of the overall physical and logical structures of documents, with the ultimate objective of a high-level understanding of their semantic content. We have also seen renewed interest in optical character recognition (OCR) and handwriting recognition during the last decade. Document analysis and recognition are obviously the next stage. Automatic, intelligent processing of documents is at the intersections of many fields of research, especially of computer vision, image analysis, pattern recognition and artificial intelligence, as well as studies on reading, handwriting and linguistics. Although quality document related publications continue to appear in journals dedicated to these domains, the community will benefit from having this journal as a focal point for archival literature dedicated to document analysis and recognition.