Kai Wu, Zugang Chen, Xinqian Wu, Guoqing Li, Jing Li, Shaohua Wang, Haodong Wang, Hang Feng
{"title":"基于分层时态记忆模型从文献中提取地球科学数据集名称","authors":"Kai Wu, Zugang Chen, Xinqian Wu, Guoqing Li, Jing Li, Shaohua Wang, Haodong Wang, Hang Feng","doi":"10.3390/ijgi13070260","DOIUrl":null,"url":null,"abstract":"Extracting geoscientific dataset names from the literature is crucial for building a literature–data association network, which can help readers access the data quickly through the Internet. However, the existing named-entity extraction methods have low accuracy in extracting geoscientific dataset names from unstructured text because geoscientific dataset names are a complex combination of multiple elements, such as geospatial coverage, temporal coverage, scale or resolution, theme content, and version. This paper proposes a new method based on the hierarchical temporal memory (HTM) model, a brain-inspired neural network with superior performance in high-level cognitive tasks, to accurately extract geoscientific dataset names from unstructured text. First, a word-encoding method based on the Unicode values of characters for the HTM model was proposed. Then, over 12,000 dataset names were collected from geoscience data-sharing websites and encoded into binary vectors to train the HTM model. We conceived a new classifier scheme for the HTM model that decodes the predictive vector for the encoder of the next word so that the similarity of the encoders of the predictive next word and the real next word can be computed. If the similarity is greater than a specified threshold, the real next word can be regarded as part of the name, and a successive word set forms the full geoscientific dataset name. We used the trained HTM model to extract geoscientific dataset names from 100 papers. Our method achieved an F1-score of 0.727, outperforming the GPT-4- and Claude-3-based few-shot learning (FSL) method, with F1-scores of 0.698 and 0.72, respectively.","PeriodicalId":48738,"journal":{"name":"ISPRS International Journal of Geo-Information","volume":"160 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Extracting Geoscientific Dataset Names from the Literature Based on the Hierarchical Temporal Memory Model\",\"authors\":\"Kai Wu, Zugang Chen, Xinqian Wu, Guoqing Li, Jing Li, Shaohua Wang, Haodong Wang, Hang Feng\",\"doi\":\"10.3390/ijgi13070260\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Extracting geoscientific dataset names from the literature is crucial for building a literature–data association network, which can help readers access the data quickly through the Internet. However, the existing named-entity extraction methods have low accuracy in extracting geoscientific dataset names from unstructured text because geoscientific dataset names are a complex combination of multiple elements, such as geospatial coverage, temporal coverage, scale or resolution, theme content, and version. This paper proposes a new method based on the hierarchical temporal memory (HTM) model, a brain-inspired neural network with superior performance in high-level cognitive tasks, to accurately extract geoscientific dataset names from unstructured text. First, a word-encoding method based on the Unicode values of characters for the HTM model was proposed. Then, over 12,000 dataset names were collected from geoscience data-sharing websites and encoded into binary vectors to train the HTM model. We conceived a new classifier scheme for the HTM model that decodes the predictive vector for the encoder of the next word so that the similarity of the encoders of the predictive next word and the real next word can be computed. If the similarity is greater than a specified threshold, the real next word can be regarded as part of the name, and a successive word set forms the full geoscientific dataset name. We used the trained HTM model to extract geoscientific dataset names from 100 papers. Our method achieved an F1-score of 0.727, outperforming the GPT-4- and Claude-3-based few-shot learning (FSL) method, with F1-scores of 0.698 and 0.72, respectively.\",\"PeriodicalId\":48738,\"journal\":{\"name\":\"ISPRS International Journal of Geo-Information\",\"volume\":\"160 1\",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ISPRS International Journal of Geo-Information\",\"FirstCategoryId\":\"89\",\"ListUrlMain\":\"https://doi.org/10.3390/ijgi13070260\",\"RegionNum\":3,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS International Journal of Geo-Information","FirstCategoryId":"89","ListUrlMain":"https://doi.org/10.3390/ijgi13070260","RegionNum":3,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Extracting Geoscientific Dataset Names from the Literature Based on the Hierarchical Temporal Memory Model
Extracting geoscientific dataset names from the literature is crucial for building a literature–data association network, which can help readers access the data quickly through the Internet. However, the existing named-entity extraction methods have low accuracy in extracting geoscientific dataset names from unstructured text because geoscientific dataset names are a complex combination of multiple elements, such as geospatial coverage, temporal coverage, scale or resolution, theme content, and version. This paper proposes a new method based on the hierarchical temporal memory (HTM) model, a brain-inspired neural network with superior performance in high-level cognitive tasks, to accurately extract geoscientific dataset names from unstructured text. First, a word-encoding method based on the Unicode values of characters for the HTM model was proposed. Then, over 12,000 dataset names were collected from geoscience data-sharing websites and encoded into binary vectors to train the HTM model. We conceived a new classifier scheme for the HTM model that decodes the predictive vector for the encoder of the next word so that the similarity of the encoders of the predictive next word and the real next word can be computed. If the similarity is greater than a specified threshold, the real next word can be regarded as part of the name, and a successive word set forms the full geoscientific dataset name. We used the trained HTM model to extract geoscientific dataset names from 100 papers. Our method achieved an F1-score of 0.727, outperforming the GPT-4- and Claude-3-based few-shot learning (FSL) method, with F1-scores of 0.698 and 0.72, respectively.
期刊介绍:
ISPRS International Journal of Geo-Information (ISSN 2220-9964) provides an advanced forum for the science and technology of geographic information. ISPRS International Journal of Geo-Information publishes regular research papers, reviews and communications. Our aim is to encourage scientists to publish their experimental and theoretical results in as much detail as possible. There is no restriction on the length of the papers. The full experimental details must be provided so that the results can be reproduced.
The 2018 IJGI Outstanding Reviewer Award has been launched! This award acknowledge those who have generously dedicated their time to review manuscripts submitted to IJGI. See full details at http://www.mdpi.com/journal/ijgi/awards.