{"title":"A Numerical Fact Extraction Method for Chinese Text","authors":"Pengyu Zhang","doi":"10.1109/SmartCloud55982.2022.00021","DOIUrl":null,"url":null,"abstract":"The understanding of text with numerical values is now involved in many application areas and the extraction of numerically relevant and important information from unstructured data is a hot topic of research. The main work in this paper is divided into three parts. The first part is a specification of Chinese numerical fact extraction and annotation, using numerical values as the core of the annotation to find other important information, where numerically measured entities and attributes are the main targets. The second part is the design of extraction methods, using two methods based on deep learning of different task forms as extraction models, namely the NER Combine and Quantity MRC methods. The former uses the sequence annotation task to extract fields, and the combination algorithm based on field distance connects values with other information; The latter uses machine reading comprehension to find its counterpart in other information by introducing numerical information as interrogative sentences. The aim of designing a supervised algorithm based on deep learning is to find the desired target more accurately than an unsupervised algorithm, to avoid the problem of having to exhaust a large number of rules to deal with trivial situations in an unsupervised algorithm, and to benefit from the a priori knowledge and strong representational power of the pre-trained language model to improve the robustness and usability of the extraction results. The third part is experimental verification, which shows the advantages and disadvantages of the two extraction methods in different contexts.","PeriodicalId":104366,"journal":{"name":"2022 IEEE 7th International Conference on Smart Cloud (SmartCloud)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 7th International Conference on Smart Cloud (SmartCloud)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SmartCloud55982.2022.00021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The understanding of text with numerical values is now involved in many application areas and the extraction of numerically relevant and important information from unstructured data is a hot topic of research. The main work in this paper is divided into three parts. The first part is a specification of Chinese numerical fact extraction and annotation, using numerical values as the core of the annotation to find other important information, where numerically measured entities and attributes are the main targets. The second part is the design of extraction methods, using two methods based on deep learning of different task forms as extraction models, namely the NER Combine and Quantity MRC methods. The former uses the sequence annotation task to extract fields, and the combination algorithm based on field distance connects values with other information; The latter uses machine reading comprehension to find its counterpart in other information by introducing numerical information as interrogative sentences. The aim of designing a supervised algorithm based on deep learning is to find the desired target more accurately than an unsupervised algorithm, to avoid the problem of having to exhaust a large number of rules to deal with trivial situations in an unsupervised algorithm, and to benefit from the a priori knowledge and strong representational power of the pre-trained language model to improve the robustness and usability of the extraction results. The third part is experimental verification, which shows the advantages and disadvantages of the two extraction methods in different contexts.