Meng Hu, Zhixu Li, Yongxin Shen, An Liu, Guanfeng Liu, Kai Zheng, Lei Zhao
{"title":"CNN-IETS:一种基于cnn的文本分割信息抽取概率方法","authors":"Meng Hu, Zhixu Li, Yongxin Shen, An Liu, Guanfeng Liu, Kai Zheng, Lei Zhao","doi":"10.1145/3132847.3132962","DOIUrl":null,"url":null,"abstract":"Information Extraction by Text Segmentation (IETS) aims at segmenting text inputs to extract implicit data values contained in them.The state-of-art IETS approaches mainly rely on machine learning techniques, either supervised or unsupervised.However, while the supervised approaches require a large labelled training data, the performance of the unsupervised ones could be unstable on different data sets.To overcome their weaknesses, this paper introduces CNN-IETS, a novel unsupervised probabilistic approach that takes the advantages of pre-existing data and a Convolution Neural Network (CNN)-based probabilistic classification model. While using the CNN model can ease the burden of selecting high-quality features in associating text segments with attributes of a given domain, the pre-existing data as a domain knowledge base can provide training data with a comprehensive list of features for building the CNN model.Given an input text, we do initial segmentation (according to the occurrences of these words in the knowledge base) to generate text segments for CNN classification with probabilities. Then, based on the probabilistic CNN classification results, we work on finding the most probable labelling way to the whole input text.As a complementary, a bidirectional sequencing model learned on-demand from test data is finally deployed to do further adjustment to some problematic labelled segments.Our experimental study conducted on several real data collections shows that CNN-IETS improves the extraction quality of state-of-art approaches by more than 10%.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"284 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"CNN-IETS: A CNN-based Probabilistic Approach for Information Extraction by Text Segmentation\",\"authors\":\"Meng Hu, Zhixu Li, Yongxin Shen, An Liu, Guanfeng Liu, Kai Zheng, Lei Zhao\",\"doi\":\"10.1145/3132847.3132962\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Information Extraction by Text Segmentation (IETS) aims at segmenting text inputs to extract implicit data values contained in them.The state-of-art IETS approaches mainly rely on machine learning techniques, either supervised or unsupervised.However, while the supervised approaches require a large labelled training data, the performance of the unsupervised ones could be unstable on different data sets.To overcome their weaknesses, this paper introduces CNN-IETS, a novel unsupervised probabilistic approach that takes the advantages of pre-existing data and a Convolution Neural Network (CNN)-based probabilistic classification model. While using the CNN model can ease the burden of selecting high-quality features in associating text segments with attributes of a given domain, the pre-existing data as a domain knowledge base can provide training data with a comprehensive list of features for building the CNN model.Given an input text, we do initial segmentation (according to the occurrences of these words in the knowledge base) to generate text segments for CNN classification with probabilities. Then, based on the probabilistic CNN classification results, we work on finding the most probable labelling way to the whole input text.As a complementary, a bidirectional sequencing model learned on-demand from test data is finally deployed to do further adjustment to some problematic labelled segments.Our experimental study conducted on several real data collections shows that CNN-IETS improves the extraction quality of state-of-art approaches by more than 10%.\",\"PeriodicalId\":20449,\"journal\":{\"name\":\"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management\",\"volume\":\"284 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3132847.3132962\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3132847.3132962","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
CNN-IETS: A CNN-based Probabilistic Approach for Information Extraction by Text Segmentation
Information Extraction by Text Segmentation (IETS) aims at segmenting text inputs to extract implicit data values contained in them.The state-of-art IETS approaches mainly rely on machine learning techniques, either supervised or unsupervised.However, while the supervised approaches require a large labelled training data, the performance of the unsupervised ones could be unstable on different data sets.To overcome their weaknesses, this paper introduces CNN-IETS, a novel unsupervised probabilistic approach that takes the advantages of pre-existing data and a Convolution Neural Network (CNN)-based probabilistic classification model. While using the CNN model can ease the burden of selecting high-quality features in associating text segments with attributes of a given domain, the pre-existing data as a domain knowledge base can provide training data with a comprehensive list of features for building the CNN model.Given an input text, we do initial segmentation (according to the occurrences of these words in the knowledge base) to generate text segments for CNN classification with probabilities. Then, based on the probabilistic CNN classification results, we work on finding the most probable labelling way to the whole input text.As a complementary, a bidirectional sequencing model learned on-demand from test data is finally deployed to do further adjustment to some problematic labelled segments.Our experimental study conducted on several real data collections shows that CNN-IETS improves the extraction quality of state-of-art approaches by more than 10%.