{"title":"Identifying named entities on a University intranet","authors":"M. Althobaiti, Udo Kruschwitz, Massimo Poesio","doi":"10.1109/CEEC.2012.6375385","DOIUrl":null,"url":null,"abstract":"Named entities (NEs) are textual references via proper names, such aspeople names, company names, places and so on. The importance of NEs has been observed in intranet search engines, including university web sites. In this paper, a mechanism is built exclusively to recognize the three named entities, which are constantly referenced in the University of Essex domain: names, course codes, and room numbers. While a person name is considered a common named entity, course codes and room numbers are specific to the University domain. We developed a technique specifically to train three different classifiers on electronic corpora, consisting of 16,629 examples in total, which were collected and annotated manually from the University domain. The resulting models were then incorporated into the NER system that was built to use pre-trained classifiers in the detection process, mark these NEs, and cross-reference them to the related documents. The proposed method performed well on a test corpus, with the average precision reaching nearly 0.97. The recall varied, but was lower overall than precision with an average of 0.82. Moreover, in terms of name recognition in the University domain, our system outperformed two other systems: the OpenNLP name finder and ANNIE system.","PeriodicalId":142286,"journal":{"name":"2012 4th Computer Science and Electronic Engineering Conference (CEEC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 4th Computer Science and Electronic Engineering Conference (CEEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CEEC.2012.6375385","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Named entities (NEs) are textual references via proper names, such aspeople names, company names, places and so on. The importance of NEs has been observed in intranet search engines, including university web sites. In this paper, a mechanism is built exclusively to recognize the three named entities, which are constantly referenced in the University of Essex domain: names, course codes, and room numbers. While a person name is considered a common named entity, course codes and room numbers are specific to the University domain. We developed a technique specifically to train three different classifiers on electronic corpora, consisting of 16,629 examples in total, which were collected and annotated manually from the University domain. The resulting models were then incorporated into the NER system that was built to use pre-trained classifiers in the detection process, mark these NEs, and cross-reference them to the related documents. The proposed method performed well on a test corpus, with the average precision reaching nearly 0.97. The recall varied, but was lower overall than precision with an average of 0.82. Moreover, in terms of name recognition in the University domain, our system outperformed two other systems: the OpenNLP name finder and ANNIE system.