{"title":"Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts","authors":"Helmut Schmid","doi":"10.1145/3322905.3322915","DOIUrl":null,"url":null,"abstract":"Part-of-speech tagging, morphological tagging, and lemmatization of historical texts pose special challenges due to the high spelling variability and the lack of large, high-quality training corpora. Researchers therefore often first map the words to their modern spelling and then annotate with tools trained on modern corpora. We show in this paper that high quality part-of-speech tagging and lemmatization of historical texts is possible while operating directly on the historical spelling. We use a part-of-speech tagger based on bidirectional long short-term memory networks (LSTMs) [11] with character-based word representations and lemmatize using an encoder-decoder system with attention. We achieve state-of-the-art results for modern German morphological tagging on the Tiger corpus and also on two historical corpora which have been used in previous work.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3322905.3322915","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 19
Abstract
Part-of-speech tagging, morphological tagging, and lemmatization of historical texts pose special challenges due to the high spelling variability and the lack of large, high-quality training corpora. Researchers therefore often first map the words to their modern spelling and then annotate with tools trained on modern corpora. We show in this paper that high quality part-of-speech tagging and lemmatization of historical texts is possible while operating directly on the historical spelling. We use a part-of-speech tagger based on bidirectional long short-term memory networks (LSTMs) [11] with character-based word representations and lemmatize using an encoder-decoder system with attention. We achieve state-of-the-art results for modern German morphological tagging on the Tiger corpus and also on two historical corpora which have been used in previous work.