{"title":"DE-Lemma: A Maximum-Entropy Based Lemmatizer for German Medical Text.","authors":"Martin Wiesner","doi":"10.3233/SHTI230712","DOIUrl":null,"url":null,"abstract":"<p><p>When processing written German language, it is helpful, to use the base form (or: lemma) of possibly inflected words, such as verbs, nouns or named entities. However, for German text from the (bio)medical domain, e.g., discharge letters, or entries stored in electronic medical or health records (EMR, EHR), difficulties exist in finding the correct lemma, as, for instance, the medical language has roots in Latin or Greek. In such cases, stemming techniques might provide inaccurate results for text written in German. This study demonstrates a Machine Learning approach for training Apache OpenNLP-based lemmatizer models from publicly available German treebanks. The resulting four \"DE-Lemma\" models were evaluated against a sample of (bio)medical nouns, randomly selected from real-world discharge letters. The most promising DE-Lemma model achieved an accuracy of 88.0% (F1 = .936).</p>","PeriodicalId":39242,"journal":{"name":"Studies in Health Technology and Informatics","volume":"307 ","pages":"189-195"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in Health Technology and Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/SHTI230712","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Health Professions","Score":null,"Total":0}
引用次数: 0
Abstract
When processing written German language, it is helpful, to use the base form (or: lemma) of possibly inflected words, such as verbs, nouns or named entities. However, for German text from the (bio)medical domain, e.g., discharge letters, or entries stored in electronic medical or health records (EMR, EHR), difficulties exist in finding the correct lemma, as, for instance, the medical language has roots in Latin or Greek. In such cases, stemming techniques might provide inaccurate results for text written in German. This study demonstrates a Machine Learning approach for training Apache OpenNLP-based lemmatizer models from publicly available German treebanks. The resulting four "DE-Lemma" models were evaluated against a sample of (bio)medical nouns, randomly selected from real-world discharge letters. The most promising DE-Lemma model achieved an accuracy of 88.0% (F1 = .936).
期刊介绍:
This book series was started in 1990 to promote research conducted under the auspices of the EC programmes’ Advanced Informatics in Medicine (AIM) and Biomedical and Health Research (BHR) bioengineering branch. A driving aspect of international health informatics is that telecommunication technology, rehabilitative technology, intelligent home technology and many other components are moving together and form one integrated world of information and communication media.