{"title":"Comparing Topic Modeling and Named Entity Recognition Techniques for the Semantic Indexing of a Landscape Architecture Textbook","authors":"K. Dawar, Ashwanth J. Samuel, Raf Alvarado","doi":"10.1109/SIEDS.2019.8735642","DOIUrl":null,"url":null,"abstract":"The task of manually annotating text is often tedious and error-prone. There is a strong need to digitize landscape history because a scalable, relational database with refined texts simply does not exist, ultimately limiting the pedagogical extent of this rich field. The data for the study conducted is a comprehensive textbook (544 pages) titled, “Landscape Design: A History of Landscape Architecture,” by Elizabeth Rogers. The Landscape Studies Initiative and Data Science Institute at the University of Virginia have partnered together to construct a SQL aided application (Flask) that will assist in deep annotation of scholarly texts. Our goal was to utilize machine learning techniques, specifically named entity recognition models (NER) and topic models (TM), not only to optimize the annotation process, but also to provide a fresh perspective on the text through a new index. In this paper, we will look at the training system, design, and architecture of several different NER models, including Python's spaCy, Stanford's Named Entity Recognizer, and IBM Bluemix's Natural Language Understanding tool, and compare their accuracies. Additionally, this paper aims to explore topic modeling from different tools and techniques, such as the Python libraries Gensim and Mallet in order to compare and contrast the relevance of those models to our dataset. The impact that these techniques have on the humanities fields can be astoundingly influential, but severely limited by the availability, size, and domain of the training dataset. Entity Recognition and Topic Modeling, as a result, are far from solved tasks: we will address some of the fundamental challenges that can prevent these systems from being robust and accurate.","PeriodicalId":265421,"journal":{"name":"2019 Systems and Information Engineering Design Symposium (SIEDS)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS.2019.8735642","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
The task of manually annotating text is often tedious and error-prone. There is a strong need to digitize landscape history because a scalable, relational database with refined texts simply does not exist, ultimately limiting the pedagogical extent of this rich field. The data for the study conducted is a comprehensive textbook (544 pages) titled, “Landscape Design: A History of Landscape Architecture,” by Elizabeth Rogers. The Landscape Studies Initiative and Data Science Institute at the University of Virginia have partnered together to construct a SQL aided application (Flask) that will assist in deep annotation of scholarly texts. Our goal was to utilize machine learning techniques, specifically named entity recognition models (NER) and topic models (TM), not only to optimize the annotation process, but also to provide a fresh perspective on the text through a new index. In this paper, we will look at the training system, design, and architecture of several different NER models, including Python's spaCy, Stanford's Named Entity Recognizer, and IBM Bluemix's Natural Language Understanding tool, and compare their accuracies. Additionally, this paper aims to explore topic modeling from different tools and techniques, such as the Python libraries Gensim and Mallet in order to compare and contrast the relevance of those models to our dataset. The impact that these techniques have on the humanities fields can be astoundingly influential, but severely limited by the availability, size, and domain of the training dataset. Entity Recognition and Topic Modeling, as a result, are far from solved tasks: we will address some of the fundamental challenges that can prevent these systems from being robust and accurate.