Pub Date : 1900-01-01DOI: 10.18653/v1/2021.latechclfl-1.15
Pranaydeep Singh, Gorik Rutten, Els Lefever
This paper presents a pilot study to automatic linguistic preprocessing of Ancient and Byzantine Greek, and morphological analysis more specifically. To this end, a novel subword-based BERT language model was trained on the basis of a varied corpus of Modern, Ancient and Post-classical Greek texts. Consequently, the obtained BERT embeddings were incorporated to train a fine-grained Part-of-Speech tagger for Ancient and Byzantine Greek. In addition, a corpus of Greek Epigrams was manually annotated and the resulting gold standard was used to evaluate the performance of the morphological analyser on Byzantine Greek. The experimental results show very good perplexity scores (4.9) for the BERT language model and state-of-the-art performance for the fine-grained Part-of-Speech tagger for in-domain data (treebanks containing a mixture of Classical and Medieval Greek), as well as for the newly created Byzantine Greek gold standard data set. The language models and associated code are made available for use at https://github.com/pranaydeeps/Ancient-Greek-BERT
{"title":"A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek","authors":"Pranaydeep Singh, Gorik Rutten, Els Lefever","doi":"10.18653/v1/2021.latechclfl-1.15","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.15","url":null,"abstract":"This paper presents a pilot study to automatic linguistic preprocessing of Ancient and Byzantine Greek, and morphological analysis more specifically. To this end, a novel subword-based BERT language model was trained on the basis of a varied corpus of Modern, Ancient and Post-classical Greek texts. Consequently, the obtained BERT embeddings were incorporated to train a fine-grained Part-of-Speech tagger for Ancient and Byzantine Greek. In addition, a corpus of Greek Epigrams was manually annotated and the resulting gold standard was used to evaluate the performance of the morphological analyser on Byzantine Greek. The experimental results show very good perplexity scores (4.9) for the BERT language model and state-of-the-art performance for the fine-grained Part-of-Speech tagger for in-domain data (treebanks containing a mixture of Classical and Medieval Greek), as well as for the newly created Byzantine Greek gold standard data set. The language models and associated code are made available for use at https://github.com/pranaydeeps/Ancient-Greek-BERT","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115420367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.18653/v1/2021.latechclfl-1.9
Almas Abdibayev, Yohei Igarashi, A. Riddell, D. Rockmore
In this paper we take up the problem of “limerick detection” and describe a system to identify five-line poems as limericks or not. This turns out to be a surprisingly difficult challenge with many subtleties. More precisely, we produce an algorithm which focuses on the structural aspects of the limerick – rhyme scheme and rhythm (i.e., stress patterns) – and when tested on a a culled data set of 98,454 publicly available limericks, our “limerick filter” accepts 67% as limericks. The primary failure of our filter is on the detection of “non-standard” rhymes, which we highlight as an outstanding challenge in computational poetics. Our accent detection algorithm proves to be very robust. Our main contributions are (1) a novel rhyme detection algorithm that works on English words including rare proper nouns and made-up words (and thus, words not in the widely used CMUDict database); (2) a novel rhythm-identifying heuristic that is robust to language noise at moderate levels and comparable in accuracy to state-of-the-art scansion algorithms. As a third significant contribution (3) we make publicly available a large corpus of limericks that includes tags of “limerick” or “not-limerick” as determined by our identification software, thereby providing a benchmark for the community. The poetic tasks that we have identified as challenges for machines suggest that the limerick is a useful “model organism” for the study of machine capabilities in poetry and more broadly literature and language. We include a list of open challenges as well. Generally, we anticipate that this work will provide useful material and benchmarks for future explorations in the field.
{"title":"Automating the Detection of Poetic Features: The Limerick as Model Organism","authors":"Almas Abdibayev, Yohei Igarashi, A. Riddell, D. Rockmore","doi":"10.18653/v1/2021.latechclfl-1.9","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.9","url":null,"abstract":"In this paper we take up the problem of “limerick detection” and describe a system to identify five-line poems as limericks or not. This turns out to be a surprisingly difficult challenge with many subtleties. More precisely, we produce an algorithm which focuses on the structural aspects of the limerick – rhyme scheme and rhythm (i.e., stress patterns) – and when tested on a a culled data set of 98,454 publicly available limericks, our “limerick filter” accepts 67% as limericks. The primary failure of our filter is on the detection of “non-standard” rhymes, which we highlight as an outstanding challenge in computational poetics. Our accent detection algorithm proves to be very robust. Our main contributions are (1) a novel rhyme detection algorithm that works on English words including rare proper nouns and made-up words (and thus, words not in the widely used CMUDict database); (2) a novel rhythm-identifying heuristic that is robust to language noise at moderate levels and comparable in accuracy to state-of-the-art scansion algorithms. As a third significant contribution (3) we make publicly available a large corpus of limericks that includes tags of “limerick” or “not-limerick” as determined by our identification software, thereby providing a benchmark for the community. The poetic tasks that we have identified as challenges for machines suggest that the limerick is a useful “model organism” for the study of machine capabilities in poetry and more broadly literature and language. We include a list of open challenges as well. Generally, we anticipate that this work will provide useful material and benchmarks for future explorations in the field.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126429651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.18653/v1/2021.latechclfl-1.12
M. Kunilovskaya, Ekaterina Lapshinova-Koltunski, R. Mitkov
The paper reports the results of a translationese study of literary texts based on translated and non-translated Russian. We aim to find out if translations deviate from non-translated literary texts, and if the established differences can be attributed to typological relations between source and target languages. We expect that literary translations from typologically distant languages should exhibit more translationese, and the fingerprints of individual source languages (and their families) are traceable in translations. We explore linguistic properties that distinguish non-translated Russian literature from translations into Russian. Our results show that non-translated fiction is different from translations to the degree that these two language varieties can be automatically classified. As expected, language typology is reflected in translations of literary texts. We identified features that point to linguistic specificity of Russian non-translated literature and to shining-through effects. Some of translationese features cut across all language pairs, while others are characteristic of literary translations from languages belonging to specific language families.
{"title":"Translationese in Russian Literary Texts","authors":"M. Kunilovskaya, Ekaterina Lapshinova-Koltunski, R. Mitkov","doi":"10.18653/v1/2021.latechclfl-1.12","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.12","url":null,"abstract":"The paper reports the results of a translationese study of literary texts based on translated and non-translated Russian. We aim to find out if translations deviate from non-translated literary texts, and if the established differences can be attributed to typological relations between source and target languages. We expect that literary translations from typologically distant languages should exhibit more translationese, and the fingerprints of individual source languages (and their families) are traceable in translations. We explore linguistic properties that distinguish non-translated Russian literature from translations into Russian. Our results show that non-translated fiction is different from translations to the degree that these two language varieties can be automatically classified. As expected, language typology is reflected in translations of literary texts. We identified features that point to linguistic specificity of Russian non-translated literature and to shining-through effects. Some of translationese features cut across all language pairs, while others are characteristic of literary translations from languages belonging to specific language families.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124495299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.18653/v1/2021.latechclfl-1.6
David Schmidt, Albin Zehe, Janne Lorenzen, Lisa Sergel, Sebastian Düker, Markus Krug, F. Puppe
This paper presents a data set of German fairy tales, manually annotated with character networks which were obtained with high inter rater agreement. The release of this corpus provides an opportunity of training and comparing different algorithms for the extraction of character networks, which so far was barely possible due to heterogeneous interests of previous researchers. We demonstrate the usefulness of our data set by providing baseline experiments for the automatic extraction of character networks, applying a rule-based pipeline as well as a neural approach, and find the neural approach outperforming the rule-approach in most evaluation settings.
{"title":"The FairyNet Corpus - Character Networks for German Fairy Tales","authors":"David Schmidt, Albin Zehe, Janne Lorenzen, Lisa Sergel, Sebastian Düker, Markus Krug, F. Puppe","doi":"10.18653/v1/2021.latechclfl-1.6","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.6","url":null,"abstract":"This paper presents a data set of German fairy tales, manually annotated with character networks which were obtained with high inter rater agreement. The release of this corpus provides an opportunity of training and comparing different algorithms for the extraction of character networks, which so far was barely possible due to heterogeneous interests of previous researchers. We demonstrate the usefulness of our data set by providing baseline experiments for the automatic extraction of character networks, applying a rule-based pipeline as well as a neural approach, and find the neural approach outperforming the rule-approach in most evaluation settings.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124557515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.18653/v1/2021.latechclfl-1.14
Yuri Bizzoni, Stefania Degaetano-Ortlieb, K. Menzel, E. Teich
Tracing the influence of individuals or groups in social networks is an increasingly popular task in sociolinguistic studies. While methods to determine someone’s influence in shortterm contexts (e.g., social media, on-line political debates) are widespread, influence in longterm contexts is less investigated and may be harder to capture. We study the diffusion of scientific terms in an English diachronic scientific corpus, applying Hawkes Processes to capture the role of individual scientists as “influencers” or “influencees” in the diffusion of new concepts. Our findings on two major scientific discoveries in chemistry and astronomy of the 18th century reveal that modelling both the introduction and diffusion of scientific terms in a historical corpus as Hawkes Processes allows detecting patterns of influence between authors on a long-term scale.
{"title":"The diffusion of scientific terms – tracing individuals’ influence in the history of science for English","authors":"Yuri Bizzoni, Stefania Degaetano-Ortlieb, K. Menzel, E. Teich","doi":"10.18653/v1/2021.latechclfl-1.14","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.14","url":null,"abstract":"Tracing the influence of individuals or groups in social networks is an increasingly popular task in sociolinguistic studies. While methods to determine someone’s influence in shortterm contexts (e.g., social media, on-line political debates) are widespread, influence in longterm contexts is less investigated and may be harder to capture. We study the diffusion of scientific terms in an English diachronic scientific corpus, applying Hawkes Processes to capture the role of individual scientists as “influencers” or “influencees” in the diffusion of new concepts. Our findings on two major scientific discoveries in chemistry and astronomy of the 18th century reveal that modelling both the introduction and diffusion of scientific terms in a historical corpus as Hawkes Processes allows detecting patterns of influence between authors on a long-term scale.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"785 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123285744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.18653/v1/2021.latechclfl-1.16
Thomas Schleider, Raphael Troncy
The knowledge of the European silk textile production is a typical case for which the information collected is heterogeneous, spread across many museums and sparse since rarely complete. Knowledge Graphs for this cultural heritage domain, when being developed with appropriate ontologies and vocabularies, enable to integrate and reconcile this diverse information. However, many of these original museum records still have some metadata gaps. In this paper, we present a zero-shot learning approach that leverages the ConceptNet common sense knowledge graph to predict categorical metadata informing about the silk objects production. We compared the performance of our approach with traditional supervised deep learning-based methods that do require training data. We demonstrate promising and competitive performance for similar datasets and circumstances and the ability to predict sometimes more fine-grained information. Our results can be reproduced using the code and datasets published at https://github.com/silknow/ZSL-KG-silk.
{"title":"Zero-Shot Information Extraction to Enhance a Knowledge Graph Describing Silk Textiles","authors":"Thomas Schleider, Raphael Troncy","doi":"10.18653/v1/2021.latechclfl-1.16","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.16","url":null,"abstract":"The knowledge of the European silk textile production is a typical case for which the information collected is heterogeneous, spread across many museums and sparse since rarely complete. Knowledge Graphs for this cultural heritage domain, when being developed with appropriate ontologies and vocabularies, enable to integrate and reconcile this diverse information. However, many of these original museum records still have some metadata gaps. In this paper, we present a zero-shot learning approach that leverages the ConceptNet common sense knowledge graph to predict categorical metadata informing about the silk objects production. We compared the performance of our approach with traditional supervised deep learning-based methods that do require training data. We demonstrate promising and competitive performance for similar datasets and circumstances and the ability to predict sometimes more fine-grained information. Our results can be reproduced using the code and datasets published at https://github.com/silknow/ZSL-KG-silk.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"30 13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125813844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.18653/v1/2021.latechclfl-1.13
Zuoyu Tian, Dylan Jarrett, Juan M. Escalona Torres, Patrícia Amaral
High quality distributional models can capture lexical and semantic relations between words. Hence, researchers design various intrinsic tasks to test whether such relations are captured. However, most of the intrinsic tasks are designed for modern languages, and there is a lack of evaluation methods for distributional models of historical corpora. In this paper, we conducted BAHP: a benchmark of assessing word embeddings in Historical Portuguese, which contains four types of tests: analogy, similarity, outlier detection, and coherence. We examined word2vec models generated from two historical Portuguese corpora in these four test sets. The results demonstrate that our test sets are capable of measuring the quality of vector space models and can provide a holistic view of the model’s ability to capture syntactic and semantic information. Furthermore, the methodology for the creation of our test sets can be easily extended to other historical languages.
{"title":"BAHP: Benchmark of Assessing Word Embeddings in Historical Portuguese","authors":"Zuoyu Tian, Dylan Jarrett, Juan M. Escalona Torres, Patrícia Amaral","doi":"10.18653/v1/2021.latechclfl-1.13","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.13","url":null,"abstract":"High quality distributional models can capture lexical and semantic relations between words. Hence, researchers design various intrinsic tasks to test whether such relations are captured. However, most of the intrinsic tasks are designed for modern languages, and there is a lack of evaluation methods for distributional models of historical corpora. In this paper, we conducted BAHP: a benchmark of assessing word embeddings in Historical Portuguese, which contains four types of tests: analogy, similarity, outlier detection, and coherence. We examined word2vec models generated from two historical Portuguese corpora in these four test sets. The results demonstrate that our test sets are capable of measuring the quality of vector space models and can provide a holistic view of the model’s ability to capture syntactic and semantic information. Furthermore, the methodology for the creation of our test sets can be easily extended to other historical languages.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122973155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.18653/v1/2021.latechclfl-1.8
Thomas Schmidt, Katrin Dennerlein, Christian Wolff
We present results of a project on emotion classification on historical German plays of Enlightenment, Storm and Stress, and German Classicism. We have developed a hierarchical annotation scheme consisting of 13 sub-emotions like suffering, love and joy that sum up to 6 main and 2 polarity classes (positive/negative). We have conducted textual annotations on 11 German plays and have acquired over 13,000 emotion annotations by two annotators per play. We have evaluated multiple traditional machine learning approaches as well as transformer-based models pretrained on historical and contemporary language for a single-label text sequence emotion classification for the different emotion categories. The evaluation is carried out on three different instances of the corpus: (1) taking all annotations, (2) filtering overlapping annotations by annotators, (3) applying a heuristic for speech-based analysis. Best results are achieved on the filtered corpus with the best models being large transformer-based models pretrained on contemporary German language. For the polarity classification accuracies of up to 90% are achieved. The accuracies become lower for settings with a higher number of classes, achieving 66% for 13 sub-emotions. Further pretraining of a historical model with a corpus of dramatic texts led to no improvements.
{"title":"Emotion Classification in German Plays with Transformer-based Language Models Pretrained on Historical and Contemporary Language","authors":"Thomas Schmidt, Katrin Dennerlein, Christian Wolff","doi":"10.18653/v1/2021.latechclfl-1.8","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.8","url":null,"abstract":"We present results of a project on emotion classification on historical German plays of Enlightenment, Storm and Stress, and German Classicism. We have developed a hierarchical annotation scheme consisting of 13 sub-emotions like suffering, love and joy that sum up to 6 main and 2 polarity classes (positive/negative). We have conducted textual annotations on 11 German plays and have acquired over 13,000 emotion annotations by two annotators per play. We have evaluated multiple traditional machine learning approaches as well as transformer-based models pretrained on historical and contemporary language for a single-label text sequence emotion classification for the different emotion categories. The evaluation is carried out on three different instances of the corpus: (1) taking all annotations, (2) filtering overlapping annotations by annotators, (3) applying a heuristic for speech-based analysis. Best results are achieved on the filtered corpus with the best models being large transformer-based models pretrained on contemporary German language. For the polarity classification accuracies of up to 90% are achieved. The accuracies become lower for settings with a higher number of classes, achieving 66% for 13 sub-emotions. Further pretraining of a historical model with a corpus of dramatic texts led to no improvements.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125093348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.18653/v1/2021.latechclfl-1.5
Danielly Sorato, Diana Zavala-Rojas
The dawn of the digital age led to increasing demands for digital research resources, which shall be quickly processed and handled by computers. Due to the amount of data created by this digitization process, the design of tools that enable the analysis and management of data and metadata has become a relevant topic. In this context, the Multilingual Corpus of Survey Questionnaires (MCSQ) contributes to the creation and distribution of data for the Social Sciences and Humanities (SSH) following FAIR (Findable, Accessible, Interoperable and Reusable) principles, and provides functionalities for end-users that are not acquainted with programming through an easy-to-use interface. By simply applying the desired filters in the graphic interface, users can build linguistic resources for the survey research and translation areas, such as translation memories, thus facilitating data access and usage.
{"title":"The Multilingual Corpus of Survey Questionnaires Query Interface","authors":"Danielly Sorato, Diana Zavala-Rojas","doi":"10.18653/v1/2021.latechclfl-1.5","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.5","url":null,"abstract":"The dawn of the digital age led to increasing demands for digital research resources, which shall be quickly processed and handled by computers. Due to the amount of data created by this digitization process, the design of tools that enable the analysis and management of data and metadata has become a relevant topic. In this context, the Multilingual Corpus of Survey Questionnaires (MCSQ) contributes to the creation and distribution of data for the Social Sciences and Humanities (SSH) following FAIR (Findable, Accessible, Interoperable and Reusable) principles, and provides functionalities for end-users that are not acquainted with programming through an easy-to-use interface. By simply applying the desired filters in the graphic interface, users can build linguistic resources for the survey research and translation areas, such as translation memories, thus facilitating data access and usage.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122182231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.18653/v1/2021.latechclfl-1.21
Andreas van Cranenburgh, E. Ketzan
This paper applies stylometry to quantify the literariness of 73 novels and novellas by American author Stephen King, chosen as an extraordinary case of a writer who has been dubbed both “high” and “low” in literariness in critical reception. We operationalize literariness using a measure of stylistic distance (Cosine Delta) based on the 1000 most frequent words in two bespoke comparison corpora used as proxies for literariness: one of popular genre fiction, another of National Book Award-winning authors. We report that a supervised model is highly effective in distinguishing the two categories, with 94.6% macro average in a binary classification. We define two subsets of texts by King—“high” and “low” literariness works as suggested by critics and ourselves—and find that a predictive model does identify King’s Dark Tower series and novels such as Dolores Claiborne as among his most “literary” texts, consistent with critical reception, which has also ascribed postmodern qualities to the Dark Tower novels. Our results demonstrate the efficacy of Cosine Delta-based stylometry in quantifying the literariness of texts, while also highlighting the methodological challenges of literariness, especially in the case of Stephen King. The code and data to reproduce our results are available at https://github.com/andreasvc/kinglit
{"title":"Stylometric Literariness Classification: the Case of Stephen King","authors":"Andreas van Cranenburgh, E. Ketzan","doi":"10.18653/v1/2021.latechclfl-1.21","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.21","url":null,"abstract":"This paper applies stylometry to quantify the literariness of 73 novels and novellas by American author Stephen King, chosen as an extraordinary case of a writer who has been dubbed both “high” and “low” in literariness in critical reception. We operationalize literariness using a measure of stylistic distance (Cosine Delta) based on the 1000 most frequent words in two bespoke comparison corpora used as proxies for literariness: one of popular genre fiction, another of National Book Award-winning authors. We report that a supervised model is highly effective in distinguishing the two categories, with 94.6% macro average in a binary classification. We define two subsets of texts by King—“high” and “low” literariness works as suggested by critics and ourselves—and find that a predictive model does identify King’s Dark Tower series and novels such as Dolores Claiborne as among his most “literary” texts, consistent with critical reception, which has also ascribed postmodern qualities to the Dark Tower novels. Our results demonstrate the efficacy of Cosine Delta-based stylometry in quantifying the literariness of texts, while also highlighting the methodological challenges of literariness, especially in the case of Stephen King. The code and data to reproduce our results are available at https://github.com/andreasvc/kinglit","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131564622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}