首页 > 最新文献

Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature最新文献

英文 中文
A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek 中古希腊语BERT语言建模与形态分析的初步研究
Pranaydeep Singh, Gorik Rutten, Els Lefever
This paper presents a pilot study to automatic linguistic preprocessing of Ancient and Byzantine Greek, and morphological analysis more specifically. To this end, a novel subword-based BERT language model was trained on the basis of a varied corpus of Modern, Ancient and Post-classical Greek texts. Consequently, the obtained BERT embeddings were incorporated to train a fine-grained Part-of-Speech tagger for Ancient and Byzantine Greek. In addition, a corpus of Greek Epigrams was manually annotated and the resulting gold standard was used to evaluate the performance of the morphological analyser on Byzantine Greek. The experimental results show very good perplexity scores (4.9) for the BERT language model and state-of-the-art performance for the fine-grained Part-of-Speech tagger for in-domain data (treebanks containing a mixture of Classical and Medieval Greek), as well as for the newly created Byzantine Greek gold standard data set. The language models and associated code are made available for use at https://github.com/pranaydeeps/Ancient-Greek-BERT
本文对古希腊语和拜占庭希腊语的自动语言预处理以及更具体的形态学分析进行了初步研究。为此,在现代、古代和后古典希腊文本的各种语料库的基础上,训练了一个新的基于子词的BERT语言模型。因此,将得到的BERT嵌入结合起来训练古希腊语和拜占庭希腊语的细粒度词性标注器。此外,还对一个希腊谚语语料库进行了手工注释,并使用所得到的金标准来评估形态分析器对拜占庭希腊语的性能。实验结果显示,BERT语言模型的困惑分数(4.9)非常好,域内数据(包含古典希腊语和中世纪希腊语混合的树库)的细粒度词性标注器以及新创建的拜占庭希腊语黄金标准数据集的最先进性能也非常好。语言模型和相关代码可在https://github.com/pranaydeeps/Ancient-Greek-BERT上使用
{"title":"A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek","authors":"Pranaydeep Singh, Gorik Rutten, Els Lefever","doi":"10.18653/v1/2021.latechclfl-1.15","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.15","url":null,"abstract":"This paper presents a pilot study to automatic linguistic preprocessing of Ancient and Byzantine Greek, and morphological analysis more specifically. To this end, a novel subword-based BERT language model was trained on the basis of a varied corpus of Modern, Ancient and Post-classical Greek texts. Consequently, the obtained BERT embeddings were incorporated to train a fine-grained Part-of-Speech tagger for Ancient and Byzantine Greek. In addition, a corpus of Greek Epigrams was manually annotated and the resulting gold standard was used to evaluate the performance of the morphological analyser on Byzantine Greek. The experimental results show very good perplexity scores (4.9) for the BERT language model and state-of-the-art performance for the fine-grained Part-of-Speech tagger for in-domain data (treebanks containing a mixture of Classical and Medieval Greek), as well as for the newly created Byzantine Greek gold standard data set. The language models and associated code are made available for use at https://github.com/pranaydeeps/Ancient-Greek-BERT","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115420367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Automating the Detection of Poetic Features: The Limerick as Model Organism 诗歌特征的自动化检测:作为模式生物的打油诗
Almas Abdibayev, Yohei Igarashi, A. Riddell, D. Rockmore
In this paper we take up the problem of “limerick detection” and describe a system to identify five-line poems as limericks or not. This turns out to be a surprisingly difficult challenge with many subtleties. More precisely, we produce an algorithm which focuses on the structural aspects of the limerick – rhyme scheme and rhythm (i.e., stress patterns) – and when tested on a a culled data set of 98,454 publicly available limericks, our “limerick filter” accepts 67% as limericks. The primary failure of our filter is on the detection of “non-standard” rhymes, which we highlight as an outstanding challenge in computational poetics. Our accent detection algorithm proves to be very robust. Our main contributions are (1) a novel rhyme detection algorithm that works on English words including rare proper nouns and made-up words (and thus, words not in the widely used CMUDict database); (2) a novel rhythm-identifying heuristic that is robust to language noise at moderate levels and comparable in accuracy to state-of-the-art scansion algorithms. As a third significant contribution (3) we make publicly available a large corpus of limericks that includes tags of “limerick” or “not-limerick” as determined by our identification software, thereby providing a benchmark for the community. The poetic tasks that we have identified as challenges for machines suggest that the limerick is a useful “model organism” for the study of machine capabilities in poetry and more broadly literature and language. We include a list of open challenges as well. Generally, we anticipate that this work will provide useful material and benchmarks for future explorations in the field.
本文探讨了“打油诗检测”问题,提出了一个五行诗是否为打油诗的识别系统。事实证明,这是一个非常困难的挑战,其中有许多微妙之处。更准确地说,我们产生了一个算法,专注于打油诗的结构方面——押韵方案和节奏(即重音模式)——当在98,454首公开可用的打油诗的精选数据集上进行测试时,我们的“打油诗过滤器”接受67%的打油诗。我们的过滤器的主要失败是在“非标准”押韵的检测上,我们强调这是计算诗学中的一个突出挑战。我们的算法被证明是非常鲁棒的。我们的主要贡献有:(1)一种新颖的押韵检测算法,该算法适用于英语单词,包括罕见的专有名词和合成词(因此,不在广泛使用的CMUDict数据库中的单词);(2)一种新颖的节奏识别启发式算法,它对中等水平的语言噪声具有鲁棒性,其准确性可与最先进的扫描算法相媲美。作为第三个重要贡献(3),我们公开了大量的打油诗语料库,其中包括由我们的识别软件确定的“打油诗”或“非打油诗”标签,从而为社区提供了一个基准。我们已经确定为机器挑战的诗歌任务表明,打油诗是研究机器在诗歌和更广泛的文学和语言方面的能力的有用的“模式生物”。我们还列出了一系列公开的挑战。总的来说,我们预计这项工作将为该领域未来的探索提供有用的材料和基准。
{"title":"Automating the Detection of Poetic Features: The Limerick as Model Organism","authors":"Almas Abdibayev, Yohei Igarashi, A. Riddell, D. Rockmore","doi":"10.18653/v1/2021.latechclfl-1.9","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.9","url":null,"abstract":"In this paper we take up the problem of “limerick detection” and describe a system to identify five-line poems as limericks or not. This turns out to be a surprisingly difficult challenge with many subtleties. More precisely, we produce an algorithm which focuses on the structural aspects of the limerick – rhyme scheme and rhythm (i.e., stress patterns) – and when tested on a a culled data set of 98,454 publicly available limericks, our “limerick filter” accepts 67% as limericks. The primary failure of our filter is on the detection of “non-standard” rhymes, which we highlight as an outstanding challenge in computational poetics. Our accent detection algorithm proves to be very robust. Our main contributions are (1) a novel rhyme detection algorithm that works on English words including rare proper nouns and made-up words (and thus, words not in the widely used CMUDict database); (2) a novel rhythm-identifying heuristic that is robust to language noise at moderate levels and comparable in accuracy to state-of-the-art scansion algorithms. As a third significant contribution (3) we make publicly available a large corpus of limericks that includes tags of “limerick” or “not-limerick” as determined by our identification software, thereby providing a benchmark for the community. The poetic tasks that we have identified as challenges for machines suggest that the limerick is a useful “model organism” for the study of machine capabilities in poetry and more broadly literature and language. We include a list of open challenges as well. Generally, we anticipate that this work will provide useful material and benchmarks for future explorations in the field.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126429651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Translationese in Russian Literary Texts 俄语文学文本中的翻译
M. Kunilovskaya, Ekaterina Lapshinova-Koltunski, R. Mitkov
The paper reports the results of a translationese study of literary texts based on translated and non-translated Russian. We aim to find out if translations deviate from non-translated literary texts, and if the established differences can be attributed to typological relations between source and target languages. We expect that literary translations from typologically distant languages should exhibit more translationese, and the fingerprints of individual source languages (and their families) are traceable in translations. We explore linguistic properties that distinguish non-translated Russian literature from translations into Russian. Our results show that non-translated fiction is different from translations to the degree that these two language varieties can be automatically classified. As expected, language typology is reflected in translations of literary texts. We identified features that point to linguistic specificity of Russian non-translated literature and to shining-through effects. Some of translationese features cut across all language pairs, while others are characteristic of literary translations from languages belonging to specific language families.
本文报告了一项基于翻译俄语和非翻译俄语的文学文本翻译学研究的结果。我们的目的是找出翻译是否偏离非翻译的文学文本,以及这种既定的差异是否可以归因于源语和目的语之间的类型关系。我们期望文学翻译从类型化遥远的语言应该表现出更多的翻译性,并且单个源语言(及其家族)的指纹在翻译中是可追溯的。我们探讨了区分非翻译的俄罗斯文学和翻译成俄语的文学的语言特性。我们的研究结果表明,非翻译小说与翻译小说的不同之处在于这两种语言变体可以被自动分类。正如预期的那样,语言类型学反映在文学文本的翻译中。我们确定了指向俄语非翻译文学的语言特殊性和发光效应的特征。一些翻译语言的特征贯穿于所有语言对,而另一些则是属于特定语系的语言的文学翻译的特征。
{"title":"Translationese in Russian Literary Texts","authors":"M. Kunilovskaya, Ekaterina Lapshinova-Koltunski, R. Mitkov","doi":"10.18653/v1/2021.latechclfl-1.12","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.12","url":null,"abstract":"The paper reports the results of a translationese study of literary texts based on translated and non-translated Russian. We aim to find out if translations deviate from non-translated literary texts, and if the established differences can be attributed to typological relations between source and target languages. We expect that literary translations from typologically distant languages should exhibit more translationese, and the fingerprints of individual source languages (and their families) are traceable in translations. We explore linguistic properties that distinguish non-translated Russian literature from translations into Russian. Our results show that non-translated fiction is different from translations to the degree that these two language varieties can be automatically classified. As expected, language typology is reflected in translations of literary texts. We identified features that point to linguistic specificity of Russian non-translated literature and to shining-through effects. Some of translationese features cut across all language pairs, while others are characteristic of literary translations from languages belonging to specific language families.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124495299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The FairyNet Corpus - Character Networks for German Fairy Tales 童话网语料库——德国童话的人物网络
David Schmidt, Albin Zehe, Janne Lorenzen, Lisa Sergel, Sebastian Düker, Markus Krug, F. Puppe
This paper presents a data set of German fairy tales, manually annotated with character networks which were obtained with high inter rater agreement. The release of this corpus provides an opportunity of training and comparing different algorithms for the extraction of character networks, which so far was barely possible due to heterogeneous interests of previous researchers. We demonstrate the usefulness of our data set by providing baseline experiments for the automatic extraction of character networks, applying a rule-based pipeline as well as a neural approach, and find the neural approach outperforming the rule-approach in most evaluation settings.
本文提出了一个德国童话故事数据集,用字符网络手工标注,获得了高度一致性的字符网络。这个语料库的发布提供了一个训练和比较不同的字符网络提取算法的机会,由于以前的研究人员的兴趣不同,到目前为止,这几乎是不可能的。我们通过为字符网络的自动提取提供基线实验,应用基于规则的管道和神经方法来证明我们的数据集的实用性,并发现神经方法在大多数评估设置中优于规则方法。
{"title":"The FairyNet Corpus - Character Networks for German Fairy Tales","authors":"David Schmidt, Albin Zehe, Janne Lorenzen, Lisa Sergel, Sebastian Düker, Markus Krug, F. Puppe","doi":"10.18653/v1/2021.latechclfl-1.6","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.6","url":null,"abstract":"This paper presents a data set of German fairy tales, manually annotated with character networks which were obtained with high inter rater agreement. The release of this corpus provides an opportunity of training and comparing different algorithms for the extraction of character networks, which so far was barely possible due to heterogeneous interests of previous researchers. We demonstrate the usefulness of our data set by providing baseline experiments for the automatic extraction of character networks, applying a rule-based pipeline as well as a neural approach, and find the neural approach outperforming the rule-approach in most evaluation settings.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124557515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
The diffusion of scientific terms – tracing individuals’ influence in the history of science for English 科学术语的传播——追踪个人对英语科学史的影响
Yuri Bizzoni, Stefania Degaetano-Ortlieb, K. Menzel, E. Teich
Tracing the influence of individuals or groups in social networks is an increasingly popular task in sociolinguistic studies. While methods to determine someone’s influence in shortterm contexts (e.g., social media, on-line political debates) are widespread, influence in longterm contexts is less investigated and may be harder to capture. We study the diffusion of scientific terms in an English diachronic scientific corpus, applying Hawkes Processes to capture the role of individual scientists as “influencers” or “influencees” in the diffusion of new concepts. Our findings on two major scientific discoveries in chemistry and astronomy of the 18th century reveal that modelling both the introduction and diffusion of scientific terms in a historical corpus as Hawkes Processes allows detecting patterns of influence between authors on a long-term scale.
追踪个人或群体在社会网络中的影响是社会语言学研究中日益流行的一项任务。虽然确定某人在短期环境(如社交媒体、在线政治辩论)中的影响力的方法很普遍,但对长期环境中的影响力的调查较少,可能更难捕捉。我们研究了英语历时科学语料库中科学术语的传播,应用霍克斯过程来捕捉单个科学家在新概念传播中作为“影响者”或“影响者”的作用。我们对18世纪化学和天文学两项重大科学发现的研究结果表明,在历史语料库中对科学术语的引入和传播进行建模,如霍克斯过程,可以在长期范围内检测作者之间的影响模式。
{"title":"The diffusion of scientific terms – tracing individuals’ influence in the history of science for English","authors":"Yuri Bizzoni, Stefania Degaetano-Ortlieb, K. Menzel, E. Teich","doi":"10.18653/v1/2021.latechclfl-1.14","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.14","url":null,"abstract":"Tracing the influence of individuals or groups in social networks is an increasingly popular task in sociolinguistic studies. While methods to determine someone’s influence in shortterm contexts (e.g., social media, on-line political debates) are widespread, influence in longterm contexts is less investigated and may be harder to capture. We study the diffusion of scientific terms in an English diachronic scientific corpus, applying Hawkes Processes to capture the role of individual scientists as “influencers” or “influencees” in the diffusion of new concepts. Our findings on two major scientific discoveries in chemistry and astronomy of the 18th century reveal that modelling both the introduction and diffusion of scientific terms in a historical corpus as Hawkes Processes allows detecting patterns of influence between authors on a long-term scale.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"785 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123285744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Zero-Shot Information Extraction to Enhance a Knowledge Graph Describing Silk Textiles 零点信息提取增强真丝织物知识图谱
Thomas Schleider, Raphael Troncy
The knowledge of the European silk textile production is a typical case for which the information collected is heterogeneous, spread across many museums and sparse since rarely complete. Knowledge Graphs for this cultural heritage domain, when being developed with appropriate ontologies and vocabularies, enable to integrate and reconcile this diverse information. However, many of these original museum records still have some metadata gaps. In this paper, we present a zero-shot learning approach that leverages the ConceptNet common sense knowledge graph to predict categorical metadata informing about the silk objects production. We compared the performance of our approach with traditional supervised deep learning-based methods that do require training data. We demonstrate promising and competitive performance for similar datasets and circumstances and the ability to predict sometimes more fine-grained information. Our results can be reproduced using the code and datasets published at https://github.com/silknow/ZSL-KG-silk.
关于欧洲丝绸纺织品生产的知识是一个典型的案例,收集的信息是异质的,分布在许多博物馆,而且很少完整。在使用适当的本体和词汇表开发该文化遗产领域的知识图时,可以集成和协调这些不同的信息。然而,许多这些原始的博物馆记录仍然存在一些元数据缺口。在本文中,我们提出了一种零采样学习方法,该方法利用ConceptNet常识知识图来预测关于丝绸对象生产的分类元数据。我们将我们的方法与传统的基于监督的深度学习方法的性能进行了比较,后者确实需要训练数据。对于类似的数据集和环境,我们展示了有希望的和有竞争力的性能,以及预测有时更细粒度信息的能力。我们的结果可以使用在https://github.com/silknow/ZSL-KG-silk上发布的代码和数据集进行复制。
{"title":"Zero-Shot Information Extraction to Enhance a Knowledge Graph Describing Silk Textiles","authors":"Thomas Schleider, Raphael Troncy","doi":"10.18653/v1/2021.latechclfl-1.16","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.16","url":null,"abstract":"The knowledge of the European silk textile production is a typical case for which the information collected is heterogeneous, spread across many museums and sparse since rarely complete. Knowledge Graphs for this cultural heritage domain, when being developed with appropriate ontologies and vocabularies, enable to integrate and reconcile this diverse information. However, many of these original museum records still have some metadata gaps. In this paper, we present a zero-shot learning approach that leverages the ConceptNet common sense knowledge graph to predict categorical metadata informing about the silk objects production. We compared the performance of our approach with traditional supervised deep learning-based methods that do require training data. We demonstrate promising and competitive performance for similar datasets and circumstances and the ability to predict sometimes more fine-grained information. Our results can be reproduced using the code and datasets published at https://github.com/silknow/ZSL-KG-silk.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"30 13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125813844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
BAHP: Benchmark of Assessing Word Embeddings in Historical Portuguese 在历史葡萄牙语评估词嵌入的基准
Zuoyu Tian, Dylan Jarrett, Juan M. Escalona Torres, Patrícia Amaral
High quality distributional models can capture lexical and semantic relations between words. Hence, researchers design various intrinsic tasks to test whether such relations are captured. However, most of the intrinsic tasks are designed for modern languages, and there is a lack of evaluation methods for distributional models of historical corpora. In this paper, we conducted BAHP: a benchmark of assessing word embeddings in Historical Portuguese, which contains four types of tests: analogy, similarity, outlier detection, and coherence. We examined word2vec models generated from two historical Portuguese corpora in these four test sets. The results demonstrate that our test sets are capable of measuring the quality of vector space models and can provide a holistic view of the model’s ability to capture syntactic and semantic information. Furthermore, the methodology for the creation of our test sets can be easily extended to other historical languages.
高质量的分布模型可以捕获单词之间的词汇和语义关系。因此,研究人员设计了各种内在任务来测试这种关系是否被捕获。然而,大多数固有任务都是针对现代语言设计的,缺乏对历史语料库分布模型的评估方法。在本文中,我们进行了BAHP:一个评估历史葡萄牙语词嵌入的基准,它包含四种类型的测试:类比,相似性,离群检测和连贯。我们在这四个测试集中检查了从两个历史葡萄牙语语料库生成的word2vec模型。结果表明,我们的测试集能够测量向量空间模型的质量,并且可以提供模型捕获语法和语义信息的能力的整体视图。此外,创建测试集的方法可以很容易地扩展到其他历史语言。
{"title":"BAHP: Benchmark of Assessing Word Embeddings in Historical Portuguese","authors":"Zuoyu Tian, Dylan Jarrett, Juan M. Escalona Torres, Patrícia Amaral","doi":"10.18653/v1/2021.latechclfl-1.13","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.13","url":null,"abstract":"High quality distributional models can capture lexical and semantic relations between words. Hence, researchers design various intrinsic tasks to test whether such relations are captured. However, most of the intrinsic tasks are designed for modern languages, and there is a lack of evaluation methods for distributional models of historical corpora. In this paper, we conducted BAHP: a benchmark of assessing word embeddings in Historical Portuguese, which contains four types of tests: analogy, similarity, outlier detection, and coherence. We examined word2vec models generated from two historical Portuguese corpora in these four test sets. The results demonstrate that our test sets are capable of measuring the quality of vector space models and can provide a holistic view of the model’s ability to capture syntactic and semantic information. Furthermore, the methodology for the creation of our test sets can be easily extended to other historical languages.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122973155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Emotion Classification in German Plays with Transformer-based Language Models Pretrained on Historical and Contemporary Language 基于变换语言模型的德语戏剧情感分类
Thomas Schmidt, Katrin Dennerlein, Christian Wolff
We present results of a project on emotion classification on historical German plays of Enlightenment, Storm and Stress, and German Classicism. We have developed a hierarchical annotation scheme consisting of 13 sub-emotions like suffering, love and joy that sum up to 6 main and 2 polarity classes (positive/negative). We have conducted textual annotations on 11 German plays and have acquired over 13,000 emotion annotations by two annotators per play. We have evaluated multiple traditional machine learning approaches as well as transformer-based models pretrained on historical and contemporary language for a single-label text sequence emotion classification for the different emotion categories. The evaluation is carried out on three different instances of the corpus: (1) taking all annotations, (2) filtering overlapping annotations by annotators, (3) applying a heuristic for speech-based analysis. Best results are achieved on the filtered corpus with the best models being large transformer-based models pretrained on contemporary German language. For the polarity classification accuracies of up to 90% are achieved. The accuracies become lower for settings with a higher number of classes, achieving 66% for 13 sub-emotions. Further pretraining of a historical model with a corpus of dramatic texts led to no improvements.
我们提出了一项关于德国历史戏剧《启蒙》、《风暴与压力》和《德国古典主义》情感分类的研究结果。我们开发了一个分层注释方案,由13个子情感组成,如痛苦,爱和喜悦,总共有6个主要极性和2个极性类别(积极/消极)。我们对11部德国戏剧进行了文本注释,每部戏剧由两名注释员进行了超过13000次的情感注释。我们已经评估了多种传统的机器学习方法,以及基于历史和当代语言预训练的基于转换器的模型,用于不同情感类别的单标签文本序列情感分类。对语料库的三个不同实例进行评估:(1)取所有注释,(2)过滤注释者的重叠注释,(3)应用启发式方法进行基于语音的分析。在过滤后的语料库上取得了最好的结果,其中最好的模型是在当代德语上预训练的基于大型变压器的模型。对于极性分类精度达到90%。当类别数量增加时,准确率会降低,13个子情绪的准确率达到66%。用戏剧文本的语料库对历史模型进行进一步的预训练没有任何改善。
{"title":"Emotion Classification in German Plays with Transformer-based Language Models Pretrained on Historical and Contemporary Language","authors":"Thomas Schmidt, Katrin Dennerlein, Christian Wolff","doi":"10.18653/v1/2021.latechclfl-1.8","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.8","url":null,"abstract":"We present results of a project on emotion classification on historical German plays of Enlightenment, Storm and Stress, and German Classicism. We have developed a hierarchical annotation scheme consisting of 13 sub-emotions like suffering, love and joy that sum up to 6 main and 2 polarity classes (positive/negative). We have conducted textual annotations on 11 German plays and have acquired over 13,000 emotion annotations by two annotators per play. We have evaluated multiple traditional machine learning approaches as well as transformer-based models pretrained on historical and contemporary language for a single-label text sequence emotion classification for the different emotion categories. The evaluation is carried out on three different instances of the corpus: (1) taking all annotations, (2) filtering overlapping annotations by annotators, (3) applying a heuristic for speech-based analysis. Best results are achieved on the filtered corpus with the best models being large transformer-based models pretrained on contemporary German language. For the polarity classification accuracies of up to 90% are achieved. The accuracies become lower for settings with a higher number of classes, achieving 66% for 13 sub-emotions. Further pretraining of a historical model with a corpus of dramatic texts led to no improvements.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125093348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
The Multilingual Corpus of Survey Questionnaires Query Interface 多语种调查问卷语料库查询界面
Danielly Sorato, Diana Zavala-Rojas
The dawn of the digital age led to increasing demands for digital research resources, which shall be quickly processed and handled by computers. Due to the amount of data created by this digitization process, the design of tools that enable the analysis and management of data and metadata has become a relevant topic. In this context, the Multilingual Corpus of Survey Questionnaires (MCSQ) contributes to the creation and distribution of data for the Social Sciences and Humanities (SSH) following FAIR (Findable, Accessible, Interoperable and Reusable) principles, and provides functionalities for end-users that are not acquainted with programming through an easy-to-use interface. By simply applying the desired filters in the graphic interface, users can build linguistic resources for the survey research and translation areas, such as translation memories, thus facilitating data access and usage.
随着数字时代的到来,对数字研究资源的需求不断增加,这些资源需要通过计算机快速处理和处理。由于这个数字化过程产生了大量的数据,设计能够分析和管理数据和元数据的工具已经成为一个相关的话题。在这种背景下,多语言调查问卷语料库(MCSQ)遵循FAIR(可查找、可访问、可互操作和可重用)原则,为社会科学和人文科学(SSH)提供数据的创建和分发,并通过易于使用的界面为不熟悉编程的最终用户提供功能。用户只需在图形界面中应用所需的过滤器,就可以建立调查研究和翻译领域的语言资源,例如翻译记忆库,从而方便数据的访问和使用。
{"title":"The Multilingual Corpus of Survey Questionnaires Query Interface","authors":"Danielly Sorato, Diana Zavala-Rojas","doi":"10.18653/v1/2021.latechclfl-1.5","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.5","url":null,"abstract":"The dawn of the digital age led to increasing demands for digital research resources, which shall be quickly processed and handled by computers. Due to the amount of data created by this digitization process, the design of tools that enable the analysis and management of data and metadata has become a relevant topic. In this context, the Multilingual Corpus of Survey Questionnaires (MCSQ) contributes to the creation and distribution of data for the Social Sciences and Humanities (SSH) following FAIR (Findable, Accessible, Interoperable and Reusable) principles, and provides functionalities for end-users that are not acquainted with programming through an easy-to-use interface. By simply applying the desired filters in the graphic interface, users can build linguistic resources for the survey research and translation areas, such as translation memories, thus facilitating data access and usage.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122182231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stylometric Literariness Classification: the Case of Stephen King 文体学文学性分类:以斯蒂芬·金为例
Andreas van Cranenburgh, E. Ketzan
This paper applies stylometry to quantify the literariness of 73 novels and novellas by American author Stephen King, chosen as an extraordinary case of a writer who has been dubbed both “high” and “low” in literariness in critical reception. We operationalize literariness using a measure of stylistic distance (Cosine Delta) based on the 1000 most frequent words in two bespoke comparison corpora used as proxies for literariness: one of popular genre fiction, another of National Book Award-winning authors. We report that a supervised model is highly effective in distinguishing the two categories, with 94.6% macro average in a binary classification. We define two subsets of texts by King—“high” and “low” literariness works as suggested by critics and ourselves—and find that a predictive model does identify King’s Dark Tower series and novels such as Dolores Claiborne as among his most “literary” texts, consistent with critical reception, which has also ascribed postmodern qualities to the Dark Tower novels. Our results demonstrate the efficacy of Cosine Delta-based stylometry in quantifying the literariness of texts, while also highlighting the methodological challenges of literariness, especially in the case of Stephen King. The code and data to reproduce our results are available at https://github.com/andreasvc/kinglit
本文运用文体学对美国作家斯蒂芬·金的73篇长篇和中篇小说的文学性进行了量化分析,并将其作为文学评价中被称为“高”和“低”的作家的一个特殊案例。我们使用基于两个定制比较语料库中1000个最常见单词的文体距离(余弦δ)度量来操作文学性:一个是流行类型小说,另一个是国家图书奖获奖作者。我们报告说,监督模型在区分两个类别方面非常有效,在二元分类中有94.6%的宏观平均。我们定义了金的两个文本子集——由评论家和我们提出的“高”和“低”文学性作品——并发现一个预测模型确实将金的黑暗塔系列和小说(如多洛雷斯·克莱本)确定为他最“文学”的文本,这与评论界的接受一致,这也将后现代特质归因于黑暗塔小说。我们的研究结果证明了余弦δ文体学在量化文本文学性方面的有效性,同时也强调了文学性的方法论挑战,特别是在斯蒂芬·金的案例中。复制我们的结果的代码和数据可在https://github.com/andreasvc/kinglit上获得
{"title":"Stylometric Literariness Classification: the Case of Stephen King","authors":"Andreas van Cranenburgh, E. Ketzan","doi":"10.18653/v1/2021.latechclfl-1.21","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.21","url":null,"abstract":"This paper applies stylometry to quantify the literariness of 73 novels and novellas by American author Stephen King, chosen as an extraordinary case of a writer who has been dubbed both “high” and “low” in literariness in critical reception. We operationalize literariness using a measure of stylistic distance (Cosine Delta) based on the 1000 most frequent words in two bespoke comparison corpora used as proxies for literariness: one of popular genre fiction, another of National Book Award-winning authors. We report that a supervised model is highly effective in distinguishing the two categories, with 94.6% macro average in a binary classification. We define two subsets of texts by King—“high” and “low” literariness works as suggested by critics and ourselves—and find that a predictive model does identify King’s Dark Tower series and novels such as Dolores Claiborne as among his most “literary” texts, consistent with critical reception, which has also ascribed postmodern qualities to the Dark Tower novels. Our results demonstrate the efficacy of Cosine Delta-based stylometry in quantifying the literariness of texts, while also highlighting the methodological challenges of literariness, especially in the case of Stephen King. The code and data to reproduce our results are available at https://github.com/andreasvc/kinglit","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131564622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1