首页 > 最新文献

Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature最新文献

英文 中文
Unsupervised Adverbial Identification in Modern Chinese Literature 中国现代文学中的无监督状语识别
Wenxiu Xie, J. Lee, Fangqiong Zhan, Xiao Han, Chi-Yin Chow
In many languages, adverbials can be derived from words of various parts-of-speech. In Chinese, the derivation may be marked either with the standard adverbial marker DI, or the non-standard marker DE. Since DE also serves double duty as the attributive marker, accurate identification of adverbials requires disambiguation of its syntactic role. As parsers are trained predominantly on texts using the standard adverbial marker DI, they often fail to recognize adverbials suffixed with the non-standard DE. This paper addresses this problem with an unsupervised, rule-based approach for adverbial identification that utilizes dependency tree patterns. Experiment results show that this approach outperforms a masked language model baseline. We apply this approach to analyze standard and non-standard adverbial marker usage in modern Chinese literature.
在许多语言中,状语可以从不同词性的词中衍生出来。在汉语中,衍生词既可以用标准状语标记DI,也可以用非标准状语标记DE。由于DE还具有定语标记的双重功能,因此要准确识别状语,就需要对其句法作用进行消歧。由于解析器主要在使用标准状语标记DI的文本上进行训练,因此它们经常无法识别以非标准DE为后缀的状语。本文采用一种利用依赖树模式的无监督、基于规则的状语识别方法来解决这个问题。实验结果表明,该方法优于掩码语言模型基线。本文运用这一方法分析了现代汉语文学中标准和非标准状语标记语的使用。
{"title":"Unsupervised Adverbial Identification in Modern Chinese Literature","authors":"Wenxiu Xie, J. Lee, Fangqiong Zhan, Xiao Han, Chi-Yin Chow","doi":"10.18653/v1/2021.latechclfl-1.10","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.10","url":null,"abstract":"In many languages, adverbials can be derived from words of various parts-of-speech. In Chinese, the derivation may be marked either with the standard adverbial marker DI, or the non-standard marker DE. Since DE also serves double duty as the attributive marker, accurate identification of adverbials requires disambiguation of its syntactic role. As parsers are trained predominantly on texts using the standard adverbial marker DI, they often fail to recognize adverbials suffixed with the non-standard DE. This paper addresses this problem with an unsupervised, rule-based approach for adverbial identification that utilizes dependency tree patterns. Experiment results show that this approach outperforms a masked language model baseline. We apply this approach to analyze standard and non-standard adverbial marker usage in modern Chinese literature.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130737852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
The Early Modern Dutch Mediascape. Detecting Media Mentions in Chronicles Using Word Embeddings and CRF 近代早期荷兰媒体景观。使用词嵌入和CRF检测编年史中的媒体提及
A. Lassche, R. Morante
While the production of information in the European early modern period is a well-researched topic, the question how people were engaging with the information explosion that occurred in early modern Europe, is still underexposed. This paper presents the annotations and experiments aimed at exploring whether we can automatically extract media related information (source, perception, and receiver) from a corpus of early modern Dutch chronicles in order to get insight in the mediascape of early modern middle class people from a historic perspective. In a number of classification experiments with Conditional Random Fields, three categories of features are tested: (i) raw and binary word embedding features, (ii) lexicon features, and (iii) character features. Overall, the classifier that uses raw embeddings performs slightly better. However, given that the best F-scores are around 0.60, we conclude that the machine learning approach needs to be combined with a close reading approach for the results to be useful to answer history research questions.
虽然欧洲近代早期的信息生产是一个研究得很好的话题,但人们如何参与近代早期欧洲发生的信息爆炸的问题仍然没有得到充分的关注。本文通过注释和实验,探索是否可以从荷兰早期现代编年史语料库中自动提取媒体相关信息(来源、感知和接受者),从而从历史的角度洞察早期现代中产阶级的媒体景观。在一些条件随机场的分类实验中,测试了三类特征:(i)原始和二进制词嵌入特征,(ii)词汇特征,(iii)字符特征。总的来说,使用原始嵌入的分类器性能稍好一些。然而,考虑到最好的f分数约为0.60,我们得出结论,机器学习方法需要与细读方法相结合,以便结果对回答历史研究问题有用。
{"title":"The Early Modern Dutch Mediascape. Detecting Media Mentions in Chronicles Using Word Embeddings and CRF","authors":"A. Lassche, R. Morante","doi":"10.18653/v1/2021.latechclfl-1.1","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.1","url":null,"abstract":"While the production of information in the European early modern period is a well-researched topic, the question how people were engaging with the information explosion that occurred in early modern Europe, is still underexposed. This paper presents the annotations and experiments aimed at exploring whether we can automatically extract media related information (source, perception, and receiver) from a corpus of early modern Dutch chronicles in order to get insight in the mediascape of early modern middle class people from a historic perspective. In a number of classification experiments with Conditional Random Fields, three categories of features are tested: (i) raw and binary word embedding features, (ii) lexicon features, and (iii) character features. Overall, the classifier that uses raw embeddings performs slightly better. However, given that the best F-scores are around 0.60, we conclude that the machine learning approach needs to be combined with a close reading approach for the results to be useful to answer history research questions.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114269711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Period Classification in Chinese Historical Texts 中国历史文本的时期分类
Zuoyu Tian, Sandra Kübler
In this study, we study language change in Chinese Biji by using a classification task: classifying Ancient Chinese texts by time periods. Specifically, we focus on a unique genre in classical Chinese literature: Biji (literally “notebook” or “brush notes”), i.e., collections of anecdotes, quotations, etc., anything authors consider noteworthy, Biji span hundreds of years across many dynasties and conserve informal language in written form. For these reasons, they are regarded as a good resource for investigating language change in Chinese (Fang, 2010). In this paper, we create a new dataset of 108 Biji across four dynasties. Based on the dataset, we first introduce a time period classification task for Chinese. Then we investigate different feature representation methods for classification. The results show that models using contextualized embeddings perform best. An analysis of the top features chosen by the word n-gram model (after bleaching proper nouns) confirms that these features are informative and correspond to observations and assumptions made by historical linguists.
在本研究中,我们使用了一个分类任务来研究汉语笔记中的语言变化:按时期对古代汉语文本进行分类。具体来说,我们关注的是中国古典文学中的一种独特类型:笔记(字面意思是“笔记本”或“毛笔笔记”),即轶事、语录等的集合,作者认为值得注意的任何东西,笔记跨越了数百年,跨越了许多朝代,保留了书面形式的非正式语言。由于这些原因,它们被认为是研究汉语语言变化的良好资源(Fang, 2010)。在本文中,我们创建了一个新的数据集,包括四个朝代的108个毕吉。在此基础上,我们首先引入了一个针对中文的时间段分类任务。然后研究了不同的特征表示方法进行分类。结果表明,使用情境化嵌入的模型效果最好。对单词n-gram模型选择的最重要特征(在漂白专有名词之后)的分析证实,这些特征具有信息性,并且与历史语言学家的观察和假设相对应。
{"title":"Period Classification in Chinese Historical Texts","authors":"Zuoyu Tian, Sandra Kübler","doi":"10.18653/v1/2021.latechclfl-1.19","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.19","url":null,"abstract":"In this study, we study language change in Chinese Biji by using a classification task: classifying Ancient Chinese texts by time periods. Specifically, we focus on a unique genre in classical Chinese literature: Biji (literally “notebook” or “brush notes”), i.e., collections of anecdotes, quotations, etc., anything authors consider noteworthy, Biji span hundreds of years across many dynasties and conserve informal language in written form. For these reasons, they are regarded as a good resource for investigating language change in Chinese (Fang, 2010). In this paper, we create a new dataset of 108 Biji across four dynasties. Based on the dataset, we first introduce a time period classification task for Chinese. Then we investigate different feature representation methods for classification. The results show that models using contextualized embeddings perform best. An analysis of the top features chosen by the word n-gram model (after bleaching proper nouns) confirms that these features are informative and correspond to observations and assumptions made by historical linguists.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115326174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Batavia asked for advice. Pretrained language models for Named Entity Recognition in historical texts. 巴达维亚向他征求意见。历史文本中命名实体识别的预训练语言模型。
S. Arnoult, L. Petram, P. Vossen
Pretrained language models like BERT have advanced the state of the art for many NLP tasks. For resource-rich languages, one has the choice between a number of language-specific models, while multilingual models are also worth considering. These models are well known for their crosslingual performance, but have also shown competitive in-language performance on some tasks. We consider monolingual and multilingual models from the perspective of historical texts, and in particular for texts enriched with editorial notes: how do language models deal with the historical and editorial content in these texts? We present a new Named Entity Recognition dataset for Dutch based on 17th and 18th century United East India Company (VOC) reports extended with modern editorial notes. Our experiments with multilingual and Dutch pretrained language models confirm the crosslingual abilities of multilingual models while showing that all language models can leverage mixed-variant data. In particular, language models successfully incorporate notes for the prediction of entities in historical texts. We also find that multilingual models outperform monolingual models on our data, but that this superiority is linked to the task at hand: multilingual models lose their advantage when confronted with more semantical tasks.
像BERT这样的预训练语言模型已经推动了许多NLP任务的发展。对于资源丰富的语言,可以在许多特定于语言的模型之间进行选择,而多语言模型也值得考虑。这些模型以其跨语言性能而闻名,但在某些任务中也显示出具有竞争力的语言性能。我们从历史文本的角度来考虑单语和多语模型,特别是对于富含编辑注释的文本:语言模型如何处理这些文本中的历史和编辑内容?我们提出了一个新的命名实体识别数据集的荷兰基于17和18世纪的联合东印度公司(VOC)报告扩展与现代编辑注释。我们对多语言和荷兰语预训练语言模型的实验证实了多语言模型的跨语言能力,同时表明所有语言模型都可以利用混合变量数据。特别是,语言模型成功地结合了注释来预测历史文本中的实体。我们还发现,在我们的数据上,多语言模型优于单语言模型,但这种优势与手头的任务有关:当面对更多语义任务时,多语言模型失去了优势。
{"title":"Batavia asked for advice. Pretrained language models for Named Entity Recognition in historical texts.","authors":"S. Arnoult, L. Petram, P. Vossen","doi":"10.18653/v1/2021.latechclfl-1.3","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.3","url":null,"abstract":"Pretrained language models like BERT have advanced the state of the art for many NLP tasks. For resource-rich languages, one has the choice between a number of language-specific models, while multilingual models are also worth considering. These models are well known for their crosslingual performance, but have also shown competitive in-language performance on some tasks. We consider monolingual and multilingual models from the perspective of historical texts, and in particular for texts enriched with editorial notes: how do language models deal with the historical and editorial content in these texts? We present a new Named Entity Recognition dataset for Dutch based on 17th and 18th century United East India Company (VOC) reports extended with modern editorial notes. Our experiments with multilingual and Dutch pretrained language models confirm the crosslingual abilities of multilingual models while showing that all language models can leverage mixed-variant data. In particular, language models successfully incorporate notes for the prediction of entities in historical texts. We also find that multilingual models outperform monolingual models on our data, but that this superiority is linked to the task at hand: multilingual models lose their advantage when confronted with more semantical tasks.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"2017 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125739078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quantifying Contextual Aspects of Inter-annotator Agreement in Intertextuality Research 互文性研究中注释者间一致的量化语境因素
Enrique Manjavacas Arevalo, Laurence Mellerin, M. Kestemont
We report on an inter-annotator agreement experiment involving instances of text reuse focusing on the well-known case of biblical intertextuality in medieval literature. We target the application use case of literary scholars whose aim is to document instances of biblical references in the ‘apparatus fontium’ of a prospective digital edition. We develop a Bayesian implementation of Cohen’s kappa for multiple annotators that allows us to assess the influence of various contextual effects on the inter-annotator agreement, producing both more robust estimates of the agreement indices as well as insights into the annotation process that leads to the estimated indices. As a result, we are able to produce a novel and nuanced estimation of inter-annotator agreement in the context of intertextuality, exploring the challenges that arise from manually annotating a dataset of biblical references in the writings of Bernard of Clairvaux. Among others, our method was able to unveil the fact that the obtained agreement depends heavily on the biblical source book of the proposed reference, as well as the underlying algorithm used to retrieve the candidate match.
我们报告了一个涉及文本重用实例的注释者间协议实验,重点是中世纪文学中圣经互文性的著名案例。我们的目标是文学学者的应用用例,其目的是在未来的数字版本的“仪器fontium”中记录圣经参考的实例。我们为多个注释者开发了Cohen kappa的贝叶斯实现,使我们能够评估各种上下文效应对注释者间协议的影响,从而产生更稳健的协议索引估计,以及对导致估计索引的注释过程的见解。因此,我们能够在互文性的背景下对注释者之间的一致性进行新颖而细致的估计,探索在伯纳德·克莱沃的著作中手动注释圣经参考资料数据集所产生的挑战。其中,我们的方法能够揭示这样一个事实,即所获得的一致性在很大程度上依赖于所建议参考的圣经源书,以及用于检索候选匹配的底层算法。
{"title":"Quantifying Contextual Aspects of Inter-annotator Agreement in Intertextuality Research","authors":"Enrique Manjavacas Arevalo, Laurence Mellerin, M. Kestemont","doi":"10.18653/v1/2021.latechclfl-1.4","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.4","url":null,"abstract":"We report on an inter-annotator agreement experiment involving instances of text reuse focusing on the well-known case of biblical intertextuality in medieval literature. We target the application use case of literary scholars whose aim is to document instances of biblical references in the ‘apparatus fontium’ of a prospective digital edition. We develop a Bayesian implementation of Cohen’s kappa for multiple annotators that allows us to assess the influence of various contextual effects on the inter-annotator agreement, producing both more robust estimates of the agreement indices as well as insights into the annotation process that leads to the estimated indices. As a result, we are able to produce a novel and nuanced estimation of inter-annotator agreement in the context of intertextuality, exploring the challenges that arise from manually annotating a dataset of biblical references in the writings of Bernard of Clairvaux. Among others, our method was able to unveil the fact that the obtained agreement depends heavily on the biblical source book of the proposed reference, as well as the underlying algorithm used to retrieve the candidate match.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127985718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FrameNet-like Annotation of Olfactory Information in Texts 文本嗅觉信息的类框架标注
Sara Tonelli, S. Menini
Although olfactory references play a crucial role in our cultural memory, only few works in NLP have tried to capture them from a computational perspective. Currently, the main challenge is not much the development of technological components for olfactory information extraction, given recent advances in semantic processing and natural language understanding, but rather the lack of a theoretical framework to capture this information from a linguistic point of view, as a preliminary step towards the development of automated systems. Therefore, in this work we present the annotation guidelines, developed with the help of history scholars and domain experts, aimed at capturing all the relevant elements involved in olfactory situations or events described in texts. These guidelines have been inspired by FrameNet annotation, but underwent some adaptations, which are detailed in this paper. Furthermore, we present a case study concerning the annotation of olfactory situations in English historical travel writings describing trips to Italy. An analysis of the most frequent role fillers show that olfactory descriptions pertain to some typical domains such as religion, food, nature, ancient past, poor sanitation, all supporting the creation of a stereotypical imagery related to Italy. On the other hand, positive feelings triggered by smells are prevalent, and contribute to framing travels to Italy as an exciting experience involving all senses.
虽然嗅觉参考在我们的文化记忆中起着至关重要的作用,但只有少数NLP的作品试图从计算的角度来捕捉它们。目前,考虑到语义处理和自然语言理解的最新进展,主要的挑战不是嗅觉信息提取技术组件的发展,而是缺乏从语言学角度捕获这些信息的理论框架,作为迈向自动化系统发展的初步步骤。因此,在这项工作中,我们提出了注释指南,在历史学者和领域专家的帮助下开发,旨在捕获文本中描述的嗅觉情况或事件中涉及的所有相关元素。这些指导方针受到FrameNet注释的启发,但经过了一些调整,在本文中详细介绍。此外,我们还提出了一个关于英国历史旅行著作中描述意大利旅行的嗅觉情况注释的案例研究。对最常见的角色填充的分析表明,嗅觉描述与一些典型领域有关,如宗教、食物、自然、古老的过去、糟糕的卫生条件,所有这些都支持了与意大利有关的刻板印象的形成。另一方面,由气味引发的积极情绪很普遍,并有助于将意大利之旅描述为一种涉及所有感官的令人兴奋的体验。
{"title":"FrameNet-like Annotation of Olfactory Information in Texts","authors":"Sara Tonelli, S. Menini","doi":"10.18653/v1/2021.latechclfl-1.2","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.2","url":null,"abstract":"Although olfactory references play a crucial role in our cultural memory, only few works in NLP have tried to capture them from a computational perspective. Currently, the main challenge is not much the development of technological components for olfactory information extraction, given recent advances in semantic processing and natural language understanding, but rather the lack of a theoretical framework to capture this information from a linguistic point of view, as a preliminary step towards the development of automated systems. Therefore, in this work we present the annotation guidelines, developed with the help of history scholars and domain experts, aimed at capturing all the relevant elements involved in olfactory situations or events described in texts. These guidelines have been inspired by FrameNet annotation, but underwent some adaptations, which are detailed in this paper. Furthermore, we present a case study concerning the annotation of olfactory situations in English historical travel writings describing trips to Italy. An analysis of the most frequent role fillers show that olfactory descriptions pertain to some typical domains such as religion, food, nature, ancient past, poor sanitation, all supporting the creation of a stereotypical imagery related to Italy. On the other hand, positive feelings triggered by smells are prevalent, and contribute to framing travels to Italy as an exciting experience involving all senses.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"11 7‐8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113977116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Data-Driven Detection of General Chiasmi Using Lexical and Semantic Features 基于词汇和语义特征的数据驱动的一般误语检测
Felix Schneider, Björn Barz, Phillip Brandes, Sophie Marshall, Joachim Denzler
Automatic detection of stylistic devices is an important tool for literary studies, e.g., for stylometric analysis or argument mining. A particularly striking device is the rhetorical figure called chiasmus, which involves the inversion of semantically or syntactically related words. Existing works focus on a special case of chiasmi that involve identical words in an A B B A pattern, so-called antimetaboles. In contrast, we propose an approach targeting the more general and challenging case A B B’ A’, where the words A, A’ and B, B’ constituting the chiasmus do not need to be identical but just related in meaning. To this end, we generalize the established candidate phrase mining strategy from antimetaboles to general chiasmi and propose novel features based on word embeddings and lemmata for capturing both semantic and syntactic information. These features serve as input for a logistic regression classifier, which learns to distinguish between rhetorical chiasmi and coincidental chiastic word orders without special meaning. We evaluate our approach on two datasets consisting of classical German dramas, four texts with annotated chiasmi and 500 unannotated texts. Compared to previous methods for chiasmus detection, our novel features improve the average precision from 17% to 28% and the precision among the top 100 results from 13% to 35%.
文体手段的自动检测是文学研究的重要工具,例如文体分析或论据挖掘。一种特别引人注目的修辞手法是所谓的交错法,它涉及到语义或句法上相关词语的反转。现有的研究集中在一种特殊情况下的交错,即涉及相同的单词以a B B a模式出现,即所谓的抗代谢。相比之下,我们提出了一种针对更一般和更具挑战性的案例A B B ' A '的方法,其中构成交错的单词A, A '和B, B '不需要相同,只需在意义上相关即可。为此,我们将已建立的候选短语挖掘策略从抗代谢物推广到一般交错,并提出了基于词嵌入和引理的新特征来捕获语义和句法信息。这些特征作为逻辑回归分类器的输入,逻辑回归分类器学习区分没有特殊意义的修辞交错和巧合交错词序。我们在两个数据集上评估了我们的方法,这些数据集包括古典德国戏剧,四个带有注释的交错文本和500个未注释的文本。与以前的交叉检测方法相比,我们的新特征将平均精度从17%提高到28%,前100个结果的精度从13%提高到35%。
{"title":"Data-Driven Detection of General Chiasmi Using Lexical and Semantic Features","authors":"Felix Schneider, Björn Barz, Phillip Brandes, Sophie Marshall, Joachim Denzler","doi":"10.18653/v1/2021.latechclfl-1.11","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.11","url":null,"abstract":"Automatic detection of stylistic devices is an important tool for literary studies, e.g., for stylometric analysis or argument mining. A particularly striking device is the rhetorical figure called chiasmus, which involves the inversion of semantically or syntactically related words. Existing works focus on a special case of chiasmi that involve identical words in an A B B A pattern, so-called antimetaboles. In contrast, we propose an approach targeting the more general and challenging case A B B’ A’, where the words A, A’ and B, B’ constituting the chiasmus do not need to be identical but just related in meaning. To this end, we generalize the established candidate phrase mining strategy from antimetaboles to general chiasmi and propose novel features based on word embeddings and lemmata for capturing both semantic and syntactic information. These features serve as input for a logistic regression classifier, which learns to distinguish between rhetorical chiasmi and coincidental chiastic word orders without special meaning. We evaluate our approach on two datasets consisting of classical German dramas, four texts with annotated chiasmi and 500 unannotated texts. Compared to previous methods for chiasmus detection, our novel features improve the average precision from 17% to 28% and the precision among the top 100 results from 13% to 35%.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132033803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
End-to-end style-conditioned poetry generation: What does it take to learn from examples alone? 端到端受风格限制的诗歌创作:仅从例子中学习需要什么?
Jörg Wöckener, T. Haider, Tristan Miller, The-Khang Nguyen, Thanh Tung Linh Nguyen, Minh Vu Pham, Jonas Belouadi, Steffen Eger
In this work, we design an end-to-end model for poetry generation based on conditioned recurrent neural network (RNN) language models whose goal is to learn stylistic features (poem length, sentiment, alliteration, and rhyming) from examples alone. We show this model successfully learns the ‘meaning’ of length and sentiment, as we can control it to generate longer or shorter as well as more positive or more negative poems. However, the model does not grasp sound phenomena like alliteration and rhyming, but instead exploits low-level statistical cues. Possible reasons include the size of the training data, the relatively low frequency and difficulty of these sublexical phenomena as well as model biases. We show that more recent GPT-2 models also have problems learning sublexical phenomena such as rhyming from examples alone.
在这项工作中,我们设计了一个基于条件递归神经网络(RNN)语言模型的端到端诗歌生成模型,其目标是仅从示例中学习风格特征(诗歌长度、情感、头韵和押韵)。我们展示了这个模型成功地学习了长度和情感的“意义”,因为我们可以控制它来生成更长或更短以及更积极或更消极的诗歌。然而,该模型并没有掌握像头韵和押韵这样的声音现象,而是利用了低级的统计线索。可能的原因包括训练数据的大小,这些亚词汇现象的频率和难度相对较低以及模型偏差。我们发现,最近的GPT-2模型在学习亚词汇现象方面也存在问题,比如仅从例子中学习押韵。
{"title":"End-to-end style-conditioned poetry generation: What does it take to learn from examples alone?","authors":"Jörg Wöckener, T. Haider, Tristan Miller, The-Khang Nguyen, Thanh Tung Linh Nguyen, Minh Vu Pham, Jonas Belouadi, Steffen Eger","doi":"10.18653/v1/2021.latechclfl-1.7","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.7","url":null,"abstract":"In this work, we design an end-to-end model for poetry generation based on conditioned recurrent neural network (RNN) language models whose goal is to learn stylistic features (poem length, sentiment, alliteration, and rhyming) from examples alone. We show this model successfully learns the ‘meaning’ of length and sentiment, as we can control it to generate longer or shorter as well as more positive or more negative poems. However, the model does not grasp sound phenomena like alliteration and rhyming, but instead exploits low-level statistical cues. Possible reasons include the size of the training data, the relatively low frequency and difficulty of these sublexical phenomena as well as model biases. We show that more recent GPT-2 models also have problems learning sublexical phenomena such as rhyming from examples alone.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127606605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Mixed-Methods Analysis of Western and Hong Kong–based Reporting on the 2019–2020 Protests 西方和香港对2019-2020年抗议活动报道的混合方法分析
Arya D. McCarthy, James Scharf, G. Dore
We apply statistical techniques from natural language processing to Western and Hong Kong–based English language newspaper articles that discuss the 2019–2020 Hong Kong protests of the Anti-Extradition Law Amendment Bill Movement. Topic modeling detects central themes of the reporting and shows the differing agendas toward one country, two systems. Embedding-based usage shift (at the word level) and sentiment analysis (at the document level) both support that Hong Kong–based reporting is more negative and more emotionally charged. A two-way test shows that while July 1, 2019 is a turning point for media portrayal, the differences between western- and Hong Kong–based reporting did not magnify when the protests began; rather, they already existed. Taken together, these findings clarify how the portrayal of activism in Hong Kong evolved throughout the Movement.
我们将自然语言处理的统计技术应用于西方和香港的英语报纸文章,这些文章讨论了2019-2020年香港反引渡法修正案运动的抗议活动。主题建模检测报告的中心主题,并显示针对一个国家、两个系统的不同议程。基于嵌入的用法变化(在单词层面)和情感分析(在文档层面)都支持香港的报道更消极,更情绪化。双向测试表明,尽管2019年7月1日是媒体报道的转折点,但抗议活动开始时,西方和香港报道之间的差异并没有扩大;相反,它们已经存在了。综上所述,这些研究结果阐明了在整个运动中,香港对行动主义的描述是如何演变的。
{"title":"A Mixed-Methods Analysis of Western and Hong Kong–based Reporting on the 2019–2020 Protests","authors":"Arya D. McCarthy, James Scharf, G. Dore","doi":"10.18653/v1/2021.latechclfl-1.20","DOIUrl":"https://doi.org/10.18653/v1/2021.latechclfl-1.20","url":null,"abstract":"We apply statistical techniques from natural language processing to Western and Hong Kong–based English language newspaper articles that discuss the 2019–2020 Hong Kong protests of the Anti-Extradition Law Amendment Bill Movement. Topic modeling detects central themes of the reporting and shows the differing agendas toward one country, two systems. Embedding-based usage shift (at the word level) and sentiment analysis (at the document level) both support that Hong Kong–based reporting is more negative and more emotionally charged. A two-way test shows that while July 1, 2019 is a turning point for media portrayal, the differences between western- and Hong Kong–based reporting did not magnify when the protests began; rather, they already existed. Taken together, these findings clarify how the portrayal of activism in Hong Kong evolved throughout the Movement.","PeriodicalId":441300,"journal":{"name":"Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129627267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1