首页 > 最新文献

Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020最新文献

英文 中文
AriEmozione: Identifying Emotions in Opera Verses AriEmozione:识别歌剧诗句中的情感
Pub Date : 1900-01-01 DOI: 10.4000/books.aaccademia.8528
Francesco Fernicola, Shibingfeng Zhang, F. Garcea, P. Bonora, Alberto Barrón-Cedeño
We present a new task: the identification of the emotions transmitted in Italian opera arias at the verse level. This is a relevant problem for the organization of the vast repertoire of Italian Opera arias available and to enable further analyses by both musicologists and the lay public. We shape the task as a multi-class supervised problem, considering six emotions: love, joy, admiration, anger, sadness, and fear. In order to address it, we manually-annotated an opera corpus with 2.5k verses —which we release to the research community— and experimented with different classification models and representations. Our best-performing models reach macroaveraged F1 measures of ∼0.45, always considering character 3-grams representations. Such performance reflects the difficulty of the task at hand, partially caused by the size and nature of the corpus, which consists of relatively short verses written in 18thcentury Italian.
我们提出了一个新的任务:在诗歌层面上识别意大利歌剧咏叹调中传递的情感。这是一个相关的问题,组织大量的意大利歌剧咏叹调曲目,并使音乐学家和普通公众能够进一步分析。我们把这个任务塑造成一个多类监督问题,考虑到六种情绪:爱、喜悦、钦佩、愤怒、悲伤和恐惧。为了解决这个问题,我们手工注释了一个有2.5万句歌词的歌剧语料库——我们发布给研究社区——并实验了不同的分类模型和表示。我们表现最好的模型达到宏观平均F1测量值约0.45,始终考虑字符3-g表示。这样的表现反映了手头任务的难度,部分原因是语料库的规模和性质,它由18世纪意大利语写的相对较短的诗句组成。
{"title":"AriEmozione: Identifying Emotions in Opera Verses","authors":"Francesco Fernicola, Shibingfeng Zhang, F. Garcea, P. Bonora, Alberto Barrón-Cedeño","doi":"10.4000/books.aaccademia.8528","DOIUrl":"https://doi.org/10.4000/books.aaccademia.8528","url":null,"abstract":"We present a new task: the identification of the emotions transmitted in Italian opera arias at the verse level. This is a relevant problem for the organization of the vast repertoire of Italian Opera arias available and to enable further analyses by both musicologists and the lay public. We shape the task as a multi-class supervised problem, considering six emotions: love, joy, admiration, anger, sadness, and fear. In order to address it, we manually-annotated an opera corpus with 2.5k verses —which we release to the research community— and experimented with different classification models and representations. Our best-performing models reach macroaveraged F1 measures of ∼0.45, always considering character 3-grams representations. Such performance reflects the difficulty of the task at hand, partially caused by the size and nature of the corpus, which consists of relatively short verses written in 18thcentury Italian.","PeriodicalId":300279,"journal":{"name":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122620525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
How Good are Humans at Native Language Identification? A Case Study on Italian L2 writings 人类的母语识别能力有多好?意大利语第二语言写作个案研究
Pub Date : 1900-01-01 DOI: 10.4000/books.aaccademia.8475
Elisa Di Nuovo, C. Bosco, E. Corino
In this paper we present a pilot study on human performance for the Native Language Identification task. We performed two tests aimed at exploring the human baseline for the task in which test takers had to identify the writers’ L1 relying only on scripts written in Italian by English, French, German and Spanish native speakers. Then, we conducted an error analysis considering the language background of both test takers and text writers.
在本文中,我们提出了一项关于人类在母语识别任务中的表现的初步研究。我们进行了两个测试,目的是探索人类基线的任务,在这个任务中,考生必须仅依靠英语、法语、德语和西班牙语母语人士用意大利语写的剧本来识别作者的母语。然后,考虑到考生和文本作者的语言背景,我们进行了错误分析。
{"title":"How Good are Humans at Native Language Identification? A Case Study on Italian L2 writings","authors":"Elisa Di Nuovo, C. Bosco, E. Corino","doi":"10.4000/books.aaccademia.8475","DOIUrl":"https://doi.org/10.4000/books.aaccademia.8475","url":null,"abstract":"In this paper we present a pilot study on human performance for the Native Language Identification task. We performed two tests aimed at exploring the human baseline for the task in which test takers had to identify the writers’ L1 relying only on scripts written in Italian by English, French, German and Spanish native speakers. Then, we conducted an error analysis considering the language background of both test takers and text writers.","PeriodicalId":300279,"journal":{"name":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125986647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Grounded and Ungrounded Referring Expressions in Human Dialogues: Language Mirrors Different Grounding Conditions 人类对话中有根据与无根据的指称表达:语言反映了不同的接地条件
Pub Date : 1900-01-01 DOI: 10.4000/books.aaccademia.8600
Eleonora Gualdoni, R. Bernardi, R. Fernández, Sandro Pezzelle
We study how language use differs between dialogue partners in a visually grounded reference task when a referent is mutually identifiable by both interlocutors vs. when it is only available to one of them. In the latter case, the addressee needs to disconfirm a proposed description – a skill largely neglected by both the theoretical and the computational linguistics communities. We consider a number of linguistic features that we expect to vary across conditions. We then analyze their effectiveness in distinguishing among the two conditions by means of statistical tests and a feature-based classifier. Overall, we show that language mirrors different grounding conditions, paving the way to future deeper investigation of referential disconfirmation.
我们研究了在以视觉为基础的参考任务中,当一个参考对象被双方对话者相互识别时与只有一方可用时,对话者之间语言使用的差异。在后一种情况下,收件人需要否定一个提议的描述——这一技能在很大程度上被理论和计算语言学社区所忽视。我们考虑了一些语言特征,我们预计这些特征在不同的条件下会有所不同。然后,我们通过统计测试和基于特征的分类器来分析它们在区分两种情况的有效性。总的来说,我们表明语言反映了不同的基础条件,为未来更深入地研究指称不确认铺平了道路。
{"title":"Grounded and Ungrounded Referring Expressions in Human Dialogues: Language Mirrors Different Grounding Conditions","authors":"Eleonora Gualdoni, R. Bernardi, R. Fernández, Sandro Pezzelle","doi":"10.4000/books.aaccademia.8600","DOIUrl":"https://doi.org/10.4000/books.aaccademia.8600","url":null,"abstract":"We study how language use differs between dialogue partners in a visually grounded reference task when a referent is mutually identifiable by both interlocutors vs. when it is only available to one of them. In the latter case, the addressee needs to disconfirm a proposed description – a skill largely neglected by both the theoretical and the computational linguistics communities. We consider a number of linguistic features that we expect to vary across conditions. We then analyze their effectiveness in distinguishing among the two conditions by means of statistical tests and a feature-based classifier. Overall, we show that language mirrors different grounding conditions, paving the way to future deeper investigation of referential disconfirmation.","PeriodicalId":300279,"journal":{"name":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126403252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Suoidne-varra-bleahkka-mála-bihkka-senet-dielku 'hay-blood-ink-paint-tar-mustard-stain' -Should compounds be lexicalized in NLP?
Pub Date : 1900-01-01 DOI: 10.4000/books.aaccademia.8979
Linda Wiechetek, Chiara Argese, Flammie A. Pirinen, Trond Trosterud
La lessicalizzazione delle parole composte, in aggiunta a trattarle in maniera dinamica, è un elemento chiave per ottenere traduzioni idiomatiche e rilevare errori nelle stesse. Presentiamo e valutiamo un e-dizionario (NDS) e un correttore grammaticale (GramDivvun) per il Sami del Nord. Otteniamo una copertura del 98% per le ricerche in NDS e del 96% per il rilevamento di errori nelle parole composte in GramDivvun.
组合词的词汇化,除了动态地处理它们之外,是获得习惯翻译和发现它们中的错误的一个关键因素。我们展示并评估了北萨米的一本电子词典(NDS)和一本语法纠正器(GramDivvun)。我们得到了98%的NDS研究和96%的语法单词错误检测。
{"title":"Suoidne-varra-bleahkka-mála-bihkka-senet-dielku 'hay-blood-ink-paint-tar-mustard-stain' -Should compounds be lexicalized in NLP?","authors":"Linda Wiechetek, Chiara Argese, Flammie A. Pirinen, Trond Trosterud","doi":"10.4000/books.aaccademia.8979","DOIUrl":"https://doi.org/10.4000/books.aaccademia.8979","url":null,"abstract":"La lessicalizzazione delle parole composte, in aggiunta a trattarle in maniera dinamica, è un elemento chiave per ottenere traduzioni idiomatiche e rilevare errori nelle stesse. Presentiamo e valutiamo un e-dizionario (NDS) e un correttore grammaticale (GramDivvun) per il Sami del Nord. Otteniamo una copertura del 98% per le ricerche in NDS e del 96% per il rilevamento di errori nelle parole composte in GramDivvun.","PeriodicalId":300279,"journal":{"name":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126407974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simple Data Augmentation for Multilingual NLU in Task Oriented Dialogue Systems 面向任务对话系统中多语种NLU的简单数据增强
Pub Date : 1900-01-01 DOI: 10.4000/books.aaccademia.8648
Samuel Louvan, B. Magnini
Data augmentation has shown potential in alleviating data scarcity for Natural Language Understanding (e.g. slot filling and intent classification) in task-oriented dialogue systems. As prior work has been mostly experimented on English datasets, we focus on five different languages, and consider a setting where limited data are available. We investigate the effectiveness of non-gradient based augmentation methods, involving simple text span substitutions and syntactic manipulations. Our experiments show that (i) augmentation is effective in all cases, particularly for slot filling; and (ii) it is beneficial for a joint intent-slot model based on multilingual BERT, both for limited data settings and when full training data is used.
在面向任务的对话系统中,数据增强在缓解自然语言理解(例如槽填充和意图分类)的数据稀缺性方面显示出潜力。由于之前的工作主要是在英语数据集上进行实验,我们将重点放在五种不同的语言上,并考虑一个可用数据有限的设置。我们研究了非梯度增强方法的有效性,包括简单的文本跨度替换和语法操作。我们的实验表明(i)增强在所有情况下都是有效的,特别是对于槽填充;(ii)对于有限的数据设置和使用完整的训练数据时,基于多语言BERT的联合意向槽模型都是有益的。
{"title":"Simple Data Augmentation for Multilingual NLU in Task Oriented Dialogue Systems","authors":"Samuel Louvan, B. Magnini","doi":"10.4000/books.aaccademia.8648","DOIUrl":"https://doi.org/10.4000/books.aaccademia.8648","url":null,"abstract":"Data augmentation has shown potential in alleviating data scarcity for Natural Language Understanding (e.g. slot filling and intent classification) in task-oriented dialogue systems. As prior work has been mostly experimented on English datasets, we focus on five different languages, and consider a setting where limited data are available. We investigate the effectiveness of non-gradient based augmentation methods, involving simple text span substitutions and syntactic manipulations. Our experiments show that (i) augmentation is effective in all cases, particularly for slot filling; and (ii) it is beneficial for a joint intent-slot model based on multilingual BERT, both for limited data settings and when full training data is used.","PeriodicalId":300279,"journal":{"name":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130389302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Multifunctional ISO Standard Dialogue Act Tagging in Italian 意大利语中多功能ISO标准对话行为标注
Pub Date : 1900-01-01 DOI: 10.4000/books.aaccademia.8860
G. Roccabruna, Alessandra Cervone, G. Riccardi
English. The task of Dialogue Act (DA) tagging, a crucial component in many conversational agents, is often addressed assuming a single DA per speaker turn in the conversation. However, speakers’ turns are often multifunctional, that is they can contain more than one DA (i.e. “I’m Alex. Have we met before?” contains a ‘state-ment’, followed by a ‘question’). This work focuses on multifunctional DA tagging in Italian. First, we present iLIS-TEN2ISO, a novel resource with multi-functional DA annotation in Italian, created by annotating the iLISTEN corpus with the ISO standard. We provide an analysis of the corpus showing the importance of multifunctionality for DA tagging. Additionally, we train DA taggers for Italian on iLISTEN (achieving State of the Art results) and iLISTEN2ISO. Our findings indicate the importance of using a multifunctional approach for DA tagging.
英语。对话行为(DA)标记任务是许多会话代理中的一个重要组成部分,通常在会话中假设每个说话人都有一个DA。然而,演讲者的回合通常是多功能的,也就是说,他们可以包含多个DA(例如,“我是Alex。我们以前见过面吗?包含一个“陈述”,后面跟着一个“疑问”)。本文主要研究意大利语的多功能数据提取标注。首先,我们提出了iLIS-TEN2ISO,这是一个具有意大利语多功能数据数据注释的新资源,它是通过用ISO标准注释iLISTEN语料库而创建的。我们提供了一个语料库的分析,显示了多功能性对DA标记的重要性。此外,我们在iLISTEN和iLISTEN2ISO上为意大利语训练数据标注器(达到最先进的结果)。我们的研究结果表明,使用多功能方法进行DA标记的重要性。
{"title":"Multifunctional ISO Standard Dialogue Act Tagging in Italian","authors":"G. Roccabruna, Alessandra Cervone, G. Riccardi","doi":"10.4000/books.aaccademia.8860","DOIUrl":"https://doi.org/10.4000/books.aaccademia.8860","url":null,"abstract":"English. The task of Dialogue Act (DA) tagging, a crucial component in many conversational agents, is often addressed assuming a single DA per speaker turn in the conversation. However, speakers’ turns are often multifunctional, that is they can contain more than one DA (i.e. “I’m Alex. Have we met before?” contains a ‘state-ment’, followed by a ‘question’). This work focuses on multifunctional DA tagging in Italian. First, we present iLIS-TEN2ISO, a novel resource with multi-functional DA annotation in Italian, created by annotating the iLISTEN corpus with the ISO standard. We provide an analysis of the corpus showing the importance of multifunctionality for DA tagging. Additionally, we train DA taggers for Italian on iLISTEN (achieving State of the Art results) and iLISTEN2ISO. Our findings indicate the importance of using a multifunctional approach for DA tagging.","PeriodicalId":300279,"journal":{"name":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130731537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Building a Treebank in Universal Dependencies for Italian Sign Language 构建意大利手语通用依存关系树库
Pub Date : 1900-01-01 DOI: 10.4000/books.aaccademia.8320
Gaia Caligiore, C. Bosco, A. Mazzei
The Italian Sign Language (LIS) is the natural language used by the Italian Deaf community. This paper discusses the application of the Universal Dependencies (UD) format to the syntactic annotation of a LIS corpus. This investigation aims in particular at contributing to sign language research by addressing the challenges that the visual-manual modality of LIS creates generally in linguistic annotation and specifically in segmentation and syntactic analysis. We addressed two case studies from the storytelling domain first segmented on the ELAN platform, and second syntactically annotated using CoNLLU format.
意大利手语(LIS)是意大利聋人社区使用的自然语言。本文讨论了通用依赖关系(Universal Dependencies, UD)格式在LIS语料库句法标注中的应用。这项研究的目的是通过解决手语视觉-手动模式在语言注释,特别是在分割和句法分析方面产生的挑战,为手语研究做出贡献。我们处理了两个来自讲故事领域的案例研究,第一个在ELAN平台上进行了分割,第二个使用CoNLLU格式进行了语法注释。
{"title":"Building a Treebank in Universal Dependencies for Italian Sign Language","authors":"Gaia Caligiore, C. Bosco, A. Mazzei","doi":"10.4000/books.aaccademia.8320","DOIUrl":"https://doi.org/10.4000/books.aaccademia.8320","url":null,"abstract":"The Italian Sign Language (LIS) is the natural language used by the Italian Deaf community. This paper discusses the application of the Universal Dependencies (UD) format to the syntactic annotation of a LIS corpus. This investigation aims in particular at contributing to sign language research by addressing the challenges that the visual-manual modality of LIS creates generally in linguistic annotation and specifically in segmentation and syntactic analysis. We addressed two case studies from the storytelling domain first segmented on the ELAN platform, and second syntactically annotated using CoNLLU format.","PeriodicalId":300279,"journal":{"name":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131145054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Diachronic Italian Corpus based on "L'Unità" 基于“L’unitcom”的意大利语历时语料库
Pub Date : 1900-01-01 DOI: 10.4000/books.aaccademia.8245
Pierpaolo Basile, A. Caputo, Tommaso Caselli, Pierluigi Cassotti, Rossella Varvara
English. In this paper, we describe the creation of a diachronic corpus for Italian by exploiting the digital archive of the newspaper “L’Unità”. We automatically clean and annotate the corpus with PoS tags, lemmas, named entities and syntactic dependencies. Moreover, we compute frequency-based time series for tokens, lemmas and entities. We show some interesting corpus statistics taking into account the temporal dimension and describe some examples of usage of time series. 1 Motivation and Background Diachronic linguistics is one of the two major temporal dimensions of language study proposed by de Saussure in his Cours de languistique générale and has a long tradition in Linguistics. Recently, the increasing availability of diachronic corpora as well as the development of new NLP techniques for representing word meanings has boosted the application of computational models to investigate historical language data (Hamilton et al., 2016; Tahmasebi et al., 2018; Tang, 2018). This culminated in SemEval-2020 Unsupervised Lexical Semantic Change Detection (Schlechtweg et al., 2020), the first attempt to systematically evaluate automatic methods for language change detection. Italian is a Romance language which has undergone lots of changes in its history. Its official Copyright c ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). adoption as a national language occurred only after the Unification of Italy (1861), having previously been a literary language. Diachronic corpora of Italian are currently available and accessible to the public (e.g., DiaCORIS and MIDIA). Unfortunately, restricted access/distribution of these resources limits their utilisation. This actually prevents the investigation of more recent NLP methods to the diachronic dimensions. To obviate this limit, we collect and make freely available1 a new corpus based on the newspaper “L’Unità”. Founded by Antonio Gramsci on February, 12th 1924, “L’Unità” was the official newspaper of the Italian Communist Party (PCI 2, henceforth). The newspaper had a troubled history: with the dissolution of PCI in 1991, the newspaper continued to live as the official newspaper of the new Democratic Party of the Left (PDS/DS) until July, 31th 2014. After that date, it ceased its publication until June, 30th 2015, and it was definitely closed on June, 3rd 2017. Since 2017, the historical archive of “L’Unità” has been made again visible and available on the Web.3 One of the main issues of this resource is the lack of information about who owns the rights of the original archive. To our knowledge, the online version of the archive was legally obtained by downloading the original archive before the closure of the newspaper. The current archive, available online, does not contain the local editions of the newspaper and the photographic archive. The main contribution of this work lies in the https://github.com/swapUniba/unita/ It is the a
英语。在本文中,我们描述了创建一个历时语料库的意大利利用报纸的数字档案“统一”。我们自动清理和注释语料库,包括词性标记、引理、命名实体和语法依赖关系。此外,我们为令牌、引理和实体计算基于频率的时间序列。我们展示了一些有趣的语料库统计数据,考虑了时间维度,并描述了一些使用时间序列的例子。历时语言学是索绪尔在他的《语言课程》中提出的语言研究的两个主要时间维度之一,在语言学中有着悠久的传统。最近,历时语料库的可用性不断增加,以及用于表示词义的新自然语言处理技术的发展,促进了计算模型在调查历史语言数据中的应用(Hamilton等人,2016;Tahmasebi et al., 2018;唐,2018年)。这在SemEval-2020无监督词法语义变化检测(Schlechtweg et al., 2020)中达到了高潮,这是首次尝试系统地评估语言变化检测的自动方法。意大利语是一种罗曼语,在历史上经历了很多变化。本文版权所有c©2020。在知识共享许可国际署名4.0 (CC BY 4.0)下允许使用。在意大利统一(1861年)之后,意大利语才被采纳为国家语言,之前它是一种文学语言。意大利语历时语料库目前可供公众使用(如DiaCORIS和MIDIA)。不幸的是,对这些资源的限制访问/分配限制了它们的利用。这实际上阻碍了对历时维度的最新NLP方法的研究。为了消除这一限制,我们收集并免费提供了一个基于报纸“L’unitune”的新语料库。1924年2月12日,安东尼奥·葛兰西(Antonio Gramsci)创办了《统一报》(L’unito),它是意大利共产党的官方报纸。这份报纸有一段坎坷的历史:随着1991年意大利共产党的解散,该报继续作为新成立的左翼民主党(PDS/DS)的官方报纸存在,直到2014年7月31日。在此日期之后,它停止发布直到2015年6月30日,并于2017年6月3日关闭。自2017年以来,“联合”的历史档案再次在网络上可见和可用。3该资源的主要问题之一是缺乏关于原始档案权利所有者的信息。据我们所知,该档案的在线版本是在该报关闭前通过下载原始档案合法获得的。目前可以在网上找到的档案不包括当地版本的报纸和照片档案。这项工作的主要贡献在于https://github.com/swapUniba/unita/,它是Partito Comunista Italiano的缩写。https://archivio.unita.news/资源本身及其对整个研究界的可及性。语料库以两种格式分发:原始文本和预处理。语料库对语言变化自动研究的有效性目前是EVALITA 2020上DIACR-Ita任务4的一部分。然而,我们说明了语料库使用的一些进一步的潜在应用。意大利历时语料库目前有各种各样的意大利历时语料库可供公众使用。DiaCORIS 5 (Onelli et al., 2006)收录了1861年至1945年间产生的意大利语书面文本,共计1亿字;而MIDIA 6 (Gaeta et al., 2013)收录了13世纪初至20世纪上半叶的意大利语书面文件,共计750万字,800多个文本,属于不同的体裁。OVI dell 'Italiano antico7语料库包含了从十二世纪到十四世纪的1948个文本,共计53.6万字。LIZ8数据库包含了从十三世纪到二十世纪的1000个文学文本。最后,Alcide de Gasperi的公共文件语料库(Tonelli et al., 2019)包括1762份文件(报纸文章、宣传文件、官方信件、议会演讲,共计300万份代币),由意大利政治家Alcide de Gasperi撰写,出版于1901年至1954年之间。这些现有资源彼此不同,与现有语料库也有不同之处。首先,文本的时间跨度。OVI语料库考虑了意大利语早期阶段的文本,时间跨度为三个世纪。MIDIA语料库和LIZ数据库涵盖了从13世纪到20世纪上半叶的7个世纪。DiaCORIS, De Gasperi 's语料库和L ' unitou语料库包含较短和较近时期的文本。 然而,统一语料库中考虑的时间跨度对于意大利语的研究很有趣,因为发生了深刻的变化https://diacr-ita.github.io/ DIACR-Ita/ http://corpora.dslo.unibo.it/ DiaCORIS/ www.corpusmidia.unito.it http://gattoweb.ovi.cnr.it https://www.zanichelli。它/ricerca/prodotti/ liz-4-0 letteratura-italiana-zanichelli在那个时期。事实上,20世纪下半叶,意大利语在社会各阶层中得到了更广泛的传播和使用。其次,这些语料库因所代表的体裁而有所不同。DiaCORIS和MIDIA语料库被设计为具有代表性和平衡的意大利语书面样本(考虑到其他类型,学术散文,小说,新闻,法律文本等)。OVI语料库和LIZ数据库只理解文学文本。德·加斯佩里的语料库是单一作者政治文本的代表。统一语料库仅代表新闻语言,但这一限制可能有利于研究历时性词汇变化。的确,观察到的语义变化不能归因于不同时期不同体裁的证明,但可以解释为真正的语义变化。最后,即使大多数语料库都可以在线查询(除了LIZ数据库),也只有De Gasperi的语料库可以免费下载。这一限制影响了这些资源对NLP社区的可用性。通过统一语料库,我们的目标是发布一个新的历时资源,可以免费获得,并可用于语言变化的理论和计算研究。
{"title":"A Diachronic Italian Corpus based on \"L'Unità\"","authors":"Pierpaolo Basile, A. Caputo, Tommaso Caselli, Pierluigi Cassotti, Rossella Varvara","doi":"10.4000/books.aaccademia.8245","DOIUrl":"https://doi.org/10.4000/books.aaccademia.8245","url":null,"abstract":"English. In this paper, we describe the creation of a diachronic corpus for Italian by exploiting the digital archive of the newspaper “L’Unità”. We automatically clean and annotate the corpus with PoS tags, lemmas, named entities and syntactic dependencies. Moreover, we compute frequency-based time series for tokens, lemmas and entities. We show some interesting corpus statistics taking into account the temporal dimension and describe some examples of usage of time series. 1 Motivation and Background Diachronic linguistics is one of the two major temporal dimensions of language study proposed by de Saussure in his Cours de languistique générale and has a long tradition in Linguistics. Recently, the increasing availability of diachronic corpora as well as the development of new NLP techniques for representing word meanings has boosted the application of computational models to investigate historical language data (Hamilton et al., 2016; Tahmasebi et al., 2018; Tang, 2018). This culminated in SemEval-2020 Unsupervised Lexical Semantic Change Detection (Schlechtweg et al., 2020), the first attempt to systematically evaluate automatic methods for language change detection. Italian is a Romance language which has undergone lots of changes in its history. Its official Copyright c ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). adoption as a national language occurred only after the Unification of Italy (1861), having previously been a literary language. Diachronic corpora of Italian are currently available and accessible to the public (e.g., DiaCORIS and MIDIA). Unfortunately, restricted access/distribution of these resources limits their utilisation. This actually prevents the investigation of more recent NLP methods to the diachronic dimensions. To obviate this limit, we collect and make freely available1 a new corpus based on the newspaper “L’Unità”. Founded by Antonio Gramsci on February, 12th 1924, “L’Unità” was the official newspaper of the Italian Communist Party (PCI 2, henceforth). The newspaper had a troubled history: with the dissolution of PCI in 1991, the newspaper continued to live as the official newspaper of the new Democratic Party of the Left (PDS/DS) until July, 31th 2014. After that date, it ceased its publication until June, 30th 2015, and it was definitely closed on June, 3rd 2017. Since 2017, the historical archive of “L’Unità” has been made again visible and available on the Web.3 One of the main issues of this resource is the lack of information about who owns the rights of the original archive. To our knowledge, the online version of the archive was legally obtained by downloading the original archive before the closure of the newspaper. The current archive, available online, does not contain the local editions of the newspaper and the photographic archive. The main contribution of this work lies in the https://github.com/swapUniba/unita/ It is the a","PeriodicalId":300279,"journal":{"name":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131170005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Analysis of Lexical Semantic Changes in Corpora with the Diachronic Engine 用历时引擎分析语料库中的词汇语义变化
Pub Date : 1900-01-01 DOI: 10.4000/books.aaccademia.8343
Pierluigi Cassotti, Pierpaolo Basile, M. Degemmis, G. Semeraro
English. With the growing availability of digitized diachronic corpora, the need for tools capable of taking into account the diachronic component of corpora becomes ever more pressing. Recent works on diachronic embeddings show that computational approaches to the diachronic analysis of language seem to be promising, but they are not user friendly for people without a technical background. This paper presents the Diachronic Engine, a system for the diachronic analysis of corpora lexical features. Diachronic Engine computes word frequency, concordances and collocations taking into account the temporal dimension. It is also able to compute temporal word embeddings and timeseries that can be exploited for lexical semantic change detection. 1 Motivation and Background Synchronic corpora are widely used in linguistics for deriving a set of abstract rules that govern a particular language under analysis by using statistical approaches. The same methodology can be adopted for analyzing the evolution of word meanings over time in the case of diachronic corpora. However, this process can be very time-consuming. Usually, linguists rely on software tools that can easily explore and clean the corpus, while highlighting the more relevant linguistic features. Sketch Engine1(Kilgarriff et al., 2004; Kilgarriff et al., 2014) is the leading tool in the corpus analysis field. Beyond several interesting features, Sketch Engine includes trends (Kilgarriff et al., 2015), which allow for diachronic Copyright c ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). https://www.sketchengine.eu/ analysis based on the frequency distribution of words. Trends rely on merely frequency features, ignoring word usage information. Moreover, the Sketch Engine interface does not provide temporal information about concordances and collocations. NoSketchEngine2 is an open-source version of SketchEngine. It requires technical expertise for the setup and, contrarily to SketchEngine, it does not support word sketches, terminology, thesaurus, n-grams, trends and corpus building. An interesting system is DiaCollo3 (Jurish and der Wissenschaften, 2015), a software tool for the discovery, comparison, and interactive visualization of target word combinations. Combinations can be requested for a particular time period, or for a direct comparison between different time periods. However, DiaCollo is focused exclusively on the extraction and visualization of collocations from diachronic corpora. In recent works about computational diachronic linguistics, techniques based on word embeddings produce promising results. In Semeval Task 1 (Schlechtweg et al., 2020), for instance, type embeddings rich high performances on both subtasks. However, these techniques are not included in any aforementioned linguistic tool. In order to bridge this gap, we try to build a tool that includes approaches for the analysis of diachronic embeddi
英语。随着数字化历时语料库的日益普及,对能够考虑语料库历时成分的工具的需求变得越来越迫切。最近关于历时嵌入的研究表明,语言历时分析的计算方法似乎很有前途,但对于没有技术背景的人来说,它们并不友好。本文介绍了语料库词汇特征历时分析系统历时引擎。历时引擎计算词频,一致性和搭配考虑到时间维度。它还能够计算可用于词法语义变化检测的时态词嵌入和时间序列。共时语料库在语言学中被广泛使用,通过统计方法推导出一套控制特定语言的抽象规则。同样的方法也可以用于分析历时语料库中词义随时间的演变。然而,这个过程可能非常耗时。通常,语言学家依赖于能够轻松探索和清理语料库的软件工具,同时突出更相关的语言特征。Sketch Engine1(Kilgarriff et al., 2004;Kilgarriff et al., 2014)是语料库分析领域的领先工具。除了几个有趣的功能之外,Sketch Engine还包括趋势(Kilgarriff等人,2015年),这允许作者对本文进行历时性版权保护c©2020。在知识共享许可国际署名4.0 (CC BY 4.0)下允许使用。https://www.sketchengine.eu/基于词频分布的分析。趋势仅仅依赖于频率特征,忽略了单词用法信息。此外,Sketch Engine接口不提供关于一致性和搭配的临时信息。NoSketchEngine2是SketchEngine的开源版本。它需要技术专业知识来设置,与SketchEngine相反,它不支持单词草图、术语、同义词库、n-grams、趋势和语料库构建。DiaCollo3 (Jurish and der Wissenschaften, 2015)是一个有趣的系统,它是一个用于发现、比较和目标单词组合交互式可视化的软件工具。可以要求对特定时间段进行组合,或者在不同时间段之间进行直接比较。然而,DiaCollo专注于从历时语料库中提取和可视化搭配。在最近关于计算历时语言学的工作中,基于词嵌入的技术产生了有希望的结果。例如,在Semeval Task 1 (Schlechtweg et al., 2020)中,类型嵌入在两个子任务上都具有很高的性能。然而,上述任何语言工具都不包括这些技术。为了弥合这一差距,我们试图构建一个工具,其中包括分析历时嵌入的方法。我们的工作成果是历时引擎(Diachronic Engine, DE),这是一个用于历时语料库管理的引擎,从频率论的角度为词法语义的变化检测提供了工具。DE包括用于提取历时搭配、不同时间段的一致性以及通过利用词频和词嵌入随时间的相似性来计算语义变化时间序列的工具。论文的其余部分组织如下:https://nlp.fi.muni.cz/trac/noske https://www.clarin.eu/showcase/
{"title":"Analysis of Lexical Semantic Changes in Corpora with the Diachronic Engine","authors":"Pierluigi Cassotti, Pierpaolo Basile, M. Degemmis, G. Semeraro","doi":"10.4000/books.aaccademia.8343","DOIUrl":"https://doi.org/10.4000/books.aaccademia.8343","url":null,"abstract":"English. With the growing availability of digitized diachronic corpora, the need for tools capable of taking into account the diachronic component of corpora becomes ever more pressing. Recent works on diachronic embeddings show that computational approaches to the diachronic analysis of language seem to be promising, but they are not user friendly for people without a technical background. This paper presents the Diachronic Engine, a system for the diachronic analysis of corpora lexical features. Diachronic Engine computes word frequency, concordances and collocations taking into account the temporal dimension. It is also able to compute temporal word embeddings and timeseries that can be exploited for lexical semantic change detection. 1 Motivation and Background Synchronic corpora are widely used in linguistics for deriving a set of abstract rules that govern a particular language under analysis by using statistical approaches. The same methodology can be adopted for analyzing the evolution of word meanings over time in the case of diachronic corpora. However, this process can be very time-consuming. Usually, linguists rely on software tools that can easily explore and clean the corpus, while highlighting the more relevant linguistic features. Sketch Engine1(Kilgarriff et al., 2004; Kilgarriff et al., 2014) is the leading tool in the corpus analysis field. Beyond several interesting features, Sketch Engine includes trends (Kilgarriff et al., 2015), which allow for diachronic Copyright c ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). https://www.sketchengine.eu/ analysis based on the frequency distribution of words. Trends rely on merely frequency features, ignoring word usage information. Moreover, the Sketch Engine interface does not provide temporal information about concordances and collocations. NoSketchEngine2 is an open-source version of SketchEngine. It requires technical expertise for the setup and, contrarily to SketchEngine, it does not support word sketches, terminology, thesaurus, n-grams, trends and corpus building. An interesting system is DiaCollo3 (Jurish and der Wissenschaften, 2015), a software tool for the discovery, comparison, and interactive visualization of target word combinations. Combinations can be requested for a particular time period, or for a direct comparison between different time periods. However, DiaCollo is focused exclusively on the extraction and visualization of collocations from diachronic corpora. In recent works about computational diachronic linguistics, techniques based on word embeddings produce promising results. In Semeval Task 1 (Schlechtweg et al., 2020), for instance, type embeddings rich high performances on both subtasks. However, these techniques are not included in any aforementioned linguistic tool. In order to bridge this gap, we try to build a tool that includes approaches for the analysis of diachronic embeddi","PeriodicalId":300279,"journal":{"name":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131259832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributional Semantics: Yesterday, Today, and Tomorrow 分布语义:昨天、今天和明天
Pub Date : 1900-01-01 DOI: 10.4000/books.aaccademia.9030
Alessandro Lenci
Distributional semantics is undoubtedly the mainstream approach to meaning representation in computational linguistics today. It has also become an important paradigm of semantic analysis in cognitive science, and even linguists have started looking at it with growing interest. The popularity of distributional semantics has literally boomed in the era of Deep Learning, when “word embeddings” have become the basic ingredient to “cook” any NLP task. The era of BERT & co. has brought new types of contextualized representations that have often generated hasty claims of incredible breakthroughs in the natural language understanding capability of deep learning models. Unfortunately, these claims are not always supported by the improved semantic abilities of the last generation of embeddings. Models like BERT are still rooted in the principles of distributional learning, but at the same time their goal is more ambitious than generating corpus-based representations of meaning. On the one hand, the embeddings they produce encode much more than lexical meaning, but on the other hand we are still largely uncertain about what semantic properties of natural language they actually capture. Distributional semantics has surely benefited from the successes of the deep learning, but this might even jeopardize the very essence of distributional models of meaning, by making their goals and foundations unclear.
分布语义学无疑是当今计算语言学研究意义表示的主流方法。它也成为认知科学中语义分析的一个重要范式,甚至语言学家也开始对它产生越来越大的兴趣。BERT & co.的时代带来了新型的情境化表示,这些表示通常会匆忙地声称深度学习模型在自然语言理解能力方面取得了令人难以置信的突破。不幸的是,这些说法并不总是得到上一代嵌入改进的语义能力的支持。像BERT这样的模型仍然植根于分布式学习的原则,但与此同时,它们的目标比生成基于语料库的意义表示更雄心勃勃。一方面,它们产生的嵌入编码远不止词汇意义,但另一方面,我们仍然在很大程度上不确定它们实际上捕获了自然语言的哪些语义属性。分布语义学确实从深度学习的成功中受益,但这甚至可能危及意义分布模型的本质,因为它们的目标和基础不明确。
{"title":"Distributional Semantics: Yesterday, Today, and Tomorrow","authors":"Alessandro Lenci","doi":"10.4000/books.aaccademia.9030","DOIUrl":"https://doi.org/10.4000/books.aaccademia.9030","url":null,"abstract":"Distributional semantics is undoubtedly the mainstream approach to meaning representation in computational linguistics today. It has also become an important paradigm of semantic analysis in cognitive science, and even linguists have started looking at it with growing interest. The popularity of distributional semantics has literally boomed in the era of Deep Learning, when “word embeddings” have become the basic ingredient to “cook” any NLP task. The era of BERT & co. has brought new types of contextualized representations that have often generated hasty claims of incredible breakthroughs in the natural language understanding capability of deep learning models. Unfortunately, these claims are not always supported by the improved semantic abilities of the last generation of embeddings. Models like BERT are still rooted in the principles of distributional learning, but at the same time their goal is more ambitious than generating corpus-based representations of meaning. On the one hand, the embeddings they produce encode much more than lexical meaning, but on the other hand we are still largely uncertain about what semantic properties of natural language they actually capture. Distributional semantics has surely benefited from the successes of the deep learning, but this might even jeopardize the very essence of distributional models of meaning, by making their goals and foundations unclear.","PeriodicalId":300279,"journal":{"name":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134520385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1