A Diachronic Italian Corpus based on "L'Unità"

Pierpaolo Basile, A. Caputo, Tommaso Caselli, Pierluigi Cassotti, Rossella Varvara
{"title":"A Diachronic Italian Corpus based on \"L'Unità\"","authors":"Pierpaolo Basile, A. Caputo, Tommaso Caselli, Pierluigi Cassotti, Rossella Varvara","doi":"10.4000/books.aaccademia.8245","DOIUrl":null,"url":null,"abstract":"English. In this paper, we describe the creation of a diachronic corpus for Italian by exploiting the digital archive of the newspaper “L’Unità”. We automatically clean and annotate the corpus with PoS tags, lemmas, named entities and syntactic dependencies. Moreover, we compute frequency-based time series for tokens, lemmas and entities. We show some interesting corpus statistics taking into account the temporal dimension and describe some examples of usage of time series. 1 Motivation and Background Diachronic linguistics is one of the two major temporal dimensions of language study proposed by de Saussure in his Cours de languistique générale and has a long tradition in Linguistics. Recently, the increasing availability of diachronic corpora as well as the development of new NLP techniques for representing word meanings has boosted the application of computational models to investigate historical language data (Hamilton et al., 2016; Tahmasebi et al., 2018; Tang, 2018). This culminated in SemEval-2020 Unsupervised Lexical Semantic Change Detection (Schlechtweg et al., 2020), the first attempt to systematically evaluate automatic methods for language change detection. Italian is a Romance language which has undergone lots of changes in its history. Its official Copyright c ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). adoption as a national language occurred only after the Unification of Italy (1861), having previously been a literary language. Diachronic corpora of Italian are currently available and accessible to the public (e.g., DiaCORIS and MIDIA). Unfortunately, restricted access/distribution of these resources limits their utilisation. This actually prevents the investigation of more recent NLP methods to the diachronic dimensions. To obviate this limit, we collect and make freely available1 a new corpus based on the newspaper “L’Unità”. Founded by Antonio Gramsci on February, 12th 1924, “L’Unità” was the official newspaper of the Italian Communist Party (PCI 2, henceforth). The newspaper had a troubled history: with the dissolution of PCI in 1991, the newspaper continued to live as the official newspaper of the new Democratic Party of the Left (PDS/DS) until July, 31th 2014. After that date, it ceased its publication until June, 30th 2015, and it was definitely closed on June, 3rd 2017. Since 2017, the historical archive of “L’Unità” has been made again visible and available on the Web.3 One of the main issues of this resource is the lack of information about who owns the rights of the original archive. To our knowledge, the online version of the archive was legally obtained by downloading the original archive before the closure of the newspaper. The current archive, available online, does not contain the local editions of the newspaper and the photographic archive. The main contribution of this work lies in the https://github.com/swapUniba/unita/ It is the acronym of Partito Comunista Italiano. https://archivio.unita.news/ resource itself and its accessibility to the research community at large. The corpus is distributed in two formats: raw text and pre-processed. The validity of the corpus for the automatic study of language change is currently tested as part of the DIACR-Ita task 4 at EVALITA 2020. However, we illustrate some further potential applications of the use of the corpus. 2 Italian diachronic corpora Various Italian diachronic corpora are currently available and accessible to the public. DiaCORIS 5 (Onelli et al., 2006) comprises written Italian texts produced between 1861 and 1945, for a total of 100 million words, while MIDIA 6 (Gaeta et al., 2013) covers written documents in Italian between the beginning of the XIII century and the first half of the XX century, for a total of 7,5 million words over 800 texts belonging to different genres. The Corpus OVI dell’Italiano antico7 consists of 1948 texts from the XII to the XIV centuries, for a total of 536.000 words. The LIZ8 database comprehends 1,000 literary texts from the XIII to the XX century. Lastly, the Corpus of Alcide de Gasperi’s public documents (Tonelli et al., 2019) includes 1,762 documents (newspaper articles, propaganda documents, official letters, parliamentary speeches, for a total of 3.000.000 tokens) written from the Italian politician Alcide De Gasperi and published between 1901 and 1954. These existing resources differ from each other and from the present corpus in different ways. First, the span of time the texts come from. The OVI Corpus considers texts from the early stages of the Italian language, with a time span of three centuries. The MIDIA corpus and the LIZ database cover 7 centuries, from the XIII to the first half of the XX century. DiaCORIS, the De Gasperi’s corpus and L’Unità corpus contain texts from a shorter and more recent period of time. However, the time span considered in L’Unità corpus is interesting for the study of the Italian language because of the deep changes that occurred https://diacr-ita.github.io/ DIACR-Ita/ http://corpora.dslo.unibo.it/ DiaCORIS/ www.corpusmidia.unito.it http://gattoweb.ovi.cnr.it https://www.zanichelli. it/ricerca/prodotti/ liz-4-0-letteratura-italiana-zanichelli in that period. Indeed, the second half of the XX century has seen a wider spread and use of Italian among all the social classes. Second, these corpora differ for the genres represented. The DiaCORIS and MIDIA corpora have been designed as representative and balanced samples of written Italian (considering, among other genres, academic prose, fiction, press, legal texts, etc). The OVI corpus and the LIZ database comprehend only literary texts. The De Gasperi’s corpus is representative of political text from a single author. L’Unità corpus is representative only of press language, but this restriction may be an advantage in the study of diachronic lexical change. Indeed, observed semantic changes cannot be attributed to attestation from different genres in different periods, but can be interpreted as true semantic shifts. Lastly, even if most of the corpora can be queried online (with the exception of the LIZ database), only the De Gasperi’s corpus can be freely downloaded. This restriction affects the usability of these resources for the NLP community. With L’Unità corpus we aim at releasing a new diachronic resource that is freely available and that can be used in the theoretical and computational study of language change.","PeriodicalId":300279,"journal":{"name":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4000/books.aaccademia.8245","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

English. In this paper, we describe the creation of a diachronic corpus for Italian by exploiting the digital archive of the newspaper “L’Unità”. We automatically clean and annotate the corpus with PoS tags, lemmas, named entities and syntactic dependencies. Moreover, we compute frequency-based time series for tokens, lemmas and entities. We show some interesting corpus statistics taking into account the temporal dimension and describe some examples of usage of time series. 1 Motivation and Background Diachronic linguistics is one of the two major temporal dimensions of language study proposed by de Saussure in his Cours de languistique générale and has a long tradition in Linguistics. Recently, the increasing availability of diachronic corpora as well as the development of new NLP techniques for representing word meanings has boosted the application of computational models to investigate historical language data (Hamilton et al., 2016; Tahmasebi et al., 2018; Tang, 2018). This culminated in SemEval-2020 Unsupervised Lexical Semantic Change Detection (Schlechtweg et al., 2020), the first attempt to systematically evaluate automatic methods for language change detection. Italian is a Romance language which has undergone lots of changes in its history. Its official Copyright c ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). adoption as a national language occurred only after the Unification of Italy (1861), having previously been a literary language. Diachronic corpora of Italian are currently available and accessible to the public (e.g., DiaCORIS and MIDIA). Unfortunately, restricted access/distribution of these resources limits their utilisation. This actually prevents the investigation of more recent NLP methods to the diachronic dimensions. To obviate this limit, we collect and make freely available1 a new corpus based on the newspaper “L’Unità”. Founded by Antonio Gramsci on February, 12th 1924, “L’Unità” was the official newspaper of the Italian Communist Party (PCI 2, henceforth). The newspaper had a troubled history: with the dissolution of PCI in 1991, the newspaper continued to live as the official newspaper of the new Democratic Party of the Left (PDS/DS) until July, 31th 2014. After that date, it ceased its publication until June, 30th 2015, and it was definitely closed on June, 3rd 2017. Since 2017, the historical archive of “L’Unità” has been made again visible and available on the Web.3 One of the main issues of this resource is the lack of information about who owns the rights of the original archive. To our knowledge, the online version of the archive was legally obtained by downloading the original archive before the closure of the newspaper. The current archive, available online, does not contain the local editions of the newspaper and the photographic archive. The main contribution of this work lies in the https://github.com/swapUniba/unita/ It is the acronym of Partito Comunista Italiano. https://archivio.unita.news/ resource itself and its accessibility to the research community at large. The corpus is distributed in two formats: raw text and pre-processed. The validity of the corpus for the automatic study of language change is currently tested as part of the DIACR-Ita task 4 at EVALITA 2020. However, we illustrate some further potential applications of the use of the corpus. 2 Italian diachronic corpora Various Italian diachronic corpora are currently available and accessible to the public. DiaCORIS 5 (Onelli et al., 2006) comprises written Italian texts produced between 1861 and 1945, for a total of 100 million words, while MIDIA 6 (Gaeta et al., 2013) covers written documents in Italian between the beginning of the XIII century and the first half of the XX century, for a total of 7,5 million words over 800 texts belonging to different genres. The Corpus OVI dell’Italiano antico7 consists of 1948 texts from the XII to the XIV centuries, for a total of 536.000 words. The LIZ8 database comprehends 1,000 literary texts from the XIII to the XX century. Lastly, the Corpus of Alcide de Gasperi’s public documents (Tonelli et al., 2019) includes 1,762 documents (newspaper articles, propaganda documents, official letters, parliamentary speeches, for a total of 3.000.000 tokens) written from the Italian politician Alcide De Gasperi and published between 1901 and 1954. These existing resources differ from each other and from the present corpus in different ways. First, the span of time the texts come from. The OVI Corpus considers texts from the early stages of the Italian language, with a time span of three centuries. The MIDIA corpus and the LIZ database cover 7 centuries, from the XIII to the first half of the XX century. DiaCORIS, the De Gasperi’s corpus and L’Unità corpus contain texts from a shorter and more recent period of time. However, the time span considered in L’Unità corpus is interesting for the study of the Italian language because of the deep changes that occurred https://diacr-ita.github.io/ DIACR-Ita/ http://corpora.dslo.unibo.it/ DiaCORIS/ www.corpusmidia.unito.it http://gattoweb.ovi.cnr.it https://www.zanichelli. it/ricerca/prodotti/ liz-4-0-letteratura-italiana-zanichelli in that period. Indeed, the second half of the XX century has seen a wider spread and use of Italian among all the social classes. Second, these corpora differ for the genres represented. The DiaCORIS and MIDIA corpora have been designed as representative and balanced samples of written Italian (considering, among other genres, academic prose, fiction, press, legal texts, etc). The OVI corpus and the LIZ database comprehend only literary texts. The De Gasperi’s corpus is representative of political text from a single author. L’Unità corpus is representative only of press language, but this restriction may be an advantage in the study of diachronic lexical change. Indeed, observed semantic changes cannot be attributed to attestation from different genres in different periods, but can be interpreted as true semantic shifts. Lastly, even if most of the corpora can be queried online (with the exception of the LIZ database), only the De Gasperi’s corpus can be freely downloaded. This restriction affects the usability of these resources for the NLP community. With L’Unità corpus we aim at releasing a new diachronic resource that is freely available and that can be used in the theoretical and computational study of language change.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于“L’unitcom”的意大利语历时语料库
英语。在本文中,我们描述了创建一个历时语料库的意大利利用报纸的数字档案“统一”。我们自动清理和注释语料库,包括词性标记、引理、命名实体和语法依赖关系。此外,我们为令牌、引理和实体计算基于频率的时间序列。我们展示了一些有趣的语料库统计数据,考虑了时间维度,并描述了一些使用时间序列的例子。历时语言学是索绪尔在他的《语言课程》中提出的语言研究的两个主要时间维度之一,在语言学中有着悠久的传统。最近,历时语料库的可用性不断增加,以及用于表示词义的新自然语言处理技术的发展,促进了计算模型在调查历史语言数据中的应用(Hamilton等人,2016;Tahmasebi et al., 2018;唐,2018年)。这在SemEval-2020无监督词法语义变化检测(Schlechtweg et al., 2020)中达到了高潮,这是首次尝试系统地评估语言变化检测的自动方法。意大利语是一种罗曼语,在历史上经历了很多变化。本文版权所有c©2020。在知识共享许可国际署名4.0 (CC BY 4.0)下允许使用。在意大利统一(1861年)之后,意大利语才被采纳为国家语言,之前它是一种文学语言。意大利语历时语料库目前可供公众使用(如DiaCORIS和MIDIA)。不幸的是,对这些资源的限制访问/分配限制了它们的利用。这实际上阻碍了对历时维度的最新NLP方法的研究。为了消除这一限制,我们收集并免费提供了一个基于报纸“L’unitune”的新语料库。1924年2月12日,安东尼奥·葛兰西(Antonio Gramsci)创办了《统一报》(L’unito),它是意大利共产党的官方报纸。这份报纸有一段坎坷的历史:随着1991年意大利共产党的解散,该报继续作为新成立的左翼民主党(PDS/DS)的官方报纸存在,直到2014年7月31日。在此日期之后,它停止发布直到2015年6月30日,并于2017年6月3日关闭。自2017年以来,“联合”的历史档案再次在网络上可见和可用。3该资源的主要问题之一是缺乏关于原始档案权利所有者的信息。据我们所知,该档案的在线版本是在该报关闭前通过下载原始档案合法获得的。目前可以在网上找到的档案不包括当地版本的报纸和照片档案。这项工作的主要贡献在于https://github.com/swapUniba/unita/,它是Partito Comunista Italiano的缩写。https://archivio.unita.news/资源本身及其对整个研究界的可及性。语料库以两种格式分发:原始文本和预处理。语料库对语言变化自动研究的有效性目前是EVALITA 2020上DIACR-Ita任务4的一部分。然而,我们说明了语料库使用的一些进一步的潜在应用。意大利历时语料库目前有各种各样的意大利历时语料库可供公众使用。DiaCORIS 5 (Onelli et al., 2006)收录了1861年至1945年间产生的意大利语书面文本,共计1亿字;而MIDIA 6 (Gaeta et al., 2013)收录了13世纪初至20世纪上半叶的意大利语书面文件,共计750万字,800多个文本,属于不同的体裁。OVI dell 'Italiano antico7语料库包含了从十二世纪到十四世纪的1948个文本,共计53.6万字。LIZ8数据库包含了从十三世纪到二十世纪的1000个文学文本。最后,Alcide de Gasperi的公共文件语料库(Tonelli et al., 2019)包括1762份文件(报纸文章、宣传文件、官方信件、议会演讲,共计300万份代币),由意大利政治家Alcide de Gasperi撰写,出版于1901年至1954年之间。这些现有资源彼此不同,与现有语料库也有不同之处。首先,文本的时间跨度。OVI语料库考虑了意大利语早期阶段的文本,时间跨度为三个世纪。MIDIA语料库和LIZ数据库涵盖了从13世纪到20世纪上半叶的7个世纪。DiaCORIS, De Gasperi 's语料库和L ' unitou语料库包含较短和较近时期的文本。 然而,统一语料库中考虑的时间跨度对于意大利语的研究很有趣,因为发生了深刻的变化https://diacr-ita.github.io/ DIACR-Ita/ http://corpora.dslo.unibo.it/ DiaCORIS/ www.corpusmidia.unito.it http://gattoweb.ovi.cnr.it https://www.zanichelli。它/ricerca/prodotti/ liz-4-0 letteratura-italiana-zanichelli在那个时期。事实上,20世纪下半叶,意大利语在社会各阶层中得到了更广泛的传播和使用。其次,这些语料库因所代表的体裁而有所不同。DiaCORIS和MIDIA语料库被设计为具有代表性和平衡的意大利语书面样本(考虑到其他类型,学术散文,小说,新闻,法律文本等)。OVI语料库和LIZ数据库只理解文学文本。德·加斯佩里的语料库是单一作者政治文本的代表。统一语料库仅代表新闻语言,但这一限制可能有利于研究历时性词汇变化。的确,观察到的语义变化不能归因于不同时期不同体裁的证明,但可以解释为真正的语义变化。最后,即使大多数语料库都可以在线查询(除了LIZ数据库),也只有De Gasperi的语料库可以免费下载。这一限制影响了这些资源对NLP社区的可用性。通过统一语料库,我们的目标是发布一个新的历时资源,可以免费获得,并可用于语言变化的理论和计算研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Case Study of Natural Gender Phenomena in Translation. A Comparison of Google Translate, Bing Microsoft Translator and DeepL for English to Italian, French and Spanish How Granularity of Orthography-Phonology Mappings Affect Reading Development: Evidence from a Computational Model of English Word Reading and Spelling Creativity Embedding: A Vector to Characterise and Classify Plausible Triples in Deep Learning NLP Models (Stem and Word) Predictability in Italian Verb Paradigms: An Entropy-Based Study Exploiting the New Resource LeFFI Dialog-based Help Desk through Automated Question Answering and Intent Detection
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1