{"title":"用历时引擎分析语料库中的词汇语义变化","authors":"Pierluigi Cassotti, Pierpaolo Basile, M. Degemmis, G. Semeraro","doi":"10.4000/books.aaccademia.8343","DOIUrl":null,"url":null,"abstract":"English. With the growing availability of digitized diachronic corpora, the need for tools capable of taking into account the diachronic component of corpora becomes ever more pressing. Recent works on diachronic embeddings show that computational approaches to the diachronic analysis of language seem to be promising, but they are not user friendly for people without a technical background. This paper presents the Diachronic Engine, a system for the diachronic analysis of corpora lexical features. Diachronic Engine computes word frequency, concordances and collocations taking into account the temporal dimension. It is also able to compute temporal word embeddings and timeseries that can be exploited for lexical semantic change detection. 1 Motivation and Background Synchronic corpora are widely used in linguistics for deriving a set of abstract rules that govern a particular language under analysis by using statistical approaches. The same methodology can be adopted for analyzing the evolution of word meanings over time in the case of diachronic corpora. However, this process can be very time-consuming. Usually, linguists rely on software tools that can easily explore and clean the corpus, while highlighting the more relevant linguistic features. Sketch Engine1(Kilgarriff et al., 2004; Kilgarriff et al., 2014) is the leading tool in the corpus analysis field. Beyond several interesting features, Sketch Engine includes trends (Kilgarriff et al., 2015), which allow for diachronic Copyright c ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). https://www.sketchengine.eu/ analysis based on the frequency distribution of words. Trends rely on merely frequency features, ignoring word usage information. Moreover, the Sketch Engine interface does not provide temporal information about concordances and collocations. NoSketchEngine2 is an open-source version of SketchEngine. It requires technical expertise for the setup and, contrarily to SketchEngine, it does not support word sketches, terminology, thesaurus, n-grams, trends and corpus building. An interesting system is DiaCollo3 (Jurish and der Wissenschaften, 2015), a software tool for the discovery, comparison, and interactive visualization of target word combinations. Combinations can be requested for a particular time period, or for a direct comparison between different time periods. However, DiaCollo is focused exclusively on the extraction and visualization of collocations from diachronic corpora. In recent works about computational diachronic linguistics, techniques based on word embeddings produce promising results. In Semeval Task 1 (Schlechtweg et al., 2020), for instance, type embeddings rich high performances on both subtasks. However, these techniques are not included in any aforementioned linguistic tool. In order to bridge this gap, we try to build a tool that includes approaches for the analysis of diachronic embeddings. The result of our work is Diachronic Engine (DE), an engine for the management of diachronic corpora that provides tools for change detection of lexical semantics from a frequentist perspective. DE includes tools for extracting diachronic collocations, concordances in different time periods as well as for computing semantic change time-series by exploiting both word frequencies and word embeddings similarity over time. The rest of the paper is organized as follows: https://nlp.fi.muni.cz/trac/noske https://www.clarin.eu/showcase/","PeriodicalId":300279,"journal":{"name":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Analysis of Lexical Semantic Changes in Corpora with the Diachronic Engine\",\"authors\":\"Pierluigi Cassotti, Pierpaolo Basile, M. Degemmis, G. Semeraro\",\"doi\":\"10.4000/books.aaccademia.8343\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"English. With the growing availability of digitized diachronic corpora, the need for tools capable of taking into account the diachronic component of corpora becomes ever more pressing. Recent works on diachronic embeddings show that computational approaches to the diachronic analysis of language seem to be promising, but they are not user friendly for people without a technical background. This paper presents the Diachronic Engine, a system for the diachronic analysis of corpora lexical features. Diachronic Engine computes word frequency, concordances and collocations taking into account the temporal dimension. It is also able to compute temporal word embeddings and timeseries that can be exploited for lexical semantic change detection. 1 Motivation and Background Synchronic corpora are widely used in linguistics for deriving a set of abstract rules that govern a particular language under analysis by using statistical approaches. The same methodology can be adopted for analyzing the evolution of word meanings over time in the case of diachronic corpora. However, this process can be very time-consuming. Usually, linguists rely on software tools that can easily explore and clean the corpus, while highlighting the more relevant linguistic features. Sketch Engine1(Kilgarriff et al., 2004; Kilgarriff et al., 2014) is the leading tool in the corpus analysis field. Beyond several interesting features, Sketch Engine includes trends (Kilgarriff et al., 2015), which allow for diachronic Copyright c ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). https://www.sketchengine.eu/ analysis based on the frequency distribution of words. Trends rely on merely frequency features, ignoring word usage information. Moreover, the Sketch Engine interface does not provide temporal information about concordances and collocations. NoSketchEngine2 is an open-source version of SketchEngine. It requires technical expertise for the setup and, contrarily to SketchEngine, it does not support word sketches, terminology, thesaurus, n-grams, trends and corpus building. An interesting system is DiaCollo3 (Jurish and der Wissenschaften, 2015), a software tool for the discovery, comparison, and interactive visualization of target word combinations. Combinations can be requested for a particular time period, or for a direct comparison between different time periods. However, DiaCollo is focused exclusively on the extraction and visualization of collocations from diachronic corpora. In recent works about computational diachronic linguistics, techniques based on word embeddings produce promising results. In Semeval Task 1 (Schlechtweg et al., 2020), for instance, type embeddings rich high performances on both subtasks. However, these techniques are not included in any aforementioned linguistic tool. In order to bridge this gap, we try to build a tool that includes approaches for the analysis of diachronic embeddings. The result of our work is Diachronic Engine (DE), an engine for the management of diachronic corpora that provides tools for change detection of lexical semantics from a frequentist perspective. DE includes tools for extracting diachronic collocations, concordances in different time periods as well as for computing semantic change time-series by exploiting both word frequencies and word embeddings similarity over time. The rest of the paper is organized as follows: https://nlp.fi.muni.cz/trac/noske https://www.clarin.eu/showcase/\",\"PeriodicalId\":300279,\"journal\":{\"name\":\"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4000/books.aaccademia.8343\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4000/books.aaccademia.8343","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Analysis of Lexical Semantic Changes in Corpora with the Diachronic Engine
English. With the growing availability of digitized diachronic corpora, the need for tools capable of taking into account the diachronic component of corpora becomes ever more pressing. Recent works on diachronic embeddings show that computational approaches to the diachronic analysis of language seem to be promising, but they are not user friendly for people without a technical background. This paper presents the Diachronic Engine, a system for the diachronic analysis of corpora lexical features. Diachronic Engine computes word frequency, concordances and collocations taking into account the temporal dimension. It is also able to compute temporal word embeddings and timeseries that can be exploited for lexical semantic change detection. 1 Motivation and Background Synchronic corpora are widely used in linguistics for deriving a set of abstract rules that govern a particular language under analysis by using statistical approaches. The same methodology can be adopted for analyzing the evolution of word meanings over time in the case of diachronic corpora. However, this process can be very time-consuming. Usually, linguists rely on software tools that can easily explore and clean the corpus, while highlighting the more relevant linguistic features. Sketch Engine1(Kilgarriff et al., 2004; Kilgarriff et al., 2014) is the leading tool in the corpus analysis field. Beyond several interesting features, Sketch Engine includes trends (Kilgarriff et al., 2015), which allow for diachronic Copyright c ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). https://www.sketchengine.eu/ analysis based on the frequency distribution of words. Trends rely on merely frequency features, ignoring word usage information. Moreover, the Sketch Engine interface does not provide temporal information about concordances and collocations. NoSketchEngine2 is an open-source version of SketchEngine. It requires technical expertise for the setup and, contrarily to SketchEngine, it does not support word sketches, terminology, thesaurus, n-grams, trends and corpus building. An interesting system is DiaCollo3 (Jurish and der Wissenschaften, 2015), a software tool for the discovery, comparison, and interactive visualization of target word combinations. Combinations can be requested for a particular time period, or for a direct comparison between different time periods. However, DiaCollo is focused exclusively on the extraction and visualization of collocations from diachronic corpora. In recent works about computational diachronic linguistics, techniques based on word embeddings produce promising results. In Semeval Task 1 (Schlechtweg et al., 2020), for instance, type embeddings rich high performances on both subtasks. However, these techniques are not included in any aforementioned linguistic tool. In order to bridge this gap, we try to build a tool that includes approaches for the analysis of diachronic embeddings. The result of our work is Diachronic Engine (DE), an engine for the management of diachronic corpora that provides tools for change detection of lexical semantics from a frequentist perspective. DE includes tools for extracting diachronic collocations, concordances in different time periods as well as for computing semantic change time-series by exploiting both word frequencies and word embeddings similarity over time. The rest of the paper is organized as follows: https://nlp.fi.muni.cz/trac/noske https://www.clarin.eu/showcase/