Pub Date : 2026-01-01Epub Date: 2026-03-14DOI: 10.1007/s10579-025-09876-7
Lifeng Han, Najet Hadj Mohamed, Malak Rassem, Gareth J F Jones, Alan F Smeaton, Goran Nenadic
In this work, we introduce the construction of a machine translation (MT) assisted and human-in-the-loop multilingual parallel corpus with annotations of multi-word expressions (MWEs), named AlphaMWE. The MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied terms. The annotated vMWEs are also bilingually and multilingually aligned manually. The languages covered include Arabic, Chinese, English, German, Italian, and Polish, of which, the Arabic corpus includes both standard and dialectal variations from Egypt and Tunisia. Our original English corpus is taken from the PARSEME shared task in 2018. We performed machine translation of this source corpus followed by human post-editing and annotation of target MWEs. Strict quality control was applied for error limitation, i.e., each MT output sentence received first manual post-editing and annotation plus a second manual quality rechecking. One of our findings during corpora preparation is that accurate translation of MWEs presents challenges to MT systems, as reflected by the outcomes of human-in-the-loop metric HOPE. To facilitate further MT research, we present a categorisation of the error types encountered by MT systems in performing MWE-related translation. To acquire a broader view of MT issues, we selected four popular state-of-the-art MT systems for comparison, namely Microsoft Bing Translator, GoogleMT, Baidu Fanyi, and DeepL MT. Because of the noise removal, translation post-editing, and MWE annotation by human professionals, we believe the AlphaMWE data set will be an asset for both monolingual and cross-lingual research, such as multi-word term lexicography, MT, and information extraction.
{"title":"Towards a resource for multilingual lexicons: an MT assisted and human-in-the-loop multilingual parallel corpus with multi-word expression annotation.","authors":"Lifeng Han, Najet Hadj Mohamed, Malak Rassem, Gareth J F Jones, Alan F Smeaton, Goran Nenadic","doi":"10.1007/s10579-025-09876-7","DOIUrl":"https://doi.org/10.1007/s10579-025-09876-7","url":null,"abstract":"<p><p>In this work, we introduce the construction of a machine translation (MT) assisted and human-in-the-loop multilingual parallel corpus with annotations of multi-word expressions (MWEs), named AlphaMWE. The MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied terms. The annotated vMWEs are also bilingually and multilingually aligned manually. The languages covered include Arabic, Chinese, English, German, Italian, and Polish, of which, the Arabic corpus includes both standard and dialectal variations from Egypt and Tunisia. Our original English corpus is taken from the PARSEME shared task in 2018. We performed machine translation of this source corpus followed by human post-editing and annotation of target MWEs. Strict quality control was applied for error limitation, i.e., each MT output sentence received first manual post-editing and annotation plus a second manual quality rechecking. One of our findings during corpora preparation is that accurate translation of MWEs presents challenges to MT systems, as reflected by the outcomes of human-in-the-loop metric HOPE. To facilitate further MT research, we present a categorisation of the error types encountered by MT systems in performing MWE-related translation. To acquire a broader view of MT issues, we selected four popular state-of-the-art MT systems for comparison, namely Microsoft Bing Translator, GoogleMT, Baidu Fanyi, and DeepL MT. Because of the noise removal, translation post-editing, and MWE annotation by human professionals, we believe the AlphaMWE data set will be an asset for both monolingual and cross-lingual research, such as multi-word term lexicography, MT, and information extraction.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"60 2","pages":"33"},"PeriodicalIF":1.8,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12989027/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147469850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2026-02-28DOI: 10.1007/s10579-025-09887-4
Christopher Hammerly, Nora Livesay, Antti Arppe, Anna Stacey, Miikka Silfverberg
This paper describes the design, evaluation, and application of OjibweMorph, a finite-state transducer (FST) for generating and analyzing words in the Central Algonquian language Ojibwe. We created a language-general modular system for creating FSTs from human- and machine-readable spreadsheets, where sets of inflectional and derivational morphology can be defined, combined with a lexical database, and automatically compiled into an FST. We show how this system is applied to generate and analyze the complex nominal and verbal morphology in Ojibwe, with an eye towards how our framework and toolkit can be used to create FSTs for other morphologically complex languages. We evaluate the Ojibwe version of the system by checking the model's performance against a set of inflectional forms and example sentences from the Ojibwe People's Dictionary, and describe the application of the FST to create a linguistically analyzed corpus, an automatic verb conjugation tool for education, a spell-checker, and intelligent dictionary search.
{"title":"OjibweMorph: an approachable finite-state transducer for Ojibwe (and beyond).","authors":"Christopher Hammerly, Nora Livesay, Antti Arppe, Anna Stacey, Miikka Silfverberg","doi":"10.1007/s10579-025-09887-4","DOIUrl":"https://doi.org/10.1007/s10579-025-09887-4","url":null,"abstract":"<p><p>This paper describes the design, evaluation, and application of <i>OjibweMorph</i>, a finite-state transducer (FST) for generating and analyzing words in the Central Algonquian language Ojibwe. We created a language-general modular system for creating FSTs from human- and machine-readable spreadsheets, where sets of inflectional and derivational morphology can be defined, combined with a lexical database, and automatically compiled into an FST. We show how this system is applied to generate and analyze the complex nominal and verbal morphology in Ojibwe, with an eye towards how our framework and toolkit can be used to create FSTs for other morphologically complex languages. We evaluate the Ojibwe version of the system by checking the model's performance against a set of inflectional forms and example sentences from the Ojibwe People's Dictionary, and describe the application of the FST to create a linguistically analyzed corpus, an automatic verb conjugation tool for education, a spell-checker, and intelligent dictionary search.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"60 2","pages":"27"},"PeriodicalIF":1.8,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12950098/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147345764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-02-19DOI: 10.1007/s10579-025-09813-8
Serhii Zasiekin, Larysa Zasiekina, Emilie Altman, Mariia Hryntus, Victor Kuperman
Documentation and analysis of psychological states experienced by witnesses and survivors of catastrophic events is a critical concern of psychological research. This paper introduces the new corpus of written testimonies collected from nearly 1500 Ukrainian civilians from May 2022-January 2024, during Russia's invasion of Ukraine. The texts are available in the original Ukrainian and the English translation. The Narratives of War (NoW) corpus additionally contains demographic and geographic data on respondents, as well as their scores in tests of PTSD symptoms and moral injury. The paper provides a detailed introduction into the method of data collection and corpus structure. It also reports a quantitative frequency-based "keyness" analysis that identifies words particularly representative of the NoW corpus, as compared to the reference corpus of Ukrainian texts that predates the war with Russia. These key words shed light on the psychological state of witnesses of war. With its materials collected during the ongoing war, the corpus contributes to the body of knowledge for studies of the psychological impact of war and trauma on civilian populations.
{"title":"The narratives of war (NoW) corpus of written testimonies of the Russia-Ukraine war.","authors":"Serhii Zasiekin, Larysa Zasiekina, Emilie Altman, Mariia Hryntus, Victor Kuperman","doi":"10.1007/s10579-025-09813-8","DOIUrl":"10.1007/s10579-025-09813-8","url":null,"abstract":"<p><p>Documentation and analysis of psychological states experienced by witnesses and survivors of catastrophic events is a critical concern of psychological research. This paper introduces the new corpus of written testimonies collected from nearly 1500 Ukrainian civilians from May 2022-January 2024, during Russia's invasion of Ukraine. The texts are available in the original Ukrainian and the English translation. The Narratives of War (NoW) corpus additionally contains demographic and geographic data on respondents, as well as their scores in tests of PTSD symptoms and moral injury. The paper provides a detailed introduction into the method of data collection and corpus structure. It also reports a quantitative frequency-based \"keyness\" analysis that identifies words particularly representative of the NoW corpus, as compared to the reference corpus of Ukrainian texts that predates the war with Russia. These key words shed light on the psychological state of witnesses of war. With its materials collected during the ongoing war, the corpus contributes to the body of knowledge for studies of the psychological impact of war and trauma on civilian populations.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"59 3","pages":"2415-2426"},"PeriodicalIF":1.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12296800/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144734916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2024-10-09DOI: 10.1007/s10579-024-09776-2
Borja Herce
This paper presents VeLeSpa, a verbal lexicon of Peninsular Spanish, which contains the full paradigms (all 63 cells) in phonological form of 6553 verbs, along with their corresponding frequencies. In this paper, the process and decisions involved in the building of the resource are presented. In addition, based on the most frequent 3000 + verbs, a quantitative analysis is conducted of morphological predictability in Spanish verbal inflection. The results and their drivers are discussed, as well as observed differences with other Romance languages and Latin.
{"title":"VeLeSpa: An inflected verbal lexicon of Peninsular Spanish and a quantitative analysis of paradigmatic predictability.","authors":"Borja Herce","doi":"10.1007/s10579-024-09776-2","DOIUrl":"10.1007/s10579-024-09776-2","url":null,"abstract":"<p><p>This paper presents VeLeSpa, a verbal lexicon of Peninsular Spanish, which contains the full paradigms (all 63 cells) in phonological form of 6553 verbs, along with their corresponding frequencies. In this paper, the process and decisions involved in the building of the resource are presented. In addition, based on the most frequent 3000 + verbs, a quantitative analysis is conducted of morphological predictability in Spanish verbal inflection. The results and their drivers are discussed, as well as observed differences with other Romance languages and Latin.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"59 2","pages":"1705-1718"},"PeriodicalIF":1.7,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12086111/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144112555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sentiment analysis, the automated process of determining emotions or opinions expressed in text, has seen extensive exploration in the field of natural language processing. However, one aspect that has remained underrepresented is the sentiment analysis of the Moroccan dialect, which boasts a unique linguistic landscape and the coexistence of multiple scripts. Previous works in sentiment analysis primarily targeted dialects employing Arabic script. While these efforts provided valuable insights, they may not fully capture the complexity of Moroccan web content, which features a blend of Arabic and Latin script. As a result, our study emphasizes the importance of extending sentiment analysis to encompass the entire spectrum of Moroccan linguistic diversity. Central to our research is the creation of the largest public dataset for Moroccan dialect sentiment analysis that incorporates not only Moroccan dialect written in Arabic script but also in Latin characters. By assembling a diverse range of textual data, we were able to construct a dataset with a range of 19,991 manually labeled texts in Moroccan dialect and also publicly available lists of stop words in Moroccan dialect as a new contribution to Moroccan Arabic resources. In our exploration of sentiment analysis, we undertook a comprehensive study encompassing various machine-learning models to assess their compatibility with our dataset. While our investigation revealed that the highest accuracy of 98.42% was attained through the utilization of the DarijaBert-mix transfer-learning model, we also delved into deep learning models. Notably, our experimentation yielded a commendable accuracy rate of 92% when employing a CNN model. Furthermore, in an effort to affirm the reliability of our dataset, we tested the CNN model using smaller publicly available datasets of Moroccan dialect, with results that proved to be promising and supportive of our findings.
{"title":"Sentiment analysis dataset in Moroccan dialect: bridging the gap between Arabic and Latin scripted dialect","authors":"Mouad Jbel, Mourad Jabrane, Imad Hafidi, Abdulmutallib Metrane","doi":"10.1007/s10579-024-09764-6","DOIUrl":"https://doi.org/10.1007/s10579-024-09764-6","url":null,"abstract":"<p>Sentiment analysis, the automated process of determining emotions or opinions expressed in text, has seen extensive exploration in the field of natural language processing. However, one aspect that has remained underrepresented is the sentiment analysis of the Moroccan dialect, which boasts a unique linguistic landscape and the coexistence of multiple scripts. Previous works in sentiment analysis primarily targeted dialects employing Arabic script. While these efforts provided valuable insights, they may not fully capture the complexity of Moroccan web content, which features a blend of Arabic and Latin script. As a result, our study emphasizes the importance of extending sentiment analysis to encompass the entire spectrum of Moroccan linguistic diversity. Central to our research is the creation of the largest public dataset for Moroccan dialect sentiment analysis that incorporates not only Moroccan dialect written in Arabic script but also in Latin characters. By assembling a diverse range of textual data, we were able to construct a dataset with a range of 19,991 manually labeled texts in Moroccan dialect and also publicly available lists of stop words in Moroccan dialect as a new contribution to Moroccan Arabic resources. In our exploration of sentiment analysis, we undertook a comprehensive study encompassing various machine-learning models to assess their compatibility with our dataset. While our investigation revealed that the highest accuracy of 98.42% was attained through the utilization of the DarijaBert-mix transfer-learning model, we also delved into deep learning models. Notably, our experimentation yielded a commendable accuracy rate of 92% when employing a CNN model. Furthermore, in an effort to affirm the reliability of our dataset, we tested the CNN model using smaller publicly available datasets of Moroccan dialect, with results that proved to be promising and supportive of our findings.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"6 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-09DOI: 10.1007/s10579-024-09769-1
Francesco Periti, Sergio Picascia, Stefano Montanelli, Alfio Ferrara, Nina Tahmasebi
The study of semantic shift, that is, of how words change meaning as a consequence of social practices, events and political circumstances, is relevant in Natural Language Processing, Linguistics, and Social Sciences. The increasing availability of large diachronic corpora and advance in computational semantics have accelerated the development of computational approaches to detecting such shift. In this paper, we introduce a novel approach to tracing the evolution of word meaning over time. Our analysis focuses on gradual changes in word semantics and relies on an incremental approach to semantic shift detection (SSD) called What is Done is Done (WiDiD). WiDiD leverages scalable and evolutionary clustering of contextualised word embeddings to detect semantic shift and capture temporal transactions in word meanings. Existing approaches to SSD: (a) significantly simplify the semantic shift problem to cover change between two (or a few) time points, and (b) consider the existing corpora as static. We instead treat SSD as an organic process in which word meanings evolve across tens or even hundreds of time periods as the corpus is progressively made available. This results in an extremely demanding task that entails a multitude of intricate decisions. We demonstrate the applicability of this incremental approach on a diachronic corpus of Italian parliamentary speeches spanning eighteen distinct time periods. We also evaluate its performance on seven popular labelled benchmarks for SSD across multiple languages. Empirical results show that our results are comparable to state-of-the-art approaches, while outperforming the state-of-the-art for certain languages.
{"title":"Studying word meaning evolution through incremental semantic shift detection","authors":"Francesco Periti, Sergio Picascia, Stefano Montanelli, Alfio Ferrara, Nina Tahmasebi","doi":"10.1007/s10579-024-09769-1","DOIUrl":"https://doi.org/10.1007/s10579-024-09769-1","url":null,"abstract":"<p>The study of <i>semantic shift</i>, that is, of how words change meaning as a consequence of social practices, events and political circumstances, is relevant in Natural Language Processing, Linguistics, and Social Sciences. The increasing availability of large diachronic corpora and advance in computational semantics have accelerated the development of computational approaches to detecting such shift. In this paper, we introduce a novel approach to tracing the evolution of word meaning over time. Our analysis focuses on gradual changes in word semantics and relies on an incremental approach to semantic shift detection (SSD) called <i>What is Done is Done</i> (WiDiD). WiDiD leverages scalable and evolutionary clustering of contextualised word embeddings to detect semantic shift and capture temporal <i>transactions</i> in word meanings. Existing approaches to SSD: (a) significantly simplify the semantic shift problem to cover change between two (or a few) time points, and (b) consider the existing corpora as static. We instead treat SSD as an organic process in which word meanings evolve across tens or even hundreds of time periods as the corpus is progressively made available. This results in an extremely demanding task that entails a multitude of intricate decisions. We demonstrate the applicability of this incremental approach on a diachronic corpus of Italian parliamentary speeches spanning eighteen distinct time periods. We also evaluate its performance on seven popular labelled benchmarks for SSD across multiple languages. Empirical results show that our results are comparable to state-of-the-art approaches, while outperforming the state-of-the-art for certain languages.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"26 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1007/s10579-024-09763-7
Najet Hadj Mohamed, Cherifa Ben Khelil, Agata Savary, Iskander Keskes, Jean Yves Antoine, Lamia Belguith Hadrich
In this paper we present PARSEME-AR, the first openly available Arabic corpus manually annotated for Verbal Multiword Expressions (VMWEs). The annotation process is carried out based on guidelines put forward by PARSEME, a multilingual project for more than 26 languages. The corpus contains 4749 VMWEs in about 7500 sentences taken from the Prague Arabic Dependency Treebank. The results notably show a high degree of discontinuity in Arabic VMWEs in comparison to other languages in the PARSEME suite. We also propose analyses of interesting and challenging phenomena encountered during the annotation process. Moreover, we offer the first benchmark for the VMWE identification task in Arabic, by training two state-of-the-art systems, on our Arabic data.
{"title":"PARSEME-AR: Arabic reference corpus for multiword expressions using PARSEME annotation guidelines","authors":"Najet Hadj Mohamed, Cherifa Ben Khelil, Agata Savary, Iskander Keskes, Jean Yves Antoine, Lamia Belguith Hadrich","doi":"10.1007/s10579-024-09763-7","DOIUrl":"https://doi.org/10.1007/s10579-024-09763-7","url":null,"abstract":"<p>In this paper we present PARSEME-AR, the first openly available Arabic corpus manually annotated for Verbal Multiword Expressions (VMWEs). The annotation process is carried out based on guidelines put forward by PARSEME, a multilingual project for more than 26 languages. The corpus contains 4749 VMWEs in about 7500 sentences taken from the Prague Arabic Dependency Treebank. The results notably show a high degree of discontinuity in Arabic VMWEs in comparison to other languages in the PARSEME suite. We also propose analyses of interesting and challenging phenomena encountered during the annotation process. Moreover, we offer the first benchmark for the VMWE identification task in Arabic, by training two state-of-the-art systems, on our Arabic data.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"146 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1007/s10579-024-09724-0
Sriram Krishnan, Amba Kulkarni, Gérard Huet
Sanskrit processing has seen a surge in the use of data-driven approaches over the past decade. Various tasks such as segmentation, morphological parsing, and dependency analysis have been tackled through the development of state-of-the-art models despite working with relatively limited datasets compared to other languages. However, a significant challenge lies in the availability of annotated datasets that are lexically, morphologically, syntactically, and semantically tagged. While syntactic and semantic tags are preferable for later stages of processing such as sentential parsing and disambiguation, lexical and morphological tags are crucial for low-level tasks of word segmentation and morphological parsing. The Digital Corpus of Sanskrit (DCS) is one notable effort that hosts over 650,000 lexically and morphologically tagged sentences from around 250 texts but also comes with its limitations at different levels of a sentence like chunk, segment, stem and morphological analysis. To overcome these limitations, we look at alternatives such as Sanskrit Heritage Segmenter (SH) and Saṃsādhanī tools, that provide information complementing DCS’ data. This work focuses on enriching the DCS dataset by incorporating analyses from SH, thereby creating a dataset that is rich in lexical and morphological information. Furthermore, this work also discusses the impact of such datasets on the performances of existing segmenters, specifically the Sanskrit Heritage Segmenter.
{"title":"Normalized dataset for Sanskrit word segmentation and morphological parsing","authors":"Sriram Krishnan, Amba Kulkarni, Gérard Huet","doi":"10.1007/s10579-024-09724-0","DOIUrl":"https://doi.org/10.1007/s10579-024-09724-0","url":null,"abstract":"<p>Sanskrit processing has seen a surge in the use of data-driven approaches over the past decade. Various tasks such as segmentation, morphological parsing, and dependency analysis have been tackled through the development of state-of-the-art models despite working with relatively limited datasets compared to other languages. However, a significant challenge lies in the availability of annotated datasets that are lexically, morphologically, syntactically, and semantically tagged. While syntactic and semantic tags are preferable for later stages of processing such as sentential parsing and disambiguation, lexical and morphological tags are crucial for low-level tasks of word segmentation and morphological parsing. The Digital Corpus of Sanskrit (DCS) is one notable effort that hosts over 650,000 lexically and morphologically tagged sentences from around 250 texts but also comes with its limitations at different levels of a sentence like chunk, segment, stem and morphological analysis. To overcome these limitations, we look at alternatives such as Sanskrit Heritage Segmenter (SH) and <i>Saṃsādhanī</i> tools, that provide information complementing DCS’ data. This work focuses on enriching the DCS dataset by incorporating analyses from SH, thereby creating a dataset that is rich in lexical and morphological information. Furthermore, this work also discusses the impact of such datasets on the performances of existing segmenters, specifically the Sanskrit Heritage Segmenter.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"14 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-21DOI: 10.1007/s10579-024-09752-w
Pascual Julián-Iranzo, Germán Rigau, Fernando Sáenz-Pérez, Pablo Velasco-Crespo
WordNet is a lexical database for English that is supplied in a variety of formats, including one compatible with the Prolog programming language. Given the success and usefulness of WordNet, wordnets of other languages have been developed, including Spanish. The Spanish WordNet, like others, does not provide a version compatible with Prolog. This work aims to fill this gap by translating the Multilingual Central Repository (MCR) version of the Spanish WordNet into a Prolog-compatible format. Thanks to this translation, a set of Spanish lexical databases are obtained, which allows access to WordNet information using declarative techniques and the deductive capabilities of the Prolog language. Also, this work facilitates the development of other programs to analyze the obtained information. Remarkably, we have adapted the technique of differential testing, used in software testing, to verify the correctness of this conversion. In addition, to ensure the consistency of the generated Prolog databases, as well as the databases from which we started, a complete series of integrity constraint tests have been carried out. In this way we have discovered some inconsistency problems in the MCR databases that have a reflection in the generated Prolog databases and have been reported to the owners of those databases.
{"title":"Conversion of the Spanish WordNet databases into a Prolog-readable format","authors":"Pascual Julián-Iranzo, Germán Rigau, Fernando Sáenz-Pérez, Pablo Velasco-Crespo","doi":"10.1007/s10579-024-09752-w","DOIUrl":"https://doi.org/10.1007/s10579-024-09752-w","url":null,"abstract":"<p>WordNet is a lexical database for English that is supplied in a variety of formats, including one compatible with the <span>Prolog</span> programming language. Given the success and usefulness of WordNet, wordnets of other languages have been developed, including Spanish. The Spanish WordNet, like others, does not provide a version compatible with <span>Prolog</span>. This work aims to fill this gap by translating the Multilingual Central Repository (MCR) version of the Spanish WordNet into a <span>Prolog</span>-compatible format. Thanks to this translation, a set of Spanish lexical databases are obtained, which allows access to WordNet information using declarative techniques and the deductive capabilities of the <span>Prolog</span> language. Also, this work facilitates the development of other programs to analyze the obtained information. Remarkably, we have adapted the technique of differential testing, used in software testing, to verify the correctness of this conversion. In addition, to ensure the consistency of the generated <span>Prolog</span> databases, as well as the databases from which we started, a complete series of integrity constraint tests have been carried out. In this way we have discovered some inconsistency problems in the MCR databases that have a reflection in the generated <span>Prolog</span> databases and have been reported to the owners of those databases.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"5 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-18DOI: 10.1007/s10579-024-09750-y
Ibtissam Touahri, Azzeddine Mazroui
Sentiment analysis is a task in natural language processing aiming to identify the overall polarity of reviews for subsequent analysis. This study used the Arabic speech-act and sentiment analysis, Arabic sentiment tweets dataset, and SemEval benchmark datasets, along with the Moroccan sentiment analysis corpus, which focuses on the Moroccan dialect. Furthermore, the modern standard and dialectal Arabic corpus dataset has been created and annotated based on the three language types: modern standard Arabic, Moroccan Arabic Dialect, and Mixed Language. Additionally, the annotation has been performed at the sentiment level, categorizing sentiments as positive, negative, or mixed. The sizes of the datasets range from 2000 to 21,000 reviews. The essential dialectal characteristics to enhance a sentiment classification system have been outlined. The proposed approach has involved deploying several models employing the supervised approach, including occurrence vectors, Recurrent Neural Network-Long Short Term Memory, and the pre-trained transformer model Arabic bidirectional encoder representations from transformers (AraBERT), complemented by the integration of Generative Adversarial Networks (GANs). The uniqueness of the proposed approach lies in constructing and annotating manually a dialectal sentiment corpus and studying carefully its main characteristics, which are used then to feed the classical supervised model. Moreover, GANs that widen the gap between the studied classes have been used to enhance the obtained results with AraBERT. The classification test results have been promising, enabling a comparison with other systems. The proposed system has been evaluated against Mazajak and CAMelTools state-of-the-art systems, designed for most Arabic dialects, using the mentioned datasets. A significant improvement of 30 points in FNN has been observed. These results have affirmed the versatility of the proposed system, demonstrating its effectiveness across multi-dialectal, multi-domain datasets, as well as balanced and unbalanced ones.
{"title":"Annotation and evaluation of a dialectal Arabic sentiment corpus against benchmark datasets using transformers","authors":"Ibtissam Touahri, Azzeddine Mazroui","doi":"10.1007/s10579-024-09750-y","DOIUrl":"https://doi.org/10.1007/s10579-024-09750-y","url":null,"abstract":"<p>Sentiment analysis is a task in natural language processing aiming to identify the overall polarity of reviews for subsequent analysis. This study used the Arabic speech-act and sentiment analysis, Arabic sentiment tweets dataset, and SemEval benchmark datasets, along with the Moroccan sentiment analysis corpus, which focuses on the Moroccan dialect. Furthermore, the modern standard and dialectal Arabic corpus dataset has been created and annotated based on the three language types: modern standard Arabic, Moroccan Arabic Dialect, and Mixed Language. Additionally, the annotation has been performed at the sentiment level, categorizing sentiments as positive, negative, or mixed. The sizes of the datasets range from 2000 to 21,000 reviews. The essential dialectal characteristics to enhance a sentiment classification system have been outlined. The proposed approach has involved deploying several models employing the supervised approach, including occurrence vectors, Recurrent Neural Network-Long Short Term Memory, and the pre-trained transformer model Arabic bidirectional encoder representations from transformers (AraBERT), complemented by the integration of Generative Adversarial Networks (GANs). The uniqueness of the proposed approach lies in constructing and annotating manually a dialectal sentiment corpus and studying carefully its main characteristics, which are used then to feed the classical supervised model. Moreover, GANs that widen the gap between the studied classes have been used to enhance the obtained results with AraBERT. The classification test results have been promising, enabling a comparison with other systems. The proposed system has been evaluated against Mazajak and CAMelTools state-of-the-art systems, designed for most Arabic dialects, using the mentioned datasets. A significant improvement of 30 points in F<sup>NN</sup> has been observed. These results have affirmed the versatility of the proposed system, demonstrating its effectiveness across multi-dialectal, multi-domain datasets, as well as balanced and unbalanced ones.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"1 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}