The objective of this paper is to present and make publicly available the NILC-Metrix, a computational system comprising 200 metrics proposed in studies on discourse, psycholinguistics, cognitive and computational linguistics, to assess textual complexity in Brazilian Portuguese (BP). The metrics are relevant for descriptive analysis and the creation of computational models and can be used to extract information from various linguistic levels of written and spoken language. The metrics were developed during the last 13 years, starting in the end of 2007, within the scope of the PorSimples project. Once the PorSimples finished, new metrics were added to the initial 48 metrics of the Coh-Metrix-Port tool. Coh-Metrix-Port adapted some metrics to BP from the Coh-Metrix tool that computes metrics related to cohesion and coherence of texts in English. Given the large number of metrics, we present them following an organisation similar to the metrics of Coh-Metrix v3.0 to facilitate comparisons made with metrics in Portuguese and English, in future studies using both tools. In this paper, we illustrate the potential of the NILC-Metrix by presenting three applications: (i) a descriptive analysis of the differences between children’s film subtitles and texts written for Elementary School I (comprises classes from 1st to 5th grade) and II (Final Years) (comprises classes from 6th to 9th grade, in an age group that corresponds to the transition between childhood and adolescence); (ii) a new predictor of textual complexity for the corpus of original and simplified texts of the PorSimples project; (iii) a complexity prediction model for school grades, using transcripts of children’s story narratives told by teenagers. For each application, we evaluate which groups of metrics are more discriminative, showing their contribution for each task.
{"title":"NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese","authors":"Sidney Evaldo Leal, Magali Sanches Duran, Carolina Evaristo Scarton, Nathan Siegle Hartmann, Sandra Maria Aluísio","doi":"10.1007/s10579-023-09693-w","DOIUrl":"https://doi.org/10.1007/s10579-023-09693-w","url":null,"abstract":"The objective of this paper is to present and make publicly available the NILC-Metrix, a computational system comprising 200 metrics proposed in studies on discourse, psycholinguistics, cognitive and computational linguistics, to assess textual complexity in Brazilian Portuguese (BP). The metrics are relevant for descriptive analysis and the creation of computational models and can be used to extract information from various linguistic levels of written and spoken language. The metrics were developed during the last 13 years, starting in the end of 2007, within the scope of the PorSimples project. Once the PorSimples finished, new metrics were added to the initial 48 metrics of the Coh-Metrix-Port tool. Coh-Metrix-Port adapted some metrics to BP from the Coh-Metrix tool that computes metrics related to cohesion and coherence of texts in English. Given the large number of metrics, we present them following an organisation similar to the metrics of Coh-Metrix v3.0 to facilitate comparisons made with metrics in Portuguese and English, in future studies using both tools. In this paper, we illustrate the potential of the NILC-Metrix by presenting three applications: (i) a descriptive analysis of the differences between children’s film subtitles and texts written for Elementary School I (comprises classes from 1st to 5th grade) and II (Final Years) (comprises classes from 6th to 9th grade, in an age group that corresponds to the transition between childhood and adolescence); (ii) a new predictor of textual complexity for the corpus of original and simplified texts of the PorSimples project; (iii) a complexity prediction model for school grades, using transcripts of children’s story narratives told by teenagers. For each application, we evaluate which groups of metrics are more discriminative, showing their contribution for each task.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136033230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-29DOI: 10.1007/s10579-023-09688-7
Leila Safari, Zanyar Mohammady
Suggestion mining has become a popular subject in the field of natural language processing (NLP) that is useful in areas like a service/product improvement. The purpose of this study is to provide an automated machine learning (ML) based approach to extract suggestions from Persian text. In this research, first, a novel two-step semi-supervised method has been proposed to generate a Persian dataset called ParsSugg, which is then used in the automatic classification of the user’s suggestions. The first step is manual labeling of data based on a proposed guideline, followed by a data augmentation phase. In the second step, using pre-trained Persian Bidirectional Encoder Representations from Transformers (ParsBERT) as a classifier and the data from the previous step, more data were labeled. The performance of various ML models, including Support Vector Machine (SVM), Random Forest (RF), Convolutional Neural Networks (CNN), Long Short Term Memory (LSTM), and the ParsBERT language model has been examined on the generated dataset. The F-score value of 97.27 for ParsBERT and about 94.5 for SVM and CNN classifiers were obtained for the suggestion class which is a promising result as the first research on suggestion classification on Persian texts. Also, the proposed guideline can be used for other NLP tasks, and the generated dataset can be used in other suggestion classification tasks.
{"title":"A semi-supervised method to generate a persian dataset for suggestion classification","authors":"Leila Safari, Zanyar Mohammady","doi":"10.1007/s10579-023-09688-7","DOIUrl":"https://doi.org/10.1007/s10579-023-09688-7","url":null,"abstract":"Suggestion mining has become a popular subject in the field of natural language processing (NLP) that is useful in areas like a service/product improvement. The purpose of this study is to provide an automated machine learning (ML) based approach to extract suggestions from Persian text. In this research, first, a novel two-step semi-supervised method has been proposed to generate a Persian dataset called ParsSugg, which is then used in the automatic classification of the user’s suggestions. The first step is manual labeling of data based on a proposed guideline, followed by a data augmentation phase. In the second step, using pre-trained Persian Bidirectional Encoder Representations from Transformers (ParsBERT) as a classifier and the data from the previous step, more data were labeled. The performance of various ML models, including Support Vector Machine (SVM), Random Forest (RF), Convolutional Neural Networks (CNN), Long Short Term Memory (LSTM), and the ParsBERT language model has been examined on the generated dataset. The F-score value of 97.27 for ParsBERT and about 94.5 for SVM and CNN classifiers were obtained for the suggestion class which is a promising result as the first research on suggestion classification on Persian texts. Also, the proposed guideline can be used for other NLP tasks, and the generated dataset can be used in other suggestion classification tasks.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135199301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-21DOI: 10.1007/s10579-023-09674-z
Natalia Loukachevitch, Ekaterina Artemova, Tatiana Batura, Pavel Braslavski, Vladimir Ivanov, Suresh Manandhar, Alexander Pugachev, Igor Rozhkov, Artem Shelmanov, Elena Tutubalina, Alexey Yandutov
This paper describes NEREL—a Russian news dataset suited for three tasks: nested named entity recognition, relation extraction, and entity linking. Compared to flat entities, nested named entities provide a richer and more complete annotation while also increasing the coverage of relations annotation and entity linking. Relations between nested named entities may cross entity boundaries to connect to shorter entities nested within longer ones, which makes it harder to detect such relations. NEREL is currently the largest Russian dataset annotated with entities and relations: it comprises 29 named entity types and 49 relation types. At the time of writing, the dataset contains 56 K named entities and 39 K relations annotated in 933 person-oriented news articles. NEREL is annotated with relations at three levels: (1) within nested named entities, (2) within sentences, and (3) with relations crossing sentence boundaries. We provide benchmark evaluation of current state-of-the-art methods in all three tasks. The dataset is freely available at https://github.com/nerel-ds/NEREL .
{"title":"NEREL: a Russian information extraction dataset with rich annotation for nested entities, relations, and wikidata entity links","authors":"Natalia Loukachevitch, Ekaterina Artemova, Tatiana Batura, Pavel Braslavski, Vladimir Ivanov, Suresh Manandhar, Alexander Pugachev, Igor Rozhkov, Artem Shelmanov, Elena Tutubalina, Alexey Yandutov","doi":"10.1007/s10579-023-09674-z","DOIUrl":"https://doi.org/10.1007/s10579-023-09674-z","url":null,"abstract":"This paper describes NEREL—a Russian news dataset suited for three tasks: nested named entity recognition, relation extraction, and entity linking. Compared to flat entities, nested named entities provide a richer and more complete annotation while also increasing the coverage of relations annotation and entity linking. Relations between nested named entities may cross entity boundaries to connect to shorter entities nested within longer ones, which makes it harder to detect such relations. NEREL is currently the largest Russian dataset annotated with entities and relations: it comprises 29 named entity types and 49 relation types. At the time of writing, the dataset contains 56 K named entities and 39 K relations annotated in 933 person-oriented news articles. NEREL is annotated with relations at three levels: (1) within nested named entities, (2) within sentences, and (3) with relations crossing sentence boundaries. We provide benchmark evaluation of current state-of-the-art methods in all three tasks. The dataset is freely available at https://github.com/nerel-ds/NEREL .","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136136095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-14DOI: 10.1007/s10579-023-09687-8
Manoel Veríssimo dos Santos Neto, Nádia Félix F. da Silva, Anderson da Silva Soares
Sentiment analysis (SA) is a study area focused on obtaining contextual polarity from the text. Currently, deep learning has obtained outstanding results in this task. However, much annotated data are necessary to train these algorithms, and obtaining this data is expensive and difficult. In the context of low-resource scenarios, this problem is even more significant because there are little available data. Transfer learning (TL) can be used to minimize this problem because it is possible to develop some architectures using fewer data. Language models are a way of applying TL in natural language processing (NLP), and they have achieved competitive results. Nevertheless, some models need many hours of training using many computational resources, and in some contexts, people and organizations do not have the resources to do this. In this paper, we explore the models BERT (Pretraining of Deep Bidirectional Transformers for Language Understanding), MultiFiT (Efficient Multilingual Language Model Fine-tuning), ALBERT (A Lite BERT for Self-supervised Learning of Language Representations), and RoBERTa (A Robustly Optimized BERT Pretraining Approach). In all of our experiments, these models obtain better results than CNN (convolutional neural network) and LSTM (Long Short Term Memory) models. To MultiFiT and RoBERTa models, we propose a pretrained language model (PTLM) using Twitter data. Using this approach, we obtained competitive results compared with the models trained in formal language datasets. The main goal is to show the impacts of TL and language models comparing results with other techniques and showing the computational costs of using these approaches.
{"title":"A survey and study impact of tweet sentiment analysis via transfer learning in low resource scenarios","authors":"Manoel Veríssimo dos Santos Neto, Nádia Félix F. da Silva, Anderson da Silva Soares","doi":"10.1007/s10579-023-09687-8","DOIUrl":"https://doi.org/10.1007/s10579-023-09687-8","url":null,"abstract":"Sentiment analysis (SA) is a study area focused on obtaining contextual polarity from the text. Currently, deep learning has obtained outstanding results in this task. However, much annotated data are necessary to train these algorithms, and obtaining this data is expensive and difficult. In the context of low-resource scenarios, this problem is even more significant because there are little available data. Transfer learning (TL) can be used to minimize this problem because it is possible to develop some architectures using fewer data. Language models are a way of applying TL in natural language processing (NLP), and they have achieved competitive results. Nevertheless, some models need many hours of training using many computational resources, and in some contexts, people and organizations do not have the resources to do this. In this paper, we explore the models BERT (Pretraining of Deep Bidirectional Transformers for Language Understanding), MultiFiT (Efficient Multilingual Language Model Fine-tuning), ALBERT (A Lite BERT for Self-supervised Learning of Language Representations), and RoBERTa (A Robustly Optimized BERT Pretraining Approach). In all of our experiments, these models obtain better results than CNN (convolutional neural network) and LSTM (Long Short Term Memory) models. To MultiFiT and RoBERTa models, we propose a pretrained language model (PTLM) using Twitter data. Using this approach, we obtained competitive results compared with the models trained in formal language datasets. The main goal is to show the impacts of TL and language models comparing results with other techniques and showing the computational costs of using these approaches.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134912901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-29DOI: 10.1007/s10579-023-09684-x
S. Frank, Anna Aumeistere
{"title":"An eye-tracking-with-EEG coregistration corpus of narrative sentences","authors":"S. Frank, Anna Aumeistere","doi":"10.1007/s10579-023-09684-x","DOIUrl":"https://doi.org/10.1007/s10579-023-09684-x","url":null,"abstract":"","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":" ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46749373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-23DOI: 10.1007/s10579-023-09685-w
Luciana Bencke, V. Moreira
{"title":"Data augmentation strategies to improve text classification: a use case in smart cities","authors":"Luciana Bencke, V. Moreira","doi":"10.1007/s10579-023-09685-w","DOIUrl":"https://doi.org/10.1007/s10579-023-09685-w","url":null,"abstract":"","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":" ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47201217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-20DOI: 10.1007/s10579-023-09680-1
Jesin James, Isabella Shields, Vithya Yogarajan, Peter J. Keegan, Catherine I. Watson, Peter-Lucas Jones, Keoni Mahelona
Te reo Māori (referred to as Māori), New Zealand’s indigenous language, is under-resourced in language technology. Māori speakers are bilingual, where Māori is code-switched with English. Unfortunately, there are minimal resources available for Māori language technology, language detection and code-switch detection between Māori–English pair. Both English and Māori use Roman-derived orthography making rule-based systems for detecting language and code-switching restrictive. Most Māori language detection is done manually by language experts. This research builds a Māori–English bilingual database of 66,016,807 words with word-level language annotation. The New Zealand Parliament Hansard debates reports were used to build the database. The language labels are assigned automatically using language-specific rules and expert manual annotations. Words with the same spelling, but different meanings, exist for Māori and English. These words could not be categorised as Māori or English based on word-level language rules. Hence, manual annotations were necessary. An analysis reporting the various aspects of the database such as metadata, year-wise analysis, frequently occurring words, sentence length and N-grams is also reported. The database developed here is a valuable tool for future language and speech technology development for Aotearoa New Zealand. The methodology followed to label the database can also be followed by other low-resourced language pairs.
{"title":"The development of a labelled te reo Māori–English bilingual database for language technology","authors":"Jesin James, Isabella Shields, Vithya Yogarajan, Peter J. Keegan, Catherine I. Watson, Peter-Lucas Jones, Keoni Mahelona","doi":"10.1007/s10579-023-09680-1","DOIUrl":"https://doi.org/10.1007/s10579-023-09680-1","url":null,"abstract":"Te reo Māori (referred to as Māori), New Zealand’s indigenous language, is under-resourced in language technology. Māori speakers are bilingual, where Māori is code-switched with English. Unfortunately, there are minimal resources available for Māori language technology, language detection and code-switch detection between Māori–English pair. Both English and Māori use Roman-derived orthography making rule-based systems for detecting language and code-switching restrictive. Most Māori language detection is done manually by language experts. This research builds a Māori–English bilingual database of 66,016,807 words with word-level language annotation. The New Zealand Parliament Hansard debates reports were used to build the database. The language labels are assigned automatically using language-specific rules and expert manual annotations. Words with the same spelling, but different meanings, exist for Māori and English. These words could not be categorised as Māori or English based on word-level language rules. Hence, manual annotations were necessary. An analysis reporting the various aspects of the database such as metadata, year-wise analysis, frequently occurring words, sentence length and N-grams is also reported. The database developed here is a valuable tool for future language and speech technology development for Aotearoa New Zealand. The methodology followed to label the database can also be followed by other low-resourced language pairs.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135876929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-13DOI: 10.1007/s10579-023-09683-y
M. Khairy, Tarek M. Mahmoud, Ahmed Omar, Tarek Abd El-Hafeez
{"title":"Comparative performance of ensemble machine learning for Arabic cyberbullying and offensive language detection","authors":"M. Khairy, Tarek M. Mahmoud, Ahmed Omar, Tarek Abd El-Hafeez","doi":"10.1007/s10579-023-09683-y","DOIUrl":"https://doi.org/10.1007/s10579-023-09683-y","url":null,"abstract":"","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":" ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44624553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-06DOI: 10.1007/s10579-023-09678-9
Alba Bonet-Jover, Robiert Sepúlveda-Torres, E. Saquete, P. Martínez-Barco, Mario Nieto-Pérez
{"title":"RUN-AS: a novel approach to annotate news reliability for disinformation detection","authors":"Alba Bonet-Jover, Robiert Sepúlveda-Torres, E. Saquete, P. Martínez-Barco, Mario Nieto-Pérez","doi":"10.1007/s10579-023-09678-9","DOIUrl":"https://doi.org/10.1007/s10579-023-09678-9","url":null,"abstract":"","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":" ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44243946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.1007/s10579-023-09656-1
Aaron Maladry, Els Lefever, Cynthia Van Hee, Veronique Hoste
{"title":"The limitations of irony detection in Dutch social media","authors":"Aaron Maladry, Els Lefever, Cynthia Van Hee, Veronique Hoste","doi":"10.1007/s10579-023-09656-1","DOIUrl":"https://doi.org/10.1007/s10579-023-09656-1","url":null,"abstract":"","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":" ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46825933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}