Pub Date : 2024-07-02DOI: 10.1007/s10579-024-09751-x
Johnatan E. Bonilla
The development of a benchmark for part-of-speech (PoS) tagging of spoken dialectal European Spanish is presented, which will serve as the foundation for a future treebank. The benchmark is constructed using transcriptions of the Corpus Oral y Sonoro del Español Rural (COSER;“Audible corpus of spoken rural Spanish”) and follows the Universal Dependencies project guidelines. We describe the methodology used to create a gold standard, which serves to evaluate different state-of-the-art PoS taggers (spaCy, Stanza NLP, and UDPipe), originally trained on written data and to fine-tune and evaluate a model for spoken Spanish. It is shown that the accuracy of these taggers drops from 0.98(-)0.99 to 0.94(-)0.95 when tested on spoken data. Of these three taggers, the spaCy’s trf (transformers) and Stanza NLP models performed the best. Finally, the spaCy trf model is fine-tuned using our gold standard, which resulted in an accuracy of 0.98 for coarse-grained tags (UPOS) and 0.97 for fine-grained tags (FEATS). Our benchmark will enable the development of more accurate PoS taggers for spoken Spanish and facilitate the construction of a treebank for European Spanish varieties.
{"title":"Spoken Spanish PoS tagging: gold standard dataset","authors":"Johnatan E. Bonilla","doi":"10.1007/s10579-024-09751-x","DOIUrl":"https://doi.org/10.1007/s10579-024-09751-x","url":null,"abstract":"<p>The development of a benchmark for part-of-speech (PoS) tagging of spoken dialectal European Spanish is presented, which will serve as the foundation for a future treebank. The benchmark is constructed using transcriptions of the <i>Corpus Oral y Sonoro del Español Rural</i> (COSER;“Audible corpus of spoken rural Spanish”) and follows the Universal Dependencies project guidelines. We describe the methodology used to create a gold standard, which serves to evaluate different state-of-the-art PoS taggers (spaCy, Stanza NLP, and UDPipe), originally trained on written data and to fine-tune and evaluate a model for spoken Spanish. It is shown that the accuracy of these taggers drops from 0.98<span>(-)</span>0.99 to 0.94<span>(-)</span>0.95 when tested on spoken data. Of these three taggers, the spaCy’s trf (transformers) and Stanza NLP models performed the best. Finally, the spaCy trf model is fine-tuned using our gold standard, which resulted in an accuracy of 0.98 for coarse-grained tags (UPOS) and 0.97 for fine-grained tags (FEATS). Our benchmark will enable the development of more accurate PoS taggers for spoken Spanish and facilitate the construction of a treebank for European Spanish varieties.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"205 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1007/s10579-024-09749-5
Rukayah Alhedayani
This paper presents a new corpus for a dialect of Arabic spoken in the central region of Saudi Arabia: the Najdi Arabic Corpus. This is the first publicly available corpus for this dialect. Audio clips gathered for the purpose of compiling the NAC are of three types: 1. 15-min recordings of interviews with people telling stories about their lives, 2. recordings of varying lengths taken from YouTube, and 3. very short recordings between 2 to 7 min long taken from other social media outlets such as WhatsApp and Snapchat. The total size of the corpus is 275,134 of part-of-speech tagged tokens gathered from different regions of Najd.
{"title":"The Najdi Arabic Corpus: a new corpus for an underrepresented Arabic dialect","authors":"Rukayah Alhedayani","doi":"10.1007/s10579-024-09749-5","DOIUrl":"https://doi.org/10.1007/s10579-024-09749-5","url":null,"abstract":"<p>This paper presents a new corpus for a dialect of Arabic spoken in the central region of Saudi Arabia: the Najdi Arabic Corpus. This is the first publicly available corpus for this dialect. Audio clips gathered for the purpose of compiling the NAC are of three types: 1. 15-min recordings of interviews with people telling stories about their lives, 2. recordings of varying lengths taken from YouTube, and 3. very short recordings between 2 to 7 min long taken from other social media outlets such as WhatsApp and Snapchat. The total size of the corpus is 275,134 of part-of-speech tagged tokens gathered from different regions of Najd.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"111 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-19DOI: 10.1007/s10579-024-09756-6
Ringki Das, Thoudam Doren Singh
Sentiment analysis is an important research domain in text analytics and natural language processing. Since the last few decades, it has become a fascinating and salient area for researchers to understand human sentiment. According to the 2011 census, the Assamese language is spoken by 15 million people. Despite being a scheduled language of the Indian Constitution, it is still a resource-constrained language. Though it is an official language and presents its script, less work on sentiment analysis is reported in the Assamese language. In a linguistically diverse country like India, it is essential to provide a system to help people understand the sentiments in their native languages. So, the multilingual society in India would not be able to fully leverage the benefits of AI without the state-of-the-art NLP systems for the regional languages. Assamese language become popular due to its wide applications. Assamese users in social media as well as other platforms also are increasing day by day. Automatic sentiment analysis systems become effective for individuals, government, political parties, and other organizations and also can stop the negativity from spreading without a language divide. This paper presents a study on textual sentiment analysis using different lexical features of the Assamese news domain using machine learning and deep learning techniques. In the experiments, the baseline models are developed and compared against the models with lexical features. The proposed model with AAV lexical features based on XGBoost classifier predicts the highest accuracy of 86.76% with TF-IDF approach. It is observed that the combination of the lexical features with the machine learning classifier can significantly help the sentiment prediction in a small dataset scenario over the individual lexical features.
{"title":"Which words are important?: an empirical study of Assamese sentiment analysis","authors":"Ringki Das, Thoudam Doren Singh","doi":"10.1007/s10579-024-09756-6","DOIUrl":"https://doi.org/10.1007/s10579-024-09756-6","url":null,"abstract":"<p>Sentiment analysis is an important research domain in text analytics and natural language processing. Since the last few decades, it has become a fascinating and salient area for researchers to understand human sentiment. According to the 2011 census, the Assamese language is spoken by 15 million people. Despite being a scheduled language of the Indian Constitution, it is still a resource-constrained language. Though it is an official language and presents its script, less work on sentiment analysis is reported in the Assamese language. In a linguistically diverse country like India, it is essential to provide a system to help people understand the sentiments in their native languages. So, the multilingual society in India would not be able to fully leverage the benefits of AI without the state-of-the-art NLP systems for the regional languages. Assamese language become popular due to its wide applications. Assamese users in social media as well as other platforms also are increasing day by day. Automatic sentiment analysis systems become effective for individuals, government, political parties, and other organizations and also can stop the negativity from spreading without a language divide. This paper presents a study on textual sentiment analysis using different lexical features of the Assamese news domain using machine learning and deep learning techniques. In the experiments, the baseline models are developed and compared against the models with lexical features. The proposed model with AAV lexical features based on XGBoost classifier predicts the highest accuracy of 86.76% with TF-IDF approach. It is observed that the combination of the lexical features with the machine learning classifier can significantly help the sentiment prediction in a small dataset scenario over the individual lexical features.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"11 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although there are numerous and effective BERT models for question answering (QA) over plain texts in English, it is not the same for other languages, such as Greek. Since it can be time-consuming to train a new BERT model for a given language, we present a generic methodology for multilingual QA by combining at runtime existing machine translation (MT) models and BERT QA models pretrained in English, and we perform a comparative evaluation for Greek language. Particularly, we propose a pipeline that (a) exploits widely used MT libraries for translating a question and a context from a source language to the English language, (b) extracts the answer from the translated English context through popular BERT models (pretrained in English corpus), (c) translates the answer back to the source language, and (d) evaluates the answer through semantic similarity metrics based on sentence embeddings, such as Bi-Encoder and BERTScore. For evaluating our system, we use 21 models, whereas we have created a test set with 20 texts and 200 questions and we have manually labelled 4200 answers. These resources can be reused for several tasks including QA and sentence similarity. Moreover, we use the existing multilingual test set XQuAD, with 240 texts and 1190 questions in Greek language. We focus on both the effectiveness and efficiency, through manually and machine labelled results. The results of the evaluation show that the proposed approach can be an efficient and effective alternative option to multilingual BERT. In particular, although the multilingual BERT QA model provides the highest scores for both human and automatic evaluation, all the models combining MT and BERT QA models are faster and some of them achieve quite similar scores.
{"title":"A comparative evaluation for question answering over Greek texts by using machine translation and BERT","authors":"Michalis Mountantonakis, Loukas Mertzanis, Michalis Bastakis, Yannis Tzitzikas","doi":"10.1007/s10579-024-09745-9","DOIUrl":"https://doi.org/10.1007/s10579-024-09745-9","url":null,"abstract":"<p>Although there are numerous and effective BERT models for question answering (QA) over plain texts in English, it is not the same for other languages, such as Greek. Since it can be time-consuming to train a new BERT model for a given language, we present a generic methodology for multilingual QA by combining at runtime existing machine translation (MT) models and BERT QA models pretrained in English, and we perform a comparative evaluation for Greek language. Particularly, we propose a pipeline that (a) exploits widely used MT libraries for translating a question and a context from a source language to the English language, (b) extracts the answer from the translated English context through popular BERT models (pretrained in English corpus), (c) translates the answer back to the source language, and (d) evaluates the answer through semantic similarity metrics based on sentence embeddings, such as Bi-Encoder and BERTScore. For evaluating our system, we use 21 models, whereas we have created a test set with 20 texts and 200 questions and we have manually labelled 4200 answers. These resources can be reused for several tasks including QA and sentence similarity. Moreover, we use the existing multilingual test set XQuAD, with 240 texts and 1190 questions in Greek language. We focus on both the effectiveness and efficiency, through manually and machine labelled results. The results of the evaluation show that the proposed approach can be an efficient and effective alternative option to multilingual BERT. In particular, although the multilingual BERT QA model provides the highest scores for both human and automatic evaluation, all the models combining MT and BERT QA models are faster and some of them achieve quite similar scores.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"1782 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-17DOI: 10.1007/s10579-024-09742-y
Clément Levallois
We introduce umigon-lexicon, a novel resource comprising English lexicons and associated conditions designed specifically to evaluate the sentiment conveyed by an author's subjective perspective. We conduct a comprehensive comparison with existing lexicons and evaluate umigon-lexicon's efficacy in sentiment analysis and factuality classification tasks. This evaluation is performed across eight datasets and against six models. The results demonstrate umigon-lexicon's competitive performance, underscoring the enduring value of lexicon-based solutions in sentiment analysis and factuality categorization. Furthermore, umigon-lexicon stands out for its intrinsic interpretability and the ability to make its operations fully transparent to end users, offering significant advantages over existing models.
{"title":"Umigon-lexicon: rule-based model for interpretable sentiment analysis and factuality categorization","authors":"Clément Levallois","doi":"10.1007/s10579-024-09742-y","DOIUrl":"https://doi.org/10.1007/s10579-024-09742-y","url":null,"abstract":"<p>We introduce umigon-lexicon, a novel resource comprising English lexicons and associated conditions designed specifically to evaluate the sentiment conveyed by an author's subjective perspective. We conduct a comprehensive comparison with existing lexicons and evaluate umigon-lexicon's efficacy in sentiment analysis and factuality classification tasks. This evaluation is performed across eight datasets and against six models. The results demonstrate umigon-lexicon's competitive performance, underscoring the enduring value of lexicon-based solutions in sentiment analysis and factuality categorization. Furthermore, umigon-lexicon stands out for its intrinsic interpretability and the ability to make its operations fully transparent to end users, offering significant advantages over existing models.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"30 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-04DOI: 10.1007/s10579-024-09740-0
Marcos Garcia
This paper presents a large and systematic assessment of distributional models for Galician. To this end, we have first trained and evaluated static word embeddings (e.g., word2vec, GloVe), and then compared their performance with that of current contextualised representations generated by neural language models. First, we have compiled and processed a large corpus for Galician, and created four datasets for word analogies and concept categorisation based on standard resources for other languages. Using the aforementioned corpus, we have trained 760 static vector space models which vary in their input representations (e.g., adjacency-based versus dependency-based approaches), learning algorithms, size of the surrounding contexts, and in the number of vector dimensions. These models have been evaluated both intrinsically, using the newly created datasets, and on extrinsic tasks, namely on POS-tagging, dependency parsing, and named entity recognition. The results provide new insights into the performance of different vector models in Galician, and about the impact of several training parameters on each task. In general, fastText embeddings are the static representations with the best performance in the intrinsic evaluations and in named entity recognition, while syntax-based embeddings achieve the highest results in POS-tagging and dependency parsing, indicating that there is no significant correlation between the performance in the intrinsic and extrinsic tasks. Finally, we have compared the performance of static vector representations with that of BERT-based word embeddings, whose fine-tuning obtains the best performance on named entity recognition. This comparison provides a comprehensive state-of-the-art of current models in Galician, and releases new transformer-based models for NER. All the resources used in this research are freely available to the community, and the best models have been incorporated into SemantiGal, an online tool to explore vector representations for Galician.
{"title":"Training and evaluation of vector models for Galician","authors":"Marcos Garcia","doi":"10.1007/s10579-024-09740-0","DOIUrl":"https://doi.org/10.1007/s10579-024-09740-0","url":null,"abstract":"<p>This paper presents a large and systematic assessment of distributional models for Galician. To this end, we have first trained and evaluated static word embeddings (e.g., <i>word2vec</i>, GloVe), and then compared their performance with that of current contextualised representations generated by neural language models. First, we have compiled and processed a large corpus for Galician, and created four datasets for word analogies and concept categorisation based on standard resources for other languages. Using the aforementioned corpus, we have trained 760 static vector space models which vary in their input representations (e.g., adjacency-based versus dependency-based approaches), learning algorithms, size of the surrounding contexts, and in the number of vector dimensions. These models have been evaluated both intrinsically, using the newly created datasets, and on extrinsic tasks, namely on POS-tagging, dependency parsing, and named entity recognition. The results provide new insights into the performance of different vector models in Galician, and about the impact of several training parameters on each task. In general, <i>fastText</i> embeddings are the static representations with the best performance in the intrinsic evaluations and in named entity recognition, while syntax-based embeddings achieve the highest results in POS-tagging and dependency parsing, indicating that there is no significant correlation between the performance in the intrinsic and extrinsic tasks. Finally, we have compared the performance of static vector representations with that of BERT-based word embeddings, whose fine-tuning obtains the best performance on named entity recognition. This comparison provides a comprehensive state-of-the-art of current models in Galician, and releases new transformer-based models for NER. All the resources used in this research are freely available to the community, and the best models have been incorporated into SemantiGal, an online tool to explore vector representations for Galician.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"3 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141255918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-02DOI: 10.1007/s10579-024-09746-8
Katja Meden, Tomaž Erjavec, Andrej Pančur
Parliamentary debates represent an essential part of democratic discourse and provide insights into various socio-demographic and linguistic phenomena - parliamentary corpora, which contain transcripts of parliamentary debates and extensive metadata, are an important resource for parliamentary discourse analysis and other research areas. This paper presents the Slovenian parliamentary corpus siParl, the latest version of which contains transcripts of plenary sessions and other legislative bodies of the Assembly of the Republic of Slovenia from 1990 to 2022, comprising more than 1 million speeches and 210 million words. We outline the development history of the corpus and also mention other initiatives that have been influenced by siParl (such as the Parla-CLARIN encoding and the ParlaMint corpora of European parliaments), present the corpus creation process, ranging from the initial data collection to the structural development and encoding of the corpus, and given the growing influence of the ParlaMint corpora, compare siParl with the Slovenian ParlaMint-SI corpus. Finally, we discuss updates for the next version as well as the long-term development and enrichment of the siParl corpus.
{"title":"Slovenian parliamentary corpus siParl","authors":"Katja Meden, Tomaž Erjavec, Andrej Pančur","doi":"10.1007/s10579-024-09746-8","DOIUrl":"https://doi.org/10.1007/s10579-024-09746-8","url":null,"abstract":"<p>Parliamentary debates represent an essential part of democratic discourse and provide insights into various socio-demographic and linguistic phenomena - parliamentary corpora, which contain transcripts of parliamentary debates and extensive metadata, are an important resource for parliamentary discourse analysis and other research areas. This paper presents the Slovenian parliamentary corpus siParl, the latest version of which contains transcripts of plenary sessions and other legislative bodies of the Assembly of the Republic of Slovenia from 1990 to 2022, comprising more than 1 million speeches and 210 million words. We outline the development history of the corpus and also mention other initiatives that have been influenced by siParl (such as the Parla-CLARIN encoding and the ParlaMint corpora of European parliaments), present the corpus creation process, ranging from the initial data collection to the structural development and encoding of the corpus, and given the growing influence of the ParlaMint corpora, compare siParl with the Slovenian ParlaMint-SI corpus. Finally, we discuss updates for the next version as well as the long-term development and enrichment of the siParl corpus.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"36 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141255846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-25DOI: 10.1007/s10579-024-09743-x
Manoel Fernando Alonso Gadi, Miguel Ángel Sicilia
The objective of this paper is to describe Cryptocurrency Linguo (CryptoLin), a novel corpus containing 2683 cryptocurrency-related news articles covering more than a three-year period. CryptoLin was human-annotated with discrete values representing negative, neutral, and positive news respectively. Eighty-three people participated in the annotation process; each news title was randomly assigned and blindly annotated by three human annotators, one in each different cohort, followed by a consensus mechanism using simple voting. The selection of the annotators was intentionally made using three cohorts with students from a very diverse set of nationalities and educational backgrounds to minimize bias as much as possible. In case one of the annotators was in total disagreement with the other two (e.g., one negative vs two positive or one positive vs two negative), we considered this minority report and defaulted the labeling to neutral. Fleiss’s Kappa, Krippendorff’s Alpha, and Gwet’s AC1 inter-rater reliability coefficients demonstrate CryptoLin’s acceptable quality of inter-annotator agreement. The dataset also includes a text span with the three manual label annotations for further auditing of the annotation mechanism. To further assess the quality of the labeling and the usefulness of CryptoLin dataset, it incorporates four pretrained Sentiment Analysis models: Vader, Textblob, Flair, and FinBERT. Vader and FinBERT demonstrate reasonable performance in the CryptoLin dataset, indicating that the data was not annotated randomly and is therefore useful for further research1. FinBERT (negative) presents the best performance, indicating an advantage of being trained with financial news. Both the CryptoLin dataset and the Jupyter Notebook with the analysis, for reproducibility, are available at the project’s Github. Overall, CryptoLin aims to complement the current knowledge by providing a novel and publicly available Gadi and Ángel Sicilia (Cryptolin dataset and python jupyter notebooks reproducibility codes, 2022) cryptocurrency sentiment corpus and fostering research on the topic of cryptocurrency sentiment analysis and potential applications in behavioral science. This can be useful for businesses and policymakers who want to understand how cryptocurrencies are being used and how they might be regulated. Finally, the rules for selecting and assigning annotators make CryptoLin unique and interesting for new research in annotator selection, assignment, and biases.
{"title":"A sentiment corpus for the cryptocurrency financial domain: the CryptoLin corpus","authors":"Manoel Fernando Alonso Gadi, Miguel Ángel Sicilia","doi":"10.1007/s10579-024-09743-x","DOIUrl":"https://doi.org/10.1007/s10579-024-09743-x","url":null,"abstract":"<p>The objective of this paper is to describe Cryptocurrency Linguo (CryptoLin), a novel corpus containing 2683 cryptocurrency-related news articles covering more than a three-year period. CryptoLin was human-annotated with discrete values representing negative, neutral, and positive news respectively. Eighty-three people participated in the annotation process; each news title was randomly assigned and blindly annotated by three human annotators, one in each different cohort, followed by a consensus mechanism using simple voting. The selection of the annotators was intentionally made using three cohorts with students from a very diverse set of nationalities and educational backgrounds to minimize bias as much as possible. In case one of the annotators was in total disagreement with the other two (e.g., one negative vs two positive or one positive vs two negative), we considered this minority report and defaulted the labeling to neutral. Fleiss’s Kappa, Krippendorff’s Alpha, and Gwet’s AC1 inter-rater reliability coefficients demonstrate CryptoLin’s acceptable quality of inter-annotator agreement. The dataset also includes a text span with the three manual label annotations for further auditing of the annotation mechanism. To further assess the quality of the labeling and the usefulness of CryptoLin dataset, it incorporates four pretrained Sentiment Analysis models: Vader, Textblob, Flair, and FinBERT. Vader and FinBERT demonstrate reasonable performance in the CryptoLin dataset, indicating that the data was not annotated randomly and is therefore useful for further research1. FinBERT (negative) presents the best performance, indicating an advantage of being trained with financial news. Both the CryptoLin dataset and the Jupyter Notebook with the analysis, for reproducibility, are available at the project’s Github. Overall, CryptoLin aims to complement the current knowledge by providing a novel and publicly available Gadi and Ángel Sicilia (Cryptolin dataset and python jupyter notebooks reproducibility codes, 2022) cryptocurrency sentiment corpus and fostering research on the topic of cryptocurrency sentiment analysis and potential applications in behavioral science. This can be useful for businesses and policymakers who want to understand how cryptocurrencies are being used and how they might be regulated. Finally, the rules for selecting and assigning annotators make CryptoLin unique and interesting for new research in annotator selection, assignment, and biases.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"16 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-25DOI: 10.1007/s10579-024-09737-9
Jihye Park, Hye Jin Lee, Sungzoon Cho
Explainability, which is the degree to which an interested stakeholder can understand the key factors that led to a data-driven model’s decision, has been considered an essential consideration in the financial domain. Accordingly, lexicons that can achieve reasonable performance and provide clear explanations to users have been among the most popular resources in sentiment-based financial forecasting. Since deep learning-based techniques have limitations in that the basis for interpreting the results is unclear, lexicons have consistently attracted the community’s attention as a crucial tool in studies that demand explanations for the sentiment estimation process. One of the challenges in the construction of a financial sentiment lexicon is the domain-specific feature that the sentiment orientation of a word can change depending on the application of directional expressions. For instance, the word “cost” typically conveys a negative sentiment; however, when the word is juxtaposed with “decrease” to form the phrase “cost decrease,” the associated sentiment is positive. Several studies have manually built lexicons containing directional expressions. However, they have been hindered because manual inspection inevitably requires intensive human labor and time. In this study, we propose to automatically construct the “sentiment lexicon composed of direction-dependent words,” which expresses each term as a pair consisting of a directional word and a direction-dependent word. Experimental results show that the proposed sentiment lexicon yields enhanced classification performance, proving the effectiveness of our method for the automated construction of a direction-aware sentiment lexicon.
{"title":"Automatic construction of direction-aware sentiment lexicon using direction-dependent words","authors":"Jihye Park, Hye Jin Lee, Sungzoon Cho","doi":"10.1007/s10579-024-09737-9","DOIUrl":"https://doi.org/10.1007/s10579-024-09737-9","url":null,"abstract":"<p>Explainability, which is the degree to which an interested stakeholder can understand the key factors that led to a data-driven model’s decision, has been considered an essential consideration in the financial domain. Accordingly, lexicons that can achieve reasonable performance and provide clear explanations to users have been among the most popular resources in sentiment-based financial forecasting. Since deep learning-based techniques have limitations in that the basis for interpreting the results is unclear, lexicons have consistently attracted the community’s attention as a crucial tool in studies that demand explanations for the sentiment estimation process. One of the challenges in the construction of a financial sentiment lexicon is the domain-specific feature that the sentiment orientation of a word can change depending on the application of directional expressions. For instance, the word “cost” typically conveys a negative sentiment; however, when the word is juxtaposed with “decrease” to form the phrase “cost decrease,” the associated sentiment is positive. Several studies have manually built lexicons containing directional expressions. However, they have been hindered because manual inspection inevitably requires intensive human labor and time. In this study, we propose to automatically construct the “sentiment lexicon composed of direction-dependent words,” which expresses each term as a pair consisting of a directional word and a direction-dependent word. Experimental results show that the proposed sentiment lexicon yields enhanced classification performance, proving the effectiveness of our method for the automated construction of a direction-aware sentiment lexicon.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"63 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-15DOI: 10.1007/s10579-024-09734-y
Ida Szubert, Omri Abend, Nathan Schneider, Samuel Gibbon, Louis Mahon, Sharon Goldwater, Mark Steedman
Corpora of child speech and child-directed speech (CDS) have enabled major contributions to the study of child language acquisition, yet semantic annotation for such corpora is still scarce and lacks a uniform standard. Semantic annotation of CDS is particularly important for understanding the nature of the input children receive and developing computational models of child language acquisition. For example, under the assumption that children are able to infer meaning representations for (at least some of) the utterances they hear, the acquisition task is to learn a grammar that can map novel adult utterances onto their corresponding meaning representations, in the face of noise and distraction by other contextually possible meanings. To study this problem and to develop computational models of it, we need corpora that provide both adult utterances and their meaning representations, ideally using annotation that is consistent across a range of languages in order to facilitate cross-linguistic comparative studies. This paper proposes a methodology for constructing such corpora of CDS paired with sentential logical forms, and uses this method to create two such corpora, in English and Hebrew. The approach enforces a cross-linguistically consistent representation, building on recent advances in dependency representation and semantic parsing. Specifically, the approach involves two steps. First, we annotate the corpora using the Universal Dependencies (UD) scheme for syntactic annotation, which has been developed to apply consistently to a wide variety of domains and typologically diverse languages. Next, we further annotate these data by applying an automatic method for transducing sentential logical forms (LFs) from UD structures. The UD and LF representations have complementary strengths: UD structures are language-neutral and support consistent and reliable annotation by multiple annotators, whereas LFs are neutral as to their syntactic derivation and transparently encode semantic relations. Using this approach, we provide syntactic and semantic annotation for two corpora from CHILDES: Brown’s Adam corpus (English; we annotate (approx) 80% of its child-directed utterances), all child-directed utterances from Berman’s Hagar corpus (Hebrew). We verify the quality of the UD annotation using an inter-annotator agreement study, and manually evaluate the transduced meaning representations. We then demonstrate the utility of the compiled corpora through (1) a longitudinal corpus study of the prevalence of different syntactic and semantic phenomena in the CDS, and (2) applying an existing computational model of language acquisition to the two corpora and briefly comparing the results across languages.
{"title":"Cross-linguistically consistent semantic and syntactic annotation of child-directed speech","authors":"Ida Szubert, Omri Abend, Nathan Schneider, Samuel Gibbon, Louis Mahon, Sharon Goldwater, Mark Steedman","doi":"10.1007/s10579-024-09734-y","DOIUrl":"https://doi.org/10.1007/s10579-024-09734-y","url":null,"abstract":"<p>Corpora of child speech and child-directed speech (CDS) have enabled major contributions to the study of child language acquisition, yet semantic annotation for such corpora is still scarce and lacks a uniform standard. Semantic annotation of CDS is particularly important for understanding the nature of the input children receive and developing computational models of child language acquisition. For example, under the assumption that children are able to infer meaning representations for (at least some of) the utterances they hear, the acquisition task is to learn a grammar that can map novel adult utterances onto their corresponding meaning representations, in the face of noise and distraction by other contextually possible meanings. To study this problem and to develop computational models of it, we need corpora that provide both adult utterances and their meaning representations, ideally using annotation that is consistent across a range of languages in order to facilitate cross-linguistic comparative studies. This paper proposes a methodology for constructing such corpora of CDS paired with sentential logical forms, and uses this method to create two such corpora, in English and Hebrew. The approach enforces a cross-linguistically consistent representation, building on recent advances in dependency representation and semantic parsing. Specifically, the approach involves two steps. First, we annotate the corpora using the Universal Dependencies (UD) scheme for syntactic annotation, which has been developed to apply consistently to a wide variety of domains and typologically diverse languages. Next, we further annotate these data by applying an automatic method for transducing sentential logical forms (LFs) from UD structures. The UD and LF representations have complementary strengths: UD structures are language-neutral and support consistent and reliable annotation by multiple annotators, whereas LFs are neutral as to their syntactic derivation and transparently encode semantic relations. Using this approach, we provide syntactic and semantic annotation for two corpora from CHILDES: Brown’s Adam corpus (English; we annotate <span>(approx)</span> 80% of its child-directed utterances), all child-directed utterances from Berman’s Hagar corpus (Hebrew). We verify the quality of the UD annotation using an inter-annotator agreement study, and manually evaluate the transduced meaning representations. We then demonstrate the utility of the compiled corpora through (1) a longitudinal corpus study of the prevalence of different syntactic and semantic phenomena in the CDS, and (2) applying an existing computational model of language acquisition to the two corpora and briefly comparing the results across languages.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"70 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141063620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}