Language Resources and Evaluation最新文献_第3页

Spoken Spanish PoS tagging: gold standard dataset 西班牙语口语 PoS 标记：黄金标准数据集

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-07-02 DOI: 10.1007/s10579-024-09751-x

Johnatan E. Bonilla

The development of a benchmark for part-of-speech (PoS) tagging of spoken dialectal European Spanish is presented, which will serve as the foundation for a future treebank. The benchmark is constructed using transcriptions of the Corpus Oral y Sonoro del Español Rural (COSER;“Audible corpus of spoken rural Spanish”) and follows the Universal Dependencies project guidelines. We describe the methodology used to create a gold standard, which serves to evaluate different state-of-the-art PoS taggers (spaCy, Stanza NLP, and UDPipe), originally trained on written data and to fine-tune and evaluate a model for spoken Spanish. It is shown that the accuracy of these taggers drops from 0.98(-)0.99 to 0.94(-)0.95 when tested on spoken data. Of these three taggers, the spaCy’s trf (transformers) and Stanza NLP models performed the best. Finally, the spaCy trf model is fine-tuned using our gold standard, which resulted in an accuracy of 0.98 for coarse-grained tags (UPOS) and 0.97 for fine-grained tags (FEATS). Our benchmark will enable the development of more accurate PoS taggers for spoken Spanish and facilitate the construction of a treebank for European Spanish varieties.

本文介绍了欧洲方言西班牙语口语语音部分标记（PoS）基准的开发情况，该基准将作为未来树库的基础。该基准是利用农村西班牙语口语和声学语料库（COSER；"农村西班牙语口语有声语料库"）的转录语料构建的，并遵循通用依存关系项目指南。我们介绍了用于创建黄金标准的方法，该标准可用于评估不同的最先进 PoS 标记器（spaCy、Stanza NLP 和 UDPipe），这些标记器最初是在书面数据上训练的，并用于微调和评估西班牙语口语模型。结果表明，在口语数据上测试时，这些标记器的准确率从 0.98(-)0.99 降至 0.94(-)0.95 。在这三种标注器中，spaCy 的 trf（转换器）和 Stanza NLP 模型表现最好。最后，我们使用黄金标准对 spaCy 的 trf 模型进行了微调，结果粗粒度标签（UPOS）的准确度为 0.98，细粒度标签（FEATS）的准确度为 0.97。我们的基准将有助于开发更准确的西班牙语口语 PoS 标记器，并促进欧洲西班牙语树库的建设。

{"title":"Spoken Spanish PoS tagging: gold standard dataset","authors":"Johnatan E. Bonilla","doi":"10.1007/s10579-024-09751-x","DOIUrl":"https://doi.org/10.1007/s10579-024-09751-x","url":null,"abstract":"The development of a benchmark for part-of-speech (PoS) tagging of spoken dialectal European Spanish is presented, which will serve as the foundation for a future treebank. The benchmark is constructed using transcriptions of the Corpus Oral y Sonoro del Español Rural (COSER;“Audible corpus of spoken rural Spanish”) and follows the Universal Dependencies project guidelines. We describe the methodology used to create a gold standard, which serves to evaluate different state-of-the-art PoS taggers (spaCy, Stanza NLP, and UDPipe), originally trained on written data and to fine-tune and evaluate a model for spoken Spanish. It is shown that the accuracy of these taggers drops from 0.98(-)0.99 to 0.94(-)0.95 when tested on spoken data. Of these three taggers, the spaCy’s trf (transformers) and Stanza NLP models performed the best. Finally, the spaCy trf model is fine-tuned using our gold standard, which resulted in an accuracy of 0.98 for coarse-grained tags (UPOS) and 0.97 for fine-grained tags (FEATS). Our benchmark will enable the development of more accurate PoS taggers for spoken Spanish and facilitate the construction of a treebank for European Spanish varieties.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"205 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Najdi Arabic Corpus: a new corpus for an underrepresented Arabic dialect 纳杰迪阿拉伯语语料库：一种代表性不足的阿拉伯语方言的新语料库

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-07-02 DOI: 10.1007/s10579-024-09749-5

Rukayah Alhedayani

This paper presents a new corpus for a dialect of Arabic spoken in the central region of Saudi Arabia: the Najdi Arabic Corpus. This is the first publicly available corpus for this dialect. Audio clips gathered for the purpose of compiling the NAC are of three types: 1. 15-min recordings of interviews with people telling stories about their lives, 2. recordings of varying lengths taken from YouTube, and 3. very short recordings between 2 to 7 min long taken from other social media outlets such as WhatsApp and Snapchat. The total size of the corpus is 275,134 of part-of-speech tagged tokens gathered from different regions of Najd.

本文为沙特阿拉伯中部地区的一种阿拉伯语方言提供了一个新语料库：Najdi 阿拉伯语语料库。这是该方言的第一个公开语料库。为编制 NAC 而收集的音频片段有三种类型：1. 15 分钟的采访录音，讲述人们的生活故事；2. 来自 YouTube 的不同长度的录音；3. 来自 WhatsApp 和 Snapchat 等其他社交媒体的 2 到 7 分钟的超短录音。语料库的总规模为 275,134 个语音部分标记词块，收集自纳杰德的不同地区。

引用次数: 0

Which words are important?: an empirical study of Assamese sentiment analysis 哪些词重要？：阿萨姆语情感分析实证研究

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-06-19 DOI: 10.1007/s10579-024-09756-6

Ringki Das, Thoudam Doren Singh

Sentiment analysis is an important research domain in text analytics and natural language processing. Since the last few decades, it has become a fascinating and salient area for researchers to understand human sentiment. According to the 2011 census, the Assamese language is spoken by 15 million people. Despite being a scheduled language of the Indian Constitution, it is still a resource-constrained language. Though it is an official language and presents its script, less work on sentiment analysis is reported in the Assamese language. In a linguistically diverse country like India, it is essential to provide a system to help people understand the sentiments in their native languages. So, the multilingual society in India would not be able to fully leverage the benefits of AI without the state-of-the-art NLP systems for the regional languages. Assamese language become popular due to its wide applications. Assamese users in social media as well as other platforms also are increasing day by day. Automatic sentiment analysis systems become effective for individuals, government, political parties, and other organizations and also can stop the negativity from spreading without a language divide. This paper presents a study on textual sentiment analysis using different lexical features of the Assamese news domain using machine learning and deep learning techniques. In the experiments, the baseline models are developed and compared against the models with lexical features. The proposed model with AAV lexical features based on XGBoost classifier predicts the highest accuracy of 86.76% with TF-IDF approach. It is observed that the combination of the lexical features with the machine learning classifier can significantly help the sentiment prediction in a small dataset scenario over the individual lexical features.

情感分析是文本分析和自然语言处理的一个重要研究领域。过去几十年来，它已成为研究人员了解人类情感的一个引人入胜的突出领域。根据 2011 年的人口普查，使用阿萨姆语的人口达 1500 万。尽管阿萨姆语是印度宪法规定的语言，但它仍然是一种资源有限的语言。尽管阿萨姆语是一种官方语言，并有自己的文字，但用阿萨姆语进行情感分析的工作报道较少。在印度这样一个语言多样化的国家，有必要提供一个系统来帮助人们理解其母语中的情感。因此，如果没有最先进的地区语言 NLP 系统，印度的多语言社会将无法充分利用人工智能的优势。阿萨姆语因其广泛的应用而变得流行。社交媒体和其他平台上的阿萨姆用户也与日俱增。自动情感分析系统对个人、政府、政党和其他组织都很有效，还能阻止负面情绪的传播，而不会造成语言鸿沟。本文利用机器学习和深度学习技术，对阿萨姆语新闻领域的不同词性特征进行了文本情感分析研究。在实验中，开发了基线模型，并与带有词法特征的模型进行了比较。与 TF-IDF 方法相比，基于 XGBoost 分类器的带有 AAV 词性特征的拟议模型预测准确率最高，达到 86.76%。据观察，在小数据集情况下，词性特征与机器学习分类器的结合比单个词性特征更有助于情感预测。

{"title":"Which words are important?: an empirical study of Assamese sentiment analysis","authors":"Ringki Das, Thoudam Doren Singh","doi":"10.1007/s10579-024-09756-6","DOIUrl":"https://doi.org/10.1007/s10579-024-09756-6","url":null,"abstract":"Sentiment analysis is an important research domain in text analytics and natural language processing. Since the last few decades, it has become a fascinating and salient area for researchers to understand human sentiment. According to the 2011 census, the Assamese language is spoken by 15 million people. Despite being a scheduled language of the Indian Constitution, it is still a resource-constrained language. Though it is an official language and presents its script, less work on sentiment analysis is reported in the Assamese language. In a linguistically diverse country like India, it is essential to provide a system to help people understand the sentiments in their native languages. So, the multilingual society in India would not be able to fully leverage the benefits of AI without the state-of-the-art NLP systems for the regional languages. Assamese language become popular due to its wide applications. Assamese users in social media as well as other platforms also are increasing day by day. Automatic sentiment analysis systems become effective for individuals, government, political parties, and other organizations and also can stop the negativity from spreading without a language divide. This paper presents a study on textual sentiment analysis using different lexical features of the Assamese news domain using machine learning and deep learning techniques. In the experiments, the baseline models are developed and compared against the models with lexical features. The proposed model with AAV lexical features based on XGBoost classifier predicts the highest accuracy of 86.76% with TF-IDF approach. It is observed that the combination of the lexical features with the machine learning classifier can significantly help the sentiment prediction in a small dataset scenario over the individual lexical features.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"11 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A comparative evaluation for question answering over Greek texts by using machine translation and BERT 使用机器翻译和 BERT 对希腊语文本进行问题解答的比较评估

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-06-19 DOI: 10.1007/s10579-024-09745-9

Michalis Mountantonakis, Loukas Mertzanis, Michalis Bastakis, Yannis Tzitzikas

Although there are numerous and effective BERT models for question answering (QA) over plain texts in English, it is not the same for other languages, such as Greek. Since it can be time-consuming to train a new BERT model for a given language, we present a generic methodology for multilingual QA by combining at runtime existing machine translation (MT) models and BERT QA models pretrained in English, and we perform a comparative evaluation for Greek language. Particularly, we propose a pipeline that (a) exploits widely used MT libraries for translating a question and a context from a source language to the English language, (b) extracts the answer from the translated English context through popular BERT models (pretrained in English corpus), (c) translates the answer back to the source language, and (d) evaluates the answer through semantic similarity metrics based on sentence embeddings, such as Bi-Encoder and BERTScore. For evaluating our system, we use 21 models, whereas we have created a test set with 20 texts and 200 questions and we have manually labelled 4200 answers. These resources can be reused for several tasks including QA and sentence similarity. Moreover, we use the existing multilingual test set XQuAD, with 240 texts and 1190 questions in Greek language. We focus on both the effectiveness and efficiency, through manually and machine labelled results. The results of the evaluation show that the proposed approach can be an efficient and effective alternative option to multilingual BERT. In particular, although the multilingual BERT QA model provides the highest scores for both human and automatic evaluation, all the models combining MT and BERT QA models are faster and some of them achieve quite similar scores.

尽管有大量有效的 BERT 模型用于英语纯文本的问题解答（QA），但对于希腊语等其他语言而言，情况却并非如此。由于为特定语言训练一个新的 BERT 模型非常耗时，我们提出了一种通用的多语言 QA 方法，在运行时将现有的机器翻译（MT）模型和在英语中预先训练的 BERT QA 模型结合起来，并对希腊语进行了比较评估。特别是，我们提出了一个管道：(a) 利用广泛使用的 MT 库将问题和上下文从源语言翻译成英语；(b) 通过流行的 BERT 模型（在英语语料库中预先训练）从翻译的英语上下文中提取答案；(c) 将答案翻译回源语言；(d) 通过基于句子嵌入的语义相似性度量（如 Bi-Encoder 和 BERTScore）评估答案。为了评估我们的系统，我们使用了 21 个模型，同时创建了一个包含 20 篇文本和 200 个问题的测试集，并人工标注了 4200 个答案。这些资源可重复用于多项任务，包括质量保证和句子相似性。此外，我们还使用了现有的多语言测试集 XQuAD，其中包含 240 篇文本和 1190 个希腊语问题。通过人工和机器标注的结果，我们重点考察了有效性和效率。评估结果表明，建议的方法可以成为多语种 BERT 的高效替代选择。特别是，尽管多语种 BERT QA 模型在人工和自动评估中都获得了最高分，但所有将 MT 和 BERT QA 模型结合起来的模型都更快，其中一些还获得了相当接近的分数。

{"title":"A comparative evaluation for question answering over Greek texts by using machine translation and BERT","authors":"Michalis Mountantonakis, Loukas Mertzanis, Michalis Bastakis, Yannis Tzitzikas","doi":"10.1007/s10579-024-09745-9","DOIUrl":"https://doi.org/10.1007/s10579-024-09745-9","url":null,"abstract":"Although there are numerous and effective BERT models for question answering (QA) over plain texts in English, it is not the same for other languages, such as Greek. Since it can be time-consuming to train a new BERT model for a given language, we present a generic methodology for multilingual QA by combining at runtime existing machine translation (MT) models and BERT QA models pretrained in English, and we perform a comparative evaluation for Greek language. Particularly, we propose a pipeline that (a) exploits widely used MT libraries for translating a question and a context from a source language to the English language, (b) extracts the answer from the translated English context through popular BERT models (pretrained in English corpus), (c) translates the answer back to the source language, and (d) evaluates the answer through semantic similarity metrics based on sentence embeddings, such as Bi-Encoder and BERTScore. For evaluating our system, we use 21 models, whereas we have created a test set with 20 texts and 200 questions and we have manually labelled 4200 answers. These resources can be reused for several tasks including QA and sentence similarity. Moreover, we use the existing multilingual test set XQuAD, with 240 texts and 1190 questions in Greek language. We focus on both the effectiveness and efficiency, through manually and machine labelled results. The results of the evaluation show that the proposed approach can be an efficient and effective alternative option to multilingual BERT. In particular, although the multilingual BERT QA model provides the highest scores for both human and automatic evaluation, all the models combining MT and BERT QA models are faster and some of them achieve quite similar scores.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"1782 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Umigon-lexicon: rule-based model for interpretable sentiment analysis and factuality categorization Umigon-lexicon：基于规则的可解释情感分析和事实性分类模型

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-06-17 DOI: 10.1007/s10579-024-09742-y

Clément Levallois

We introduce umigon-lexicon, a novel resource comprising English lexicons and associated conditions designed specifically to evaluate the sentiment conveyed by an author's subjective perspective. We conduct a comprehensive comparison with existing lexicons and evaluate umigon-lexicon's efficacy in sentiment analysis and factuality classification tasks. This evaluation is performed across eight datasets and against six models. The results demonstrate umigon-lexicon's competitive performance, underscoring the enduring value of lexicon-based solutions in sentiment analysis and factuality categorization. Furthermore, umigon-lexicon stands out for its intrinsic interpretability and the ability to make its operations fully transparent to end users, offering significant advantages over existing models.

我们介绍了umigon-lexicon，这是一个由英语词典和相关条件组成的新型资源，专门用于评估作者主观视角所传达的情感。我们与现有词典进行了全面比较，并评估了 umigon-lexicon 在情感分析和事实分类任务中的功效。评估在八个数据集和六个模型中进行。结果表明 umigon-lexicon 的性能极具竞争力，凸显了基于词典的解决方案在情感分析和事实分类中的持久价值。此外，umigon-lexicon 因其固有的可解释性和对终端用户完全透明的操作能力而脱颖而出，与现有模型相比具有显著优势。

引用次数: 0

Training and evaluation of vector models for Galician 加利西亚语矢量模型的训练和评估

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-06-04 DOI: 10.1007/s10579-024-09740-0

Marcos Garcia

This paper presents a large and systematic assessment of distributional models for Galician. To this end, we have first trained and evaluated static word embeddings (e.g., word2vec, GloVe), and then compared their performance with that of current contextualised representations generated by neural language models. First, we have compiled and processed a large corpus for Galician, and created four datasets for word analogies and concept categorisation based on standard resources for other languages. Using the aforementioned corpus, we have trained 760 static vector space models which vary in their input representations (e.g., adjacency-based versus dependency-based approaches), learning algorithms, size of the surrounding contexts, and in the number of vector dimensions. These models have been evaluated both intrinsically, using the newly created datasets, and on extrinsic tasks, namely on POS-tagging, dependency parsing, and named entity recognition. The results provide new insights into the performance of different vector models in Galician, and about the impact of several training parameters on each task. In general, fastText embeddings are the static representations with the best performance in the intrinsic evaluations and in named entity recognition, while syntax-based embeddings achieve the highest results in POS-tagging and dependency parsing, indicating that there is no significant correlation between the performance in the intrinsic and extrinsic tasks. Finally, we have compared the performance of static vector representations with that of BERT-based word embeddings, whose fine-tuning obtains the best performance on named entity recognition. This comparison provides a comprehensive state-of-the-art of current models in Galician, and releases new transformer-based models for NER. All the resources used in this research are freely available to the community, and the best models have been incorporated into SemantiGal, an online tool to explore vector representations for Galician.

本文对加利西亚语的分布模型进行了大规模的系统评估。为此，我们首先对静态词嵌入（如 word2vec、GloVe）进行了训练和评估，然后将其性能与当前由神经语言模型生成的上下文表征进行了比较。首先，我们为加利西亚语编译和处理了一个大型语料库，并根据其他语言的标准资源创建了四个用于词语类比和概念分类的数据集。利用上述语料库，我们训练了 760 个静态向量空间模型，这些模型在输入表征（如基于邻接的方法和基于依赖的方法）、学习算法、周围上下文的大小以及向量维数方面各不相同。我们利用新创建的数据集对这些模型进行了内在评估，并对外在任务（即 POS 标记、依赖关系解析和命名实体识别）进行了评估。评估结果为了解不同向量模型在加利西亚语中的性能以及多个训练参数对各项任务的影响提供了新的视角。总的来说，在内在评估和命名实体识别中，快速文本嵌入是性能最好的静态表示，而在 POS 标记和依赖关系解析中，基于语法的嵌入取得了最高的成绩，这表明内在任务和外在任务的性能之间没有明显的相关性。最后，我们比较了静态向量表示法和基于 BERT 的词嵌入法的性能，后者的微调在命名实体识别中取得了最佳性能。通过比较，我们全面了解了目前加利西亚语模型的最新情况，并发布了新的基于转换器的 NER 模型。这项研究中使用的所有资源都免费向社区开放，最佳模型已被纳入SemantiGal，这是一个探索加利西亚语向量表示的在线工具。

{"title":"Training and evaluation of vector models for Galician","authors":"Marcos Garcia","doi":"10.1007/s10579-024-09740-0","DOIUrl":"https://doi.org/10.1007/s10579-024-09740-0","url":null,"abstract":"This paper presents a large and systematic assessment of distributional models for Galician. To this end, we have first trained and evaluated static word embeddings (e.g., word2vec, GloVe), and then compared their performance with that of current contextualised representations generated by neural language models. First, we have compiled and processed a large corpus for Galician, and created four datasets for word analogies and concept categorisation based on standard resources for other languages. Using the aforementioned corpus, we have trained 760 static vector space models which vary in their input representations (e.g., adjacency-based versus dependency-based approaches), learning algorithms, size of the surrounding contexts, and in the number of vector dimensions. These models have been evaluated both intrinsically, using the newly created datasets, and on extrinsic tasks, namely on POS-tagging, dependency parsing, and named entity recognition. The results provide new insights into the performance of different vector models in Galician, and about the impact of several training parameters on each task. In general, fastText embeddings are the static representations with the best performance in the intrinsic evaluations and in named entity recognition, while syntax-based embeddings achieve the highest results in POS-tagging and dependency parsing, indicating that there is no significant correlation between the performance in the intrinsic and extrinsic tasks. Finally, we have compared the performance of static vector representations with that of BERT-based word embeddings, whose fine-tuning obtains the best performance on named entity recognition. This comparison provides a comprehensive state-of-the-art of current models in Galician, and releases new transformer-based models for NER. All the resources used in this research are freely available to the community, and the best models have been incorporated into SemantiGal, an online tool to explore vector representations for Galician.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"3 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141255918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Slovenian parliamentary corpus siParl 斯洛文尼亚议会语料库 siParl

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-06-02 DOI: 10.1007/s10579-024-09746-8

Katja Meden, Tomaž Erjavec, Andrej Pančur

Parliamentary debates represent an essential part of democratic discourse and provide insights into various socio-demographic and linguistic phenomena - parliamentary corpora, which contain transcripts of parliamentary debates and extensive metadata, are an important resource for parliamentary discourse analysis and other research areas. This paper presents the Slovenian parliamentary corpus siParl, the latest version of which contains transcripts of plenary sessions and other legislative bodies of the Assembly of the Republic of Slovenia from 1990 to 2022, comprising more than 1 million speeches and 210 million words. We outline the development history of the corpus and also mention other initiatives that have been influenced by siParl (such as the Parla-CLARIN encoding and the ParlaMint corpora of European parliaments), present the corpus creation process, ranging from the initial data collection to the structural development and encoding of the corpus, and given the growing influence of the ParlaMint corpora, compare siParl with the Slovenian ParlaMint-SI corpus. Finally, we discuss updates for the next version as well as the long-term development and enrichment of the siParl corpus.

议会辩论是民主话语的重要组成部分，可以深入了解各种社会人口和语言现象--议会语料库包含议会辩论的文字记录和大量元数据，是议会话语分析和其他研究领域的重要资源。本文介绍了斯洛文尼亚议会语料库 siParl，其最新版本包含 1990 年至 2022 年斯洛文尼亚共和国议会全体会议和其他立法机构会议的记录誊本，包含 100 多万份发言稿和 2.1 亿个单词。我们概述了该语料库的发展历史，还提到了受 siParl 影响的其他倡议（如 Parla-CLARIN 编码和欧洲议会 ParlaMint 语料库），介绍了从最初的数据收集到语料库的结构发展和编码的语料库创建过程，并鉴于 ParlaMint 语料库日益增长的影响力，将 siParl 与斯洛文尼亚 ParlaMint-SI 语料库进行了比较。最后，我们讨论了 siParl 语料库下一版的更新以及长期发展和充实。

{"title":"Slovenian parliamentary corpus siParl","authors":"Katja Meden, Tomaž Erjavec, Andrej Pančur","doi":"10.1007/s10579-024-09746-8","DOIUrl":"https://doi.org/10.1007/s10579-024-09746-8","url":null,"abstract":"Parliamentary debates represent an essential part of democratic discourse and provide insights into various socio-demographic and linguistic phenomena - parliamentary corpora, which contain transcripts of parliamentary debates and extensive metadata, are an important resource for parliamentary discourse analysis and other research areas. This paper presents the Slovenian parliamentary corpus siParl, the latest version of which contains transcripts of plenary sessions and other legislative bodies of the Assembly of the Republic of Slovenia from 1990 to 2022, comprising more than 1 million speeches and 210 million words. We outline the development history of the corpus and also mention other initiatives that have been influenced by siParl (such as the Parla-CLARIN encoding and the ParlaMint corpora of European parliaments), present the corpus creation process, ranging from the initial data collection to the structural development and encoding of the corpus, and given the growing influence of the ParlaMint corpora, compare siParl with the Slovenian ParlaMint-SI corpus. Finally, we discuss updates for the next version as well as the long-term development and enrichment of the siParl corpus.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"36 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141255846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A sentiment corpus for the cryptocurrency financial domain: the CryptoLin corpus 加密货币金融领域的情感语料库：CryptoLin 语料库

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-05-25 DOI: 10.1007/s10579-024-09743-x

Manoel Fernando Alonso Gadi, Miguel Ángel Sicilia

The objective of this paper is to describe Cryptocurrency Linguo (CryptoLin), a novel corpus containing 2683 cryptocurrency-related news articles covering more than a three-year period. CryptoLin was human-annotated with discrete values representing negative, neutral, and positive news respectively. Eighty-three people participated in the annotation process; each news title was randomly assigned and blindly annotated by three human annotators, one in each different cohort, followed by a consensus mechanism using simple voting. The selection of the annotators was intentionally made using three cohorts with students from a very diverse set of nationalities and educational backgrounds to minimize bias as much as possible. In case one of the annotators was in total disagreement with the other two (e.g., one negative vs two positive or one positive vs two negative), we considered this minority report and defaulted the labeling to neutral. Fleiss’s Kappa, Krippendorff’s Alpha, and Gwet’s AC1 inter-rater reliability coefficients demonstrate CryptoLin’s acceptable quality of inter-annotator agreement. The dataset also includes a text span with the three manual label annotations for further auditing of the annotation mechanism. To further assess the quality of the labeling and the usefulness of CryptoLin dataset, it incorporates four pretrained Sentiment Analysis models: Vader, Textblob, Flair, and FinBERT. Vader and FinBERT demonstrate reasonable performance in the CryptoLin dataset, indicating that the data was not annotated randomly and is therefore useful for further research1. FinBERT (negative) presents the best performance, indicating an advantage of being trained with financial news. Both the CryptoLin dataset and the Jupyter Notebook with the analysis, for reproducibility, are available at the project’s Github. Overall, CryptoLin aims to complement the current knowledge by providing a novel and publicly available Gadi and Ángel Sicilia (Cryptolin dataset and python jupyter notebooks reproducibility codes, 2022) cryptocurrency sentiment corpus and fostering research on the topic of cryptocurrency sentiment analysis and potential applications in behavioral science. This can be useful for businesses and policymakers who want to understand how cryptocurrencies are being used and how they might be regulated. Finally, the rules for selecting and assigning annotators make CryptoLin unique and interesting for new research in annotator selection, assignment, and biases.

CryptoLin 是一个新颖的语料库，包含 2683 篇与加密货币相关的新闻文章，时间跨度超过三年。CryptoLin 由人工标注，离散值分别代表负面、中性和正面新闻。有 83 人参与了注释过程；每个新闻标题都由三名人类注释者随机分配并盲注，每组各一名，然后通过简单投票达成共识。在选择注释员时，有意使用了来自不同国籍和教育背景的三批学生，以尽可能减少偏差。如果其中一位批注者与其他两位批注者意见完全不一致（例如，一个否定对两个肯定，或一个肯定对两个否定），我们会考虑这个少数报告，并将标签默认为中性。弗莱斯卡帕（Fleiss's Kappa）、克里彭多夫阿尔法（Krippendorff's Alpha）和格威特AC1评分者间可靠性系数表明，CryptoLin的评分者间一致性质量是可以接受的。数据集还包括一个包含三个人工标签注释的文本跨度，以便进一步审核注释机制。为了进一步评估 CryptoLin 数据集的标注质量和实用性，该数据集采用了四种预训练的情感分析模型：Vader、Textblob、Flair 和 FinBERT。Vader 和 FinBERT 在 CryptoLin 数据集中表现出了合理的性能，表明该数据并非随机标注，因此对进一步研究非常有用1。FinBERT（负值）的性能最好，这表明使用财经新闻进行训练具有优势。CryptoLin 数据集和包含分析结果的 Jupyter Notebook 均可在该项目的 Github 上获取，以实现可重复性。总之，CryptoLin 的目的是通过提供一个新颖的、可公开获取的 Gadi 和 Ángel Sicilia（Cryptolin 数据集和 python Jupyter 笔记本可重现性代码，2022 年）加密货币情感语料库来补充现有知识，并促进加密货币情感分析主题的研究和在行为科学中的潜在应用。这对于想要了解加密货币的使用情况以及如何对其进行监管的企业和政策制定者来说非常有用。最后，选择和分配注释者的规则使 CryptoLin 具有独特性，对注释者选择、分配和偏见方面的新研究很有意义。

{"title":"A sentiment corpus for the cryptocurrency financial domain: the CryptoLin corpus","authors":"Manoel Fernando Alonso Gadi, Miguel Ángel Sicilia","doi":"10.1007/s10579-024-09743-x","DOIUrl":"https://doi.org/10.1007/s10579-024-09743-x","url":null,"abstract":"The objective of this paper is to describe Cryptocurrency Linguo (CryptoLin), a novel corpus containing 2683 cryptocurrency-related news articles covering more than a three-year period. CryptoLin was human-annotated with discrete values representing negative, neutral, and positive news respectively. Eighty-three people participated in the annotation process; each news title was randomly assigned and blindly annotated by three human annotators, one in each different cohort, followed by a consensus mechanism using simple voting. The selection of the annotators was intentionally made using three cohorts with students from a very diverse set of nationalities and educational backgrounds to minimize bias as much as possible. In case one of the annotators was in total disagreement with the other two (e.g., one negative vs two positive or one positive vs two negative), we considered this minority report and defaulted the labeling to neutral. Fleiss’s Kappa, Krippendorff’s Alpha, and Gwet’s AC1 inter-rater reliability coefficients demonstrate CryptoLin’s acceptable quality of inter-annotator agreement. The dataset also includes a text span with the three manual label annotations for further auditing of the annotation mechanism. To further assess the quality of the labeling and the usefulness of CryptoLin dataset, it incorporates four pretrained Sentiment Analysis models: Vader, Textblob, Flair, and FinBERT. Vader and FinBERT demonstrate reasonable performance in the CryptoLin dataset, indicating that the data was not annotated randomly and is therefore useful for further research1. FinBERT (negative) presents the best performance, indicating an advantage of being trained with financial news. Both the CryptoLin dataset and the Jupyter Notebook with the analysis, for reproducibility, are available at the project’s Github. Overall, CryptoLin aims to complement the current knowledge by providing a novel and publicly available Gadi and Ángel Sicilia (Cryptolin dataset and python jupyter notebooks reproducibility codes, 2022) cryptocurrency sentiment corpus and fostering research on the topic of cryptocurrency sentiment analysis and potential applications in behavioral science. This can be useful for businesses and policymakers who want to understand how cryptocurrencies are being used and how they might be regulated. Finally, the rules for selecting and assigning annotators make CryptoLin unique and interesting for new research in annotator selection, assignment, and biases.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"16 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic construction of direction-aware sentiment lexicon using direction-dependent words 利用方向依赖词自动构建方向感知情感词典

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-05-25 DOI: 10.1007/s10579-024-09737-9

Jihye Park, Hye Jin Lee, Sungzoon Cho

Explainability, which is the degree to which an interested stakeholder can understand the key factors that led to a data-driven model’s decision, has been considered an essential consideration in the financial domain. Accordingly, lexicons that can achieve reasonable performance and provide clear explanations to users have been among the most popular resources in sentiment-based financial forecasting. Since deep learning-based techniques have limitations in that the basis for interpreting the results is unclear, lexicons have consistently attracted the community’s attention as a crucial tool in studies that demand explanations for the sentiment estimation process. One of the challenges in the construction of a financial sentiment lexicon is the domain-specific feature that the sentiment orientation of a word can change depending on the application of directional expressions. For instance, the word “cost” typically conveys a negative sentiment; however, when the word is juxtaposed with “decrease” to form the phrase “cost decrease,” the associated sentiment is positive. Several studies have manually built lexicons containing directional expressions. However, they have been hindered because manual inspection inevitably requires intensive human labor and time. In this study, we propose to automatically construct the “sentiment lexicon composed of direction-dependent words,” which expresses each term as a pair consisting of a directional word and a direction-dependent word. Experimental results show that the proposed sentiment lexicon yields enhanced classification performance, proving the effectiveness of our method for the automated construction of a direction-aware sentiment lexicon.

可解释性，即相关利益方能够理解导致数据驱动模型决策的关键因素的程度，一直被认为是金融领域的一个基本考虑因素。因此，能够实现合理性能并为用户提供清晰解释的词典一直是基于情感的金融预测中最受欢迎的资源之一。由于基于深度学习的技术有其局限性，即解释结果的依据不明确，因此在要求解释情感估计过程的研究中，词典作为一种重要工具一直吸引着社会各界的关注。构建金融情感词典所面临的挑战之一，是单词的情感取向会因定向表达的应用而改变这一特定领域的特征。例如，"成本 "一词通常传达的是负面情感；然而，当该词与 "减少 "并列构成短语 "成本减少 "时，相关情感则是正面的。有几项研究已经人工建立了包含定向表达的词典。然而，由于人工检查不可避免地需要大量的人力和时间，因此这些研究受到了阻碍。在本研究中，我们提出自动构建 "由方向依赖词组成的情感词库"，该词库将每个术语表述为由一个方向词和一个方向依赖词组成的一对。实验结果表明，所提出的情感词典提高了分类性能，证明了我们的方法在自动构建方向感知情感词典方面的有效性。

{"title":"Automatic construction of direction-aware sentiment lexicon using direction-dependent words","authors":"Jihye Park, Hye Jin Lee, Sungzoon Cho","doi":"10.1007/s10579-024-09737-9","DOIUrl":"https://doi.org/10.1007/s10579-024-09737-9","url":null,"abstract":"Explainability, which is the degree to which an interested stakeholder can understand the key factors that led to a data-driven model’s decision, has been considered an essential consideration in the financial domain. Accordingly, lexicons that can achieve reasonable performance and provide clear explanations to users have been among the most popular resources in sentiment-based financial forecasting. Since deep learning-based techniques have limitations in that the basis for interpreting the results is unclear, lexicons have consistently attracted the community’s attention as a crucial tool in studies that demand explanations for the sentiment estimation process. One of the challenges in the construction of a financial sentiment lexicon is the domain-specific feature that the sentiment orientation of a word can change depending on the application of directional expressions. For instance, the word “cost” typically conveys a negative sentiment; however, when the word is juxtaposed with “decrease” to form the phrase “cost decrease,” the associated sentiment is positive. Several studies have manually built lexicons containing directional expressions. However, they have been hindered because manual inspection inevitably requires intensive human labor and time. In this study, we propose to automatically construct the “sentiment lexicon composed of direction-dependent words,” which expresses each term as a pair consisting of a directional word and a direction-dependent word. Experimental results show that the proposed sentiment lexicon yields enhanced classification performance, proving the effectiveness of our method for the automated construction of a direction-aware sentiment lexicon.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"63 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-linguistically consistent semantic and syntactic annotation of child-directed speech 对儿童指令性语音进行跨语言一致的语义和句法注释

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation

Pub Date : 2024-05-15 DOI: 10.1007/s10579-024-09734-y

Ida Szubert, Omri Abend, Nathan Schneider, Samuel Gibbon, Louis Mahon, Sharon Goldwater, Mark Steedman

Corpora of child speech and child-directed speech (CDS) have enabled major contributions to the study of child language acquisition, yet semantic annotation for such corpora is still scarce and lacks a uniform standard. Semantic annotation of CDS is particularly important for understanding the nature of the input children receive and developing computational models of child language acquisition. For example, under the assumption that children are able to infer meaning representations for (at least some of) the utterances they hear, the acquisition task is to learn a grammar that can map novel adult utterances onto their corresponding meaning representations, in the face of noise and distraction by other contextually possible meanings. To study this problem and to develop computational models of it, we need corpora that provide both adult utterances and their meaning representations, ideally using annotation that is consistent across a range of languages in order to facilitate cross-linguistic comparative studies. This paper proposes a methodology for constructing such corpora of CDS paired with sentential logical forms, and uses this method to create two such corpora, in English and Hebrew. The approach enforces a cross-linguistically consistent representation, building on recent advances in dependency representation and semantic parsing. Specifically, the approach involves two steps. First, we annotate the corpora using the Universal Dependencies (UD) scheme for syntactic annotation, which has been developed to apply consistently to a wide variety of domains and typologically diverse languages. Next, we further annotate these data by applying an automatic method for transducing sentential logical forms (LFs) from UD structures. The UD and LF representations have complementary strengths: UD structures are language-neutral and support consistent and reliable annotation by multiple annotators, whereas LFs are neutral as to their syntactic derivation and transparently encode semantic relations. Using this approach, we provide syntactic and semantic annotation for two corpora from CHILDES: Brown’s Adam corpus (English; we annotate (approx) 80% of its child-directed utterances), all child-directed utterances from Berman’s Hagar corpus (Hebrew). We verify the quality of the UD annotation using an inter-annotator agreement study, and manually evaluate the transduced meaning representations. We then demonstrate the utility of the compiled corpora through (1) a longitudinal corpus study of the prevalence of different syntactic and semantic phenomena in the CDS, and (2) applying an existing computational model of language acquisition to the two corpora and briefly comparing the results across languages.

儿童言语和儿童引导言语（CDS）语料库为儿童语言习得研究做出了重大贡献，但此类语料库的语义注释仍然很少，而且缺乏统一标准。CDS 的语义注释对于了解儿童接受的输入的性质和开发儿童语言习得的计算模型尤为重要。例如，假设儿童能够推断出他们听到的（至少部分）话语的意义表征，那么语言习得的任务就是学习一种语法，这种语法能够在噪音和其他上下文可能意义的干扰下，将成人的新话语映射到其相应的意义表征上。要研究这个问题并开发其计算模型，我们需要同时提供成人语篇及其意义表征的语料库，最好是使用在各种语言中都一致的注释，以便于跨语言比较研究。本文提出了一种构建与句子逻辑形式配对的 CDS 语料库的方法，并使用这种方法创建了英语和希伯来语的两个语料库。该方法以依存表征和语义解析的最新进展为基础，实现了跨语言的一致表征。具体来说，该方法包括两个步骤。首先，我们使用通用依存关系（UD）方案对语料库进行句法注释。接下来，我们采用一种自动方法从 UD 结构转换出句法逻辑形式 (LF)，从而进一步注释这些数据。UD 和 LF 表示法具有互补优势：UD 结构是语言中性的，支持多个注释者进行一致、可靠的注释，而 LF 在句法派生方面是中性的，可以透明地编码语义关系。利用这种方法，我们为 CHILDES 的两个语料库提供了句法和语义注释：布朗的亚当语料库（英语；我们注释了其 80% 的儿童导向语篇）、伯曼的哈加语料库（希伯来语）中的所有儿童导向语篇。我们利用注释者之间的一致性研究验证了 UD 注释的质量，并对转换后的意义表示进行了人工评估。然后，我们通过（1）对 CDS 中不同句法和语义现象的普遍性进行纵向语料库研究，以及（2）将现有的语言习得计算模型应用于这两个语料库，并对不同语言的结果进行简要比较，证明了编译语料库的实用性。

{"title":"Cross-linguistically consistent semantic and syntactic annotation of child-directed speech","authors":"Ida Szubert, Omri Abend, Nathan Schneider, Samuel Gibbon, Louis Mahon, Sharon Goldwater, Mark Steedman","doi":"10.1007/s10579-024-09734-y","DOIUrl":"https://doi.org/10.1007/s10579-024-09734-y","url":null,"abstract":"Corpora of child speech and child-directed speech (CDS) have enabled major contributions to the study of child language acquisition, yet semantic annotation for such corpora is still scarce and lacks a uniform standard. Semantic annotation of CDS is particularly important for understanding the nature of the input children receive and developing computational models of child language acquisition. For example, under the assumption that children are able to infer meaning representations for (at least some of) the utterances they hear, the acquisition task is to learn a grammar that can map novel adult utterances onto their corresponding meaning representations, in the face of noise and distraction by other contextually possible meanings. To study this problem and to develop computational models of it, we need corpora that provide both adult utterances and their meaning representations, ideally using annotation that is consistent across a range of languages in order to facilitate cross-linguistic comparative studies. This paper proposes a methodology for constructing such corpora of CDS paired with sentential logical forms, and uses this method to create two such corpora, in English and Hebrew. The approach enforces a cross-linguistically consistent representation, building on recent advances in dependency representation and semantic parsing. Specifically, the approach involves two steps. First, we annotate the corpora using the Universal Dependencies (UD) scheme for syntactic annotation, which has been developed to apply consistently to a wide variety of domains and typologically diverse languages. Next, we further annotate these data by applying an automatic method for transducing sentential logical forms (LFs) from UD structures. The UD and LF representations have complementary strengths: UD structures are language-neutral and support consistent and reliable annotation by multiple annotators, whereas LFs are neutral as to their syntactic derivation and transparently encode semantic relations. Using this approach, we provide syntactic and semantic annotation for two corpora from CHILDES: Brown’s Adam corpus (English; we annotate (approx) 80% of its child-directed utterances), all child-directed utterances from Berman’s Hagar corpus (Hebrew). We verify the quality of the UD annotation using an inter-annotator agreement study, and manually evaluate the transduced meaning representations. We then demonstrate the utility of the compiled corpora through (1) a longitudinal corpus study of the prevalence of different syntactic and semantic phenomena in the CDS, and (2) applying an existing computational model of language acquisition to the two corpora and briefly comparing the results across languages.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"70 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141063620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0