Pub Date : 2024-01-13DOI: 10.1007/s10579-023-09708-6
Kiran Babu Nelatoori, Hima Bindu Kommanti
Detecting toxic comments and rationale for the offensiveness of a social media post promotes moderation of social media content. For this purpose, we propose a Co-Attentive Multi-task Learning (CA-MTL) model through transfer learning for low-resource Hindi-English (commonly known as Hinglish) toxic texts. Together, the cooperative tasks of rationale/span detection and toxic comment classification create a strong multi-task learning objective. A task collaboration module is designed to leverage the bi-directional attention between the classification and span prediction tasks. The combined loss function of the model is constructed using the individual loss functions of these two tasks. Although an English toxic span detection dataset exists, one for Hinglish code-mixed text does not exist as of today. Hence, we developed a dataset with toxic span annotations for Hinglish code-mixed text. The proposed CA-MTL model is compared against single-task and multi-task learning models that lack the co-attention mechanism, using multilingual and Hinglish BERT variants. The F1 scores of the proposed CA-MTL model with HingRoBERTa encoder for both tasks are significantly higher than the baseline models. Caution: This paper may contain words disturbing to some readers.
{"title":"Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning","authors":"Kiran Babu Nelatoori, Hima Bindu Kommanti","doi":"10.1007/s10579-023-09708-6","DOIUrl":"https://doi.org/10.1007/s10579-023-09708-6","url":null,"abstract":"<p>Detecting toxic comments and rationale for the offensiveness of a social media post promotes moderation of social media content. For this purpose, we propose a Co-Attentive Multi-task Learning (CA-MTL) model through transfer learning for low-resource Hindi-English (commonly known as Hinglish) toxic texts. Together, the cooperative tasks of rationale/span detection and toxic comment classification create a strong multi-task learning objective. A task collaboration module is designed to leverage the bi-directional attention between the classification and span prediction tasks. The combined loss function of the model is constructed using the individual loss functions of these two tasks. Although an English toxic span detection dataset exists, one for Hinglish code-mixed text does not exist as of today. Hence, we developed a dataset with toxic span annotations for Hinglish code-mixed text. The proposed CA-MTL model is compared against single-task and multi-task learning models that lack the co-attention mechanism, using multilingual and Hinglish BERT variants. The F1 scores of the proposed CA-MTL model with HingRoBERTa encoder for both tasks are significantly higher than the baseline models. <i>Caution:</i> This paper may contain words disturbing to some readers.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"27 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139460881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-06DOI: 10.1007/s10579-023-09706-8
Abstract
This paper stems from the project A World of Possibilities. Modal pathways over an extra-long period of time: the diachrony of modality in the Latin language (WoPoss) which involves a corpus-based approach to the study of modality in the history of the Latin language. Linguistic annotation and, in particular, the semantic annotation of modality is a keystone of the project. Besides the difficulties intrinsic to any annotation task dealing with semantics, our annotation scheme involves multiple layers of annotation that are interconnected, adding complexity to the task. Considering the intricacies of our fine-grained semantic annotation, we needed to develop well-documented schemas in order to control the consistency of the annotation, but also to enable an efficient reuse of our annotated corpus. This paper presents the different elements involved in the annotation task, and how the description and the relations between the different linguistic components were formalised and documented, combining schema languages with XML documentation.
摘要 本文源自 "一个充满可能性的世界 "项目。该项目采用基于语料库的方法研究拉丁语历史中的模态。语言注释,特别是模态的语义注释是该项目的关键。除了任何处理语义的注释任务都会遇到的固有困难之外,我们的注释方案还涉及相互关联的多层注释,从而增加了任务的复杂性。考虑到细粒度语义标注的复杂性,我们需要开发记录完备的模式,以便控制标注的一致性,同时还能有效地重复使用我们标注的语料库。本文介绍了注释任务中涉及的不同要素,以及如何将模式语言与 XML 文档相结合,对不同语言成分之间的描述和关系进行形式化和文档化。
{"title":"Multi-layered semantic annotation and the formalisation of annotation schemas for the investigation of modality in a Latin corpus","authors":"","doi":"10.1007/s10579-023-09706-8","DOIUrl":"https://doi.org/10.1007/s10579-023-09706-8","url":null,"abstract":"<h3>Abstract</h3> <p>This paper stems from the project <em>A World of Possibilities. Modal pathways over an extra-long period of time: the diachrony of modality in the Latin language</em> (WoPoss) which involves a corpus-based approach to the study of modality in the history of the Latin language. Linguistic annotation and, in particular, the semantic annotation of modality is a keystone of the project. Besides the difficulties intrinsic to any annotation task dealing with semantics, our annotation scheme involves multiple layers of annotation that are interconnected, adding complexity to the task. Considering the intricacies of our fine-grained semantic annotation, we needed to develop well-documented schemas in order to control the consistency of the annotation, but also to enable an efficient reuse of our annotated corpus. This paper presents the different elements involved in the annotation task, and how the description and the relations between the different linguistic components were formalised and documented, combining schema languages with XML documentation.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"24 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139375818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-03DOI: 10.1007/s10579-023-09702-y
Kerenza Doxolodeo, Adila Alfa Krisnadhi
Constructing a question-answering dataset can be prohibitively expensive, making it difficult for researchers to make one for an under-resourced language, such as Indonesian. We create a novel Indonesian Question Answering dataset that is produced automatically end-to-end. The process uses Context Free Grammar, the Wikipedia Indonesian Corpus, and the concept of the proxy model. The dataset consists of 134 thousand simple questions and 60 thousand complex questions. It achieved competitive grammatical and model accuracy compared to the translated dataset but suffers from some issues due to resource constraints.
{"title":"AC-IQuAD: Automatically Constructed Indonesian Question Answering Dataset by Leveraging Wikidata","authors":"Kerenza Doxolodeo, Adila Alfa Krisnadhi","doi":"10.1007/s10579-023-09702-y","DOIUrl":"https://doi.org/10.1007/s10579-023-09702-y","url":null,"abstract":"<p>Constructing a question-answering dataset can be prohibitively expensive, making it difficult for researchers to make one for an under-resourced language, such as Indonesian. We create a novel Indonesian Question Answering dataset that is produced automatically end-to-end. The process uses Context Free Grammar, the Wikipedia Indonesian Corpus, and the concept of the proxy model. The dataset consists of 134 thousand simple questions and 60 thousand complex questions. It achieved competitive grammatical and model accuracy compared to the translated dataset but suffers from some issues due to resource constraints.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"21 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139093771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-02DOI: 10.1007/s10579-023-09716-6
Soran Badawi, Arefeh Kazemi, Vali Rezaie
Language is essential for communication and the expression of feelings and sentiments. As technology advances, language has become increasingly ubiquitous in our lives. One of the most critical research areas in natural language processing (NLP) is sentiment analysis, which aims to identify and extract opinions and attitudes from text. Sentiment analysis is particularly useful for understanding public opinion on products, services, and topics of interest. While sentiment analysis systems are well-developed for English, this differs for other languages, such as Kurdish. This is because less-resourced languages have fewer NLP resources, including annotated datasets. To bridge this gap, this paper introduces KurdiSent, the first manually annotated dataset for Kurdish sentiment analysis. KurdiSent consists of over 12,000 instances labeled as positive, negative, or neutral. The corpus covers the Sorani dialect of Kurdish, the most widely spoken dialect. To ensure the quality of KurdiSent, the dataset was trained on machine learning and deep learning classifiers. The experimental results indicated that XLM-R outperformed all machine learning and deep learning classifiers, with an accuracy of 85%, compared to 81% for the best machine learning classifier. KurdiSent is a valuable resource for the NLP community, as it will enable researchers to develop and improve sentiment analysis systems for Kurdish. The corpus will facilitate a better understanding of public opinion in Kurdish-speaking communities.
{"title":"KurdiSent: a corpus for kurdish sentiment analysis","authors":"Soran Badawi, Arefeh Kazemi, Vali Rezaie","doi":"10.1007/s10579-023-09716-6","DOIUrl":"https://doi.org/10.1007/s10579-023-09716-6","url":null,"abstract":"<p>Language is essential for communication and the expression of feelings and sentiments. As technology advances, language has become increasingly ubiquitous in our lives. One of the most critical research areas in natural language processing (NLP) is sentiment analysis, which aims to identify and extract opinions and attitudes from text. Sentiment analysis is particularly useful for understanding public opinion on products, services, and topics of interest. While sentiment analysis systems are well-developed for English, this differs for other languages, such as Kurdish. This is because less-resourced languages have fewer NLP resources, including annotated datasets. To bridge this gap, this paper introduces KurdiSent, the first manually annotated dataset for Kurdish sentiment analysis. KurdiSent consists of over 12,000 instances labeled as positive, negative, or neutral. The corpus covers the Sorani dialect of Kurdish, the most widely spoken dialect. To ensure the quality of KurdiSent, the dataset was trained on machine learning and deep learning classifiers. The experimental results indicated that XLM-R outperformed all machine learning and deep learning classifiers, with an accuracy of 85%, compared to 81% for the best machine learning classifier. KurdiSent is a valuable resource for the NLP community, as it will enable researchers to develop and improve sentiment analysis systems for Kurdish. The corpus will facilitate a better understanding of public opinion in Kurdish-speaking communities.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"21 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139083544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-26DOI: 10.1007/s10579-023-09699-4
Pablo Faria, Charlotte Galves, Catarina Magro
In the last two decades, four Portuguese syntactically annotated corpora were built along the lines initially defined for the Penn Parsed Historical Corpora (Santorini, 2016). They cover the old, the middle, the classical and the modern periods of European Portuguese, as well as the nineteenth and twentieth century Brazilian Portuguese, and include different textual genres and oral discourse excerpts. Together they provide a fundamental resource for the study of variation and change in Portuguese. In the last years, an effort was made to maximally unify the annotation scheme applied to those corpora, in such a way that the searches done on one corpus could be done in exactly the same manner on the others. This effort resulted in the Portuguese Syntactic Annotation Manual (Magro & Galves, 2019). In this paper, we present the syntactic annotation for the Portuguese Corpora. We describe the functioning of ParsPort, a rule-based parser which makes use of the revision mode of the query language Corpus Search (Randall, 2005–2015). We argue that ParsPort is more efficient to our annotation efforts than the probabilistic parser developed by Bikel (2004), previously used for the syntactic annotation of the Portuguese Corpora. Finally we mention recent advances towards more user-friendly tools for syntactic searches.
{"title":"Syntactic annotation for Portuguese corpora: standards, parsers, and search interfaces","authors":"Pablo Faria, Charlotte Galves, Catarina Magro","doi":"10.1007/s10579-023-09699-4","DOIUrl":"https://doi.org/10.1007/s10579-023-09699-4","url":null,"abstract":"<p>In the last two decades, four Portuguese syntactically annotated corpora were built along the lines initially defined for the <i>Penn Parsed Historical Corpora</i> (Santorini, 2016). They cover the old, the middle, the classical and the modern periods of European Portuguese, as well as the nineteenth and twentieth century Brazilian Portuguese, and include different textual genres and oral discourse excerpts. Together they provide a fundamental resource for the study of variation and change in Portuguese. In the last years, an effort was made to maximally unify the annotation scheme applied to those corpora, in such a way that the searches done on one corpus could be done in exactly the same manner on the others. This effort resulted in the Portuguese Syntactic Annotation Manual (Magro & Galves, 2019). In this paper, we present the syntactic annotation for the Portuguese Corpora. We describe the functioning of ParsPort, a rule-based parser which makes use of the revision mode of the query language Corpus Search (Randall, 2005–2015). We argue that ParsPort is more efficient to our annotation efforts than the probabilistic parser developed by Bikel (2004), previously used for the syntactic annotation of the Portuguese Corpora. Finally we mention recent advances towards more user-friendly tools for syntactic searches.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"5 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139056467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-13DOI: 10.1007/s10579-023-09703-x
Colin Swaelens, Ilse De Vos, Els Lefever
In this paper, we explore the feasibility of developing a part-of-speech tagger for not-normalised, Byzantine Greek epigrams. Hence, we compared three different transformer-based models with embedding representations, which are then fine-tuned on a fine-grained part-of-speech tagging task. To train the language models, we compiled two data sets: the first consisting of Ancient and Byzantine Greek texts, the second of Ancient, Byzantine and Modern Greek. This allowed us to ascertain whether Modern Greek contributes to the modelling of Byzantine Greek. For the supervised task of part-of-speech tagging, we collected a training set of existing, annotated (Ancient) Greek texts. For evaluation, a gold standard containing 10,000 tokens of unedited Byzantine Greek poems was manually annotated and validated through an inter-annotator agreement study. The experimental results look very promising, with the BERT model trained on all Greek data achieving the best performance for fine-grained part-of-speech tagging.
{"title":"Linguistic annotation of Byzantine book epigrams","authors":"Colin Swaelens, Ilse De Vos, Els Lefever","doi":"10.1007/s10579-023-09703-x","DOIUrl":"https://doi.org/10.1007/s10579-023-09703-x","url":null,"abstract":"<p>In this paper, we explore the feasibility of developing a part-of-speech tagger for not-normalised, Byzantine Greek epigrams. Hence, we compared three different transformer-based models with embedding representations, which are then fine-tuned on a fine-grained part-of-speech tagging task. To train the language models, we compiled two data sets: the first consisting of Ancient and Byzantine Greek texts, the second of Ancient, Byzantine and Modern Greek. This allowed us to ascertain whether Modern Greek contributes to the modelling of Byzantine Greek. For the supervised task of part-of-speech tagging, we collected a training set of existing, annotated (Ancient) Greek texts. For evaluation, a gold standard containing 10,000 tokens of unedited Byzantine Greek poems was manually annotated and validated through an inter-annotator agreement study. The experimental results look very promising, with the BERT model trained on all Greek data achieving the best performance for fine-grained part-of-speech tagging.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"15 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138632327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-13DOI: 10.1007/s10579-023-09704-w
Jörg Tiedemann, Mikko Aulamo, Daria Bakshandaeva, Michele Boggia, Stig-Arne Grönroos, Tommi Nieminen, Alessandro Raganato, Yves Scherrer, Raúl Vázquez, Sami Virpioja
This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our ongoing mission of increasing language coverage and translation quality, and also describe work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices.
{"title":"Democratizing neural machine translation with OPUS-MT","authors":"Jörg Tiedemann, Mikko Aulamo, Daria Bakshandaeva, Michele Boggia, Stig-Arne Grönroos, Tommi Nieminen, Alessandro Raganato, Yves Scherrer, Raúl Vázquez, Sami Virpioja","doi":"10.1007/s10579-023-09704-w","DOIUrl":"https://doi.org/10.1007/s10579-023-09704-w","url":null,"abstract":"<p>This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our ongoing mission of increasing language coverage and translation quality, and also describe work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"32 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138631759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-09DOI: 10.1007/s10579-023-09705-9
Gábor Simon, Tímea Bajzát, Júlia Ballagó, Zsuzsanna Havasi, Emese K. Molnár, Eszter Szlávich
The aim of the article is to present a new language resource for metaphor analysis in corpora that is (i) a MIPVU-inspired, morpheme-based process for identifying metaphor in Hungarian and (ii) the refinement and innovative version of metaphor identification extending the scope of the process to multi-word expressions. The elaboration of language-specific protocols in metaphor identification has become one of the central endeavors in contemporary cross-linguistic research on metaphor, but there is a gap in the field regarding languages with rich morphology, especially in the case of Hungarian. To fill this gap, we developed a hybrid, morpheme-based version of the original method, which can handle morphologically complex metaphorical expressions. Additional innovations of our protocol are the measurement and tagging of idiomaticity in metaphors based on collocation analysis and the identification of semantic relationships between the components of metaphorical expressions. The present paper discusses both the theoretical motivation and the practical details of the adapted method for metaphor identification. As a conclusion, the presented protocol can provide new answers to the questions of metaphor identification in languages with rich morphology and shed new light on the internal semantic organization of linguistic metaphors.
{"title":"When MIPVU goes to no man’s land: a new language resource for hybrid, morpheme-based metaphor identification in Hungarian","authors":"Gábor Simon, Tímea Bajzát, Júlia Ballagó, Zsuzsanna Havasi, Emese K. Molnár, Eszter Szlávich","doi":"10.1007/s10579-023-09705-9","DOIUrl":"https://doi.org/10.1007/s10579-023-09705-9","url":null,"abstract":"<p>The aim of the article is to present a new language resource for metaphor analysis in corpora that is (i) a MIPVU-inspired, morpheme-based process for identifying metaphor in Hungarian and (ii) the refinement and innovative version of metaphor identification extending the scope of the process to multi-word expressions. The elaboration of language-specific protocols in metaphor identification has become one of the central endeavors in contemporary cross-linguistic research on metaphor, but there is a gap in the field regarding languages with rich morphology, especially in the case of Hungarian. To fill this gap, we developed a hybrid, morpheme-based version of the original method, which can handle morphologically complex metaphorical expressions. Additional innovations of our protocol are the measurement and tagging of idiomaticity in metaphors based on collocation analysis and the identification of semantic relationships between the components of metaphorical expressions. The present paper discusses both the theoretical motivation and the practical details of the adapted method for metaphor identification. As a conclusion, the presented protocol can provide new answers to the questions of metaphor identification in languages with rich morphology and shed new light on the internal semantic organization of linguistic metaphors.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"9 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138561352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-08DOI: 10.1007/s10579-023-09700-0
Sofie Labat, Thomas Demeester, Véronique Hoste
Due to the rise of user-generated content, social media is increasingly adopted as a channel to deliver customer service. Given the public character of online platforms, the automatic detection of emotions forms an important application in monitoring customer satisfaction and preventing negative word-of-mouth. This paper introduces EmoTwiCS, a corpus of 9489 Dutch customer service dialogues on Twitter that are annotated for emotion trajectories. In our business-oriented corpus, we view emotions as dynamic attributes of the customer that can change at each utterance of the conversation. The term ‘emotion trajectory’ refers therefore not only to the fine-grained emotions experienced by customers (annotated with 28 labels and valence-arousal-dominance scores), but also to the event happening prior to the conversation and the responses made by the human operator (both annotated with 8 categories). Inter-annotator agreement (IAA) scores on the resulting dataset are substantial and comparable with related research, underscoring its high quality. Given the interplay between the different layers of annotated information, we perform several in-depth analyses to investigate (i) static emotions in isolated tweets, (ii) dynamic emotions and their shifts in trajectory, and (iii) the role of causes and response strategies in emotion trajectories. We conclude by listing the advantages and limitations of our dataset, after which we give some suggestions on the different types of predictive modelling tasks and open research questions to which EmoTwiCS can be applied. The dataset is made publicly available at https://lt3.ugent.be/resources/emotwics.
{"title":"EmoTwiCS: a corpus for modelling emotion trajectories in Dutch customer service dialogues on Twitter","authors":"Sofie Labat, Thomas Demeester, Véronique Hoste","doi":"10.1007/s10579-023-09700-0","DOIUrl":"https://doi.org/10.1007/s10579-023-09700-0","url":null,"abstract":"<p>Due to the rise of user-generated content, social media is increasingly adopted as a channel to deliver customer service. Given the public character of online platforms, the automatic detection of emotions forms an important application in monitoring customer satisfaction and preventing negative word-of-mouth. This paper introduces EmoTwiCS, a corpus of 9489 Dutch customer service dialogues on Twitter that are annotated for emotion trajectories. In our business-oriented corpus, we view emotions as dynamic attributes of the customer that can change at each utterance of the conversation. The term ‘emotion trajectory’ refers therefore not only to the fine-grained emotions experienced by customers (annotated with 28 labels and valence-arousal-dominance scores), but also to the event happening prior to the conversation and the responses made by the human operator (both annotated with 8 categories). Inter-annotator agreement (IAA) scores on the resulting dataset are substantial and comparable with related research, underscoring its high quality. Given the interplay between the different layers of annotated information, we perform several in-depth analyses to investigate (i) static emotions in isolated tweets, (ii) dynamic emotions and their shifts in trajectory, and (iii) the role of causes and response strategies in emotion trajectories. We conclude by listing the advantages and limitations of our dataset, after which we give some suggestions on the different types of predictive modelling tasks and open research questions to which EmoTwiCS can be applied. The dataset is made publicly available at https://lt3.ugent.be/resources/emotwics.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"10 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138561256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, social networks play a fundamental role in promoting and diffusing television and radio programs to different categories of audiences. So, political parties, influential groups and political activists have rapidly seized these new communication media to spread their ideas and give their sentiments concerning critical issues. In this context, Twitter, Facebook and YouTube have become very popular tools for sharing videos and communicating with users who interact with each other to discuss some problems, propose solutions and give viewpoints. This interaction on the social media sites yields to a huge amount of unstructured and noisy texts; hence the need for automated analysis techniques to classify sentiments conveyed in the users’ comments. In this work, we focus on opinions written in a less resourced Arabic language: Tunisian dialect (TD). In this work, we present a process for building a sentiment analyses model for comments written on Tunisian television broadcasts published in social media. These comments are written in a particular way with different spellings due to the fact that the Tunisian Dialect (TD) does not have an orthographic standard. For this we design crucial resources, namely sentiment lexicon and annotated corpus that we have used to investigate machine-learning and deep-learning models in order to identify the best sentiment analysis model for Tunisian Dialect.
{"title":"Resources building for sentiment analysis of content disseminated by Tunisian medias in social networks","authors":"Emna Fsih, Rahma Boujelbane, Lamia Hadrich Belguith","doi":"10.1007/s10579-023-09697-6","DOIUrl":"https://doi.org/10.1007/s10579-023-09697-6","url":null,"abstract":"<p>Nowadays, social networks play a fundamental role in promoting and diffusing television and radio programs to different categories of audiences. So, political parties, influential groups and political activists have rapidly seized these new communication media to spread their ideas and give their sentiments concerning critical issues. In this context, Twitter, Facebook and YouTube have become very popular tools for sharing videos and communicating with users who interact with each other to discuss some problems, propose solutions and give viewpoints. This interaction on the social media sites yields to a huge amount of unstructured and noisy texts; hence the need for automated analysis techniques to classify sentiments conveyed in the users’ comments. In this work, we focus on opinions written in a less resourced Arabic language: Tunisian dialect (TD). In this work, we present a process for building a sentiment analyses model for comments written on Tunisian television broadcasts published in social media. These comments are written in a particular way with different spellings due to the fact that the Tunisian Dialect (TD) does not have an orthographic standard. For this we design crucial resources, namely sentiment lexicon and annotated corpus that we have used to investigate machine-learning and deep-learning models in order to identify the best sentiment analysis model for Tunisian Dialect.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"563 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}