Pub Date : 2023-11-20DOI: 10.1007/s10579-023-09698-5
Jia Hoong Ong, Florence Yik Nam Leung, Fang Liu
Most audio-visual (AV) emotion databases consist of clips that do not reflect real-life emotion processing (e.g., professional actors in bright studio-like environment), contain only spoken clips, and none have sung clips that express complex emotions. Here, we introduce a new AV database, the Reading Everyday Emotion Database (REED), which directly addresses those gaps. We recorded the faces of everyday adults with a diverse range of acting experience expressing 13 emotions—neutral, the six basic emotions (angry, disgusted, fearful, happy, sad, surprised), and six complex emotions (embarrassed, hopeful, jealous, proud, sarcastic, stressed)—in two auditory domains (spoken and sung) using everyday recording devices (e.g., laptops, mobile phones, etc.). The recordings were validated by an independent group of raters. We found that: intensity ratings of the recordings were positively associated with recognition accuracy; and the basic emotions, as well as the Neutral and Sarcastic emotions, were recognised more accurately than the other complex emotions. Emotion recognition accuracy also differed by utterance. Exploratory analysis revealed that recordings of those with drama experience were better recognised than those without. Overall, this database will benefit those who need AV clips with natural variations in both emotion expressions and recording environment.
{"title":"The Reading Everyday Emotion Database (REED): a set of audio-visual recordings of emotions in music and language","authors":"Jia Hoong Ong, Florence Yik Nam Leung, Fang Liu","doi":"10.1007/s10579-023-09698-5","DOIUrl":"https://doi.org/10.1007/s10579-023-09698-5","url":null,"abstract":"<p>Most audio-visual (AV) emotion databases consist of clips that do not reflect real-life emotion processing (e.g., professional actors in bright studio-like environment), contain only spoken clips, and none have sung clips that express complex emotions. Here, we introduce a new AV database, the Reading Everyday Emotion Database (REED), which directly addresses those gaps. We recorded the faces of everyday adults with a diverse range of acting experience expressing 13 emotions—neutral, the six basic emotions (angry, disgusted, fearful, happy, sad, surprised), and six complex emotions (embarrassed, hopeful, jealous, proud, sarcastic, stressed)—in two auditory domains (spoken and sung) using everyday recording devices (e.g., laptops, mobile phones, etc.). The recordings were validated by an independent group of raters. We found that: intensity ratings of the recordings were positively associated with recognition accuracy; and the basic emotions, as well as the Neutral and Sarcastic emotions, were recognised more accurately than the other complex emotions. Emotion recognition accuracy also differed by utterance. Exploratory analysis revealed that recordings of those with drama experience were better recognised than those without. Overall, this database will benefit those who need AV clips with natural variations in both emotion expressions and recording environment.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"6 6","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the “context" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the “type” of discursive role that the comment is performing with respect to the previous comment(s). The dataset has been developed as part of the ComMA Project and consists of a total of 57,363 annotated comments, 1142 annotated memes, and around 70 h of annotated audio (extracted from videos) in four languages—Meitei, Bangla, Hindi, and Indian English. This data has been collected from various social media platforms such as YouTube, Facebook, Twitter, and Telegram. As is usual on social media websites, a large number of these comments are multilingual, and many are code-mixed with English. This paper gives a detailed description of the tagset developed during the course of this project and elaborates on the process of developing and using a multi-label, fine-grained tagset for marking comments with aggression and bias of various kinds, which includes gender bias, religious intolerance (called communal bias in the tagset), class/caste bias, and ethnic/racial bias. We define and discuss the tags that have been used for marking different discursive roles being performed through the comments, such as attack, defend, and so on. We also present a statistical analysis of the dataset as well as the results of our baseline experiments for developing an automatic aggression identification system using the dataset developed. Based on the results of the baseline experiments, we also argue that our dataset provides diverse and ‘hard’ sets of instances which makes it a good dataset for training and testing new techniques for aggressive and abusive language classification.
{"title":"A multilingual, multimodal dataset of aggression and bias: the ComMA dataset","authors":"Ritesh Kumar, Shyam Ratan, Siddharth Singh, Enakshi Nandi, Laishram Niranjana Devi, Akash Bhagat, Yogesh Dawer, Bornini Lahiri, Akanksha Bansal","doi":"10.1007/s10579-023-09696-7","DOIUrl":"https://doi.org/10.1007/s10579-023-09696-7","url":null,"abstract":"<p>In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the “context\" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the “type” of discursive role that the comment is performing with respect to the previous comment(s). The dataset has been developed as part of the ComMA Project and consists of a total of 57,363 annotated comments, 1142 annotated memes, and around 70 h of annotated audio (extracted from videos) in four languages—Meitei, Bangla, Hindi, and Indian English. This data has been collected from various social media platforms such as YouTube, Facebook, Twitter, and Telegram. As is usual on social media websites, a large number of these comments are multilingual, and many are code-mixed with English. This paper gives a detailed description of the tagset developed during the course of this project and elaborates on the process of developing and using a multi-label, fine-grained tagset for marking comments with aggression and bias of various kinds, which includes gender bias, religious intolerance (called communal bias in the tagset), class/caste bias, and ethnic/racial bias. We define and discuss the tags that have been used for marking different discursive roles being performed through the comments, such as attack, defend, and so on. We also present a statistical analysis of the dataset as well as the results of our baseline experiments for developing an automatic aggression identification system using the dataset developed. Based on the results of the baseline experiments, we also argue that our dataset provides diverse and ‘hard’ sets of instances which makes it a good dataset for training and testing new techniques for aggressive and abusive language classification.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"77 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-16DOI: 10.1007/s10579-023-09695-8
Taja Kuzman, Nikola Ljubešić
Automatic genre identification (AGI) is a text classification task focused on genres, i.e., text categories defined by the author’s purpose, common function of the text, and the text’s conventional form. Obtaining genre information has been shown to be beneficial for a wide range of disciplines, including linguistics, corpus linguistics, computational linguistics, natural language processing, information retrieval and information security. Consequently, in the past 20 years, numerous researchers have collected genre datasets with the aim to develop an efficient genre classifier. However, their approaches to the definition of genre schemata, data collection and manual annotation vary substantially, resulting in significantly different datasets. As most AGI experiments are dataset-dependent, a sufficient understanding of the differences between the available genre datasets is of great importance for the researchers venturing into this area. In this paper, we present a detailed overview of different approaches to each of the steps of the AGI task, from the definition of the genre concept and the genre schema, to the dataset collection and annotation methods, and, finally, to machine learning strategies. Special focus is dedicated to the description of the most relevant genre schemata and datasets, and details on the availability of all of the datasets are provided. In addition, the paper presents the recent advances in machine learning approaches to automatic genre identification, and concludes with proposing the directions towards developing a stable multilingual genre classifier.
{"title":"Automatic genre identification: a survey","authors":"Taja Kuzman, Nikola Ljubešić","doi":"10.1007/s10579-023-09695-8","DOIUrl":"https://doi.org/10.1007/s10579-023-09695-8","url":null,"abstract":"<p>Automatic genre identification (AGI) is a text classification task focused on genres, i.e., text categories defined by the author’s purpose, common function of the text, and the text’s conventional form. Obtaining genre information has been shown to be beneficial for a wide range of disciplines, including linguistics, corpus linguistics, computational linguistics, natural language processing, information retrieval and information security. Consequently, in the past 20 years, numerous researchers have collected genre datasets with the aim to develop an efficient genre classifier. However, their approaches to the definition of genre schemata, data collection and manual annotation vary substantially, resulting in significantly different datasets. As most AGI experiments are dataset-dependent, a sufficient understanding of the differences between the available genre datasets is of great importance for the researchers venturing into this area. In this paper, we present a detailed overview of different approaches to each of the steps of the AGI task, from the definition of the genre concept and the genre schema, to the dataset collection and annotation methods, and, finally, to machine learning strategies. Special focus is dedicated to the description of the most relevant genre schemata and datasets, and details on the availability of all of the datasets are provided. In addition, the paper presents the recent advances in machine learning approaches to automatic genre identification, and concludes with proposing the directions towards developing a stable multilingual genre classifier.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"22 3","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-16DOI: 10.1007/s10579-023-09690-z
Stella E. O. Tagnin
This paper starts with an overview of corpora available for Brazilian Portuguese to subsequently focus mainly on the CoMET Project developed at the University of São Paulo. CoMET consists of three corpora: a comparable Portuguese-English technical corpus (CorTec), a Portuguese-English parallel (translation) corpus (CorTrad) and a multilingual learner corpus, (CoMAprend), all available for online queries with specific tools. CorTec offers over fifty corpora in a variety of domains, from Health Sciences to Olympic Games. CorTrad is divided into three parts: Popular Science, Technical-Scientific and Literary. Each one of CoMET’s corpora is presented in detail. Examples are also provided.
{"title":"Brazilian Portuguese corpora for teaching and translation: the CoMET project","authors":"Stella E. O. Tagnin","doi":"10.1007/s10579-023-09690-z","DOIUrl":"https://doi.org/10.1007/s10579-023-09690-z","url":null,"abstract":"<p>This paper starts with an overview of corpora available for Brazilian Portuguese to subsequently focus mainly on the CoMET Project developed at the University of São Paulo. CoMET consists of three corpora: a comparable Portuguese-English technical corpus (CorTec), a Portuguese-English parallel (translation) corpus (CorTrad) and a multilingual learner corpus, (CoMAprend), all available for online queries with specific tools. CorTec offers over fifty corpora in a variety of domains, from Health Sciences to Olympic Games. CorTrad is divided into three parts: Popular Science, Technical-Scientific and Literary. Each one of CoMET’s corpora is presented in detail. Examples are also provided.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"8 4","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-06DOI: 10.1007/s10579-023-09701-z
Alice Lee, Nicola Bessell, Henk van den Heuvel, Katarzyna Klessa, Satu Saalasti
{"title":"Correction: The DELAD initiative for sharing language resources on speech disorders","authors":"Alice Lee, Nicola Bessell, Henk van den Heuvel, Katarzyna Klessa, Satu Saalasti","doi":"10.1007/s10579-023-09701-z","DOIUrl":"https://doi.org/10.1007/s10579-023-09701-z","url":null,"abstract":"","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"757 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135636775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-04DOI: 10.1007/s10579-023-09691-y
Ishan Tarunesh, Somak Aditya, Monojit Choudhury
Natural Language Inference (NLI) is considered a representative task to test natural language understanding (NLU). In this work, we propose an extensible framework to collectively yet categorically test diverse Logical reasoning capabilities required for NLI (and, by extension, NLU). Motivated by behavioral testing, we create a semi-synthetic large test bench (363 templates, 363k examples) and an associated framework that offers the following utilities: (1) individually test and analyze reasoning capabilities along 17 reasoning dimensions (including pragmatic reasoning); (2) design experiments to study cross-capability information content (leave one out or bring one in); and (3) the synthetic nature enables us to control for artifacts and biases. We extend a publicly available framework of automated test case instantiation from free-form natural language templates (CheckList) and a well-defined taxonomy of capabilities to cover a wide range of increasingly harder test cases while varying the complexity of natural language. Through our analysis of state-of-the-art NLI systems, we observe that our benchmark is indeed hard (and non-trivial even with training on additional resources). Some capabilities stand out as harder. Further, fine-grained analysis and fine-tuning experiments reveal more insights about these capabilities and the models – supporting and extending previous observations; thus showing the utility of the proposed testbench.
{"title":"LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI","authors":"Ishan Tarunesh, Somak Aditya, Monojit Choudhury","doi":"10.1007/s10579-023-09691-y","DOIUrl":"https://doi.org/10.1007/s10579-023-09691-y","url":null,"abstract":"Natural Language Inference (NLI) is considered a representative task to test natural language understanding (NLU). In this work, we propose an extensible framework to collectively yet categorically test diverse Logical reasoning capabilities required for NLI (and, by extension, NLU). Motivated by behavioral testing, we create a semi-synthetic large test bench (363 templates, 363k examples) and an associated framework that offers the following utilities: (1) individually test and analyze reasoning capabilities along 17 reasoning dimensions (including pragmatic reasoning); (2) design experiments to study cross-capability information content (leave one out or bring one in); and (3) the synthetic nature enables us to control for artifacts and biases. We extend a publicly available framework of automated test case instantiation from free-form natural language templates (CheckList) and a well-defined taxonomy of capabilities to cover a wide range of increasingly harder test cases while varying the complexity of natural language. Through our analysis of state-of-the-art NLI systems, we observe that our benchmark is indeed hard (and non-trivial even with training on additional resources). Some capabilities stand out as harder. Further, fine-grained analysis and fine-tuning experiments reveal more insights about these capabilities and the models – supporting and extending previous observations; thus showing the utility of the proposed testbench.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"11 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135774512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-26DOI: 10.1007/s10579-023-09694-9
Antonio F. G. Sevilla, Alberto Díaz Esteban, José María Lahoz-Bengoechea
{"title":"Building the VisSE Corpus of Spanish SignWriting","authors":"Antonio F. G. Sevilla, Alberto Díaz Esteban, José María Lahoz-Bengoechea","doi":"10.1007/s10579-023-09694-9","DOIUrl":"https://doi.org/10.1007/s10579-023-09694-9","url":null,"abstract":"","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"24 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134909333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-21DOI: 10.1007/s10579-023-09682-z
Nikolay Babakov, Varvara Logacheva, Alexander Panchenko
Toxicity on the Internet is an acknowledged problem. It includes a wide range of actions from the use of obscene words to offenses and hate speech toward particular users or groups of people. However, there also exist other types of inappropriate messages which are usually not viewed as toxic as they do not contain swear words or explicit offenses. Such messages can contain covert toxicity or generalizations, incite harmful actions (crime, suicide, drug use), and provoke “heated” discussions. These messages are often related to particular sensitive topics, e.g. politics, sexual minorities, or social injustice. Such topics tend to yield toxic emotional reactions more often than other topics, e.g. cars or computing. At the same time, not all messages within “flammable” topics are inappropriate. This work focuses on automatically detecting inappropriate language in natural texts. This is crucial for monitoring user-generated content and developing dialogue systems and AI assistants. While many works focus on toxicity detection, we highlight the fact that texts can be harmful without being toxic or containing obscene language. Blind censorship based on keywords is a common approach to address these issues, but it limits a system’s functionality. This work proposes a safe and effective solution to serve broad user needs and develop necessary resources and tools. Thus, machinery for inappropriateness detection could be useful (i) for making communication on the Internet safer, more productive, and inclusive by flagging truly inappropriate content while not banning messages blindly by topic; (ii) for detection of inappropriate messages generated by automatic systems, e.g. neural chatbots, due to biases in training data; (iii) for debiasing training data for language models (e.g. BERT and GPT-2). Towards this end, in this work, we present two text collections labeled according to a binary notion of inappropriateness (124,597 samples) and a multinomial notion of sensitive topic (33,904 samples). Assuming that the notion of inappropriateness is common among people of the same culture, we base our approach on a human intuitive understanding of what is not acceptable and harmful. To devise an objective view of inappropriateness, we define it in a data-driven way through crowdsourcing. Namely, we run a large-scale annotation study asking workers if a given chatbot-generated utterance could harm the reputation of the company that created this chatbot. High values of inter-annotator agreement suggest that the notion of inappropriateness exists and can be uniformly understood by different people. To define the notion of a sensitive topic in an objective way we use guidelines suggested by specialists in the Legal and PR departments of a large company. We use the collected datasets to train inappropriateness and sensitive topic classifiers employing both classic and Transformer-based models.
{"title":"Beyond plain toxic: building datasets for detection of flammable topics and inappropriate statements","authors":"Nikolay Babakov, Varvara Logacheva, Alexander Panchenko","doi":"10.1007/s10579-023-09682-z","DOIUrl":"https://doi.org/10.1007/s10579-023-09682-z","url":null,"abstract":"Toxicity on the Internet is an acknowledged problem. It includes a wide range of actions from the use of obscene words to offenses and hate speech toward particular users or groups of people. However, there also exist other types of inappropriate messages which are usually not viewed as toxic as they do not contain swear words or explicit offenses. Such messages can contain covert toxicity or generalizations, incite harmful actions (crime, suicide, drug use), and provoke “heated” discussions. These messages are often related to particular sensitive topics, e.g. politics, sexual minorities, or social injustice. Such topics tend to yield toxic emotional reactions more often than other topics, e.g. cars or computing. At the same time, not all messages within “flammable” topics are inappropriate. This work focuses on automatically detecting inappropriate language in natural texts. This is crucial for monitoring user-generated content and developing dialogue systems and AI assistants. While many works focus on toxicity detection, we highlight the fact that texts can be harmful without being toxic or containing obscene language. Blind censorship based on keywords is a common approach to address these issues, but it limits a system’s functionality. This work proposes a safe and effective solution to serve broad user needs and develop necessary resources and tools. Thus, machinery for inappropriateness detection could be useful (i) for making communication on the Internet safer, more productive, and inclusive by flagging truly inappropriate content while not banning messages blindly by topic; (ii) for detection of inappropriate messages generated by automatic systems, e.g. neural chatbots, due to biases in training data; (iii) for debiasing training data for language models (e.g. BERT and GPT-2). Towards this end, in this work, we present two text collections labeled according to a binary notion of inappropriateness (124,597 samples) and a multinomial notion of sensitive topic (33,904 samples). Assuming that the notion of inappropriateness is common among people of the same culture, we base our approach on a human intuitive understanding of what is not acceptable and harmful. To devise an objective view of inappropriateness, we define it in a data-driven way through crowdsourcing. Namely, we run a large-scale annotation study asking workers if a given chatbot-generated utterance could harm the reputation of the company that created this chatbot. High values of inter-annotator agreement suggest that the notion of inappropriateness exists and can be uniformly understood by different people. To define the notion of a sensitive topic in an objective way we use guidelines suggested by specialists in the Legal and PR departments of a large company. We use the collected datasets to train inappropriateness and sensitive topic classifiers employing both classic and Transformer-based models.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"14 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135510980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-21DOI: 10.1007/s10579-023-09679-8
Saba Anwar, Artem Shelmanov, Nikolay Arefyev, Alexander Panchenko, Chris Biemann
Abstract Semantic frames are formal structures describing situations, actions or events, e.g., Commerce buy , Kidnapping , or Exchange . Each frame provides a set of frame elements or semantic roles corresponding to participants of the situation and lexical units (LUs)—words and phrases that can evoke this particular frame in texts. For example, for the frame Kidnapping , two key roles are Perpetrator and the Victim , and this frame can be evoked with lexical units abduct, kidnap , or snatcher . While formally sound, the scarce availability of semantic frame resources and their limited lexical coverage hinders the wider adoption of frame semantics across languages and domains. To tackle this problem, firstly, we propose a method that takes as input a few frame-annotated sentences and generates alternative lexical realizations of lexical units and semantic roles matching the original frame definition. Secondly, we show that the obtained synthetically generated semantic frame annotated examples help to improve the quality of frame-semantic parsing. To evaluate our proposed approach, we decompose our work into two parts. In the first part of text augmentation for LUs and roles, we experiment with various types of models such as distributional thesauri, non-contextualized word embeddings (word2vec, fastText, GloVe), and Transformer-based contextualized models, such as BERT or XLNet. We perform the intrinsic evaluation of these induced lexical substitutes using FrameNet gold annotations. Models based on Transformers show overall superior performance, however, they do not always outperform simpler models (based on static embeddings) unless information about the target word is suitably injected. However, we observe that non-contextualized models also show comparable performance on the task of LU expansion. We also show that combining substitutes of individual models can significantly improve the quality of final substitutes. Because intrinsic evaluation scores are highly dependent on the gold dataset and the frame preservation, and cannot be ensured by an automatic evaluation mechanism because of the incompleteness of gold datasets, we also carried out experiments with manual evaluation on sample datasets to further analyze the usefulness of our approach. The results show that the manual evaluation framework significantly outperforms automatic evaluation for lexical substitution. For extrinsic evaluation, the second part of this work assesses the utility of these lexical substitutes for the improvement of frame-semantic parsing. We took a small set of frame-annotated sentences and augmented them by replacing corresponding target words with their closest substitutes, obtained from best-performing models. Our extensive experiments on the original and augmented set of annotations with two semantic parsers show that our method is effective for improving the downstream parsing task by training set augmentation, as well as for quickly building FrameNet-like r
{"title":"Text augmentation for semantic frame induction and parsing","authors":"Saba Anwar, Artem Shelmanov, Nikolay Arefyev, Alexander Panchenko, Chris Biemann","doi":"10.1007/s10579-023-09679-8","DOIUrl":"https://doi.org/10.1007/s10579-023-09679-8","url":null,"abstract":"Abstract Semantic frames are formal structures describing situations, actions or events, e.g., Commerce buy , Kidnapping , or Exchange . Each frame provides a set of frame elements or semantic roles corresponding to participants of the situation and lexical units (LUs)—words and phrases that can evoke this particular frame in texts. For example, for the frame Kidnapping , two key roles are Perpetrator and the Victim , and this frame can be evoked with lexical units abduct, kidnap , or snatcher . While formally sound, the scarce availability of semantic frame resources and their limited lexical coverage hinders the wider adoption of frame semantics across languages and domains. To tackle this problem, firstly, we propose a method that takes as input a few frame-annotated sentences and generates alternative lexical realizations of lexical units and semantic roles matching the original frame definition. Secondly, we show that the obtained synthetically generated semantic frame annotated examples help to improve the quality of frame-semantic parsing. To evaluate our proposed approach, we decompose our work into two parts. In the first part of text augmentation for LUs and roles, we experiment with various types of models such as distributional thesauri, non-contextualized word embeddings (word2vec, fastText, GloVe), and Transformer-based contextualized models, such as BERT or XLNet. We perform the intrinsic evaluation of these induced lexical substitutes using FrameNet gold annotations. Models based on Transformers show overall superior performance, however, they do not always outperform simpler models (based on static embeddings) unless information about the target word is suitably injected. However, we observe that non-contextualized models also show comparable performance on the task of LU expansion. We also show that combining substitutes of individual models can significantly improve the quality of final substitutes. Because intrinsic evaluation scores are highly dependent on the gold dataset and the frame preservation, and cannot be ensured by an automatic evaluation mechanism because of the incompleteness of gold datasets, we also carried out experiments with manual evaluation on sample datasets to further analyze the usefulness of our approach. The results show that the manual evaluation framework significantly outperforms automatic evaluation for lexical substitution. For extrinsic evaluation, the second part of this work assesses the utility of these lexical substitutes for the improvement of frame-semantic parsing. We took a small set of frame-annotated sentences and augmented them by replacing corresponding target words with their closest substitutes, obtained from best-performing models. Our extensive experiments on the original and augmented set of annotations with two semantic parsers show that our method is effective for improving the downstream parsing task by training set augmentation, as well as for quickly building FrameNet-like r","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"51 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135510936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-21DOI: 10.1007/s10579-023-09686-9
Steven Coats
Abstract This report describes the Corpus of German Speech (CoGS), a 56-million-word corpus of automatic speech recognition transcripts from YouTube channels of local government entities in Germany. Transcripts have been annotated with latitude and longitude coordinates, making the resource potentially useful for geospatial analyses of lexical, morpho-syntactic, and pragmatic variation; this is exemplified with an exploratory geospatial analysis of grammatical variation in the encoding of past temporal reference. Additional corpus metadata include video identifiers and timestamps on individual word tokens, making it possible to search for specific discourse content or utterance sequences in the corpus and download the underlying video and audio from the web, using open-source tools. The discourse content of the transcripts in CoGS touches upon a wide range of topics, making the resource potentially interesting as a data source for research in digital humanities and social science. The report also briefly discusses the permissibility of reuse of data sourced from German municipalities for corpus-building purposes in the context of EU, German, and American law, which clearly authorize such a use case.
{"title":"A new corpus of geolocated ASR transcripts from Germany","authors":"Steven Coats","doi":"10.1007/s10579-023-09686-9","DOIUrl":"https://doi.org/10.1007/s10579-023-09686-9","url":null,"abstract":"Abstract This report describes the Corpus of German Speech (CoGS), a 56-million-word corpus of automatic speech recognition transcripts from YouTube channels of local government entities in Germany. Transcripts have been annotated with latitude and longitude coordinates, making the resource potentially useful for geospatial analyses of lexical, morpho-syntactic, and pragmatic variation; this is exemplified with an exploratory geospatial analysis of grammatical variation in the encoding of past temporal reference. Additional corpus metadata include video identifiers and timestamps on individual word tokens, making it possible to search for specific discourse content or utterance sequences in the corpus and download the underlying video and audio from the web, using open-source tools. The discourse content of the transcripts in CoGS touches upon a wide range of topics, making the resource potentially interesting as a data source for research in digital humanities and social science. The report also briefly discusses the permissibility of reuse of data sourced from German municipalities for corpus-building purposes in the context of EU, German, and American law, which clearly authorize such a use case.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"114 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135511735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}