Pub Date : 2023-05-01Epub Date: 2021-10-29DOI: 10.1017/s1351324921000292
Charles Chen, Razvan Bunescu, Cindy Marling
We propose a new setting for question answering in which users can query the system using both natural language and direct interactions within a graphical user interface that displays multiple time series associated with an entity of interest. The user interacts with the interface in order to understand the entity's state and behavior, entailing sequences of actions and questions whose answers may depend on previous factual or navigational interactions. We describe a pipeline implementation where spoken questions are first transcribed into text which is then semantically parsed into logical forms that can be used to automatically extract the answer from the underlying database. The speech recognition module is implemented by adapting a pre-trained LSTM-based architecture to the user's speech, whereas for the semantic parsing component we introduce an LSTM-based encoder-decoder architecture that models context dependency through copying mechanisms and multiple levels of attention over inputs and previous outputs. When evaluated separately, with and without data augmentation, both models are shown to substantially outperform several strong baselines. Furthermore, the full pipeline evaluation shows only a small degradation in semantic parsing accuracy, demonstrating that the semantic parser is robust to mistakes in the speech recognition output. The new question answering paradigm proposed in this paper has the potential to improve the presentation and navigation of the large amounts of sensor data and life events that are generated in many areas of medicine.
{"title":"A Semantic Parsing Pipeline for Context-Dependent Question Answering over Temporally Structured Data.","authors":"Charles Chen, Razvan Bunescu, Cindy Marling","doi":"10.1017/s1351324921000292","DOIUrl":"10.1017/s1351324921000292","url":null,"abstract":"<p><p>We propose a new setting for question answering in which users can query the system using both natural language and direct interactions within a graphical user interface that displays multiple time series associated with an entity of interest. The user interacts with the interface in order to understand the entity's state and behavior, entailing sequences of actions and questions whose answers may depend on previous factual or navigational interactions. We describe a pipeline implementation where spoken questions are first transcribed into text which is then semantically parsed into logical forms that can be used to automatically extract the answer from the underlying database. The speech recognition module is implemented by adapting a pre-trained LSTM-based architecture to the user's speech, whereas for the semantic parsing component we introduce an LSTM-based encoder-decoder architecture that models context dependency through copying mechanisms and multiple levels of attention over inputs and previous outputs. When evaluated separately, with and without data augmentation, both models are shown to substantially outperform several strong baselines. Furthermore, the full pipeline evaluation shows only a small degradation in semantic parsing accuracy, demonstrating that the semantic parser is robust to mistakes in the speech recognition output. The new question answering paradigm proposed in this paper has the potential to improve the presentation and navigation of the large amounts of sensor data and life events that are generated in many areas of medicine.</p>","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 2","pages":"769-793"},"PeriodicalIF":2.5,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10348695/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10184074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1017/s1351324923000141
Kenneth Ward Church, Raman Chandrasekar
Abstract Our last emerging trend article introduced Risks 1.0 (fairness and bias) and Risks 2.0 (addictive, dangerous, deadly, and insanely profitable). This article introduces Risks 3.0 (spyware and cyber weapons). Risks 3.0 are less profitable, but more destructive. We will summarize two recent books, Pegasus: How a Spy in Your Pocket Threatens the End of Privacy, Dignity, and Democracy and This is How They Tell Me the World Ends: The Cyberweapons Arms Race. The first book starts with a leak of 50,000 phone numbers, targeted by spyware named Pegasus. Pegasus uses a zero-click exploit to obtain root access to your phone, taking control of the microphone, camera, GPS, text messages, etc. The list of 50,000 numbers includes journalists, politicians, and academics, as well as their friends and family. Some of these people have been murdered. The second book describes the history of cyber weapons such as Stuxnet, which is described as crossing the Rubicon. In the short term, it sets back Iran’s nuclear program for less than the cost of conventional weapons, but it did not take long for Iran to build the fourth-biggest cyber army in the world. As spyware continues to proliferate, we envision a future dystopia where everyone spies on everyone. Nothing will be safe from hacking: not your identity, or your secrets, or your passwords, or your bank accounts. When the endpoints (phones) have been compromised, technologies such as end-to-end encryption and multi-factor authentication offer a false sense of security; encryption and authentication are as pointless as closing the proverbial barn door after the fact. To address Risks 3.0, journalists are using the tools of their trade to raise awareness in the court of public opinion. We should do what we can to support them. This paper is a small step in that direction.
{"title":"Emerging trends: Risks 3.0 and proliferation of spyware to 50,000 cell phones","authors":"Kenneth Ward Church, Raman Chandrasekar","doi":"10.1017/s1351324923000141","DOIUrl":"https://doi.org/10.1017/s1351324923000141","url":null,"abstract":"Abstract Our last emerging trend article introduced Risks 1.0 (fairness and bias) and Risks 2.0 (addictive, dangerous, deadly, and insanely profitable). This article introduces Risks 3.0 (spyware and cyber weapons). Risks 3.0 are less profitable, but more destructive. We will summarize two recent books, Pegasus: How a Spy in Your Pocket Threatens the End of Privacy, Dignity, and Democracy and This is How They Tell Me the World Ends: The Cyberweapons Arms Race. The first book starts with a leak of 50,000 phone numbers, targeted by spyware named Pegasus. Pegasus uses a zero-click exploit to obtain root access to your phone, taking control of the microphone, camera, GPS, text messages, etc. The list of 50,000 numbers includes journalists, politicians, and academics, as well as their friends and family. Some of these people have been murdered. The second book describes the history of cyber weapons such as Stuxnet, which is described as crossing the Rubicon. In the short term, it sets back Iran’s nuclear program for less than the cost of conventional weapons, but it did not take long for Iran to build the fourth-biggest cyber army in the world. As spyware continues to proliferate, we envision a future dystopia where everyone spies on everyone. Nothing will be safe from hacking: not your identity, or your secrets, or your passwords, or your bank accounts. When the endpoints (phones) have been compromised, technologies such as end-to-end encryption and multi-factor authentication offer a false sense of security; encryption and authentication are as pointless as closing the proverbial barn door after the fact. To address Risks 3.0, journalists are using the tools of their trade to raise awareness in the court of public opinion. We should do what we can to support them. This paper is a small step in that direction.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"824 - 841"},"PeriodicalIF":2.5,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43929076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-28DOI: 10.1017/S1351324923000116
M. Tikhonova, V. Mikhailov, D. Pisarevskaya, Valentin Malykh, Tatiana Shavrina
{"title":"Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task – CORRIGENDUM","authors":"M. Tikhonova, V. Mikhailov, D. Pisarevskaya, Valentin Malykh, Tatiana Shavrina","doi":"10.1017/S1351324923000116","DOIUrl":"https://doi.org/10.1017/S1351324923000116","url":null,"abstract":"","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1198 - 1198"},"PeriodicalIF":2.5,"publicationDate":"2023-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42905594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-11DOI: 10.1017/s1351324923000104
Soumitra Ghosh, Amit Priyankar, Asif Ekbal, P. Bhattacharyya
Moderators often face a double challenge regarding reducing offensive and harmful content in social media. Despite the need to prevent the free circulation of such content, strict censorship on social media cannot be implemented due to a tricky dilemma – preserving free speech on the Internet while limiting them and how not to overreact. Existing systems do not essentially exploit the correlatedness of hate-offensive content and aggressive posts; instead, they attend to the tasks individually. As a result, the need for cost-effective, sophisticated multi-task systems to effectively detect aggressive and offensive content on social media is highly critical in recent times. This work presents a novel multifaceted transformer-based framework to identify aggressive and hate posts on social media. Through an end-to-end transformer-based multi-task network, our proposed approach addresses the following array of tasks: (a) aggression identification, (b) misogynistic aggression identification, (c) identifying hate-offensive and non-hate-offensive content, (d) identifying hate, profane, and offensive posts, (e) type of offense. We further investigate the role of emotion in improving the system’s overall performance by learning the task of emotion detection jointly with the other tasks. We evaluate our approach on two popular benchmark datasets of aggression and hate speech, covering four languages, and compare the system performance with various state-of-the-art methods. Results indicate that our multi-task system performs significantly well for all the tasks across multiple languages, outperforming several benchmark methods. Moreover, the secondary task of emotion detection substantially improves the system performance for all the tasks, indicating strong correlatedness among the tasks of aggression, hate, and emotion, thus opening avenues for future research.
{"title":"A transformer-based multi-task framework for joint detection of aggression and hate on social media data","authors":"Soumitra Ghosh, Amit Priyankar, Asif Ekbal, P. Bhattacharyya","doi":"10.1017/s1351324923000104","DOIUrl":"https://doi.org/10.1017/s1351324923000104","url":null,"abstract":"\u0000 Moderators often face a double challenge regarding reducing offensive and harmful content in social media. Despite the need to prevent the free circulation of such content, strict censorship on social media cannot be implemented due to a tricky dilemma – preserving free speech on the Internet while limiting them and how not to overreact. Existing systems do not essentially exploit the correlatedness of hate-offensive content and aggressive posts; instead, they attend to the tasks individually. As a result, the need for cost-effective, sophisticated multi-task systems to effectively detect aggressive and offensive content on social media is highly critical in recent times. This work presents a novel multifaceted transformer-based framework to identify aggressive and hate posts on social media. Through an end-to-end transformer-based multi-task network, our proposed approach addresses the following array of tasks: (a) aggression identification, (b) misogynistic aggression identification, (c) identifying hate-offensive and non-hate-offensive content, (d) identifying hate, profane, and offensive posts, (e) type of offense. We further investigate the role of emotion in improving the system’s overall performance by learning the task of emotion detection jointly with the other tasks. We evaluate our approach on two popular benchmark datasets of aggression and hate speech, covering four languages, and compare the system performance with various state-of-the-art methods. Results indicate that our multi-task system performs significantly well for all the tasks across multiple languages, outperforming several benchmark methods. Moreover, the secondary task of emotion detection substantially improves the system performance for all the tasks, indicating strong correlatedness among the tasks of aggression, hate, and emotion, thus opening avenues for future research.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43089267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abstract With the aid of recently proposed word embedding algorithms, the study of semantic relatedness has progressed rapidly. However, word-level representations are still lacking for many natural language processing tasks. Various sense-level embedding learning algorithms have been proposed to address this issue. In this paper, we present a generalized model derived from existing sense retrofitting models. In this generalization, we take into account semantic relations between the senses, relation strength, and semantic strength. Experimental results show that the generalized model outperforms previous approaches on four tasks: semantic relatedness, contextual word similarity, semantic difference, and synonym selection. Based on the generalized sense retrofitting model, we also propose a standardization process on the dimensions with four settings, a neighbor expansion process from the nearest neighbors, and combinations of these two approaches. Finally, we propose a Procrustes analysis approach that inspired from bilingual mapping models for learning representations that outside of the ontology. The experimental results show the advantages of these approaches on semantic relatedness tasks.
{"title":"On generalization of the sense retrofitting model","authors":"Yang-Yin Lee, Ting-Yu Yen, Hen-Hsen Huang, Yow-Ting Shiue, Hsin-Hsi Chen","doi":"10.1017/S1351324922000523","DOIUrl":"https://doi.org/10.1017/S1351324922000523","url":null,"abstract":"Abstract With the aid of recently proposed word embedding algorithms, the study of semantic relatedness has progressed rapidly. However, word-level representations are still lacking for many natural language processing tasks. Various sense-level embedding learning algorithms have been proposed to address this issue. In this paper, we present a generalized model derived from existing sense retrofitting models. In this generalization, we take into account semantic relations between the senses, relation strength, and semantic strength. Experimental results show that the generalized model outperforms previous approaches on four tasks: semantic relatedness, contextual word similarity, semantic difference, and synonym selection. Based on the generalized sense retrofitting model, we also propose a standardization process on the dimensions with four settings, a neighbor expansion process from the nearest neighbors, and combinations of these two approaches. Finally, we propose a Procrustes analysis approach that inspired from bilingual mapping models for learning representations that outside of the ontology. The experimental results show the advantages of these approaches on semantic relatedness tasks.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1097 - 1125"},"PeriodicalIF":2.5,"publicationDate":"2023-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41501051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-29DOI: 10.1017/s1351324923000098
Nina Seemann, Yeong Su Lee, Julian Höllig, Michaela Geierhos
With the increase of user-generated content on social media, the detection of abusive language has become crucial and is therefore reflected in several shared tasks that have been performed in recent years. The development of automatic detection systems is desirable, and the classification of abusive social media content can be solved with the help of machine learning. The basis for successful development of machine learning models is the availability of consistently labeled training data. But a diversity of terms and definitions of abusive language is a crucial barrier. In this work, we analyze a total of nine datasets—five English and four German datasets—designed for detecting abusive online content. We provide a detailed description of the datasets, that is, for which tasks the dataset was created, how the data were collected, and its annotation guidelines. Our analysis shows that there is no standard definition of abusive language, which often leads to inconsistent annotations. As a consequence, it is difficult to draw cross-domain conclusions, share datasets, or use models for other abusive social media language tasks. Furthermore, our manual inspection of a random sample of each dataset revealed controversial examples. We highlight challenges in data annotation by discussing those examples, and present common problems in the annotation process, such as contradictory annotations and missing context information. Finally, to complement our theoretical work, we conduct generalization experiments on three German datasets.
{"title":"The problem of varying annotations to identify abusive language in social media content","authors":"Nina Seemann, Yeong Su Lee, Julian Höllig, Michaela Geierhos","doi":"10.1017/s1351324923000098","DOIUrl":"https://doi.org/10.1017/s1351324923000098","url":null,"abstract":"\u0000 With the increase of user-generated content on social media, the detection of abusive language has become crucial and is therefore reflected in several shared tasks that have been performed in recent years. The development of automatic detection systems is desirable, and the classification of abusive social media content can be solved with the help of machine learning. The basis for successful development of machine learning models is the availability of consistently labeled training data. But a diversity of terms and definitions of abusive language is a crucial barrier. In this work, we analyze a total of nine datasets—five English and four German datasets—designed for detecting abusive online content. We provide a detailed description of the datasets, that is, for which tasks the dataset was created, how the data were collected, and its annotation guidelines. Our analysis shows that there is no standard definition of abusive language, which often leads to inconsistent annotations. As a consequence, it is difficult to draw cross-domain conclusions, share datasets, or use models for other abusive social media language tasks. Furthermore, our manual inspection of a random sample of each dataset revealed controversial examples. We highlight challenges in data annotation by discussing those examples, and present common problems in the annotation process, such as contradictory annotations and missing context information. Finally, to complement our theoretical work, we conduct generalization experiments on three German datasets.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45813982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-29DOI: 10.1017/s1351324923000074
Fengyang Shi, Guohua Feng
{"title":"Syntactic n-grams in Computational Linguistics, by Grigori Sidorov. Cham, Springer Nature, 2019. ISBN 9783030147716. IX + 92 pages.","authors":"Fengyang Shi, Guohua Feng","doi":"10.1017/s1351324923000074","DOIUrl":"https://doi.org/10.1017/s1351324923000074","url":null,"abstract":"","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47990316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-16DOI: 10.1017/s1351324923000086
Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpeläinen, Hanna-Mari Kupari, Aurora Piirto, Jenna Saarni, Maija Sevón, Otto Tarkka
In this paper, we study natural language paraphrasing from both corpus creation and modeling points of view. We focus in particular on the methodology that allows the extraction of challenging examples of paraphrase pairs in their natural textual context, leading to a dataset potentially more suitable for evaluating the models’ ability to represent meaning, especially in document context, when compared with those gathered using various sentence-level heuristics. To this end, we introduce the Turku Paraphrase Corpus, the first large-scale, fully manually annotated corpus of paraphrases in Finnish. The corpus contains 104,645 manually labeled paraphrase pairs, of which 98% are verified to be true paraphrases, either universally or within their present context. In order to control the diversity of the paraphrase pairs and avoid certain biases easily introduced in automatic candidate extraction, the paraphrases are manually collected from different paraphrase-rich text sources. This allows us to create a challenging dataset including longer and more lexically diverse paraphrases than can be expected from those collected through heuristics. In addition to quality, this also allows us to keep the original document context for each pair, making it possible to study paraphrasing in context. To our knowledge, this is the first paraphrase corpus which provides the original document context for the annotated pairs. We also study several paraphrase models trained and evaluated on the new data. Our initial paraphrase classification experiments indicate a challenging nature of the dataset when classifying using the detailed labeling scheme used in the corpus annotation, the accuracy substantially lacking behind human performance. However, when evaluating the models on a large scale paraphrase retrieval task on almost 400M candidate sentences, the results are highly encouraging, 29–53% of the pairs being ranked in the top 10 depending on the paraphrase type. The Turku Paraphrase Corpus is available at github.com/TurkuNLP/Turku-paraphrase-corpus as well as through the popular HuggingFace datasets under the CC-BY-SA license.
{"title":"Towards diverse and contextually anchored paraphrase modeling: A dataset and baselines for Finnish","authors":"Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpeläinen, Hanna-Mari Kupari, Aurora Piirto, Jenna Saarni, Maija Sevón, Otto Tarkka","doi":"10.1017/s1351324923000086","DOIUrl":"https://doi.org/10.1017/s1351324923000086","url":null,"abstract":"\u0000 In this paper, we study natural language paraphrasing from both corpus creation and modeling points of view. We focus in particular on the methodology that allows the extraction of challenging examples of paraphrase pairs in their natural textual context, leading to a dataset potentially more suitable for evaluating the models’ ability to represent meaning, especially in document context, when compared with those gathered using various sentence-level heuristics. To this end, we introduce the Turku Paraphrase Corpus, the first large-scale, fully manually annotated corpus of paraphrases in Finnish. The corpus contains 104,645 manually labeled paraphrase pairs, of which 98% are verified to be true paraphrases, either universally or within their present context. In order to control the diversity of the paraphrase pairs and avoid certain biases easily introduced in automatic candidate extraction, the paraphrases are manually collected from different paraphrase-rich text sources. This allows us to create a challenging dataset including longer and more lexically diverse paraphrases than can be expected from those collected through heuristics. In addition to quality, this also allows us to keep the original document context for each pair, making it possible to study paraphrasing in context. To our knowledge, this is the first paraphrase corpus which provides the original document context for the annotated pairs.\u0000 We also study several paraphrase models trained and evaluated on the new data. Our initial paraphrase classification experiments indicate a challenging nature of the dataset when classifying using the detailed labeling scheme used in the corpus annotation, the accuracy substantially lacking behind human performance. However, when evaluating the models on a large scale paraphrase retrieval task on almost 400M candidate sentences, the results are highly encouraging, 29–53% of the pairs being ranked in the top 10 depending on the paraphrase type. The Turku Paraphrase Corpus is available at github.com/TurkuNLP/Turku-paraphrase-corpus as well as through the popular HuggingFace datasets under the CC-BY-SA license.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44715547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-28DOI: 10.1017/S1351324923000062
Henrique Lopes Cardoso, R. Sousa-Silva, Paula Carvalho, Bruno Martins
Abstract The study of argumentation is transversal to several research domains, from philosophy to linguistics, from the law to computer science and artificial intelligence. In discourse analysis, several distinct models have been proposed to harness argumentation, each with a different focus or aim. To analyze the use of argumentation in natural language, several corpora annotation efforts have been carried out, with a more or less explicit grounding on one of such theoretical argumentation models. In fact, given the recent growing interest in argument mining applications, argument-annotated corpora are crucial to train machine learning models in a supervised way. However, the proliferation of such corpora has led to a wide disparity in the granularity of the argument annotations employed. In this paper, we review the most relevant theoretical argumentation models, after which we survey argument annotation projects closely following those theoretical models. We also highlight the main simplifications that are often introduced in practice. Furthermore, we glimpse other annotation efforts that are not so theoretically grounded but instead follow a shallower approach. It turns out that most argument annotation projects make their own assumptions and simplifications, both in terms of the textual genre they focus on and in terms of adapting the adopted theoretical argumentation model for their own agenda. Issues of compatibility among argument-annotated corpora are discussed by looking at the problem from a syntactical, semantic, and practical perspective. Finally, we discuss current and prospective applications of models that take advantage of argument-annotated corpora.
{"title":"Argumentation models and their use in corpus annotation: Practice, prospects, and challenges","authors":"Henrique Lopes Cardoso, R. Sousa-Silva, Paula Carvalho, Bruno Martins","doi":"10.1017/S1351324923000062","DOIUrl":"https://doi.org/10.1017/S1351324923000062","url":null,"abstract":"Abstract The study of argumentation is transversal to several research domains, from philosophy to linguistics, from the law to computer science and artificial intelligence. In discourse analysis, several distinct models have been proposed to harness argumentation, each with a different focus or aim. To analyze the use of argumentation in natural language, several corpora annotation efforts have been carried out, with a more or less explicit grounding on one of such theoretical argumentation models. In fact, given the recent growing interest in argument mining applications, argument-annotated corpora are crucial to train machine learning models in a supervised way. However, the proliferation of such corpora has led to a wide disparity in the granularity of the argument annotations employed. In this paper, we review the most relevant theoretical argumentation models, after which we survey argument annotation projects closely following those theoretical models. We also highlight the main simplifications that are often introduced in practice. Furthermore, we glimpse other annotation efforts that are not so theoretically grounded but instead follow a shallower approach. It turns out that most argument annotation projects make their own assumptions and simplifications, both in terms of the textual genre they focus on and in terms of adapting the adopted theoretical argumentation model for their own agenda. Issues of compatibility among argument-annotated corpora are discussed by looking at the problem from a syntactical, semantic, and practical perspective. Finally, we discuss current and prospective applications of models that take advantage of argument-annotated corpora.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1150 - 1187"},"PeriodicalIF":2.5,"publicationDate":"2023-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42757315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-21DOI: 10.1017/S1351324923000050
Huizhe Su, Hao Wang, Xiangfeng Luo, Shaorong Xie
Abstract In recent years, the extraction of overlapping relations has received great attention in the field of natural language processing (NLP). However, most existing approaches treat relational triples in sentences as isolated, without considering the rich semantic correlations implied in the relational hierarchy. Extracting these overlapping relational triples is challenging, given the overlapping types are various and relatively complex. In addition, these approaches do not highlight the semantic information in the sentence from coarse-grained to fine-grained. In this paper, we propose an end-to-end neural framework based on a decomposition model that incorporates multi-granularity relational features for the extraction of overlapping triples. Our approach employs an attention mechanism that combines relational hierarchy information with multiple granularities and pretrained textual representations, where the relational hierarchies are constructed manually or obtained by unsupervised clustering. We found that the different hierarchy construction strategies have little effect on the final extraction results. Experimental results on two public datasets, NYT and WebNLG, show that our mode substantially outperforms the baseline system in extracting overlapping relational triples, especially for long-tailed relations.
{"title":"An end-to-end neural framework using coarse-to-fine-grained attention for overlapping relational triple extraction","authors":"Huizhe Su, Hao Wang, Xiangfeng Luo, Shaorong Xie","doi":"10.1017/S1351324923000050","DOIUrl":"https://doi.org/10.1017/S1351324923000050","url":null,"abstract":"Abstract In recent years, the extraction of overlapping relations has received great attention in the field of natural language processing (NLP). However, most existing approaches treat relational triples in sentences as isolated, without considering the rich semantic correlations implied in the relational hierarchy. Extracting these overlapping relational triples is challenging, given the overlapping types are various and relatively complex. In addition, these approaches do not highlight the semantic information in the sentence from coarse-grained to fine-grained. In this paper, we propose an end-to-end neural framework based on a decomposition model that incorporates multi-granularity relational features for the extraction of overlapping triples. Our approach employs an attention mechanism that combines relational hierarchy information with multiple granularities and pretrained textual representations, where the relational hierarchies are constructed manually or obtained by unsupervised clustering. We found that the different hierarchy construction strategies have little effect on the final extraction results. Experimental results on two public datasets, NYT and WebNLG, show that our mode substantially outperforms the baseline system in extracting overlapping relational triples, especially for long-tailed relations.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1126 - 1149"},"PeriodicalIF":2.5,"publicationDate":"2023-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46882761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}