Pub Date : 2022-04-06DOI: 10.1017/s1351324922000134
Kyungtae Lim, Jungyeul Park
We propose a novel approach for sentence boundary detection in text datasets in which boundaries are not evident (e.g., sentence fragments). Although detecting sentence boundaries without punctuation marks has rarely been explored in written text, current real-world textual data suffer from widespread lack of proper start/stop signaling. Herein, we annotate a dataset with linguistic information, such as parts of speech and named entity labels, to boost the sentence boundary detection task. Via experiments, we obtained F1 scores up to 98.07% using the proposed multitask neural model, including a score of 89.41% for sentences completely lacking punctuation marks. We also present an ablation study and provide a detailed analysis to demonstrate the effectiveness of the proposed multitask learning method.
{"title":"Real-world sentence boundary detection using multitask learning: A case study on French","authors":"Kyungtae Lim, Jungyeul Park","doi":"10.1017/s1351324922000134","DOIUrl":"https://doi.org/10.1017/s1351324922000134","url":null,"abstract":"\u0000 We propose a novel approach for sentence boundary detection in text datasets in which boundaries are not evident (e.g., sentence fragments). Although detecting sentence boundaries without punctuation marks has rarely been explored in written text, current real-world textual data suffer from widespread lack of proper start/stop signaling. Herein, we annotate a dataset with linguistic information, such as parts of speech and named entity labels, to boost the sentence boundary detection task. Via experiments, we obtained F1 scores up to 98.07% using the proposed multitask neural model, including a score of 89.41% for sentences completely lacking punctuation marks. We also present an ablation study and provide a detailed analysis to demonstrate the effectiveness of the proposed multitask learning method.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2022-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43668233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-30DOI: 10.1017/s1351324922000122
Nurullah Sevim, Furkan Şahinuç, Aykut Koç
Word embeddings have become important building blocks that are used profoundly in natural language processing (NLP). Despite their several advantages, word embeddings can unintentionally accommodate some gender- and ethnicity-based biases that are present within the corpora they are trained on. Therefore, ethical concerns have been raised since word embeddings are extensively used in several high-level algorithms. Studying such biases and debiasing them have recently become an important research endeavor. Various studies have been conducted to measure the extent of bias that word embeddings capture and to eradicate them. Concurrently, as another subfield that has started to gain traction recently, the applications of NLP in the field of law have started to increase and develop rapidly. As law has a direct and utmost effect on people’s lives, the issues of bias for NLP applications in legal domain are certainly important. However, to the best of our knowledge, bias issues have not yet been studied in the context of legal corpora. In this article, we approach the gender bias problem from the scope of legal text processing domain. Word embedding models that are trained on corpora composed by legal documents and legislation from different countries have been utilized to measure and eliminate gender bias in legal documents. Several methods have been employed to reveal the degree of gender bias and observe its variations over countries. Moreover, a debiasing method has been used to neutralize unwanted bias. The preservation of semantic coherence of the debiased vector space has also been demonstrated by using high-level tasks. Finally, overall results and their implications have been discussed in the scope of NLP in legal domain.
{"title":"Gender bias in legal corpora and debiasing it","authors":"Nurullah Sevim, Furkan Şahinuç, Aykut Koç","doi":"10.1017/s1351324922000122","DOIUrl":"https://doi.org/10.1017/s1351324922000122","url":null,"abstract":"\u0000 Word embeddings have become important building blocks that are used profoundly in natural language processing (NLP). Despite their several advantages, word embeddings can unintentionally accommodate some gender- and ethnicity-based biases that are present within the corpora they are trained on. Therefore, ethical concerns have been raised since word embeddings are extensively used in several high-level algorithms. Studying such biases and debiasing them have recently become an important research endeavor. Various studies have been conducted to measure the extent of bias that word embeddings capture and to eradicate them. Concurrently, as another subfield that has started to gain traction recently, the applications of NLP in the field of law have started to increase and develop rapidly. As law has a direct and utmost effect on people’s lives, the issues of bias for NLP applications in legal domain are certainly important. However, to the best of our knowledge, bias issues have not yet been studied in the context of legal corpora. In this article, we approach the gender bias problem from the scope of legal text processing domain. Word embedding models that are trained on corpora composed by legal documents and legislation from different countries have been utilized to measure and eliminate gender bias in legal documents. Several methods have been employed to reveal the degree of gender bias and observe its variations over countries. Moreover, a debiasing method has been used to neutralize unwanted bias. The preservation of semantic coherence of the debiased vector space has also been demonstrated by using high-level tasks. Finally, overall results and their implications have been discussed in the scope of NLP in legal domain.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"1 1","pages":"449-482"},"PeriodicalIF":2.5,"publicationDate":"2022-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78705462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-24DOI: 10.1017/s1351324922000109
Chérifa Ben Khelil, C. Zribi, D. Duchier, Y. Parmentier
Arabic presents many challenges for automatic processing. Although several research studies have addressed some issues, electronic resources for processing Arabic remain relatively rare or not widely available. In this paper, we propose a Tree-adjoining grammar with a syntax-semantic interface. It is applied to the modern standard Arabic, but it can be easily adapted to other languages. This grammar named “ArabTAG V2.0” (Arabic Tree Adjoining Grammar) is semi-automatically generated by means of an abstract representation called meta-grammar. To ensure its development, ArabTAG V2.0 benefits from a grammar testing environment that uses a corpus of phenomena. Further experiments were performed to check the coverage of this grammar as well as the syntax-semantic analysis. The results showed that ArabTAG V2.0 can cover the majority of syntactical structures and different linguistic phenomena with a precision rate of 88.76%. Moreover, we were able to semantically analyze sentences and build their semantic representations with a precision rate of about 95.63%.
{"title":"Generating Arabic TAG for syntax-semantics analysis","authors":"Chérifa Ben Khelil, C. Zribi, D. Duchier, Y. Parmentier","doi":"10.1017/s1351324922000109","DOIUrl":"https://doi.org/10.1017/s1351324922000109","url":null,"abstract":"\u0000 Arabic presents many challenges for automatic processing. Although several research studies have addressed some issues, electronic resources for processing Arabic remain relatively rare or not widely available. In this paper, we propose a Tree-adjoining grammar with a syntax-semantic interface. It is applied to the modern standard Arabic, but it can be easily adapted to other languages. This grammar named “ArabTAG V2.0” (Arabic Tree Adjoining Grammar) is semi-automatically generated by means of an abstract representation called meta-grammar. To ensure its development, ArabTAG V2.0 benefits from a grammar testing environment that uses a corpus of phenomena. Further experiments were performed to check the coverage of this grammar as well as the syntax-semantic analysis. The results showed that ArabTAG V2.0 can cover the majority of syntactical structures and different linguistic phenomena with a precision rate of 88.76%. Moreover, we were able to semantically analyze sentences and build their semantic representations with a precision rate of about 95.63%.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"255 1","pages":"386-424"},"PeriodicalIF":2.5,"publicationDate":"2022-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82292033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-18DOI: 10.1017/s1351324922000110
Ahmed Hamdi, Elvys Linhares Pontes, Nicolas Sidère, Mickaël Coustaty, A. Doucet
Named entities (NEs) are among the most relevant type of information that can be used to properly index digital documents and thus easily retrieve them. It has long been observed that NEs are key to accessing the contents of digital library portals as they are contained in most user queries. However, most digitized documents are indexed through their optical character recognition (OCRed) version which include numerous errors. Although OCR engines have considerably improved over the last few years, OCR errors still considerably impact document access. Previous works were conducted to evaluate the impact of OCR errors on named entity recognition (NER) and named entity linking (NEL) techniques separately. In this article, we experimented with a variety of OCRed documents with different levels and types of OCR noise to assess in depth the impact of OCR on named entity processing. We provide a deep analysis of OCR errors that impact the performance of NER and NEL. We then present the resulting exhaustive study and subsequent recommendations on the adequate documents, the OCR quality levels, and the post-OCR correction strategies required to perform reliable NER and NEL.
{"title":"In-depth analysis of the impact of OCR errors on named entity recognition and linking","authors":"Ahmed Hamdi, Elvys Linhares Pontes, Nicolas Sidère, Mickaël Coustaty, A. Doucet","doi":"10.1017/s1351324922000110","DOIUrl":"https://doi.org/10.1017/s1351324922000110","url":null,"abstract":"\u0000 Named entities (NEs) are among the most relevant type of information that can be used to properly index digital documents and thus easily retrieve them. It has long been observed that NEs are key to accessing the contents of digital library portals as they are contained in most user queries. However, most digitized documents are indexed through their optical character recognition (OCRed) version which include numerous errors. Although OCR engines have considerably improved over the last few years, OCR errors still considerably impact document access. Previous works were conducted to evaluate the impact of OCR errors on named entity recognition (NER) and named entity linking (NEL) techniques separately. In this article, we experimented with a variety of OCRed documents with different levels and types of OCR noise to assess in depth the impact of OCR on named entity processing. We provide a deep analysis of OCR errors that impact the performance of NER and NEL. We then present the resulting exhaustive study and subsequent recommendations on the adequate documents, the OCR quality levels, and the post-OCR correction strategies required to perform reliable NER and NEL.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"84 1","pages":"425-448"},"PeriodicalIF":2.5,"publicationDate":"2022-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83850798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-25DOI: 10.1017/s1351324922000055
J. Vivaldi, H. Rodríguez
Abstract This paper presents MHeTRep, a multilingual medical terminology and the methodology followed for its compilation. The multilingual terminology is organised into one vocabulary for each language. All the terms in the collection are semantically tagged with a tagset corresponding to the top categories of Snomed-CT ontology. When possible, the individual terms are linked to their equivalent in the other languages. Even though many NLP resources and tools claim to be domain independent, their application to specific tasks can be restricted to specific domains, otherwise their performance degrades notably. As the accuracy of NLP resources drops heavily when applied in environments different from which they were built, a tuning to the new environment is needed. Usually, having a domain terminology facilitates and accelerates the adaptation of general domain NLP applications to a new domain. This is particularly important in medicine, a domain living moments of great expansion. The proposed method takes Snomed-CT as starting point. From this point and using 13 multilingual resources, covering the most relevant medical concepts such as drugs, anatomy, clinical findings and procedures, we built a large resource covering seven languages totalling more than two million semantically tagged terms. The resulting collection has been intensively evaluated in several ways for the involved languages and domain categories. Our hypothesis is that MHeTRep can be used advantageously over the original resources for a number of NLP use cases and likely extended to other languages.
{"title":"MHeTRep: A multilingual semantically tagged health terms repository","authors":"J. Vivaldi, H. Rodríguez","doi":"10.1017/s1351324922000055","DOIUrl":"https://doi.org/10.1017/s1351324922000055","url":null,"abstract":"Abstract This paper presents MHeTRep, a multilingual medical terminology and the methodology followed for its compilation. The multilingual terminology is organised into one vocabulary for each language. All the terms in the collection are semantically tagged with a tagset corresponding to the top categories of Snomed-CT ontology. When possible, the individual terms are linked to their equivalent in the other languages. Even though many NLP resources and tools claim to be domain independent, their application to specific tasks can be restricted to specific domains, otherwise their performance degrades notably. As the accuracy of NLP resources drops heavily when applied in environments different from which they were built, a tuning to the new environment is needed. Usually, having a domain terminology facilitates and accelerates the adaptation of general domain NLP applications to a new domain. This is particularly important in medicine, a domain living moments of great expansion. The proposed method takes Snomed-CT as starting point. From this point and using 13 multilingual resources, covering the most relevant medical concepts such as drugs, anatomy, clinical findings and procedures, we built a large resource covering seven languages totalling more than two million semantically tagged terms. The resulting collection has been intensively evaluated in several ways for the involved languages and domain categories. Our hypothesis is that MHeTRep can be used advantageously over the original resources for a number of NLP use cases and likely extended to other languages.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1364 - 1401"},"PeriodicalIF":2.5,"publicationDate":"2022-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44880191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In training deep learning networks, the optimizer and related learning rate are often used without much thought or with minimal tuning, even though it is crucial in ensuring a fast convergence to a good quality minimum of the loss function that can also generalize well on the test dataset. Drawing inspiration from the successful application of cyclical learning rate policy to computer vision tasks, we explore how cyclical learning rate can be applied to train transformer-based neural networks for neural machine translation. From our carefully designed experiments, we show that the choice of optimizers and the associated cyclical learning rate policy can have a significant impact on the performance. In addition, we establish guidelines when applying cyclical learning rates to neural machine translation tasks.
{"title":"An empirical study of cyclical learning rate on neural machine translation","authors":"Weixuan Wang, Choon Meng Lee, Jianfeng Liu, Talha Çolakoğlu, Wei Peng","doi":"10.1017/s135132492200002x","DOIUrl":"https://doi.org/10.1017/s135132492200002x","url":null,"abstract":"\u0000 In training deep learning networks, the optimizer and related learning rate are often used without much thought or with minimal tuning, even though it is crucial in ensuring a fast convergence to a good quality minimum of the loss function that can also generalize well on the test dataset. Drawing inspiration from the successful application of cyclical learning rate policy to computer vision tasks, we explore how cyclical learning rate can be applied to train transformer-based neural networks for neural machine translation. From our carefully designed experiments, we show that the choice of optimizers and the associated cyclical learning rate policy can have a significant impact on the performance. In addition, we establish guidelines when applying cyclical learning rates to neural machine translation tasks.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"88 1","pages":"316-336"},"PeriodicalIF":2.5,"publicationDate":"2022-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80588155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-08DOI: 10.1017/S1351324922000043
Kenneth Ward Church, Valia Kordoni
Abstract Many papers are chasing state-of-the-art (SOTA) numbers, and more will do so in the future. SOTA-chasing comes with many costs. SOTA-chasing squeezes out more promising opportunities such as coopetition and interdisciplinary collaboration. In addition, there is a risk that too much SOTA-chasing could lead to claims of superhuman performance, unrealistic expectations, and the next AI winter. Two root causes for SOTA-chasing will be discussed: (1) lack of leadership and (2) iffy reviewing processes. SOTA-chasing may be similar to the replication crisis in the scientific literature. The replication crisis is yet another example, like evaluation, of over-confidence in accepted practices and the scientific method, even when such practices lead to absurd consequences.
{"title":"Emerging Trends: SOTA-Chasing","authors":"Kenneth Ward Church, Valia Kordoni","doi":"10.1017/S1351324922000043","DOIUrl":"https://doi.org/10.1017/S1351324922000043","url":null,"abstract":"Abstract Many papers are chasing state-of-the-art (SOTA) numbers, and more will do so in the future. SOTA-chasing comes with many costs. SOTA-chasing squeezes out more promising opportunities such as coopetition and interdisciplinary collaboration. In addition, there is a risk that too much SOTA-chasing could lead to claims of superhuman performance, unrealistic expectations, and the next AI winter. Two root causes for SOTA-chasing will be discussed: (1) lack of leadership and (2) iffy reviewing processes. SOTA-chasing may be similar to the replication crisis in the scientific literature. The replication crisis is yet another example, like evaluation, of over-confidence in accepted practices and the scientific method, even when such practices lead to absurd consequences.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"28 1","pages":"249 - 269"},"PeriodicalIF":2.5,"publicationDate":"2022-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49477443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-08DOI: 10.1017/s1351324922000067
R. Mitkov, B. Boguraev
{"title":"NLE volume 28 issue 2 Cover and Front matter","authors":"R. Mitkov, B. Boguraev","doi":"10.1017/s1351324922000067","DOIUrl":"https://doi.org/10.1017/s1351324922000067","url":null,"abstract":"","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":"f1 - f2"},"PeriodicalIF":2.5,"publicationDate":"2022-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47897633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-08DOI: 10.1017/s1351324922000079
{"title":"NLE volume 28 issue 2 Cover and Back matter","authors":"","doi":"10.1017/s1351324922000079","DOIUrl":"https://doi.org/10.1017/s1351324922000079","url":null,"abstract":"","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":"b1 - b2"},"PeriodicalIF":2.5,"publicationDate":"2022-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44865303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-02-04DOI: 10.1017/s1351324921000486
Can Udomcharoenchaikit, P. Boonkwan, P. Vateekul
Many fundamentaltasks in natural language processing (NLP) such as part-of-speech tagging, text chunking, and named-entity recognition can be formulated as sequence labeling problems. Although neural sequence labeling models have shown excellent results on standard test sets, they are very brittle when presented with misspelled texts. In this paper, we introduce an adversarial training framework that enhances the robustness against typographical adversarial examples. We evaluate the robustness of sequence labeling models with an adversarial evaluation scheme that includes typographical adversarial examples. We generate two types of adversarial examples without access (black-box) or with full access (white-box) to the target model’s parameters. We conducted a series of extensive experiments on three languages (English, Thai, and German) across three sequence labeling tasks. Experiments show that the proposed adversarial training framework provides better resistance against adversarial examples on all tasks. We found that we can further improve the model’s robustness on the chunking task by including a triplet loss constraint.
{"title":"Towards improving the robustness of sequential labeling models against typographical adversarial examples using triplet loss","authors":"Can Udomcharoenchaikit, P. Boonkwan, P. Vateekul","doi":"10.1017/s1351324921000486","DOIUrl":"https://doi.org/10.1017/s1351324921000486","url":null,"abstract":"\u0000 Many fundamentaltasks in natural language processing (NLP) such as part-of-speech tagging, text chunking, and named-entity recognition can be formulated as sequence labeling problems. Although neural sequence labeling models have shown excellent results on standard test sets, they are very brittle when presented with misspelled texts. In this paper, we introduce an adversarial training framework that enhances the robustness against typographical adversarial examples. We evaluate the robustness of sequence labeling models with an adversarial evaluation scheme that includes typographical adversarial examples. We generate two types of adversarial examples without access (black-box) or with full access (white-box) to the target model’s parameters. We conducted a series of extensive experiments on three languages (English, Thai, and German) across three sequence labeling tasks. Experiments show that the proposed adversarial training framework provides better resistance against adversarial examples on all tasks. We found that we can further improve the model’s robustness on the chunking task by including a triplet loss constraint.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"37 1","pages":"287-315"},"PeriodicalIF":2.5,"publicationDate":"2022-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91318432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}