Pub Date : 2023-06-01DOI: 10.1017/s135132492300027x
Rana Husni Al Mahmoud, B. Hammo, Hossam Faris
This article reports on designing and implementing a multiclass sentiment classification approach to handle the imbalanced class distribution of Arabic documents. The proposed approach, sentiment classification of Arabic documents (SCArD), combines the advantages of a clustering-based undersampling (CBUS) method and an ensemble learning model to aid machine learning (ML) classifiers in building accurate models against highly imbalanced datasets. The CBUS method applies two standard clustering algorithms: K-means and expectation–maximization, to balance the ratio between the major and the minor classes by decreasing the number of the major class instances and maintaining the number of the minor class instances at the cluster level. The merits of the proposed approach are that it does not remove the majority class instances from the dataset nor injects the dataset with artificial minority class instances. The resulting balanced datasets are used to train two ML classifiers, random forest and updateable Naïve Bayes, to develop prediction data models. The best prediction data models are selected based on F1-score rates. We applied two techniques to test SCArD and generate new predictions from the imbalanced test dataset. The first technique uses the best prediction data models. The second technique uses the majority voting ensemble learning model, which combines the best prediction data models to generate the final predictions. The experimental results showed that SCArD is promising and outperformed the other comparative classification models based on the F1-score rates.
{"title":"Cluster-based ensemble learning model for improving sentiment classification of Arabic documents","authors":"Rana Husni Al Mahmoud, B. Hammo, Hossam Faris","doi":"10.1017/s135132492300027x","DOIUrl":"https://doi.org/10.1017/s135132492300027x","url":null,"abstract":"\u0000 This article reports on designing and implementing a multiclass sentiment classification approach to handle the imbalanced class distribution of Arabic documents. The proposed approach, sentiment classification of Arabic documents (SCArD), combines the advantages of a clustering-based undersampling (CBUS) method and an ensemble learning model to aid machine learning (ML) classifiers in building accurate models against highly imbalanced datasets. The CBUS method applies two standard clustering algorithms: K-means and expectation–maximization, to balance the ratio between the major and the minor classes by decreasing the number of the major class instances and maintaining the number of the minor class instances at the cluster level. The merits of the proposed approach are that it does not remove the majority class instances from the dataset nor injects the dataset with artificial minority class instances. The resulting balanced datasets are used to train two ML classifiers, random forest and updateable Naïve Bayes, to develop prediction data models. The best prediction data models are selected based on F1-score rates. We applied two techniques to test SCArD and generate new predictions from the imbalanced test dataset. The first technique uses the best prediction data models. The second technique uses the majority voting ensemble learning model, which combines the best prediction data models to generate the final predictions. The experimental results showed that SCArD is promising and outperformed the other comparative classification models based on the F1-score rates.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46482512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-31DOI: 10.1017/s1351324923000207
Elham Seifossadat, H. Sameti
This paper proposes a sequence-to-sequence model for data-to-text generation, called DM-NLG, to generate a natural language text from structured nonlinguistic input. Specifically, by adding a dynamic memory module to the attention-based sequence-to-sequence model, it can store the information that leads to generate previous output words and use it to generate the next word. In this way, the decoder part of the model is aware of all previous decisions, and as a result, the generation of duplicate words or incomplete semantic concepts is prevented. To improve the generated sentences quality by the DM-NLG decoder, a postprocessing step is performed using the pretrained language models. To prove the effectiveness of the DM-NLG model, we performed experiments on five different datasets and observed that our proposed model is able to reduce the slot error rate rate by 50% and improve the BLEU by 10%, compared to the state-of-the-art models.
{"title":"Improving semantic coverage of data-to-text generation model using dynamic memory networks","authors":"Elham Seifossadat, H. Sameti","doi":"10.1017/s1351324923000207","DOIUrl":"https://doi.org/10.1017/s1351324923000207","url":null,"abstract":"\u0000 This paper proposes a sequence-to-sequence model for data-to-text generation, called DM-NLG, to generate a natural language text from structured nonlinguistic input. Specifically, by adding a dynamic memory module to the attention-based sequence-to-sequence model, it can store the information that leads to generate previous output words and use it to generate the next word. In this way, the decoder part of the model is aware of all previous decisions, and as a result, the generation of duplicate words or incomplete semantic concepts is prevented. To improve the generated sentences quality by the DM-NLG decoder, a postprocessing step is performed using the pretrained language models. To prove the effectiveness of the DM-NLG model, we performed experiments on five different datasets and observed that our proposed model is able to reduce the slot error rate rate by 50% and improve the BLEU by 10%, compared to the state-of-the-art models.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48272916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-29DOI: 10.1017/s1351324923000189
Hafiz Rizwan Iqbal, Rashad Maqsood, Agha Ali Raza, Saeed-Ul Hassan
Automatic paraphrase detection is the task of measuring the semantic overlap between two given texts. A major hurdle in the development and evaluation of paraphrase detection approaches, particularly for South Asian languages like Urdu, is the inadequacy of standard evaluation resources. The very few available paraphrased corpora for these languages are manually created. As a result, they are constrained to smaller sizes and are not very feasible to evaluate mainstream data-driven and deep neural networks (DNNs)-based approaches. Consequently, there is a need to develop semi- or fully automated corpus generation approaches for the resource-scarce languages. There is currently no semi- or fully automatically generated sentence-level Urdu paraphrase corpus. Moreover, no study is available to localize and compare approaches for Urdu paraphrase detection that focus on various mainstream deep neural architectures and pretrained language models. This research study addresses this problem by presenting a semi-automatic pipeline for generating paraphrased corpora for Urdu. It also presents a corpus that is generated using the proposed approach. This corpus contains 3147 semi-automatically extracted Urdu sentence pairs that are manually tagged as paraphrased (854) and non-paraphrased (2293). Finally, this paper proposes two novel approaches based on DNNs for the task of paraphrase detection in Urdu text. These are Word Embeddings n-gram Overlap (henceforth called WENGO), and a modified approach, Deep Text Reuse and Paraphrase Plagiarism Detection (henceforth called D-TRAPPD). Both of these approaches have been evaluated on two related tasks: (i) paraphrase detection, and (ii) text reuse and plagiarism detection. The results from these evaluations revealed that D-TRAPPD ( $F_1 = 96.80$ for paraphrase detection and $F_1 = 88.90$ for text reuse and plagiarism detection) outperformed WENGO ( $F_1 = 81.64$ for paraphrase detection and $F_1 = 61.19$ for text reuse and plagiarism detection) as well as other state-of-the-art approaches for these two tasks. The corpus, models, and our implementations have been made available as free to download for the research community.
{"title":"Urdu paraphrase detection: A novel DNN-based implementation using a semi-automatically generated corpus","authors":"Hafiz Rizwan Iqbal, Rashad Maqsood, Agha Ali Raza, Saeed-Ul Hassan","doi":"10.1017/s1351324923000189","DOIUrl":"https://doi.org/10.1017/s1351324923000189","url":null,"abstract":"\u0000 Automatic paraphrase detection is the task of measuring the semantic overlap between two given texts. A major hurdle in the development and evaluation of paraphrase detection approaches, particularly for South Asian languages like Urdu, is the inadequacy of standard evaluation resources. The very few available paraphrased corpora for these languages are manually created. As a result, they are constrained to smaller sizes and are not very feasible to evaluate mainstream data-driven and deep neural networks (DNNs)-based approaches. Consequently, there is a need to develop semi- or fully automated corpus generation approaches for the resource-scarce languages. There is currently no semi- or fully automatically generated sentence-level Urdu paraphrase corpus. Moreover, no study is available to localize and compare approaches for Urdu paraphrase detection that focus on various mainstream deep neural architectures and pretrained language models.\u0000 This research study addresses this problem by presenting a semi-automatic pipeline for generating paraphrased corpora for Urdu. It also presents a corpus that is generated using the proposed approach. This corpus contains 3147 semi-automatically extracted Urdu sentence pairs that are manually tagged as paraphrased (854) and non-paraphrased (2293). Finally, this paper proposes two novel approaches based on DNNs for the task of paraphrase detection in Urdu text. These are Word Embeddings n-gram Overlap (henceforth called WENGO), and a modified approach, Deep Text Reuse and Paraphrase Plagiarism Detection (henceforth called D-TRAPPD). Both of these approaches have been evaluated on two related tasks: (i) paraphrase detection, and (ii) text reuse and plagiarism detection. The results from these evaluations revealed that D-TRAPPD (\u0000 \u0000 \u0000 \u0000$F_1 = 96.80$\u0000\u0000 \u0000 for paraphrase detection and \u0000 \u0000 \u0000 \u0000$F_1 = 88.90$\u0000\u0000 \u0000 for text reuse and plagiarism detection) outperformed WENGO (\u0000 \u0000 \u0000 \u0000$F_1 = 81.64$\u0000\u0000 \u0000 for paraphrase detection and \u0000 \u0000 \u0000 \u0000$F_1 = 61.19$\u0000\u0000 \u0000 for text reuse and plagiarism detection) as well as other state-of-the-art approaches for these two tasks. The corpus, models, and our implementations have been made available as free to download for the research community.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47763438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-23DOI: 10.1017/s1351324923000232
Michael DeBuse, Sean Warnick
This article discusses the development of an automated plot extraction system for narrative texts. Acknowledging the distinction between plot, as an object of study with its own rich history and literature, and features of a text that may be automatically extractable, we begin by characterizing a text’s scatter plot of entities. This visualization of a text reveals entity density patterns characterizing the particular telling of the story under investigation and leads to effective scene partitioning. We then introduce the concept of narrative flow, a graph representation of the narrative ordering of scenes (the syuzhet) that includes how entities move through scenes from the text, and investigate the degree to which narrative flow can be automatically extracted given a glossary of plot-important objects, actors, and locations. Our subsequent analysis then explores the correlation between subjective notions of plot and the information extracted through these visualizations. In particular, we discuss narrative structures commonly found within the graphs and make comparisons with ground truth narrative flow graphs, showing mixed results highlighting the difficulty of plot extraction. However, the visual artifacts and common structural relationships seen in the graphs provide insight into narrative and its underlying plot.
{"title":"Plot extraction and the visualization of narrative flow","authors":"Michael DeBuse, Sean Warnick","doi":"10.1017/s1351324923000232","DOIUrl":"https://doi.org/10.1017/s1351324923000232","url":null,"abstract":"\u0000 This article discusses the development of an automated plot extraction system for narrative texts. Acknowledging the distinction between plot, as an object of study with its own rich history and literature, and features of a text that may be automatically extractable, we begin by characterizing a text’s scatter plot of entities. This visualization of a text reveals entity density patterns characterizing the particular telling of the story under investigation and leads to effective scene partitioning. We then introduce the concept of narrative flow, a graph representation of the narrative ordering of scenes (the syuzhet) that includes how entities move through scenes from the text, and investigate the degree to which narrative flow can be automatically extracted given a glossary of plot-important objects, actors, and locations. Our subsequent analysis then explores the correlation between subjective notions of plot and the information extracted through these visualizations. In particular, we discuss narrative structures commonly found within the graphs and make comparisons with ground truth narrative flow graphs, showing mixed results highlighting the difficulty of plot extraction. However, the visual artifacts and common structural relationships seen in the graphs provide insight into narrative and its underlying plot.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47790134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-17DOI: 10.1017/s1351324923000219
Darío Garigliotti, K. Balog, K. Hose, Johannes Bjerva
Web search is an experience that naturally lends itself to recommendations, including query suggestions and related entities. In this article, we propose to recommend specific tasks to users, based on their search queries, such as planning a holiday trip or organizing a party. Specifically, we introduce the problem of query-based task recommendation and develop methods that combine well-established term-based ranking techniques with continuous semantic representations, including sentence representations from several transformer-based models. Using a purpose-built test collection, we find that our method is able to significantly outperform a strong text-based baseline. Further, we extend our approach to using a set of queries that all share the same underlying task, referred to as search mission, as input. The study is rounded off with a detailed feature and query analysis.
{"title":"Recommending tasks based on search queries and missions","authors":"Darío Garigliotti, K. Balog, K. Hose, Johannes Bjerva","doi":"10.1017/s1351324923000219","DOIUrl":"https://doi.org/10.1017/s1351324923000219","url":null,"abstract":"\u0000 Web search is an experience that naturally lends itself to recommendations, including query suggestions and related entities. In this article, we propose to recommend specific tasks to users, based on their search queries, such as planning a holiday trip or organizing a party. Specifically, we introduce the problem of query-based task recommendation and develop methods that combine well-established term-based ranking techniques with continuous semantic representations, including sentence representations from several transformer-based models. Using a purpose-built test collection, we find that our method is able to significantly outperform a strong text-based baseline. Further, we extend our approach to using a set of queries that all share the same underlying task, referred to as search mission, as input. The study is rounded off with a detailed feature and query analysis.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44859495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-15DOI: 10.1017/s1351324923000177
Xin Shen, Wai Lam, Shumin Ma, Huadong Wang
Recently, neural abstractive text summarization (NATS) models based on sequence-to-sequence architecture have drawn a lot of attention. Real-world texts that need to be summarized range from short news with dozens of words to long reports with thousands of words. However, most existing NATS models are not good at summarizing long documents, due to the inherent limitations of their underlying neural architectures. In this paper, we focus on the task of long document summarization (LDS). Based on the inherent section structures of source documents, we divide an abstractive LDS problem into several smaller-sized problems. In this circumstance, how to provide a less-biased target summary as the supervision for each section is vital for the model’s performance. As a preliminary, we formally describe the section-to-summary-sentence (S2SS) alignment for LDS. Based on this, we propose a novel NATS framework for the LDS task. Our framework is built based on the theory of unbalanced optimal transport (UOT), and it is named as UOTSumm. It jointly learns three targets in a unified training objective, including the optimal S2SS alignment, a section-level NATS summarizer, and the number of aligned summary sentences for each section. In this way, UOTSumm directly learns the text alignment from summarization data, without resorting to any biased tool such as ROUGE. UOTSumm can be easily adapted to most existing NATS models. And we implement two versions of UOTSumm, with and without the pretrain-finetune technique. We evaluate UOTSumm on three publicly available LDS benchmarks: PubMed, arXiv, and GovReport. UOTSumm obviously outperforms its counterparts that use ROUGE for the text alignment. When combined with UOTSumm, the performance of two vanilla NATS models improves by a large margin. Besides, UOTSumm achieves better or comparable performance when compared with some recent strong baselines.
{"title":"Joint learning of text alignment and abstractive summarization for long documents via unbalanced optimal transport","authors":"Xin Shen, Wai Lam, Shumin Ma, Huadong Wang","doi":"10.1017/s1351324923000177","DOIUrl":"https://doi.org/10.1017/s1351324923000177","url":null,"abstract":"\u0000 Recently, neural abstractive text summarization (NATS) models based on sequence-to-sequence architecture have drawn a lot of attention. Real-world texts that need to be summarized range from short news with dozens of words to long reports with thousands of words. However, most existing NATS models are not good at summarizing long documents, due to the inherent limitations of their underlying neural architectures. In this paper, we focus on the task of long document summarization (LDS). Based on the inherent section structures of source documents, we divide an abstractive LDS problem into several smaller-sized problems. In this circumstance, how to provide a less-biased target summary as the supervision for each section is vital for the model’s performance. As a preliminary, we formally describe the section-to-summary-sentence (S2SS) alignment for LDS. Based on this, we propose a novel NATS framework for the LDS task. Our framework is built based on the theory of unbalanced optimal transport (UOT), and it is named as UOTSumm. It jointly learns three targets in a unified training objective, including the optimal S2SS alignment, a section-level NATS summarizer, and the number of aligned summary sentences for each section. In this way, UOTSumm directly learns the text alignment from summarization data, without resorting to any biased tool such as ROUGE. UOTSumm can be easily adapted to most existing NATS models. And we implement two versions of UOTSumm, with and without the pretrain-finetune technique. We evaluate UOTSumm on three publicly available LDS benchmarks: PubMed, arXiv, and GovReport. UOTSumm obviously outperforms its counterparts that use ROUGE for the text alignment. When combined with UOTSumm, the performance of two vanilla NATS models improves by a large margin. Besides, UOTSumm achieves better or comparable performance when compared with some recent strong baselines.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"1 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41342362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-15DOI: 10.1017/s1351324923000153
Michael Wiegand, Marc Schulder, Josef Ruppenhofer
We examine the binary classification of sentiment views for verbal multiword expressions (MWEs). Sentiment views denote the perspective of the holder of some opinion. We distinguish between MWEs conveying the view of the speaker of the utterance (e.g., in “The company reinvented the wheel” the holder is the implicit speaker who criticizes the company for creating something already existing) and MWEs conveying the view of explicit entities participating in an opinion event (e.g., in “Peter threw in the towel” the holder is Peter having given up something). The task has so far been examined on unigram opinion words. Since many features found effective for unigrams are not usable for MWEs, we propose novel ones taking into account the internal structure of MWEs, a unigram sentiment-view lexicon and various information from Wiktionary. We also examine distributional methods and show that the corpus on which a representation is induced has a notable impact on the classification. We perform an extrinsic evaluation in the task of opinion holder extraction and show that the learnt knowledge also improves a state-of-the-art classifier trained on BERT. Sentiment-view classification is typically framed as a task in which only little labeled training data are available. As in the case of unigrams, we show that for MWEs a feature-based approach beats state-of-the-art generic methods.
{"title":"Determining sentiment views of verbal multiword expressions using linguistic features","authors":"Michael Wiegand, Marc Schulder, Josef Ruppenhofer","doi":"10.1017/s1351324923000153","DOIUrl":"https://doi.org/10.1017/s1351324923000153","url":null,"abstract":"\u0000 We examine the binary classification of sentiment views for verbal multiword expressions (MWEs). Sentiment views denote the perspective of the holder of some opinion. We distinguish between MWEs conveying the view of the speaker of the utterance (e.g., in “The company reinvented the wheel” the holder is the implicit speaker who criticizes the company for creating something already existing) and MWEs conveying the view of explicit entities participating in an opinion event (e.g., in “Peter threw in the towel” the holder is Peter having given up something). The task has so far been examined on unigram opinion words. Since many features found effective for unigrams are not usable for MWEs, we propose novel ones taking into account the internal structure of MWEs, a unigram sentiment-view lexicon and various information from Wiktionary. We also examine distributional methods and show that the corpus on which a representation is induced has a notable impact on the classification. We perform an extrinsic evaluation in the task of opinion holder extraction and show that the learnt knowledge also improves a state-of-the-art classifier trained on BERT. Sentiment-view classification is typically framed as a task in which only little labeled training data are available. As in the case of unigrams, we show that for MWEs a feature-based approach beats state-of-the-art generic methods.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44157069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-10DOI: 10.1017/s1351324923000128
Shuiyuan Yu, Zihao Zhang, Haitao Liu
Word order is one of the most important grammatical devices and the basis for language understanding. However, as one of the most popular NLP architectures, Transformer does not explicitly encode word order. A solution to this problem is to incorporate position information by means of position encoding/embedding (PE). Although a variety of methods of incorporating position information have been proposed, the NLP community is still in want of detailed statistical researches on position information in real-life language. In order to understand the influence of position information on the correlation between words in more detail, we investigated the factors that affect the frequency of words and word sequences in large corpora. Our results show that absolute position, relative position, being at one of the two ends of a sentence and sentence length all significantly affect the frequency of words and word sequences. Besides, we observed that the frequency distribution of word sequences over relative position carries valuable grammatical information. Our study suggests that in order to accurately capture word–word correlations, it is not enough to focus merely on absolute and relative position. Transformers should have access to more types of position-related information which may require improvements to the current architecture.
{"title":"What should be encoded by position embedding for neural network language models?","authors":"Shuiyuan Yu, Zihao Zhang, Haitao Liu","doi":"10.1017/s1351324923000128","DOIUrl":"https://doi.org/10.1017/s1351324923000128","url":null,"abstract":"\u0000 Word order is one of the most important grammatical devices and the basis for language understanding. However, as one of the most popular NLP architectures, Transformer does not explicitly encode word order. A solution to this problem is to incorporate position information by means of position encoding/embedding (PE). Although a variety of methods of incorporating position information have been proposed, the NLP community is still in want of detailed statistical researches on position information in real-life language. In order to understand the influence of position information on the correlation between words in more detail, we investigated the factors that affect the frequency of words and word sequences in large corpora. Our results show that absolute position, relative position, being at one of the two ends of a sentence and sentence length all significantly affect the frequency of words and word sequences. Besides, we observed that the frequency distribution of word sequences over relative position carries valuable grammatical information. Our study suggests that in order to accurately capture word–word correlations, it is not enough to focus merely on absolute and relative position. Transformers should have access to more types of position-related information which may require improvements to the current architecture.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42388065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-05DOI: 10.1017/s135132492300013x
Haijie Ding, Xiaolong Xu
Table-to-text generation aims to generate descriptions for structured data (i.e., tables) and has been applied in many fields like question-answering systems and search engines. Current approaches mostly use neural language models to learn alignment between output and input based on the attention mechanisms, which are still flawed by the gradual weakening of attention when processing long texts and the inability to utilize the records’ structural information. To solve these problems, we propose a novel generative model SAN-T2T, which consists of a field-content selective encoder and a descriptive decoder, connected with a selective attention network. In the encoding phase, the table’s structure is integrated into its field representation, and a content selector with self-aligned gates is applied to take advantage of the fact that different records can determine each other’s importance. In the decoding phase, the content selector’s semantic information enhances the alignment between description and records, and a featured copy mechanism is applied to solve the rare word problem. Experiments on WikiBio and WeatherGov datasets show that SAN-T2T outperforms the baselines by a large margin, and the content selector indeed improves the model’s performance.
{"title":"SAN-T2T: An automated table-to-text generator based on selective attention network","authors":"Haijie Ding, Xiaolong Xu","doi":"10.1017/s135132492300013x","DOIUrl":"https://doi.org/10.1017/s135132492300013x","url":null,"abstract":"\u0000 Table-to-text generation aims to generate descriptions for structured data (i.e., tables) and has been applied in many fields like question-answering systems and search engines. Current approaches mostly use neural language models to learn alignment between output and input based on the attention mechanisms, which are still flawed by the gradual weakening of attention when processing long texts and the inability to utilize the records’ structural information. To solve these problems, we propose a novel generative model SAN-T2T, which consists of a field-content selective encoder and a descriptive decoder, connected with a selective attention network. In the encoding phase, the table’s structure is integrated into its field representation, and a content selector with self-aligned gates is applied to take advantage of the fact that different records can determine each other’s importance. In the decoding phase, the content selector’s semantic information enhances the alignment between description and records, and a featured copy mechanism is applied to solve the rare word problem. Experiments on WikiBio and WeatherGov datasets show that SAN-T2T outperforms the baselines by a large margin, and the content selector indeed improves the model’s performance.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46933417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1017/s1351324923000256
J. Tait
Yorick was a great friend of Natural Language Engineering. He was a member of the founding editorial board, but more to the point was a sage and encouraging advisor to the Founding Editors Roberto Garigliano, John Tait, and Branimir Boguraev right from the genesis of the project. At the time of his death, Yorick was one of, if not the, doyen of computational linguists. He had been continuously active in the field since 1962. Having graduated in philosophy, he took up a position in Margaret Masterman’s Cambridge Language Research Unit, an eccentric and somewhat informal organisation which started the careers of many pioneers of artificial intelligence and natural language engineering including Karen Spärck Jones, Martin Kay, Margaret Boden, and Roger Needham (thought by some to be the originator of machine learning, as well as much else in computing). Yorick was awarded a PhD in 1968 for work on the use of interlingua in machine translation. His PhD thesis stands out not least for its bright yellow binding (Wilks, 1968). Wilks’ effective PhD supervisor was Margaret Masterman, a student of Wittgenstein’s, although his work was formally directed by the distinguished philosopher Richard Braithwaite, Masterman’s husband, as she lacked an appropriate established position in the University of Cambridge. Inevitably, given the puny computers of the time, Yorick’s PhD work falls well short of the scientific standards of the 21st Century. Despite its shortcomings, his pioneering work influenced many people who have ultimately contributed to the now widespread practical use of machine translation and other automatic language processing systems. In particular, it would be reasonable to surmise that the current success of deep learning systems is based on inferring or inducing a hidden interlingua of the sort Wilks and colleagues tried to handcraft in the 1960s and 1970s. Furthermore, all probabilistic language systems are based on selecting a better or more likely interpretation of a fragment of language over a less likely one, a development of the preference semantics notion originally invented and popularised byWillks (1973, 1975). As a result, his early work continues to be worth studying, not least for the very deep insights careful reading often reveals. Underlying this early work was an interest in metaphor, which Yorick recognised as a pervasive feature of language. This was a topic to which Yorick returned repeatedly throughout his life. Wilks (1978) began to develop his approach, with Barnden (2007) providing a useful summary of work to that date. However, there is much later work – for example Wilks et al. (2013). Wilks was an important figure in the attempt to utilise existing, published dictionaries as a knowledge source for automatic natural language processing systems (Wilks, Slator, and Guthrie, 1996). This endeavour ultimately foundered on the differing interests of commercial dictionary publishers and developers of natural language processing
{"title":"Obituary: Yorick Wilks","authors":"J. Tait","doi":"10.1017/s1351324923000256","DOIUrl":"https://doi.org/10.1017/s1351324923000256","url":null,"abstract":"Yorick was a great friend of Natural Language Engineering. He was a member of the founding editorial board, but more to the point was a sage and encouraging advisor to the Founding Editors Roberto Garigliano, John Tait, and Branimir Boguraev right from the genesis of the project. At the time of his death, Yorick was one of, if not the, doyen of computational linguists. He had been continuously active in the field since 1962. Having graduated in philosophy, he took up a position in Margaret Masterman’s Cambridge Language Research Unit, an eccentric and somewhat informal organisation which started the careers of many pioneers of artificial intelligence and natural language engineering including Karen Spärck Jones, Martin Kay, Margaret Boden, and Roger Needham (thought by some to be the originator of machine learning, as well as much else in computing). Yorick was awarded a PhD in 1968 for work on the use of interlingua in machine translation. His PhD thesis stands out not least for its bright yellow binding (Wilks, 1968). Wilks’ effective PhD supervisor was Margaret Masterman, a student of Wittgenstein’s, although his work was formally directed by the distinguished philosopher Richard Braithwaite, Masterman’s husband, as she lacked an appropriate established position in the University of Cambridge. Inevitably, given the puny computers of the time, Yorick’s PhD work falls well short of the scientific standards of the 21st Century. Despite its shortcomings, his pioneering work influenced many people who have ultimately contributed to the now widespread practical use of machine translation and other automatic language processing systems. In particular, it would be reasonable to surmise that the current success of deep learning systems is based on inferring or inducing a hidden interlingua of the sort Wilks and colleagues tried to handcraft in the 1960s and 1970s. Furthermore, all probabilistic language systems are based on selecting a better or more likely interpretation of a fragment of language over a less likely one, a development of the preference semantics notion originally invented and popularised byWillks (1973, 1975). As a result, his early work continues to be worth studying, not least for the very deep insights careful reading often reveals. Underlying this early work was an interest in metaphor, which Yorick recognised as a pervasive feature of language. This was a topic to which Yorick returned repeatedly throughout his life. Wilks (1978) began to develop his approach, with Barnden (2007) providing a useful summary of work to that date. However, there is much later work – for example Wilks et al. (2013). Wilks was an important figure in the attempt to utilise existing, published dictionaries as a knowledge source for automatic natural language processing systems (Wilks, Slator, and Guthrie, 1996). This endeavour ultimately foundered on the differing interests of commercial dictionary publishers and developers of natural language processing","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"846 - 847"},"PeriodicalIF":2.5,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47052800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}