Zhao Yang, Yuanzhe Zhang, Dianbo Sui, Yiming Ju, Jun Zhao, Kang Liu
Knowledge distillation is widely used in pre-trained language model compression, which can transfer knowledge from a cumbersome model to a lightweight one. Though knowledge distillation based model compression has achieved promising performance, we observe that explanations between the teacher model and the student model are not consistent. We argue that the student model should study not only the predictions of the teacher model but also the internal reasoning process. To this end, we propose Explanation Guided Knowledge Distillation (EGKD) in this paper, which utilizes explanations to represent the thinking process and improve knowledge distillation. To obtain explanations in our distillation framework, we select three typical explanation methods rooted in different mechanisms, namely gradient-based, perturbation-based, and feature selection methods, Then, to improve computational efficiency, we propose different optimization strategies to utilize the explanations obtained by these three different explanation methods, which could provide the student model better learning guidance. Experimental results on GLUE demonstrate that leveraging explanations can improve the performance of the student model. Moreover, our EGKD could also be applied to model compression with different architectures.
{"title":"Explanation Guided Knowledge Distillation for Pre-trained Language Model Compression","authors":"Zhao Yang, Yuanzhe Zhang, Dianbo Sui, Yiming Ju, Jun Zhao, Kang Liu","doi":"10.1145/3639364","DOIUrl":"https://doi.org/10.1145/3639364","url":null,"abstract":"<p>Knowledge distillation is widely used in pre-trained language model compression, which can transfer knowledge from a cumbersome model to a lightweight one. Though knowledge distillation based model compression has achieved promising performance, we observe that explanations between the teacher model and the student model are not consistent. We argue that the student model should study not only the predictions of the teacher model but also the internal reasoning process. To this end, we propose Explanation Guided Knowledge Distillation (EGKD) in this paper, which utilizes explanations to represent the thinking process and improve knowledge distillation. To obtain explanations in our distillation framework, we select three typical explanation methods rooted in different mechanisms, namely <i>gradient-based</i>, <i>perturbation-based</i>, and <i>feature selection</i> methods, Then, to improve computational efficiency, we propose different optimization strategies to utilize the explanations obtained by these three different explanation methods, which could provide the student model better learning guidance. Experimental results on GLUE demonstrate that leveraging explanations can improve the performance of the student model. Moreover, our EGKD could also be applied to model compression with different architectures.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"247 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139071925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Entity linking is the task of assigning a unique identity to named entities mentioned in a text, a sort of word sense disambiguation that focuses on automatically determining a pre-defined sense for a target entity to be disambiguated. This study proposes the DGE (Dual Gloss Encoders) model for Chinese entity linking in the biomedical domain. We separately model a dual encoder architecture, comprising a context-aware gloss encoder and a lexical gloss encoder, for contextualized embedding representations. Dual gloss encoders are then jointly optimized to assign the nearest gloss with the highest score for target entity disambiguation. The experimental datasets consist of a total of 10,218 sentences that were manually annotated with glosses defined in the BabelNet 5.0 across 40 distinct biomedical entities. Experimental results show that the DGE model achieved an F1-score of 97.81, outperforming other existing methods. A series of model analyses indicate that the proposed approach is effective for Chinese biomedical entity linking.
{"title":"Leveraging Dual Gloss Encoders in Chinese Biomedical Entity Linking","authors":"Tzu-Mi Lin, Man-Chen Hung, Lung-Hao Lee","doi":"10.1145/3638555","DOIUrl":"https://doi.org/10.1145/3638555","url":null,"abstract":"<p>Entity linking is the task of assigning a unique identity to named entities mentioned in a text, a sort of word sense disambiguation that focuses on automatically determining a pre-defined sense for a target entity to be disambiguated. This study proposes the DGE (Dual Gloss Encoders) model for Chinese entity linking in the biomedical domain. We separately model a dual encoder architecture, comprising a context-aware gloss encoder and a lexical gloss encoder, for contextualized embedding representations. Dual gloss encoders are then jointly optimized to assign the nearest gloss with the highest score for target entity disambiguation. The experimental datasets consist of a total of 10,218 sentences that were manually annotated with glosses defined in the BabelNet 5.0 across 40 distinct biomedical entities. Experimental results show that the DGE model achieved an F1-score of 97.81, outperforming other existing methods. A series of model analyses indicate that the proposed approach is effective for Chinese biomedical entity linking.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"10 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The distinctiveness and sparsity of low-resource multilingual South African abusive language necessitate the development of a novel solution to automatically detect different classes of abusive language instances using machine learning. Skip-gram has been used to address sparsity in machine learning classification problems but is inadequate in detecting South African abusive language due to the considerable amount of rare features and class imbalance. Joint Domain Adaptation has been used to enlarge features of a low-resource target domain for improved classification outcomes by jointly learning from the target domain and large-resource source domain. This paper, therefore, builds a Skip-gram model based on Joint Domain Adaptation to improve the detection of multilingual South African abusive language. Contrary to the existing Joint Domain Adaptation approaches, a Joint Multilevel Domain Adaptation model involving adaptation of monolingual source domain instances and multilingual target domain instances with high frequency of rare features was executed at the first level, and adaptation of target-domain features and first-level features at the next level. Both surface-level and embedding word features were used to evaluate the proposed model. In the evaluation of surface-level features, the Joint Multilevel Domain Adaptation model outperformed the state-of-the-art models with accuracy of 0.92 and F1-score of 0.68. In the evaluation of embedding features, the proposed model outperformed the state-of-the-art models with accuracy of 0.88 and F1-score of 0.64. The Joint Multilevel Domain Adaptation model significantly improved the average information gain of the rare features in different language categories and reduced class imbalance.
{"title":"Improving the Detection of Multilingual South African Abusive Language via Skip-gram using Joint Multilevel Domain Adaptation: The Detection of Multilingual South African Abusive Language using Skip-gram and Domain Adaptation: ACM Transactions on Asian and Low-Resource Language Information Processing: Vol 0, No ja","authors":"Oluwafemi Oriola, Eduan Kotzé","doi":"10.1145/3638759","DOIUrl":"https://doi.org/10.1145/3638759","url":null,"abstract":"<p>The distinctiveness and sparsity of low-resource multilingual South African abusive language necessitate the development of a novel solution to automatically detect different classes of abusive language instances using machine learning. Skip-gram has been used to address sparsity in machine learning classification problems but is inadequate in detecting South African abusive language due to the considerable amount of rare features and class imbalance. Joint Domain Adaptation has been used to enlarge features of a low-resource target domain for improved classification outcomes by jointly learning from the target domain and large-resource source domain. This paper, therefore, builds a Skip-gram model based on Joint Domain Adaptation to improve the detection of multilingual South African abusive language. Contrary to the existing Joint Domain Adaptation approaches, a Joint Multilevel Domain Adaptation model involving adaptation of monolingual source domain instances and multilingual target domain instances with high frequency of rare features was executed at the first level, and adaptation of target-domain features and first-level features at the next level. Both surface-level and embedding word features were used to evaluate the proposed model. In the evaluation of surface-level features, the Joint Multilevel Domain Adaptation model outperformed the state-of-the-art models with accuracy of 0.92 and F1-score of 0.68. In the evaluation of embedding features, the proposed model outperformed the state-of-the-art models with accuracy of 0.88 and F1-score of 0.64. The Joint Multilevel Domain Adaptation model significantly improved the average information gain of the rare features in different language categories and reduced class imbalance.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"27 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Waleed Nazih, Amany Fashwan, Amr El-Gendy, Yasser Hifny
Arabic is a morphologically rich language, which means that the Arabic language has a complicated system of word formation and structure. The affixes in the Arabic language (i.e., prefixes and suffixes) can be added to root words to generate different meanings and grammatical functions. These affixes can indicate aspects such as tense, gender, number, case, person, and more. In addition, the meaning and function of words can be modified in Arabic using an internal structure known as morphological patterns. Computational morphological analyzers of Arabic are vital to developing Arabic language processing toolkits. In this paper, we introduce a new morphological analyzer (Ibn-Ginni) that inherits the speed and quality of the Buckwalter Arabic Morphological Analyzer (BAMA). The BAMA has poor coverage of the classical Arabic language. Hence, the coverage of classical Arabic is improved by using the Alkhalil analyzer. Although it is slow, it was used to generate a huge number of solutions for 3 million unique Arabic words collected from different resources. These wordform-based solutions were converted to stem-based solutions, refined manually, and added to the database of BAMA, resulting in substantial improvements in the quality of the analysis. Hence, Ibn-Ginni is a hybrid system between BAMA and Alkhalil analyzers and may be considered an efficient large-scale analyzer. The Ibn-Ginni analyzer analyzed 0.6 million more words than the BAMA analyzer. Therefore, our analyzer significantly improves the coverage of the Arabic language. Besides, the Ibn-Ginni analyzer is high-speed at providing solutions; the average time to analyze a word is 0.3 ms. Using a corpus designed for benchmarking Arabic morphological analyzers, our analyzer was able to find all solutions for 72.72% of the words. Moreover, the analyzer did not provide all possible morphological solutions for 24.24% of the words. The analyzer and its morphological database are publicly available on GitHub.
{"title":"Ibn-Ginni: An Improved Morphological Analyzer for Arabic","authors":"Waleed Nazih, Amany Fashwan, Amr El-Gendy, Yasser Hifny","doi":"10.1145/3639050","DOIUrl":"https://doi.org/10.1145/3639050","url":null,"abstract":"<p>Arabic is a morphologically rich language, which means that the Arabic language has a complicated system of word formation and structure. The affixes in the Arabic language (i.e., prefixes and suffixes) can be added to root words to generate different meanings and grammatical functions. These affixes can indicate aspects such as tense, gender, number, case, person, and more. In addition, the meaning and function of words can be modified in Arabic using an internal structure known as morphological patterns. Computational morphological analyzers of Arabic are vital to developing Arabic language processing toolkits. In this paper, we introduce a new morphological analyzer (Ibn-Ginni) that inherits the speed and quality of the Buckwalter Arabic Morphological Analyzer (BAMA). The BAMA has poor coverage of the classical Arabic language. Hence, the coverage of classical Arabic is improved by using the Alkhalil analyzer. Although it is slow, it was used to generate a huge number of solutions for 3 million unique Arabic words collected from different resources. These wordform-based solutions were converted to stem-based solutions, refined manually, and added to the database of BAMA, resulting in substantial improvements in the quality of the analysis. Hence, Ibn-Ginni is a hybrid system between BAMA and Alkhalil analyzers and may be considered an efficient large-scale analyzer. The Ibn-Ginni analyzer analyzed 0.6 million more words than the BAMA analyzer. Therefore, our analyzer significantly improves the coverage of the Arabic language. Besides, the Ibn-Ginni analyzer is high-speed at providing solutions; the average time to analyze a word is 0.3 ms. Using a corpus designed for benchmarking Arabic morphological analyzers, our analyzer was able to find all solutions for 72.72% of the words. Moreover, the analyzer did not provide all possible morphological solutions for 24.24% of the words. The analyzer and its morphological database are publicly available on GitHub.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"20 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modeling conversational context is an essential step for emotion recognition in conversations. Existing works still suffer from insufficient utilization of local context information and remote context information. This paper designs a hypergraph neural network, namely HNN-ERC, to better utilize local and remote contextual information. HNN-ERC combines the recurrent neural network with the conventional hypergraph neural network to strengthen connections between utterances and make each utterance receive information from other utterances better. The proposed model has empirically achieved state-of-the-art results on three benchmark datasets, demonstrating the effectiveness and superiority of the new model.
{"title":"Hypergraph Neural Network for Emotion Recognition in Conversations","authors":"Cheng Zheng, Haojie Xu, Xiao Sun","doi":"10.1145/3638760","DOIUrl":"https://doi.org/10.1145/3638760","url":null,"abstract":"<p>Modeling conversational context is an essential step for emotion recognition in conversations. Existing works still suffer from insufficient utilization of local context information and remote context information. This paper designs a hypergraph neural network, namely HNN-ERC, to better utilize local and remote contextual information. HNN-ERC combines the recurrent neural network with the conventional hypergraph neural network to strengthen connections between utterances and make each utterance receive information from other utterances better. The proposed model has empirically achieved state-of-the-art results on three benchmark datasets, demonstrating the effectiveness and superiority of the new model.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"7 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139052784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes an approach for aspect-based sentiment analysis of Arabic social data, especially the considerable text corpus generated through communications on Twitter for expressing opinions in Arabic-language tweets during the COVID-19 pandemic. The proposed approach examines the performance of several pre-trained predictive and autoregressive language models; namely, BERT (Bidirectional Encoder Representations from Transformers) and XLNet, along with topic modeling algorithms; namely, LDA (Latent Dirichlet Allocation) and NMF (Non-negative Matrix Factorization), for aspect-based sentiment analysis of online Arabic text. In addition, Bi-LSTM (Bidirectional Long Short Term Memory) deep learning model is used to classify the extracted aspects from online reviews. Obtained experimental results indicate that the combined XLNet-NMF model outperforms other implemented state-of-the-art methods through improving the feature extraction of unstructured social media text with achieving values of 0.946 and 0.938, for average sentiment classification accuracy and F-measure, respectively.
{"title":"Autoregressive Feature Extraction with Topic Modeling for Aspect-based Sentiment Analysis of Arabic as a Low-resource Language","authors":"Asmaa Hashem Sweidan, Nashwa El-Bendary, Esraa Elhariri","doi":"10.1145/3638050","DOIUrl":"https://doi.org/10.1145/3638050","url":null,"abstract":"<p>This paper proposes an approach for aspect-based sentiment analysis of Arabic social data, especially the considerable text corpus generated through communications on Twitter for expressing opinions in Arabic-language tweets during the COVID-19 pandemic. The proposed approach examines the performance of several pre-trained predictive and autoregressive language models; namely, BERT (Bidirectional Encoder Representations from Transformers) and XLNet, along with topic modeling algorithms; namely, LDA (Latent Dirichlet Allocation) and NMF (Non-negative Matrix Factorization), for aspect-based sentiment analysis of online Arabic text. In addition, Bi-LSTM (Bidirectional Long Short Term Memory) deep learning model is used to classify the extracted aspects from online reviews. Obtained experimental results indicate that the combined XLNet-NMF model outperforms other implemented state-of-the-art methods through improving the feature extraction of unstructured social media text with achieving values of 0.946 and 0.938, for average sentiment classification accuracy and F-measure, respectively.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"22 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
VerbNet is a lexical resource for verbs that has many applications in natural language processing tasks, especially ones that require information about both the syntactic behavior and the semantics of verbs. This paper presents an attempt to construct the first version of a Thai VerbNet corpus via data enrichment of the existing lexical resource. This corpus contains the annotation at both the syntactic and semantic levels, where verbs are tagged with frames within the verb class hierarchy and their arguments are labeled with the semantic role. We discuss the technical aspect of the construction process of Thai VerbNet and survey different semantic role labeling methods to make this process fully automatic. We also investigate the linguistic aspect of the computed verb classes and the results show the potential in assisting semantic classification and analysis. At the current stage, we have built the verb class hierarchy consisting of 28 verb classes from 112 unique concept frames over 490 unique verbs using our association rule learning method on Thai verbs.
{"title":"The Computational Method for Supporting Thai VerbNet Construction","authors":"Krittanut Chungnoi, Rachada Kongkachandra, Sarun Gulyanon","doi":"10.1145/3638533","DOIUrl":"https://doi.org/10.1145/3638533","url":null,"abstract":"<p>VerbNet is a lexical resource for verbs that has many applications in natural language processing tasks, especially ones that require information about both the syntactic behavior and the semantics of verbs. This paper presents an attempt to construct the first version of a Thai VerbNet corpus via data enrichment of the existing lexical resource. This corpus contains the annotation at both the syntactic and semantic levels, where verbs are tagged with frames within the verb class hierarchy and their arguments are labeled with the semantic role. We discuss the technical aspect of the construction process of Thai VerbNet and survey different semantic role labeling methods to make this process fully automatic. We also investigate the linguistic aspect of the computed verb classes and the results show the potential in assisting semantic classification and analysis. At the current stage, we have built the verb class hierarchy consisting of 28 verb classes from 112 unique concept frames over 490 unique verbs using our association rule learning method on Thai verbs.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"76 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139054110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haixia Wang, Qingran Miao, Qun Xiao, Yilong Zhang, Yingyu Mao
Chinese characters are complex and contain discriminative information, meaning that their writers have the potential to be recognized using less text. In this study, offline Chinese writer identification based on a single character was investigated. To extract comprehensive features to model Chinese characters, explicit and implicit information as well as global and local features are of interest. A dual-branch multitask fusion network is proposed which contains two branches for global and local feature extraction simultaneously, and introduces auxiliary tasks to help the main task. Content recognition, stroke number estimation, and stroke recognition are considered as three auxiliary tasks for explicit information. The main task extracts implicit information of writer identity. The experimental results validated the positive influences of auxiliary tasks on the writer identification task, with the stroke number estimation task being most helpful. In-depth research was conducted to investigate the influencing factors in Chinese writer identification, with respect to character complexity, stroke importance, and character number, which provides a systematic reference for the actual application of neural networks in Chinese writer identification.
{"title":"Dual-branch Multitask Fusion Network for Offline Chinese Writer Identification","authors":"Haixia Wang, Qingran Miao, Qun Xiao, Yilong Zhang, Yingyu Mao","doi":"10.1145/3638554","DOIUrl":"https://doi.org/10.1145/3638554","url":null,"abstract":"<p>Chinese characters are complex and contain discriminative information, meaning that their writers have the potential to be recognized using less text. In this study, offline Chinese writer identification based on a single character was investigated. To extract comprehensive features to model Chinese characters, explicit and implicit information as well as global and local features are of interest. A dual-branch multitask fusion network is proposed which contains two branches for global and local feature extraction simultaneously, and introduces auxiliary tasks to help the main task. Content recognition, stroke number estimation, and stroke recognition are considered as three auxiliary tasks for explicit information. The main task extracts implicit information of writer identity. The experimental results validated the positive influences of auxiliary tasks on the writer identification task, with the stroke number estimation task being most helpful. In-depth research was conducted to investigate the influencing factors in Chinese writer identification, with respect to character complexity, stroke importance, and character number, which provides a systematic reference for the actual application of neural networks in Chinese writer identification.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"18 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139052827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study aims to develop a machine learning-based model to predict the readability of Gujarati texts. The dataset was fifty prose passages from Gujarati literature. Fourteen lexical and syntactic readability text features were extracted from the dataset using a machine learning algorithm of the unigram POS tagger and three Python programming scripts. Two samples of native Gujarati speaking secondary and higher education students rated the Gujarati texts for readability judgment on a 10-point scale of 'easy' to 'difficult' with the interrater agreement. After dimensionality reduction, seven text features as the independent variables and the mean readability rating as the dependent variable were used to train the readability model. As the students' level of education and gender were related to their readability rating, four readability models for school students, university students, male students, and female students were trained with a backward stepwise multiple linear regression algorithm of supervised machine learning. The trained model is comparable across the raters' groups. The best model is the university students' readability rating model. The model is cross-validated. It explains 91% and 88% of the variance in readability ratings at training and cross-validation, respectively, and its effect size and power are large and high.
{"title":"A Machine Learning-Based Readability Model for Gujarati Texts","authors":"Chandrakant K. Bhogayata","doi":"10.1145/3637826","DOIUrl":"https://doi.org/10.1145/3637826","url":null,"abstract":"This study aims to develop a machine learning-based model to predict the readability of Gujarati texts. The dataset was fifty prose passages from Gujarati literature. Fourteen lexical and syntactic readability text features were extracted from the dataset using a machine learning algorithm of the unigram POS tagger and three Python programming scripts. Two samples of native Gujarati speaking secondary and higher education students rated the Gujarati texts for readability judgment on a 10-point scale of 'easy' to 'difficult' with the interrater agreement. After dimensionality reduction, seven text features as the independent variables and the mean readability rating as the dependent variable were used to train the readability model. As the students' level of education and gender were related to their readability rating, four readability models for school students, university students, male students, and female students were trained with a backward stepwise multiple linear regression algorithm of supervised machine learning. The trained model is comparable across the raters' groups. The best model is the university students' readability rating model. The model is cross-validated. It explains 91% and 88% of the variance in readability ratings at training and cross-validation, respectively, and its effect size and power are large and high.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"131 50","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138953509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-domain neural machine translation aims to construct a unified NMT model to translate sentences across various domains. Nevertheless, previous studies have one limitation is the incapacity to acquire both domain-general and specific representations concurrently. To this end, we propose an ensemble strategy with gradient conflict for multi-domain neural machine translation that automatically learns model parameters by identifying both domain-shared and domain-specific features. Specifically, our approach consists of (1) a parameter-sharing framework: the parameters of all the layers are originally shared and equivalent to each domain. (2) ensemble strategy: we design an Extra Ensemble strategy via a piecewise condition function to learn direction and distance-based gradient conflict. In addition, we give a detailed theoretical analysis of the gradient conflict to further validate the effectiveness of our approach. Experimental results on two multi-domain datasets show the superior performance of our proposed model compared to previous work.
{"title":"An Ensemble Strategy with Gradient Conflict for Multi-Domain Neural Machine Translation","authors":"Zhibo Man, Yujie Zhang, Yu Li, Yuanmeng Chen, Yufeng Chen, Jinan Xu","doi":"10.1145/3638248","DOIUrl":"https://doi.org/10.1145/3638248","url":null,"abstract":"<p>Multi-domain neural machine translation aims to construct a unified NMT model to translate sentences across various domains. Nevertheless, previous studies have one limitation is the incapacity to acquire both domain-general and specific representations concurrently. To this end, we propose an ensemble strategy with gradient conflict for multi-domain neural machine translation that automatically learns model parameters by identifying both domain-shared and domain-specific features. Specifically, our approach consists of <b>(1)</b> a parameter-sharing framework: the parameters of all the layers are originally shared and equivalent to each domain. <b>(2)</b> ensemble strategy: we design an Extra Ensemble strategy via a piecewise condition function to learn direction and distance-based gradient conflict. In addition, we give a detailed theoretical analysis of the gradient conflict to further validate the effectiveness of our approach. Experimental results on two multi-domain datasets show the superior performance of our proposed model compared to previous work.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"31 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138827012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}