ACM Transactions on Asian and Low-Resource Language Information Processing最新文献

英文中文

Explanation Guided Knowledge Distillation for Pre-trained Language Model Compression 预训练语言模型压缩的解释引导知识提炼

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2023-12-29 DOI: 10.1145/3639364

Zhao Yang, Yuanzhe Zhang, Dianbo Sui, Yiming Ju, Jun Zhao, Kang Liu

Knowledge distillation is widely used in pre-trained language model compression, which can transfer knowledge from a cumbersome model to a lightweight one. Though knowledge distillation based model compression has achieved promising performance, we observe that explanations between the teacher model and the student model are not consistent. We argue that the student model should study not only the predictions of the teacher model but also the internal reasoning process. To this end, we propose Explanation Guided Knowledge Distillation (EGKD) in this paper, which utilizes explanations to represent the thinking process and improve knowledge distillation. To obtain explanations in our distillation framework, we select three typical explanation methods rooted in different mechanisms, namely gradient-based, perturbation-based, and feature selection methods, Then, to improve computational efficiency, we propose different optimization strategies to utilize the explanations obtained by these three different explanation methods, which could provide the student model better learning guidance. Experimental results on GLUE demonstrate that leveraging explanations can improve the performance of the student model. Moreover, our EGKD could also be applied to model compression with different architectures.

知识蒸馏被广泛应用于预训练语言模型压缩，它可以将知识从繁琐的模型转移到轻量级模型。虽然基于知识蒸馏的模型压缩取得了可喜的成绩，但我们发现教师模型和学生模型之间的解释并不一致。我们认为，学生模型不仅要研究教师模型的预测，还要研究内部推理过程。为此，我们在本文中提出了 "解释引导知识提炼"（EGKD），利用解释来表示思维过程并改进知识提炼。为了在我们的蒸馏框架中获得解释，我们选择了三种植根于不同机制的典型解释方法，即基于梯度的方法、基于扰动的方法和基于特征选择的方法，然后，为了提高计算效率，我们提出了不同的优化策略来利用这三种不同解释方法所获得的解释，从而为学生模型提供更好的学习指导。GLUE 的实验结果表明，利用解释可以提高学生模型的性能。此外，我们的 EGKD 还可以应用于不同架构的模型压缩。

{"title":"Explanation Guided Knowledge Distillation for Pre-trained Language Model Compression","authors":"Zhao Yang, Yuanzhe Zhang, Dianbo Sui, Yiming Ju, Jun Zhao, Kang Liu","doi":"10.1145/3639364","DOIUrl":"https://doi.org/10.1145/3639364","url":null,"abstract":"Knowledge distillation is widely used in pre-trained language model compression, which can transfer knowledge from a cumbersome model to a lightweight one. Though knowledge distillation based model compression has achieved promising performance, we observe that explanations between the teacher model and the student model are not consistent. We argue that the student model should study not only the predictions of the teacher model but also the internal reasoning process. To this end, we propose Explanation Guided Knowledge Distillation (EGKD) in this paper, which utilizes explanations to represent the thinking process and improve knowledge distillation. To obtain explanations in our distillation framework, we select three typical explanation methods rooted in different mechanisms, namely gradient-based, perturbation-based, and feature selection methods, Then, to improve computational efficiency, we propose different optimization strategies to utilize the explanations obtained by these three different explanation methods, which could provide the student model better learning guidance. Experimental results on GLUE demonstrate that leveraging explanations can improve the performance of the student model. Moreover, our EGKD could also be applied to model compression with different architectures.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"247 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139071925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Leveraging Dual Gloss Encoders in Chinese Biomedical Entity Linking 在中文生物医学实体链接中利用双词汇编码器

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2023-12-28 DOI: 10.1145/3638555

Tzu-Mi Lin, Man-Chen Hung, Lung-Hao Lee

Entity linking is the task of assigning a unique identity to named entities mentioned in a text, a sort of word sense disambiguation that focuses on automatically determining a pre-defined sense for a target entity to be disambiguated. This study proposes the DGE (Dual Gloss Encoders) model for Chinese entity linking in the biomedical domain. We separately model a dual encoder architecture, comprising a context-aware gloss encoder and a lexical gloss encoder, for contextualized embedding representations. Dual gloss encoders are then jointly optimized to assign the nearest gloss with the highest score for target entity disambiguation. The experimental datasets consist of a total of 10,218 sentences that were manually annotated with glosses defined in the BabelNet 5.0 across 40 distinct biomedical entities. Experimental results show that the DGE model achieved an F1-score of 97.81, outperforming other existing methods. A series of model analyses indicate that the proposed approach is effective for Chinese biomedical entity linking.

实体关联是为文本中提到的命名实体赋予唯一标识的任务，是一种词义消歧，主要是为待消歧的目标实体自动确定一个预定义。本研究针对生物医学领域的中文实体链接提出了 DGE（双词汇编码器）模型。我们分别建立了一个双编码器架构模型，其中包括一个上下文感知词汇编码器和一个词汇编码器，用于上下文嵌入表示。然后对双词汇编码器进行联合优化，为目标实体消歧分配得分最高的最近词汇。实验数据集由总共 10,218 个句子组成，这些句子由人工标注了 BabelNet 5.0 中定义的词汇，涉及 40 个不同的生物医学实体。实验结果表明，DGE 模型的 F1 分数达到 97.81，优于其他现有方法。一系列的模型分析表明，所提出的方法对中文生物医学实体链接非常有效。

引用次数: 0

Improving the Detection of Multilingual South African Abusive Language via Skip-gram using Joint Multilevel Domain Adaptation: The Detection of Multilingual South African Abusive Language using Skip-gram and Domain Adaptation: ACM Transactions on Asian and Low-Resource Language Information Processing: Vol 0, No ja 利用多层次领域联合适应，通过跳过图改进对南非多语种辱骂性语言的检测：使用跳格和领域适应检测多语种南非辱骂性语言》：ACM Transactions on Asian and Low-Resource Language Information Processing：Vol 0, No ja

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2023-12-28 DOI: 10.1145/3638759

Oluwafemi Oriola, Eduan Kotzé

The distinctiveness and sparsity of low-resource multilingual South African abusive language necessitate the development of a novel solution to automatically detect different classes of abusive language instances using machine learning. Skip-gram has been used to address sparsity in machine learning classification problems but is inadequate in detecting South African abusive language due to the considerable amount of rare features and class imbalance. Joint Domain Adaptation has been used to enlarge features of a low-resource target domain for improved classification outcomes by jointly learning from the target domain and large-resource source domain. This paper, therefore, builds a Skip-gram model based on Joint Domain Adaptation to improve the detection of multilingual South African abusive language. Contrary to the existing Joint Domain Adaptation approaches, a Joint Multilevel Domain Adaptation model involving adaptation of monolingual source domain instances and multilingual target domain instances with high frequency of rare features was executed at the first level, and adaptation of target-domain features and first-level features at the next level. Both surface-level and embedding word features were used to evaluate the proposed model. In the evaluation of surface-level features, the Joint Multilevel Domain Adaptation model outperformed the state-of-the-art models with accuracy of 0.92 and F1-score of 0.68. In the evaluation of embedding features, the proposed model outperformed the state-of-the-art models with accuracy of 0.88 and F1-score of 0.64. The Joint Multilevel Domain Adaptation model significantly improved the average information gain of the rare features in different language categories and reduced class imbalance.

低资源多语种南非辱骂性语言的独特性和稀疏性要求开发一种新的解决方案，利用机器学习自动检测不同类别的辱骂性语言实例。在机器学习分类问题中，Skip-gram 已被用于解决稀疏性问题，但由于存在大量稀有特征和类别不平衡问题，因此不足以检测南非辱骂性语言。联合域适应（Joint Domain Adaptation）被用于扩大低资源目标域的特征，通过从目标域和大资源源域联合学习来改善分类结果。因此，本文建立了一个基于联合域适应的跳格模型，以改进对南非多语言滥用语言的检测。与现有的联合域适应方法不同，本文采用了多层次联合域适应模型，第一层适应单语源域实例和多语目标域实例的高频稀有特征，第二层适应目标域特征和第一层特征。表层和嵌入词特征都被用来评估所提出的模型。在表层特征的评估中，联合多级域适应模型的准确率为 0.92，F1 分数为 0.68，优于最先进的模型。在嵌入特征的评估中，拟议模型的准确率为 0.88，F1 分数为 0.64，优于最新模型。联合多级域适应模型显著提高了不同语言类别中稀有特征的平均信息增益，并减少了类不平衡。

{"title":"Improving the Detection of Multilingual South African Abusive Language via Skip-gram using Joint Multilevel Domain Adaptation: The Detection of Multilingual South African Abusive Language using Skip-gram and Domain Adaptation: ACM Transactions on Asian and Low-Resource Language Information Processing: Vol 0, No ja","authors":"Oluwafemi Oriola, Eduan Kotzé","doi":"10.1145/3638759","DOIUrl":"https://doi.org/10.1145/3638759","url":null,"abstract":"The distinctiveness and sparsity of low-resource multilingual South African abusive language necessitate the development of a novel solution to automatically detect different classes of abusive language instances using machine learning. Skip-gram has been used to address sparsity in machine learning classification problems but is inadequate in detecting South African abusive language due to the considerable amount of rare features and class imbalance. Joint Domain Adaptation has been used to enlarge features of a low-resource target domain for improved classification outcomes by jointly learning from the target domain and large-resource source domain. This paper, therefore, builds a Skip-gram model based on Joint Domain Adaptation to improve the detection of multilingual South African abusive language. Contrary to the existing Joint Domain Adaptation approaches, a Joint Multilevel Domain Adaptation model involving adaptation of monolingual source domain instances and multilingual target domain instances with high frequency of rare features was executed at the first level, and adaptation of target-domain features and first-level features at the next level. Both surface-level and embedding word features were used to evaluate the proposed model. In the evaluation of surface-level features, the Joint Multilevel Domain Adaptation model outperformed the state-of-the-art models with accuracy of 0.92 and F1-score of 0.68. In the evaluation of embedding features, the proposed model outperformed the state-of-the-art models with accuracy of 0.88 and F1-score of 0.64. The Joint Multilevel Domain Adaptation model significantly improved the average information gain of the rare features in different language categories and reduced class imbalance.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"27 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ibn-Ginni: An Improved Morphological Analyzer for Arabic 伊本-吉尼：改进的阿拉伯语态分析器

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2023-12-28 DOI: 10.1145/3639050

Waleed Nazih, Amany Fashwan, Amr El-Gendy, Yasser Hifny

Arabic is a morphologically rich language, which means that the Arabic language has a complicated system of word formation and structure. The affixes in the Arabic language (i.e., prefixes and suffixes) can be added to root words to generate different meanings and grammatical functions. These affixes can indicate aspects such as tense, gender, number, case, person, and more. In addition, the meaning and function of words can be modified in Arabic using an internal structure known as morphological patterns. Computational morphological analyzers of Arabic are vital to developing Arabic language processing toolkits. In this paper, we introduce a new morphological analyzer (Ibn-Ginni) that inherits the speed and quality of the Buckwalter Arabic Morphological Analyzer (BAMA). The BAMA has poor coverage of the classical Arabic language. Hence, the coverage of classical Arabic is improved by using the Alkhalil analyzer. Although it is slow, it was used to generate a huge number of solutions for 3 million unique Arabic words collected from different resources. These wordform-based solutions were converted to stem-based solutions, refined manually, and added to the database of BAMA, resulting in substantial improvements in the quality of the analysis. Hence, Ibn-Ginni is a hybrid system between BAMA and Alkhalil analyzers and may be considered an efficient large-scale analyzer. The Ibn-Ginni analyzer analyzed 0.6 million more words than the BAMA analyzer. Therefore, our analyzer significantly improves the coverage of the Arabic language. Besides, the Ibn-Ginni analyzer is high-speed at providing solutions; the average time to analyze a word is 0.3 ms. Using a corpus designed for benchmarking Arabic morphological analyzers, our analyzer was able to find all solutions for 72.72% of the words. Moreover, the analyzer did not provide all possible morphological solutions for 24.24% of the words. The analyzer and its morphological database are publicly available on GitHub.

阿拉伯语是一种形态丰富的语言，这意味着阿拉伯语具有复杂的构词和结构系统。阿拉伯语中的词缀（即前缀和后缀）可以添加到词根中，产生不同的意义和语法功能。这些词缀可以表示时态、性别、数、大小写、人称等方面。此外，在阿拉伯语中，单词的意义和功能可以通过一种称为形态模式的内部结构进行修改。阿拉伯语的计算形态分析器对于开发阿拉伯语语言处理工具包至关重要。本文介绍了一种新的形态分析器（Ibn-Ginni），它继承了 Buckwalter 阿拉伯语形态分析器（BAMA）的速度和质量。BAMA 对古典阿拉伯语的覆盖率较低。因此，通过使用 Alkhalil 分析器，古典阿拉伯语的覆盖率得到了提高。虽然 Alkhalil 分析器的速度较慢，但它还是为从不同资源中收集的 300 万个独特阿拉伯语单词生成了大量解决方案。这些基于词形的解决方案被转换为基于词干的解决方案，经过人工改进后添加到 BAMA 数据库中，从而大大提高了分析质量。因此，Ibn-Ginni 是一个介于 BAMA 和 Alkhalil 分析器之间的混合系统，可被视为一个高效的大型分析器。Ibn-Ginni 分析仪比 BAMA 分析仪多分析了 60 万个单词。因此，我们的分析器大大提高了阿拉伯语的覆盖率。此外，Ibn-Ginni 分析器还能高速提供解决方案；分析一个单词的平均时间为 0.3 毫秒。使用为阿拉伯语形态分析仪基准测试而设计的语料库，我们的分析仪能够为 72.72% 的单词找到所有解决方案。此外，分析仪没有为 24.24% 的单词提供所有可能的词形解决方案。分析器及其形态数据库可在 GitHub 上公开获取。

{"title":"Ibn-Ginni: An Improved Morphological Analyzer for Arabic","authors":"Waleed Nazih, Amany Fashwan, Amr El-Gendy, Yasser Hifny","doi":"10.1145/3639050","DOIUrl":"https://doi.org/10.1145/3639050","url":null,"abstract":"Arabic is a morphologically rich language, which means that the Arabic language has a complicated system of word formation and structure. The affixes in the Arabic language (i.e., prefixes and suffixes) can be added to root words to generate different meanings and grammatical functions. These affixes can indicate aspects such as tense, gender, number, case, person, and more. In addition, the meaning and function of words can be modified in Arabic using an internal structure known as morphological patterns. Computational morphological analyzers of Arabic are vital to developing Arabic language processing toolkits. In this paper, we introduce a new morphological analyzer (Ibn-Ginni) that inherits the speed and quality of the Buckwalter Arabic Morphological Analyzer (BAMA). The BAMA has poor coverage of the classical Arabic language. Hence, the coverage of classical Arabic is improved by using the Alkhalil analyzer. Although it is slow, it was used to generate a huge number of solutions for 3 million unique Arabic words collected from different resources. These wordform-based solutions were converted to stem-based solutions, refined manually, and added to the database of BAMA, resulting in substantial improvements in the quality of the analysis. Hence, Ibn-Ginni is a hybrid system between BAMA and Alkhalil analyzers and may be considered an efficient large-scale analyzer. The Ibn-Ginni analyzer analyzed 0.6 million more words than the BAMA analyzer. Therefore, our analyzer significantly improves the coverage of the Arabic language. Besides, the Ibn-Ginni analyzer is high-speed at providing solutions; the average time to analyze a word is 0.3 ms. Using a corpus designed for benchmarking Arabic morphological analyzers, our analyzer was able to find all solutions for 72.72% of the words. Moreover, the analyzer did not provide all possible morphological solutions for 24.24% of the words. The analyzer and its morphological database are publicly available on GitHub.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"20 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hypergraph Neural Network for Emotion Recognition in Conversations 超图神经网络用于对话中的情感识别

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2023-12-27 DOI: 10.1145/3638760

Cheng Zheng, Haojie Xu, Xiao Sun

Modeling conversational context is an essential step for emotion recognition in conversations. Existing works still suffer from insufficient utilization of local context information and remote context information. This paper designs a hypergraph neural network, namely HNN-ERC, to better utilize local and remote contextual information. HNN-ERC combines the recurrent neural network with the conventional hypergraph neural network to strengthen connections between utterances and make each utterance receive information from other utterances better. The proposed model has empirically achieved state-of-the-art results on three benchmark datasets, demonstrating the effectiveness and superiority of the new model.

会话语境建模是会话中情感识别的关键步骤。现有研究仍存在对本地语境信息和远程语境信息利用不足的问题。本文设计了一种超图神经网络，即 HNN-ERC，以更好地利用本地和远程语境信息。HNN-ERC 将递归神经网络与传统的超图神经网络相结合，加强了语篇之间的联系，使每个语篇都能更好地接收来自其他语篇的信息。所提出的模型在三个基准数据集上取得了最先进的实证结果，证明了新模型的有效性和优越性。

引用次数: 0

Autoregressive Feature Extraction with Topic Modeling for Aspect-based Sentiment Analysis of Arabic as a Low-resource Language 利用自回归特征提取和主题建模对低资源语言阿拉伯语进行基于方面的情感分析

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2023-12-27 DOI: 10.1145/3638050

Asmaa Hashem Sweidan, Nashwa El-Bendary, Esraa Elhariri

This paper proposes an approach for aspect-based sentiment analysis of Arabic social data, especially the considerable text corpus generated through communications on Twitter for expressing opinions in Arabic-language tweets during the COVID-19 pandemic. The proposed approach examines the performance of several pre-trained predictive and autoregressive language models; namely, BERT (Bidirectional Encoder Representations from Transformers) and XLNet, along with topic modeling algorithms; namely, LDA (Latent Dirichlet Allocation) and NMF (Non-negative Matrix Factorization), for aspect-based sentiment analysis of online Arabic text. In addition, Bi-LSTM (Bidirectional Long Short Term Memory) deep learning model is used to classify the extracted aspects from online reviews. Obtained experimental results indicate that the combined XLNet-NMF model outperforms other implemented state-of-the-art methods through improving the feature extraction of unstructured social media text with achieving values of 0.946 and 0.938, for average sentiment classification accuracy and F-measure, respectively.

本文提出了一种对阿拉伯语社交数据进行基于方面的情感分析的方法，特别是在 COVID-19 大流行期间，通过 Twitter 上的交流产生的大量阿拉伯语推文表达意见的文本语料库。所提出的方法检验了几个预先训练的预测和自回归语言模型（即 BERT（来自变换器的双向编码器表示）和 XLNet）以及主题建模算法（即 LDA（潜在德里希特分配）和 NMF（非负矩阵因数分解））的性能，用于对在线阿拉伯语文本进行基于方面的情感分析。此外，Bi-LSTM（双向长短期记忆）深度学习模型用于对从在线评论中提取的方面进行分类。实验结果表明，XLNet-NMF 组合模型改善了非结构化社交媒体文本的特征提取，平均情感分类准确率和 F-measure 值分别达到 0.946 和 0.938，优于其他最先进的方法。

{"title":"Autoregressive Feature Extraction with Topic Modeling for Aspect-based Sentiment Analysis of Arabic as a Low-resource Language","authors":"Asmaa Hashem Sweidan, Nashwa El-Bendary, Esraa Elhariri","doi":"10.1145/3638050","DOIUrl":"https://doi.org/10.1145/3638050","url":null,"abstract":"This paper proposes an approach for aspect-based sentiment analysis of Arabic social data, especially the considerable text corpus generated through communications on Twitter for expressing opinions in Arabic-language tweets during the COVID-19 pandemic. The proposed approach examines the performance of several pre-trained predictive and autoregressive language models; namely, BERT (Bidirectional Encoder Representations from Transformers) and XLNet, along with topic modeling algorithms; namely, LDA (Latent Dirichlet Allocation) and NMF (Non-negative Matrix Factorization), for aspect-based sentiment analysis of online Arabic text. In addition, Bi-LSTM (Bidirectional Long Short Term Memory) deep learning model is used to classify the extracted aspects from online reviews. Obtained experimental results indicate that the combined XLNet-NMF model outperforms other implemented state-of-the-art methods through improving the feature extraction of unstructured social media text with achieving values of 0.946 and 0.938, for average sentiment classification accuracy and F-measure, respectively.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"22 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Computational Method for Supporting Thai VerbNet Construction 支持泰语动词网构建的计算方法

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2023-12-26 DOI: 10.1145/3638533

Krittanut Chungnoi, Rachada Kongkachandra, Sarun Gulyanon

VerbNet is a lexical resource for verbs that has many applications in natural language processing tasks, especially ones that require information about both the syntactic behavior and the semantics of verbs. This paper presents an attempt to construct the first version of a Thai VerbNet corpus via data enrichment of the existing lexical resource. This corpus contains the annotation at both the syntactic and semantic levels, where verbs are tagged with frames within the verb class hierarchy and their arguments are labeled with the semantic role. We discuss the technical aspect of the construction process of Thai VerbNet and survey different semantic role labeling methods to make this process fully automatic. We also investigate the linguistic aspect of the computed verb classes and the results show the potential in assisting semantic classification and analysis. At the current stage, we have built the verb class hierarchy consisting of 28 verb classes from 112 unique concept frames over 490 unique verbs using our association rule learning method on Thai verbs.

动词网（VerbNet）是一种动词词库，在自然语言处理任务中有着广泛的应用，尤其是那些需要动词的句法行为和语义信息的任务。本文介绍了通过对现有词法资源进行数据丰富来构建第一版泰语 VerbNet 语料库的尝试。该语料库包含句法和语义两个层面的注释，其中动词被标记为动词类层次结构中的框架，其参数被标记为语义角色。我们讨论了泰语动词网构建过程的技术方面，并研究了不同的语义角色标注方法，以使这一过程完全自动化。我们还对计算出的动词类进行了语言方面的研究，结果显示了其在辅助语义分类和分析方面的潜力。在现阶段，我们使用关联规则学习方法对泰语动词进行了学习，从 112 个独特的概念框架和 490 个独特的动词中建立了由 28 个动词类别组成的动词类别层次结构。

引用次数: 0

Dual-branch Multitask Fusion Network for Offline Chinese Writer Identification 用于离线中文作家识别的双分支多任务融合网络

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2023-12-26 DOI: 10.1145/3638554

Haixia Wang, Qingran Miao, Qun Xiao, Yilong Zhang, Yingyu Mao

Chinese characters are complex and contain discriminative information, meaning that their writers have the potential to be recognized using less text. In this study, offline Chinese writer identification based on a single character was investigated. To extract comprehensive features to model Chinese characters, explicit and implicit information as well as global and local features are of interest. A dual-branch multitask fusion network is proposed which contains two branches for global and local feature extraction simultaneously, and introduces auxiliary tasks to help the main task. Content recognition, stroke number estimation, and stroke recognition are considered as three auxiliary tasks for explicit information. The main task extracts implicit information of writer identity. The experimental results validated the positive influences of auxiliary tasks on the writer identification task, with the stroke number estimation task being most helpful. In-depth research was conducted to investigate the influencing factors in Chinese writer identification, with respect to character complexity, stroke importance, and character number, which provides a systematic reference for the actual application of neural networks in Chinese writer identification.

汉字结构复杂，且包含辨别信息，这意味着可以使用较少的文本识别汉字作家。本研究调查了基于单个汉字的离线中文作家识别。为了提取全面的特征来建立汉字模型，显性和隐性信息以及全局和局部特征都很重要。本研究提出了一种双分支多任务融合网络，它包含两个分支，可同时进行全局和局部特征提取，并引入辅助任务来帮助主任务。内容识别、笔画数估计和笔画识别被视为显性信息的三个辅助任务。主任务则提取作者身份的隐含信息。实验结果验证了辅助任务对作家身份识别任务的积极影响，其中笔画数估计任务的帮助最大。通过深入研究汉字复杂度、笔画重要性和字数对汉字作家识别的影响因素，为神经网络在汉字作家识别中的实际应用提供了系统的参考。

{"title":"Dual-branch Multitask Fusion Network for Offline Chinese Writer Identification","authors":"Haixia Wang, Qingran Miao, Qun Xiao, Yilong Zhang, Yingyu Mao","doi":"10.1145/3638554","DOIUrl":"https://doi.org/10.1145/3638554","url":null,"abstract":"Chinese characters are complex and contain discriminative information, meaning that their writers have the potential to be recognized using less text. In this study, offline Chinese writer identification based on a single character was investigated. To extract comprehensive features to model Chinese characters, explicit and implicit information as well as global and local features are of interest. A dual-branch multitask fusion network is proposed which contains two branches for global and local feature extraction simultaneously, and introduces auxiliary tasks to help the main task. Content recognition, stroke number estimation, and stroke recognition are considered as three auxiliary tasks for explicit information. The main task extracts implicit information of writer identity. The experimental results validated the positive influences of auxiliary tasks on the writer identification task, with the stroke number estimation task being most helpful. In-depth research was conducted to investigate the influencing factors in Chinese writer identification, with respect to character complexity, stroke importance, and character number, which provides a systematic reference for the actual application of neural networks in Chinese writer identification.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"18 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139052827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Machine Learning-Based Readability Model for Gujarati Texts 基于机器学习的古吉拉特语文本可读性模型

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2023-12-21 DOI: 10.1145/3637826

Chandrakant K. Bhogayata

This study aims to develop a machine learning-based model to predict the readability of Gujarati texts. The dataset was fifty prose passages from Gujarati literature. Fourteen lexical and syntactic readability text features were extracted from the dataset using a machine learning algorithm of the unigram POS tagger and three Python programming scripts. Two samples of native Gujarati speaking secondary and higher education students rated the Gujarati texts for readability judgment on a 10-point scale of 'easy' to 'difficult' with the interrater agreement. After dimensionality reduction, seven text features as the independent variables and the mean readability rating as the dependent variable were used to train the readability model. As the students' level of education and gender were related to their readability rating, four readability models for school students, university students, male students, and female students were trained with a backward stepwise multiple linear regression algorithm of supervised machine learning. The trained model is comparable across the raters' groups. The best model is the university students' readability rating model. The model is cross-validated. It explains 91% and 88% of the variance in readability ratings at training and cross-validation, respectively, and its effect size and power are large and high.

本研究旨在开发一种基于机器学习的模型，用于预测古吉拉特语文本的可读性。数据集是古吉拉特语文学中的 50 篇散文。研究人员使用单字符串 POS 标记的机器学习算法和三个 Python 编程脚本从数据集中提取了 14 个词法和句法可读性文本特征。两个以古吉拉特语为母语的中学生和大学生样本对古吉拉特语文本的可读性进行了评分，评分标准为 10 分，从 "易 "到 "难"，评分者之间的评分结果一致。经过降维处理后，七个文本特征作为自变量，平均可读性评分作为因变量，用于训练可读性模型。由于学生的受教育程度和性别与他们的可读性评分有关，因此采用监督机器学习的后向逐步多元线性回归算法训练了小学生、大学生、男生和女生的四个可读性模型。训练出的模型在不同评分者群体中具有可比性。最佳模型是大学生的可读性评分模型。该模型经过交叉验证。在训练和交叉验证时，它分别解释了 91% 和 88% 的可读性评分方差，其效应大小和功率都很大、很高。

{"title":"A Machine Learning-Based Readability Model for Gujarati Texts","authors":"Chandrakant K. Bhogayata","doi":"10.1145/3637826","DOIUrl":"https://doi.org/10.1145/3637826","url":null,"abstract":"This study aims to develop a machine learning-based model to predict the readability of Gujarati texts. The dataset was fifty prose passages from Gujarati literature. Fourteen lexical and syntactic readability text features were extracted from the dataset using a machine learning algorithm of the unigram POS tagger and three Python programming scripts. Two samples of native Gujarati speaking secondary and higher education students rated the Gujarati texts for readability judgment on a 10-point scale of 'easy' to 'difficult' with the interrater agreement. After dimensionality reduction, seven text features as the independent variables and the mean readability rating as the dependent variable were used to train the readability model. As the students' level of education and gender were related to their readability rating, four readability models for school students, university students, male students, and female students were trained with a backward stepwise multiple linear regression algorithm of supervised machine learning. The trained model is comparable across the raters' groups. The best model is the university students' readability rating model. The model is cross-validated. It explains 91% and 88% of the variance in readability ratings at training and cross-validation, respectively, and its effect size and power are large and high.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"131 50","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138953509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Ensemble Strategy with Gradient Conflict for Multi-Domain Neural Machine Translation 多域神经机器翻译的梯度冲突集合策略

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2023-12-21 DOI: 10.1145/3638248

Zhibo Man, Yujie Zhang, Yu Li, Yuanmeng Chen, Yufeng Chen, Jinan Xu

Multi-domain neural machine translation aims to construct a unified NMT model to translate sentences across various domains. Nevertheless, previous studies have one limitation is the incapacity to acquire both domain-general and specific representations concurrently. To this end, we propose an ensemble strategy with gradient conflict for multi-domain neural machine translation that automatically learns model parameters by identifying both domain-shared and domain-specific features. Specifically, our approach consists of (1) a parameter-sharing framework: the parameters of all the layers are originally shared and equivalent to each domain. (2) ensemble strategy: we design an Extra Ensemble strategy via a piecewise condition function to learn direction and distance-based gradient conflict. In addition, we give a detailed theoretical analysis of the gradient conflict to further validate the effectiveness of our approach. Experimental results on two multi-domain datasets show the superior performance of our proposed model compared to previous work.

多领域神经机器翻译旨在构建一个统一的神经机器翻译模型，以翻译不同领域的句子。然而，以往的研究有一个局限性，即无法同时获得领域通用表征和特定表征。为此，我们为多领域神经机器翻译提出了一种带有梯度冲突的集合策略，通过识别领域共享特征和领域特定特征来自动学习模型参数。具体来说，我们的方法包括：(1) 参数共享框架：所有层的参数最初都是共享的，并且等同于每个域。(2) 集合策略：我们通过片断条件函数设计了一种额外集合策略，以学习基于方向和距离的梯度冲突。此外，我们还对梯度冲突进行了详细的理论分析，以进一步验证我们方法的有效性。在两个多领域数据集上的实验结果表明，与之前的研究相比，我们提出的模型性能更优。

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

ACM Transactions on Asian and Low-Resource Language Information Processing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀