Multilingual Question Answering for Malaysia History with Transformer-based Language Model

Q1 Multidisciplinary Emerging Science Journal Pub Date : 2024-04-01 DOI:10.28991/esj-2024-08-02-019

Qi Zhi Lim, C. Lee, K. Lim, Jing Xiang Ng, Eric Khang Heng Ooi, Nicole Kai Ning Loh

{"title":"Multilingual Question Answering for Malaysia History with Transformer-based Language Model","authors":"Qi Zhi Lim, C. Lee, K. Lim, Jing Xiang Ng, Eric Khang Heng Ooi, Nicole Kai Ning Loh","doi":"10.28991/esj-2024-08-02-019","DOIUrl":null,"url":null,"abstract":"In natural language processing (NLP), a Question Answering System (QAS) refers to a system or model that is designed to understand and respond to user queries in natural language. As we navigate through the recent advancements in QAS, it can be observed that there is a paradigm shift of the methods used from traditional machine learning and deep learning approaches towards transformer-based language models. While significant progress has been made, the utilization of these models for historical QAS and the development of QAS for Malay language remain largely unexplored. This research aims to bridge the gaps, focusing on developing a Multilingual QAS for history of Malaysia by utilizing a transformer-based language model. The system development process encompasses various stages, including data collection, knowledge representation, data loading and pre-processing, document indexing and storing, and the establishment of a querying pipeline with the retriever and reader. A dataset with a collection of 100 articles, including web blogs related to the history of Malaysia, has been constructed, serving as the knowledge base for the proposed QAS. A significant aspect of this research is the use of the translated dataset in English instead of the raw dataset in Malay. This decision was made to leverage the effectiveness of well-established retriever and reader models that were trained on English data. Moreover, an evaluation dataset comprising 100 question-answer pairs has been created to evaluate the performance of the models. A comparative analysis of six different transformer-based language models, namely DeBERTaV3, BERT, ALBERT, ELECTRA, MiniLM, and RoBERTa, has been conducted, where the effectiveness of the models was examined through a series of experiments to determine the best reader model for the proposed QAS. The experimental results reveal that the proposed QAS achieved the best performance when employing RoBERTa as the reader model. Finally, the proposed QAS was deployed on Discord and equipped with multilingual support through the incorporation of language detection and translation modules, enabling it to handle queries in both Malay and English. Doi: 10.28991/ESJ-2024-08-02-019 Full Text: PDF","PeriodicalId":11586,"journal":{"name":"Emerging Science Journal","volume":"47 5","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Emerging Science Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.28991/esj-2024-08-02-019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}

引用次数: 0

Abstract

In natural language processing (NLP), a Question Answering System (QAS) refers to a system or model that is designed to understand and respond to user queries in natural language. As we navigate through the recent advancements in QAS, it can be observed that there is a paradigm shift of the methods used from traditional machine learning and deep learning approaches towards transformer-based language models. While significant progress has been made, the utilization of these models for historical QAS and the development of QAS for Malay language remain largely unexplored. This research aims to bridge the gaps, focusing on developing a Multilingual QAS for history of Malaysia by utilizing a transformer-based language model. The system development process encompasses various stages, including data collection, knowledge representation, data loading and pre-processing, document indexing and storing, and the establishment of a querying pipeline with the retriever and reader. A dataset with a collection of 100 articles, including web blogs related to the history of Malaysia, has been constructed, serving as the knowledge base for the proposed QAS. A significant aspect of this research is the use of the translated dataset in English instead of the raw dataset in Malay. This decision was made to leverage the effectiveness of well-established retriever and reader models that were trained on English data. Moreover, an evaluation dataset comprising 100 question-answer pairs has been created to evaluate the performance of the models. A comparative analysis of six different transformer-based language models, namely DeBERTaV3, BERT, ALBERT, ELECTRA, MiniLM, and RoBERTa, has been conducted, where the effectiveness of the models was examined through a series of experiments to determine the best reader model for the proposed QAS. The experimental results reveal that the proposed QAS achieved the best performance when employing RoBERTa as the reader model. Finally, the proposed QAS was deployed on Discord and equipped with multilingual support through the incorporation of language detection and translation modules, enabling it to handle queries in both Malay and English. Doi: 10.28991/ESJ-2024-08-02-019 Full Text: PDF

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用基于转换器的语言模型为马来西亚历史提供多语种问题解答

在自然语言处理（NLP）中，问题解答系统（QAS）指的是一种旨在理解和响应用户自然语言查询的系统或模型。当我们浏览最近在 QAS 方面取得的进展时，可以发现所使用的方法正在从传统的机器学习和深度学习方法向基于转换器的语言模型转变。虽然已经取得了重大进展，但将这些模型用于历史 QAS 以及开发马来语 QAS 的工作在很大程度上仍有待探索。本研究旨在弥合这些差距，重点是利用基于转换器的语言模型，为马来西亚历史开发多语种 QAS。系统开发过程包括多个阶段，包括数据收集、知识表示、数据加载和预处理、文档索引和存储，以及与检索器和阅读器一起建立查询管道。我们已经构建了一个包含 100 篇文章的数据集，其中包括与马来西亚历史相关的网络博客，作为所提议的 QAS 的知识库。这项研究的一个重要方面是使用英语翻译数据集而不是马来语原始数据集。做出这一决定是为了充分利用在英语数据上训练有素的检索器和阅读器模型的有效性。此外，还创建了一个包含 100 对问答的评估数据集来评估模型的性能。我们对六种不同的基于转换器的语言模型（即 DeBERTaV3、BERT、ALBERT、ELECTRA、MiniLM 和 RoBERTa）进行了比较分析，通过一系列实验检验了这些模型的有效性，以确定拟议 QAS 的最佳阅读器模型。实验结果表明，当采用 RoBERTa 作为阅读器模型时，所提出的 QAS 达到了最佳性能。最后，建议的 QAS 部署在 Discord 上，并通过整合语言检测和翻译模块提供多语言支持，使其能够处理马来语和英语的查询。Doi: 10.28991/ESJ-2024-08-02-019 全文：PDF

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊