MF-Saudi：弥合音频和文本数据鸿沟的多模态框架，用于沙特方言检测

IF 5.2 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Journal of King Saud University-Computer and Information Sciences Pub Date : 2024-06-14 DOI:10.1016/j.jksuci.2024.102084

Raed Alharbi

{"title":"MF-Saudi：弥合音频和文本数据鸿沟的多模态框架，用于沙特方言检测","authors":"Raed Alharbi","doi":"10.1016/j.jksuci.2024.102084","DOIUrl":null,"url":null,"abstract":"<div><p>Detecting variations in dialects within a language can be challenging, particularly in regions with rich linguistic diversity like Saudi Arabia. To our knowledge, no prior attempts have been made to develop a multimodal, audio–textual framework for Saudi dialect detection. The current approaches often concentrate on detecting dialects only based on audio or textual data, which fails to capture the complex relationship between both modalities. In this paper, we propose a novel Multimodal Framework, called MF-Saudi, for Saudi dialect detection. The framework consists of three main components: (1) a pretrained BERT encoder for extracting and encoding textual information; (2) an acoustic model for representing audio signals and fusing them with textual information via the fusion layer; and (3) an alignment learning module to develop meaningful representations that capture the complexities of audio–text relationships, resulting in improved dialect detection. We conduct empirical evaluations on a real-world dataset, demonstrating that our solution outperforms some of the state-of-the-art baseline methods. The experiment’s code can be found here: <span>https://github.com/raed19/MF-Saudi</span><svg><path></path></svg>.</p></div>","PeriodicalId":48547,"journal":{"name":"Journal of King Saud University-Computer and Information Sciences","volume":null,"pages":null},"PeriodicalIF":5.2000,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1319157824001733/pdfft?md5=99b69313cadb5fce44b832f5ddaa2066&pid=1-s2.0-S1319157824001733-main.pdf","citationCount":"0","resultStr":"{\"title\":\"MF-Saudi: A multimodal framework for bridging the gap between audio and textual data for Saudi dialect detection\",\"authors\":\"Raed Alharbi\",\"doi\":\"10.1016/j.jksuci.2024.102084\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Detecting variations in dialects within a language can be challenging, particularly in regions with rich linguistic diversity like Saudi Arabia. To our knowledge, no prior attempts have been made to develop a multimodal, audio–textual framework for Saudi dialect detection. The current approaches often concentrate on detecting dialects only based on audio or textual data, which fails to capture the complex relationship between both modalities. In this paper, we propose a novel Multimodal Framework, called MF-Saudi, for Saudi dialect detection. The framework consists of three main components: (1) a pretrained BERT encoder for extracting and encoding textual information; (2) an acoustic model for representing audio signals and fusing them with textual information via the fusion layer; and (3) an alignment learning module to develop meaningful representations that capture the complexities of audio–text relationships, resulting in improved dialect detection. We conduct empirical evaluations on a real-world dataset, demonstrating that our solution outperforms some of the state-of-the-art baseline methods. The experiment’s code can be found here: <span>https://github.com/raed19/MF-Saudi</span><svg><path></path></svg>.</p></div>\",\"PeriodicalId\":48547,\"journal\":{\"name\":\"Journal of King Saud University-Computer and Information Sciences\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2024-06-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1319157824001733/pdfft?md5=99b69313cadb5fce44b832f5ddaa2066&pid=1-s2.0-S1319157824001733-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of King Saud University-Computer and Information Sciences\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1319157824001733\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of King Saud University-Computer and Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1319157824001733","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

检测一种语言内部的方言变化是一项挑战，尤其是在沙特阿拉伯这样语言多样性丰富的地区。据我们所知，此前还没有人尝试过为沙特方言检测开发多模态、音频和文本框架。目前的方法通常只集中在基于音频或文本数据的方言检测上，无法捕捉两种模式之间的复杂关系。在本文中，我们提出了一种用于沙特方言检测的新型多模态框架，称为 MF-Saudi。该框架由三个主要部分组成：(1) 预训练 BERT 编码器，用于提取和编码文本信息；(2) 声学模型，用于表示音频信号，并通过融合层将音频信号与文本信息融合；以及 (3) 对齐学习模块，用于开发有意义的表征，捕捉音频与文本之间的复杂关系，从而改进方言检测。我们在真实世界的数据集上进行了实证评估，证明我们的解决方案优于一些最先进的基线方法。实验代码请访问：https://github.com/raed19/MF-Saudi。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MF-Saudi: A multimodal framework for bridging the gap between audio and textual data for Saudi dialect detection

Detecting variations in dialects within a language can be challenging, particularly in regions with rich linguistic diversity like Saudi Arabia. To our knowledge, no prior attempts have been made to develop a multimodal, audio–textual framework for Saudi dialect detection. The current approaches often concentrate on detecting dialects only based on audio or textual data, which fails to capture the complex relationship between both modalities. In this paper, we propose a novel Multimodal Framework, called MF-Saudi, for Saudi dialect detection. The framework consists of three main components: (1) a pretrained BERT encoder for extracting and encoding textual information; (2) an acoustic model for representing audio signals and fusing them with textual information via the fusion layer; and (3) an alignment learning module to develop meaningful representations that capture the complexities of audio–text relationships, resulting in improved dialect detection. We conduct empirical evaluations on a real-world dataset, demonstrating that our solution outperforms some of the state-of-the-art baseline methods. The experiment’s code can be found here: https://github.com/raed19/MF-Saudi.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of King Saud University-Computer and Information Sciences COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

10.50

自引率

8.70%

发文量

656

审稿时长

29 days

期刊介绍： In 2022 the Journal of King Saud University - Computer and Information Sciences will become an author paid open access journal. Authors who submit their manuscript after October 31st 2021 will be asked to pay an Article Processing Charge (APC) after acceptance of their paper to make their work immediately, permanently, and freely accessible to all. The Journal of King Saud University Computer and Information Sciences is a refereed, international journal that covers all aspects of both foundations of computer and its practical applications.