Multimodal Integration of Mel Spectrograms and Text Transcripts for Enhanced Automatic Speech Recognition: Leveraging Extractive Transformer-Based Approaches and Late Fusion Strategies

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computational Intelligence Pub Date : 2024-12-19 DOI:10.1111/coin.70012

Sunakshi Mehra, Virender Ranga, Ritu Agarwal

{"title":"Multimodal Integration of Mel Spectrograms and Text Transcripts for Enhanced Automatic Speech Recognition: Leveraging Extractive Transformer-Based Approaches and Late Fusion Strategies","authors":"Sunakshi Mehra, Virender Ranga, Ritu Agarwal","doi":"10.1111/coin.70012","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>This research endeavor aims to advance the field of Automatic Speech Recognition (ASR) by innovatively integrating multimodal data, specifically textual transcripts and Mel Spectrograms (2D images) obtained from raw audio. This study explores the less-explored potential of spectrograms and linguistic information in enhancing spoken word recognition accuracy. To elevate ASR performance, we propose two distinct transformer-based approaches: First, for the audio-centric approach, we leverage RegNet and ConvNeXt architectures, initially trained on a massive dataset of 14 million annotated images from ImageNet, to process Mel Spectrograms as image inputs. Second, we harness the Speech2Text transformer to decouple text transcript acquisition from raw audio. We pre-process Mel Spectrogram images, resizing them to 224 × 224 pixels to create two-dimensional audio representations. ImageNet, RegNet, and ConvNeXt individually categorize these images. The first channel generates the embeddings for visual modalities (RegNet and ConvNeXt) on 2D Mel Spectrograms. Additionally, we employ Sentence-BERT embeddings via Siamese BERT networks to transform Speech2Text transcripts into vectors. These image embeddings, along with Sentence-BERT embeddings from speech transcription, are subsequently fine-tuned within a deep dense model with five layers and batch normalization for spoken word classification. Our experiments focus on the Google Speech Command Dataset (GSCD) version 2, encompassing 35-word categories. To gauge the impact of spectrograms and linguistic features, we conducted an ablation analysis. Our novel late fusion strategy unites word embeddings and image embeddings, resulting in remarkable test accuracy rates of 95.87% for ConvNeXt, 99.95% for RegNet, and 85.93% for text transcripts across the 35-word categories, as processed by the deep dense layered model with Batch Normalization. We obtained a test accuracy of 99.96% for 35-word categories after using the late fusion of ConvNeXt + RegNet + SBERT, demonstrating superior results compared to other state-of-the-art methods.</p>\n </div>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"40 6","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/coin.70012","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

This research endeavor aims to advance the field of Automatic Speech Recognition (ASR) by innovatively integrating multimodal data, specifically textual transcripts and Mel Spectrograms (2D images) obtained from raw audio. This study explores the less-explored potential of spectrograms and linguistic information in enhancing spoken word recognition accuracy. To elevate ASR performance, we propose two distinct transformer-based approaches: First, for the audio-centric approach, we leverage RegNet and ConvNeXt architectures, initially trained on a massive dataset of 14 million annotated images from ImageNet, to process Mel Spectrograms as image inputs. Second, we harness the Speech2Text transformer to decouple text transcript acquisition from raw audio. We pre-process Mel Spectrogram images, resizing them to 224 × 224 pixels to create two-dimensional audio representations. ImageNet, RegNet, and ConvNeXt individually categorize these images. The first channel generates the embeddings for visual modalities (RegNet and ConvNeXt) on 2D Mel Spectrograms. Additionally, we employ Sentence-BERT embeddings via Siamese BERT networks to transform Speech2Text transcripts into vectors. These image embeddings, along with Sentence-BERT embeddings from speech transcription, are subsequently fine-tuned within a deep dense model with five layers and batch normalization for spoken word classification. Our experiments focus on the Google Speech Command Dataset (GSCD) version 2, encompassing 35-word categories. To gauge the impact of spectrograms and linguistic features, we conducted an ablation analysis. Our novel late fusion strategy unites word embeddings and image embeddings, resulting in remarkable test accuracy rates of 95.87% for ConvNeXt, 99.95% for RegNet, and 85.93% for text transcripts across the 35-word categories, as processed by the deep dense layered model with Batch Normalization. We obtained a test accuracy of 99.96% for 35-word categories after using the late fusion of ConvNeXt + RegNet + SBERT, demonstrating superior results compared to other state-of-the-art methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于增强自动语音识别的Mel谱图和文本文本的多模态集成：利用基于提取转换器的方法和后期融合策略

本研究旨在通过创新地整合多模态数据，特别是从原始音频中获得的文本文本和Mel谱图（2D图像），推进自动语音识别（ASR）领域的发展。本研究探讨了频谱图和语言信息在提高口语识别准确性方面的潜力。为了提高ASR性能，我们提出了两种不同的基于转换器的方法：首先，对于以音频为中心的方法，我们利用RegNet和ConvNeXt架构，最初在ImageNet的1400万张注释图像的大规模数据集上进行训练，以处理Mel谱图作为图像输入。其次，我们利用Speech2Text转换器将文本文本采集与原始音频解耦。我们预处理Mel谱图图像，将其大小调整为224 × 224像素，以创建二维音频表示。ImageNet、RegNet和ConvNeXt分别对这些图像进行分类。第一个通道在二维Mel谱图上生成视觉模态（RegNet和ConvNeXt）的嵌入。此外，我们通过Siamese BERT网络使用句子BERT嵌入将语音文本文本转换为向量。这些图像嵌入，以及来自语音转录的句子bert嵌入，随后在一个具有五层的深度密集模型中进行微调，并对口语单词分类进行批处理归一化。我们的实验集中在谷歌语音命令数据集（GSCD）版本2上，包含35个词的类别。为了衡量谱图和语言特征的影响，我们进行了消融分析。我们新颖的后期融合策略将词嵌入和图像嵌入结合起来，在使用批处理归一化的深度密集分层模型处理的35个词类别的文本文本中，ConvNeXt的测试准确率为95.87%，RegNet为99.95%,85.93%。在使用ConvNeXt + RegNet + SBERT后期融合后，我们获得了35个单词类别的测试准确率为99.96%，与其他最先进的方法相比，显示出优越的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computational Intelligence 工程技术-计算机：人工智能

CiteScore

6.90

自引率

3.60%

发文量

审稿时长

>12 weeks

期刊介绍： This leading international journal promotes and stimulates research in the field of artificial intelligence (AI). Covering a wide range of issues - from the tools and languages of AI to its philosophical implications - Computational Intelligence provides a vigorous forum for the publication of both experimental and theoretical research, as well as surveys and impact studies. The journal is designed to meet the needs of a wide range of AI workers in academic and industrial research.