Dong-Jin Kim, Hyung-Min Park, Harksoo Kim, Seung-Hoon Na, Gerard Jounghyun Kim
{"title":"语音和语言人工智能技术特刊","authors":"Dong-Jin Kim, Hyung-Min Park, Harksoo Kim, Seung-Hoon Na, Gerard Jounghyun Kim","doi":"10.4218/etr2.12666","DOIUrl":null,"url":null,"abstract":"<p>Recent advancements in artificial intelligence (AI) have substantially improved applications that depend on human speech and language comprehension. Human speech, characterized by the articulation of thoughts and emotions through sounds, relies on language, a complex system that uses words and symbols for interpersonal communication. The rapid evolution of AI has amplified the demand for related solutions to swiftly and efficiently process extensive amounts of speech and language data. Speech and language technologies have emerged as major topics in AI research, improving the capacity of computers to comprehend text and spoken language by resembling human cognition. These technological breakthroughs have enabled computers to interpret human language, whether expressed in textual or spoken forms, unveiling the comprehensive meaning of the intentions, nuances, and emotional cues expressed by writers or speakers.</p><p><i>Electronics and Telecommunications Research Institute (ETRI) Journal</i> is a peer-reviewed open-access journal launched in 1993 and published bimonthly by ETRI, Republic of Korea. It is intended to promote worldwide academic exchange of research on information, telecommunications, and electronics.</p><p>This special is devoted to all aspects and future research directions in the rapidly progressing subject of speech and language technologies. In particular, this special issue highlights recent outstanding results on the application of AI techniques to understand speech or natural language. We selected 12 outstanding papers on three topics of speech and language technologies. Below, we provide a summary of commitments to this special issue.</p><p>The first paper [<span>1</span>] “Towards a small language model powered chain-of-reasoning for open-domain question answering” by Roh and others focuses on open-domain question-answering tasks that involve a chain of reasoning primarily implemented using large language models. Emphasizing cost effectiveness, the authors introduce EffiChainQA, an architecture centered on the use of small language models. They employ a retrieval-based language model that is known to address the hallucination issue and incorporates up-to-date knowledge, thereby addressing common limitations of larger language models. In addition, they introduce a question decomposer that leverages a generative language model and is essential for enhanced chain of reasoning.</p><p>In the second paper in this special issue [<span>2</span>], “CR-M-SpanBERT: Multiple-embedding-based DNN Coreference Resolution Using Self-attention SpanBERT” by Jung, a model is proposed to incorporate multiple embeddings for coreference resolution based on the SpanBERT architecture. The experimental results show that multiple embeddings can improve the coreference resolution performance regardless of the employed baseline model, such as LSTM, BERT, and SpanBERT.</p><p>As automated essay scoring has evolved from handcrafted techniques to deep learning methods, holistic scoring has improved. However, assessing specific traits remains challenging because of the limited depth of existing methods to model dual assessments for holistic and multitrait tasks. To address this challenge, a paper in this special issue titled “Dual-Scale BERT using Multi-Trait Representations for Holistic and Trait-Specific Essay Grading” [<span>3</span>] by Cho and others explores comprehensive feedback while modeling the interconnections between holistic and trait representations. The authors introduce the DualBERT-Trans-CNN model, which combines transformer-based representations with a novel dual-scale BERT encoder at the document level. By explicitly leveraging multitrait representations in a multitask learning framework, they emphasize the interrelation between holistic and trait-based score predictions to improve accuracy.</p><p>The fourth paper in this special issue [<span>4</span>], “Named entity recognition using transfer learning and small human- and meta-pseudo-labeled datasets” by Bae and Lim, introduces a high-performance model for named entity recognition for written and spoken language. The authors use transfer learning to leverage the previously developed KorBERT model as the baseline to overcome the challenges related to labeled data scarcity and domain shifts. They also adopt a meta-pseudo-label method using a teacher/student framework with labeled and unlabeled data. Their model presents two innovations: the combination of loss functions from human- and pseudo-labeled data and the updating of the teacher model only when a threshold is not reached.</p><p>While deep learning approaches are of keen interest, combining and applying them to traditional language analysis is also worthy, especially to explain analysis outcomes. The fifth paper in this special issue [<span>5</span>], “Transformer-Based Reranking for Improving Korean Morphological Analysis Systems” by Ryu and others, introduces this approach to Korean morphological analysis by combining dictionary-based techniques with transformer-based deep learning models. In particular, they use the BERT-based reranking system that substantially enhances the accuracy of the traditional dictionary-based morphological analysis methods. Their results demonstrate considerable performance improvements and highlight advantages of combining analytical and probabilistic models for language processing applications.</p><p>The sixth paper in this special issue [<span>6</span>], “Framework for evaluating code generation ability of large language models” by Yeo and others, introduces a systematic framework for evaluating the code generation capabilities of large language models and presents the derivation of a new metric called <i>pass-rate</i>@<i>n</i>, which captures granular accuracy levels by considering test pass rates. The experimental results demonstrate the effectiveness of the evaluation framework, which can be integrated with real-world coding platforms.</p><p>Another notable contribution to this field is presented in the paper titled “KMSAV: Korean multi-speaker spontaneous audiovisual dataset” by Park and others [<span>7</span>]. This paper presents a rich and extensive database encompassing approximately 150 h of rigorously transcribed and annotated audio-visual data supplemented by a diverse trove of 2000 h of untranscribed YouTube videos. This open-access corpus, accompanied by a tailored open-source framework, is validated through an evaluation using cutting-edge automatic and audio-visual speech recognition techniques.</p><p>The application of speech and language AI techniques to the clinical and medical domains has gathered research interest. The eighth paper [<span>8</span>], “Alzheimer's disease recognition from spontaneous speech using large language models” by Bang and others, presents the innovation of using large language models for predicting Alzhemier's disease by extensively using evaluation feedback generated by ChatGPT from image descriptions provided by potential patients. The feedback is used as an additional feature for speech multimodal transformer blocks. Experimental results demonstrate substantial improvements by leveraging the evaluation feedback from ChatGPT, thereby motivating the use of large language models for diagnosing some diseases.</p><p>The ninth paper [<span>9</span>], “Joint streaming model for backchannel prediction and automatic speech recognition” by Choi and others, addresses a crucial aspect of human conversation: the timely use of conversation backchannels such as “uh-huh” or “yeah.” This paper introduces a novel method that combines backchannel prediction with real-time speech recognition using a streaming transformer and multitask learning. The results show substantial improvements over existing methods, particularly in streaming scenarios, marking a substantial advancement toward more natural and engaging human–machine interactions.</p><p>The use of high-quality and adequate data for addressed application tasks is key to achieve stable high performance. The tenth paper in this special issue [<span>10</span>], “Spoken-to-written text conversion for enhancement of Korean–English readability and machine translation” by Choi and others, addresses the problem that Korean text produced by automatic speech recognition is often not presented in the written but in the spoken form, particularly when including numeric expressions and English words. Consequently, frequent ambiguities occur in similar types of errors for automatic speech translation. To mitigate these common types of errors, the authors propose a Korean spoken-to-written transcription conversion method trained on a large-scale dataset containing 8.6 million sentences formatted in a transcription style that aligns the written and spoken forms of text segments. Using the transcription conversion, substantial improvements in automatic speech translation from Korean to English are achieved, demonstrating the importance of high-quality task-aware data for properly training AI models.</p><p>The landscape of multimodal speech recognition has been drastically reshaped by the latest breakthroughs in deep learning. The paper titled “Multimodal Audiovisual Speech Recognition Architecture Using a Three-feature Multifusion Method for Noise-robust Systems” by Jeon and others addresses challenges of speech recognition in diverse noisy environments [<span>11</span>]. This paper presents an audio-visual speech recognition model that emulates human dialogue recognition, showing remarkable robustness across synthesized environments at nine different noise levels. By integrating audio and visual elements through a dense spatial–temporal convolutional neural network, the model achieves a substantially lower error rate than traditional methods. This study may pave the way for enhanced speech recognition services with both stability and improved recognition rates in noisy environments.</p><p>Language tutoring systems for nonnative speakers have taken a significant leap forward with the development of advanced end-to-end methods for automatic speech recognition and proficiency evaluation, as presented in the paper [<span>12</span>], “AI-based language tutoring systems with end-to-end automatic speech recognition and proficiency evaluation” by Kang and others. This paper details the creation of systems that proficiently assess and provide feedback on pronunciation and fluency using a combination of semisupervised and transfer learning techniques with diverse speech data. Highlighting its practical application, this study showcases two deployed systems, EBS AI PengTalk and KSI Korean AI Tutor, which enhance language learning for Korean elementary students and foreigners learning Korean, respectively.</p><p>The guest editors would like to thank all the authors, reviewers, and editorial staff of ETRI Journal for making this special issue successful. We are pleased to have been a part of the effort to timely publish high-quality technical papers. The presented studies on speech and language models will certainly contribute to the design and implementation of future AI systems.</p><p>The authors declare that there are no conflicts of interest.</p>","PeriodicalId":11901,"journal":{"name":"ETRI Journal","volume":"46 1","pages":"7-10"},"PeriodicalIF":1.3000,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.4218/etr2.12666","citationCount":"0","resultStr":"{\"title\":\"Special issue on speech and language AI technologies\",\"authors\":\"Dong-Jin Kim, Hyung-Min Park, Harksoo Kim, Seung-Hoon Na, Gerard Jounghyun Kim\",\"doi\":\"10.4218/etr2.12666\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Recent advancements in artificial intelligence (AI) have substantially improved applications that depend on human speech and language comprehension. Human speech, characterized by the articulation of thoughts and emotions through sounds, relies on language, a complex system that uses words and symbols for interpersonal communication. The rapid evolution of AI has amplified the demand for related solutions to swiftly and efficiently process extensive amounts of speech and language data. Speech and language technologies have emerged as major topics in AI research, improving the capacity of computers to comprehend text and spoken language by resembling human cognition. These technological breakthroughs have enabled computers to interpret human language, whether expressed in textual or spoken forms, unveiling the comprehensive meaning of the intentions, nuances, and emotional cues expressed by writers or speakers.</p><p><i>Electronics and Telecommunications Research Institute (ETRI) Journal</i> is a peer-reviewed open-access journal launched in 1993 and published bimonthly by ETRI, Republic of Korea. It is intended to promote worldwide academic exchange of research on information, telecommunications, and electronics.</p><p>This special is devoted to all aspects and future research directions in the rapidly progressing subject of speech and language technologies. In particular, this special issue highlights recent outstanding results on the application of AI techniques to understand speech or natural language. We selected 12 outstanding papers on three topics of speech and language technologies. Below, we provide a summary of commitments to this special issue.</p><p>The first paper [<span>1</span>] “Towards a small language model powered chain-of-reasoning for open-domain question answering” by Roh and others focuses on open-domain question-answering tasks that involve a chain of reasoning primarily implemented using large language models. Emphasizing cost effectiveness, the authors introduce EffiChainQA, an architecture centered on the use of small language models. They employ a retrieval-based language model that is known to address the hallucination issue and incorporates up-to-date knowledge, thereby addressing common limitations of larger language models. In addition, they introduce a question decomposer that leverages a generative language model and is essential for enhanced chain of reasoning.</p><p>In the second paper in this special issue [<span>2</span>], “CR-M-SpanBERT: Multiple-embedding-based DNN Coreference Resolution Using Self-attention SpanBERT” by Jung, a model is proposed to incorporate multiple embeddings for coreference resolution based on the SpanBERT architecture. The experimental results show that multiple embeddings can improve the coreference resolution performance regardless of the employed baseline model, such as LSTM, BERT, and SpanBERT.</p><p>As automated essay scoring has evolved from handcrafted techniques to deep learning methods, holistic scoring has improved. However, assessing specific traits remains challenging because of the limited depth of existing methods to model dual assessments for holistic and multitrait tasks. To address this challenge, a paper in this special issue titled “Dual-Scale BERT using Multi-Trait Representations for Holistic and Trait-Specific Essay Grading” [<span>3</span>] by Cho and others explores comprehensive feedback while modeling the interconnections between holistic and trait representations. The authors introduce the DualBERT-Trans-CNN model, which combines transformer-based representations with a novel dual-scale BERT encoder at the document level. By explicitly leveraging multitrait representations in a multitask learning framework, they emphasize the interrelation between holistic and trait-based score predictions to improve accuracy.</p><p>The fourth paper in this special issue [<span>4</span>], “Named entity recognition using transfer learning and small human- and meta-pseudo-labeled datasets” by Bae and Lim, introduces a high-performance model for named entity recognition for written and spoken language. The authors use transfer learning to leverage the previously developed KorBERT model as the baseline to overcome the challenges related to labeled data scarcity and domain shifts. They also adopt a meta-pseudo-label method using a teacher/student framework with labeled and unlabeled data. Their model presents two innovations: the combination of loss functions from human- and pseudo-labeled data and the updating of the teacher model only when a threshold is not reached.</p><p>While deep learning approaches are of keen interest, combining and applying them to traditional language analysis is also worthy, especially to explain analysis outcomes. The fifth paper in this special issue [<span>5</span>], “Transformer-Based Reranking for Improving Korean Morphological Analysis Systems” by Ryu and others, introduces this approach to Korean morphological analysis by combining dictionary-based techniques with transformer-based deep learning models. In particular, they use the BERT-based reranking system that substantially enhances the accuracy of the traditional dictionary-based morphological analysis methods. Their results demonstrate considerable performance improvements and highlight advantages of combining analytical and probabilistic models for language processing applications.</p><p>The sixth paper in this special issue [<span>6</span>], “Framework for evaluating code generation ability of large language models” by Yeo and others, introduces a systematic framework for evaluating the code generation capabilities of large language models and presents the derivation of a new metric called <i>pass-rate</i>@<i>n</i>, which captures granular accuracy levels by considering test pass rates. The experimental results demonstrate the effectiveness of the evaluation framework, which can be integrated with real-world coding platforms.</p><p>Another notable contribution to this field is presented in the paper titled “KMSAV: Korean multi-speaker spontaneous audiovisual dataset” by Park and others [<span>7</span>]. This paper presents a rich and extensive database encompassing approximately 150 h of rigorously transcribed and annotated audio-visual data supplemented by a diverse trove of 2000 h of untranscribed YouTube videos. This open-access corpus, accompanied by a tailored open-source framework, is validated through an evaluation using cutting-edge automatic and audio-visual speech recognition techniques.</p><p>The application of speech and language AI techniques to the clinical and medical domains has gathered research interest. The eighth paper [<span>8</span>], “Alzheimer's disease recognition from spontaneous speech using large language models” by Bang and others, presents the innovation of using large language models for predicting Alzhemier's disease by extensively using evaluation feedback generated by ChatGPT from image descriptions provided by potential patients. The feedback is used as an additional feature for speech multimodal transformer blocks. Experimental results demonstrate substantial improvements by leveraging the evaluation feedback from ChatGPT, thereby motivating the use of large language models for diagnosing some diseases.</p><p>The ninth paper [<span>9</span>], “Joint streaming model for backchannel prediction and automatic speech recognition” by Choi and others, addresses a crucial aspect of human conversation: the timely use of conversation backchannels such as “uh-huh” or “yeah.” This paper introduces a novel method that combines backchannel prediction with real-time speech recognition using a streaming transformer and multitask learning. The results show substantial improvements over existing methods, particularly in streaming scenarios, marking a substantial advancement toward more natural and engaging human–machine interactions.</p><p>The use of high-quality and adequate data for addressed application tasks is key to achieve stable high performance. The tenth paper in this special issue [<span>10</span>], “Spoken-to-written text conversion for enhancement of Korean–English readability and machine translation” by Choi and others, addresses the problem that Korean text produced by automatic speech recognition is often not presented in the written but in the spoken form, particularly when including numeric expressions and English words. Consequently, frequent ambiguities occur in similar types of errors for automatic speech translation. To mitigate these common types of errors, the authors propose a Korean spoken-to-written transcription conversion method trained on a large-scale dataset containing 8.6 million sentences formatted in a transcription style that aligns the written and spoken forms of text segments. Using the transcription conversion, substantial improvements in automatic speech translation from Korean to English are achieved, demonstrating the importance of high-quality task-aware data for properly training AI models.</p><p>The landscape of multimodal speech recognition has been drastically reshaped by the latest breakthroughs in deep learning. The paper titled “Multimodal Audiovisual Speech Recognition Architecture Using a Three-feature Multifusion Method for Noise-robust Systems” by Jeon and others addresses challenges of speech recognition in diverse noisy environments [<span>11</span>]. This paper presents an audio-visual speech recognition model that emulates human dialogue recognition, showing remarkable robustness across synthesized environments at nine different noise levels. By integrating audio and visual elements through a dense spatial–temporal convolutional neural network, the model achieves a substantially lower error rate than traditional methods. This study may pave the way for enhanced speech recognition services with both stability and improved recognition rates in noisy environments.</p><p>Language tutoring systems for nonnative speakers have taken a significant leap forward with the development of advanced end-to-end methods for automatic speech recognition and proficiency evaluation, as presented in the paper [<span>12</span>], “AI-based language tutoring systems with end-to-end automatic speech recognition and proficiency evaluation” by Kang and others. This paper details the creation of systems that proficiently assess and provide feedback on pronunciation and fluency using a combination of semisupervised and transfer learning techniques with diverse speech data. Highlighting its practical application, this study showcases two deployed systems, EBS AI PengTalk and KSI Korean AI Tutor, which enhance language learning for Korean elementary students and foreigners learning Korean, respectively.</p><p>The guest editors would like to thank all the authors, reviewers, and editorial staff of ETRI Journal for making this special issue successful. We are pleased to have been a part of the effort to timely publish high-quality technical papers. The presented studies on speech and language models will certainly contribute to the design and implementation of future AI systems.</p><p>The authors declare that there are no conflicts of interest.</p>\",\"PeriodicalId\":11901,\"journal\":{\"name\":\"ETRI Journal\",\"volume\":\"46 1\",\"pages\":\"7-10\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2024-02-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.4218/etr2.12666\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ETRI Journal\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.4218/etr2.12666\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ETRI Journal","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.4218/etr2.12666","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
摘要
他们特别使用了基于 BERT 的重排系统,该系统大大提高了传统的基于词典的形态分析方法的准确性。本特刊的第六篇论文[6]是由 Yeo 等人撰写的 "大型语言模型代码生成能力评估框架",该论文介绍了一个评估大型语言模型代码生成能力的系统框架,并提出了一个名为 pass-rate@n 的新指标,该指标通过考虑测试通过率来捕捉细粒度的准确度水平。实验结果证明了评估框架的有效性,该框架可与现实世界的编码平台集成。Park 等人发表的题为 "KMSAV:韩国多说话者自发视听数据集 "的论文[7]是对该领域的另一个显著贡献。这篇论文介绍了一个丰富而广泛的数据库,其中包括约 150 小时经过严格转录和注释的视听数据,以及 2000 小时未经转录的 YouTube 视频。这个开放访问的语料库,伴随着一个量身定制的开源框架,通过使用最先进的自动和视听语音识别技术进行评估得到了验证。Bang 等人的第八篇论文[8]"使用大型语言模型从自发语音中识别阿尔茨海默病",通过广泛使用 ChatGPT 从潜在患者提供的图像描述中生成的评估反馈,提出了使用大型语言模型预测阿尔茨海默病的创新方法。反馈被用作语音多模态转换器块的附加特征。实验结果表明,利用 ChatGPT 的评估反馈可以大大提高诊断效果,从而推动了大型语言模型在某些疾病诊断中的应用。第九篇论文[9]是 Choi 等人撰写的 "用于后信道预测和自动语音识别的联合流模型",该论文探讨了人类对话的一个重要方面:及时使用 "嗯 "或 "呀 "等对话后信道。本文介绍了一种新方法,它利用流转换器和多任务学习将后信道预测与实时语音识别结合起来。研究结果表明,与现有方法相比,尤其是在流场景中,语音识别率有了大幅提高,这标志着在实现更自然、更吸引人的人机交互方面取得了实质性进展。本特刊的第十篇论文[10],即 Choi 等人撰写的 "为提高韩英可读性和机器翻译而进行的口语到书面文本转换",解决了由自动语音识别生成的韩语文本通常不是以书面形式而是以口语形式呈现的问题,尤其是在包含数字表达式和英语单词时。因此,在自动语音翻译中经常出现类似类型的歧义错误。为了减少这些常见错误,作者提出了一种韩语口语到书面语转录转换方法,该方法是在一个大规模数据集上训练出来的,该数据集包含 860 万个句子,其格式为转录风格,将文本片段的书面和口语形式统一起来。通过转录转换,韩语到英语的自动语音翻译得到了大幅改善,这表明高质量的任务感知数据对于正确训练人工智能模型的重要性。Jeon 等人撰写的题为 "Multimodal Audiovisual Speech Recognition Architecture Using a Three-feature Multifusion Method for Noise-robust Systems"(多模态视听语音识别架构,用于噪声抑制系统的三特征多融合方法)的论文,解决了在各种噪声环境下语音识别所面临的挑战[11]。本文提出了一种视听语音识别模型,该模型可模仿人类对话识别,在九种不同噪音水平的合成环境中表现出显著的鲁棒性。通过密集的时空卷积神经网络整合音频和视觉元素,该模型的错误率大大低于传统方法。这项研究可为增强语音识别服务铺平道路,使其在嘈杂环境中既能保持稳定,又能提高识别率。 随着先进的端到端自动语音识别和能力评估方法的开发,针对非母语人士的语言辅导系统取得了重大飞跃,正如 Kang 等人在论文[12]"基于人工智能的端到端自动语音识别和能力评估语言辅导系统 "中所介绍的那样。这篇论文详细介绍了如何结合半监督和迁移学习技术,利用不同的语音数据,创建能够熟练评估和反馈发音和流利程度的系统。为了突出其实际应用,本研究展示了两个已部署的系统:EBS AI PengTalk 和 KSI Korean AI Tutor,这两个系统分别提高了韩国小学生和学习韩语的外国人的语言学习能力。我们很高兴能参与其中,及时发表高质量的技术论文。所介绍的语音和语言模型研究必将有助于未来人工智能系统的设计和实施。
Special issue on speech and language AI technologies
Recent advancements in artificial intelligence (AI) have substantially improved applications that depend on human speech and language comprehension. Human speech, characterized by the articulation of thoughts and emotions through sounds, relies on language, a complex system that uses words and symbols for interpersonal communication. The rapid evolution of AI has amplified the demand for related solutions to swiftly and efficiently process extensive amounts of speech and language data. Speech and language technologies have emerged as major topics in AI research, improving the capacity of computers to comprehend text and spoken language by resembling human cognition. These technological breakthroughs have enabled computers to interpret human language, whether expressed in textual or spoken forms, unveiling the comprehensive meaning of the intentions, nuances, and emotional cues expressed by writers or speakers.
Electronics and Telecommunications Research Institute (ETRI) Journal is a peer-reviewed open-access journal launched in 1993 and published bimonthly by ETRI, Republic of Korea. It is intended to promote worldwide academic exchange of research on information, telecommunications, and electronics.
This special is devoted to all aspects and future research directions in the rapidly progressing subject of speech and language technologies. In particular, this special issue highlights recent outstanding results on the application of AI techniques to understand speech or natural language. We selected 12 outstanding papers on three topics of speech and language technologies. Below, we provide a summary of commitments to this special issue.
The first paper [1] “Towards a small language model powered chain-of-reasoning for open-domain question answering” by Roh and others focuses on open-domain question-answering tasks that involve a chain of reasoning primarily implemented using large language models. Emphasizing cost effectiveness, the authors introduce EffiChainQA, an architecture centered on the use of small language models. They employ a retrieval-based language model that is known to address the hallucination issue and incorporates up-to-date knowledge, thereby addressing common limitations of larger language models. In addition, they introduce a question decomposer that leverages a generative language model and is essential for enhanced chain of reasoning.
In the second paper in this special issue [2], “CR-M-SpanBERT: Multiple-embedding-based DNN Coreference Resolution Using Self-attention SpanBERT” by Jung, a model is proposed to incorporate multiple embeddings for coreference resolution based on the SpanBERT architecture. The experimental results show that multiple embeddings can improve the coreference resolution performance regardless of the employed baseline model, such as LSTM, BERT, and SpanBERT.
As automated essay scoring has evolved from handcrafted techniques to deep learning methods, holistic scoring has improved. However, assessing specific traits remains challenging because of the limited depth of existing methods to model dual assessments for holistic and multitrait tasks. To address this challenge, a paper in this special issue titled “Dual-Scale BERT using Multi-Trait Representations for Holistic and Trait-Specific Essay Grading” [3] by Cho and others explores comprehensive feedback while modeling the interconnections between holistic and trait representations. The authors introduce the DualBERT-Trans-CNN model, which combines transformer-based representations with a novel dual-scale BERT encoder at the document level. By explicitly leveraging multitrait representations in a multitask learning framework, they emphasize the interrelation between holistic and trait-based score predictions to improve accuracy.
The fourth paper in this special issue [4], “Named entity recognition using transfer learning and small human- and meta-pseudo-labeled datasets” by Bae and Lim, introduces a high-performance model for named entity recognition for written and spoken language. The authors use transfer learning to leverage the previously developed KorBERT model as the baseline to overcome the challenges related to labeled data scarcity and domain shifts. They also adopt a meta-pseudo-label method using a teacher/student framework with labeled and unlabeled data. Their model presents two innovations: the combination of loss functions from human- and pseudo-labeled data and the updating of the teacher model only when a threshold is not reached.
While deep learning approaches are of keen interest, combining and applying them to traditional language analysis is also worthy, especially to explain analysis outcomes. The fifth paper in this special issue [5], “Transformer-Based Reranking for Improving Korean Morphological Analysis Systems” by Ryu and others, introduces this approach to Korean morphological analysis by combining dictionary-based techniques with transformer-based deep learning models. In particular, they use the BERT-based reranking system that substantially enhances the accuracy of the traditional dictionary-based morphological analysis methods. Their results demonstrate considerable performance improvements and highlight advantages of combining analytical and probabilistic models for language processing applications.
The sixth paper in this special issue [6], “Framework for evaluating code generation ability of large language models” by Yeo and others, introduces a systematic framework for evaluating the code generation capabilities of large language models and presents the derivation of a new metric called pass-rate@n, which captures granular accuracy levels by considering test pass rates. The experimental results demonstrate the effectiveness of the evaluation framework, which can be integrated with real-world coding platforms.
Another notable contribution to this field is presented in the paper titled “KMSAV: Korean multi-speaker spontaneous audiovisual dataset” by Park and others [7]. This paper presents a rich and extensive database encompassing approximately 150 h of rigorously transcribed and annotated audio-visual data supplemented by a diverse trove of 2000 h of untranscribed YouTube videos. This open-access corpus, accompanied by a tailored open-source framework, is validated through an evaluation using cutting-edge automatic and audio-visual speech recognition techniques.
The application of speech and language AI techniques to the clinical and medical domains has gathered research interest. The eighth paper [8], “Alzheimer's disease recognition from spontaneous speech using large language models” by Bang and others, presents the innovation of using large language models for predicting Alzhemier's disease by extensively using evaluation feedback generated by ChatGPT from image descriptions provided by potential patients. The feedback is used as an additional feature for speech multimodal transformer blocks. Experimental results demonstrate substantial improvements by leveraging the evaluation feedback from ChatGPT, thereby motivating the use of large language models for diagnosing some diseases.
The ninth paper [9], “Joint streaming model for backchannel prediction and automatic speech recognition” by Choi and others, addresses a crucial aspect of human conversation: the timely use of conversation backchannels such as “uh-huh” or “yeah.” This paper introduces a novel method that combines backchannel prediction with real-time speech recognition using a streaming transformer and multitask learning. The results show substantial improvements over existing methods, particularly in streaming scenarios, marking a substantial advancement toward more natural and engaging human–machine interactions.
The use of high-quality and adequate data for addressed application tasks is key to achieve stable high performance. The tenth paper in this special issue [10], “Spoken-to-written text conversion for enhancement of Korean–English readability and machine translation” by Choi and others, addresses the problem that Korean text produced by automatic speech recognition is often not presented in the written but in the spoken form, particularly when including numeric expressions and English words. Consequently, frequent ambiguities occur in similar types of errors for automatic speech translation. To mitigate these common types of errors, the authors propose a Korean spoken-to-written transcription conversion method trained on a large-scale dataset containing 8.6 million sentences formatted in a transcription style that aligns the written and spoken forms of text segments. Using the transcription conversion, substantial improvements in automatic speech translation from Korean to English are achieved, demonstrating the importance of high-quality task-aware data for properly training AI models.
The landscape of multimodal speech recognition has been drastically reshaped by the latest breakthroughs in deep learning. The paper titled “Multimodal Audiovisual Speech Recognition Architecture Using a Three-feature Multifusion Method for Noise-robust Systems” by Jeon and others addresses challenges of speech recognition in diverse noisy environments [11]. This paper presents an audio-visual speech recognition model that emulates human dialogue recognition, showing remarkable robustness across synthesized environments at nine different noise levels. By integrating audio and visual elements through a dense spatial–temporal convolutional neural network, the model achieves a substantially lower error rate than traditional methods. This study may pave the way for enhanced speech recognition services with both stability and improved recognition rates in noisy environments.
Language tutoring systems for nonnative speakers have taken a significant leap forward with the development of advanced end-to-end methods for automatic speech recognition and proficiency evaluation, as presented in the paper [12], “AI-based language tutoring systems with end-to-end automatic speech recognition and proficiency evaluation” by Kang and others. This paper details the creation of systems that proficiently assess and provide feedback on pronunciation and fluency using a combination of semisupervised and transfer learning techniques with diverse speech data. Highlighting its practical application, this study showcases two deployed systems, EBS AI PengTalk and KSI Korean AI Tutor, which enhance language learning for Korean elementary students and foreigners learning Korean, respectively.
The guest editors would like to thank all the authors, reviewers, and editorial staff of ETRI Journal for making this special issue successful. We are pleased to have been a part of the effort to timely publish high-quality technical papers. The presented studies on speech and language models will certainly contribute to the design and implementation of future AI systems.
The authors declare that there are no conflicts of interest.
期刊介绍:
ETRI Journal is an international, peer-reviewed multidisciplinary journal published bimonthly in English. The main focus of the journal is to provide an open forum to exchange innovative ideas and technology in the fields of information, telecommunications, and electronics.
Key topics of interest include high-performance computing, big data analytics, cloud computing, multimedia technology, communication networks and services, wireless communications and mobile computing, material and component technology, as well as security.
With an international editorial committee and experts from around the world as reviewers, ETRI Journal publishes high-quality research papers on the latest and best developments from the global community.