Dong-Jin Kim, Hyung-Min Park, Harksoo Kim, Seung-Hoon Na, Gerard Jounghyun Kim
{"title":"Special issue on speech and language AI technologies","authors":"Dong-Jin Kim, Hyung-Min Park, Harksoo Kim, Seung-Hoon Na, Gerard Jounghyun Kim","doi":"10.4218/etr2.12666","DOIUrl":null,"url":null,"abstract":"<p>Recent advancements in artificial intelligence (AI) have substantially improved applications that depend on human speech and language comprehension. Human speech, characterized by the articulation of thoughts and emotions through sounds, relies on language, a complex system that uses words and symbols for interpersonal communication. The rapid evolution of AI has amplified the demand for related solutions to swiftly and efficiently process extensive amounts of speech and language data. Speech and language technologies have emerged as major topics in AI research, improving the capacity of computers to comprehend text and spoken language by resembling human cognition. These technological breakthroughs have enabled computers to interpret human language, whether expressed in textual or spoken forms, unveiling the comprehensive meaning of the intentions, nuances, and emotional cues expressed by writers or speakers.</p><p><i>Electronics and Telecommunications Research Institute (ETRI) Journal</i> is a peer-reviewed open-access journal launched in 1993 and published bimonthly by ETRI, Republic of Korea. It is intended to promote worldwide academic exchange of research on information, telecommunications, and electronics.</p><p>This special is devoted to all aspects and future research directions in the rapidly progressing subject of speech and language technologies. In particular, this special issue highlights recent outstanding results on the application of AI techniques to understand speech or natural language. We selected 12 outstanding papers on three topics of speech and language technologies. Below, we provide a summary of commitments to this special issue.</p><p>The first paper [<span>1</span>] “Towards a small language model powered chain-of-reasoning for open-domain question answering” by Roh and others focuses on open-domain question-answering tasks that involve a chain of reasoning primarily implemented using large language models. Emphasizing cost effectiveness, the authors introduce EffiChainQA, an architecture centered on the use of small language models. They employ a retrieval-based language model that is known to address the hallucination issue and incorporates up-to-date knowledge, thereby addressing common limitations of larger language models. In addition, they introduce a question decomposer that leverages a generative language model and is essential for enhanced chain of reasoning.</p><p>In the second paper in this special issue [<span>2</span>], “CR-M-SpanBERT: Multiple-embedding-based DNN Coreference Resolution Using Self-attention SpanBERT” by Jung, a model is proposed to incorporate multiple embeddings for coreference resolution based on the SpanBERT architecture. The experimental results show that multiple embeddings can improve the coreference resolution performance regardless of the employed baseline model, such as LSTM, BERT, and SpanBERT.</p><p>As automated essay scoring has evolved from handcrafted techniques to deep learning methods, holistic scoring has improved. However, assessing specific traits remains challenging because of the limited depth of existing methods to model dual assessments for holistic and multitrait tasks. To address this challenge, a paper in this special issue titled “Dual-Scale BERT using Multi-Trait Representations for Holistic and Trait-Specific Essay Grading” [<span>3</span>] by Cho and others explores comprehensive feedback while modeling the interconnections between holistic and trait representations. The authors introduce the DualBERT-Trans-CNN model, which combines transformer-based representations with a novel dual-scale BERT encoder at the document level. By explicitly leveraging multitrait representations in a multitask learning framework, they emphasize the interrelation between holistic and trait-based score predictions to improve accuracy.</p><p>The fourth paper in this special issue [<span>4</span>], “Named entity recognition using transfer learning and small human- and meta-pseudo-labeled datasets” by Bae and Lim, introduces a high-performance model for named entity recognition for written and spoken language. The authors use transfer learning to leverage the previously developed KorBERT model as the baseline to overcome the challenges related to labeled data scarcity and domain shifts. They also adopt a meta-pseudo-label method using a teacher/student framework with labeled and unlabeled data. Their model presents two innovations: the combination of loss functions from human- and pseudo-labeled data and the updating of the teacher model only when a threshold is not reached.</p><p>While deep learning approaches are of keen interest, combining and applying them to traditional language analysis is also worthy, especially to explain analysis outcomes. The fifth paper in this special issue [<span>5</span>], “Transformer-Based Reranking for Improving Korean Morphological Analysis Systems” by Ryu and others, introduces this approach to Korean morphological analysis by combining dictionary-based techniques with transformer-based deep learning models. In particular, they use the BERT-based reranking system that substantially enhances the accuracy of the traditional dictionary-based morphological analysis methods. Their results demonstrate considerable performance improvements and highlight advantages of combining analytical and probabilistic models for language processing applications.</p><p>The sixth paper in this special issue [<span>6</span>], “Framework for evaluating code generation ability of large language models” by Yeo and others, introduces a systematic framework for evaluating the code generation capabilities of large language models and presents the derivation of a new metric called <i>pass-rate</i>@<i>n</i>, which captures granular accuracy levels by considering test pass rates. The experimental results demonstrate the effectiveness of the evaluation framework, which can be integrated with real-world coding platforms.</p><p>Another notable contribution to this field is presented in the paper titled “KMSAV: Korean multi-speaker spontaneous audiovisual dataset” by Park and others [<span>7</span>]. This paper presents a rich and extensive database encompassing approximately 150 h of rigorously transcribed and annotated audio-visual data supplemented by a diverse trove of 2000 h of untranscribed YouTube videos. This open-access corpus, accompanied by a tailored open-source framework, is validated through an evaluation using cutting-edge automatic and audio-visual speech recognition techniques.</p><p>The application of speech and language AI techniques to the clinical and medical domains has gathered research interest. The eighth paper [<span>8</span>], “Alzheimer's disease recognition from spontaneous speech using large language models” by Bang and others, presents the innovation of using large language models for predicting Alzhemier's disease by extensively using evaluation feedback generated by ChatGPT from image descriptions provided by potential patients. The feedback is used as an additional feature for speech multimodal transformer blocks. Experimental results demonstrate substantial improvements by leveraging the evaluation feedback from ChatGPT, thereby motivating the use of large language models for diagnosing some diseases.</p><p>The ninth paper [<span>9</span>], “Joint streaming model for backchannel prediction and automatic speech recognition” by Choi and others, addresses a crucial aspect of human conversation: the timely use of conversation backchannels such as “uh-huh” or “yeah.” This paper introduces a novel method that combines backchannel prediction with real-time speech recognition using a streaming transformer and multitask learning. The results show substantial improvements over existing methods, particularly in streaming scenarios, marking a substantial advancement toward more natural and engaging human–machine interactions.</p><p>The use of high-quality and adequate data for addressed application tasks is key to achieve stable high performance. The tenth paper in this special issue [<span>10</span>], “Spoken-to-written text conversion for enhancement of Korean–English readability and machine translation” by Choi and others, addresses the problem that Korean text produced by automatic speech recognition is often not presented in the written but in the spoken form, particularly when including numeric expressions and English words. Consequently, frequent ambiguities occur in similar types of errors for automatic speech translation. To mitigate these common types of errors, the authors propose a Korean spoken-to-written transcription conversion method trained on a large-scale dataset containing 8.6 million sentences formatted in a transcription style that aligns the written and spoken forms of text segments. Using the transcription conversion, substantial improvements in automatic speech translation from Korean to English are achieved, demonstrating the importance of high-quality task-aware data for properly training AI models.</p><p>The landscape of multimodal speech recognition has been drastically reshaped by the latest breakthroughs in deep learning. The paper titled “Multimodal Audiovisual Speech Recognition Architecture Using a Three-feature Multifusion Method for Noise-robust Systems” by Jeon and others addresses challenges of speech recognition in diverse noisy environments [<span>11</span>]. This paper presents an audio-visual speech recognition model that emulates human dialogue recognition, showing remarkable robustness across synthesized environments at nine different noise levels. By integrating audio and visual elements through a dense spatial–temporal convolutional neural network, the model achieves a substantially lower error rate than traditional methods. This study may pave the way for enhanced speech recognition services with both stability and improved recognition rates in noisy environments.</p><p>Language tutoring systems for nonnative speakers have taken a significant leap forward with the development of advanced end-to-end methods for automatic speech recognition and proficiency evaluation, as presented in the paper [<span>12</span>], “AI-based language tutoring systems with end-to-end automatic speech recognition and proficiency evaluation” by Kang and others. This paper details the creation of systems that proficiently assess and provide feedback on pronunciation and fluency using a combination of semisupervised and transfer learning techniques with diverse speech data. Highlighting its practical application, this study showcases two deployed systems, EBS AI PengTalk and KSI Korean AI Tutor, which enhance language learning for Korean elementary students and foreigners learning Korean, respectively.</p><p>The guest editors would like to thank all the authors, reviewers, and editorial staff of ETRI Journal for making this special issue successful. We are pleased to have been a part of the effort to timely publish high-quality technical papers. The presented studies on speech and language models will certainly contribute to the design and implementation of future AI systems.</p><p>The authors declare that there are no conflicts of interest.</p>","PeriodicalId":11901,"journal":{"name":"ETRI Journal","volume":"46 1","pages":"7-10"},"PeriodicalIF":1.3000,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.4218/etr2.12666","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ETRI Journal","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.4218/etr2.12666","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advancements in artificial intelligence (AI) have substantially improved applications that depend on human speech and language comprehension. Human speech, characterized by the articulation of thoughts and emotions through sounds, relies on language, a complex system that uses words and symbols for interpersonal communication. The rapid evolution of AI has amplified the demand for related solutions to swiftly and efficiently process extensive amounts of speech and language data. Speech and language technologies have emerged as major topics in AI research, improving the capacity of computers to comprehend text and spoken language by resembling human cognition. These technological breakthroughs have enabled computers to interpret human language, whether expressed in textual or spoken forms, unveiling the comprehensive meaning of the intentions, nuances, and emotional cues expressed by writers or speakers.
Electronics and Telecommunications Research Institute (ETRI) Journal is a peer-reviewed open-access journal launched in 1993 and published bimonthly by ETRI, Republic of Korea. It is intended to promote worldwide academic exchange of research on information, telecommunications, and electronics.
This special is devoted to all aspects and future research directions in the rapidly progressing subject of speech and language technologies. In particular, this special issue highlights recent outstanding results on the application of AI techniques to understand speech or natural language. We selected 12 outstanding papers on three topics of speech and language technologies. Below, we provide a summary of commitments to this special issue.
The first paper [1] “Towards a small language model powered chain-of-reasoning for open-domain question answering” by Roh and others focuses on open-domain question-answering tasks that involve a chain of reasoning primarily implemented using large language models. Emphasizing cost effectiveness, the authors introduce EffiChainQA, an architecture centered on the use of small language models. They employ a retrieval-based language model that is known to address the hallucination issue and incorporates up-to-date knowledge, thereby addressing common limitations of larger language models. In addition, they introduce a question decomposer that leverages a generative language model and is essential for enhanced chain of reasoning.
In the second paper in this special issue [2], “CR-M-SpanBERT: Multiple-embedding-based DNN Coreference Resolution Using Self-attention SpanBERT” by Jung, a model is proposed to incorporate multiple embeddings for coreference resolution based on the SpanBERT architecture. The experimental results show that multiple embeddings can improve the coreference resolution performance regardless of the employed baseline model, such as LSTM, BERT, and SpanBERT.
As automated essay scoring has evolved from handcrafted techniques to deep learning methods, holistic scoring has improved. However, assessing specific traits remains challenging because of the limited depth of existing methods to model dual assessments for holistic and multitrait tasks. To address this challenge, a paper in this special issue titled “Dual-Scale BERT using Multi-Trait Representations for Holistic and Trait-Specific Essay Grading” [3] by Cho and others explores comprehensive feedback while modeling the interconnections between holistic and trait representations. The authors introduce the DualBERT-Trans-CNN model, which combines transformer-based representations with a novel dual-scale BERT encoder at the document level. By explicitly leveraging multitrait representations in a multitask learning framework, they emphasize the interrelation between holistic and trait-based score predictions to improve accuracy.
The fourth paper in this special issue [4], “Named entity recognition using transfer learning and small human- and meta-pseudo-labeled datasets” by Bae and Lim, introduces a high-performance model for named entity recognition for written and spoken language. The authors use transfer learning to leverage the previously developed KorBERT model as the baseline to overcome the challenges related to labeled data scarcity and domain shifts. They also adopt a meta-pseudo-label method using a teacher/student framework with labeled and unlabeled data. Their model presents two innovations: the combination of loss functions from human- and pseudo-labeled data and the updating of the teacher model only when a threshold is not reached.
While deep learning approaches are of keen interest, combining and applying them to traditional language analysis is also worthy, especially to explain analysis outcomes. The fifth paper in this special issue [5], “Transformer-Based Reranking for Improving Korean Morphological Analysis Systems” by Ryu and others, introduces this approach to Korean morphological analysis by combining dictionary-based techniques with transformer-based deep learning models. In particular, they use the BERT-based reranking system that substantially enhances the accuracy of the traditional dictionary-based morphological analysis methods. Their results demonstrate considerable performance improvements and highlight advantages of combining analytical and probabilistic models for language processing applications.
The sixth paper in this special issue [6], “Framework for evaluating code generation ability of large language models” by Yeo and others, introduces a systematic framework for evaluating the code generation capabilities of large language models and presents the derivation of a new metric called pass-rate@n, which captures granular accuracy levels by considering test pass rates. The experimental results demonstrate the effectiveness of the evaluation framework, which can be integrated with real-world coding platforms.
Another notable contribution to this field is presented in the paper titled “KMSAV: Korean multi-speaker spontaneous audiovisual dataset” by Park and others [7]. This paper presents a rich and extensive database encompassing approximately 150 h of rigorously transcribed and annotated audio-visual data supplemented by a diverse trove of 2000 h of untranscribed YouTube videos. This open-access corpus, accompanied by a tailored open-source framework, is validated through an evaluation using cutting-edge automatic and audio-visual speech recognition techniques.
The application of speech and language AI techniques to the clinical and medical domains has gathered research interest. The eighth paper [8], “Alzheimer's disease recognition from spontaneous speech using large language models” by Bang and others, presents the innovation of using large language models for predicting Alzhemier's disease by extensively using evaluation feedback generated by ChatGPT from image descriptions provided by potential patients. The feedback is used as an additional feature for speech multimodal transformer blocks. Experimental results demonstrate substantial improvements by leveraging the evaluation feedback from ChatGPT, thereby motivating the use of large language models for diagnosing some diseases.
The ninth paper [9], “Joint streaming model for backchannel prediction and automatic speech recognition” by Choi and others, addresses a crucial aspect of human conversation: the timely use of conversation backchannels such as “uh-huh” or “yeah.” This paper introduces a novel method that combines backchannel prediction with real-time speech recognition using a streaming transformer and multitask learning. The results show substantial improvements over existing methods, particularly in streaming scenarios, marking a substantial advancement toward more natural and engaging human–machine interactions.
The use of high-quality and adequate data for addressed application tasks is key to achieve stable high performance. The tenth paper in this special issue [10], “Spoken-to-written text conversion for enhancement of Korean–English readability and machine translation” by Choi and others, addresses the problem that Korean text produced by automatic speech recognition is often not presented in the written but in the spoken form, particularly when including numeric expressions and English words. Consequently, frequent ambiguities occur in similar types of errors for automatic speech translation. To mitigate these common types of errors, the authors propose a Korean spoken-to-written transcription conversion method trained on a large-scale dataset containing 8.6 million sentences formatted in a transcription style that aligns the written and spoken forms of text segments. Using the transcription conversion, substantial improvements in automatic speech translation from Korean to English are achieved, demonstrating the importance of high-quality task-aware data for properly training AI models.
The landscape of multimodal speech recognition has been drastically reshaped by the latest breakthroughs in deep learning. The paper titled “Multimodal Audiovisual Speech Recognition Architecture Using a Three-feature Multifusion Method for Noise-robust Systems” by Jeon and others addresses challenges of speech recognition in diverse noisy environments [11]. This paper presents an audio-visual speech recognition model that emulates human dialogue recognition, showing remarkable robustness across synthesized environments at nine different noise levels. By integrating audio and visual elements through a dense spatial–temporal convolutional neural network, the model achieves a substantially lower error rate than traditional methods. This study may pave the way for enhanced speech recognition services with both stability and improved recognition rates in noisy environments.
Language tutoring systems for nonnative speakers have taken a significant leap forward with the development of advanced end-to-end methods for automatic speech recognition and proficiency evaluation, as presented in the paper [12], “AI-based language tutoring systems with end-to-end automatic speech recognition and proficiency evaluation” by Kang and others. This paper details the creation of systems that proficiently assess and provide feedback on pronunciation and fluency using a combination of semisupervised and transfer learning techniques with diverse speech data. Highlighting its practical application, this study showcases two deployed systems, EBS AI PengTalk and KSI Korean AI Tutor, which enhance language learning for Korean elementary students and foreigners learning Korean, respectively.
The guest editors would like to thank all the authors, reviewers, and editorial staff of ETRI Journal for making this special issue successful. We are pleased to have been a part of the effort to timely publish high-quality technical papers. The presented studies on speech and language models will certainly contribute to the design and implementation of future AI systems.
The authors declare that there are no conflicts of interest.
期刊介绍:
ETRI Journal is an international, peer-reviewed multidisciplinary journal published bimonthly in English. The main focus of the journal is to provide an open forum to exchange innovative ideas and technology in the fields of information, telecommunications, and electronics.
Key topics of interest include high-performance computing, big data analytics, cloud computing, multimedia technology, communication networks and services, wireless communications and mobile computing, material and component technology, as well as security.
With an international editorial committee and experts from around the world as reviewers, ETRI Journal publishes high-quality research papers on the latest and best developments from the global community.