Researchers are increasingly eager to develop techniques to extract emotional data from new sources due to the exponential growth of subjective information on Web 2.0. One of the most challenging aspects of textual emotion detection is the collection of data with emotion labels, given the subjectivity involved in labeling emotions. To address this significant issue, our research aims to aid in the development of effective solutions. We propose a Deep Convolutional Belief-based Spatial Network Model (DCB-SNM) as a semi-automated technique to tackle this challenge. This model involves two basic phases of analysis: text and video. In this process, pre-trained annotators identify the dominant emotion. Our work evaluates the impact of this automatic pre-annotation approach on manual emotion annotation from the perspectives of annotation time and agreement. The data on annotation time indicates an increase of roughly 20% when the pre-annotation procedure is utilized, without negatively affecting the annotators' skill. This demonstrates the benefits of pre-annotation approaches. Additionally, pre-annotation proves to be particularly advantageous for contributors with low prediction accuracy, enhancing overall annotation efficiency and reliability.
由于 Web 2.0 上的主观信息呈指数级增长,研究人员越来越热衷于开发从新来源中提取情感数据的技术。文本情感检测最具挑战性的方面之一是收集带有情感标签的数据,因为情感标签涉及主观性。为了解决这一重大问题,我们的研究旨在帮助开发有效的解决方案。我们提出了一种基于深度卷积信念的空间网络模型(DCB-SNM),作为应对这一挑战的半自动化技术。该模型涉及两个基本分析阶段:文本和视频。在这一过程中,预先训练好的注释者会识别出主要情绪。我们的工作从注释时间和一致性的角度评估了这种自动预注释方法对人工情感注释的影响。注释时间方面的数据表明,使用预注释程序后,注释时间大约增加了 20%,而且不会对注释者的技能产生负面影响。这说明了预标注方法的好处。此外,事实证明预注释对预测准确率低的注释者特别有利,可提高整体注释效率和可靠性。
{"title":"A DENSE SPATIAL NETWORK MODEL FOR EMOTION RECOGNITION USING LEARNING APPROACHES","authors":"L. V., Dinesh Kumar Anguraj","doi":"10.1145/3688000","DOIUrl":"https://doi.org/10.1145/3688000","url":null,"abstract":"Researchers are increasingly eager to develop techniques to extract emotional data from new sources due to the exponential growth of subjective information on Web 2.0. One of the most challenging aspects of textual emotion detection is the collection of data with emotion labels, given the subjectivity involved in labeling emotions. To address this significant issue, our research aims to aid in the development of effective solutions. We propose a Deep Convolutional Belief-based Spatial Network Model (DCB-SNM) as a semi-automated technique to tackle this challenge. This model involves two basic phases of analysis: text and video. In this process, pre-trained annotators identify the dominant emotion. Our work evaluates the impact of this automatic pre-annotation approach on manual emotion annotation from the perspectives of annotation time and agreement. The data on annotation time indicates an increase of roughly 20% when the pre-annotation procedure is utilized, without negatively affecting the annotators' skill. This demonstrates the benefits of pre-annotation approaches. Additionally, pre-annotation proves to be particularly advantageous for contributors with low prediction accuracy, enhancing overall annotation efficiency and reliability.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141920813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The advancement of medicine presents challenges for modern cultures, especially with unpredictable elderly falling incidents anywhere due to serious health issues. Delayed rescue for at-risk elders can be dangerous. Traditional elder safety methods like video surveillance or wearable sensors are inefficient and burdensome, wasting human resources and requiring caregivers' constant fall detection monitoring. Thus, a more effective and convenient solution is needed to ensure elderly safety. In this paper, a method is presented for detecting human falls in naturally occurring scenes using videos through a traditional Convolutional Neural Network (CNN) model, Inception-v3, VGG-19 and two versions of the You Only Look Once (YOLO) working model. The primary focus of this work is human fall detection through the utilization of deep learning models. Specifically, the YOLO approach is adopted for object detection and tracking in video scenes. By implementing YOLO, human subjects are identified, and bounding boxes are generated around them. The classification of various human activities, including fall detection is accomplished through the analysis of deformation features extracted from these bounding boxes. The traditional CNN model achieves an impressive 99.83% accuracy in human fall detection, surpassing other state-of-the-art methods. However, training time is longer compared to YOLO-v2 and YOLO-v3, but significantly shorter than Inception-v3, taking only around 10% of its total training time.
{"title":"Learning and Vision-based approach for Human fall detection and classification in naturally occurring scenes using video data","authors":"Shashvat Singh, Kumkum Kumari, A. Vaish","doi":"10.1145/3687125","DOIUrl":"https://doi.org/10.1145/3687125","url":null,"abstract":"The advancement of medicine presents challenges for modern cultures, especially with unpredictable elderly falling incidents anywhere due to serious health issues. Delayed rescue for at-risk elders can be dangerous. Traditional elder safety methods like video surveillance or wearable sensors are inefficient and burdensome, wasting human resources and requiring caregivers' constant fall detection monitoring. Thus, a more effective and convenient solution is needed to ensure elderly safety. In this paper, a method is presented for detecting human falls in naturally occurring scenes using videos through a traditional Convolutional Neural Network (CNN) model, Inception-v3, VGG-19 and two versions of the You Only Look Once (YOLO) working model. The primary focus of this work is human fall detection through the utilization of deep learning models. Specifically, the YOLO approach is adopted for object detection and tracking in video scenes. By implementing YOLO, human subjects are identified, and bounding boxes are generated around them. The classification of various human activities, including fall detection is accomplished through the analysis of deformation features extracted from these bounding boxes. The traditional CNN model achieves an impressive 99.83% accuracy in human fall detection, surpassing other state-of-the-art methods. However, training time is longer compared to YOLO-v2 and YOLO-v3, but significantly shorter than Inception-v3, taking only around 10% of its total training time.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141920202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The study aims to present an in-depth Sentiment Analysis (SA) grounded by the presence of emotions in the speech signals. Nowadays, all kinds of web-based applications ranging from social media platforms and video-sharing sites to e-commerce applications provide support for Human-Computer Interfaces (HCIs). These media applications allow users to share their experiences in all forms such as text, audio, video, GIF, etc. The most natural and fundamental form of expressing oneself is through speech. Speech-Based Sentiment Analysis (SBSA) is the task of gaining insights into speech signals. It aims to classify the statement as neutral, negative, or positive. On the other hand, Speech Emotion Recognition (SER) categorizes speech signals into the following emotions: disgust, fear, sadness, anger, happiness, and neutral. It is necessary to recognize the sentiments along with the profoundness of the emotions in the speech signals. To cater to the above idea, a methodology is proposed defining a text-oriented SA model using the combination of CNN and Bi-LSTM techniques along with an embedding layer, applied to the text obtained from speech signals; achieving an accuracy of 84.49%. Also, the proposed methodology suggests an Emotion Analysis (EA) model based on the CNN technique highlighting the type of emotion present in the speech signal with an accuracy measure of 95.12%. The presented architecture can also be applied to different other domains like product review systems, video recommendation systems, education, health, security, etc.
{"title":"CNN-Based Models for Emotion and Sentiment Analysis Using Speech Data","authors":"Anjum Madan, Devender Kumar","doi":"10.1145/3687303","DOIUrl":"https://doi.org/10.1145/3687303","url":null,"abstract":"The study aims to present an in-depth Sentiment Analysis (SA) grounded by the presence of emotions in the speech signals. Nowadays, all kinds of web-based applications ranging from social media platforms and video-sharing sites to e-commerce applications provide support for Human-Computer Interfaces (HCIs). These media applications allow users to share their experiences in all forms such as text, audio, video, GIF, etc. The most natural and fundamental form of expressing oneself is through speech. Speech-Based Sentiment Analysis (SBSA) is the task of gaining insights into speech signals. It aims to classify the statement as neutral, negative, or positive. On the other hand, Speech Emotion Recognition (SER) categorizes speech signals into the following emotions: disgust, fear, sadness, anger, happiness, and neutral. It is necessary to recognize the sentiments along with the profoundness of the emotions in the speech signals. To cater to the above idea, a methodology is proposed defining a text-oriented SA model using the combination of CNN and Bi-LSTM techniques along with an embedding layer, applied to the text obtained from speech signals; achieving an accuracy of 84.49%. Also, the proposed methodology suggests an Emotion Analysis (EA) model based on the CNN technique highlighting the type of emotion present in the speech signal with an accuracy measure of 95.12%. The presented architecture can also be applied to different other domains like product review systems, video recommendation systems, education, health, security, etc.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141927347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
yao wen, jie li, Donghong Cai, Zhicheng Dong, Fangkai Cai, Ping Lan, quan zhou
With the development of artificial intelligence, natural language processing enables us to better understand and utilize semantic information. However, traditional object detection algorithms cannot get an effective performance, when dealed with Tibetan opera mask datasets which have the properties of limited samples, symmetrical patterns and high inter-class distances. In order to solve this issue, we propose a novel feature representation model with recall loss function for detecting different marks. In the model, we develop an adaptive feature extraction network with fused layers to extract features. Furthermore, a lightweight efficient attention mechanism is designed to enhance the significance of key features. Additionally, a recall loss function is proposed to increase the differences among classes. Finally, experimental results on the dataset of Tibetan opera mask demonstrate that our proposed model outperforms compared models.
{"title":"Adaptive Semantic Information Extraction of Tibetan Opera Mask with Recall Loss","authors":"yao wen, jie li, Donghong Cai, Zhicheng Dong, Fangkai Cai, Ping Lan, quan zhou","doi":"10.1145/3666041","DOIUrl":"https://doi.org/10.1145/3666041","url":null,"abstract":"With the development of artificial intelligence, natural language processing enables us to better understand and utilize semantic information. However, traditional object detection algorithms cannot get an effective performance, when dealed with Tibetan opera mask datasets which have the properties of limited samples, symmetrical patterns and high inter-class distances. In order to solve this issue, we propose a novel feature representation model with recall loss function for detecting different marks. In the model, we develop an adaptive feature extraction network with fused layers to extract features. Furthermore, a lightweight efficient attention mechanism is designed to enhance the significance of key features. Additionally, a recall loss function is proposed to increase the differences among classes. Finally, experimental results on the dataset of Tibetan opera mask demonstrate that our proposed model outperforms compared models.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141799656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In order to capture and integrate structural features and temporal features contained in social graph and diffusion cascade more effectively, an information diffusion prediction model based on Transformer and Relational Graph Convolutional Network (TRGCN) is proposed. Firstly, a dynamic heterogeneous graph composed of the social network graph and the diffusion cascade graph was constructed, and it was input into the Relational Graph Convolutional Network (RGCN) to extract the structural features of each node. Secondly, the time embedding of each node was re-encoded using Bi-directional Long Short-Term Memory (Bi-LSTM). The time decay function was introduced to give different weights to nodes at different time positions, so as to obtain the temporal features of nodes. Finally, structural features and temporal features were input into Transformer and then merged. The spatial-temporal features are obtained for information diffusion prediction. The experimental results on three real data sets of Twitter, Douban and Memetracker show that compared with the optimal model in the comparison experiment, the TRGCN model has an average increase of 4.16% in Hits@100 metric and 13.26% in map@100 metric. The validity and rationality of the model are proved.
{"title":"TRGCN: A Prediction Model for Information Diffusion Based on Transformer and Relational Graph Convolutional Network","authors":"Jinghua Zhao, Xiting Lyu, Haiying Rong, Jiale Zhao","doi":"10.1145/3672074","DOIUrl":"https://doi.org/10.1145/3672074","url":null,"abstract":"In order to capture and integrate structural features and temporal features contained in social graph and diffusion cascade more effectively, an information diffusion prediction model based on Transformer and Relational Graph Convolutional Network (TRGCN) is proposed. Firstly, a dynamic heterogeneous graph composed of the social network graph and the diffusion cascade graph was constructed, and it was input into the Relational Graph Convolutional Network (RGCN) to extract the structural features of each node. Secondly, the time embedding of each node was re-encoded using Bi-directional Long Short-Term Memory (Bi-LSTM). The time decay function was introduced to give different weights to nodes at different time positions, so as to obtain the temporal features of nodes. Finally, structural features and temporal features were input into Transformer and then merged. The spatial-temporal features are obtained for information diffusion prediction. The experimental results on three real data sets of Twitter, Douban and Memetracker show that compared with the optimal model in the comparison experiment, the TRGCN model has an average increase of 4.16% in Hits@100 metric and 13.26% in map@100 metric. The validity and rationality of the model are proved.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141799053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current supervised word sense disambiguation models have obtained high disambiguation results using annotated information of different word senses and pre-trained language models. However, the semantic data of the supervised word sense disambiguation models are in the form of short texts, and many of the corpus information is not rich enough to distinguish the semantics in different scenarios. The paper proposes a bi-encoder word sense disambiguation method combining knowledge graph and text hierarchy structure, by introducing structured knowledge from the knowledge graph to supplement more extended semantic information, using the hierarchy of contextual input text to describe the meaning of words and phrases, and constructing a BERT-based bi-encoder, introducing a graph attention network to reduce the noise information in the contextual input text, so as to improve the disambiguation accuracy of the target words in phrase form and ultimately improve the disambiguation effectiveness of the method. By comparing the method with the latest nine comparison algorithms in five test datasets, the disambiguation accuracy of the method mostly outperformed the comparison algorithms and achieved better results.
{"title":"Word Sense Disambiguation Combining Knowledge Graph And Text Hierarchical Structure","authors":"Yukun Cao, Chengkun Jin, Yijia Tang, Ziyue Wei","doi":"10.1145/3677524","DOIUrl":"https://doi.org/10.1145/3677524","url":null,"abstract":"Current supervised word sense disambiguation models have obtained high disambiguation results using annotated information of different word senses and pre-trained language models. However, the semantic data of the supervised word sense disambiguation models are in the form of short texts, and many of the corpus information is not rich enough to distinguish the semantics in different scenarios. The paper proposes a bi-encoder word sense disambiguation method combining knowledge graph and text hierarchy structure, by introducing structured knowledge from the knowledge graph to supplement more extended semantic information, using the hierarchy of contextual input text to describe the meaning of words and phrases, and constructing a BERT-based bi-encoder, introducing a graph attention network to reduce the noise information in the contextual input text, so as to improve the disambiguation accuracy of the target words in phrase form and ultimately improve the disambiguation effectiveness of the method. By comparing the method with the latest nine comparison algorithms in five test datasets, the disambiguation accuracy of the method mostly outperformed the comparison algorithms and achieved better results.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141802527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mood, a long-lasting affective state detached from specific stimuli, plays an important role in behavior. Although sentiment analysis and emotion classification have garnered attention, research on mood classification remains in its early stages. This study adopts a two-dimensional structure of affect, comprising ”pleasantness” and ”activation,” to classify mood patterns. Emojis, graphic symbols representing emotions and concepts, are widely used in computer-mediated communication. Unlike previous studies that consider emojis as direct labels for emotion or sentiment, this work uses a pre-trained large language model which integrates both text and emojis to develop a mood classification model. Our contributions are three-fold. First, we annotate 10,000 Thai tweets with mood to train the models and release the dataset to the public. Second, we show that emojis contribute to determining mood to a lesser extent than text, far from mapping directly to mood. Third, through the application of the trained model, we observe the correlation of moods during the Thai political turmoil of 2019-2020 on Thai Twitter and find a significant correlation. These moods closely reflect the news events and reveal one side of Thai public opinion during the turmoil.
{"title":"Exploring the Correlation between Emojis and Mood Expression in Thai Twitter Discourse","authors":"Attapol T. Rutherford, Pawitsapak Akarajaradwong","doi":"10.1145/3680543","DOIUrl":"https://doi.org/10.1145/3680543","url":null,"abstract":"Mood, a long-lasting affective state detached from specific stimuli, plays an important role in behavior. Although sentiment analysis and emotion classification have garnered attention, research on mood classification remains in its early stages. This study adopts a two-dimensional structure of affect, comprising ”pleasantness” and ”activation,” to classify mood patterns. Emojis, graphic symbols representing emotions and concepts, are widely used in computer-mediated communication. Unlike previous studies that consider emojis as direct labels for emotion or sentiment, this work uses a pre-trained large language model which integrates both text and emojis to develop a mood classification model. Our contributions are three-fold. First, we annotate 10,000 Thai tweets with mood to train the models and release the dataset to the public. Second, we show that emojis contribute to determining mood to a lesser extent than text, far from mapping directly to mood. Third, through the application of the trained model, we observe the correlation of moods during the Thai political turmoil of 2019-2020 on Thai Twitter and find a significant correlation. These moods closely reflect the news events and reveal one side of Thai public opinion during the turmoil.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141809824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Translation from the mother tongue, including the Tunisian dialect, to modern standard Arabic is a highly significant field in natural language processing due to its wide range of applications and associated benefits. Recently, researchers have shown increased interest in the Tunisian dialect, primarily driven by the massive volume of content generated spontaneously by Tunisians on social media follow-ing the revolution. This paper presents two distinct translators for converting the Tunisian dialect into Modern Standard Arabic. The first translator utilizes a rule-based approach, employing a collection of finite state transducers and a bilingual dictionary derived from the study corpus. On the other hand, the second translator relies on deep learning models, specifically the sequence-to-sequence trans-former model and a parallel corpus. To assess, evaluate, and compare the performance of the two translators, we conducted tests using a parallel corpus comprising 8,599 words. The results achieved by both translators are noteworthy. The translator based on finite state transducers achieved a blue score of 56.65, while the transformer model-based translator achieved a higher score of 66.07.
{"title":"Translation from Tunisian Dialect to Modern Standard Arabic: Exploring Finite-State Transducers and Sequence-to-Sequence Transformer Approaches","authors":"Roua Torjmen, K. Haddar","doi":"10.1145/3681788","DOIUrl":"https://doi.org/10.1145/3681788","url":null,"abstract":"Translation from the mother tongue, including the Tunisian dialect, to modern standard Arabic is a highly significant field in natural language processing due to its wide range of applications and associated benefits. Recently, researchers have shown increased interest in the Tunisian dialect, primarily driven by the massive volume of content generated spontaneously by Tunisians on social media follow-ing the revolution. This paper presents two distinct translators for converting the Tunisian dialect into Modern Standard Arabic. The first translator utilizes a rule-based approach, employing a collection of finite state transducers and a bilingual dictionary derived from the study corpus. On the other hand, the second translator relies on deep learning models, specifically the sequence-to-sequence trans-former model and a parallel corpus. To assess, evaluate, and compare the performance of the two translators, we conducted tests using a parallel corpus comprising 8,599 words. The results achieved by both translators are noteworthy. The translator based on finite state transducers achieved a blue score of 56.65, while the transformer model-based translator achieved a higher score of 66.07.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141809007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatic speech recognition (ASR) has become an indispensable part of the AI domain, with various speech technologies reliant on it. The quality of speech recognition depends on the amount of annotated data used to train an ASR system, among other factors. For a low-resourced language, this is a severe constraint and thus ASR quality is often poor. Humans can read through text containing ASR-errors, provided the context of the sentence is preserved. Yet in cases of transcripts generated by ASR systems of low-resource languages, multiple important words are misrecognized and the context is mostly lost; discerning such a text becomes nearly impossible. This paper analyzes the types of transcription errors that occur while generating ASR transcripts of spoken documents in Bengali, an under-resourced language predominantly spoken in India and Bangladesh. The transcripts of the Bengali spoken document are generated using the ASR of Google Cloud Speech. The paper also explores if there is an effect of such transcription errors in generating speech summaries of these spoken documents. Summarization is carried out extractively; sentences are selected from the ASR-generated text of the spoken document. Speech summaries are created by aggregating the speech-segments from the original speech of the selected sentences. Subjective evaluation shows the ‘readability’ of the spoken summaries are not degraded by ASR errors, but the quality is affected due to the reliance on intermediate text-summary containing transcription errors.
自动语音识别(ASR)已成为人工智能领域不可或缺的一部分,各种语音技术都依赖于它。除其他因素外,语音识别的质量取决于用于训练 ASR 系统的注释数据量。对于资源匮乏的语言来说,这是一个严重的制约因素,因此 ASR 的质量往往很差。只要保留句子的上下文,人类就可以阅读包含 ASR 错误的文本。然而,在低资源语言的 ASR 系统生成的转录文本中,多个重要单词被错误识别,上下文大部分丢失;要辨别这样的文本几乎是不可能的。本文分析了在生成孟加拉语口语文档 ASR 转录本时出现的转录错误类型,孟加拉语是一种资源匮乏的语言,主要在印度和孟加拉国使用。孟加拉语口语文档的转录本是使用谷歌云语音的 ASR 生成的。本文还探讨了在生成这些口语文档的语音摘要时,此类转录错误是否会产生影响。摘要是以提取方式进行的;从 ASR 生成的口语文档文本中选取句子。从所选句子的原始语音中汇总语音片段,创建语音摘要。主观评估表明,口语摘要的 "可读性 "不会因 ASR 错误而降低,但由于依赖包含转录错误的中间文本摘要,其质量会受到影响。
{"title":"Analyzing the Effects of Transcription Errors on Summary Generation of Bengali Spoken Documents","authors":"Priyanjana Chowdhury, Nabanika Sarkar, Sanghamitra Nath, Utpal Sharma","doi":"10.1145/3678005","DOIUrl":"https://doi.org/10.1145/3678005","url":null,"abstract":"Automatic speech recognition (ASR) has become an indispensable part of the AI domain, with various speech technologies reliant on it. The quality of speech recognition depends on the amount of annotated data used to train an ASR system, among other factors. For a low-resourced language, this is a severe constraint and thus ASR quality is often poor. Humans can read through text containing ASR-errors, provided the context of the sentence is preserved. Yet in cases of transcripts generated by ASR systems of low-resource languages, multiple important words are misrecognized and the context is mostly lost; discerning such a text becomes nearly impossible. This paper analyzes the types of transcription errors that occur while generating ASR transcripts of spoken documents in Bengali, an under-resourced language predominantly spoken in India and Bangladesh. The transcripts of the Bengali spoken document are generated using the ASR of Google Cloud Speech. The paper also explores if there is an effect of such transcription errors in generating speech summaries of these spoken documents. Summarization is carried out extractively; sentences are selected from the ASR-generated text of the spoken document. Speech summaries are created by aggregating the speech-segments from the original speech of the selected sentences. Subjective evaluation shows the ‘readability’ of the spoken summaries are not degraded by ASR errors, but the quality is affected due to the reliance on intermediate text-summary containing transcription errors.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141830350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The existence of noisy labels is inevitable in real-world large-scale corpora. As deep neural networks are notably vulnerable to overfitting on noisy samples, this highlights the importance of the ability of language models to resist noise for efficient training. However, little attention has been paid to alleviating the influence of label noise in natural language processing. To address this problem, we present CoMix, a robust Noise-Against training strategy taking advantage of Co-training that deals with textual annotation errors in text classification tasks. In our proposed framework, the original training set is first split into labeled and unlabeled subsets according to a sample partition criteria and then applies label refurbishment on the unlabeled subsets. We implement textual interpolation in hidden space between samples on the updated subsets. Meanwhile, we employ peer diverged networks simultaneously leveraging co-training strategies to avoid the accumulation of confirm bias. Experimental results on three popular text classification benchmarks demonstrate the effectiveness of CoMix in bolstering the network’s resistance to label mislabeling under various noise types and ratios, which also outperforms the state-of-the-art methods.
{"title":"CoMix: Confronting with Noisy Label Learning with Co-training Strategies on Textual Mislabeling","authors":"Shu Zhao, Zhuoer Zhao, Yangyang Xu, Xiao Sun","doi":"10.1145/3678175","DOIUrl":"https://doi.org/10.1145/3678175","url":null,"abstract":"The existence of noisy labels is inevitable in real-world large-scale corpora. As deep neural networks are notably vulnerable to overfitting on noisy samples, this highlights the importance of the ability of language models to resist noise for efficient training. However, little attention has been paid to alleviating the influence of label noise in natural language processing. To address this problem, we present CoMix, a robust Noise-Against training strategy taking advantage of Co-training that deals with textual annotation errors in text classification tasks. In our proposed framework, the original training set is first split into labeled and unlabeled subsets according to a sample partition criteria and then applies label refurbishment on the unlabeled subsets. We implement textual interpolation in hidden space between samples on the updated subsets. Meanwhile, we employ peer diverged networks simultaneously leveraging co-training strategies to avoid the accumulation of confirm bias. Experimental results on three popular text classification benchmarks demonstrate the effectiveness of CoMix in bolstering the network’s resistance to label mislabeling under various noise types and ratios, which also outperforms the state-of-the-art methods.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141645329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}