Computer Speech and Language最新文献_第6页

Conversations in the wild: Data collection, automatic generation and evaluation 野外对话数据收集、自动生成和评估

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-07-30 DOI: 10.1016/j.csl.2024.101699

Nimra Zaheer , Agha Ali Raza , Mudassir Shabbir

The aim of conversational speech processing is to analyze human conversations in natural settings. It finds numerous applications in personality traits identification, speech therapy, speaker identification and verification, speech emotion detection, and speaker diarization. However, large-scale annotated datasets required for feature extraction and conversational model training only exist for a handful of languages (e.g. English, Mandarin, and French) as the gathering, cleaning, and annotation of such datasets is tedious, time-consuming, and expensive. We propose two scalable, language-agnostic algorithms for automatically generating multi-speaker, variable-length, spontaneous conversations. These algorithms synthesize conversations using existing non-conversational speech datasets. We also contribute the resulting datasets (283 hours, 50 speakers). As a comparison, we also gathered the first spontaneous conversational dataset for Urdu (24 hours, 212 speakers) from public talk shows. Using speaker diarization as an example, we evaluate our datasets and report the first baseline diarization error rates (DER) for Urdu (25% for synthetic dataset-based models, and 29% for natural conversations). Our conversational speech generation technique allows training speaker diarization pipelines without the need for preparing huge conversational repositories.

会话语音处理的目的是分析自然环境中的人类会话。会话语音处理在个性特征识别、语音治疗、说话人识别与验证、语音情感检测和说话人日记等方面应用广泛。然而，特征提取和会话模型训练所需的大规模注释数据集仅适用于少数语言（如英语、普通话和法语），因为此类数据集的收集、清理和注释工作繁琐、耗时且昂贵。我们提出了两种可扩展的、与语言无关的算法，用于自动生成多发言人、长度可变的自发对话。这些算法利用现有的非会话语音数据集合成对话。我们还提供了生成的数据集（283 小时，50 位发言人）。作为对比，我们还收集了首个乌尔都语自发会话数据集（24 小时，212 位发言人），这些数据来自公共脱口秀节目。以说话人日记化为例，我们对数据集进行了评估，并报告了乌尔都语的首个基准日记化错误率（DER）（基于合成数据集的模型为 25%，自然会话为 29%）。我们的会话语音生成技术可以训练说话人日记化管道，而无需准备庞大的会话库。

{"title":"Conversations in the wild: Data collection, automatic generation and evaluation","authors":"Nimra Zaheer , Agha Ali Raza , Mudassir Shabbir","doi":"10.1016/j.csl.2024.101699","DOIUrl":"10.1016/j.csl.2024.101699","url":null,"abstract":"<div><p>The aim of conversational speech processing is to analyze human conversations in natural settings. It finds numerous applications in personality traits identification, speech therapy, speaker identification and verification, speech emotion detection, and speaker diarization. However, large-scale annotated datasets required for feature extraction and conversational model training only exist for a handful of languages (e.g. English, Mandarin, and French) as the gathering, cleaning, and annotation of such datasets is tedious, time-consuming, and expensive. We propose two scalable, language-agnostic algorithms for automatically generating multi-speaker, variable-length, spontaneous conversations. These algorithms synthesize conversations using existing non-conversational speech datasets. We also contribute the resulting datasets (283 hours, 50 speakers). As a comparison, we also gathered the first spontaneous conversational dataset for Urdu (24 hours, 212 speakers) from public talk shows. Using speaker diarization as an example, we evaluate our datasets and report the first baseline diarization error rates (DER) for Urdu (25% for synthetic dataset-based models, and 29% for natural conversations). Our conversational speech generation technique allows training speaker diarization pipelines without the need for preparing huge conversational repositories.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101699"},"PeriodicalIF":3.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000822/pdfft?md5=3c965afd5ed1a80b86a1318a77699ef7&pid=1-s2.0-S0885230824000822-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141947077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Prompting large language models for user simulation in task-oriented dialogue systems 提示大型语言模型，用于面向任务的对话系统中的用户模拟

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-07-26 DOI: 10.1016/j.csl.2024.101697

Atheer Algherairy , Moataz Ahmed

Large Language Models (LLMs) have gained widespread popularity due to their instruction-following abilities. In this study, we evaluate their ability in simulating user interactions for task-oriented dialogue (TOD) systems. Our findings demonstrate that prompting LLMs reveals their promising capabilities for training and testing dialogue policies, eliminating the need for domain expertise in crafting complex rules or relying on large annotated data, as required by traditional simulators. The results show that the dialogue system trained with the ChatGPT simulator achieves a success rate of 59%, comparable to a 62% success rate of the dialogue system trained with the manual-rules, agenda-based user simulator (ABUS). Furthermore, the dialogue system trained with the ChatGPT simulator demonstrates better generalization ability compared to the dialogue system trained with the ABUS. Its success rate outperforms that of the dialogue system trained with the ABUS by 4% on GenTUS, 5% on the ChatGPT Simulator, and 3% on the Llama simulator. Nevertheless, LLM-based user simulators provide challenging environment, lexically rich, diverse, and random responses. Llama simulator outperforms the human reference in all lexical diversity metrics with a margin of 0.66 in SE, 0.39 in CE, 0.01 in MSTTR, 0.04 in HDD, and 0.55 in MTLD, while the ChatGPT simulator achieves comparable results. This ultimately contributes to enhancing the system’s ability to generalize more effectively.

大语言模型（LLMs）因其遵循指令的能力而广受欢迎。在本研究中，我们评估了它们在模拟面向任务的对话（TOD）系统的用户交互方面的能力。我们的研究结果表明，提示 LLMs 在训练和测试对话策略方面显示出了很好的能力，无需像传统模拟器那样需要专业领域的知识来制定复杂的规则或依赖大量的注释数据。结果表明，使用 ChatGPT 模拟器训练的对话系统成功率为 59%，与使用人工规则、基于议程的用户模拟器（ABUS）训练的对话系统 62% 的成功率相当。此外，与使用 ABUS 训练的对话系统相比，使用 ChatGPT 模拟器训练的对话系统具有更好的泛化能力。在 GenTUS 上，它的成功率比用 ABUS 训练的对话系统高出 4%，在 ChatGPT 模拟器上高出 5%，在 Llama 模拟器上高出 3%。不过，基于 LLM 的用户模拟器提供了具有挑战性的环境、丰富的词汇、多样的随机回复。在所有词汇多样性指标上，Llama 模拟器都优于人类参考，SE 为 0.66，CE 为 0.39，MSTTR 为 0.01，HDD 为 0.04，MTLD 为 0.55，而 ChatGPT 模拟器的结果与之相当。这最终有助于增强系统更有效的泛化能力。

{"title":"Prompting large language models for user simulation in task-oriented dialogue systems","authors":"Atheer Algherairy , Moataz Ahmed","doi":"10.1016/j.csl.2024.101697","DOIUrl":"10.1016/j.csl.2024.101697","url":null,"abstract":"<div><p>Large Language Models (LLMs) have gained widespread popularity due to their instruction-following abilities. In this study, we evaluate their ability in simulating user interactions for task-oriented dialogue (TOD) systems. Our findings demonstrate that prompting LLMs reveals their promising capabilities for training and testing dialogue policies, eliminating the need for domain expertise in crafting complex rules or relying on large annotated data, as required by traditional simulators. The results show that the dialogue system trained with the ChatGPT simulator achieves a success rate of 59%, comparable to a 62% success rate of the dialogue system trained with the manual-rules, agenda-based user simulator (ABUS). Furthermore, the dialogue system trained with the ChatGPT simulator demonstrates better generalization ability compared to the dialogue system trained with the ABUS. Its success rate outperforms that of the dialogue system trained with the ABUS by 4% on GenTUS, 5% on the ChatGPT Simulator, and 3% on the Llama simulator. Nevertheless, LLM-based user simulators provide challenging environment, lexically rich, diverse, and random responses. Llama simulator outperforms the human reference in all lexical diversity metrics with a margin of 0.66 in SE, 0.39 in CE, 0.01 in MSTTR, 0.04 in HDD, and 0.55 in MTLD, while the ChatGPT simulator achieves comparable results. This ultimately contributes to enhancing the system’s ability to generalize more effectively.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101697"},"PeriodicalIF":3.1,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000809/pdfft?md5=81b644a0e6ced84bc9ba93092c2f49b3&pid=1-s2.0-S0885230824000809-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141848167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Demystifying large language models in second language development research 解密第二语言发展研究中的大型语言模型

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-07-26 DOI: 10.1016/j.csl.2024.101700

Yan Cong

Evaluating students' textual response is a common and critical task in language research and education practice. However, manual assessment can be tedious and may lack consistency, posing challenges for both scientific discovery and frontline teaching. Leveraging state-of-the-art large language models (LLMs), we aim to define and operationalize LLM-Surprisal, a numeric representation of the interplay between lexical diversity and syntactic complexity, and to empirically and theoretically demonstrate its relevance for automatic writing assessment and Chinese L2 (second language) learners’ English writing development. We developed an LLM-based natural language processing pipeline that can automatically compute text Surprisal scores. By comparing Surprisal metrics with the widely used classic indices in L2 studies, we extended the usage of computational metrics in Chinese learners’ L2 English writing. Our analyses suggested that LLM-Surprisals can distinguish L2 from L1 (first language) writing, index L2 development stages, and predict scores provided by human professionals. This indicated that the Surprisal dimension may manifest itself as critical aspects in L2 development. The relative advantages and disadvantages of these approaches were discussed in depth. We concluded that LLMs are promising tools that can enhance L2 research. Our showcase paves the way for more nuanced approaches to computationally assessing and understanding L2 development. Our pipelines and findings will inspire language teachers, learners, and researchers to operationalize LLMs in an innovative and accessible manner.

在语言研究和教育实践中，评估学生对文本的反应是一项常见而重要的任务。然而，人工评估既繁琐又缺乏一致性，给科学发现和一线教学都带来了挑战。利用最先进的大语言模型（LLM），我们旨在定义和操作 LLM-Surprisal（词法多样性和句法复杂性之间相互作用的数字表示），并从经验和理论上证明其对自动写作评估和中国 L2（第二语言）学习者英语写作发展的相关性。我们开发了一个基于 LLM 的自然语言处理管道，可以自动计算文本 Surprisal 分数。通过将 Surprisal 指标与 L2 研究中广泛使用的经典指标进行比较，我们扩展了计算指标在中国学习者 L2 英语写作中的应用。我们的分析表明，LLM-Surprisals 可以区分 L2 和 L1（第一语言）写作，为 L2 发展阶段提供指数，并预测人类专业人员提供的分数。这表明，惊奇维度可能是 L2 发展的关键因素。我们深入讨论了这些方法的相对优缺点。我们的结论是，LLMs 是一种很有前途的工具，可以促进 L2 研究。我们的展示为通过计算评估和理解 L2 发展的更细致方法铺平了道路。我们的管道和研究成果将激励语言教师、学习者和研究人员以创新和易用的方式将 LLMs 付诸实施。

{"title":"Demystifying large language models in second language development research","authors":"Yan Cong","doi":"10.1016/j.csl.2024.101700","DOIUrl":"10.1016/j.csl.2024.101700","url":null,"abstract":"<div><p>Evaluating students' textual response is a common and critical task in language research and education practice. However, manual assessment can be tedious and may lack consistency, posing challenges for both scientific discovery and frontline teaching. Leveraging state-of-the-art large language models (LLMs), we aim to define and operationalize LLM-Surprisal, a numeric representation of the interplay between lexical diversity and syntactic complexity, and to empirically and theoretically demonstrate its relevance for automatic writing assessment and Chinese L2 (second language) learners’ English writing development. We developed an LLM-based natural language processing pipeline that can automatically compute text Surprisal scores. By comparing Surprisal metrics with the widely used classic indices in L2 studies, we extended the usage of computational metrics in Chinese learners’ L2 English writing. Our analyses suggested that LLM-Surprisals can distinguish L2 from L1 (first language) writing, index L2 development stages, and predict scores provided by human professionals. This indicated that the Surprisal dimension may manifest itself as critical aspects in L2 development. The relative advantages and disadvantages of these approaches were discussed in depth. We concluded that LLMs are promising tools that can enhance L2 research. Our showcase paves the way for more nuanced approaches to computationally assessing and understanding L2 development. Our pipelines and findings will inspire language teachers, learners, and researchers to operationalize LLMs in an innovative and accessible manner.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101700"},"PeriodicalIF":3.1,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000834/pdfft?md5=88083b1a8544dcbd7f01cce3a7d527d7&pid=1-s2.0-S0885230824000834-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141843458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The effect of preference elicitation methods on the user experience in conversational recommender systems 偏好激发方法对对话式推荐系统用户体验的影响

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-07-25 DOI: 10.1016/j.csl.2024.101696

Liv Ziegfeld , Daan Di Scala , Anita H.M. Cremers

The prevalence of conversational interfaces is rapidly rising, since improved algorithms allow for remarkable proficiency in understanding and generating natural language. This also holds for Conversational Recommender Systems (CRS), that benefit from information being provided by the user in the course of the dialogue to offer personalized recommendations. However, the challenge remains eliciting the user's characteristics and preferences in a way that leads to the most optimal user experience. Hence, the current research was aimed at investigating the effect of different Preference Elicitation (PE) methods on the user experience of a CRS. We introduce two axes across which PE methods can be classified, namely the degree of system prompt guidance and the level of user input restriction. We built three versions of a CRS to conduct a between-subjects experiment which compared three conditions: high guidance-high restriction, high guidance-low restriction and low guidance-low restriction. We tested their effect on ten constructs of user experience measures on 66 European participants, all working in agriculture or forestry.

The study did not find any significant effects of the three preference elicitation methods on all user experience constructs collected through questionnaires. However, we did find significant differences in terms of the objective measures chat duration (Speed), response time (Cognitive Demand) and recommendation performance (Accuracy of Recommended Items). Regarding the recommendation performance, it was found that the preference elicitation methods with high guidance led to a higher match score than the condition with low guidance. The certainty score was highest in the condition with high guidance and high input restriction. Finally, we found through a question at the end of the conversation that users who were satisfied with the recommendation responded more positively to six out of ten user experience constructs. This suggests that satisfaction with the recommendation performance is a crucial factor in the user experience of CRSs.

会话界面的普及率正在迅速上升，因为经过改进的算法可以非常熟练地理解和生成自然语言。对话推荐系统（CRS）也是如此，该系统利用用户在对话过程中提供的信息来提供个性化推荐。然而，如何获取用户的特征和偏好，从而带来最佳的用户体验，仍然是一项挑战。因此，目前的研究旨在调查不同的偏好激发（PE）方法对 CRS 用户体验的影响。我们引入了两个轴来对 PE 方法进行分类，即系统提示引导的程度和用户输入限制的程度。我们制作了三个版本的 CRS，进行了主体间实验，比较了三种情况：高引导-高限制、高引导-低限制和低引导-低限制。我们在 66 名欧洲参与者（均从事农业或林业工作）身上测试了这三种方法对十项用户体验指标的影响。研究没有发现三种偏好激发方法对通过问卷收集的所有用户体验指标有任何显著影响。不过，我们确实发现在客观测量聊天持续时间（速度）、响应时间（认知需求）和推荐性能（推荐项目的准确性）方面存在明显差异。在推荐性能方面，我们发现高引导性的偏好激发方法比低引导性的条件下匹配得分更高。高指导性和高输入限制条件下的确定性得分最高。最后，我们通过对话结束时的一个问题发现，对推荐感到满意的用户对十个用户体验构面中的六个作出了更积极的回应。这表明，对推荐性能的满意度是 CRS 用户体验的一个关键因素。

{"title":"The effect of preference elicitation methods on the user experience in conversational recommender systems","authors":"Liv Ziegfeld , Daan Di Scala , Anita H.M. Cremers","doi":"10.1016/j.csl.2024.101696","DOIUrl":"10.1016/j.csl.2024.101696","url":null,"abstract":"<div><p>The prevalence of conversational interfaces is rapidly rising, since improved algorithms allow for remarkable proficiency in understanding and generating natural language. This also holds for Conversational Recommender Systems (CRS), that benefit from information being provided by the user in the course of the dialogue to offer personalized recommendations. However, the challenge remains eliciting the user's characteristics and preferences in a way that leads to the most optimal user experience. Hence, the current research was aimed at investigating the effect of different Preference Elicitation (PE) methods on the user experience of a CRS. We introduce two axes across which PE methods can be classified, namely the degree of system prompt guidance and the level of user input restriction. We built three versions of a CRS to conduct a between-subjects experiment which compared three conditions: high guidance-high restriction, high guidance-low restriction and low guidance-low restriction. We tested their effect on ten constructs of user experience measures on 66 European participants, all working in agriculture or forestry.</p><p>The study did not find any significant effects of the three preference elicitation methods on all user experience constructs collected through questionnaires. However, we did find significant differences in terms of the objective measures chat duration (Speed), response time (Cognitive Demand) and recommendation performance (Accuracy of Recommended Items). Regarding the recommendation performance, it was found that the preference elicitation methods with high guidance led to a higher match score than the condition with low guidance. The certainty score was highest in the condition with high guidance and high input restriction. Finally, we found through a question at the end of the conversation that users who were satisfied with the recommendation responded more positively to six out of ten user experience constructs. This suggests that satisfaction with the recommendation performance is a crucial factor in the user experience of CRSs.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101696"},"PeriodicalIF":3.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000792/pdfft?md5=2468411a22f6c0a2ba9f84281b96dacc&pid=1-s2.0-S0885230824000792-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141840842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Theory of mind performance of large language models: A comparative analysis of Turkish and English 大型语言模型的思维理论性能：土耳其语和英语的比较分析

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-07-25 DOI: 10.1016/j.csl.2024.101698

Burcu Ünlütabak, Onur Bal

Theory of mind (ToM), understanding others’ mental states, is a defining skill belonging to humans. Research assessing LLMs’ ToM performance yields conflicting findings and leads to discussions about whether and how they could show ToM understanding. Psychological research indicates that the characteristics of a specific language can influence how mental states are represented and communicated. Thus, it is reasonable to expect language characteristics to influence how LLMs communicate with humans, especially when the conversation involves references to mental states. This study examines how these characteristics affect LLMs’ ToM performance by evaluating GPT 3.5 and 4 performances in English and Turkish. Turkish provides an excellent contrast to English since Turkish has a different syntactic structure and special verbs, san- and zannet-, meaning “falsely believe.” Using Open AI's Chat Completion API, we collected responses from GPT models for first- and second-order ToM scenarios in English and Turkish. Our innovative approach combined completion prompts and open-ended questions within the same chat session, offering deep insights into models’ reasoning processes. Our data showed that while GPT models can respond accurately to standard ToM tasks (100% accuracy), their performance deteriorates (below chance level) with slight modifications. This high sensitivity suggests a lack of robustness in ToM performance. GPT 4 outperformed its predecessor, GPT 3.5, showing improvement in ToM performance to some extent. The models generally performed better when tasks were presented in English than in Turkish. These findings indicate that GPT models cannot reliably pass first-order and second-order ToM tasks in either of the languages yet. The findings have significant implications for Explainability of LLMs by highlighting challenges and biases that they face when simulating human-like ToM understanding in different languages.

心智理论（ToM），即理解他人的心理状态，是人类的一项决定性技能。评估本地语言学习者心智理论表现的研究得出了相互矛盾的结论，并引发了关于他们是否以及如何表现出心智理论理解能力的讨论。心理学研究表明，特定语言的特点会影响心理状态的表达和交流方式。因此，我们有理由相信，语言特点会影响 LLM 与人类交流的方式，尤其是当对话涉及到心理状态时。本研究通过评估 GPT 3.5 和 4 在英语和土耳其语中的表现，探讨了这些语言特点如何影响本地语言学家的 ToM 表现。土耳其语与英语形成了很好的对比，因为土耳其语具有不同的句法结构和特殊动词 san- 和 zannet-，意为 "虚假地相信"。我们使用 Open AI 的聊天完成 API，收集了 GPT 模型在英语和土耳其语的一阶和二阶 ToM 场景中的反应。我们的创新方法在同一聊天会话中结合了完成提示和开放式问题，从而深入了解了模型的推理过程。我们的数据显示，虽然 GPT 模型可以准确地响应标准 ToM 任务（准确率为 100%），但只要稍加修改，其性能就会下降（低于偶然水平）。这种高敏感性表明 ToM 性能缺乏稳健性。GPT 4 的表现优于其前身 GPT 3.5，在一定程度上提高了 ToM 性能。当任务以英语呈现时，模型的表现普遍优于以土耳其语呈现时。这些发现表明，GPT 模型还不能可靠地通过两种语言中的一阶和二阶 ToM 任务。这些发现对 LLM 的可解释性具有重要意义，因为它们强调了 LLM 在不同语言中模拟类人 ToM 理解时所面临的挑战和偏差。

{"title":"Theory of mind performance of large language models: A comparative analysis of Turkish and English","authors":"Burcu Ünlütabak, Onur Bal","doi":"10.1016/j.csl.2024.101698","DOIUrl":"10.1016/j.csl.2024.101698","url":null,"abstract":"<div><p>Theory of mind (ToM), understanding others’ mental states, is a defining skill belonging to humans. Research assessing LLMs’ ToM performance yields conflicting findings and leads to discussions about whether and how they could show ToM understanding. Psychological research indicates that the characteristics of a specific language can influence how mental states are represented and communicated. Thus, it is reasonable to expect language characteristics to influence how LLMs communicate with humans, especially when the conversation involves references to mental states. This study examines how these characteristics affect LLMs’ ToM performance by evaluating GPT 3.5 and 4 performances in English and Turkish. Turkish provides an excellent contrast to English since Turkish has a different syntactic structure and special verbs, san- and zannet-, meaning “falsely believe.” Using Open AI's Chat Completion API, we collected responses from GPT models for first- and second-order ToM scenarios in English and Turkish. Our innovative approach combined completion prompts and open-ended questions within the same chat session, offering deep insights into models’ reasoning processes. Our data showed that while GPT models can respond accurately to standard ToM tasks (100% accuracy), their performance deteriorates (below chance level) with slight modifications. This high sensitivity suggests a lack of robustness in ToM performance. GPT 4 outperformed its predecessor, GPT 3.5, showing improvement in ToM performance to some extent. The models generally performed better when tasks were presented in English than in Turkish. These findings indicate that GPT models cannot reliably pass first-order and second-order ToM tasks in either of the languages yet. The findings have significant implications for <em>Explainability</em> of LLMs by highlighting challenges and biases that they face when simulating human-like ToM understanding in different languages.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101698"},"PeriodicalIF":3.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000810/pdfft?md5=e4a1b003e652ef2e0a652d3d4eaf2c3d&pid=1-s2.0-S0885230824000810-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141848847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ChatMatch: Exploring the potential of hybrid vision–language deep learning approach for the intelligent analysis and inference of racket sports ChatMatch：探索视觉-语言混合深度学习方法在球拍类运动智能分析和推理中的潜力

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-07-25 DOI: 10.1016/j.csl.2024.101694

Jiawen Zhang , Dongliang Han , Shuai Han , Heng Li , Wing-Kai Lam , Mingyu Zhang

Video understanding technology has become increasingly important in various disciplines, yet current approaches have primarily focused on lower comprehension level of video content, posing challenges for providing comprehensive and professional insights at a higher comprehension level. Video analysis plays a crucial role in athlete training and strategy development in racket sports. This study aims to demonstrate an innovative and higher-level video comprehension framework (ChatMatch), which integrates computer vision technologies with the cutting-edge large language models (LLM) to enable intelligent analysis and inference of racket sports videos. To examine the feasibility of this framework, we deployed a prototype of ChatMatch in the badminton in this study. A vision-based encoder was first proposed to extract the meta-features included the locations, actions, gestures, and action results of players in each frame of racket match videos, followed by a rule-based decoding method to transform the extracted information in both structured knowledge and unstructured knowledge. A set of LLM-based agents included namely task identifier, coach agent, statistician agent, and video manager, was developed through a prompt engineering and driven by an automated mechanism. The automatic collaborative interaction among the agents enabled the provision of a comprehensive response to professional inquiries from users. The validation findings showed that our vision models had excellent performances in meta-feature extraction, achieving a location identification accuracy of 0.991, an action recognition accuracy of 0.902, and a gesture recognition accuracy of 0.950. Additionally, a total of 100 questions were gathered from four proficient badminton players and one coach to evaluate the performance of the LLM-based agents, and the outcomes obtained from ChatMatch exhibited commendable results across general inquiries, statistical queries, and video retrieval tasks. These findings highlight the potential of using this approach that can offer valuable insights for athletes and coaches while significantly improve the efficiency of sports video analysis.

视频理解技术在各学科中的重要性与日俱增，但目前的方法主要集中在较低理解水平的视频内容上，为在较高理解水平上提供全面、专业的见解带来了挑战。视频分析在球拍类运动的运动员训练和策略制定中发挥着至关重要的作用。本研究旨在展示一个创新的、更高层次的视频理解框架（ChatMatch），该框架将计算机视觉技术与前沿的大型语言模型（LLM）相结合，实现了对球拍类运动视频的智能分析和推理。为了检验该框架的可行性，我们在羽毛球比赛中部署了 ChatMatch 的原型。首先，我们提出了一种基于视觉的编码器来提取元特征，包括球拍比赛视频中每一帧中球员的位置、动作、手势和动作结果，然后采用基于规则的解码方法将提取的信息转换为结构化知识和非结构化知识。通过提示工程和自动机制的驱动，开发了一套基于 LLM 的代理，包括任务识别器、教练代理、统计代理和视频管理器。这些代理之间的自动协作互动能够对用户的专业咨询做出全面回应。验证结果表明，我们的视觉模型在元特征提取方面表现出色，位置识别准确率达到 0.991，动作识别准确率达到 0.902，手势识别准确率达到 0.950。此外，为了评估基于 LLM 的代理的性能，我们还从四名羽毛球高手和一名教练那里收集了 100 个问题，结果显示 ChatMatch 在一般查询、统计查询和视频检索任务方面都取得了令人称道的成绩。这些发现凸显了使用这种方法的潜力，它可以为运动员和教练员提供有价值的见解，同时显著提高体育视频分析的效率。

{"title":"ChatMatch: Exploring the potential of hybrid vision–language deep learning approach for the intelligent analysis and inference of racket sports","authors":"Jiawen Zhang , Dongliang Han , Shuai Han , Heng Li , Wing-Kai Lam , Mingyu Zhang","doi":"10.1016/j.csl.2024.101694","DOIUrl":"10.1016/j.csl.2024.101694","url":null,"abstract":"<div><p>Video understanding technology has become increasingly important in various disciplines, yet current approaches have primarily focused on lower comprehension level of video content, posing challenges for providing comprehensive and professional insights at a higher comprehension level. Video analysis plays a crucial role in athlete training and strategy development in racket sports. This study aims to demonstrate an innovative and higher-level video comprehension framework (ChatMatch), which integrates computer vision technologies with the cutting-edge large language models (LLM) to enable intelligent analysis and inference of racket sports videos. To examine the feasibility of this framework, we deployed a prototype of ChatMatch in the badminton in this study. A vision-based encoder was first proposed to extract the meta-features included the locations, actions, gestures, and action results of players in each frame of racket match videos, followed by a rule-based decoding method to transform the extracted information in both structured knowledge and unstructured knowledge. A set of LLM-based agents included namely task identifier, coach agent, statistician agent, and video manager, was developed through a prompt engineering and driven by an automated mechanism. The automatic collaborative interaction among the agents enabled the provision of a comprehensive response to professional inquiries from users. The validation findings showed that our vision models had excellent performances in meta-feature extraction, achieving a location identification accuracy of 0.991, an action recognition accuracy of 0.902, and a gesture recognition accuracy of 0.950. Additionally, a total of 100 questions were gathered from four proficient badminton players and one coach to evaluate the performance of the LLM-based agents, and the outcomes obtained from ChatMatch exhibited commendable results across general inquiries, statistical queries, and video retrieval tasks. These findings highlight the potential of using this approach that can offer valuable insights for athletes and coaches while significantly improve the efficiency of sports video analysis.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101694"},"PeriodicalIF":3.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000779/pdfft?md5=2c72701b559ac872232548320e08722b&pid=1-s2.0-S0885230824000779-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141853772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On improving conversational interfaces in educational systems 关于改进教育系统中的对话界面

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-07-23 DOI: 10.1016/j.csl.2024.101693

Yuyan Wu, Romina Soledad Albornoz-De Luise, Miguel Arevalillo-Herráez

Conversational Intelligent Tutoring Systems (CITS) have drawn increasing interest in education because of their capacity to tailor learning experiences, improve user engagement, and contribute to the effective transfer of knowledge. Conversational agents employ advanced natural language techniques to engage in a convincing human-like tutorial conversation. In solving math word problems, a significant challenge arises in enabling the system to understand user utterances and accurately map extracted entities to the essential problem quantities required for problem-solving, despite the inherent ambiguity of human natural language. In this study, we propose two possible approaches to enhance the performance of a particular CITS designed to teach learners to solve arithmetic–algebraic word problems. Firstly, we propose an ensemble approach to intent classification and entity extraction, which combines the predictions made by two distinct individual models that use constraints defined by human experts. This approach leverages the intertwined nature of the intents and entities to yield a comprehensive understanding of the user’s utterance, ultimately aiming to enhance semantic accuracy. Secondly, we introduce an adapted Term Frequency-Inverse Document Frequency technique to associate entities with problem quantity descriptions. The evaluation was conducted on the AWPS and MATH-HINTS datasets, containing conversational data and a collection of arithmetical and algebraic math problems, respectively. The results demonstrate that the proposed ensemble approach outperforms individual models, and the proposed method for entity–quantity matching surpasses the performance of typical text semantic embedding models.

对话式智能辅导系统（CITS）因其能够定制学习体验、提高用户参与度和促进知识的有效传递而在教育领域引起越来越多的关注。对话式代理采用先进的自然语言技术，进行令人信服的仿人辅导对话。在解决数学单词问题时，尽管人类自然语言本身具有模糊性，但如何让系统理解用户的话语，并将提取的实体准确映射到解决问题所需的基本问题量上，仍是一个重大挑战。在本研究中，我们提出了两种可能的方法来提高特定 CITS 的性能，该 CITS 专门用于教授学习者解决算术-代数文字问题。首先，我们提出了一种意图分类和实体提取的集合方法，该方法结合了两个不同的单独模型所做的预测，这两个模型使用了人类专家定义的约束条件。这种方法利用意图和实体相互交织的特性，全面理解用户的语句，最终提高语义准确性。其次，我们引入了经调整的术语频率-反向文档频率技术，将实体与问题数量描述联系起来。评估是在 AWPS 和 MATH-HINTS 数据集上进行的，这两个数据集分别包含对话数据以及算术和代数数学问题集。结果表明，所提出的集合方法优于单个模型，而且所提出的实体-数量匹配方法超过了典型文本语义嵌入模型的性能。

{"title":"On improving conversational interfaces in educational systems","authors":"Yuyan Wu, Romina Soledad Albornoz-De Luise, Miguel Arevalillo-Herráez","doi":"10.1016/j.csl.2024.101693","DOIUrl":"10.1016/j.csl.2024.101693","url":null,"abstract":"<div><p>Conversational Intelligent Tutoring Systems (CITS) have drawn increasing interest in education because of their capacity to tailor learning experiences, improve user engagement, and contribute to the effective transfer of knowledge. Conversational agents employ advanced natural language techniques to engage in a convincing human-like tutorial conversation. In solving math word problems, a significant challenge arises in enabling the system to understand user utterances and accurately map extracted entities to the essential problem quantities required for problem-solving, despite the inherent ambiguity of human natural language. In this study, we propose two possible approaches to enhance the performance of a particular CITS designed to teach learners to solve arithmetic–algebraic word problems. Firstly, we propose an ensemble approach to intent classification and entity extraction, which combines the predictions made by two distinct individual models that use constraints defined by human experts. This approach leverages the intertwined nature of the intents and entities to yield a comprehensive understanding of the user’s utterance, ultimately aiming to enhance semantic accuracy. Secondly, we introduce an adapted Term Frequency-Inverse Document Frequency technique to associate entities with problem quantity descriptions. The evaluation was conducted on the AWPS and MATH-HINTS datasets, containing conversational data and a collection of arithmetical and algebraic math problems, respectively. The results demonstrate that the proposed ensemble approach outperforms individual models, and the proposed method for entity–quantity matching surpasses the performance of typical text semantic embedding models.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101693"},"PeriodicalIF":3.1,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000767/pdfft?md5=56f2f2395571e332090191dc68fc5505&pid=1-s2.0-S0885230824000767-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141851561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A computational analysis of transcribed speech of people living with dementia: The Anchise 2022 Corpus 对痴呆症患者语音转录的计算分析：Anchise 2022 语料库

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-07-22 DOI: 10.1016/j.csl.2024.101691

Francesco Sigona , Daniele P. Radicioni , Barbara Gili Fivela , Davide Colla , Matteo Delsanto , Enrico Mensa , Andrea Bolioli , Pietro Vigorelli

<div><h3>Introduction</h3><p>Automatic linguistic analysis can provide cost-effective, valuable clues to the diagnosis of cognitive difficulties and to therapeutic practice, and hence impact positively on wellbeing. In this work, we analyzed transcribed conversations between elderly individuals living with dementia and healthcare professionals. The material came from the Anchise 2022 Corpus, a large collection of transcripts of conversations in Italian recorded in naturalistic conditions. The aim of the work was to test the effectiveness of a number of automatic analyzes in finding correlations with the progression of dementia in individuals with cognitive decline as measured by the Mini-Mental State Examination (MMSE) score, which is the only psychometric-clinical information available on the participants in the conversations. Healthy controls (HC) were not considered in this study, nor does the corpus itself include HCs. The main innovation and strength of the work consists in the high ecological validity of the language analyzed (most of the literature to date concerns controlled language experiments); in the use of Italian (there is little corpora for Italian); in the size of the analyzed data (more than 200 conversations were considered); in the adoption of a wide range of NLP methods, that span from traditional morphosyntactic investigation to deep linguistic models for conducting analyzes such as through perplexity, sentiment (polarity) and emotions.</p></div><div><h3>Methods</h3><p>Analyzing real-world interactions not designed with computational analysis in mind, such as is the case of the Anchise Corpus, is particularly challenging. To achieve the research goals, a wide variety of tools were employed. These included traditional morphosyntactic analysis based on digital linguistic biomarkers (DLBs), transformer-based language models, sentiment and emotion analysis, and perplexity metrics. Analyzes were conducted both on the continuous range of MMSE values and on the severe/moderate/mild categorization suggested by AIFA (Italian Medicines Agency) guidelines, based on MMSE threshold values.</p></div><div><h3>Results and discussion</h3><p>Correlations between MMSE and individual DLBs were weak, up to 0.19 for positive, and -0.21 for negative correlation values. Nevertheless, some correlations were statistically significant and consistent with the literature, suggesting that people with a greater degree of impairment tend to show a reduced vocabulary, to have anomia, to adopt a more informal linguist register, and to display a simplified use of verbs, with a decrease in the use of participles, gerunds, subjunctive moods, modal verbs, as well as a flattening in the use of the tenses towards the present to the detriment of the past. The -0.26 inverse correlation between perplexity and MMSE suggests that perplexity captures slightly more specific linguistic information, which can complement the MMSE scores. In the categorization tasks, the clas

引言自动语言分析可以为认知障碍的诊断和治疗实践提供具有成本效益的宝贵线索，从而对福祉产生积极影响。在这项工作中，我们分析了老年痴呆症患者与医疗保健专业人员之间的对话记录。这些材料来自 Anchise 2022 语料库，该语料库收集了大量在自然条件下记录的意大利语对话记录。这项工作的目的是测试一些自动分析方法在发现认知功能衰退患者痴呆症进展与迷你精神状态检查（MMSE）得分之间相关性方面的有效性，迷你精神状态检查是对话参与者唯一可用的心理临床信息。本研究不考虑健康对照组（HC），语料库本身也不包括健康对照组。这项工作的主要创新和优势在于所分析语言的高度生态有效性（迄今为止，大多数文献都涉及受控语言实验）；意大利语的使用（意大利语语料库很少）；分析数据的规模（考虑了 200 多段对话）；采用广泛的 NLP 方法，从传统的形态句法调查到深度语言模型，通过困惑度、情感（极性）和情绪等进行分析。方法分析现实世界中没有考虑到计算分析的互动（如 Anchise 语料库）尤其具有挑战性。为了实现研究目标，我们使用了多种工具。这些工具包括基于数字语言生物标记（DLB）的传统形态句法分析、基于转换器的语言模型、情感和情绪分析以及复杂度度量。分析既针对 MMSE 值的连续范围，也针对 AIFA（意大利药品管理局）指南根据 MMSE 阈值建议的严重/中度/轻度分类。尽管如此，一些相关值在统计上是显著的，并且与文献一致，这表明障碍程度越严重的人，其词汇量越少，有失认症，采用更非正式的语言语域，并显示出简化动词的使用，分词、动名词、从句情态、情态动词的使用减少，以及时态的使用趋向于现在时，而不利于过去时。困惑度与 MMSE 之间-0.26 的反相关性表明，困惑度可以捕捉到稍为具体的语言信息，从而对 MMSE 分数起到补充作用。在分类任务中，基于 DLB 的分类器在 "严重 "和 "轻微 "的二元分类中取得了 0.79 的 F1 分数，在多标签分类中取得了 0.61 的 F1 分数。情感和情绪分析表明，快乐呈反向趋势，而 MMSE 分数则表明，受损程度较轻的人比其他人更不快乐，或者说更 "消极"。考虑到现实世界的背景，这与受痴呆症影响的人意识逐渐减弱的假设是一致的。最后，综合各种分析方法已被证明能够有效地提供有关语言和交流障碍的更广泛的信息，以及有关痴呆症进展的更精确的数据。

{"title":"A computational analysis of transcribed speech of people living with dementia: The Anchise 2022 Corpus","authors":"Francesco Sigona , Daniele P. Radicioni , Barbara Gili Fivela , Davide Colla , Matteo Delsanto , Enrico Mensa , Andrea Bolioli , Pietro Vigorelli","doi":"10.1016/j.csl.2024.101691","DOIUrl":"10.1016/j.csl.2024.101691","url":null,"abstract":"<div><h3>Introduction</h3><p>Automatic linguistic analysis can provide cost-effective, valuable clues to the diagnosis of cognitive difficulties and to therapeutic practice, and hence impact positively on wellbeing. In this work, we analyzed transcribed conversations between elderly individuals living with dementia and healthcare professionals. The material came from the Anchise 2022 Corpus, a large collection of transcripts of conversations in Italian recorded in naturalistic conditions. The aim of the work was to test the effectiveness of a number of automatic analyzes in finding correlations with the progression of dementia in individuals with cognitive decline as measured by the Mini-Mental State Examination (MMSE) score, which is the only psychometric-clinical information available on the participants in the conversations. Healthy controls (HC) were not considered in this study, nor does the corpus itself include HCs. The main innovation and strength of the work consists in the high ecological validity of the language analyzed (most of the literature to date concerns controlled language experiments); in the use of Italian (there is little corpora for Italian); in the size of the analyzed data (more than 200 conversations were considered); in the adoption of a wide range of NLP methods, that span from traditional morphosyntactic investigation to deep linguistic models for conducting analyzes such as through perplexity, sentiment (polarity) and emotions.</p></div><div><h3>Methods</h3><p>Analyzing real-world interactions not designed with computational analysis in mind, such as is the case of the Anchise Corpus, is particularly challenging. To achieve the research goals, a wide variety of tools were employed. These included traditional morphosyntactic analysis based on digital linguistic biomarkers (DLBs), transformer-based language models, sentiment and emotion analysis, and perplexity metrics. Analyzes were conducted both on the continuous range of MMSE values and on the severe/moderate/mild categorization suggested by AIFA (Italian Medicines Agency) guidelines, based on MMSE threshold values.</p></div><div><h3>Results and discussion</h3><p>Correlations between MMSE and individual DLBs were weak, up to 0.19 for positive, and -0.21 for negative correlation values. Nevertheless, some correlations were statistically significant and consistent with the literature, suggesting that people with a greater degree of impairment tend to show a reduced vocabulary, to have anomia, to adopt a more informal linguist register, and to display a simplified use of verbs, with a decrease in the use of participles, gerunds, subjunctive moods, modal verbs, as well as a flattening in the use of the tenses towards the present to the detriment of the past. The -0.26 inverse correlation between perplexity and MMSE suggests that perplexity captures slightly more specific linguistic information, which can complement the MMSE scores. In the categorization tasks, the clas","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101691"},"PeriodicalIF":3.1,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000743/pdfft?md5=5a1457a7753032d3fdc01ffd4b14e74e&pid=1-s2.0-S0885230824000743-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141844241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PaSCoNT - Parallel Speech Corpus of Northern-central Thai for automatic speech recognition PaSCoNT - 用于自动语音识别的泰语中北部平行语音库

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-07-22 DOI: 10.1016/j.csl.2024.101692

Supawat Taerungruang , Phimphaka Taninpong , Vataya Chunwijitra , Sumonmas Thatphithakkul , Sawit Kasuriya , Viroj Inthanon , Pawat Paksaranuwat , Salinee Thumronglaohapun , Nawapon Nakharutai , Papangkorn Inkeaw , Jakramate Bootkrajang

This paper proposed a Parallel Speech Corpus of Northern-central Thai (PaSCoNT). The purpose of this research is not only to understand the different linguistic characteristics between Northern and Central Thai, but also to utilize this corpus for automatic speech recognition. The corpus is composed of speech data from dialogues of daily life among northern Thai people. We designed 2,000 Northern Thai sentences covering all phonemes, in collaboration with linguists specialized in the Northern Thai dialect. The samples in this study are 200 Northern Thai dialect speakers who had been living in Chiang Mai province for more than 18 years. The speech was recorded in both open and closed environments. In the speech recording, each speaker must read 100 pairs of Northern-Central Thai sentences to ensure that the speech data comes from the same speaker. In total, 100 h of speech were recorded: 50 h of Northern Thai and 50 h of Central Thai. Overall, PaSCoNT consists of 907,832 words and 6,279 vocabulary items. Statistical analysis of the PaSCoNT corpus revealed that 49.64 % of words in the lexicon belongs to the Northern Thai dialect, 50.36 % from the Central Thai dialect, and 1,621 vocabulary items appeared in both Northern and Central Thai. Statistical analysis is used to examine the difference in speech tempo, i.e. time per phoneme (TTP), syllable per minute (SPM), between Northern and Central Thai. The results revealed that there were statistically significant differences speech tempo between Central and Northern Thai. The TTP speaking and articulation rate of Central Thai is lower than Northern Thai whereas SPM speaking and articulation rate of Central Thai is higher than Northern Thai. The results also showed that the ASR model training using Northern Thai speech corpus provides the lower WER% when testing using Northern Thai testing speech data and provides the higher WER% when testing using Central Thai Testing speech data and vice versa. However, the ASR model training using the PaSCoNT speech corpus provides the lower WER% for both Northern Thai and Central Thai testing speech data.

本文提出了泰语北部-中部平行语音语料库（PaSCoNT）。本研究的目的不仅在于了解泰语北部和中部的不同语言特点，还在于利用该语料库进行自动语音识别。该语料库由泰北人日常生活对话中的语音数据组成。我们与专门研究泰北方言的语言学家合作，设计了 2,000 个涵盖所有音素的泰北方言句子。本研究的样本是在清迈府生活了 18 年以上的 200 名讲泰北方言的人。语音记录在开放和封闭的环境中进行。在语音录制过程中，每位说话者必须朗读 100 对中北部泰语句子，以确保语音数据来自同一说话者。总共录制了 100 小时的语音：50 小时北部泰语，50 小时中部泰语。总体而言，PaSCoNT 包含 907,832 个单词和 6,279 个词汇项目。对 PaSCoNT 语料库进行统计分析后发现，词库中 49.64% 的单词属于泰北方言，50.36% 属于泰中方言，1,621 个词汇同时出现在泰北和泰中方言中。统计分析用于研究北部泰语和中部泰语在语音节奏上的差异，即每音素时间 (TTP) 和每分钟音节数 (SPM)。结果显示，中部泰语和北部泰语在语音节奏上存在显著的统计学差异。中部泰语的 TTP 说话和发音速度低于北部泰语，而中部泰语的 SPM 说话和发音速度高于北部泰语。结果还显示，使用泰北语测试语音数据进行测试时，使用泰北语语料库训练的 ASR 模型的 WER% 较低，而使用泰中语测试语音数据进行测试时的 WER% 较高，反之亦然。但是，使用 PaSCoNT 语音语料进行 ASR 模型训练时，泰北和泰中测试语音数据的 WER% 都较低。

{"title":"PaSCoNT - Parallel Speech Corpus of Northern-central Thai for automatic speech recognition","authors":"Supawat Taerungruang , Phimphaka Taninpong , Vataya Chunwijitra , Sumonmas Thatphithakkul , Sawit Kasuriya , Viroj Inthanon , Pawat Paksaranuwat , Salinee Thumronglaohapun , Nawapon Nakharutai , Papangkorn Inkeaw , Jakramate Bootkrajang","doi":"10.1016/j.csl.2024.101692","DOIUrl":"10.1016/j.csl.2024.101692","url":null,"abstract":"<div><p>This paper proposed a Parallel Speech Corpus of Northern-central Thai (PaSCoNT). The purpose of this research is not only to understand the different linguistic characteristics between Northern and Central Thai, but also to utilize this corpus for automatic speech recognition. The corpus is composed of speech data from dialogues of daily life among northern Thai people. We designed 2,000 Northern Thai sentences covering all phonemes, in collaboration with linguists specialized in the Northern Thai dialect. The samples in this study are 200 Northern Thai dialect speakers who had been living in Chiang Mai province for more than 18 years. The speech was recorded in both open and closed environments. In the speech recording, each speaker must read 100 pairs of Northern-Central Thai sentences to ensure that the speech data comes from the same speaker. In total, 100 h of speech were recorded: 50 h of Northern Thai and 50 h of Central Thai. Overall, PaSCoNT consists of 907,832 words and 6,279 vocabulary items. Statistical analysis of the PaSCoNT corpus revealed that 49.64 % of words in the lexicon belongs to the Northern Thai dialect, 50.36 % from the Central Thai dialect, and 1,621 vocabulary items appeared in both Northern and Central Thai. Statistical analysis is used to examine the difference in speech tempo, i.e. time per phoneme (TTP), syllable per minute (SPM), between Northern and Central Thai. The results revealed that there were statistically significant differences speech tempo between Central and Northern Thai. The TTP speaking and articulation rate of Central Thai is lower than Northern Thai whereas SPM speaking and articulation rate of Central Thai is higher than Northern Thai. The results also showed that the ASR model training using Northern Thai speech corpus provides the lower WER% when testing using Northern Thai testing speech data and provides the higher WER% when testing using Central Thai Testing speech data and vice versa. However, the ASR model training using the PaSCoNT speech corpus provides the lower WER% for both Northern Thai and Central Thai testing speech data.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101692"},"PeriodicalIF":3.1,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000755/pdfft?md5=f97afe2aa357037c83c6473c50174543&pid=1-s2.0-S0885230824000755-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141839086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generalizing Hate Speech Detection Using Multi-Task Learning: A Case Study of Political Public Figures 利用多任务学习实现仇恨言论检测的泛化：政治公众人物案例研究

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-07-17 DOI: 10.1016/j.csl.2024.101690

Lanqin Yuan, Marian-Andrei Rizoiu

Automatic identification of hateful and abusive content is vital in combating the spread of harmful online content and its damaging effects. Most existing works evaluate models by examining the generalization error on train–test splits on hate speech datasets. These datasets often differ in their definitions and labeling criteria, leading to poor generalization performance when predicting across new domains and datasets. This work proposes a new Multi-task Learning (MTL) pipeline that trains simultaneously across multiple hate speech datasets to construct a more encompassing classification model. Using a dataset-level leave-one-out evaluation (designating a dataset for testing and jointly training on all others), we trial the MTL detection on new, previously unseen datasets. Our results consistently outperform a large sample of existing work. We show strong results when examining the generalization error in train–test splits and substantial improvements when predicting on previously unseen datasets. Furthermore, we assemble a novel dataset, dubbed PubFigs, focusing on the problematic speech of American Public Political Figures. We crowdsource-label using Amazon MTurk more than 20,000 tweets and machine-label problematic speech in all the 305,235 tweets in PubFigs. We find that the abusive and hate tweeting mainly originates from right-leaning figures and relates to six topics, including Islam, women, ethnicity, and immigrants. We show that MTL builds embeddings that can simultaneously separate abusive from hate speech, and identify its topics.

自动识别仇恨和辱骂内容对于打击有害网络内容的传播及其破坏性影响至关重要。现有的大多数工作都是通过检查仇恨言论数据集上训练-测试分裂的泛化误差来评估模型的。这些数据集的定义和标记标准往往不同，导致在预测新领域和数据集时泛化性能较差。本研究提出了一种新的多任务学习（MTL）管道，可同时在多个仇恨言论数据集上进行训练，以构建一个更全面的分类模型。我们使用数据集级的 "留一弃一 "评估（指定一个数据集进行测试，并在所有其他数据集上进行联合训练），在以前未见过的新数据集上试用 MTL 检测。我们的结果始终优于大量现有工作。在对训练-测试分离的泛化误差进行检查时，我们显示出了很好的结果，而在对以前未见过的数据集进行预测时，我们的结果也有了很大的改进。此外，我们还建立了一个名为 PubFigs 的新数据集，重点关注美国公众政治人物的问题言论。我们使用亚马逊 MTurk 对 20,000 多条推文进行了众包标注，并对 PubFigs 中所有 305,235 条推文中的问题言论进行了机器标注。我们发现，辱骂性和仇恨性推文主要来自右倾人物，涉及伊斯兰教、妇女、种族和移民等六个主题。我们的研究表明，MTL 建立的嵌入可以同时区分辱骂性和仇恨性言论，并识别其主题。

{"title":"Generalizing Hate Speech Detection Using Multi-Task Learning: A Case Study of Political Public Figures","authors":"Lanqin Yuan, Marian-Andrei Rizoiu","doi":"10.1016/j.csl.2024.101690","DOIUrl":"10.1016/j.csl.2024.101690","url":null,"abstract":"<div><p>Automatic identification of hateful and abusive content is vital in combating the spread of harmful online content and its damaging effects. Most existing works evaluate models by examining the generalization error on train–test splits on hate speech datasets. These datasets often differ in their definitions and labeling criteria, leading to poor generalization performance when predicting across new domains and datasets. This work proposes a new Multi-task Learning (MTL) pipeline that trains simultaneously across multiple hate speech datasets to construct a more encompassing classification model. Using a dataset-level leave-one-out evaluation (designating a dataset for testing and jointly training on all others), we trial the MTL detection on new, previously unseen datasets. Our results consistently outperform a large sample of existing work. We show strong results when examining the generalization error in train–test splits and substantial improvements when predicting on previously unseen datasets. Furthermore, we assemble a novel dataset, dubbed <span>PubFigs</span>, focusing on the problematic speech of American Public Political Figures. We crowdsource-label using Amazon MTurk more than 20,000 tweets and machine-label problematic speech in all the 305,235 tweets in <span>PubFigs</span>. We find that the abusive and hate tweeting mainly originates from right-leaning figures and relates to six topics, including Islam, women, ethnicity, and immigrants. We show that MTL builds embeddings that can simultaneously separate abusive from hate speech, and identify its topics.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101690"},"PeriodicalIF":3.1,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000731/pdfft?md5=e169fb47936a2284a9d518194884b197&pid=1-s2.0-S0885230824000731-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141853188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0