P. Posokhov, E. A. Rudaleva, S. Skrylnikov, O. V. Makhnytkina, V. I. Kabarov
{"title":"Persona Knowledge Extraction from Dialog Data in Russian Language","authors":"P. Posokhov, E. A. Rudaleva, S. Skrylnikov, O. V. Makhnytkina, V. I. Kabarov","doi":"10.17587/it.30.190-197","DOIUrl":null,"url":null,"abstract":"The article deals with the joint application of linguistic rules and machine learning models to solve the problem of knowledge extraction from dialog data in Russian. Linguistic rules based on morphological, syntactic and grammatical features are used for automatic markup of the training dataset. The neural network model based on the T5 architecture was trained in multitasking mode, which implied solving the following tasks: a) answer generation based on the dialog history and the facts about the agent's persona found relevant to this history; b) extraction of facts about the persona using the generation method based on the last replica of the agent. The Toloka Persona Chat Rus dataset was used for the experiments. The metrics of both approaches show their applicability to the Russian language, for which no studies have been conducted before.","PeriodicalId":504905,"journal":{"name":"Informacionnye Tehnologii","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informacionnye Tehnologii","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17587/it.30.190-197","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The article deals with the joint application of linguistic rules and machine learning models to solve the problem of knowledge extraction from dialog data in Russian. Linguistic rules based on morphological, syntactic and grammatical features are used for automatic markup of the training dataset. The neural network model based on the T5 architecture was trained in multitasking mode, which implied solving the following tasks: a) answer generation based on the dialog history and the facts about the agent's persona found relevant to this history; b) extraction of facts about the persona using the generation method based on the last replica of the agent. The Toloka Persona Chat Rus dataset was used for the experiments. The metrics of both approaches show their applicability to the Russian language, for which no studies have been conducted before.
文章论述了如何联合应用语言规则和机器学习模型来解决从俄语对话数据中提取知识的问题。基于形态、句法和语法特征的语言规则被用于自动标记训练数据集。基于 T5 架构的神经网络模型在多任务模式下进行了训练,这意味着要解决以下任务:a) 根据对话历史和发现的与对话历史相关的代理角色事实生成答案;b) 根据代理的最后一次复制使用生成方法提取角色事实。实验使用的是 Toloka Persona Chat Rus 数据集。这两种方法的度量结果表明,它们都适用于俄语,而俄语以前还没有进行过相关研究。