Typographic-Based Data Augmentation to Improve a Question Retrieval in Short Dialogue System

Helmi Satria Nugraha, S. Suyanto
{"title":"Typographic-Based Data Augmentation to Improve a Question Retrieval in Short Dialogue System","authors":"Helmi Satria Nugraha, S. Suyanto","doi":"10.1109/ISRITI48646.2019.9034594","DOIUrl":null,"url":null,"abstract":"Many questions posed by users to particular customer service with a short dialog (such as a chatbot) cause difficulties to answer. These reduce the user satisfaction level to the service. A question answering (QA) system can be developed to solve this problem by providing relevant answers to the user questions. One of the commonly used methods to build a QA is a question retrieval (QR) that provides answers based on the most relevant stored- questions. However, interpreting two questions those are essentially the same but in different words is quite challenging. Besides, the limitation of the data set to learn is also interesting. This paper investigates a data augmentation based on typographic and synonym as well as evaluates the use of sub-word (instead of word) features to get the best word-embedding in the question. The word-embedding is then used to search the cosine similarity between a query and the stored-questions. Finally, the user receives an answer based on the question with the highest cosine similarity. Evaluation on a quite low data set shows that the proposed data augmentation is capable of significantly improving the system performance. Besides, the sub-word feature is better for word-embedding in the short conversation than the whole-word one.","PeriodicalId":367363,"journal":{"name":"2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI)","volume":"125 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISRITI48646.2019.9034594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Many questions posed by users to particular customer service with a short dialog (such as a chatbot) cause difficulties to answer. These reduce the user satisfaction level to the service. A question answering (QA) system can be developed to solve this problem by providing relevant answers to the user questions. One of the commonly used methods to build a QA is a question retrieval (QR) that provides answers based on the most relevant stored- questions. However, interpreting two questions those are essentially the same but in different words is quite challenging. Besides, the limitation of the data set to learn is also interesting. This paper investigates a data augmentation based on typographic and synonym as well as evaluates the use of sub-word (instead of word) features to get the best word-embedding in the question. The word-embedding is then used to search the cosine similarity between a query and the stored-questions. Finally, the user receives an answer based on the question with the highest cosine similarity. Evaluation on a quite low data set shows that the proposed data augmentation is capable of significantly improving the system performance. Besides, the sub-word feature is better for word-embedding in the short conversation than the whole-word one.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于排版的数据增强改进短对话系统中的问题检索
用户通过简短对话(如聊天机器人)向特定客户服务提出的许多问题导致难以回答。这会降低用户对服务的满意度。可以开发问答(QA)系统来解决这个问题,为用户的问题提供相关的答案。构建QA的常用方法之一是问题检索(QR),它根据最相关的存储问题提供答案。然而,解释两个本质上相同但措辞不同的问题是相当具有挑战性的。此外,数据集学习的局限性也很有趣。本文研究了一种基于排版和同义词的数据增强方法,并评估了子词(代替词)特征在问题中的使用,以获得最佳的词嵌入。然后使用词嵌入来搜索查询和存储问题之间的余弦相似度。最后,用户会收到基于余弦相似度最高的问题的答案。在一个相当低的数据集上的评估表明,所提出的数据增强能够显著提高系统性能。此外,子词特征比全词特征更适合于短对话中的词嵌入。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
TrendiTex: An Intelligent Fashion Designer Pair Extraction of Aspect and Implicit Opinion Word based on its Co-occurrence in Corpus of Bahasa Indonesia Parameter Tuning of G-mapping SLAM (Simultaneous Localization and Mapping) on Mobile Robot with Laser-Range Finder 360° Sensor ISRITI 2019 Committees Network Architecture Design of Indonesia Research and Education Network (IDREN)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1