{"title":"Typographic-Based Data Augmentation to Improve a Question Retrieval in Short Dialogue System","authors":"Helmi Satria Nugraha, S. Suyanto","doi":"10.1109/ISRITI48646.2019.9034594","DOIUrl":null,"url":null,"abstract":"Many questions posed by users to particular customer service with a short dialog (such as a chatbot) cause difficulties to answer. These reduce the user satisfaction level to the service. A question answering (QA) system can be developed to solve this problem by providing relevant answers to the user questions. One of the commonly used methods to build a QA is a question retrieval (QR) that provides answers based on the most relevant stored- questions. However, interpreting two questions those are essentially the same but in different words is quite challenging. Besides, the limitation of the data set to learn is also interesting. This paper investigates a data augmentation based on typographic and synonym as well as evaluates the use of sub-word (instead of word) features to get the best word-embedding in the question. The word-embedding is then used to search the cosine similarity between a query and the stored-questions. Finally, the user receives an answer based on the question with the highest cosine similarity. Evaluation on a quite low data set shows that the proposed data augmentation is capable of significantly improving the system performance. Besides, the sub-word feature is better for word-embedding in the short conversation than the whole-word one.","PeriodicalId":367363,"journal":{"name":"2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI)","volume":"125 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISRITI48646.2019.9034594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
Many questions posed by users to particular customer service with a short dialog (such as a chatbot) cause difficulties to answer. These reduce the user satisfaction level to the service. A question answering (QA) system can be developed to solve this problem by providing relevant answers to the user questions. One of the commonly used methods to build a QA is a question retrieval (QR) that provides answers based on the most relevant stored- questions. However, interpreting two questions those are essentially the same but in different words is quite challenging. Besides, the limitation of the data set to learn is also interesting. This paper investigates a data augmentation based on typographic and synonym as well as evaluates the use of sub-word (instead of word) features to get the best word-embedding in the question. The word-embedding is then used to search the cosine similarity between a query and the stored-questions. Finally, the user receives an answer based on the question with the highest cosine similarity. Evaluation on a quite low data set shows that the proposed data augmentation is capable of significantly improving the system performance. Besides, the sub-word feature is better for word-embedding in the short conversation than the whole-word one.