{"title":"Krathu-500: Post-Comments Thai Corpus","authors":"Pittawat Taveekitworachai, Jonathan H. Chan","doi":"10.1109/RI2C56397.2022.9910314","DOIUrl":null,"url":null,"abstract":"The Krathu-500 contains 574 post titles and a post body with all comments on each post on Pantip.com. The corpus contains a total of 63,293 comments in Thai language that is used in real-life situations, with various contexts and types, in a conversational form. The corpus serves as a good way to improve the capabilities of machine learning techniques that deal with the Thai language. A smaller sentiment-labeled corpus of comments is also provided with 6,306 records. The labeled corpus is a human-annotated dataset with three labels for negative, neutral, and positive comments. Three baseline models were developed for labeled corpus using simple LSTM, CNN, and BERT, respectively. The project also consists of an open-source repository that allows anyone who is interested to modify and built on top of the current source code and dataset.","PeriodicalId":403083,"journal":{"name":"2022 Research, Invention, and Innovation Congress: Innovative Electricals and Electronics (RI2C)","volume":"23 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Research, Invention, and Innovation Congress: Innovative Electricals and Electronics (RI2C)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RI2C56397.2022.9910314","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The Krathu-500 contains 574 post titles and a post body with all comments on each post on Pantip.com. The corpus contains a total of 63,293 comments in Thai language that is used in real-life situations, with various contexts and types, in a conversational form. The corpus serves as a good way to improve the capabilities of machine learning techniques that deal with the Thai language. A smaller sentiment-labeled corpus of comments is also provided with 6,306 records. The labeled corpus is a human-annotated dataset with three labels for negative, neutral, and positive comments. Three baseline models were developed for labeled corpus using simple LSTM, CNN, and BERT, respectively. The project also consists of an open-source repository that allows anyone who is interested to modify and built on top of the current source code and dataset.