{"title":"面向电影类型多标签分类的土耳其语主题建模数据集","authors":"Elgun Jabrayilzade, Algın Poyraz Arslan, Hasan Para, Ozan Polatbilek, Erhan Sezerer, Selma Tekir","doi":"10.1109/SIU49456.2020.9302027","DOIUrl":null,"url":null,"abstract":"Statistical topic modeling aims to assign topics to documents in an unsupervised way. Latent Dirichlet Allocation (LDA) is the standard model for topic modeling. It shows good performance on document collections, documents being relatively long texts but it has poor performance on short texts. Topic modeling on short texts is on the rise due to the potential of social media. Thus, approaches that are able to find topics on short texts as well as long texts are sought. However, there is a lack of datasets that include both long and short texts which have the same ground-truth categories. In this work, we release a Turkish movie dataset which contain both short film descriptions and long subscripts where film genre can be considered as topic. Furthermore, we provide multi-label movie genre classification results using a Feed Forward Neural Network (FFNN) taking LDA document-topic or Doc2Vec dense representations.","PeriodicalId":312627,"journal":{"name":"2020 28th Signal Processing and Communications Applications Conference (SIU)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Turkish Topic Modeling Dataset For Multi-label Classification of Movie Genre\",\"authors\":\"Elgun Jabrayilzade, Algın Poyraz Arslan, Hasan Para, Ozan Polatbilek, Erhan Sezerer, Selma Tekir\",\"doi\":\"10.1109/SIU49456.2020.9302027\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Statistical topic modeling aims to assign topics to documents in an unsupervised way. Latent Dirichlet Allocation (LDA) is the standard model for topic modeling. It shows good performance on document collections, documents being relatively long texts but it has poor performance on short texts. Topic modeling on short texts is on the rise due to the potential of social media. Thus, approaches that are able to find topics on short texts as well as long texts are sought. However, there is a lack of datasets that include both long and short texts which have the same ground-truth categories. In this work, we release a Turkish movie dataset which contain both short film descriptions and long subscripts where film genre can be considered as topic. Furthermore, we provide multi-label movie genre classification results using a Feed Forward Neural Network (FFNN) taking LDA document-topic or Doc2Vec dense representations.\",\"PeriodicalId\":312627,\"journal\":{\"name\":\"2020 28th Signal Processing and Communications Applications Conference (SIU)\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 28th Signal Processing and Communications Applications Conference (SIU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIU49456.2020.9302027\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 28th Signal Processing and Communications Applications Conference (SIU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIU49456.2020.9302027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Turkish Topic Modeling Dataset For Multi-label Classification of Movie Genre
Statistical topic modeling aims to assign topics to documents in an unsupervised way. Latent Dirichlet Allocation (LDA) is the standard model for topic modeling. It shows good performance on document collections, documents being relatively long texts but it has poor performance on short texts. Topic modeling on short texts is on the rise due to the potential of social media. Thus, approaches that are able to find topics on short texts as well as long texts are sought. However, there is a lack of datasets that include both long and short texts which have the same ground-truth categories. In this work, we release a Turkish movie dataset which contain both short film descriptions and long subscripts where film genre can be considered as topic. Furthermore, we provide multi-label movie genre classification results using a Feed Forward Neural Network (FFNN) taking LDA document-topic or Doc2Vec dense representations.