{"title":"利用von Mises-Fisher分布的dirichlet过程混合模型进行文档聚类","authors":"N. K. Anh, Tam The Nguyen, Ngo Van Linh","doi":"10.1145/2542050.2542079","DOIUrl":null,"url":null,"abstract":"Document clustering has become an increasingly important technique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering. This paper proposes a Dirichlet process mixture (DPM) model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. We have developed a mean-field variational inference algorithm for the DPM model of vMFs that is applied to clustering text documents. Using this model, the number of clusters is determined automatically after the clustering process rather than pre-estimated. We conducted extensive experiments to evaluate the proposed approach on a large number of high dimensional text datasets. Empirical experimental results over NMI (Normalized Mutual Information) and Purity evaluation measures demonstrate that our approach outperforms the four state-of-the-art clustering algorithms.","PeriodicalId":246033,"journal":{"name":"Proceedings of the 4th Symposium on Information and Communication Technology","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Document clustering using dirichlet process mixture model of von Mises-Fisher distributions\",\"authors\":\"N. K. Anh, Tam The Nguyen, Ngo Van Linh\",\"doi\":\"10.1145/2542050.2542079\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Document clustering has become an increasingly important technique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering. This paper proposes a Dirichlet process mixture (DPM) model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. We have developed a mean-field variational inference algorithm for the DPM model of vMFs that is applied to clustering text documents. Using this model, the number of clusters is determined automatically after the clustering process rather than pre-estimated. We conducted extensive experiments to evaluate the proposed approach on a large number of high dimensional text datasets. Empirical experimental results over NMI (Normalized Mutual Information) and Purity evaluation measures demonstrate that our approach outperforms the four state-of-the-art clustering algorithms.\",\"PeriodicalId\":246033,\"journal\":{\"name\":\"Proceedings of the 4th Symposium on Information and Communication Technology\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th Symposium on Information and Communication Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2542050.2542079\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th Symposium on Information and Communication Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2542050.2542079","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Document clustering using dirichlet process mixture model of von Mises-Fisher distributions
Document clustering has become an increasingly important technique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering. This paper proposes a Dirichlet process mixture (DPM) model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. We have developed a mean-field variational inference algorithm for the DPM model of vMFs that is applied to clustering text documents. Using this model, the number of clusters is determined automatically after the clustering process rather than pre-estimated. We conducted extensive experiments to evaluate the proposed approach on a large number of high dimensional text datasets. Empirical experimental results over NMI (Normalized Mutual Information) and Purity evaluation measures demonstrate that our approach outperforms the four state-of-the-art clustering algorithms.