{"title":"异构XML文档聚类的结构和词袋折衷方法研究","authors":"Nadia Zerida, Jin Yao","doi":"10.1109/ADVCOMP.2008.28","DOIUrl":null,"url":null,"abstract":"The presence of a large quantity of unlabeled documents on the web increases, and organizing related heterogeneous XML documents by using their structural and conceptual properties into clusters become a great need. In this paper, we consider the pre-processing step as a key step to improve clustering quality, we propose a new pre-processing method which is based on combining Hapax words and path-based descriptors. A constrained agglomerative clustering method is used, and a comparison between different document representations is performed. The effectiveness of the method is evaluated on the INEX corpus, and clustering quality is measured by using micro and macro average purity measures.","PeriodicalId":269090,"journal":{"name":"2008 The Second International Conference on Advanced Engineering Computing and Applications in Sciences","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Towards Compromising Structural and Bag of Words Approaches for Clustering Heterogeneous XML Documents\",\"authors\":\"Nadia Zerida, Jin Yao\",\"doi\":\"10.1109/ADVCOMP.2008.28\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The presence of a large quantity of unlabeled documents on the web increases, and organizing related heterogeneous XML documents by using their structural and conceptual properties into clusters become a great need. In this paper, we consider the pre-processing step as a key step to improve clustering quality, we propose a new pre-processing method which is based on combining Hapax words and path-based descriptors. A constrained agglomerative clustering method is used, and a comparison between different document representations is performed. The effectiveness of the method is evaluated on the INEX corpus, and clustering quality is measured by using micro and macro average purity measures.\",\"PeriodicalId\":269090,\"journal\":{\"name\":\"2008 The Second International Conference on Advanced Engineering Computing and Applications in Sciences\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-09-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 The Second International Conference on Advanced Engineering Computing and Applications in Sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ADVCOMP.2008.28\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 The Second International Conference on Advanced Engineering Computing and Applications in Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ADVCOMP.2008.28","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Towards Compromising Structural and Bag of Words Approaches for Clustering Heterogeneous XML Documents
The presence of a large quantity of unlabeled documents on the web increases, and organizing related heterogeneous XML documents by using their structural and conceptual properties into clusters become a great need. In this paper, we consider the pre-processing step as a key step to improve clustering quality, we propose a new pre-processing method which is based on combining Hapax words and path-based descriptors. A constrained agglomerative clustering method is used, and a comparison between different document representations is performed. The effectiveness of the method is evaluated on the INEX corpus, and clustering quality is measured by using micro and macro average purity measures.