{"title":"基于文章邻域分析的网络文章频繁项集聚类","authors":"Tomás Kucecka, D. Chudá, P. Sladecek","doi":"10.1109/CINTI.2013.6705250","DOIUrl":null,"url":null,"abstract":"Document clustering is a process of organizing text data into clusters where a cluster usually represents a group of topic related documents. Most effective text clustering approaches are based on frequent itemsets. A popular algorithm that uses this approach is FIHC (Frequent Itemset-based Hierarchical Clustering). In recent years, many modifications have been made to this algorithm. In this paper we focus on clustering web articles which represent a special type of text data. They contain hyperlinks through which they are linked with other articles on the web. We propose a FICWAN algorithm which is a modification of FIHC. FICWAN is especially suited for web data. We show that by considering the neighborhood of a web article and its HTML tags and CSS we are able to significantly improve the quality of created clusters. We experimented with our approach on several corpuses and the results clearly outperformed FIHC.","PeriodicalId":439949,"journal":{"name":"2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FIC WAN frequent itemset clustering of web articles by analyzing the article neighborhood\",\"authors\":\"Tomás Kucecka, D. Chudá, P. Sladecek\",\"doi\":\"10.1109/CINTI.2013.6705250\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Document clustering is a process of organizing text data into clusters where a cluster usually represents a group of topic related documents. Most effective text clustering approaches are based on frequent itemsets. A popular algorithm that uses this approach is FIHC (Frequent Itemset-based Hierarchical Clustering). In recent years, many modifications have been made to this algorithm. In this paper we focus on clustering web articles which represent a special type of text data. They contain hyperlinks through which they are linked with other articles on the web. We propose a FICWAN algorithm which is a modification of FIHC. FICWAN is especially suited for web data. We show that by considering the neighborhood of a web article and its HTML tags and CSS we are able to significantly improve the quality of created clusters. We experimented with our approach on several corpuses and the results clearly outperformed FIHC.\",\"PeriodicalId\":439949,\"journal\":{\"name\":\"2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI)\",\"volume\":\"79 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CINTI.2013.6705250\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CINTI.2013.6705250","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
FIC WAN frequent itemset clustering of web articles by analyzing the article neighborhood
Document clustering is a process of organizing text data into clusters where a cluster usually represents a group of topic related documents. Most effective text clustering approaches are based on frequent itemsets. A popular algorithm that uses this approach is FIHC (Frequent Itemset-based Hierarchical Clustering). In recent years, many modifications have been made to this algorithm. In this paper we focus on clustering web articles which represent a special type of text data. They contain hyperlinks through which they are linked with other articles on the web. We propose a FICWAN algorithm which is a modification of FIHC. FICWAN is especially suited for web data. We show that by considering the neighborhood of a web article and its HTML tags and CSS we are able to significantly improve the quality of created clusters. We experimented with our approach on several corpuses and the results clearly outperformed FIHC.