S. Zobaed, Md. Enamul Haque, Shahidullah Kaiser, R. Hussain
{"title":"NoCS2:云环境下基于主题的大数据文本语料库聚类","authors":"S. Zobaed, Md. Enamul Haque, Shahidullah Kaiser, R. Hussain","doi":"10.1109/ICCITECHN.2018.8631951","DOIUrl":null,"url":null,"abstract":"Cloud services are widely deployed to store and process big data. Organizations who deal with big data, especially large document set, prefer utilizing cloud services for storage and computational efficiency. However, for processing large text corpus, an inefficient data processing is computationally expensive for real-time systems. In addition, efficient memory utilization is important to cluster big data including large text corpus. Clustering of the large text corpus is an important component of various document retrieval systems such as PubMed1. To address these challenges, in this paper, we present NoCS2 (Number of Cluster and Seed Selection) for efficient topic-based clustering from unstructured big data in the cloud. NoCS2 relies on computing and storage services in the cloud server. Traditional clustering solutions for text dataset consider a fixed number of clusters irrespective of the dataset size and characteristics such as science and technology. Alternatively, our solution dynamically determines the appropriate $k$ number of clusters based on the characteristics of the dataset. Particularly, we use precomputed matrix trace as the number of clusters for a dataset that represents the total number of keywords using vector representation. Then, we build $k$ clusters using topic-based similarity among keywords. Finally, we compare our proposed method with two state-of-the-art clustering methods. Empirical results demonstrate that the average closeness score of NoCS2 is better than other methods for large and sparse datasets.","PeriodicalId":355984,"journal":{"name":"2018 21st International Conference of Computer and Information Technology (ICCIT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"NoCS2: Topic-Based Clustering of Big Data Text Corpus in the Cloud\",\"authors\":\"S. Zobaed, Md. Enamul Haque, Shahidullah Kaiser, R. Hussain\",\"doi\":\"10.1109/ICCITECHN.2018.8631951\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cloud services are widely deployed to store and process big data. Organizations who deal with big data, especially large document set, prefer utilizing cloud services for storage and computational efficiency. However, for processing large text corpus, an inefficient data processing is computationally expensive for real-time systems. In addition, efficient memory utilization is important to cluster big data including large text corpus. Clustering of the large text corpus is an important component of various document retrieval systems such as PubMed1. To address these challenges, in this paper, we present NoCS2 (Number of Cluster and Seed Selection) for efficient topic-based clustering from unstructured big data in the cloud. NoCS2 relies on computing and storage services in the cloud server. Traditional clustering solutions for text dataset consider a fixed number of clusters irrespective of the dataset size and characteristics such as science and technology. Alternatively, our solution dynamically determines the appropriate $k$ number of clusters based on the characteristics of the dataset. Particularly, we use precomputed matrix trace as the number of clusters for a dataset that represents the total number of keywords using vector representation. Then, we build $k$ clusters using topic-based similarity among keywords. Finally, we compare our proposed method with two state-of-the-art clustering methods. Empirical results demonstrate that the average closeness score of NoCS2 is better than other methods for large and sparse datasets.\",\"PeriodicalId\":355984,\"journal\":{\"name\":\"2018 21st International Conference of Computer and Information Technology (ICCIT)\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 21st International Conference of Computer and Information Technology (ICCIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCITECHN.2018.8631951\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 21st International Conference of Computer and Information Technology (ICCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCITECHN.2018.8631951","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
摘要
云服务被广泛用于存储和处理大数据。处理大数据,特别是大型文档集的组织更喜欢使用云服务来存储和计算效率。然而,对于处理大型文本语料库,低效的数据处理对于实时系统来说是非常昂贵的。此外,高效的内存利用对于包括大型文本语料库在内的大数据的聚类也很重要。大型文本语料库的聚类是各种文档检索系统(如PubMed1)的重要组成部分。为了应对这些挑战,在本文中,我们提出了NoCS2 (Number of Cluster and Seed Selection),用于从云中的非结构化大数据中高效地进行基于主题的聚类。NoCS2依赖于云服务器中的计算和存储服务。传统的文本数据聚类解决方案考虑固定数量的聚类,而不考虑数据集的大小和科学技术等特征。或者,我们的解决方案根据数据集的特征动态确定适当的$k$簇数。特别是,我们使用预先计算的矩阵跟踪作为数据集的簇数,该数据集使用向量表示来表示关键字的总数。然后,我们使用关键词之间基于主题的相似度构建$k$聚类。最后,我们将我们提出的方法与两种最先进的聚类方法进行了比较。实证结果表明,对于大型稀疏数据集,NoCS2的平均接近度得分优于其他方法。
NoCS2: Topic-Based Clustering of Big Data Text Corpus in the Cloud
Cloud services are widely deployed to store and process big data. Organizations who deal with big data, especially large document set, prefer utilizing cloud services for storage and computational efficiency. However, for processing large text corpus, an inefficient data processing is computationally expensive for real-time systems. In addition, efficient memory utilization is important to cluster big data including large text corpus. Clustering of the large text corpus is an important component of various document retrieval systems such as PubMed1. To address these challenges, in this paper, we present NoCS2 (Number of Cluster and Seed Selection) for efficient topic-based clustering from unstructured big data in the cloud. NoCS2 relies on computing and storage services in the cloud server. Traditional clustering solutions for text dataset consider a fixed number of clusters irrespective of the dataset size and characteristics such as science and technology. Alternatively, our solution dynamically determines the appropriate $k$ number of clusters based on the characteristics of the dataset. Particularly, we use precomputed matrix trace as the number of clusters for a dataset that represents the total number of keywords using vector representation. Then, we build $k$ clusters using topic-based similarity among keywords. Finally, we compare our proposed method with two state-of-the-art clustering methods. Empirical results demonstrate that the average closeness score of NoCS2 is better than other methods for large and sparse datasets.