NoCS2: Topic-Based Clustering of Big Data Text Corpus in the Cloud

S. Zobaed, Md. Enamul Haque, Shahidullah Kaiser, R. Hussain
{"title":"NoCS2: Topic-Based Clustering of Big Data Text Corpus in the Cloud","authors":"S. Zobaed, Md. Enamul Haque, Shahidullah Kaiser, R. Hussain","doi":"10.1109/ICCITECHN.2018.8631951","DOIUrl":null,"url":null,"abstract":"Cloud services are widely deployed to store and process big data. Organizations who deal with big data, especially large document set, prefer utilizing cloud services for storage and computational efficiency. However, for processing large text corpus, an inefficient data processing is computationally expensive for real-time systems. In addition, efficient memory utilization is important to cluster big data including large text corpus. Clustering of the large text corpus is an important component of various document retrieval systems such as PubMed1. To address these challenges, in this paper, we present NoCS2 (Number of Cluster and Seed Selection) for efficient topic-based clustering from unstructured big data in the cloud. NoCS2 relies on computing and storage services in the cloud server. Traditional clustering solutions for text dataset consider a fixed number of clusters irrespective of the dataset size and characteristics such as science and technology. Alternatively, our solution dynamically determines the appropriate $k$ number of clusters based on the characteristics of the dataset. Particularly, we use precomputed matrix trace as the number of clusters for a dataset that represents the total number of keywords using vector representation. Then, we build $k$ clusters using topic-based similarity among keywords. Finally, we compare our proposed method with two state-of-the-art clustering methods. Empirical results demonstrate that the average closeness score of NoCS2 is better than other methods for large and sparse datasets.","PeriodicalId":355984,"journal":{"name":"2018 21st International Conference of Computer and Information Technology (ICCIT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 21st International Conference of Computer and Information Technology (ICCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCITECHN.2018.8631951","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Cloud services are widely deployed to store and process big data. Organizations who deal with big data, especially large document set, prefer utilizing cloud services for storage and computational efficiency. However, for processing large text corpus, an inefficient data processing is computationally expensive for real-time systems. In addition, efficient memory utilization is important to cluster big data including large text corpus. Clustering of the large text corpus is an important component of various document retrieval systems such as PubMed1. To address these challenges, in this paper, we present NoCS2 (Number of Cluster and Seed Selection) for efficient topic-based clustering from unstructured big data in the cloud. NoCS2 relies on computing and storage services in the cloud server. Traditional clustering solutions for text dataset consider a fixed number of clusters irrespective of the dataset size and characteristics such as science and technology. Alternatively, our solution dynamically determines the appropriate $k$ number of clusters based on the characteristics of the dataset. Particularly, we use precomputed matrix trace as the number of clusters for a dataset that represents the total number of keywords using vector representation. Then, we build $k$ clusters using topic-based similarity among keywords. Finally, we compare our proposed method with two state-of-the-art clustering methods. Empirical results demonstrate that the average closeness score of NoCS2 is better than other methods for large and sparse datasets.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
NoCS2:云环境下基于主题的大数据文本语料库聚类
云服务被广泛用于存储和处理大数据。处理大数据,特别是大型文档集的组织更喜欢使用云服务来存储和计算效率。然而,对于处理大型文本语料库,低效的数据处理对于实时系统来说是非常昂贵的。此外,高效的内存利用对于包括大型文本语料库在内的大数据的聚类也很重要。大型文本语料库的聚类是各种文档检索系统(如PubMed1)的重要组成部分。为了应对这些挑战,在本文中,我们提出了NoCS2 (Number of Cluster and Seed Selection),用于从云中的非结构化大数据中高效地进行基于主题的聚类。NoCS2依赖于云服务器中的计算和存储服务。传统的文本数据聚类解决方案考虑固定数量的聚类,而不考虑数据集的大小和科学技术等特征。或者,我们的解决方案根据数据集的特征动态确定适当的$k$簇数。特别是,我们使用预先计算的矩阵跟踪作为数据集的簇数,该数据集使用向量表示来表示关键字的总数。然后,我们使用关键词之间基于主题的相似度构建$k$聚类。最后,我们将我们提出的方法与两种最先进的聚类方法进行了比较。实证结果表明,对于大型稀疏数据集,NoCS2的平均接近度得分优于其他方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Automatic Document Feeding Scanner: A Low Cost Approach A Proposed Algorithm and Architecture for Automated Meeting Scheduling and Document Management Website Classification Using Word Based Multiple N -Gram Models and Random Search Oriented Feature Parameters Towards Design and Implementation of a Low-Cost EMG Signal Recorder for Application in Prosthetic Arm Control for Developing Countries Like Bangladesh Power Efficient Distant Controlled Smart Irrigation System for AMAN and BORO Rice
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1