{"title":"Research of Massive Internet Text Data Real-Time Loading and Index System","authors":"Weihong Han, Yan Jia, Shuqiang Yang","doi":"10.1109/NCM.2009.414","DOIUrl":null,"url":null,"abstract":"With rapid development of the Internet and communication technology, massive text data has been accumulated in Internet, including text data on network pages, emails, instant messengers and etc. Requirements on increasing data volume, real-time data-loading and creating text indexes pose enormous challenges to data-loading techniques. This paper presents a data loading system in real time, Text-loader that is used in ITSR (Internet Text Data Real-time Storage and Retrieval System). Text-loader consists of an efficient algorithm for bulk data loading and exchange partition mechanism, increasing text index creation algorithm, optimized parallelism, and guidelines for system tuning. Performance studies show the positive effects of these techniques with loading speed of every Cluster, increasing from 220 million records per day to 1.2 billion per day, and achieving the top loading speed of 6TB data when 10 Clusters are in parallel. This framework offers a promising approach for loading other large and complex text databases.","PeriodicalId":119669,"journal":{"name":"2009 Fifth International Joint Conference on INC, IMS and IDC","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Fifth International Joint Conference on INC, IMS and IDC","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCM.2009.414","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
With rapid development of the Internet and communication technology, massive text data has been accumulated in Internet, including text data on network pages, emails, instant messengers and etc. Requirements on increasing data volume, real-time data-loading and creating text indexes pose enormous challenges to data-loading techniques. This paper presents a data loading system in real time, Text-loader that is used in ITSR (Internet Text Data Real-time Storage and Retrieval System). Text-loader consists of an efficient algorithm for bulk data loading and exchange partition mechanism, increasing text index creation algorithm, optimized parallelism, and guidelines for system tuning. Performance studies show the positive effects of these techniques with loading speed of every Cluster, increasing from 220 million records per day to 1.2 billion per day, and achieving the top loading speed of 6TB data when 10 Clusters are in parallel. This framework offers a promising approach for loading other large and complex text databases.