Pub Date : 2016-11-13DOI: 10.1109/DATACLOUD.2016.10
Zong Peng, Beth Plale
Full text search engines underly the search of major content providers, Google, Bing and Yahoo. Open source search engines, such as Solr and ElasticSearch, are highly scalable and
{"title":"A Multi-tenant Fair Share Approach to Full-text Search Engine","authors":"Zong Peng, Beth Plale","doi":"10.1109/DATACLOUD.2016.10","DOIUrl":"https://doi.org/10.1109/DATACLOUD.2016.10","url":null,"abstract":"Full text search engines underly the search of major content providers, Google, Bing and Yahoo. Open source search engines, such as Solr and ElasticSearch, are highly scalable and","PeriodicalId":325593,"journal":{"name":"2016 Seventh International Workshop on Data-Intensive Computing in the Clouds (DataCloud)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115769978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-13DOI: 10.1109/DATACLOUD.2016.4
Rosa Filgueira, Rafael Ferreira da Silva, A. Krause, E. Deelman, M. Atkinson
We present Asterism, an open source data-intensive framework, which combines the strengths of traditional workflow management systems with new parallel stream-based dataflow systems to run data-intensive applications across multiple heterogeneous resources, without users having to: re-formulate their methods according to different enactment engines; manage the data distribution across systems; parallelize their methods; co-place and schedule their methods with computing resources; and store and transfer large/small volumes of data. We also present the Data-Intensive workflows as a Service (DIaaS) model, which enables easy dataintensive workow composition and deployment on clouds using containers. The feasibility of Asterism and DIaaS model have been evaluated using a real domain application on the NSF-Chameleon cloud. Experimental results shows how Asterism successfully and efficiently exploits combinations of diverse computational platforms, whereas DIaaS delivers specialized software to execute data-intensive applications in a scalable, efficient, and robust way reducing the engineering time and computational cost.
{"title":"Asterism: Pegasus and Dispel4py Hybrid Workflows for Data-Intensive Science","authors":"Rosa Filgueira, Rafael Ferreira da Silva, A. Krause, E. Deelman, M. Atkinson","doi":"10.1109/DATACLOUD.2016.4","DOIUrl":"https://doi.org/10.1109/DATACLOUD.2016.4","url":null,"abstract":"We present Asterism, an open source data-intensive framework, which combines the strengths of traditional workflow management systems with new parallel stream-based dataflow systems to run data-intensive applications across multiple heterogeneous resources, without users having to: re-formulate their methods according to different enactment engines; manage the data distribution across systems; parallelize their methods; co-place and schedule their methods with computing resources; and store and transfer large/small volumes of data. We also present the Data-Intensive workflows as a Service (DIaaS) model, which enables easy dataintensive workow composition and deployment on clouds using containers. The feasibility of Asterism and DIaaS model have been evaluated using a real domain application on the NSF-Chameleon cloud. Experimental results shows how Asterism successfully and efficiently exploits combinations of diverse computational platforms, whereas DIaaS delivers specialized software to execute data-intensive applications in a scalable, efficient, and robust way reducing the engineering time and computational cost.","PeriodicalId":325593,"journal":{"name":"2016 Seventh International Workshop on Data-Intensive Computing in the Clouds (DataCloud)","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122450540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-13DOI: 10.1109/DATACLOUD.2016.9
Xieming Li, O. Tatebe
This paper describes a data-aware task dispatching strategy called Improved Data-Aware Task Dispatching (IDAD). This approach exploits the high-performance of local file access in non-uniform storage-access (NUSA) file systems and is based on our previous work, Data-Aware Dispatch (DAD). In IDAD, the method of calculating data placement is revised, and the CPU factor is removed, as it has no major impact on performance but significantly reduces the difficulty for tweaking parameter.We evaluated our approach in comparison with DAD and the stock FIFO Torque scheduler using BLAST benchmarks. We observed makespan reductions of 10.40% and 35.05% compared with DAD and stock FIFO schedulers, respectively.
{"title":"Improved Data-Aware Task Dispatching for Batch Queuing Systems","authors":"Xieming Li, O. Tatebe","doi":"10.1109/DATACLOUD.2016.9","DOIUrl":"https://doi.org/10.1109/DATACLOUD.2016.9","url":null,"abstract":"This paper describes a data-aware task dispatching strategy called Improved Data-Aware Task Dispatching (IDAD). This approach exploits the high-performance of local file access in non-uniform storage-access (NUSA) file systems and is based on our previous work, Data-Aware Dispatch (DAD). In IDAD, the method of calculating data placement is revised, and the CPU factor is removed, as it has no major impact on performance but significantly reduces the difficulty for tweaking parameter.We evaluated our approach in comparison with DAD and the stock FIFO Torque scheduler using BLAST benchmarks. We observed makespan reductions of 10.40% and 35.05% compared with DAD and stock FIFO schedulers, respectively.","PeriodicalId":325593,"journal":{"name":"2016 Seventh International Workshop on Data-Intensive Computing in the Clouds (DataCloud)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115038948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-13DOI: 10.1109/DataCloud.2016.7
Michael S. Warren, S. Skillman, R. Chartrand, T. Kelton, R. Keisler, D. Raleigh, M. Turk
We present our experiences using cloud computing to support data-intensive analytics on satellite imagery for commercial applications. Drawing from our background in highperformance computing, we draw parallels between the early days of clustered computing systems and the current state of cloud computing and its potential to disrupt the HPC market. Using our own virtual file system layer on top of cloud remote object storage, we demonstrate aggregate read bandwidth of 230 gigabytes per second using 512 Google Compute Engine (GCE) nodes accessing a USA multi-region standard storage bucket. This figure is comparable to the best HPC storage systems in existence. We also present several of our application results, including the identification of field boundaries in Ukraine, and the generation of a global cloud-free base layer from Landsat imagery.
{"title":"Data-Intensive Supercomputing in the Cloud: Global Analytics for Satellite Imagery","authors":"Michael S. Warren, S. Skillman, R. Chartrand, T. Kelton, R. Keisler, D. Raleigh, M. Turk","doi":"10.1109/DataCloud.2016.7","DOIUrl":"https://doi.org/10.1109/DataCloud.2016.7","url":null,"abstract":"We present our experiences using cloud computing to support data-intensive analytics on satellite imagery for commercial applications. Drawing from our background in highperformance computing, we draw parallels between the early days of clustered computing systems and the current state of cloud computing and its potential to disrupt the HPC market. Using our own virtual file system layer on top of cloud remote object storage, we demonstrate aggregate read bandwidth of 230 gigabytes per second using 512 Google Compute Engine (GCE) nodes accessing a USA multi-region standard storage bucket. This figure is comparable to the best HPC storage systems in existence. We also present several of our application results, including the identification of field boundaries in Ukraine, and the generation of a global cloud-free base layer from Landsat imagery.","PeriodicalId":325593,"journal":{"name":"2016 Seventh International Workshop on Data-Intensive Computing in the Clouds (DataCloud)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123312920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-13DOI: 10.1109/DATACLOUD.2016.8
Xi Yang, T. Lehman
Advanced hybrid cloud services aim to serve big data applications by bridging multi-provider high performance cloud resources including direct connects, hypervisor bypassing VM interfaces, on premise clusters, parallel storage and high speed inter-cloud networks. We present a new “full-stack model driven orchestration” paradigm to integrate these diverse resources through semantic modeling and provide complex highend services through dynamic orchestrated workflows. We also present architectural design of a real-world orchestration system, VersaStack, that implements the paradigm as well as a case study for providing full-scale advanced hybrid cloud services in practice.
{"title":"Model Driven Advanced Hybrid Cloud Services for Big Data: Paradigm and Practice","authors":"Xi Yang, T. Lehman","doi":"10.1109/DATACLOUD.2016.8","DOIUrl":"https://doi.org/10.1109/DATACLOUD.2016.8","url":null,"abstract":"Advanced hybrid cloud services aim to serve big data applications by bridging multi-provider high performance cloud resources including direct connects, hypervisor bypassing VM interfaces, on premise clusters, parallel storage and high speed inter-cloud networks. We present a new “full-stack model driven orchestration” paradigm to integrate these diverse resources through semantic modeling and provide complex highend services through dynamic orchestrated workflows. We also present architectural design of a real-world orchestration system, VersaStack, that implements the paradigm as well as a case study for providing full-scale advanced hybrid cloud services in practice.","PeriodicalId":325593,"journal":{"name":"2016 Seventh International Workshop on Data-Intensive Computing in the Clouds (DataCloud)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129533770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-13DOI: 10.1109/DATACLOUD.2016.6
R. Arora, Trung Nguyen Ba, Tiffany A. Connors
Large, heterogeneous, and complex data collections can be difficult to analyze and manage manually. There is a need for scalable and user-friendly approaches that can automate the
大型、异构和复杂的数据集合可能难以手工分析和管理。需要一种可扩展的、用户友好的方法来实现自动化
{"title":"Pecos: A Scalable Solution for Analyzing and Managing Qualitative Data","authors":"R. Arora, Trung Nguyen Ba, Tiffany A. Connors","doi":"10.1109/DATACLOUD.2016.6","DOIUrl":"https://doi.org/10.1109/DATACLOUD.2016.6","url":null,"abstract":"Large, heterogeneous, and complex data collections can be difficult to analyze and manage manually. There is a need for scalable and user-friendly approaches that can automate the","PeriodicalId":325593,"journal":{"name":"2016 Seventh International Workshop on Data-Intensive Computing in the Clouds (DataCloud)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133117992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-13DOI: 10.1109/DATACLOUD.2016.5
William Agnew, Michael Fischer, Ian T Foster, K. Chard
Big data scientists face the challenge of locating valuable datasets across a network of distributed storage locations. We explore methods for recommending storage locations (“endpoints”) for users based on a range of prediction models including collaborative filtering and heuristics that consider available information such as user, institution, access history, endpoint ownership, and endpoint usage. We combine the strengths of these models by training a deep recurrent neural network on their predictions. Collectively we show, via analysis of historical usage from the Globus research data management service, that our approach can predict the next storage location accessed by users with 80.3% and 95.3% accuracy for top-1 and top-3 recommendations, respectively. Additionally, our heuristics can predict the endpoints that users will use in the future with over 75% precision and recall.
{"title":"An Ensemble-Based Recommendation Engine for Scientific Data Transfers","authors":"William Agnew, Michael Fischer, Ian T Foster, K. Chard","doi":"10.1109/DATACLOUD.2016.5","DOIUrl":"https://doi.org/10.1109/DATACLOUD.2016.5","url":null,"abstract":"Big data scientists face the challenge of locating valuable datasets across a network of distributed storage locations. We explore methods for recommending storage locations (“endpoints”) for users based on a range of prediction models including collaborative filtering and heuristics that consider available information such as user, institution, access history, endpoint ownership, and endpoint usage. We combine the strengths of these models by training a deep recurrent neural network on their predictions. Collectively we show, via analysis of historical usage from the Globus research data management service, that our approach can predict the next storage location accessed by users with 80.3% and 95.3% accuracy for top-1 and top-3 recommendations, respectively. Additionally, our heuristics can predict the endpoints that users will use in the future with over 75% precision and recall.","PeriodicalId":325593,"journal":{"name":"2016 Seventh International Workshop on Data-Intensive Computing in the Clouds (DataCloud)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124106996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-13DOI: 10.1109/DATACLOUD.2016.11
M. Bahrami, Dong Li, M. Singhal, A. Kundu
Cloud computing provides an opportunity to users to outsource their data and applications. However, data privacy is one of the key challenges for the users who are outsourcing data on some transparent cloud servers. Data encryption is the best option to protect users' data privacy on the cloud. However, computation overheads of encryption methods could be expensive to some small computing machines, such as mobile or IoT devices with limited resources, such as battery. In our previous study, we developed a light-weight Data Privacy Method (DPM) based on a chaos system that uses a Pseudo Random Permutation (PRP) to scramble the content of original data. Although the nature of PRP is against parallelization, we provide an efficient parallel algorithm to scramble a file while the file splits into multiple chunks. The parallel DPM avoids an adversary to access the original data (e.g., by using a brute-force attack), when the size of each scrambled data is large enough. In this paper, we accelerate DPM on a Graphic Processing Unit (GPU) by using NVIDIA CUDA platform for implementation. We assess the generated shuffle addresses from pseudo-random and the distribution of randomness when the computation on data is parallelized on a multiple GPU-cores. A set of rigorous evaluation results shows that the parallel DPM provides a superior performance over tradition DPM when the most time consuming of native CUDA parallel functions have monitored. We also perform a security analysis of parallel DPM to ensure it is secure and it is a cost effective model to protect users' data privacy in a cloud environment.
{"title":"An Efficient Parallel Implementation of a Light-weight Data Privacy Method for Mobile Cloud Users","authors":"M. Bahrami, Dong Li, M. Singhal, A. Kundu","doi":"10.1109/DATACLOUD.2016.11","DOIUrl":"https://doi.org/10.1109/DATACLOUD.2016.11","url":null,"abstract":"Cloud computing provides an opportunity to users to outsource their data and applications. However, data privacy is one of the key challenges for the users who are outsourcing data on some transparent cloud servers. Data encryption is the best option to protect users' data privacy on the cloud. However, computation overheads of encryption methods could be expensive to some small computing machines, such as mobile or IoT devices with limited resources, such as battery. In our previous study, we developed a light-weight Data Privacy Method (DPM) based on a chaos system that uses a Pseudo Random Permutation (PRP) to scramble the content of original data. Although the nature of PRP is against parallelization, we provide an efficient parallel algorithm to scramble a file while the file splits into multiple chunks. The parallel DPM avoids an adversary to access the original data (e.g., by using a brute-force attack), when the size of each scrambled data is large enough. In this paper, we accelerate DPM on a Graphic Processing Unit (GPU) by using NVIDIA CUDA platform for implementation. We assess the generated shuffle addresses from pseudo-random and the distribution of randomness when the computation on data is parallelized on a multiple GPU-cores. A set of rigorous evaluation results shows that the parallel DPM provides a superior performance over tradition DPM when the most time consuming of native CUDA parallel functions have monitored. We also perform a security analysis of parallel DPM to ensure it is secure and it is a cost effective model to protect users' data privacy in a cloud environment.","PeriodicalId":325593,"journal":{"name":"2016 Seventh International Workshop on Data-Intensive Computing in the Clouds (DataCloud)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116773444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}