This paper discusses using hard drives that integrate a key-value interface and network access in the actual drive hardware (Kinetic storage platform) to supply file system functionality in a large scale environment. Taking advantage of higher-level functionality to handle metadata on the drives themselves, a serverless system architecture is proposed. Skipping path component traversal during the lookup operation is the key technique discussed in this paper to avoid performance degradation with highly decentralized metadata. Scalability implications are reviewed based on a fuse file system implementation.
{"title":"File System Scalability with Highly Decentralized Metadata on Independent Storage Devices","authors":"P. Lensing, Toni Cortes, J. Hughes, A. Brinkmann","doi":"10.1109/CCGrid.2016.28","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.28","url":null,"abstract":"This paper discusses using hard drives that integrate a key-value interface and network access in the actual drive hardware (Kinetic storage platform) to supply file system functionality in a large scale environment. Taking advantage of higher-level functionality to handle metadata on the drives themselves, a serverless system architecture is proposed. Skipping path component traversal during the lookup operation is the key technique discussed in this paper to avoid performance degradation with highly decentralized metadata. Scalability implications are reviewed based on a fuse file system implementation.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130422251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Outin, Jean-Emile Dartois, Olivier Barais, Jean-Louis Pazat
As cloud computing is being more and more used, datacenters play a large role in the overall energy consumption. We propose to tackle this problem, by continuously and autonomously optimizing the cloud datacenters energy efficiency. To this end, modeling the energy consumption for these infrastructures is crucial to drive the optimization process, anticipate the effects of aggressive optimization policies, and to determine precisely the gains brought with the planned optimization. Yet, it is very complex to model with accuracy the energy consumption of a physical device as it depends on several factors. Do we need a detailed and fine-grained energy model to perform good optimizations in the datacenter? Or is a simple and naive energy model good enough to propose viable energy-efficient optimizations? Through experiments, our results show that we don't get energy savings compared to classical bin-packing strategies but there are some gains inusing precise modeling: better utilization of the network and the VM migration processes.
{"title":"Seeking for the Optimal Energy Modelisation Accuracy to Allow Efficient Datacenter Optimizations","authors":"E. Outin, Jean-Emile Dartois, Olivier Barais, Jean-Louis Pazat","doi":"10.1109/CCGrid.2016.67","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.67","url":null,"abstract":"As cloud computing is being more and more used, datacenters play a large role in the overall energy consumption. We propose to tackle this problem, by continuously and autonomously optimizing the cloud datacenters energy efficiency. To this end, modeling the energy consumption for these infrastructures is crucial to drive the optimization process, anticipate the effects of aggressive optimization policies, and to determine precisely the gains brought with the planned optimization. Yet, it is very complex to model with accuracy the energy consumption of a physical device as it depends on several factors. Do we need a detailed and fine-grained energy model to perform good optimizations in the datacenter? Or is a simple and naive energy model good enough to propose viable energy-efficient optimizations? Through experiments, our results show that we don't get energy savings compared to classical bin-packing strategies but there are some gains inusing precise modeling: better utilization of the network and the VM migration processes.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130556899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Composition of Cloud services is necessary when a single component is unable to satisfy all the user's requirements. It is a complex task for Cloud managers which involves several operations such as discovery, compatibility checking, selection, and deployment. Similarly to a non Cloud environment, the service composition raises the need for design-time approaches to check the correct interaction between the different components of a composite service. However, for Cloud-based service composition, new specific constraints, such as resources management, elasticity and multitenancy have to be considered. In this work, we use Symbolic Observation Graphs (SOG) in order to abstract Cloud services and to check the correction of their composition with respect to event-and state-based LTL formulae. The violation of such formulae can come either from the stakeholders' interaction or from the shared Cloud resources perspectives. In the former case, the involved services are considered as incompatible while, in the latter case, the problem can be solved by deploying additional resources. The approach we propose in this paper allows then to check whether the resource provider service is able, at run time, to satisfy the users' requests in terms of Cloud resources.
{"title":"A Formal Approach for Service Composition in a Cloud Resources Sharing Context","authors":"Kais Klai, Hanen Ochi","doi":"10.1109/CCGrid.2016.74","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.74","url":null,"abstract":"Composition of Cloud services is necessary when a single component is unable to satisfy all the user's requirements. It is a complex task for Cloud managers which involves several operations such as discovery, compatibility checking, selection, and deployment. Similarly to a non Cloud environment, the service composition raises the need for design-time approaches to check the correct interaction between the different components of a composite service. However, for Cloud-based service composition, new specific constraints, such as resources management, elasticity and multitenancy have to be considered. In this work, we use Symbolic Observation Graphs (SOG) in order to abstract Cloud services and to check the correction of their composition with respect to event-and state-based LTL formulae. The violation of such formulae can come either from the stakeholders' interaction or from the shared Cloud resources perspectives. In the former case, the involved services are considered as incompatible while, in the latter case, the problem can be solved by deploying additional resources. The approach we propose in this paper allows then to check whether the resource provider service is able, at run time, to satisfy the users' requests in terms of Cloud resources.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"166 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132062946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francisco Rodrigo Duro, Francisco Javier García Blas, Florin Isaila, J. Wozniak, J. Carretero, R. Ross
This paper explores novel techniques for improving the performance of many-task workflows based on the Swift scripting language. We propose novel programmer options for automated distributed data placement and task scheduling. These options trigger a data placement mechanism used for distributing intermediate workflow data over the servers of Hercules, a distributed key-value store that can be used to cache file system data. We demonstrate that these new mechanisms can significantly improve the aggregated throughput of many-task workflows with up to 86x, reduce the contention on the shared file system, exploit the data locality, and trade off locality and load balance.
{"title":"Flexible Data-Aware Scheduling for Workflows over an In-memory Object Store","authors":"Francisco Rodrigo Duro, Francisco Javier García Blas, Florin Isaila, J. Wozniak, J. Carretero, R. Ross","doi":"10.1109/CCGrid.2016.40","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.40","url":null,"abstract":"This paper explores novel techniques for improving the performance of many-task workflows based on the Swift scripting language. We propose novel programmer options for automated distributed data placement and task scheduling. These options trigger a data placement mechanism used for distributing intermediate workflow data over the servers of Hercules, a distributed key-value store that can be used to cache file system data. We demonstrate that these new mechanisms can significantly improve the aggregated throughput of many-task workflows with up to 86x, reduce the contention on the shared file system, exploit the data locality, and trade off locality and load balance.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127117766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tzu-Hsien Wu, Hao Shyng, J. Chou, Bin Dong, Kesheng Wu
Scientific discoveries are increasingly relying on analysis of massive amounts of data generated from scientific experiments, observations, and simulations. The ability to directly access the most relevant data records, without shifting through all of them becomes essential. While many indexing techniques have been developed to quickly locate the selected data records, the time and space required for building and storing these indexes are often too expensive to meet the demands of in situ or real-time data analysis. Existing indexing methods generally capture information about each individual data record, however, when reading a data record, the I/O system typically has to access a block or a page of data. In this work, we postulate that indexing blocks instead of individual data records could significantly reduce index size and index building time without increasing the I/O time for accessing the selected data records. Our experiments using multiple real datasets on a supercomputer show that block index can reduce query time by a factor of 2 to 50 over other existing methods, including SciDB and FastQuery. But the size of block index is almost negligible comparing to the data size, and the time of building index can reach the peak I/O speed.
{"title":"Indexing Blocks to Reduce Space and Time Requirements for Searching Large Data Files","authors":"Tzu-Hsien Wu, Hao Shyng, J. Chou, Bin Dong, Kesheng Wu","doi":"10.1109/CCGrid.2016.18","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.18","url":null,"abstract":"Scientific discoveries are increasingly relying on analysis of massive amounts of data generated from scientific experiments, observations, and simulations. The ability to directly access the most relevant data records, without shifting through all of them becomes essential. While many indexing techniques have been developed to quickly locate the selected data records, the time and space required for building and storing these indexes are often too expensive to meet the demands of in situ or real-time data analysis. Existing indexing methods generally capture information about each individual data record, however, when reading a data record, the I/O system typically has to access a block or a page of data. In this work, we postulate that indexing blocks instead of individual data records could significantly reduce index size and index building time without increasing the I/O time for accessing the selected data records. Our experiments using multiple real datasets on a supercomputer show that block index can reduce query time by a factor of 2 to 50 over other existing methods, including SciDB and FastQuery. But the size of block index is almost negligible comparing to the data size, and the time of building index can reach the peak I/O speed.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126227386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the diversity in the applications that run in clusters, many different application frameworks have been developed, such as MapReduce for data-intensive batch jobs and Spark for interactive data analytics. A framework is first deployed in a cluster, and then starts executing a large set of jobs that are submitted over time. When multiple such frameworks with time-varying resource demands are presentin a single cluster, static allocation of resources on a per-framework basis leads to low system utilization and resource fragmentation. In this paper, we present koala-f, a resource manager that dynamically provides resources to frameworks by employing a feedback loop to collecttheir possibly different performance metrics. Frameworks periodically -- not necessarily with the same frequency -- report the values of their performancemetrics to koala-f, which then either rebalances their resources individuallyagainst the idle-resource pool, or, when the latter is empty, rebalances their resources amongst them. We demonstrate the effectiveness of koala-f with experiments in a real system.
{"title":"KOALA-F: A Resource Manager for Scheduling Frameworks in Clusters","authors":"Aleksandra Kuzmanovska, R. H. Mak, D. Epema","doi":"10.1109/CCGrid.2016.60","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.60","url":null,"abstract":"Due to the diversity in the applications that run in clusters, many different application frameworks have been developed, such as MapReduce for data-intensive batch jobs and Spark for interactive data analytics. A framework is first deployed in a cluster, and then starts executing a large set of jobs that are submitted over time. When multiple such frameworks with time-varying resource demands are presentin a single cluster, static allocation of resources on a per-framework basis leads to low system utilization and resource fragmentation. In this paper, we present koala-f, a resource manager that dynamically provides resources to frameworks by employing a feedback loop to collecttheir possibly different performance metrics. Frameworks periodically -- not necessarily with the same frequency -- report the values of their performancemetrics to koala-f, which then either rebalances their resources individuallyagainst the idle-resource pool, or, when the latter is empty, rebalances their resources amongst them. We demonstrate the effectiveness of koala-f with experiments in a real system.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115366847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Property Graphs with rich attributes over vertices and edges are becoming common. Querying and mining such linked Big Data is important for knowledge discovery and mining. Distributed graph platforms like Pregel focus on batch execution on commodity clusters. But exploratory analytics requires platforms that are both responsive and scalable. We propose Graph-oriented Database (GoDB), a distributed graph database that supports declarative queries over large property graphs. GoDB builds upon our GoFFish subgraph-centric batch processing platform, leveraging its scalability while using execution heuristics to offer responsiveness. The GoDB declarative query model supports vertex, edge, path and reachability queries, and this is translated to a distributed execution plan on GoFFish. We also propose a novel cost model to choose a query plan that minimizes the execution latency. We evaluate GoDB deployed on the Azure IaaS Cloud, over real-world property graphs and for a diverse workload of 500 queries. These show that the cost model selects the optimal execution plan at least 80% of the time, and helps GoDB weakly scale with the graph size. A comparative study with Titan, a leading open-source graph database, shows that we complete all queries, each in ≤ 1.6 secs, while Titan cannot complete up to 42% of some query workloads.
{"title":"GoDB: From Batch Processing to Distributed Querying over Property Graphs","authors":"N. Jamadagni, Yogesh L. Simmhan","doi":"10.1109/CCGrid.2016.105","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.105","url":null,"abstract":"Property Graphs with rich attributes over vertices and edges are becoming common. Querying and mining such linked Big Data is important for knowledge discovery and mining. Distributed graph platforms like Pregel focus on batch execution on commodity clusters. But exploratory analytics requires platforms that are both responsive and scalable. We propose Graph-oriented Database (GoDB), a distributed graph database that supports declarative queries over large property graphs. GoDB builds upon our GoFFish subgraph-centric batch processing platform, leveraging its scalability while using execution heuristics to offer responsiveness. The GoDB declarative query model supports vertex, edge, path and reachability queries, and this is translated to a distributed execution plan on GoFFish. We also propose a novel cost model to choose a query plan that minimizes the execution latency. We evaluate GoDB deployed on the Azure IaaS Cloud, over real-world property graphs and for a diverse workload of 500 queries. These show that the cost model selects the optimal execution plan at least 80% of the time, and helps GoDB weakly scale with the graph size. A comparative study with Titan, a leading open-source graph database, shows that we complete all queries, each in ≤ 1.6 secs, while Titan cannot complete up to 42% of some query workloads.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115596114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bogdan Nicolae, Carlos H. A. Costa, Claudia Misale, K. Katrinis, Yoonho Park
Big data analytics is an indispensable tool in transforming science, engineering, medicine, healthcare, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. However, this introduces important challenges, among which data shuffling is particularly difficult: on one hand it is a key part of the computation that has a major impact on the overall performance and scalability so its efficiency is paramount, while on the other hand it needs to operate with scarce memory in order to leave as much memory available for data caching. In this context, efficient scheduling of data transfers such that it addresses both dimensions of the problem simultaneously is non-trivial. State-of-the-art solutions often rely on simple approaches that yield sub optimal performance and resource usage. This paper contributes a novel shuffle data transfer strategy that dynamically adapts to the computation with minimal memory utilization, which we briefly underline as a series of design principles.
{"title":"Towards Memory-Optimized Data Shuffling Patterns for Big Data Analytics","authors":"Bogdan Nicolae, Carlos H. A. Costa, Claudia Misale, K. Katrinis, Yoonho Park","doi":"10.1109/CCGrid.2016.85","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.85","url":null,"abstract":"Big data analytics is an indispensable tool in transforming science, engineering, medicine, healthcare, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. However, this introduces important challenges, among which data shuffling is particularly difficult: on one hand it is a key part of the computation that has a major impact on the overall performance and scalability so its efficiency is paramount, while on the other hand it needs to operate with scarce memory in order to leave as much memory available for data caching. In this context, efficient scheduling of data transfers such that it addresses both dimensions of the problem simultaneously is non-trivial. State-of-the-art solutions often rely on simple approaches that yield sub optimal performance and resource usage. This paper contributes a novel shuffle data transfer strategy that dynamically adapts to the computation with minimal memory utilization, which we briefly underline as a series of design principles.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114893915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryan Chard, K. Chard, Bryan K. F. Ng, K. Bubendorfer, Alex Rodriguez, R. Madduri, Ian T Foster
Cloud providers offer a diverse set of instance types with varying resource capacities, designed to meet the needs of a broad range of user requirements. While this flexibility is a major benefit of the cloud computing model, it also creates challenges when selecting the most suitable instance type for a given application. Sub-optimal instance selection can result in poor performance and/or increased cost, with significant impacts when applications are executed repeatedly. Yet selecting an optimal instance type is challenging, as each instance type can be configured differently, application performance is dependent on input data and configuration, and instance types and applications are frequently updated. We present a service that supports automatic profiling of application performance on different instance types to create rich application profiles that can be used for comparison, provisioning, and scheduling. This service can dynamically provision cloud instances, automatically deploy and contextualize applications, transfer input datasets, monitor execution performance, and create a composite profile with fine grained resource usage information. We use real usage data from four production genomics gateways and estimate the use of profiles in autonomic provisioning systems can decrease execution time by up to 15.7% and cost by up to 86.6%.
{"title":"An Automated Tool Profiling Service for the Cloud","authors":"Ryan Chard, K. Chard, Bryan K. F. Ng, K. Bubendorfer, Alex Rodriguez, R. Madduri, Ian T Foster","doi":"10.1109/CCGrid.2016.57","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.57","url":null,"abstract":"Cloud providers offer a diverse set of instance types with varying resource capacities, designed to meet the needs of a broad range of user requirements. While this flexibility is a major benefit of the cloud computing model, it also creates challenges when selecting the most suitable instance type for a given application. Sub-optimal instance selection can result in poor performance and/or increased cost, with significant impacts when applications are executed repeatedly. Yet selecting an optimal instance type is challenging, as each instance type can be configured differently, application performance is dependent on input data and configuration, and instance types and applications are frequently updated. We present a service that supports automatic profiling of application performance on different instance types to create rich application profiles that can be used for comparison, provisioning, and scheduling. This service can dynamically provision cloud instances, automatically deploy and contextualize applications, transfer input datasets, monitor execution performance, and create a composite profile with fine grained resource usage information. We use real usage data from four production genomics gateways and estimate the use of profiles in autonomic provisioning systems can decrease execution time by up to 15.7% and cost by up to 86.6%.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129572063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
TSD (Tjenester for Sensitive Data), is an isolated infrastructure for storing and processing sensitive research data, e.g. human patient genomics data. Due to the isolation of the TSD, it is not possible to install software in the traditional fashion. Docker containers is a platform implementing lightweight virtualization technology for applying the build-once-run-anyware approach in software packaging and sharing. This paper describes our experience at USIT (The University Centre of Information Technology) at the University of Oslo With Docker containers as a solution for installing and running software packages that require downloading of dependencies and binaries during the installation, inside a secure isolated infrastructure. Using Docker containers made it possible to package software packages as Docker images and run them smoothly inside our secure system, TSD. The paper describes Docker as a technology, its benefits and weaknesses in terms of security, demonstrates our experience with a use case for installing and running the Galaxy bioinformatics portal as a Docker container inside the TSD, and investigates the use of Stroll file-system as a proxy between Galaxy portal and the HPC cluster.
TSD (Tjenester for Sensitive Data)是一个独立的基础设施,用于存储和处理敏感的研究数据,例如人类患者基因组数据。由于TSD的隔离性,无法以传统方式安装软件。Docker容器是一个实现轻量级虚拟化技术的平台,用于在软件打包和共享中应用构建一次运行任何软件的方法。本文描述了我们在奥斯陆大学的USIT(信息技术大学中心)使用Docker容器作为安装和运行软件包的解决方案的经验,这些软件包在安装过程中需要下载依赖项和二进制文件,在一个安全隔离的基础设施中。使用Docker容器可以将软件包打包为Docker镜像,并在我们的安全系统TSD中顺利运行。本文将Docker描述为一种技术,它在安全性方面的优点和缺点,展示了我们在TSD内安装和运行Galaxy生物信息学门户作为Docker容器的用例的经验,并研究了Stroll文件系统作为Galaxy门户和HPC集群之间的代理的使用。
{"title":"Software Provisioning Inside a Secure Environment as Docker Containers Using Stroll File-System","authors":"A. Azab, D. Domanska","doi":"10.1109/CCGrid.2016.106","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.106","url":null,"abstract":"TSD (Tjenester for Sensitive Data), is an isolated infrastructure for storing and processing sensitive research data, e.g. human patient genomics data. Due to the isolation of the TSD, it is not possible to install software in the traditional fashion. Docker containers is a platform implementing lightweight virtualization technology for applying the build-once-run-anyware approach in software packaging and sharing. This paper describes our experience at USIT (The University Centre of Information Technology) at the University of Oslo With Docker containers as a solution for installing and running software packages that require downloading of dependencies and binaries during the installation, inside a secure isolated infrastructure. Using Docker containers made it possible to package software packages as Docker images and run them smoothly inside our secure system, TSD. The paper describes Docker as a technology, its benefits and weaknesses in terms of security, demonstrates our experience with a use case for installing and running the Galaxy bioinformatics portal as a Docker container inside the TSD, and investigates the use of Stroll file-system as a proxy between Galaxy portal and the HPC cluster.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128792289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}