首页 > 最新文献

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)最新文献

英文 中文
File System Scalability with Highly Decentralized Metadata on Independent Storage Devices 独立存储设备上具有高度分散元数据的文件系统可扩展性
P. Lensing, Toni Cortes, J. Hughes, A. Brinkmann
This paper discusses using hard drives that integrate a key-value interface and network access in the actual drive hardware (Kinetic storage platform) to supply file system functionality in a large scale environment. Taking advantage of higher-level functionality to handle metadata on the drives themselves, a serverless system architecture is proposed. Skipping path component traversal during the lookup operation is the key technique discussed in this paper to avoid performance degradation with highly decentralized metadata. Scalability implications are reviewed based on a fuse file system implementation.
本文讨论了在实际的驱动器硬件(动能存储平台)中使用集成了键值接口和网络访问的硬盘驱动器来提供大规模环境中的文件系统功能。利用更高级的功能来处理驱动器本身的元数据,提出了一种无服务器系统架构。在查找操作期间跳过路径组件遍历是本文讨论的关键技术,可以避免使用高度分散的元数据导致性能下降。基于一个fuse文件系统实现,对可伸缩性的含义进行了回顾。
{"title":"File System Scalability with Highly Decentralized Metadata on Independent Storage Devices","authors":"P. Lensing, Toni Cortes, J. Hughes, A. Brinkmann","doi":"10.1109/CCGrid.2016.28","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.28","url":null,"abstract":"This paper discusses using hard drives that integrate a key-value interface and network access in the actual drive hardware (Kinetic storage platform) to supply file system functionality in a large scale environment. Taking advantage of higher-level functionality to handle metadata on the drives themselves, a serverless system architecture is proposed. Skipping path component traversal during the lookup operation is the key technique discussed in this paper to avoid performance degradation with highly decentralized metadata. Scalability implications are reviewed based on a fuse file system implementation.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130422251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Seeking for the Optimal Energy Modelisation Accuracy to Allow Efficient Datacenter Optimizations 寻求最佳的能源建模精度,以实现高效的数据中心优化
E. Outin, Jean-Emile Dartois, Olivier Barais, Jean-Louis Pazat
As cloud computing is being more and more used, datacenters play a large role in the overall energy consumption. We propose to tackle this problem, by continuously and autonomously optimizing the cloud datacenters energy efficiency. To this end, modeling the energy consumption for these infrastructures is crucial to drive the optimization process, anticipate the effects of aggressive optimization policies, and to determine precisely the gains brought with the planned optimization. Yet, it is very complex to model with accuracy the energy consumption of a physical device as it depends on several factors. Do we need a detailed and fine-grained energy model to perform good optimizations in the datacenter? Or is a simple and naive energy model good enough to propose viable energy-efficient optimizations? Through experiments, our results show that we don't get energy savings compared to classical bin-packing strategies but there are some gains inusing precise modeling: better utilization of the network and the VM migration processes.
随着云计算被越来越多地使用,数据中心在整体能耗中扮演着很大的角色。我们建议通过持续自主地优化云数据中心的能源效率来解决这个问题。为此,对这些基础设施的能源消耗进行建模对于推动优化过程、预测积极优化策略的效果以及精确确定计划优化带来的收益至关重要。然而,准确地模拟物理设备的能量消耗是非常复杂的,因为它取决于几个因素。我们是否需要一个详细的、细粒度的能量模型来在数据中心中执行良好的优化?或者一个简单朴素的能源模型就足以提出可行的节能优化方案吗?通过实验,我们的结果表明,与经典的bin-packing策略相比,我们没有节省能源,但使用精确的建模有一些好处:更好地利用网络和VM迁移过程。
{"title":"Seeking for the Optimal Energy Modelisation Accuracy to Allow Efficient Datacenter Optimizations","authors":"E. Outin, Jean-Emile Dartois, Olivier Barais, Jean-Louis Pazat","doi":"10.1109/CCGrid.2016.67","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.67","url":null,"abstract":"As cloud computing is being more and more used, datacenters play a large role in the overall energy consumption. We propose to tackle this problem, by continuously and autonomously optimizing the cloud datacenters energy efficiency. To this end, modeling the energy consumption for these infrastructures is crucial to drive the optimization process, anticipate the effects of aggressive optimization policies, and to determine precisely the gains brought with the planned optimization. Yet, it is very complex to model with accuracy the energy consumption of a physical device as it depends on several factors. Do we need a detailed and fine-grained energy model to perform good optimizations in the datacenter? Or is a simple and naive energy model good enough to propose viable energy-efficient optimizations? Through experiments, our results show that we don't get energy savings compared to classical bin-packing strategies but there are some gains inusing precise modeling: better utilization of the network and the VM migration processes.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130556899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Formal Approach for Service Composition in a Cloud Resources Sharing Context 云资源共享环境下服务组合的形式化方法
Kais Klai, Hanen Ochi
Composition of Cloud services is necessary when a single component is unable to satisfy all the user's requirements. It is a complex task for Cloud managers which involves several operations such as discovery, compatibility checking, selection, and deployment. Similarly to a non Cloud environment, the service composition raises the need for design-time approaches to check the correct interaction between the different components of a composite service. However, for Cloud-based service composition, new specific constraints, such as resources management, elasticity and multitenancy have to be considered. In this work, we use Symbolic Observation Graphs (SOG) in order to abstract Cloud services and to check the correction of their composition with respect to event-and state-based LTL formulae. The violation of such formulae can come either from the stakeholders' interaction or from the shared Cloud resources perspectives. In the former case, the involved services are considered as incompatible while, in the latter case, the problem can be solved by deploying additional resources. The approach we propose in this paper allows then to check whether the resource provider service is able, at run time, to satisfy the users' requests in terms of Cloud resources.
当单个组件无法满足所有用户需求时,就需要对云服务进行组合。对于云管理人员来说,这是一项复杂的任务,涉及到发现、兼容性检查、选择和部署等几个操作。与非云环境类似,服务组合需要设计时方法来检查组合服务的不同组件之间的正确交互。但是,对于基于云的服务组合,必须考虑新的特定约束,例如资源管理、弹性和多租户。在这项工作中,我们使用符号观察图(SOG)来抽象云服务,并根据基于事件和状态的LTL公式检查其组成的正确性。违反这些公式可能来自涉众的交互,也可能来自共享云资源的角度。在前一种情况下,所涉及的服务被认为是不兼容的,而在后一种情况下,可以通过部署额外的资源来解决问题。我们在本文中提出的方法允许然后检查资源提供者服务是否能够在运行时满足用户对云资源的请求。
{"title":"A Formal Approach for Service Composition in a Cloud Resources Sharing Context","authors":"Kais Klai, Hanen Ochi","doi":"10.1109/CCGrid.2016.74","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.74","url":null,"abstract":"Composition of Cloud services is necessary when a single component is unable to satisfy all the user's requirements. It is a complex task for Cloud managers which involves several operations such as discovery, compatibility checking, selection, and deployment. Similarly to a non Cloud environment, the service composition raises the need for design-time approaches to check the correct interaction between the different components of a composite service. However, for Cloud-based service composition, new specific constraints, such as resources management, elasticity and multitenancy have to be considered. In this work, we use Symbolic Observation Graphs (SOG) in order to abstract Cloud services and to check the correction of their composition with respect to event-and state-based LTL formulae. The violation of such formulae can come either from the stakeholders' interaction or from the shared Cloud resources perspectives. In the former case, the involved services are considered as incompatible while, in the latter case, the problem can be solved by deploying additional resources. The approach we propose in this paper allows then to check whether the resource provider service is able, at run time, to satisfy the users' requests in terms of Cloud resources.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"166 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132062946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Flexible Data-Aware Scheduling for Workflows over an In-memory Object Store 在内存对象存储上灵活的数据感知调度工作流
Francisco Rodrigo Duro, Francisco Javier García Blas, Florin Isaila, J. Wozniak, J. Carretero, R. Ross
This paper explores novel techniques for improving the performance of many-task workflows based on the Swift scripting language. We propose novel programmer options for automated distributed data placement and task scheduling. These options trigger a data placement mechanism used for distributing intermediate workflow data over the servers of Hercules, a distributed key-value store that can be used to cache file system data. We demonstrate that these new mechanisms can significantly improve the aggregated throughput of many-task workflows with up to 86x, reduce the contention on the shared file system, exploit the data locality, and trade off locality and load balance.
本文探讨了基于Swift脚本语言改进多任务工作流性能的新技术。我们为自动化分布式数据放置和任务调度提出了新颖的编程选项。这些选项触发一种数据放置机制,用于在Hercules服务器上分发中间工作流数据,Hercules是一种分布式键值存储,可用于缓存文件系统数据。我们证明了这些新机制可以显著提高多任务工作流的聚合吞吐量,最多可提高86x,减少共享文件系统上的争用,利用数据局部性,并在局部性和负载平衡之间进行权衡。
{"title":"Flexible Data-Aware Scheduling for Workflows over an In-memory Object Store","authors":"Francisco Rodrigo Duro, Francisco Javier García Blas, Florin Isaila, J. Wozniak, J. Carretero, R. Ross","doi":"10.1109/CCGrid.2016.40","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.40","url":null,"abstract":"This paper explores novel techniques for improving the performance of many-task workflows based on the Swift scripting language. We propose novel programmer options for automated distributed data placement and task scheduling. These options trigger a data placement mechanism used for distributing intermediate workflow data over the servers of Hercules, a distributed key-value store that can be used to cache file system data. We demonstrate that these new mechanisms can significantly improve the aggregated throughput of many-task workflows with up to 86x, reduce the contention on the shared file system, exploit the data locality, and trade off locality and load balance.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127117766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Indexing Blocks to Reduce Space and Time Requirements for Searching Large Data Files 索引块减少搜索大数据文件的空间和时间要求
Tzu-Hsien Wu, Hao Shyng, J. Chou, Bin Dong, Kesheng Wu
Scientific discoveries are increasingly relying on analysis of massive amounts of data generated from scientific experiments, observations, and simulations. The ability to directly access the most relevant data records, without shifting through all of them becomes essential. While many indexing techniques have been developed to quickly locate the selected data records, the time and space required for building and storing these indexes are often too expensive to meet the demands of in situ or real-time data analysis. Existing indexing methods generally capture information about each individual data record, however, when reading a data record, the I/O system typically has to access a block or a page of data. In this work, we postulate that indexing blocks instead of individual data records could significantly reduce index size and index building time without increasing the I/O time for accessing the selected data records. Our experiments using multiple real datasets on a supercomputer show that block index can reduce query time by a factor of 2 to 50 over other existing methods, including SciDB and FastQuery. But the size of block index is almost negligible comparing to the data size, and the time of building index can reach the peak I/O speed.
科学发现越来越依赖于对科学实验、观察和模拟产生的大量数据的分析。直接访问最相关的数据记录的能力变得至关重要,而不需要在所有这些记录之间进行切换。虽然已经开发了许多索引技术来快速定位选定的数据记录,但是构建和存储这些索引所需的时间和空间往往过于昂贵,无法满足现场或实时数据分析的需求。现有的索引方法通常捕获关于每个单独数据记录的信息,但是,在读取数据记录时,I/O系统通常必须访问数据块或数据页。在这项工作中,我们假设索引块而不是单个数据记录可以显著减少索引大小和索引构建时间,而不会增加访问所选数据记录的I/O时间。我们在超级计算机上使用多个真实数据集进行的实验表明,块索引比其他现有方法(包括SciDB和FastQuery)可以减少2到50倍的查询时间。但是块索引的大小与数据大小相比几乎可以忽略不计,并且索引的构建时间可以达到峰值I/O速度。
{"title":"Indexing Blocks to Reduce Space and Time Requirements for Searching Large Data Files","authors":"Tzu-Hsien Wu, Hao Shyng, J. Chou, Bin Dong, Kesheng Wu","doi":"10.1109/CCGrid.2016.18","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.18","url":null,"abstract":"Scientific discoveries are increasingly relying on analysis of massive amounts of data generated from scientific experiments, observations, and simulations. The ability to directly access the most relevant data records, without shifting through all of them becomes essential. While many indexing techniques have been developed to quickly locate the selected data records, the time and space required for building and storing these indexes are often too expensive to meet the demands of in situ or real-time data analysis. Existing indexing methods generally capture information about each individual data record, however, when reading a data record, the I/O system typically has to access a block or a page of data. In this work, we postulate that indexing blocks instead of individual data records could significantly reduce index size and index building time without increasing the I/O time for accessing the selected data records. Our experiments using multiple real datasets on a supercomputer show that block index can reduce query time by a factor of 2 to 50 over other existing methods, including SciDB and FastQuery. But the size of block index is almost negligible comparing to the data size, and the time of building index can reach the peak I/O speed.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126227386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
KOALA-F: A Resource Manager for Scheduling Frameworks in Clusters KOALA-F:集群中调度框架的资源管理器
Aleksandra Kuzmanovska, R. H. Mak, D. Epema
Due to the diversity in the applications that run in clusters, many different application frameworks have been developed, such as MapReduce for data-intensive batch jobs and Spark for interactive data analytics. A framework is first deployed in a cluster, and then starts executing a large set of jobs that are submitted over time. When multiple such frameworks with time-varying resource demands are presentin a single cluster, static allocation of resources on a per-framework basis leads to low system utilization and resource fragmentation. In this paper, we present koala-f, a resource manager that dynamically provides resources to frameworks by employing a feedback loop to collecttheir possibly different performance metrics. Frameworks periodically -- not necessarily with the same frequency -- report the values of their performancemetrics to koala-f, which then either rebalances their resources individuallyagainst the idle-resource pool, or, when the latter is empty, rebalances their resources amongst them. We demonstrate the effectiveness of koala-f with experiments in a real system.
由于在集群中运行的应用程序的多样性,已经开发了许多不同的应用程序框架,例如用于数据密集型批处理作业的MapReduce和用于交互式数据分析的Spark。首先在集群中部署框架,然后开始执行一大批随时间提交的作业。当单个集群中存在多个具有时变资源需求的此类框架时,基于每个框架的静态资源分配会导致系统利用率低和资源碎片化。在本文中,我们介绍了考拉-f,这是一个资源管理器,它通过使用反馈循环来收集框架可能不同的性能指标,从而动态地向框架提供资源。框架定期(不一定以相同的频率)向考拉-f报告其性能指标的值,然后考拉-f根据空闲资源池重新平衡它们的资源,或者当后者为空时,重新平衡它们之间的资源。在实际系统中,通过实验验证了考拉-f算法的有效性。
{"title":"KOALA-F: A Resource Manager for Scheduling Frameworks in Clusters","authors":"Aleksandra Kuzmanovska, R. H. Mak, D. Epema","doi":"10.1109/CCGrid.2016.60","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.60","url":null,"abstract":"Due to the diversity in the applications that run in clusters, many different application frameworks have been developed, such as MapReduce for data-intensive batch jobs and Spark for interactive data analytics. A framework is first deployed in a cluster, and then starts executing a large set of jobs that are submitted over time. When multiple such frameworks with time-varying resource demands are presentin a single cluster, static allocation of resources on a per-framework basis leads to low system utilization and resource fragmentation. In this paper, we present koala-f, a resource manager that dynamically provides resources to frameworks by employing a feedback loop to collecttheir possibly different performance metrics. Frameworks periodically -- not necessarily with the same frequency -- report the values of their performancemetrics to koala-f, which then either rebalances their resources individuallyagainst the idle-resource pool, or, when the latter is empty, rebalances their resources amongst them. We demonstrate the effectiveness of koala-f with experiments in a real system.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115366847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
GoDB: From Batch Processing to Distributed Querying over Property Graphs 从批处理到属性图的分布式查询
N. Jamadagni, Yogesh L. Simmhan
Property Graphs with rich attributes over vertices and edges are becoming common. Querying and mining such linked Big Data is important for knowledge discovery and mining. Distributed graph platforms like Pregel focus on batch execution on commodity clusters. But exploratory analytics requires platforms that are both responsive and scalable. We propose Graph-oriented Database (GoDB), a distributed graph database that supports declarative queries over large property graphs. GoDB builds upon our GoFFish subgraph-centric batch processing platform, leveraging its scalability while using execution heuristics to offer responsiveness. The GoDB declarative query model supports vertex, edge, path and reachability queries, and this is translated to a distributed execution plan on GoFFish. We also propose a novel cost model to choose a query plan that minimizes the execution latency. We evaluate GoDB deployed on the Azure IaaS Cloud, over real-world property graphs and for a diverse workload of 500 queries. These show that the cost model selects the optimal execution plan at least 80% of the time, and helps GoDB weakly scale with the graph size. A comparative study with Titan, a leading open-source graph database, shows that we complete all queries, each in ≤ 1.6 secs, while Titan cannot complete up to 42% of some query workloads.
在顶点和边上具有丰富属性的属性图正变得越来越普遍。这种关联大数据的查询和挖掘对于知识发现和挖掘具有重要意义。像Pregel这样的分布式图形平台专注于批量执行商品集群。但是探索性分析需要响应性和可扩展性都好的平台。我们提出了面向图的数据库(GoDB),这是一种分布式图数据库,支持对大型属性图的声明性查询。GoDB构建在以GoFFish子图为中心的批处理平台之上,利用其可伸缩性,同时使用执行启发式提供响应性。GoDB声明式查询模型支持顶点、边、路径和可达性查询,这在GoFFish上被转换为分布式执行计划。我们还提出了一种新的成本模型来选择执行延迟最小的查询计划。我们评估了部署在Azure IaaS云上的GoDB,在真实世界的属性图和500个查询的不同工作负载上。这些结果表明,成本模型至少在80%的时间内选择了最优的执行计划,并帮助GoDB随着图的大小进行弱扩展。与Titan(一个领先的开源图形数据库)的比较研究表明,我们完成了所有查询,每个查询在≤1.6秒内完成,而Titan无法完成高达42%的某些查询工作负载。
{"title":"GoDB: From Batch Processing to Distributed Querying over Property Graphs","authors":"N. Jamadagni, Yogesh L. Simmhan","doi":"10.1109/CCGrid.2016.105","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.105","url":null,"abstract":"Property Graphs with rich attributes over vertices and edges are becoming common. Querying and mining such linked Big Data is important for knowledge discovery and mining. Distributed graph platforms like Pregel focus on batch execution on commodity clusters. But exploratory analytics requires platforms that are both responsive and scalable. We propose Graph-oriented Database (GoDB), a distributed graph database that supports declarative queries over large property graphs. GoDB builds upon our GoFFish subgraph-centric batch processing platform, leveraging its scalability while using execution heuristics to offer responsiveness. The GoDB declarative query model supports vertex, edge, path and reachability queries, and this is translated to a distributed execution plan on GoFFish. We also propose a novel cost model to choose a query plan that minimizes the execution latency. We evaluate GoDB deployed on the Azure IaaS Cloud, over real-world property graphs and for a diverse workload of 500 queries. These show that the cost model selects the optimal execution plan at least 80% of the time, and helps GoDB weakly scale with the graph size. A comparative study with Titan, a leading open-source graph database, shows that we complete all queries, each in ≤ 1.6 secs, while Titan cannot complete up to 42% of some query workloads.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115596114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Towards Memory-Optimized Data Shuffling Patterns for Big Data Analytics 面向大数据分析的内存优化数据洗牌模式
Bogdan Nicolae, Carlos H. A. Costa, Claudia Misale, K. Katrinis, Yoonho Park
Big data analytics is an indispensable tool in transforming science, engineering, medicine, healthcare, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. However, this introduces important challenges, among which data shuffling is particularly difficult: on one hand it is a key part of the computation that has a major impact on the overall performance and scalability so its efficiency is paramount, while on the other hand it needs to operate with scarce memory in order to leave as much memory available for data caching. In this context, efficient scheduling of data transfers such that it addresses both dimensions of the problem simultaneously is non-trivial. State-of-the-art solutions often rely on simple approaches that yield sub optimal performance and resource usage. This paper contributes a novel shuffle data transfer strategy that dynamically adapts to the computation with minimal memory utilization, which we briefly underline as a series of design principles.
大数据分析是改变科学、工程、医学、医疗保健、金融乃至商业本身的不可或缺的工具。随着数据大小的爆炸式增长和对更短的解决方案时间的需求,内存平台(如Apache Spark)越来越受欢迎。然而,这带来了重要的挑战,其中数据变换尤其困难:一方面,它是对整体性能和可伸缩性有重大影响的计算的关键部分,因此其效率至关重要,而另一方面,它需要使用稀缺的内存进行操作,以便为数据缓存留下尽可能多的可用内存。在这种情况下,有效地调度数据传输以同时解决问题的两个方面是非常重要的。最先进的解决方案通常依赖于产生次优性能和资源使用的简单方法。本文提出了一种新的随机数据传输策略,该策略可以动态适应最小内存利用率的计算,我们简要地强调了一系列设计原则。
{"title":"Towards Memory-Optimized Data Shuffling Patterns for Big Data Analytics","authors":"Bogdan Nicolae, Carlos H. A. Costa, Claudia Misale, K. Katrinis, Yoonho Park","doi":"10.1109/CCGrid.2016.85","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.85","url":null,"abstract":"Big data analytics is an indispensable tool in transforming science, engineering, medicine, healthcare, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. However, this introduces important challenges, among which data shuffling is particularly difficult: on one hand it is a key part of the computation that has a major impact on the overall performance and scalability so its efficiency is paramount, while on the other hand it needs to operate with scarce memory in order to leave as much memory available for data caching. In this context, efficient scheduling of data transfers such that it addresses both dimensions of the problem simultaneously is non-trivial. State-of-the-art solutions often rely on simple approaches that yield sub optimal performance and resource usage. This paper contributes a novel shuffle data transfer strategy that dynamically adapts to the computation with minimal memory utilization, which we briefly underline as a series of design principles.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114893915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
An Automated Tool Profiling Service for the Cloud 用于云的自动化工具分析服务
Ryan Chard, K. Chard, Bryan K. F. Ng, K. Bubendorfer, Alex Rodriguez, R. Madduri, Ian T Foster
Cloud providers offer a diverse set of instance types with varying resource capacities, designed to meet the needs of a broad range of user requirements. While this flexibility is a major benefit of the cloud computing model, it also creates challenges when selecting the most suitable instance type for a given application. Sub-optimal instance selection can result in poor performance and/or increased cost, with significant impacts when applications are executed repeatedly. Yet selecting an optimal instance type is challenging, as each instance type can be configured differently, application performance is dependent on input data and configuration, and instance types and applications are frequently updated. We present a service that supports automatic profiling of application performance on different instance types to create rich application profiles that can be used for comparison, provisioning, and scheduling. This service can dynamically provision cloud instances, automatically deploy and contextualize applications, transfer input datasets, monitor execution performance, and create a composite profile with fine grained resource usage information. We use real usage data from four production genomics gateways and estimate the use of profiles in autonomic provisioning systems can decrease execution time by up to 15.7% and cost by up to 86.6%.
云提供商提供了一组具有不同资源容量的不同实例类型,旨在满足广泛的用户需求。虽然这种灵活性是云计算模型的主要优点,但在为给定应用程序选择最合适的实例类型时,它也会带来挑战。次优实例选择可能导致性能差和/或成本增加,在重复执行应用程序时会产生重大影响。然而,选择最佳实例类型是一项挑战,因为每种实例类型可以配置不同,应用程序性能依赖于输入数据和配置,并且实例类型和应用程序经常更新。我们提供了一个服务,它支持在不同实例类型上自动分析应用程序性能,以创建丰富的应用程序配置文件,这些配置文件可用于比较、配置和调度。此服务可以动态地提供云实例、自动部署和上下文化应用程序、传输输入数据集、监视执行性能,并创建具有细粒度资源使用信息的组合配置文件。我们使用了来自四个生产基因组网关的真实使用数据,并估计在自主供应系统中使用配置文件可以减少高达15.7%的执行时间和高达86.6%的成本。
{"title":"An Automated Tool Profiling Service for the Cloud","authors":"Ryan Chard, K. Chard, Bryan K. F. Ng, K. Bubendorfer, Alex Rodriguez, R. Madduri, Ian T Foster","doi":"10.1109/CCGrid.2016.57","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.57","url":null,"abstract":"Cloud providers offer a diverse set of instance types with varying resource capacities, designed to meet the needs of a broad range of user requirements. While this flexibility is a major benefit of the cloud computing model, it also creates challenges when selecting the most suitable instance type for a given application. Sub-optimal instance selection can result in poor performance and/or increased cost, with significant impacts when applications are executed repeatedly. Yet selecting an optimal instance type is challenging, as each instance type can be configured differently, application performance is dependent on input data and configuration, and instance types and applications are frequently updated. We present a service that supports automatic profiling of application performance on different instance types to create rich application profiles that can be used for comparison, provisioning, and scheduling. This service can dynamically provision cloud instances, automatically deploy and contextualize applications, transfer input datasets, monitor execution performance, and create a composite profile with fine grained resource usage information. We use real usage data from four production genomics gateways and estimate the use of profiles in autonomic provisioning systems can decrease execution time by up to 15.7% and cost by up to 86.6%.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129572063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Software Provisioning Inside a Secure Environment as Docker Containers Using Stroll File-System 使用Stroll文件系统的Docker容器在安全环境中的软件配置
A. Azab, D. Domanska
TSD (Tjenester for Sensitive Data), is an isolated infrastructure for storing and processing sensitive research data, e.g. human patient genomics data. Due to the isolation of the TSD, it is not possible to install software in the traditional fashion. Docker containers is a platform implementing lightweight virtualization technology for applying the build-once-run-anyware approach in software packaging and sharing. This paper describes our experience at USIT (The University Centre of Information Technology) at the University of Oslo With Docker containers as a solution for installing and running software packages that require downloading of dependencies and binaries during the installation, inside a secure isolated infrastructure. Using Docker containers made it possible to package software packages as Docker images and run them smoothly inside our secure system, TSD. The paper describes Docker as a technology, its benefits and weaknesses in terms of security, demonstrates our experience with a use case for installing and running the Galaxy bioinformatics portal as a Docker container inside the TSD, and investigates the use of Stroll file-system as a proxy between Galaxy portal and the HPC cluster.
TSD (Tjenester for Sensitive Data)是一个独立的基础设施,用于存储和处理敏感的研究数据,例如人类患者基因组数据。由于TSD的隔离性,无法以传统方式安装软件。Docker容器是一个实现轻量级虚拟化技术的平台,用于在软件打包和共享中应用构建一次运行任何软件的方法。本文描述了我们在奥斯陆大学的USIT(信息技术大学中心)使用Docker容器作为安装和运行软件包的解决方案的经验,这些软件包在安装过程中需要下载依赖项和二进制文件,在一个安全隔离的基础设施中。使用Docker容器可以将软件包打包为Docker镜像,并在我们的安全系统TSD中顺利运行。本文将Docker描述为一种技术,它在安全性方面的优点和缺点,展示了我们在TSD内安装和运行Galaxy生物信息学门户作为Docker容器的用例的经验,并研究了Stroll文件系统作为Galaxy门户和HPC集群之间的代理的使用。
{"title":"Software Provisioning Inside a Secure Environment as Docker Containers Using Stroll File-System","authors":"A. Azab, D. Domanska","doi":"10.1109/CCGrid.2016.106","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.106","url":null,"abstract":"TSD (Tjenester for Sensitive Data), is an isolated infrastructure for storing and processing sensitive research data, e.g. human patient genomics data. Due to the isolation of the TSD, it is not possible to install software in the traditional fashion. Docker containers is a platform implementing lightweight virtualization technology for applying the build-once-run-anyware approach in software packaging and sharing. This paper describes our experience at USIT (The University Centre of Information Technology) at the University of Oslo With Docker containers as a solution for installing and running software packages that require downloading of dependencies and binaries during the installation, inside a secure isolated infrastructure. Using Docker containers made it possible to package software packages as Docker images and run them smoothly inside our secure system, TSD. The paper describes Docker as a technology, its benefits and weaknesses in terms of security, demonstrates our experience with a use case for installing and running the Galaxy bioinformatics portal as a Docker container inside the TSD, and investigates the use of Stroll file-system as a proxy between Galaxy portal and the HPC cluster.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128792289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1