While volunteer computing (VC) systems reach the most powerful computing platforms, they still have the problem of guaranteeing computational correctness, due to the inherent unreliability of volunteer participants. Spot-checking technique, which checks each participant by allocating spotter jobs, is a promising approach to the validation of computation results. The current spot-checking technique and associated sabotage-tolerance methods are based on the implicit assumption that participants never detect the allocation of spotter jobs, however generating such spotter jobs is still an open problem. Hence, in the real VC environment where the implicit assumption does not always hold, spot-checking-based sabotage-tolerance methods (such as well-known credibility-based voting) become almost impossible to guarantee the computational correctness. In this paper, we generalize the spot-checking technique by introducing the idea of imperfect checking. Using our new technique, it becomes possible to estimate the correct credibility for participant nodes even if they may detect spotter jobs. Moreover, by the idea of imperfect checking, we propose a new credibility-based voting which does not need to allocate spotter jobs. Simulation results show that the proposed method reduces the computation time compared to the original credibility-based voting, while guaranteeing the same level of computational correctness.
{"title":"Generalized Spot-Checking for Sabotage-Tolerance in Volunteer Computing Systems","authors":"Kanno Watanabe, Masaru Fukushi","doi":"10.1109/CCGRID.2010.97","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.97","url":null,"abstract":"While volunteer computing (VC) systems reach the most powerful computing platforms, they still have the problem of guaranteeing computational correctness, due to the inherent unreliability of volunteer participants. Spot-checking technique, which checks each participant by allocating spotter jobs, is a promising approach to the validation of computation results. The current spot-checking technique and associated sabotage-tolerance methods are based on the implicit assumption that participants never detect the allocation of spotter jobs, however generating such spotter jobs is still an open problem. Hence, in the real VC environment where the implicit assumption does not always hold, spot-checking-based sabotage-tolerance methods (such as well-known credibility-based voting) become almost impossible to guarantee the computational correctness. In this paper, we generalize the spot-checking technique by introducing the idea of imperfect checking. Using our new technique, it becomes possible to estimate the correct credibility for participant nodes even if they may detect spotter jobs. Moreover, by the idea of imperfect checking, we propose a new credibility-based voting which does not need to allocate spotter jobs. Simulation results show that the proposed method reduces the computation time compared to the original credibility-based voting, while guaranteeing the same level of computational correctness.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"414 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132413779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Shams, M. Powell, T. Crockett, J. Norris, Ryan A. Rossi, T. Söderström
Cloud Computing has delivered unprecedented compute capacity to NASA missions at affordable rates. Missions like the Mars Exploration Rovers (MER) and Mars Science Lab (MSL) are enjoying the elasticity that enables them to leverage hundreds, if not thousands, or machines for short durations without making any hardware procurements. In this paper, we describe Polyphony, a resilient, scalable, and modular framework that efficiently leverages a large set of computing resources to perform parallel computations. Polyphony can employ resources on the cloud, excess capacity on local machines, as well as spare resources on the supercomputing center, and it enables these resources to work in concert to accomplish a common goal. Polyphony is resilient to node failures, even if they occur in the middle of a transaction. We will conclude with an evaluation of a production-ready application built on top of Polyphony to perform image-processing operations of images from around the solar system, including Mars, Saturn, and Titan.
{"title":"Polyphony: A Workflow Orchestration Framework for Cloud Computing","authors":"K. Shams, M. Powell, T. Crockett, J. Norris, Ryan A. Rossi, T. Söderström","doi":"10.1109/CCGRID.2010.117","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.117","url":null,"abstract":"Cloud Computing has delivered unprecedented compute capacity to NASA missions at affordable rates. Missions like the Mars Exploration Rovers (MER) and Mars Science Lab (MSL) are enjoying the elasticity that enables them to leverage hundreds, if not thousands, or machines for short durations without making any hardware procurements. In this paper, we describe Polyphony, a resilient, scalable, and modular framework that efficiently leverages a large set of computing resources to perform parallel computations. Polyphony can employ resources on the cloud, excess capacity on local machines, as well as spare resources on the supercomputing center, and it enables these resources to work in concert to accomplish a common goal. Polyphony is resilient to node failures, even if they occur in the middle of a transaction. We will conclude with an evaluation of a production-ready application built on top of Polyphony to perform image-processing operations of images from around the solar system, including Mars, Saturn, and Titan.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132460209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Good scheduling is important for ensuring effective use of Grid resources, while maximising parallel performance. In this paper, we show how a basic ``Random-Stealing'' load balancing algorithm for computational Grids can be improved by using information about the task granularity of parallel programs. We propose several strategies (SSL, SLL and LLL) for using granularity information to improve load balancing, presenting results both from simulations and from a real implementation (the Grid-GUM Runtime System for Parallel Haskell). We assume a common model of task creation which subsumes both master/worker and data-parallel programming paradigms under a task-stealing work distribution strategy. Overall, we achieve improvement in runtime of up to 19.4% for irregular problems in the real implementation, and up to 40% for the simulations (typical improvements of more that 15% for irregular programs, and from 5-10% for regular ones). Our results show that, for computationally-uniform Grids, advanced load balancing methods that exploit granularity information generally have the greatest impact on reducing the runtimes of irregular parallel programs. Moreover, the more irregular the program is, the better the improvements that can be achieved.
良好的调度对于确保有效使用网格资源,同时最大化并行性能非常重要。在本文中,我们展示了如何通过使用并行程序的任务粒度信息来改进计算网格的基本“随机窃取”负载平衡算法。我们提出了几种策略(SSL, SLL和LLL)来使用粒度信息来改善负载平衡,并给出了模拟和实际实现(Grid-GUM Runtime System for Parallel Haskell)的结果。我们假设了一个通用的任务创建模型,该模型包含了在任务窃取工作分配策略下的主/worker和数据并行编程范式。总的来说,我们在实际实现中不规则问题的运行时改进了19.4%,在模拟中提高了40%(不规则程序的典型改进超过15%,常规程序的典型改进为5-10%)。我们的研究结果表明,对于计算均匀的网格,利用粒度信息的高级负载平衡方法通常对减少不规则并行程序的运行时间有最大的影响。而且,程序越不规则,改进效果越好。
{"title":"Granularity-Aware Work-Stealing for Computationally-Uniform Grids","authors":"Vladimir Janjic, K. Hammond","doi":"10.1109/CCGRID.2010.49","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.49","url":null,"abstract":"Good scheduling is important for ensuring effective use of Grid resources, while maximising parallel performance. In this paper, we show how a basic ``Random-Stealing'' load balancing algorithm for computational Grids can be improved by using information about the task granularity of parallel programs. We propose several strategies (SSL, SLL and LLL) for using granularity information to improve load balancing, presenting results both from simulations and from a real implementation (the Grid-GUM Runtime System for Parallel Haskell). We assume a common model of task creation which subsumes both master/worker and data-parallel programming paradigms under a task-stealing work distribution strategy. Overall, we achieve improvement in runtime of up to 19.4% for irregular problems in the real implementation, and up to 40% for the simulations (typical improvements of more that 15% for irregular programs, and from 5-10% for regular ones). Our results show that, for computationally-uniform Grids, advanced load balancing methods that exploit granularity information generally have the greatest impact on reducing the runtimes of irregular parallel programs. Moreover, the more irregular the program is, the better the improvements that can be achieved.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128955157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In order to address the high performance I/O needs of HPC and enterprise applications, modern interconnection fabrics, such as InfiniBand and more recently, 10GigE, rely on network adapters with RDMA capabilities. In virtualized environments, these types of adapters are configured in a manner that bypasses the hypervisor and allows virtual machines (VMs) direct device access, so that they deliver near-native low-latency/high-bandwidth I/O. One challenge with the bypass approach is that it causes the hypervisor to lose control over VM-device interactions, including the ability to monitor such interactions and to ensure fair resource usage by VMs. Fairness violations, however, permit low-priority VMs to affect the I/O allocations of other higher priority VMs and more generally, lack of supervision can lead to inefficiencies in the usage of platform resources. This paper describes the FaReS system-level mechanisms for monitoring VMs' usage of bypass I/O devices. Monitoring information acquired with FaReS is then used to adjust VMM-level scheduling in order to improve resource utilization and/or ensure fairness properties across the sets of VMs sharing platform resources. FaReS employs a memory introspection-based tool for asynchronously monitoring VMM-bypass devices, using InfiniBand HCAs as a concrete example. FaReS and its very low overhead (
{"title":"FaReS: Fair Resource Scheduling for VMM-Bypass InfiniBand Devices","authors":"A. Ranadive, Ada Gavrilovska, K. Schwan","doi":"10.1109/CCGRID.2010.11","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.11","url":null,"abstract":"In order to address the high performance I/O needs of HPC and enterprise applications, modern interconnection fabrics, such as InfiniBand and more recently, 10GigE, rely on network adapters with RDMA capabilities. In virtualized environments, these types of adapters are configured in a manner that bypasses the hypervisor and allows virtual machines (VMs) direct device access, so that they deliver near-native low-latency/high-bandwidth I/O. One challenge with the bypass approach is that it causes the hypervisor to lose control over VM-device interactions, including the ability to monitor such interactions and to ensure fair resource usage by VMs. Fairness violations, however, permit low-priority VMs to affect the I/O allocations of other higher priority VMs and more generally, lack of supervision can lead to inefficiencies in the usage of platform resources. This paper describes the FaReS system-level mechanisms for monitoring VMs' usage of bypass I/O devices. Monitoring information acquired with FaReS is then used to adjust VMM-level scheduling in order to improve resource utilization and/or ensure fairness properties across the sets of VMs sharing platform resources. FaReS employs a memory introspection-based tool for asynchronously monitoring VMM-bypass devices, using InfiniBand HCAs as a concrete example. FaReS and its very low overhead (","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122935769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Resource management is one of the focus areas of Grid which identifies Job Modeling to be a very important part of it. A proper Job Modeling can be helpful in allocating jobs to their most suitable resource providers in Grid. This paper presents a feedback-guided Automatic Job Modeling technique that describes the process required to identify the most suitable resource provider for a particular job.
{"title":"Feedback-Guided Analysis for Resource Requirements in Large Distributed System","authors":"M. Sarkar, Sarbani Roy, N. Mukherjee","doi":"10.1109/CCGRID.2010.90","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.90","url":null,"abstract":"Resource management is one of the focus areas of Grid which identifies Job Modeling to be a very important part of it. A proper Job Modeling can be helpful in allocating jobs to their most suitable resource providers in Grid. This paper presents a feedback-guided Automatic Job Modeling technique that describes the process required to identify the most suitable resource provider for a particular job.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132943325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We are using GPUs to run a new weather model being developed at NOAA’s Earth System Research Laboratory (ESRL). The parallelization approach is to run the entire model on the GPU and only rely on the CPU for model initialization, I/O, and inter-processor communications. We have written a compiler to convert Fortran into CUDA, and used it to parallelize the dynamics portion of the model. Dynamics, the most computationally intensive part of the model, is currently running 34 times faster on a single GPU than the CPU. We also describe our approach and progress to date in running NIM on multiple GPUs.
{"title":"Running the NIM Next-Generation Weather Model on GPUs","authors":"M. Govett, J. Middlecoff, T. Henderson","doi":"10.1109/CCGRID.2010.106","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.106","url":null,"abstract":"We are using GPUs to run a new weather model being developed at NOAA’s Earth System Research Laboratory (ESRL). The parallelization approach is to run the entire model on the GPU and only rely on the CPU for model initialization, I/O, and inter-processor communications. We have written a compiler to convert Fortran into CUDA, and used it to parallelize the dynamics portion of the model. Dynamics, the most computationally intensive part of the model, is currently running 34 times faster on a single GPU than the CPU. We also describe our approach and progress to date in running NIM on multiple GPUs.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130278035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cloud computing focuses on delivery of reliable, fault-tolerant and scalable infrastructure for hosting Internet based application services. Our work presents the implementation of an efficient Quality of Service (QoS) based meta-scheduler and Backfill strategy based light weight Virtual Machine Scheduler for dispatching jobs. The user centric meta-scheduler deals with selection of proper resources to execute high level jobs. The system centric Virtual Machine (VM) scheduler optimally dispatches the jobs to processors for better resource utilization. We also present our proposals on scheduling heuristics that can be incorporated at data center level for selecting ideal host for VM creation. The implementation can be further extended at the host level, using Inter VM scheduler for adaptive load balancing in cloud environment.
{"title":"Design and Implementation of an Efficient Two-Level Scheduler for Cloud Computing Environment","authors":"R. Jeyarani, R. Ram, N. Nagaveni","doi":"10.1109/CCGRID.2010.94","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.94","url":null,"abstract":"Cloud computing focuses on delivery of reliable, fault-tolerant and scalable infrastructure for hosting Internet based application services. Our work presents the implementation of an efficient Quality of Service (QoS) based meta-scheduler and Backfill strategy based light weight Virtual Machine Scheduler for dispatching jobs. The user centric meta-scheduler deals with selection of proper resources to execute high level jobs. The system centric Virtual Machine (VM) scheduler optimally dispatches the jobs to processors for better resource utilization. We also present our proposals on scheduling heuristics that can be incorporated at data center level for selecting ideal host for VM creation. The implementation can be further extended at the host level, using Inter VM scheduler for adaptive load balancing in cloud environment.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134338898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Soila Kavulya, Jiaqi Tan, R. Gandhi, P. Narasimhan
MapReduce is a programming paradigm for parallel processing that is increasingly being used for data-intensive applications in cloud computing environments. An understanding of the characteristics of workloads running in MapReduce environments benefits both the service providers in the cloud and users: the service provider can use this knowledge to make better scheduling decisions, while the user can learn what aspects of their jobs impact performance. This paper analyzes 10-months of MapReduce logs from the M45 supercomputing cluster which Yahoo! made freely available to select universities for academic research. We characterize resource utilization patterns, job patterns, and sources of failures. We use an instance-based learning technique that exploits temporal locality to predict job completion times from historical data and identify potential performance problems in our dataset.
{"title":"An Analysis of Traces from a Production MapReduce Cluster","authors":"Soila Kavulya, Jiaqi Tan, R. Gandhi, P. Narasimhan","doi":"10.1109/CCGRID.2010.112","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.112","url":null,"abstract":"MapReduce is a programming paradigm for parallel processing that is increasingly being used for data-intensive applications in cloud computing environments. An understanding of the characteristics of workloads running in MapReduce environments benefits both the service providers in the cloud and users: the service provider can use this knowledge to make better scheduling decisions, while the user can learn what aspects of their jobs impact performance. This paper analyzes 10-months of MapReduce logs from the M45 supercomputing cluster which Yahoo! made freely available to select universities for academic research. We characterize resource utilization patterns, job patterns, and sources of failures. We use an instance-based learning technique that exploits temporal locality to predict job completion times from historical data and identify potential performance problems in our dataset.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"224 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132393341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The technologies and ideas that underlie e-Science in providing seamless access to distributed resources is a compelling one and has been applied in many research domains. The clinical domain is one area in particular that, in principle has much to be gained from e-Science approaches. Until now however it has largely been the case that the practical realization, support and adoption of e-Science solutions in a clinical setting have been fraught by many hurdles. Not least is trust of technologies and their use in the field as opposed to demonstrator projects with non-real clinical data to prove the merit of e-Science ideas and solutions. The National e-Science Centre (NeSC– www.nesc.ac.uk) at the University of Glasgow have had a large number of clinical projects that have moved from the proof of concept demonstrators through to real systems used by real clinical researchers in real clinical trials and studies. In this paper we focus on the software systems that have been developed to support two major international post-genomic clinical research projects in the area of rare diseases: the European Union 7th Framework (EuroDSD – www.eurodsd.eu) project and the European Science Foundation (ENSAT – www.ensat.org) project. We outline the software platforms that have been rolled out and identify how the e-Science vision of secure access to clinical resources has been realized and subsequently used.
{"title":"Development and Support of Platforms for Research into Rare Diseases","authors":"R. Sinnott, Jipu Jiang, A. Stell, J. Watt","doi":"10.1109/CCGRID.2010.127","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.127","url":null,"abstract":"The technologies and ideas that underlie e-Science in providing seamless access to distributed resources is a compelling one and has been applied in many research domains. The clinical domain is one area in particular that, in principle has much to be gained from e-Science approaches. Until now however it has largely been the case that the practical realization, support and adoption of e-Science solutions in a clinical setting have been fraught by many hurdles. Not least is trust of technologies and their use in the field as opposed to demonstrator projects with non-real clinical data to prove the merit of e-Science ideas and solutions. The National e-Science Centre (NeSC– www.nesc.ac.uk) at the University of Glasgow have had a large number of clinical projects that have moved from the proof of concept demonstrators through to real systems used by real clinical researchers in real clinical trials and studies. In this paper we focus on the software systems that have been developed to support two major international post-genomic clinical research projects in the area of rare diseases: the European Union 7th Framework (EuroDSD – www.eurodsd.eu) project and the European Science Foundation (ENSAT – www.ensat.org) project. We outline the software platforms that have been rolled out and identify how the e-Science vision of secure access to clinical resources has been realized and subsequently used.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114264833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large high dimension datasets are of growing importance in many fields and it is important to be able to visualize them for understanding the results of data mining approaches or just for browsing them in a way that distance between points in visualization (2D or 3D) space tracks that in original high dimensional space. Dimension reduction is a well understood approach but can be very time and memory intensive for large problems. Here we report on parallel algorithms for Scaling by MAjorizing a Complicated Function (SMACOF) to solve Multidimensional Scaling problem and Generative Topographic Mapping (GTM). The former is particularly time consuming with complexity that grows as square of data set size but has advantage that it does not require explicit vectors for dataset points but just measurement of inter-point dissimilarities. We compare SMACOF and GTM on a subset of the NIH PubChem database which has binary vectors of length 166 bits. We find good parallel performance for both GTM and SMACOF and strong correlation between the dimension-reduced PubChem data from these two methods.
{"title":"High Performance Dimension Reduction and Visualization for Large High-Dimensional Data Analysis","authors":"J. Choi, S. Bae, Xiaohong Qiu, G. Fox","doi":"10.1109/CCGRID.2010.104","DOIUrl":"https://doi.org/10.1109/CCGRID.2010.104","url":null,"abstract":"Large high dimension datasets are of growing importance in many fields and it is important to be able to visualize them for understanding the results of data mining approaches or just for browsing them in a way that distance between points in visualization (2D or 3D) space tracks that in original high dimensional space. Dimension reduction is a well understood approach but can be very time and memory intensive for large problems. Here we report on parallel algorithms for Scaling by MAjorizing a Complicated Function (SMACOF) to solve Multidimensional Scaling problem and Generative Topographic Mapping (GTM). The former is particularly time consuming with complexity that grows as square of data set size but has advantage that it does not require explicit vectors for dataset points but just measurement of inter-point dissimilarities. We compare SMACOF and GTM on a subset of the NIH PubChem database which has binary vectors of length 166 bits. We find good parallel performance for both GTM and SMACOF and strong correlation between the dimension-reduced PubChem data from these two methods.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123163059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}