Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)最新文献
The term 'Big Data' defines large datasets that are difficult to use and manage through conventional software tools. Legal Electronic Discovery (e-Discovery) is a business domain which has massive consumption of Big Data, where electronic records such as e-mail, documents, databases and social media postings are processed in order to discover evidence that may be pertinent to legal/compliance needs, litigation or other investigations. Numerous vendors exist in the market to provide organizations with services such as data collection, digital forensics and electronic discovery. High-end instrumentation and modern information technologies are creating data at an ever increasing rate. The challenges associated with managing the large datasets are related to the capture, storage, search, sharing, analytics, and visualization of the data. Big Data also offers unprecedented opportunities in other fields, ranging from astronomy and biology to marketing and e-commerce. This paper presents lessons learnt from the legal e-Discovery domain that can be adapted to process Big Data effectively on HPC resources, thereby benefitting the various disciplines of science, engineering and business that are grappling with a deluge of Big Data challenges and opportunities.
{"title":"Applying Lessons from e-Discovery to Process Big Data using HPC","authors":"Sukrit Sondhi, R. Arora","doi":"10.1145/2616498.2616525","DOIUrl":"https://doi.org/10.1145/2616498.2616525","url":null,"abstract":"The term 'Big Data' defines large datasets that are difficult to use and manage through conventional software tools. Legal Electronic Discovery (e-Discovery) is a business domain which has massive consumption of Big Data, where electronic records such as e-mail, documents, databases and social media postings are processed in order to discover evidence that may be pertinent to legal/compliance needs, litigation or other investigations. Numerous vendors exist in the market to provide organizations with services such as data collection, digital forensics and electronic discovery. High-end instrumentation and modern information technologies are creating data at an ever increasing rate. The challenges associated with managing the large datasets are related to the capture, storage, search, sharing, analytics, and visualization of the data. Big Data also offers unprecedented opportunities in other fields, ranging from astronomy and biology to marketing and e-commerce. This paper presents lessons learnt from the legal e-Discovery domain that can be adapted to process Big Data effectively on HPC resources, thereby benefitting the various disciplines of science, engineering and business that are grappling with a deluge of Big Data challenges and opportunities.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"52 1","pages":"8:1-8:2"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87474752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work describes an implementation of distributed particle tracking that provides a factor 10000x speedup over traditional schemes. While none of the techniques used to achieve this result are completely new, they have been used in combination to great effect in this project. The implementation includes parallel IO using HDF5, a flexible load balancing scheme, and dynamic buffering to achieve excellent performance at scale. The use of HDF5 decouples the size of the simulation generating the data from the particle tracing, providing a more flexible and efficient workflow. The load balancing scheme ensures that heterogeneous particle distributions do not result in a waste of computational resources by maintaining all the MPI tasks occupied at any given time. Dynamic buffering minimizes MPI exchanges across MPI tasks, a critical element in the performance improvements achieved.
{"title":"ECSS Experience: Particle Tracing Reinvented","authors":"C. Rosales, R. McLay","doi":"10.1145/2616498.2616527","DOIUrl":"https://doi.org/10.1145/2616498.2616527","url":null,"abstract":"This work describes an implementation of distributed particle tracking that provides a factor 10000x speedup over traditional schemes. While none of the techniques used to achieve this result are completely new, they have been used in combination to great effect in this project. The implementation includes parallel IO using HDF5, a flexible load balancing scheme, and dynamic buffering to achieve excellent performance at scale. The use of HDF5 decouples the size of the simulation generating the data from the particle tracing, providing a more flexible and efficient workflow. The load balancing scheme ensures that heterogeneous particle distributions do not result in a waste of computational resources by maintaining all the MPI tasks occupied at any given time. Dynamic buffering minimizes MPI exchanges across MPI tasks, a critical element in the performance improvements achieved.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"22 1","pages":"13:1-13:2"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73688116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fine particulate matter (PM2.5) is a federally-regulated air pollutant with well-known impacts on human health. The FAA's Destination 2025 program seeks to decrease aviation-related health impacts across the U.S. by 50% by the year 2018. Atmospheric models, such as the Community Multiscale Air Quality model (CMAQ), are used to estimate the atmospheric concentration of pollutants such as PM2.5. Sensitivity analysis of these models has long been limited to finite difference and regression-based methods, both of which require many computationally intensive model simulations to link changes in output with perturbations in input. Further, they are unable to offer detailed or ad hoc analysis for changes within a domain, such as changes in emissions on an airport-by-airport basis. In order to calculate the sensitivity of PM2.5 concentrations to emissions from individual airports, we utilize the Decoupled Direct Method in three dimensions (DDM-3D), an advanced sensitivity analysis tool recently implemented in CMAQ. DDM-3D allows calculation of sensitivity coefficients within a single simulation, eliminating the need for multiple model runs. However, while the output provides results for a variety of input perturbations in a single simulation, the processing time for each run is dramatically increased compared to simulations conducted without the DDM-3D module. Use of the XSEDE Stampede computing cluster allows us to calculate sensitivity coefficients for a large number of input parameters. This allows for a much wider variety of ad hoc aviation policy scenarios to be generated and evaluated than would be possible using other sensitivity analysis methods or smaller-scaled computing systems. We present a design of experiments to compute individual sensitivity coefficients for 139 major airports in the US, due to six different precursor emissions that form PM2.5 in the atmosphere. Simulations based on this design are currently in progress, with full results to be published at a later date.
{"title":"Calculation of Sensitivity Coefficients for Individual Airport Emissions in the Continental U.S. using CMAQ-DDM/PM","authors":"S. Boone, S. Arunachalam","doi":"10.1145/2616498.2616504","DOIUrl":"https://doi.org/10.1145/2616498.2616504","url":null,"abstract":"Fine particulate matter (PM2.5) is a federally-regulated air pollutant with well-known impacts on human health. The FAA's Destination 2025 program seeks to decrease aviation-related health impacts across the U.S. by 50% by the year 2018. Atmospheric models, such as the Community Multiscale Air Quality model (CMAQ), are used to estimate the atmospheric concentration of pollutants such as PM2.5. Sensitivity analysis of these models has long been limited to finite difference and regression-based methods, both of which require many computationally intensive model simulations to link changes in output with perturbations in input. Further, they are unable to offer detailed or ad hoc analysis for changes within a domain, such as changes in emissions on an airport-by-airport basis. In order to calculate the sensitivity of PM2.5 concentrations to emissions from individual airports, we utilize the Decoupled Direct Method in three dimensions (DDM-3D), an advanced sensitivity analysis tool recently implemented in CMAQ. DDM-3D allows calculation of sensitivity coefficients within a single simulation, eliminating the need for multiple model runs. However, while the output provides results for a variety of input perturbations in a single simulation, the processing time for each run is dramatically increased compared to simulations conducted without the DDM-3D module.\u0000 Use of the XSEDE Stampede computing cluster allows us to calculate sensitivity coefficients for a large number of input parameters. This allows for a much wider variety of ad hoc aviation policy scenarios to be generated and evaluated than would be possible using other sensitivity analysis methods or smaller-scaled computing systems. We present a design of experiments to compute individual sensitivity coefficients for 139 major airports in the US, due to six different precursor emissions that form PM2.5 in the atmosphere. Simulations based on this design are currently in progress, with full results to be published at a later date.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"54 1","pages":"10:1-10:8"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74824788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Petascale computing systems have enabled tremendous advances for traditional simulation and modeling algorithms that are built around parallel execution. Unfortunately, scientific domains using data-oriented or high-throughput paradigms have difficulty taking full advantage of these resources without custom software development. This paper describes our solution for rapidly developing parallel parametric studies using sequential or threaded tasks: The launcher. We detail how to get ensembles executing quickly through common job schedulers SGE and SLURM, and the various user-customizable options that the launcher provides. We illustrate the efficiency of or tool by presenting execution results at large scale (over 65,000 cores) for varying workloads, including a virtual screening workload with indeterminate runtimes using the drug docking software Autodock Vina.
{"title":"Launcher: A Shell-based Framework for Rapid Development of Parallel Parametric Studies","authors":"Lucas A. Wilson, John M. Fonner","doi":"10.1145/2616498.2616534","DOIUrl":"https://doi.org/10.1145/2616498.2616534","url":null,"abstract":"Petascale computing systems have enabled tremendous advances for traditional simulation and modeling algorithms that are built around parallel execution. Unfortunately, scientific domains using data-oriented or high-throughput paradigms have difficulty taking full advantage of these resources without custom software development. This paper describes our solution for rapidly developing parallel parametric studies using sequential or threaded tasks: The launcher. We detail how to get ensembles executing quickly through common job schedulers SGE and SLURM, and the various user-customizable options that the launcher provides. We illustrate the efficiency of or tool by presenting execution results at large scale (over 65,000 cores) for varying workloads, including a virtual screening workload with indeterminate runtimes using the drug docking software Autodock Vina.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"146 1","pages":"40:1-40:8"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86404758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There are millions of files and multi-terabytes of data transferred to and from the University of Tennessee's National Institute for Computational Sciences each month. New capabilities available with GridFTP version 5.2.2 include additional transfer log information previously unavailable in prior versions implemented within XSEDE. The transfer log data now available includes identification of source and destination endpoints which unlocks a wealth of information that can be used to detail GridFTP activities across the Internet. This information can be used for a wide variety of reports of interest to individual XSEDE Service Providers and to XSEDE Operations. In this paper, we discuss the new capabilities available for transfer logs in GridFTP 5.2.2, our initial attempt to organize, analyze, and report on this file transfer data for NICS, and its applicability to XSEDE Service Providers. Analysis of this new information can provide insight into effective and efficient utilization of GridFTP resources including identification of potential areas of GridFTP file transfer improvement (e.g., network and server tuning) and potential predictive analysis to improve efficiency.
{"title":"Descriptive Data Analysis of File Transfer Data","authors":"S. Srinivasan, Victor Hazlewood, G. D. Peterson","doi":"10.1145/2616498.2616550","DOIUrl":"https://doi.org/10.1145/2616498.2616550","url":null,"abstract":"There are millions of files and multi-terabytes of data transferred to and from the University of Tennessee's National Institute for Computational Sciences each month. New capabilities available with GridFTP version 5.2.2 include additional transfer log information previously unavailable in prior versions implemented within XSEDE. The transfer log data now available includes identification of source and destination endpoints which unlocks a wealth of information that can be used to detail GridFTP activities across the Internet. This information can be used for a wide variety of reports of interest to individual XSEDE Service Providers and to XSEDE Operations. In this paper, we discuss the new capabilities available for transfer logs in GridFTP 5.2.2, our initial attempt to organize, analyze, and report on this file transfer data for NICS, and its applicability to XSEDE Service Providers. Analysis of this new information can provide insight into effective and efficient utilization of GridFTP resources including identification of potential areas of GridFTP file transfer improvement (e.g., network and server tuning) and potential predictive analysis to improve efficiency.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"112 1","pages":"37:1-37:8"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85777550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As MPI applications scale to larger machines, errors that had been hidden from testing at smaller scales begin to manifest themselves. It is therefore necessary to extend debuggers to work at these scales, in order for efficient development of correct applications to proceed. PGDB is the Parallel GDB, an open-source debugger for MPI applications that provides such a capability. It is designed from the ground up to be a robust debugging environment at scale, while presenting an interface similar to that of the typical command-line GDB debugger. Its usage on representative debugging problems is demonstrated and its scalability on the Stampede supercomputer is evaluated.
{"title":"PGDB: A Debugger for MPI Applications","authors":"Nikoli Dryden","doi":"10.1145/2616498.2616535","DOIUrl":"https://doi.org/10.1145/2616498.2616535","url":null,"abstract":"As MPI applications scale to larger machines, errors that had been hidden from testing at smaller scales begin to manifest themselves. It is therefore necessary to extend debuggers to work at these scales, in order for efficient development of correct applications to proceed. PGDB is the Parallel GDB, an open-source debugger for MPI applications that provides such a capability. It is designed from the ground up to be a robust debugging environment at scale, while presenting an interface similar to that of the typical command-line GDB debugger. Its usage on representative debugging problems is demonstrated and its scalability on the Stampede supercomputer is evaluated.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"75 1","pages":"44:1-44:7"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77155305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Uwe Hilgert, S. McKay, M. Khalfan, Jason J. Williams, Cornel Ghiban, D. Micklos
DNA Subway bundles research-grade bioinformatics tools, high-performance computing, and databases into easy-to-use workflows. Students have been "riding" different lines since 2010, to predict and annotate genes in up to 150kb of raw DNA sequence (Red Line), identify homologs in sequenced genomes (Yellow Line), identify species using DNA barcodes and construct phylogenetic trees (Blue Line), and examine RNA sequence (RNA-Seq) datasets for transcript abundance and differential expression (Green Line). With support for plant and animal genomes, DNA Subway engages students in their own learning, bringing to life key concepts in molecular biology, genetics, and evolution. Integrated DNA barcoding and RNA extraction wet-lab experiments support a variety of inquiry-based projects using student-generated data. Products of student research can be exported, published, and used in follow-up experiments. To date, DNA Subway has over 8,000 registered users who have produced 51,000 projects. Based on the popular Tuxedo Protocol, the Green Line was introduced in January 2014 as an easy-to-use workflow to analyze RNA-Seq datasets. The workflow uses iPlant's APIs (http://agaveapi.co/) to access high-performance compute resources of NSF's Extreme Scientific and Engineering Discovery Environment (XSEDE), providing the first easy "on ramp" to biological supercomputing.
DNA Subway将研究级生物信息学工具,高性能计算和数据库捆绑到易于使用的工作流程中。自2010年以来,学生们一直在“骑”不同的线,在高达150kb的原始DNA序列中预测和注释基因(红线),在测序基因组中识别同源物(黄线),使用DNA条形码识别物种并构建系统发育树(蓝线),并检查RNA序列(RNA- seq)数据集的转录丰度和差异表达(绿线)。通过对植物和动物基因组的支持,赛百味DNA让学生参与到自己的学习中,将分子生物学、遗传学和进化中的关键概念带入生活。集成的DNA条形码和RNA提取湿实验室实验支持各种基于探究的项目使用学生生成的数据。学生的研究成果可以输出、发表,并用于后续实验。到目前为止,DNA赛百味拥有超过8000名注册用户,他们已经制作了51000个项目。基于流行的Tuxedo协议,Green Line于2014年1月推出,作为一种易于使用的工作流程来分析RNA-Seq数据集。该工作流使用iPlant的api (http://agaveapi.co/)访问NSF的极限科学和工程发现环境(XSEDE)的高性能计算资源,为生物超级计算提供了第一个简单的“入口”。
{"title":"DNA Subway: Making Genome Analysis Egalitarian","authors":"Uwe Hilgert, S. McKay, M. Khalfan, Jason J. Williams, Cornel Ghiban, D. Micklos","doi":"10.1145/2616498.2616575","DOIUrl":"https://doi.org/10.1145/2616498.2616575","url":null,"abstract":"DNA Subway bundles research-grade bioinformatics tools, high-performance computing, and databases into easy-to-use workflows. Students have been \"riding\" different lines since 2010, to predict and annotate genes in up to 150kb of raw DNA sequence (Red Line), identify homologs in sequenced genomes (Yellow Line), identify species using DNA barcodes and construct phylogenetic trees (Blue Line), and examine RNA sequence (RNA-Seq) datasets for transcript abundance and differential expression (Green Line). With support for plant and animal genomes, DNA Subway engages students in their own learning, bringing to life key concepts in molecular biology, genetics, and evolution. Integrated DNA barcoding and RNA extraction wet-lab experiments support a variety of inquiry-based projects using student-generated data. Products of student research can be exported, published, and used in follow-up experiments. To date, DNA Subway has over 8,000 registered users who have produced 51,000 projects.\u0000 Based on the popular Tuxedo Protocol, the Green Line was introduced in January 2014 as an easy-to-use workflow to analyze RNA-Seq datasets. The workflow uses iPlant's APIs (http://agaveapi.co/) to access high-performance compute resources of NSF's Extreme Scientific and Engineering Discovery Environment (XSEDE), providing the first easy \"on ramp\" to biological supercomputing.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"27 1","pages":"70:1-70:3"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82707674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes the process of incorporating predictions of job queue wait times and run times into a Science Gateway. Science Gateways that integrate multiple resources can use predictions of queue wait times and run times to advice users when they choose where a job is executed or in an automated resource selection process. These predictions are also critical in executing workflows were it isn't feasible to have users specify where each task executes and the workflow management system therefore has to perform resource selection programmatically. SEAGrid science gateway has partly integrated the estimation of wait time prediction based on Karnak prediction service and is in the process of extending this to run time prediction.
{"title":"Incorporating Job Predictions into the SEAGrid Science Gateway","authors":"Ye Fan, Sudhakar Pamidighantam, Warren Smith","doi":"10.1145/2616498.2616563","DOIUrl":"https://doi.org/10.1145/2616498.2616563","url":null,"abstract":"This paper describes the process of incorporating predictions of job queue wait times and run times into a Science Gateway. Science Gateways that integrate multiple resources can use predictions of queue wait times and run times to advice users when they choose where a job is executed or in an automated resource selection process. These predictions are also critical in executing workflows were it isn't feasible to have users specify where each task executes and the workflow management system therefore has to perform resource selection programmatically. SEAGrid science gateway has partly integrated the estimation of wait time prediction based on Karnak prediction service and is in the process of extending this to run time prediction.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"31 1","pages":"57:1-57:3"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86019719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we describe an integrated data analysis pipeline for identifying and predicting genetic interactions based on cellular responses to perturbations of single- and multiple-agents. This pipeline was developed in the context of genome wide single-RNAi screens and smaller scale double-RNAi screens using Drosophila KC-167 cell lines, with the aim to reconstruct the molecular pathways regulating changes in cell shape. The TACC (Texas Advanced Computing Center) under XSEDE framework allocated 100,000 service unites (SUs) from its Stampede system to facilitate image quantification and signaling pathway modeling using fluorescence images of Drosophila cells, and recently a kinome-wide single RNAi screening has been reported [1].
{"title":"An Integrated Analytic Pipeline for Identifying and Predicting Genetic Interactions based on Perturbation Data from High Content Double RNAi Screening","authors":"Zheng Yin, Fuhai Li, Stephen T. C. Wong","doi":"10.1145/2616498.2616513","DOIUrl":"https://doi.org/10.1145/2616498.2616513","url":null,"abstract":"In this paper, we describe an integrated data analysis pipeline for identifying and predicting genetic interactions based on cellular responses to perturbations of single- and multiple-agents. This pipeline was developed in the context of genome wide single-RNAi screens and smaller scale double-RNAi screens using Drosophila KC-167 cell lines, with the aim to reconstruct the molecular pathways regulating changes in cell shape. The TACC (Texas Advanced Computing Center) under XSEDE framework allocated 100,000 service unites (SUs) from its Stampede system to facilitate image quantification and signaling pathway modeling using fluorescence images of Drosophila cells, and recently a kinome-wide single RNAi screening has been reported [1].","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"05 1","pages":"7:1-7:2"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85910456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As high-performance computing (HPC) heads towards the exascale era, application performance analysis becomes more complex and less tractable. It usually requires considerable training, experience, and a good working knowledge of hardware/software interaction to use performance tools effectively, which becomes a barrier for domain scientists. Moreover, instrumentation and profiling activities from a large run can easily generate gigantic data volume, making both data management and characterization another challenge. To cope with these, we develop a statistical method to extract the principal performance features and produce easily interpretable results. This paper introduces a performance analysis methodology based on the combination of Variable Clustering (VarCluster) and Principal Component Analysis (PCA), describes the analysis process, and gives experimental results of scientific applications on a Cray XT5 system. As a visualization aid, we use Voronoi tessellations to map the numerical results into graphical forms to convey the performance information more clearly.
{"title":"Statistical Performance Analysis for Scientific Applications","authors":"Fei Xing, Haihang You, Charng-Da Lu","doi":"10.1145/2616498.2616555","DOIUrl":"https://doi.org/10.1145/2616498.2616555","url":null,"abstract":"As high-performance computing (HPC) heads towards the exascale era, application performance analysis becomes more complex and less tractable. It usually requires considerable training, experience, and a good working knowledge of hardware/software interaction to use performance tools effectively, which becomes a barrier for domain scientists. Moreover, instrumentation and profiling activities from a large run can easily generate gigantic data volume, making both data management and characterization another challenge. To cope with these, we develop a statistical method to extract the principal performance features and produce easily interpretable results. This paper introduces a performance analysis methodology based on the combination of Variable Clustering (VarCluster) and Principal Component Analysis (PCA), describes the analysis process, and gives experimental results of scientific applications on a Cray XT5 system. As a visualization aid, we use Voronoi tessellations to map the numerical results into graphical forms to convey the performance information more clearly.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"2 1","pages":"62:1-62:8"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89355620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)