首页 > 最新文献

Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)最新文献

英文 中文
Applying Lessons from e-Discovery to Process Big Data using HPC 将电子发现的经验教训应用于HPC处理大数据
Sukrit Sondhi, R. Arora
The term 'Big Data' defines large datasets that are difficult to use and manage through conventional software tools. Legal Electronic Discovery (e-Discovery) is a business domain which has massive consumption of Big Data, where electronic records such as e-mail, documents, databases and social media postings are processed in order to discover evidence that may be pertinent to legal/compliance needs, litigation or other investigations. Numerous vendors exist in the market to provide organizations with services such as data collection, digital forensics and electronic discovery. High-end instrumentation and modern information technologies are creating data at an ever increasing rate. The challenges associated with managing the large datasets are related to the capture, storage, search, sharing, analytics, and visualization of the data. Big Data also offers unprecedented opportunities in other fields, ranging from astronomy and biology to marketing and e-commerce. This paper presents lessons learnt from the legal e-Discovery domain that can be adapted to process Big Data effectively on HPC resources, thereby benefitting the various disciplines of science, engineering and business that are grappling with a deluge of Big Data challenges and opportunities.
“大数据”一词定义了难以通过传统软件工具使用和管理的大型数据集。法律电子发现(e-Discovery)是一个大量使用大数据的商业领域,处理电子邮件、文件、数据库和社交媒体帖子等电子记录,以发现可能与法律/合规需求、诉讼或其他调查相关的证据。市场上有许多供应商为组织提供数据收集、数字取证和电子发现等服务。高端仪器和现代信息技术正在以越来越快的速度创造数据。与管理大型数据集相关的挑战与数据的捕获、存储、搜索、共享、分析和可视化有关。大数据还在其他领域提供了前所未有的机会,从天文学、生物学到市场营销和电子商务。本文介绍了法律电子发现领域的经验教训,这些经验教训可以用于在高性能计算资源上有效地处理大数据,从而使科学、工程和商业的各个学科受益,这些学科正在努力应对大量的大数据挑战和机遇。
{"title":"Applying Lessons from e-Discovery to Process Big Data using HPC","authors":"Sukrit Sondhi, R. Arora","doi":"10.1145/2616498.2616525","DOIUrl":"https://doi.org/10.1145/2616498.2616525","url":null,"abstract":"The term 'Big Data' defines large datasets that are difficult to use and manage through conventional software tools. Legal Electronic Discovery (e-Discovery) is a business domain which has massive consumption of Big Data, where electronic records such as e-mail, documents, databases and social media postings are processed in order to discover evidence that may be pertinent to legal/compliance needs, litigation or other investigations. Numerous vendors exist in the market to provide organizations with services such as data collection, digital forensics and electronic discovery. High-end instrumentation and modern information technologies are creating data at an ever increasing rate. The challenges associated with managing the large datasets are related to the capture, storage, search, sharing, analytics, and visualization of the data. Big Data also offers unprecedented opportunities in other fields, ranging from astronomy and biology to marketing and e-commerce. This paper presents lessons learnt from the legal e-Discovery domain that can be adapted to process Big Data effectively on HPC resources, thereby benefitting the various disciplines of science, engineering and business that are grappling with a deluge of Big Data challenges and opportunities.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"52 1","pages":"8:1-8:2"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87474752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
ECSS Experience: Particle Tracing Reinvented ECSS经验:粒子追踪的重新发明
C. Rosales, R. McLay
This work describes an implementation of distributed particle tracking that provides a factor 10000x speedup over traditional schemes. While none of the techniques used to achieve this result are completely new, they have been used in combination to great effect in this project. The implementation includes parallel IO using HDF5, a flexible load balancing scheme, and dynamic buffering to achieve excellent performance at scale. The use of HDF5 decouples the size of the simulation generating the data from the particle tracing, providing a more flexible and efficient workflow. The load balancing scheme ensures that heterogeneous particle distributions do not result in a waste of computational resources by maintaining all the MPI tasks occupied at any given time. Dynamic buffering minimizes MPI exchanges across MPI tasks, a critical element in the performance improvements achieved.
这项工作描述了一种分布式粒子跟踪的实现,它比传统方案提供了10000x的加速。虽然用于实现这一结果的技术都不是全新的,但在这个项目中,它们已经被结合使用,产生了很大的效果。该实现包括使用HDF5的并行IO、灵活的负载平衡方案和动态缓冲,以实现大规模的卓越性能。HDF5的使用分离了从粒子跟踪生成数据的模拟的大小,提供了一个更灵活和高效的工作流程。负载平衡方案通过保持在任何给定时间占用的所有MPI任务,确保异构粒子分布不会导致计算资源的浪费。动态缓冲最大限度地减少了MPI任务之间的MPI交换,这是实现性能改进的关键因素。
{"title":"ECSS Experience: Particle Tracing Reinvented","authors":"C. Rosales, R. McLay","doi":"10.1145/2616498.2616527","DOIUrl":"https://doi.org/10.1145/2616498.2616527","url":null,"abstract":"This work describes an implementation of distributed particle tracking that provides a factor 10000x speedup over traditional schemes. While none of the techniques used to achieve this result are completely new, they have been used in combination to great effect in this project. The implementation includes parallel IO using HDF5, a flexible load balancing scheme, and dynamic buffering to achieve excellent performance at scale. The use of HDF5 decouples the size of the simulation generating the data from the particle tracing, providing a more flexible and efficient workflow. The load balancing scheme ensures that heterogeneous particle distributions do not result in a waste of computational resources by maintaining all the MPI tasks occupied at any given time. Dynamic buffering minimizes MPI exchanges across MPI tasks, a critical element in the performance improvements achieved.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"22 1","pages":"13:1-13:2"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73688116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Calculation of Sensitivity Coefficients for Individual Airport Emissions in the Continental U.S. using CMAQ-DDM/PM 使用CMAQ-DDM/PM计算美国大陆个别机场排放的敏感系数
S. Boone, S. Arunachalam
Fine particulate matter (PM2.5) is a federally-regulated air pollutant with well-known impacts on human health. The FAA's Destination 2025 program seeks to decrease aviation-related health impacts across the U.S. by 50% by the year 2018. Atmospheric models, such as the Community Multiscale Air Quality model (CMAQ), are used to estimate the atmospheric concentration of pollutants such as PM2.5. Sensitivity analysis of these models has long been limited to finite difference and regression-based methods, both of which require many computationally intensive model simulations to link changes in output with perturbations in input. Further, they are unable to offer detailed or ad hoc analysis for changes within a domain, such as changes in emissions on an airport-by-airport basis. In order to calculate the sensitivity of PM2.5 concentrations to emissions from individual airports, we utilize the Decoupled Direct Method in three dimensions (DDM-3D), an advanced sensitivity analysis tool recently implemented in CMAQ. DDM-3D allows calculation of sensitivity coefficients within a single simulation, eliminating the need for multiple model runs. However, while the output provides results for a variety of input perturbations in a single simulation, the processing time for each run is dramatically increased compared to simulations conducted without the DDM-3D module. Use of the XSEDE Stampede computing cluster allows us to calculate sensitivity coefficients for a large number of input parameters. This allows for a much wider variety of ad hoc aviation policy scenarios to be generated and evaluated than would be possible using other sensitivity analysis methods or smaller-scaled computing systems. We present a design of experiments to compute individual sensitivity coefficients for 139 major airports in the US, due to six different precursor emissions that form PM2.5 in the atmosphere. Simulations based on this design are currently in progress, with full results to be published at a later date.
细颗粒物(PM2.5)是一种由联邦政府监管的空气污染物,对人体健康的影响众所周知。美国联邦航空局的“目的地2025”计划旨在到2018年将全美与航空相关的健康影响减少50%。大气模型,如社区多尺度空气质量模型(CMAQ),被用来估计PM2.5等污染物的大气浓度。长期以来,这些模型的敏感性分析一直局限于有限差分和基于回归的方法,这两种方法都需要大量的计算密集型模型模拟,以将输出变化与输入扰动联系起来。此外,它们无法对一个领域内的变化提供详细的或特别的分析,例如逐个机场的排放量变化。为了计算PM2.5浓度对各个机场排放的敏感性,我们使用了三维解耦直接法(DDM-3D),这是CMAQ最近实施的一种先进的敏感性分析工具。DDM-3D允许在一次模拟中计算灵敏度系数,从而消除了多次模型运行的需要。然而,虽然输出在一次模拟中提供了各种输入扰动的结果,但与没有DDM-3D模块的模拟相比,每次运行的处理时间大大增加。使用XSEDE Stampede计算集群允许我们计算大量输入参数的灵敏度系数。与使用其他敏感性分析方法或较小规模的计算系统相比,这允许生成和评估更广泛的特别航空政策情景。我们提出了一种实验设计,用于计算美国139个主要机场的个别敏感性系数,因为在大气中形成PM2.5的六种不同的前体排放。基于这种设计的模拟目前正在进行中,完整的结果将在晚些时候公布。
{"title":"Calculation of Sensitivity Coefficients for Individual Airport Emissions in the Continental U.S. using CMAQ-DDM/PM","authors":"S. Boone, S. Arunachalam","doi":"10.1145/2616498.2616504","DOIUrl":"https://doi.org/10.1145/2616498.2616504","url":null,"abstract":"Fine particulate matter (PM2.5) is a federally-regulated air pollutant with well-known impacts on human health. The FAA's Destination 2025 program seeks to decrease aviation-related health impacts across the U.S. by 50% by the year 2018. Atmospheric models, such as the Community Multiscale Air Quality model (CMAQ), are used to estimate the atmospheric concentration of pollutants such as PM2.5. Sensitivity analysis of these models has long been limited to finite difference and regression-based methods, both of which require many computationally intensive model simulations to link changes in output with perturbations in input. Further, they are unable to offer detailed or ad hoc analysis for changes within a domain, such as changes in emissions on an airport-by-airport basis. In order to calculate the sensitivity of PM2.5 concentrations to emissions from individual airports, we utilize the Decoupled Direct Method in three dimensions (DDM-3D), an advanced sensitivity analysis tool recently implemented in CMAQ. DDM-3D allows calculation of sensitivity coefficients within a single simulation, eliminating the need for multiple model runs. However, while the output provides results for a variety of input perturbations in a single simulation, the processing time for each run is dramatically increased compared to simulations conducted without the DDM-3D module.\u0000 Use of the XSEDE Stampede computing cluster allows us to calculate sensitivity coefficients for a large number of input parameters. This allows for a much wider variety of ad hoc aviation policy scenarios to be generated and evaluated than would be possible using other sensitivity analysis methods or smaller-scaled computing systems. We present a design of experiments to compute individual sensitivity coefficients for 139 major airports in the US, due to six different precursor emissions that form PM2.5 in the atmosphere. Simulations based on this design are currently in progress, with full results to be published at a later date.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"54 1","pages":"10:1-10:8"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74824788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Launcher: A Shell-based Framework for Rapid Development of Parallel Parametric Studies 启动器:一个基于shell的框架,用于并行参数化研究的快速发展
Lucas A. Wilson, John M. Fonner
Petascale computing systems have enabled tremendous advances for traditional simulation and modeling algorithms that are built around parallel execution. Unfortunately, scientific domains using data-oriented or high-throughput paradigms have difficulty taking full advantage of these resources without custom software development. This paper describes our solution for rapidly developing parallel parametric studies using sequential or threaded tasks: The launcher. We detail how to get ensembles executing quickly through common job schedulers SGE and SLURM, and the various user-customizable options that the launcher provides. We illustrate the efficiency of or tool by presenting execution results at large scale (over 65,000 cores) for varying workloads, including a virtual screening workload with indeterminate runtimes using the drug docking software Autodock Vina.
千兆级计算系统使围绕并行执行构建的传统仿真和建模算法取得了巨大进步。不幸的是,如果没有定制软件开发,使用面向数据或高吞吐量范例的科学领域很难充分利用这些资源。本文描述了我们使用顺序或线程任务快速开发并行参数研究的解决方案:启动器。我们详细介绍了如何通过通用作业调度程序SGE和SLURM快速执行集成,以及启动程序提供的各种用户可自定义选项。我们通过展示针对不同工作负载(包括使用药物对接软件Autodock Vina具有不确定运行时间的虚拟筛选工作负载)的大规模(超过65,000个核)执行结果来说明该工具的效率。
{"title":"Launcher: A Shell-based Framework for Rapid Development of Parallel Parametric Studies","authors":"Lucas A. Wilson, John M. Fonner","doi":"10.1145/2616498.2616534","DOIUrl":"https://doi.org/10.1145/2616498.2616534","url":null,"abstract":"Petascale computing systems have enabled tremendous advances for traditional simulation and modeling algorithms that are built around parallel execution. Unfortunately, scientific domains using data-oriented or high-throughput paradigms have difficulty taking full advantage of these resources without custom software development. This paper describes our solution for rapidly developing parallel parametric studies using sequential or threaded tasks: The launcher. We detail how to get ensembles executing quickly through common job schedulers SGE and SLURM, and the various user-customizable options that the launcher provides. We illustrate the efficiency of or tool by presenting execution results at large scale (over 65,000 cores) for varying workloads, including a virtual screening workload with indeterminate runtimes using the drug docking software Autodock Vina.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"146 1","pages":"40:1-40:8"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86404758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Descriptive Data Analysis of File Transfer Data 文件传输数据的描述性数据分析
S. Srinivasan, Victor Hazlewood, G. D. Peterson
There are millions of files and multi-terabytes of data transferred to and from the University of Tennessee's National Institute for Computational Sciences each month. New capabilities available with GridFTP version 5.2.2 include additional transfer log information previously unavailable in prior versions implemented within XSEDE. The transfer log data now available includes identification of source and destination endpoints which unlocks a wealth of information that can be used to detail GridFTP activities across the Internet. This information can be used for a wide variety of reports of interest to individual XSEDE Service Providers and to XSEDE Operations. In this paper, we discuss the new capabilities available for transfer logs in GridFTP 5.2.2, our initial attempt to organize, analyze, and report on this file transfer data for NICS, and its applicability to XSEDE Service Providers. Analysis of this new information can provide insight into effective and efficient utilization of GridFTP resources including identification of potential areas of GridFTP file transfer improvement (e.g., network and server tuning) and potential predictive analysis to improve efficiency.
每个月都有数以百万计的文件和数tb的数据在田纳西大学国家计算科学研究所之间来回传输。GridFTP 5.2.2版本提供的新功能包括额外的传输日志信息,这些信息在XSEDE中实现的先前版本中是不可用的。现在可用的传输日志数据包括源和目标端点的标识,这将解锁大量信息,这些信息可用于详细说明互联网上的GridFTP活动。此信息可用于各个XSEDE服务提供者和XSEDE操作感兴趣的各种报告。在本文中,我们讨论了GridFTP 5.2.2中传输日志可用的新功能,这是我们对nic的文件传输数据的组织、分析和报告的初步尝试,以及它对XSEDE服务提供商的适用性。对这些新信息的分析可以深入了解GridFTP资源的有效和高效利用,包括识别GridFTP文件传输改进的潜在领域(例如,网络和服务器调优)以及提高效率的潜在预测分析。
{"title":"Descriptive Data Analysis of File Transfer Data","authors":"S. Srinivasan, Victor Hazlewood, G. D. Peterson","doi":"10.1145/2616498.2616550","DOIUrl":"https://doi.org/10.1145/2616498.2616550","url":null,"abstract":"There are millions of files and multi-terabytes of data transferred to and from the University of Tennessee's National Institute for Computational Sciences each month. New capabilities available with GridFTP version 5.2.2 include additional transfer log information previously unavailable in prior versions implemented within XSEDE. The transfer log data now available includes identification of source and destination endpoints which unlocks a wealth of information that can be used to detail GridFTP activities across the Internet. This information can be used for a wide variety of reports of interest to individual XSEDE Service Providers and to XSEDE Operations. In this paper, we discuss the new capabilities available for transfer logs in GridFTP 5.2.2, our initial attempt to organize, analyze, and report on this file transfer data for NICS, and its applicability to XSEDE Service Providers. Analysis of this new information can provide insight into effective and efficient utilization of GridFTP resources including identification of potential areas of GridFTP file transfer improvement (e.g., network and server tuning) and potential predictive analysis to improve efficiency.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"112 1","pages":"37:1-37:8"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85777550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
PGDB: A Debugger for MPI Applications MPI应用程序的调试器
Nikoli Dryden
As MPI applications scale to larger machines, errors that had been hidden from testing at smaller scales begin to manifest themselves. It is therefore necessary to extend debuggers to work at these scales, in order for efficient development of correct applications to proceed. PGDB is the Parallel GDB, an open-source debugger for MPI applications that provides such a capability. It is designed from the ground up to be a robust debugging environment at scale, while presenting an interface similar to that of the typical command-line GDB debugger. Its usage on representative debugging problems is demonstrated and its scalability on the Stampede supercomputer is evaluated.
当MPI应用程序扩展到更大的机器时,在较小规模的测试中隐藏的错误开始显现出来。因此,有必要扩展调试器以在这些规模下工作,以便继续有效地开发正确的应用程序。PGDB是Parallel GDB,它是MPI应用程序的开源调试器,提供了这种功能。它从一开始就被设计成一个健壮的大规模调试环境,同时提供了一个类似于典型命令行GDB调试器的接口。演示了该方法在典型调试问题中的应用,并对其在Stampede超级计算机上的可扩展性进行了评估。
{"title":"PGDB: A Debugger for MPI Applications","authors":"Nikoli Dryden","doi":"10.1145/2616498.2616535","DOIUrl":"https://doi.org/10.1145/2616498.2616535","url":null,"abstract":"As MPI applications scale to larger machines, errors that had been hidden from testing at smaller scales begin to manifest themselves. It is therefore necessary to extend debuggers to work at these scales, in order for efficient development of correct applications to proceed. PGDB is the Parallel GDB, an open-source debugger for MPI applications that provides such a capability. It is designed from the ground up to be a robust debugging environment at scale, while presenting an interface similar to that of the typical command-line GDB debugger. Its usage on representative debugging problems is demonstrated and its scalability on the Stampede supercomputer is evaluated.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"75 1","pages":"44:1-44:7"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77155305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
DNA Subway: Making Genome Analysis Egalitarian DNA地铁:使基因组分析平等
Uwe Hilgert, S. McKay, M. Khalfan, Jason J. Williams, Cornel Ghiban, D. Micklos
DNA Subway bundles research-grade bioinformatics tools, high-performance computing, and databases into easy-to-use workflows. Students have been "riding" different lines since 2010, to predict and annotate genes in up to 150kb of raw DNA sequence (Red Line), identify homologs in sequenced genomes (Yellow Line), identify species using DNA barcodes and construct phylogenetic trees (Blue Line), and examine RNA sequence (RNA-Seq) datasets for transcript abundance and differential expression (Green Line). With support for plant and animal genomes, DNA Subway engages students in their own learning, bringing to life key concepts in molecular biology, genetics, and evolution. Integrated DNA barcoding and RNA extraction wet-lab experiments support a variety of inquiry-based projects using student-generated data. Products of student research can be exported, published, and used in follow-up experiments. To date, DNA Subway has over 8,000 registered users who have produced 51,000 projects. Based on the popular Tuxedo Protocol, the Green Line was introduced in January 2014 as an easy-to-use workflow to analyze RNA-Seq datasets. The workflow uses iPlant's APIs (http://agaveapi.co/) to access high-performance compute resources of NSF's Extreme Scientific and Engineering Discovery Environment (XSEDE), providing the first easy "on ramp" to biological supercomputing.
DNA Subway将研究级生物信息学工具,高性能计算和数据库捆绑到易于使用的工作流程中。自2010年以来,学生们一直在“骑”不同的线,在高达150kb的原始DNA序列中预测和注释基因(红线),在测序基因组中识别同源物(黄线),使用DNA条形码识别物种并构建系统发育树(蓝线),并检查RNA序列(RNA- seq)数据集的转录丰度和差异表达(绿线)。通过对植物和动物基因组的支持,赛百味DNA让学生参与到自己的学习中,将分子生物学、遗传学和进化中的关键概念带入生活。集成的DNA条形码和RNA提取湿实验室实验支持各种基于探究的项目使用学生生成的数据。学生的研究成果可以输出、发表,并用于后续实验。到目前为止,DNA赛百味拥有超过8000名注册用户,他们已经制作了51000个项目。基于流行的Tuxedo协议,Green Line于2014年1月推出,作为一种易于使用的工作流程来分析RNA-Seq数据集。该工作流使用iPlant的api (http://agaveapi.co/)访问NSF的极限科学和工程发现环境(XSEDE)的高性能计算资源,为生物超级计算提供了第一个简单的“入口”。
{"title":"DNA Subway: Making Genome Analysis Egalitarian","authors":"Uwe Hilgert, S. McKay, M. Khalfan, Jason J. Williams, Cornel Ghiban, D. Micklos","doi":"10.1145/2616498.2616575","DOIUrl":"https://doi.org/10.1145/2616498.2616575","url":null,"abstract":"DNA Subway bundles research-grade bioinformatics tools, high-performance computing, and databases into easy-to-use workflows. Students have been \"riding\" different lines since 2010, to predict and annotate genes in up to 150kb of raw DNA sequence (Red Line), identify homologs in sequenced genomes (Yellow Line), identify species using DNA barcodes and construct phylogenetic trees (Blue Line), and examine RNA sequence (RNA-Seq) datasets for transcript abundance and differential expression (Green Line). With support for plant and animal genomes, DNA Subway engages students in their own learning, bringing to life key concepts in molecular biology, genetics, and evolution. Integrated DNA barcoding and RNA extraction wet-lab experiments support a variety of inquiry-based projects using student-generated data. Products of student research can be exported, published, and used in follow-up experiments. To date, DNA Subway has over 8,000 registered users who have produced 51,000 projects.\u0000 Based on the popular Tuxedo Protocol, the Green Line was introduced in January 2014 as an easy-to-use workflow to analyze RNA-Seq datasets. The workflow uses iPlant's APIs (http://agaveapi.co/) to access high-performance compute resources of NSF's Extreme Scientific and Engineering Discovery Environment (XSEDE), providing the first easy \"on ramp\" to biological supercomputing.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"27 1","pages":"70:1-70:3"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82707674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Incorporating Job Predictions into the SEAGrid Science Gateway 将工作预测纳入SEAGrid科学门户
Ye Fan, Sudhakar Pamidighantam, Warren Smith
This paper describes the process of incorporating predictions of job queue wait times and run times into a Science Gateway. Science Gateways that integrate multiple resources can use predictions of queue wait times and run times to advice users when they choose where a job is executed or in an automated resource selection process. These predictions are also critical in executing workflows were it isn't feasible to have users specify where each task executes and the workflow management system therefore has to perform resource selection programmatically. SEAGrid science gateway has partly integrated the estimation of wait time prediction based on Karnak prediction service and is in the process of extending this to run time prediction.
本文描述了将作业队列等待时间和运行时间的预测合并到科学网关中的过程。集成多个资源的Science gateway可以使用队列等待时间和运行时间的预测,在用户选择执行作业的位置或在自动资源选择过程中向用户提供建议。如果让用户指定每个任务执行的位置是不可行的,工作流管理系统因此必须以编程方式执行资源选择,那么这些预测在执行工作流时也是至关重要的。SEAGrid科学网关部分集成了基于Karnak预测服务的等待时间预测估计,并正在将其扩展到运行时间预测。
{"title":"Incorporating Job Predictions into the SEAGrid Science Gateway","authors":"Ye Fan, Sudhakar Pamidighantam, Warren Smith","doi":"10.1145/2616498.2616563","DOIUrl":"https://doi.org/10.1145/2616498.2616563","url":null,"abstract":"This paper describes the process of incorporating predictions of job queue wait times and run times into a Science Gateway. Science Gateways that integrate multiple resources can use predictions of queue wait times and run times to advice users when they choose where a job is executed or in an automated resource selection process. These predictions are also critical in executing workflows were it isn't feasible to have users specify where each task executes and the workflow management system therefore has to perform resource selection programmatically. SEAGrid science gateway has partly integrated the estimation of wait time prediction based on Karnak prediction service and is in the process of extending this to run time prediction.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"31 1","pages":"57:1-57:3"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86019719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
An Integrated Analytic Pipeline for Identifying and Predicting Genetic Interactions based on Perturbation Data from High Content Double RNAi Screening 基于高含量双RNAi筛选微扰数据的遗传相互作用识别和预测集成分析管道
Zheng Yin, Fuhai Li, Stephen T. C. Wong
In this paper, we describe an integrated data analysis pipeline for identifying and predicting genetic interactions based on cellular responses to perturbations of single- and multiple-agents. This pipeline was developed in the context of genome wide single-RNAi screens and smaller scale double-RNAi screens using Drosophila KC-167 cell lines, with the aim to reconstruct the molecular pathways regulating changes in cell shape. The TACC (Texas Advanced Computing Center) under XSEDE framework allocated 100,000 service unites (SUs) from its Stampede system to facilitate image quantification and signaling pathway modeling using fluorescence images of Drosophila cells, and recently a kinome-wide single RNAi screening has been reported [1].
在本文中,我们描述了一个集成的数据分析管道,用于识别和预测基于细胞对单因子和多因子扰动的反应的遗传相互作用。该管道是在全基因组单rnai筛选和较小规模双rnai筛选果蝇KC-167细胞系的背景下开发的,目的是重建调节细胞形状变化的分子途径。XSEDE框架下的TACC (Texas Advanced Computing Center)从其Stampede系统中分配了100,000个服务单元(service units, su),用于利用果蝇细胞的荧光图像进行图像量化和信号通路建模,最近有报道称进行了全基因组范围的单RNAi筛选[1]。
{"title":"An Integrated Analytic Pipeline for Identifying and Predicting Genetic Interactions based on Perturbation Data from High Content Double RNAi Screening","authors":"Zheng Yin, Fuhai Li, Stephen T. C. Wong","doi":"10.1145/2616498.2616513","DOIUrl":"https://doi.org/10.1145/2616498.2616513","url":null,"abstract":"In this paper, we describe an integrated data analysis pipeline for identifying and predicting genetic interactions based on cellular responses to perturbations of single- and multiple-agents. This pipeline was developed in the context of genome wide single-RNAi screens and smaller scale double-RNAi screens using Drosophila KC-167 cell lines, with the aim to reconstruct the molecular pathways regulating changes in cell shape. The TACC (Texas Advanced Computing Center) under XSEDE framework allocated 100,000 service unites (SUs) from its Stampede system to facilitate image quantification and signaling pathway modeling using fluorescence images of Drosophila cells, and recently a kinome-wide single RNAi screening has been reported [1].","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"05 1","pages":"7:1-7:2"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85910456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Statistical Performance Analysis for Scientific Applications 科学应用的统计性能分析
Fei Xing, Haihang You, Charng-Da Lu
As high-performance computing (HPC) heads towards the exascale era, application performance analysis becomes more complex and less tractable. It usually requires considerable training, experience, and a good working knowledge of hardware/software interaction to use performance tools effectively, which becomes a barrier for domain scientists. Moreover, instrumentation and profiling activities from a large run can easily generate gigantic data volume, making both data management and characterization another challenge. To cope with these, we develop a statistical method to extract the principal performance features and produce easily interpretable results. This paper introduces a performance analysis methodology based on the combination of Variable Clustering (VarCluster) and Principal Component Analysis (PCA), describes the analysis process, and gives experimental results of scientific applications on a Cray XT5 system. As a visualization aid, we use Voronoi tessellations to map the numerical results into graphical forms to convey the performance information more clearly.
随着高性能计算(HPC)走向百亿亿次时代,应用程序性能分析变得更加复杂和难以处理。要有效地使用性能工具,通常需要大量的培训、经验和良好的硬件/软件交互工作知识,这成为领域科学家的障碍。此外,大型运行中的检测和分析活动很容易产生巨大的数据量,这使得数据管理和特征描述成为另一个挑战。为了解决这些问题,我们开发了一种统计方法来提取主要性能特征并产生易于解释的结果。本文介绍了一种基于变量聚类(VarCluster)和主成分分析(PCA)相结合的性能分析方法,描述了分析过程,并给出了在Cray XT5系统上的科学应用实验结果。作为可视化辅助,我们使用Voronoi镶嵌将数值结果映射为图形形式,以更清楚地传达性能信息。
{"title":"Statistical Performance Analysis for Scientific Applications","authors":"Fei Xing, Haihang You, Charng-Da Lu","doi":"10.1145/2616498.2616555","DOIUrl":"https://doi.org/10.1145/2616498.2616555","url":null,"abstract":"As high-performance computing (HPC) heads towards the exascale era, application performance analysis becomes more complex and less tractable. It usually requires considerable training, experience, and a good working knowledge of hardware/software interaction to use performance tools effectively, which becomes a barrier for domain scientists. Moreover, instrumentation and profiling activities from a large run can easily generate gigantic data volume, making both data management and characterization another challenge. To cope with these, we develop a statistical method to extract the principal performance features and produce easily interpretable results. This paper introduces a performance analysis methodology based on the combination of Variable Clustering (VarCluster) and Principal Component Analysis (PCA), describes the analysis process, and gives experimental results of scientific applications on a Cray XT5 system. As a visualization aid, we use Voronoi tessellations to map the numerical results into graphical forms to convey the performance information more clearly.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"2 1","pages":"62:1-62:8"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89355620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1