Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management最新文献

英文中文

Astronomical data processing in EXTASCID 天文数据的EXTASCID处理

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484875

Yu Cheng, Florin Rusu

Scientific data have dual structure. Raw data are preponderantly ordered multi-dimensional arrays or sequences while metadata and derived data are best represented as unordered relations. Scientific data processing requires complex operations over arrays and relations. These operations cannot be expressed using only standard linear and relational algebra operators, respectively. Existing scientific data processing systems are designed for a single data model and handle complex processing at the application level. EXTASCID is a complete and extensible system for scientific data processing. It supports both array and relational data natively. Complex processing is handled by a metaoperator that can execute any user code. As a result, EXTASCID can process full scientific workflows inside the system, with minimal data movement and application code. We illustrate the overall process on a real dataset and workflow from astronomy---starting with a set of sky images, the goal is to identify and classify transient astrophysical objects.

科学数据具有二元结构。原始数据主要是有序的多维数组或序列，而元数据和派生数据最好表示为无序关系。科学数据处理需要对数组和关系进行复杂的操作。这些操作不能分别仅使用标准线性和关系代数操作符来表示。现有的科学数据处理系统是为单一数据模型设计的，在应用层处理复杂的处理。EXTASCID是一个完整的、可扩展的科学数据处理系统。它原生地支持数组和关系数据。复杂的处理由可以执行任何用户代码的元操作符处理。因此，EXTASCID可以在系统内处理完整的科学工作流，只需最少的数据移动和应用程序代码。我们从天文学的真实数据集和工作流程上说明了整个过程-从一组天空图像开始，目标是识别和分类瞬态天体物理对象。

引用次数: 21

Computational challenges in next-generation genomics 下一代基因组学中的计算挑战

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484885

S. Salzberg

Next-generation sequencing (NGS) technology allows us to peer inside the cell in exquisite detail, revealing new insights into biology, evolution, and disease that would have been impossible to find just a few years ago. The enormous volumes of data produced by NGS experiments present many computational challenges that we are working to address. In this talk, I will discuss solutions to two basic alignment problems: (1) mapping sequences onto the human genome at very high speed, and (2) mapping and assembling transcripts from RNA-seq experiments. I will also discuss some of the problems that can arise during alignment and how these can lead to mistaken conclusions about genetic variation and gene expression. My group has developed algorithms to solve each of these problems, including the widely-used Bowtie and Bowtie2 programs for fast alignment and the TopHat and Cufflinks programs for assembly and quantification of genes in transcriptome sequencing (RNA-seq) experiments. This talk describes joint work with current and former lab members including Ben Langmead, Cole Trapnell, Daehwan Kim, and Geo Pertea; and with collaborators including Mihai Pop and Lior Pachter.

新一代测序(NGS)技术使我们能够细致地观察细胞内部，揭示出几年前不可能发现的生物学、进化和疾病的新见解。NGS实验产生的大量数据提出了许多我们正在努力解决的计算挑战。在这次演讲中，我将讨论两个基本的比对问题的解决方案:(1)以非常高的速度将序列定位到人类基因组上，(2)从RNA-seq实验中定位和组装转录本。我还将讨论在比对过程中可能出现的一些问题，以及这些问题如何导致关于遗传变异和基因表达的错误结论。我的团队已经开发了解决这些问题的算法，包括广泛使用的用于快速比对的Bowtie和Bowtie2程序，以及用于转录组测序(RNA-seq)实验中基因组装和定量的TopHat和Cufflinks程序。这次演讲描述了与现任和前任实验室成员的联合工作，包括Ben Langmead, Cole Trapnell, Daehwan Kim和Geo Pertea;与Mihai Pop和Lior Pachter等合作。

{"title":"Computational challenges in next-generation genomics","authors":"S. Salzberg","doi":"10.1145/2484838.2484885","DOIUrl":"https://doi.org/10.1145/2484838.2484885","url":null,"abstract":"Next-generation sequencing (NGS) technology allows us to peer inside the cell in exquisite detail, revealing new insights into biology, evolution, and disease that would have been impossible to find just a few years ago. The enormous volumes of data produced by NGS experiments present many computational challenges that we are working to address. In this talk, I will discuss solutions to two basic alignment problems: (1) mapping sequences onto the human genome at very high speed, and (2) mapping and assembling transcripts from RNA-seq experiments. I will also discuss some of the problems that can arise during alignment and how these can lead to mistaken conclusions about genetic variation and gene expression.\u0000 My group has developed algorithms to solve each of these problems, including the widely-used Bowtie and Bowtie2 programs for fast alignment and the TopHat and Cufflinks programs for assembly and quantification of genes in transcriptome sequencing (RNA-seq) experiments. This talk describes joint work with current and former lab members including Ben Langmead, Cole Trapnell, Daehwan Kim, and Geo Pertea; and with collaborators including Mihai Pop and Lior Pachter.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"26 1","pages":"2:1"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87927254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast computation of approximate biased histograms on sliding windows over data streams 数据流上滑动窗口的近似偏置直方图的快速计算

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484851

Hamid Mousavi, C. Zaniolo

Histograms provide effective synopses of large data sets, and are thus used in a wide variety of applications, including query optimization, approximate query answering, distribution fitting, parallel database partitioning, and data mining. Moreover, very fast approximate algorithms are needed to compute accurate histograms on fast-arriving data streams, whereby online queries can be supported within the given memory and computing resources. Many real-life applications require that the data distribution in certain regions must be modeled with greater accuracy, and Biased Histograms are designed to address this need. In this paper, we define biased histograms over data streams and sliding windows on data streams, and propose the Bar Splitting Biased Histogram (BSBH) algorithm to construct them efficiently and accurately. We prove that BSBH generates expected ∈-approximate biased histograms for data streams with stationary distributions, and our experiments show that BSBH also achieves good approximation in the presence of concept shifts, even major ones. Additionally, BSBH employs a new biased sampling technique which outperforms uniform sampling in terms of accuracy, while using about the same amount of time and memory. Therefore, BSBH outperforms previously proposed algorithms for computing biased histograms over the whole data stream, and it is the first algorithm that supports windows.

直方图提供了大型数据集的有效概要，因此被广泛用于各种应用程序，包括查询优化、近似查询回答、分布拟合、并行数据库分区和数据挖掘。此外，需要非常快速的近似算法来计算快速到达的数据流上的准确直方图，从而可以在给定的内存和计算资源内支持在线查询。许多现实生活中的应用需要在某些区域的数据分布必须以更高的精度建模，而偏置直方图的设计就是为了满足这一需求。本文定义了数据流上的偏置直方图和数据流上的滑动窗口，并提出了条分割偏置直方图(BSBH)算法来高效、准确地构建它们。我们证明了对于平稳分布的数据流，BSBH可以生成期望的∈-近似偏置直方图，并且我们的实验表明，在存在概念转移的情况下，即使是重大的概念转移，BSBH也可以获得很好的近似。此外，BSBH采用了一种新的有偏差采样技术，在精度方面优于均匀采样，同时使用大约相同的时间和内存。因此，BSBH优于先前提出的在整个数据流上计算偏置直方图的算法，并且是第一个支持窗口的算法。

{"title":"Fast computation of approximate biased histograms on sliding windows over data streams","authors":"Hamid Mousavi, C. Zaniolo","doi":"10.1145/2484838.2484851","DOIUrl":"https://doi.org/10.1145/2484838.2484851","url":null,"abstract":"Histograms provide effective synopses of large data sets, and are thus used in a wide variety of applications, including query optimization, approximate query answering, distribution fitting, parallel database partitioning, and data mining. Moreover, very fast approximate algorithms are needed to compute accurate histograms on fast-arriving data streams, whereby online queries can be supported within the given memory and computing resources. Many real-life applications require that the data distribution in certain regions must be modeled with greater accuracy, and Biased Histograms are designed to address this need. In this paper, we define biased histograms over data streams and sliding windows on data streams, and propose the Bar Splitting Biased Histogram (BSBH) algorithm to construct them efficiently and accurately. We prove that BSBH generates expected ∈-approximate biased histograms for data streams with stationary distributions, and our experiments show that BSBH also achieves good approximation in the presence of concept shifts, even major ones. Additionally, BSBH employs a new biased sampling technique which outperforms uniform sampling in terms of accuracy, while using about the same amount of time and memory. Therefore, BSBH outperforms previously proposed algorithms for computing biased histograms over the whole data stream, and it is the first algorithm that supports windows.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"144 1","pages":"13:1-13:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90120300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Optimizing fastquery performance on lustre file system 优化lustre文件系统的快速查询性能

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484853

Kuan-Wu Lin, S. Byna, J. Chou, Kesheng Wu

FastQuery is a parallel indexing and querying system we developed for accelerating analysis and visualization of scientific data. We have applied it to a wide variety of HPC applications and demonstrated its capability and scalability using a petascale trillion-particle simulation in our previous work. Yet, through our experience, we found that performance of reading and writing data with FastQuery, like many other HPC applications, could be significantly affected by various tunable parameters throughout the parallel I/O stack. In this paper, we describe our success in tuning the performance of FastQuery on a Lustre parallel file system. We study and analyze the impact of parameters and tunable settings at file system, MPI-IO library, and HDF5 library levels of the I/O stack. We demonstrate that a combined optimization strategy is able to improve performance and I/O bandwidth of FastQuery significantly. In our tests with a trillion-particle dataset, the time to index the dataset reduced by more than one half.

FastQuery是我们开发的一个并行索引和查询系统，用于加速科学数据的分析和可视化。我们已经将其应用于各种HPC应用，并在我们之前的工作中使用千万亿次的万亿粒子模拟展示了它的能力和可扩展性。然而，根据我们的经验，我们发现，与许多其他HPC应用程序一样，使用FastQuery读写数据的性能可能会受到整个并行I/O堆栈中的各种可调参数的显著影响。在本文中，我们描述了在Lustre并行文件系统上成功调优FastQuery的性能。我们研究和分析了I/O堆栈的文件系统、MPI-IO库和HDF5库级别的参数和可调设置的影响。我们证明了一种组合优化策略能够显著提高FastQuery的性能和I/O带宽。在我们的测试中，有1万亿个粒子的数据集，索引数据集的时间减少了一半以上。

引用次数: 8

Forecasting in hierarchical environments 分层环境下的预测

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484849

R. Lorenz, Lars Dannecker, Philipp J. Rösch, Wolfgang Lehner, Gregor Hackenbroich, B. Schlegel

Forecasting is an important data analysis technique and serves as the basis for business planning in many application areas such as energy, sales and traffic management. The currently employed statistical models already provide very accurate predictions, but the forecasting calculation process is very time consuming. This is especially true since many application domains deal with hierarchically organized data. Forecasting in these environments is especially challenging due to ensuring forecasting consistency between hierarchy levels, which leads to an increased data processing and communication effort. For this purpose, we introduce our novel hierarchical forecasting approach, where we propose to push forecast models to the entities on the lowest hierarch level and reuse these models to efficiently create forecast models on higher hierarchical levels. With that we avoid the time-consuming parameter estimation process and allow an almost instant calculation of forecasts.

预测是一种重要的数据分析技术，在能源、销售和交通管理等许多应用领域都是商业规划的基础。目前使用的统计模型已经提供了非常准确的预测，但预测计算过程非常耗时。这一点尤其正确，因为许多应用程序域处理分层组织的数据。在这些环境中进行预测尤其具有挑战性，因为要确保在层次结构级别之间进行预测的一致性，这将导致数据处理和通信工作的增加。为此，我们提出了一种新的分层预测方法，我们提出将预测模型推送到最低层次层次的实体，并重用这些模型来有效地创建更高层次层次的预测模型。这样我们就避免了耗时的参数估计过程，并允许几乎即时的预测计算。

引用次数: 3

Inverted indices for particle tracking in petascale cosmological simulations 千万亿次宇宙学模拟中粒子跟踪的逆指数

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484882

D. Crankshaw, R. Burns, B. Falck, T. Budavári, A. Szalay, Jie-Shuang Wang

We describe the challenges arising from tracking dark matter particles in state of the art cosmological simulations. We are in the process of running the Indra suite of simulations, with an aggregate count of more than 35 trillion particles and 1.1PB of total raw data volume. However, it is not enough just to store the particle positions and velocities in an efficient manner -- analyses also need to be able to track individual particles efficiently through the temporal history of the simulation. The required inverted indices can easily have raw sizes comparable to the original simulation. We explore various strategies on how to create an efficient index for such data, using additional insight from the physical properties of the particle motions for a greatly compressed data representation. The basic particle data are stored in a relational database in course-grained containers corresponding to leaves of a fixed depth oct-tree labeled by their Peano-Hilbert index. Within each container the individual objects are sorted by their Lagrangian identifier. Thus each particle has a multi-level address: the PH key of the container and the index of the particle within the sorted array (the slot). Given the nature of the cosmological simulations and choice of the PH-box sizes, in consecutive snapshots particles can only cross into spatially adjacent boxes. Also, the slot number of a particle in adjacent snapshots is adjusted up or down by typically a small number. As a result, a special version of delta encoding over the multi-tier address already results in a dramatic reduction of data that needs to be stored. We follow next with an efficient bit-compression, adapting to the statistical properties of the two-part addresses, achieving a final compression ratio better than a factor of 9. The final size of the full inverted index is projected to be 22.5 TB for a petabyte ensemble of simulations.

我们描述了在最先进的宇宙模拟中跟踪暗物质粒子所带来的挑战。我们正在运行Indra模拟套件，其总数超过35万亿个粒子和1.1PB的总原始数据量。然而，仅仅以有效的方式存储粒子的位置和速度是不够的——分析还需要能够通过模拟的时间历史有效地跟踪单个粒子。所需的倒排索引很容易具有与原始模拟相当的原始大小。我们探索了如何为这些数据创建有效索引的各种策略，使用粒子运动的物理特性对大大压缩的数据表示的额外见解。基本粒子数据存储在关系数据库中的细粒度容器中，这些容器对应于固定深度oct树的叶子，这些叶子由它们的Peano-Hilbert索引标记。在每个容器中，单个对象按其拉格朗日标识符排序。因此，每个粒子都有一个多级地址:容器的PH键和粒子在排序数组(槽)中的索引。考虑到宇宙学模拟的性质和ph盒大小的选择，在连续的快照中，粒子只能穿过空间上相邻的盒子。此外，相邻快照中粒子的槽号通常会向上或向下调整一个小数字。因此，多层地址上的增量编码的特殊版本已经导致需要存储的数据的显著减少。接下来，我们采用有效的位压缩，适应两部分地址的统计特性，实现比9倍更好的最终压缩比。对于一个pb的模拟集合，完整倒排索引的最终大小预计为22.5 TB。

{"title":"Inverted indices for particle tracking in petascale cosmological simulations","authors":"D. Crankshaw, R. Burns, B. Falck, T. Budavári, A. Szalay, Jie-Shuang Wang","doi":"10.1145/2484838.2484882","DOIUrl":"https://doi.org/10.1145/2484838.2484882","url":null,"abstract":"We describe the challenges arising from tracking dark matter particles in state of the art cosmological simulations. We are in the process of running the Indra suite of simulations, with an aggregate count of more than 35 trillion particles and 1.1PB of total raw data volume. However, it is not enough just to store the particle positions and velocities in an efficient manner -- analyses also need to be able to track individual particles efficiently through the temporal history of the simulation. The required inverted indices can easily have raw sizes comparable to the original simulation.\u0000 We explore various strategies on how to create an efficient index for such data, using additional insight from the physical properties of the particle motions for a greatly compressed data representation. The basic particle data are stored in a relational database in course-grained containers corresponding to leaves of a fixed depth oct-tree labeled by their Peano-Hilbert index. Within each container the individual objects are sorted by their Lagrangian identifier. Thus each particle has a multi-level address: the PH key of the container and the index of the particle within the sorted array (the slot).\u0000 Given the nature of the cosmological simulations and choice of the PH-box sizes, in consecutive snapshots particles can only cross into spatially adjacent boxes. Also, the slot number of a particle in adjacent snapshots is adjusted up or down by typically a small number. As a result, a special version of delta encoding over the multi-tier address already results in a dramatic reduction of data that needs to be stored. We follow next with an efficient bit-compression, adapting to the statistical properties of the two-part addresses, achieving a final compression ratio better than a factor of 9. The final size of the full inverted index is projected to be 22.5 TB for a petabyte ensemble of simulations.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"50 1","pages":"25:1-25:10"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90884092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A fast handshake join implementation on FPGA with adaptive merging network 基于自适应融合网络的FPGA快速握手连接实现

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484868

Yasin Oge, T. Miyoshi, H. Kawashima, T. Yoshinaga

One of a critical design issues for implementing handshake-join hardware is result collection performed by a merging network. To address the issue, we introduce an adaptive merging network. Our implementation achieves over 3 million tuples per second when the selectivity is 0.1. The proposed implementation attains up to 5.2x higher throughput than original handshake-join hardware. In this demonstration, we apply the proposed technique to filter out malicious packets from packet streams. To the best of our knowledge, our system is the fastest handshake join implementation on FPGA.

实现握手连接硬件的一个关键设计问题是由合并网络执行的结果收集。为了解决这个问题，我们引入了一种自适应合并网络。当选择性为0.1时，我们的实现实现了每秒超过300万个元组。提出的实现比原来的握手连接硬件的吞吐量提高了5.2倍。在此演示中，我们应用所提出的技术从数据包流中过滤出恶意数据包。据我们所知，我们的系统是FPGA上最快的握手连接实现。

引用次数: 7

Real-time collaborative analysis with (almost) pure SQL: a case study in biogeochemical oceanography 使用(几乎)纯SQL的实时协作分析:生物地球化学海洋学中的一个案例研究

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484880

D. Halperin, F. Ribalet, Konstantin Weitz, M. Saito, Bill Howe, E. Armbrust

We consider a case study using SQL-as-a-Service to support "instant analysis" of weakly structured relational data at a multi-investigator science retreat. Here, "weakly structured" means tabular, rows-and-columns datasets that share some common context, but that have limited a priori agreement on file formats, relationships, types, schemas, metadata, or semantics. In this case study, the data were acquired from hundreds of distinct locations during a multi-day oceanographic cruise using a variety of physical, biological, and chemical sensors and assays. Months after the cruise when preliminary data processing was complete, 40+ researchers from a variety of disciplines participated in a two-day "data synthesis workshop." At this workshop, two computer scientists used a web-based query-as-a-service platform called SQLShare to perform "SQL stenography": capturing the scientific discussion in real time to integrate data, test hypotheses, and populate visualizations to then inform and enhance further discussion. In this "field test" of our technology and approach, we found that it was not only feasible to support interactive science Q&A with essentially pure SQL, but that we significantly increased the value of the "face time" at the meeting: researchers from different fields were able to validate assumptions and resolve ambiguity about each others' fields. As a result, new science emerged from a meeting that was originally just a planning meeting. In this paper, we describe the details of this experiment, discuss our major findings, and lay out a new research agenda for collaborative science database services.

我们考虑一个案例研究，在一个多研究者的科学静修中使用SQL-as-a-Service来支持弱结构关系数据的“即时分析”。这里，“弱结构化”是指表格、行和列数据集，它们共享一些公共上下文，但在文件格式、关系、类型、模式、元数据或语义方面限制了先验协议。在本案例研究中，数据是在为期数天的海洋巡航中从数百个不同地点获取的，使用了各种物理、生物和化学传感器和分析。巡航几个月后，当初步数据处理完成后，来自不同学科的40多名研究人员参加了为期两天的“数据综合研讨会”。在这个研讨会上，两位计算机科学家使用了一个基于web的名为SQLShare的查询即服务平台来执行“SQL速记”:实时捕获科学讨论，以整合数据、测试假设和填充可视化，然后通知和加强进一步的讨论。在对我们的技术和方法的“现场测试”中，我们发现不仅可以用本质上纯粹的SQL来支持交互式科学问答，而且我们还显著增加了会议上“面对面时间”的价值:来自不同领域的研究人员能够验证假设并解决彼此领域的歧义。结果，新的科学从最初只是一个计划会议的会议中出现。在本文中，我们描述了这个实验的细节，讨论了我们的主要发现，并为协作科学数据库服务提出了一个新的研究议程。

{"title":"Real-time collaborative analysis with (almost) pure SQL: a case study in biogeochemical oceanography","authors":"D. Halperin, F. Ribalet, Konstantin Weitz, M. Saito, Bill Howe, E. Armbrust","doi":"10.1145/2484838.2484880","DOIUrl":"https://doi.org/10.1145/2484838.2484880","url":null,"abstract":"We consider a case study using SQL-as-a-Service to support \"instant analysis\" of weakly structured relational data at a multi-investigator science retreat. Here, \"weakly structured\" means tabular, rows-and-columns datasets that share some common context, but that have limited a priori agreement on file formats, relationships, types, schemas, metadata, or semantics. In this case study, the data were acquired from hundreds of distinct locations during a multi-day oceanographic cruise using a variety of physical, biological, and chemical sensors and assays. Months after the cruise when preliminary data processing was complete, 40+ researchers from a variety of disciplines participated in a two-day \"data synthesis workshop.\" At this workshop, two computer scientists used a web-based query-as-a-service platform called SQLShare to perform \"SQL stenography\": capturing the scientific discussion in real time to integrate data, test hypotheses, and populate visualizations to then inform and enhance further discussion. In this \"field test\" of our technology and approach, we found that it was not only feasible to support interactive science Q&A with essentially pure SQL, but that we significantly increased the value of the \"face time\" at the meeting: researchers from different fields were able to validate assumptions and resolve ambiguity about each others' fields. As a result, new science emerged from a meeting that was originally just a planning meeting. In this paper, we describe the details of this experiment, discuss our major findings, and lay out a new research agenda for collaborative science database services.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"95 1","pages":"28:1-28:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76100039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Learning uncertainty models from weather forecast performance databases using quantile regression 使用分位数回归从天气预报性能数据库中学习不确定性模型

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484840

A. Zarnani, P. Musílek

Forecast uncertainty information is not available in the immediate output of Numerical weather prediction (NWP) models. Such important information is required for optimal decision making processes in many domains. Prediction intervals are a prominent form of reporting the forecast uncertainty. In this paper, a series of learning methods are investigated to obtain prediction interval models by a statistical post-processing procedure involving the historical performance of an NWP system. The article investigates the application of a number of different quantile regression algorithms, including kernel quantile regression, to compute prediction intervals for target weather attributes. These quantile regression methods along with a recently proposed fuzzy clustering-based distribution fitting model are practically benchmarked in a set of experiments involving a three years long database of hourly NWP forecast and observation records. The role of different feature sets and parameters in the models are studied as well. The forecast skills of the obtained prediction intervals are evaluated not only by means of classical cross fold validation test experiments, but also subject to a new sampling variation process to assess the uncertainty of skill score measurements. The results show also how the different methods compare in terms of various quality aspects of prediction interval forecasts such as sharpness and reliability.

数值天气预报(NWP)模式的直接输出中没有预报不确定性信息。这些重要的信息在许多领域的最佳决策过程中都是必需的。预测区间是报告预测不确定性的重要形式。本文研究了一系列学习方法，通过统计后处理程序获得NWP系统历史性能的预测区间模型。本文研究了一些不同的分位数回归算法的应用，包括核分位数回归，以计算目标天气属性的预测区间。这些分位数回归方法以及最近提出的基于模糊聚类的分布拟合模型在一组涉及长达三年的每小时NWP预测和观测记录数据库的实验中进行了实际基准测试。研究了不同特征集和参数在模型中的作用。所得预测区间的预测技能不仅通过经典的交叉折叠验证试验进行评估，而且还采用了一种新的抽样变异过程来评估技能得分测量的不确定性。结果还显示了不同方法在预测区间预测的清晰度和可靠性等各个质量方面的比较。

{"title":"Learning uncertainty models from weather forecast performance databases using quantile regression","authors":"A. Zarnani, P. Musílek","doi":"10.1145/2484838.2484840","DOIUrl":"https://doi.org/10.1145/2484838.2484840","url":null,"abstract":"Forecast uncertainty information is not available in the immediate output of Numerical weather prediction (NWP) models. Such important information is required for optimal decision making processes in many domains. Prediction intervals are a prominent form of reporting the forecast uncertainty. In this paper, a series of learning methods are investigated to obtain prediction interval models by a statistical post-processing procedure involving the historical performance of an NWP system. The article investigates the application of a number of different quantile regression algorithms, including kernel quantile regression, to compute prediction intervals for target weather attributes. These quantile regression methods along with a recently proposed fuzzy clustering-based distribution fitting model are practically benchmarked in a set of experiments involving a three years long database of hourly NWP forecast and observation records. The role of different feature sets and parameters in the models are studied as well. The forecast skills of the obtained prediction intervals are evaluated not only by means of classical cross fold validation test experiments, but also subject to a new sampling variation process to assess the uncertainty of skill score measurements. The results show also how the different methods compare in terms of various quality aspects of prediction interval forecasts such as sharpness and reliability.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"9 1","pages":"16:1-16:9"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90976072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Nesting the earth mover's distance for effective cluster tracing 嵌套推土机的距离，以实现有效的集群跟踪

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484881

Hardy Kremer, Stephan Günnemann, Simon Wollwage, T. Seidl

Cluster tracing algorithms are used to mine temporal evolutions of clusters. Generally, clusters represent groups of objects with similar values. In a temporal context like tracing, similar values correspond to similar behavior in one snapshot in time. Recently, tracing based on object-value-similarity was introduced. In this new paradigm, the decision whether two clusters are considered similar is based on the similarity of the clusters' object values. Existing approaches of this paradigm, however, have a severe limitation. The mapping of clusters between snapshots in time is performed pairwise, i.e. global connections between a temporal snapshot's clusters are ignored; thus, impacts of other clusters that may affect the mapping are not considered and incorrect cluster tracings may be obtained. In this vision paper, we present our ongoing work on a novel approach for cluster tracing that applies the object-value-similarity paradigm and is based on the well-known Earth Mover's Distance (EMD). The EMD enables a cluster tracing that uses global mapping: in the mapping process, all clusters of compared snapshots are considered simultaneously. A special property of our approach is that we nest the EMD: we use it as a ground distance for itself to achieve most effective value-based cluster tracing.

聚类跟踪算法用于挖掘聚类的时间演化。通常，集群表示具有相似值的对象组。在像跟踪这样的临时上下文中，相似的值对应于同一时间快照中的相似行为。近年来，基于对象值相似度的跟踪被引入。在这个新范例中，判断两个聚类是否相似是基于聚类对象值的相似性。然而，这种范式的现有方法有严重的局限性。快照之间的簇映射是成对执行的，即忽略时序快照的簇之间的全局连接;因此，没有考虑可能影响映射的其他聚类的影响，可能会获得不正确的聚类跟踪。在这篇远景论文中，我们介绍了我们正在进行的一种新的聚类跟踪方法，该方法应用对象-值-相似性范式，并基于众所周知的地球移动者距离(EMD)。EMD支持使用全局映射的集群跟踪:在映射过程中，同时考虑比较快照的所有集群。我们的方法的一个特殊属性是我们嵌套了EMD:我们使用它作为自身的地面距离来实现最有效的基于值的聚类跟踪。

{"title":"Nesting the earth mover's distance for effective cluster tracing","authors":"Hardy Kremer, Stephan Günnemann, Simon Wollwage, T. Seidl","doi":"10.1145/2484838.2484881","DOIUrl":"https://doi.org/10.1145/2484838.2484881","url":null,"abstract":"Cluster tracing algorithms are used to mine temporal evolutions of clusters. Generally, clusters represent groups of objects with similar values. In a temporal context like tracing, similar values correspond to similar behavior in one snapshot in time. Recently, tracing based on object-value-similarity was introduced. In this new paradigm, the decision whether two clusters are considered similar is based on the similarity of the clusters' object values. Existing approaches of this paradigm, however, have a severe limitation. The mapping of clusters between snapshots in time is performed pairwise, i.e. global connections between a temporal snapshot's clusters are ignored; thus, impacts of other clusters that may affect the mapping are not considered and incorrect cluster tracings may be obtained.\u0000 In this vision paper, we present our ongoing work on a novel approach for cluster tracing that applies the object-value-similarity paradigm and is based on the well-known Earth Mover's Distance (EMD). The EMD enables a cluster tracing that uses global mapping: in the mapping process, all clusters of compared snapshots are considered simultaneously. A special property of our approach is that we nest the EMD: we use it as a ground distance for itself to achieve most effective value-based cluster tracing.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"49 1","pages":"34:1-34:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79124952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀