Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management最新文献
Scientific data have dual structure. Raw data are preponderantly ordered multi-dimensional arrays or sequences while metadata and derived data are best represented as unordered relations. Scientific data processing requires complex operations over arrays and relations. These operations cannot be expressed using only standard linear and relational algebra operators, respectively. Existing scientific data processing systems are designed for a single data model and handle complex processing at the application level. EXTASCID is a complete and extensible system for scientific data processing. It supports both array and relational data natively. Complex processing is handled by a metaoperator that can execute any user code. As a result, EXTASCID can process full scientific workflows inside the system, with minimal data movement and application code. We illustrate the overall process on a real dataset and workflow from astronomy---starting with a set of sky images, the goal is to identify and classify transient astrophysical objects.
{"title":"Astronomical data processing in EXTASCID","authors":"Yu Cheng, Florin Rusu","doi":"10.1145/2484838.2484875","DOIUrl":"https://doi.org/10.1145/2484838.2484875","url":null,"abstract":"Scientific data have dual structure. Raw data are preponderantly ordered multi-dimensional arrays or sequences while metadata and derived data are best represented as unordered relations. Scientific data processing requires complex operations over arrays and relations. These operations cannot be expressed using only standard linear and relational algebra operators, respectively. Existing scientific data processing systems are designed for a single data model and handle complex processing at the application level.\u0000 EXTASCID is a complete and extensible system for scientific data processing. It supports both array and relational data natively. Complex processing is handled by a metaoperator that can execute any user code. As a result, EXTASCID can process full scientific workflows inside the system, with minimal data movement and application code. We illustrate the overall process on a real dataset and workflow from astronomy---starting with a set of sky images, the goal is to identify and classify transient astrophysical objects.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"98 1","pages":"47:1-47:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82554087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Next-generation sequencing (NGS) technology allows us to peer inside the cell in exquisite detail, revealing new insights into biology, evolution, and disease that would have been impossible to find just a few years ago. The enormous volumes of data produced by NGS experiments present many computational challenges that we are working to address. In this talk, I will discuss solutions to two basic alignment problems: (1) mapping sequences onto the human genome at very high speed, and (2) mapping and assembling transcripts from RNA-seq experiments. I will also discuss some of the problems that can arise during alignment and how these can lead to mistaken conclusions about genetic variation and gene expression. My group has developed algorithms to solve each of these problems, including the widely-used Bowtie and Bowtie2 programs for fast alignment and the TopHat and Cufflinks programs for assembly and quantification of genes in transcriptome sequencing (RNA-seq) experiments. This talk describes joint work with current and former lab members including Ben Langmead, Cole Trapnell, Daehwan Kim, and Geo Pertea; and with collaborators including Mihai Pop and Lior Pachter.
{"title":"Computational challenges in next-generation genomics","authors":"S. Salzberg","doi":"10.1145/2484838.2484885","DOIUrl":"https://doi.org/10.1145/2484838.2484885","url":null,"abstract":"Next-generation sequencing (NGS) technology allows us to peer inside the cell in exquisite detail, revealing new insights into biology, evolution, and disease that would have been impossible to find just a few years ago. The enormous volumes of data produced by NGS experiments present many computational challenges that we are working to address. In this talk, I will discuss solutions to two basic alignment problems: (1) mapping sequences onto the human genome at very high speed, and (2) mapping and assembling transcripts from RNA-seq experiments. I will also discuss some of the problems that can arise during alignment and how these can lead to mistaken conclusions about genetic variation and gene expression.\u0000 My group has developed algorithms to solve each of these problems, including the widely-used Bowtie and Bowtie2 programs for fast alignment and the TopHat and Cufflinks programs for assembly and quantification of genes in transcriptome sequencing (RNA-seq) experiments. This talk describes joint work with current and former lab members including Ben Langmead, Cole Trapnell, Daehwan Kim, and Geo Pertea; and with collaborators including Mihai Pop and Lior Pachter.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"26 1","pages":"2:1"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87927254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Histograms provide effective synopses of large data sets, and are thus used in a wide variety of applications, including query optimization, approximate query answering, distribution fitting, parallel database partitioning, and data mining. Moreover, very fast approximate algorithms are needed to compute accurate histograms on fast-arriving data streams, whereby online queries can be supported within the given memory and computing resources. Many real-life applications require that the data distribution in certain regions must be modeled with greater accuracy, and Biased Histograms are designed to address this need. In this paper, we define biased histograms over data streams and sliding windows on data streams, and propose the Bar Splitting Biased Histogram (BSBH) algorithm to construct them efficiently and accurately. We prove that BSBH generates expected ∈-approximate biased histograms for data streams with stationary distributions, and our experiments show that BSBH also achieves good approximation in the presence of concept shifts, even major ones. Additionally, BSBH employs a new biased sampling technique which outperforms uniform sampling in terms of accuracy, while using about the same amount of time and memory. Therefore, BSBH outperforms previously proposed algorithms for computing biased histograms over the whole data stream, and it is the first algorithm that supports windows.
{"title":"Fast computation of approximate biased histograms on sliding windows over data streams","authors":"Hamid Mousavi, C. Zaniolo","doi":"10.1145/2484838.2484851","DOIUrl":"https://doi.org/10.1145/2484838.2484851","url":null,"abstract":"Histograms provide effective synopses of large data sets, and are thus used in a wide variety of applications, including query optimization, approximate query answering, distribution fitting, parallel database partitioning, and data mining. Moreover, very fast approximate algorithms are needed to compute accurate histograms on fast-arriving data streams, whereby online queries can be supported within the given memory and computing resources. Many real-life applications require that the data distribution in certain regions must be modeled with greater accuracy, and Biased Histograms are designed to address this need. In this paper, we define biased histograms over data streams and sliding windows on data streams, and propose the Bar Splitting Biased Histogram (BSBH) algorithm to construct them efficiently and accurately. We prove that BSBH generates expected ∈-approximate biased histograms for data streams with stationary distributions, and our experiments show that BSBH also achieves good approximation in the presence of concept shifts, even major ones. Additionally, BSBH employs a new biased sampling technique which outperforms uniform sampling in terms of accuracy, while using about the same amount of time and memory. Therefore, BSBH outperforms previously proposed algorithms for computing biased histograms over the whole data stream, and it is the first algorithm that supports windows.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"144 1","pages":"13:1-13:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90120300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
FastQuery is a parallel indexing and querying system we developed for accelerating analysis and visualization of scientific data. We have applied it to a wide variety of HPC applications and demonstrated its capability and scalability using a petascale trillion-particle simulation in our previous work. Yet, through our experience, we found that performance of reading and writing data with FastQuery, like many other HPC applications, could be significantly affected by various tunable parameters throughout the parallel I/O stack. In this paper, we describe our success in tuning the performance of FastQuery on a Lustre parallel file system. We study and analyze the impact of parameters and tunable settings at file system, MPI-IO library, and HDF5 library levels of the I/O stack. We demonstrate that a combined optimization strategy is able to improve performance and I/O bandwidth of FastQuery significantly. In our tests with a trillion-particle dataset, the time to index the dataset reduced by more than one half.
{"title":"Optimizing fastquery performance on lustre file system","authors":"Kuan-Wu Lin, S. Byna, J. Chou, Kesheng Wu","doi":"10.1145/2484838.2484853","DOIUrl":"https://doi.org/10.1145/2484838.2484853","url":null,"abstract":"FastQuery is a parallel indexing and querying system we developed for accelerating analysis and visualization of scientific data. We have applied it to a wide variety of HPC applications and demonstrated its capability and scalability using a petascale trillion-particle simulation in our previous work. Yet, through our experience, we found that performance of reading and writing data with FastQuery, like many other HPC applications, could be significantly affected by various tunable parameters throughout the parallel I/O stack. In this paper, we describe our success in tuning the performance of FastQuery on a Lustre parallel file system. We study and analyze the impact of parameters and tunable settings at file system, MPI-IO library, and HDF5 library levels of the I/O stack. We demonstrate that a combined optimization strategy is able to improve performance and I/O bandwidth of FastQuery significantly. In our tests with a trillion-particle dataset, the time to index the dataset reduced by more than one half.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"35 1","pages":"29:1-29:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74668473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Lorenz, Lars Dannecker, Philipp J. Rösch, Wolfgang Lehner, Gregor Hackenbroich, B. Schlegel
Forecasting is an important data analysis technique and serves as the basis for business planning in many application areas such as energy, sales and traffic management. The currently employed statistical models already provide very accurate predictions, but the forecasting calculation process is very time consuming. This is especially true since many application domains deal with hierarchically organized data. Forecasting in these environments is especially challenging due to ensuring forecasting consistency between hierarchy levels, which leads to an increased data processing and communication effort. For this purpose, we introduce our novel hierarchical forecasting approach, where we propose to push forecast models to the entities on the lowest hierarch level and reuse these models to efficiently create forecast models on higher hierarchical levels. With that we avoid the time-consuming parameter estimation process and allow an almost instant calculation of forecasts.
{"title":"Forecasting in hierarchical environments","authors":"R. Lorenz, Lars Dannecker, Philipp J. Rösch, Wolfgang Lehner, Gregor Hackenbroich, B. Schlegel","doi":"10.1145/2484838.2484849","DOIUrl":"https://doi.org/10.1145/2484838.2484849","url":null,"abstract":"Forecasting is an important data analysis technique and serves as the basis for business planning in many application areas such as energy, sales and traffic management. The currently employed statistical models already provide very accurate predictions, but the forecasting calculation process is very time consuming. This is especially true since many application domains deal with hierarchically organized data. Forecasting in these environments is especially challenging due to ensuring forecasting consistency between hierarchy levels, which leads to an increased data processing and communication effort. For this purpose, we introduce our novel hierarchical forecasting approach, where we propose to push forecast models to the entities on the lowest hierarch level and reuse these models to efficiently create forecast models on higher hierarchical levels. With that we avoid the time-consuming parameter estimation process and allow an almost instant calculation of forecasts.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"58 1","pages":"37:1-37:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90811203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Crankshaw, R. Burns, B. Falck, T. Budavári, A. Szalay, Jie-Shuang Wang
We describe the challenges arising from tracking dark matter particles in state of the art cosmological simulations. We are in the process of running the Indra suite of simulations, with an aggregate count of more than 35 trillion particles and 1.1PB of total raw data volume. However, it is not enough just to store the particle positions and velocities in an efficient manner -- analyses also need to be able to track individual particles efficiently through the temporal history of the simulation. The required inverted indices can easily have raw sizes comparable to the original simulation. We explore various strategies on how to create an efficient index for such data, using additional insight from the physical properties of the particle motions for a greatly compressed data representation. The basic particle data are stored in a relational database in course-grained containers corresponding to leaves of a fixed depth oct-tree labeled by their Peano-Hilbert index. Within each container the individual objects are sorted by their Lagrangian identifier. Thus each particle has a multi-level address: the PH key of the container and the index of the particle within the sorted array (the slot). Given the nature of the cosmological simulations and choice of the PH-box sizes, in consecutive snapshots particles can only cross into spatially adjacent boxes. Also, the slot number of a particle in adjacent snapshots is adjusted up or down by typically a small number. As a result, a special version of delta encoding over the multi-tier address already results in a dramatic reduction of data that needs to be stored. We follow next with an efficient bit-compression, adapting to the statistical properties of the two-part addresses, achieving a final compression ratio better than a factor of 9. The final size of the full inverted index is projected to be 22.5 TB for a petabyte ensemble of simulations.
{"title":"Inverted indices for particle tracking in petascale cosmological simulations","authors":"D. Crankshaw, R. Burns, B. Falck, T. Budavári, A. Szalay, Jie-Shuang Wang","doi":"10.1145/2484838.2484882","DOIUrl":"https://doi.org/10.1145/2484838.2484882","url":null,"abstract":"We describe the challenges arising from tracking dark matter particles in state of the art cosmological simulations. We are in the process of running the Indra suite of simulations, with an aggregate count of more than 35 trillion particles and 1.1PB of total raw data volume. However, it is not enough just to store the particle positions and velocities in an efficient manner -- analyses also need to be able to track individual particles efficiently through the temporal history of the simulation. The required inverted indices can easily have raw sizes comparable to the original simulation.\u0000 We explore various strategies on how to create an efficient index for such data, using additional insight from the physical properties of the particle motions for a greatly compressed data representation. The basic particle data are stored in a relational database in course-grained containers corresponding to leaves of a fixed depth oct-tree labeled by their Peano-Hilbert index. Within each container the individual objects are sorted by their Lagrangian identifier. Thus each particle has a multi-level address: the PH key of the container and the index of the particle within the sorted array (the slot).\u0000 Given the nature of the cosmological simulations and choice of the PH-box sizes, in consecutive snapshots particles can only cross into spatially adjacent boxes. Also, the slot number of a particle in adjacent snapshots is adjusted up or down by typically a small number. As a result, a special version of delta encoding over the multi-tier address already results in a dramatic reduction of data that needs to be stored. We follow next with an efficient bit-compression, adapting to the statistical properties of the two-part addresses, achieving a final compression ratio better than a factor of 9. The final size of the full inverted index is projected to be 22.5 TB for a petabyte ensemble of simulations.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"50 1","pages":"25:1-25:10"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90884092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One of a critical design issues for implementing handshake-join hardware is result collection performed by a merging network. To address the issue, we introduce an adaptive merging network. Our implementation achieves over 3 million tuples per second when the selectivity is 0.1. The proposed implementation attains up to 5.2x higher throughput than original handshake-join hardware. In this demonstration, we apply the proposed technique to filter out malicious packets from packet streams. To the best of our knowledge, our system is the fastest handshake join implementation on FPGA.
{"title":"A fast handshake join implementation on FPGA with adaptive merging network","authors":"Yasin Oge, T. Miyoshi, H. Kawashima, T. Yoshinaga","doi":"10.1145/2484838.2484868","DOIUrl":"https://doi.org/10.1145/2484838.2484868","url":null,"abstract":"One of a critical design issues for implementing handshake-join hardware is result collection performed by a merging network. To address the issue, we introduce an adaptive merging network. Our implementation achieves over 3 million tuples per second when the selectivity is 0.1. The proposed implementation attains up to 5.2x higher throughput than original handshake-join hardware. In this demonstration, we apply the proposed technique to filter out malicious packets from packet streams. To the best of our knowledge, our system is the fastest handshake join implementation on FPGA.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"15 1","pages":"44:1-44:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88037040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Halperin, F. Ribalet, Konstantin Weitz, M. Saito, Bill Howe, E. Armbrust
We consider a case study using SQL-as-a-Service to support "instant analysis" of weakly structured relational data at a multi-investigator science retreat. Here, "weakly structured" means tabular, rows-and-columns datasets that share some common context, but that have limited a priori agreement on file formats, relationships, types, schemas, metadata, or semantics. In this case study, the data were acquired from hundreds of distinct locations during a multi-day oceanographic cruise using a variety of physical, biological, and chemical sensors and assays. Months after the cruise when preliminary data processing was complete, 40+ researchers from a variety of disciplines participated in a two-day "data synthesis workshop." At this workshop, two computer scientists used a web-based query-as-a-service platform called SQLShare to perform "SQL stenography": capturing the scientific discussion in real time to integrate data, test hypotheses, and populate visualizations to then inform and enhance further discussion. In this "field test" of our technology and approach, we found that it was not only feasible to support interactive science Q&A with essentially pure SQL, but that we significantly increased the value of the "face time" at the meeting: researchers from different fields were able to validate assumptions and resolve ambiguity about each others' fields. As a result, new science emerged from a meeting that was originally just a planning meeting. In this paper, we describe the details of this experiment, discuss our major findings, and lay out a new research agenda for collaborative science database services.
{"title":"Real-time collaborative analysis with (almost) pure SQL: a case study in biogeochemical oceanography","authors":"D. Halperin, F. Ribalet, Konstantin Weitz, M. Saito, Bill Howe, E. Armbrust","doi":"10.1145/2484838.2484880","DOIUrl":"https://doi.org/10.1145/2484838.2484880","url":null,"abstract":"We consider a case study using SQL-as-a-Service to support \"instant analysis\" of weakly structured relational data at a multi-investigator science retreat. Here, \"weakly structured\" means tabular, rows-and-columns datasets that share some common context, but that have limited a priori agreement on file formats, relationships, types, schemas, metadata, or semantics. In this case study, the data were acquired from hundreds of distinct locations during a multi-day oceanographic cruise using a variety of physical, biological, and chemical sensors and assays. Months after the cruise when preliminary data processing was complete, 40+ researchers from a variety of disciplines participated in a two-day \"data synthesis workshop.\" At this workshop, two computer scientists used a web-based query-as-a-service platform called SQLShare to perform \"SQL stenography\": capturing the scientific discussion in real time to integrate data, test hypotheses, and populate visualizations to then inform and enhance further discussion. In this \"field test\" of our technology and approach, we found that it was not only feasible to support interactive science Q&A with essentially pure SQL, but that we significantly increased the value of the \"face time\" at the meeting: researchers from different fields were able to validate assumptions and resolve ambiguity about each others' fields. As a result, new science emerged from a meeting that was originally just a planning meeting. In this paper, we describe the details of this experiment, discuss our major findings, and lay out a new research agenda for collaborative science database services.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"95 1","pages":"28:1-28:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76100039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Forecast uncertainty information is not available in the immediate output of Numerical weather prediction (NWP) models. Such important information is required for optimal decision making processes in many domains. Prediction intervals are a prominent form of reporting the forecast uncertainty. In this paper, a series of learning methods are investigated to obtain prediction interval models by a statistical post-processing procedure involving the historical performance of an NWP system. The article investigates the application of a number of different quantile regression algorithms, including kernel quantile regression, to compute prediction intervals for target weather attributes. These quantile regression methods along with a recently proposed fuzzy clustering-based distribution fitting model are practically benchmarked in a set of experiments involving a three years long database of hourly NWP forecast and observation records. The role of different feature sets and parameters in the models are studied as well. The forecast skills of the obtained prediction intervals are evaluated not only by means of classical cross fold validation test experiments, but also subject to a new sampling variation process to assess the uncertainty of skill score measurements. The results show also how the different methods compare in terms of various quality aspects of prediction interval forecasts such as sharpness and reliability.
{"title":"Learning uncertainty models from weather forecast performance databases using quantile regression","authors":"A. Zarnani, P. Musílek","doi":"10.1145/2484838.2484840","DOIUrl":"https://doi.org/10.1145/2484838.2484840","url":null,"abstract":"Forecast uncertainty information is not available in the immediate output of Numerical weather prediction (NWP) models. Such important information is required for optimal decision making processes in many domains. Prediction intervals are a prominent form of reporting the forecast uncertainty. In this paper, a series of learning methods are investigated to obtain prediction interval models by a statistical post-processing procedure involving the historical performance of an NWP system. The article investigates the application of a number of different quantile regression algorithms, including kernel quantile regression, to compute prediction intervals for target weather attributes. These quantile regression methods along with a recently proposed fuzzy clustering-based distribution fitting model are practically benchmarked in a set of experiments involving a three years long database of hourly NWP forecast and observation records. The role of different feature sets and parameters in the models are studied as well. The forecast skills of the obtained prediction intervals are evaluated not only by means of classical cross fold validation test experiments, but also subject to a new sampling variation process to assess the uncertainty of skill score measurements. The results show also how the different methods compare in terms of various quality aspects of prediction interval forecasts such as sharpness and reliability.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"9 1","pages":"16:1-16:9"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90976072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hardy Kremer, Stephan Günnemann, Simon Wollwage, T. Seidl
Cluster tracing algorithms are used to mine temporal evolutions of clusters. Generally, clusters represent groups of objects with similar values. In a temporal context like tracing, similar values correspond to similar behavior in one snapshot in time. Recently, tracing based on object-value-similarity was introduced. In this new paradigm, the decision whether two clusters are considered similar is based on the similarity of the clusters' object values. Existing approaches of this paradigm, however, have a severe limitation. The mapping of clusters between snapshots in time is performed pairwise, i.e. global connections between a temporal snapshot's clusters are ignored; thus, impacts of other clusters that may affect the mapping are not considered and incorrect cluster tracings may be obtained. In this vision paper, we present our ongoing work on a novel approach for cluster tracing that applies the object-value-similarity paradigm and is based on the well-known Earth Mover's Distance (EMD). The EMD enables a cluster tracing that uses global mapping: in the mapping process, all clusters of compared snapshots are considered simultaneously. A special property of our approach is that we nest the EMD: we use it as a ground distance for itself to achieve most effective value-based cluster tracing.
{"title":"Nesting the earth mover's distance for effective cluster tracing","authors":"Hardy Kremer, Stephan Günnemann, Simon Wollwage, T. Seidl","doi":"10.1145/2484838.2484881","DOIUrl":"https://doi.org/10.1145/2484838.2484881","url":null,"abstract":"Cluster tracing algorithms are used to mine temporal evolutions of clusters. Generally, clusters represent groups of objects with similar values. In a temporal context like tracing, similar values correspond to similar behavior in one snapshot in time. Recently, tracing based on object-value-similarity was introduced. In this new paradigm, the decision whether two clusters are considered similar is based on the similarity of the clusters' object values. Existing approaches of this paradigm, however, have a severe limitation. The mapping of clusters between snapshots in time is performed pairwise, i.e. global connections between a temporal snapshot's clusters are ignored; thus, impacts of other clusters that may affect the mapping are not considered and incorrect cluster tracings may be obtained.\u0000 In this vision paper, we present our ongoing work on a novel approach for cluster tracing that applies the object-value-similarity paradigm and is based on the well-known Earth Mover's Distance (EMD). The EMD enables a cluster tracing that uses global mapping: in the mapping process, all clusters of compared snapshots are considered simultaneously. A special property of our approach is that we nest the EMD: we use it as a ground distance for itself to achieve most effective value-based cluster tracing.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"49 1","pages":"34:1-34:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79124952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management