Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management最新文献
Many data stream applications, such as network intrusion detection, on-line financial tickers and environmental monitoring, typically exhibit certain "real-time" traits. In such applications, people are interested in strategies that ensure on-time delivery of query results. In this paper, we point out that traditional operator-based query scheduling strategies are insufficient to handle this class of problem. Therefore we choose to approach the issue from a new angle by modeling multi-query scheduling as a job-scheduling problem, a classical problem in real-time computing. By taking advantage of the wisdom in the real-time computing community, we propose several new scheduling strategies and algorithms to enhance the overall data stream query scheduling performance. Through extensive experiments over both real and synthetic data, we identify the important factors for scheduling performance and verify the effectiveness of our approaches.
{"title":"Multi-query scheduling for time-critical data stream applications","authors":"Yongluan Zhou, Ji Wu, A. K. Leghari","doi":"10.1145/2484838.2484864","DOIUrl":"https://doi.org/10.1145/2484838.2484864","url":null,"abstract":"Many data stream applications, such as network intrusion detection, on-line financial tickers and environmental monitoring, typically exhibit certain \"real-time\" traits. In such applications, people are interested in strategies that ensure on-time delivery of query results. In this paper, we point out that traditional operator-based query scheduling strategies are insufficient to handle this class of problem. Therefore we choose to approach the issue from a new angle by modeling multi-query scheduling as a job-scheduling problem, a classical problem in real-time computing. By taking advantage of the wisdom in the real-time computing community, we propose several new scheduling strategies and algorithms to enhance the overall data stream query scheduling performance. Through extensive experiments over both real and synthetic data, we identify the important factors for scheduling performance and verify the effectiveness of our approaches.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"21 1","pages":"15:1-15:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74173555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In moving object databases, authors usually assume that number and position of objects to be processed are always known in advance. Detecting an unknown moving object and pursuing its movement, however, is usually left to tracking algorithms resting outside the database. Trackers are complex software systems which process sensor data and application-specific context information in order to detect, classify, monitor and predict the course of moving objects. As there are no universal software tools for realizing a tracker, such systems are usually hand-coded from scratch for each tracking application. In this paper we present a way how to implement a framework for implementing universal trackers inside a database. As a use case, we consider the well-known probabilistic multiple hypothesis tracking approach (PMHT) and the interacting multiple model filter (IMM) for realizing typical tracking tasks. We show that incremental view maintenance techniques and Bregman Ball trees are well-suited for efficiently implementing state-of-the-art trackers for processing streams of radar data.
{"title":"Towards a universal tracking database","authors":"Gereon Schüller, Andreas Behrend","doi":"10.1145/2484838.2484845","DOIUrl":"https://doi.org/10.1145/2484838.2484845","url":null,"abstract":"In moving object databases, authors usually assume that number and position of objects to be processed are always known in advance. Detecting an unknown moving object and pursuing its movement, however, is usually left to tracking algorithms resting outside the database. Trackers are complex software systems which process sensor data and application-specific context information in order to detect, classify, monitor and predict the course of moving objects. As there are no universal software tools for realizing a tracker, such systems are usually hand-coded from scratch for each tracking application. In this paper we present a way how to implement a framework for implementing universal trackers inside a database. As a use case, we consider the well-known probabilistic multiple hypothesis tracking approach (PMHT) and the interacting multiple model filter (IMM) for realizing typical tracking tasks. We show that incremental view maintenance techniques and Bregman Ball trees are well-suited for efficiently implementing state-of-the-art trackers for processing streams of radar data.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"30 1","pages":"10:1-10:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85450877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lúcio F. D. Santos, Willian D. Oliveira, Mônica Ribeiro Porto Ferreira, A. Traina, C. Traina
New operators to execute similarity-based queries over multimedia data stored in Database Management Systems are increasingly demanded. However, searching in very large datasets, the basic operators often return elements too much similar both to the query center and to themselves, reducing the answer's utility. In this paper, we tackle the problem of providing diversity to similarity query results, and define techniques to assure that each element in the result set is different enough from the others. Existing techniques compel the user to define either a parameter to trade among similarity and diversity or a minimum similarity between result elements. Distinctly, our approach provides similarity queries with diversification using the influence concept, which automatically estimates the inherent diversity between the result set elements requiring no user-defined parameters. Furthermore, our technique can be applied over any data represented in a metric space, so it is both parameter and application-domain independent. The "Better Results with Influence Diversification" (BRID) technique is the basis to the k-Diverse Nearest Neighbor (BRIDk) and to the Range Diverse (BRIDr) algorithms, which execute k-nearest neighbor and range queries with diversification, showing that the technique can be applied to diversify any type of similarity queries. We also define a way to measure the diversification degree in a result set. Through a detailed experimental evaluation using our approach, we show that BRID outperforms the existing methods regarding both query diversification quality and execution times, being at least two orders of magnitude faster than the best existing approaches.
{"title":"Parameter-free and domain-independent similarity search with diversity","authors":"Lúcio F. D. Santos, Willian D. Oliveira, Mônica Ribeiro Porto Ferreira, A. Traina, C. Traina","doi":"10.1145/2484838.2484854","DOIUrl":"https://doi.org/10.1145/2484838.2484854","url":null,"abstract":"New operators to execute similarity-based queries over multimedia data stored in Database Management Systems are increasingly demanded. However, searching in very large datasets, the basic operators often return elements too much similar both to the query center and to themselves, reducing the answer's utility. In this paper, we tackle the problem of providing diversity to similarity query results, and define techniques to assure that each element in the result set is different enough from the others. Existing techniques compel the user to define either a parameter to trade among similarity and diversity or a minimum similarity between result elements. Distinctly, our approach provides similarity queries with diversification using the influence concept, which automatically estimates the inherent diversity between the result set elements requiring no user-defined parameters. Furthermore, our technique can be applied over any data represented in a metric space, so it is both parameter and application-domain independent. The \"Better Results with Influence Diversification\" (BRID) technique is the basis to the k-Diverse Nearest Neighbor (BRIDk) and to the Range Diverse (BRIDr) algorithms, which execute k-nearest neighbor and range queries with diversification, showing that the technique can be applied to diversify any type of similarity queries. We also define a way to measure the diversification degree in a result set. Through a detailed experimental evaluation using our approach, we show that BRID outperforms the existing methods regarding both query diversification quality and execution times, being at least two orders of magnitude faster than the best existing approaches.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"44 1","pages":"5:1-5:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83241937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the problems of scientific interest raise in scale and complexity, scientists have to tacitly manage too many analytic elements. Hypotheses are worked out to drive research towards successful explanation and prediction, which characterizes science as a dynamic activity that is partially ordered towards progress. This paper motivates and introduces research lattices, carrying out a lattice-theoretic approach for hypothesis representation and management in large-scale science and engineering. The goal of this work is to equip scientists with tools to manipulate and query hypotheses while keeping track of research progress. We refer to SciDB's array data model and discuss how data and theories could be managed in a unified model management framework.
{"title":"Research lattices: towards a scientific hypothesis data model","authors":"Bernardo Gonçalves, F. Porto","doi":"10.1145/2484838.2484861","DOIUrl":"https://doi.org/10.1145/2484838.2484861","url":null,"abstract":"As the problems of scientific interest raise in scale and complexity, scientists have to tacitly manage too many analytic elements. Hypotheses are worked out to drive research towards successful explanation and prediction, which characterizes science as a dynamic activity that is partially ordered towards progress. This paper motivates and introduces research lattices, carrying out a lattice-theoretic approach for hypothesis representation and management in large-scale science and engineering. The goal of this work is to equip scientists with tools to manipulate and query hypotheses while keeping track of research progress. We refer to SciDB's array data model and discuss how data and theories could be managed in a unified model management framework.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"50 1","pages":"41:1-41:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83773759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kaifeng Jiang, Dongxu Shao, S. Bressan, Thomas Kister, K. Tan
The pervasiveness of location-acquisition technologies has made it possible to collect the movement data of individuals or vehicles. However, it has to be carefully managed to ensure that there is no privacy breach. In this paper, we investigate the problem of publishing trajectory data under the differential privacy model. A straightforward solution is to add noise to a trajectory - this can be done either by adding noise to each coordinate of the position, to each position of the trajectory, or to the whole trajectory. However, such naive approaches result in trajectories with zigzag shapes and many crossings, making the published trajectories of little practical use. We introduce a mechanism called SDD (Sampling Distance and Direction), which is ε-differentially private. SDD samples a suitable direction and distance at each position to publish the next possible position. Numerical experiments conducted on real ship trajectories demonstrate that our proposed mechanism can deliver ship trajectories that are of good practical utility.
{"title":"Publishing trajectories with differential privacy guarantees","authors":"Kaifeng Jiang, Dongxu Shao, S. Bressan, Thomas Kister, K. Tan","doi":"10.1145/2484838.2484846","DOIUrl":"https://doi.org/10.1145/2484838.2484846","url":null,"abstract":"The pervasiveness of location-acquisition technologies has made it possible to collect the movement data of individuals or vehicles. However, it has to be carefully managed to ensure that there is no privacy breach. In this paper, we investigate the problem of publishing trajectory data under the differential privacy model. A straightforward solution is to add noise to a trajectory - this can be done either by adding noise to each coordinate of the position, to each position of the trajectory, or to the whole trajectory. However, such naive approaches result in trajectories with zigzag shapes and many crossings, making the published trajectories of little practical use. We introduce a mechanism called SDD (Sampling Distance and Direction), which is ε-differentially private. SDD samples a suitable direction and distance at each position to publish the next possible position. Numerical experiments conducted on real ship trajectories demonstrate that our proposed mechanism can deliver ship trajectories that are of good practical utility.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"127 1","pages":"12:1-12:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73929761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sarana Nutanong, N. Carey, Yanif Ahmad, A. Szalay, T. Woolf
Molecular dynamics (MD) simulations generate detailed time-series data of all-atom motions. These simulations are leading users of the world's most powerful supercomputers, and are standard-bearers for a wide range of high-performance computing (HPC) methods. However, MD data exploration and analysis is in its infancy in terms of scalability, ease-of-use, and ultimately its ability to answer 'grand challenge' science questions. This demonstration introduces the Molecular Dynamics Database (MDDB) project at Johns Hopkins, to study the co-design of database methods for deep on-the-fly exploratory MD analyses with HPC simulations. Data exploration in MD suffers from a "human bottleneck", where the laborious administration of simulations leaves little room for domain experts to focus on tackling science questions. MDDB exploits the data-rich nature of MD simulations to provide adaptive control of the exploration process with machine learning techniques, specifically reinforcement learning (RL). We present MDDB's data and queries, architecture, and its use of RL methods. Our audience will co-operate with our steering algorithm and science partners, and witness MDDB's abilities to significantly reduce exploration times and direct computation resources to where they best address science questions.
{"title":"Adaptive exploration for large-scale protein analysis in the molecular dynamics database","authors":"Sarana Nutanong, N. Carey, Yanif Ahmad, A. Szalay, T. Woolf","doi":"10.1145/2484838.2484872","DOIUrl":"https://doi.org/10.1145/2484838.2484872","url":null,"abstract":"Molecular dynamics (MD) simulations generate detailed time-series data of all-atom motions. These simulations are leading users of the world's most powerful supercomputers, and are standard-bearers for a wide range of high-performance computing (HPC) methods. However, MD data exploration and analysis is in its infancy in terms of scalability, ease-of-use, and ultimately its ability to answer 'grand challenge' science questions. This demonstration introduces the Molecular Dynamics Database (MDDB) project at Johns Hopkins, to study the co-design of database methods for deep on-the-fly exploratory MD analyses with HPC simulations. Data exploration in MD suffers from a \"human bottleneck\", where the laborious administration of simulations leaves little room for domain experts to focus on tackling science questions. MDDB exploits the data-rich nature of MD simulations to provide adaptive control of the exploration process with machine learning techniques, specifically reinforcement learning (RL). We present MDDB's data and queries, architecture, and its use of RL methods. Our audience will co-operate with our steering algorithm and science partners, and witness MDDB's abilities to significantly reduce exploration times and direct computation resources to where they best address science questions.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"3 1","pages":"45:1-45:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75666587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Balazinska, S. Davidson, Bill Howe, Alexandros Labrinidis
MOTIVATION: As industry and science are increasingly data-driven, the need for skilled data scientists is exceeding what our universities are producing. According to a Mckinsey report: "By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills". Similarly, the ability to extract knowledge from scientific data is accelerating discovery and we need the next generation of domain scientists to be experts not only in their domain but also in data management. At the same time, however, researchers in academia who focus on building instruments or data management tools are often less recognized for their contributions than researchers focusing purely on the actual science. OVERVIEW: The goal of this panel will be to discuss all these challenges. We will discuss various aspects of how we should be educating both the emerging "data science" experts and the next generation of database and domain science experts. The panel will also discuss career paths for researchers who choose to specialize in developing new methods and tools for Big Data management in domain sciences, with recommendations for how we should better support these less traditional career paths.
{"title":"Education and career paths for data scientists","authors":"M. Balazinska, S. Davidson, Bill Howe, Alexandros Labrinidis","doi":"10.1145/2484838.2484886","DOIUrl":"https://doi.org/10.1145/2484838.2484886","url":null,"abstract":"MOTIVATION: As industry and science are increasingly data-driven, the need for skilled data scientists is exceeding what our universities are producing. According to a Mckinsey report: \"By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills\". Similarly, the ability to extract knowledge from scientific data is accelerating discovery and we need the next generation of domain scientists to be experts not only in their domain but also in data management. At the same time, however, researchers in academia who focus on building instruments or data management tools are often less recognized for their contributions than researchers focusing purely on the actual science.\u0000 OVERVIEW: The goal of this panel will be to discuss all these challenges. We will discuss various aspects of how we should be educating both the emerging \"data science\" experts and the next generation of database and domain science experts. The panel will also discuss career paths for researchers who choose to specialize in developing new methods and tools for Big Data management in domain sciences, with recommendations for how we should better support these less traditional career paths.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"82 1","pages":"3:1-3:2"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89023229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Online aggregation provides continuous estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution, or can let the processing terminate and obtain the exact result. In this demonstration, we introduce a general framework for parallel online aggregation in which estimation does not incur overhead on top of the actual processing. We define a generic interface to express any estimation model that abstracts completely the execution details. We design multiple sampling-based estimators suited for parallel online aggregation and implement them inside the framework. Demonstration participants are shown how estimates to general SQL aggregation queries over terabytes of TPC-H data are generated during the entire processing. Due to parallel execution, the estimate converges to the correct result in a matter of seconds even for the most difficult queries. The behavior of the estimators is evaluated under different operating regimes of the distributed cluster used in the demonstration.
{"title":"Parallel online aggregation in action","authors":"Chengjie Qin, Florin Rusu","doi":"10.1145/2484838.2484874","DOIUrl":"https://doi.org/10.1145/2484838.2484874","url":null,"abstract":"Online aggregation provides continuous estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution, or can let the processing terminate and obtain the exact result. In this demonstration, we introduce a general framework for parallel online aggregation in which estimation does not incur overhead on top of the actual processing. We define a generic interface to express any estimation model that abstracts completely the execution details. We design multiple sampling-based estimators suited for parallel online aggregation and implement them inside the framework. Demonstration participants are shown how estimates to general SQL aggregation queries over terabytes of TPC-H data are generated during the entire processing. Due to parallel execution, the estimate converges to the correct result in a matter of seconds even for the most difficult queries. The behavior of the estimators is evaluated under different operating regimes of the distributed cluster used in the demonstration.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"27 1","pages":"46:1-46:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84835885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficient top-k retrieval of records from a database has been an active research field for many years. We approach the problem from a real-world application point of view, in which the order of records according to some similarity function on an attribute is not unique: Many records have same values in several attributes and thus their ranking in those attributes is arbitrary. For instance, in large person databases many individuals have the same first name, the same date of birth, or live in the same city. Existing algorithms, such as the Threshold Algorithm (TA), are ill-equipped to handle such cases efficiently. We introduce a variation of TA, the Bulk Sorted Access Algorithm (BSA), which retrieves larger chunks of records from the sorted lists using fixed thresholds, and which focusses its efforts on records that are ranked high in more than one ordering and are thus more promising candidates. We experimentally show that our method outperforms TA and another previous method for top-k retrieval in those very common cases.
{"title":"Bulk sorted access for efficient top-k retrieval","authors":"Dustin Lange, Felix Naumann","doi":"10.1145/2484838.2484852","DOIUrl":"https://doi.org/10.1145/2484838.2484852","url":null,"abstract":"Efficient top-k retrieval of records from a database has been an active research field for many years. We approach the problem from a real-world application point of view, in which the order of records according to some similarity function on an attribute is not unique: Many records have same values in several attributes and thus their ranking in those attributes is arbitrary. For instance, in large person databases many individuals have the same first name, the same date of birth, or live in the same city. Existing algorithms, such as the Threshold Algorithm (TA), are ill-equipped to handle such cases efficiently.\u0000 We introduce a variation of TA, the Bulk Sorted Access Algorithm (BSA), which retrieves larger chunks of records from the sorted lists using fixed thresholds, and which focusses its efforts on records that are ranked high in more than one ordering and are thus more promising candidates. We experimentally show that our method outperforms TA and another previous method for top-k retrieval in those very common cases.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"199 1","pages":"39:1-39:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73557802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The availability of real-life data sets is of crucial importance for algorithm and application development, as these often require insight into the specific properties of the data. Often, however, such data are not released because of their proprietary and confidential nature. We propose to solve this problem using the statistical technique of multiple imputation, which is used as a powerful method for generating realistic synthetic data sets. Additionally, it is shown how the generated records can be combined into networked data using clustering techniques.
{"title":"Sharing confidential data for algorithm development by multiple imputation","authors":"S. Verwer, S. V. D. Braak, Sunil Choenni","doi":"10.1145/2484838.2484865","DOIUrl":"https://doi.org/10.1145/2484838.2484865","url":null,"abstract":"The availability of real-life data sets is of crucial importance for algorithm and application development, as these often require insight into the specific properties of the data. Often, however, such data are not released because of their proprietary and confidential nature. We propose to solve this problem using the statistical technique of multiple imputation, which is used as a powerful method for generating realistic synthetic data sets. Additionally, it is shown how the generated records can be combined into networked data using clustering techniques.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"47 1","pages":"42:1-42:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85191511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management