Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management最新文献
Robert Schuler, Karl Czajkowski, Mike D'Arcy, Hongsuda Tangmunarunkit, Carl Kesselman
Database evolution is a notoriously difficult task, and it is exacerbated by the necessity to evolve database-dependent applications. As science becomes increasingly dependent on sophisticated data management, the need to evolve an array of database-driven systems will only intensify. In this paper, we present an architecture for data-centric ecosystems that allows the components to seamlessly co-evolve by centralizing the models and mappings at the data service and pushing model-adaptive interactions to the database clients. Boundary objects fill the gap where applications are unable to adapt and need a stable interface to interact with the components of the ecosystem. Finally, evolution of the ecosystem is enabled via integrated schema modification and model management operations. We present use cases from actual experiences that demonstrate the utility of our approach.
{"title":"Towards Co-Evolution of Data-Centric Ecosystems.","authors":"Robert Schuler, Karl Czajkowski, Mike D'Arcy, Hongsuda Tangmunarunkit, Carl Kesselman","doi":"10.1145/3400903.3400908","DOIUrl":"https://doi.org/10.1145/3400903.3400908","url":null,"abstract":"<p><p>Database evolution is a notoriously difficult task, and it is exacerbated by the necessity to evolve database-dependent applications. As science becomes increasingly dependent on sophisticated data management, the need to evolve an array of database-driven systems will only intensify. In this paper, we present an architecture for data-centric ecosystems that allows the components to seamlessly co-evolve by centralizing the models and mappings at the data service and pushing model-adaptive interactions to the database clients. Boundary objects fill the gap where applications are unable to adapt and need a stable interface to interact with the components of the ecosystem. Finally, evolution of the ecosystem is enabled via integrated schema modification and model management operations. We present use cases from actual experiences that demonstrate the utility of our approach.</p>","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"2020 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3400903.3400908","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10158370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dániel Kondor, L. Dobos, I. Csabai, A. Bodor, G. Vattay, T. Budavári, A. Szalay
We present a case study about the spatial indexing and regional classification of billions of geographic coordinates from geo-tagged social network data using Hierarchical Triangular Mesh (HTM) implemented for Microsoft SQL Server. Due to the lack of certain features of the HTM library, we use it in conjunction with the GIS functions of SQL Server to significantly increase the efficiency of pre-filtering of spatial filter and join queries. For example, we implemented a new algorithm to compute the HTM tessellation of complex geographic regions and precomputed the intersections of HTM triangles and geographic regions for faster false-positive filtering. With full control over the index structure, HTM-based pre-filtering of simple containment searches outperforms SQL Server spatial indices by a factor of ten and HTM-based spatial joins run about a hundred times faster.
{"title":"Efficient classification of billions of points into complex geographic regions using hierarchical triangular mesh","authors":"Dániel Kondor, L. Dobos, I. Csabai, A. Bodor, G. Vattay, T. Budavári, A. Szalay","doi":"10.1145/2618243.2618245","DOIUrl":"https://doi.org/10.1145/2618243.2618245","url":null,"abstract":"We present a case study about the spatial indexing and regional classification of billions of geographic coordinates from geo-tagged social network data using Hierarchical Triangular Mesh (HTM) implemented for Microsoft SQL Server. Due to the lack of certain features of the HTM library, we use it in conjunction with the GIS functions of SQL Server to significantly increase the efficiency of pre-filtering of spatial filter and join queries. For example, we implemented a new algorithm to compute the HTM tessellation of complex geographic regions and precomputed the intersections of HTM triangles and geographic regions for faster false-positive filtering. With full control over the index structure, HTM-based pre-filtering of simple containment searches outperforms SQL Server spatial indices by a factor of ten and HTM-based spatial joins run about a hundred times faster.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"42 1","pages":"4:1-4:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77624359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Galpin, A. B. Stokes, G. Valkanas, A. Gray, N. Paton, A. Fernandes, K. Sattler, D. Gunopulos
Wireless sensor networks enable cost-effective data collection for tasks such as precision agriculture and environment monitoring. However, the resource-constrained nature of sensor nodes, which often have both limited computational capabilities and battery lifetimes, means that applications that use them must make judicious use of these resources. Research that seeks to support data intensive sensor applications has explored a range of approaches and developed many different techniques, including bespoke algorithms for specific analyses and generic sensor network query processors. However, all such proposals sit within a multi-dimensional design space, where it can be difficult to understand the implications of specific decisions and to identify optimal solutions. This paper presents a benchmark that seeks to support the systematic analysis and comparison of different techniques and platforms, enabling both development and user communities to make well informed choices. The contributions of the paper include: (i) the identification of key variables and performance metrics; (ii) the specification of experiments that explore how different types of task perform under different metrics for the controlled variables; and (iii) an application of the benchmark to investigate the behavior of several representative platforms and techniques.
{"title":"SensorBench: benchmarking approaches to processing wireless sensor network data","authors":"I. Galpin, A. B. Stokes, G. Valkanas, A. Gray, N. Paton, A. Fernandes, K. Sattler, D. Gunopulos","doi":"10.1145/2618243.2618252","DOIUrl":"https://doi.org/10.1145/2618243.2618252","url":null,"abstract":"Wireless sensor networks enable cost-effective data collection for tasks such as precision agriculture and environment monitoring. However, the resource-constrained nature of sensor nodes, which often have both limited computational capabilities and battery lifetimes, means that applications that use them must make judicious use of these resources. Research that seeks to support data intensive sensor applications has explored a range of approaches and developed many different techniques, including bespoke algorithms for specific analyses and generic sensor network query processors. However, all such proposals sit within a multi-dimensional design space, where it can be difficult to understand the implications of specific decisions and to identify optimal solutions. This paper presents a benchmark that seeks to support the systematic analysis and comparison of different techniques and platforms, enabling both development and user communities to make well informed choices. The contributions of the paper include: (i) the identification of key variables and performance metrics; (ii) the specification of experiments that explore how different types of task perform under different metrics for the controlled variables; and (iii) an application of the benchmark to investigate the behavior of several representative platforms and techniques.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"31 1","pages":"21:1-21:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73367384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ilias Kanellos, Thanasis Vergoulis, Dimitris Sacharidis, Theodore Dalamagas, A. Hatzigeorgiou, S. Sartzetakis, T. Sellis
MicroRNAs (miRNAs) are small RNA molecules that inhibit the expression of particular genes, a function that makes them useful towards the treatment of many diseases. Computational methods that predict which genes are targeted by particular miRNA molecules are known as target prediction methods. In this paper, we present a MapReduce-based system, termed MR-microT, for one of the most popular and accurate, but computational intensive, prediction methods. MR-microT offers the highly requested by life scientists feature of predicting the targets of ad-hoc miRNA molecules in near-real time through an intuitive Web interface.
{"title":"MR-microT: a MapReduce-based MicroRNA target prediction method","authors":"Ilias Kanellos, Thanasis Vergoulis, Dimitris Sacharidis, Theodore Dalamagas, A. Hatzigeorgiou, S. Sartzetakis, T. Sellis","doi":"10.1145/2618243.2618289","DOIUrl":"https://doi.org/10.1145/2618243.2618289","url":null,"abstract":"MicroRNAs (miRNAs) are small RNA molecules that inhibit the expression of particular genes, a function that makes them useful towards the treatment of many diseases. Computational methods that predict which genes are targeted by particular miRNA molecules are known as target prediction methods. In this paper, we present a MapReduce-based system, termed MR-microT, for one of the most popular and accurate, but computational intensive, prediction methods. MR-microT offers the highly requested by life scientists feature of predicting the targets of ad-hoc miRNA molecules in near-real time through an intuitive Web interface.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"72 1","pages":"47:1-47:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87693453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Networks entail vulnerable and sensitive information that pose serious privacy threats. In this paper, we introduce, k-core attack, a new attack model which stems from the k-core decomposition principle. K-core attack undermines the privacy of some state-of-the-art techniques. We propose a novel structural anonymization technique called (k, Δ)-Core Anonymity, which harnesses the k-core attack and structurally anonymizes small and large networks. In addition, although real-world social networks are massive in nature, most existing works focus on the anonymization of networks with less than one hundred thousand nodes. (k, Δ)-Core Anonymity is tailored for massive networks. To the best of our knowledge, this is the first technique that provides empirical studies on structural network anonymization for massive networks. Using three real and two synthetic datasets, we demonstrate the effectiveness of our technique on small and large networks with up to 1.7 million nodes and 17.8 million edges. Our experiments reveal that our approach outperforms a state-of-the-art work in several aspects.
{"title":"(k, d)-core anonymity: structural anonymization of massive networks","authors":"Roland Assam, Marwan Hassani, M. Brysch, T. Seidl","doi":"10.1145/2618243.2618269","DOIUrl":"https://doi.org/10.1145/2618243.2618269","url":null,"abstract":"Networks entail vulnerable and sensitive information that pose serious privacy threats. In this paper, we introduce, k-core attack, a new attack model which stems from the k-core decomposition principle. K-core attack undermines the privacy of some state-of-the-art techniques. We propose a novel structural anonymization technique called (k, Δ)-Core Anonymity, which harnesses the k-core attack and structurally anonymizes small and large networks. In addition, although real-world social networks are massive in nature, most existing works focus on the anonymization of networks with less than one hundred thousand nodes. (k, Δ)-Core Anonymity is tailored for massive networks. To the best of our knowledge, this is the first technique that provides empirical studies on structural network anonymization for massive networks. Using three real and two synthetic datasets, we demonstrate the effectiveness of our technique on small and large networks with up to 1.7 million nodes and 17.8 million edges. Our experiments reveal that our approach outperforms a state-of-the-art work in several aspects.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"6 1","pages":"17:1-17:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87360927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Orestis Gkorgkas, Akrivi Vlachou, C. Doulkeridis, K. Nørvåg
In this paper, we address the problem of discovering a ranked set of k distinct main objects combined with additional (accessory) objects that best fit the given preferences. This problem is challenging because it considers object combinations of variable size, where objects are combined only if the combination produces a higher score, and thus becomes more preferable to a user. In this way, users can explore overviews of combinations that are more suited to their preferences than single objects, without the need to explicitly specify which objects should be combined. We model this problem as a rank-join problem where each combination is represented by a set of tuples from different relations and we call the respective query eXploratory Top-k Join query. Existing approaches fall short to tackle this problem because they impose a fixed size of combinations, they do not distinguish on combinations based on the main objects or they do not take into account user preferences. We introduce a more efficient bounding scheme that can be used on an adaptation of the rank-join algorithm, which exploits some key properties of our problem and allows earlier termination of query processing. Our experimental evaluation demonstrates the efficiency of the proposed bounding technique.
{"title":"Efficient processing of exploratory top-k joins","authors":"Orestis Gkorgkas, Akrivi Vlachou, C. Doulkeridis, K. Nørvåg","doi":"10.1145/2618243.2618280","DOIUrl":"https://doi.org/10.1145/2618243.2618280","url":null,"abstract":"In this paper, we address the problem of discovering a ranked set of k distinct main objects combined with additional (accessory) objects that best fit the given preferences. This problem is challenging because it considers object combinations of variable size, where objects are combined only if the combination produces a higher score, and thus becomes more preferable to a user. In this way, users can explore overviews of combinations that are more suited to their preferences than single objects, without the need to explicitly specify which objects should be combined. We model this problem as a rank-join problem where each combination is represented by a set of tuples from different relations and we call the respective query eXploratory Top-k Join query. Existing approaches fall short to tackle this problem because they impose a fixed size of combinations, they do not distinguish on combinations based on the main objects or they do not take into account user preferences. We introduce a more efficient bounding scheme that can be used on an adaptation of the rank-join algorithm, which exploits some key properties of our problem and allows earlier termination of query processing. Our experimental evaluation demonstrates the efficiency of the proposed bounding technique.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"30 1","pages":"35:1-35:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75193587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kasper Grud Skat Madsen, Philip Thyssen, Yongluan Zhou
Recently there has been an increasing interest in building distributed platforms for processing of fast data streams. In this demonstration, we highlight the need for elasticity in distributed data stream processing systems and present Enorm, a data stream processing platform with focus on elasticity, i.e. the ability to dynamically scale resource usage according to the runtime workload fluctuations. In order to achieve dynamic scaling with minimal overhead and latency, we use an integrated approach for both fault-tolerance and elasticity. The idea is that both fault-tolerance and elasticity essentially require replicating or migrating computation states among different nodes. Integrating and sharing the state management operations between the two modules can not only provide abundant opportunities to reduce the system's runtime overhead but also simplify the system's architecture.
{"title":"Integrating fault-tolerance and elasticity in a distributed data stream processing system","authors":"Kasper Grud Skat Madsen, Philip Thyssen, Yongluan Zhou","doi":"10.1145/2618243.2618288","DOIUrl":"https://doi.org/10.1145/2618243.2618288","url":null,"abstract":"Recently there has been an increasing interest in building distributed platforms for processing of fast data streams. In this demonstration, we highlight the need for elasticity in distributed data stream processing systems and present Enorm, a data stream processing platform with focus on elasticity, i.e. the ability to dynamically scale resource usage according to the runtime workload fluctuations. In order to achieve dynamic scaling with minimal overhead and latency, we use an integrated approach for both fault-tolerance and elasticity. The idea is that both fault-tolerance and elasticity essentially require replicating or migrating computation states among different nodes. Integrating and sharing the state management operations between the two modules can not only provide abundant opportunities to reduce the system's runtime overhead but also simplify the system's architecture.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"16 1","pages":"48:1-48:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74953733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific workflows often have to process huge data sets in a multiplicity of data formats. For that purpose, they typically embed complex data provisioning tasks that transform these heterogeneous data into formats the underlying tools or services can handle. This results in an increased complexity of workflow design. As scientists typically design their scientific workflows on their own, this complexity hinders them to concentrate on their core issue, namely the experiments, analyses, or simulations they conduct. In this paper, we present the core idea of a pattern-based approach to alleviate the design of scientific workflows. This approach is particularly targeted at the needs of scientists. We exemplify and assess the pattern-based design approach by applying it to a complex scientific workflow realizing a real-world simulation of structure changes in bones.
{"title":"Data patterns to alleviate the design of scientific workflows exemplified by a bone simulation","authors":"P. Reimann, H. Schwarz, B. Mitschang","doi":"10.1145/2618243.2618279","DOIUrl":"https://doi.org/10.1145/2618243.2618279","url":null,"abstract":"Scientific workflows often have to process huge data sets in a multiplicity of data formats. For that purpose, they typically embed complex data provisioning tasks that transform these heterogeneous data into formats the underlying tools or services can handle. This results in an increased complexity of workflow design. As scientists typically design their scientific workflows on their own, this complexity hinders them to concentrate on their core issue, namely the experiments, analyses, or simulations they conduct. In this paper, we present the core idea of a pattern-based approach to alleviate the design of scientific workflows. This approach is particularly targeted at the needs of scientists. We exemplify and assess the pattern-based design approach by applying it to a complex scientific workflow realizing a real-world simulation of structure changes in bones.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"64 1","pages":"43:1-43:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83667896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Ray, Bogdan Simion, Angela Demke Brown, Ryan Johnson
Spatial join is a crucial operation in many spatial analysis applications in scientific and geographical information systems. Due to the compute-intensive nature of spatial predicate evaluation, spatial join queries can be slow even with a moderate sized dataset. Efficient parallelization of spatial join is therefore essential to achieve acceptable performance for many spatial applications. Technological trends, including the rising core count and increasingly large main memory, hold great promise in this regard. Previous parallel spatial join approaches tried to partition the dataset so that the number of spatial objects in each partition was as equal as possible. They also focused only on the filter step. However, when the more compute-intensive refinement step is included, significant processing skew may arise due to the uneven size of the objects. This processing skew significantly limits the achievable parallel performance of the spatial join queries, as the longest-running spatial partition determines the overall query execution time. Our solution is SPINOJA, a skew-resistant parallel in-memory spatial join infrastructure. SPINOJA introduces MOD-Quadtree declustering, which partitions the spatial dataset such that the amount of computation demanded by each partition is equalized and the processing skew is minimized. We compare three work metrics used to create the partitions and three load-balancing strategies to assign the partitions to multiple cores. SPINOJA uses an in-memory column-store to store the spatial tables. Our evaluation shows that SPINOJA outperforms in-memory implementations of previous spatial join approaches by a significant margin and a recently proposed in-memory spatial join algorithm by an order of magnitude.
{"title":"Skew-resistant parallel in-memory spatial join","authors":"S. Ray, Bogdan Simion, Angela Demke Brown, Ryan Johnson","doi":"10.1145/2618243.2618262","DOIUrl":"https://doi.org/10.1145/2618243.2618262","url":null,"abstract":"Spatial join is a crucial operation in many spatial analysis applications in scientific and geographical information systems. Due to the compute-intensive nature of spatial predicate evaluation, spatial join queries can be slow even with a moderate sized dataset. Efficient parallelization of spatial join is therefore essential to achieve acceptable performance for many spatial applications. Technological trends, including the rising core count and increasingly large main memory, hold great promise in this regard. Previous parallel spatial join approaches tried to partition the dataset so that the number of spatial objects in each partition was as equal as possible. They also focused only on the filter step. However, when the more compute-intensive refinement step is included, significant processing skew may arise due to the uneven size of the objects. This processing skew significantly limits the achievable parallel performance of the spatial join queries, as the longest-running spatial partition determines the overall query execution time.\u0000 Our solution is SPINOJA, a skew-resistant parallel in-memory spatial join infrastructure. SPINOJA introduces MOD-Quadtree declustering, which partitions the spatial dataset such that the amount of computation demanded by each partition is equalized and the processing skew is minimized. We compare three work metrics used to create the partitions and three load-balancing strategies to assign the partitions to multiple cores. SPINOJA uses an in-memory column-store to store the spatial tables. Our evaluation shows that SPINOJA outperforms in-memory implementations of previous spatial join approaches by a significant margin and a recently proposed in-memory spatial join algorithm by an order of magnitude.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"6:1-6:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90175090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatic schema matching algorithms are typically only concerned with finding attribute correspondences. However, real world data integration problems often require matchings whose arguments span all three types of elements in relational databases: relation, attribute and data value. This paper introduces the definitions and semantics of three additional correspondence types concerning both schema and data values. These correspondences cover the higher-order mappings identified in a seminal paper by Krishnamurthy, Litwin, and Kent. It is shown that these correspondences can be automatically translated to tuple generating dependencies (tgds), and thus this research is compatible with data integration applications that leverage tgds. Two methods for automatically identifying these correspondences are developed. One requires a limited number of duplicates across data sources. The other is a general instance-based method with no such requirement. Experiments conducted on four real world data sets demonstrate the effectiveness of the methods.
{"title":"Schema matching over relations, attributes, and data values","authors":"Aibo Tian, M. Kejriwal, Daniel P. Miranker","doi":"10.1145/2618243.2618248","DOIUrl":"https://doi.org/10.1145/2618243.2618248","url":null,"abstract":"Automatic schema matching algorithms are typically only concerned with finding attribute correspondences. However, real world data integration problems often require matchings whose arguments span all three types of elements in relational databases: relation, attribute and data value. This paper introduces the definitions and semantics of three additional correspondence types concerning both schema and data values. These correspondences cover the higher-order mappings identified in a seminal paper by Krishnamurthy, Litwin, and Kent. It is shown that these correspondences can be automatically translated to tuple generating dependencies (tgds), and thus this research is compatible with data integration applications that leverage tgds.\u0000 Two methods for automatically identifying these correspondences are developed. One requires a limited number of duplicates across data sources. The other is a general instance-based method with no such requirement. Experiments conducted on four real world data sets demonstrate the effectiveness of the methods.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"10 1","pages":"28:1-28:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89378139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management