In this data article, we introduce the Multi-Modal Event-based Vehicle Detection and Tracking (MEVDT) dataset. This dataset provides a synchronized stream of event data and grayscale images of traffic scenes, captured using the Dynamic and Active-Pixel Vision Sensor (DAVIS) 240c hybrid event-based camera. MEVDT comprises 63 multi-modal sequences with approximately 13k images, 5M events, 10k object labels, and 85 unique object tracking trajectories. Additionally, MEVDT includes manually annotated ground truth labels $unicode{x2014}$ consisting of object classifications, pixel-precise bounding boxes, and unique object IDs $unicode{x2014}$ which are provided at a labeling frequency of 24 Hz. Designed to advance the research in the domain of event-based vision, MEVDT aims to address the critical need for high-quality, real-world annotated datasets that enable the development and evaluation of object detection and tracking algorithms in automotive environments.
{"title":"MEVDT: Multi-Modal Event-Based Vehicle Detection and Tracking Dataset","authors":"Zaid A. El Shair, Samir A. Rawashdeh","doi":"arxiv-2407.20446","DOIUrl":"https://doi.org/arxiv-2407.20446","url":null,"abstract":"In this data article, we introduce the Multi-Modal Event-based Vehicle\u0000Detection and Tracking (MEVDT) dataset. This dataset provides a synchronized\u0000stream of event data and grayscale images of traffic scenes, captured using the\u0000Dynamic and Active-Pixel Vision Sensor (DAVIS) 240c hybrid event-based camera.\u0000MEVDT comprises 63 multi-modal sequences with approximately 13k images, 5M\u0000events, 10k object labels, and 85 unique object tracking trajectories.\u0000Additionally, MEVDT includes manually annotated ground truth labels\u0000$unicode{x2014}$ consisting of object classifications, pixel-precise bounding\u0000boxes, and unique object IDs $unicode{x2014}$ which are provided at a labeling\u0000frequency of 24 Hz. Designed to advance the research in the domain of\u0000event-based vision, MEVDT aims to address the critical need for high-quality,\u0000real-world annotated datasets that enable the development and evaluation of\u0000object detection and tracking algorithms in automotive environments.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meghyn Bienvenu, Diego Figueira, Pierre Lafourcade
The Shapley value, originally introduced in cooperative game theory for wealth distribution, has found use in KR and databases for the purpose of assigning scores to formulas and database tuples based upon their contribution to obtaining a query result or inconsistency. In the present paper, we explore the use of Shapley values in ontology-mediated query answering (OMQA) and present a detailed complexity analysis of Shapley value computation (SVC) in the OMQA setting. In particular, we establish a PF/#P-hard dichotomy for SVC for ontology-mediated queries (T,q) composed of an ontology T formulated in the description logic ELHI_bot and a connected constant-free homomorphism-closed query q. We further show that the #P-hardness side of the dichotomy can be strengthened to cover possibly disconnected queries with constants. Our results exploit recently discovered connections between SVC and probabilistic query evaluation and allow us to generalize existing results on probabilistic OMQA.
{"title":"Shapley Value Computation in Ontology-Mediated Query Answering","authors":"Meghyn Bienvenu, Diego Figueira, Pierre Lafourcade","doi":"arxiv-2407.20058","DOIUrl":"https://doi.org/arxiv-2407.20058","url":null,"abstract":"The Shapley value, originally introduced in cooperative game theory for\u0000wealth distribution, has found use in KR and databases for the purpose of\u0000assigning scores to formulas and database tuples based upon their contribution\u0000to obtaining a query result or inconsistency. In the present paper, we explore\u0000the use of Shapley values in ontology-mediated query answering (OMQA) and\u0000present a detailed complexity analysis of Shapley value computation (SVC) in\u0000the OMQA setting. In particular, we establish a PF/#P-hard dichotomy for SVC\u0000for ontology-mediated queries (T,q) composed of an ontology T formulated in the\u0000description logic ELHI_bot and a connected constant-free homomorphism-closed\u0000query q. We further show that the #P-hardness side of the dichotomy can be\u0000strengthened to cover possibly disconnected queries with constants. Our results\u0000exploit recently discovered connections between SVC and probabilistic query\u0000evaluation and allow us to generalize existing results on probabilistic OMQA.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study presents a comparative analysis of the a complex SQL benchmark, TPC-DS, with two existing text-to-SQL benchmarks, BIRD and Spider. Our findings reveal that TPC-DS queries exhibit a significantly higher level of structural complexity compared to the other two benchmarks. This underscores the need for more intricate benchmarks to simulate realistic scenarios effectively. To facilitate this comparison, we devised several measures of structural complexity and applied them across all three benchmarks. The results of this study can guide future research in the development of more sophisticated text-to-SQL benchmarks. We utilized 11 distinct Language Models (LLMs) to generate SQL queries based on the query descriptions provided by the TPC-DS benchmark. The prompt engineering process incorporated both the query description as outlined in the TPC-DS specification and the database schema of TPC-DS. Our findings indicate that the current state-of-the-art generative AI models fall short in generating accurate decision-making queries. We conducted a comparison of the generated queries with the TPC-DS gold standard queries using a series of fuzzy structure matching techniques based on query features. The results demonstrated that the accuracy of the generated queries is insufficient for practical real-world application.
{"title":"Evaluating LLMs for Text-to-SQL Generation With Complex SQL Workload","authors":"Limin Ma, Ken Pu, Ying Zhu","doi":"arxiv-2407.19517","DOIUrl":"https://doi.org/arxiv-2407.19517","url":null,"abstract":"This study presents a comparative analysis of the a complex SQL benchmark,\u0000TPC-DS, with two existing text-to-SQL benchmarks, BIRD and Spider. Our findings\u0000reveal that TPC-DS queries exhibit a significantly higher level of structural\u0000complexity compared to the other two benchmarks. This underscores the need for\u0000more intricate benchmarks to simulate realistic scenarios effectively. To\u0000facilitate this comparison, we devised several measures of structural\u0000complexity and applied them across all three benchmarks. The results of this\u0000study can guide future research in the development of more sophisticated\u0000text-to-SQL benchmarks. We utilized 11 distinct Language Models (LLMs) to generate SQL queries based\u0000on the query descriptions provided by the TPC-DS benchmark. The prompt\u0000engineering process incorporated both the query description as outlined in the\u0000TPC-DS specification and the database schema of TPC-DS. Our findings indicate\u0000that the current state-of-the-art generative AI models fall short in generating\u0000accurate decision-making queries. We conducted a comparison of the generated\u0000queries with the TPC-DS gold standard queries using a series of fuzzy structure\u0000matching techniques based on query features. The results demonstrated that the\u0000accuracy of the generated queries is insufficient for practical real-world\u0000application.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multidimensional Big Data Analytics is an emerging area that marries the capabilities of OLAP with modern Big Data Analytics. Essentially, the idea is engrafting multidimensional models into Big Data analytics processes to gain into expressive power of the overall discovery task. ClustCube is a state-of-the-art model that combines OLAP and Clustering, thus delving into practical and well-understood advantages in the context of real-life applications and systems. In this paper, we show how ClustCube can effectively and efficiently realizing nice tools for supporting Multidimensional Big Data Analytics, and assess these tools in the context of real-life research projects.
{"title":"Turning Multidimensional Big Data Analytics into Practice: Design and Implementation of ClustCube Big-Data Tools in Real-Life Scenarios","authors":"Alfredo Cuzzocrea, Abderraouf Hafsaoui, Ismail Benlaredj","doi":"arxiv-2407.18604","DOIUrl":"https://doi.org/arxiv-2407.18604","url":null,"abstract":"Multidimensional Big Data Analytics is an emerging area that marries the\u0000capabilities of OLAP with modern Big Data Analytics. Essentially, the idea is\u0000engrafting multidimensional models into Big Data analytics processes to gain\u0000into expressive power of the overall discovery task. ClustCube is a\u0000state-of-the-art model that combines OLAP and Clustering, thus delving into\u0000practical and well-understood advantages in the context of real-life\u0000applications and systems. In this paper, we show how ClustCube can effectively\u0000and efficiently realizing nice tools for supporting Multidimensional Big Data\u0000Analytics, and assess these tools in the context of real-life research\u0000projects.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We aim to accelerate the original vision of the semantic web by revisiting design decisions that have defined the semantic web up until now. We propose a shift in direction that more broadly embraces existing data infrastructure by reconsidering the semantic web's logical foundations. We argue to shift attention away from description logic, which has so far underpinned the semantic web, to a different fragment of first-order logic. We argue, using examples from the (geo)spatial domain, that by doing so, the semantic web can be approached as a traditional data migration and integration problem at a massive scale. That way, a huge amount of existing tools and theories can be deployed to the semantic web's benefit, and the original vision of ontology as shared abstraction be reinvigorated.
{"title":"Towards A More Reasonable Semantic Web","authors":"Vleer Doing, Ryan Wisnesky","doi":"arxiv-2407.19095","DOIUrl":"https://doi.org/arxiv-2407.19095","url":null,"abstract":"We aim to accelerate the original vision of the semantic web by revisiting\u0000design decisions that have defined the semantic web up until now. We propose a\u0000shift in direction that more broadly embraces existing data infrastructure by\u0000reconsidering the semantic web's logical foundations. We argue to shift\u0000attention away from description logic, which has so far underpinned the\u0000semantic web, to a different fragment of first-order logic. We argue, using\u0000examples from the (geo)spatial domain, that by doing so, the semantic web can\u0000be approached as a traditional data migration and integration problem at a\u0000massive scale. That way, a huge amount of existing tools and theories can be\u0000deployed to the semantic web's benefit, and the original vision of ontology as\u0000shared abstraction be reinvigorated.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"88 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stavros Maroulis, Nikos Bikakis, Vassilis Stamatopoulos, George Papastefanatos
In data exploration, users need to analyze large data files quickly, aiming to minimize data-to-analysis time. While recent adaptive indexing approaches address this need, they are cases where demonstrate poor performance. Particularly, during the initial queries, in regions with a high density of objects, and in very large files over commodity hardware. This work introduces an approach for adaptive indexing driven by both query workload and user-defined accuracy constraints to support approximate query answering. The approach is based on partial index adaptation which reduces the costs associated with reading data files and refining indexes. We leverage a hierarchical tile-based indexing scheme and its stored metadata to provide efficient query evaluation, ensuring accuracy within user-specified bounds. Our preliminary evaluation demonstrates improvement on query evaluation time, especially during initial user exploration.
{"title":"Partial Adaptive Indexing for Approximate Query Answering","authors":"Stavros Maroulis, Nikos Bikakis, Vassilis Stamatopoulos, George Papastefanatos","doi":"arxiv-2407.18702","DOIUrl":"https://doi.org/arxiv-2407.18702","url":null,"abstract":"In data exploration, users need to analyze large data files quickly, aiming\u0000to minimize data-to-analysis time. While recent adaptive indexing approaches\u0000address this need, they are cases where demonstrate poor performance.\u0000Particularly, during the initial queries, in regions with a high density of\u0000objects, and in very large files over commodity hardware. This work introduces\u0000an approach for adaptive indexing driven by both query workload and\u0000user-defined accuracy constraints to support approximate query answering. The\u0000approach is based on partial index adaptation which reduces the costs\u0000associated with reading data files and refining indexes. We leverage a\u0000hierarchical tile-based indexing scheme and its stored metadata to provide\u0000efficient query evaluation, ensuring accuracy within user-specified bounds. Our\u0000preliminary evaluation demonstrates improvement on query evaluation time,\u0000especially during initial user exploration.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data Quality (DQ) describes the degree to which data characteristics meet requirements and are fit for use by humans and/or systems. There are several aspects in which DQ can be measured, called DQ dimensions (i.e. accuracy, completeness, consistency, etc.), also referred to as characteristics in literature. ISO/IEC 25012 Standard defines a data quality model with fifteen such dimensions, setting the requirements a data product should meet. In this short report, we aim to bridge the gap between lower-level functionalities offered by DQ tools and higher-level dimensions in a systematic manner, revealing the many-to-many relationships between them. To this end, we examine 6 open-source DQ tools and we emphasize on providing a mapping between the functionalities they offer and the DQ dimensions, as defined by the ISO standard. Wherever applicable, we also provide insights into the software engineering details that tools leverage, in order to address DQ challenges.
{"title":"A survey of open-source data quality tools: shedding light on the materialization of data quality dimensions in practice","authors":"Vasileios Papastergios, Anastasios Gounaris","doi":"arxiv-2407.18649","DOIUrl":"https://doi.org/arxiv-2407.18649","url":null,"abstract":"Data Quality (DQ) describes the degree to which data characteristics meet\u0000requirements and are fit for use by humans and/or systems. There are several\u0000aspects in which DQ can be measured, called DQ dimensions (i.e. accuracy,\u0000completeness, consistency, etc.), also referred to as characteristics in\u0000literature. ISO/IEC 25012 Standard defines a data quality model with fifteen\u0000such dimensions, setting the requirements a data product should meet. In this\u0000short report, we aim to bridge the gap between lower-level functionalities\u0000offered by DQ tools and higher-level dimensions in a systematic manner,\u0000revealing the many-to-many relationships between them. To this end, we examine\u00006 open-source DQ tools and we emphasize on providing a mapping between the\u0000functionalities they offer and the DQ dimensions, as defined by the ISO\u0000standard. Wherever applicable, we also provide insights into the software\u0000engineering details that tools leverage, in order to address DQ challenges.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yixuan Liu, Yuhan Liu, Li Xiong, Yujie Gu, Hong Chen
The shuffle model of Differential Privacy (DP) is an enhanced privacy protocol which introduces an intermediate trusted server between local users and a central data curator. It significantly amplifies the central DP guarantee by anonymizing and shuffling the local randomized data. Yet, deriving a tight privacy bound is challenging due to its complicated randomization protocol. While most existing work are focused on unified local privacy settings, this work focuses on deriving the central privacy bound for a more practical setting where personalized local privacy is required by each user. To bound the privacy after shuffling, we first need to capture the probability of each user generating clones of the neighboring data points. Second, we need to quantify the indistinguishability between two distributions of the number of clones on neighboring datasets. Existing works either inaccurately capture the probability, or underestimate the indistinguishability between neighboring datasets. Motivated by this, we develop a more precise analysis, which yields a general and tighter bound for arbitrary DP mechanisms. Firstly, we derive the clone-generating probability by hypothesis testing %from a randomizer-specific perspective, which leads to a more accurate characterization of the probability. Secondly, we analyze the indistinguishability in the context of $f$-DP, where the convexity of the distributions is leveraged to achieve a tighter privacy bound. Theoretical and numerical results demonstrate that our bound remarkably outperforms the existing results in the literature.
{"title":"Enhanced Privacy Bound for Shuffle Model with Personalized Privacy","authors":"Yixuan Liu, Yuhan Liu, Li Xiong, Yujie Gu, Hong Chen","doi":"arxiv-2407.18157","DOIUrl":"https://doi.org/arxiv-2407.18157","url":null,"abstract":"The shuffle model of Differential Privacy (DP) is an enhanced privacy\u0000protocol which introduces an intermediate trusted server between local users\u0000and a central data curator. It significantly amplifies the central DP guarantee\u0000by anonymizing and shuffling the local randomized data. Yet, deriving a tight\u0000privacy bound is challenging due to its complicated randomization protocol.\u0000While most existing work are focused on unified local privacy settings, this\u0000work focuses on deriving the central privacy bound for a more practical setting\u0000where personalized local privacy is required by each user. To bound the privacy\u0000after shuffling, we first need to capture the probability of each user\u0000generating clones of the neighboring data points. Second, we need to quantify\u0000the indistinguishability between two distributions of the number of clones on\u0000neighboring datasets. Existing works either inaccurately capture the\u0000probability, or underestimate the indistinguishability between neighboring\u0000datasets. Motivated by this, we develop a more precise analysis, which yields a\u0000general and tighter bound for arbitrary DP mechanisms. Firstly, we derive the\u0000clone-generating probability by hypothesis testing %from a randomizer-specific\u0000perspective, which leads to a more accurate characterization of the\u0000probability. Secondly, we analyze the indistinguishability in the context of\u0000$f$-DP, where the convexity of the distributions is leveraged to achieve a\u0000tighter privacy bound. Theoretical and numerical results demonstrate that our\u0000bound remarkably outperforms the existing results in the literature.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carter Benson, Alec Sculley, Austin Liebers, John Beverley
Generative artificial intelligence (AI), exemplified by the release of GPT-3.5 in 2022, has significantly advanced the potential applications of large language models (LLMs), including in the realms of ontology development and knowledge graph creation. Ontologies, which are structured frameworks for organizing information, and knowledge graphs, which combine ontologies with actual data, are essential for enabling interoperability and automated reasoning. However, current research has largely overlooked the generation of ontologies extending from established upper-level frameworks like the Basic Formal Ontology (BFO), risking the creation of non-integrable ontology silos. This study explores the extent to which LLMs, particularly GPT-4, can support ontologists trained in BFO. Through iterative development of a specialized GPT model named "My Ontologist," we aimed to generate BFO-conformant ontologies. Initial versions faced challenges in maintaining definition conventions and leveraging foundational texts effectively. My Ontologist 3.0 showed promise by adhering to structured rules and modular ontology suites, yet the release of GPT-4o disrupted this progress by altering the model's behavior. Our findings underscore the importance of aligning LLM-generated ontologies with top-level standards and highlight the complexities of integrating evolving AI capabilities in ontology engineering.
{"title":"My Ontologist: Evaluating BFO-Based AI for Definition Support","authors":"Carter Benson, Alec Sculley, Austin Liebers, John Beverley","doi":"arxiv-2407.17657","DOIUrl":"https://doi.org/arxiv-2407.17657","url":null,"abstract":"Generative artificial intelligence (AI), exemplified by the release of\u0000GPT-3.5 in 2022, has significantly advanced the potential applications of large\u0000language models (LLMs), including in the realms of ontology development and\u0000knowledge graph creation. Ontologies, which are structured frameworks for\u0000organizing information, and knowledge graphs, which combine ontologies with\u0000actual data, are essential for enabling interoperability and automated\u0000reasoning. However, current research has largely overlooked the generation of\u0000ontologies extending from established upper-level frameworks like the Basic\u0000Formal Ontology (BFO), risking the creation of non-integrable ontology silos.\u0000This study explores the extent to which LLMs, particularly GPT-4, can support\u0000ontologists trained in BFO. Through iterative development of a specialized GPT\u0000model named \"My Ontologist,\" we aimed to generate BFO-conformant ontologies.\u0000Initial versions faced challenges in maintaining definition conventions and\u0000leveraging foundational texts effectively. My Ontologist 3.0 showed promise by\u0000adhering to structured rules and modular ontology suites, yet the release of\u0000GPT-4o disrupted this progress by altering the model's behavior. Our findings\u0000underscore the importance of aligning LLM-generated ontologies with top-level\u0000standards and highlight the complexities of integrating evolving AI\u0000capabilities in ontology engineering.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In many real-world applications such as social network analysis, knowledge graph discovery, biological network analytics, and so on, graph data management has become increasingly important and has drawn much attention from the database community. While many graphs (e.g., Twitter, Wikipedia, etc.) are usually involving over time, it is of great importance to study the dynamic subgraph matching (DSM) problem, a fundamental yet challenging graph operator, which continuously monitors subgraph matching results over dynamic graphs with a stream of edge updates. To efficiently tackle the DSM problem, we carefully design a novel vertex dominance embedding approach, which effectively encodes vertex labels that can be incrementally maintained upon graph updates. Inspire by low pruning power for high-degree vertices, we propose a new degree grouping technique over basic subgraph patterns in different degree groups (i.e., groups of star substructures), and devise degree-aware star substructure synopses (DAS^3) to effectively facilitate our designed vertex dominance and range pruning strategies. We develop efficient algorithms to incrementally maintain dynamic graphs and answer DSM queries. Through extensive experiments, we confirm the efficiency of our proposed approaches over both real and synthetic graphs.
{"title":"Dynamic Subgraph Matching via Cost-Model-based Vertex Dominance Embeddings (Technical Report)","authors":"Yutong Ye, Xiang Lian, Nan Zhang, Mingsong Chen","doi":"arxiv-2407.16660","DOIUrl":"https://doi.org/arxiv-2407.16660","url":null,"abstract":"In many real-world applications such as social network analysis, knowledge\u0000graph discovery, biological network analytics, and so on, graph data management\u0000has become increasingly important and has drawn much attention from the\u0000database community. While many graphs (e.g., Twitter, Wikipedia, etc.) are\u0000usually involving over time, it is of great importance to study the dynamic\u0000subgraph matching (DSM) problem, a fundamental yet challenging graph operator,\u0000which continuously monitors subgraph matching results over dynamic graphs with\u0000a stream of edge updates. To efficiently tackle the DSM problem, we carefully\u0000design a novel vertex dominance embedding approach, which effectively encodes\u0000vertex labels that can be incrementally maintained upon graph updates. Inspire\u0000by low pruning power for high-degree vertices, we propose a new degree grouping\u0000technique over basic subgraph patterns in different degree groups (i.e., groups\u0000of star substructures), and devise degree-aware star substructure synopses\u0000(DAS^3) to effectively facilitate our designed vertex dominance and range\u0000pruning strategies. We develop efficient algorithms to incrementally maintain\u0000dynamic graphs and answer DSM queries. Through extensive experiments, we\u0000confirm the efficiency of our proposed approaches over both real and synthetic\u0000graphs.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"351 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141776262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}