Christina Timko, Malte Niederstadt, Naman Goel, Boi Faltings
A crucial building block of responsible artificial intelligence is responsible data governance, including data collection. Its importance is also underlined in the latest EU regulations. The data should be of high quality, foremost correct and representative, and individuals providing the data should have autonomy over what data is collected. In this article, we consider the setting of collecting personally measured fitness data (physical activity measurements), in which some individuals may not have an incentive to measure and report accurate data. This can significantly degrade the quality of the collected data. On the other hand, high-quality collective data of this nature could be used for reliable scientific insights or to build trustworthy artificial intelligence applications. We conduct a framed field experiment (N = 691) to examine the effect of offering fixed and quality-dependent monetary incentives on the quality of the collected data. We use a peer-based incentive-compatible mechanism for the quality-dependent incentives without spot-checking or surveilling individuals. We find that the incentive-compatible mechanism can elicit good-quality data while providing a good user experience and compensating fairly, although, in the specific study context, the data quality does not necessarily differ under the two incentive schemes. We contribute new design insights from the experiment and discuss directions that future field experiments and applications on explainable and transparent data collection may focus on.
{"title":"Incentive Mechanism Design for Responsible Data Governance: A Large-scale Field Experiment","authors":"Christina Timko, Malte Niederstadt, Naman Goel, Boi Faltings","doi":"10.1145/3592617","DOIUrl":"https://doi.org/10.1145/3592617","url":null,"abstract":"A crucial building block of responsible artificial intelligence is responsible data governance, including data collection. Its importance is also underlined in the latest EU regulations. The data should be of high quality, foremost correct and representative, and individuals providing the data should have autonomy over what data is collected. In this article, we consider the setting of collecting personally measured fitness data (physical activity measurements), in which some individuals may not have an incentive to measure and report accurate data. This can significantly degrade the quality of the collected data. On the other hand, high-quality collective data of this nature could be used for reliable scientific insights or to build trustworthy artificial intelligence applications. We conduct a framed field experiment (N = 691) to examine the effect of offering fixed and quality-dependent monetary incentives on the quality of the collected data. We use a peer-based incentive-compatible mechanism for the quality-dependent incentives without spot-checking or surveilling individuals. We find that the incentive-compatible mechanism can elicit good-quality data while providing a good user experience and compensating fairly, although, in the specific study context, the data quality does not necessarily differ under the two incentive schemes. We contribute new design insights from the experiment and discuss directions that future field experiments and applications on explainable and transparent data collection may focus on.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"23 1","pages":"1 - 18"},"PeriodicalIF":2.1,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83296585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Ao, Zehui Cheng, Rada Y. Chirkova, Phokion G. Kolaitis
We consider the problem of answering temporal queries on RDF stores, in presence of atemporal RDFS domain ontologies, of relational data sources that include temporal information, and of rules that map the domain information in the source schemas into the target ontology. Our proposed practice-oriented solution consists of two rule-based domain-independent algorithms. The first algorithm materializes target RDF data via a version of data exchange that enriches both the data and the ontology with temporal information from the relational sources. The second algorithm accepts as inputs temporal queries expressed in terms of the domain ontology using a lightweight temporal extension of SPARQL, and ensures successful evaluation of the queries on the materialized temporally-enriched RDF data. To study the quality of the information generated by the algorithms, we develop a general framework that formalizes the relational-to-RDF temporal data-exchange problem. The framework includes a chase formalism and a formal solution for the problem of answering temporal queries in the context of relational-to-RDF temporal data exchange. In this article, we present the algorithms and the formal framework that proves correctness of the information output by the algorithms, and also report on the algorithm implementation and experimental results for two application domains.
{"title":"Theory and Practice of Relational-to-RDF Temporal Data Exchange and Query Answering","authors":"J. Ao, Zehui Cheng, Rada Y. Chirkova, Phokion G. Kolaitis","doi":"10.1145/3591359","DOIUrl":"https://doi.org/10.1145/3591359","url":null,"abstract":"We consider the problem of answering temporal queries on RDF stores, in presence of atemporal RDFS domain ontologies, of relational data sources that include temporal information, and of rules that map the domain information in the source schemas into the target ontology. Our proposed practice-oriented solution consists of two rule-based domain-independent algorithms. The first algorithm materializes target RDF data via a version of data exchange that enriches both the data and the ontology with temporal information from the relational sources. The second algorithm accepts as inputs temporal queries expressed in terms of the domain ontology using a lightweight temporal extension of SPARQL, and ensures successful evaluation of the queries on the materialized temporally-enriched RDF data. To study the quality of the information generated by the algorithms, we develop a general framework that formalizes the relational-to-RDF temporal data-exchange problem. The framework includes a chase formalism and a formal solution for the problem of answering temporal queries in the context of relational-to-RDF temporal data exchange. In this article, we present the algorithms and the formal framework that proves correctness of the information output by the algorithms, and also report on the algorithm implementation and experimental results for two application domains.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"25 1","pages":"1 - 27"},"PeriodicalIF":2.1,"publicationDate":"2023-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72468768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linking administrative data to produce more informative data for subsequent analysis has become an increasingly common practice. However, there might be concomitant risks of disclosing sensitive information about individuals. One practice that reduces these risks is data synthesis. In data synthesis the data are used to fit a model from which synthetic data are then generated. The synthetic data are then released to end users. There are some scenarios where an end user might have the option of using linked data or accepting synthesized data. However, linkage and synthesis are susceptible to errors that could limit their usefulness. Here, we investigate the problem of comparing the quality of linked data to synthesized data and demonstrate through simulations how the problem might be approached. These comparisons are important when considering how an end user can be supplied with the highest-quality data and in situations where one must consider risk/utility tradeoffs.
{"title":"To Link or Synthesize? An Approach to Data Quality Comparison","authors":"Duncan Smith, M. Elliot, J. Sakshaug","doi":"10.1145/3580487","DOIUrl":"https://doi.org/10.1145/3580487","url":null,"abstract":"Linking administrative data to produce more informative data for subsequent analysis has become an increasingly common practice. However, there might be concomitant risks of disclosing sensitive information about individuals. One practice that reduces these risks is data synthesis. In data synthesis the data are used to fit a model from which synthetic data are then generated. The synthetic data are then released to end users. There are some scenarios where an end user might have the option of using linked data or accepting synthesized data. However, linkage and synthesis are susceptible to errors that could limit their usefulness. Here, we investigate the problem of comparing the quality of linked data to synthesized data and demonstrate through simulations how the problem might be approached. These comparisons are important when considering how an end user can be supplied with the highest-quality data and in situations where one must consider risk/utility tradeoffs.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"58 1","pages":"1 - 20"},"PeriodicalIF":2.1,"publicationDate":"2023-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74240710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This editorial summarizes the content of the Special Issue on Truth and Trust Online of the Journal of Data and Information Quality. We thank the authors for their exceptional contributions to this special issue.
这篇社论总结了《数据与信息质量杂志》在线真相与信任特刊的内容。我们感谢作者对本期特刊的杰出贡献。
{"title":"Introduction to the Special Issue on Truth and Trust Online","authors":"Dustin Wright, Paolo Papotti, Isabelle Augenstein","doi":"10.1145/3578242","DOIUrl":"https://doi.org/10.1145/3578242","url":null,"abstract":"This editorial summarizes the content of the Special Issue on Truth and Trust Online of the Journal of Data and Information Quality. We thank the authors for their exceptional contributions to this special issue.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"48 1","pages":"1 - 3"},"PeriodicalIF":2.1,"publicationDate":"2023-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82737811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gautam Kumar, Sambit Bakshi, A. K. Sangaiah, Pankaj Kumar Sa
The growing integration of technology into our lives has resulted in unprecedented amounts of data that are being exchanged among devices in an Internet of Things (IoT) environment. Authentication, identification, and device heterogeneities are major security and privacy concerns in IoT. One of the most effective solutions to avoid unauthorized access to sensitive information is biometrics. Deep learning-based biometric systems have been proven to outperform traditional image processing and machine learning techniques. However, the image quality covariates associated with blur, resolution, illumination, and noise predominantly affect recognition performance. Therefore, assessing the robustness of the developed solution is another important concern that still needs to be investigated. This article proposes a periocular region-based biometric system and explores the effect of image quality covariates (artifacts) on the performance of periocular recognition. To simulate the real-time scenarios and understand the consequences of blur, resolution, and bit-depth of images on the recognition accuracy of periocular biometrics, we modeled out-of-focus blur, camera shake blur, low-resolution, and low bit-depth image acquisition using Gaussian function, linear motion, interpolation, and bit plan slicing, respectively. All the images of the UBIRIS.v1 database are degraded by varying strength of image quality covariates to obtain degraded versions of the database. Afterward, deep models are trained with each degraded version of the database. The performance of the model is evaluated by measuring statistical parameters calculated from a confusion matrix. Experimental results show that among all types of covariates, camera shake blur has less effect on the recognition performance, while out-of-focus blur significantly impacts it. Irrespective of image quality, the convolutional neural network produces excellent results, which proves the robustness of the developed model.
{"title":"Experimental Evaluation of Covariates Effects on Periocular Biometrics: A Robust Security Assessment Framework","authors":"Gautam Kumar, Sambit Bakshi, A. K. Sangaiah, Pankaj Kumar Sa","doi":"10.1145/3579029","DOIUrl":"https://doi.org/10.1145/3579029","url":null,"abstract":"The growing integration of technology into our lives has resulted in unprecedented amounts of data that are being exchanged among devices in an Internet of Things (IoT) environment. Authentication, identification, and device heterogeneities are major security and privacy concerns in IoT. One of the most effective solutions to avoid unauthorized access to sensitive information is biometrics. Deep learning-based biometric systems have been proven to outperform traditional image processing and machine learning techniques. However, the image quality covariates associated with blur, resolution, illumination, and noise predominantly affect recognition performance. Therefore, assessing the robustness of the developed solution is another important concern that still needs to be investigated. This article proposes a periocular region-based biometric system and explores the effect of image quality covariates (artifacts) on the performance of periocular recognition. To simulate the real-time scenarios and understand the consequences of blur, resolution, and bit-depth of images on the recognition accuracy of periocular biometrics, we modeled out-of-focus blur, camera shake blur, low-resolution, and low bit-depth image acquisition using Gaussian function, linear motion, interpolation, and bit plan slicing, respectively. All the images of the UBIRIS.v1 database are degraded by varying strength of image quality covariates to obtain degraded versions of the database. Afterward, deep models are trained with each degraded version of the database. The performance of the model is evaluated by measuring statistical parameters calculated from a confusion matrix. Experimental results show that among all types of covariates, camera shake blur has less effect on the recognition performance, while out-of-focus blur significantly impacts it. Irrespective of image quality, the convolutional neural network produces excellent results, which proves the robustness of the developed model.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"74 1","pages":"1 - 25"},"PeriodicalIF":2.1,"publicationDate":"2023-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89379695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated Learning is a distributed, privacy-preserving machine learning model that is gaining more attention these days. Federated Learning has a vast number of applications in different fields. While being more popular, it also suffers some drawbacks like high communication costs, privacy concerns, and data management issues. In this survey, we define federated learning systems and analyse the system to ensure a smooth flow and to guide future research with the help of soft computing techniques. We undertake a complete review of aggregating federated learning systems with soft computing techniques. We also investigate the impacts of collaborating various nature-inspired techniques with federated learning to alleviate its flaws. Finally, this paper discusses the possible future developments of integrating federated learning and soft computing techniques.
{"title":"A Survey on Soft Computing Techniques for Federated Learning- Applications, Challenges and Future Directions","authors":"Y. Supriya, T. Gadekallu","doi":"10.1145/3575810","DOIUrl":"https://doi.org/10.1145/3575810","url":null,"abstract":"Federated Learning is a distributed, privacy-preserving machine learning model that is gaining more attention these days. Federated Learning has a vast number of applications in different fields. While being more popular, it also suffers some drawbacks like high communication costs, privacy concerns, and data management issues. In this survey, we define federated learning systems and analyse the system to ensure a smooth flow and to guide future research with the help of soft computing techniques. We undertake a complete review of aggregating federated learning systems with soft computing techniques. We also investigate the impacts of collaborating various nature-inspired techniques with federated learning to alleviate its flaws. Finally, this paper discusses the possible future developments of integrating federated learning and soft computing techniques.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"208 1","pages":"1 - 28"},"PeriodicalIF":2.1,"publicationDate":"2023-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88583831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kyle Hoffpauir, Jacob Simmons, Nikolas Schmidt, Rachitha Pittala, Isaac Briggs, Shanmukha Makani, Y. Jararweh
As the number of devices connected to the Internet has grown larger, so too has the intensity of the tasks that these devices need to perform. Modern networks are more frequently working to perform computationally intensive tasks on low-power devices and low-end hardware. Current architectures and platforms tend towards centralized and resource-rich cloud computing approaches to address these deficits. However, edge computing presents a much more viable and flexible alternative. Edge computing refers to a distributed and decentralized network architecture in which demanding tasks such as image recognition, smart city services, and high-intensity data processing tasks can be distributed over a number of integrated network devices. In this article, we provide a comprehensive survey for emerging edge intelligence applications, lightweight machine learning algorithms, and their support for future applications and services. We start by analyzing the rise of cloud computing, discuss its weak points, and identify situations in which edge computing provides advantages over traditional cloud computing architectures. We then divulge details of the survey: the first section identifies opportunities and domains for edge computing growth, the second identifies algorithms and approaches that can be used to enhance edge intelligence implementations, and the third specifically analyzes situations in which edge intelligence can be enhanced using any of the aforementioned algorithms or approaches. In this third section, lightweight machine learning approaches are detailed. A more in-depth analysis and discussion of future developments follows. The primary discourse of this article is in service of an effort to ensure that appropriate approaches are applied adequately to artificial intelligence implementations in edge systems, mainly, the lightweight machine learning approaches.
{"title":"A Survey on Edge Intelligence and Lightweight Machine Learning Support for Future Applications and Services","authors":"Kyle Hoffpauir, Jacob Simmons, Nikolas Schmidt, Rachitha Pittala, Isaac Briggs, Shanmukha Makani, Y. Jararweh","doi":"10.1145/3581759","DOIUrl":"https://doi.org/10.1145/3581759","url":null,"abstract":"As the number of devices connected to the Internet has grown larger, so too has the intensity of the tasks that these devices need to perform. Modern networks are more frequently working to perform computationally intensive tasks on low-power devices and low-end hardware. Current architectures and platforms tend towards centralized and resource-rich cloud computing approaches to address these deficits. However, edge computing presents a much more viable and flexible alternative. Edge computing refers to a distributed and decentralized network architecture in which demanding tasks such as image recognition, smart city services, and high-intensity data processing tasks can be distributed over a number of integrated network devices. In this article, we provide a comprehensive survey for emerging edge intelligence applications, lightweight machine learning algorithms, and their support for future applications and services. We start by analyzing the rise of cloud computing, discuss its weak points, and identify situations in which edge computing provides advantages over traditional cloud computing architectures. We then divulge details of the survey: the first section identifies opportunities and domains for edge computing growth, the second identifies algorithms and approaches that can be used to enhance edge intelligence implementations, and the third specifically analyzes situations in which edge intelligence can be enhanced using any of the aforementioned algorithms or approaches. In this third section, lightweight machine learning approaches are detailed. A more in-depth analysis and discussion of future developments follows. The primary discourse of this article is in service of an effort to ensure that appropriate approaches are applied adequately to artificial intelligence implementations in edge systems, mainly, the lightweight machine learning approaches.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"67 1","pages":"1 - 30"},"PeriodicalIF":2.1,"publicationDate":"2023-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79103940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automated claim checking is the task of determining the veracity of a claim given evidence retrieved from a textual knowledge base of trustworthy facts. While previous work has taken the knowledge base as given and optimized the claim-checking pipeline, we take the opposite approach—taking the pipeline as given, we explore the choice of the knowledge base. Our first insight is that a claim-checking pipeline can be transferred to a new domain of claims with access to a knowledge base from the new domain. Second, we do not find a “universally best” knowledge base—higher domain overlap of a task dataset and a knowledge base tends to produce better label accuracy. Third, combining multiple knowledge bases does not tend to improve performance beyond using the closest-domain knowledge base. Finally, we show that the claim-checking pipeline’s confidence score for selecting evidence can be used to assess whether a knowledge base will perform well for a new set of claims, even in the absence of ground-truth labels.
{"title":"The Choice of Textual Knowledge Base in Automated Claim Checking","authors":"Dominik Stammbach, Boya Zhang, Elliott Ash","doi":"10.1145/3561389","DOIUrl":"https://doi.org/10.1145/3561389","url":null,"abstract":"Automated claim checking is the task of determining the veracity of a claim given evidence retrieved from a textual knowledge base of trustworthy facts. While previous work has taken the knowledge base as given and optimized the claim-checking pipeline, we take the opposite approach—taking the pipeline as given, we explore the choice of the knowledge base. Our first insight is that a claim-checking pipeline can be transferred to a new domain of claims with access to a knowledge base from the new domain. Second, we do not find a “universally best” knowledge base—higher domain overlap of a task dataset and a knowledge base tends to produce better label accuracy. Third, combining multiple knowledge bases does not tend to improve performance beyond using the closest-domain knowledge base. Finally, we show that the claim-checking pipeline’s confidence score for selecting evidence can be used to assess whether a knowledge base will perform well for a new set of claims, even in the absence of ground-truth labels.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"40 1","pages":"1 - 22"},"PeriodicalIF":2.1,"publicationDate":"2023-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87095226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kakali Chatterjee, Ashutosh Kumar Singh, Neha, K. Yu
The quality of the healthcare environment has become an essential factor for healthcare users to access quality services. Smart healthcare systems use the Internet of Medical Things (IoMT) devices to capture patients’ health data for treatment or diagnostic purposes. This sensitive collected patient data is shared between the different stakeholders across the network to provide quality services. Due to this, healthcare systems are vulnerable to confidentiality, integrity and privacy threats. In the COVID-19 scenario, when collaborative medical consultation is required, the quality assessment of the framework is essential to protect the privacy of doctors and patients. In this paper, a ring signature-based anonymous authentication and quality assessment scheme is designed for collaborative medical consultation environments for quality assessment and protection of the privacy of doctors and patients. This scheme also uses a new KMOV Cryptosystem to ensure the quality of the network and protect the system from different attacks that hamper data confidentiality.
{"title":"A Multifactor Ring Signature based Authentication Scheme for Quality Assessment of IoMT Environment in COVID-19 Scenario","authors":"Kakali Chatterjee, Ashutosh Kumar Singh, Neha, K. Yu","doi":"10.1145/3575811","DOIUrl":"https://doi.org/10.1145/3575811","url":null,"abstract":"The quality of the healthcare environment has become an essential factor for healthcare users to access quality services. Smart healthcare systems use the Internet of Medical Things (IoMT) devices to capture patients’ health data for treatment or diagnostic purposes. This sensitive collected patient data is shared between the different stakeholders across the network to provide quality services. Due to this, healthcare systems are vulnerable to confidentiality, integrity and privacy threats. In the COVID-19 scenario, when collaborative medical consultation is required, the quality assessment of the framework is essential to protect the privacy of doctors and patients. In this paper, a ring signature-based anonymous authentication and quality assessment scheme is designed for collaborative medical consultation environments for quality assessment and protection of the privacy of doctors and patients. This scheme also uses a new KMOV Cryptosystem to ensure the quality of the network and protect the system from different attacks that hamper data confidentiality.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"6 1","pages":"1 - 24"},"PeriodicalIF":2.1,"publicationDate":"2023-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75686593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Object stores offer an increasingly popular choice for data management and analytics. As with every data model, managing the integrity of objects is fundamental for data quality but also important for the efficiency of update and query operations. In response to shortcomings of unique and existence constraints in object stores, we propose a new principled class of constraints that separates uniqueness from existence dimensions of data quality, and fully supports multiple labels and composite properties. We illustrate benefits of the constraints on real-world examples of property graphs where node integrity is enforced for better update and query performance. The benefits are quantified experimentally in terms of perfectly scaling the access to data through indices that result from the constraints. We establish axiomatic and algorithmic characterizations for the underlying implication problem. In addition, we fully characterize which non-redundant families of constraints attain maximum cardinality for any given finite sets of labels and properties. We exemplify further use cases of the constraints: elicitation of business rules, identification of data quality problems, and design for data quality. Finally, we propose extensions to managing the integrity of objects in object stores such as graph databases.
{"title":"Uniqueness Constraints for Object Stores","authors":"Philipp Skavantzos, Uwe Leck, Kaiqi Zhao, S. Link","doi":"10.1145/3581758","DOIUrl":"https://doi.org/10.1145/3581758","url":null,"abstract":"Object stores offer an increasingly popular choice for data management and analytics. As with every data model, managing the integrity of objects is fundamental for data quality but also important for the efficiency of update and query operations. In response to shortcomings of unique and existence constraints in object stores, we propose a new principled class of constraints that separates uniqueness from existence dimensions of data quality, and fully supports multiple labels and composite properties. We illustrate benefits of the constraints on real-world examples of property graphs where node integrity is enforced for better update and query performance. The benefits are quantified experimentally in terms of perfectly scaling the access to data through indices that result from the constraints. We establish axiomatic and algorithmic characterizations for the underlying implication problem. In addition, we fully characterize which non-redundant families of constraints attain maximum cardinality for any given finite sets of labels and properties. We exemplify further use cases of the constraints: elicitation of business rules, identification of data quality problems, and design for data quality. Finally, we propose extensions to managing the integrity of objects in object stores such as graph databases.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"15 1","pages":"1 - 29"},"PeriodicalIF":2.1,"publicationDate":"2023-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73458508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}