ACM Journal of Data and Information Quality最新文献_第2页

A data centric AI framework for automating exploratory data analysis and data quality tasks 一个以数据为中心的AI框架，用于自动化探索性数据分析和数据质量任务

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-06-26 DOI: 10.1145/3603709

Hima Patel, Shanmukha C. Guttula, Nitin Gupta, Sandeep Hans, Ruhi Sharma Mittal, Lokesh N

Democratisation of machine learning (ML) has been an important theme in the research community for the last several years with notable progress made by the model-building community with automated machine learning models. However, data plays a central role in building ML models and there is a need to focus on data-centric AI innovations. In this paper, we first map the steps taken by data scientists for the data preparation phase and identify open areas and pain points via user interviews. We then propose a framework and four novel algorithms for exploratory data analysis and data quality for AI steps addressing the pain points from user interviews. We also validate our algorithms with open-source datasets and show the effectiveness of our proposed methods. Next, we build a tool that automatically generates python code encompassing the above algorithms and study the usefulness of these algorithms via two user studies with data scientists. We observe from the first study results that the participants who used the tool were able to gain 2X productivity and 6% model improvement over the control group. The second study is performed in a more realistic environment to understand how the tool would be used in real-world scenarios. The results from this study are coherent with the first study and show an average of 30-50% of time savings that can be attributed to the tool.

机器学习(ML)的民主化在过去几年中一直是研究界的一个重要主题，模型构建社区在自动化机器学习模型方面取得了显着进展。然而，数据在构建机器学习模型中起着核心作用，需要关注以数据为中心的人工智能创新。在本文中，我们首先绘制了数据科学家在数据准备阶段所采取的步骤，并通过用户访谈确定了开放领域和痛点。然后，我们提出了一个框架和四种新颖的算法，用于探索性数据分析和数据质量，以解决用户访谈中的痛点。我们还用开源数据集验证了我们的算法，并展示了我们提出的方法的有效性。接下来，我们构建一个工具，自动生成包含上述算法的python代码，并通过与数据科学家进行两次用户研究来研究这些算法的有用性。我们从第一个研究结果中观察到，使用该工具的参与者能够比对照组获得2倍的生产力和6%的模型改进。第二项研究是在一个更现实的环境中进行的，以了解该工具如何在现实场景中使用。这项研究的结果与第一项研究一致，表明该工具平均节省了30-50%的时间。

{"title":"A data centric AI framework for automating exploratory data analysis and data quality tasks","authors":"Hima Patel, Shanmukha C. Guttula, Nitin Gupta, Sandeep Hans, Ruhi Sharma Mittal, Lokesh N","doi":"10.1145/3603709","DOIUrl":"https://doi.org/10.1145/3603709","url":null,"abstract":"Democratisation of machine learning (ML) has been an important theme in the research community for the last several years with notable progress made by the model-building community with automated machine learning models. However, data plays a central role in building ML models and there is a need to focus on data-centric AI innovations. In this paper, we first map the steps taken by data scientists for the data preparation phase and identify open areas and pain points via user interviews. We then propose a framework and four novel algorithms for exploratory data analysis and data quality for AI steps addressing the pain points from user interviews. We also validate our algorithms with open-source datasets and show the effectiveness of our proposed methods. Next, we build a tool that automatically generates python code encompassing the above algorithms and study the usefulness of these algorithms via two user studies with data scientists. We observe from the first study results that the participants who used the tool were able to gain 2X productivity and 6% model improvement over the control group. The second study is performed in a more realistic environment to understand how the tool would be used in real-world scenarios. The results from this study are coherent with the first study and show an average of 30-50% of time savings that can be attributed to the tool.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"105 1","pages":""},"PeriodicalIF":2.1,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80653634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Editorial for the Special Issue on Quality Assessment of Data Security 数据安全质量评估特刊社论

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-06-22 DOI: 10.1145/3591360

Gautam Srivastava, Jerry Chun‐wei Lin, Zhihan Lv

Due to rapid technical advancements, many devices such as sensors, embedded systems, actuators, and mobile/smart devices receive huge amounts of information through data exchange and interconnectivity. From this increase in the exchange of data, there has also been a direct correlation to sensitive information that also moves through systems continuously. In this context, it is critical to ensure that both private and personal data is not disclosed and that any confidential information can be successfully hidden. Therefore, security and privacy have attracted a great deal of attention in academia and industry in recent decades. Not only is there a reason to protect against data leakage that is sensitive in nature, but it is also imperative to ensure that users of such systems trust the means by which their data is exchanged. Hundreds of security solutions have recently been discussed in the literature. However, the ability to properly manage the quality of security to ensure that developed models and algorithms can secure data is a very important task. To that end, only a limited number of works have addressed this problem directly. Since exchanged data usually is complex, researchers should also develop and investigate security models to perform quality assessments of data security. These tasks will ensure that threats from hackers or malware can be minimized. Security solutions can take on many forms. From cryptographic primitives all the way to machine learning and artificial intelligence, these potential fail-safes need to be properly researched, disseminated and discussed to ensure the next generation of systems will adhere to certain standards in the realm of security and privacy. This special issue saw a total of 21 submissions, from which five papers were published. It was intentional to adhere to a strict acceptance rate and ensure that only the best papers in the scope of the special issue were accepted. The following few paragraphs summarize the contributions that our special issue collection presents. In “A Survey on Edge Intelligence and Lightweight Machine Learning Support for Future Applications and Services,” Hoffpauir et al. provided a comprehensive survey of the emerging edge intelligence applications, lightweight machine learning algorithms, and their support for future applications and services. The survey started by analyzing the rise of cloud computing discussing its weak points, and identifying situations in which edge computing provides advantages over traditional cloud computing architectures. Then it dove into the survey the first section identifying opportunities and domains for edge computing growth, the second identifying algorithms and approaches that can be used to enhance edge intelligence implementations, and the third specifically analyzing situations in which edge intelligence can be enhanced using any of the aforementioned algorithms or approaches. In this third section, lightweight machine learning approaches

由于技术的快速进步，许多设备如传感器、嵌入式系统、执行器和移动/智能设备通过数据交换和互联接收大量信息。由于数据交换的增加，也与敏感信息直接相关，这些信息也在系统中不断移动。在这种情况下，确保私人和个人数据不被披露以及任何机密信息都可以成功隐藏是至关重要的。因此，近几十年来，安全和隐私问题引起了学术界和工业界的广泛关注。不仅有理由防止敏感的数据泄露，而且还必须确保此类系统的用户信任其数据交换的方式。最近在文献中讨论了数百种安全解决方案。然而，正确管理安全质量以确保开发的模型和算法能够保护数据的能力是一项非常重要的任务。为此，只有少数作品直接解决了这个问题。由于交换的数据通常是复杂的，研究人员还应该开发和调查安全模型，以执行数据安全的质量评估。这些任务将确保来自黑客或恶意软件的威胁可以最小化。安全解决方案可以采取多种形式。从密码学原语到机器学习和人工智能，这些潜在的故障安全措施需要得到适当的研究、传播和讨论，以确保下一代系统将遵守安全和隐私领域的某些标准。本期特刊共收到21份投稿，其中发表了5篇论文。它有意坚持严格的接受率，并确保只接受特刊范围内最好的论文。以下几段总结了我们特刊收集的贡献。在“对未来应用和服务的边缘智能和轻量级机器学习支持的调查”中，Hoffpauir等人对新兴的边缘智能应用、轻量级机器学习算法及其对未来应用和服务的支持进行了全面调查。该调查首先分析了云计算的兴起，讨论了它的弱点，并确定了边缘计算比传统云计算架构提供优势的情况。然后深入调查，第一部分确定边缘计算增长的机会和领域，第二部分确定可用于增强边缘智能实现的算法和方法，第三部分具体分析可以使用任何上述算法或方法增强边缘智能的情况。在第三部分中，详细介绍了轻量级机器学习方法。接下来将对未来的发展进行更深入的分析和讨论。的

{"title":"Editorial for the Special Issue on Quality Assessment of Data Security","authors":"Gautam Srivastava, Jerry Chun‐wei Lin, Zhihan Lv","doi":"10.1145/3591360","DOIUrl":"https://doi.org/10.1145/3591360","url":null,"abstract":"Due to rapid technical advancements, many devices such as sensors, embedded systems, actuators, and mobile/smart devices receive huge amounts of information through data exchange and interconnectivity. From this increase in the exchange of data, there has also been a direct correlation to sensitive information that also moves through systems continuously. In this context, it is critical to ensure that both private and personal data is not disclosed and that any confidential information can be successfully hidden. Therefore, security and privacy have attracted a great deal of attention in academia and industry in recent decades. Not only is there a reason to protect against data leakage that is sensitive in nature, but it is also imperative to ensure that users of such systems trust the means by which their data is exchanged. Hundreds of security solutions have recently been discussed in the literature. However, the ability to properly manage the quality of security to ensure that developed models and algorithms can secure data is a very important task. To that end, only a limited number of works have addressed this problem directly. Since exchanged data usually is complex, researchers should also develop and investigate security models to perform quality assessments of data security. These tasks will ensure that threats from hackers or malware can be minimized. Security solutions can take on many forms. From cryptographic primitives all the way to machine learning and artificial intelligence, these potential fail-safes need to be properly researched, disseminated and discussed to ensure the next generation of systems will adhere to certain standards in the realm of security and privacy. This special issue saw a total of 21 submissions, from which five papers were published. It was intentional to adhere to a strict acceptance rate and ensure that only the best papers in the scope of the special issue were accepted. The following few paragraphs summarize the contributions that our special issue collection presents. In “A Survey on Edge Intelligence and Lightweight Machine Learning Support for Future Applications and Services,” Hoffpauir et al. provided a comprehensive survey of the emerging edge intelligence applications, lightweight machine learning algorithms, and their support for future applications and services. The survey started by analyzing the rise of cloud computing discussing its weak points, and identifying situations in which edge computing provides advantages over traditional cloud computing architectures. Then it dove into the survey the first section identifying opportunities and domains for edge computing growth, the second identifying algorithms and approaches that can be used to enhance edge intelligence implementations, and the third specifically analyzing situations in which edge intelligence can be enhanced using any of the aforementioned algorithms or approaches. In this third section, lightweight machine learning approaches ","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"25 1","pages":"1 - 3"},"PeriodicalIF":2.1,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82676180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Clustering Heterogeneous Data Values for Data Quality Analysis 聚类异构数据值用于数据质量分析

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-06-22 DOI: 10.1145/3603710

Viola Wenz, Arno Kesper, G. Taentzer

Data is of high quality if it is fit for its intended purpose. Data heterogeneity can be a major quality problem, as quality aspects such as understandability and consistency can be compromised. Heterogeneity of data values is particularly common when data is manually entered by different people using inadequate control rules. In this case, syntactic and semantic heterogeneity often go hand in hand. Heterogeneity of data values may be a direct result of problems in the acquisition process, quality problems of the underlying data model, or possibly erroneous data transformations. For example, in the cultural heritage domain, it is common to analyze data fields by manually searching lists of data values sorted alphabetically or by number of occurrences. Additionally, search functions such as regular expression matching are used to detect specific patterns. However, this requires a priori knowledge and technical skills that domain experts often do not have. Since such datasets often contain thousands of values, the entire process is very time-consuming. Outliers or subtle differences between values that may be critical to data quality can be easily overlooked. To improve this process of analyzing the quality of data values, we propose a bottom-up human-in-the-loop approach that clusters values of a data field according to syntactic similarity. The clustering is intended to help domain experts explore the heterogeneity of values in a data field and can be configured by domain experts according to their domain knowledge. The overview of the syntactic diversity of the data values gives an impression of the rules and practices of data acquisition as well as their violations. From this, experts can infer potential quality issues with the data acquisition process and system, as well as the data model and data transformations. We outline a proof-of-concept implementation of the approach. Our evaluation found that clustering adds value to data quality analysis, especially for detecting quality problems in data models.

如果数据符合其预期目的，那么数据就是高质量的。数据异构可能是一个主要的质量问题，因为可理解性和一致性等质量方面可能会受到损害。当不同的人使用不适当的控制规则手动输入数据时，数据值的异质性尤其常见。在这种情况下，语法和语义的异构性通常是齐头并进的。数据值的异构性可能是获取过程中的问题、底层数据模型的质量问题或可能错误的数据转换的直接结果。例如，在文化遗产领域，通常通过手动搜索按字母顺序或按出现次数排序的数据值列表来分析数据字段。此外，正则表达式匹配等搜索函数用于检测特定模式。然而，这需要领域专家通常不具备的先验知识和技术技能。由于这些数据集通常包含数千个值，因此整个过程非常耗时。可能对数据质量至关重要的值之间的异常值或细微差异很容易被忽略。为了改进这一分析数据值质量的过程，我们提出了一种自下而上的human-in-the-loop方法，该方法根据语法相似性对数据字段的值进行聚类。聚类旨在帮助领域专家探索数据领域中值的异质性，并可由领域专家根据其领域知识进行配置。数据值的语法多样性概述了数据获取的规则和实践以及违反这些规则和实践的情况。由此，专家可以推断数据获取过程和系统以及数据模型和数据转换的潜在质量问题。我们概述了该方法的概念验证实现。我们的评估发现，聚类为数据质量分析增加了价值，特别是在检测数据模型中的质量问题时。

{"title":"Clustering Heterogeneous Data Values for Data Quality Analysis","authors":"Viola Wenz, Arno Kesper, G. Taentzer","doi":"10.1145/3603710","DOIUrl":"https://doi.org/10.1145/3603710","url":null,"abstract":"Data is of high quality if it is fit for its intended purpose. Data heterogeneity can be a major quality problem, as quality aspects such as understandability and consistency can be compromised. Heterogeneity of data values is particularly common when data is manually entered by different people using inadequate control rules. In this case, syntactic and semantic heterogeneity often go hand in hand. Heterogeneity of data values may be a direct result of problems in the acquisition process, quality problems of the underlying data model, or possibly erroneous data transformations. For example, in the cultural heritage domain, it is common to analyze data fields by manually searching lists of data values sorted alphabetically or by number of occurrences. Additionally, search functions such as regular expression matching are used to detect specific patterns. However, this requires a priori knowledge and technical skills that domain experts often do not have. Since such datasets often contain thousands of values, the entire process is very time-consuming. Outliers or subtle differences between values that may be critical to data quality can be easily overlooked. To improve this process of analyzing the quality of data values, we propose a bottom-up human-in-the-loop approach that clusters values of a data field according to syntactic similarity. The clustering is intended to help domain experts explore the heterogeneity of values in a data field and can be configured by domain experts according to their domain knowledge. The overview of the syntactic diversity of the data values gives an impression of the rules and practices of data acquisition as well as their violations. From this, experts can infer potential quality issues with the data acquisition process and system, as well as the data model and data transformations. We outline a proof-of-concept implementation of the approach. Our evaluation found that clustering adds value to data quality analysis, especially for detecting quality problems in data models.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"4 1","pages":"1 - 33"},"PeriodicalIF":2.1,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91281672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Soft Computing Techniques for Detecting Cyberbullying in Social Multimedia Data 社交多媒体数据中网络欺凌检测的软计算技术

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-06-20 DOI: 10.1145/3604617

Yang Jing, Haowei Ma, A. Ansari, G. Sucharitha, B. Omarov, Sandeep Kumar, M. Mohammadi, Khaled A. Z. Alyamani

Cyberbullying is a form of abuse, manipulation, or humiliation directed against a single person via the Internet. CB makes use of nasty Internet comments and remarks. It occurs when someone publicly mocks, insults, slanders, criticizes, or mocks another person while remaining anonymous on the Internet. As a result, there is a rising need to create new methods for sifting through data on social media sites for symptoms of cyberbullying. The goal is to lessen the negative consequences of this condition. This article discusses a soft computing-based methodology for detecting cyberbullying in social multimedia data. This model incorporates social media data. Normalization is performed to remove noise from data. To improve a feature, the Particle Swarm Optimization Technique is applied. Feature optimization helps to make cyberbullying detection more accurate. The LSTM model is used to classify things. With the help of social media data, the PSO LSTM model is getting better at finding cyberbullying. The accuracy of PSO LSTM is 99.1%. It is 2.9% higher than the accuracy of the AdaBoost technique and 10.4% more than the accuracy of the KNN technique. The specificity and sensitivity of PSO-based LSTM is also higher in percentage than KNN and AdaBoost algorithm.

网络欺凌是一种通过互联网对一个人进行虐待、操纵或羞辱的形式。CB利用讨厌的网络评论和言论。它发生在某人公开嘲笑、侮辱、诽谤、批评或嘲笑另一个人，而在互联网上保持匿名。因此，越来越需要创造新的方法来筛选社交媒体网站上的数据，以寻找网络欺凌的症状。我们的目标是减轻这种情况的负面影响。本文讨论了一种基于软计算的方法来检测社交多媒体数据中的网络欺凌。这个模型结合了社交媒体数据。执行归一化以去除数据中的噪声。为了改进特征，采用了粒子群优化技术。特征优化有助于提高网络欺凌检测的准确性。LSTM模型用于对事物进行分类。在社交媒体数据的帮助下，PSO LSTM模型在发现网络欺凌方面做得越来越好。PSO LSTM的准确率为99.1%。它比AdaBoost技术的精度高2.9%，比KNN技术的精度高10.4%。基于pso的LSTM的特异性和灵敏度也比KNN和AdaBoost算法高。

{"title":"Soft Computing Techniques for Detecting Cyberbullying in Social Multimedia Data","authors":"Yang Jing, Haowei Ma, A. Ansari, G. Sucharitha, B. Omarov, Sandeep Kumar, M. Mohammadi, Khaled A. Z. Alyamani","doi":"10.1145/3604617","DOIUrl":"https://doi.org/10.1145/3604617","url":null,"abstract":"Cyberbullying is a form of abuse, manipulation, or humiliation directed against a single person via the Internet. CB makes use of nasty Internet comments and remarks. It occurs when someone publicly mocks, insults, slanders, criticizes, or mocks another person while remaining anonymous on the Internet. As a result, there is a rising need to create new methods for sifting through data on social media sites for symptoms of cyberbullying. The goal is to lessen the negative consequences of this condition. This article discusses a soft computing-based methodology for detecting cyberbullying in social multimedia data. This model incorporates social media data. Normalization is performed to remove noise from data. To improve a feature, the Particle Swarm Optimization Technique is applied. Feature optimization helps to make cyberbullying detection more accurate. The LSTM model is used to classify things. With the help of social media data, the PSO LSTM model is getting better at finding cyberbullying. The accuracy of PSO LSTM is 99.1%. It is 2.9% higher than the accuracy of the AdaBoost technique and 10.4% more than the accuracy of the KNN technique. The specificity and sensitivity of PSO-based LSTM is also higher in percentage than KNN and AdaBoost algorithm.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"3 1","pages":"1 - 14"},"PeriodicalIF":2.1,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74690404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Method to Screen, Assess, and Prepare Open Data for Use 筛选、评估和准备开放数据的方法

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-06-20 DOI: 10.1145/3603708

P. Krasikov, Christine Legner

Open data's value-creating capabilities and innovation potential are widely recognized, resulting in a notable increase in the number of published open data sources. A crucial challenge for companies intending to leverage open data is to identify suitable open datasets that support specific business scenarios and prepare these datasets for use. Researchers have developed several open data assessment techniques, but those are restricted in scope, do not consider the use context, and are not embedded in the complete set of activities required for open data consumption in enterprises. Therefore, our research aims to develop prescriptive knowledge in the form of a meaningful method to screen, assess, and prepare open data for use in an enterprise setting. Our findings complement existing open data assessment techniques by providing methodological guidance to prepare open data of uncertain quality for use in a value-adding and demand-oriented manner, enabled by knowledge graphs and linked data concepts. From an academic perspective, our research conceptualizes open data preparation as a purposeful and value-creating process.

开放数据的价值创造能力和创新潜力得到广泛认可，公开的开放数据源数量显著增加。对于打算利用开放数据的公司来说，一个关键的挑战是确定支持特定业务场景的合适开放数据集，并准备好这些数据集的使用。研究人员已经开发了几种开放数据评估技术，但这些技术在范围上受到限制，没有考虑使用上下文，也没有嵌入到企业开放数据消费所需的完整活动中。因此，我们的研究旨在以一种有意义的方法的形式开发规范性知识，以筛选、评估和准备在企业环境中使用的开放数据。我们的研究结果补充了现有的开放数据评估技术，为准备质量不确定的开放数据提供了方法学指导，以便通过知识图谱和关联数据概念以增值和需求导向的方式使用。从学术角度来看，我们的研究将开放数据准备概念化为一个有目的和创造价值的过程。

引用次数: 0

Transactional Services for Concurrent Mobile Agents over Edge/Cloud Computing-Assisted Social Internet of Things 边缘/云计算辅助的社交物联网上并发移动代理的事务服务

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-06-15 DOI: 10.1145/3603714

Ahmad Al-qerem, A. Ali, S. Nashwan, Mohammad Alauthman, Ala Hamarsheh, Ahmad Nabot, Issam Jibreen

The Web of Things (WoT) is a concept that aims to create a network of intelligent devices capable of remote monitoring, service provisioning, and control. Virtual and Physical Internet of Things (IoT) gateways facilitate communication, processing, and storage among social nodes that form the social Web of Things (SWoT). Peripheral IoT services commonly use device data. However, due to the limited bandwidth and processing power of edge devices in the IoT, they must dynamically alter the quality of service provided to their connected clients to meet each user's needs while also meeting the service quality requirements of other devices that may access the same data. Consequently, deciding which transactions get access to which Internet of Things data is a scheduling problem. Edge-cloud computing requires transaction management because several Internet of Things transactions may access shared data simultaneously. However, cloud transaction management methods cannot be employed in edge-cloud computing settings. Transaction management models must be consistent and consider ACIDity of transactions, especially consistency. This study compares three implementation strategies, Edge Host Strategy (EHS), Cloud Host Strategy (CHS), and Hybrid BHS (BHS), which execute all IoT transactions on the Edge host, the cloud, and both hosts, respectively. These transactions affect the Edge hosts as well. An IoTT framework is provided, viewing an Internet of Things transaction as a collection of fundamental and additional subtransactions that loosen atomicity. Execution strategy controls essential and additional subtransactions. The integration of edge and cloud computing demonstrates that the execution approach significantly affects system performance. EHS and CHS can waste wireless bandwidth, while BHS outperforms CHS and EHS in many scenarios. These solutions enable edge transactions to complete without restarting due to outdated IoT data or other edge or cloud transactions. The properties of these approaches have been detailed, showing that they often outperform concurrent protocols and can improve edge-cloud computing.

物联网(WoT)是一个概念，旨在创建能够远程监控、服务提供和控制的智能设备网络。虚拟物联网网关和物理物联网网关为社交节点之间的通信、处理和存储提供了便利，构成了社交物联网(SWoT)。外围物联网服务通常使用设备数据。然而，由于物联网中边缘设备的带宽和处理能力有限，它们必须动态改变向其连接的客户端提供的服务质量，以满足每个用户的需求，同时还要满足可能访问相同数据的其他设备的服务质量要求。因此，决定哪些事务可以访问哪些物联网数据是一个调度问题。边缘云计算需要事务管理，因为多个物联网事务可能同时访问共享数据。然而，云事务管理方法不能用于边缘云计算设置。事务管理模型必须是一致的，并考虑事务的酸度，特别是一致性。本研究比较了三种实施策略，边缘主机策略(EHS)、云主机策略(CHS)和混合BHS (BHS)，它们分别在边缘主机、云和两个主机上执行所有物联网交易。这些事务也会影响Edge主机。提供了一个物联网框架，将物联网事务视为放松原子性的基本和附加子事务的集合。执行策略控制基本的和附加的子事务。边缘计算和云计算的集成表明，执行方法对系统性能有显著影响。EHS和CHS会浪费无线带宽，而BHS在许多情况下优于CHS和EHS。这些解决方案使边缘交易能够完成，而不会因过时的物联网数据或其他边缘或云交易而重新启动。这些方法的特性已经详细说明，表明它们通常优于并发协议，并且可以改进边缘云计算。

{"title":"Transactional Services for Concurrent Mobile Agents over Edge/Cloud Computing-Assisted Social Internet of Things","authors":"Ahmad Al-qerem, A. Ali, S. Nashwan, Mohammad Alauthman, Ala Hamarsheh, Ahmad Nabot, Issam Jibreen","doi":"10.1145/3603714","DOIUrl":"https://doi.org/10.1145/3603714","url":null,"abstract":"The Web of Things (WoT) is a concept that aims to create a network of intelligent devices capable of remote monitoring, service provisioning, and control. Virtual and Physical Internet of Things (IoT) gateways facilitate communication, processing, and storage among social nodes that form the social Web of Things (SWoT). Peripheral IoT services commonly use device data. However, due to the limited bandwidth and processing power of edge devices in the IoT, they must dynamically alter the quality of service provided to their connected clients to meet each user's needs while also meeting the service quality requirements of other devices that may access the same data. Consequently, deciding which transactions get access to which Internet of Things data is a scheduling problem. Edge-cloud computing requires transaction management because several Internet of Things transactions may access shared data simultaneously. However, cloud transaction management methods cannot be employed in edge-cloud computing settings. Transaction management models must be consistent and consider ACIDity of transactions, especially consistency. This study compares three implementation strategies, Edge Host Strategy (EHS), Cloud Host Strategy (CHS), and Hybrid BHS (BHS), which execute all IoT transactions on the Edge host, the cloud, and both hosts, respectively. These transactions affect the Edge hosts as well. An IoTT framework is provided, viewing an Internet of Things transaction as a collection of fundamental and additional subtransactions that loosen atomicity. Execution strategy controls essential and additional subtransactions. The integration of edge and cloud computing demonstrates that the execution approach significantly affects system performance. EHS and CHS can waste wireless bandwidth, while BHS outperforms CHS and EHS in many scenarios. These solutions enable edge transactions to complete without restarting due to outdated IoT data or other edge or cloud transactions. The properties of these approaches have been detailed, showing that they often outperform concurrent protocols and can improve edge-cloud computing.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"34 1","pages":"1 - 20"},"PeriodicalIF":2.1,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82719770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Joint IoT/ML Platforms for Smart Societies and Environments: A Review on Multimodal Information-Based Learning for Safety and Security 面向智能社会和环境的物联网/机器学习联合平台:安全与安保多模态信息学习综述

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-06-15 DOI: 10.1145/3603713

Hani Attar

The application of the Internet of Things (IoT) is highly expected to have comprehensive economic, business, and societal implications for our smart lives; indeed, IoT technologies play an essential role in creating a variety of smart applications that improve the nature and well-being of life in the real world. Consequently, the interconnected nature of IoT systems and the variety of components of their implementation have given rise to new security concerns. Cyber-attacks and threats in the IoT ecosystem significantly impact the development of new intelligent applications. Moreover, the IoT ecosystem suffers from inheriting vulnerabilities that make its devices inoperable to benefit from instigating security techniques such as authentication, access control, encryption, and network security. Recently, great advances have been achieved in the field of Machine Intelligence (MI), Deep Learning (DL), and Machine Learning (ML), which have been applied to many important applications. ML and DL are regarded as efficient data exploration techniques for discovering “normal” and “abnormal” IoT component and device behavior inside the IoT ecosystem. Therefore, ML/DL approaches are required to convert the security of IoT systems from providing safe Device-to-Device (D2D) communication to providing security-based intelligence systems. The proposed work examines ML/DL technologies that may be utilized to provide superior security solutions for IoT devices. The potential security risks associated with the IoT are discussed, including pre-existing and newly emerging threats. Furthermore, the benefits and challenges of DL and ML techniques are examined to enhance IoT security.

物联网(IoT)的应用有望对我们的智能生活产生全面的经济、商业和社会影响;事实上，物联网技术在创建各种智能应用程序方面发挥着至关重要的作用，这些应用程序可以改善现实世界中的自然和生活。因此，物联网系统的互联性质及其实施的各种组件引起了新的安全问题。物联网生态系统中的网络攻击和威胁对新智能应用的发展产生了重大影响。此外，物联网生态系统遭受继承漏洞的困扰，这些漏洞使其设备无法从身份验证、访问控制、加密和网络安全等安全技术中受益。近年来，机器智能(MI)、深度学习(DL)和机器学习(ML)领域取得了很大的进展，并被应用到许多重要的应用中。ML和DL被认为是有效的数据探索技术，用于发现物联网生态系统中“正常”和“异常”的物联网组件和设备行为。因此，需要ML/DL方法将物联网系统的安全性从提供安全的设备到设备(D2D)通信转换为提供基于安全的智能系统。拟议的工作检查ML/DL技术，可用于为物联网设备提供卓越的安全解决方案。讨论了与物联网相关的潜在安全风险，包括已有的和新出现的威胁。此外，研究了深度学习和机器学习技术的优点和挑战，以增强物联网安全性。

{"title":"Joint IoT/ML Platforms for Smart Societies and Environments: A Review on Multimodal Information-Based Learning for Safety and Security","authors":"Hani Attar","doi":"10.1145/3603713","DOIUrl":"https://doi.org/10.1145/3603713","url":null,"abstract":"The application of the Internet of Things (IoT) is highly expected to have comprehensive economic, business, and societal implications for our smart lives; indeed, IoT technologies play an essential role in creating a variety of smart applications that improve the nature and well-being of life in the real world. Consequently, the interconnected nature of IoT systems and the variety of components of their implementation have given rise to new security concerns. Cyber-attacks and threats in the IoT ecosystem significantly impact the development of new intelligent applications. Moreover, the IoT ecosystem suffers from inheriting vulnerabilities that make its devices inoperable to benefit from instigating security techniques such as authentication, access control, encryption, and network security. Recently, great advances have been achieved in the field of Machine Intelligence (MI), Deep Learning (DL), and Machine Learning (ML), which have been applied to many important applications. ML and DL are regarded as efficient data exploration techniques for discovering “normal” and “abnormal” IoT component and device behavior inside the IoT ecosystem. Therefore, ML/DL approaches are required to convert the security of IoT systems from providing safe Device-to-Device (D2D) communication to providing security-based intelligence systems. The proposed work examines ML/DL technologies that may be utilized to provide superior security solutions for IoT devices. The potential security risks associated with the IoT are discussed, including pre-existing and newly emerging threats. Furthermore, the benefits and challenges of DL and ML techniques are examined to enhance IoT security.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"20 1","pages":"1 - 26"},"PeriodicalIF":2.1,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79894842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep Learning–based Dynamic User Alignment in Social Networks 基于深度学习的社交网络动态用户对齐

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-06-14 DOI: 10.1145/3603711

K. Matrouk, Srikanth V, Sumit Kumar, Mohit Kumar Bhadla, Mirza Sabirov, M. Saadh

Academics and businesses are paying intense attention to social network alignment, which centers various social networks around their shared members. All studies to date treat the social network as static and ignore its innate dynamism. In reality, an individual's discriminative pattern is embedded in the dynamics of social networks, and this information may be used to improve social network alignment. This study finds that these dynamics can reveal more apparent patterns better suited to lining up the social web of things (SWoT). The correlation between the user structure and attributes for each social network must be maintained to combine the binary dynamics and make the original synthetic embedding representation. Finally, the initial embedding of each network is projected to a target subspace as part of the semi-supervised spatial transformation learning process. The Dynamic Social Network Alignment approach outperforms the current mainstream algorithm by 10% in this article's extensive series of trials using real-world datasets. The findings of this study show that this alignment of enormous networks addresses the volume, variety, velocity, and veracity (or 4Vs) of vast networks. To improve the efficacy and resilience of an adversarial network alignment, adversarial learning techniques can be applied. The results show that the model with structure, attribute, and time information performs the best, while the model without attribute information comes in second, the model without time information performs mediocrely, and the model without structure information performs the worst.

学术界和企业界都在密切关注社会网络的一致性，即各种社会网络围绕着共享的成员。迄今为止，所有的研究都将社交网络视为静态的，而忽略了其内在的动态性。在现实中，个体的判别模式嵌入在社会网络的动态中，这些信息可以用来改善社会网络的一致性。这项研究发现，这些动态可以揭示更明显的模式，更适合排列社交网络的事物(SWoT)。必须保持各社交网络用户结构与属性之间的相关性，结合二元动态，得到原始的综合嵌入表示。最后，将每个网络的初始嵌入投影到目标子空间，作为半监督空间转换学习过程的一部分。在本文使用真实世界数据集进行的广泛系列试验中，动态社会网络对齐方法比当前主流算法高出10%。这项研究的结果表明，这种庞大网络的对齐解决了庞大网络的数量、种类、速度和准确性(或4v)。为了提高对抗性网络对齐的有效性和弹性，可以应用对抗性学习技术。结果表明，包含结构、属性和时间信息的模型性能最好，不包含属性信息的模型性能次之，不包含时间信息的模型性能一般，不包含结构信息的模型性能最差。

{"title":"Deep Learning–based Dynamic User Alignment in Social Networks","authors":"K. Matrouk, Srikanth V, Sumit Kumar, Mohit Kumar Bhadla, Mirza Sabirov, M. Saadh","doi":"10.1145/3603711","DOIUrl":"https://doi.org/10.1145/3603711","url":null,"abstract":"Academics and businesses are paying intense attention to social network alignment, which centers various social networks around their shared members. All studies to date treat the social network as static and ignore its innate dynamism. In reality, an individual's discriminative pattern is embedded in the dynamics of social networks, and this information may be used to improve social network alignment. This study finds that these dynamics can reveal more apparent patterns better suited to lining up the social web of things (SWoT). The correlation between the user structure and attributes for each social network must be maintained to combine the binary dynamics and make the original synthetic embedding representation. Finally, the initial embedding of each network is projected to a target subspace as part of the semi-supervised spatial transformation learning process. The Dynamic Social Network Alignment approach outperforms the current mainstream algorithm by 10% in this article's extensive series of trials using real-world datasets. The findings of this study show that this alignment of enormous networks addresses the volume, variety, velocity, and veracity (or 4Vs) of vast networks. To improve the efficacy and resilience of an adversarial network alignment, adversarial learning techniques can be applied. The results show that the model with structure, attribute, and time information performs the best, while the model without attribute information comes in second, the model without time information performs mediocrely, and the model without structure information performs the worst.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"39 1","pages":"1 - 26"},"PeriodicalIF":2.1,"publicationDate":"2023-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75080609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fusion-based Representation Learning Model for Multimode User-generated Social Network Content 基于融合的多模式用户生成社交网络内容表示学习模型

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-06-14 DOI: 10.1145/3603712

R. J. Martin, Rajvardhan Oak, Mukesh Soni, V. Mahalakshmi, Arsalan Muhammad Soomar, Anjali Joshi

As mobile networks and APPs are developed, user-generated content (UGC), which includes multi-source heterogeneous data like user reviews, tags, scores, images, and videos, has become an essential basis for improving the quality of personalized services. Due to the multi-source heterogeneous nature of the data, big data fusion offers both promise and drawbacks. With the rise of mobile networks and applications, UGC, which includes multi-source heterogeneous data including ratings, marks, scores, images, and videos, has gained importance. This information is very important for improving the calibre of customized services. The key to the application's success is representational learning of fusing and vectorization on the multi-source heterogeneous UGC. Multi-source text fusion and representation learning have become the key to its application. In this regard, a fusion representation learning for multi-source text and image is proposed. The convolutional fusion technique, in contrast to splicing and fusion, may take into consideration the varied data characteristics in each size. This research proposes a new data feature fusion strategy based on the convolution operation, which was inspired by the convolutional neural network. Using Doc2vec and LDA model, the vectorized representation of multi-source text is given, and the deep convolutional network is used to obtain it. Finally, the proposed algorithm is applied to Amazon's commodity dataset containing UGC content based on the classification accuracy of UGC vectorized representation items and shows the feasibility and impact of the proposed algorithm.

随着移动网络和app的发展，用户生成内容(user-generated content, UGC)成为提高个性化服务质量的重要基础，UGC包含用户评论、标签、分数、图片、视频等多源异构数据。由于数据的多源异构特性，大数据融合既有希望，也有缺点。随着移动网络和应用的兴起，包含评分、评分、分数、图片、视频等多源异构数据的UGC变得越来越重要。这些信息对于提高定制服务的质量非常重要。应用成功的关键是对多源异构UGC进行融合和矢量化的表征学习。多源文本融合和表示学习已成为其应用的关键。为此，提出了一种多源文本和图像的融合表示学习方法。与拼接和融合相比，卷积融合技术可以考虑到各种尺寸的不同数据特征。本研究受卷积神经网络的启发，提出了一种基于卷积运算的数据特征融合策略。利用Doc2vec和LDA模型，给出了多源文本的矢量化表示，并利用深度卷积网络进行了求解。最后，基于UGC矢量化表示项目的分类准确率，将本文算法应用于亚马逊包含UGC内容的商品数据集，并展示了本文算法的可行性和影响。

{"title":"Fusion-based Representation Learning Model for Multimode User-generated Social Network Content","authors":"R. J. Martin, Rajvardhan Oak, Mukesh Soni, V. Mahalakshmi, Arsalan Muhammad Soomar, Anjali Joshi","doi":"10.1145/3603712","DOIUrl":"https://doi.org/10.1145/3603712","url":null,"abstract":"As mobile networks and APPs are developed, user-generated content (UGC), which includes multi-source heterogeneous data like user reviews, tags, scores, images, and videos, has become an essential basis for improving the quality of personalized services. Due to the multi-source heterogeneous nature of the data, big data fusion offers both promise and drawbacks. With the rise of mobile networks and applications, UGC, which includes multi-source heterogeneous data including ratings, marks, scores, images, and videos, has gained importance. This information is very important for improving the calibre of customized services. The key to the application's success is representational learning of fusing and vectorization on the multi-source heterogeneous UGC. Multi-source text fusion and representation learning have become the key to its application. In this regard, a fusion representation learning for multi-source text and image is proposed. The convolutional fusion technique, in contrast to splicing and fusion, may take into consideration the varied data characteristics in each size. This research proposes a new data feature fusion strategy based on the convolution operation, which was inspired by the convolutional neural network. Using Doc2vec and LDA model, the vectorized representation of multi-source text is given, and the deep convolutional network is used to obtain it. Finally, the proposed algorithm is applied to Amazon's commodity dataset containing UGC content based on the classification accuracy of UGC vectorized representation items and shows the feasibility and impact of the proposed algorithm.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"57 1","pages":"1 - 21"},"PeriodicalIF":2.1,"publicationDate":"2023-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91381668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Context-aware Big Data Quality Assessment: A Scoping Review 上下文感知大数据质量评估:范围审查

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-06-13 DOI: 10.1145/3603707

Hadi Fadlallah, R. Kilany, Houssein Dhayne, Rami El Haddad, R. Haque, Y. Taher, Ali H. Jaber

The term data quality refers to measuring the fitness of data regarding the intended usage. Poor data quality leads to inadequate, inconsistent, and erroneous decisions that could escalate the computational cost, cause a decline in profits, and cause customer churn. Thus, data quality is crucial for researchers and industry practitioners. Different factors drive the assessment of data quality. Data context is deemed one of the key factors due to the contextual diversity of real-world use cases of various entities such as people and organizations. Data used in a specific context (e.g., an organization policy) may need to be more efficacious for another context. Hence, implementing a data quality assessment solution in different contexts is challenging. Traditional technologies for data quality assessment reached the pinnacle of maturity. Existing solutions can solve most of the quality issues. The data context in these solutions is defined as validation rules applied within the ETL (extract, transform, load) process, i.e., the data warehousing process. In contrast to traditional data quality management, it is impossible to specify all the data semantics beforehand for big data. We need context-aware data quality rules to detect semantic errors in a massive amount of heterogeneous data generated at high speed. While many researchers tackle the quality issues of big data, they define the data context from a specific standpoint. Although data quality is a longstanding research issue in academia and industries, it remains an open issue, especially with the advent of big data, which has fostered the challenge of data quality assessment more than ever. This article provides a scoping review to study the existing context-aware data quality assessment solutions, starting with the existing big data quality solutions in general and then covering context-aware solutions. The strength and weaknesses of such solutions are outlined and discussed. The survey showed that none of the existing data quality assessment solutions could guarantee context awareness with the ability to handle big data. Notably, each solution dealt only with a partial view of the context. We compared the existing quality models and solutions to reach a comprehensive view covering the aspects of context awareness when assessing data quality. This led us to a set of recommendations framed in a methodological framework shaping the design and implementation of any context-aware data quality service for big data. Open challenges are then identified and discussed.

术语数据质量指的是测量数据在预期用途方面的适应性。糟糕的数据质量会导致不充分、不一致和错误的决策，这可能会增加计算成本，导致利润下降，并导致客户流失。因此，数据质量对研究人员和行业从业者至关重要。不同的因素驱动着数据质量的评估。由于人员和组织等各种实体的实际用例的上下文多样性，数据上下文被认为是关键因素之一。在特定上下文中使用的数据(例如，组织策略)可能需要在另一个上下文中更有效。因此，在不同的上下文中实现数据质量评估解决方案是具有挑战性的。传统的数据质量评估技术达到了成熟的顶峰。现有的解决方案可以解决大多数质量问题。这些解决方案中的数据上下文被定义为应用于ETL(提取、转换、加载)过程(即数据仓库过程)中的验证规则。与传统的数据质量管理相比，大数据不可能预先规定所有的数据语义。我们需要上下文感知的数据质量规则来检测高速生成的大量异构数据中的语义错误。虽然许多研究人员解决大数据的质量问题，但他们从特定的角度定义数据上下文。虽然数据质量是学术界和工业界长期以来的研究问题，但它仍然是一个开放的问题，特别是随着大数据的出现，数据质量评估的挑战比以往任何时候都更大。本文提供了一个范围审查，以研究现有的上下文感知数据质量评估解决方案，从现有的一般大数据质量解决方案开始，然后涵盖上下文感知解决方案。概述并讨论了这些解决方案的优缺点。调查显示，现有的数据质量评估解决方案都无法保证具有处理大数据能力的上下文感知。值得注意的是，每个解决方案只处理上下文的部分视图。我们比较了现有的质量模型和解决方案，以在评估数据质量时获得涵盖上下文感知方面的全面视图。这导致我们在方法论框架中提出了一组建议，这些建议塑造了大数据上下文感知数据质量服务的设计和实现。然后确定和讨论开放的挑战。

{"title":"Context-aware Big Data Quality Assessment: A Scoping Review","authors":"Hadi Fadlallah, R. Kilany, Houssein Dhayne, Rami El Haddad, R. Haque, Y. Taher, Ali H. Jaber","doi":"10.1145/3603707","DOIUrl":"https://doi.org/10.1145/3603707","url":null,"abstract":"The term data quality refers to measuring the fitness of data regarding the intended usage. Poor data quality leads to inadequate, inconsistent, and erroneous decisions that could escalate the computational cost, cause a decline in profits, and cause customer churn. Thus, data quality is crucial for researchers and industry practitioners. Different factors drive the assessment of data quality. Data context is deemed one of the key factors due to the contextual diversity of real-world use cases of various entities such as people and organizations. Data used in a specific context (e.g., an organization policy) may need to be more efficacious for another context. Hence, implementing a data quality assessment solution in different contexts is challenging. Traditional technologies for data quality assessment reached the pinnacle of maturity. Existing solutions can solve most of the quality issues. The data context in these solutions is defined as validation rules applied within the ETL (extract, transform, load) process, i.e., the data warehousing process. In contrast to traditional data quality management, it is impossible to specify all the data semantics beforehand for big data. We need context-aware data quality rules to detect semantic errors in a massive amount of heterogeneous data generated at high speed. While many researchers tackle the quality issues of big data, they define the data context from a specific standpoint. Although data quality is a longstanding research issue in academia and industries, it remains an open issue, especially with the advent of big data, which has fostered the challenge of data quality assessment more than ever. This article provides a scoping review to study the existing context-aware data quality assessment solutions, starting with the existing big data quality solutions in general and then covering context-aware solutions. The strength and weaknesses of such solutions are outlined and discussed. The survey showed that none of the existing data quality assessment solutions could guarantee context awareness with the ability to handle big data. Notably, each solution dealt only with a partial view of the context. We compared the existing quality models and solutions to reach a comprehensive view covering the aspects of context awareness when assessing data quality. This led us to a set of recommendations framed in a methodological framework shaping the design and implementation of any context-aware data quality service for big data. Open challenges are then identified and discussed.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"8 1","pages":"1 - 33"},"PeriodicalIF":2.1,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90308221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2