ACM Journal of Data and Information Quality最新文献_第3页

BIGQA: Declarative Big Data Quality Assessment BIGQA:声明式大数据质量评估

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-06-13 DOI: 10.1145/3603706

Hadi Fadlallah, R. Kilany, Houssein Dhayne, Rami El Haddad, R. Haque, Y. Taher, Ali Jaber

In the big data domain, data quality assessment operations are often complex and must be implementable in a distributed and timely manner. This article tries to generalize the quality assessment operations by providing a new ISO-based declarative data quality assessment framework (BIGQA). BIGQA is a flexible solution that supports data quality assessment in different domains and contexts. It facilitates the planning and execution of big data quality assessment operations for data domain experts and data management specialists at any phase in the data life cycle. This work implements BIGQA to demonstrate its ability to produce customized data quality reports while running efficiently on parallel or distributed computing frameworks. BIGQA generates data quality assessment plans using straightforward operators designed to handle big data and guarantee a high degree of parallelism when executed. Moreover, it allows incremental data quality assessment to avoid reading the whole dataset each time the quality assessment operation is required. The result was validated using radiation wireless sensor data and Stack Overflow users’ data to show that it can be implemented within different contexts. The experiments show a 71% performance improvement over a 1 GB flat file on a single processing machine compared with a non-parallel application and a 75% performance improvement over a 25 GB flat file within a distributed environment compared to a non-distributed application.

在大数据领域，数据质量评估操作往往非常复杂，必须以分布式和及时的方式实施。本文试图通过提供一个新的基于iso的声明性数据质量评估框架(BIGQA)来推广质量评估操作。BIGQA是一个灵活的解决方案，支持不同领域和上下文的数据质量评估。它有助于数据领域专家和数据管理专家在数据生命周期的任何阶段规划和执行大数据质量评估操作。这项工作实现了BIGQA，以展示其在并行或分布式计算框架上高效运行时生成定制数据质量报告的能力。BIGQA使用简单的运算符生成数据质量评估计划，用于处理大数据，并保证执行时的高度并行性。此外，它允许增量数据质量评估，避免每次需要进行质量评估操作时读取整个数据集。使用辐射无线传感器数据和Stack Overflow用户数据验证了结果，表明它可以在不同的环境中实现。实验表明，与非并行应用程序相比，在单个处理机器上，1 GB平面文件的性能提高了71%，在分布式环境中，与非分布式应用程序相比，25 GB平面文件的性能提高了75%。

{"title":"BIGQA: Declarative Big Data Quality Assessment","authors":"Hadi Fadlallah, R. Kilany, Houssein Dhayne, Rami El Haddad, R. Haque, Y. Taher, Ali Jaber","doi":"10.1145/3603706","DOIUrl":"https://doi.org/10.1145/3603706","url":null,"abstract":"In the big data domain, data quality assessment operations are often complex and must be implementable in a distributed and timely manner. This article tries to generalize the quality assessment operations by providing a new ISO-based declarative data quality assessment framework (BIGQA). BIGQA is a flexible solution that supports data quality assessment in different domains and contexts. It facilitates the planning and execution of big data quality assessment operations for data domain experts and data management specialists at any phase in the data life cycle. This work implements BIGQA to demonstrate its ability to produce customized data quality reports while running efficiently on parallel or distributed computing frameworks. BIGQA generates data quality assessment plans using straightforward operators designed to handle big data and guarantee a high degree of parallelism when executed. Moreover, it allows incremental data quality assessment to avoid reading the whole dataset each time the quality assessment operation is required. The result was validated using radiation wireless sensor data and Stack Overflow users’ data to show that it can be implemented within different contexts. The experiments show a 71% performance improvement over a 1 GB flat file on a single processing machine compared with a non-parallel application and a 75% performance improvement over a 25 GB flat file within a distributed environment compared to a non-distributed application.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"56 1","pages":"1 - 30"},"PeriodicalIF":2.1,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78304350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Synthetic Generation of Multidimensional Data to Improve Classification Model Validity 多维数据合成提高分类模型有效性

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-06-13 DOI: 10.1145/3603715

Ahmad Al-qerem, A. Ali, Hani Attar, S. Nashwan, Lianyong Qi, Mohammad Kazem Moghimi, A. Solyman

This article aims to compare Generative Adversarial Network (GAN) models and feature selection methods for generating synthetic data in order to improve the validity of a classification model. The synthetic data generation technique involves generating new data samples from existing data to increase the diversity of the data and help the model generalize better. The multidimensional aspect of the data refers to the fact that it can have multiple features or variables that describe it. The GAN models have proven to be effective in preserving the statistical properties of the original data. However, the order of data augmentation and feature selection is crucial to build robust and accurate predictive models. By comparing the different GAN models with feature selection methods on multidimensional datasets, this article aims to determine the best combination to support the validity of a classification model in multidimensional data.

本文旨在比较生成对抗网络(GAN)模型和生成合成数据的特征选择方法，以提高分类模型的有效性。合成数据生成技术包括从现有数据中生成新的数据样本，以增加数据的多样性，帮助模型更好地泛化。数据的多维方面指的是它可以有多个特征或变量来描述它。GAN模型已被证明在保留原始数据的统计特性方面是有效的。然而，数据增强和特征选择的顺序对于构建鲁棒和准确的预测模型至关重要。通过在多维数据集上比较不同的GAN模型和特征选择方法，本文旨在确定支持多维数据分类模型有效性的最佳组合。

引用次数: 0

A Novel Feature Selection Method for Risk Management in High-Dimensional Time Series of Cryptocurrency Market 加密货币市场高维时间序列风险管理的特征选择方法

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-05-26 DOI: 10.1145/3597309

Erfan Varedi, R. Boostani

In this study, a novel approach for feature selection has been presented in order to overcome the challenge of classifying positive and negative risk prediction in the cryptocurrency market, which contains high fluctuation. This approach is based on maximizing information gain with simultaneously minimizing the similarity of selected features to achieve a proper feature set for improving classification accuracy. The proposed method was compared with other feature selection techniques, such as sequential and bidirectional feature selection, univariate feature selection, and least absolute shrinkage and selection operator. To evaluate the feature selection techniques, several classifiers were employed: XGBoost, k-nearest neighbor, support vector machine, random forest, logistic regression, long short-term memory, and deep neural networks. The features were elicited from the time series of Bitcoin, Binance, and Ethereum cryptocurrencies. The results of applying the selected features to different classifiers indicated that XGBoost and random forest provided better results on the time series datasets. Furthermore, the proposed feature selection method achieved the best results on two (out of three) cryptocurrencies. The accuracy in the best state varied between 55% to 68% for different time series. It is worth mentioning that preprocessed features were used in this research, meaning that raw data (candle data) were used to derive efficient features that can explain the problem and help the classifiers in predicting the labels.

本研究提出了一种新的特征选择方法，以克服加密货币市场中存在高波动的正、负风险预测分类的挑战。该方法是基于最大化信息增益的同时最小化所选特征的相似度来获得合适的特征集，以提高分类精度。并将该方法与顺序和双向特征选择、单变量特征选择、最小绝对收缩和选择算子等特征选择技术进行了比较。为了评估特征选择技术，使用了几种分类器:XGBoost、k近邻、支持向量机、随机森林、逻辑回归、长短期记忆和深度神经网络。这些特征是从比特币、币安和以太坊加密货币的时间序列中得出的。将选择的特征应用于不同分类器的结果表明，XGBoost和随机森林在时间序列数据集上提供了更好的结果。此外，所提出的特征选择方法在两种(三种)加密货币上取得了最佳结果。对于不同的时间序列，最佳状态下的准确率在55% ~ 68%之间。值得一提的是，本研究中使用了预处理的特征，这意味着使用原始数据(蜡烛数据)来获得可以解释问题并帮助分类器预测标签的有效特征。

{"title":"A Novel Feature Selection Method for Risk Management in High-Dimensional Time Series of Cryptocurrency Market","authors":"Erfan Varedi, R. Boostani","doi":"10.1145/3597309","DOIUrl":"https://doi.org/10.1145/3597309","url":null,"abstract":"In this study, a novel approach for feature selection has been presented in order to overcome the challenge of classifying positive and negative risk prediction in the cryptocurrency market, which contains high fluctuation. This approach is based on maximizing information gain with simultaneously minimizing the similarity of selected features to achieve a proper feature set for improving classification accuracy. The proposed method was compared with other feature selection techniques, such as sequential and bidirectional feature selection, univariate feature selection, and least absolute shrinkage and selection operator. To evaluate the feature selection techniques, several classifiers were employed: XGBoost, k-nearest neighbor, support vector machine, random forest, logistic regression, long short-term memory, and deep neural networks. The features were elicited from the time series of Bitcoin, Binance, and Ethereum cryptocurrencies. The results of applying the selected features to different classifiers indicated that XGBoost and random forest provided better results on the time series datasets. Furthermore, the proposed feature selection method achieved the best results on two (out of three) cryptocurrencies. The accuracy in the best state varied between 55% to 68% for different time series. It is worth mentioning that preprocessed features were used in this research, meaning that raw data (candle data) were used to derive efficient features that can explain the problem and help the classifiers in predicting the labels.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"27 1","pages":"1 - 14"},"PeriodicalIF":2.1,"publicationDate":"2023-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81231253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Pipeline Design for Data Preparation for Social Media Analysis 面向社交媒体分析的数据准备管道设计

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-05-20 DOI: 10.1145/3597305

Carlo A. Bono, C. Cappiello, B. Pernici, Edoardo Ramalli, Monica Vitali

In a data-driven culture, in which analytics applications are the main resources for supporting decision-making, the use of high-quality datasets is mandatory to minimize errors and risks. For this reason, data analysis tasks need to be preceded by a data preparation pipeline. The design of such a pipeline is not trivial: the data analyst must carefully choose the appropriate operations considering several aspects. This is often performed by adopting a trial-and-error approach that does not always lead to the most effective solution. In addition, extracting information from social media poses specific problems due to the need to consider only posts relevant for the analysis, for its dependence from the context being considered, for its multimedia contents, and for the risk of filtering out informative posts with automatic filters. In this paper, we propose a systematic approach to support the design of pipelines that are able to effectively extract a relevant dataset for the goal of the analysis of data from social media. We provide a conceptual model for designing and annotating the data preparation pipeline with quality and performance information, thus providing the data analyst preliminary information on the expected quality of the resulting dataset in a context-aware manner. The generation of metadata related to the processing tasks has been recognized as essential for enabling data sharing and reusability. To this aim, the dataset resulting from the pipeline application is automatically annotated with provenance metadata to get a detailed description of all the activities performed by the pipeline on them. As a case study, we consider the design of a pipeline for creating datasets of images extracted from social media in order to analyze behavioural aspects during COVID-19.

在数据驱动的文化中，分析应用程序是支持决策的主要资源，使用高质量的数据集是强制性的，以最大限度地减少错误和风险。因此，数据分析任务前需要有数据准备管道。这种管道的设计并不简单:数据分析师必须考虑几个方面，仔细选择适当的操作。这通常是通过采用一种试错方法来实现的，这种方法并不总是导致最有效的解决方案。此外，从社交媒体中提取信息会带来一些具体问题，因为需要只考虑与分析相关的帖子，因为它依赖于所考虑的上下文，因为它的多媒体内容，以及使用自动过滤器过滤掉信息丰富的帖子的风险。在本文中，我们提出了一种系统的方法来支持能够有效地提取相关数据集的管道设计，以分析来自社交媒体的数据。我们提供了一个概念模型，用于设计和注释带有质量和性能信息的数据准备管道，从而以上下文感知的方式为数据分析师提供有关结果数据集的预期质量的初步信息。与处理任务相关的元数据的生成已被认为是实现数据共享和可重用性的必要条件。为此，管道应用程序生成的数据集将自动使用来源元数据进行注释，以获得管道对其执行的所有活动的详细描述。作为一个案例研究，我们考虑设计一个管道，用于创建从社交媒体中提取的图像数据集，以分析COVID-19期间的行为方面。

{"title":"Pipeline Design for Data Preparation for Social Media Analysis","authors":"Carlo A. Bono, C. Cappiello, B. Pernici, Edoardo Ramalli, Monica Vitali","doi":"10.1145/3597305","DOIUrl":"https://doi.org/10.1145/3597305","url":null,"abstract":"In a data-driven culture, in which analytics applications are the main resources for supporting decision-making, the use of high-quality datasets is mandatory to minimize errors and risks. For this reason, data analysis tasks need to be preceded by a data preparation pipeline. The design of such a pipeline is not trivial: the data analyst must carefully choose the appropriate operations considering several aspects. This is often performed by adopting a trial-and-error approach that does not always lead to the most effective solution. In addition, extracting information from social media poses specific problems due to the need to consider only posts relevant for the analysis, for its dependence from the context being considered, for its multimedia contents, and for the risk of filtering out informative posts with automatic filters. In this paper, we propose a systematic approach to support the design of pipelines that are able to effectively extract a relevant dataset for the goal of the analysis of data from social media. We provide a conceptual model for designing and annotating the data preparation pipeline with quality and performance information, thus providing the data analyst preliminary information on the expected quality of the resulting dataset in a context-aware manner. The generation of metadata related to the processing tasks has been recognized as essential for enabling data sharing and reusability. To this aim, the dataset resulting from the pipeline application is automatically annotated with provenance metadata to get a detailed description of all the activities performed by the pipeline on them. As a case study, we consider the design of a pipeline for creating datasets of images extracted from social media in order to analyze behavioural aspects during COVID-19.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"1 1","pages":""},"PeriodicalIF":2.1,"publicationDate":"2023-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88878734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Novel Curated Scholarly Graph Connecting Textual and Data Publications 连接文本和数据出版物的一种新颖的策划学术图

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-05-19 DOI: 10.1145/3597310

Ornella Irrera, A. Mannocci, P. Manghi, G. Silvello

In the last decade, scholarly graphs became fundamental to storing and managing scholarly knowledge in a structured and machine-readable way. Methods and tools for discovery and impact assessment of science rely on such graphs and their quality to serve scientists, policymakers, and publishers. Since research data became very important in scholarly communication, scholarly graphs started including dataset metadata and their relationships to publications. Such graphs are the foundations for Open Science investigations, data-article publishing workflows, discovery, and assessment indicators. However, due to the heterogeneity of practices (FAIRness is indeed in the making), they often lack the complete and reliable metadata necessary to perform accurate data analysis; e.g., dataset metadata is inaccurate, author names are not uniform, and the semantics of the relationships is unknown, ambiguous or incomplete. This work describes an open and curated scholarly graph we built and published as a training and test set for data discovery, data connection, author disambiguation, and link prediction tasks. Overall the graph contains 4,047 publications, 5,488 datasets, 22 software, 21,561 authors; 9,692 edges interconnect publications to datasets and software and are labeled with semantics that outline whether a publication is citing, referencing, documenting, supplementing another product. To ensure high-quality metadata and semantics, we relied on the information extracted from PDFs of the publications and the datasets and software webpages to curate and enrich nodes metadata and edges semantics. To the best of our knowledge, this is the first ever published resource, including publications and datasets with manually validated and curated metadata.

在过去的十年中，学术图表成为以结构化和机器可读的方式存储和管理学术知识的基础。科学发现和影响评估的方法和工具依赖于这些图表及其质量来为科学家、政策制定者和出版商服务。由于研究数据在学术交流中变得非常重要，学术图表开始包括数据集元数据及其与出版物的关系。这些图表是开放科学调查、数据-文章出版工作流程、发现和评估指标的基础。然而，由于实践的异质性(公平性确实正在形成中)，它们往往缺乏执行准确数据分析所需的完整可靠的元数据;例如，数据集元数据不准确，作者姓名不统一，关系的语义未知、模糊或不完整。这项工作描述了一个开放和策划的学术图，我们建立并发布了作为数据发现，数据连接，作者消歧和链接预测任务的训练和测试集。总体而言，该图表包含4,047种出版物，5,488个数据集，22个软件，21,561位作者;9692条边将出版物与数据集和软件连接起来，并标记为语义，概述出版物是否引用，参考，记录，补充另一个产品。为了保证高质量的元数据和语义，我们依靠从出版物、数据集和软件网页的pdf中提取的信息来管理和丰富节点元数据和边缘语义。据我们所知，这是有史以来第一个发布的资源，包括人工验证和管理元数据的出版物和数据集。

{"title":"A Novel Curated Scholarly Graph Connecting Textual and Data Publications","authors":"Ornella Irrera, A. Mannocci, P. Manghi, G. Silvello","doi":"10.1145/3597310","DOIUrl":"https://doi.org/10.1145/3597310","url":null,"abstract":"In the last decade, scholarly graphs became fundamental to storing and managing scholarly knowledge in a structured and machine-readable way. Methods and tools for discovery and impact assessment of science rely on such graphs and their quality to serve scientists, policymakers, and publishers. Since research data became very important in scholarly communication, scholarly graphs started including dataset metadata and their relationships to publications. Such graphs are the foundations for Open Science investigations, data-article publishing workflows, discovery, and assessment indicators. However, due to the heterogeneity of practices (FAIRness is indeed in the making), they often lack the complete and reliable metadata necessary to perform accurate data analysis; e.g., dataset metadata is inaccurate, author names are not uniform, and the semantics of the relationships is unknown, ambiguous or incomplete. This work describes an open and curated scholarly graph we built and published as a training and test set for data discovery, data connection, author disambiguation, and link prediction tasks. Overall the graph contains 4,047 publications, 5,488 datasets, 22 software, 21,561 authors; 9,692 edges interconnect publications to datasets and software and are labeled with semantics that outline whether a publication is citing, referencing, documenting, supplementing another product. To ensure high-quality metadata and semantics, we relied on the information extracted from PDFs of the publications and the datasets and software webpages to curate and enrich nodes metadata and edges semantics. To the best of our knowledge, this is the first ever published resource, including publications and datasets with manually validated and curated metadata.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"42 1","pages":"1 - 24"},"PeriodicalIF":2.1,"publicationDate":"2023-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78953755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Biases in Large Language Models: Origins, Inventory, and Discussion 大型语言模型中的偏差:起源、盘点和讨论

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-05-16 DOI: 10.1145/3597307

Roberto Navigli, Simone Conia, Björn Ross

In this article, we introduce and discuss the pervasive issue of bias in the large language models that are currently at the core of mainstream approaches to Natural Language Processing (NLP). We first introduce data selection bias, that is, the bias caused by the choice of texts that make up a training corpus. Then, we survey the different types of social bias evidenced in the text generated by language models trained on such corpora, ranging from gender to age, from sexual orientation to ethnicity, and from religion to culture. We conclude with directions focused on measuring, reducing, and tackling the aforementioned types of bias.

在本文中，我们介绍并讨论了在大型语言模型中普遍存在的偏见问题，这些问题目前是自然语言处理(NLP)主流方法的核心。我们首先介绍数据选择偏差，即由组成训练语料库的文本的选择引起的偏差。然后，我们调查了在这些语料库上训练的语言模型生成的文本中所证明的不同类型的社会偏见，从性别到年龄，从性取向到种族，从宗教到文化。我们总结了测量、减少和解决上述类型偏见的方向。

引用次数: 17

Multimodal Deep Learning with Discriminant Descriptors for Offensive Memes Detection 基于判别描述符的多模态深度学习攻击性模因检测

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-05-15 DOI: 10.1145/3597308

A. Alzu’bi, Lojin Bani Younis, A. Abuarqoub, M. Hammoudeh

A meme is a visual representation that illustrates a thought or concept. Memes are spreading steadily among people in this era of rapidly expanding social media platforms, and they are becoming increasingly popular forms of expression. In the domain of meme and emotion analysis, the detection of offensives is a crucial task. However, it can be difficult to identify and comprehend the underlying emotion of a meme because its content is multimodal. Additionally, there is a lack of memes datasets that address how offensive a meme is, and the existing ones in this context have a bias towards the dominant labels or categories, leading to an imbalanced training set. In this article, we present a descriptive balanced dataset to help detect the offensive nature of the meme’s content using a proposed multimodal deep learning model. Two deep semantic models, baseline BERT and hateXplain-BERT, are systematically combined with several deep ResNet architectures to estimate the severity of the offensive memes. This process is based on the Meme-Merge collection that we construct from two publicly available datasets. The experimental results demonstrate the model’s effectiveness in classifying offensive memes, achieving F1 scores of 0.7315 and 0.7140 for the baseline datasets and Meme-Merge, respectively. The proposed multimodal deep learning approach also outperformed the baseline model in three meme tasks: metaphor understanding, sentiment understanding, and intention detection.

模因是一种说明思想或概念的视觉表现。在这个社交媒体平台迅速扩张的时代，表情包在人们中间稳步传播，它们正成为越来越受欢迎的表达形式。在模因和情感分析领域，攻击词的检测是一项至关重要的任务。然而，由于模因的内容是多模态的，因此很难识别和理解模因的潜在情感。此外，还缺乏模因数据集来解决模因的攻击性，并且在这种情况下，现有的模因数据集对主导标签或类别有偏见，导致训练集不平衡。在本文中，我们提出了一个描述性的平衡数据集，以帮助使用提出的多模态深度学习模型检测模因内容的攻击性。两个深度语义模型，基线BERT和hatexplainbert，系统地结合几个深度ResNet架构来估计攻击性模因的严重程度。这个过程是基于我们从两个公开可用的数据集构建的模因合并集合。实验结果表明，该模型对攻击性模因的分类是有效的，基线数据集和模因合并的F1得分分别为0.7315和0.7140。所提出的多模态深度学习方法在隐喻理解、情感理解和意图检测三个模因任务上也优于基线模型。

{"title":"Multimodal Deep Learning with Discriminant Descriptors for Offensive Memes Detection","authors":"A. Alzu’bi, Lojin Bani Younis, A. Abuarqoub, M. Hammoudeh","doi":"10.1145/3597308","DOIUrl":"https://doi.org/10.1145/3597308","url":null,"abstract":"A meme is a visual representation that illustrates a thought or concept. Memes are spreading steadily among people in this era of rapidly expanding social media platforms, and they are becoming increasingly popular forms of expression. In the domain of meme and emotion analysis, the detection of offensives is a crucial task. However, it can be difficult to identify and comprehend the underlying emotion of a meme because its content is multimodal. Additionally, there is a lack of memes datasets that address how offensive a meme is, and the existing ones in this context have a bias towards the dominant labels or categories, leading to an imbalanced training set. In this article, we present a descriptive balanced dataset to help detect the offensive nature of the meme’s content using a proposed multimodal deep learning model. Two deep semantic models, baseline BERT and hateXplain-BERT, are systematically combined with several deep ResNet architectures to estimate the severity of the offensive memes. This process is based on the Meme-Merge collection that we construct from two publicly available datasets. The experimental results demonstrate the model’s effectiveness in classifying offensive memes, achieving F1 scores of 0.7315 and 0.7140 for the baseline datasets and Meme-Merge, respectively. The proposed multimodal deep learning approach also outperformed the baseline model in three meme tasks: metaphor understanding, sentiment understanding, and intention detection.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"63 2 1","pages":"1 - 16"},"PeriodicalIF":2.1,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78623211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Social Data Analytics on the Design and Implementation of an EEG-Mechatronic System Interface 基于eeg -机电系统接口设计与实现的多模态社会数据分析

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-05-15 DOI: 10.1145/3597306

Cameron Aume, S. Pal, Alireza Jolfaei, S. Mukhopadhyay

The devices that can read Electroencephalography (EEG) signals have been widely used for Brain-Computer Interfaces (BCIs). Popularity in the field of BCIs has increased in recent years with the development of several consumer-grade EEG devices that can detect human cognitive states in real-time and deliver feedback to enhance human performance. Several previous studies have been conducted to understand the fundamentals and essential aspects of EEG in BCIs. However, the significant issue of how consumer-grade EEG devices can be used to control mechatronic systems effectively has been given less attention. In this article, we have designed and implemented an EEG BCI system using the OpenBCI Cyton headset and a user interface running a game to explore the concept of streamlining the interaction between humans and mechatronic systems with a BCI EEG-mechatronic system interface. Big Multimodal Social Data (BMSD) analytics can be applied to the high-frequency and high-volume EEG data, allowing us to explore aspects of data acquisition, data processing, and data validation and evaluate the Quality of Experience (QoE) of our system. We employ real-world participants to play a game to gather training data that was later put into multiple machine learning models, including a linear discriminant analysis (LDA), k-nearest neighbours (KNN), and a convolutional neural network (CNN). After training the machine learning models, a validation phase of the experiment took place where participants tried to play the same game but without direct control, utilising the outputs of the machine learning models to determine how the game moved. We find that a CNN trained to the specific user was able to control the game and performed with the highest activation accuracy from the machine learning models tested, along with the highest user-rated QoE, which gives us significant insight for future implementation with a mechatronic system.

在脑机接口(bci)中，可以读取脑电图(EEG)信号的设备得到了广泛的应用。近年来，随着一些消费级脑电图设备的发展，脑机接口领域的普及程度有所提高，这些设备可以实时检测人类的认知状态，并提供反馈以提高人类的表现。为了了解脑机接口中脑电图的基本原理和重要方面，已经进行了一些先前的研究。然而，如何使用消费级脑电图设备有效地控制机电系统的重要问题却很少得到关注。在本文中，我们使用OpenBCI Cyton耳机和运行游戏的用户界面设计并实现了一个EEG BCI系统，以探索通过BCI EEG-机电系统界面简化人与机电系统之间交互的概念。大多模态社会数据(BMSD)分析可以应用于高频和大容量的脑电图数据，使我们能够探索数据采集、数据处理和数据验证的各个方面，并评估我们系统的体验质量(QoE)。我们让现实世界的参与者玩一个游戏来收集训练数据，这些数据后来被放入多个机器学习模型中，包括线性判别分析(LDA)、k近邻(KNN)和卷积神经网络(CNN)。在训练机器学习模型之后，进行了实验的验证阶段，参与者试图玩相同的游戏，但没有直接控制，利用机器学习模型的输出来确定游戏的移动方式。我们发现，经过特定用户训练的CNN能够控制游戏，并且在测试的机器学习模型中具有最高的激活精度，以及最高的用户评价QoE，这为我们提供了机电一体化系统未来实施的重要见解。

{"title":"Multimodal Social Data Analytics on the Design and Implementation of an EEG-Mechatronic System Interface","authors":"Cameron Aume, S. Pal, Alireza Jolfaei, S. Mukhopadhyay","doi":"10.1145/3597306","DOIUrl":"https://doi.org/10.1145/3597306","url":null,"abstract":"The devices that can read Electroencephalography (EEG) signals have been widely used for Brain-Computer Interfaces (BCIs). Popularity in the field of BCIs has increased in recent years with the development of several consumer-grade EEG devices that can detect human cognitive states in real-time and deliver feedback to enhance human performance. Several previous studies have been conducted to understand the fundamentals and essential aspects of EEG in BCIs. However, the significant issue of how consumer-grade EEG devices can be used to control mechatronic systems effectively has been given less attention. In this article, we have designed and implemented an EEG BCI system using the OpenBCI Cyton headset and a user interface running a game to explore the concept of streamlining the interaction between humans and mechatronic systems with a BCI EEG-mechatronic system interface. Big Multimodal Social Data (BMSD) analytics can be applied to the high-frequency and high-volume EEG data, allowing us to explore aspects of data acquisition, data processing, and data validation and evaluate the Quality of Experience (QoE) of our system. We employ real-world participants to play a game to gather training data that was later put into multiple machine learning models, including a linear discriminant analysis (LDA), k-nearest neighbours (KNN), and a convolutional neural network (CNN). After training the machine learning models, a validation phase of the experiment took place where participants tried to play the same game but without direct control, utilising the outputs of the machine learning models to determine how the game moved. We find that a CNN trained to the specific user was able to control the game and performed with the highest activation accuracy from the machine learning models tested, along with the highest user-rated QoE, which gives us significant insight for future implementation with a mechatronic system.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"52 1","pages":"1 - 25"},"PeriodicalIF":2.1,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84697711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Method to Classify Data Quality for Decision Making Under Uncertainty 一种面向不确定决策的数据质量分类方法

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-04-21 DOI: 10.1145/3592534

Vanessa Simard, M. Rönnqvist, L. Lebel, N. Lehoux

Every decision-making process is subject to a certain degree of uncertainty. In sectors where the outcomes of the operations planned are uncertain and difficult to control such as in forestry, data describing the available resources can have a large impact on productivity. When planning activities, it is often assumed that such data are accurate, which causes a need for more replanning efforts. Data verification is kept to a minimum even though using erroneous information increases the level of uncertainty. In this context, it is relevant to develop a process to evaluate whether the data used for planning decisions are appropriate, so as to ensure the decision validity and provide information for better understanding and actions. However, the level of data quality alone can sometimes be difficult to interpret and needs to be put into perspective. This article proposes an extension to most data quality assessment techniques by comparing data to past quality levels. A classification method is proposed to evaluate the level of data quality in order to support decision making. Such classification provides insights into the level of uncertainty associated with the data. The method developed is then exploited using a theoretical case based on the literature and a practical case based on the forest sector. An example of how classified data quality can improve decisions in a transportation problem is finally shown.

每一个决策过程都有一定程度的不确定性。在计划行动的结果不确定和难以控制的部门，例如林业，描述现有资源的数据可能对生产力产生重大影响。在规划活动时，通常假设这些数据是准确的，这导致需要更多的重新规划工作。即使使用错误的信息增加了不确定性，数据验证也保持在最低限度。在这种情况下，制定一个过程来评估用于规划决策的数据是否适当，以确保决策的有效性，并为更好地理解和行动提供信息是相关的。然而，数据质量的水平有时很难解释，需要正确看待。本文通过将数据与过去的质量水平进行比较，提出了对大多数数据质量评估技术的扩展。提出了一种评价数据质量水平的分类方法，以支持决策。这种分类提供了对与数据相关的不确定程度的见解。然后利用基于文献的理论案例和基于森林部门的实际案例来开发所开发的方法。最后展示了分类数据质量如何改善运输问题决策的一个示例。

{"title":"A Method to Classify Data Quality for Decision Making Under Uncertainty","authors":"Vanessa Simard, M. Rönnqvist, L. Lebel, N. Lehoux","doi":"10.1145/3592534","DOIUrl":"https://doi.org/10.1145/3592534","url":null,"abstract":"Every decision-making process is subject to a certain degree of uncertainty. In sectors where the outcomes of the operations planned are uncertain and difficult to control such as in forestry, data describing the available resources can have a large impact on productivity. When planning activities, it is often assumed that such data are accurate, which causes a need for more replanning efforts. Data verification is kept to a minimum even though using erroneous information increases the level of uncertainty. In this context, it is relevant to develop a process to evaluate whether the data used for planning decisions are appropriate, so as to ensure the decision validity and provide information for better understanding and actions. However, the level of data quality alone can sometimes be difficult to interpret and needs to be put into perspective. This article proposes an extension to most data quality assessment techniques by comparing data to past quality levels. A classification method is proposed to evaluate the level of data quality in order to support decision making. Such classification provides insights into the level of uncertainty associated with the data. The method developed is then exploited using a theoretical case based on the literature and a practical case based on the forest sector. An example of how classified data quality can improve decisions in a transportation problem is finally shown.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"26 1","pages":"1 - 27"},"PeriodicalIF":2.1,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73809409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Survey of Data Quality Requirements That Matter in ML Development Pipelines ML开发管道中重要的数据质量需求调查

IF 2.1 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality

Pub Date : 2023-04-19 DOI: 10.1145/3592616

Margaret A. Priestley, Fionntán O'Donnell, E. Simperl

The fitness of the systems in which Machine Learning (ML) is used depends greatly on good-quality data. Specifications on what makes a good-quality dataset have traditionally been defined by the needs of the data users—typically analysts and engineers. Our article critically examines the extent to which established data quality frameworks are applicable to contemporary use cases in ML. Using a review of recent literature at the intersection of ML, data management, and human-computer interaction, we find that the classical “fitness-for-use” view of data quality can benefit from a more stage-specific approach that is sensitive to where in the ML lifecycle the data are encountered. This helps practitioners to plan their data quality tasks in a manner that meets the needs of the stakeholders who will encounter the dataset, whether it be data subjects, software developers or organisations. We therefore propose a new treatment of traditional data quality criteria by structuring them according to two dimensions: (1) the stage of the ML lifecycle where the use case occurs vs. (2) the main categories of data quality that can be pursued (intrinsic, contextual, representational and accessibility). To illustrate how this works in practice, we contribute a temporal mapping of the various data quality requirements that are important at different stages of the ML data pipeline. We also share some implications for data practitioners and organisations that wish to enhance their data management routines in preparation for ML.

使用机器学习(ML)的系统的适应度很大程度上取决于高质量的数据。关于什么是高质量数据集的规范传统上是由数据用户(通常是分析师和工程师)的需求定义的。我们的文章批判性地考察了已建立的数据质量框架在多大程度上适用于ML中的当代用例。通过对ML、数据管理和人机交互交叉领域的最新文献的回顾，我们发现经典的“适合使用”数据质量视图可以从一种更具体阶段的方法中受益，这种方法对ML生命周期中遇到数据的位置很敏感。这有助于从业者以满足将遇到数据集的利益相关者(无论是数据主体、软件开发人员还是组织)的需求的方式计划他们的数据质量任务。因此，我们提出了一种新的处理传统数据质量标准的方法，通过根据两个维度来构建它们:(1)用例发生的ML生命周期阶段与(2)可以追求的数据质量的主要类别(内在的，上下文的，代表性的和可访问性)。为了说明这在实践中是如何工作的，我们提供了各种数据质量需求的时间映射，这些需求在ML数据管道的不同阶段很重要。我们还分享了一些数据从业者和组织希望加强他们的数据管理程序准备ML的启示。

{"title":"A Survey of Data Quality Requirements That Matter in ML Development Pipelines","authors":"Margaret A. Priestley, Fionntán O'Donnell, E. Simperl","doi":"10.1145/3592616","DOIUrl":"https://doi.org/10.1145/3592616","url":null,"abstract":"The fitness of the systems in which Machine Learning (ML) is used depends greatly on good-quality data. Specifications on what makes a good-quality dataset have traditionally been defined by the needs of the data users—typically analysts and engineers. Our article critically examines the extent to which established data quality frameworks are applicable to contemporary use cases in ML. Using a review of recent literature at the intersection of ML, data management, and human-computer interaction, we find that the classical “fitness-for-use” view of data quality can benefit from a more stage-specific approach that is sensitive to where in the ML lifecycle the data are encountered. This helps practitioners to plan their data quality tasks in a manner that meets the needs of the stakeholders who will encounter the dataset, whether it be data subjects, software developers or organisations. We therefore propose a new treatment of traditional data quality criteria by structuring them according to two dimensions: (1) the stage of the ML lifecycle where the use case occurs vs. (2) the main categories of data quality that can be pursued (intrinsic, contextual, representational and accessibility). To illustrate how this works in practice, we contribute a temporal mapping of the various data quality requirements that are important at different stages of the ML data pipeline. We also share some implications for data practitioners and organisations that wish to enhance their data management routines in preparation for ML.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"33 1","pages":"1 - 39"},"PeriodicalIF":2.1,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81012919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2