Abdullah Al-Ameri, Waleed Al-Shammari, Aniello Castiglione, Michele Nappi, Chiara Pero, Muhammad Umer
Predicting students’ academic success is crucial for educational institutions to provide targeted support and interventions to those at risk of underperforming. With the increasing adoption of digital learning management systems (LMS), there has been a surge in multimedia data, opening new avenues for predictive analytics in education. Anticipating students’ academic performance can function as an early alert system for those facing potential failure, enabling educational institutions to implement interventions proactively. This study proposes leveraging features extracted from a convolutional neural network (CNN) in conjunction with machine learning models to enhance predictive accuracy. This approach obviates the need for manual feature extraction and yields superior outcomes compared to using machine learning and deep learning models independently. Initially, nine machine learning models are applied to both the original and convoluted features. The top-performing individual models are then combined into an ensemble model. This research work makes an ensemble of support vector machine (SVM) and random forest (RF) for academic performance prediction. The efficacy of the proposed method is validated against existing models, demonstrating its superior performance. With an accuracy of 97.88%, and precision, recall, and F1 scores of 98%, the proposed approach attains outstanding results in forecasting student academic success. This study contributes to the burgeoning field of predictive analytics in education by showcasing the effectiveness of leveraging multimedia data from learning management systems with convoluted features and ensemble modeling techniques.
预测学生的学业成功与否,对于教育机构为可能成绩不佳的学生提供有针对性的支持和干预措施至关重要。随着数字化学习管理系统(LMS)的日益普及,多媒体数据激增,为教育领域的预测分析开辟了新的途径。预测学生的学业成绩可以作为一个早期预警系统,提醒那些面临潜在失败的学生,使教育机构能够积极主动地实施干预措施。本研究建议利用从卷积神经网络(CNN)中提取的特征,结合机器学习模型来提高预测的准确性。与单独使用机器学习和深度学习模型相比,这种方法无需手动提取特征,并能产生更好的结果。最初,九个机器学习模型被应用于原始特征和卷积特征。然后,将表现最好的单个模型组合成一个集合模型。这项研究工作将支持向量机(SVM)和随机森林(RF)组合起来,用于学业成绩预测。通过与现有模型的对比,验证了所提方法的有效性,证明了其卓越的性能。该方法的准确率为 97.88%,精确度、召回率和 F1 分数均为 98%,在预测学生学业成绩方面取得了优异的成绩。这项研究展示了利用学习管理系统的多媒体数据、复杂特征和集合建模技术的有效性,从而为教育领域蓬勃发展的预测分析做出了贡献。
{"title":"Student Academic Success Prediction Using Learning Management Multimedia Data With Convoluted Features and Ensemble Model","authors":"Abdullah Al-Ameri, Waleed Al-Shammari, Aniello Castiglione, Michele Nappi, Chiara Pero, Muhammad Umer","doi":"10.1145/3687268","DOIUrl":"https://doi.org/10.1145/3687268","url":null,"abstract":"Predicting students’ academic success is crucial for educational institutions to provide targeted support and interventions to those at risk of underperforming. With the increasing adoption of digital learning management systems (LMS), there has been a surge in multimedia data, opening new avenues for predictive analytics in education. Anticipating students’ academic performance can function as an early alert system for those facing potential failure, enabling educational institutions to implement interventions proactively. This study proposes leveraging features extracted from a convolutional neural network (CNN) in conjunction with machine learning models to enhance predictive accuracy. This approach obviates the need for manual feature extraction and yields superior outcomes compared to using machine learning and deep learning models independently. Initially, nine machine learning models are applied to both the original and convoluted features. The top-performing individual models are then combined into an ensemble model. This research work makes an ensemble of support vector machine (SVM) and random forest (RF) for academic performance prediction. The efficacy of the proposed method is validated against existing models, demonstrating its superior performance. With an accuracy of 97.88%, and precision, recall, and F1 scores of 98%, the proposed approach attains outstanding results in forecasting student academic success. This study contributes to the burgeoning field of predictive analytics in education by showcasing the effectiveness of leveraging multimedia data from learning management systems with convoluted features and ensemble modeling techniques.","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":"10 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141920507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data quality plays a vital role in scientific research and decision-making across industries. Thus it is crucial to incorporate the data quality control (DQC) process, which comprises various actions and operations to detect and correct data errors. The increasing adoption of machine learning (ML) techniques in different domains has raised concerns about data quality in the ML field. On the other hand, ML’s capability to uncover complex patterns makes it suitable for addressing challenges involved in the DQC process. However, supervised learning methods demand abundant labeled data, while unsupervised learning methods heavily rely on the underlying distribution of the data. Active learning (AL) provides a promising solution by proactively selecting data points for inspection, thus reducing the burden of data labeling for domain experts. Therefore, this survey focuses on applying AL to DQC. Starting with a review of common data quality issues and solutions in the ML field, we aim to enhance the understanding of current quality assessment methods. We then present two scenarios to illustrate the adoption of AL into the DQC systems on the anomaly detection task, including pool-based and stream-based approaches. Finally, we provide the remaining challenges and research opportunities in this field.
数据质量在各行各业的科学研究和决策中发挥着至关重要的作用。因此,纳入数据质量控制(DQC)流程至关重要,该流程包括各种检测和纠正数据错误的行动和操作。随着机器学习(ML)技术在不同领域的应用日益广泛,人们对 ML 领域的数据质量产生了担忧。另一方面,ML 发现复杂模式的能力使其适合应对 DQC 过程中的挑战。然而,监督学习方法需要大量的标注数据,而无监督学习方法则严重依赖于数据的底层分布。主动学习(AL)通过主动选择数据点进行检测,从而减轻了领域专家的数据标注负担,提供了一种很有前景的解决方案。因此,本调查侧重于将 AL 应用于 DQC。我们首先回顾了 ML 领域常见的数据质量问题和解决方案,旨在加深对当前质量评估方法的理解。然后,我们介绍了两种情景,以说明在异常检测任务的 DQC 系统中采用 AL 的情况,包括基于池的方法和基于流的方法。最后,我们提出了这一领域仍然存在的挑战和研究机会。
{"title":"Active Learning for Data Quality Control: A Survey","authors":"Na Li, Yiyang Qi, Chaoran Li, Zhiming Zhao","doi":"10.1145/3663369","DOIUrl":"https://doi.org/10.1145/3663369","url":null,"abstract":"Data quality plays a vital role in scientific research and decision-making across industries. Thus it is crucial to incorporate the data quality control (DQC) process, which comprises various actions and operations to detect and correct data errors. The increasing adoption of machine learning (ML) techniques in different domains has raised concerns about data quality in the ML field. On the other hand, ML’s capability to uncover complex patterns makes it suitable for addressing challenges involved in the DQC process. However, supervised learning methods demand abundant labeled data, while unsupervised learning methods heavily rely on the underlying distribution of the data. Active learning (AL) provides a promising solution by proactively selecting data points for inspection, thus reducing the burden of data labeling for domain experts. Therefore, this survey focuses on applying AL to DQC. Starting with a review of common data quality issues and solutions in the ML field, we aim to enhance the understanding of current quality assessment methods. We then present two scenarios to illustrate the adoption of AL into the DQC systems on the anomaly detection task, including pool-based and stream-based approaches. Finally, we provide the remaining challenges and research opportunities in this field.","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":"6 23","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140988485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Bachinger, Lisa Ehrlinger, G. Kronberger, Wolfram Wöß
Data validation is a primary concern in any data-driven application, as undetected data errors may negatively affect machine learning models and lead to suboptimal decisions. Data quality issues are usually detected manually by experts, which becomes infeasible and uneconomical for large volumes of data. To enable automated data validation, we propose “shape constraint-based data validation”, a novel approach based on machine learning that incorporates expert knowledge in the form of shape constraints. Shape constraints can be used to describe expected (multivariate and nonlinear) patterns in valid data, and enable the detection of invalid data which deviates from these expected patterns. Our approach can be divided into two steps: (1) shape-constrained prediction models are trained on data, and (2) their training error is analyzed to identify invalid data. The training error can be used as an indicator for invalid data because shape-constrained models can fit valid data better than invalid data. We evaluate the approach on a benchmark suite consisting of synthetic datasets, which we have published for benchmarking similar data validation approaches. Additionally, we demonstrate the capabilities of the proposed approach with a real-world dataset consisting of measurements from a friction test bench in an industrial setting. Our approach detects subtle data errors that are difficult to identify even for domain experts.
{"title":"Data Validation Utilizing Expert Knowledge and Shape Constraints","authors":"F. Bachinger, Lisa Ehrlinger, G. Kronberger, Wolfram Wöß","doi":"10.1145/3661826","DOIUrl":"https://doi.org/10.1145/3661826","url":null,"abstract":"Data validation is a primary concern in any data-driven application, as undetected data errors may negatively affect machine learning models and lead to suboptimal decisions. Data quality issues are usually detected manually by experts, which becomes infeasible and uneconomical for large volumes of data.\u0000 To enable automated data validation, we propose “shape constraint-based data validation”, a novel approach based on machine learning that incorporates expert knowledge in the form of shape constraints. Shape constraints can be used to describe expected (multivariate and nonlinear) patterns in valid data, and enable the detection of invalid data which deviates from these expected patterns. Our approach can be divided into two steps: (1) shape-constrained prediction models are trained on data, and (2) their training error is analyzed to identify invalid data. The training error can be used as an indicator for invalid data because shape-constrained models can fit valid data better than invalid data.\u0000 We evaluate the approach on a benchmark suite consisting of synthetic datasets, which we have published for benchmarking similar data validation approaches. Additionally, we demonstrate the capabilities of the proposed approach with a real-world dataset consisting of measurements from a friction test bench in an industrial setting. Our approach detects subtle data errors that are difficult to identify even for domain experts.","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":" 1163","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140988959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This Special Issue of the Journal of Data and Information Quality (JDIQ) contains novel theoretical and methodological contributions on data curation involving humans in the loop. In this editorial, we summarize the scope of the issue and briefly describe its content.
{"title":"Editorial: Special Issue on Human in the Loop Data Curation","authors":"Gianluca Demartini, Shazia Sadiq, Jie Yang","doi":"10.1145/3650209","DOIUrl":"https://doi.org/10.1145/3650209","url":null,"abstract":"This Special Issue of the Journal of Data and Information Quality (JDIQ) contains novel theoretical and methodological contributions on data curation involving humans in the loop. In this editorial, we summarize the scope of the issue and briefly describe its content.","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":" 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140210008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Editor-in-Chief (June 2017–November 2023) Farewell Report","authors":"Tiziana Catarci","doi":"10.1145/3651229","DOIUrl":"https://doi.org/10.1145/3651229","url":null,"abstract":"","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":" 21","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140210958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In entity resolution, blocking pre-partitions data for further processing by more expensive methods. Two entity mentions are in the same block if they share identical or related blocking-keys . Previous work has sometimes related blocking keys by grouping or alphabetically sorting them, but – as was shown for author disambiguation – the respective equivalences or total orders are not necessarily well-suited to model the logical matching-relation between blocking keys. To address this, we present a novel blocking approach that exploits the subset partial order over entity representations to build a matching-based bipartite graph, using connected components as blocks. To prevent over- and underconnectedness, we allow specification of overly general and generalization of overly specific representations. To build the bipartite graph, we contribute a new parallellized algorithm with configurable time/space tradeoff for minimal element search in the subset partial order. As a job-based approach, it combines dynamic scalability and easier integration to make it more convenient than the previously described approaches. Experiments on large gold standards for publication records, author mentions and affiliation strings suggest that our approach is competitive in performance and allows better addressing of domain-specific problems. For duplicate detection and author disambiguation, our method offers the expected performance as defined by the vector-similarity baseline used in another work on the same dataset and the common surname, first-initial baseline. For top-level institution resolution, we have reproduced the challenges described in prior work, strengthening the conclusion that for affiliation data, overlapping blocks under minimal elements are more suitable than connected components.
{"title":"Connected Components for Scaling Partial-Order Blocking to Billion Entities","authors":"Tobias Backes, Stefan Dietze","doi":"10.1145/3646553","DOIUrl":"https://doi.org/10.1145/3646553","url":null,"abstract":"\u0000 In entity resolution,\u0000 blocking\u0000 pre-partitions data for further processing by more expensive methods. Two entity mentions are in the same block if they share identical or related\u0000 blocking-keys\u0000 . Previous work has sometimes related blocking keys by grouping or alphabetically sorting them, but – as was shown for author disambiguation – the respective equivalences or total orders are not necessarily well-suited to model the logical matching-relation between blocking keys. To address this, we present a novel blocking approach that exploits the subset\u0000 partial\u0000 order over entity representations to build a matching-based bipartite graph, using connected components as blocks. To prevent over- and underconnectedness, we allow specification of overly general and generalization of overly specific representations. To build the bipartite graph, we contribute a new parallellized algorithm with configurable time/space tradeoff for minimal element search in the subset partial order. As a job-based approach, it combines dynamic scalability and easier integration to make it more convenient than the previously described approaches. Experiments on large gold standards for publication records, author mentions and affiliation strings suggest that our approach is competitive in performance and allows better addressing of domain-specific problems. For duplicate detection and author disambiguation, our method offers the expected performance as defined by the vector-similarity baseline used in another work on the same dataset and the common surname, first-initial baseline. For top-level institution resolution, we have reproduced the challenges described in prior work, strengthening the conclusion that for affiliation data, overlapping blocks under minimal elements are more suitable than connected components.\u0000","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":"10 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139958252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
João L. M. Pereira, Manuel J. Fonseca, Antónia Lopes, H. Galhardas
The existence of large amounts of data increases the probability of occurring data quality problems. A data cleaning process that corrects these problems is usually an iterative process because it may need to be re-executed and refined to produce high quality data. Moreover, due to the specificity of some data quality problems and the limitation of data cleaning programs to cover all problems, often a user has to be involved during the program executions by manually repairing data. However, there is no data cleaning framework that appropriately supports this involvement in such an iterative process, a form of human-in-the-loop, to clean structured data. Moreover, data preparation tools that somehow involve the user in data cleaning processes have not been evaluated with real users to assess their effort. Therefore, we propose Cleenex, a data cleaning framework with support for user involvement during an iterative data cleaning process and conducted two data cleaning experimental evaluations: an assessment of the Cleenex components that support the user when manually repairing data with a simulated user, and a comparison, in terms of user involvement, of data preparation tools with real users. Results show that Cleenex components reduce the user effort when manually cleaning data during a data cleaning process, for example the number of tuples visualized is reduced in 99%. Moreover, when performing data cleaning tasks with Cleenex, real users need less time/effort (e.g., half the clicks) and, based on questionnaires, prefer it to the other tools used for comparison, OpenRefine and Pentaho Data Integration.
{"title":"Cleenex: Support for User Involvement During an Iterative Data Cleaning Process","authors":"João L. M. Pereira, Manuel J. Fonseca, Antónia Lopes, H. Galhardas","doi":"10.1145/3648476","DOIUrl":"https://doi.org/10.1145/3648476","url":null,"abstract":"The existence of large amounts of data increases the probability of occurring data quality problems. A data cleaning process that corrects these problems is usually an iterative process because it may need to be re-executed and refined to produce high quality data. Moreover, due to the specificity of some data quality problems and the limitation of data cleaning programs to cover all problems, often a user has to be involved during the program executions by manually repairing data. However, there is no data cleaning framework that appropriately supports this involvement in such an iterative process, a form of human-in-the-loop, to clean structured data. Moreover, data preparation tools that somehow involve the user in data cleaning processes have not been evaluated with real users to assess their effort.\u0000 Therefore, we propose Cleenex, a data cleaning framework with support for user involvement during an iterative data cleaning process and conducted two data cleaning experimental evaluations: an assessment of the Cleenex components that support the user when manually repairing data with a simulated user, and a comparison, in terms of user involvement, of data preparation tools with real users.\u0000 Results show that Cleenex components reduce the user effort when manually cleaning data during a data cleaning process, for example the number of tuples visualized is reduced in 99%. Moreover, when performing data cleaning tasks with Cleenex, real users need less time/effort (e.g., half the clicks) and, based on questionnaires, prefer it to the other tools used for comparison, OpenRefine and Pentaho Data Integration.","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":"867 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139894342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}