Hima Patel, Shanmukha C. Guttula, Nitin Gupta, Sandeep Hans, Ruhi Sharma Mittal, Lokesh N
Democratisation of machine learning (ML) has been an important theme in the research community for the last several years with notable progress made by the model-building community with automated machine learning models. However, data plays a central role in building ML models and there is a need to focus on data-centric AI innovations. In this paper, we first map the steps taken by data scientists for the data preparation phase and identify open areas and pain points via user interviews. We then propose a framework and four novel algorithms for exploratory data analysis and data quality for AI steps addressing the pain points from user interviews. We also validate our algorithms with open-source datasets and show the effectiveness of our proposed methods. Next, we build a tool that automatically generates python code encompassing the above algorithms and study the usefulness of these algorithms via two user studies with data scientists. We observe from the first study results that the participants who used the tool were able to gain 2X productivity and 6% model improvement over the control group. The second study is performed in a more realistic environment to understand how the tool would be used in real-world scenarios. The results from this study are coherent with the first study and show an average of 30-50% of time savings that can be attributed to the tool.
{"title":"A data centric AI framework for automating exploratory data analysis and data quality tasks","authors":"Hima Patel, Shanmukha C. Guttula, Nitin Gupta, Sandeep Hans, Ruhi Sharma Mittal, Lokesh N","doi":"10.1145/3603709","DOIUrl":"https://doi.org/10.1145/3603709","url":null,"abstract":"Democratisation of machine learning (ML) has been an important theme in the research community for the last several years with notable progress made by the model-building community with automated machine learning models. However, data plays a central role in building ML models and there is a need to focus on data-centric AI innovations. In this paper, we first map the steps taken by data scientists for the data preparation phase and identify open areas and pain points via user interviews. We then propose a framework and four novel algorithms for exploratory data analysis and data quality for AI steps addressing the pain points from user interviews. We also validate our algorithms with open-source datasets and show the effectiveness of our proposed methods. Next, we build a tool that automatically generates python code encompassing the above algorithms and study the usefulness of these algorithms via two user studies with data scientists. We observe from the first study results that the participants who used the tool were able to gain 2X productivity and 6% model improvement over the control group. The second study is performed in a more realistic environment to understand how the tool would be used in real-world scenarios. The results from this study are coherent with the first study and show an average of 30-50% of time savings that can be attributed to the tool.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"105 1","pages":""},"PeriodicalIF":2.1,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80653634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to rapid technical advancements, many devices such as sensors, embedded systems, actuators, and mobile/smart devices receive huge amounts of information through data exchange and interconnectivity. From this increase in the exchange of data, there has also been a direct correlation to sensitive information that also moves through systems continuously. In this context, it is critical to ensure that both private and personal data is not disclosed and that any confidential information can be successfully hidden. Therefore, security and privacy have attracted a great deal of attention in academia and industry in recent decades. Not only is there a reason to protect against data leakage that is sensitive in nature, but it is also imperative to ensure that users of such systems trust the means by which their data is exchanged. Hundreds of security solutions have recently been discussed in the literature. However, the ability to properly manage the quality of security to ensure that developed models and algorithms can secure data is a very important task. To that end, only a limited number of works have addressed this problem directly. Since exchanged data usually is complex, researchers should also develop and investigate security models to perform quality assessments of data security. These tasks will ensure that threats from hackers or malware can be minimized. Security solutions can take on many forms. From cryptographic primitives all the way to machine learning and artificial intelligence, these potential fail-safes need to be properly researched, disseminated and discussed to ensure the next generation of systems will adhere to certain standards in the realm of security and privacy. This special issue saw a total of 21 submissions, from which five papers were published. It was intentional to adhere to a strict acceptance rate and ensure that only the best papers in the scope of the special issue were accepted. The following few paragraphs summarize the contributions that our special issue collection presents. In “A Survey on Edge Intelligence and Lightweight Machine Learning Support for Future Applications and Services,” Hoffpauir et al. provided a comprehensive survey of the emerging edge intelligence applications, lightweight machine learning algorithms, and their support for future applications and services. The survey started by analyzing the rise of cloud computing discussing its weak points, and identifying situations in which edge computing provides advantages over traditional cloud computing architectures. Then it dove into the survey the first section identifying opportunities and domains for edge computing growth, the second identifying algorithms and approaches that can be used to enhance edge intelligence implementations, and the third specifically analyzing situations in which edge intelligence can be enhanced using any of the aforementioned algorithms or approaches. In this third section, lightweight machine learning approaches
{"title":"Editorial for the Special Issue on Quality Assessment of Data Security","authors":"Gautam Srivastava, Jerry Chun‐wei Lin, Zhihan Lv","doi":"10.1145/3591360","DOIUrl":"https://doi.org/10.1145/3591360","url":null,"abstract":"Due to rapid technical advancements, many devices such as sensors, embedded systems, actuators, and mobile/smart devices receive huge amounts of information through data exchange and interconnectivity. From this increase in the exchange of data, there has also been a direct correlation to sensitive information that also moves through systems continuously. In this context, it is critical to ensure that both private and personal data is not disclosed and that any confidential information can be successfully hidden. Therefore, security and privacy have attracted a great deal of attention in academia and industry in recent decades. Not only is there a reason to protect against data leakage that is sensitive in nature, but it is also imperative to ensure that users of such systems trust the means by which their data is exchanged. Hundreds of security solutions have recently been discussed in the literature. However, the ability to properly manage the quality of security to ensure that developed models and algorithms can secure data is a very important task. To that end, only a limited number of works have addressed this problem directly. Since exchanged data usually is complex, researchers should also develop and investigate security models to perform quality assessments of data security. These tasks will ensure that threats from hackers or malware can be minimized. Security solutions can take on many forms. From cryptographic primitives all the way to machine learning and artificial intelligence, these potential fail-safes need to be properly researched, disseminated and discussed to ensure the next generation of systems will adhere to certain standards in the realm of security and privacy. This special issue saw a total of 21 submissions, from which five papers were published. It was intentional to adhere to a strict acceptance rate and ensure that only the best papers in the scope of the special issue were accepted. The following few paragraphs summarize the contributions that our special issue collection presents. In “A Survey on Edge Intelligence and Lightweight Machine Learning Support for Future Applications and Services,” Hoffpauir et al. provided a comprehensive survey of the emerging edge intelligence applications, lightweight machine learning algorithms, and their support for future applications and services. The survey started by analyzing the rise of cloud computing discussing its weak points, and identifying situations in which edge computing provides advantages over traditional cloud computing architectures. Then it dove into the survey the first section identifying opportunities and domains for edge computing growth, the second identifying algorithms and approaches that can be used to enhance edge intelligence implementations, and the third specifically analyzing situations in which edge intelligence can be enhanced using any of the aforementioned algorithms or approaches. In this third section, lightweight machine learning approaches ","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"25 1","pages":"1 - 3"},"PeriodicalIF":2.1,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82676180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data is of high quality if it is fit for its intended purpose. Data heterogeneity can be a major quality problem, as quality aspects such as understandability and consistency can be compromised. Heterogeneity of data values is particularly common when data is manually entered by different people using inadequate control rules. In this case, syntactic and semantic heterogeneity often go hand in hand. Heterogeneity of data values may be a direct result of problems in the acquisition process, quality problems of the underlying data model, or possibly erroneous data transformations. For example, in the cultural heritage domain, it is common to analyze data fields by manually searching lists of data values sorted alphabetically or by number of occurrences. Additionally, search functions such as regular expression matching are used to detect specific patterns. However, this requires a priori knowledge and technical skills that domain experts often do not have. Since such datasets often contain thousands of values, the entire process is very time-consuming. Outliers or subtle differences between values that may be critical to data quality can be easily overlooked. To improve this process of analyzing the quality of data values, we propose a bottom-up human-in-the-loop approach that clusters values of a data field according to syntactic similarity. The clustering is intended to help domain experts explore the heterogeneity of values in a data field and can be configured by domain experts according to their domain knowledge. The overview of the syntactic diversity of the data values gives an impression of the rules and practices of data acquisition as well as their violations. From this, experts can infer potential quality issues with the data acquisition process and system, as well as the data model and data transformations. We outline a proof-of-concept implementation of the approach. Our evaluation found that clustering adds value to data quality analysis, especially for detecting quality problems in data models.
{"title":"Clustering Heterogeneous Data Values for Data Quality Analysis","authors":"Viola Wenz, Arno Kesper, G. Taentzer","doi":"10.1145/3603710","DOIUrl":"https://doi.org/10.1145/3603710","url":null,"abstract":"Data is of high quality if it is fit for its intended purpose. Data heterogeneity can be a major quality problem, as quality aspects such as understandability and consistency can be compromised. Heterogeneity of data values is particularly common when data is manually entered by different people using inadequate control rules. In this case, syntactic and semantic heterogeneity often go hand in hand. Heterogeneity of data values may be a direct result of problems in the acquisition process, quality problems of the underlying data model, or possibly erroneous data transformations. For example, in the cultural heritage domain, it is common to analyze data fields by manually searching lists of data values sorted alphabetically or by number of occurrences. Additionally, search functions such as regular expression matching are used to detect specific patterns. However, this requires a priori knowledge and technical skills that domain experts often do not have. Since such datasets often contain thousands of values, the entire process is very time-consuming. Outliers or subtle differences between values that may be critical to data quality can be easily overlooked. To improve this process of analyzing the quality of data values, we propose a bottom-up human-in-the-loop approach that clusters values of a data field according to syntactic similarity. The clustering is intended to help domain experts explore the heterogeneity of values in a data field and can be configured by domain experts according to their domain knowledge. The overview of the syntactic diversity of the data values gives an impression of the rules and practices of data acquisition as well as their violations. From this, experts can infer potential quality issues with the data acquisition process and system, as well as the data model and data transformations. We outline a proof-of-concept implementation of the approach. Our evaluation found that clustering adds value to data quality analysis, especially for detecting quality problems in data models.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"4 1","pages":"1 - 33"},"PeriodicalIF":2.1,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91281672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Jing, Haowei Ma, A. Ansari, G. Sucharitha, B. Omarov, Sandeep Kumar, M. Mohammadi, Khaled A. Z. Alyamani
Cyberbullying is a form of abuse, manipulation, or humiliation directed against a single person via the Internet. CB makes use of nasty Internet comments and remarks. It occurs when someone publicly mocks, insults, slanders, criticizes, or mocks another person while remaining anonymous on the Internet. As a result, there is a rising need to create new methods for sifting through data on social media sites for symptoms of cyberbullying. The goal is to lessen the negative consequences of this condition. This article discusses a soft computing-based methodology for detecting cyberbullying in social multimedia data. This model incorporates social media data. Normalization is performed to remove noise from data. To improve a feature, the Particle Swarm Optimization Technique is applied. Feature optimization helps to make cyberbullying detection more accurate. The LSTM model is used to classify things. With the help of social media data, the PSO LSTM model is getting better at finding cyberbullying. The accuracy of PSO LSTM is 99.1%. It is 2.9% higher than the accuracy of the AdaBoost technique and 10.4% more than the accuracy of the KNN technique. The specificity and sensitivity of PSO-based LSTM is also higher in percentage than KNN and AdaBoost algorithm.
{"title":"Soft Computing Techniques for Detecting Cyberbullying in Social Multimedia Data","authors":"Yang Jing, Haowei Ma, A. Ansari, G. Sucharitha, B. Omarov, Sandeep Kumar, M. Mohammadi, Khaled A. Z. Alyamani","doi":"10.1145/3604617","DOIUrl":"https://doi.org/10.1145/3604617","url":null,"abstract":"Cyberbullying is a form of abuse, manipulation, or humiliation directed against a single person via the Internet. CB makes use of nasty Internet comments and remarks. It occurs when someone publicly mocks, insults, slanders, criticizes, or mocks another person while remaining anonymous on the Internet. As a result, there is a rising need to create new methods for sifting through data on social media sites for symptoms of cyberbullying. The goal is to lessen the negative consequences of this condition. This article discusses a soft computing-based methodology for detecting cyberbullying in social multimedia data. This model incorporates social media data. Normalization is performed to remove noise from data. To improve a feature, the Particle Swarm Optimization Technique is applied. Feature optimization helps to make cyberbullying detection more accurate. The LSTM model is used to classify things. With the help of social media data, the PSO LSTM model is getting better at finding cyberbullying. The accuracy of PSO LSTM is 99.1%. It is 2.9% higher than the accuracy of the AdaBoost technique and 10.4% more than the accuracy of the KNN technique. The specificity and sensitivity of PSO-based LSTM is also higher in percentage than KNN and AdaBoost algorithm.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"3 1","pages":"1 - 14"},"PeriodicalIF":2.1,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74690404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Open data's value-creating capabilities and innovation potential are widely recognized, resulting in a notable increase in the number of published open data sources. A crucial challenge for companies intending to leverage open data is to identify suitable open datasets that support specific business scenarios and prepare these datasets for use. Researchers have developed several open data assessment techniques, but those are restricted in scope, do not consider the use context, and are not embedded in the complete set of activities required for open data consumption in enterprises. Therefore, our research aims to develop prescriptive knowledge in the form of a meaningful method to screen, assess, and prepare open data for use in an enterprise setting. Our findings complement existing open data assessment techniques by providing methodological guidance to prepare open data of uncertain quality for use in a value-adding and demand-oriented manner, enabled by knowledge graphs and linked data concepts. From an academic perspective, our research conceptualizes open data preparation as a purposeful and value-creating process.
{"title":"A Method to Screen, Assess, and Prepare Open Data for Use","authors":"P. Krasikov, Christine Legner","doi":"10.1145/3603708","DOIUrl":"https://doi.org/10.1145/3603708","url":null,"abstract":"Open data's value-creating capabilities and innovation potential are widely recognized, resulting in a notable increase in the number of published open data sources. A crucial challenge for companies intending to leverage open data is to identify suitable open datasets that support specific business scenarios and prepare these datasets for use. Researchers have developed several open data assessment techniques, but those are restricted in scope, do not consider the use context, and are not embedded in the complete set of activities required for open data consumption in enterprises. Therefore, our research aims to develop prescriptive knowledge in the form of a meaningful method to screen, assess, and prepare open data for use in an enterprise setting. Our findings complement existing open data assessment techniques by providing methodological guidance to prepare open data of uncertain quality for use in a value-adding and demand-oriented manner, enabled by knowledge graphs and linked data concepts. From an academic perspective, our research conceptualizes open data preparation as a purposeful and value-creating process.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"1 1","pages":""},"PeriodicalIF":2.1,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77837398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmad Al-qerem, A. Ali, S. Nashwan, Mohammad Alauthman, Ala Hamarsheh, Ahmad Nabot, Issam Jibreen
The Web of Things (WoT) is a concept that aims to create a network of intelligent devices capable of remote monitoring, service provisioning, and control. Virtual and Physical Internet of Things (IoT) gateways facilitate communication, processing, and storage among social nodes that form the social Web of Things (SWoT). Peripheral IoT services commonly use device data. However, due to the limited bandwidth and processing power of edge devices in the IoT, they must dynamically alter the quality of service provided to their connected clients to meet each user's needs while also meeting the service quality requirements of other devices that may access the same data. Consequently, deciding which transactions get access to which Internet of Things data is a scheduling problem. Edge-cloud computing requires transaction management because several Internet of Things transactions may access shared data simultaneously. However, cloud transaction management methods cannot be employed in edge-cloud computing settings. Transaction management models must be consistent and consider ACIDity of transactions, especially consistency. This study compares three implementation strategies, Edge Host Strategy (EHS), Cloud Host Strategy (CHS), and Hybrid BHS (BHS), which execute all IoT transactions on the Edge host, the cloud, and both hosts, respectively. These transactions affect the Edge hosts as well. An IoTT framework is provided, viewing an Internet of Things transaction as a collection of fundamental and additional subtransactions that loosen atomicity. Execution strategy controls essential and additional subtransactions. The integration of edge and cloud computing demonstrates that the execution approach significantly affects system performance. EHS and CHS can waste wireless bandwidth, while BHS outperforms CHS and EHS in many scenarios. These solutions enable edge transactions to complete without restarting due to outdated IoT data or other edge or cloud transactions. The properties of these approaches have been detailed, showing that they often outperform concurrent protocols and can improve edge-cloud computing.
{"title":"Transactional Services for Concurrent Mobile Agents over Edge/Cloud Computing-Assisted Social Internet of Things","authors":"Ahmad Al-qerem, A. Ali, S. Nashwan, Mohammad Alauthman, Ala Hamarsheh, Ahmad Nabot, Issam Jibreen","doi":"10.1145/3603714","DOIUrl":"https://doi.org/10.1145/3603714","url":null,"abstract":"The Web of Things (WoT) is a concept that aims to create a network of intelligent devices capable of remote monitoring, service provisioning, and control. Virtual and Physical Internet of Things (IoT) gateways facilitate communication, processing, and storage among social nodes that form the social Web of Things (SWoT). Peripheral IoT services commonly use device data. However, due to the limited bandwidth and processing power of edge devices in the IoT, they must dynamically alter the quality of service provided to their connected clients to meet each user's needs while also meeting the service quality requirements of other devices that may access the same data. Consequently, deciding which transactions get access to which Internet of Things data is a scheduling problem. Edge-cloud computing requires transaction management because several Internet of Things transactions may access shared data simultaneously. However, cloud transaction management methods cannot be employed in edge-cloud computing settings. Transaction management models must be consistent and consider ACIDity of transactions, especially consistency. This study compares three implementation strategies, Edge Host Strategy (EHS), Cloud Host Strategy (CHS), and Hybrid BHS (BHS), which execute all IoT transactions on the Edge host, the cloud, and both hosts, respectively. These transactions affect the Edge hosts as well. An IoTT framework is provided, viewing an Internet of Things transaction as a collection of fundamental and additional subtransactions that loosen atomicity. Execution strategy controls essential and additional subtransactions. The integration of edge and cloud computing demonstrates that the execution approach significantly affects system performance. EHS and CHS can waste wireless bandwidth, while BHS outperforms CHS and EHS in many scenarios. These solutions enable edge transactions to complete without restarting due to outdated IoT data or other edge or cloud transactions. The properties of these approaches have been detailed, showing that they often outperform concurrent protocols and can improve edge-cloud computing.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"34 1","pages":"1 - 20"},"PeriodicalIF":2.1,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82719770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The application of the Internet of Things (IoT) is highly expected to have comprehensive economic, business, and societal implications for our smart lives; indeed, IoT technologies play an essential role in creating a variety of smart applications that improve the nature and well-being of life in the real world. Consequently, the interconnected nature of IoT systems and the variety of components of their implementation have given rise to new security concerns. Cyber-attacks and threats in the IoT ecosystem significantly impact the development of new intelligent applications. Moreover, the IoT ecosystem suffers from inheriting vulnerabilities that make its devices inoperable to benefit from instigating security techniques such as authentication, access control, encryption, and network security. Recently, great advances have been achieved in the field of Machine Intelligence (MI), Deep Learning (DL), and Machine Learning (ML), which have been applied to many important applications. ML and DL are regarded as efficient data exploration techniques for discovering “normal” and “abnormal” IoT component and device behavior inside the IoT ecosystem. Therefore, ML/DL approaches are required to convert the security of IoT systems from providing safe Device-to-Device (D2D) communication to providing security-based intelligence systems. The proposed work examines ML/DL technologies that may be utilized to provide superior security solutions for IoT devices. The potential security risks associated with the IoT are discussed, including pre-existing and newly emerging threats. Furthermore, the benefits and challenges of DL and ML techniques are examined to enhance IoT security.
{"title":"Joint IoT/ML Platforms for Smart Societies and Environments: A Review on Multimodal Information-Based Learning for Safety and Security","authors":"Hani Attar","doi":"10.1145/3603713","DOIUrl":"https://doi.org/10.1145/3603713","url":null,"abstract":"The application of the Internet of Things (IoT) is highly expected to have comprehensive economic, business, and societal implications for our smart lives; indeed, IoT technologies play an essential role in creating a variety of smart applications that improve the nature and well-being of life in the real world. Consequently, the interconnected nature of IoT systems and the variety of components of their implementation have given rise to new security concerns. Cyber-attacks and threats in the IoT ecosystem significantly impact the development of new intelligent applications. Moreover, the IoT ecosystem suffers from inheriting vulnerabilities that make its devices inoperable to benefit from instigating security techniques such as authentication, access control, encryption, and network security. Recently, great advances have been achieved in the field of Machine Intelligence (MI), Deep Learning (DL), and Machine Learning (ML), which have been applied to many important applications. ML and DL are regarded as efficient data exploration techniques for discovering “normal” and “abnormal” IoT component and device behavior inside the IoT ecosystem. Therefore, ML/DL approaches are required to convert the security of IoT systems from providing safe Device-to-Device (D2D) communication to providing security-based intelligence systems. The proposed work examines ML/DL technologies that may be utilized to provide superior security solutions for IoT devices. The potential security risks associated with the IoT are discussed, including pre-existing and newly emerging threats. Furthermore, the benefits and challenges of DL and ML techniques are examined to enhance IoT security.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"20 1","pages":"1 - 26"},"PeriodicalIF":2.1,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79894842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Matrouk, Srikanth V, Sumit Kumar, Mohit Kumar Bhadla, Mirza Sabirov, M. Saadh
Academics and businesses are paying intense attention to social network alignment, which centers various social networks around their shared members. All studies to date treat the social network as static and ignore its innate dynamism. In reality, an individual's discriminative pattern is embedded in the dynamics of social networks, and this information may be used to improve social network alignment. This study finds that these dynamics can reveal more apparent patterns better suited to lining up the social web of things (SWoT). The correlation between the user structure and attributes for each social network must be maintained to combine the binary dynamics and make the original synthetic embedding representation. Finally, the initial embedding of each network is projected to a target subspace as part of the semi-supervised spatial transformation learning process. The Dynamic Social Network Alignment approach outperforms the current mainstream algorithm by 10% in this article's extensive series of trials using real-world datasets. The findings of this study show that this alignment of enormous networks addresses the volume, variety, velocity, and veracity (or 4Vs) of vast networks. To improve the efficacy and resilience of an adversarial network alignment, adversarial learning techniques can be applied. The results show that the model with structure, attribute, and time information performs the best, while the model without attribute information comes in second, the model without time information performs mediocrely, and the model without structure information performs the worst.
{"title":"Deep Learning–based Dynamic User Alignment in Social Networks","authors":"K. Matrouk, Srikanth V, Sumit Kumar, Mohit Kumar Bhadla, Mirza Sabirov, M. Saadh","doi":"10.1145/3603711","DOIUrl":"https://doi.org/10.1145/3603711","url":null,"abstract":"Academics and businesses are paying intense attention to social network alignment, which centers various social networks around their shared members. All studies to date treat the social network as static and ignore its innate dynamism. In reality, an individual's discriminative pattern is embedded in the dynamics of social networks, and this information may be used to improve social network alignment. This study finds that these dynamics can reveal more apparent patterns better suited to lining up the social web of things (SWoT). The correlation between the user structure and attributes for each social network must be maintained to combine the binary dynamics and make the original synthetic embedding representation. Finally, the initial embedding of each network is projected to a target subspace as part of the semi-supervised spatial transformation learning process. The Dynamic Social Network Alignment approach outperforms the current mainstream algorithm by 10% in this article's extensive series of trials using real-world datasets. The findings of this study show that this alignment of enormous networks addresses the volume, variety, velocity, and veracity (or 4Vs) of vast networks. To improve the efficacy and resilience of an adversarial network alignment, adversarial learning techniques can be applied. The results show that the model with structure, attribute, and time information performs the best, while the model without attribute information comes in second, the model without time information performs mediocrely, and the model without structure information performs the worst.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"39 1","pages":"1 - 26"},"PeriodicalIF":2.1,"publicationDate":"2023-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75080609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. J. Martin, Rajvardhan Oak, Mukesh Soni, V. Mahalakshmi, Arsalan Muhammad Soomar, Anjali Joshi
As mobile networks and APPs are developed, user-generated content (UGC), which includes multi-source heterogeneous data like user reviews, tags, scores, images, and videos, has become an essential basis for improving the quality of personalized services. Due to the multi-source heterogeneous nature of the data, big data fusion offers both promise and drawbacks. With the rise of mobile networks and applications, UGC, which includes multi-source heterogeneous data including ratings, marks, scores, images, and videos, has gained importance. This information is very important for improving the calibre of customized services. The key to the application's success is representational learning of fusing and vectorization on the multi-source heterogeneous UGC. Multi-source text fusion and representation learning have become the key to its application. In this regard, a fusion representation learning for multi-source text and image is proposed. The convolutional fusion technique, in contrast to splicing and fusion, may take into consideration the varied data characteristics in each size. This research proposes a new data feature fusion strategy based on the convolution operation, which was inspired by the convolutional neural network. Using Doc2vec and LDA model, the vectorized representation of multi-source text is given, and the deep convolutional network is used to obtain it. Finally, the proposed algorithm is applied to Amazon's commodity dataset containing UGC content based on the classification accuracy of UGC vectorized representation items and shows the feasibility and impact of the proposed algorithm.
{"title":"Fusion-based Representation Learning Model for Multimode User-generated Social Network Content","authors":"R. J. Martin, Rajvardhan Oak, Mukesh Soni, V. Mahalakshmi, Arsalan Muhammad Soomar, Anjali Joshi","doi":"10.1145/3603712","DOIUrl":"https://doi.org/10.1145/3603712","url":null,"abstract":"As mobile networks and APPs are developed, user-generated content (UGC), which includes multi-source heterogeneous data like user reviews, tags, scores, images, and videos, has become an essential basis for improving the quality of personalized services. Due to the multi-source heterogeneous nature of the data, big data fusion offers both promise and drawbacks. With the rise of mobile networks and applications, UGC, which includes multi-source heterogeneous data including ratings, marks, scores, images, and videos, has gained importance. This information is very important for improving the calibre of customized services. The key to the application's success is representational learning of fusing and vectorization on the multi-source heterogeneous UGC. Multi-source text fusion and representation learning have become the key to its application. In this regard, a fusion representation learning for multi-source text and image is proposed. The convolutional fusion technique, in contrast to splicing and fusion, may take into consideration the varied data characteristics in each size. This research proposes a new data feature fusion strategy based on the convolution operation, which was inspired by the convolutional neural network. Using Doc2vec and LDA model, the vectorized representation of multi-source text is given, and the deep convolutional network is used to obtain it. Finally, the proposed algorithm is applied to Amazon's commodity dataset containing UGC content based on the classification accuracy of UGC vectorized representation items and shows the feasibility and impact of the proposed algorithm.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"57 1","pages":"1 - 21"},"PeriodicalIF":2.1,"publicationDate":"2023-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91381668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hadi Fadlallah, R. Kilany, Houssein Dhayne, Rami El Haddad, R. Haque, Y. Taher, Ali H. Jaber
The term data quality refers to measuring the fitness of data regarding the intended usage. Poor data quality leads to inadequate, inconsistent, and erroneous decisions that could escalate the computational cost, cause a decline in profits, and cause customer churn. Thus, data quality is crucial for researchers and industry practitioners. Different factors drive the assessment of data quality. Data context is deemed one of the key factors due to the contextual diversity of real-world use cases of various entities such as people and organizations. Data used in a specific context (e.g., an organization policy) may need to be more efficacious for another context. Hence, implementing a data quality assessment solution in different contexts is challenging. Traditional technologies for data quality assessment reached the pinnacle of maturity. Existing solutions can solve most of the quality issues. The data context in these solutions is defined as validation rules applied within the ETL (extract, transform, load) process, i.e., the data warehousing process. In contrast to traditional data quality management, it is impossible to specify all the data semantics beforehand for big data. We need context-aware data quality rules to detect semantic errors in a massive amount of heterogeneous data generated at high speed. While many researchers tackle the quality issues of big data, they define the data context from a specific standpoint. Although data quality is a longstanding research issue in academia and industries, it remains an open issue, especially with the advent of big data, which has fostered the challenge of data quality assessment more than ever. This article provides a scoping review to study the existing context-aware data quality assessment solutions, starting with the existing big data quality solutions in general and then covering context-aware solutions. The strength and weaknesses of such solutions are outlined and discussed. The survey showed that none of the existing data quality assessment solutions could guarantee context awareness with the ability to handle big data. Notably, each solution dealt only with a partial view of the context. We compared the existing quality models and solutions to reach a comprehensive view covering the aspects of context awareness when assessing data quality. This led us to a set of recommendations framed in a methodological framework shaping the design and implementation of any context-aware data quality service for big data. Open challenges are then identified and discussed.
{"title":"Context-aware Big Data Quality Assessment: A Scoping Review","authors":"Hadi Fadlallah, R. Kilany, Houssein Dhayne, Rami El Haddad, R. Haque, Y. Taher, Ali H. Jaber","doi":"10.1145/3603707","DOIUrl":"https://doi.org/10.1145/3603707","url":null,"abstract":"The term data quality refers to measuring the fitness of data regarding the intended usage. Poor data quality leads to inadequate, inconsistent, and erroneous decisions that could escalate the computational cost, cause a decline in profits, and cause customer churn. Thus, data quality is crucial for researchers and industry practitioners. Different factors drive the assessment of data quality. Data context is deemed one of the key factors due to the contextual diversity of real-world use cases of various entities such as people and organizations. Data used in a specific context (e.g., an organization policy) may need to be more efficacious for another context. Hence, implementing a data quality assessment solution in different contexts is challenging. Traditional technologies for data quality assessment reached the pinnacle of maturity. Existing solutions can solve most of the quality issues. The data context in these solutions is defined as validation rules applied within the ETL (extract, transform, load) process, i.e., the data warehousing process. In contrast to traditional data quality management, it is impossible to specify all the data semantics beforehand for big data. We need context-aware data quality rules to detect semantic errors in a massive amount of heterogeneous data generated at high speed. While many researchers tackle the quality issues of big data, they define the data context from a specific standpoint. Although data quality is a longstanding research issue in academia and industries, it remains an open issue, especially with the advent of big data, which has fostered the challenge of data quality assessment more than ever. This article provides a scoping review to study the existing context-aware data quality assessment solutions, starting with the existing big data quality solutions in general and then covering context-aware solutions. The strength and weaknesses of such solutions are outlined and discussed. The survey showed that none of the existing data quality assessment solutions could guarantee context awareness with the ability to handle big data. Notably, each solution dealt only with a partial view of the context. We compared the existing quality models and solutions to reach a comprehensive view covering the aspects of context awareness when assessing data quality. This led us to a set of recommendations framed in a methodological framework shaping the design and implementation of any context-aware data quality service for big data. Open challenges are then identified and discussed.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"8 1","pages":"1 - 33"},"PeriodicalIF":2.1,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90308221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}