Chaoneng Li, Guanwen Feng, Yiran Jia, Yunan Li, Jian Ji, Qiguang Miao
Due to the rapid advancement of wireless sensor and location technologies, a large amount of mobile agent trajectory data has become available. Intelligent city systems and video surveillance all benefit from trajectory anomaly detection. The authors propose an unsupervised reconstruction error-based trajectory anomaly detection (RETAD) method for vehicles to address the issues of conventional anomaly detection, which include difficulty extracting features, are susceptible to overfitting, and have a poor anomaly detection effect. RETAD reconstructs the original vehicle trajectories through an autoencoder based on recurrent neural networks. The model obtains moving patterns of normal trajectories by eliminating the gap between the reconstruction results and the initial inputs. Anomalous trajectories are defined as those with a reconstruction error larger than anomaly threshold. Experimental results demonstrate that the effectiveness of RETAD in detecting anomalies is superior to traditional distance-based, density-based, and machine learning classification algorithms on multiple metrics.
{"title":"RETAD: Vehicle Trajectory Anomaly Detection Based on Reconstruction Error","authors":"Chaoneng Li, Guanwen Feng, Yiran Jia, Yunan Li, Jian Ji, Qiguang Miao","doi":"10.4018/ijdwm.316460","DOIUrl":"https://doi.org/10.4018/ijdwm.316460","url":null,"abstract":"Due to the rapid advancement of wireless sensor and location technologies, a large amount of mobile agent trajectory data has become available. Intelligent city systems and video surveillance all benefit from trajectory anomaly detection. The authors propose an unsupervised reconstruction error-based trajectory anomaly detection (RETAD) method for vehicles to address the issues of conventional anomaly detection, which include difficulty extracting features, are susceptible to overfitting, and have a poor anomaly detection effect. RETAD reconstructs the original vehicle trajectories through an autoencoder based on recurrent neural networks. The model obtains moving patterns of normal trajectories by eliminating the gap between the reconstruction results and the initial inputs. Anomalous trajectories are defined as those with a reconstruction error larger than anomaly threshold. Experimental results demonstrate that the effectiveness of RETAD in detecting anomalies is superior to traditional distance-based, density-based, and machine learning classification algorithms on multiple metrics.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"19 1","pages":"1-14"},"PeriodicalIF":1.2,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70455534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is essential to have a fast, reliable, and energy-efficient connection between wireless sensor networks (WSNs). Control specifications, networking layers, media access control, and physical layers should be optimised or co-designed. Health insurance will become more expensive for individuals with lower incomes. There are privacy and cyber security issues, an increased risk of malpractice lawsuits, and more costs in terms of both time and money for doctors and patients. In this paper, personal health biomedical clothing based on wireless sensor networks (PH-BC-WSN) was used to enhance access to quality health care, boost food production through precision agriculture, and improve the quality of human resources. The internet of things enables the creation of healthcare and medical asset monitoring systems that are more efficient. There was extensive discussion of medical data eavesdropping, manipulation, fabrication of warnings, denial of services, position and tracker of users, physical interference with devices, and electromagnetic attacks.
{"title":"Personal Health and Illness Management and the Future Vision of Biomedical Clothing Based on WSN","authors":"Ge Zhang, Zubin Ning","doi":"10.4018/ijdwm.316126","DOIUrl":"https://doi.org/10.4018/ijdwm.316126","url":null,"abstract":"It is essential to have a fast, reliable, and energy-efficient connection between wireless sensor networks (WSNs). Control specifications, networking layers, media access control, and physical layers should be optimised or co-designed. Health insurance will become more expensive for individuals with lower incomes. There are privacy and cyber security issues, an increased risk of malpractice lawsuits, and more costs in terms of both time and money for doctors and patients. In this paper, personal health biomedical clothing based on wireless sensor networks (PH-BC-WSN) was used to enhance access to quality health care, boost food production through precision agriculture, and improve the quality of human resources. The internet of things enables the creation of healthcare and medical asset monitoring systems that are more efficient. There was extensive discussion of medical data eavesdropping, manipulation, fabrication of warnings, denial of services, position and tracker of users, physical interference with devices, and electromagnetic attacks.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"12 1","pages":"1-21"},"PeriodicalIF":1.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83531376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Q. Zhu, Wenhao Ding, Mingsen Xiang, M. Hu, Ning Zhang
With the change of people's consumption mode, credit consumption has gradually become a new consumption trend. Frequent loan defaults give default prediction more and more attention. This paper proposes a new comprehensive prediction method of loan default. This method combines convolutional neural network and LightGBM algorithm to establish a prediction model. Firstly, the excellent feature extraction ability of convolutional neural network is used to extract features from the original loan data and generate a new feature matrix. Secondly, the new feature matrix is used as input data, and the parameters of LightGBM algorithm are adjusted through grid search so as to build the LightGBM model. Finally, the LightGBM model is trained based on the new feature matrix, and the CNN-LightGBM loan default prediction model is obtained. To verify the effectiveness and superiority of our model, a series of experiments were conducted to compare the proposed prediction model with four classical models. The results show that CNN-LightGBM model is superior to other models in all evaluation indexes.
{"title":"Loan Default Prediction Based on Convolutional Neural Network and LightGBM","authors":"Q. Zhu, Wenhao Ding, Mingsen Xiang, M. Hu, Ning Zhang","doi":"10.4018/ijdwm.315823","DOIUrl":"https://doi.org/10.4018/ijdwm.315823","url":null,"abstract":"With the change of people's consumption mode, credit consumption has gradually become a new consumption trend. Frequent loan defaults give default prediction more and more attention. This paper proposes a new comprehensive prediction method of loan default. This method combines convolutional neural network and LightGBM algorithm to establish a prediction model. Firstly, the excellent feature extraction ability of convolutional neural network is used to extract features from the original loan data and generate a new feature matrix. Secondly, the new feature matrix is used as input data, and the parameters of LightGBM algorithm are adjusted through grid search so as to build the LightGBM model. Finally, the LightGBM model is trained based on the new feature matrix, and the CNN-LightGBM loan default prediction model is obtained. To verify the effectiveness and superiority of our model, a series of experiments were conducted to compare the proposed prediction model with four classical models. The results show that CNN-LightGBM model is superior to other models in all evaluation indexes.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"24 1","pages":"1-16"},"PeriodicalIF":1.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90475469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, a top-k pseudo labeling method for semi-supervised self-learning is proposed. Pseudo labeling is a key technology in semi-supervised self-learning. Briefly, the quality of the pseudo label generated largely determined the convergence of the neural network and the accuracy obtained. In this paper, the authors use a method called top-k pseudo labeling to generate pseudo label during the training of semi-supervised neural network model. The proposed labeling method helps a lot in learning features from unlabeled data. The proposed method is easy to implement and only relies on the neural network prediction and hyper-parameter k. The experiment results show that the proposed method works well with semi-supervised learning on CIFAR-10 and CIFAR-100 datasets. Also, a variant of top-k labeling for supervised learning named top-k regulation is proposed. The experiment results show that various models can achieve higher accuracy on test set when trained with top-k regulation.
{"title":"Top-K Pseudo Labeling for Semi-Supervised Image Classification","authors":"Yi Jiang, Hui Sun","doi":"10.4018/ijdwm.316150","DOIUrl":"https://doi.org/10.4018/ijdwm.316150","url":null,"abstract":"In this paper, a top-k pseudo labeling method for semi-supervised self-learning is proposed. Pseudo labeling is a key technology in semi-supervised self-learning. Briefly, the quality of the pseudo label generated largely determined the convergence of the neural network and the accuracy obtained. In this paper, the authors use a method called top-k pseudo labeling to generate pseudo label during the training of semi-supervised neural network model. The proposed labeling method helps a lot in learning features from unlabeled data. The proposed method is easy to implement and only relies on the neural network prediction and hyper-parameter k. The experiment results show that the proposed method works well with semi-supervised learning on CIFAR-10 and CIFAR-100 datasets. Also, a variant of top-k labeling for supervised learning named top-k regulation is proposed. The experiment results show that various models can achieve higher accuracy on test set when trained with top-k regulation.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"12 1","pages":"1-18"},"PeriodicalIF":1.2,"publicationDate":"2022-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90776239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Giacometti, Béatrice Bouchou-Markhoff, Arnaud Soulet
This paper presents Versus, which is the first automatic method for generating comparison tables from knowledge bases of the Semantic Web. For this purpose, it introduces the contextual reference level to evaluate whether a feature is relevant to compare a set of entities. This measure relies on contexts that are sets of entities similar to the compared entities. Its principle is to favor the features whose values for the compared entities are reference (or frequent) in these contexts. The proposal efficiently evaluates the contextual reference level from a public SPARQL endpoint limited by a fair-use policy. Using a new benchmark based on Wikidata, the experiments show the interest of the contextual reference level for identifying the features deemed relevant by users with high precision and recall. In addition, the proposed optimizations significantly reduce the number of required queries for properties as well as for inverse relations. Interestingly, this experimental study also show that the inverse relations bring out a large number of numerical comparison features.
{"title":"A Method for Generating Comparison Tables From the Semantic Web","authors":"A. Giacometti, Béatrice Bouchou-Markhoff, Arnaud Soulet","doi":"10.4018/ijdwm.298008","DOIUrl":"https://doi.org/10.4018/ijdwm.298008","url":null,"abstract":"This paper presents Versus, which is the first automatic method for generating comparison tables from knowledge bases of the Semantic Web. For this purpose, it introduces the contextual reference level to evaluate whether a feature is relevant to compare a set of entities. This measure relies on contexts that are sets of entities similar to the compared entities. Its principle is to favor the features whose values for the compared entities are reference (or frequent) in these contexts. The proposal efficiently evaluates the contextual reference level from a public SPARQL endpoint limited by a fair-use policy. Using a new benchmark based on Wikidata, the experiments show the interest of the contextual reference level for identifying the features deemed relevant by users with high precision and recall. In addition, the proposed optimizations significantly reduce the number of required queries for properties as well as for inverse relations. Interestingly, this experimental study also show that the inverse relations bring out a large number of numerical comparison features.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"37 1","pages":"1-20"},"PeriodicalIF":1.2,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83159465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wissam Siblini, Mohamed Challal, Charlotte Pasqual
Open Domain Question Answering (ODQA) on a large-scale corpus of documents (e.g. Wikipedia) is a key challenge in computer science. Although Transformer-based language models such as Bert have shown an ability to outperform humans to extract answers from small pre-selected passages of text, they suffer from their high complexity if the search space is much larger. The most common way to deal with this problem is to add a preliminary information retrieval step to strongly filter the corpus and keep only the relevant passages. In this article, the authors consider a more direct and complementary solution which consists in restricting the attention mechanism in Transformer-based models to allow a more efficient management of computations. The resulting variants are competitive with the original models on the extractive task and allow, in the ODQA setting, a significant acceleration of predictions and sometimes even an improvement in the quality of response.
{"title":"Efficient Open Domain Question Answering With Delayed Attention in Transformer-Based Models","authors":"Wissam Siblini, Mohamed Challal, Charlotte Pasqual","doi":"10.4018/ijdwm.298005","DOIUrl":"https://doi.org/10.4018/ijdwm.298005","url":null,"abstract":"Open Domain Question Answering (ODQA) on a large-scale corpus of documents (e.g. Wikipedia) is a key challenge in computer science. Although Transformer-based language models such as Bert have shown an ability to outperform humans to extract answers from small pre-selected passages of text, they suffer from their high complexity if the search space is much larger. The most common way to deal with this problem is to add a preliminary information retrieval step to strongly filter the corpus and keep only the relevant passages. In this article, the authors consider a more direct and complementary solution which consists in restricting the attention mechanism in Transformer-based models to allow a more efficient management of computations. The resulting variants are competitive with the original models on the extractive task and allow, in the ODQA setting, a significant acceleration of predictions and sometimes even an improvement in the quality of response.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"45 1","pages":"1-16"},"PeriodicalIF":1.2,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84632210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nazha Selmaoui-Folcher, Jannaï Tokotoko, Samuel Gorohouna, Laïsa Roi, C. Leschi, Catherine Ris
Pretopology is a mathematical model developed from a weakening of the topological axiomatic. It was initially used in economic, social and biological sciences and next in pattern recognition and image analysis. More recently, it has been applied to the analysis of complex networks. Pretopology enables to work in a mathematical framework with weak properties, and its nonidempotent operator called pseudo-closure permits to implement iterative algorithms. It proposes a formalism that generalizes graph theory concepts and allows to model problems universally. In this paper, authors will extend this mathematical model to analyze complex data with spatiotemporal dimensions. Authors define the notion of a temporal pretopology based on a temporal function. They give an example of temporal function based on a binary relation, and construct a temporal pretopology. They define two new notions of temporal substructures which aim at representing evolution of substructures. They propose algorithms to extract these substructures. They experiment the proposition on 2 data and two economic real data.
{"title":"Concept of Temporal Pretopology for the Analysis for Structural Changes: Application to Econometrics","authors":"Nazha Selmaoui-Folcher, Jannaï Tokotoko, Samuel Gorohouna, Laïsa Roi, C. Leschi, Catherine Ris","doi":"10.4018/ijdwm.298004","DOIUrl":"https://doi.org/10.4018/ijdwm.298004","url":null,"abstract":"Pretopology is a mathematical model developed from a weakening of the topological axiomatic. It was initially used in economic, social and biological sciences and next in pattern recognition and image analysis. More recently, it has been applied to the analysis of complex networks. Pretopology enables to work in a mathematical framework with weak properties, and its nonidempotent operator called pseudo-closure permits to implement iterative algorithms. It proposes a formalism that generalizes graph theory concepts and allows to model problems universally. In this paper, authors will extend this mathematical model to analyze complex data with spatiotemporal dimensions. Authors define the notion of a temporal pretopology based on a temporal function. They give an example of temporal function based on a binary relation, and construct a temporal pretopology. They define two new notions of temporal substructures which aim at representing evolution of substructures. They propose algorithms to extract these substructures. They experiment the proposition on 2 data and two economic real data.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"1 1","pages":"1-17"},"PeriodicalIF":1.2,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79691839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erwan Schild, Gautier Durantin, Jean-Charles Lamirel, F. Miconi
Chatbots represent a promising tool to automate the processing of requests in a business context. However, despite major progress in natural language processing technologies, constructing a dataset deemed relevant by business experts is a manual, iterative and error-prone process. To assist these experts during modelling and labelling, the authors propose an active learning methodology coined Interactive Clustering. It relies on interactions between computer-guided segmentation of data in intents, and response-driven human annotations imposing constraints on clusters to improve relevance.This article applies Interactive Clustering on a realistic dataset, and measures the optimal settings required for relevant segmentation in a minimal number of annotations. The usability of the method is discussed in terms of computation time, and the achieved compromise between business relevance and classification performance during training.In this context, Interactive Clustering appears as a suitable methodology combining human and computer initiatives to efficiently develop a useable chatbot.
{"title":"Iterative and Semi-Supervised Design of Chatbots Using Interactive Clustering","authors":"Erwan Schild, Gautier Durantin, Jean-Charles Lamirel, F. Miconi","doi":"10.4018/ijdwm.298007","DOIUrl":"https://doi.org/10.4018/ijdwm.298007","url":null,"abstract":"Chatbots represent a promising tool to automate the processing of requests in a business context. However, despite major progress in natural language processing technologies, constructing a dataset deemed relevant by business experts is a manual, iterative and error-prone process. To assist these experts during modelling and labelling, the authors propose an active learning methodology coined Interactive Clustering. It relies on interactions between computer-guided segmentation of data in intents, and response-driven human annotations imposing constraints on clusters to improve relevance.This article applies Interactive Clustering on a realistic dataset, and measures the optimal settings required for relevant segmentation in a minimal number of annotations. The usability of the method is discussed in terms of computation time, and the achieved compromise between business relevance and classification performance during training.In this context, Interactive Clustering appears as a suitable methodology combining human and computer initiatives to efficiently develop a useable chatbot.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"17 1","pages":"1-19"},"PeriodicalIF":1.2,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73614194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Romane Scherrer, Erwan Aulnette, T. Quiniou, J. Kasarhérou, Pierre Kolb, Nazha Selmaoui-Folcher
An autonomous acoustic system based on two bottom-moored hydrophones, a two-input audio board and a small single-board computer was installed at the entrance of a marina to detect entering/exiting boat. Windowed time lagged cross-correlations are calculated by the system to find the consecutive time delays between the hydrophone signals and to compute a signal which is a function of the boats' angular trajectories. Since its installation, the single-board computer performs online prediction with a signal processing-based algorithm which achieved an accuracy of 80 %. To improve system performance, a convolutional neural network (CNN) is trained with the acquired data to perform real-time detection. Two classification tasks were considered (binary and multiclass) to both detect a boat and its direction of navigation. Finally, a trained CNN was implemented in a single-board computer to ensure that prediction can be performed in real time.
{"title":"Boat Detection in Marina Using Time-Delay Analysis and Deep Learning","authors":"Romane Scherrer, Erwan Aulnette, T. Quiniou, J. Kasarhérou, Pierre Kolb, Nazha Selmaoui-Folcher","doi":"10.4018/ijdwm.298006","DOIUrl":"https://doi.org/10.4018/ijdwm.298006","url":null,"abstract":"An autonomous acoustic system based on two bottom-moored hydrophones, a two-input audio board and a small single-board computer was installed at the entrance of a marina to detect entering/exiting boat. Windowed time lagged cross-correlations are calculated by the system to find the consecutive time delays between the hydrophone signals and to compute a signal which is a function of the boats' angular trajectories. Since its installation, the single-board computer performs online prediction with a signal processing-based algorithm which achieved an accuracy of 80 %. To improve system performance, a convolutional neural network (CNN) is trained with the acquired data to perform real-time detection. Two classification tasks were considered (binary and multiclass) to both detect a boat and its direction of navigation. Finally, a trained CNN was implemented in a single-board computer to ensure that prediction can be performed in real time.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"16 1","pages":"1-16"},"PeriodicalIF":1.2,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73432847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social media data become an integral part in the business data and should be integrated into the decisional process for better decision making based on information which reflects better the true situation of business in any field. However, social media data are unstructured and generated in very high frequency which exceeds the capacity of the data warehouse. In this work, we propose to extend the data warehousing process with a staging area which heart is a large scale system implementing an information extraction process using Storm and Hadoop frameworks to better manage their volume and frequency. Concerning structured information extraction, mainly events, we combine a set of techniques from NLP, linguistic rules and machine learning to succeed the task. Finally, we propose the adequate data warehouse conceptual model for events modeling and integration with enterprise data warehouse using an intermediate table called Bridge table. For application and experiments, we focus on drug abuse events extraction from Twitter data and their modeling into the Event Data Warehouse.
{"title":"Large-Scale System for Social Media Data Warehousing: The Case of Twitter-Related Drug Abuse Events Integration","authors":"Ferdaous Jenhani, M. Gouider","doi":"10.4018/ijdwm.290890","DOIUrl":"https://doi.org/10.4018/ijdwm.290890","url":null,"abstract":"Social media data become an integral part in the business data and should be integrated into the decisional process for better decision making based on information which reflects better the true situation of business in any field. However, social media data are unstructured and generated in very high frequency which exceeds the capacity of the data warehouse. In this work, we propose to extend the data warehousing process with a staging area which heart is a large scale system implementing an information extraction process using Storm and Hadoop frameworks to better manage their volume and frequency. Concerning structured information extraction, mainly events, we combine a set of techniques from NLP, linguistic rules and machine learning to succeed the task. Finally, we propose the adequate data warehouse conceptual model for events modeling and integration with enterprise data warehouse using an intermediate table called Bridge table. For application and experiments, we focus on drug abuse events extraction from Twitter data and their modeling into the Event Data Warehouse.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"10 1","pages":"1-18"},"PeriodicalIF":1.2,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75167432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}