With the rise of big data, business intelligence had to find solutions for managing even greater data volumes and variety than in data warehouses, which proved ill-adapted. Data lakes answer these needs from a storage point of view, but require managing adequate metadata to guarantee an efficient access to data. Starting from a multidimensional metadata model designed for an industrial heritage data lake presenting a lack of schema evolutivity, we propose in this paper to use ensemble modeling, and more precisely a data vault, to address this issue. To illustrate the feasibility of this approach, we instantiate our metadata conceptual model into relational and document-oriented logical and physical models, respectively. We also compare the physical models in terms of metadata storage and query response time.
{"title":"Modeling Data Lake Metadata with a Data Vault","authors":"I. D. Nogueira, Maram Romdhane, J. Darmont","doi":"10.1145/3216122.3216130","DOIUrl":"https://doi.org/10.1145/3216122.3216130","url":null,"abstract":"With the rise of big data, business intelligence had to find solutions for managing even greater data volumes and variety than in data warehouses, which proved ill-adapted. Data lakes answer these needs from a storage point of view, but require managing adequate metadata to guarantee an efficient access to data. Starting from a multidimensional metadata model designed for an industrial heritage data lake presenting a lack of schema evolutivity, we propose in this paper to use ensemble modeling, and more precisely a data vault, to address this issue. To illustrate the feasibility of this approach, we instantiate our metadata conceptual model into relational and document-oriented logical and physical models, respectively. We also compare the physical models in terms of metadata storage and query response time.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123855262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
When exploring data or communicating it to other people, data is currently visualized through flat diagrams, tables, graphs, etc. Visualization of data in three dimensions (3D) offers more immersive and intuitive representations of the data and, through the added dimension, allows for more compact representations. Still, when representing large amounts of data in 3D, a fine control of the layout becomes a must. Current tools for 3D visualization do not allow for easy and fine tuned control of this layout. SuperSQL is an extension of the SQL language allowing users to declaratively and concisely specify the layout of, and generate structured documents such as web pages. In this work we extend SuperSQL to allow the generation of 3D data representations in the Unity game engine. With this system, users can represent their data through basic shapes, colors, and animations, or even their own custom 3D assets, by writing simple SQL-like queries.
{"title":"3D Visualization of data using SuperSQL and Unity","authors":"Tatsuki Fujimoto, Kento Goto, Motomichi Toyama","doi":"10.1145/3216122.3216145","DOIUrl":"https://doi.org/10.1145/3216122.3216145","url":null,"abstract":"When exploring data or communicating it to other people, data is currently visualized through flat diagrams, tables, graphs, etc. Visualization of data in three dimensions (3D) offers more immersive and intuitive representations of the data and, through the added dimension, allows for more compact representations. Still, when representing large amounts of data in 3D, a fine control of the layout becomes a must. Current tools for 3D visualization do not allow for easy and fine tuned control of this layout. SuperSQL is an extension of the SQL language allowing users to declaratively and concisely specify the layout of, and generate structured documents such as web pages. In this work we extend SuperSQL to allow the generation of 3D data representations in the Unity game engine. With this system, users can represent their data through basic shapes, colors, and animations, or even their own custom 3D assets, by writing simple SQL-like queries.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126361862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Extract-Transform-Load (ETL) process in data warehousing involves extracting data from source databases, transforming it into a form suitable for research and analysis, and loading it into a data warehouse. ETL processes can use complex transformations involving sources and targets that use different schemas, databases, and technologies, which make ETL implementations fault-prone. In this paper, we present an approach for validating ETL processes using automated balancing tests that check for various types of discrepancies between the source and target data. We formalize three categories of properties, namely, completeness, consistency, and syntactic validity that must be checked during testing. Our approach uses the rules provided in the ETL specifications to generate source-to-target mappings, from which balancing test assertions are generated for each property. We evaluated the approach on a real-world health data warehouse project and revealed 11 previously undetected faults. Using mutation analysis, we demonstrated that our auto-generated assertions can detect faults in the data inside the target data warehouse.
{"title":"An Approach for Testing the Extract-Transform-Load Process in Data Warehouse Systems","authors":"Hajar Homayouni, Sudipto Ghosh, I. Ray","doi":"10.1145/3216122.3216149","DOIUrl":"https://doi.org/10.1145/3216122.3216149","url":null,"abstract":"The Extract-Transform-Load (ETL) process in data warehousing involves extracting data from source databases, transforming it into a form suitable for research and analysis, and loading it into a data warehouse. ETL processes can use complex transformations involving sources and targets that use different schemas, databases, and technologies, which make ETL implementations fault-prone. In this paper, we present an approach for validating ETL processes using automated balancing tests that check for various types of discrepancies between the source and target data. We formalize three categories of properties, namely, completeness, consistency, and syntactic validity that must be checked during testing. Our approach uses the rules provided in the ETL specifications to generate source-to-target mappings, from which balancing test assertions are generated for each property. We evaluated the approach on a real-world health data warehouse project and revealed 11 previously undetected faults. Using mutation analysis, we demonstrated that our auto-generated assertions can detect faults in the data inside the target data warehouse.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129152378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Cuzzocrea, Francesco Folino, M. Guarascio, L. Pontieri
In many application contexts, a business process' executions are subject to performance constraints expressed in an aggregated form, usually over predefined time windows, and detecting a likely violation to such a constraint in advance could help undertake corrective measures for preventing it. This paper illustrates a prediction-aware event processing framework that addresses the problem of estimating whether the process instances of a given (unfinished) window w will violate an aggregate performance constraint, based on the continuous learning and application of an ensemble of models, capable each of making and integrating two kinds of predictions: single-instance predictions concerning the ongoing process instances of w, and time-series predictions concerning the "future" process instances of w (i.e. those that have not started yet, but will start by the end of w). Notably, the framework can continuously update the ensemble, fully exploiting the raw event data produced by the process under monitoring, suitably lifted to an adequate level of abstraction. The framework has been validated against historical event data coming from real-life business processes, showing promising results in terms of both accuracy and efficiency.
{"title":"A Predictive Learning Framework for Monitoring Aggregated Performance Indicators over Business Process Events","authors":"A. Cuzzocrea, Francesco Folino, M. Guarascio, L. Pontieri","doi":"10.1145/3216122.3216143","DOIUrl":"https://doi.org/10.1145/3216122.3216143","url":null,"abstract":"In many application contexts, a business process' executions are subject to performance constraints expressed in an aggregated form, usually over predefined time windows, and detecting a likely violation to such a constraint in advance could help undertake corrective measures for preventing it. This paper illustrates a prediction-aware event processing framework that addresses the problem of estimating whether the process instances of a given (unfinished) window w will violate an aggregate performance constraint, based on the continuous learning and application of an ensemble of models, capable each of making and integrating two kinds of predictions: single-instance predictions concerning the ongoing process instances of w, and time-series predictions concerning the \"future\" process instances of w (i.e. those that have not started yet, but will start by the end of w). Notably, the framework can continuously update the ensemble, fully exploiting the raw event data produced by the process under monitoring, suitably lifted to an adequate level of abstraction. The framework has been validated against historical event data coming from real-life business processes, showing promising results in terms of both accuracy and efficiency.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"191 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132833601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giorgio Baldassarre, Paolo Lo Giudice, Lorenzo Musarella, D. Ursino
The Internet of Things (IoT) is currently considered the new frontier of the Internet. One of the most effective ways to investigate and implement IoT is based on the use of the social network paradigm. In the last years, social network researchers have introduced new models capable of capturing the growing complexity of this scenario. One of the most known of them is the Social Internetworking System, which models a scenario comprising several related social networks. In this paper, we investigate the possibility of applying the ideas characterizing the Social Internetworking System to IoT and we propose a new paradigm capable of modelling this scenario and of favoring the cooperation of objects belonging to different IoTs. Furthermore, in order to give an idea of both the potentialities and the complexity of this new paradigm, we illustrate in more detail one of the most interesting issues regarding it, namely the redefinition of the betweenness centrality measure.
{"title":"A paradigm for the cooperation of objects belonging to different IoTs","authors":"Giorgio Baldassarre, Paolo Lo Giudice, Lorenzo Musarella, D. Ursino","doi":"10.1145/3216122.3216171","DOIUrl":"https://doi.org/10.1145/3216122.3216171","url":null,"abstract":"The Internet of Things (IoT) is currently considered the new frontier of the Internet. One of the most effective ways to investigate and implement IoT is based on the use of the social network paradigm. In the last years, social network researchers have introduced new models capable of capturing the growing complexity of this scenario. One of the most known of them is the Social Internetworking System, which models a scenario comprising several related social networks. In this paper, we investigate the possibility of applying the ideas characterizing the Social Internetworking System to IoT and we propose a new paradigm capable of modelling this scenario and of favoring the cooperation of objects belonging to different IoTs. Furthermore, in order to give an idea of both the potentialities and the complexity of this new paradigm, we illustrate in more detail one of the most interesting issues regarding it, namely the redefinition of the betweenness centrality measure.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131257389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Distributed systems provide users with powerful capabilities to store and process their data in third-party machines. However, the privacy of the outsourced data is not guaranteed. One solution for protecting the user data against privacy attacks is to encrypt the sensitive data before sending to the nodes of the distributed system. Then, the main problem is to evaluate user queries over the encrypted data. In this paper, we propose a complete solution for processing top-k queries over encrypted databases stored across the nodes of a distributed system. The problem of distributed top-k query processing has been well addressed over plaintext (non encrypted) data. However, the proposed approaches cannot be used in the case of encrypted data.
{"title":"Top-k Query Processing over Distributed Sensitive Data","authors":"S. Mahboubi, Reza Akbarinia, P. Valduriez","doi":"10.1145/3216122.3216153","DOIUrl":"https://doi.org/10.1145/3216122.3216153","url":null,"abstract":"Distributed systems provide users with powerful capabilities to store and process their data in third-party machines. However, the privacy of the outsourced data is not guaranteed. One solution for protecting the user data against privacy attacks is to encrypt the sensitive data before sending to the nodes of the distributed system. Then, the main problem is to evaluate user queries over the encrypted data. In this paper, we propose a complete solution for processing top-k queries over encrypted databases stored across the nodes of a distributed system. The problem of distributed top-k query processing has been well addressed over plaintext (non encrypted) data. However, the proposed approaches cannot be used in the case of encrypted data.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131983197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Location dependent information services (LDIS) can be characterized as the applications that coordinate a cell phone's area or position with other data to give enhanced value of services to the client at right place in the right time from anywhere. In this paper, an algorithm Caching Efficiency with Next Location Prediction Based (CELPB) has been developed that uses a newly developed metric i.e. caching efficiency with next location prediction (CELP) for the computation of valid scope in prediction interval. This metric takes account the future movement behavior of client with the help of Sequential Pattern Mining and Clustering. The mobility rules have also been framed for the prediction of an accurate next location, which can be used in estimating the future movement path (edges) of client if he reached in valid scope area of any data item. Simulation results show that proposed policy achieves up to 10 percent performance improvement compared to earlier cache invalidation policy (CEBAB) for LDIS.
{"title":"CELPB: A Cache Invalidation Policy for Location Dependent Data in Mobile Environment","authors":"Ajay K. Gupta, Udai Shanker","doi":"10.1145/3216122.3216147","DOIUrl":"https://doi.org/10.1145/3216122.3216147","url":null,"abstract":"Location dependent information services (LDIS) can be characterized as the applications that coordinate a cell phone's area or position with other data to give enhanced value of services to the client at right place in the right time from anywhere. In this paper, an algorithm Caching Efficiency with Next Location Prediction Based (CELPB) has been developed that uses a newly developed metric i.e. caching efficiency with next location prediction (CELP) for the computation of valid scope in prediction interval. This metric takes account the future movement behavior of client with the help of Sequential Pattern Mining and Clustering. The mobility rules have also been framed for the prediction of an accurate next location, which can be used in estimating the future movement path (edges) of client if he reached in valid scope area of any data item. Simulation results show that proposed policy achieves up to 10 percent performance improvement compared to earlier cache invalidation policy (CEBAB) for LDIS.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117334629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Our work focuses on inductive transfer learning, a setting in which one assumes that both source and target tasks share the same features and label spaces. We demonstrate that transfer learning can be successfully used for feature reduction and hence for more efficient classification performance. Further, our experiments show that this approach increases the precision of the classification task as well.
{"title":"Feature Reduction Improves Classification Accuracy in Healthcare","authors":"Maha Asiri, Hamid R. Nemati, F. Sadri","doi":"10.1145/3216122.3216165","DOIUrl":"https://doi.org/10.1145/3216122.3216165","url":null,"abstract":"Our work focuses on inductive transfer learning, a setting in which one assumes that both source and target tasks share the same features and label spaces. We demonstrate that transfer learning can be successfully used for feature reduction and hence for more efficient classification performance. Further, our experiments show that this approach increases the precision of the classification task as well.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125584957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently there has been an effort to solve the problems caused by the infamous NULL in relational databases, by systematically applying Kleene's three-valued logic to SQL. The third truth-value is unknown. In this paper we show that by using a fourth truth-value inconsistent, all the advantages of the three-valued approach can be retained, and that negation can be given a constructive, intuitionistic meaning that allows negative knowledge to be specified in the logic explicitly, without having to resort to extra-logical notions of stratification or to non-monotonic reasoning. The four-valued approach also allows for a computationally efficient treatment of query answering in the presence of inconsistencies. This is in contrast to the computationally intractable repair approach to inconsistency management. From a practical view-point we show that the Cylindric Star Algebra, developed by the authors, is particularly well suited for evaluating First Order queries on four-valued databases, and that the framework of data exchange can smoothly adapted to the four truth-values.
{"title":"A useful four-valued database logic","authors":"G. Grahne, A. Moallemi","doi":"10.1145/3216122.3216157","DOIUrl":"https://doi.org/10.1145/3216122.3216157","url":null,"abstract":"Recently there has been an effort to solve the problems caused by the infamous NULL in relational databases, by systematically applying Kleene's three-valued logic to SQL. The third truth-value is unknown. In this paper we show that by using a fourth truth-value inconsistent, all the advantages of the three-valued approach can be retained, and that negation can be given a constructive, intuitionistic meaning that allows negative knowledge to be specified in the logic explicitly, without having to resort to extra-logical notions of stratification or to non-monotonic reasoning. The four-valued approach also allows for a computationally efficient treatment of query answering in the presence of inconsistencies. This is in contrast to the computationally intractable repair approach to inconsistency management. From a practical view-point we show that the Cylindric Star Algebra, developed by the authors, is particularly well suited for evaluating First Order queries on four-valued databases, and that the framework of data exchange can smoothly adapted to the four truth-values.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134088734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Influenza surveillance through social media data is becoming an important research topic because it could enhance the capabilities of official surveillance systems in monitoring the outbreak of seasonal flu, by providing healthcare organization with improved situational awareness. In this paper, the two influenza seasons 2016-2017 and 2017-2018, restricted to Italy, are investigated by analyzing the tweets posted by users regarding influenza-like illness. Two types of analysis are performed. The first studies the correlation between the tweets containing the most frequent flu related words with the data provided by the Italian InfluNet surveillance system. The second one examines the sentiment of people on the medicines used to heal flu. We show that there is a strict correlation between the reports published on the InfluNet system, and the contents posted by Twitter users about their symptoms and health state. Moreover, we found that the sentiment expressed by people regarding the treatment, in terms of medicines, taken to heal seems rather negative.
{"title":"Twitter-based Influenza Surveillance: An Analysis of the 2016-2017 and 2017-2018 Seasons in Italy","authors":"C. Comito, Agostino Forestiero, C. Pizzuti","doi":"10.1145/3216122.3216128","DOIUrl":"https://doi.org/10.1145/3216122.3216128","url":null,"abstract":"Influenza surveillance through social media data is becoming an important research topic because it could enhance the capabilities of official surveillance systems in monitoring the outbreak of seasonal flu, by providing healthcare organization with improved situational awareness. In this paper, the two influenza seasons 2016-2017 and 2017-2018, restricted to Italy, are investigated by analyzing the tweets posted by users regarding influenza-like illness. Two types of analysis are performed. The first studies the correlation between the tweets containing the most frequent flu related words with the data provided by the Italian InfluNet surveillance system. The second one examines the sentiment of people on the medicines used to heal flu. We show that there is a strict correlation between the reports published on the InfluNet system, and the contents posted by Twitter users about their symptoms and health state. Moreover, we found that the sentiment expressed by people regarding the treatment, in terms of medicines, taken to heal seems rather negative.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133672488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}