Time is a useful dimension to explore in text databases especially when historical and factual information is concerned. As documents generally refer to different events and time periods, understanding the focus time of key sentences, defined as the time the content refers to, is a crucial task to temporally annotate a document. In this paper, we leverage a bag of linked entities representation of sentences and temporal information from Wikipedia and DBpedia to implement a novel approach to focus time estimation. We evaluate our approach on sample datasets and compare it with a state of the art method, measuring improvements in MRR.
{"title":"Leveraging linked entities to estimate focus time of short texts","authors":"C. Morbidoni, A. Cucchiarelli, D. Ursino","doi":"10.1145/3216122.3216158","DOIUrl":"https://doi.org/10.1145/3216122.3216158","url":null,"abstract":"Time is a useful dimension to explore in text databases especially when historical and factual information is concerned. As documents generally refer to different events and time periods, understanding the focus time of key sentences, defined as the time the content refers to, is a crucial task to temporally annotate a document. In this paper, we leverage a bag of linked entities representation of sentences and temporal information from Wikipedia and DBpedia to implement a novel approach to focus time estimation. We evaluate our approach on sample datasets and compare it with a state of the art method, measuring improvements in MRR.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128947959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Main memory hash joins are an important category of in-memory joins. However, the performance of these joins can be hindered by dataset skew, shuffling, and load balancing. We conducted a comprehensive study on the effects of dataset skew on four hash join algorithms. We show that hash joins are acutely affected by dataset skew, and the performance gets worse with shuffled data. To address these issues, we propose non-partitioning hash joins using two different hash tables. First, we use a separate chaining hash table that is based on an existing implementation that we have modified. This version outperforms the original implementation on skewed datasets by up to three orders of magnitude. Second, we propose a novel hash table for hash joins, called Maple hash table. We demonstrate that this hash table is better suited to skewed and/or shuffled datasets. Moreover, this approach further improves performance by up to 17.3×.
{"title":"On Improving Data Skew Resilience In Main-memory Hash Joins","authors":"Puya Memarzia, S. Ray, V. Bhavsar","doi":"10.1145/3216122.3216156","DOIUrl":"https://doi.org/10.1145/3216122.3216156","url":null,"abstract":"Main memory hash joins are an important category of in-memory joins. However, the performance of these joins can be hindered by dataset skew, shuffling, and load balancing. We conducted a comprehensive study on the effects of dataset skew on four hash join algorithms. We show that hash joins are acutely affected by dataset skew, and the performance gets worse with shuffled data. To address these issues, we propose non-partitioning hash joins using two different hash tables. First, we use a separate chaining hash table that is based on an existing implementation that we have modified. This version outperforms the original implementation on skewed datasets by up to three orders of magnitude. Second, we propose a novel hash table for hash joins, called Maple hash table. We demonstrate that this hash table is better suited to skewed and/or shuffled datasets. Moreover, this approach further improves performance by up to 17.3×.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124230722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Syed Muhammad Abrar Akber, I. Khan, S. S. Muhammad, Syed Muhammad Mohsin, I. Khan, Shahaboddin Shamshirband, Anthony T. Chronopoulos
Data collection and transmission are the fundamental operations of WSNs. The performance of WSNs relies upon these essential tasks because data gathering directly affects the efficiency and lifetime of WSNs. This paper presents a data volume based data collection technique using Mobile Data Collector (MDC). In this technique, the MDC uses data volume information to plan visits to the nodes. The MDC visits only those nodes which have generated data while the rests of the nodes are ignored. This scheme is validated with the help of simulations, and the results are compared with existing renowned techniques. The results show that the proposed scheme is energy efficient.
{"title":"Data Volume Based Data Gathering in WSNs using Mobile Data Collector","authors":"Syed Muhammad Abrar Akber, I. Khan, S. S. Muhammad, Syed Muhammad Mohsin, I. Khan, Shahaboddin Shamshirband, Anthony T. Chronopoulos","doi":"10.1145/3216122.3216166","DOIUrl":"https://doi.org/10.1145/3216122.3216166","url":null,"abstract":"Data collection and transmission are the fundamental operations of WSNs. The performance of WSNs relies upon these essential tasks because data gathering directly affects the efficiency and lifetime of WSNs. This paper presents a data volume based data collection technique using Mobile Data Collector (MDC). In this technique, the MDC uses data volume information to plan visits to the nodes. The MDC visits only those nodes which have generated data while the rests of the nodes are ignored. This scheme is validated with the help of simulations, and the results are compared with existing renowned techniques. The results show that the proposed scheme is energy efficient.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126840860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Incomplete information arises in many database applications, such as data integration, data exchange, inconsistency management, data cleaning, ontological reasoning, and many others. A principled way of answering queries over incomplete databases is to compute certain answers, which are query answers that can be obtained from every complete database represented by an incomplete one. For databases containing (labeled) nulls, certain answers to positive queries can be easily computed in polynomial time, but for more general queries with negation the problem becomes coNP-hard. To make query answering feasible in practice, one might resort to SQL's evaluation, but unfortunately, the way SQL behaves in the presence of nulls may result in wrong answers. Thus, on the one hand, SQL's evaluation is efficient but flawed, on the other hand, certain answers are a principled semantics but with high complexity. To deal with issue, recent research has focused on developing polynomial time approximation algorithms for computing (approximate) certain answers. This paper surveys recent advances in this area.
{"title":"Algorithms for Computing Approximate Certain Answers over Incomplete Databases","authors":"S. Greco, Cristian Molinaro, I. Trubitsyna","doi":"10.1145/3216122.3220542","DOIUrl":"https://doi.org/10.1145/3216122.3220542","url":null,"abstract":"Incomplete information arises in many database applications, such as data integration, data exchange, inconsistency management, data cleaning, ontological reasoning, and many others. A principled way of answering queries over incomplete databases is to compute certain answers, which are query answers that can be obtained from every complete database represented by an incomplete one. For databases containing (labeled) nulls, certain answers to positive queries can be easily computed in polynomial time, but for more general queries with negation the problem becomes coNP-hard. To make query answering feasible in practice, one might resort to SQL's evaluation, but unfortunately, the way SQL behaves in the presence of nulls may result in wrong answers. Thus, on the one hand, SQL's evaluation is efficient but flawed, on the other hand, certain answers are a principled semantics but with high complexity. To deal with issue, recent research has focused on developing polynomial time approximation algorithms for computing (approximate) certain answers. This paper surveys recent advances in this area.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131031130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, we show how the mathematical object tensor can be used to build a multi-paradigm model for the storage of social data in data warehouses. From an architectural point of view, our approach allows to link different storage systems (polystore) and limits the impact of ETL tools performing model transformations required to feed different analysis algorithms. Therefore, systems can take advantage of multiple data models both in terms of query execution performance and the semantic expressiveness of data representation. The proposed model allows to reach the logical independence between data and programs implementing analysis algorithms. With a concrete case study on message virality on Twitter during the French presidential election of 2017, we highlight some of the contributions of our model.
{"title":"A Tensor Based Data Model for Polystore: An Application to Social Networks Data","authors":"É. Leclercq, M. Savonnet","doi":"10.1145/3216122.3216152","DOIUrl":"https://doi.org/10.1145/3216122.3216152","url":null,"abstract":"In this article, we show how the mathematical object tensor can be used to build a multi-paradigm model for the storage of social data in data warehouses. From an architectural point of view, our approach allows to link different storage systems (polystore) and limits the impact of ETL tools performing model transformations required to feed different analysis algorithms. Therefore, systems can take advantage of multiple data models both in terms of query execution performance and the semantic expressiveness of data representation. The proposed model allows to reach the logical independence between data and programs implementing analysis algorithms. With a concrete case study on message virality on Twitter during the French presidential election of 2017, we highlight some of the contributions of our model.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"6 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133042820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An important functionality of current social applications is real-time recommendation, which is responsible for suggesting relevant published data to the users based on their preferences. By representing the users and the published data in a metric space, each user can be recommended with their k nearest neighbors among the published data. We consider the scenario when the relevance of a published data item to a user decreases as the data gets older, i.e., a time-dependent distance function is applied. We define the problem as the continuous time-dependent kNN join and provide a solution to a broad range of time-dependent functions. In addition, we propose a binary sketch-based approximation technique used to speed up the join evaluation by replacing expensive metric distance computations with cheap Hamming distances.
{"title":"Continuous Time-Dependent kNN Join by Binary Sketches","authors":"Filip Nálepa, Michal Batko, P. Zezula","doi":"10.1145/3216122.3216159","DOIUrl":"https://doi.org/10.1145/3216122.3216159","url":null,"abstract":"An important functionality of current social applications is real-time recommendation, which is responsible for suggesting relevant published data to the users based on their preferences. By representing the users and the published data in a metric space, each user can be recommended with their k nearest neighbors among the published data. We consider the scenario when the relevance of a published data item to a user decreases as the data gets older, i.e., a time-dependent distance function is applied. We define the problem as the continuous time-dependent kNN join and provide a solution to a broad range of time-dependent functions. In addition, we propose a binary sketch-based approximation technique used to speed up the join evaluation by replacing expensive metric distance computations with cheap Hamming distances.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125136866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michelangelo Ceci, Michele Spagnoletta, Pasqua Fabiana Lanotte, D. Malerba
Process mining is a research discipline that aims to discover, monitor and improve real processing using event logs. In this paper we tackle the problem of next activity prediction/recommendation via "nested prediction model" learning, that is, we first identify recurrent and frequent sequences of activities and then we learn a prediction model for each frequent sequence. The key principle underlying the design of the proposed solution is in the ability to process massive logs by means of a parallel and distributed solution (by exploiting the Spark parallel computation framework) which can make reasonable decisions in the absence of perfect models. Indeed, given the classical threshold for minimum support and a user-specified error bound, our approach exploits the Chernoff bound to mine "approximate" frequent sequences with statistical error guarantees on their actual supports. Experiments on real-world log data prove the effectiveness of the proposed approach.
{"title":"Distributed Learning of Process Models for Next Activity Prediction","authors":"Michelangelo Ceci, Michele Spagnoletta, Pasqua Fabiana Lanotte, D. Malerba","doi":"10.1145/3216122.3216125","DOIUrl":"https://doi.org/10.1145/3216122.3216125","url":null,"abstract":"Process mining is a research discipline that aims to discover, monitor and improve real processing using event logs. In this paper we tackle the problem of next activity prediction/recommendation via \"nested prediction model\" learning, that is, we first identify recurrent and frequent sequences of activities and then we learn a prediction model for each frequent sequence. The key principle underlying the design of the proposed solution is in the ability to process massive logs by means of a parallel and distributed solution (by exploiting the Spark parallel computation framework) which can make reasonable decisions in the absence of perfect models. Indeed, given the classical threshold for minimum support and a user-specified error bound, our approach exploits the Chernoff bound to mine \"approximate\" frequent sequences with statistical error guarantees on their actual supports. Experiments on real-world log data prove the effectiveness of the proposed approach.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128976143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yeting Li, Xinyu Chu, Xiaoying Mou, Chunmei Dong, H. Chen
Regular expressions are a fundamental concept in computer science and widely used in various applications. In this paper we focused on deterministic regular expressions (DREs). Considering that researchers did not have large datasets as evidence before, we first harvested a large corpus of real data from the Web then conducted a practical study to investigate the usage of DREs. One feature of our work is that the data set is sufficiently large compared with previous work, which is obtained using several data collection strategies we proposed. The results show more than 98% of expressions in Relax NG are DRE, and more than 56% of expressions from RegExLib are DRE, while both Relax NG and RegExLib do not have the determinism constraint. These observations indicate that DREs are commonly used in practice. The results also show further study of subclasses of DREs is necessary. As far as we know, we are the first to analyze the determinism and the subclasses of DREs of Relax NG and RegExLib, and give these results. Furthermore, we give some discussions and applications of the data set. We find current research in new subclasses of DREs is insufficient, therefore it is necessary to do further study. We also analyze the referencing relationships among XSDs and define SchemaRank, which can be used in XML Schema design.
{"title":"Practical Study of Deterministic Regular Expressions from Large-scale XML and Schema Data","authors":"Yeting Li, Xinyu Chu, Xiaoying Mou, Chunmei Dong, H. Chen","doi":"10.1145/3216122.3216126","DOIUrl":"https://doi.org/10.1145/3216122.3216126","url":null,"abstract":"Regular expressions are a fundamental concept in computer science and widely used in various applications. In this paper we focused on deterministic regular expressions (DREs). Considering that researchers did not have large datasets as evidence before, we first harvested a large corpus of real data from the Web then conducted a practical study to investigate the usage of DREs. One feature of our work is that the data set is sufficiently large compared with previous work, which is obtained using several data collection strategies we proposed. The results show more than 98% of expressions in Relax NG are DRE, and more than 56% of expressions from RegExLib are DRE, while both Relax NG and RegExLib do not have the determinism constraint. These observations indicate that DREs are commonly used in practice. The results also show further study of subclasses of DREs is necessary. As far as we know, we are the first to analyze the determinism and the subclasses of DREs of Relax NG and RegExLib, and give these results. Furthermore, we give some discussions and applications of the data set. We find current research in new subclasses of DREs is insufficient, therefore it is necessary to do further study. We also analyze the referencing relationships among XSDs and define SchemaRank, which can be used in XML Schema design.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128449166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Business systems these days need to be agile to address the needs of a changing world. Business modelling requires process management to be highly adaptable with the ability to support dynamic workflows, inter-application integration (potentially between businesses) and process reconfiguration. Designing in the ability to cater for evolution is critical to success. To handle change, systems need the capability to adapt as and when necessary to changes in users' requirements. Using our implementation of a self-describing system, a so-called description-driven approach, new versions of data structures or processes can be created alongside older versions providing a log of changes to the underlying data schema and enabling the gathering of traceable ("provenance") data. The CRISTAL software, which originated at CERN for handling physics data, uses versions of stored descriptions to define data and workflows which can be evolved over time and thereby to handle evolving system needs. It has been customised for use in business as the Agilium-NG product. This paper reports on how the Agilium-NG software has enabled the deployment of an unique business process management solution that can be dynamically evolved to cater for changing user requirements.
{"title":"The Deployment of an Enhanced Model-Driven Architecture for Business Process Management","authors":"R. McClatchey","doi":"10.1145/3216122.3216155","DOIUrl":"https://doi.org/10.1145/3216122.3216155","url":null,"abstract":"Business systems these days need to be agile to address the needs of a changing world. Business modelling requires process management to be highly adaptable with the ability to support dynamic workflows, inter-application integration (potentially between businesses) and process reconfiguration. Designing in the ability to cater for evolution is critical to success. To handle change, systems need the capability to adapt as and when necessary to changes in users' requirements. Using our implementation of a self-describing system, a so-called description-driven approach, new versions of data structures or processes can be created alongside older versions providing a log of changes to the underlying data schema and enabling the gathering of traceable (\"provenance\") data. The CRISTAL software, which originated at CERN for handling physics data, uses versions of stored descriptions to define data and workflows which can be evolved over time and thereby to handle evolving system needs. It has been customised for use in business as the Agilium-NG product. This paper reports on how the Agilium-NG software has enabled the deployment of an unique business process management solution that can be dynamically evolved to cater for changing user requirements.","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130461968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 22nd International Database Engineering & Applications Symposium","authors":"","doi":"10.1145/3216122","DOIUrl":"https://doi.org/10.1145/3216122","url":null,"abstract":"","PeriodicalId":422509,"journal":{"name":"Proceedings of the 22nd International Database Engineering & Applications Symposium","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132575436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}