Edemberg Rocha da Silva, G. H. B. Souza, A. Salgado
Data integration systems based on Peer-to-Peer environments have been developed to integrate dynamic, autonomous and heterogeneous data sources on the Web. Some of these systems adopt semantic approaches for clustering their data sources, reducing the search space. However, the clusters may become overloaded and traditional strategies of load balance are not suitable to semantic clusters. In this paper, we discuss limitations of load balance strategies in semantic clusters. In addition, we propose a solution for this load balance and we present some experimental results.
{"title":"Load balance for semantic cluster-based data integration systems","authors":"Edemberg Rocha da Silva, G. H. B. Souza, A. Salgado","doi":"10.1145/2513591.2513648","DOIUrl":"https://doi.org/10.1145/2513591.2513648","url":null,"abstract":"Data integration systems based on Peer-to-Peer environments have been developed to integrate dynamic, autonomous and heterogeneous data sources on the Web. Some of these systems adopt semantic approaches for clustering their data sources, reducing the search space. However, the clusters may become overloaded and traditional strategies of load balance are not suitable to semantic clusters. In this paper, we discuss limitations of load balance strategies in semantic clusters. In addition, we propose a solution for this load balance and we present some experimental results.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"24 1","pages":"174-179"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78682037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muntazir Mehdi, Ratnesh Sahay, Wassim Derguech, E. Curry
The dynamicity of sensor data sources and publishing real-time sensor data over a generalised infrastructure like the Web pose a new set of integration challenges. Semantic Sensor Networks demand excessive expressivity for efficient formal analysis of sensor data. This article specifically addresses the problem of adapting data model specific or context-specific properties in automatic generation of multidimensional data cubes. The idea is to generate data cubes on-the-fly from syntactic sensor data to sustain decision making, event processing and to publish this data as Linked Open Data.
{"title":"On-the-fly generation of multidimensional data cubes for web of things","authors":"Muntazir Mehdi, Ratnesh Sahay, Wassim Derguech, E. Curry","doi":"10.1145/2513591.2513655","DOIUrl":"https://doi.org/10.1145/2513591.2513655","url":null,"abstract":"The dynamicity of sensor data sources and publishing real-time sensor data over a generalised infrastructure like the Web pose a new set of integration challenges. Semantic Sensor Networks demand excessive expressivity for efficient formal analysis of sensor data. This article specifically addresses the problem of adapting data model specific or context-specific properties in automatic generation of multidimensional data cubes. The idea is to generate data cubes on-the-fly from syntactic sensor data to sustain decision making, event processing and to publish this data as Linked Open Data.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"1 1","pages":"28-37"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82407392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The all-pairs problem is an input-output relationship where each output corresponds to a pair of inputs, and each pair of inputs has a corresponding output. It models similarity joins where no simplification of the search for similar pairs, e.g., locality-sensitive hashing, is possible, and each input must be compared with every other input to determine those pairs that are "similar." When implemented by a MapReduce algorithm, there was a gap, a factor of 2, between the lower bound on necessary communication and the communication required by the best known algorithm. In this brief paper we show that the lower bound can essentially be met.
{"title":"Matching bounds for the all-pairs MapReduce problem","authors":"F. Afrati, J. Ullman","doi":"10.1145/2513591.2513663","DOIUrl":"https://doi.org/10.1145/2513591.2513663","url":null,"abstract":"The all-pairs problem is an input-output relationship where each output corresponds to a pair of inputs, and each pair of inputs has a corresponding output. It models similarity joins where no simplification of the search for similar pairs, e.g., locality-sensitive hashing, is possible, and each input must be compared with every other input to determine those pairs that are \"similar.\" When implemented by a MapReduce algorithm, there was a gap, a factor of 2, between the lower bound on necessary communication and the communication required by the best known algorithm. In this brief paper we show that the lower bound can essentially be met.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"60 1","pages":"3-4"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78135238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional data warehouses integrate new data during lengthy offline periods, with indexes being dropped and rebuilt for efficiency reasons. There is the idea that these and other factors make them unfit for realtime warehousing. We analyze how a set of factors influence near-realtime and frequent loading capabilities, and what can be done to improve near-realtime capacity using a traditional architecture. We analyze how the query workload affects and is affected by the ETL process and the influence of factors such as the type of load strategy, the size of the load data, indexing, integrity constraints, refresh activity over summary data, and fact table partitioning. We evaluate the factors experimentally and show that partitioning is an important factor to deliver near-realtime capacity.
{"title":"Near real-time with traditional data warehouse architectures: factors and how-to","authors":"Nickerson Ferreira, P. Martins, P. Furtado","doi":"10.1145/2513591.2513650","DOIUrl":"https://doi.org/10.1145/2513591.2513650","url":null,"abstract":"Traditional data warehouses integrate new data during lengthy offline periods, with indexes being dropped and rebuilt for efficiency reasons. There is the idea that these and other factors make them unfit for realtime warehousing. We analyze how a set of factors influence near-realtime and frequent loading capabilities, and what can be done to improve near-realtime capacity using a traditional architecture. We analyze how the query workload affects and is affected by the ETL process and the influence of factors such as the type of load strategy, the size of the load data, indexing, integrity constraints, refresh activity over summary data, and fact table partitioning. We evaluate the factors experimentally and show that partitioning is an important factor to deliver near-realtime capacity.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"74 1","pages":"68-75"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85973529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michal Batko, Jan Botorek, Petra Budíková, P. Zezula
Unprecedented amounts of digital data are becoming available nowadays, but frequently the data lack some semantic information necessary to effectively organize these resources. For images in particular, textual annotations that represent the semantics are highly desirable. Only a small percentage of images is created with reliable annotations, therefore a lot of effort is being invested into automatic image annotation. In this paper, we address the annotation problem from a general perspective and introduce a new annotation model that is applicable to many text assignment problems. We also provide experimental results from several implemented instances of our model.
{"title":"Content-based annotation and classification framework: a general multi-purpose approach","authors":"Michal Batko, Jan Botorek, Petra Budíková, P. Zezula","doi":"10.1145/2513591.2513651","DOIUrl":"https://doi.org/10.1145/2513591.2513651","url":null,"abstract":"Unprecedented amounts of digital data are becoming available nowadays, but frequently the data lack some semantic information necessary to effectively organize these resources. For images in particular, textual annotations that represent the semantics are highly desirable. Only a small percentage of images is created with reliable annotations, therefore a lot of effort is being invested into automatic image annotation. In this paper, we address the annotation problem from a general perspective and introduce a new annotation model that is applicable to many text assignment problems. We also provide experimental results from several implemented instances of our model.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"1 1","pages":"58-67"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75864147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, many researchers have been interested in the problem of Skyline query evaluation because this kind of queries allows to filter high volumes of data. Skyline queries return those objects that are the best ones according to multiple user's criteria. In this work, we propose two algorithms to evaluate Skyline queries over Vertically Partitioned Tables (VPTs). Additionally, we have performed an experimental study that shows our proposed algorithms outperform the existing state-of-art algorithms.
{"title":"Evaluating skyline queries over vertically partitioned tables","authors":"J. Subero, Marlene Goncalves","doi":"10.1145/2513591.2513647","DOIUrl":"https://doi.org/10.1145/2513591.2513647","url":null,"abstract":"In recent years, many researchers have been interested in the problem of Skyline query evaluation because this kind of queries allows to filter high volumes of data. Skyline queries return those objects that are the best ones according to multiple user's criteria. In this work, we propose two algorithms to evaluate Skyline queries over Vertically Partitioned Tables (VPTs). Additionally, we have performed an experimental study that shows our proposed algorithms outperform the existing state-of-art algorithms.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"6 1","pages":"180-185"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72742541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A key factor of measuring database performance is query response time, which is dominated by I/O time. Database partitioning is among techniques that can help users reduce the I/O time significantly. However, how to efficiently partition tables in a database is not an easy problem, especially when we want to have this partitioning task done automatically by the system itself. This paper introduces an algorithm called Self-Managing Online Partitioner for Databases (SMOPD) in vertical partitioning based on closed item sets mining from a query set and system statistic information mined from system statistic views. This algorithm can dynamically monitor the database performance using user-configured parameters and automatically detect the performance trend so that it can decide when to perform a re-partitioning action without feedback from DBAs. This algorithm can free DBAs from the heavy tasks of keeping monitoring the system and struggling against the large statistic tables. The paper also presents the experimental results evaluating the performance of the algorithm using the TPC-H benchmark.
{"title":"Self-managing online partitioner for databases (SMOPD): a vertical database partitioning system with a fully automatic online approach","authors":"Liangzhe Li, L. Gruenwald","doi":"10.1145/2513591.2513649","DOIUrl":"https://doi.org/10.1145/2513591.2513649","url":null,"abstract":"A key factor of measuring database performance is query response time, which is dominated by I/O time. Database partitioning is among techniques that can help users reduce the I/O time significantly. However, how to efficiently partition tables in a database is not an easy problem, especially when we want to have this partitioning task done automatically by the system itself. This paper introduces an algorithm called Self-Managing Online Partitioner for Databases (SMOPD) in vertical partitioning based on closed item sets mining from a query set and system statistic information mined from system statistic views. This algorithm can dynamically monitor the database performance using user-configured parameters and automatically detect the performance trend so that it can decide when to perform a re-partitioning action without feedback from DBAs. This algorithm can free DBAs from the heavy tasks of keeping monitoring the system and struggling against the large statistic tables. The paper also presents the experimental results evaluating the performance of the algorithm using the TPC-H benchmark.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"195 1","pages":"168-173"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75540753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Append-/Log-based Storage Managers (LbSM) for database systems represent a good match for the characteristics and behaviour of Flash technology. LbSM alleviate random writes reducing the impact of Flash read/write asymmetry, increasing endurance and performance. A recently proposed combination of Multi-Versioning database approaches and LbSM called SIAS [9] offers further benefits: it substantially lowers the write rate due to tuple version append granularity and therefore improves the performance. In SIAS a page contains versions of tuples of the same table. Once appended such a page is immutable. The only allowable operations are reads (lookups, scans, version visibility checks) in tuple version granularity. Optimising for them offers an essential performance increase. In the present work-in-progress paper we propose two types of read optimisations: Multi-Version Index and Ordered Log Storage. Benefits of Ordered Log Storage: (i) Read efficiency due to the use of parallel read streams; (ii) Write efficiency since larger amounts of data are appended sequentially; (iii) fast garbage collection: read multiple sorted runs, filter dead tuples and write one single, large (combined) sorted run. (iv) possible cache-efficiency optimisations (for large scans) Benefits of Multi-Version Indexing: (i) index only visibility checks; (ii) postponing of index reorganisations; (iii) no invalid tuple bits in the index (in-place updates); (iv) pre-filtering of invisible tuple versions; (v) facilitate easy identification of tuple versions to be garbage collected. Benefits of the combination of both approaches: (i) Index and ordered access; (ii) Facilitate range searches in sorted runs; (iii) on the fly garbage collection (checking of one bit).
{"title":"Read optimisations for append storage on flash","authors":"R. Gottstein, Ilia Petrov, A. Buchmann","doi":"10.1145/2513591.2513640","DOIUrl":"https://doi.org/10.1145/2513591.2513640","url":null,"abstract":"Append-/Log-based Storage Managers (LbSM) for database systems represent a good match for the characteristics and behaviour of Flash technology. LbSM alleviate random writes reducing the impact of Flash read/write asymmetry, increasing endurance and performance. A recently proposed combination of Multi-Versioning database approaches and LbSM called SIAS [9] offers further benefits: it substantially lowers the write rate due to tuple version append granularity and therefore improves the performance. In SIAS a page contains versions of tuples of the same table. Once appended such a page is immutable. The only allowable operations are reads (lookups, scans, version visibility checks) in tuple version granularity. Optimising for them offers an essential performance increase. In the present work-in-progress paper we propose two types of read optimisations: Multi-Version Index and Ordered Log Storage.\u0000 Benefits of Ordered Log Storage: (i) Read efficiency due to the use of parallel read streams; (ii) Write efficiency since larger amounts of data are appended sequentially; (iii) fast garbage collection: read multiple sorted runs, filter dead tuples and write one single, large (combined) sorted run. (iv) possible cache-efficiency optimisations (for large scans)\u0000 Benefits of Multi-Version Indexing: (i) index only visibility checks; (ii) postponing of index reorganisations; (iii) no invalid tuple bits in the index (in-place updates); (iv) pre-filtering of invisible tuple versions; (v) facilitate easy identification of tuple versions to be garbage collected.\u0000 Benefits of the combination of both approaches: (i) Index and ordered access; (ii) Facilitate range searches in sorted runs; (iii) on the fly garbage collection (checking of one bit).","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"65 1","pages":"106-113"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80386698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given a sequential input connection, we tackle parallel skyline computation of the read data by means of a spatial tree structure for indexing fine-grained feature vectors. For this purpose, multiple local split decision trees are simultaneously filled before the actual computation starts. We exploit the special tree structure to clip parts of the tree without depth-first search. The split of the data allows us to do this step in a divide and conquer manner. With this schedule we seek to provide an algorithm robust against the "dimension curse" and different data distributions.
{"title":"Breaking skyline computation down to the metal: the skyline breaker algorithm","authors":"D. Köppl","doi":"10.1145/2513591.2513637","DOIUrl":"https://doi.org/10.1145/2513591.2513637","url":null,"abstract":"Given a sequential input connection, we tackle parallel skyline computation of the read data by means of a spatial tree structure for indexing fine-grained feature vectors. For this purpose, multiple local split decision trees are simultaneously filled before the actual computation starts. We exploit the special tree structure to clip parts of the tree without depth-first search. The split of the data allows us to do this step in a divide and conquer manner. With this schedule we seek to provide an algorithm robust against the \"dimension curse\" and different data distributions.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"5 1","pages":"132-141"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82341260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kun-Han Juang, En Tzu Wang, Chieh-Feng Chiang, Arbee L. P. Chen
The coverage problem is one of the fundamental problems in sensor networks, which reflects the degree of a region being monitored by sensors. In this paper, we make the first attempt to address the k-coverage verification problem regarding a given query line segment, which returns all sub-segments from the line segment that are covered by at least k sensors. To deal with the problem, we propose three methods based on the R-tree index. The first method is the most primitive one, which identifies all intersection points of the query line segment and the circumferences of the covering regions of the sensors and then checks each sub-segment to see whether it is k-coverage. Improving from the first method, the second method calculates the lower bound of the number of sensors covering a specific sub-segment to reduce the computation costs. The third method partitions the query line segment into sub-segments with equal length and then verifies each of them. A series of experiments on a real dataset and two synthetic datasets are performed to evaluate these methods. The experiment results demonstrate that the third method has the best performance among all three methods.
{"title":"Verification of k-coverage on query line segments","authors":"Kun-Han Juang, En Tzu Wang, Chieh-Feng Chiang, Arbee L. P. Chen","doi":"10.1145/2513591.2513639","DOIUrl":"https://doi.org/10.1145/2513591.2513639","url":null,"abstract":"The coverage problem is one of the fundamental problems in sensor networks, which reflects the degree of a region being monitored by sensors. In this paper, we make the first attempt to address the k-coverage verification problem regarding a given query line segment, which returns all sub-segments from the line segment that are covered by at least k sensors. To deal with the problem, we propose three methods based on the R-tree index. The first method is the most primitive one, which identifies all intersection points of the query line segment and the circumferences of the covering regions of the sensors and then checks each sub-segment to see whether it is k-coverage. Improving from the first method, the second method calculates the lower bound of the number of sensors covering a specific sub-segment to reduce the computation costs. The third method partitions the query line segment into sub-segments with equal length and then verifies each of them. A series of experiments on a real dataset and two synthetic datasets are performed to evaluate these methods. The experiment results demonstrate that the third method has the best performance among all three methods.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"192 1","pages":"114-121"},"PeriodicalIF":0.0,"publicationDate":"2013-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74197649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}