Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452706
Carlos Eduardo S. Pires, Paulo Orlando Queiroz-Sousa, Zoubida Kedad, A. Salgado
Quickly understanding the content of a data source is very useful in several contexts. In a Peer Data Management System (PDMS), peers can be semantically clustered, each cluster being represented by a schema obtained by merging the local schemas of the peers in this cluster. In this paper, we present a process for summarizing schemas of peers participating in a PDMS. We assume that all the schemas are represented by ontologies and we propose a summarization algorithm which produces a summary containing the maximum number of relevant concepts and the minimum number of non-relevant concepts of the initial ontology. The relevance of a concept is determined using the notions of centrality and frequency. Since several possible candidate summaries can be identified during the summarization process, classical Information Retrieval metrics are employed to determine the best summary.
{"title":"Summarizing ontology-based schemas in PDMS","authors":"Carlos Eduardo S. Pires, Paulo Orlando Queiroz-Sousa, Zoubida Kedad, A. Salgado","doi":"10.1109/ICDEW.2010.5452706","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452706","url":null,"abstract":"Quickly understanding the content of a data source is very useful in several contexts. In a Peer Data Management System (PDMS), peers can be semantically clustered, each cluster being represented by a schema obtained by merging the local schemas of the peers in this cluster. In this paper, we present a process for summarizing schemas of peers participating in a PDMS. We assume that all the schemas are represented by ontologies and we propose a summarization algorithm which produces a summary containing the maximum number of relevant concepts and the minimum number of non-relevant concepts of the initial ontology. The relevance of a concept is determined using the notions of centrality and frequency. Since several possible candidate summaries can be identified during the summarization process, classical Information Retrieval metrics are employed to determine the best summary.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121095924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452724
Hilmi Yildirim, Mohammed J. Zaki
Reachability queries appear very frequently in many important applications that work with graph structured data. In some of them, testing reachability between two nodes corresponds to an important problem. For example, in proteinprotein interaction networks one can use it to answer whether two proteins are related, whereas in ontological databases such queries might correspond to the question of whether a concept subsumes another one. Given the huge databases that are often tested with reachability queries, it is important problem to come up with a scalable indexing scheme that has almost constant query time. In this paper, we bring a new dimension to the well-known interval labeling approach. Our approach labels each node with multiple intervals instead of a single interval so that each labeling represents a hyper-rectangle. Our new approach BOX can index dags in linear time and space while retaining the querying time admissible. In experiments, we show that BOX is not vulnerable to increasing edge to node ratios which is a problem for the existing approaches.
{"title":"Graph indexing for reachability queries","authors":"Hilmi Yildirim, Mohammed J. Zaki","doi":"10.1109/ICDEW.2010.5452724","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452724","url":null,"abstract":"Reachability queries appear very frequently in many important applications that work with graph structured data. In some of them, testing reachability between two nodes corresponds to an important problem. For example, in proteinprotein interaction networks one can use it to answer whether two proteins are related, whereas in ontological databases such queries might correspond to the question of whether a concept subsumes another one. Given the huge databases that are often tested with reachability queries, it is important problem to come up with a scalable indexing scheme that has almost constant query time. In this paper, we bring a new dimension to the well-known interval labeling approach. Our approach labels each node with multiple intervals instead of a single interval so that each labeling represents a hyper-rectangle. Our new approach BOX can index dags in linear time and space while retaining the querying time admissible. In experiments, we show that BOX is not vulnerable to increasing edge to node ratios which is a problem for the existing approaches.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116410351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452707
M. Sydow, Mariusz Pikula, Ralf Schenkel
A problem of diversified entity summarisation in RDF-like knowledge graphs, with limited ¿presentation budget¿, is formulated and studied. A greedy algorithm that adapts previous ideas from IR is proposed and preliminary but promising experimental results on real dataset extracted from IMDB database are presented.
{"title":"DIVERSUM: Towards diversified summarisation of entities in knowledge graphs","authors":"M. Sydow, Mariusz Pikula, Ralf Schenkel","doi":"10.1109/ICDEW.2010.5452707","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452707","url":null,"abstract":"A problem of diversified entity summarisation in RDF-like knowledge graphs, with limited ¿presentation budget¿, is formulated and studied. A greedy algorithm that adapts previous ideas from IR is proposed and preliminary but promising experimental results on real dataset extracted from IMDB database are presented.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127497090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452751
Nesime Tatbul
In this position paper, we motivate the need for streaming data integration in three main forms including across multiple streaming data sources, over multiple stream processing engine instances, and between stream processing engines and traditional database systems. We argue that this need presents a broad range of challenges and opportunities for new research. We provide an overview of the young state of the art in this area and further discuss a selected set of concrete research topics that are currently under investigation within the scope of our MaxStream federated stream processing project at ETH Zurich.
{"title":"Streaming data integration: Challenges and opportunities","authors":"Nesime Tatbul","doi":"10.1109/ICDEW.2010.5452751","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452751","url":null,"abstract":"In this position paper, we motivate the need for streaming data integration in three main forms including across multiple streaming data sources, over multiple stream processing engine instances, and between stream processing engines and traditional database systems. We argue that this need presents a broad range of challenges and opportunities for new research. We provide an overview of the young state of the art in this area and further discuss a selected set of concrete research topics that are currently under investigation within the scope of our MaxStream federated stream processing project at ETH Zurich.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125537137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452771
T. Bernecker, Tobias Emrich, Franz Graf, H. Kriegel, Peer Kröger, M. Renz, Erich Schubert, A. Zimek
There are abundant scenarios for applications of similarity search in databases where the similarity of objects is defined for a subset of attributes, i.e., in a subspace, only. While much research has been done in efficient support of single column similarity queries or of similarity queries in the full space, scarcely any support of similarity search in subspaces has been provided so far. The three existing approaches are variations of the sequential scan. Here, we propose the first index-based solution to subspace similarity search in arbitrary subspaces which is based on the concepts of nearest neighbor ranking and top-k retrieval.
{"title":"Subspace similarity search using the ideas of ranking and top-k retrieval","authors":"T. Bernecker, Tobias Emrich, Franz Graf, H. Kriegel, Peer Kröger, M. Renz, Erich Schubert, A. Zimek","doi":"10.1109/ICDEW.2010.5452771","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452771","url":null,"abstract":"There are abundant scenarios for applications of similarity search in databases where the similarity of objects is defined for a subset of attributes, i.e., in a subspace, only. While much research has been done in efficient support of single column similarity queries or of similarity queries in the full space, scarcely any support of similarity search in subspaces has been provided so far. The three existing approaches are variations of the sequential scan. Here, we propose the first index-based solution to subspace similarity search in arbitrary subspaces which is based on the concepts of nearest neighbor ranking and top-k retrieval.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122270723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452770
D. Souravlias, Marina Drosou, K. Stefanidis, E. Pitoura
In publish/subscribe systems, users express their interests in specific items of information and get notified when relevant data items are produced. Such systems allow users to stay informed without the need of going through huge amounts of data. However, as the volume of data being created increases, some form of ranking of matched events is needed to avoid overwhelming the users. In this work-in-progress paper, we explore novelty as a ranking criterion. An event is considered novel, if it matches a subscription that has rarely been matched in the past.
{"title":"On novelty in publish/subscribe delivery","authors":"D. Souravlias, Marina Drosou, K. Stefanidis, E. Pitoura","doi":"10.1109/ICDEW.2010.5452770","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452770","url":null,"abstract":"In publish/subscribe systems, users express their interests in specific items of information and get notified when relevant data items are produced. Such systems allow users to stay informed without the need of going through huge amounts of data. However, as the volume of data being created increases, some form of ranking of matched events is needed to avoid overwhelming the users. In this work-in-progress paper, we explore novelty as a ranking criterion. An event is considered novel, if it matches a subscription that has rarely been matched in the past.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128363350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452729
Tao Cheng, K. Chang
As the Web has evolved into a data-rich repository, with the standard “page view,” current search engines are becoming increasingly inadequate. To realize data-aware search, toward searching for data entities on the Web, we have been developing the various aspects of an entity search system, including: entity ranking, entity indexing and parallelization, entity resolution, as well as generalization and customization. Preliminary results show the promise of our proposals, achieving high accuracy, efficiency and scalability. We will also summarize our contributions and point out interesting future directions along the line of enabling data-aware search on the Web.
{"title":"Toward large scale data-aware search: Ranking, indexing, resolution and beyond","authors":"Tao Cheng, K. Chang","doi":"10.1109/ICDEW.2010.5452729","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452729","url":null,"abstract":"As the Web has evolved into a data-rich repository, with the standard “page view,” current search engines are becoming increasingly inadequate. To realize data-aware search, toward searching for data entities on the Web, we have been developing the various aspects of an entity search system, including: entity ranking, entity indexing and parallelization, entity resolution, as well as generalization and customization. Preliminary results show the promise of our proposals, achieving high accuracy, efficiency and scalability. We will also summarize our contributions and point out interesting future directions along the line of enabling data-aware search on the Web.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129898981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452736
C. Leung, Boyu Hao, Fan Jiang
Frequent itemset mining is a common data mining task for many real-life applications. The mined frequent itemsets can be served as building blocks for various patterns including association rules and frequent sequences. Many existing algorithms mine for frequent itemsets from traditional static transaction databases, in which the contents of each transaction (namely, items) are definitely known and precise. However, there are many situations in which ones are uncertain about the contents of transactions. This calls for the mining of uncertain data. Moreover, there are also situations in which users are interested in only some portions of the mined frequent itemsets (i.e., itemsets satisfying user-specified constraints, which express the user interest). This leads to constrained mining. Furthermore, due to advances in technology, a flood of data can be produced in many situations. This calls for the mining of data streams. To deal with all these situations, we propose tree-based algorithms to efficiently mine streams of uncertain data for frequent itemsets that satisfy user-specified constraints.
{"title":"Constrained frequent itemset mining from uncertain data streams","authors":"C. Leung, Boyu Hao, Fan Jiang","doi":"10.1109/ICDEW.2010.5452736","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452736","url":null,"abstract":"Frequent itemset mining is a common data mining task for many real-life applications. The mined frequent itemsets can be served as building blocks for various patterns including association rules and frequent sequences. Many existing algorithms mine for frequent itemsets from traditional static transaction databases, in which the contents of each transaction (namely, items) are definitely known and precise. However, there are many situations in which ones are uncertain about the contents of transactions. This calls for the mining of uncertain data. Moreover, there are also situations in which users are interested in only some portions of the mined frequent itemsets (i.e., itemsets satisfying user-specified constraints, which express the user interest). This leads to constrained mining. Furthermore, due to advances in technology, a flood of data can be produced in many situations. This calls for the mining of data streams. To deal with all these situations, we propose tree-based algorithms to efficiently mine streams of uncertain data for frequent itemsets that satisfy user-specified constraints.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"172 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132485477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452715
N. Talukder, M. Ouzzani, A. Elmagarmid, Hazem Elmeleegy, M. Yakout
The increasing popularity of social networks, such as Facebook and Orkut, has raised several privacy concerns. Traditional ways of safeguarding privacy of personal information by hiding sensitive attributes are no longer adequate. Research shows that probabilistic classification techniques can effectively infer such private information. The disclosed sensitive information of friends, group affiliations and even participation in activities, such as tagging and commenting, are considered background knowledge in this process. In this paper, we present a privacy protection tool, called Privometer, that measures the amount of sensitive information leakage in a user profile and suggests self-sanitization actions to regulate the amount of leakage. In contrast to previous research, where inference techniques use publicly available profile information, we consider an augmented model where a potentially malicious application installed in the user's friend profiles can access substantially more information. In our model, merely hiding the sensitive information is not sufficient to protect the user privacy. We present an implementation of Privometer in Facebook.
{"title":"Privometer: Privacy protection in social networks","authors":"N. Talukder, M. Ouzzani, A. Elmagarmid, Hazem Elmeleegy, M. Yakout","doi":"10.1109/ICDEW.2010.5452715","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452715","url":null,"abstract":"The increasing popularity of social networks, such as Facebook and Orkut, has raised several privacy concerns. Traditional ways of safeguarding privacy of personal information by hiding sensitive attributes are no longer adequate. Research shows that probabilistic classification techniques can effectively infer such private information. The disclosed sensitive information of friends, group affiliations and even participation in activities, such as tagging and commenting, are considered background knowledge in this process. In this paper, we present a privacy protection tool, called Privometer, that measures the amount of sensitive information leakage in a user profile and suggests self-sanitization actions to regulate the amount of leakage. In contrast to previous research, where inference techniques use publicly available profile information, we consider an augmented model where a potentially malicious application installed in the user's friend profiles can access substantially more information. In our model, merely hiding the sensitive information is not sufficient to protect the user privacy. We present an implementation of Privometer in Facebook.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130015689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-03-01DOI: 10.1109/ICDEW.2010.5452748
J. Schaffner, D. Jacobs, B. Eckart, Jan Brunnert, A. Zeier
For traditional data warehouses, mostly large and expensive server and storage systems are used. In particular, for small- and medium size companies, it is often too expensive to run or rent such systems. These companies might need analytical services only from time to time, for example at the end of a billing period. A solution to overcome these problems is to use Cloud Computing. In this paper, we report on work-in-progress towards building an OLAP cluster of multi-tenant main memory column databases on the Amazon EC2 cloud computing environment, for which purpose we ported SAP's in-memory column database TREX to run in the Amazon cloud. We discuss early findings on cost/performance tradeoffs between reliably storing the data of a tenant on a single node using a highly-available network attached storage, such as Amazon EBS, vs. replication of tenant data to a secondary node where the data resides on less resilient storage. We also describe a mechanism to provide support for historical queries across older snapshots of tenant data which is lazy-loaded from Amazon's S3 near-line archiving storage and cached on the local VM disks.
{"title":"Towards enterprise software as a service in the cloud","authors":"J. Schaffner, D. Jacobs, B. Eckart, Jan Brunnert, A. Zeier","doi":"10.1109/ICDEW.2010.5452748","DOIUrl":"https://doi.org/10.1109/ICDEW.2010.5452748","url":null,"abstract":"For traditional data warehouses, mostly large and expensive server and storage systems are used. In particular, for small- and medium size companies, it is often too expensive to run or rent such systems. These companies might need analytical services only from time to time, for example at the end of a billing period. A solution to overcome these problems is to use Cloud Computing. In this paper, we report on work-in-progress towards building an OLAP cluster of multi-tenant main memory column databases on the Amazon EC2 cloud computing environment, for which purpose we ported SAP's in-memory column database TREX to run in the Amazon cloud. We discuss early findings on cost/performance tradeoffs between reliably storing the data of a tenant on a single node using a highly-available network attached storage, such as Amazon EBS, vs. replication of tenant data to a secondary node where the data resides on less resilient storage. We also describe a mechanism to provide support for historical queries across older snapshots of tenant data which is lazy-loaded from Amazon's S3 near-line archiving storage and cached on the local VM disks.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114800566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}