{"title":"FFTM: optimized frequent tree mining with soft embedding constraints on siblings","authors":"M. Sghaier, S. Yahia, Anne Laurent, M. Teisseire","doi":"10.1145/1456223.1456309","DOIUrl":null,"url":null,"abstract":"Databases have become increasingly large and the data they contain is increasingly bulky. Thus the problem of knowledge extraction has become very significant and requires multiple techniques for processing the data available in order to extract the information contained from it. We particularly consider the data available on the web. Regarding the problem of the data exchange on the internet, XML is playing an increasing important role in this issue and has become a dominating standard proposed to deal with huge volumes of electronic documents. We are especially involved in extracting knowledge from complex tree structures such as XML documents.\n As they are heterogeneous and with complex structures, the resources available in such documents present the difficulty of querying these data. In order to deal with this problem, automatic tools are of compelling need. We especially consider the problem of constructing a mediator schema whose role is to give the necassary information about the resources structure and through which the data can be queried. In this paper, we present a new approach, called FFTM, dealing with the problem of schema mining through which we particularly focused on the use of soft embedding concept in order to extract more relevant knowledge. Indeed, crisp methods often discard interesting approximate patterns. For this purpose, we have adopted fuzzy constraints for discovering and validating frequent substructures in a large collection of semi-structured data, where both patterns and the data are modeled by labeled trees. The FFTM approach has been tested and validated on synthetic and XML document databases. The experimental results obtained show that our approach is very relevant and palliates the problem of the crisp approach.","PeriodicalId":309453,"journal":{"name":"International Conference on Soft Computing as Transdisciplinary Science and Technology","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Soft Computing as Transdisciplinary Science and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1456223.1456309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Databases have become increasingly large and the data they contain is increasingly bulky. Thus the problem of knowledge extraction has become very significant and requires multiple techniques for processing the data available in order to extract the information contained from it. We particularly consider the data available on the web. Regarding the problem of the data exchange on the internet, XML is playing an increasing important role in this issue and has become a dominating standard proposed to deal with huge volumes of electronic documents. We are especially involved in extracting knowledge from complex tree structures such as XML documents.
As they are heterogeneous and with complex structures, the resources available in such documents present the difficulty of querying these data. In order to deal with this problem, automatic tools are of compelling need. We especially consider the problem of constructing a mediator schema whose role is to give the necassary information about the resources structure and through which the data can be queried. In this paper, we present a new approach, called FFTM, dealing with the problem of schema mining through which we particularly focused on the use of soft embedding concept in order to extract more relevant knowledge. Indeed, crisp methods often discard interesting approximate patterns. For this purpose, we have adopted fuzzy constraints for discovering and validating frequent substructures in a large collection of semi-structured data, where both patterns and the data are modeled by labeled trees. The FFTM approach has been tested and validated on synthetic and XML document databases. The experimental results obtained show that our approach is very relevant and palliates the problem of the crisp approach.