{"title":"A graph based method for Arabic document indexing","authors":"M.S. El Bazzi, D. Mammass, T. Zaki, A. Ennaji","doi":"10.1109/SETIT.2016.7939885","DOIUrl":null,"url":null,"abstract":"Extracting knowledge from text data and taking its full advantage has been an important way to reduce its computation and accelerate processing, especially for large amounts of data. Thus, different approaches and methodologies for modeling and representing textual data have been proposed. In this paper, a graph-based approach for automatic indexing of unstructured data from an Arabic corpus has been proposed. First, each document in the collection is represented by a graph. After the generation of document graph, term weighting is computed to estimate the relevance of a term to the document. The graph representation offers the advantage that it allows for a much more expressive document modeling than the standard bag of words approach, and consequently, it improves classification performance. Experimental results show that the graph based indexing method is a promising approach for semantic and contextual indexation, and outperforms statistical based method (TFIDF) by 12% in F-measure.","PeriodicalId":426951,"journal":{"name":"2016 7th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 7th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SETIT.2016.7939885","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
Extracting knowledge from text data and taking its full advantage has been an important way to reduce its computation and accelerate processing, especially for large amounts of data. Thus, different approaches and methodologies for modeling and representing textual data have been proposed. In this paper, a graph-based approach for automatic indexing of unstructured data from an Arabic corpus has been proposed. First, each document in the collection is represented by a graph. After the generation of document graph, term weighting is computed to estimate the relevance of a term to the document. The graph representation offers the advantage that it allows for a much more expressive document modeling than the standard bag of words approach, and consequently, it improves classification performance. Experimental results show that the graph based indexing method is a promising approach for semantic and contextual indexation, and outperforms statistical based method (TFIDF) by 12% in F-measure.