{"title":"基于潜在语义索引(LSI)的阿拉伯语增强搜索","authors":"Fawaz S. Al-Anzi, Dia AbuZeina","doi":"10.1109/ICONIC.2018.8601096","DOIUrl":null,"url":null,"abstract":"The Vector Space Model (VSM) is a common document representation model that is widely used in data mining and information retrieval (IR) systems. However, this technique poses some challenges such as high dimensional space and semantic loss representation. Therefore, the latent semantic indexing (LSI) is proposed to reduce the feature dimensions and to generate semantic rich features that represent conceptual term-document associations. In particular, LSI has been successfully implemented in search engines and text classification tasks. In this paper, we propose a novel approach to enhance the quality of the retrieved documents in search engines for Arabic language. That is, we propose to use a new extension of the LSI technique instead of just using the standard LSI technique. The LSI method proposed is based on employing the word co-occurrences to form a term-by-document matrix. The proposed method is to be based on the documents evaluating cosine similarity measures for term-by-document matrix. We will empirically evaluate the performance using an Arabic data collection that contains no less than 500 documents with no less than 30,000 unique words. A testing set contains keywords from a specific domain will be used to evaluate the quality of the top 20-30 retrieved documents using different singular values (i.e. different number of dimensions). The results will be judged on the performance of the proposed method as it is compared to the standard LSI.","PeriodicalId":277315,"journal":{"name":"2018 International Conference on Intelligent and Innovative Computing Applications (ICONIC)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Enhanced Search for Arabic Language Using Latent Semantic Indexing (LSI)\",\"authors\":\"Fawaz S. Al-Anzi, Dia AbuZeina\",\"doi\":\"10.1109/ICONIC.2018.8601096\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Vector Space Model (VSM) is a common document representation model that is widely used in data mining and information retrieval (IR) systems. However, this technique poses some challenges such as high dimensional space and semantic loss representation. Therefore, the latent semantic indexing (LSI) is proposed to reduce the feature dimensions and to generate semantic rich features that represent conceptual term-document associations. In particular, LSI has been successfully implemented in search engines and text classification tasks. In this paper, we propose a novel approach to enhance the quality of the retrieved documents in search engines for Arabic language. That is, we propose to use a new extension of the LSI technique instead of just using the standard LSI technique. The LSI method proposed is based on employing the word co-occurrences to form a term-by-document matrix. The proposed method is to be based on the documents evaluating cosine similarity measures for term-by-document matrix. We will empirically evaluate the performance using an Arabic data collection that contains no less than 500 documents with no less than 30,000 unique words. A testing set contains keywords from a specific domain will be used to evaluate the quality of the top 20-30 retrieved documents using different singular values (i.e. different number of dimensions). The results will be judged on the performance of the proposed method as it is compared to the standard LSI.\",\"PeriodicalId\":277315,\"journal\":{\"name\":\"2018 International Conference on Intelligent and Innovative Computing Applications (ICONIC)\",\"volume\":\"126 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 International Conference on Intelligent and Innovative Computing Applications (ICONIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICONIC.2018.8601096\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Intelligent and Innovative Computing Applications (ICONIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICONIC.2018.8601096","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Enhanced Search for Arabic Language Using Latent Semantic Indexing (LSI)
The Vector Space Model (VSM) is a common document representation model that is widely used in data mining and information retrieval (IR) systems. However, this technique poses some challenges such as high dimensional space and semantic loss representation. Therefore, the latent semantic indexing (LSI) is proposed to reduce the feature dimensions and to generate semantic rich features that represent conceptual term-document associations. In particular, LSI has been successfully implemented in search engines and text classification tasks. In this paper, we propose a novel approach to enhance the quality of the retrieved documents in search engines for Arabic language. That is, we propose to use a new extension of the LSI technique instead of just using the standard LSI technique. The LSI method proposed is based on employing the word co-occurrences to form a term-by-document matrix. The proposed method is to be based on the documents evaluating cosine similarity measures for term-by-document matrix. We will empirically evaluate the performance using an Arabic data collection that contains no less than 500 documents with no less than 30,000 unique words. A testing set contains keywords from a specific domain will be used to evaluate the quality of the top 20-30 retrieved documents using different singular values (i.e. different number of dimensions). The results will be judged on the performance of the proposed method as it is compared to the standard LSI.