Enhanced Search for Arabic Language Using Latent Semantic Indexing (LSI)

Fawaz S. Al-Anzi, Dia AbuZeina
{"title":"Enhanced Search for Arabic Language Using Latent Semantic Indexing (LSI)","authors":"Fawaz S. Al-Anzi, Dia AbuZeina","doi":"10.1109/ICONIC.2018.8601096","DOIUrl":null,"url":null,"abstract":"The Vector Space Model (VSM) is a common document representation model that is widely used in data mining and information retrieval (IR) systems. However, this technique poses some challenges such as high dimensional space and semantic loss representation. Therefore, the latent semantic indexing (LSI) is proposed to reduce the feature dimensions and to generate semantic rich features that represent conceptual term-document associations. In particular, LSI has been successfully implemented in search engines and text classification tasks. In this paper, we propose a novel approach to enhance the quality of the retrieved documents in search engines for Arabic language. That is, we propose to use a new extension of the LSI technique instead of just using the standard LSI technique. The LSI method proposed is based on employing the word co-occurrences to form a term-by-document matrix. The proposed method is to be based on the documents evaluating cosine similarity measures for term-by-document matrix. We will empirically evaluate the performance using an Arabic data collection that contains no less than 500 documents with no less than 30,000 unique words. A testing set contains keywords from a specific domain will be used to evaluate the quality of the top 20-30 retrieved documents using different singular values (i.e. different number of dimensions). The results will be judged on the performance of the proposed method as it is compared to the standard LSI.","PeriodicalId":277315,"journal":{"name":"2018 International Conference on Intelligent and Innovative Computing Applications (ICONIC)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Intelligent and Innovative Computing Applications (ICONIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICONIC.2018.8601096","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

The Vector Space Model (VSM) is a common document representation model that is widely used in data mining and information retrieval (IR) systems. However, this technique poses some challenges such as high dimensional space and semantic loss representation. Therefore, the latent semantic indexing (LSI) is proposed to reduce the feature dimensions and to generate semantic rich features that represent conceptual term-document associations. In particular, LSI has been successfully implemented in search engines and text classification tasks. In this paper, we propose a novel approach to enhance the quality of the retrieved documents in search engines for Arabic language. That is, we propose to use a new extension of the LSI technique instead of just using the standard LSI technique. The LSI method proposed is based on employing the word co-occurrences to form a term-by-document matrix. The proposed method is to be based on the documents evaluating cosine similarity measures for term-by-document matrix. We will empirically evaluate the performance using an Arabic data collection that contains no less than 500 documents with no less than 30,000 unique words. A testing set contains keywords from a specific domain will be used to evaluate the quality of the top 20-30 retrieved documents using different singular values (i.e. different number of dimensions). The results will be judged on the performance of the proposed method as it is compared to the standard LSI.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于潜在语义索引(LSI)的阿拉伯语增强搜索
向量空间模型(VSM)是一种常用的文档表示模型,广泛应用于数据挖掘和信息检索系统中。然而,该技术面临着高维空间和语义损失表示等挑战。因此,提出了潜在语义索引(LSI)来降低特征维数,并生成语义丰富的特征来表示概念术语与文档之间的关联。特别是LSI在搜索引擎和文本分类任务中得到了成功的实现。在本文中,我们提出了一种新的方法来提高阿拉伯语搜索引擎中检索文档的质量。也就是说,我们建议使用LSI技术的新扩展,而不是仅仅使用标准的LSI技术。所提出的LSI方法是基于使用单词共现来形成逐文档的术语矩阵。所提出的方法是基于文档评估余弦相似性度量的术语-文档矩阵。我们将使用包含不少于500个文档和不少于30,000个唯一单词的阿拉伯语数据集对性能进行实证评估。包含特定领域关键字的测试集将用于使用不同的奇异值(即不同的维数)评估前20-30个检索文档的质量。结果将根据所提出的方法的性能来判断,因为它与标准LSI进行了比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
M-Government Adoption Framework for Least Developed Countries: The Case of Malawi Rule-based Control Strategy for a River-based Grid-connected Hydrokinetic System A Survey on Spectrum Handoff Algorithm for Continuous Connectivity Services and Applications Security in IoT Enabled Networks Cloud Robotic Architectures: Directions for Future Research from a Comparative Analysis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1