Automatic Arabic term extraction from special domain corpora

A. Al-Thubaity, Marwa Khan, Saad Alotaibi, Badriyya Alonazi
{"title":"Automatic Arabic term extraction from special domain corpora","authors":"A. Al-Thubaity, Marwa Khan, Saad Alotaibi, Badriyya Alonazi","doi":"10.1109/IALP.2014.6973468","DOIUrl":null,"url":null,"abstract":"The availability of machine-readable Arabic special domain text in digital libraries, websites of Arabic university publications, and refereed journals fosters numerous interesting studies and applications. Among these applications is automatic term extraction from special domain corpora. These extracted terms can serve as a foundation for other applications and research, such as special domain dictionary building, terminology resource creation, and special domain ontology construction. Our literature survey shows a lack of such studies for Arabic special domain text; moreover, the few studies that have been identified use complex and computationally expensive methods. In this study, we use two basic methods to automatically extract terms from Arabic special domain corpora. Our methods are based on two simple heuristics. The most frequent words and n-grams in special domain corpora are typically terms, which themselves are typically bounded by functional words. We applied our methods on a corpus of applied Arabic linguistics. We obtained results comparable to those of other Arabic term extraction studies in that they exhibited 87% accuracy when only terms strictly pertaining to the field of applied Arabic linguistics were considered, and 93.7% when related terms were included.","PeriodicalId":117334,"journal":{"name":"2014 International Conference on Asian Language Processing (IALP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Asian Language Processing (IALP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2014.6973468","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

The availability of machine-readable Arabic special domain text in digital libraries, websites of Arabic university publications, and refereed journals fosters numerous interesting studies and applications. Among these applications is automatic term extraction from special domain corpora. These extracted terms can serve as a foundation for other applications and research, such as special domain dictionary building, terminology resource creation, and special domain ontology construction. Our literature survey shows a lack of such studies for Arabic special domain text; moreover, the few studies that have been identified use complex and computationally expensive methods. In this study, we use two basic methods to automatically extract terms from Arabic special domain corpora. Our methods are based on two simple heuristics. The most frequent words and n-grams in special domain corpora are typically terms, which themselves are typically bounded by functional words. We applied our methods on a corpus of applied Arabic linguistics. We obtained results comparable to those of other Arabic term extraction studies in that they exhibited 87% accuracy when only terms strictly pertaining to the field of applied Arabic linguistics were considered, and 93.7% when related terms were included.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
从特殊领域语料库中自动提取阿拉伯语术语
在数字图书馆、阿拉伯大学出版物网站和评审期刊中,机器可读的阿拉伯语特殊领域文本的可用性促进了许多有趣的研究和应用。这些应用包括从特殊领域语料库中自动提取术语。这些提取的术语可以作为其他应用和研究的基础,如特殊领域词典的构建、术语资源的创建和特殊领域本体的构建。文献调查显示,对阿拉伯语特殊领域文本的研究缺乏;此外,已经确定的少数研究使用复杂和计算昂贵的方法。在本研究中,我们使用两种基本的方法从阿拉伯语特殊领域语料库中自动提取术语。我们的方法基于两个简单的启发式。特殊领域语料库中出现频率最高的词和n-gram通常是术语,它们本身通常被功能词所限制。我们把我们的方法应用在一个应用阿拉伯语言学的语料库上。我们获得的结果与其他阿拉伯语术语提取研究相当,当只考虑与应用阿拉伯语言学领域严格相关的术语时,他们显示出87%的准确性,当包括相关术语时,准确度为93.7%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Automatic detection of subject/object drops in Bengali Which performs better for new word detection, character based or Chinese Word Segmentation based? Effectiveness of multiscale fractal dimension-based phonetic segmentation in speech synthesis for low resource language A Cepstral Mean Subtraction based features for Singer Identification The analysis on mistaken segmentation of Tibetan words based on statistical method
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1