WikiDocsAligner: An Off-the-Shelf Wikipedia Documents Alignment Tool

Motaz Saad, B. Alijla
{"title":"WikiDocsAligner: An Off-the-Shelf Wikipedia Documents Alignment Tool","authors":"Motaz Saad, B. Alijla","doi":"10.1109/PICICT.2017.27","DOIUrl":null,"url":null,"abstract":"Wikipedia encyclopedia is an attractive source for comparable corpora in many languages. Most researchers develop their own script to perform document alignment task, which requires efforts and time. In this paper, we present WikiDocsAligner, an off-the-shelf Wikipedia Articles alignment handy tool. The implementation of WikiDocsAligner does not require the researchers to import/export of interlanguage links databases. The user just need to download Wikipedia dumps (interlanguage links and articles), then provide them to the tool, which performs the alignment. This software can be used easily to align Wikipedia documents in any language pair. Finally, we use WikiDocsAligner to align comparable documents from Arabic Wikipedia and Egyptian Wikipedia. So we shed the light on Wikipedia as a source of Arabic dialects language resources. The produced resources is interesting and useful as the demand on Arabic/dialects language resources increased in the last decade.","PeriodicalId":259869,"journal":{"name":"2017 Palestinian International Conference on Information and Communication Technology (PICICT)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Palestinian International Conference on Information and Communication Technology (PICICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PICICT.2017.27","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Wikipedia encyclopedia is an attractive source for comparable corpora in many languages. Most researchers develop their own script to perform document alignment task, which requires efforts and time. In this paper, we present WikiDocsAligner, an off-the-shelf Wikipedia Articles alignment handy tool. The implementation of WikiDocsAligner does not require the researchers to import/export of interlanguage links databases. The user just need to download Wikipedia dumps (interlanguage links and articles), then provide them to the tool, which performs the alignment. This software can be used easily to align Wikipedia documents in any language pair. Finally, we use WikiDocsAligner to align comparable documents from Arabic Wikipedia and Egyptian Wikipedia. So we shed the light on Wikipedia as a source of Arabic dialects language resources. The produced resources is interesting and useful as the demand on Arabic/dialects language resources increased in the last decade.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
WikiDocsAligner:一个现成的维基百科文档对齐工具
维基百科是许多语言中具有可比性的语料库的有吸引力的来源。大多数研究人员开发自己的脚本来执行文档对齐任务,这需要时间和精力。在本文中,我们介绍了WikiDocsAligner,一个现成的维基百科文章对齐工具。WikiDocsAligner的实现不需要研究人员导入/导出语言间链接数据库。用户只需要下载维基百科转储(跨语言链接和文章),然后将它们提供给执行对齐的工具。这个软件可以很容易地用任何语言对对齐维基百科文档。最后,我们使用WikiDocsAligner来对齐来自阿拉伯语维基百科和埃及语维基百科的可比文档。所以我们把维基百科作为阿拉伯方言语言资源的来源。随着过去十年对阿拉伯语/方言语言资源的需求增加,所生产的资源是有趣和有用的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Precision Agriculture for Greenhouses Using a Wireless Sensor Network A New Set of Features for Detecting Router Advertisement Flooding Attacks Automatic Arabic Text Summarization for Large Scale Multiple Documents Using Genetic Algorithm and MapReduce Review on Detection Techniques against DDoS Attacks on a Software-Defined Networking Controller Arabic Opinion Mining Using Distributed Representations of Documents
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1