Waqar Ali, Tanveer Ahmad, Zobia Rehman, A. Rehman, M. A. Shah, Ansar Abbas, Ghulam Dustgeer
{"title":"A Novel Framework for Plagiarism Detection: A Case Study for Urdu Language","authors":"Waqar Ali, Tanveer Ahmad, Zobia Rehman, A. Rehman, M. A. Shah, Ansar Abbas, Ghulam Dustgeer","doi":"10.23919/IConAC.2018.8749122","DOIUrl":null,"url":null,"abstract":"Plagiarism is an act of presenting someone else's idea, words and original work as one's own without acknowledging the original source. It creates many problems, especially for academic institutions and researchers. There are many plagiarism detection tools publically available which are used to overcome these problems, however these tools mainly work for particular languages like Arabic and English. In South Asian countries specifically India and Pakistan, a huge part of research content is available in Hindi and Urdu languages. Unfortunately, plagiarism detection in Urdu text cannot acquire the proper attention of research community because it has complex sentence structure and lacks linguistic resources. In this paper, we propose a novel framework for plagiarism detection specifically for Urdu language. There is no benchmark corpus available for Urdu plagiarism detection, and therefore we developed a corpus of Urdu language. We applied distance measuring method along with vector space method to measure the similarity between suspicious and source text. For evaluation purpose, we defined different classes of plagiarized text such as paraphrase, heavily plagiarized, light plagiarized and direct copy-paste. Moreover, we evaluated each class of plagiarized text in terms of precision, recall, and f-measure. The experimental results have presented that Levenshiten distance and Jaccard containment methods produced significant improvement in the performance of plagiarism detection compared with existing methods.","PeriodicalId":121030,"journal":{"name":"2018 24th International Conference on Automation and Computing (ICAC)","volume":"12 25","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 24th International Conference on Automation and Computing (ICAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/IConAC.2018.8749122","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Plagiarism is an act of presenting someone else's idea, words and original work as one's own without acknowledging the original source. It creates many problems, especially for academic institutions and researchers. There are many plagiarism detection tools publically available which are used to overcome these problems, however these tools mainly work for particular languages like Arabic and English. In South Asian countries specifically India and Pakistan, a huge part of research content is available in Hindi and Urdu languages. Unfortunately, plagiarism detection in Urdu text cannot acquire the proper attention of research community because it has complex sentence structure and lacks linguistic resources. In this paper, we propose a novel framework for plagiarism detection specifically for Urdu language. There is no benchmark corpus available for Urdu plagiarism detection, and therefore we developed a corpus of Urdu language. We applied distance measuring method along with vector space method to measure the similarity between suspicious and source text. For evaluation purpose, we defined different classes of plagiarized text such as paraphrase, heavily plagiarized, light plagiarized and direct copy-paste. Moreover, we evaluated each class of plagiarized text in terms of precision, recall, and f-measure. The experimental results have presented that Levenshiten distance and Jaccard containment methods produced significant improvement in the performance of plagiarism detection compared with existing methods.