{"title":"Cross-lingual similar documents retrieval based on co-occurrence projection","authors":"Jiao Liu, Rong-yi Cui, Yahui Zhao","doi":"10.1109/ICCSNT.2017.8343468","DOIUrl":null,"url":null,"abstract":"In this paper, an approach to calculating the similarity among cross-lingual documents was researched for multilingual documents including Chinese, English, and Korean. Firstly, document was represented as a vector in the space of other language by co-occurrence projection. And then, taking advantage of the latent semantic analysis, the loss of vector caused by polysemy between different languages was remedied. Finally, the cross-lingual cosine similarity of documents was calculated in the same language space possessing equivalent semantic information. External dictionary and knowledge base were sidestepped by using the translation corpus to establish the lexical correspondence among Chinese, English, and Korean. The results show that co-occurrence projection has a great effect in calculating cross-lingual documents similarity, moreover, the retrieval accuracy of translation can be reached 95%, which verifies the effectiveness of the proposed method.","PeriodicalId":163433,"journal":{"name":"2017 6th International Conference on Computer Science and Network Technology (ICCSNT)","volume":"163 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 6th International Conference on Computer Science and Network Technology (ICCSNT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCSNT.2017.8343468","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
In this paper, an approach to calculating the similarity among cross-lingual documents was researched for multilingual documents including Chinese, English, and Korean. Firstly, document was represented as a vector in the space of other language by co-occurrence projection. And then, taking advantage of the latent semantic analysis, the loss of vector caused by polysemy between different languages was remedied. Finally, the cross-lingual cosine similarity of documents was calculated in the same language space possessing equivalent semantic information. External dictionary and knowledge base were sidestepped by using the translation corpus to establish the lexical correspondence among Chinese, English, and Korean. The results show that co-occurrence projection has a great effect in calculating cross-lingual documents similarity, moreover, the retrieval accuracy of translation can be reached 95%, which verifies the effectiveness of the proposed method.