{"title":"NWJC2Vec","authors":"Masayuki Asahara","doi":"10.1075/term.00011.asa","DOIUrl":null,"url":null,"abstract":"\n In this paper, we present a word embedding dataset NWJC2Vec constructed using ‘NINJAL Web Japanese Corpus (NWJC)’. NWJC is a Web-crawled text corpus that contains 25.8 billion tokens. We construct two types of the word embedding dataset: one is based on the surface form, and the other is based on the complete morpheme information provided by UniDic, which is a lexicon for the Japanese morphological analyser MeCab. We perform an evaluation of the dataset by comparing it with the ‘Word List by Semantic Principles (Bunrui Goihyo)’.","PeriodicalId":44429,"journal":{"name":"Terminology","volume":"97 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2018-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1075/term.00011.asa","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Terminology","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1075/term.00011.asa","RegionNum":4,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 1
Abstract
In this paper, we present a word embedding dataset NWJC2Vec constructed using ‘NINJAL Web Japanese Corpus (NWJC)’. NWJC is a Web-crawled text corpus that contains 25.8 billion tokens. We construct two types of the word embedding dataset: one is based on the surface form, and the other is based on the complete morpheme information provided by UniDic, which is a lexicon for the Japanese morphological analyser MeCab. We perform an evaluation of the dataset by comparing it with the ‘Word List by Semantic Principles (Bunrui Goihyo)’.
期刊介绍:
Terminology is an independent journal with a cross-cultural and cross-disciplinary scope. It focusses on the discussion of (systematic) solutions not only of language problems encountered in translation, but also, for example, of (monolingual) problems of ambiguity, reference and developments in multidisciplinary communication. Particular attention will be given to new and developing subject areas such as knowledge representation and transfer, information technology tools, expert systems and terminological databases. Terminology encompasses terminology both in general (theory and practice) and in specialized fields (LSP), such as physics.