{"title":"Semantic Preserving Siamese Autoencoder for Binary Quantization of Word Embeddings","authors":"Wouter Mostard, Lambert Schomaker, M. Wiering","doi":"10.1145/3508230.3508235","DOIUrl":null,"url":null,"abstract":"Word embeddings are used as building blocks for a wide range of natural language processing and information retrieval tasks. These embeddings are usually represented as continuous vectors, requiring significant memory capacity and computationally expensive similarity measures. In this study, we introduce a novel method for semantic hashing continuous vector representations into lower-dimensional Hamming space while explicitly preserving semantic information between words. This is achieved by introducing a Siamese autoencoder combined with a novel semantic preserving loss function. We show that our quantization model induces only a 4% loss of semantic information over continuous representations and outperforms the baseline models on several word similarity and sentence classification tasks. Finally, we show through cluster analysis that our method learns binary representations where individual bits hold interpretable semantic information. In conclusion, binary quantization of word embeddings significantly decreases time and space requirements while offering new possibilities through exploiting semantic information of individual bits in downstream information retrieval tasks.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508230.3508235","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Word embeddings are used as building blocks for a wide range of natural language processing and information retrieval tasks. These embeddings are usually represented as continuous vectors, requiring significant memory capacity and computationally expensive similarity measures. In this study, we introduce a novel method for semantic hashing continuous vector representations into lower-dimensional Hamming space while explicitly preserving semantic information between words. This is achieved by introducing a Siamese autoencoder combined with a novel semantic preserving loss function. We show that our quantization model induces only a 4% loss of semantic information over continuous representations and outperforms the baseline models on several word similarity and sentence classification tasks. Finally, we show through cluster analysis that our method learns binary representations where individual bits hold interpretable semantic information. In conclusion, binary quantization of word embeddings significantly decreases time and space requirements while offering new possibilities through exploiting semantic information of individual bits in downstream information retrieval tasks.