Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, Han Xiao
{"title":"jina-embeddings-v3: Multilingual Embeddings With Task LoRA","authors":"Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, Han Xiao","doi":"arxiv-2409.10173","DOIUrl":null,"url":null,"abstract":"We introduce jina-embeddings-v3, a novel text embedding model with 570\nmillion parameters, achieves state-of-the-art performance on multilingual data\nand long-context retrieval tasks, supporting context lengths of up to 8192\ntokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA)\nadapters to generate high-quality embeddings for query-document retrieval,\nclustering, classification, and text matching. Additionally, Matryoshka\nRepresentation Learning is integrated into the training process, allowing\nflexible truncation of embedding dimensions without compromising performance.\nEvaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the\nlatest proprietary embeddings from OpenAI and Cohere on English tasks, while\nachieving superior performance compared to multilingual-e5-large-instruct\nacross all multilingual tasks.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"48 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10173","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We introduce jina-embeddings-v3, a novel text embedding model with 570
million parameters, achieves state-of-the-art performance on multilingual data
and long-context retrieval tasks, supporting context lengths of up to 8192
tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA)
adapters to generate high-quality embeddings for query-document retrieval,
clustering, classification, and text matching. Additionally, Matryoshka
Representation Learning is integrated into the training process, allowing
flexible truncation of embedding dimensions without compromising performance.
Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the
latest proprietary embeddings from OpenAI and Cohere on English tasks, while
achieving superior performance compared to multilingual-e5-large-instruct
across all multilingual tasks.