通过向量嵌入增强大型语言模型以提高特定领域的响应性。

IF 1.2 4区综合性期刊 Q3 MULTIDISCIPLINARY SCIENCES Jove-Journal of Visualized Experiments Pub Date : 2024-12-06 DOI:10.3791/66796

Nathan M Wolfrath, Nathaniel B Verhagen, Bradley H Crotty, Melek Somai, Anai N Kothari

{"title":"通过向量嵌入增强大型语言模型以提高特定领域的响应性。","authors":"Nathan M Wolfrath, Nathaniel B Verhagen, Bradley H Crotty, Melek Somai, Anai N Kothari","doi":"10.3791/66796","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) have emerged as a popular resource for generating information relevant to a user query. Such models are created through a resource-intensive training process utilizing an extensive, static corpus of textual data. This static nature results in limitations for adoption in domains with rapidly changing knowledge, proprietary information, and sensitive data. In this work, methods are outlined for augmenting general-purpose LLMs, known as foundation models, with domain-specific information using an embeddings-based approach for incorporating up-to-date, peer-reviewed scientific manuscripts. This is achieved through open-source tools such as Llama-Index and publicly available models such as Llama-2 to maximize transparency, user privacy and control, and replicability. While scientific manuscripts are used as an example use case, this approach can be extended to any text data source. Additionally, methods for evaluating model performance following this enhancement are discussed. These methods enable the rapid development of LLM systems for highly specialized domains regardless of the comprehensiveness of information in the training corpus.","PeriodicalId":48787,"journal":{"name":"Jove-Journal of Visualized Experiments","volume":" 214","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Augmenting Large Language Models via Vector Embeddings to Improve Domain-specific Responsiveness.\",\"authors\":\"Nathan M Wolfrath, Nathaniel B Verhagen, Bradley H Crotty, Melek Somai, Anai N Kothari\",\"doi\":\"10.3791/66796\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large language models (LLMs) have emerged as a popular resource for generating information relevant to a user query. Such models are created through a resource-intensive training process utilizing an extensive, static corpus of textual data. This static nature results in limitations for adoption in domains with rapidly changing knowledge, proprietary information, and sensitive data. In this work, methods are outlined for augmenting general-purpose LLMs, known as foundation models, with domain-specific information using an embeddings-based approach for incorporating up-to-date, peer-reviewed scientific manuscripts. This is achieved through open-source tools such as Llama-Index and publicly available models such as Llama-2 to maximize transparency, user privacy and control, and replicability. While scientific manuscripts are used as an example use case, this approach can be extended to any text data source. Additionally, methods for evaluating model performance following this enhancement are discussed. These methods enable the rapid development of LLM systems for highly specialized domains regardless of the comprehensiveness of information in the training corpus.\",\"PeriodicalId\":48787,\"journal\":{\"name\":\"Jove-Journal of Visualized Experiments\",\"volume\":\" 214\",\"pages\":\"\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2024-12-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Jove-Journal of Visualized Experiments\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.3791/66796\",\"RegionNum\":4,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jove-Journal of Visualized Experiments","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.3791/66796","RegionNum":4,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（llm）已经成为生成与用户查询相关的信息的流行资源。这样的模型是通过利用广泛的静态文本数据语料库的资源密集型训练过程创建的。这种静态特性限制了在具有快速变化的知识、专有信息和敏感数据的领域中的采用。在这项工作中，概述了增加通用法学硕士的方法，称为基础模型，使用基于嵌入的方法来整合最新的、同行评审的科学手稿，并使用特定领域的信息。这是通过开源工具（如Llama-Index）和公共可用模型（如Llama-2）实现的，以最大限度地提高透明度、用户隐私和控制以及可复制性。虽然科学手稿被用作示例用例，但这种方法可以扩展到任何文本数据源。此外，还讨论了在此增强之后评估模型性能的方法。这些方法使得高度专业化领域的法学硕士系统能够快速发展，而不考虑训练语料库中信息的全面性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Augmenting Large Language Models via Vector Embeddings to Improve Domain-specific Responsiveness.

Large language models (LLMs) have emerged as a popular resource for generating information relevant to a user query. Such models are created through a resource-intensive training process utilizing an extensive, static corpus of textual data. This static nature results in limitations for adoption in domains with rapidly changing knowledge, proprietary information, and sensitive data. In this work, methods are outlined for augmenting general-purpose LLMs, known as foundation models, with domain-specific information using an embeddings-based approach for incorporating up-to-date, peer-reviewed scientific manuscripts. This is achieved through open-source tools such as Llama-Index and publicly available models such as Llama-2 to maximize transparency, user privacy and control, and replicability. While scientific manuscripts are used as an example use case, this approach can be extended to any text data source. Additionally, methods for evaluating model performance following this enhancement are discussed. These methods enable the rapid development of LLM systems for highly specialized domains regardless of the comprehensiveness of information in the training corpus.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Jove-Journal of Visualized Experiments MULTIDISCIPLINARY SCIENCES-

CiteScore

2.10

自引率

0.00%

发文量

992

期刊介绍： JoVE, the Journal of Visualized Experiments, is the world''s first peer reviewed scientific video journal. Established in 2006, JoVE is devoted to publishing scientific research in a visual format to help researchers overcome two of the biggest challenges facing the scientific research community today; poor reproducibility and the time and labor intensive nature of learning new experimental techniques.