Multi-modal Language models in bioacoustics with zero-shot transfer: a case study.

IF 3.9 2区 综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES Scientific Reports Pub Date : 2025-02-28 DOI:10.1038/s41598-025-89153-3
Zhongqi Miao, Benjamin Elizalde, Soham Deshmukh, Justin Kitzes, Huaming Wang, Rahul Dodhia, Juan Lavista Ferres
{"title":"Multi-modal Language models in bioacoustics with zero-shot transfer: a case study.","authors":"Zhongqi Miao, Benjamin Elizalde, Soham Deshmukh, Justin Kitzes, Huaming Wang, Rahul Dodhia, Juan Lavista Ferres","doi":"10.1038/s41598-025-89153-3","DOIUrl":null,"url":null,"abstract":"<p><p>Automatically detecting sound events with Artificial Intelligence (AI) has become increas- ingly popular in the field of bioacoustics, ecoacoustics, and soundscape ecology, particularly for wildlife monitoring and conservation. Conventional methods predominantly employ supervised learning techniques that depend on substantial amounts of manually annotated bioacoustic data. However, manual annotation in bioacoustics is tremendously resource- intensive in terms of both human labor and financial resources, and it requires considerable domain expertise. Moreover, the supervised learning framework limits the application scope to predefined categories within a closed setting. The recent advent of Multi-Modal Language Models has markedly enhanced the versatility and possibilities within the realm of AI appli- cations, as this technique addresses many of the challenges that inhibit the deployment of AI in real-world applications. In this paper, we explore the potential of Multi-Modal Language Models in the context of bioacoustics through a case study. We aim to showcase the potential and limitations of Multi-Modal Language Models in bioacoustic applications. In our case study, we applied an Audio-Language Model--a type of Multi-Modal Language Model that aligns language with audio / sound recording data--named CLAP (Contrastive Language-Audio Pretraining) to eight bioacoustic benchmarks covering a wide variety of sounds previously unfamiliar to the model. We demonstrate that CLAP, after simple prompt engineering, can effectively recognize group-level categories such as birds, frogs, and whales across the benchmarks without the need for specific model fine-tuning or additional training, achieving a zero-shot transfer recognition performance comparable to supervised learning baselines. Moreover, we show that CLAP has the potential to perform tasks previously unattainable with supervised bioacoustic approaches, such as estimating relative distances and discovering unknown animal species. On the other hand, we also identify limitations of CLAP, such as the model's inability to recognize fine-grained species-level categories and the reliance on manually engineered text prompts in real-world applications.</p>","PeriodicalId":21811,"journal":{"name":"Scientific Reports","volume":"15 1","pages":"7242"},"PeriodicalIF":3.9000,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11871236/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Reports","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41598-025-89153-3","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Automatically detecting sound events with Artificial Intelligence (AI) has become increas- ingly popular in the field of bioacoustics, ecoacoustics, and soundscape ecology, particularly for wildlife monitoring and conservation. Conventional methods predominantly employ supervised learning techniques that depend on substantial amounts of manually annotated bioacoustic data. However, manual annotation in bioacoustics is tremendously resource- intensive in terms of both human labor and financial resources, and it requires considerable domain expertise. Moreover, the supervised learning framework limits the application scope to predefined categories within a closed setting. The recent advent of Multi-Modal Language Models has markedly enhanced the versatility and possibilities within the realm of AI appli- cations, as this technique addresses many of the challenges that inhibit the deployment of AI in real-world applications. In this paper, we explore the potential of Multi-Modal Language Models in the context of bioacoustics through a case study. We aim to showcase the potential and limitations of Multi-Modal Language Models in bioacoustic applications. In our case study, we applied an Audio-Language Model--a type of Multi-Modal Language Model that aligns language with audio / sound recording data--named CLAP (Contrastive Language-Audio Pretraining) to eight bioacoustic benchmarks covering a wide variety of sounds previously unfamiliar to the model. We demonstrate that CLAP, after simple prompt engineering, can effectively recognize group-level categories such as birds, frogs, and whales across the benchmarks without the need for specific model fine-tuning or additional training, achieving a zero-shot transfer recognition performance comparable to supervised learning baselines. Moreover, we show that CLAP has the potential to perform tasks previously unattainable with supervised bioacoustic approaches, such as estimating relative distances and discovering unknown animal species. On the other hand, we also identify limitations of CLAP, such as the model's inability to recognize fine-grained species-level categories and the reliance on manually engineered text prompts in real-world applications.

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
生物声学中的多模态语言模型:一个案例研究。
利用人工智能(AI)自动检测声音事件在生物声学、生态声学和声景生态学领域越来越受欢迎,尤其是在野生动物监测和保护方面。传统方法主要采用监督学习技术,这种技术依赖于大量人工标注的生物声学数据。然而,生物声学中的人工标注在人力和财力方面都是资源密集型的,需要大量的领域专业知识。此外,监督学习框架将应用范围限制在封闭环境中的预定义类别。最近出现的多模态语言模型显著提高了人工智能应用领域的多样性和可能性,因为这种技术解决了阻碍人工智能在现实世界应用中部署的许多难题。在本文中,我们将通过一个案例研究来探讨多模态语言模型在生物声学领域的潜力。我们旨在展示多模态语言模型在生物声学应用中的潜力和局限性。在案例研究中,我们将一种名为 CLAP(对比语言-音频预训练)的音频-语言模型(一种多模态语言模型,将语言与音频/声音记录数据相匹配)应用于八个生物声学基准,涵盖了该模型以前不熟悉的各种声音。我们证明,在经过简单的提示工程后,CLAP 可以有效地识别各种基准中的群体级类别,如鸟类、青蛙和鲸鱼,而无需对模型进行特定的微调或额外的训练,从而实现了与监督学习基准相当的零点转移识别性能。此外,我们还展示了 CLAP 的潜力,它可以执行以前无法通过有监督生物声学方法实现的任务,如估计相对距离和发现未知动物物种。另一方面,我们也发现了 CLAP 的局限性,例如该模型无法识别细粒度的物种类别,以及在实际应用中对人工设计文本提示的依赖。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Scientific Reports
Scientific Reports Natural Science Disciplines-
CiteScore
7.50
自引率
4.30%
发文量
19567
审稿时长
3.9 months
期刊介绍: We publish original research from all areas of the natural sciences, psychology, medicine and engineering. You can learn more about what we publish by browsing our specific scientific subject areas below or explore Scientific Reports by browsing all articles and collections. Scientific Reports has a 2-year impact factor: 4.380 (2021), and is the 6th most-cited journal in the world, with more than 540,000 citations in 2020 (Clarivate Analytics, 2021). •Engineering Engineering covers all aspects of engineering, technology, and applied science. It plays a crucial role in the development of technologies to address some of the world''s biggest challenges, helping to save lives and improve the way we live. •Physical sciences Physical sciences are those academic disciplines that aim to uncover the underlying laws of nature — often written in the language of mathematics. It is a collective term for areas of study including astronomy, chemistry, materials science and physics. •Earth and environmental sciences Earth and environmental sciences cover all aspects of Earth and planetary science and broadly encompass solid Earth processes, surface and atmospheric dynamics, Earth system history, climate and climate change, marine and freshwater systems, and ecology. It also considers the interactions between humans and these systems. •Biological sciences Biological sciences encompass all the divisions of natural sciences examining various aspects of vital processes. The concept includes anatomy, physiology, cell biology, biochemistry and biophysics, and covers all organisms from microorganisms, animals to plants. •Health sciences The health sciences study health, disease and healthcare. This field of study aims to develop knowledge, interventions and technology for use in healthcare to improve the treatment of patients.
期刊最新文献
Benchmarking agreement between large language models and published clinical trial conclusions across four artificial intelligence platforms. Lipidomics of prolonged labor duration in African American birthing people. AI-driven adaptive vibration control in smart plate systems: a sustainable approach for next-generation sports engineering. A possible role of Painless channel in the regulation of immune system activity in Tenebrio molitor L. Advancing cardiovascular disease diagnosis with an interpretable and responsible AI framework.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1