ESPnet-EZ：仅使用 Python 的 ESPnet，易于微调和集成

arXiv - CS - Sound Pub Date : 2024-09-14 DOI:arxiv-2409.09506

Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, Shinji Watanabe

{"title":"ESPnet-EZ：仅使用 Python 的 ESPnet，易于微调和集成","authors":"Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, Shinji Watanabe","doi":"arxiv-2409.09506","DOIUrl":null,"url":null,"abstract":"We introduce ESPnet-EZ, an extension of the open-source speech processing\ntoolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ\nfocuses on two major aspects: (i) easy fine-tuning and inference of existing\nESPnet models on various tasks and (ii) easy integration with popular deep\nneural network frameworks such as PyTorch-Lightning, Hugging Face transformers\nand datasets, and Lhotse. By replacing ESPnet design choices inherited from\nKaldi with a Python-only, Bash-free interface, we dramatically reduce the\neffort required to build, debug, and use a new model. For example, to fine-tune\na speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number of\nnewly written code by 2.7x and the amount of dependent code by 6.7x while\ndramatically reducing the Bash script dependencies. The codebase of ESPnet-EZ\nis publicly available.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"35 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration\",\"authors\":\"Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, Shinji Watanabe\",\"doi\":\"arxiv-2409.09506\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce ESPnet-EZ, an extension of the open-source speech processing\\ntoolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ\\nfocuses on two major aspects: (i) easy fine-tuning and inference of existing\\nESPnet models on various tasks and (ii) easy integration with popular deep\\nneural network frameworks such as PyTorch-Lightning, Hugging Face transformers\\nand datasets, and Lhotse. By replacing ESPnet design choices inherited from\\nKaldi with a Python-only, Bash-free interface, we dramatically reduce the\\neffort required to build, debug, and use a new model. For example, to fine-tune\\na speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number of\\nnewly written code by 2.7x and the amount of dependent code by 6.7x while\\ndramatically reducing the Bash script dependencies. The codebase of ESPnet-EZ\\nis publicly available.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":\"35 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09506\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们介绍的 ESPnet-EZ 是开源语音处理工具包 ESPnet 的扩展，旨在快速、轻松地开发语音模型。ESPnet-EZ 主要关注两个方面：(i) 在各种任务中轻松微调和推断现有的 ESPnet 模型；(ii) 轻松集成流行的深度神经网络框架，如 PyTorch-Lightning、Hugging Face transformersand datasets 和 Lhotse。通过用纯 Python、无 Bash 界面取代从 Kaldi 继承而来的 ESPnet 设计选择，我们大大减少了构建、调试和使用新模型所需的工作量。例如，在微调语音基础模型时，ESPnet-EZ 与 ESPnet 相比，新编写代码的数量减少了 2.7 倍，依赖代码的数量减少了 6.7 倍，同时大大减少了对 Bash 脚本的依赖。ESPnet-EZ的代码库已经公开。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration

We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on various tasks and (ii) easy integration with popular deep neural network frameworks such as PyTorch-Lightning, Hugging Face transformers and datasets, and Lhotse. By replacing ESPnet design choices inherited from Kaldi with a Python-only, Bash-free interface, we dramatically reduce the effort required to build, debug, and use a new model. For example, to fine-tune a speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number of newly written code by 2.7x and the amount of dependent code by 6.7x while dramatically reducing the Bash script dependencies. The codebase of ESPnet-EZ is publicly available.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量