{"title":"ESPnet-EZ:仅使用 Python 的 ESPnet,易于微调和集成","authors":"Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, Shinji Watanabe","doi":"arxiv-2409.09506","DOIUrl":null,"url":null,"abstract":"We introduce ESPnet-EZ, an extension of the open-source speech processing\ntoolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ\nfocuses on two major aspects: (i) easy fine-tuning and inference of existing\nESPnet models on various tasks and (ii) easy integration with popular deep\nneural network frameworks such as PyTorch-Lightning, Hugging Face transformers\nand datasets, and Lhotse. By replacing ESPnet design choices inherited from\nKaldi with a Python-only, Bash-free interface, we dramatically reduce the\neffort required to build, debug, and use a new model. For example, to fine-tune\na speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number of\nnewly written code by 2.7x and the amount of dependent code by 6.7x while\ndramatically reducing the Bash script dependencies. The codebase of ESPnet-EZ\nis publicly available.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration\",\"authors\":\"Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, Shinji Watanabe\",\"doi\":\"arxiv-2409.09506\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce ESPnet-EZ, an extension of the open-source speech processing\\ntoolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ\\nfocuses on two major aspects: (i) easy fine-tuning and inference of existing\\nESPnet models on various tasks and (ii) easy integration with popular deep\\nneural network frameworks such as PyTorch-Lightning, Hugging Face transformers\\nand datasets, and Lhotse. By replacing ESPnet design choices inherited from\\nKaldi with a Python-only, Bash-free interface, we dramatically reduce the\\neffort required to build, debug, and use a new model. For example, to fine-tune\\na speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number of\\nnewly written code by 2.7x and the amount of dependent code by 6.7x while\\ndramatically reducing the Bash script dependencies. The codebase of ESPnet-EZ\\nis publicly available.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09506\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration
We introduce ESPnet-EZ, an extension of the open-source speech processing
toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ
focuses on two major aspects: (i) easy fine-tuning and inference of existing
ESPnet models on various tasks and (ii) easy integration with popular deep
neural network frameworks such as PyTorch-Lightning, Hugging Face transformers
and datasets, and Lhotse. By replacing ESPnet design choices inherited from
Kaldi with a Python-only, Bash-free interface, we dramatically reduce the
effort required to build, debug, and use a new model. For example, to fine-tune
a speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number of
newly written code by 2.7x and the amount of dependent code by 6.7x while
dramatically reducing the Bash script dependencies. The codebase of ESPnet-EZ
is publicly available.