Recent speech quantization compression models have adopted residual vector quantization (RVQ) methods. However, these models typically use fixed bitrates, allocating the same number of time frames at a constant scale across all speech segments. This approach may lead to bitrate inefficiency, particularly when the audio contains simpler segments. To address this limitation, we introduce a multi-scale variable bitrate approach by incorporating a relative importance map, adaptive threshold masks, and a gradient estimation function into the RVQ-GAN model. This method allows the allocation of time frames at varying time scales, depending on the complexity of the audio. For more complex audio, a greater number of time frames are allocated, while fewer time frames are assigned to simpler segments. Additionally, we propose both symmetric and asymmetric decoding methods. Asymmetric decoding is easier to implement and integrates seamlessly into the system, while symmetric decoding delivers superior audio quality at lower bitrates. Subjective and objective experiments demonstrate that, compared to EnCodec, both of our decoding methods deliver excellent audio quality at lower bitrates across various speech and singing datasets, with only a slight increase in computational cost. In comparison to the VRVQ method, we achieve comparable audio quality at even lower bitrates, while requiring less computational cost.
扫码关注我们
求助内容:
应助结果提醒方式:
