The potential of blended seismic acquisition to improve acquisition efficiency and cut acquisition costs is still open, particularly with efficient deblending algorithms to provide accurate deblended data for subsequent processing procedures. In recent years, deep learning algorithms, particularly supervised algorithms, have drawn much attention over conventional deblending algorithms due to their ability to nonlinearly characterize seismic data and achieve more accurate deblended results. Supervised algorithms require large amounts of labeled data for training, yet accurate labels are rarely accessible in field cases. We present a self-supervised multistep deblending framework that does not require clean labels and can characterize the decreasing blending noise level quantitatively in a flexible multistep manner. To achieve this, we leverage the coherence similarity of the common shot gathers (CSGs) and the common receiver gathers (CRGs) after pseudo-deblending. The CSGs are used to construct the training data adaptively, where the raw CSGs are regarded as the label with the corresponding artificially pseudo-deblended data as the initial training input. We employ different networks to quantitatively characterize decreasing blending noise levels in multiple steps for accurate deblending with the help of a blending noise estimation–subtraction strategy. The training of one network can be efficiently initialized by transfer learning from the optimized parameters of the previous network. The optimized parameters trained on CSGs are used to deblend all CRGs of the raw pseudo-deblended data in a multistep manner. Tests on synthetic and field data validate the proposed self-supervised multistep deblending algorithm, which outperforms the multilevel blending noise strategy.