The collection and annotation of large-scale bird datasets are resource-intensive and time-consuming processes that significantly limit the scalability and accuracy of biodiversity monitoring systems. While self-supervised learning (SSL) has emerged as a promising approach for leveraging unannotated data, current SSL methods face two critical challenges in bird species recognition: (1) long-tailed data distributions that result in poor performance on underrepresented species; and (2) domain shift issues caused by data augmentation strategies designed to mitigate class imbalance. Here we present SDNet, a novel SSL-based bird recognition framework that integrates diffusion models with large language models (LLMs) to overcome these limitations. SDNet employs LLMs to generate semantically rich textual descriptions for tail-class species by prompting the models with species taxonomy, morphological attributes, and habitat information, producing detailed natural language priors that capture fine-grained visual characteristics (e.g., plumage patterns, body proportions, and distinctive markings). These textual descriptions are subsequently used by a conditional diffusion model to synthesize new bird image samples through cross-attention mechanisms that fuse textual embeddings with intermediate visual feature representations during the denoising process, ensuring generated images preserve species-specific morphological details while maintaining photorealistic quality. Additionally, we incorporate a Swin Transformer as the feature extraction backbone whose hierarchical window-based attention mechanism and shifted windowing scheme enable multi-scale local feature extraction that proves particularly effective at capturing fine-grained discriminative patterns (such as beak shape and feather texture) while mitigating domain shift between synthetic and original images through consistent feature representations across both data sources. SDNet is validated on both a self-constructed dataset (Bird_BXS) and a publicly available benchmark (Birds_25), demonstrating substantial improvements over conventional SSL approaches. Our results indicate that the synergistic integration of LLMs, diffusion models, and the Swin Transformer architecture contributes significantly to recognition accuracy, particularly for rare and morphologically similar species. These findings highlight the potential of SDNet for addressing fundamental limitations of existing SSL methods in avian recognition tasks and establishing a new paradigm for efficient self-supervised learning in large-scale ornithological vision applications.
扫码关注我们
求助内容:
应助结果提醒方式:
