With its ability of joint representation learning and clustering via deep neural networks, the deep clustering have gained significant attention in recent years. Despite the considerable progress, most of the previous deep clustering methods still suffer from three critical limitations. First, they tend to associate some distribution-based clustering loss to the neural network, which often overlook the sample-wise contrastiveness for discriminative representation learning. Second, they generally utilize the features learned at a single layer for the clustering process, which, surprisingly, cannot go beyond a single layer to explore multiple layers for joint multi-layer (multi-stage) learning. Third, they typically use the convolutional neural network (CNN) for clustering images, which focus on local information yet cannot well capture the global dependencies. To tackle these issues, this paper presents a new deep clustering method called pyramid contrastive learning for clustering (PCLC), which is able to incorporate a pyramidal contrastive architecture to jointly enforce contrastive learning and clustering at multiple network layers (or stages). Particularly, for an input image, two types of augmentations are first performed to generate two paralleled augmented views. To bridge the gap between the CNN (for capturing local information) and the Transformer (for reflecting global dependencies), a mixed CNN-Transformer based encoder is utilized as the backbone, whose CNN-Transformer blocks are further divided into four stages, thus giving rise to a pyramid of multi-stage feature representations. Thereafter, multiple stages of twin contrastive learning are simultaneously conducted at both the instance-level and the cluster-level, through the optimization of which the final clustering can be achieved. Extensive experiments on multiple challenging image datasets demonstrate the superior clustering performance of PCLC over the state-of-the-art. The source code is available at https://github.com/Zachary-Chow/PCLC.