Object cosegmentation aims to obtain common objects from multiple images or videos, which performs by employing handcraft features to evaluate region similarity or learning higher semantic information via deep learning. However, the former based on handcraft features is sensitive to illumination, appearance changes and clutter background to the domain gap. The latter based on deep learning needs the groundtruth of object segmentation to train the co-attention model to spotlight the common object regions in different domain. This paper proposes an adversarial domain adaption-based video object cosegmentation method without any pixel-wise supervision. Intuitively, high-level semantic similarity are beneficial for common object recognition. However, there are inconsistency distributions of different video sources, i.e., domain gap. We propose an adversarial learning method to align feature distributions of different videos, which aims to maintain the feature similarity of common objects to overcome the dataset bias. Hence, a feature encoder via Siamese network is constructed to fool a discriminative network to obtain domain adapted feature mapping. To further assist the feature embedding of common objects, we define a latent task for label generation to train a classifying network, which could make full use of high-level semantic information. Experimental results on several video cosegmentation datasets suggest that domain adaption based on adversarial learning could significantly improve the common semantic feature exaction.