{"title":"Language-Driven Spatial-Semantic Cross-Attention for Face Attribute Recognition With Limited Labeled Data","authors":"Young-Eun Kim;Gyeong-Min Bak;Seong-Whan Lee","doi":"10.1109/TNNLS.2024.3514836","DOIUrl":null,"url":null,"abstract":"Recent advances in deep learning have demonstrated excellent results for face attribute recognition (FAR), which is generally trained with large-scale labeled data. Despite the significant progress in this field, most existing works mainly rely on large-scale labeled data, which is not practical in many real-world FAR applications. Numerous studies have been conducted to address this problem, but they require either large external face datasets or complex auxiliary tasks for pretraining the backbone network. In this article, we propose a new method named language-driven spatial–semantic cross-attention (LSA) that does not require any pretraining steps with additional datasets or auxiliary tasks. Driven by the impressive outcomes of recent computer vision studies using language models, we harness language-based relational information to enhance attribute recognition. The core of LSA is to combine and balance the learned scaled-dot product attention with the attention constructed based on language-driven knowledge. To this end, we propose a correlation dictionary, obtained with the similarity between text embeddings of facial attributes and facial regions to represent relationships. The correlation dictionary then creates a cross-attention form and is combined into the cross-attention with balancing parameters. Thus, we can compensate for the lack of data information by providing prior knowledge directly to the network. Extensive experiments demonstrate that our method surpasses state-of-the-art techniques, achieving an average improvement of 0.29% on the CelebA dataset and 0.39% on the LFWA dataset with limited labeling data, even without additional dataset training.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 6","pages":"10981-10992"},"PeriodicalIF":8.9000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10806625","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10806625/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advances in deep learning have demonstrated excellent results for face attribute recognition (FAR), which is generally trained with large-scale labeled data. Despite the significant progress in this field, most existing works mainly rely on large-scale labeled data, which is not practical in many real-world FAR applications. Numerous studies have been conducted to address this problem, but they require either large external face datasets or complex auxiliary tasks for pretraining the backbone network. In this article, we propose a new method named language-driven spatial–semantic cross-attention (LSA) that does not require any pretraining steps with additional datasets or auxiliary tasks. Driven by the impressive outcomes of recent computer vision studies using language models, we harness language-based relational information to enhance attribute recognition. The core of LSA is to combine and balance the learned scaled-dot product attention with the attention constructed based on language-driven knowledge. To this end, we propose a correlation dictionary, obtained with the similarity between text embeddings of facial attributes and facial regions to represent relationships. The correlation dictionary then creates a cross-attention form and is combined into the cross-attention with balancing parameters. Thus, we can compensate for the lack of data information by providing prior knowledge directly to the network. Extensive experiments demonstrate that our method surpasses state-of-the-art techniques, achieving an average improvement of 0.29% on the CelebA dataset and 0.39% on the LFWA dataset with limited labeling data, even without additional dataset training.
期刊介绍:
The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.