Transformer-Based Light Field Salient Object Detection and Its Application to Autofocus

Yao Jiang;Xin Li;Keren Fu;Qijun Zhao
{"title":"Transformer-Based Light Field Salient Object Detection and Its Application to Autofocus","authors":"Yao Jiang;Xin Li;Keren Fu;Qijun Zhao","doi":"10.1109/TIP.2024.3498331","DOIUrl":null,"url":null,"abstract":"Existing light field salient object detection (LFSOD) models predominantly rely on convolutional neural networks or local attention to process light field data, consequently encountering difficulties in modeling intra-slice and cross-slice long-range dependencies within focal stacks. In this paper, we ponder the feasibility of relying solely on the pure Transformer architecture to address this dilemma and propose a novel quasi-pure Transformer-based framework for LFSOD, termed TLFNet. TLFNet incorporates innovative Transformer-based fusion modules (PGFormer) along with an edge enhancement module. The PGFormer employs a perpendicular self-attention (PSA) mechanism to capture long-range dependencies along both cross-slice and intra-slice axes within the focal stack, and integrates multi-modal features using a guided feature fusion (GFF) module. To address the issue of blurry edges arising from the Transformer-based encoder-decoder architecture, the edge enhancement module combines detailed texture and body information and employs focal loss to improve the edge precision of salient objects. TLFNet is a nearly pure Transformer-based approach (with approximately 99.01% of its parameters belonging to the Transformer), while the edge enhancement module significantly boosts accuracy with only around 0.99% of parameters. Comprehensive benchmarks demonstrate that TLFNet outperforms 14 light field models and achieves new state-of-the-art performance. Last but not least, we show in this paper a new application scheme of TLFNet, by cooperating with the deep autofocus technique proposed by Herrmann et al. (2020), leading to light field salient object autofocus (LFSOA). LFSOA aims to identify and output the focal slice with a salient object in focus while keeping other irrelevant background blurred (out-of-focus), yielding an autonomous bokeh effect in photography. The code for the model and application will be publicly available at \n<uri>https://github.com/jiangyao-scu/TLFNet</uri>\n.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6647-6659"},"PeriodicalIF":13.7000,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10759590/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Existing light field salient object detection (LFSOD) models predominantly rely on convolutional neural networks or local attention to process light field data, consequently encountering difficulties in modeling intra-slice and cross-slice long-range dependencies within focal stacks. In this paper, we ponder the feasibility of relying solely on the pure Transformer architecture to address this dilemma and propose a novel quasi-pure Transformer-based framework for LFSOD, termed TLFNet. TLFNet incorporates innovative Transformer-based fusion modules (PGFormer) along with an edge enhancement module. The PGFormer employs a perpendicular self-attention (PSA) mechanism to capture long-range dependencies along both cross-slice and intra-slice axes within the focal stack, and integrates multi-modal features using a guided feature fusion (GFF) module. To address the issue of blurry edges arising from the Transformer-based encoder-decoder architecture, the edge enhancement module combines detailed texture and body information and employs focal loss to improve the edge precision of salient objects. TLFNet is a nearly pure Transformer-based approach (with approximately 99.01% of its parameters belonging to the Transformer), while the edge enhancement module significantly boosts accuracy with only around 0.99% of parameters. Comprehensive benchmarks demonstrate that TLFNet outperforms 14 light field models and achieves new state-of-the-art performance. Last but not least, we show in this paper a new application scheme of TLFNet, by cooperating with the deep autofocus technique proposed by Herrmann et al. (2020), leading to light field salient object autofocus (LFSOA). LFSOA aims to identify and output the focal slice with a salient object in focus while keeping other irrelevant background blurred (out-of-focus), yielding an autonomous bokeh effect in photography. The code for the model and application will be publicly available at https://github.com/jiangyao-scu/TLFNet .
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于变压器的光场突出物体检测及其在自动对焦中的应用
现有的光场显著目标检测(LFSOD)模型主要依靠卷积神经网络或局部注意力来处理光场数据,因此在模拟焦点堆栈内的片内和跨片的远程依赖关系方面存在困难。在本文中,我们考虑了完全依赖纯Transformer架构来解决这一困境的可行性,并提出了一种新的基于准纯Transformer的LFSOD框架,称为TLFNet。TLFNet集成了创新的基于变压器的融合模块(PGFormer)以及边缘增强模块。PGFormer采用垂直自关注(PSA)机制来捕获焦点堆栈中沿交叉切片和切片内轴的远程依赖关系,并使用引导特征融合(GFF)模块集成多模态特征。为了解决基于transformer的编码器-解码器架构导致的边缘模糊问题,边缘增强模块结合了详细的纹理和身体信息,并利用焦损失来提高突出物体的边缘精度。TLFNet是一种几乎纯粹的基于Transformer的方法(大约99.01%的参数属于Transformer),而边缘增强模块仅使用约0.99%的参数即可显着提高准确性。综合基准测试表明,TLFNet优于14种光场模型,实现了新的最先进的性能。最后,本文提出了一种新的TLFNet应用方案,该方案与Herrmann等人(2020)提出的深度自动对焦技术相结合,实现了光场显著物自动对焦(LFSOA)。LFSOA旨在识别和输出焦点切片,同时保持其他不相关的背景模糊(失焦),在摄影中产生自动散景效果。模型和应用程序的代码将在https://github.com/jiangyao-scu/TLFNet上公开提供。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
DPENet: A Dual Prototype-Enhanced Network for Few-Shot Object Detection. Unsupervised Domain Adaptation in Biomedical Images Segmentation with Guided Diffusion Generative Prior. HGroupScene: Hierarchical Grouping and Similar Aggregation for 3D Semantic Scene Completion. MDbFusion++: A Visible and Infrared Image Fusion Framework Capable for Motion Deblurring. From Coarse to Continuous: Progressive Refinement Implicit Neural Representation for Motion-Robust Anisotropic MRI Reconstruction.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1