只需像素洗牌:用于密集预测任务的空间感知卷积混频器

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Pub Date : 2024-10-09 DOI:10.1016/j.patcog.2024.111068
Hatem Ibrahem, Ahmed Salem, Hyun-Soo Kang
{"title":"只需像素洗牌:用于密集预测任务的空间感知卷积混频器","authors":"Hatem Ibrahem,&nbsp;Ahmed Salem,&nbsp;Hyun-Soo Kang","doi":"10.1016/j.patcog.2024.111068","DOIUrl":null,"url":null,"abstract":"<div><div>ConvMixer is an extremely simple model that could perform better than the state-of-the-art convolutional-based and vision transformer-based methods thanks to mixing the input image patches using a standard convolution. The global mixing process of the patches is only valid for the classification tasks, but it cannot be used for dense prediction tasks as the spatial information of the image is lost in the mixing process. We propose a more efficient technique for image patching, known as pixel shuffling, as it can preserve spatial information. We downsample the input image using the pixel shuffle downsampling in the same form of image patches so that the ConvMixer can be extended for the dense prediction tasks. This paper proves that pixel shuffle downsampling is more efficient than the standard image patching as it outperforms the original ConvMixer architecture in the CIFAR10 and ImageNet-1k classification tasks. We also suggest spatially-aware ConvMixer architectures based on efficient pixel shuffle downsampling and upsampling operations for semantic segmentation and monocular depth estimation. We performed extensive experiments to test the proposed architectures on several datasets; Pascal VOC2012, Cityscapes, and ADE20k for semantic segmentation, NYU-depthV2, and Cityscapes for depth estimation. We show that SA-ConvMixer is efficient enough to get relatively high accuracy at many tasks in a few training epochs (150<span><math><mo>∼</mo></math></span>400). The proposed SA-ConvMixer could achieve an ImageNet-1K Top-1 classification accuracy of 87.02%, mean intersection over union (mIOU) of 87.1% in the PASCAL VOC2012 semantic segmentation task, and absolute relative error of 0.096 in the NYU depthv2 depth estimation task. The implementation code of the proposed method is available at: <span><span>https://github.com/HatemHosam/SA-ConvMixer/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"158 ","pages":"Article 111068"},"PeriodicalIF":7.5000,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pixel shuffling is all you need: spatially aware convmixer for dense prediction tasks\",\"authors\":\"Hatem Ibrahem,&nbsp;Ahmed Salem,&nbsp;Hyun-Soo Kang\",\"doi\":\"10.1016/j.patcog.2024.111068\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>ConvMixer is an extremely simple model that could perform better than the state-of-the-art convolutional-based and vision transformer-based methods thanks to mixing the input image patches using a standard convolution. The global mixing process of the patches is only valid for the classification tasks, but it cannot be used for dense prediction tasks as the spatial information of the image is lost in the mixing process. We propose a more efficient technique for image patching, known as pixel shuffling, as it can preserve spatial information. We downsample the input image using the pixel shuffle downsampling in the same form of image patches so that the ConvMixer can be extended for the dense prediction tasks. This paper proves that pixel shuffle downsampling is more efficient than the standard image patching as it outperforms the original ConvMixer architecture in the CIFAR10 and ImageNet-1k classification tasks. We also suggest spatially-aware ConvMixer architectures based on efficient pixel shuffle downsampling and upsampling operations for semantic segmentation and monocular depth estimation. We performed extensive experiments to test the proposed architectures on several datasets; Pascal VOC2012, Cityscapes, and ADE20k for semantic segmentation, NYU-depthV2, and Cityscapes for depth estimation. We show that SA-ConvMixer is efficient enough to get relatively high accuracy at many tasks in a few training epochs (150<span><math><mo>∼</mo></math></span>400). The proposed SA-ConvMixer could achieve an ImageNet-1K Top-1 classification accuracy of 87.02%, mean intersection over union (mIOU) of 87.1% in the PASCAL VOC2012 semantic segmentation task, and absolute relative error of 0.096 in the NYU depthv2 depth estimation task. The implementation code of the proposed method is available at: <span><span>https://github.com/HatemHosam/SA-ConvMixer/</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"158 \",\"pages\":\"Article 111068\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2024-10-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320324008197\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008197","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

ConvMixer 是一个非常简单的模型,通过使用标准卷积混合输入图像补丁,其性能优于最先进的基于卷积的方法和基于视觉变换器的方法。全局混合图像片段的方法只适用于分类任务,但不能用于密集预测任务,因为在混合过程中会丢失图像的空间信息。我们提出了一种更有效的图像修补技术,即像素洗牌,因为它可以保留空间信息。我们使用像素洗牌降采样技术对输入图像进行降采样,使其成为相同形式的图像补丁,从而使 ConvMixer 可扩展用于高密度预测任务。本文证明了像素洗牌下采样比标准图像修补更有效,因为它在 CIFAR10 和 ImageNet-1k 分类任务中的表现优于原始 ConvMixer 架构。我们还提出了基于高效像素洗牌下采样和上采样操作的空间感知 ConvMixer 架构,用于语义分割和单目深度估计。我们进行了大量实验,在多个数据集上测试了所提出的架构:Pascal VOC2012、Cityscapes 和 ADE20k 用于语义分割,NYU-depthV2 和 Cityscapes 用于深度估计。我们的研究表明,SA-ConvMixer 足够高效,只需几个训练历元(150∼400)就能在许多任务中获得相对较高的准确率。所提出的 SA-ConvMixer 在 ImageNet-1K Top-1 分类准确率为 87.02%,在 PASCAL VOC2012 语义分割任务中的平均交集大于联合率(mIOU)为 87.1%,在纽约大学 depthv2 深度估计任务中的绝对相对误差为 0.096。该方法的实现代码可在以下网址获取:https://github.com/HatemHosam/SA-ConvMixer/。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Pixel shuffling is all you need: spatially aware convmixer for dense prediction tasks
ConvMixer is an extremely simple model that could perform better than the state-of-the-art convolutional-based and vision transformer-based methods thanks to mixing the input image patches using a standard convolution. The global mixing process of the patches is only valid for the classification tasks, but it cannot be used for dense prediction tasks as the spatial information of the image is lost in the mixing process. We propose a more efficient technique for image patching, known as pixel shuffling, as it can preserve spatial information. We downsample the input image using the pixel shuffle downsampling in the same form of image patches so that the ConvMixer can be extended for the dense prediction tasks. This paper proves that pixel shuffle downsampling is more efficient than the standard image patching as it outperforms the original ConvMixer architecture in the CIFAR10 and ImageNet-1k classification tasks. We also suggest spatially-aware ConvMixer architectures based on efficient pixel shuffle downsampling and upsampling operations for semantic segmentation and monocular depth estimation. We performed extensive experiments to test the proposed architectures on several datasets; Pascal VOC2012, Cityscapes, and ADE20k for semantic segmentation, NYU-depthV2, and Cityscapes for depth estimation. We show that SA-ConvMixer is efficient enough to get relatively high accuracy at many tasks in a few training epochs (150400). The proposed SA-ConvMixer could achieve an ImageNet-1K Top-1 classification accuracy of 87.02%, mean intersection over union (mIOU) of 87.1% in the PASCAL VOC2012 semantic segmentation task, and absolute relative error of 0.096 in the NYU depthv2 depth estimation task. The implementation code of the proposed method is available at: https://github.com/HatemHosam/SA-ConvMixer/.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Pattern Recognition
Pattern Recognition 工程技术-工程:电子与电气
CiteScore
14.40
自引率
16.20%
发文量
683
审稿时长
5.6 months
期刊介绍: The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.
期刊最新文献
Learning accurate and enriched features for stereo image super-resolution Semi-supervised multi-view feature selection with adaptive similarity fusion and learning DyConfidMatch: Dynamic thresholding and re-sampling for 3D semi-supervised learning CAST: An innovative framework for Cross-dimensional Attention Structure in Transformers Embedded feature selection for robust probability learning machines
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1