{"title":"Deformable Capsules for Object Detection","authors":"Rodney LaLonde, Naji Khosravan, Ulas Bagci","doi":"10.1002/aisy.202400044","DOIUrl":null,"url":null,"abstract":"<p>Capsule networks promise significant benefits over convolutional neural networks (CNN) by storing stronger internal representations and routing information based on the agreement between intermediate representations’ projections. Despite this, their success has been limited to small-scale classification datasets due to their computationally expensive nature. Though memory-efficient, convolutional capsules impose geometric constraints that fundamentally limit the ability of capsules to model the pose/deformation of objects. Further, they do not address the bigger memory concern of class capsules scaling up to bigger tasks such as detection or large-scale classification. Herein, a new family of capsule networks, deformable capsules (<i>DeformCaps</i>), is introduced to address object detection problem in computer vision. Two new algorithms associated with our <i>DeformCaps</i>, a novel capsule structure (<i>SplitCaps</i>), and a novel dynamic routing algorithm (<i>SE-Routing</i>), which balance computational efficiency with the need for modeling a large number of objects and classes, are proposed. This has never been achieved with capsule networks before. The proposed methods efficiently scale up to create the first-ever capsule network for object detection in the literature. The proposed architecture is a one-stage detection framework and it obtains results on microsoft common objects in context which are on par with state-of-the-art one-stage CNN-based methods, while producing fewer false-positive detection, generalizing to unusual poses/viewpoints of objects.</p>","PeriodicalId":93858,"journal":{"name":"Advanced intelligent systems (Weinheim an der Bergstrasse, Germany)","volume":null,"pages":null},"PeriodicalIF":6.8000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/aisy.202400044","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advanced intelligent systems (Weinheim an der Bergstrasse, Germany)","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/aisy.202400044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Capsule networks promise significant benefits over convolutional neural networks (CNN) by storing stronger internal representations and routing information based on the agreement between intermediate representations’ projections. Despite this, their success has been limited to small-scale classification datasets due to their computationally expensive nature. Though memory-efficient, convolutional capsules impose geometric constraints that fundamentally limit the ability of capsules to model the pose/deformation of objects. Further, they do not address the bigger memory concern of class capsules scaling up to bigger tasks such as detection or large-scale classification. Herein, a new family of capsule networks, deformable capsules (DeformCaps), is introduced to address object detection problem in computer vision. Two new algorithms associated with our DeformCaps, a novel capsule structure (SplitCaps), and a novel dynamic routing algorithm (SE-Routing), which balance computational efficiency with the need for modeling a large number of objects and classes, are proposed. This has never been achieved with capsule networks before. The proposed methods efficiently scale up to create the first-ever capsule network for object detection in the literature. The proposed architecture is a one-stage detection framework and it obtains results on microsoft common objects in context which are on par with state-of-the-art one-stage CNN-based methods, while producing fewer false-positive detection, generalizing to unusual poses/viewpoints of objects.