Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision最新文献_第6页

IDa-Det: An Information Discrepancy-aware Distillation for 1-bit Detectors IDa-Det:用于1位检测器的信息差异感知蒸馏

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-07 DOI: 10.48550/arXiv.2210.03477

Sheng Xu, Yanjing Li, Bo-Wen Zeng, Teli Ma, Baochang Zhang, Xianbin Cao, Penglei Gao, Jinhu Lv

Knowledge distillation (KD) has been proven to be useful for training compact object detection models. However, we observe that KD is often effective when the teacher model and student counterpart share similar proposal information. This explains why existing KD methods are less effective for 1-bit detectors, caused by a significant information discrepancy between the real-valued teacher and the 1-bit student. This paper presents an Information Discrepancy-aware strategy (IDa-Det) to distill 1-bit detectors that can effectively eliminate information discrepancies and significantly reduce the performance gap between a 1-bit detector and its real-valued counterpart. We formulate the distillation process as a bi-level optimization formulation. At the inner level, we select the representative proposals with maximum information discrepancy. We then introduce a novel entropy distillation loss to reduce the disparity based on the selected proposals. Extensive experiments demonstrate IDa-Det's superiority over state-of-the-art 1-bit detectors and KD methods on both PASCAL VOC and COCO datasets. IDa-Det achieves a 76.9% mAP for a 1-bit Faster-RCNN with ResNet-18 backbone. Our code is open-sourced on https://github.com/SteveTsui/IDa-Det.

知识蒸馏(KD)已被证明是训练紧凑目标检测模型的有用方法。然而，我们观察到，当教师模型和学生模型共享相似的提议信息时，KD通常是有效的。这解释了为什么现有的KD方法对1位检测器的有效性较低，这是由于实值教师和1位学生之间存在显着的信息差异。本文提出了一种信息差异感知策略(IDa-Det)来提取1位检测器，该检测器可以有效地消除信息差异，并显着降低1位检测器与实值检测器之间的性能差距。我们将蒸馏过程制定为一个双层优化配方。在内部层面，我们选择信息差异最大的具有代表性的提案。然后，我们引入了一种新的熵蒸馏损失来减小基于所选建议的差异。大量的实验证明了IDa-Det在PASCAL VOC和COCO数据集上优于最先进的1位检测器和KD方法。IDa-Det在具有ResNet-18骨干网的1位更快的rcnn上实现了76.9%的mAP。我们的代码在https://github.com/SteveTsui/IDa-Det上是开源的。

{"title":"IDa-Det: An Information Discrepancy-aware Distillation for 1-bit Detectors","authors":"Sheng Xu, Yanjing Li, Bo-Wen Zeng, Teli Ma, Baochang Zhang, Xianbin Cao, Penglei Gao, Jinhu Lv","doi":"10.48550/arXiv.2210.03477","DOIUrl":"https://doi.org/10.48550/arXiv.2210.03477","url":null,"abstract":"Knowledge distillation (KD) has been proven to be useful for training compact object detection models. However, we observe that KD is often effective when the teacher model and student counterpart share similar proposal information. This explains why existing KD methods are less effective for 1-bit detectors, caused by a significant information discrepancy between the real-valued teacher and the 1-bit student. This paper presents an Information Discrepancy-aware strategy (IDa-Det) to distill 1-bit detectors that can effectively eliminate information discrepancies and significantly reduce the performance gap between a 1-bit detector and its real-valued counterpart. We formulate the distillation process as a bi-level optimization formulation. At the inner level, we select the representative proposals with maximum information discrepancy. We then introduce a novel entropy distillation loss to reduce the disparity based on the selected proposals. Extensive experiments demonstrate IDa-Det's superiority over state-of-the-art 1-bit detectors and KD methods on both PASCAL VOC and COCO datasets. IDa-Det achieves a 76.9% mAP for a 1-bit Faster-RCNN with ResNet-18 backbone. Our code is open-sourced on https://github.com/SteveTsui/IDa-Det.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"60 1","pages":"346-361"},"PeriodicalIF":0.0,"publicationDate":"2022-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88015808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

FloatingFusion: Depth from ToF and Image-stabilized Stereo Cameras FloatingFusion:来自ToF和图像稳定立体相机的深度

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-06 DOI: 10.1007/978-3-031-19769-7_35

Andreas Meuleman, Hak-Il Kim, J. Tompkin, Min H. Kim

引用次数: 2

Differentiable Raycasting for Self-supervised Occupancy Forecasting 自监督入住率预测的可微投射

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-04 DOI: 10.48550/arXiv.2210.01917

Tarasha Khurana, Peiyun Hu, Achal Dave, Jason Ziglar, David Held, Deva Ramanan

Motion planning for safe autonomous driving requires learning how the environment around an ego-vehicle evolves with time. Ego-centric perception of driveable regions in a scene not only changes with the motion of actors in the environment, but also with the movement of the ego-vehicle itself. Self-supervised representations proposed for large-scale planning, such as ego-centric freespace, confound these two motions, making the representation difficult to use for downstream motion planners. In this paper, we use geometric occupancy as a natural alternative to view-dependent representations such as freespace. Occupancy maps naturally disentangle the motion of the environment from the motion of the ego-vehicle. However, one cannot directly observe the full 3D occupancy of a scene (due to occlusion), making it difficult to use as a signal for learning. Our key insight is to use differentiable raycasting to"render"future occupancy predictions into future LiDAR sweep predictions, which can be compared with ground-truth sweeps for self-supervised learning. The use of differentiable raycasting allows occupancy to emerge as an internal representation within the forecasting network. In the absence of groundtruth occupancy, we quantitatively evaluate the forecasting of raycasted LiDAR sweeps and show improvements of upto 15 F1 points. For downstream motion planners, where emergent occupancy can be directly used to guide non-driveable regions, this representation relatively reduces the number of collisions with objects by up to 17% as compared to freespace-centric motion planners.

安全自动驾驶的运动规划需要了解自动驾驶汽车周围的环境如何随着时间的推移而变化。场景中以自我为中心的可驾驶区域感知不仅随着环境中角色的运动而变化，而且随着自我车辆本身的运动而变化。针对大规模规划提出的自监督表示，如以自我为中心的自由空间，混淆了这两种运动，使得下游运动规划者难以使用该表示。在本文中，我们使用几何占位作为依赖于视图的表示(如自由空间)的自然替代。占用地图自然地将环境的运动与自我车辆的运动分离开来。然而，人们无法直接观察到场景的完整3D占用(由于遮挡)，因此很难将其用作学习的信号。我们的关键见解是使用可微分光线投射将未来的占用预测“渲染”到未来的激光雷达扫描预测中，这可以与自监督学习的地面真相扫描进行比较。可微分光线投射的使用使得占用率作为预测网络中的内部表示形式出现。在没有真实占用的情况下，我们定量评估了光线投射激光雷达扫描的预测，并显示了高达15个F1点的改进。对于下游运动规划器，紧急占用可以直接用于引导不可驾驶区域，与以自由空间为中心的运动规划器相比，这种表示相对减少了与物体碰撞的数量，最多可减少17%。

{"title":"Differentiable Raycasting for Self-supervised Occupancy Forecasting","authors":"Tarasha Khurana, Peiyun Hu, Achal Dave, Jason Ziglar, David Held, Deva Ramanan","doi":"10.48550/arXiv.2210.01917","DOIUrl":"https://doi.org/10.48550/arXiv.2210.01917","url":null,"abstract":"Motion planning for safe autonomous driving requires learning how the environment around an ego-vehicle evolves with time. Ego-centric perception of driveable regions in a scene not only changes with the motion of actors in the environment, but also with the movement of the ego-vehicle itself. Self-supervised representations proposed for large-scale planning, such as ego-centric freespace, confound these two motions, making the representation difficult to use for downstream motion planners. In this paper, we use geometric occupancy as a natural alternative to view-dependent representations such as freespace. Occupancy maps naturally disentangle the motion of the environment from the motion of the ego-vehicle. However, one cannot directly observe the full 3D occupancy of a scene (due to occlusion), making it difficult to use as a signal for learning. Our key insight is to use differentiable raycasting to\"render\"future occupancy predictions into future LiDAR sweep predictions, which can be compared with ground-truth sweeps for self-supervised learning. The use of differentiable raycasting allows occupancy to emerge as an internal representation within the forecasting network. In the absence of groundtruth occupancy, we quantitatively evaluate the forecasting of raycasted LiDAR sweeps and show improvements of upto 15 F1 points. For downstream motion planners, where emergent occupancy can be directly used to guide non-driveable regions, this representation relatively reduces the number of collisions with objects by up to 17% as compared to freespace-centric motion planners.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"61 1","pages":"353-369"},"PeriodicalIF":0.0,"publicationDate":"2022-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89246636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

From Face to Natural Image: Learning Real Degradation for Blind Image Super-Resolution 从人脸到自然图像:学习盲图像超分辨率的真实退化

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-03 DOI: 10.48550/arXiv.2210.00752

Xiaoming Li, Chaofeng Chen, Xianhui Lin, W. Zuo, Lei Zhang

How to design proper training pairs is critical for super-resolving real-world low-quality (LQ) images, which suffers from the difficulties in either acquiring paired ground-truth high-quality (HQ) images or synthesizing photo-realistic degraded LQ observations. Recent works mainly focus on modeling the degradation with handcrafted or estimated degradation parameters, which are however incapable to model complicated real-world degradation types, resulting in limited quality improvement. Notably, LQ face images, which may have the same degradation process as natural images, can be robustly restored with photo-realistic textures by exploiting their strong structural priors. This motivates us to use the real-world LQ face images and their restored HQ counterparts to model the complex real-world degradation (namely ReDegNet), and then transfer it to HQ natural images to synthesize their realistic LQ counterparts. By taking these paired HQ-LQ face images as inputs to explicitly predict the degradation-aware and content-independent representations, we could control the degraded image generation, and subsequently transfer these degradation representations from face to natural images to synthesize the degraded LQ natural images. Experiments show that our ReDegNet can well learn the real degradation process from face images. The restoration network trained with our synthetic pairs performs favorably against SOTAs. More importantly, our method provides a new way to handle the real-world complex scenarios by learning their degradation representations from the facial portions, which can be used to significantly improve the quality of non-facial areas. The source code is available at https://github.com/csxmli2016/ReDegNet.

如何设计合适的训练对对于超分辨真实世界低质量(LQ)图像是至关重要的，它既难以获得成对的高质量(HQ)图像，也难以合成逼真的退化LQ观测值。最近的研究主要集中在用手工制作的或估计的退化参数来建模退化，然而，这些方法无法模拟复杂的现实世界的退化类型，导致质量提高有限。值得注意的是，LQ人脸图像可能具有与自然图像相同的退化过程，通过利用其强结构先验，可以鲁棒地恢复具有逼真纹理的图像。这促使我们使用真实世界的LQ人脸图像及其还原的HQ对应图像来模拟复杂的真实世界退化(即ReDegNet)，然后将其转移到HQ自然图像中以合成其真实的LQ对应图像。通过将这些配对的HQ-LQ人脸图像作为输入，明确预测退化感知和内容无关的表征，我们可以控制退化图像的生成，随后将这些退化表征从人脸转移到自然图像中，以合成退化的LQ自然图像。实验表明，我们的ReDegNet可以很好地学习人脸图像的真实退化过程。用我们的合成对训练的恢复网络对sota表现良好。更重要的是，我们的方法提供了一种新的方法来处理现实世界的复杂场景，通过学习面部部分的退化表示，可以显著提高非面部区域的质量。源代码可从https://github.com/csxmli2016/ReDegNet获得。

{"title":"From Face to Natural Image: Learning Real Degradation for Blind Image Super-Resolution","authors":"Xiaoming Li, Chaofeng Chen, Xianhui Lin, W. Zuo, Lei Zhang","doi":"10.48550/arXiv.2210.00752","DOIUrl":"https://doi.org/10.48550/arXiv.2210.00752","url":null,"abstract":"How to design proper training pairs is critical for super-resolving real-world low-quality (LQ) images, which suffers from the difficulties in either acquiring paired ground-truth high-quality (HQ) images or synthesizing photo-realistic degraded LQ observations. Recent works mainly focus on modeling the degradation with handcrafted or estimated degradation parameters, which are however incapable to model complicated real-world degradation types, resulting in limited quality improvement. Notably, LQ face images, which may have the same degradation process as natural images, can be robustly restored with photo-realistic textures by exploiting their strong structural priors. This motivates us to use the real-world LQ face images and their restored HQ counterparts to model the complex real-world degradation (namely ReDegNet), and then transfer it to HQ natural images to synthesize their realistic LQ counterparts. By taking these paired HQ-LQ face images as inputs to explicitly predict the degradation-aware and content-independent representations, we could control the degraded image generation, and subsequently transfer these degradation representations from face to natural images to synthesize the degraded LQ natural images. Experiments show that our ReDegNet can well learn the real degradation process from face images. The restoration network trained with our synthetic pairs performs favorably against SOTAs. More importantly, our method provides a new way to handle the real-world complex scenarios by learning their degradation representations from the facial portions, which can be used to significantly improve the quality of non-facial areas. The source code is available at https://github.com/csxmli2016/ReDegNet.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"56 1","pages":"376-392"},"PeriodicalIF":0.0,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79827963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Anatomy-Aware Contrastive Representation Learning for Fetal Ultrasound. 胎儿超声的解剖学感知对比表示学习。

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-01 DOI: 10.1007/978-3-031-25066-8_23

Zeyu Fu, Jianbo Jiao, Robail Yasrab, Lior Drukker, Aris T Papageorghiou, J Alison Noble

Self-supervised contrastive representation learning offers the advantage of learning meaningful visual representations from unlabeled medical datasets for transfer learning. However, applying current contrastive learning approaches to medical data without considering its domain-specific anatomical characteristics may lead to visual representations that are inconsistent in appearance and semantics. In this paper, we propose to improve visual representations of medical images via anatomy-aware contrastive learning (AWCL), which incorporates anatomy information to augment the positive/negative pair sampling in a contrastive learning manner. The proposed approach is demonstrated for automated fetal ultrasound imaging tasks, enabling the positive pairs from the same or different ultrasound scans that are anatomically similar to be pulled together and thus improving the representation learning. We empirically investigate the effect of inclusion of anatomy information with coarse- and fine-grained granularity, for contrastive learning and find that learning with fine-grained anatomy information which preserves intra-class difference is more effective than its counterpart. We also analyze the impact of anatomy ratio on our AWCL framework and find that using more distinct but anatomically similar samples to compose positive pairs results in better quality representations. Extensive experiments on a large-scale fetal ultrasound dataset demonstrate that our approach is effective for learning representations that transfer well to three clinical downstream tasks, and achieves superior performance compared to ImageNet supervised and the current state-of-the-art contrastive learning methods. In particular, AWCL outperforms ImageNet supervised method by 13.8% and state-of-the-art contrastive-based method by 7.1% on a cross-domain segmentation task. The code is available at https://github.com/JianboJiao/AWCL.

自监督对比表征学习提供了从未标记的医学数据集中学习有意义的视觉表征用于迁移学习的优势。然而，将当前的对比学习方法应用于医学数据而不考虑其特定领域的解剖特征，可能会导致视觉表示在外观和语义上不一致。在本文中，我们提出通过解剖学感知对比学习（AWCL）来改进医学图像的视觉表示，AWCL结合解剖学信息，以对比学习的方式增强正/负对采样。所提出的方法已被证明用于自动胎儿超声成像任务，使来自解剖相似的相同或不同超声扫描的阳性对能够被拉在一起，从而改进了表征学习。我们实证研究了包含粗粒度和细粒度解剖信息的效果，用于对比学习，并发现使用保留类内差异的细粒度解剖学信息进行学习比其对应信息更有效。我们还分析了解剖比例对AWCL框架的影响，发现使用更多不同但解剖相似的样本来组成正对会产生更好的质量表示。在大规模胎儿超声数据集上进行的大量实验表明，与ImageNet监督的方法和当前最先进的对比学习方法相比，我们的方法对于学习能够很好地转移到三个临床下游任务的表示是有效的，并实现了卓越的性能。特别是，在跨域分割任务中，AWCL比ImageNet监督的方法高13.8%，比最先进的基于对比的方法高7.1%。代码可在https://github.com/JianboJiao/AWCL.

{"title":"Anatomy-Aware Contrastive Representation Learning for Fetal Ultrasound.","authors":"Zeyu Fu, Jianbo Jiao, Robail Yasrab, Lior Drukker, Aris T Papageorghiou, J Alison Noble","doi":"10.1007/978-3-031-25066-8_23","DOIUrl":"10.1007/978-3-031-25066-8_23","url":null,"abstract":"Self-supervised contrastive representation learning offers the advantage of learning meaningful visual representations from unlabeled medical datasets for transfer learning. However, applying current contrastive learning approaches to medical data without considering its domain-specific anatomical characteristics may lead to visual representations that are inconsistent in appearance and semantics. In this paper, we propose to improve visual representations of medical images via anatomy-aware contrastive learning (AWCL), which incorporates anatomy information to augment the positive/negative pair sampling in a contrastive learning manner. The proposed approach is demonstrated for automated fetal ultrasound imaging tasks, enabling the positive pairs from the same or different ultrasound scans that are anatomically similar to be pulled together and thus improving the representation learning. We empirically investigate the effect of inclusion of anatomy information with coarse- and fine-grained granularity, for contrastive learning and find that learning with fine-grained anatomy information which preserves intra-class difference is more effective than its counterpart. We also analyze the impact of anatomy ratio on our AWCL framework and find that using more distinct but anatomically similar samples to compose positive pairs results in better quality representations. Extensive experiments on a large-scale fetal ultrasound dataset demonstrate that our approach is effective for learning representations that transfer well to three clinical downstream tasks, and achieves superior performance compared to ImageNet supervised and the current state-of-the-art contrastive learning methods. In particular, AWCL outperforms ImageNet supervised method by 13.8% and state-of-the-art contrastive-based method by 7.1% on a cross-domain segmentation task. The code is available at https://github.com/JianboJiao/AWCL.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"2022 ","pages":"422-436"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7614575/pdf/EMS176131.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9538765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Versatile Uses of Partial Distance Correlation in Deep Learning. 论深度学习中部分距离相关性的多种用途

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-01 Epub Date: 2022-11-01 DOI: 10.1007/978-3-031-19809-0_19

Xingjian Zhen, Zihang Meng, Rudrasis Chakraborty, Vikas Singh

Comparing the functional behavior of neural network models, whether it is a single network over time or two (or more networks) during or post-training, is an essential step in understanding what they are learning (and what they are not), and for identifying strategies for regularization or efficiency improvements. Despite recent progress, e.g., comparing vision transformers to CNNs, systematic comparison of function, especially across different networks, remains difficult and is often carried out layer by layer. Approaches such as canonical correlation analysis (CCA) are applicable in principle, but have been sparingly used so far. In this paper, we revisit a (less widely known) from statistics, called distance correlation (and its partial variant), designed to evaluate correlation between feature spaces of different dimensions. We describe the steps necessary to carry out its deployment for large scale models - this opens the door to a surprising array of applications ranging from conditioning one deep model w.r.t. another, learning disentangled representations as well as optimizing diverse models that would directly be more robust to adversarial attacks. Our experiments suggest a versatile regularizer (or constraint) with many advantages, which avoids some of the common difficulties one faces in such analyses .

比较神经网络模型的功能行为，无论是长期的单个网络，还是训练期间或训练后的两个（或多个）网络，都是了解它们在学习什么（以及没有学习什么），以及确定正则化或效率改进策略的重要一步。尽管最近取得了一些进展，例如将视觉转换器与 CNN 进行了比较，但系统性的功能比较，尤其是不同网络之间的功能比较，仍然很困难，而且通常是逐层进行。典型相关分析 (CCA) 等方法原则上是适用的，但迄今为止还很少使用。在本文中，我们重温了一种（不太广为人知的）统计方法，即距离相关分析（及其部分变体），旨在评估不同维度的特征空间之间的相关性。我们描述了将其部署到大规模模型中的必要步骤--这为一系列令人惊讶的应用打开了大门，包括调节一个深度模型与另一个深度模型之间的关系、学习分离表征以及优化多样化模型，从而直接提高对抗性攻击的鲁棒性。我们的实验提出了一种具有多种优势的通用正则（或约束），它避免了此类分析中常见的一些困难。

{"title":"On the Versatile Uses of Partial Distance Correlation in Deep Learning.","authors":"Xingjian Zhen, Zihang Meng, Rudrasis Chakraborty, Vikas Singh","doi":"10.1007/978-3-031-19809-0_19","DOIUrl":"10.1007/978-3-031-19809-0_19","url":null,"abstract":"Comparing the functional behavior of neural network models, whether it is a single network over time or two (or more networks) during or post-training, is an essential step in understanding what they are learning (and what they are not), and for identifying strategies for regularization or efficiency improvements. Despite recent progress, e.g., comparing vision transformers to CNNs, systematic comparison of function, especially across different networks, remains difficult and is often carried out layer by layer. Approaches such as canonical correlation analysis (CCA) are applicable in principle, but have been sparingly used so far. In this paper, we revisit a (less widely known) from statistics, called distance correlation (and its partial variant), designed to evaluate correlation between feature spaces of different dimensions. We describe the steps necessary to carry out its deployment for large scale models - this opens the door to a surprising array of applications ranging from conditioning one deep model w.r.t. another, learning disentangled representations as well as optimizing diverse models that would directly be more robust to adversarial attacks. Our experiments suggest a versatile regularizer (or constraint) with many advantages, which avoids some of the common difficulties one faces in such analyses .","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"13686 ","pages":"327-346"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10228573/pdf/nihms-1894550.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9656711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Target-absent Human Attention. 目标缺失的人类注意力。

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-01 Epub Date: 2022-10-23 DOI: 10.1007/978-3-031-19772-7_4

Zhibo Yang, Sounak Mondal, Seoyoung Ahn, Gregory Zelinsky, Minh Hoai, Dimitris Samaras

The prediction of human gaze behavior is important for building human-computer interaction systems that can anticipate the user's attention. Computer vision models have been developed to predict the fixations made by people as they search for target objects. But what about when the target is not in the image? Equally important is to know how people search when they cannot find a target, and when they would stop searching. In this paper, we propose a data-driven computational model that addresses the search-termination problem and predicts the scanpath of search fixations made by people searching for targets that do not appear in images. We model visual search as an imitation learning problem and represent the internal knowledge that the viewer acquires through fixations using a novel state representation that we call Foveated Feature Maps (FFMs). FFMs integrate a simulated foveated retina into a pretrained ConvNet that produces an in-network feature pyramid, all with minimal computational overhead. Our method integrates FFMs as the state representation in inverse reinforcement learning. Experimentally, we improve the state of the art in predicting human target-absent search behavior on the COCO-Search18 dataset. Code is available at: https://github.com/cvlab-stonybrook/Target-absent-Human-Attention.

预测人类的注视行为对于建立能够预测用户注意力的人机交互系统非常重要。人们已经开发出计算机视觉模型，用于预测人们在搜索目标对象时的注视行为。但当目标不在图像中时怎么办？同样重要的是了解人们在找不到目标时是如何搜索的，以及他们何时会停止搜索。在本文中，我们提出了一个数据驱动的计算模型，该模型可解决搜索终止问题，并预测人们在搜索未出现在图像中的目标时的搜索固定扫描路径。我们将视觉搜索建模为一个模仿学习问题，并使用一种新颖的状态表示法（我们称之为 "视线特征图"，Foveated Feature Maps (FFMs)）来表示观察者通过定点获得的内部知识。FFMs 将模拟的有纹视网膜整合到预先训练好的 ConvNet 中，从而生成网内特征金字塔，所有这一切都只需最小的计算开销。我们的方法将 FFMs 整合为反强化学习中的状态表示。通过实验，我们提高了在 COCO-Search18 数据集上预测人类目标缺失搜索行为的技术水平。代码见：https://github.com/cvlab-stonybrook/Target-absent-Human-Attention。

{"title":"Target-absent Human Attention.","authors":"Zhibo Yang, Sounak Mondal, Seoyoung Ahn, Gregory Zelinsky, Minh Hoai, Dimitris Samaras","doi":"10.1007/978-3-031-19772-7_4","DOIUrl":"https://doi.org/10.1007/978-3-031-19772-7_4","url":null,"abstract":"The prediction of human gaze behavior is important for building human-computer interaction systems that can anticipate the user's attention. Computer vision models have been developed to predict the fixations made by people as they search for target objects. But what about when the target is not in the image? Equally important is to know how people search when they cannot find a target, and when they would stop searching. In this paper, we propose a data-driven computational model that addresses the search-termination problem and predicts the scanpath of search fixations made by people searching for targets that do not appear in images. We model visual search as an imitation learning problem and represent the internal knowledge that the viewer acquires through fixations using a novel state representation that we call Foveated Feature Maps (FFMs). FFMs integrate a simulated foveated retina into a pretrained ConvNet that produces an in-network feature pyramid, all with minimal computational overhead. Our method integrates FFMs as the state representation in inverse reinforcement learning. Experimentally, we improve the state of the art in predicting human target-absent search behavior on the COCO-Search18 dataset. Code is available at: https://github.com/cvlab-stonybrook/Target-absent-Human-Attention.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"13664 ","pages":"52-68"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10745181/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139032868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CryoAI: Amortized Inference of Poses for Ab Initio Reconstruction of 3D Molecular Volumes from Real Cryo-EM Images. CryoAI：从真实低温电子显微镜图像初始重建三维分子卷的摊销推断姿势。

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-01 Epub Date: 2022-10-23 DOI: 10.1007/978-3-031-19803-8_32

Axel Levy, Frédéric Poitevin, Julien Martel, Youssef Nashed, Ariana Peck, Nina Miolane, Daniel Ratner, Mike Dunne, Gordon Wetzstein

Cryo-electron microscopy (cryo-EM) has become a tool of fundamental importance in structural biology, helping us understand the basic building blocks of life. The algorithmic challenge of cryo-EM is to jointly estimate the unknown 3D poses and the 3D electron scattering potential of a biomolecule from millions of extremely noisy 2D images. Existing reconstruction algorithms, however, cannot easily keep pace with the rapidly growing size of cryo-EM datasets due to their high computational and memory cost. We introduce cryoAI, an ab initio reconstruction algorithm for homogeneous conformations that uses direct gradient-based optimization of particle poses and the electron scattering potential from single-particle cryo-EM data. CryoAI combines a learned encoder that predicts the poses of each particle image with a physics-based decoder to aggregate each particle image into an implicit representation of the scattering potential volume. This volume is stored in the Fourier domain for computational efficiency and leverages a modern coordinate network architecture for memory efficiency. Combined with a symmetrized loss function, this framework achieves results of a quality on par with state-of-the-art cryo-EM solvers for both simulated and experimental data, one order of magnitude faster for large datasets and with significantly lower memory requirements than existing methods.

低温电子显微镜（cryo-EM）已成为结构生物学领域的重要工具，帮助我们了解生命的基本组成。冷冻电子显微镜在算法上面临的挑战是如何从数百万张噪声极高的二维图像中联合估算出生物分子的未知三维姿态和三维电子散射势。然而，由于计算和内存成本高昂，现有的重建算法难以跟上低温电子显微镜数据集快速增长的步伐。我们介绍的 CryoAI 是一种针对同质构象的自证重建算法，它采用基于梯度的直接优化方法，从单粒子低温电子显微镜数据中优化粒子位置和电子散射势。CryoAI 将预测每个粒子图像位置的学习编码器与基于物理的解码器相结合，将每个粒子图像聚合为散射势体积的隐式表示。该体积存储在傅立叶域中，以提高计算效率，并利用现代坐标网络架构提高内存效率。该框架与对称损失函数相结合，在模拟和实验数据的质量上与最先进的低温电磁求解器不相上下，在大型数据集上比现有方法快一个数量级，对内存的要求也大大降低。

{"title":"CryoAI: Amortized Inference of Poses for Ab Initio Reconstruction of 3D Molecular Volumes from Real Cryo-EM Images.","authors":"Axel Levy, Frédéric Poitevin, Julien Martel, Youssef Nashed, Ariana Peck, Nina Miolane, Daniel Ratner, Mike Dunne, Gordon Wetzstein","doi":"10.1007/978-3-031-19803-8_32","DOIUrl":"10.1007/978-3-031-19803-8_32","url":null,"abstract":"Cryo-electron microscopy (cryo-EM) has become a tool of fundamental importance in structural biology, helping us understand the basic building blocks of life. The algorithmic challenge of cryo-EM is to jointly estimate the unknown 3D poses and the 3D electron scattering potential of a biomolecule from millions of extremely noisy 2D images. Existing reconstruction algorithms, however, cannot easily keep pace with the rapidly growing size of cryo-EM datasets due to their high computational and memory cost. We introduce cryoAI, an ab initio reconstruction algorithm for homogeneous conformations that uses direct gradient-based optimization of particle poses and the electron scattering potential from single-particle cryo-EM data. CryoAI combines a learned encoder that predicts the poses of each particle image with a physics-based decoder to aggregate each particle image into an implicit representation of the scattering potential volume. This volume is stored in the Fourier domain for computational efficiency and leverages a modern coordinate network architecture for memory efficiency. Combined with a symmetrized loss function, this framework achieves results of a quality on par with state-of-the-art cryo-EM solvers for both simulated and experimental data, one order of magnitude faster for large datasets and with significantly lower memory requirements than existing methods.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"13681 ","pages":"540-557"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9897229/pdf/nihms-1824058.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10718776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

k-SALSA: k-anonymous synthetic averaging of retinal images via local style alignment. k-SALSA:通过局部风格对齐的k-匿名视网膜图像合成平均。

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-01 DOI: 10.1007/978-3-031-19803-8_39

Minkyu Jeon, Hyeonjin Park, Hyunwoo J Kim, Michael Morley, Hyunghoon Cho

The application of modern machine learning to retinal image analyses offers valuable insights into a broad range of human health conditions beyond ophthalmic diseases. Additionally, data sharing is key to fully realizing the potential of machine learning models by providing a rich and diverse collection of training data. However, the personallyidentifying nature of retinal images, encompassing the unique vascular structure of each individual, often prevents this data from being shared openly. While prior works have explored image de-identification strategies based on synthetic averaging of images in other domains (e.g. facial images), existing techniques face difficulty in preserving both privacy and clinical utility in retinal images, as we demonstrate in our work. We therefore introduce $k$ -SALSA, a generative adversarial network (GAN)-based framework for synthesizing retinal fundus images that summarize a given private dataset while satisfying the privacy notion of $k$ -anonymity. $k$ -SALSA brings together state-of-the-art techniques for training and inverting GANs to achieve practical performance on retinal images. Furthermore, $k$ -SALSA leverages a new technique, called local style alignment, to generate a synthetic average that maximizes the retention of fine-grain visual patterns in the source images, thus improving the clinical utility of the generated images. On two benchmark datasets of diabetic retinopathy (EyePACS and APTOS), we demonstrate our improvement upon existing methods with respect to image fidelity, classification performance, and mitigation of membership inference attacks. Our work represents a step toward broader sharing of retinal images for scientific collaboration. Code is available at https://github.com/hcholab/k-salsa.

现代机器学习在视网膜图像分析中的应用为眼科疾病以外的广泛的人类健康状况提供了有价值的见解。此外，数据共享是通过提供丰富多样的训练数据集来充分发挥机器学习模型潜力的关键。然而，视网膜图像的个人识别性质，包括每个人独特的血管结构，经常阻止这些数据被公开共享。虽然之前的工作已经探索了基于其他领域(例如面部图像)图像合成平均的图像去识别策略，但正如我们在工作中所展示的那样，现有技术在保护视网膜图像的隐私和临床实用性方面面临困难。因此，我们引入了k-SALSA，这是一种基于生成对抗网络(GAN)的框架，用于合成视网膜眼底图像，该图像总结了给定的私有数据集，同时满足k-匿名的隐私概念。k-SALSA汇集了最先进的训练和倒转gan技术，以实现视网膜图像的实际性能。此外，k-SALSA利用一种称为局部风格对齐的新技术来生成合成平均值，最大限度地保留源图像中的细颗粒视觉模式，从而提高生成图像的临床效用。在糖尿病视网膜病变的两个基准数据集(EyePACS和APTOS)上，我们展示了我们在图像保真度、分类性能和减轻隶属度推理攻击方面对现有方法的改进。我们的工作代表了为科学合作更广泛地共享视网膜图像的一步。代码可从https://github.com/hcholab/k-salsa获得。

{"title":"k-SALSA: k-anonymous synthetic averaging of retinal images via local style alignment.","authors":"Minkyu Jeon, Hyeonjin Park, Hyunwoo J Kim, Michael Morley, Hyunghoon Cho","doi":"10.1007/978-3-031-19803-8_39","DOIUrl":"https://doi.org/10.1007/978-3-031-19803-8_39","url":null,"abstract":"The application of modern machine learning to retinal image analyses offers valuable insights into a broad range of human health conditions beyond ophthalmic diseases. Additionally, data sharing is key to fully realizing the potential of machine learning models by providing a rich and diverse collection of training data. However, the personallyidentifying nature of retinal images, encompassing the unique vascular structure of each individual, often prevents this data from being shared openly. While prior works have explored image de-identification strategies based on synthetic averaging of images in other domains (e.g. facial images), existing techniques face difficulty in preserving both privacy and clinical utility in retinal images, as we demonstrate in our work. We therefore introduce <math><mi>k</mi></math>-SALSA, a generative adversarial network (GAN)-based framework for synthesizing retinal fundus images that summarize a given private dataset while satisfying the privacy notion of <math><mi>k</mi></math>-anonymity. <math><mi>k</mi></math>-SALSA brings together state-of-the-art techniques for training and inverting GANs to achieve practical performance on retinal images. Furthermore, <math><mi>k</mi></math>-SALSA leverages a new technique, called local style alignment, to generate a synthetic average that maximizes the retention of fine-grain visual patterns in the source images, thus improving the clinical utility of the generated images. On two benchmark datasets of diabetic retinopathy (EyePACS and APTOS), we demonstrate our improvement upon existing methods with respect to image fidelity, classification performance, and mitigation of membership inference attacks. Our work represents a step toward broader sharing of retinal images for scientific collaboration. Code is available at https://github.com/hcholab/k-salsa.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"13681 ","pages":"661-678"},"PeriodicalIF":0.0,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10388376/pdf/nihms-1918399.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9922383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

INT: Towards Infinite-frames 3D Detection with An Efficient Framework INT:用一个有效的框架实现无限帧3D检测

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-30 DOI: 10.48550/arXiv.2209.15215

Jianyun Xu, Zhenwei Miao, Da Zhang, Hongyu Pan, Kai Liu, Peihan Hao, Jun Zhu, Zhengyang Sun, Hongming Li, Xin Zhan

It is natural to construct a multi-frame instead of a single-frame 3D detector for a continuous-time stream. Although increasing the number of frames might improve performance, previous multi-frame studies only used very limited frames to build their systems due to the dramatically increased computational and memory cost. To address these issues, we propose a novel on-stream training and prediction framework that, in theory, can employ an infinite number of frames while keeping the same amount of computation as a single-frame detector. This infinite framework (INT), which can be used with most existing detectors, is utilized, for example, on the popular CenterPoint, with significant latency reductions and performance improvements. We've also conducted extensive experiments on two large-scale datasets, nuScenes and Waymo Open Dataset, to demonstrate the scheme's effectiveness and efficiency. By employing INT on CenterPoint, we can get around 7% (Waymo) and 15% (nuScenes) performance boost with only 2~4ms latency overhead, and currently SOTA on the Waymo 3D Detection leaderboard.

对于连续时间流，构建多帧而不是单帧3D检测器是很自然的。虽然增加帧数可能会提高性能，但由于计算和内存成本的显著增加，以前的多帧研究只使用非常有限的帧来构建系统。为了解决这些问题，我们提出了一种新的流上训练和预测框架，理论上，它可以使用无限数量的帧，同时保持与单帧检测器相同的计算量。这种无限框架(INT)可以与大多数现有的检测器一起使用，例如，在流行的CenterPoint上使用，可以显著减少延迟并提高性能。我们还在nuScenes和Waymo Open Dataset这两个大型数据集上进行了广泛的实验，以证明该方案的有效性和效率。通过在CenterPoint上使用INT，我们可以在只有2~4ms延迟开销的情况下获得7% (Waymo)和15% (nuScenes)的性能提升，目前在Waymo 3D检测排行榜上排名SOTA。

{"title":"INT: Towards Infinite-frames 3D Detection with An Efficient Framework","authors":"Jianyun Xu, Zhenwei Miao, Da Zhang, Hongyu Pan, Kai Liu, Peihan Hao, Jun Zhu, Zhengyang Sun, Hongming Li, Xin Zhan","doi":"10.48550/arXiv.2209.15215","DOIUrl":"https://doi.org/10.48550/arXiv.2209.15215","url":null,"abstract":"It is natural to construct a multi-frame instead of a single-frame 3D detector for a continuous-time stream. Although increasing the number of frames might improve performance, previous multi-frame studies only used very limited frames to build their systems due to the dramatically increased computational and memory cost. To address these issues, we propose a novel on-stream training and prediction framework that, in theory, can employ an infinite number of frames while keeping the same amount of computation as a single-frame detector. This infinite framework (INT), which can be used with most existing detectors, is utilized, for example, on the popular CenterPoint, with significant latency reductions and performance improvements. We've also conducted extensive experiments on two large-scale datasets, nuScenes and Waymo Open Dataset, to demonstrate the scheme's effectiveness and efficiency. By employing INT on CenterPoint, we can get around 7% (Waymo) and 15% (nuScenes) performance boost with only 2~4ms latency overhead, and currently SOTA on the Waymo 3D Detection leaderboard.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"106 1","pages":"193-209"},"PeriodicalIF":0.0,"publicationDate":"2022-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74273139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7