Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition最新文献_第3页

End-to-end globally consistent registration of multiple point clouds 多点云的端到端全局一致注册

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Pub Date : 2020-06-14 DOI: 10.3929/ETHZ-B-000458888

Zan Gojcic, Caifa Zhou, J. D. Wegner, L. Guibas, Tolga Birdal

We present a novel, end-to-end learnable, multiview 3D point cloud registration algorithm. Registration of multiple scans typically follows a two-stage pipeline: the initial pairwise alignment and the globally consistent refinement. The former is often ambiguous due to the low overlap of neighboring point clouds, symmetries and repetitive scene parts. Therefore, the latter global refinement aims at establishing the cyclic consistency across multiple scans and helps in resolving the ambiguous cases. In this paper we propose the first end-to-end algorithm for joint learning of both parts of this two-stage problem. Experimental evaluation on benchmark datasets shows that our approach outperforms stateof-the-art by a significant margin, while being end-to-end trainable and computationally less costly. A more detailed description of the method, further analysis, and ablation studies are provided in the original CVPR 2020 paper [11].

我们提出了一种新颖的，端到端可学习的，多视图3D点云配准算法。多次扫描的配准通常遵循两个阶段的流程:初始成对对齐和全局一致的细化。前者由于邻近点云的低重叠、对称性和重复的场景部分而经常是模糊的。因此，后一种全局细化旨在建立跨多个扫描的循环一致性，并有助于解决不明确的情况。在本文中，我们提出了第一个端到端算法，用于联合学习这两阶段问题的两个部分。在基准数据集上的实验评估表明，我们的方法在很大程度上优于最先进的方法，同时是端到端可训练的，并且计算成本更低。CVPR 2020的原始论文[11]提供了更详细的方法描述、进一步分析和烧蚀研究。

引用次数: 1

Gum-Net: Unsupervised Geometric Matching for Fast and Accurate 3D Subtomogram Image Alignment and Averaging. Gum-Net：用于快速精确三维子图图像对齐和平均的无监督几何匹配。

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Pub Date : 2020-06-01 Epub Date: 2020-08-05 DOI: 10.1109/cvpr42600.2020.00413

Xiangrui Zeng, Min Xu

We propose a Geometric unsupervised matching Network (Gum-Net) for finding the geometric correspondence between two images with application to 3D subtomogram alignment and averaging. Subtomogram alignment is the most important task in cryo-electron tomography (cryo-ET), a revolutionary 3D imaging technique for visualizing the molecular organization of unperturbed cellular landscapes in single cells. However, subtomogram alignment and averaging are very challenging due to severe imaging limits such as noise and missing wedge effects. We introduce an end-to-end trainable architecture with three novel modules specifically designed for preserving feature spatial information and propagating feature matching information. The training is performed in a fully unsupervised fashion to optimize a matching metric. No ground truth transformation information nor category-level or instance-level matching supervision information is needed. After systematic assessments on six real and nine simulated datasets, we demonstrate that Gum-Net reduced the alignment error by 40 to 50% and improved the averaging resolution by 10%. Gum-Net also achieved 70 to 110 times speedup in practice with GPU acceleration compared to state-of-the-art subtomogram alignment methods. Our work is the first 3D unsupervised geometric matching method for images of strong transformation variation and high noise level. The training code, trained model, and datasets are available in our open-source software AITom.

我们提出了一种几何无监督匹配网络（Gum-Net），用于寻找两幅图像之间的几何对应关系，并将其应用于三维子图配准和平均。副图配准是低温电子断层成像技术（cryo-ET）中最重要的任务，该技术是一种革命性的三维成像技术，可用于观察单细胞中未受干扰的细胞景观的分子组织。然而，由于噪声和缺失楔效应等严重的成像限制，子图配准和平均非常具有挑战性。我们介绍了一种端到端可训练架构，其中有三个新模块专门用于保存特征空间信息和传播特征匹配信息。训练以完全无监督的方式进行，以优化匹配度量。不需要地面真实转换信息，也不需要类别级或实例级匹配监督信息。在对 6 个真实数据集和 9 个模拟数据集进行系统评估后，我们证明 Gum-Net 可将配准误差降低 40% 至 50%，并将平均分辨率提高 10%。与最先进的子图配准方法相比，Gum-Net 在实际应用中通过 GPU 加速实现了 70 到 110 倍的提速。我们的工作是首个针对强变换变化和高噪声图像的三维无监督几何匹配方法。我们的开源软件 AITom 提供了训练代码、训练模型和数据集。

{"title":"Gum-Net: Unsupervised Geometric Matching for Fast and Accurate 3D Subtomogram Image Alignment and Averaging.","authors":"Xiangrui Zeng, Min Xu","doi":"10.1109/cvpr42600.2020.00413","DOIUrl":"10.1109/cvpr42600.2020.00413","url":null,"abstract":"We propose a Geometric unsupervised matching Network (Gum-Net) for finding the geometric correspondence between two images with application to 3D subtomogram alignment and averaging. Subtomogram alignment is the most important task in cryo-electron tomography (cryo-ET), a revolutionary 3D imaging technique for visualizing the molecular organization of unperturbed cellular landscapes in single cells. However, subtomogram alignment and averaging are very challenging due to severe imaging limits such as noise and missing wedge effects. We introduce an end-to-end trainable architecture with three novel modules specifically designed for preserving feature spatial information and propagating feature matching information. The training is performed in a fully unsupervised fashion to optimize a matching metric. No ground truth transformation information nor category-level or instance-level matching supervision information is needed. After systematic assessments on six real and nine simulated datasets, we demonstrate that Gum-Net reduced the alignment error by 40 to 50% and improved the averaging resolution by 10%. Gum-Net also achieved 70 to 110 times speedup in practice with GPU acceleration compared to state-of-the-art subtomogram alignment methods. Our work is the first 3D unsupervised geometric matching method for images of strong transformation variation and high noise level. The training code, trained model, and datasets are available in our open-source software AITom.","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2020 ","pages":"4072-4082"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7955792/pdf/nihms-1675395.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25485821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generating Accurate Pseudo-labels in Semi-Supervised Learning and Avoiding Overconfident Predictions via Hermite Polynomial Activations. 在半监督学习中生成准确的伪标签，通过赫米特多项式激活避免过度自信的预测

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Pub Date : 2020-06-01 Epub Date: 2020-08-05 DOI: 10.1109/cvpr42600.2020.01145

Vishnu Suresh Lokhande, Songwong Tasneeyapant, Abhay Venkatesh, Sathya N Ravi, Vikas Singh

Rectified Linear Units (ReLUs) are among the most widely used activation function in a broad variety of tasks in vision. Recent theoretical results suggest that despite their excellent practical performance, in various cases, a substitution with basis expansions (e.g., polynomials) can yield significant benefits from both the optimization and generalization perspective. Unfortunately, the existing results remain limited to networks with a couple of layers, and the practical viability of these results is not yet known. Motivated by some of these results, we explore the use of Hermite polynomial expansions as a substitute for ReLUs in deep networks. While our experiments with supervised learning do not provide a clear verdict, we find that this strategy offers considerable benefits in semi-supervised learning (SSL) / transductive learning settings. We carefully develop this idea and show how the use of Hermite polynomials based activations can yield improvements in pseudo-label accuracies and sizable financial savings (due to concurrent runtime benefits). Further, we show via theoretical analysis, that the networks (with Hermite activations) offer robustness to noise and other attractive mathematical properties. Code is available on //GitHub.

整定线性单元（ReLUs）是视觉任务中使用最广泛的激活函数之一。最近的理论结果表明，尽管它们具有出色的实用性能，但在各种情况下，用基数展开（如多项式）进行替代可以从优化和泛化的角度产生显著的优势。遗憾的是，现有的结果仍局限于有几层的网络，这些结果的实际可行性尚不清楚。受其中一些结果的启发，我们探索在深度网络中使用 Hermite 多项式展开来替代 ReLU。虽然我们在监督学习方面的实验没有给出明确的结论，但我们发现这种策略在半监督学习（SSL）/转导学习设置中具有相当大的优势。我们仔细研究了这一想法，并展示了使用基于赫米特多项式的激活如何提高伪标签精确度，并节省大量资金（由于同时存在运行时间优势）。此外，我们还通过理论分析表明，（采用赫米特激活的）网络具有抗噪声能力和其他吸引人的数学特性。代码可在 //GitHub 上获取。

{"title":"Generating Accurate Pseudo-labels in Semi-Supervised Learning and Avoiding Overconfident Predictions via Hermite Polynomial Activations.","authors":"Vishnu Suresh Lokhande, Songwong Tasneeyapant, Abhay Venkatesh, Sathya N Ravi, Vikas Singh","doi":"10.1109/cvpr42600.2020.01145","DOIUrl":"10.1109/cvpr42600.2020.01145","url":null,"abstract":"Rectified Linear Units (ReLUs) are among the most widely used activation function in a broad variety of tasks in vision. Recent theoretical results suggest that despite their excellent practical performance, in various cases, a substitution with basis expansions (e.g., polynomials) can yield significant benefits from both the optimization and generalization perspective. Unfortunately, the existing results remain limited to networks with a couple of layers, and the practical viability of these results is not yet known. Motivated by some of these results, we explore the use of Hermite polynomial expansions as a substitute for ReLUs in deep networks. While our experiments with supervised learning do not provide a clear verdict, we find that this strategy offers considerable benefits in semi-supervised learning (SSL) / transductive learning settings. We carefully develop this idea and show how the use of Hermite polynomials based activations can yield improvements in pseudo-label accuracies and sizable financial savings (due to concurrent runtime benefits). Further, we show via theoretical analysis, that the networks (with Hermite activations) offer robustness to noise and other attractive mathematical properties. Code is available on //GitHub.","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2020 ","pages":"11432-11440"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7811889/pdf/nihms-1656467.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38836139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Discovering Synchronized Subsets of Sequences: A Large Scale Solution. 发现序列的同步子集:一个大规模的解决方案。

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Pub Date : 2020-06-01 Epub Date: 2020-08-05 DOI: 10.1109/cvpr42600.2020.00951

Evangelos Sariyanidi, Casey J Zampella, Keith G Bartley, John D Herrington, Theodore D Satterthwaite, Robert T Schultz, Birkan Tunc

Finding the largest subset of sequences (i.e., time series) that are correlated above a certain threshold, within large datasets, is of significant interest for computer vision and pattern recognition problems across domains, including behavior analysis, computational biology, neuroscience, and finance. Maximal clique algorithms can be used to solve this problem, but they are not scalable. We present an approximate, but highly efficient and scalable, method that represents the search space as a union of sets called ϵ-expanded clusters, one of which is theoretically guaranteed to contain the largest subset of synchronized sequences. The method finds synchronized sets by fitting a Euclidean ball on ϵ-expanded clusters, using Jung's theorem. We validate the method on data from the three distinct domains of facial behavior analysis, finance, and neuroscience, where we respectively discover the synchrony among pixels of face videos, stock market item prices, and dynamic brain connectivity data. Experiments show that our method produces results comparable to, but up to 300 times faster than, maximal clique algorithms, with speed gains increasing exponentially with the number of input sequences.

在大型数据集中，寻找关联超过一定阈值的序列(即时间序列)的最大子集，对于跨领域的计算机视觉和模式识别问题具有重要意义，包括行为分析，计算生物学，神经科学和金融。最大团算法可以用来解决这个问题，但它们是不可伸缩的。我们提出了一种近似但高效且可扩展的方法，该方法将搜索空间表示为称为ϵ-expanded集群的集合的并集，其中一个理论上保证包含最大的同步序列子集。该方法利用荣格定理，通过在ϵ-expanded簇上拟合欧几里得球来找到同步集。我们在面部行为分析、金融和神经科学三个不同领域的数据上验证了该方法，我们分别发现了面部视频像素、股票市场项目价格和动态大脑连接数据之间的同步性。实验表明，我们的方法产生的结果与最大团算法相当，但速度比最大团算法快300倍，并且速度增益随着输入序列的数量呈指数增长。

{"title":"Discovering Synchronized Subsets of Sequences: A Large Scale Solution.","authors":"Evangelos Sariyanidi, Casey J Zampella, Keith G Bartley, John D Herrington, Theodore D Satterthwaite, Robert T Schultz, Birkan Tunc","doi":"10.1109/cvpr42600.2020.00951","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00951","url":null,"abstract":"Finding the largest subset of sequences (i.e., time series) that are correlated above a certain threshold, within large datasets, is of significant interest for computer vision and pattern recognition problems across domains, including behavior analysis, computational biology, neuroscience, and finance. Maximal clique algorithms can be used to solve this problem, but they are not scalable. We present an approximate, but highly efficient and scalable, method that represents the search space as a union of sets called ϵ-expanded clusters, one of which is theoretically guaranteed to contain the largest subset of synchronized sequences. The method finds synchronized sets by fitting a Euclidean ball on ϵ-expanded clusters, using Jung's theorem. We validate the method on data from the three distinct domains of facial behavior analysis, finance, and neuroscience, where we respectively discover the synchrony among pixels of face videos, stock market item prices, and dynamic brain connectivity data. Experiments show that our method produces results comparable to, but up to 300 times faster than, maximal clique algorithms, with speed gains increasing exponentially with the number of input sequences.","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2020 ","pages":"9490-9499"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/cvpr42600.2020.00951","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38412981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Augmenting Colonoscopy using Extended and Directional CycleGAN for Lossy Image Translation. 使用扩展和定向 CycleGAN 增强结肠镜检查的有损图像转换。

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Pub Date : 2020-06-01 Epub Date: 2020-08-05 DOI: 10.1109/cvpr42600.2020.00475

Shawn Mathew, Saad Nadeem, Sruti Kumari, Arie Kaufman

Colorectal cancer screening modalities, such as optical colonoscopy (OC) and virtual colonoscopy (VC), are critical for diagnosing and ultimately removing polyps (precursors of colon cancer). The non-invasive VC is normally used to inspect a 3D reconstructed colon (from CT scans) for polyps and if found, the OC procedure is performed to physically traverse the colon via endoscope and remove these polyps. In this paper, we present a deep learning framework, Extended and Directional CycleGAN, for lossy unpaired image-to-image translation between OC and VC to augment OC video sequences with scale-consistent depth information from VC, and augment VC with patient-specific textures, color and specular highlights from OC (e.g, for realistic polyp synthesis). Both OC and VC contain structural information, but it is obscured in OC by additional patient-specific texture and specular highlights, hence making the translation from OC to VC lossy. The existing CycleGAN approaches do not handle lossy transformations. To address this shortcoming, we introduce an extended cycle consistency loss, which compares the geometric structures from OC in the VC domain. This loss removes the need for the CycleGAN to embed OC information in the VC domain. To handle a stronger removal of the textures and lighting, a Directional Discriminator is introduced to differentiate the direction of translation (by creating paired information for the discriminator), as opposed to the standard CycleGAN which is direction-agnostic. Combining the extended cycle consistency loss and the Directional Discriminator, we show state-of-the-art results on scale-consistent depth inference for phantom, textured VC and for real polyp and normal colon video sequences. We also present results for realistic pendunculated and flat polyp synthesis from bumps introduced in 3D VC models.

光学结肠镜（OC）和虚拟结肠镜（VC）等结肠直肠癌筛查方法对于诊断和最终切除息肉（结肠癌的前兆）至关重要。非侵入性虚拟结肠镜通常用于检查三维重建结肠（来自 CT 扫描）中的息肉，如果发现息肉，则进行光学结肠镜手术，通过内窥镜实际穿越结肠并切除这些息肉。在本文中，我们提出了一种深度学习框架--扩展和定向循环广义相对论（Extended and Directional CycleGAN），用于 OC 和 VC 之间的有损无配对图像到图像转换，从而利用来自 VC 的规模一致的深度信息增强 OC 视频序列，并利用来自 OC 的患者特定纹理、颜色和镜面高光（例如，用于逼真的息肉合成）增强 VC。OC 和 VC 都包含结构信息，但 OC 中的结构信息被附加的患者特定纹理和镜面高光所掩盖，因此从 OC 到 VC 的转换是有损的。现有的 CycleGAN 方法无法处理有损转换。为了解决这一缺陷，我们引入了一种扩展的周期一致性损失，它可以在 VC 域中比较来自 OC 的几何结构。这种损失使 CycleGAN 无需在 VC 域中嵌入 OC 信息。为了处理更强的纹理和光照移除，我们引入了方向判别器来区分平移方向（通过为判别器创建成对信息），而标准的 CycleGAN 是与方向无关的。结合扩展的周期一致性损失和方向判别器，我们展示了幻影、纹理 VC 以及真实息肉和正常结肠视频序列的尺度一致性深度推断的最新结果。我们还展示了根据三维 VC 模型中引入的凹凸合成真实下垂和扁平息肉的结果。

{"title":"Augmenting Colonoscopy using Extended and Directional CycleGAN for Lossy Image Translation.","authors":"Shawn Mathew, Saad Nadeem, Sruti Kumari, Arie Kaufman","doi":"10.1109/cvpr42600.2020.00475","DOIUrl":"10.1109/cvpr42600.2020.00475","url":null,"abstract":"Colorectal cancer screening modalities, such as optical colonoscopy (OC) and virtual colonoscopy (VC), are critical for diagnosing and ultimately removing polyps (precursors of colon cancer). The non-invasive VC is normally used to inspect a 3D reconstructed colon (from CT scans) for polyps and if found, the OC procedure is performed to physically traverse the colon via endoscope and remove these polyps. In this paper, we present a deep learning framework, Extended and Directional CycleGAN, for lossy unpaired image-to-image translation between OC and VC to augment OC video sequences with scale-consistent depth information from VC, and augment VC with patient-specific textures, color and specular highlights from OC (e.g, for realistic polyp synthesis). Both OC and VC contain structural information, but it is obscured in OC by additional patient-specific texture and specular highlights, hence making the translation from OC to VC lossy. The existing CycleGAN approaches do not handle lossy transformations. To address this shortcoming, we introduce an extended cycle consistency loss, which compares the geometric structures from OC in the VC domain. This loss removes the need for the CycleGAN to embed OC information in the VC domain. To handle a stronger removal of the textures and lighting, a Directional Discriminator is introduced to differentiate the direction of translation (by creating paired information for the discriminator), as opposed to the standard CycleGAN which is direction-agnostic. Combining the extended cycle consistency loss and the Directional Discriminator, we show state-of-the-art results on scale-consistent depth inference for phantom, textured VC and for real polyp and normal colon video sequences. We also present results for realistic pendunculated and flat polyp synthesis from bumps introduced in 3D VC models.","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2020 ","pages":"4695-4704"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7811175/pdf/nihms-1660601.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38761939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Can Facial Pose and Expression Be Separated with Weak Perspective Camera? 弱透视相机能分离面部姿态和表情吗?

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Pub Date : 2020-06-01 Epub Date: 2020-08-05 DOI: 10.1109/cvpr42600.2020.00720

Evangelos Sariyanidi, Casey J Zampella, Robert T Schultz, Birkan Tunc

Separating facial pose and expression within images requires a camera model for 3D-to-2D mapping. The weak perspective (WP) camera has been the most popular choice; it is the default, if not the only option, in state-of-the-art facial analysis methods and software. WP camera is justified by the supposition that its errors are negligible when the subjects are relatively far from the camera, yet this claim has never been tested despite nearly 20 years of research. This paper critically examines the suitability of WP camera for separating facial pose and expression. First, we theoretically show that WP causes pose-expression ambiguity, as it leads to estimation of spurious expressions. Next, we experimentally quantify the magnitude of spurious expressions. Finally, we test whether spurious expressions have detrimental effects on a common facial analysis application, namely Action Unit (AU) detection. Contrary to conventional wisdom, we find that severe pose-expression ambiguity exists even when subjects are not close to the camera, leading to large false positive rates in AU detection. We also demonstrate that the magnitude and characteristics of spurious expressions depend on the point distribution model used to model the expressions. Our results suggest that common assumptions about WP need to be revisited in facial expression modeling, and that facial analysis software should encourage and facilitate the use of the true camera model whenever possible.

在图像中分离面部姿势和表情需要一个用于3d到2d映射的相机模型。弱透视(WP)相机一直是最受欢迎的选择;在最先进的面部分析方法和软件中，这即使不是唯一的选择，也是默认的。当拍摄对象离相机相对较远时，WP相机的误差可以忽略不计，这一假设是合理的，然而，尽管近20年的研究，这一说法从未得到验证。本文批判性地考察了WP相机在分离面部姿势和表情方面的适用性。首先，我们从理论上表明，WP会导致姿势-表情歧义，因为它会导致对虚假表情的估计。接下来，我们通过实验量化虚假表达的大小。最后，我们测试了虚假表情是否对常见的面部分析应用产生不利影响，即动作单元(AU)检测。与传统观点相反，我们发现即使受试者不靠近相机，也存在严重的姿势-表情歧义，导致AU检测的假阳性率很高。我们还证明了伪表达式的大小和特征取决于用于模拟表达式的点分布模型。我们的研究结果表明，在面部表情建模中需要重新审视关于WP的常见假设，并且面部分析软件应尽可能鼓励和促进使用真实的相机模型。

{"title":"Can Facial Pose and Expression Be Separated with Weak Perspective Camera?","authors":"Evangelos Sariyanidi, Casey J Zampella, Robert T Schultz, Birkan Tunc","doi":"10.1109/cvpr42600.2020.00720","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00720","url":null,"abstract":"Separating facial pose and expression within images requires a camera model for 3D-to-2D mapping. The weak perspective (WP) camera has been the most popular choice; it is the default, if not the only option, in state-of-the-art facial analysis methods and software. WP camera is justified by the supposition that its errors are negligible when the subjects are relatively far from the camera, yet this claim has never been tested despite nearly 20 years of research. This paper critically examines the suitability of WP camera for separating facial pose and expression. First, we theoretically show that WP causes pose-expression ambiguity, as it leads to estimation of spurious expressions. Next, we experimentally quantify the magnitude of spurious expressions. Finally, we test whether spurious expressions have detrimental effects on a common facial analysis application, namely Action Unit (AU) detection. Contrary to conventional wisdom, we find that severe pose-expression ambiguity exists even when subjects are not close to the camera, leading to large false positive rates in AU detection. We also demonstrate that the magnitude and characteristics of spurious expressions depend on the point distribution model used to model the expressions. Our results suggest that common assumptions about WP need to be revisited in facial expression modeling, and that facial analysis software should encourage and facilitate the use of the true camera model whenever possible.","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2020 ","pages":"7171-7180"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/cvpr42600.2020.00720","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38373442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Predicting Goal-directed Human Attention Using Inverse Reinforcement Learning. 利用反强化学习预测目标导向的人类注意力

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Pub Date : 2020-06-01 Epub Date: 2020-08-05 DOI: 10.1109/cvpr42600.2020.00027

Zhibo Yang, Lihan Huang, Yupei Chen, Zijun Wei, Seoyoung Ahn, Gregory Zelinsky, Dimitris Samaras, Minh Hoai

Human gaze behavior prediction is important for behavioral vision and for computer vision applications. Most models mainly focus on predicting free-viewing behavior using saliency maps, but do not generalize to goal-directed behavior, such as when a person searches for a visual target object. We propose the first inverse reinforcement learning (IRL) model to learn the internal reward function and policy used by humans during visual search. We modeled the viewer's internal belief states as dynamic contextual belief maps of object locations. These maps were learned and then used to predict behavioral scanpaths for multiple target categories. To train and evaluate our IRL model we created COCO-Search18, which is now the largest dataset of high-quality search fixations in existence. COCO-Search18 has 10 participants searching for each of 18 target-object categories in 6202 images, making about 300,000 goal-directed fixations. When trained and evaluated on COCO-Search18, the IRL model outperformed baseline models in predicting search fixation scanpaths, both in terms of similarity to human search behavior and search efficiency. Finally, reward maps recovered by the IRL model reveal distinctive target-dependent patterns of object prioritization, which we interpret as a learned object context.

人类注视行为预测对于行为视觉和计算机视觉应用非常重要。大多数模型主要侧重于使用显著性图预测自由注视行为，但不能推广到目标导向行为，如人在搜索视觉目标对象时的注视行为。我们提出了首个反强化学习（IRL）模型，用于学习人类在视觉搜索过程中使用的内部奖励函数和策略。我们将观看者的内部信念状态建模为物体位置的动态上下文信念图。我们学习了这些图谱，然后将其用于预测多个目标类别的行为扫描路径。为了训练和评估我们的 IRL 模型，我们创建了 COCO-Search18，这是目前最大的高质量搜索固定数据集。COCO-Search18 有 10 名参与者在 6202 幅图像中分别搜索 18 个目标对象类别，共进行了约 300,000 次目标定向定点。在 COCO-Search18 上进行训练和评估时，IRL 模型在预测搜索定点扫描路径方面的表现优于基线模型，无论是在与人类搜索行为的相似性方面还是在搜索效率方面。最后，IRL 模型恢复的奖励图揭示了与目标相关的独特的目标优先模式，我们将其解释为学习到的目标上下文。

{"title":"Predicting Goal-directed Human Attention Using Inverse Reinforcement Learning.","authors":"Zhibo Yang, Lihan Huang, Yupei Chen, Zijun Wei, Seoyoung Ahn, Gregory Zelinsky, Dimitris Samaras, Minh Hoai","doi":"10.1109/cvpr42600.2020.00027","DOIUrl":"10.1109/cvpr42600.2020.00027","url":null,"abstract":"Human gaze behavior prediction is important for behavioral vision and for computer vision applications. Most models mainly focus on predicting free-viewing behavior using saliency maps, but do not generalize to goal-directed behavior, such as when a person searches for a visual target object. We propose the first inverse reinforcement learning (IRL) model to learn the internal reward function and policy used by humans during visual search. We modeled the viewer's internal belief states as dynamic contextual belief maps of object locations. These maps were learned and then used to predict behavioral scanpaths for multiple target categories. To train and evaluate our IRL model we created COCO-Search18, which is now the largest dataset of high-quality search fixations in existence. COCO-Search18 has 10 participants searching for each of 18 target-object categories in 6202 images, making about 300,000 goal-directed fixations. When trained and evaluated on COCO-Search18, the IRL model outperformed baseline models in predicting search fixation scanpaths, both in terms of similarity to human search behavior and search efficiency. Finally, reward maps recovered by the IRL model reveal distinctive target-dependent patterns of object prioritization, which we interpret as a learned object context.","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2020 ","pages":"190-199"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8218821/pdf/nihms-1715372.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39020515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Putting visual object recognition in context. 将视觉对象识别置于上下文中。

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Pub Date : 2020-06-01 Epub Date: 2020-08-05 DOI: 10.1109/CVPR42600.2020.01300

Mengmi Zhang, Claire Tseng, Gabriel Kreiman

Context plays an important role in visual recognition. Recent studies have shown that visual recognition networks can be fooled by placing objects in inconsistent contexts (e.g. a cow in the ocean). To understand and model the role of contextual information in visual recognition, we systematically and quantitatively investigated ten critical properties of where, when, and how context modulates recognition including amount of context, context and object resolution, geometrical structure of context, context congruence, time required to incorporate contextual information, and temporal dynamics of contextual modulation. The tasks involve recognizing a target object surrounded with context in a natural image. As an essential benchmark, we first describe a series of psychophysics experiments, where we alter one aspect of context at a time, and quantify human recognition accuracy. To computationally assess performance on the same tasks, we propose a biologically inspired context aware object recognition model consisting of a two-stream architecture. The model processes visual information at the fovea and periphery in parallel, dynamically incorporates both object and contextual information, and sequentially reasons about the class label for the target object. Across a wide range of behavioral tasks, the model approximates human level performance without retraining for each task, captures the dependence of context enhancement on image properties, and provides initial steps towards integrating scene and object information for visual recognition.

语境在视觉识别中起着重要的作用。最近的研究表明，视觉识别网络可以通过将物体放置在不一致的环境中(例如，海洋中的奶牛)而被愚弄。为了理解语境信息在视觉识别中的作用并建立模型，我们系统地、定量地研究了语境在何地、何时以及如何调节识别的十个关键属性，包括语境的数量、语境和物体分辨率、语境的几何结构、语境一致性、整合语境信息所需的时间以及语境调节的时间动态。这些任务包括在自然图像中识别被上下文包围的目标物体。作为基本基准，我们首先描述了一系列心理物理学实验，在这些实验中，我们一次改变上下文的一个方面，并量化人类识别的准确性。为了计算评估在相同任务上的性能，我们提出了一个由两流架构组成的生物启发的上下文感知对象识别模型。该模型在中央凹和外围并行处理视觉信息，动态地合并对象和上下文信息，并依次对目标对象的类标签进行推理。在广泛的行为任务中，该模型近似于人类水平的表现，而无需对每个任务进行再训练，捕获上下文增强对图像属性的依赖，并为视觉识别整合场景和对象信息提供了初步步骤。

{"title":"Putting visual object recognition in context.","authors":"Mengmi Zhang, Claire Tseng, Gabriel Kreiman","doi":"10.1109/CVPR42600.2020.01300","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.01300","url":null,"abstract":"Context plays an important role in visual recognition. Recent studies have shown that visual recognition networks can be fooled by placing objects in inconsistent contexts (e.g. a cow in the ocean). To understand and model the role of contextual information in visual recognition, we systematically and quantitatively investigated ten critical properties of where, when, and how context modulates recognition including amount of context, context and object resolution, geometrical structure of context, context congruence, time required to incorporate contextual information, and temporal dynamics of contextual modulation. The tasks involve recognizing a target object surrounded with context in a natural image. As an essential benchmark, we first describe a series of psychophysics experiments, where we alter one aspect of context at a time, and quantify human recognition accuracy. To computationally assess performance on the same tasks, we propose a biologically inspired context aware object recognition model consisting of a two-stream architecture. The model processes visual information at the fovea and periphery in parallel, dynamically incorporates both object and contextual information, and sequentially reasons about the class label for the target object. Across a wide range of behavioral tasks, the model approximates human level performance without retraining for each task, captures the dependence of context enhancement on image properties, and provides initial steps towards integrating scene and object information for visual recognition.","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2020 ","pages":"12982-12991"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/CVPR42600.2020.01300","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39452945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

Indian Plant Recognition in the Wild 印度野生植物识别

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Pub Date : 2019-12-22 DOI: 10.1007/978-981-15-8697-2_41

Vamsidhar Muthireddy, C. V. Jawahar

引用次数: 0

Metric Learning for Image Registration. 图像注册的度量学习

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Pub Date : 2019-06-01 Epub Date: 2020-01-09 DOI: 10.1109/cvpr.2019.00866

Marc Niethammer, Roland Kwitt, François-Xavier Vialard

Image registration is a key technique in medical image analysis to estimate deformations between image pairs. A good deformation model is important for high-quality estimates. However, most existing approaches use ad-hoc deformation models chosen for mathematical convenience rather than to capture observed data variation. Recent deep learning approaches learn deformation models directly from data. However, they provide limited control over the spatial regularity of transformations. Instead of learning the entire registration approach, we learn a spatially-adaptive regularizer within a registration model. This allows controlling the desired level of regularity and preserving structural properties of a registration model. For example, diffeomorphic transformations can be attained. Our approach is a radical departure from existing deep learning approaches to image registration by embedding a deep learning model in an optimization-based registration algorithm to parameterize and data-adapt the registration model itself. Source code is publicly-available at https://github.com/uncbiag/registration.

图像配准是医学图像分析中估算图像对之间变形的一项关键技术。一个好的形变模型对于高质量的估计非常重要。然而，大多数现有方法使用的是为了数学方便而临时选择的变形模型，而不是捕捉观察到的数据变化。最近的深度学习方法直接从数据中学习形变模型。然而，这些方法对变换的空间规则性控制有限。我们不学习整个配准方法，而是在配准模型中学习空间适应性正则化器。这样既能控制所需的规则性水平，又能保留注册模型的结构属性。例如，可以实现差分变换。通过在基于优化的配准算法中嵌入深度学习模型，对配准模型本身进行参数化和数据适配，我们的方法与现有的图像配准深度学习方法截然不同。源代码可通过 https://github.com/uncbiag/registration 公开获取。

{"title":"Metric Learning for Image Registration.","authors":"Marc Niethammer, Roland Kwitt, François-Xavier Vialard","doi":"10.1109/cvpr.2019.00866","DOIUrl":"10.1109/cvpr.2019.00866","url":null,"abstract":"Image registration is a key technique in medical image analysis to estimate deformations between image pairs. A good deformation model is important for high-quality estimates. However, most existing approaches use ad-hoc deformation models chosen for mathematical convenience rather than to capture observed data variation. Recent deep learning approaches learn deformation models directly from data. However, they provide limited control over the spatial regularity of transformations. Instead of learning the entire registration approach, we learn a spatially-adaptive regularizer within a registration model. This allows controlling the desired level of regularity and preserving structural properties of a registration model. For example, diffeomorphic transformations can be attained. Our approach is a radical departure from existing deep learning approaches to image registration by embedding a deep learning model in an optimization-based registration algorithm to parameterize and data-adapt the registration model itself. Source code is publicly-available at https://github.com/uncbiag/registration.","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2019 ","pages":"8455-8464"},"PeriodicalIF":0.0,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7286567/pdf/nihms-1033311.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38036233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0