Pub Date : 2024-04-03DOI: 10.1007/s41095-024-0422-4
Abstract
3D-aware image synthesis has attained high quality and robust 3D consistency. Existing 3D controllable generative models are designed to synthesize 3D-aware images through a single modality, such as 2D segmentation or sketches, but lack the ability to finely control generated content, such as texture and age. In pursuit of enhancing user-guided controllability, we propose Multi3D, a 3D-aware controllable image synthesis model that supports multi-modal input. Our model can govern the geometry of the generated image using a 2D label map, such as a segmentation or sketch map, while concurrently regulating the appearance of the generated image through a textual description. To demonstrate the effectiveness of our method, we have conducted experiments on multiple datasets, including CelebAMask-HQ, AFHQ-cat, and shapenet-car. Qualitative and quantitative evaluations show that our method outperforms existing state-of-the-art methods.
摘要 三维感知图像合成已经实现了高质量和稳健的三维一致性。现有的三维可控生成模型旨在通过二维分割或草图等单一方式合成三维感知图像,但缺乏精细控制生成内容(如纹理和年龄)的能力。为了提高用户引导的可控性,我们提出了支持多模态输入的 3D 感知可控图像合成模型 Multi3D。我们的模型可以使用二维标签图(如分割图或草图)来控制生成图像的几何形状,同时通过文字描述来控制生成图像的外观。为了证明我们方法的有效性,我们在多个数据集上进行了实验,包括 CelebAMask-HQ、AFHQ-cat 和 shapenet-car。定性和定量评估结果表明,我们的方法优于现有的最先进方法。
{"title":"Multi3D: 3D-aware multimodal image synthesis","authors":"","doi":"10.1007/s41095-024-0422-4","DOIUrl":"https://doi.org/10.1007/s41095-024-0422-4","url":null,"abstract":"<h3>Abstract</h3> <p>3D-aware image synthesis has attained high quality and robust 3D consistency. Existing 3D controllable generative models are designed to synthesize 3D-aware images through a single modality, such as 2D segmentation or sketches, but lack the ability to finely control generated content, such as texture and age. In pursuit of enhancing user-guided controllability, we propose Multi3D, a 3D-aware controllable image synthesis model that supports multi-modal input. Our model can govern the geometry of the generated image using a 2D label map, such as a segmentation or sketch map, while concurrently regulating the appearance of the generated image through a textual description. To demonstrate the effectiveness of our method, we have conducted experiments on multiple datasets, including CelebAMask-HQ, AFHQ-cat, and shapenet-car. Qualitative and quantitative evaluations show that our method outperforms existing state-of-the-art methods. <span> <span> <img alt=\"\" src=\"https://static-content.springer.com/image/MediaObjects/41095_2024_422_Fig1_HTML.jpg\"/> </span> </span></p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140593367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-22DOI: 10.1007/s41095-022-0311-7
Gengxin Liu, Oliver van Kaick, Hui Huang, Ruizhen Hu
Since the preparation of labeled data for training semantic segmentation networks of point clouds is a time-consuming process, weakly supervised approaches have been introduced to learn from only a small fraction of data. These methods are typically based on learning with contrastive losses while automatically deriving per-point pseudo-labels from a sparse set of user-annotated labels. In this paper, our key observation is that the selection of which samples to annotate is as important as how these samples are used for training. Thus, we introduce a method for weakly supervised segmentation of 3D scenes that combines self-training with active learning. Active learning selects points for annotation that are likely to result in improvements to the trained model, while self-training makes efficient use of the user-provided labels for learning the model. We demonstrate that our approach leads to an effective method that provides improvements in scene segmentation over previous work and baselines, while requiring only a few user annotations.
{"title":"Active self-training for weakly supervised 3D scene semantic segmentation","authors":"Gengxin Liu, Oliver van Kaick, Hui Huang, Ruizhen Hu","doi":"10.1007/s41095-022-0311-7","DOIUrl":"https://doi.org/10.1007/s41095-022-0311-7","url":null,"abstract":"<p>Since the preparation of labeled data for training semantic segmentation networks of point clouds is a time-consuming process, weakly supervised approaches have been introduced to learn from only a small fraction of data. These methods are typically based on learning with contrastive losses while automatically deriving per-point pseudo-labels from a sparse set of user-annotated labels. In this paper, our key observation is that the selection of which samples to annotate is as important as how these samples are used for training. Thus, we introduce a method for weakly supervised segmentation of 3D scenes that combines self-training with active learning. Active learning selects points for annotation that are likely to result in improvements to the trained model, while self-training makes efficient use of the user-provided labels for learning the model. We demonstrate that our approach leads to an effective method that provides improvements in scene segmentation over previous work and baselines, while requiring only a few user annotations.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-22DOI: 10.1007/s41095-023-0362-4
Yue Wang, Yuke Li, James H. Elder, Runmin Wu, Huchuan Lu
Semantic segmentation is an important sub-task for many applications. However, pixel-level ground-truth labeling is costly, and there is a tendency to overfit to training data, thereby limiting the generalization ability. Unsupervised domain adaptation can potentially address these problems by allowing systems trained on labelled datasets from the source domain (including less expensive synthetic domain) to be adapted to a novel target domain. The conventional approach involves automatic extraction and alignment of the representations of source and target domains globally. One limitation of this approach is that it tends to neglect the differences between classes: representations of certain classes can be more easily extracted and aligned between the source and target domains than others, limiting the adaptation over all classes. Here, we address this problem by introducing a Class-Conditional Domain Adaptation (CCDA) method. This incorporates a class-conditional multi-scale discriminator and class-conditional losses for both segmentation and adaptation. Together, they measure the segmentation, shift the domain in a class-conditional manner, and equalize the loss over classes. Experimental results demonstrate that the performance of our CCDA method matches, and in some cases, surpasses that of state-of-the-art methods.
{"title":"Class-conditional domain adaptation for semantic segmentation","authors":"Yue Wang, Yuke Li, James H. Elder, Runmin Wu, Huchuan Lu","doi":"10.1007/s41095-023-0362-4","DOIUrl":"https://doi.org/10.1007/s41095-023-0362-4","url":null,"abstract":"<p>Semantic segmentation is an important sub-task for many applications. However, pixel-level ground-truth labeling is costly, and there is a tendency to overfit to training data, thereby limiting the generalization ability. Unsupervised domain adaptation can potentially address these problems by allowing systems trained on labelled datasets from the source domain (including less expensive synthetic domain) to be adapted to a novel target domain. The conventional approach involves automatic extraction and alignment of the representations of source and target domains globally. One limitation of this approach is that it tends to neglect the differences between classes: representations of certain classes can be more easily extracted and aligned between the source and target domains than others, limiting the adaptation over all classes. Here, we address this problem by introducing a Class-Conditional Domain Adaptation (CCDA) method. This incorporates a class-conditional multi-scale discriminator and class-conditional losses for both segmentation and adaptation. Together, they measure the segmentation, shift the domain in a class-conditional manner, and equalize the loss over classes. Experimental results demonstrate that the performance of our CCDA method matches, and in some cases, surpasses that of state-of-the-art methods.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-22DOI: 10.1007/s41095-023-0379-8
Shanghuan Liu, Shaoyan Gai, Feipeng Da, Fazal Waris
3D pose transfer over unorganized point clouds is a challenging generation task, which transfers a source’s pose to a target shape and keeps the target’s identity. Recent deep models have learned deformations and used the target’s identity as a style to modulate the combined features of two shapes or the aligned vertices of the source shape. However, all operations in these models are point-wise and independent and ignore the geometric information on the surface and structure of the input shapes. This disadvantage severely limits the generation and generalization capabilities. In this study, we propose a geometry-aware method based on a novel transformer autoencoder to solve this problem. An efficient self-attention mechanism, that is, cross-covariance attention, was utilized across our framework to perceive the correlations between points at different distances. Specifically, the transformer encoder extracts the target shape’s local geometry details for identity attributes and the source shape’s global geometry structure for pose information. Our transformer decoder efficiently learns deformations and recovers identity properties by fusing and decoding the extracted features in a geometry attentional manner, which does not require corresponding information or modulation steps. The experiments demonstrated that the geometry-aware method achieved state-of-the-art performance in a 3D pose transfer task. The implementation code and data are available at https://github.com/SEULSH/Geometry-Aware-3D-Pose-Transfer-Using-Transformer-Autoencoder.
{"title":"Geometry-aware 3D pose transfer using transformer autoencoder","authors":"Shanghuan Liu, Shaoyan Gai, Feipeng Da, Fazal Waris","doi":"10.1007/s41095-023-0379-8","DOIUrl":"https://doi.org/10.1007/s41095-023-0379-8","url":null,"abstract":"<p>3D pose transfer over unorganized point clouds is a challenging generation task, which transfers a source’s pose to a target shape and keeps the target’s identity. Recent deep models have learned deformations and used the target’s identity as a style to modulate the combined features of two shapes or the aligned vertices of the source shape. However, all operations in these models are point-wise and independent and ignore the geometric information on the surface and structure of the input shapes. This disadvantage severely limits the generation and generalization capabilities. In this study, we propose a geometry-aware method based on a novel transformer autoencoder to solve this problem. An efficient self-attention mechanism, that is, cross-covariance attention, was utilized across our framework to perceive the correlations between points at different distances. Specifically, the transformer encoder extracts the target shape’s local geometry details for identity attributes and the source shape’s global geometry structure for pose information. Our transformer decoder efficiently learns deformations and recovers identity properties by fusing and decoding the extracted features in a geometry attentional manner, which does not require corresponding information or modulation steps. The experiments demonstrated that the geometry-aware method achieved state-of-the-art performance in a 3D pose transfer task. The implementation code and data are available at https://github.com/SEULSH/Geometry-Aware-3D-Pose-Transfer-Using-Transformer-Autoencoder.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-22DOI: 10.1007/s41095-023-0340-x
Abstract
Recently, neural implicit function-based representation has attracted more and more attention, and has been widely used to represent surfaces using differentiable neural networks. However, surface reconstruction from point clouds or multi-view images using existing neural geometry representations still suffer from slow computation and poor accuracy. To alleviate these issues, we propose a multi-scale hash encoding-based neural geometry representation which effectively and efficiently represents the surface as a signed distance field. Our novel neural network structure carefully combines low-frequency Fourier position encoding with multi-scale hash encoding. The initialization of the geometry network and geometry features of the rendering module are accordingly redesigned. Our experiments demonstrate that the proposed representation is at least 10 times faster for reconstructing point clouds with millions of points. It also significantly improves speed and accuracy of multi-view reconstruction. Our code and models are available at https://github.com/Dengzhi-USTC/Neural-Geometry-Reconstruction.
{"title":"Multi-scale hash encoding based neural geometry representation","authors":"","doi":"10.1007/s41095-023-0340-x","DOIUrl":"https://doi.org/10.1007/s41095-023-0340-x","url":null,"abstract":"<h3>Abstract</h3> <p>Recently, neural implicit function-based representation has attracted more and more attention, and has been widely used to represent surfaces using differentiable neural networks. However, surface reconstruction from point clouds or multi-view images using existing neural geometry representations still suffer from slow computation and poor accuracy. To alleviate these issues, we propose a multi-scale hash encoding-based neural geometry representation which effectively and efficiently represents the surface as a signed distance field. Our novel neural network structure carefully combines low-frequency Fourier position encoding with multi-scale hash encoding. The initialization of the geometry network and geometry features of the rendering module are accordingly redesigned. Our experiments demonstrate that the proposed representation is at least 10 times faster for reconstructing point clouds with millions of points. It also significantly improves speed and accuracy of multi-view reconstruction. Our code and models are available at https://github.com/Dengzhi-USTC/Neural-Geometry-Reconstruction. <span> <span> <img alt=\"\" src=\"https://static-content.springer.com/image/MediaObjects/41095_2023_340_Fig1_HTML.jpg\"/> </span> </span></p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140204009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-22DOI: 10.1007/s41095-024-0398-z
Xueguang Xie, Yang Gao, Fei Hou, Aimin Hao, Hong Qin
The authors apologize for a hidden error in the article. It is that the images in Figs. 14(a) and 14(d) were mistakenly presented as left–right mirror images. The authors have flipped them to ensure that the figures now correspond correctly with others in the subfigures (b, c, e, f). The accurate version of Fig. 14 is provided as below.
{"title":"Erratum to: Dynamic ocean inverse modeling based on differentiable rendering","authors":"Xueguang Xie, Yang Gao, Fei Hou, Aimin Hao, Hong Qin","doi":"10.1007/s41095-024-0398-z","DOIUrl":"https://doi.org/10.1007/s41095-024-0398-z","url":null,"abstract":"<p>The authors apologize for a hidden error in the article. It is that the images in Figs. 14(a) and 14(d) were mistakenly presented as left–right mirror images. The authors have flipped them to ensure that the figures now correspond correctly with others in the subfigures (b, c, e, f). The accurate version of Fig. 14 is provided as below.</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140204015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-09DOI: 10.1007/s41095-023-0352-6
Chuhua Xian, Jiaxin Li, Hao Wu, Zisen Lin, Guiqing Li
In this study, we present a new and innovative framework for acquiring high-quality SVBRDF maps. Our approach addresses the limitations of the current methods and proposes a new solution. The core of our method is a simple hardware setup consisting of a consumer-level camera, LED lights, and a carefully designed network that can accurately obtain the high-quality SVBRDF properties of a nearly planar object. By capturing a flexible number of images of an object, our network uses different subnetworks to train different property maps and employs appropriate loss functions for each of them. To further enhance the quality of the maps, we improved the network structure by adding a novel skip connection that connects the encoder and decoder with global features. Through extensive experimentation using both synthetic and real-world materials, our results demonstrate that our method outperforms previous methods and produces superior results. Furthermore, our proposed setup can also be used to acquire physically based rendering maps of special materials.
{"title":"Delving into high-quality SVBRDF acquisition: A new setup and method","authors":"Chuhua Xian, Jiaxin Li, Hao Wu, Zisen Lin, Guiqing Li","doi":"10.1007/s41095-023-0352-6","DOIUrl":"https://doi.org/10.1007/s41095-023-0352-6","url":null,"abstract":"<p>In this study, we present a new and innovative framework for acquiring high-quality SVBRDF maps. Our approach addresses the limitations of the current methods and proposes a new solution. The core of our method is a simple hardware setup consisting of a consumer-level camera, LED lights, and a carefully designed network that can accurately obtain the high-quality SVBRDF properties of a nearly planar object. By capturing a flexible number of images of an object, our network uses different subnetworks to train different property maps and employs appropriate loss functions for each of them. To further enhance the quality of the maps, we improved the network structure by adding a novel skip connection that connects the encoder and decoder with global features. Through extensive experimentation using both synthetic and real-world materials, our results demonstrate that our method outperforms previous methods and produces superior results. Furthermore, our proposed setup can also be used to acquire physically based rendering maps of special materials.</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139765949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-08DOI: 10.1007/s41095-023-0369-x
Abstract
Recently, facial-expression recognition (FER) has primarily focused on images in the wild, including factors such as face occlusion and image blurring, rather than laboratory images. Complex field environments have introduced new challenges to FER. To address these challenges, this study proposes a cross-fusion dual-attention network. The network comprises three parts: (1) a cross-fusion grouped dual-attention mechanism to refine local features and obtain global information; (2) a proposed C2 activation function construction method, which is a piecewise cubic polynomial with three degrees of freedom, requiring less computation with improved flexibility and recognition abilities, which can better address slow running speeds and neuron inactivation problems; and (3) a closed-loop operation between the self-attention distillation process and residual connections to suppress redundant information and improve the generalization ability of the model. The recognition accuracies on the RAF-DB, FERPlus, and AffectNet datasets were 92.78%, 92.02%, and 63.58%, respectively. Experiments show that this model can provide more effective solutions for FER tasks.
摘要 最近,面部表情识别(FER)主要侧重于野外图像,包括人脸遮挡和图像模糊等因素,而不是实验室图像。复杂的野外环境给 FER 带来了新的挑战。为了应对这些挑战,本研究提出了一种交叉融合双注意网络。该网络由三部分组成:(1) 交叉融合分组双注意机制,用于提炼局部特征并获取全局信息;(2) 提出的 C2 激活函数构造方法,即具有三个自由度的片断三次多项式,需要的计算量更少,灵活性和识别能力更强,能较好地解决运行速度慢和神经元失活的问题;(3) 自注意提炼过程与残余连接之间的闭环操作,用于抑制冗余信息,提高模型的泛化能力。在 RAF-DB、FERPlus 和 AffectNet 数据集上的识别准确率分别为 92.78%、92.02% 和 63.58%。实验表明,该模型能为 FER 任务提供更有效的解决方案。
{"title":"CF-DAN: Facial-expression recognition based on cross-fusion dual-attention network","authors":"","doi":"10.1007/s41095-023-0369-x","DOIUrl":"https://doi.org/10.1007/s41095-023-0369-x","url":null,"abstract":"<h3>Abstract</h3> <p>Recently, facial-expression recognition (FER) has primarily focused on images in the wild, including factors such as face occlusion and image blurring, rather than laboratory images. Complex field environments have introduced new challenges to FER. To address these challenges, this study proposes a cross-fusion dual-attention network. The network comprises three parts: (1) a cross-fusion grouped dual-attention mechanism to refine local features and obtain global information; (2) a proposed <em>C</em><sup>2</sup> activation function construction method, which is a piecewise cubic polynomial with three degrees of freedom, requiring less computation with improved flexibility and recognition abilities, which can better address slow running speeds and neuron inactivation problems; and (3) a closed-loop operation between the self-attention distillation process and residual connections to suppress redundant information and improve the generalization ability of the model. The recognition accuracies on the RAF-DB, FERPlus, and AffectNet datasets were 92.78%, 92.02%, and 63.58%, respectively. Experiments show that this model can provide more effective solutions for FER tasks. <span> <span> <img alt=\"\" src=\"https://static-content.springer.com/image/MediaObjects/41095_2023_369_Fig1_HTML.jpg\"/> </span> </span></p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139765951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-08DOI: 10.1007/s41095-022-0319-z
Junyi Wang, Yue Qi
Visual localization and object detection both play important roles in various tasks. In many indoor application scenarios where some detected objects have fixed positions, the two techniques work closely together. However, few researchers consider these two tasks simultaneously, because of a lack of datasets and the little attention paid to such environments. In this paper, we explore multi-task network design and joint refinement of detection and localization. To address the dataset problem, we construct a medium indoor scene of an aviation exhibition hall through a semi-automatic process. The dataset provides localization and detection information, and is publicly available at https://drive.google.com/drive/folders/1U28zkuN4_I0dbzkqyIAKlAl5k9oUK0jI?usp=sharing for benchmarking localization and object detection tasks. Targeting this dataset, we have designed a multi-task network, JLDNet, based on YOLO v3, that outputs a target point cloud and object bounding boxes. For dynamic environments, the detection branch also promotes the perception of dynamics. JLDNet includes image feature learning, point feature learning, feature fusion, detection construction, and point cloud regression. Moreover, object-level bundle adjustment is used to further improve localization and detection accuracy. To test JLDNet and compare it to other methods, we have conducted experiments on 7 static scenes, our constructed dataset, and the dynamic TUM RGB-D and Bonn datasets. Our results show state-of-the-art accuracy for both tasks, and the benefit of jointly working on both tasks is demonstrated.
{"title":"Multi-task learning and joint refinement between camera localization and object detection","authors":"Junyi Wang, Yue Qi","doi":"10.1007/s41095-022-0319-z","DOIUrl":"https://doi.org/10.1007/s41095-022-0319-z","url":null,"abstract":"<p>Visual localization and object detection both play important roles in various tasks. In many indoor application scenarios where some detected objects have fixed positions, the two techniques work closely together. However, few researchers consider these two tasks simultaneously, because of a lack of datasets and the little attention paid to such environments. In this paper, we explore multi-task network design and joint refinement of detection and localization. To address the dataset problem, we construct a medium indoor scene of an aviation exhibition hall through a semi-automatic process. The dataset provides localization and detection information, and is publicly available at https://drive.google.com/drive/folders/1U28zkuN4_I0dbzkqyIAKlAl5k9oUK0jI?usp=sharing for benchmarking localization and object detection tasks. Targeting this dataset, we have designed a multi-task network, JLDNet, based on YOLO v3, that outputs a target point cloud and object bounding boxes. For dynamic environments, the detection branch also promotes the perception of dynamics. JLDNet includes image feature learning, point feature learning, feature fusion, detection construction, and point cloud regression. Moreover, object-level bundle adjustment is used to further improve localization and detection accuracy. To test JLDNet and compare it to other methods, we have conducted experiments on 7 static scenes, our constructed dataset, and the dynamic TUM RGB-D and Bonn datasets. Our results show state-of-the-art accuracy for both tasks, and the benefit of jointly working on both tasks is demonstrated.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139766066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The dynamic effects of smoke are impressive in illustration design, but it is a troublesome and challenging issue for inexpert users to design smoke effects without domain knowledge of fluid simulations. In this work, we propose DualSmoke, a two-stage global-to-local generation framework for interactive smoke illustration design. In the global stage, the proposed approach utilizes fluid patterns to generate Lagrangian coherent structures from the user’s hand-drawn sketches. In the local stage, detailed flow patterns are obtained from the generated coherent structure. Finally, we apply a guiding force field to the smoke simulator to produce the desired smoke illustration. To construct the training dataset, DualSmoke generates flow patterns using finite-time Lyapunov exponents of the velocity fields. The synthetic sketch data are generated from the flow patterns by skeleton extraction. Our user study verifies that the proposed design interface can provide various smoke illustration designs with good user usability. Our code is available at https://githubcom/shasph/DualSmoke.
{"title":"DualSmoke: Sketch-based smoke illustration design with two-stage generative model","authors":"Haoran Xie, Keisuke Arihara, Syuhei Sato, Kazunori Miyata","doi":"10.1007/s41095-022-0318-0","DOIUrl":"https://doi.org/10.1007/s41095-022-0318-0","url":null,"abstract":"<p>The dynamic effects of smoke are impressive in illustration design, but it is a troublesome and challenging issue for inexpert users to design smoke effects without domain knowledge of fluid simulations. In this work, we propose DualSmoke, a two-stage global-to-local generation framework for interactive smoke illustration design. In the global stage, the proposed approach utilizes fluid patterns to generate Lagrangian coherent structures from the user’s hand-drawn sketches. In the local stage, detailed flow patterns are obtained from the generated coherent structure. Finally, we apply a guiding force field to the smoke simulator to produce the desired smoke illustration. To construct the training dataset, DualSmoke generates flow patterns using finite-time Lyapunov exponents of the velocity fields. The synthetic sketch data are generated from the flow patterns by skeleton extraction. Our user study verifies that the proposed design interface can provide various smoke illustration designs with good user usability. Our code is available at https://githubcom/shasph/DualSmoke.\u0000</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":null,"pages":null},"PeriodicalIF":6.9,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139766088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}