首页 > 最新文献

2023 18th International Conference on Machine Vision and Applications (MVA)最新文献

英文 中文
CG-based dataset generation and adversarial image conversion for deep cucumber recognition 基于cg的黄瓜深度识别数据集生成和对抗图像转换
Pub Date : 2023-07-23 DOI: 10.23919/MVA57639.2023.10215910
Hiroaki Masuzawa, Chuo Nakano, Jun Miura
This paper deals with deep cucumber recognition using CG (Computer Graphics)-based dataset generation. The variety and the size of the dataset are crucial in deep learning. Although there are many public datasets for common situations like traffic scenes, we need to make a dataset for a particular scene like cucumber farms. As it is costly and time-consuming to annotate much data manually, we proposed generating images by CG and converting them to realistic ones using adversarial learning approaches. We compare several image conversion methods using real cucumber plant images.
本文采用基于计算机图形学的数据集生成技术对黄瓜进行深度识别。数据集的多样性和规模对深度学习至关重要。虽然有许多公共数据集用于交通场景等常见场景,但我们需要为黄瓜农场等特定场景制作一个数据集。由于手动标注大量数据既昂贵又耗时,我们建议通过CG生成图像,并使用对抗性学习方法将其转换为逼真的图像。以黄瓜植物的真实图像为例,比较了几种图像转换方法。
{"title":"CG-based dataset generation and adversarial image conversion for deep cucumber recognition","authors":"Hiroaki Masuzawa, Chuo Nakano, Jun Miura","doi":"10.23919/MVA57639.2023.10215910","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215910","url":null,"abstract":"This paper deals with deep cucumber recognition using CG (Computer Graphics)-based dataset generation. The variety and the size of the dataset are crucial in deep learning. Although there are many public datasets for common situations like traffic scenes, we need to make a dataset for a particular scene like cucumber farms. As it is costly and time-consuming to annotate much data manually, we proposed generating images by CG and converting them to realistic ones using adversarial learning approaches. We compare several image conversion methods using real cucumber plant images.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114181023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Can you read lips with a masked face? 戴着面具你能读懂唇语吗?
Pub Date : 2023-07-23 DOI: 10.23919/MVA57639.2023.10215925
Taiki Arakane, Chihiro Kai, T. Saitoh
We have been working on lip-reading that estimates the content of utterances using only visual information. When most people started wearing masks due to the coronavirus pandemic, several people asked us if using machine-based lip-reading even when wearing a mask was possible. Taking this as a research question, we worked on word-level lip-reading for the world’s first masked face image. The utterance scene dataset when wearing a mask is not open to the public, so we developed our dataset, face detection for masked face images, facial landmarks detection, and lip-reading based on deep learning. We collected speech scenes of 20 people for 15 Japanese words and obtained a recognition accuracy of 88.3% as a result of the recognition experiment. This paper reports that lip-reading is possible for masked face images.
我们一直在研究唇读,它只使用视觉信息来估计话语的内容。当大多数人因为冠状病毒大流行而开始戴口罩时,有几个人问我们,即使戴着口罩,是否也可以使用基于机器的唇读。作为一个研究问题,我们对世界上第一张蒙面脸图像进行了单词级唇读。戴口罩时的话语场景数据集不向公众开放,因此我们开发了我们的数据集,用于蒙面人脸图像的人脸检测,面部地标检测和基于深度学习的唇读。我们收集了20个人的语音场景,对15个日语单词进行了识别实验,获得了88.3%的识别准确率。本文报道了对蒙面图像进行唇读是可能的。
{"title":"Can you read lips with a masked face?","authors":"Taiki Arakane, Chihiro Kai, T. Saitoh","doi":"10.23919/MVA57639.2023.10215925","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215925","url":null,"abstract":"We have been working on lip-reading that estimates the content of utterances using only visual information. When most people started wearing masks due to the coronavirus pandemic, several people asked us if using machine-based lip-reading even when wearing a mask was possible. Taking this as a research question, we worked on word-level lip-reading for the world’s first masked face image. The utterance scene dataset when wearing a mask is not open to the public, so we developed our dataset, face detection for masked face images, facial landmarks detection, and lip-reading based on deep learning. We collected speech scenes of 20 people for 15 Japanese words and obtained a recognition accuracy of 88.3% as a result of the recognition experiment. This paper reports that lip-reading is possible for masked face images.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126235519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ensemble Fusion for Small Object Detection 小目标检测的集成融合
Pub Date : 2023-07-23 DOI: 10.23919/MVA57639.2023.10215748
Hao-Yu Hou, Mu-Yi Shen, Chia-Chi Hsu, En-Ming Huang, Yu-Chen Huang, Yu-Cheng Xia, Chien-Yao Wang, Chun-Yi Lee
Detecting small objects is often impeded by blurriness and low resolution, which poses substantial challenges for accurately detecting and localizing such objects. In addition, conventional feature extraction methods usually face difficulties in capturing effective representations for these entities, as down-sampling and convolutional operations contribute to the blurring of small object details. To tackle these challenges, this study introduces an approach for detecting tiny objects through ensemble fusion, which leverages the advantages of multiple diverse model variants and combines their predictions. Experimental results reveal that the proposed method effectively harnesses the strengths of each model via ensemble fusion, leading to enhanced accuracy and robustness in small object detection. Our model achieves the highest score of 0.776 in terms of average precision (AP) at an IoU threshold of 0.5 in the MVA Challenge on Small Object Detection for Birds.
小物体的检测往往受到模糊和低分辨率的阻碍,这对准确检测和定位小物体提出了很大的挑战。此外,传统的特征提取方法通常在捕获这些实体的有效表示方面面临困难,因为降采样和卷积操作会导致小对象细节的模糊。为了应对这些挑战,本研究引入了一种通过集成融合检测微小物体的方法,该方法利用了多个不同模型变体的优势,并结合了它们的预测。实验结果表明,该方法通过集成融合有效地利用了各个模型的优势,提高了小目标检测的精度和鲁棒性。在鸟类小目标检测的MVA挑战中,我们的模型在IoU阈值为0.5的情况下,在平均精度(AP)方面取得了0.776的最高分。
{"title":"Ensemble Fusion for Small Object Detection","authors":"Hao-Yu Hou, Mu-Yi Shen, Chia-Chi Hsu, En-Ming Huang, Yu-Chen Huang, Yu-Cheng Xia, Chien-Yao Wang, Chun-Yi Lee","doi":"10.23919/MVA57639.2023.10215748","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215748","url":null,"abstract":"Detecting small objects is often impeded by blurriness and low resolution, which poses substantial challenges for accurately detecting and localizing such objects. In addition, conventional feature extraction methods usually face difficulties in capturing effective representations for these entities, as down-sampling and convolutional operations contribute to the blurring of small object details. To tackle these challenges, this study introduces an approach for detecting tiny objects through ensemble fusion, which leverages the advantages of multiple diverse model variants and combines their predictions. Experimental results reveal that the proposed method effectively harnesses the strengths of each model via ensemble fusion, leading to enhanced accuracy and robustness in small object detection. Our model achieves the highest score of 0.776 in terms of average precision (AP) at an IoU threshold of 0.5 in the MVA Challenge on Small Object Detection for Birds.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132454074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Combining Knowledge Distillation and Transfer Learning for Sensor Fusion in Visible and Thermal Camera-based Person Classification 结合知识蒸馏和迁移学习的可见光和热像仪人物分类传感器融合
Pub Date : 2023-07-23 DOI: 10.23919/MVA57639.2023.10215818
Vijay John, Yasutomo Kawanishi
Visible and thermal camera-based sensor fusion has shown to address the limitations and enhance the robustness of visible camera-based person classification. In this paper, we propose to further enhance the classification accuracy of visible-thermal person classification using transfer learning, knowledge distillation, and the vision transformer. In our work, the visible-thermal person classifier is implemented using the vision transformer. The proposed classifier is trained using the transfer learning and knowledge distillation techniques. To train the proposed classifier, visible and thermal teacher models are implemented using the vision transformers. The multimodal classifier learns from the two teachers using a novel loss function which incorporates the knowledge distillation. The proposed method is validated on the public Speaking Faces dataset. A comparative analysis with baseline algorithms and an ablation study is performed. The results show that the proposed framework reports an enhanced classification accuracy.
基于可见光和热像仪的传感器融合解决了基于可见光相机的人物分类的局限性,增强了鲁棒性。本文提出利用迁移学习、知识升华和视觉变换等方法进一步提高视热人物分类的分类精度。在我们的工作中,使用视觉转换器实现了可视热人分类器。该分类器使用迁移学习和知识蒸馏技术进行训练。为了训练所提出的分类器,使用视觉转换器实现了可视和热教师模型。多模态分类器使用一种包含知识蒸馏的新型损失函数从两位老师那里学习。在公共演讲面孔数据集上对该方法进行了验证。与基线算法和消融研究进行比较分析。结果表明,该框架具有较好的分类精度。
{"title":"Combining Knowledge Distillation and Transfer Learning for Sensor Fusion in Visible and Thermal Camera-based Person Classification","authors":"Vijay John, Yasutomo Kawanishi","doi":"10.23919/MVA57639.2023.10215818","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215818","url":null,"abstract":"Visible and thermal camera-based sensor fusion has shown to address the limitations and enhance the robustness of visible camera-based person classification. In this paper, we propose to further enhance the classification accuracy of visible-thermal person classification using transfer learning, knowledge distillation, and the vision transformer. In our work, the visible-thermal person classifier is implemented using the vision transformer. The proposed classifier is trained using the transfer learning and knowledge distillation techniques. To train the proposed classifier, visible and thermal teacher models are implemented using the vision transformers. The multimodal classifier learns from the two teachers using a novel loss function which incorporates the knowledge distillation. The proposed method is validated on the public Speaking Faces dataset. A comparative analysis with baseline algorithms and an ablation study is performed. The results show that the proposed framework reports an enhanced classification accuracy.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129002684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Black-box Adversarial Attack against Visual Interpreters for Deep Neural Networks 针对深度神经网络视觉解释器的黑盒对抗性攻击
Pub Date : 2023-07-23 DOI: 10.23919/MVA57639.2023.10215758
Yudai Hirose, Satoshi Ono
With the rapid development of deep neural networks (DNNs), eXplainable AI, which provides a basis for prediction on inputs, has become increasingly important. In addition, DNNs have a vulnerability called an Adversarial Example (AE), which can cause incorrect output by applying special perturbations to inputs. Potential vulnerabilities can also exist in image interpreters such as GradCAM, necessitating their investigation, as these vulnerabilities could potentially result in misdiagnosis within medical imaging. Therefore, this study proposes a black-box adversarial attack method that misleads the image interpreter using Sep-CMA-ES. The proposed method deceptively shifts the focus area of the image interpreter to a different location from that of the original image while maintaining the same predictive labels.
随着深度神经网络(dnn)的快速发展,为输入预测提供基础的可解释人工智能(eXplainable AI)变得越来越重要。此外,dnn有一个被称为对抗示例(AE)的漏洞,它可以通过对输入施加特殊的扰动来导致不正确的输出。像GradCAM这样的图像解释器也可能存在潜在的漏洞,需要对其进行调查,因为这些漏洞可能会导致医学成像中的误诊。因此,本研究提出了一种使用Sep-CMA-ES误导图像解释器的黑盒对抗性攻击方法。所提出的方法欺骗性地将图像解释器的焦点区域转移到与原始图像的焦点区域不同的位置,同时保持相同的预测标签。
{"title":"Black-box Adversarial Attack against Visual Interpreters for Deep Neural Networks","authors":"Yudai Hirose, Satoshi Ono","doi":"10.23919/MVA57639.2023.10215758","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215758","url":null,"abstract":"With the rapid development of deep neural networks (DNNs), eXplainable AI, which provides a basis for prediction on inputs, has become increasingly important. In addition, DNNs have a vulnerability called an Adversarial Example (AE), which can cause incorrect output by applying special perturbations to inputs. Potential vulnerabilities can also exist in image interpreters such as GradCAM, necessitating their investigation, as these vulnerabilities could potentially result in misdiagnosis within medical imaging. Therefore, this study proposes a black-box adversarial attack method that misleads the image interpreter using Sep-CMA-ES. The proposed method deceptively shifts the focus area of the image interpreter to a different location from that of the original image while maintaining the same predictive labels.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"161 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133911197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Malware detection using Kernel Constrained Subspace Method 基于核约束子空间方法的恶意软件检测
Pub Date : 2023-07-23 DOI: 10.23919/MVA57639.2023.10215631
DJAFER YAHIA M BENCHADI, Messaoud Benchadi, Bojan Batalo, K. Fukui
This paper proposes a novel approach based on subspace representation for malware detection, an important task of distinguishing between safe and malware (malicious) file classes. Our solution is to utilize a target software’s byte-level visualization (image pattern) and represent the two classes by low-dimensional subspaces respectively, in high-dimensional vector space. We use the kernel constrained subspace method (KCSM) as a classifier, which has shown excellent results in various pattern recognition tasks. However, its computational cost may be high due to the use of kernel trick, which makes it difficult to achieve real-time detection. To address this issue, we introduce Random Fourier Features (RFF), which we can handle directly like standard vectors, bypassing the kernel trick. This approach reduces execution time by around 99%, while retaining a high recognition rate. We conduct extensive experiments on several public malware datasets, and demonstrate superior results against several baselines and previous approaches.
本文提出了一种基于子空间表示的恶意软件检测方法,这是区分安全与恶意(恶意)文件类的重要任务。我们的解决方案是利用目标软件的字节级可视化(图像模式),并在高维向量空间中分别用低维子空间表示这两个类。我们将核约束子空间方法(KCSM)作为分类器,该方法在各种模式识别任务中显示出优异的效果。然而,由于使用核技巧,其计算成本可能很高,难以实现实时检测。为了解决这个问题,我们引入随机傅立叶特征(RFF),我们可以像处理标准向量一样直接处理它,绕过核技巧。这种方法减少了大约99%的执行时间,同时保持了高识别率。我们在几个公共恶意软件数据集上进行了广泛的实验,并在几个基线和以前的方法上展示了优越的结果。
{"title":"Malware detection using Kernel Constrained Subspace Method","authors":"DJAFER YAHIA M BENCHADI, Messaoud Benchadi, Bojan Batalo, K. Fukui","doi":"10.23919/MVA57639.2023.10215631","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215631","url":null,"abstract":"This paper proposes a novel approach based on subspace representation for malware detection, an important task of distinguishing between safe and malware (malicious) file classes. Our solution is to utilize a target software’s byte-level visualization (image pattern) and represent the two classes by low-dimensional subspaces respectively, in high-dimensional vector space. We use the kernel constrained subspace method (KCSM) as a classifier, which has shown excellent results in various pattern recognition tasks. However, its computational cost may be high due to the use of kernel trick, which makes it difficult to achieve real-time detection. To address this issue, we introduce Random Fourier Features (RFF), which we can handle directly like standard vectors, bypassing the kernel trick. This approach reduces execution time by around 99%, while retaining a high recognition rate. We conduct extensive experiments on several public malware datasets, and demonstrate superior results against several baselines and previous approaches.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128915386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hardware-Aware Zero-Shot Neural Architecture Search 硬件感知零射击神经结构搜索
Pub Date : 2023-07-23 DOI: 10.23919/MVA57639.2023.10216205
Yutaka Yoshihama, Kenichi Yadani, Shota Isobe
Designing a convolutional neural network architecture that achieves low-latency and high accuracy on edge devices with constrained computational resources is a difficult challenge. Neural architecture search (NAS) is used to optimize the architecture in a large design space, but at huge computational cost. As a countermeasure, we use here the zero-shot NAS method. A drawback to the previous method was that a discrepancy of correction occurred between the evaluation score of the neural architecture and its accuracy. To address this problem, we refined the neural architecture search space from previous zero-shot NAS. The neural architecture obtained using the proposed method achieves ImageNet top-1 accuracy of 75.3% under conditions of latency equivalent to MobileNetV2 (ImageNet top-1 accuracy is 71.8%) on the Qualcomm SA8155 platform.
在计算资源有限的边缘设备上设计一种低延迟、高精度的卷积神经网络架构是一项艰巨的挑战。神经结构搜索(Neural architecture search, NAS)用于在大的设计空间内优化结构,但计算成本巨大。作为一种对策,我们在这里使用零射击NAS方法。先前方法的缺点是神经结构的评价分数与其精度之间存在校正误差。为了解决这个问题,我们从之前的零采样NAS中改进了神经结构搜索空间。在与高通SA8155平台上的MobileNetV2 (ImageNet top-1精度为71.8%)相当的延迟条件下,使用该方法获得的神经网络架构实现了75.3%的ImageNet top-1精度。
{"title":"Hardware-Aware Zero-Shot Neural Architecture Search","authors":"Yutaka Yoshihama, Kenichi Yadani, Shota Isobe","doi":"10.23919/MVA57639.2023.10216205","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10216205","url":null,"abstract":"Designing a convolutional neural network architecture that achieves low-latency and high accuracy on edge devices with constrained computational resources is a difficult challenge. Neural architecture search (NAS) is used to optimize the architecture in a large design space, but at huge computational cost. As a countermeasure, we use here the zero-shot NAS method. A drawback to the previous method was that a discrepancy of correction occurred between the evaluation score of the neural architecture and its accuracy. To address this problem, we refined the neural architecture search space from previous zero-shot NAS. The neural architecture obtained using the proposed method achieves ImageNet top-1 accuracy of 75.3% under conditions of latency equivalent to MobileNetV2 (ImageNet top-1 accuracy is 71.8%) on the Qualcomm SA8155 platform.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129345815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Monocular Blind Spot Estimation with Occupancy Grid Mapping 基于占用网格映射的单目盲点估计
Pub Date : 2023-07-23 DOI: 10.23919/MVA57639.2023.10215609
Kazuya Odagiri, K. Onoguchi
We present a low-cost method for detecting blind spots in front of the ego vehicle. In low visibility conditions, blind spot estimation is crucial to avoid the risk of pedestrians or vehicles appearing suddenly. However, most blind spot estimation methods require expensive range sensors or neural networks trained with data measured by them. Our method only uses a monocular camera throughout all phases from training to inference, since it is cheaper and more versatile. We assume that a blind spot is a depth discontinuity region. Occupancy probabilities of these regions are integrated using the occupancy grid mapping algorithm. Instead of using range sensors, we leverage the self-supervised monocular depth estimation method for the occupancy grid mapping. 2D blind spot labels are created from occupancy grids and a blind spot estimation network is trained using these labels. Our experiments show quantitative and qualitative performance and demonstrate an ability to learn with arbitrary videos.
提出了一种低成本的车辆前方盲点检测方法。在低能见度条件下,盲点估计对于避免行人或车辆突然出现的风险至关重要。然而,大多数盲点估计方法需要昂贵的距离传感器或用它们测量的数据训练的神经网络。我们的方法在从训练到推理的所有阶段只使用单目摄像机,因为它更便宜,更通用。我们假设盲点是一个深度不连续区域。利用占用网格映射算法对这些区域的占用概率进行综合。我们利用自监督单目深度估计方法代替距离传感器进行占用网格映射。从占用网格中创建二维盲点标签,并使用这些标签训练盲点估计网络。我们的实验显示了定量和定性的性能,并展示了使用任意视频学习的能力。
{"title":"Monocular Blind Spot Estimation with Occupancy Grid Mapping","authors":"Kazuya Odagiri, K. Onoguchi","doi":"10.23919/MVA57639.2023.10215609","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215609","url":null,"abstract":"We present a low-cost method for detecting blind spots in front of the ego vehicle. In low visibility conditions, blind spot estimation is crucial to avoid the risk of pedestrians or vehicles appearing suddenly. However, most blind spot estimation methods require expensive range sensors or neural networks trained with data measured by them. Our method only uses a monocular camera throughout all phases from training to inference, since it is cheaper and more versatile. We assume that a blind spot is a depth discontinuity region. Occupancy probabilities of these regions are integrated using the occupancy grid mapping algorithm. Instead of using range sensors, we leverage the self-supervised monocular depth estimation method for the occupancy grid mapping. 2D blind spot labels are created from occupancy grids and a blind spot estimation network is trained using these labels. Our experiments show quantitative and qualitative performance and demonstrate an ability to learn with arbitrary videos.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115453721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpreting Art by Leveraging Pre-Trained Models 利用预先训练的模型来解读艺术
Pub Date : 2023-07-23 DOI: 10.23919/MVA57639.2023.10216010
Niklas Penzel, J. Denzler
In many domains, so-called foundation models were recently proposed. These models are trained on immense amounts of data resulting in impressive performances on various downstream tasks and benchmarks. Later works focus on leveraging this pre-trained knowledge by combining these models. To reduce data and compute requirements, we utilize and combine foundation models in two ways. First, we use language and vision models to extract and generate a challenging language vision task in the form of artwork interpretation pairs. Second, we combine and fine-tune CLIP as well as GPT-2 to reduce compute requirements for training interpretation models. We perform a qualitative and quantitative analysis of our data and conclude that generating artwork leads to improvements in visual-text alignment and, therefore, to more proficient interpretation models1. Our approach addresses how to leverage and combine pre-trained models to tackle tasks where existing data is scarce or difficult to obtain.
在许多领域,最近提出了所谓的基础模型。这些模型在大量数据上进行训练,从而在各种下游任务和基准测试中获得令人印象深刻的性能。后来的工作重点是通过组合这些模型来利用这些预训练的知识。为了减少数据和计算需求,我们以两种方式利用和组合基础模型。首先,我们使用语言和视觉模型提取并生成一个具有挑战性的语言视觉任务,以艺术品解释对的形式。其次,我们结合并微调CLIP和GPT-2,以减少训练解释模型的计算需求。我们对我们的数据进行了定性和定量分析,并得出结论,生成艺术作品可以改善视觉文本对齐,从而产生更熟练的解释模型1。我们的方法解决了如何利用和组合预训练模型来处理现有数据稀缺或难以获得的任务。
{"title":"Interpreting Art by Leveraging Pre-Trained Models","authors":"Niklas Penzel, J. Denzler","doi":"10.23919/MVA57639.2023.10216010","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10216010","url":null,"abstract":"In many domains, so-called foundation models were recently proposed. These models are trained on immense amounts of data resulting in impressive performances on various downstream tasks and benchmarks. Later works focus on leveraging this pre-trained knowledge by combining these models. To reduce data and compute requirements, we utilize and combine foundation models in two ways. First, we use language and vision models to extract and generate a challenging language vision task in the form of artwork interpretation pairs. Second, we combine and fine-tune CLIP as well as GPT-2 to reduce compute requirements for training interpretation models. We perform a qualitative and quantitative analysis of our data and conclude that generating artwork leads to improvements in visual-text alignment and, therefore, to more proficient interpretation models1. Our approach addresses how to leverage and combine pre-trained models to tackle tasks where existing data is scarce or difficult to obtain.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114555505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bottleneck Transformer model with Channel Self-Attention for skin lesion classification 具有通道自关注的瓶颈变压器模型用于皮肤病变分类
Pub Date : 2023-07-23 DOI: 10.23919/MVA57639.2023.10215720
Masato Tada, X. Han
Early diagnosis of skin diseases is an important and challenge task for proper treatment, and even the deadliest skin cancer: the malignant melanoma can be cured for increasing the survival rate with less than 5-year life expectancy. The manual diagnosis of skin lesions by specialists not only is time-consuming but also usually causes great variation of the diagnosis results. Recently, deep learning networks with the main convolution operations have been widely employed for vision recognition including medical image analysis and classification, and demonstrated the great effectiveness. However, the convolution operation extracts the feature in the limited receptive field, and cannot capture long-range dependence for modeling global contexts. Therefore, transformer as an alternative for global feature modeling with self-attention module has become the prevalent network architecture for lifting performance in various vision tasks. This study aims to construct a hybrid skin lesion recognition model by incorporating the convolution operations and self-attention structures. Specifically, we firstly employ a backbone CNN to extract the high-level feature maps, and then leverage a transformer block to capture the global correlation. Due to the diverse contexts in channel domain and the reduced information in spatial domain of the high-level features, we alternatively incorporate a self-attention to model long-range dependencies in the channel direction instead of spatial self-attention in the conventional transformer block, and then follow spatial relation modeling with the depth-wise convolution block in the feature feed-forward module. To demonstrate the effectiveness of the proposed method, we conduct experiments on the HAM10000 and ISIC2019 skin lesion datasets, and verify the superior performance over the baseline model and the state-of-the-art methods.
皮肤病的早期诊断对于正确治疗是一项重要而具有挑战性的任务,即使是最致命的皮肤癌:恶性黑色素瘤可以治愈,以提高预期寿命不到5年的生存率。专家对皮肤病变的人工诊断不仅费时,而且往往导致诊断结果的差异很大。近年来,以卷积操作为主的深度学习网络已被广泛应用于医学图像分析和分类等视觉识别领域,并显示出极大的有效性。然而,卷积运算在有限的接受野中提取特征,并且不能捕获全局上下文建模的远程依赖性。因此,变压器作为具有自关注模块的全局特征建模的替代方案,已成为提升各种视觉任务性能的主流网络架构。本研究旨在将卷积运算与自注意结构相结合,构建一种混合皮肤损伤识别模型。具体来说,我们首先使用主干CNN来提取高级特征映射,然后利用变压器块来捕获全局相关性。由于通道域环境的多样性和高阶特征空间域信息的减少,我们在通道方向上引入自关注来建模远程依赖关系,而不是在传统的变压器块中引入空间自关注,然后在特征前馈模块中使用深度卷积块进行空间关系建模。为了证明该方法的有效性,我们在HAM10000和ISIC2019皮肤病变数据集上进行了实验,并验证了其优于基线模型和最先进方法的性能。
{"title":"Bottleneck Transformer model with Channel Self-Attention for skin lesion classification","authors":"Masato Tada, X. Han","doi":"10.23919/MVA57639.2023.10215720","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215720","url":null,"abstract":"Early diagnosis of skin diseases is an important and challenge task for proper treatment, and even the deadliest skin cancer: the malignant melanoma can be cured for increasing the survival rate with less than 5-year life expectancy. The manual diagnosis of skin lesions by specialists not only is time-consuming but also usually causes great variation of the diagnosis results. Recently, deep learning networks with the main convolution operations have been widely employed for vision recognition including medical image analysis and classification, and demonstrated the great effectiveness. However, the convolution operation extracts the feature in the limited receptive field, and cannot capture long-range dependence for modeling global contexts. Therefore, transformer as an alternative for global feature modeling with self-attention module has become the prevalent network architecture for lifting performance in various vision tasks. This study aims to construct a hybrid skin lesion recognition model by incorporating the convolution operations and self-attention structures. Specifically, we firstly employ a backbone CNN to extract the high-level feature maps, and then leverage a transformer block to capture the global correlation. Due to the diverse contexts in channel domain and the reduced information in spatial domain of the high-level features, we alternatively incorporate a self-attention to model long-range dependencies in the channel direction instead of spatial self-attention in the conventional transformer block, and then follow spatial relation modeling with the depth-wise convolution block in the feature feed-forward module. To demonstrate the effectiveness of the proposed method, we conduct experiments on the HAM10000 and ISIC2019 skin lesion datasets, and verify the superior performance over the baseline model and the state-of-the-art methods.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125364532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2023 18th International Conference on Machine Vision and Applications (MVA)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1