Pub Date : 2024-08-14DOI: 10.1186/s13640-024-00635-2
Omran Alamayreh, Carmelo Fascella, Sara Mandelli, Benedetta Tondi, Paolo Bestagini, Mauro Barni
In the last few years, research on the detection of AI-generated videos has focused exclusively on detecting facial manipulations known as deepfakes. Much less attention has been paid to the detection of artificial non-facial fake videos. In this paper, we address a new forensic task, namely, the detection of fake videos of human body reenactment. To this purpose, we consider videos generated by the “Everybody Dance Now” framework. To accomplish our task, we have constructed and released a novel dataset of fake videos of this kind, referred to as FakeDance dataset. Additionally, we propose two forgery detectors to study the detectability of FakeDance kind of videos. The first one exploits spatial–temporal clues of a given video by means of hand-crafted descriptors, whereas the second detector is an end-to-end detector based on Convolutional Neural Networks (CNNs) trained on purpose. Both detectors have their peculiarities and strengths, working well in different operative scenarios. We believe that our proposed dataset together with the two detectors will contribute to the research on the detection of non-facial fake videos generated by means of AI.
在过去几年中,有关人工智能生成视频检测的研究主要集中在检测被称为深度伪造的面部操纵上。而对人工非面部伪造视频的检测则关注较少。在本文中,我们将讨论一项新的取证任务,即检测人体再现的虚假视频。为此,我们考虑了由 "Everybody Dance Now "框架生成的视频。为了完成任务,我们构建并发布了一个新颖的此类伪造视频数据集,称为 FakeDance 数据集。此外,我们还提出了两种伪造检测器来研究 FakeDance 类视频的可检测性。第一个检测器通过手工创建的描述符来利用给定视频的时空线索,而第二个检测器则是基于专门训练的卷积神经网络(CNN)的端到端检测器。这两种检测器各有特点和优势,在不同的工作场景下都能发挥良好的作用。我们相信,我们提出的数据集和这两种检测器将有助于通过人工智能手段检测非人脸伪造视频的研究。
{"title":"Just Dance: detection of human body reenactment fake videos","authors":"Omran Alamayreh, Carmelo Fascella, Sara Mandelli, Benedetta Tondi, Paolo Bestagini, Mauro Barni","doi":"10.1186/s13640-024-00635-2","DOIUrl":"https://doi.org/10.1186/s13640-024-00635-2","url":null,"abstract":"<p>In the last few years, research on the detection of AI-generated videos has focused exclusively on detecting facial manipulations known as deepfakes. Much less attention has been paid to the detection of artificial non-facial fake videos. In this paper, we address a new forensic task, namely, the detection of fake videos of human body reenactment. To this purpose, we consider videos generated by the “Everybody Dance Now” framework. To accomplish our task, we have constructed and released a novel dataset of fake videos of this kind, referred to as FakeDance dataset. Additionally, we propose two forgery detectors to study the detectability of FakeDance kind of videos. The first one exploits spatial–temporal clues of a given video by means of hand-crafted descriptors, whereas the second detector is an end-to-end detector based on Convolutional Neural Networks (CNNs) trained on purpose. Both detectors have their peculiarities and strengths, working well in different operative scenarios. We believe that our proposed dataset together with the two detectors will contribute to the research on the detection of non-facial fake videos generated by means of AI.</p>","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"27 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-09DOI: 10.1186/s13640-024-00626-3
Evangelos Alexiou, Xuemei Zhou, Irene Viola, Pablo Cesar
Point clouds denote a prominent solution for the representation of 3D photo-realistic content in immersive applications. Similarly to other imaging modalities, quality predictions for point cloud contents are vital for a wide range of applications, enabling trade-off optimizations between data quality and data size in every processing step from acquisition to rendering. In this work, we focus on use cases that consider human end-users consuming point cloud contents and, hence, we concentrate on visual quality metrics. In particular, we propose a set of perceptually relevant descriptors based on principal component analysis (PCA) decomposition, which is applied to both geometry and texture data for full-reference point cloud quality assessment. Statistical features are derived from these descriptors to characterize local shape and appearance properties for both a reference and a distorted point cloud. The extracted statistical features are subsequently compared to provide corresponding predictions of visual quality for the distorted point cloud. As part of our method, a learning-based approach is proposed to fuse these individual predictors to a unified perceptual score. We validate the accuracy of the individual predictors, as well as the unified quality scores obtained after regression against subjectively annotated datasets, showing that our metric outperforms state-of-the-art solutions. Insights regarding design decisions are provided through exploratory studies, evaluating the performance of our metric under different parameter configurations, attribute domains, color spaces, and regression models. A software implementation of the proposed metric is made available at the following link: https://github.com/cwi-dis/pointpca.
{"title":"PointPCA: point cloud objective quality assessment using PCA-based descriptors","authors":"Evangelos Alexiou, Xuemei Zhou, Irene Viola, Pablo Cesar","doi":"10.1186/s13640-024-00626-3","DOIUrl":"https://doi.org/10.1186/s13640-024-00626-3","url":null,"abstract":"<p>Point clouds denote a prominent solution for the representation of 3D photo-realistic content in immersive applications. Similarly to other imaging modalities, quality predictions for point cloud contents are vital for a wide range of applications, enabling trade-off optimizations between data quality and data size in every processing step from acquisition to rendering. In this work, we focus on use cases that consider human end-users consuming point cloud contents and, hence, we concentrate on visual quality metrics. In particular, we propose a set of perceptually relevant descriptors based on principal component analysis (PCA) decomposition, which is applied to both geometry and texture data for full-reference point cloud quality assessment. Statistical features are derived from these descriptors to characterize local shape and appearance properties for both a reference and a distorted point cloud. The extracted statistical features are subsequently compared to provide corresponding predictions of visual quality for the distorted point cloud. As part of our method, a learning-based approach is proposed to fuse these individual predictors to a unified perceptual score. We validate the accuracy of the individual predictors, as well as the unified quality scores obtained after regression against subjectively annotated datasets, showing that our metric outperforms state-of-the-art solutions. Insights regarding design decisions are provided through exploratory studies, evaluating the performance of our metric under different parameter configurations, attribute domains, color spaces, and regression models. A software implementation of the proposed metric is made available at the following link: https://github.com/cwi-dis/pointpca.</p>","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"79 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-07DOI: 10.1186/s13640-024-00637-0
Zhe Luo, Wenjing Jia, Stuart Perry
3D point cloud data, as an immersive detailed data source, has been increasingly used in numerous applications. To deal with the computational and storage challenges of this data, it needs to be compressed before transmission, storage, and processing, especially in real-time systems. Instead of decoding the compressed data stream and subsequently conducting downstream tasks on the decompressed data, analyzing point clouds directly in their compressed domain has attracted great interest. In this paper, we dive into the realm of compressed point cloud classification (CPCC), aiming to achieve high point cloud classification accuracy in a bitrate-saving way by ensuring the bit stream contains a high degree of representative information of the point cloud. Edge information is one of the most important and representative attributes of the point cloud because it can display the outlines or main shapes. However, extracting edge points or information from point cloud models is challenging due to their irregularity and sparsity. To address this challenge, we adopt an advanced edge-sampling method that enhances existing state-of-the-art (SOTA) point cloud edge-sampling techniques based on attention mechanisms and consequently develop a novel CPCC method “CPCC-PES” that focuses on point cloud’s edge information. The result obtained on the benchmark ModelNet40 dataset shows that our model has superior rate-accuracy trade-off performance than SOTA works. Specifically, our method achieves over 90% Top-1 Accuracy with a mere 0.08 bits-per-point (bpp), marking a remarkable over 96% reduction in BD-bitrate compared with specialized codecs. This means that our method only consumes 20% of the bitrate of other SOTA works while maintaining comparable accuracy. Furthermore, we propose a new evaluation metric named BD-Top-1 Accuracy to evaluate the trade-off performance between bitrate and Top-1 Accuracy for future CPCC research.
{"title":"Compressed point cloud classification with point-based edge sampling","authors":"Zhe Luo, Wenjing Jia, Stuart Perry","doi":"10.1186/s13640-024-00637-0","DOIUrl":"https://doi.org/10.1186/s13640-024-00637-0","url":null,"abstract":"<p>3D point cloud data, as an immersive detailed data source, has been increasingly used in numerous applications. To deal with the computational and storage challenges of this data, it needs to be compressed before transmission, storage, and processing, especially in real-time systems. Instead of decoding the compressed data stream and subsequently conducting downstream tasks on the decompressed data, analyzing point clouds directly in their compressed domain has attracted great interest. In this paper, we dive into the realm of compressed point cloud classification (CPCC), aiming to achieve high point cloud classification accuracy in a bitrate-saving way by ensuring the bit stream contains a high degree of representative information of the point cloud. Edge information is one of the most important and representative attributes of the point cloud because it can display the outlines or main shapes. However, extracting edge points or information from point cloud models is challenging due to their irregularity and sparsity. To address this challenge, we adopt an advanced edge-sampling method that enhances existing state-of-the-art (SOTA) point cloud edge-sampling techniques based on attention mechanisms and consequently develop a novel CPCC method “CPCC-PES” that focuses on point cloud’s edge information. The result obtained on the benchmark ModelNet40 dataset shows that our model has superior rate-accuracy trade-off performance than SOTA works. Specifically, our method achieves over 90% Top-1 Accuracy with a mere 0.08 bits-per-point (bpp), marking a remarkable over 96% reduction in BD-bitrate compared with specialized codecs. This means that our method only consumes 20% of the bitrate of other SOTA works while maintaining comparable accuracy. Furthermore, we propose a new evaluation metric named BD-Top-1 Accuracy to evaluate the trade-off performance between bitrate and Top-1 Accuracy for future CPCC research.</p>","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"28 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-06DOI: 10.1186/s13640-024-00636-1
Guillermo A. Camacho-Muñoz, Sandra Esperanza Nope Rodríguez, Humberto Loaiza-Correa, João Paulo Silva do Monte Lima, Rafael Alves Roberto
This paper addresses the problem of 6D pose tracking of plane segments from point clouds acquired from a mobile camera. This is motivated by manual packing operations, where an opportunity exists to enhance performance, aiding operators with instructions based on augmented reality. The approach uses as input point clouds, by its advantages for extracting geometric information relevant to estimating the 6D pose of rigid objects. The proposed algorithm begins with a RANSAC fitting stage on the raw point cloud. It then implements strategies to compute the 2D size and 6D pose of plane segments from geometric analysis of the fitted point cloud. Redundant detections are combined using a new quality factor that predicts point cloud mapping density and allows the selection of the most accurate detection. The algorithm is designed for dynamic scenes, employing a novel particle concept in the point cloud space to track detections’ validity over time. A variant of the algorithm uses box size priors (available in most packing operations) to filter out irrelevant detections. The impact of this prior knowledge is evaluated through an experimental design that compares the performance of a plane segment tracking system, considering variations in the tracking algorithm and camera speed (onboard the packing operator). The tracking algorithm varies at two levels: algorithm ((A_{wpk})), which integrates prior knowledge of box sizes, and algorithm ((A_{woutpk})), which assumes ignorance of box properties. Camera speed is evaluated at low and high speeds. Results indicate increments in the precision and F1-score associated with using the (A_{wpk}) algorithm and consistent performance across both velocities. These results confirm the enhancement of the performance of a tracking system in a real-life and complex scenario by including previous knowledge of the elements in the scene. The proposed algorithm is limited to tracking plane segments of boxes fully supported on surfaces parallel to the ground plane and not stacked. Future works are proposed to include strategies to resolve this limitation.
本文探讨了从移动摄像头获取的点云中对平面段进行 6D 姿态跟踪的问题。该问题是由人工包装操作引起的,在人工包装操作中存在着提高性能的机会,可以通过基于增强现实技术的指令为操作员提供帮助。该方法使用点云作为输入,其优势在于可提取与估计刚性物体 6D 姿态相关的几何信息。建议的算法从原始点云的 RANSAC 拟合阶段开始。然后,通过对拟合点云进行几何分析,实施计算平面段二维尺寸和六维姿态的策略。冗余检测使用新的质量因子进行组合,该因子可预测点云映射密度,并允许选择最准确的检测。该算法专为动态场景设计,在点云空间中采用了一种新颖的粒子概念,以跟踪检测结果随时间变化的有效性。该算法的一个变体使用盒尺寸先验(可用于大多数打包操作)来过滤无关的检测。通过实验设计来评估这种先验知识的影响,在考虑跟踪算法和相机速度(包装操作员机载相机)变化的情况下,比较平面段跟踪系统的性能。跟踪算法在两个层面上发生变化:算法((A_{wpk}))和算法((A_{woutpk})),前者整合了包装箱尺寸的先验知识,后者则假定包装箱的属性是未知的。在低速和高速时对相机速度进行评估。结果表明,使用 (A_{woutpk})算法的精确度和 F1 分数都有所提高,而且两种速度下的性能一致。这些结果证实,在现实生活中的复杂场景中,通过加入先前对场景中元素的了解,可以提高跟踪系统的性能。所提出的算法仅限于跟踪完全支撑在与地平面平行的表面上且未堆叠的方框平面段。建议未来的工作包括解决这一限制的策略。
{"title":"Evaluation of the use of box size priors for 6D plane segment tracking from point clouds with applications in cargo packing","authors":"Guillermo A. Camacho-Muñoz, Sandra Esperanza Nope Rodríguez, Humberto Loaiza-Correa, João Paulo Silva do Monte Lima, Rafael Alves Roberto","doi":"10.1186/s13640-024-00636-1","DOIUrl":"https://doi.org/10.1186/s13640-024-00636-1","url":null,"abstract":"<p>This paper addresses the problem of 6D pose tracking of plane segments from point clouds acquired from a mobile camera. This is motivated by manual packing operations, where an opportunity exists to enhance performance, aiding operators with instructions based on augmented reality. The approach uses as input point clouds, by its advantages for extracting geometric information relevant to estimating the 6D pose of rigid objects. The proposed algorithm begins with a RANSAC fitting stage on the raw point cloud. It then implements strategies to compute the 2D size and 6D pose of plane segments from geometric analysis of the fitted point cloud. Redundant detections are combined using a new quality factor that predicts point cloud mapping density and allows the selection of the most accurate detection. The algorithm is designed for dynamic scenes, employing a novel particle concept in the point cloud space to track detections’ validity over time. A variant of the algorithm uses box size priors (available in most packing operations) to filter out irrelevant detections. The impact of this prior knowledge is evaluated through an experimental design that compares the performance of a plane segment tracking system, considering variations in the tracking algorithm and camera speed (onboard the packing operator). The tracking algorithm varies at two levels: algorithm (<span>(A_{wpk})</span>), which integrates prior knowledge of box sizes, and algorithm (<span>(A_{woutpk})</span>), which assumes ignorance of box properties. Camera speed is evaluated at low and high speeds. Results indicate increments in the precision and F1-score associated with using the <span>(A_{wpk})</span> algorithm and consistent performance across both velocities. These results confirm the enhancement of the performance of a tracking system in a real-life and complex scenario by including previous knowledge of the elements in the scene. The proposed algorithm is limited to tracking plane segments of boxes fully supported on surfaces parallel to the ground plane and not stacked. Future works are proposed to include strategies to resolve this limitation.</p>","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"19 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-17DOI: 10.1186/s13640-024-00630-7
Mathias Wien, Joel Jung
We present a study on the validity of quality assessment in the context of the development of visual media coding schemes. The work is motivated by the need for reliable means for decision-taking in standardization efforts of MPEG and JVET, i.e., the adoption or rejection of coding tools during the development process of the coding standard. The study includes results considering three means: objective quality metrics, remote expert viewing, which is a method designed in the context of MPEG standardization, and formal laboratory visual evaluation. The focus of this work is on the comparison of pairs of coded video sequences, e.g., a proposed change and an anchor scheme at a given rate point. An aggregation of performance measurements across multiple rate points, such as the Bjøntegaard Delta rate, is out of the scope of this paper. The paper details the test setup for the subjective assessment methods and the objective quality metrics under consideration. The results of the three approaches are reviewed, analyzed, and compared with respect to their suitability for the decision-taking task. The study indicates that, subject to the chosen test content and test protocols, the results of remote expert viewing using a forced-choice scale can be considered more discriminatory than the results of naïve viewers in the laboratory tests. The results further that, in general, the well-established quality metrics, such as PSNR, SSIM, or MS-SSIM, exhibit a high rate of correct decision-making when their results are compared with both types of viewing tests. Among the learning-based metrics, VMAF and AVQT appear to be most robust. For the development process of a coding standard, the selection of the most suitable means must be guided by the context, where a small number of carefully selected objective metrics, in combination with viewing tests for unclear cases, appears recommendable.
{"title":"Remote expert viewing, laboratory tests or objective metrics: which one(s) to trust?","authors":"Mathias Wien, Joel Jung","doi":"10.1186/s13640-024-00630-7","DOIUrl":"https://doi.org/10.1186/s13640-024-00630-7","url":null,"abstract":"<p>We present a study on the validity of quality assessment in the context of the development of visual media coding schemes. The work is motivated by the need for reliable means for decision-taking in standardization efforts of MPEG and JVET, i.e., the adoption or rejection of coding tools during the development process of the coding standard. The study includes results considering three means: objective quality metrics, remote expert viewing, which is a method designed in the context of MPEG standardization, and formal laboratory visual evaluation. The focus of this work is on the comparison of pairs of coded video sequences, e.g., a proposed change and an anchor scheme at a given rate point. An aggregation of performance measurements across multiple rate points, such as the Bjøntegaard Delta rate, is out of the scope of this paper. The paper details the test setup for the subjective assessment methods and the objective quality metrics under consideration. The results of the three approaches are reviewed, analyzed, and compared with respect to their suitability for the decision-taking task. The study indicates that, subject to the chosen test content and test protocols, the results of remote expert viewing using a forced-choice scale can be considered more discriminatory than the results of naïve viewers in the laboratory tests. The results further that, in general, the well-established quality metrics, such as PSNR, SSIM, or MS-SSIM, exhibit a high rate of correct decision-making when their results are compared with both types of viewing tests. Among the learning-based metrics, VMAF and AVQT appear to be most robust. For the development process of a coding standard, the selection of the most suitable means must be guided by the context, where a small number of carefully selected objective metrics, in combination with viewing tests for unclear cases, appears recommendable.</p>","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"135 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141550115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-17DOI: 10.1186/s13640-024-00633-4
Nuno A. B. Martins, Luís A. da Silva Cruz, Fernando Lopes
The rapid growth on the amount of generated 3D data, particularly in the form of Light Detection And Ranging (LiDAR) point clouds (PCs), poses very significant challenges in terms of data storage, transmission, and processing. Point cloud (PC) representation of 3D visual information has shown to be a very flexible format with many applications ranging from multimedia immersive communication to machine vision tasks in the robotics and autonomous driving domains. In this paper, we investigate the performance of four reference 3D object detection techniques, when the input PCs are compressed with varying levels of degradation. Compression is performed using two MPEG standard coders based on 2D projections and octree decomposition, as well as two coding methods based on Deep Learning (DL). For the DL coding methods, we used a Joint Photographic Experts Group (JPEG) reference PC coder, that we adapted to accept LiDAR PCs in both Cartesian and cylindrical coordinate systems. The detection performance of the four reference 3D object detection methods was evaluated using both pre-trained models and models specifically trained using degraded PCs reconstructed from compressed representations. It is shown that LiDAR PCs can be compressed down to 6 bits per point with no significant degradation on the object detection precision. Furthermore, employing specifically trained detection models improves the detection capabilities even at compression rates as low as 2 bits per point. These results show that LiDAR PCs can be coded to enable efficient storage and transmission, without significant object detection performance loss.
生成的三维数据量迅速增长,尤其是以光探测和测距(LiDAR)点云(PC)的形式出现,这给数据存储、传输和处理带来了巨大挑战。三维视觉信息的点云(PC)表示已被证明是一种非常灵活的格式,其应用范围从多媒体沉浸式通信到机器人和自动驾驶领域的机器视觉任务。在本文中,我们研究了四种参考 3D 物体检测技术在对输入 PC 进行不同程度的降级压缩时的性能。压缩采用了两种基于二维投影和八叉树分解的 MPEG 标准编码器,以及两种基于深度学习(DL)的编码方法。对于 DL 编码方法,我们使用了联合图像专家组(JPEG)的参考 PC 编码器,并对其进行了调整,以接受直角坐标系和圆柱坐标系中的 LiDAR PC。我们使用预训练模型和使用压缩表示重建的降级 PC 专门训练的模型,对四种参考 3D 物体检测方法的检测性能进行了评估。结果表明,LiDAR PC 可以压缩到每点 6 比特,而物体检测精度不会明显降低。此外,即使压缩率低至每点 2 比特,采用专门训练的检测模型也能提高检测能力。这些结果表明,可以对激光雷达 PC 进行编码,以实现高效存储和传输,而不会明显降低物体检测性能。
{"title":"Impact of LiDAR point cloud compression on 3D object detection evaluated on the KITTI dataset","authors":"Nuno A. B. Martins, Luís A. da Silva Cruz, Fernando Lopes","doi":"10.1186/s13640-024-00633-4","DOIUrl":"https://doi.org/10.1186/s13640-024-00633-4","url":null,"abstract":"<p>The rapid growth on the amount of generated 3D data, particularly in the form of Light Detection And Ranging (LiDAR) point clouds (PCs), poses very significant challenges in terms of data storage, transmission, and processing. Point cloud (PC) representation of 3D visual information has shown to be a very flexible format with many applications ranging from multimedia immersive communication to machine vision tasks in the robotics and autonomous driving domains. In this paper, we investigate the performance of four reference 3D object detection techniques, when the input PCs are compressed with varying levels of degradation. Compression is performed using two MPEG standard coders based on 2D projections and octree decomposition, as well as two coding methods based on Deep Learning (DL). For the DL coding methods, we used a Joint Photographic Experts Group (JPEG) reference PC coder, that we adapted to accept LiDAR PCs in both Cartesian and cylindrical coordinate systems. The detection performance of the four reference 3D object detection methods was evaluated using both pre-trained models and models specifically trained using degraded PCs reconstructed from compressed representations. It is shown that LiDAR PCs can be compressed down to 6 bits per point with no significant degradation on the object detection precision. Furthermore, employing specifically trained detection models improves the detection capabilities even at compression rates as low as 2 bits per point. These results show that LiDAR PCs can be coded to enable efficient storage and transmission, without significant object detection performance loss.</p>","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"51 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141550114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-08DOI: 10.1186/s13640-024-00631-6
Abdelrahman Seleem, André F. R. Guarda, Nuno M. M. Rodrigues, Fernando Pereira
The recent adoption of deep learning-based models for the processing and coding of multimedia signals has brought noticeable gains in performance, which have established deep learning-based solutions as the uncontested state-of-the-art both for computer vision tasks, targeting machine consumption, as well as, more recently, coding applications, targeting human visualization. Traditionally, applications requiring both coding and computer vision processing require first decoding the bitstream and then applying the computer vision methods to the decompressed multimedia signals. However, the adoption of deep learning-based solutions enables the use of compressed domain computer vision processing, with gains in performance and computational complexity over the decompressed domain approach. For point clouds (PCs), these gains have been demonstrated in the single available compressed domain computer vision processing solution, named Compressed Domain PC Classifier, which processes JPEG Pleno PC coding (PCC) compressed streams using a PC classifier largely compatible with the state-of-the-art spatial domain PointGrid classifier. However, the available Compressed Domain PC Classifier presents strong limitations by imposing a single, specific input size which is associated to specific JPEG Pleno PCC configurations; this limits the compression performance as these configurations are not ideal for all PCs due to their different characteristics, notably density. To overcome these limitations, this paper proposes the first Adaptive Compressed Domain PC Classifier solution which includes a novel adaptive bridge model that allows to process the JPEG Pleno PCC encoded bit streams using different coding configurations, now maximizing the compression efficiency. Experimental results show that the novel Adaptive Compressed Domain PC Classifier allows JPEG PCC to achieve better compression performance by not imposing a single, specific coding configuration for all PCs, regardless of its different characteristics. Moreover, the added adaptability power can achieve slightly better PC classification performance than the previous Compressed Domain PC Classifier and largely better PC classification performance (and lower number of weights) than the PointGrid PC classifier working in the decompressed domain.
最近,基于深度学习的多媒体信号处理和编码模型的采用带来了明显的性能提升,这使得基于深度学习的解决方案成为无可争议的最先进解决方案,既适用于以机器消费为目标的计算机视觉任务,也适用于最近以人类可视化为目标的编码应用。传统上,需要同时进行编码和计算机视觉处理的应用首先需要对比特流进行解码,然后将计算机视觉方法应用于解压缩的多媒体信号。不过,采用基于深度学习的解决方案后,就可以使用压缩域计算机视觉处理,在性能和计算复杂度方面都比解压缩域方法有所提高。对于点云(PC)而言,这些优势已在名为 "压缩域 PC 分类器 "的单一压缩域计算机视觉处理解决方案中得到证实,该解决方案使用与最先进的空间域 PointGrid 分类器基本兼容的 PC 分类器处理 JPEG Pleno PC 编码(PCC)压缩流。然而,现有的压缩域 PC 分类器有很大的局限性,因为它强加了与特定 JPEG Pleno PCC 配置相关的单一、特定的输入大小;这限制了压缩性能,因为这些配置因其不同的特性(尤其是密度)而并非适用于所有 PC。为了克服这些限制,本文提出了第一个自适应压缩域 PC 分类器解决方案,其中包括一个新颖的自适应桥接模型,允许使用不同的编码配置处理 JPEG Pleno PCC 编码的比特流,从而最大限度地提高压缩效率。实验结果表明,新颖的自适应压缩域 PC 分类器不对所有 PC 强加单一、特定的编码配置,从而使 JPEG PCC 实现更好的压缩性能,而不管其不同的特性。此外,新增的适应能力可使 PC 分类性能略优于之前的压缩域 PC 分类器,并在很大程度上优于在解压缩域工作的 PointGrid PC 分类器(权重数量更少)。
{"title":"Adaptive bridge model for compressed domain point cloud classification","authors":"Abdelrahman Seleem, André F. R. Guarda, Nuno M. M. Rodrigues, Fernando Pereira","doi":"10.1186/s13640-024-00631-6","DOIUrl":"https://doi.org/10.1186/s13640-024-00631-6","url":null,"abstract":"<p>The recent adoption of deep learning-based models for the processing and coding of multimedia signals has brought noticeable gains in performance, which have established deep learning-based solutions as the uncontested state-of-the-art both for computer vision tasks, targeting machine consumption, as well as, more recently, coding applications, targeting human visualization. Traditionally, applications requiring both coding and computer vision processing require first decoding the bitstream and then applying the computer vision methods to the decompressed multimedia signals. However, the adoption of deep learning-based solutions enables the use of compressed domain computer vision processing, with gains in performance and computational complexity over the decompressed domain approach. For point clouds (PCs), these gains have been demonstrated in the single available compressed domain computer vision processing solution, named Compressed Domain PC Classifier, which processes JPEG Pleno PC coding (PCC) compressed streams using a PC classifier largely compatible with the state-of-the-art spatial domain PointGrid classifier. However, the available Compressed Domain PC Classifier presents strong limitations by imposing a single, specific input size which is associated to specific JPEG Pleno PCC configurations; this limits the compression performance as these configurations are not ideal for all PCs due to their different characteristics, notably density. To overcome these limitations, this paper proposes the first Adaptive Compressed Domain PC Classifier solution which includes a novel adaptive bridge model that allows to process the JPEG Pleno PCC encoded bit streams using different coding configurations, now maximizing the compression efficiency. Experimental results show that the novel Adaptive Compressed Domain PC Classifier allows JPEG PCC to achieve better compression performance by not imposing a single, specific coding configuration for all PCs, regardless of its different characteristics. Moreover, the added adaptability power can achieve slightly better PC classification performance than the previous Compressed Domain PC Classifier and largely better PC classification performance (and lower number of weights) than the PointGrid PC classifier working in the decompressed domain.</p>","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"15 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141549981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-30DOI: 10.1186/s13640-024-00628-1
Saeed Mahmoudpour, Carla Pagliari, Peter Schelkens
Conventional photography can only provide a two-dimensional image of the scene, whereas emerging imaging modalities such as light field enable the representation of higher dimensional visual information by capturing light rays from different directions. Light fields provide immersive experiences, a sense of presence in the scene, and can enhance different vision tasks. Hence, research into light field processing methods has become increasingly popular. It does, however, come at the cost of higher data volume and computational complexity. With the growing deployment of machine-learning and deep architectures in image processing applications, a paradigm shift toward learning-based approaches has also been observed in the design of light field processing methods. Various learning-based approaches are developed to process the high volume of light field data efficiently for different vision tasks while improving performance. Taking into account the diversity of light field vision tasks and the deployed learning-based frameworks, it is necessary to survey the scattered learning-based works in the domain to gain insight into the current trends and challenges. This paper aims to review the existing learning-based solutions for light field imaging and to summarize the most promising frameworks. Moreover, evaluation methods and available light field datasets are highlighted. Lastly, the review concludes with a brief outlook for future research directions.
{"title":"Learning-based light field imaging: an overview","authors":"Saeed Mahmoudpour, Carla Pagliari, Peter Schelkens","doi":"10.1186/s13640-024-00628-1","DOIUrl":"https://doi.org/10.1186/s13640-024-00628-1","url":null,"abstract":"<p>Conventional photography can only provide a two-dimensional image of the scene, whereas emerging imaging modalities such as light field enable the representation of higher dimensional visual information by capturing light rays from different directions. Light fields provide immersive experiences, a sense of presence in the scene, and can enhance different vision tasks. Hence, research into light field processing methods has become increasingly popular. It does, however, come at the cost of higher data volume and computational complexity. With the growing deployment of machine-learning and deep architectures in image processing applications, a paradigm shift toward learning-based approaches has also been observed in the design of light field processing methods. Various learning-based approaches are developed to process the high volume of light field data efficiently for different vision tasks while improving performance. Taking into account the diversity of light field vision tasks and the deployed learning-based frameworks, it is necessary to survey the scattered learning-based works in the domain to gain insight into the current trends and challenges. This paper aims to review the existing learning-based solutions for light field imaging and to summarize the most promising frameworks. Moreover, evaluation methods and available light field datasets are highlighted. Lastly, the review concludes with a brief outlook for future research directions.</p>","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"41 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141189137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-22DOI: 10.1186/s13640-024-00623-6
Abstract
This contribution presents the TOMIE framework (Tracking Of Multiple Industrial Entities), a framework for the continuous tracking of industrial entities (e.g., pallets, crates, barrels) over a network of, in this example, six RGB cameras. This framework makes use of multiple sensors, data pipelines, and data annotation procedures, and is described in detail in this contribution. With the vision of a fully automated tracking system for industrial entities in mind, it enables researchers to efficiently capture high-quality data in an industrial setting. Using this framework, an image dataset, the TOMIE dataset, is created, which at the same time is used to gauge the framework’s validity. This dataset contains annotation files for 112,860 frames and 640,936 entity instances that are captured from a set of six cameras that perceive a large indoor space. This dataset out-scales comparable datasets by a factor of four and is made up of scenarios, drawn from industrial applications from the sector of warehousing. Three tracking algorithms, namely ByteTrack, Bot-Sort, and SiamMOT, are applied to this dataset, serving as a proof-of-concept and providing tracking results that are comparable to the state of the art.
{"title":"Semi-automated computer vision-based tracking of multiple industrial entities: a framework and dataset creation approach","authors":"","doi":"10.1186/s13640-024-00623-6","DOIUrl":"https://doi.org/10.1186/s13640-024-00623-6","url":null,"abstract":"<h3>Abstract</h3> <p>This contribution presents the TOMIE framework (Tracking Of Multiple Industrial Entities), a framework for the continuous tracking of industrial entities (e.g., pallets, crates, barrels) over a network of, in this example, six RGB cameras. This framework makes use of multiple sensors, data pipelines, and data annotation procedures, and is described in detail in this contribution. With the vision of a fully automated tracking system for industrial entities in mind, it enables researchers to efficiently capture high-quality data in an industrial setting. Using this framework, an image dataset, the TOMIE dataset, is created, which at the same time is used to gauge the framework’s validity. This dataset contains annotation files for 112,860 frames and 640,936 entity instances that are captured from a set of six cameras that perceive a large indoor space. This dataset out-scales comparable datasets by a factor of four and is made up of scenarios, drawn from industrial applications from the sector of warehousing. Three tracking algorithms, namely ByteTrack, Bot-Sort, and SiamMOT, are applied to this dataset, serving as a proof-of-concept and providing tracking results that are comparable to the state of the art.</p>","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"3 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140198850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-18DOI: 10.1186/s13640-024-00622-7
Abstract
H.266/Versatile Video Coding (VVC) is the most recent video coding standard developed by the Joint Video Experts Team (JVET). The quad-tree with nested multi-type tree (QTMT) architecture that improves the compression performance of H.266/VVC is introduced. Moreover, H.266/VVC contains a greater number of intra-prediction modes than H.265/High Efficiency Video Coding (HEVC), totalling 67. However, these lead to extremely the coding computational complexity. To cope with the above issues, a fast intra-coding unit (CU) size decision method and a fast intra-prediction mode decision method are proposed in this paper. Specifically, the trained Support Vector Machine (SVM) classifier models are utilized for determining CU partition mode in a fast CU size decision scheme. Furthermore, the quantity of intra-prediction modes added to the RDO mode set decreases in a fast intra-prediction mode decision scheme based on the improved search step. Simulation results illustrate that the proposed overall algorithm can decrease 55.24% encoding runtime with negligible BDBR.
{"title":"Fast CU size decision and intra-prediction mode decision method for H.266/VVC","authors":"","doi":"10.1186/s13640-024-00622-7","DOIUrl":"https://doi.org/10.1186/s13640-024-00622-7","url":null,"abstract":"<h3>Abstract</h3> <p>H.266/Versatile Video Coding (VVC) is the most recent video coding standard developed by the Joint Video Experts Team (JVET). The quad-tree with nested multi-type tree (QTMT) architecture that improves the compression performance of H.266/VVC is introduced. Moreover, H.266/VVC contains a greater number of intra-prediction modes than H.265/High Efficiency Video Coding (HEVC), totalling 67. However, these lead to extremely the coding computational complexity. To cope with the above issues, a fast intra-coding unit (CU) size decision method and a fast intra-prediction mode decision method are proposed in this paper. Specifically, the trained Support Vector Machine (SVM) classifier models are utilized for determining CU partition mode in a fast CU size decision scheme. Furthermore, the quantity of intra-prediction modes added to the RDO mode set decreases in a fast intra-prediction mode decision scheme based on the improved search step. Simulation results illustrate that the proposed overall algorithm can decrease 55.24% encoding runtime with negligible BDBR.</p>","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"123 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140172567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}