S. Li, Guangji Huang, Xing Xu, Yang Yang, Fumin Shen
We propose the Accelerated Sign Hunter (ASH), a sign-based black-box attack under l∞ constraint. The proposed method searches an approximate gradient sign of loss w.r.t. the input image with few queries to the target model and crafts the adversarial example by updating the input image in this direction. It applies a Branch-Prune Strategy that infers the unknown sign bits according to the checked ones to avoid unnecessary queries. It also adopts a Stabilized Hierarchical Search to achieve better performance within a limited query budget. We provide a theoretical proof showing that the Accelerated Sign Hunter halves the queries without dropping the attack success rate (SR) compared with the state-of-the-art sign-based black-box attack. Extensive experiments also demonstrate the superiority of our ASH method over other black-box attacks. In particular on Inception-v3 for ImageNet, our method achieves the SR of 0.989 with an average queries of 338.56, which is 1/4 fewer than that of the state-of-the-art sign-based attack to achieve the same SR. Moreover, our ASH method is out-of-the-box since there are no hyperparameters that need to be tuned.
{"title":"Accelerated Sign Hunter: A Sign-based Black-box Attack via Branch-Prune Strategy and Stabilized Hierarchical Search","authors":"S. Li, Guangji Huang, Xing Xu, Yang Yang, Fumin Shen","doi":"10.1145/3512527.3531399","DOIUrl":"https://doi.org/10.1145/3512527.3531399","url":null,"abstract":"We propose the Accelerated Sign Hunter (ASH), a sign-based black-box attack under l∞ constraint. The proposed method searches an approximate gradient sign of loss w.r.t. the input image with few queries to the target model and crafts the adversarial example by updating the input image in this direction. It applies a Branch-Prune Strategy that infers the unknown sign bits according to the checked ones to avoid unnecessary queries. It also adopts a Stabilized Hierarchical Search to achieve better performance within a limited query budget. We provide a theoretical proof showing that the Accelerated Sign Hunter halves the queries without dropping the attack success rate (SR) compared with the state-of-the-art sign-based black-box attack. Extensive experiments also demonstrate the superiority of our ASH method over other black-box attacks. In particular on Inception-v3 for ImageNet, our method achieves the SR of 0.989 with an average queries of 338.56, which is 1/4 fewer than that of the state-of-the-art sign-based attack to achieve the same SR. Moreover, our ASH method is out-of-the-box since there are no hyperparameters that need to be tuned.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130729643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
According to the given natural language query, moment retrieval aims to localize the most relevant moment in an untrimmed video. The existing solutions for this problem can be roughly divided into two categories based on whether candidate moments are generated: i) Moment-based approach: It pre-cuts the video into a set of candidate moments, performs multimodal fusion, and evaluates matching scores with the query. ii) Clip-based approach: It directly aligns video clips and query with predicting matching scores without generating candidate moments. Both frameworks have respective shortcomings: the moment-based models suffer from heavy computations, while the performance of clip-based models is familiarly inferior to moment-based counterparts. To this end, we design an intuitive and efficient Dual-Channel Localization Network (DCLN) to balance computational cost and retrieval performance. For reducing computational cost, we capture the temporal relations of only a few video moments with the same start or end boundary in the proposed dual-channel structure. The start or end channel map index represents the corresponding video moment's start or end time boundary. For improving model performance, we apply the proposed dual-channel localization network to efficiently encode the temporal relations on the dual-channel map and learn discriminative features to distinguish the matching degree between natural language query and video moments. The extensive experiments on two standard benchmarks demonstrate the effectiveness of our proposed method.
{"title":"Dual-Channel Localization Networks for Moment Retrieval with Natural Language","authors":"Bolin Zhang, Bin Jiang, Chao Yang, Liang Pang","doi":"10.1145/3512527.3531394","DOIUrl":"https://doi.org/10.1145/3512527.3531394","url":null,"abstract":"According to the given natural language query, moment retrieval aims to localize the most relevant moment in an untrimmed video. The existing solutions for this problem can be roughly divided into two categories based on whether candidate moments are generated: i) Moment-based approach: It pre-cuts the video into a set of candidate moments, performs multimodal fusion, and evaluates matching scores with the query. ii) Clip-based approach: It directly aligns video clips and query with predicting matching scores without generating candidate moments. Both frameworks have respective shortcomings: the moment-based models suffer from heavy computations, while the performance of clip-based models is familiarly inferior to moment-based counterparts. To this end, we design an intuitive and efficient Dual-Channel Localization Network (DCLN) to balance computational cost and retrieval performance. For reducing computational cost, we capture the temporal relations of only a few video moments with the same start or end boundary in the proposed dual-channel structure. The start or end channel map index represents the corresponding video moment's start or end time boundary. For improving model performance, we apply the proposed dual-channel localization network to efficiently encode the temporal relations on the dual-channel map and learn discriminative features to distinguish the matching degree between natural language query and video moments. The extensive experiments on two standard benchmarks demonstrate the effectiveness of our proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"264 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122468651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minh-Son Dao, M. Riegler, Duc-Tien Dang-Nguyen, C. Gurrin, Yuta Nakashima, M. Dong
We have witnessed the rise of cross-data against multimodal data problems recently. The cross-modal retrieval system uses a textual query to look for images; the air quality index can be predicted using lifelogging images; the congestion can be predicted using weather and tweets data; daily exercises and meals can help to predict the sleeping quality are some examples of this research direction. Although vast investigations focusing on multimodal data analytics have been developed, few cross-data (e.g., cross-modal data, cross-domain, cross-platform) research has been carried on. In order to promote intelligent cross-data analytics and retrieval research and to bring a smart, sustainable society to human beings, the specific article collection on "Intelligent Cross-Data Analysis and Retrieval" is introduced. This Research Topic welcomes those who come from diverse research domains and disciplines such as well-being, disaster prevention and mitigation, mobility, climate change, tourism, healthcare, and food computing
{"title":"ICDAR'22: Intelligent Cross-Data Analysis and Retrieval","authors":"Minh-Son Dao, M. Riegler, Duc-Tien Dang-Nguyen, C. Gurrin, Yuta Nakashima, M. Dong","doi":"10.1145/3512527.3531441","DOIUrl":"https://doi.org/10.1145/3512527.3531441","url":null,"abstract":"We have witnessed the rise of cross-data against multimodal data problems recently. The cross-modal retrieval system uses a textual query to look for images; the air quality index can be predicted using lifelogging images; the congestion can be predicted using weather and tweets data; daily exercises and meals can help to predict the sleeping quality are some examples of this research direction. Although vast investigations focusing on multimodal data analytics have been developed, few cross-data (e.g., cross-modal data, cross-domain, cross-platform) research has been carried on. In order to promote intelligent cross-data analytics and retrieval research and to bring a smart, sustainable society to human beings, the specific article collection on \"Intelligent Cross-Data Analysis and Retrieval\" is introduced. This Research Topic welcomes those who come from diverse research domains and disciplines such as well-being, disaster prevention and mitigation, mobility, climate change, tourism, healthcare, and food computing","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133705913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Sun, Yongqing Sun, Mitsuhiro Goto, Shigekuni Kondo, Dan Mikami, Susumu Yamamoto
This paper presents a motor learning method based on the presenting of a personalized target motion, which we call a tentative goal. While many prior studies have focused on helping users correct their motor skill motions, most of them present the reference motion to users regardless of whether the motion is attainable or not. This makes it difficult for users to appropriately modify their motion to the reference motion when the difference between their motion and the reference motion is too significant. This study aims to provide a tentative goal that maximizes performance within a certain amount of motion change. To achieve this, predicting the performance of any motion is necessary. However, it is challenging to estimate the performance of a tentative goal by building a general model because of the large variety of human motion. Therefore, we built an individual model that predicts performance from a small training dataset and implemented it using our proposed data augmentation method. Experiments with basketball free-throw data demonstrate the effectiveness of the proposed method.
{"title":"Motor Learning based on Presentation of a Tentative Goal","authors":"S. Sun, Yongqing Sun, Mitsuhiro Goto, Shigekuni Kondo, Dan Mikami, Susumu Yamamoto","doi":"10.1145/3512527.3531413","DOIUrl":"https://doi.org/10.1145/3512527.3531413","url":null,"abstract":"This paper presents a motor learning method based on the presenting of a personalized target motion, which we call a tentative goal. While many prior studies have focused on helping users correct their motor skill motions, most of them present the reference motion to users regardless of whether the motion is attainable or not. This makes it difficult for users to appropriately modify their motion to the reference motion when the difference between their motion and the reference motion is too significant. This study aims to provide a tentative goal that maximizes performance within a certain amount of motion change. To achieve this, predicting the performance of any motion is necessary. However, it is challenging to estimate the performance of a tentative goal by building a general model because of the large variety of human motion. Therefore, we built an individual model that predicts performance from a small training dataset and implemented it using our proposed data augmentation method. Experiments with basketball free-throw data demonstrate the effectiveness of the proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133078445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Georgios Begkas, Panagiotis Giannakeris, K. Ioannidis, Georgios Kalpakis, T. Tsikrika, S. Vrochidis, Y. Kompatsiaris
Unexploded Ordnance (UXO) classification is a challenging task which is currently tackled using electromagnetic induction devices that are expensive and may require physical presence in potentially hazardous environments. The limited availability of open UXO data has, until now, impeded the progress of image-based UXO classification, which may offer a safe alternative at a reduced cost. In addition, the existing sporadic efforts focus mainly on small scale experiments using only a subset of common UXO categories. Our work aims to stimulate research interest in image-based UXO classification, with the curation of a novel dataset that consists of over 10000 annotated images from eight major UXO categories. Through extensive experimentation with supervised deep learning we uncover key insights into the challenging aspects of this task. Finally, we set the baseline on our novel benchmark by training state-of-the-art Convolutional Neural Networks and a Vision Transformer that are able to discriminate between highly overlapping UXO categories with 84.33% accuracy.
{"title":"Automatic Visual Recognition of Unexploded Ordnances Using Supervised Deep Learning","authors":"Georgios Begkas, Panagiotis Giannakeris, K. Ioannidis, Georgios Kalpakis, T. Tsikrika, S. Vrochidis, Y. Kompatsiaris","doi":"10.1145/3512527.3531383","DOIUrl":"https://doi.org/10.1145/3512527.3531383","url":null,"abstract":"Unexploded Ordnance (UXO) classification is a challenging task which is currently tackled using electromagnetic induction devices that are expensive and may require physical presence in potentially hazardous environments. The limited availability of open UXO data has, until now, impeded the progress of image-based UXO classification, which may offer a safe alternative at a reduced cost. In addition, the existing sporadic efforts focus mainly on small scale experiments using only a subset of common UXO categories. Our work aims to stimulate research interest in image-based UXO classification, with the curation of a novel dataset that consists of over 10000 annotated images from eight major UXO categories. Through extensive experimentation with supervised deep learning we uncover key insights into the challenging aspects of this task. Finally, we set the baseline on our novel benchmark by training state-of-the-art Convolutional Neural Networks and a Vision Transformer that are able to discriminate between highly overlapping UXO categories with 84.33% accuracy.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133080794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, scene graph generation methods have been used in image captioning to encode the objects and their relationships in the encoder-decoder framework, where the decoder selects part of the graph nodes as input for word inference. However, current methods attend to scene graph relying on ambiguous language information, neglecting the strong connections between scene graph nodes. In this paper, we propose a Scene Graph Extension (SGE) architecture to model the dynamic scene graph extension using the partly generated sentence. Our model first uses the generated words and previous attention results of scene graph nodes to make up a partial scene graph. Then we choose objects or relationships that has close connection with the generated graph to infer the next word. Our SGE is appealing in view that it is pluggable to any scene graph based image captioning method. We conduct the extensive experiments on MSCOCO dataset. The results shows that the proposed SGE significantly outperforms the baselines, resulting in a state-of-the-art performance under most metrics.
{"title":"Improve Image Captioning by Modeling Dynamic Scene Graph Extension","authors":"Minghao Geng, Qingjie Zhao","doi":"10.1145/3512527.3531401","DOIUrl":"https://doi.org/10.1145/3512527.3531401","url":null,"abstract":"Recently, scene graph generation methods have been used in image captioning to encode the objects and their relationships in the encoder-decoder framework, where the decoder selects part of the graph nodes as input for word inference. However, current methods attend to scene graph relying on ambiguous language information, neglecting the strong connections between scene graph nodes. In this paper, we propose a Scene Graph Extension (SGE) architecture to model the dynamic scene graph extension using the partly generated sentence. Our model first uses the generated words and previous attention results of scene graph nodes to make up a partial scene graph. Then we choose objects or relationships that has close connection with the generated graph to infer the next word. Our SGE is appealing in view that it is pluggable to any scene graph based image captioning method. We conduct the extensive experiments on MSCOCO dataset. The results shows that the proposed SGE significantly outperforms the baselines, resulting in a state-of-the-art performance under most metrics.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123311027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Learning with weak supervision already becomes one of the research trends in fine-grained image recognition. These methods aim to learn feature representation in the case of less manual cost or expert knowledge. Most existing weakly supervised methods are based on incomplete annotation or inexact annotation, which is difficult to perform well limited by supervision information. Therefore, using these two kind of annotations for training at the same time could mine more relevance while the annotating burden will not increase much. In this paper, we propose a combined learning framework by coarse-grained large data and fine-grained small data for weakly supervised fine-grained recognition. Combined learning contains two significant modules: 1) a discriminant module, which maintains the structure information consistent between coarse label and fine label by attention map and part sampling, 2) a cluster division strategy, which mines the detail differences between fine categories by feature subtraction. Experiment results show that our method outperforms weakly supervised methods and achieves the performance close to fully supervised methods in CUB-200-2011 and Stanford Cars datasets.
{"title":"Weakly Supervised Fine-grained Recognition based on Combined Learning for Small Data and Coarse Label","authors":"Anqi Hu, Zhengxing Sun, Qian Li","doi":"10.1145/3512527.3531419","DOIUrl":"https://doi.org/10.1145/3512527.3531419","url":null,"abstract":"Learning with weak supervision already becomes one of the research trends in fine-grained image recognition. These methods aim to learn feature representation in the case of less manual cost or expert knowledge. Most existing weakly supervised methods are based on incomplete annotation or inexact annotation, which is difficult to perform well limited by supervision information. Therefore, using these two kind of annotations for training at the same time could mine more relevance while the annotating burden will not increase much. In this paper, we propose a combined learning framework by coarse-grained large data and fine-grained small data for weakly supervised fine-grained recognition. Combined learning contains two significant modules: 1) a discriminant module, which maintains the structure information consistent between coarse label and fine label by attention map and part sampling, 2) a cluster division strategy, which mines the detail differences between fine categories by feature subtraction. Experiment results show that our method outperforms weakly supervised methods and achieves the performance close to fully supervised methods in CUB-200-2011 and Stanford Cars datasets.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114557119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video temporal localization aims to locate a period that semantically matches a natural language query in a given untrimmed video. We empirically observe that although existing approaches gain steady progress on sentence localization, the performance of phrase localization is far from satisfactory. In principle, the phrase should be easier to localize as fewer combinations of visual concepts need to be considered; such incapability indicates that the existing models only capture the sentence annotation bias in the benchmark but lack sufficient understanding of the intrinsic relationship between simple visual and language concepts, thus the model generalization and interpretability is questioned. This paper proposes a unified framework that can deal with both sentence and phrase-level localization, namely Phrase Level Prediction Net (PLPNet). Specifically, based on the hypothesis that similar phrases tend to focus on similar video cues, while dissimilar ones should not, we build a contrastive mechanism to restrain phrase-level localization without fine-grained phrase boundary annotation required in training. Moreover, considering the sentence's flexibility and wide discrepancy among phrases, we propose a clustering-based batch sampler to ensure that contrastive learning can be conducted efficiently. Extensive experiments demonstrate that our method surpasses state-of-the-art methods of phrase-level temporal localization while maintaining high performance in sentence localization and boosting the model's interpretability and generalization capability. Our code is available at https://github.com/sizhelee/PLPNet.
{"title":"Phrase-level Prediction for Video Temporal Localization","authors":"Sizhe Li, C. Li, Minghang Zheng, Yang Liu","doi":"10.1145/3512527.3531382","DOIUrl":"https://doi.org/10.1145/3512527.3531382","url":null,"abstract":"Video temporal localization aims to locate a period that semantically matches a natural language query in a given untrimmed video. We empirically observe that although existing approaches gain steady progress on sentence localization, the performance of phrase localization is far from satisfactory. In principle, the phrase should be easier to localize as fewer combinations of visual concepts need to be considered; such incapability indicates that the existing models only capture the sentence annotation bias in the benchmark but lack sufficient understanding of the intrinsic relationship between simple visual and language concepts, thus the model generalization and interpretability is questioned. This paper proposes a unified framework that can deal with both sentence and phrase-level localization, namely Phrase Level Prediction Net (PLPNet). Specifically, based on the hypothesis that similar phrases tend to focus on similar video cues, while dissimilar ones should not, we build a contrastive mechanism to restrain phrase-level localization without fine-grained phrase boundary annotation required in training. Moreover, considering the sentence's flexibility and wide discrepancy among phrases, we propose a clustering-based batch sampler to ensure that contrastive learning can be conducted efficiently. Extensive experiments demonstrate that our method surpasses state-of-the-art methods of phrase-level temporal localization while maintaining high performance in sentence localization and boosting the model's interpretability and generalization capability. Our code is available at https://github.com/sizhelee/PLPNet.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114735390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video-based person re-identification (ReID) aims to match video trajectories of pedestrians across multi-view cameras and has important applications in criminal investigation and intelligent surveillance. Compared with single image re-identification, the abundant temporal information contained in video sequences makes it describe pedestrian instances more precisely and effectively. Recently, most existing video-based person ReID algorithms have made use of temporal information by fusing diverse visual contents captured in independent frames. However, these algorithms only measure the salience of visual clues in each single frame, inevitably introducing momentary interference caused by factors like occlusion. Therefore, in this work, we introduce a Temporal-consistent Visual Clue Attentive Network (TVCAN), which is designed to capture temporal-consistently salient pedestrian contents among frames. Our TVCAN consists of two major modules, the TCSA module, and the TCCA module, which are responsible for capturing and emphasizing consistently salient visual contents from the spatial dimension and channel dimension, respectively. Through extensive experiments, the effectiveness of our designed modules has been verified. Additionally, our TVCAN outperforms all compared state-of-the-art methods on three mainstream benchmarks.
{"title":"Temporal-Consistent Visual Clue Attentive Network for Video-Based Person Re-Identification","authors":"Bingliang Jiao, Liying Gao, Peng Wang","doi":"10.1145/3512527.3531362","DOIUrl":"https://doi.org/10.1145/3512527.3531362","url":null,"abstract":"Video-based person re-identification (ReID) aims to match video trajectories of pedestrians across multi-view cameras and has important applications in criminal investigation and intelligent surveillance. Compared with single image re-identification, the abundant temporal information contained in video sequences makes it describe pedestrian instances more precisely and effectively. Recently, most existing video-based person ReID algorithms have made use of temporal information by fusing diverse visual contents captured in independent frames. However, these algorithms only measure the salience of visual clues in each single frame, inevitably introducing momentary interference caused by factors like occlusion. Therefore, in this work, we introduce a Temporal-consistent Visual Clue Attentive Network (TVCAN), which is designed to capture temporal-consistently salient pedestrian contents among frames. Our TVCAN consists of two major modules, the TCSA module, and the TCCA module, which are responsible for capturing and emphasizing consistently salient visual contents from the spatial dimension and channel dimension, respectively. Through extensive experiments, the effectiveness of our designed modules has been verified. Additionally, our TVCAN outperforms all compared state-of-the-art methods on three mainstream benchmarks.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123917842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The creation of datasets in medical imaging is a central topic of research, especially with the advances of deep learning in the past decade. Publications of such datasets typically report baseline results with one or more deep neural networks in the form of established performance metrics (e.g., F1-score, Jaccard, etc.). Then, much work is done trying to beat these baseline metrics to compare different neural architectures. However, these reported metrics are almost meaningless when the underlying data does not conform to specific standards. In order to better understand what standards we need, we have reproduced and analyzed a study of four medical image classification datasets in laparoscopy. With automated frame extraction of surgical videos, we find that the resulting images are way too similar and produce high evaluation metrics by design. We show this similarity with a basic SIFT algorithm that produces high evaluation metrics on the original data. We confirm our hypothesis by creating and evaluating a video-based dataset split from the original images. The original network evaluated on the video-based split performs worse than our basic SIFT algorithm on the original data.
{"title":"The Impact of Dataset Splits on Classification Performance in Medical Videos","authors":"Markus Fox, Klaus Schoeffmann","doi":"10.1145/3512527.3531424","DOIUrl":"https://doi.org/10.1145/3512527.3531424","url":null,"abstract":"The creation of datasets in medical imaging is a central topic of research, especially with the advances of deep learning in the past decade. Publications of such datasets typically report baseline results with one or more deep neural networks in the form of established performance metrics (e.g., F1-score, Jaccard, etc.). Then, much work is done trying to beat these baseline metrics to compare different neural architectures. However, these reported metrics are almost meaningless when the underlying data does not conform to specific standards. In order to better understand what standards we need, we have reproduced and analyzed a study of four medical image classification datasets in laparoscopy. With automated frame extraction of surgical videos, we find that the resulting images are way too similar and produce high evaluation metrics by design. We show this similarity with a basic SIFT algorithm that produces high evaluation metrics on the original data. We confirm our hypothesis by creating and evaluating a video-based dataset split from the original images. The original network evaluated on the video-based split performs worse than our basic SIFT algorithm on the original data.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121060847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}