Pub Date : 2026-01-28DOI: 10.1016/j.displa.2026.103359
Qiwen Yuan , Jiajie Chen , Zhendong Shi
Unauthorized use of doctor’s code is a high-risk and context-dependent issue in health insurance supervision. Traditional rule-based screening achieves high recall but often produces false positives in cases that appear anomalous yet are clinically legitimate, such as telemedicine encounters, refund-related re-settlements, and rapid outpatient–emergency transitions. These methods lack semantic understanding of medical context and rely heavily on manual auditing. We propose a hybrid detection framework that integrates rule-based temporal filtering with large language model (LLM)–based semantic reasoning. Time-threshold rules are first applied to extract suspected cases from real health-insurance claim data. Expert-derived legitimate scenario patterns are then embedded into structured prompts to guide the LLM in semantic plausibility assessment and false-positive reduction. For evaluation, we construct a 240-pair multi-scenario benchmark dataset from de-identified real claim records, covering both reasonable and suspicious situations. Zero-shot experiments with DeepSeek-R1-7B show that the framework achieves 75% accuracy and 87% precision in distinguishing reasonable from unauthorized cases. These results indicate that the proposed method can effectively reduce false alarms and alleviate manual audit workload, providing a practical and efficient solution for real-world health-insurance supervision.
{"title":"Hybrid detection model for unauthorized use of doctor’s code in health insurance: Integrating rule-based screening and LLM reasoning","authors":"Qiwen Yuan , Jiajie Chen , Zhendong Shi","doi":"10.1016/j.displa.2026.103359","DOIUrl":"10.1016/j.displa.2026.103359","url":null,"abstract":"<div><div>Unauthorized use of doctor’s code is a high-risk and context-dependent issue in health insurance supervision. Traditional rule-based screening achieves high recall but often produces false positives in cases that appear anomalous yet are clinically legitimate, such as telemedicine encounters, refund-related re-settlements, and rapid outpatient–emergency transitions. These methods lack semantic understanding of medical context and rely heavily on manual auditing. We propose a hybrid detection framework that integrates rule-based temporal filtering with large language model (LLM)–based semantic reasoning. Time-threshold rules are first applied to extract suspected cases from real health-insurance claim data. Expert-derived legitimate scenario patterns are then embedded into structured prompts to guide the LLM in semantic plausibility assessment and false-positive reduction. For evaluation, we construct a 240-pair multi-scenario benchmark dataset from de-identified real claim records, covering both reasonable and suspicious situations. Zero-shot experiments with DeepSeek-R1-7B show that the framework achieves 75% accuracy and 87% precision in distinguishing reasonable from unauthorized cases. These results indicate that the proposed method can effectively reduce false alarms and alleviate manual audit workload, providing a practical and efficient solution for real-world health-insurance supervision.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"93 ","pages":"Article 103359"},"PeriodicalIF":3.4,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146070883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27DOI: 10.1016/j.displa.2026.103368
Hao Liu , Maoji Qiu , Rong Huang
For block compressive sensing (BCS) of natural videos, existing reconstruction algorithms typically utilize nonlocal self-similarity (NSS) to generate sparse residuals, thereby achieving favorable recovery performance by exploiting the statistical characteristics of key frames and non-key frames. However, when applied to multi-perspective infrared aerial videos rather than natural videos, these reconstruction algorithms usually result in poor recovery quality because of the inflexibility in selecting similar patches and poor adaptability to dynamic scene changes. Due to the distribution property of infrared aerial imagery, inter-frame and intra-frame similar patches should be selected adaptively so that an accurate dictionary matrix can be learned. Therefore, this paper proposes a content-adaptive dual feature selection mechanism. It first conducts a rough screening of inter-frame and intra-frame similar patches based on the correlation of observed measurement vectors across frames. Then, it is followed by a fine screening stage, where principal component analysis (PCA) is applied to project the similar patch-group matrix into a low-dimensional space. Finally, the split Bregman iteration (SBI) is employed to solve the BCS reconstruction for infrared aerial video. Experimental results on both HIT-UAV and M200-XT2DroneVehicle datasets demonstrate that the proposed algorithm achieves better recovery quality compared to state-of-the-art algorithms.
{"title":"Content-adaptive dual feature selection for infrared aerial video compressive sensing reconstruction","authors":"Hao Liu , Maoji Qiu , Rong Huang","doi":"10.1016/j.displa.2026.103368","DOIUrl":"10.1016/j.displa.2026.103368","url":null,"abstract":"<div><div>For block compressive sensing (BCS) of natural videos, existing reconstruction algorithms typically utilize nonlocal self-similarity (NSS) to generate sparse residuals, thereby achieving favorable recovery performance by exploiting the statistical characteristics of key frames and non-key frames. However, when applied to multi-perspective infrared aerial videos rather than natural videos, these reconstruction algorithms usually result in poor recovery quality because of the inflexibility in selecting similar patches and poor adaptability to dynamic scene changes. Due to the distribution property of infrared aerial imagery, inter-frame and intra-frame similar patches should be selected adaptively so that an accurate dictionary matrix can be learned. Therefore, this paper proposes a content-adaptive dual feature selection mechanism. It first conducts a rough screening of inter-frame and intra-frame similar patches based on the correlation of observed measurement vectors across frames. Then, it is followed by a fine screening stage, where principal component analysis (PCA) is applied to project the similar patch-group matrix into a low-dimensional space. Finally, the split Bregman iteration (SBI) is employed to solve the BCS reconstruction for infrared aerial video. Experimental results on both HIT-UAV and M200-XT2DroneVehicle datasets demonstrate that the proposed algorithm achieves better recovery quality compared to state-of-the-art algorithms.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"93 ","pages":"Article 103368"},"PeriodicalIF":3.4,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146070880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27DOI: 10.1016/j.displa.2026.103365
Yuehui Liao , Yun Zheng , Yingjie Jiao , Na Tang , Yuhao Wang , Yu Hu , Yaning Feng , Ruofan Wang , Qun Jin , Xiaobo Lai , Panfei Li
Accurate glioma segmentation in magnetic resonance imaging (MRI) is crucial for effective diagnosis and treatment planning in neuro-oncology; however, this process is often time-consuming and heavily reliant on expert annotations. To address these limitations, we present Trans-MT, a 3D semi-supervised segmentation model that integrates a transformer-based architecture with asymmetric data augmentation, achieving high segmentation accuracy with limited labeled data. Trans-MT employs a teacher-student framework: the teacher model generates reliable pseudo-labels for unlabeled data, while the student model learns through supervised and consistency losses, guided by an uncertainty-aware mechanism to refine its predictions. The architecture of Trans-MT features a hybrid encoder, nnUFormer, which combines the robust capabilities of nn-UNet with transformers, enabling it to capture global contextual information essential for accurate tumor segmentation. This design enhances the model’s ability to detect intricate tumor structures within MRI scans, even with sparse annotations. Additionally, the model’s learning process is strengthened by asymmetric data augmentation, which enriches data diversity and robustness. We evaluated Trans-MT on the BraTS 2019, 2020, and 2021 datasets, where it demonstrated superior performance over several state-of-the-art semi-supervised models, particularly in segmenting challenging tumor sub-regions. The results confirm that Trans-MT significantly improves segmentation precision, making it a valuable advancement in brain tumor segmentation methodology and a practical solution for clinical settings with limited labeled data. Our code is available at https://github.com/smallboy-code/TransMT.
{"title":"Trans-MT: a 3D semi-supervised glioma segmentation model integrating transformer architecture and asymmetric data augmentation","authors":"Yuehui Liao , Yun Zheng , Yingjie Jiao , Na Tang , Yuhao Wang , Yu Hu , Yaning Feng , Ruofan Wang , Qun Jin , Xiaobo Lai , Panfei Li","doi":"10.1016/j.displa.2026.103365","DOIUrl":"10.1016/j.displa.2026.103365","url":null,"abstract":"<div><div>Accurate glioma segmentation in magnetic resonance imaging (MRI) is crucial for effective diagnosis and treatment planning in neuro-oncology; however, this process is often time-consuming and heavily reliant on expert annotations. To address these limitations, we present Trans-MT, a 3D semi-supervised segmentation model that integrates a transformer-based architecture with asymmetric data augmentation, achieving high segmentation accuracy with limited labeled data. Trans-MT employs a teacher-student framework: the teacher model generates reliable pseudo-labels for unlabeled data, while the student model learns through supervised and consistency losses, guided by an uncertainty-aware mechanism to refine its predictions. The architecture of Trans-MT features a hybrid encoder, nnUFormer, which combines the robust capabilities of nn-UNet with transformers, enabling it to capture global contextual information essential for accurate tumor segmentation. This design enhances the model’s ability to detect intricate tumor structures within MRI scans, even with sparse annotations. Additionally, the model’s learning process is strengthened by asymmetric data augmentation, which enriches data diversity and robustness. We evaluated Trans-MT on the BraTS 2019, 2020, and 2021 datasets, where it demonstrated superior performance over several state-of-the-art semi-supervised models, particularly in segmenting challenging tumor sub-regions. The results confirm that Trans-MT significantly improves segmentation precision, making it a valuable advancement in brain tumor segmentation methodology and a practical solution for clinical settings with limited labeled data. Our code is available<!--> <!-->at <span><span>https://github.com/smallboy-code/TransMT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"93 ","pages":"Article 103365"},"PeriodicalIF":3.4,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146090217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1016/j.displa.2026.103355
Yongqing Huo, Wenke Jiang
Video anomaly detection (VAD) is critical for automated identification of anomalous behaviors in surveillance system, with applications in public safety, intelligent transportation and healthcare. However, with the continuous expansion of application domains, ensuring that VAD algorithm maintains excellent detection performance across diverse scenarios has become the primary focus of current research direction. To enhance the robustness of detection across various environments, we propose a novel autoencoder-based model in this paper. Compared with other algorithms, our method can more effectively exploit multi-scale feature information within frames for learning feature distribution. In the encoder, we construct the convolutional module with multiple kernel sizes and incorporate the designed Spatial-Channel Transformer Attention (SCTA) module to strengthen the feature representation. In the decoder, we integrate the multi-scale feature reconstruction module with Self-Supervised Predictive Convolutional Attentive Blocks (SSPCAB) for more accurate next-frame prediction. Moreover, we introduce a dedicated memory module to capture and store the distribution of normal data patterns. Meanwhile, the architecture employs the Conv-LSTM and a specially designed Temporal-Spatial Attention (TSA) module in skip connections to capture spatiotemporal dependencies across video frames. Benefiting from the design and integration of those modules, our proposed method achieves superior detection performance on public datasets, including UCSD Ped2, CUHK Avenue and ShanghaiTech. The experimental results demonstrate the effectiveness and versatility of our method in anomaly detection tasks.
{"title":"Learning video normality for anomaly detection via multi-scale spatiotemporal feature extraction and a feature memory module","authors":"Yongqing Huo, Wenke Jiang","doi":"10.1016/j.displa.2026.103355","DOIUrl":"10.1016/j.displa.2026.103355","url":null,"abstract":"<div><div>Video anomaly detection (VAD) is critical for automated identification of anomalous behaviors in surveillance system, with applications in public safety, intelligent transportation and healthcare. However, with the continuous expansion of application domains, ensuring that VAD algorithm maintains excellent detection performance across diverse scenarios has become the primary focus of current research direction. To enhance the robustness of detection across various environments, we propose a novel autoencoder-based model in this paper. Compared with other algorithms, our method can more effectively exploit multi-scale feature information within frames for learning feature distribution. In the encoder, we construct the convolutional module with multiple kernel sizes and incorporate the designed Spatial-Channel Transformer Attention (SCTA) module to strengthen the feature representation. In the decoder, we integrate the multi-scale feature reconstruction module with Self-Supervised Predictive Convolutional Attentive Blocks (SSPCAB) for more accurate next-frame prediction. Moreover, we introduce a dedicated memory module to capture and store the distribution of normal data patterns. Meanwhile, the architecture employs the Conv-LSTM and a specially designed Temporal-Spatial Attention (TSA) module in skip connections to capture spatiotemporal dependencies across video frames. Benefiting from the design and integration of those modules, our proposed method achieves superior detection performance on public datasets, including UCSD Ped2, CUHK Avenue and ShanghaiTech. The experimental results demonstrate the effectiveness and versatility of our method in anomaly detection tasks.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103355"},"PeriodicalIF":3.4,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146077221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1016/j.displa.2026.103358
Yuehui Liao , Yun Zheng , Jingyu Zhu , Yu Chen , Feng Gao , Yaning Feng , Weiji Yang , Guang Yang , Xiaobo Lai , Panfei Li
Glioblastoma (GBM) is an aggressive brain tumor associated with poor prognosis and limited treatment options. The methylation status of the O6-methylguanine-DNA methyltransferase (MGMT) promoter is a critical biomarker for predicting the efficacy of temozolomide chemotherapy in GBM patients. However, current methods for determining MGMT promoter methylation, including invasive and costly techniques, hinder their widespread clinical application. In this study, we propose a novel non-invasive deep learning framework based on a Mixture-of-Experts (MoE) architecture for predicting MGMT promoter methylation status using multi-modal magnetic resonance imaging (MRI) data. Our MoE model incorporates modality-specific expert networks built on the ResNet18 architecture, with a self-attention-based gating mechanism that dynamically selects and integrates the most relevant features across MRI modalities (T1-weighted, contrast-enhanced T1, T2-weighted, and fluid-attenuated inversion recovery). We evaluate the proposed framework on the BraTS2021 and TCGA-GBM datasets, showing superior performance compared to conventional deep learning models in terms of accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). Furthermore, Grad-CAM visualizations provide enhanced interpretability by highlighting biologically relevant regions in the tumor and peritumoral areas that influence model predictions. The proposed framework represents a promising tool for integrating imaging biomarkers into precision oncology workflows, offering a scalable, cost-effective, and interpretable solution for non-invasive MGMT methylation prediction in GBM.
{"title":"Self-attention-based mixture-of-experts framework for non-invasive prediction of MGMT promoter methylation in glioblastoma using multi-modal MRI","authors":"Yuehui Liao , Yun Zheng , Jingyu Zhu , Yu Chen , Feng Gao , Yaning Feng , Weiji Yang , Guang Yang , Xiaobo Lai , Panfei Li","doi":"10.1016/j.displa.2026.103358","DOIUrl":"10.1016/j.displa.2026.103358","url":null,"abstract":"<div><div>Glioblastoma (GBM) is an aggressive brain tumor associated with poor prognosis and limited treatment options. The methylation status of the O6-methylguanine-DNA methyltransferase (MGMT) promoter is a critical biomarker for predicting the efficacy of temozolomide chemotherapy in GBM patients. However, current methods for determining MGMT promoter methylation, including invasive and costly techniques, hinder their widespread clinical application. In this study, we propose a novel non-invasive deep learning framework based on a Mixture-of-Experts (MoE) architecture for predicting MGMT promoter methylation status using multi-modal magnetic resonance imaging (MRI) data. Our MoE model incorporates modality-specific expert networks built on the ResNet18 architecture, with a self-attention-based gating mechanism that dynamically selects and integrates the most relevant features across MRI modalities (T1-weighted, contrast-enhanced T1, T2-weighted, and fluid-attenuated inversion recovery). We evaluate the proposed framework on the BraTS2021 and TCGA-GBM datasets, showing superior performance compared to conventional deep learning models in terms of accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). Furthermore, Grad-CAM visualizations provide enhanced interpretability by highlighting biologically relevant regions in the tumor and peritumoral areas that influence model predictions. The proposed framework represents a promising tool for integrating imaging biomarkers into precision oncology workflows, offering a scalable, cost-effective, and interpretable solution for non-invasive MGMT methylation prediction in GBM.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103358"},"PeriodicalIF":3.4,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146077313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1016/j.displa.2026.103347
Yan Liu , Yan Yang , Yongquan Jiang , Xiaole Zhao , Liang Fan
The precise segmentation of vascular structures is vital for diagnosing retinal and coronary artery diseases. However, the complex morphology and large structural variability of blood vessels make manual annotation time-consuming and finite which in turn limits the scalability of supervised segmentation methods. We propose a semi-supervised segmentation framework named geometric orientational fusion attention network (GOFA-Net) that integrates differentiable geometric augmentation and orientation-aware attention to effectively leverage knowledge from limited annotations. GOFA-Net comprises three key complementary components: 1) a differentiable geometric augmentation strategy (DGAS) employs quaternion-based representations to diversify training samples while preserving prediction consistency between teacher and student models; 2) a multi-view fusion module (MVFM) orchestrates collaborative feature learning between quaternion and conventional convolutional streams to capture comprehensive spatial dependencies; and 3) a global orientational attention module (GOAM) enhances structural awareness through direction-sensitive geometric embeddings, specifically reinforcing the perception of vascular topology along horizontal and vertical orientations. Extensive validation on multiple retinal vessel datasets (DRIVE, STARE, CHASE_DB1, and HRF) and coronary angiography datasets (DCA1 and CHUAC) show that GOFA-Net consistently outperforms state-of-the-art semi-supervised methods, achieving particularly notable gains in scenarios with limited annotations.
{"title":"Harnessing differentiable geometry and orientation attention for semi-supervised vessel segmentation with limited annotations","authors":"Yan Liu , Yan Yang , Yongquan Jiang , Xiaole Zhao , Liang Fan","doi":"10.1016/j.displa.2026.103347","DOIUrl":"10.1016/j.displa.2026.103347","url":null,"abstract":"<div><div>The precise segmentation of vascular structures is vital for diagnosing retinal and coronary artery diseases. However, the complex morphology and large structural variability of blood vessels make manual annotation time-consuming and finite which in turn limits the scalability of supervised segmentation methods. We propose a semi-supervised segmentation framework named geometric orientational fusion attention network (GOFA-Net) that integrates differentiable geometric augmentation and orientation-aware attention to effectively leverage knowledge from limited annotations. GOFA-Net comprises three key complementary components: 1) a differentiable geometric augmentation strategy (DGAS) employs quaternion-based representations to diversify training samples while preserving prediction consistency between teacher and student models; 2) a multi-view fusion module (MVFM) orchestrates collaborative feature learning between quaternion and conventional convolutional streams to capture comprehensive spatial dependencies; and 3) a global orientational attention module (GOAM) enhances structural awareness through direction-sensitive geometric embeddings, specifically reinforcing the perception of vascular topology along horizontal and vertical orientations. Extensive validation on multiple retinal vessel datasets (DRIVE, STARE, CHASE_DB1, and HRF) and coronary angiography datasets (DCA1 and CHUAC) show that GOFA-Net consistently outperforms state-of-the-art semi-supervised methods, achieving particularly notable gains in scenarios with limited annotations.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103347"},"PeriodicalIF":3.4,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146037274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1016/j.displa.2026.103361
Shuming Zhang , Yuhang Zhu , Yanhui Sun , Weiyong Liu , Zhangjin Huang
Multi-Object Tracking (MOT) aims to detect and associate objects across frames while maintaining consistent IDs. While some approaches leverage both strong and weak cues alongside camera compensation to improve association, they struggle in scenarios involving high object density or nonlinear motion. To address these challenges, we propose RAP-SORT, a novel MOT framework that introduces four key innovations. First, the Robust Tracklet Confidence Modeling (RTCM) module models trajectory confidence by smoothing updates and applying second-order difference adjustments for low-confidence cases. Second, the Advanced Observation-Centric Recovery (AOCR) module facilitates trajectory recovery via linear interpolation and backtracking. Third, the Pseudo-Depth IoU (PDIoU) metric integrates height and depth cues into IoU calculations for enhanced spatial awareness. Finally, the Window Denoising (WD) module is tailored for the DanceTrack dataset, effectively mitigating the creation of new tracks caused by misdetections. RAP-SORT sets a new state-of-the-art on the DanceTrack and MOT20 benchmarks, achieving HOTA scores of 66.7 and 64.2, surpassing the previous best by 1.0 and 0.3, respectively, while also delivering competitive performance on MOT17. Code and models will be available soon at https://github.com/levi5611/RAP-SORT.
{"title":"RAP-SORT: Advanced Multi-Object Tracking for complex scenarios","authors":"Shuming Zhang , Yuhang Zhu , Yanhui Sun , Weiyong Liu , Zhangjin Huang","doi":"10.1016/j.displa.2026.103361","DOIUrl":"10.1016/j.displa.2026.103361","url":null,"abstract":"<div><div>Multi-Object Tracking (MOT) aims to detect and associate objects across frames while maintaining consistent IDs. While some approaches leverage both strong and weak cues alongside camera compensation to improve association, they struggle in scenarios involving high object density or nonlinear motion. To address these challenges, we propose RAP-SORT, a novel MOT framework that introduces four key innovations. First, the Robust Tracklet Confidence Modeling (RTCM) module models trajectory confidence by smoothing updates and applying second-order difference adjustments for low-confidence cases. Second, the Advanced Observation-Centric Recovery (AOCR) module facilitates trajectory recovery via linear interpolation and backtracking. Third, the Pseudo-Depth IoU (PDIoU) metric integrates height and depth cues into IoU calculations for enhanced spatial awareness. Finally, the Window Denoising (WD) module is tailored for the DanceTrack dataset, effectively mitigating the creation of new tracks caused by misdetections. RAP-SORT sets a new state-of-the-art on the DanceTrack and MOT20 benchmarks, achieving HOTA scores of 66.7 and 64.2, surpassing the previous best by 1.0 and 0.3, respectively, while also delivering competitive performance on MOT17. Code and models will be available soon at <span><span>https://github.com/levi5611/RAP-SORT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103361"},"PeriodicalIF":3.4,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146037275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1016/j.displa.2026.103360
Yulong Lei , Zishuo Wang , Jinglin Xu , Yuxin Peng
Large Vision–Language Models (LVLMs) excel in multimodal tasks, but their high computational cost, driven by the large number of image tokens, severely limits inference efficiency. While existing training-free methods reduce token counts to accelerate inference, they often struggle to preserve model performance. This trade-off between efficiency and accuracy poses the key challenge in accelerating Large Vision–Language Model (LVLM) inference without retraining. In this paper, we analyze the rank of attention matrices across layers and discover that image token redundancy peaks in two specific VLM layers: many tokens convey nearly identical information, yet still participate in subsequent computations. Leveraging this insight, we propose Ctp2Fic, a new two-stage coarse-to-fine token compression framework. Specifically, in the Coarse-grained Text-guided Pruning stage, we dynamically assign a weight to each visual token based on its semantic relevance to the input instruction and prune low-weight tokens that are unrelated to the task. During the Fine-grained Image-based Clustering stage, we apply a lightweight clustering algorithm to merge semantically similar tokens into compact, representative ones, thus further reducing the sequence length. Our framework requires no model fine-tuning and seamlessly integrates into existing LVLM inference pipelines. Extensive experiments demonstrate that Ctp2Fic outperforms state-of-the-art acceleration techniques in both inference speed and accuracy, achieving superior efficiency and performance without retraining.
{"title":"Ctp2Fic: From coarse-grained token pruning to fine-grained token clustering for LVLM inference acceleration","authors":"Yulong Lei , Zishuo Wang , Jinglin Xu , Yuxin Peng","doi":"10.1016/j.displa.2026.103360","DOIUrl":"10.1016/j.displa.2026.103360","url":null,"abstract":"<div><div>Large Vision–Language Models (LVLMs) excel in multimodal tasks, but their high computational cost, driven by the large number of image tokens, severely limits inference efficiency. While existing training-free methods reduce token counts to accelerate inference, they often struggle to preserve model performance. This trade-off between efficiency and accuracy poses the key challenge in accelerating Large Vision–Language Model (LVLM) inference without retraining. In this paper, we analyze the rank of attention matrices across layers and discover that image token redundancy peaks in two specific VLM layers: many tokens convey nearly identical information, yet still participate in subsequent computations. Leveraging this insight, we propose Ctp2Fic, a new two-stage coarse-to-fine token compression framework. Specifically, in the Coarse-grained Text-guided Pruning stage, we dynamically assign a weight to each visual token based on its semantic relevance to the input instruction and prune low-weight tokens that are unrelated to the task. During the Fine-grained Image-based Clustering stage, we apply a lightweight clustering algorithm to merge semantically similar tokens into compact, representative ones, thus further reducing the sequence length. Our framework requires no model fine-tuning and seamlessly integrates into existing LVLM inference pipelines. Extensive experiments demonstrate that Ctp2Fic outperforms state-of-the-art acceleration techniques in both inference speed and accuracy, achieving superior efficiency and performance without retraining.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103360"},"PeriodicalIF":3.4,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146037276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1016/j.displa.2026.103356
Shiliang Yan, Yinling Wang, Dandan Lu, Min Wang
The exploration and resolution of persistent noise incursions within the tracking sequences, especially the occlusion, illumination variations, and fast motion, have garnered substantial attention for their functional properties in enhancing the accuracy and robustness of visual object trackers. However, existing visual object trackers, equipped with template updating mechanisms or calibration strategy, heavily rely on time-consuming historical data to achieve optimal tracking performance, impeding their real-time tracking capabilities. To address these challenges, this paper introduces a long-short term dual level memory augmented transformer structure aided visual object predictor (MeAP). The key contributions of MeAP can be summarized as follows: 1) the formulation of a noise model for specific invasion events based on incursion effects and corresponding template strategies serving as the foundation for more efficient memory utilization; 2) The memory exploration scheme based online tracking mask-based feature extraction strategy and the transformer architecture is introduced to mitigate the impact of noise invasion during memory vector construction; 3) the memory utilization scheme based target basic feature and dual feature target mask predictor is provided to implement the scene-edge feature for mask-based feature extraction method and jointly predict the accurate location of the tracking target.. Extensive experiments conducted on OTB100, NFS, VOT2021, and AVisT benchmarks demonstrate that MeAP, with its introduced modules, achieves comparable tracking performances against other state-of-the-art (SOTA) trackers, and operates at an average speed of 31 frames per second (FPS) across 4 benchmarks.
{"title":"MeAP: dual level memory strategy augmented transformer based visual object predictor","authors":"Shiliang Yan, Yinling Wang, Dandan Lu, Min Wang","doi":"10.1016/j.displa.2026.103356","DOIUrl":"10.1016/j.displa.2026.103356","url":null,"abstract":"<div><div>The exploration and resolution of persistent noise incursions within the tracking sequences, especially the occlusion, illumination variations, and fast motion, have garnered substantial attention for their functional properties in enhancing the accuracy and robustness of visual object trackers. However, existing visual object trackers, equipped with template updating mechanisms or calibration strategy, heavily rely on time-consuming historical data to achieve optimal tracking performance, impeding their real-time tracking capabilities. To address these challenges, this paper introduces a long-short term dual level memory augmented transformer structure aided visual object predictor (MeAP). The key contributions of MeAP can be summarized as follows: 1) the formulation of a noise model for specific invasion events based on incursion effects and corresponding template strategies serving as the foundation for more efficient memory utilization; 2) The memory exploration scheme based online tracking mask-based feature extraction strategy and the transformer architecture is introduced to mitigate the impact of noise invasion during memory vector construction; 3) the memory utilization scheme based target basic feature and dual feature target mask predictor is provided to implement the scene-edge feature for mask-based feature extraction method and jointly predict the accurate location of the tracking target.. Extensive experiments conducted on OTB100, NFS, VOT2021, and AVisT benchmarks demonstrate that MeAP, with its introduced modules, achieves comparable tracking performances against other state-of-the-art (SOTA) trackers, and operates at an average speed of 31 frames per second (FPS) across 4 benchmarks.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103356"},"PeriodicalIF":3.4,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146077224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-17DOI: 10.1016/j.displa.2026.103352
Jinhang Zhang, LiQiang Song, Min Gao, Wenzhao Li, Zhuang Wei
The high prevalence of small objects in aerial images presents a significant challenge for object detection tasks. In this paper, we propose the Interactive Feature Pyramid Network (IFPN) specifically for small object detection in aerial images. The IFPN architecture comprises an Interactive Channel-Wise Attention (ICA) module and an Interactive Spatial-Wise Attention (ISA) module. The ICA and ISA modules facilitate feature interaction across multiple layers, thereby mitigating semantic gaps and information loss inherent in traditional feature pyramids, and effectively capturing the detailed features essential for small objects. By incorporating global contextual information, IFPN enhances the model’s ability to discern the relationship between the target and its surrounding context, particularly in scenarios where small objects exhibit limited features, thereby significantly improving the accuracy of small object detection. Additionally, we propose an Attention Convolution Module (ACM) designed to furnish high-quality feature bases for IFPN during its early stages. Extensive experiments conducted on aerial image datasets attest to the effectiveness and sophistication of IFPN for detecting small objects within aerial images.
{"title":"Interactive feature pyramid network for small object detection in UAV aerial images","authors":"Jinhang Zhang, LiQiang Song, Min Gao, Wenzhao Li, Zhuang Wei","doi":"10.1016/j.displa.2026.103352","DOIUrl":"10.1016/j.displa.2026.103352","url":null,"abstract":"<div><div>The high prevalence of small objects in aerial images presents a significant challenge for object detection tasks. In this paper, we propose the Interactive Feature Pyramid Network (IFPN) specifically for small object detection in aerial images. The IFPN architecture comprises an Interactive Channel-Wise Attention (ICA) module and an Interactive Spatial-Wise Attention (ISA) module. The ICA and ISA modules facilitate feature interaction across multiple layers, thereby mitigating semantic gaps and information loss inherent in traditional feature pyramids, and effectively capturing the detailed features essential for small objects. By incorporating global contextual information, IFPN enhances the model’s ability to discern the relationship between the target and its surrounding context, particularly in scenarios where small objects exhibit limited features, thereby significantly improving the accuracy of small object detection. Additionally, we propose an Attention Convolution Module (ACM) designed to furnish high-quality feature bases for IFPN during its early stages. Extensive experiments conducted on aerial image datasets attest to the effectiveness and sophistication of IFPN for detecting small objects within aerial images.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"92 ","pages":"Article 103352"},"PeriodicalIF":3.4,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146037273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}